I performed a number of load tests with the following setups:
Azure Kubernetes Cluster (AKS) → Mongo DB Atlas on Azure (M20, M30, same region, network peering)
GCP Kubernetes Cluster (GKE) → Mongo DB Atlas on GCP (M20, M30, same region, network peering)
and expected to see much better results from GCP due to much higher IOPS, however to my surprise Azure results were better (???).
The load tests were sending 50 or 100 business operations per second (1 business operation involves 1-2 find and 7-8 insert/replace db operations, involving 3 microservices and 3 corresponding MongoDB dbs, but all on the same MongoDB Atlas server).
Measurements are all done serverside based on traces sent to Azure Application Insights.
I am not sure how I can paste a picture or a xls file here with the results (no attach file possibility)?
I have the feeling I am doing something wrong, so does anyone have ideas what I might have done wrong, or is Azure really more performant than GCP (was always left with the opposite impression, as Azure has some serious disk IOPS issues about which I wrote here: MongoDB Atlas & Azure - a forced marriage? - DEV Community 👩💻👨💻) …
Sending the traces from GKE → Azure App Insights is for sure more costly - saw 5-10% more CPU usage of the pods in GKE than in AKS, could be because of this.
However the timestamps in the traces (for the roundtrips to MongoDB) are calculated locally (in the pod running in GKE), and it does not matter afterwards how long it takes to send the trace to Azure App Insights.
E.g. a trace measures roundtrip incl. db processing time for a find operation in MongoDB. It is calculated inside the running pod (e.g. 2 ms), and then this trace with 2ms inside is sent to Azure App Insights (which can take 20-30ms) but the 2ms inside the trace are still 2ms …
Ran again a bunch of tests without Azure App Insights tracing, and the results are similar - Azure M20 (2 vcpus, 4 gb ram) with 128Gb storage (1100 IOPS) performs better than than GCP M20 (1 vcpu, 4 gb ram) with 128 gb storage (7k+ IOPS) …
I performed also tests against AWS, and what seems to be a striking difference is the Max Write Disk Latency … e.g. for M20 with 128Gb storage:
Azure: 40-60ms, 1 spike within 1h up to 80 ms …
GCP: 30-40ms, 2-3 spikes within 1h up to 50 ms …
AWS: 1-2ms stable, only 1-2 spikes within 1h up to 6ms …
Can someone explain the gigantic difference in max write disk latency? How come AWS is 10x better??