Slow throughput on M2

Andreas_Ley · June 13, 2022, 9:47am

I’ve run into the following situation with Atlas Device Sync (M2 cluster on AWS Frankfurt):

Imported 600,000 objects into Realm (took 1 minute)
Realm Sync uploaded all 600,000 objects (took 15 minutes or so)
Sync Server is writing the objects to Atlas at a rate of 10 objects/second (taking 20+ hours)

While Atlas M0 (Free Cluster), M2, and M5 Limitations does mention some limitations, the troughput number for even the free cluster is quite a bit higher.
Is this expected performance for a M2 cluster?
Is there some overview of expected performance for larger and dedicated cluster types?

Is there any way to see the Sync Server queue (number of unsynchronized objects)? Terminating Sync after schema upgrades seems quite risky if there’s no way to ensure that all pending changesets have been written to Atlas.

Tyler_Kaye · June 13, 2022, 1:44pm

Hi. Im not sure I 100% follow your bullet points of what is being “uploaded” from where to where. I will say that we recommend always using a dedicated instance with sync. This is because:
(a) You can scale up to bigger instances without terminating sync (M2->M5, or M2->M10+ is not possible without terminating sync)
(b) You get access to cluster metrics
(c) You do not get any rate-limiting. One of the biggest issues with Sync and shared-clusters is that there is rate-limiting applied

Thus, for any performance-related testing I would highly recommend using >= M10.

If you still have any issues, let me know and if you send me the URL to your app in the Atlas / Realm UI I can take a look at the logs

Thanks,
Tyler

Andreas_Ley · June 13, 2022, 2:48pm

Thanks, Tyler.

I’ve upgraded to a M10 instance which massively improved performance. Since the project is still in development, performance is not the main issue and I’m more worried about losing data in production later on.

Here’s again what happened on the M2 instance:

I imported 600,000 objects into a macOS app that’s using a Realm database (with Atlas Device Sync). The import, which ended up in the local .realm file, took 1 minute. The objects were written to the Realm in batches of 1000 objects.
The Realm SDK then automatically started to upload the new objects from the local .realm file to the Realm Sync server (Atlas App Services / Device Sync). This took about 15 minutes, iirc.
The Realm Sync server obviously cached the whole changeset and then started transferring the new data into Atlas, doing so at a speed of less than 10 objects/second. Writing 600,000 objects to Atlas would therefore take more than 20 hours.
When I terminated Device Sync in Atlas App Services, the queue was wiped. 500,000 objects that weren’t yet synchronized to Atlas were gone server-side.
The warning that appears when terminating Device Sync doesn’t mention such data loss. I also couldn’t find any way to see the status of Device Sync (e.g. “Writing 500,000 objects to Atlas…” or something along these lines).

On the M10 instance, the transfer from Device Sync to Atlas was basically done as soon as the data was uploaded. Still, it would be great to have the ability to see the status/queue of Device Sync in the MongoDB Cloud dashboard.

Tyler_Kaye · June 13, 2022, 6:06pm

Hi, glad to see that the M10 was a much better experience. I have found that it is the minimum required tier for any sync application with real load (600,000 objects is non-trivial).

As for the last question, we are planning on releasing a series of metrics endpoints and visualizations in the coming months. One of the metrics is not the “size” of the queue but rather the “lag” measured in time of how long the changeset has taken from when it is received by the server to when it is inserted into Atlas and I suspect that will be helpful to you.

Brock_GL · June 14, 2022, 6:45am

Hello Andreas,

I see my colleague Tyler has done an amazing job resolving this, I would like to add some background context for the behavior you had observed.

These behaviors are common for Shared Tier clients.

Information about shared tier clusters

Shared tier clusters share resources between all tenants on the same tier.
Resources provided to each tenant of course increase for each tier level.
M0 has the smallest amount of resources in the shared tiers, as M5 has the most.
The shared tier instances theoretically do gain better throughput via networking and the writing service as you go higher in tier. - Due to less tenant density.
All tenants on the same shared tier cluster share the same networking, writing, and other resources.

Cluster Recommendations
MongoDB does not recommend shared tier clusters for full production environments, and we encourage at the least an M10 as a dedicated cluster for a production environment with extreme lows in traffic, but an M20 would be the better starting point. Our official recommendation overall is an M30 cluster for production environments.

Notice
We cannot guarantee stability on a live production environment on a shared tier cluster for the reasons mentioned above, as these clusters are intended for educational and development environments. As the shared cluster gains tenant density the available resources outside of the dedicated RAM, and CPU allotted to the specific tenant on creation will be spread more thin.

Once you moved to the M10, the jump in performance is due to now having dedicated resources like the writer, and so on. Resources that as mentioned above in a shared tier instance would otherwise be shared.

I hope this better clears up why you observe such differences using the M10.

Regards,
Brock

system · June 19, 2022, 6:45am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.