Mongo Realm GraphQL performance - response time, query/sec, max load

mba_cat · June 30, 2021, 11:02am

Hi Mongo!

I am currently load testing my MongoDB Realm GraphQL endpoint trying to identify service bottlenecks / limitations / theoretical maximum users for my app, can someone at Mongo able to answer the following questions:

My MongoDB setup / configuration:

Cluster:
- M0 testing cluster in AWS / Frankfurt (eu-central-1)
Collection:
- Single collection ‘goals’ with single field ‘name’ a string <50 characters
- The goals collection contains 12 items
Realm:
- Setup as a global app because our users will be global with users in US and Europe
- Auth configured as Email/Pass with a single user
- Realm schema setup on goals with all users read access
Testing
- For all tests authenticate once at beginning of test and reuse same token for each query
- Testing query: query: query { goals { _id name } }
- Ad hoc test: POST testing query to GraphQL endpoint using Postman running on local machine in UK
- Load tests: POST testing query to GraphQL endpoint repeatedly using artillery running on local machine in UK

My questions:

Using MongoDB Realm GraphQL what is the architecture behind the default resolvers provided by Mongo - for example are they isolated functions running per query / per cluster or some sort of serverless function, also where do these functions run - on cluster / at edge / somewhere else?
Does the above change if we use custom resolvers and are their any limits on memory / runtime for custom resolver functions?
When testing ad hoc sequential queries from my local UK machine using Postman I have noticed that my first query appears to take 2-3 times longer than subsequent queries, for example first query response can be up to ~350ms, where second identical query response can be ~80->120ms, I have also replicated the same behaviour using an online curl tool which also shows the response from a US location at first ~1000ms subsequent 300-600ms, is there a warmup period for queries or a cache for subsequent queries?
When running repeat ad hoc single queries the response time can vary significantly, for example on my cluster when under no load and with no other queries running I can run the exact same query several times randomly over 10 minutes from my local UK machine with the response time being in the range 80ms → 700ms, testing the same from a US machine gave a range of 300ms → 1300ms, is there a reason for this, are repeat queries routed to the same region for a global Realm app, would using a dedicated / multi region cluster improve / stabalise this?
NOTE: I have verified that during the test my own connection latency was stable at 20ms to a local UK server.
What is the maximum number of queries that the MongoDB Realm GraphQL endpoint can sustain within a reasonable response time of <200ms and does this change if using dedicated / multi region clusters or is this a limit of Realm?
Using artillery load tests over 180s (3mins) for up to 30 requests/s show a response of min=61ms / med=72ms / max=533ms, however load testing for 50 requests/s showed response times peaking randomly during testing up to a response time of 26 seconds in some cases. Is there a reason for the varying response times and would using a dedicated / multi region cluster improve / stabalise this or is this a limitation of Realm GraphQL?
NOTE: When running some more load tests at 50 request/s I am now seeing response times consistently min=61ms / med=75ms / max=433ms, did something change between 2021-06-29 18:00 and 2021-06-30 11:45 UK time?
While running the load tests Atlas shows connections ramp up, but these connections do not seem to be released after the Realm queries complete and still remain 2 hours after the requests were made - when will Realm release these connections and is there a way to release these faster to prevent other client connections being blocked by these Realm connections?
NOTE: For example when connecting using the Mongo javascript client I can close the connection after the query completes or even timeout the connection after a short period of time.
Are we able to see the default function used by Realm (to modify and use it in our own custom resolvers)?

Thank you

Sumedha_Mehta1 · June 30, 2021, 6:52pm

We are using essentially a similar architecture to Realm functions for GraphQL. This is a layer of serverless compute on separate from the cluster and we don’t have cold start times - the requests are almost immediately executed.
No, given we use the same architecture for both, there shouldn’t be a difference on limit/memory/runtime. (e.g. At the moment, there is still a 90s timeout at the moment for both, 256 mb limit for cpu, etc.)
We very recently released caching the GraphQL schema that gets generated on request to improve the majority of GQL request times that were taking longer due to generating the GraphQL schema (e.g. it looks like your times reduce by ~75% due to it). There may be occasional cache misses which will contribute to that. If you have feedback on this behavior, do let us know! Also note, that this is specific to GraphQL queries rather than all requests that go through Realm (e.g. via the SDK, Realm Functions)
This seems more related to the caching rather than the region…but I’m curious to how often you run into a spike/variation. Are you running these ad hoc single queries within seconds of each other, or is that time gap longer?
Are you referring to ‘maximum queries/second’? There are a lot of nuances to this question that depend on your region and cluster size. I’d love to know more about your use-case and what load you’re expecting on your app. (Feel free to message me or email me at sumedha.mehta@mongodb.com if this is something a bit more nuanced or private)
Not sure why this may be the case without a full investigation, but it’s likely due to the M0 cluster.
Realm will pool connections in order to improve data access performance to your cluster when requests are made through Realm. The behavior you’re noticing is a side effect of connection pooling. Curious how many connections you have open when this happened.
Is there a reason you’re looking to see the default function used by Realm rather than using docs? While you can’t at the moment, basic CRUD functions are available here (documented here).

Hope that helps - in general, using a dedicated cluster will definitely improve request times, we don’t recommend M0s for any production use-case or even for load-testing purposes. Multi-region/multi-cloud clusters will increase availability in case of a specific region/cloud’s replica sets go down. Geo-sharding can also help with latency.

mba_cat · June 30, 2021, 8:58pm

Hi Sumedha

Thank you for the quick and detailed reply.

Thank you for the information. That does seem to be what I am observing now, yesterday (2021-06-29) I seemed to be observing a higher ‘first’ run response of >200ms then subsequent queries stabalised around 80-120ms but now I am seeing consistent 90-100ms response times.
Thank you for the information.
Thank you for the information yes it does appear the cache is working for my query and has reduced the response times. Is there any way to use the cahse for custom functions?
I was running the ad hoc queries at intervals of a few seconds and a few minutes and observed the behaviour, although now this seems to have stabalised as I assume the cache is now working in Europe. When I run a query using reqbin.com to execute the query from the US it reports 300ms->1300ms as before although I am dubious about the reported response times as it could be due to their own network.
Yes, specifically I am looking at metrics to identify what sort of clusters we will need to use to support out userbase and also if using Realm how much of an improvement using dedicated vs multi region clusters would provide, I appreciate the answer is nuanced and ‘depends’ - more details of my thought process, testing and use case the date are at the end of this post.
That is sort of expected, I assume that a more powerful cluster would positively impact the response times and queries/s
The pool is expected and I observed connections increase in Atlas but I am not seeing all connections released, please see a snapshot of the connections below.

08:00 - 10:00 - Was running load tests against Realm and observed expected
connections increase but appears to be a reused pool because the connection
amount stabalised.
10:00 - 12:00 - Running more load tests and ad hoc queries and observed pool
increase slightly
12:00 - 14:15 - Ran varous tests using a node server connecting using the
MongoDB javascript driver using serverless functions. Identified that in some
cases the connections were correctly opened and closed, tried the MongoDB
recomended approach to caching the connections but observed that these were
not always released.
14:15 - Stopped testing both Realm and ad hoc functions and expected connections
to drop to ~0 but still see around 80 connections that did not get closed - these
appear ot be zombie connections either from Realm or partial left overs from the
serverless tests.

Yes I am specifically interested to see how / best practices used for some of the default functions options such a *_lt / *_gt / *_in etc and also pagination, the basic CRUD documentation is awesome but shows only the most basic implementation.

The current use case is fluid but can be summarised as:

Seeking to move out current node REST api servers which are connecting to a Mongo database to a more serverless and resilient architecture and looking into Mongo offerings.
Ideally I want to use Realm sync but cannot at this time (flutter driver in dev preview)
In the interim if I can migrate the current api to GraphQL it should help in the preperation to sync and also provide a more easily scalable architecture using Mongo Realm GraphQL.
Realm GraphQL looks like it could work but I am testing if the response is good enough, if it is then I may be able to work around the current user authentication restrictions and error handling and migrate our current users onto Realm email/password auth…
Ideally Realm auth would enable me to use my existing user collection (there is a feature request for this on the Realm feature request site but I am unsure of timings)
Failing that I am experimenting with using a third party GraphQL server and connecting it to Mongo Atlas directly (not great but would enable more customisation that Realm GraphQL currently offers) the only downside of this is that trying to run this on serverless architecture means that the driver leaves zombie connections open even using the Mongo recomended approach of caching client connection etc, despite intense investigation there does not seem to be any way round this and AWS offer their own database to counter the issue when using serverless functions.
Finally a more complex approach could involve using the third party GraphQL serverless as a gateway to the Realm GraphQL server - this would increase response time but would bypass the mongo client connection issue because Realm would manage its own connection pool and the third party GraphQL functions could be deployed on serverless functions at the edge and would then proxy to the global Realm GraphQL backed by several mutli region clusters to minimise response times however this is even more complex and ideally I would just prefer to use Realm for everything!

Sumedha_Mehta1 · July 7, 2021, 1:07pm

A few thoughts -

The pool is expected and I observed connections increase in Atlas but I am not seeing all connections released, please see a snapshot of the connections below.

The ~100 open connections lasting a few hours is expected, especially given we are caching and running some optimizations as part of the serverless layer, per our engineering team

Custom Resolvers should benefit from the caching given they become a part of the GraphQL Schema

Ideally Realm auth would enable me to use my existing user collection (there is a feature request for this on the Realm feature request site but I am unsure of timings)

While Realm auth doesn’t let you explicitly import users, a couple of ways around this would be to use an auth trigger to create a user on every login or use the admin API.

failing that I am experimenting with using a third party GraphQL server and connecting it to Mongo Atlas directly (not great but would enable more customisation that Realm GraphQL currently offers)

Is there any specific customization you’re looking for that isn’t available on Realm yet? I’d love to hear more about that.

Let me know if I missed something or if you have follow up questions.

mba_cat · July 8, 2021, 12:33pm

Hi Sumedha

… ~100 open connections lasting a few hours is expected, especially given we are caching and running some optimizations …

Thank you for the information, can you clarify - would unused open connections cause the connection limit of the server to be exceeded and prevent new connections or will the cluster optimise and close these connections to allow new ones?

… use an auth trigger to create a user on every login or use the admin API.

I have experimented with both of these however Realm assigns a new user id which then requires all related user id’s to be updated across all collections - the feature request to allow developers to use their existing users collection would retain these id’s

Is there any specific customization you’re looking for that isn’t available on Realm yet? …

Most of my desired customizations are already open as feature requests or available at some level:

Ability to directly customise graphql schema
Ability to customise error messages returned to clients
Ability to use custom domain with Realm GraphQL endpoint + auth

But these are just workarounds while I am ideally waiting for a Realm Flutter SDK with sync (ideally cross platform for ios/android/web) which is the most upvoted request and we have seen a developer preview for but there does not appear to be any updates or movement on any of the sdk pages:

Flutter Pub Dev Realm packages - last update Apr 01

Realm package GitHub repository - open issues / unassigned / not commented from 8 Feb

Mongo Feedback Realm Flutter SDK Request - In ‘started’, 8 old comments with only update pointing to GitHub repository

Hope that helps!