Schema sampling fails due to sort memory limit exceeded, how to add allowDiskUse:true

michael_hoeller · July 19, 2021, 3:03pm

Hi there
this post is related to Compass version 1.28.1. As the title tells:

The Schema sampling for a collection with large string fields fails due to an exceeded sort memory limit.
The error message tells the solution: add allowDiskUse:true. Makes sense, will slow the process down but the allowed “max time ms” can be increased via the options button. So far so good. I tried several ideas to extend the find or sort but obviously was not successful…

Just where should the allowDiskUse:true be put ?

mahisatya · July 19, 2021, 6:21pm

Hi Michael,

That’s an interesting error during schema analysis. How large are the string fields? I ask because Compass should be sampling no more than 1000 documents while analyzing the schema. So, under normal circumstances it should never hit that limit unless the values are huge. I assume.

I will be curious to see someone explain this behavior, and a workaround.

Thanks,
Mahi

michael_hoeller · July 19, 2021, 8:18pm

Hello @mahisatya,

thank you for the information. The behaviour it self is as expected. compass uses the $sampel aggregation to collect some random documents. This is done by a certain rule set:

$sample is the first stage of the pipeline
N is less than 5% of the total documents in the collection
The collection contains more than 100 documents

The initial value is 1000 for the schema analysis, the collection has only 284 documents so this is not fulfilled
284 * 0,05 = 14,2 so 14 docs should work 15 docs should not. This is true pls see the screenshots.

Since the one rule is not fulfilled the second option is taken:

If any of the above conditions are NOT met, $sample performs a collection scan followed by a random sort to select N documents. In this case, the $sample stage is subject to the sort memory restrictions.

So we do a collection scan restricted to 100 megabytes of RAM for in-memory sorts. This fails.
The error message states the standard “solution” to allow to write to disk which is enabled by setting allowDiskUse:ture

So after this round trip the question still is:

Where should the allowDiskUse:true be put to sample up to 1000 documents

Regards.
Michael

Stennie_X · July 27, 2021, 12:27am

Hi @michael_hoeller,

A Compass engineer confirmed allowDiskUse is enabled for sampling (checked Compass code and via server logging), and wasn’t able to reproduce this issue with Compass 1.28.1.

Can you share some more information about the deployment you are querying:

Is this on-premises or Atlas?
Are you using a standalone, replica set, or sharded cluster connection?
If Atlas, what cluster tier are you using?

Also, what is the average size of documents for this collection? It sounds like this is a similar sampling issue to Chart get's a timeout - how to debug?, where the average document size was problematic. A similar workaround (sampling from a view of the collection with fewer fields) may also be helpful.

Regards,
Stennie

michael_hoeller · July 27, 2021, 6:31am

Hello Stennie

I will quote and add:

It seems that this is still the behaviour like described above. Since N is not less 5% we do a collection scan and sort. At least the error message indicates that allowDiskUse is not active.

This is an Atlas cluster, not sharded… and not really big. The document size could be less but ~4,7 MB is only 25% of the allowed maximum.

I can provide a dump or access to the cluster, the issue is constantly reproduceable.

Regards,
Michael

Stennie_X · July 28, 2021, 5:21am

Hi @michael_hoeller,

What is the Atlas cluster tier? I wanted to set up a similar cluster to try to repro the issue.

I recall you using dedicated clusters, but would note that shared clusters (M0, M2, M5) do not support the allowDiskUse option.

If the data isn’t confidential, a dump would be helpful for testing. DM me with details :).

Thanks,
Stennie

michael_hoeller · July 28, 2021, 6:02am

Hello @Stennie_X

Thanks for the note, now it comes clear. I moved a staging cluster to M2, I was aware that there are restrictions on M0 and forgot to check for M2, your link is actually on my link list. My fault. I double checked with an M50 Cluster, there I don’t get the message. It fails also on the M50 cluster but with a timeout which is an other issue mentioned in Chart get’s a timeout - how to debug?. The workaround in that post is to use a view, which generally works but is sub optimal for a schema analysis.

Thanks a lot Stennie!
Michael

system · August 2, 2021, 6:03am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.