allowDiskUse has no effect in case of $accumulator

I stumbled upon two issues when migrating code from MapReduce to an aggregation pipeline using $accumulator in a $group stage on a MongoDB 6.0.8 server using mongosh 1.10.1.

I explicitly set allowDiskUseByDefault to true and pass it in the runCommand for the aggregation

db.adminCommand(
	{
		setParameter: 1,
		allowDiskUseByDefault: true
	}
)
...
db.runCommand({
	"aggregate":sourceCollection,
	"pipeline":pipeline,
	allowDiskUse: true,
	cursor:{},
});

First one was this OOM error when data fed into the group stage exceeded a certain size. Average size of docs going into the $group was 231KB and aggregation bombed when processing 620 such documents (~140MB), but was working up to 612 documents (138MB):

MongoServerError: PlanExecutor error during aggregation :: caused by :: Out of memory

I worked around this one by limiting the amount of documents processed in one run to stay well below that amount (chose max. 200 documents at a time). This being a time based aggregation, I cannot freely choose the exact amount of documents, as I’d always need to process whole hours or days respectively.

The second error I stumbled over was this one:

JavaScript execution interrupted

The documents processed each contain an array of objects and the $accumulator merges these arrays by adding docs to the target if it’s not present and adding up some properties and simply setting others when it is. If the number of sub-documents in these arrays exceeded a certain amount, I got the aforementioned “JavaScript execution interrupted”; in most cases it would be fine, but there were a couple of hours where we suffered a bot attack that led to an excessive amount of documents per each array.

I worked around this problem by inserting an addition $set phase before the $group where I would sort these arrays by a specific key and then limit the number per array to 2,000 elements. I didn’t experiment to find out the exact limit when it would start to fail, but with the selected limit of 2,000 I consistently achieved successful aggregations.

So my question is: Shouldn’t $group just spill to disk if allowDiskUse is allowed? Why is there still an OOM error and how else could this have been avoided? MapReduce just ran very long and consumed quite some memory, but I could depend upon it completing eventually - with the aggregation framework I have a somewhat bad feeling that there may be circumstances where an aggregation may bomb and we’d have to find out some workaround. Is there maybe some configuration setting I missed that would allocate more ressources to specific aggregations? This thing is running on a server with 128GB of RAM and I just assume that such a server is not that out of the ordinary, so it would be nice if larger aggregations could actually make use of the available memory?

Hi @Markus_Wollny,

Thanks for providing the very detailed analysis. I did do some testing of my own very briefly using $accumulator and $group on ~10million documents with an average object size of 59 but wasn’t able to replicate any OOM errors. For what it’s worth, my test environment was also MongoDB 6.0.8 and I also used mongosh 1.10.1.

To help with replicating this behaviour you’ve experienced, can you provide some further information:

  1. If you tuned any default settings on your mongod instance where this error occured
  2. details of the variable pipeline (i.e. Advise what the pipeline stages and values are)
  3. sample documents from any collections involved in the pipeline

Please redact any personal or sensitive information before posting here.

Regards,
Jason

Hello,

I am sorry for the long delay, but unfortunately to to the deployment of workarounds for the original problem, I no longer have the original raw data that caused the problem and needed some time to compile a new test case to show the problem; I have provided a ZIP archive of test data and aggregation code that will reproduce the Out of Memory error; you can download the dump of the data and the aggregation code here: Sign in to your account - Link is valid for 30 days, download size is 561MB.

The archive contains a bson dump of a collection of preprocessed hourly documents that need to be aggregated to days (same code can then be used to aggregare days to weeks and/or months). The source collection has 27.4k documents, average document size is 151.7KB.

You’ll want to restore the dump to a test installation and then run the aggregation:

tar -zxvf impression_hourly_collection.tar.gz
mongorestore --db=testdb --collection=bi_impression_hourly ./impression_hourly_collection/track/bi_impression_hourly.bson

time /usr/bin/mongo testdb --verbose -f ./aggregationDemo/aggregateImpressionsToDay.js

This will crash (at least on my machine with 64GB RAM) with maxDocumentsPerRun = 1850, will run fine though with maxDocumentsPerRun = 1800.

I haven’t tuned anything in the configuration of the mongod instance. I hope this helps to narrow down the issue. I wasn’t able to reproduce the JavaScript execution interrupted error with this data, but as I said before, sanitizing the number of nested documents in the array did the trick and this is a good enough solution for my purposes.

There is no personal or sensitive information contained in the data whatsoever.

Kind regards

Markus

@Markus_Wollny - Firstly, thank you for the detail reproduction dataset and instructions.

After running:

time /usr/bin/mongo testdb --verbose -f ./aggregationDemo/aggregateImpressionsToDay.js

I was not able to get a crash. This is my output (I did alter the /user/bin/mongo portion to instead use mongosh on my own environment).

Connecting to:		mongodb://127.0.0.1:27017/testdb?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.10.1
Using MongoDB:		6.0.8
Using Mongosh:		1.10.1

For mongosh info see: https://docs.mongodb.com/mongodb-shell/

------
   The server generated these startup warnings when booting
   2023-07-27T12:41:42.510+10:00: Access control is not enabled for the database. Read and write access to data and configuration is unrestricted
   2023-07-27T12:41:42.510+10:00: This server is bound to localhost. Remote systems will be unable to connect to this server. Start the server with --bind_ip <address> to specify which IP addresses it should serve responses from, or with --bind_ip_all to bind to all interfaces. If this behavior is desired, start the server with --bind_ip 127.0.0.1 to disable this warning
   2023-07-27T12:41:42.510+10:00: Soft rlimits for open file descriptors too low
------

Loading file: ./aggregationDemo/aggregateImpressionsToDay.js
Limiting time to Sun Nov 27 2022 05:00:00 GMT+1000 (GMT+10:00)
Still here, writing the lock...
Processing selection:
{ '_id.t': { '$lt': ISODate("2022-11-26T19:00:00.000Z") } }
Number of items found:
(node:29056) [MONGODB DRIVER] Warning: cursor.count is deprecated and will be removed in the next major version, please use `collection.estimatedDocumentCount` or `collection.countDocuments` instead 
(Use `node --trace-warnings ...` to show where the warning was created)
1849
Done.
mongosh testdb --verbose -f ./aggregationDemo/aggregateImpressionsToDay.js  1.01s user 0.23s system 0% cpu 7:06.42 total

The environment i’m running this test on has around 16GB of RAM.

I’ll see what else I can find out digging through the execution stats of the aggregation.

Thanks for your patience.

Jason

Ran it a second time with the same 1850 maxDocumentsPerRun value again to see if any changes occured:

Loading file: ./aggregationDemo/aggregateImpressionsToDay.js
Limiting time to Sat Dec 10 2022 02:00:00 GMT+1000 (GMT+10:00)
Still here, writing the lock...
Processing selection:
{
  '_id.t': {
    '$lt': ISODate("2022-12-09T16:00:00.000Z"),
    '$gte': ISODate("2022-11-26T19:00:00.000Z")
  }
}
Number of items found:
(node:30490) [MONGODB DRIVER] Warning: cursor.count is deprecated and will be removed in the next major version, please use `collection.estimatedDocumentCount` or `collection.countDocuments` instead 
(Use `node --trace-warnings ...` to show where the warning was created)
1855
Done.
mongosh testdb --verbose -f ./aggregationDemo/aggregateImpressionsToDay.js  1.00s user 0.16s system 0% cpu 4:58.61 total

Here’s what I got when running the demo, output differs a little from yours, so maybe there is some issue with a different config; regarding mongo vs. mongosh - I simply symlinked the mongosh to the old name, so I am really running mongosh, too.

~# time /usr/bin/mongo testdb --verbose -f ./aggregationDemo/aggregateImpressionsToDay.js
Current Mongosh Log ID: 64c76de0abfc62b06937b75b
Connecting to:          mongodb://127.0.0.1:27017/testdb?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.10.1
Using MongoDB:          6.0.8
Using Mongosh:          1.10.1

For mongosh info see: https://docs.mongodb.com/mongodb-shell/

------
   The server generated these startup warnings when booting
   2023-07-14T08:03:55.986+02:00: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine. See http://dochub.mongodb.org/core/prodnotes-filesystem
   2023-07-14T08:03:58.367+02:00: Access control is not enabled for the database. Read and write access to data and configuration is unrestricted
------

Loading file: ./aggregationDemo/aggregateImpressionsToDay.js
Limiting time to Sat Nov 26 2022 20:00:00 GMT+0100 (Mitteleuropäische Normalzeit)
Still here, writing the lock...
Processing selection:
{ '_id.t': { '$lt': ISODate("2022-11-26T19:00:00.000Z") } }
Number of items found:
1849
MongoServerError: PlanExecutor error during aggregation :: caused by :: Out of memory

real    14m21,631s
user    0m2,134s
sys     0m0,256s

Here’s all of my /etc/mongod.conf:

storage:
  dbPath: /var/lib/mongodb
  directoryPerDB: true
  engine: wiredTiger
  wiredTiger:
     engineConfig:
         directoryForIndexes: true

systemLog:
  destination: file
  logAppend: true
  path: /var/log/mongodb/mongod.log
  # traceAllExceptions: true # this needs to be disabled in production

net:
  port: 27017
  bindIp: 0.0.0.0
  maxIncomingConnections: 20000

processManagement:
  timeZoneInfo: /usr/share/zoneinfo

replication:
   oplogSizeMB: 2048
   replSetName: rs0

So I have enabled replication, though this is the sole member of its set. Could this be causing the issue? Maybe something to do with the oplog? I also noticed that I didn’t get the deprecation warning about using cursor.count, but I assume that this is irrelevant here.

Thank you for looking into this!

Kind regards

Markus

Thanks @Markus_Wollny,

I tried with a similar config file and even increased the number of max documents to 10,000 but wasn’t able to get the crash to occur.

jason.tran@M-VNYW6V6WX4 2023-07-31_MongoDB_issue % time mongosh --verbose -f ./aggregationDemo/aggregateImpressionsToDay.js
Current Mongosh Log ID:	64cae7f649cced044592f79d
Connecting to:		mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.10.1
Using MongoDB:		6.0.8
Using Mongosh:		1.10.1

For mongosh info see: https://docs.mongodb.com/mongodb-shell/

------
   The server generated these startup warnings when booting
   2023-08-03T09:32:47.201+10:00: Access control is not enabled for the database. Read and write access to data and configuration is unrestricted
   2023-08-03T09:32:47.201+10:00: This server is bound to localhost. Remote systems will be unable to connect to this server. Start the server with --bind_ip <address> to specify which IP addresses it should serve responses from, or with --bind_ip_all to bind to all interfaces. If this behavior is desired, start the server with --bind_ip 127.0.0.1 to disable this warning
   2023-08-03T09:32:47.201+10:00: Soft rlimits for open file descriptors too low
------

Loading file: ./aggregationDemo/aggregateImpressionsToDay.js
Limiting time to Sat Apr 15 2023 01:00:00 GMT+1000 (GMT+10:00)
Still here, writing the lock...
Processing selection:
{
  '_id.t': {
    '$lt': ISODate("2023-04-14T15:00:00.000Z"),
    '$gte': ISODate("2023-02-04T05:00:00.000Z")
  }
}
Number of items found:
10003
Done.
mongosh --verbose -f ./aggregationDemo/aggregateImpressionsToDay.js  1.09s user 0.19s system 0% cpu 25:55.11 total

I will see if the team knows anything further about the crashes but it’s a bit difficult without being able to reproduce it on my system. In saying so, could you describe the environment details (RAM, OS, etc) in which you’re running the script / aggregation?

Regards,
Jason

Hi,

I’m running MongoDB/mongosh on Debian 11.7 (Bullseye), system has 32 GB RAM. This is running on a VMWare ESXi host, I have two CPU-cores assigned. Kernel is 5.10.0-23-amd64 #1 SMP Debian 5.10.179-1 (2023-05-12) x86_64 GNU/Linux. The machine has one partition of ext4 type and a total size of 75GB, 36GB of which are still available. free -m output:

# LANG=C free -m
               total        used        free      shared  buff/cache   available
Mem:           32101       12585       14017          64        5498       18996
Swap:            974         107         867

I have set fs.file-max=786432 and vm.max_map_count=262144 in /etc/sysctl.d/30-mongodb.conf. I have diabled transparent huge pages and adjusted the following in the mongod.service:

[Service]
LimitFSIZE=infinity
LimitCPU=infinity
LimitAS=infinity
LimitMEMLOCK=infinity
LimitNOFILE=64000
LimitNPROC=64000
OOMScoreAdjust=-1000

I get mongo packages via apt:

# cat /etc/apt/sources.list.d/mongodb-org-6.0.list
deb [ signed-by=/usr/share/keyrings/mongodb-server-6.0.gpg] http://repo.mongodb.org/apt/debian bullseye/mongodb-org/6.0 main

Kind regards

Markus