Performance issues with processing large volumes of data from a cursor

Grace_Zhang · October 23, 2023, 11:20pm

Hi! The use-case I have here is that I have a large volume of data I’m returning from a collection with one query – up to several million documents. I’m running a series of calculations on this data and generating visualizations with it. Each document is < 300B, but each query is returning millions. I’ve already indexed what I can in the collection and am using projections to return a subset of the document fields, and my actual query time itself (the execution time in the .explain() method) is reasonable. However, I’m seeing really slow times when I try to convert the cursor to a data structure that I can work with, like a list. Is there any way to better optimize this process? I’ve tried using batching and parallelization, but neither was very effective. Any advice at all is appreciated, thank you so much!

steevej · October 24, 2023, 12:30am

Do your calculation with aggregation rather than downloading document and using python.

Grace_Zhang · October 24, 2023, 12:33am

hi! i would do calculations with aggregation, but i need to have all of the fields accessible to be used in other places as well, or if i’m plotting visualizations that need all the fields, i need every field returned as well.

Grace_Zhang · October 24, 2023, 2:30am

hi! i would do calculations with aggregation, but i need to have all of the fields accessible to be used in other places as well, or if i’m plotting visualizations that need all the fields, i need every field returned as well.

steevej · October 25, 2023, 10:00pm

If the explain plan indicates that the query time is reasonable then the issue is really about transferring million of documents from the server to your application.

If mongod and your application runs on different machine then you need a faster network.

If mongod and your application runs on the same machine then you need more RAM and/or CPU.

If your mongod is on Atlas, then you might want to try Charts for your visualisation.

More details about your system is needed to provide any more comments.

One last thing about

How to you convert? One document at a time or you download all documents in an array and then converting all the array? The latter might require more memory because all documents in raw form and all documents in your data structure are present at the same time. Converting one by one, might reduce the memory requirement.