I have a MongoDB cluster containing more than 50 collections. There is a collection that contains more than 91M documents (records/rows). Indices are properly inserted in this collection.
My query is to fetch a count of data containing unique companies in this collection. However, due to the large collection, I am unable to fetch those records because as I start executing my simple group-by query on that field, my CPU usage increases up to 100%, due to which my live server crashes.
Even, due to these long-running queries, I am unable to analyze this kind of large collection. Is there any way to create queries that take less CPU consumption and also return accurate results?
Queries perform based on indexing, you can try to use some partial index strategies, materialized views, revisit the modeling that may change over time, reference or embed documents to try. If you can share your query and the explain as @Kobe_W commented, we can help you see if there is any tuning. Furthermore, if you could briefly share the type of application that uses this, how frequently, etc…
There are some solutions that I received while surfing.
Use $limit & $skip to optimize the query
Apply Indexing to the collection
But the collection on which I am trying to run this query already has proper indexing & also I think it is not feasible to use the skip & limit command because it won’t give me a proper count as there are chances of having some values of field_name beyond the limit.
Can you confirm that it stops because of 100% CPU? I rather suspect that you got an OOM or 16MB exception during the $group.
The stage $group is blocking and has to process the whole collection before producing the first resulting document. So if you have a lot of unique values the memory consumption by the $group stage might exceed the server RAM.
You could try the following aggregation which should cuts the memory requirement.
You think it is wrong. Have you tried? Have you tried it, at least on a test collection to understand what it does?
I do not think it is wrong if I shared it.
Where is the other collection? I specified the same collection_name that you used in your original code that you shared. So there is no other collection. I $lookup in the same collection. Doing the $count in a different stage reduce the amount of memory used by $group. I was focusing my answer on limiting memory because at first you wrote:
which is quite different than
One thing is sure is that you have to try what people share, even if you think it is wrong.
You may also try to $sort with field_name:1 before you $group, but only if you have an index with field_name as prefix otherwise it will be worst.
Another thing is that analytic use-cases like yours are preferably executed on secondary nodes. This way you can live with the temporary 100% CPU spike.
But frequent 100% CPU might indicate that your hardware is under-specified for the workload.