Running a query on a huge collection, brings CPU usage to 100%

I have a MongoDB cluster containing more than 50 collections. There is a collection that contains more than 91M documents (records/rows). Indices are properly inserted in this collection.
My query is to fetch a count of data containing unique companies in this collection. However, due to the large collection, I am unable to fetch those records because as I start executing my simple group-by query on that field, my CPU usage increases up to 100%, due to which my live server crashes.

Even, due to these long-running queries, I am unable to analyze this kind of large collection. Is there any way to create queries that take less CPU consumption and also return accurate results?

1 Like

Show the output of explain ?

Hello Mehul, welcome to the MongoDB community!

Queries perform based on indexing, you can try to use some partial index strategies, materialized views, revisit the modeling that may change over time, reference or embed documents to try. If you can share your query and the explain as @Kobe_W commented, we can help you see if there is any tuning. Furthermore, if you could briefly share the type of application that uses this, how frequently, etc…

Thanks for responding.

Query: This query is used to fetch a distinct count of field_name (can’t give you due to confidentiality issue).
collection_name → huge collection with more than 99M records (documents)

db.getCollection("collection_name").aggregate(
    [
        {
            "$group" : {
                "_id" : "$field_name",
                "count" : {
                    "$sum" : NumberInt(1)
                }
            }
        }
    ], 
);

Hope this information helps you. @Kobe_W @Samuel_84194

There are some solutions that I received while surfing.

  • Use $limit & $skip to optimize the query
  • Apply Indexing to the collection

But the collection on which I am trying to run this query already has proper indexing & also I think it is not feasible to use the skip & limit command because it won’t give me a proper count as there are chances of having some values of field_name beyond the limit.

Can you confirm that it stops because of 100% CPU? I rather suspect that you got an OOM or 16MB exception during the $group.

The stage $group is blocking and has to process the whole collection before producing the first resulting document. So if you have a lot of unique values the memory consumption by the $group stage might exceed the server RAM.

You could try the following aggregation which should cuts the memory requirement.

distinct = { "$group" : {
    "_id" : "$field_name"
} }
count = { "$lookup" : {
    "from" : "collection_name" ,
    "localField" : "_id" ,
    "foreignField" : "field_name" ,
    "as" : "count" ,
    "pipeline" : [ { "$count" : "result" } ]
} }
db.getCollection( "collection_name" ).aggregate( [ distinct , count ] )
2 Likes

@steevej Actually, I have to stop my query because of 100% CPU, my DBA kills it.

So, I need to find a way in which I can run queries on huge collections, without affecting CPU usage.

And the query that you have sent me, I think it’s wrong because I am running a query on a single collection, there is no need for another collection, so why should I use $lookup?

You think it is wrong. Have you tried? Have you tried it, at least on a test collection to understand what it does?

I do not think it is wrong if I shared it.

Where is the other collection? I specified the same collection_name that you used in your original code that you shared. So there is no other collection. I $lookup in the same collection. Doing the $count in a different stage reduce the amount of memory used by $group. I was focusing my answer on limiting memory because at first you wrote:

which is quite different than

One thing is sure is that you have to try what people share, even if you think it is wrong.

You may also try to $sort with field_name:1 before you $group, but only if you have an index with field_name as prefix otherwise it will be worst.

Another thing is that analytic use-cases like yours are preferably executed on secondary nodes. This way you can live with the temporary 100% CPU spike.

But frequent 100% CPU might indicate that your hardware is under-specified for the workload.

1 Like

Sorry @steevej You are correct. Actually, I forgot that you have done self-join and I forgot that concept for a while. But when I tried it was running properly even on the huge collections.

I saved lots of time.
THANKS BRO. It was a great help.

1 Like

Please mark one of the post as the solution.

1 Like

To quote @steevej

And the explain plan between the two makes me irrationally mad with the query planner.

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.