Cause of poor performance - 1.7 million records

AlesRoma · November 12, 2020, 7:44am

Hi everybody,

The context:

I need to build a report that shows 1 year worth of data (around 1.7 million of records).

The database contains one collection per report, here the db stats:

/* 1 */
{
    "db" : "dbarc",
    "collections" : 839,
    "views" : 0,
    "objects" : 34745518,
    "avgObjSize" : 3965.76271463272,
    "dataSize" : 137792479785.0,
    "storageSize" : 76157050880.0,
    "numExtents" : 0,
    "indexes" : 9287,
    "indexSize" : 17899974656.0,
    "fsUsedSize" : 276794646528.0,
    "fsTotalSize" : 540052742144.0,
    "ok" : 1.0
}

The problem:

The response time of queries for that amount of data is too much. 200 seconds for 1.7 million of records. Here the explain plan:

/* 1 */
{
    "queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "dbarc.sps",
        "indexFilterSet" : false,
        "parsedQuery" : {},
        "winningPlan" : {
            "stage" : "COLLSCAN",
            "direction" : "forward"
        },
        "rejectedPlans" : []
    },
    "executionStats" : {
        "executionSuccess" : true,
        "nReturned" : 1727172,
        "executionTimeMillis" : 141746,
        "totalKeysExamined" : 0,
        "totalDocsExamined" : 1727172,
        "executionStages" : {
            "stage" : "COLLSCAN",
            "nReturned" : 1727172,
            "executionTimeMillisEstimate" : 141095,
            "works" : 1727174,
            "advanced" : 1727172,
            "needTime" : 1,
            "needYield" : 0,
            "saveState" : 14246,
            "restoreState" : 14246,
            "isEOF" : 1,
            "invalidates" : 0,
            "direction" : "forward",
            "docsExamined" : 1727172
        }
    },
    "serverInfo" : {
        "host" : "MONGO1",
        "port" : 27017,
        "version" : "4.0.2",
        "gitVersion" : "fc1573ba18aee42f97a3bb13b67af7d837826b47"
    },
    "ok" : 1.0
}

The collections has many indexes, one on a text field which can be large, i’m sure i’m missing something big, basic things. I’m new about NoSQL dbs, i think the problem could be the total indexes size but i have no clue at the moment.

Thank you very much for the eventual help.

tapiocaPENGUIN · November 12, 2020, 2:01pm

Welcome to the forum!

If you notice in your execution plan you are doing a COLLSCAN. With nReturned 1727172 and you are not examining any keys (totalKeysExamined: 0). This means that the query that you are using is not using an index. So I would take a look again at your indexes that you already have and the query that you are using and try to build a better index.

steevej · November 12, 2020, 2:31pm

That’s a very good point from tapiocaPENGUIN. This also means that you are shipping your 1.7 million records over the wire and if you look at dataSize it is a lot of data sent over.

I would take a look at the aggregation framework, https://docs.mongodb.com/manual/aggregation/. See if you could reduce the amount of data sent over by having the computation of the report done on the server rather than doing it on the client with the complete data set.

You might still have some issue if the whole 1.7 millions records are needed if you do not have enough RAM.

I would also take a look at the computed pattern at Building With Patterns: The Computed Pattern | MongoDB Blog to see if some subset of the report can be done on historical data that will not changed.

AlesRoma · November 15, 2020, 11:14am

Thank you, you are right, but i was figuring out the situation, even querying the _id, the response time is too much:

Execution time in seconds : 99
total_records: 1000000

The problem is elsewhere not in index, Steevej i think pointed it out. Thanks again

AlesRoma · November 15, 2020, 11:21am

First of all thank you very much for the help.

I do use aggregation:
AggregateIterable resultAit = feedCollection.aggregate(aggregatePipeline).allowDiskUse(true)
.useCursor(true);
Iterator resultit = resultAit.iterator();

Then i iterate the documents and i create a list of BSON document that i pass to the application.

The problem is in fact, that i go for out of memory error (GC out of memory, heap memory saturated).
I need all those documents for building report, i could use some subset of data for some report but i could need that amount of data in the future so i don’t know what the solution may will be.
Using some search engine like ElasticSearch could be useful?

p.s.: i’ve red the computed pattern (and the subset pattern), those approach could be helpful, or maybe i could try to read data locally (back-end on the same machine of the db), but i don’t think is a good solution…

Thanks again

Srikanth_Kalakanti · March 17, 2021, 3:53pm

Any update/suggestion @AlesRoma as i too facing same issue i.e.
AggregateIterable resultAit = feedCollection.aggregate(aggregatePipeline).allowDiskUse(true)
.useCursor(true);
Iterator resultit = resultAit.iterator();

when i iterate the documents and i create a list of BSON document mee too facing the same issue.

Srikanth_Kalakanti · March 17, 2021, 3:58pm

one more question here is, when we get the AggregateIterable results after we do the aggregate pipeline on a collection and add $merge operation as last stage of aggregation pipeline… until we read/iterate the results the data is not updated is this the correct functionality with the $merge in aggregation pipeline?

Prasad_Saya · March 17, 2021, 5:16pm

Hello @Srikanth_Kalakanti, welcome to the MongoDB Community forum!

Please tell about your own query (actual one) with sample input data and what you are expecting as a result. Also, include the versions of MongoDB Server, the kind of deployment and the Java Driver you are working with.

To discuss about performance issue, I’d like to know the size of the collection also. If you can, run the explain with “executionStats” mode and post the results.

Srikanth_Kalakanti · March 18, 2021, 5:54am

System Details
MongoDB4.4 version
mongo-java-driver-3.12.7.jar

Consider the Collection(Employee and it has a child Details(Array Object)
Employee:
first-name,last_name etc
Details: Object array(In this array each element contains _id and some other data (Total size of the array size is 16 MB data)

So transfering huge data is difficult using the MongoDB Kafka Connector, we chunked this data(with _id element in the array and aggragation pipeline) i.e.
Created a new collection(Employee_details) merging the elements of Employee collection and Details Collection i.e each array element - considered as a new object with the _id

*So initially Employee has 5000+ record and after splitting into chunks the records(2,42,145) *

Now on this new collection(Employee_details) wrote the MongoDb connector to load data to Kafka topic.

On this new collection when doing the CDC changes using the Mongo Aggregation Pipeline using the Java programm.
Pipeline = match(and(exists(“Details”, true), ne(“Details”,
new BsonNull()))), match(ne(“Details”, Arrays.asList())), unwind(“$Details”), group(“$Details._id”,first(“DetailsId”, “$Details.Id”), first(“GroupType”, “$Details…GroupType”)… etc)
AggregateIterable result = collection.aggregate(Pipeline, eq(“$merge”, and(eq(“into”, “Employee_details”), eq(“on”, “_id”), eq(“whenMatched”, “merge”), eq(“whenNotMatched”, “insert”));

So here only, when we iterate this AggregateIterable result, the data is updating… for only few records of Employee it working
If we consider all records(5000+) of Employee we are getting below error:

Exception in thread “main” com.mongodb.MongoCommandException: Command failed with error 292 (QueryExceededMemoryLimitNoDiskUseAllowed): ‘Exceeded memory limit for $group, but didn’t allow external sort. Pass allowDiskUse:true to opt in.’ on server xx-shard-00-01.mongodb.net:27017. The full response is {“operationTime”: {“$timestamp”: {“t”: 1615971176, “i”: 2}}, “ok”: 0.0, “errmsg”: “Exceeded memory limit for $group, but didn’t allow external sort. Pass allowDiskUse:true to opt in.”, “code”: 292, “codeName”: “QueryExceededMemoryLimitNoDiskUseAllowed”, “$clusterTime”: {“clusterTime”: {“$timestamp”: {“t”: 1615971176, “i”: 2}}, “signature”: {“hash”: {“$binary”: “XXXXXX=”, “$type”: “00”}, “keyId”: {“$numberLong”: “6880913173017264129”}}}}

Tried adding allowDiskUse:true to the Aggregation Pipeline, still same issue.

Question here:
Once we get the result object from the aggregation pipeline… do we need to compulsory iterate the result(AggregateIterable) to update the CDC changes?

Memory issue code identified:
ArrayList results = new ArrayList();
for (Document document : result) {
results.add(document);
}
As this above code is causing the memory issue, so is their any other alternative for processing result(AggregateIterable) to update the DB?

Iteration of result(AggregateIterable) is mandatory? to update the DB?

Observation: When we are not reading this result object, DB updates are not working.
So need what is the best option to read this result(AggregateIterable)

Prasad_Saya · March 18, 2021, 6:45am

Hello @Srikanth_Kalakanti, please clarify these:

Tried adding allowDiskUse:true to the Aggregation Pipeline, still same issue.

What is the issue - the error. Please post the error message from the aggregation after adding allowDiskUse:true to the pipeline.

Question here:
Once we get the result object from the aggregation pipeline… do we need to compulsory iterate the result(AggregateIterable) to update the CDC changes?

You are getting a result from the aggregation as AggregateIterable result. I am not able to understand what you mean by: “do we need to compulsory iterate the result(AggregateIterable) to update the CDC changes?”

What is it you are updating? It is not clear what you are trying to say. Please explain clearly what is it you are trying and what is the problem.

Also, please format the code and error messages properly.

Srikanth_Kalakanti · March 18, 2021, 7:24am

Please find the error message from the aggregation after adding allowDiskUse:true to the pipeline.

Exception in thread “main” java.lang.OutOfMemoryError: GC overhead limit exceeded

Mar 18, 2021 7:15:14 AM com.mongodb.diagnostics.logging.JULLogger log
INFO: Exception in monitor thread while connecting to server xxx-shard-00-04.xxxx.mongodb.net:27017
com.mongodb.MongoException: java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.mongodb.internal.connection.InternalStreamConnection.open(InternalStreamConnection.java:138)
at com.mongodb.internal.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:117)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOfRange(Arrays.java:3664)

My Question is :
Here we are giving an input source collection and applying the aggregation pipe line and finally using $merge to a new collection i.e.
AggregateIterable result = collection.aggregate(pipeline,$merge($destination collection));

So if any changes are done in source collection and then run the above aggreagation pipeline and $merge code, then until we iterate the result object the destination collection data is not changing and for this I iterated the result(AggregateIterable) which is causing the GC/memory issue

Alessio_Rotatori · March 18, 2021, 7:57am

Hi,

actually i could not resolve the problem yet.
I tried to mitigate it raising ram size java parameters and scaling up the machine.
Also, i created methods for showing this records on a chart, reducing the projection (therefore the size of data).
Also we are considering using elasticSearch as search engine.

Thank you for sharing your experiences

Prasad_Saya · March 18, 2021, 8:35am

From the Java API documentation, java.lang.OutOfMemoryError is defined as:

Thrown when the Java Virtual Machine cannot allocate an object because it is out of memory, and no more memory could be made available by the garbage collector.

From a topic on Oracle’s Java web pages - Understand the OutOfMemoryError Exception, there are various causes this error can occur. Each of those causes and possible solutions are explained in the article. Also, the article is related to Java SE 8 version. Your case is: Exception in thread thread_name: java.lang.OutOfMemoryError: GC Overhead limit exceeded:

Cause: The detail message “GC overhead limit exceeded” indicates that the garbage collector is running all the time and Java program is making very slow progress. After a garbage collection, if the Java process is spending more than approximately 98% of its time doing garbage collection and if it is recovering less than 2% of the heap and has been doing so far the last 5 (compile time constant) consecutive garbage collections, then a java.lang.OutOfMemoryError is thrown. This exception is typically thrown because the amount of live data barely fits into the Java heap having little free space for new allocations.

Action: Increase the heap size. The java.lang.OutOfMemoryError exception for GC Overhead limit exceeded can be turned off with the command line flag -XX:-UseGCOverheadLimit .