Cursor of the aggregate method fetchs only the first batch of data

Pavel_Cermak · December 14, 2023, 3:15pm

I have a question about aggregation iteration over a collection.

Problem description:

I created a pipeline (the source of the pipeline is not relevant) and I want to fetch all data using the aggregation over a collection. The source is below. The problem is that only the first batch of data are fetched (it means 100 items) and then the while loop ends. The iterator doesn’t fetch next batch of data. I tried the find() function (understand, without the pipeline) instead of aggregate() and it works as expected.

Question:

What is wrong? How can I achieve fetching all data using aggregate() function?

JAVA code:

MongoDatabase database = mongoClient.getDatabase(configuration.getDatabase());
MongoCollection collection = database.getCollection(configuration.getCollection());
List<Bson> pipeline = preparePipeline(collection.getNamespace().getDatabaseName(), collection.getNamespace().getCollectionName());

MongoCursor<Document> iterator = collection.aggregate(pipeline).batchSize(100).cursor();
long count = 0;
while (iterator.hasNext()) {
    Document document = iterator.next();
    log .info("Count: {}, doc: {}", count++, document.toJson());
}
iterator.close();

Pavel_Cermak · December 19, 2023, 8:17am

Hi all,

I finally found the problem. I was wrong, if I said the pipeline doesn’t matter. The reason was the pipeline definition. It seams the aggregation doesn’t like, when the pipeline returns object:

{"ns": {"db":"some value"}}

My definition of the not working pipeline was following:

[
  {"$replaceRoot":{"newRoot":{"fullDocument":"$$ROOT"}}}, 
  {"$addFields":
    {
      "operationType":"insert",
      "documentKey":{"_id":"$fullDocument._id"},
      "ns":{"db":"customer-db", "coll":"customerdata"}
    }
  }
]

When I changed one of the names: "ns" or "db" to something else, the aggregation starts to work as expected. My final version of working pipeline is here:

[
  {"$replaceRoot":{"newRoot":{"fullDocument":"$$ROOT"}}}, 
  {"$addFields":
    {
      "operationType":"insert",
      "documentKey":{"_id":"$fullDocument._id"},
      "namespace":{"db":"customer-db", "coll":"customerdata"}
    }
  }
]

steevej · December 19, 2023, 2:54pm

I would extremely surprised that such a little difference in the pipeline makes such a big difference in the result.

I suspect that there is something else at stake and that using namespace rather than ns hide a bug somewhere else in your code.

Your pipeline comes from a function we do not see and I suspect that you redacted the prepared to remove some other stages that you think are not related to the issue but are the real culprits in having different results. To convince myself that my gut feeling is right (sometime it is wrong) I made some tests.

I first modified the 2 pipelines so that I keep the _id of the full document as the top _id of the documents. You will see why I did that next. The first pipeline $out into a _result collection. The second one uses $merge into _result. The result of this would a _result collection where ns and namespace both exist for all documents. Any bug on the mongo side would result with documents without namespace or documents without ns.

In the few tests I ran, all documents in _result collection have both fields. Which is comforting.

So my suggestion is to run your code with the hard coded pipelines that you shared one after the other. This should confirm that you should look at your preparePipeline() code rather than expect a bug in mongo.

To help you further we need non redacted sample documents and non redacted pipelines produced by preparePipeline.

Since your fields (fullDocument, ns, db, coll, operationType, documentKey) are inspired by (probably comes from) a change stream, I suspect that the ns: in your pipeline collides with the ns: field from the change stream.