Aggregate stage lighter with a $addField before

Axel_Morvan · December 28, 2022, 10:37am

Hello,

I’m using MongoDB 4.2 for now and use it with a NodeJs application with Mongoose 6.0.14.

I’ve created an aggregate (see below) to retrieve some data and it has a strange behavior. With the base aggregate too many documents are filtered and I receive an error 4031700 (document constructed by $facet is 104857724 bytes, which exceeds the limit of 104857600 bytes) on the $facet stage.
But if I add an empty $addFields stage before my $lookup, there is no more error. In fact, I can retrieve about 10 times more documents with the base query filter in the $match stage.
This behavior is the same when replacing the $addFields by a $sort or a $project with all fields within documents.

So my question is : Do aggregates have a specific behavior with $addFields, $sort or $project concerning the way the data is stored for later stages ?

My aggregate :

[
    { $match: my_global_query },
    {
        $facet: {
            outputField1: [
                { $match: my_specific_query },
                { $unwind: { path: "array" } },
                { $addFields: {} }, // Miraculous $addField (or $sort or $project)
                {
                    $lookup: {
                        from: "collection",
                        localField: "local_field",
                        foreignField: "foreign_field",
                        as: "output",
                    },
                },
            ],
        },
    },
];

steevej · December 28, 2022, 1:42pm

Your pipeline seems to be highly redacted so it is hard to make a real assessment of what is going on.

It is really hard to imagine that increasing the data would reduce the data of the $facet. May be some detrimental optimization is performed without the empty $addFields. A $project that eliminates data, yes. A $sort may be, as some data might be piped to the next stage faster.

Do you get the error if you only $match and $unwind? That is without any other stages in the $facet pipeline.

Doing $unwind inside the $facet might be useless and the culprit as it multiplies the data. You do not need to $unwind even if localField refers to the array. The $lookup is smart enough to get all the elements of the array.

Having $facet with only 1 field is useless, but again you might have more but the redacted pipeline you shared stops us from doing a real assessment.

Axel_Morvan · December 30, 2022, 9:38am

Thank you for response.

I understand that not having a lot of details about the aggregate make it hard to understand, but I’m using it in a professional context and so I cannot shared sensitive information, including the schemas. However, I’ve just noticed that I forgot a stage in the upper aggregate (another $unwind) and it should looks like :

[
    { $match: my_global_query },
    {
        $facet: {
            outputField1: [
                { $match: my_specific_query },
                { $unwind: { path: "array" } },
                { $addFields: {} }, // Miraculous $addField (or $sort or $project)
                {
                    $lookup: {
                        from: "collection",
                        localField: "local_field",
                        foreignField: "foreign_field",
                        as: "output",
                    },
                },
                { $unwind: { path: "second_array" } }
            ],
        },
    },
];

With only the $match and first $unwind, I have no issue. I think this is the second $unwind which create a lot of data (all of them are unique). The fact is even if I $project all existing fields (so there is theoretically no eliminated data) or I use $sort or $addFields, I have no longer any problem. Maybe MongoDB is doing some weird operations to transform data and change the way it is memory stored in one case, what do you think ?

About the $facet with only 1 field, it’s because I omited the second one to not be confusing as it is similar to the first one with only one $unwind). The complete aggregate have hundreds of lines to avoid javascript compute (maybe not the best idea we had, but we do not have the time to re-write it for now). So I only focussed on the problematic stage.