How to use aggregation for large collection?

edenOo · August 30, 2021, 7:00am

I am using mongo Atlas M10. I want to transform all document data to formatted tree data by using the aggregate framework. It is only working for a certain limit of documents.
I am getting below error in a large number of documents.
“MongoError: BSONObj size: 20726581 (0x13C4335) is invalid. Size must be between 0 and 16793600(16MB)”

I already set allowDiskUse to true. It is still getting that error.

May I have a solution for that error?

below are my aggregate stages:

model.aggregate([
        {
            $match: {
                parent: 0
            }
        },
        {
            $graphLookup: {
                from: appId + "_" + viewName + "s",
                startWith: "$id",
                connectFromField: "id",
                connectToField: "parent",
                depthField: "level",
                as: "data"
            }
        },
        {
            $unset: [
                "data._id",
                "data.createdAt",
                "data.updatedAt",
                "data.updateBy"
            ]
        },
        {
            $unwind: {
                path: "$data",
                preserveNullAndEmptyArrays: true
            }
        },
        {
            $sort: {
                "data.level": -1
            }
        },
        {
            $group: {
                _id: "$id",
                parent: {
                    $first: "$parent"
                },
                value: {
                    $first: "$value"
                },
                type: {
                    $first: "$type"
                },
                data: {
                    $push: "$data"
                }
            }
        },
        {
            $addFields: {
                data: {
                    $reduce: {
                        input: "$data",
                        initialValue: {
                            level: -1,
                            presentData: [],
                            prevData: []
                        },
                        in: {
                            $let: {
                                vars: {
                                    prev: {
                                        $cond: [
                                            {
                                                $eq: [
                                                    "$$value.level",
                                                    "$$this.level"
                                                ]
                                            },
                                            "$$value.prevData",
                                            "$$value.presentData"
                                        ]
                                    },
                                    current: {
                                        $cond: [
                                            {
                                                $eq: [
                                                    "$$value.level",
                                                    "$$this.level"
                                                ]
                                            },
                                            "$$value.presentData",
                                            []
                                        ]
                                    }
                                },
                                in: {
                                    level: "$$this.level",
                                    prevData: "$$prev",
                                    presentData: {
                                        $concatArrays: [
                                            "$$current",
                                            [
                                                {
                                                    $mergeObjects: [
                                                        "$$this",
                                                        {
                                                            data: {
                                                                $filter: {
                                                                    input: "$$prev",
                                                                    as: "e",
                                                                    cond: {
                                                                        $eq: [
                                                                            "$$e.parent",
                                                                            "$$this.id"
                                                                        ]
                                                                    }
                                                                }
                                                            }
                                                        }
                                                    ]
                                                }
                                            ]
                                        ]
                                    }
                                }
                            }
                        }
                    }
                }
            }
        },
        {
            $addFields: {
                data: "$data.presentData"
            }
        }
    ]).allowDiskUse(true)

Xavier_Robitaille · January 3, 2022, 3:29pm

Hi Eden,

Looks like some stages of your pipeline are hitting the 16MB BSON limit . My understanding is that you need to make sure that the output of every stage in your pipeline is less than 16MB (in your example, one of your stages is blocked from outputting ~21MB).

When I hit this problem for the first time I also felt like Mongo’s documentation could’ve done a better job at proposing possible solutions / examples of solutions (instead of just stating the limit).

Xavier Robitaille
Feather Finance

Xavier_Robitaille · January 3, 2022, 4:15pm

For reference, one possible solution to consider is to add a $project stage early to exclude fields that are non-essential to your query, and which use up part of the 21MB.

Exclude Fields with $project:

{ $project: { "<field1>": 0, "<field2>": 0, ... } } // Return all but the specified fields

Asya_Kamsky · January 3, 2022, 8:53pm

add a $project stage early to exclude fields that are non-essential

This is (usually) bad advice. You never need to do this, because the pipeline already analyzes which fields are needed and only requests those fields from the collection.

You can see that by using explain - see fields section.

Asya

Asya_Kamsky · January 3, 2022, 9:02pm

@edenOo if you’re doing $graphLookup from a view, could you reduce the size of the view? I see you are unsetting several fields that come from the view, but excluding them upfront may limit the size of the entire tree enough to fit into 100MBs.

Note that $graphLookup is fundamentally limited to 100MBs and cannot spill to disk. So if the expected tree structure is bigger than 100MBs then you’ll probably need to find a different solution to your problem. Maybe give us more details about what the data is and what exactly you are trying to do with it?

Asya

Xavier_Robitaille · January 4, 2022, 7:25pm

@Asya_Kamsky thanks for stepping in. The reason why I stumbled on Eden’s post is that I had this problem myself, and I was looking for the best way to solve it.

Let me describe my use case, our web app handles stock market transactions (aka account “activities”), and we use a Bucket Pattern, because many of our users have several 20k-50k transactions/activities in their account (i.e. several times the 16MB limit). Our use case is pretty much exactly the example described in these two articles by Justin LaBreck.

I was getting BSON size limit error messages from the following query when querying users with many activityBuckets. I added the project: {"activities": 0} stage and it solved my problem. The query returns all of the user’s activityBuckets, but without the actual activity data (ie. only the activityBucket high level data).

Would you have recommended a different solution?

//-----------------------------------------------------------------------------------------------------
// get user and all its activityBuckets(without actual activities otherwise would bust 16MB)
//-----------------------------------------------------------------------------------------------------
db.users.aggregate( [
    { $match: { 'email': 'johndoe@featherfinance.com' } }, 
    { $lookup: {
        from: "activitybuckets",
        let: { users_id: "$_id"},
        pipeline: [ 
            { $project: {"activities": 0} },
            {
                $match: {
                    $expr: { 
                        $and: [
                            { $eq: [ '$$users_id', "$user" ] },
                    }
                }
            }
        ],
        as: "activities"
    } },
] );

Asya_Kamsky · January 11, 2022, 4:33pm

The problem you describe is quite different - without the project in the inner pipeline you’re saying you want all of the document to be in the activities array and that would make it bigger than legal BSON size for single document. $project is needed when you have to tell the engine what fields you want/need. In the original answer you imply that it’s necessary to exclude fields not essential to your query which the engine will attempt to determine by itself based on which fields you are using in the pipeline and which you are returning to the client. So it’s important to specify correctly (at the end of the pipeline usually is the best place) which fields you want back. Sometimes in complex sub-pipelines where you need to specify that is less obvious.

In the case of $graphLookup like the original question, there is a limitation that means there’s no way to use $project or $unset other than by creating a view to make the collection you’re doing $graphLookup in smaller.

Hope this is more helpful, rather than more confusing

Asya

Asya_Kamsky · January 11, 2022, 4:35pm

P.S. I would put $project after $match inside the sub-pipeline, by the way. I also would use the localField/foreignField syntax, as of 5.0.0 you can still add more stages (due to https://jira.mongodb.org/browse/SERVER-34927 being implemented).

Xavier_Robitaille · January 13, 2022, 2:10pm

@Asya_Kamsky thank you so much!

It is much clearer now.

Yun_Hao · March 6, 2022, 3:50pm

I came to the same problem. Thanks for the explanation! My problem is solved. But I think it would be much nicer if we can apply some pipeline before $graphLookup, instead of creating a view.

Venkata_Sai_Gopi · October 5, 2023, 8:19pm

Hello We have more 300M documents in one of our collection we have written an aggregation pipeline to separate the records which are in one year range. Pipeline is pretty simple juat have two stage

Stage 1: MATCH THE DOCUMENTS with gte one year.
Stage 2: OUT -To add the documents to the new collection.
and we using the allow dis usage option as well.
Any suggestions to get this query run without any issues since we have large set of data.

steevej · October 6, 2023, 8:50pm

Without seeing the real pipeline that you are doing it is impossible for us to pin-point any issues you might get.