Using Views, lookup and sort

Michal_Janocha · January 7, 2022, 11:38am

Hello. I’ve a collection (let’s call it COL_A) with ~25k documents simillar to newly introduced Timeseries collection (basically data with timestamps), to easily access newest documents in this collection I’ve created a view (VIEW_A). I can’t put more than one image in the post, but view aggregation is built of first 4 stages (group, replaceRoot, match, sort) visible on the next screenshot.

I am trying to join VIEW_A with second view (VIEW_B) using lookup and sort VIEW_A using field from VIEW_B. If I am trying to lookup COL_B (~9k documents) from VIEW_A it works perfectly fine (and fast), even with ‘double sorting’:

But when I try to lookup VIEW_B from VIEW_A it takes a lot more time. I would like to use Views anyway because we’ve a lot of collections and all of them are based on the same “time series” like mechanism. Probably I could get rid off View’s sorting, but then pagination using skip/limit won’t work

The question is:
Am I able to somehow make it running efficiently?

Best regards

John_Page · January 7, 2022, 5:16pm

I have to admit I struggled to understand your question a bit here, rather than screenshots you could copy and paste as code examples if it helps.

What I understand if you have a view on COL_A . Reading the aggregation above it gets the first document for each value if data._id, removes those that have been ‘deleted’ then sorts by _id.

This itself has some issues. Your sort cannot use an index in this case and also you are using $first but not sorting first so you have no guarantee what is first (unless all values are the same an you just want any of them). for ViewA you should first sort (using an index) by data._id. You might also want to put the match before the group too .

This get to the heart of your lookup issue lookup - lookup uses a nested loop search - so for every document in ViewA it will either run a find() (on a collection, ideally using an index) or the aggregation defined in the view - if you view defines an unoptimised aggregation ( like your view on A - you don’t share the view on B) you will run that many times (once per document in A) and the performance will suck.

There are a whole bunch of ways to optimise something like this but unless you can share more - like a minimal example of col_a,col_b and the code of view_a and view_b I can’t help you redo them, aside from the fields you show data._id, projectId and srting.data.projectName you don’t need any other fields in your examples unless you user them in the views.

Michal_Janocha · January 10, 2022, 10:23am

Hello @John_Page, at first, thank you for answer!

I agree, description is a bit overcomplicated. I’ll try to simplify things a bit. I’ll start with views definitions:
VIEW_A:

[{
    $sort: {
        timestamp: -1
    }
}, {
    $group: {
        _id: {
            entId: '$data._id'
        },
        root: {
            $first: '$$ROOT'
        }
    }
}, {
    $replaceRoot: {
        newRoot: '$root.data'
    }
}, {
    $match: {
        'isDeleted': {
            $ne: true
        }
    }
}, {
    $sort: {
        _id: 1
    }
}]

VIEW_B:

[{
    $sort: {
        timestamp: -1
    }
}, {
    $group: {
        _id: {
            entId: '$data._id'
        },
        root: {
            $first: '$$ROOT'
        }
    }
}, {
    $replaceRoot: {
        newRoot: '$root.data'
    }
}, {
    $match: {
        'isDeleted': {
            $ne: true
        }
    }
}, {
    $sort: {
        _id: 1
    }
}]

As you can see views are the same, regarding $match, I can’t move it before group because then I wouldn’t get newest “snapshot” of data. Example:
Collection holds each document change in the system (each removal, update, etc.), in the view I would like to always get newest snapshot (and filter out deleted ones, becase they’re deleted )

Each ‘entity’ has its own set of raw-collections (which contains all snapshots of data) and view which always displays newest snaphsot (or doesn’t display anything if last snapshot had isDeleted flag set to true.

Example aggregation, which works in acceptable way (~2seconds) but points to raw collection instead of the view, this aggregation basically takes all steps from the view but without sort as a lookup pipeline:

[{
    $lookup: {
        from: 'projects-ts',
        'let': {
            projectId: '$projectId'
        },
        pipeline: [{
                $match: {
                    $expr: {
                        $eq: [
                            '$data._id',
                            '$$projectId'
                        ]
                    }
                }
            },
            {
                $group: {
                    _id: {
                        entId: '$data._id'
                    },
                    root: {
                        $first: '$$ROOT'
                    }
                }
            },
            {
                $replaceRoot: {
                    newRoot: '$root.data'
                }
            },
            {
                $match: {
                    isDeleted: {
                        $ne: true
                    }
                }
            }
        ],
        as: 'project'
    }
}, {
    $unwind: {
        path: '$project',
        preserveNullAndEmptyArrays: true
    }
}, {
    $sort: {
        'project.projectName': -1
    }
}, {
    $facet: {
        meta: [{
            $count: 'total'
        }],
        data: [{
            $skip: 0
        }, {
            $limit: 15
        }]
    }
}, {
    $unwind: {
        path: '$meta'
    }
}]

I hope now it is more clear what I am trying to achieve. Two questions:

Am I able to execute this aggregation in optimal way but on a view?
Is it a proper solution at all?