Collation unnecessary used in lookup slows down the query

Grzegorz_Skupiewski · June 14, 2023, 1:51pm

Let’s assume we have a collection with field name, which we need to sort using collation. Additionally, we have a second collection, which is going to be joined to the first collection on _id fields of both collections. _id fields are being set by ourselves (it is a string identifier for each document and consists of digits only).

There is an index on field name in first collection. Entire first collection has collation settings

{ collation: { locale: "pl", strength: 2 } }

Example of the pipeline:

[
    {
        $sort: {
            name: 1
        }
    },
    {
        $lookup: {
            from: "col2",
            localField: "_id",
            foreignField: "_id",
            as: "joined"
        }
    },
    {
        $match: {
            joined: {
                $ne: []
            }
        }
    },
    {
        $limit: 10
    }
]

It seems, that even if collation is needed only to sort name in the first collection, it is used as well when comparing both _id fields between collections in lookup stage, therefore unless there is also an index with the same collation settings on joined collection (index on _id) the query is very slow.

Benchmarks: (aggregation is performed on col1 and col2 is being joined)
collation is not being used - 309ms
col1 with collation, col2 without collation - 4296ms
col1 with collation, col2 with collation - 266ms

So this is my question - it is not convenient to set collation for col2, as it is being created every day anew. Is there a way, to sort first collection using collation, but to not suffer from very slow lookup operation?

I have already tested setting collation only on index itself, not the entire collection and using collation options in aggregate, but the result is the same or even worse.