I am working with a set of ~5M items. I have two collections:
keywords.metrics
[
{
keyword: "tesla",
metrics: {
rating: 1000,
...
}
},
...
]
indexed: keyword
, metrics.rating
keywords.ancestries
[
{
keyword: "tesla model x",
ancestry: [
{
keyword: "tesla"
},
...
]
},
...
]
indexed: ancestry.keyword
The task is:
For a given keyword, retrieve all descendants and sort them by
metrics.rating
.
In the most extreme case, a single keyword can have up to 100k descendants.
The following aggregation pipeline executed on keywords.ancestries
will do that:
[
{
$match:
{
"ancestry.keyword": "tesla",
}
},
{
$lookup:
{
from: "keywords.metrics",
let: {
keyword: "$keyword",
},
pipeline: [
{
$match: {
$expr: {
$eq: ["$keyword", "$$keyword"],
}
}
}
],
as: "metrics"
}
},
{
$sort:
{
"metrics.metrics.rating": 1
}
}
]
It works and is quite performant even for 5M keywords in keywords.metrics
, however, explain
shows that only the $match
and $lookup
stages are supported by indexes. The $sort
stage operates on an indexed field, but the index is not used.
Since this query will be the foundation for a keyword browser frontend where several sorting and filtering options can be changed by users frequently, I anticipate the query to run very often. For that purpose, a query that runs in ~2s according to explain
and narrowly avoids bleeding into disk in the $sort
stage (80MB) seems suboptimal to me.
So I would really like to avoid having to sort up to 100k documents without index.
Is there a clever way to do this?
One possible solution for the query to be fully covered by indexes, is to unify both collections such that the metrics
$lookup
can be skipped. However, I’m hesitating to pursue that solution, because it feels wrong to restructure my data just to support one additional query. If I perform the restructure, I can for example no longer get a keyword’s descendants without having to ‘drag along’ the metrics field in database calculations. Adding all the data to a singlekeywords
collection seems to make documents rather big and I always thought bigger documents meant degrading performance for all other queries.