Use index of joined collection from $lookup in $sort after the join

cabus · August 6, 2023, 3:33pm

I am working with a set of ~5M items. I have two collections:

keywords.metrics

[
  {
    keyword: "tesla",
    metrics: {
      rating: 1000,
      ...
    }
  },
  ...
]

indexed: keyword, metrics.rating

keywords.ancestries

[
  {
    keyword: "tesla model x",
    ancestry: [
      {
        keyword: "tesla"
      },
      ...
    ]
  },
  ...
]

indexed: ancestry.keyword

The task is:

For a given keyword, retrieve all descendants and sort them by metrics.rating.
In the most extreme case, a single keyword can have up to 100k descendants.

The following aggregation pipeline executed on keywords.ancestries will do that:

[
  {
    $match:
      {
        "ancestry.keyword": "tesla",
      }
  },
  {
    $lookup:
      {
        from: "keywords.metrics",
        let: {
          keyword: "$keyword",
        },
        pipeline: [
          {
            $match: {
              $expr: {
                $eq: ["$keyword", "$$keyword"],
              }
            }
          }
        ],
        as: "metrics"
      }
  },
  {
    $sort:
      {
        "metrics.metrics.rating": 1
      }
  }
]

It works and is quite performant even for 5M keywords in keywords.metrics, however, explain shows that only the $match and $lookup stages are supported by indexes. The $sort stage operates on an indexed field, but the index is not used.
Since this query will be the foundation for a keyword browser frontend where several sorting and filtering options can be changed by users frequently, I anticipate the query to run very often. For that purpose, a query that runs in ~2s according to explain and narrowly avoids bleeding into disk in the $sort stage (80MB) seems suboptimal to me.

So I would really like to avoid having to sort up to 100k documents without index.

Is there a clever way to do this?

One possible solution for the query to be fully covered by indexes, is to unify both collections such that the metrics $lookup can be skipped. However, I’m hesitating to pursue that solution, because it feels wrong to restructure my data just to support one additional query. If I perform the restructure, I can for example no longer get a keyword’s descendants without having to ‘drag along’ the metrics field in database calculations. Adding all the data to a single keywords collection seems to make documents rather big and I always thought bigger documents meant degrading performance for all other queries.

slava · August 6, 2023, 8:24pm

Hello, @cabus! It’s been a while since your last post

It seems, that the description of your issue needs some refinement.

You have few mistakes in your aggregation code. For example, in your $lookup stage, defined let-keyword results to an array and you compare it with string later on in $expr. This will lead to no joined documents by $lookup stage, so your $sort stage will have nothing to sort.

Worth to notice, that ancestry field in keywords.ancestries is an array. Can one entry in keywords.ancestries collection relate to multiple entries in keywords.metrics collection?

cabus · August 7, 2023, 12:54pm

Hi,
thanks for your response!

The $lookup stage is working as intended on my side.

Maybe the description of the two involved collections was a bit poorly worded:
The [ and ] denote the collection boundary. I wrote the collection as a JSON array. Every object in that array represents a document.

The pipeline operates as follows:

$match fetches all documents that have the seed keyword (tesla) in their ancestry.
$lookup joins metrics from keywords.metrics for every document found (e.g. a document with keyword: 'tesla model x')
One entry in keywords.ancestries relates to one entry in keywords.metrics (1:1)
$sort sorts the documents by a field that is joined in the $lookup stage prior (rating, in this example)

steevej · August 8, 2023, 1:25pm

It is not with the sample date set you provided. Using your $match and $lookup I get:

[
  {
    _id: ObjectId("64d23e0a5e9e8e87c5502356"),
    keyword: 'tesla model x',
    ancestry: [ { keyword: 'tesla' } ],
    metrics: []
  }
]

However, if I change the document in metrics from “keyword”:“tesla” to “keywork”:“tesla model x” I do get some result. Could you please provide a richer sample data set that supports your use-case directly? This means also please removes the ... from the documents because we cannot cut-n-paste your documents directly.

Since the ultimate and problematic goal is to have the metrics sorted using an index, why don’t you simply start your aggregation from metrics. This way the $sort is supported by your index.