Slow $lookup performance

Alex_Coleman · March 4, 2023, 8:20pm

I have a fairly straightforward aggregation pipeline to join a collection with itself, but I’m having trouble getting the performance I expect. At a high level, I want to join collection1.dest=collection2.src AND collection2.type='some_constant'. collection1 and collection2 are the same collection though.

My pipeline looks like this:

// Get list of edges, returns 256 docs
{ $match: { source: "some_id", type: "1" },
// Lookup destinations, returns 256 docs
{ $lookup: { "from": "grouped_assocs", "localField": 'destination_id', "foreignField": 'source_id', "as": "target_objs", pipeline: [ $match: { "type": "2" } ] } }

Today, this takes ~100ms. By upgrading to mongo 6.0 and removing the pipeline filter in stage 2, it speeds up to 25ms (returning 3k documents though, which can be mitigated by filtering in another stage after). However, if I simply fire two separate queries to the DB I can achieve ~1ms for both stages (incurring latency twice though).

I’m confused about why the pipeline is unable to treat it the same as two sequential queries to the database. Here are the sequential queries which results in much better performance:

// Get the list of edges, 256 docs
[ { $match: { source: "some_id", type: "1"} ]

followed by:

// Load the resulting objects for each edge, 256 docs
[
{ $match: { type: '2', src: { $in: [ list of "dest" returned from first query ] } } }
]

I understand the join and $in have slightly different results ($in would remove any dupes, and have a different order, although in my case these don’t matter). Is there some way to achieve that same performance with a pipeline though?

Note: I have an index on “src,type”, and it does seem to be used in all cases listed

steevej · March 4, 2023, 11:02pm

It is hard to make sense of your issue. Either you have redacted field names and collection names and the modified names are not consistent. Or you have misspelling errors.

Sometimes you use source and at other times your use either src or source_id.

If your index is on src and type and the field name is source or source_id then do not be surprised if it is slow and your index is not used.

To help you we need you to share real sample documents. The real pipeline. The output of getIndexes().

The $lookup stage does not produce duplicate as far as know.

The only way to have a specific order is to $sort.

Alex_Coleman · March 4, 2023, 11:59pm

Apologies, let me give the exact pipeline. I tried to redact to simplify unnecessary pieces, but I can see that only added confusion.

This is the pipeline we use today, which takes 100+ms

[{
 $match: {
  source_id: UUID("140cdf407c8311eb8bc38572321f765c"),
  assoc_id: 10,
  company_id: UUID("140cdf407c8311eb8bc38572321f765c")
 }
}, {
 $limit: 10000
}, {
 $lookup: {
  from: 'grouped_assocs',
  localField: 'destination_id',
  foreignField: 'source_id',
  as: 'target_objs'
 }
}, {
 $unwind: {
  path: '$target_objs'
 }
}, {
 $match: {
  'target_objs.assoc_id': 4,
  'target_objs.company_id': UUID("140cdf407c8311eb8bc38572321f765c")
 }
}, {
 $replaceWith: '$target_objs'
}]

This is a two-part query which resolves in <10ms total (ignoring latency):

// Part 1: Fetch destination IDs
[{
 $match: {
  source_id: UUID("140cdf407c8311eb8bc38572321f765c"),
  assoc_id: 10,
  company_id: UUID("140cdf407c8311eb8bc38572321f765c")
 }
}, {
 $project: {
  destination_id: 1
 }
}]

// Part 2: Fetch the definitions for each destination Entity
[{
 $match: {
  assoc_id: 4,
  company_id: UUID("140cdf407c8311eb8bc38572321f765c"),
  source_id: {
   $in: [
    UUID("0036c760607a11edb513fba47e97a4ac"),
    UUID("01792fb011dc11edb724bd44abf0d73f"),
    UUID("0230e720587511ec9849b1197c2ef65d"),
    // (Ommitted 250 more entries for brevity)
   ]
  }
 }
}]

Ideally I would be able to send a single pipeline to the database which does part 1 & part 2 in <10ms, but currently I have to decide between 1 pipeline of 100ms or 2 pipelines of <10ms.

Additionally, using MongoDB 6.0 I’m able to get 25ms by removing the last two stages, despite that they should be a simple indexed filter to apply on top of the previous stages (I don’t want to 10x the transferred data because my database won’t let me do a simple filter).

Also, here’s my indexes, which should cover everything and then some:

[
  { v: 2, key: { _id: 1 }, name: '_id_' },
  { v: 2, key: { source_id: 1 }, name: 'source_id_1' },
  { v: 2, key: { destination_id: 1 }, name: 'destination_id_1' },
  {
    v: 2,
    key: { company_id: 1, source_id: 1, assoc_id: 1, destination_id: 1 },
    name: 'company_id_1_source_id_1_assoc_id_1_destination_id_1'
  },
  {
    v: 2,
    key: { assoc_id: 1, created_time: 1 },
    name: 'assoc_id_1_created_time_1'
  },
  {
    v: 2,
    key: {
      company_id: 1,
      source_id: 1,
      assoc_id: 1,
      destination_id: 1,
      created_time: 1
    },
    name: 'company_id_1_source_id_1_assoc_id_1_destination_id_1_created_time_1'
  },
  {
    v: 2,
    key: { company_id: 1, source_id: 1, assoc_id: 1, created_time: 1 },
    name: 'company_id_1_source_id_1_assoc_id_1_created_time_1'
  },
  {
    v: 2,
    key: { company_id: Long("1"), assoc_id: Long("1") },
    name: 'company_id_1_assoc_id_1'
  },
  { v: 2, key: { 'node.name': Long("1") }, name: 'node.name_1' },
  {
    v: 2,
    key: { source_id: 1, assoc_id: 1, company_id: 1 },
    name: 'source_id_1_assoc_id_1_company_id_1'
  }
]

steevej · March 5, 2023, 2:56pm

It looks like you have some redundant indexes.

What ever query uses source_id_1 can use source_id_1_assoc_id_1_company_id_1.

Same with company_id_1_source_id_1_assoc_id_1_destination_id_1 and company_id_1_source_id_1_assoc_id_1_destination_id_1_created_time_1.

It is not clear if source_id is unique or not so I will assume it is not.

The big difference I see between the single access pipeline and the 2 parts is the assoc_id and company_id $match. In the 2 parts version, you must likely be able to use the index company_id_1_source_id_1_assoc_id_1_destination_id_1 while in the single access you $match after the $unwind. My suggestion would be to try to move the $match, that comes after the $unwind, into a pipeline of the $lookup. Your $lookup would then look like:

$lookup: {
  from: 'grouped_assocs',
  localField: 'destination_id',
  foreignField: 'source_id',
  as: 'target_objs' ,
  pipeline: [
    { $match: {
      'assoc_id': 4,
      'company_id': UUID("140cdf407c8311eb8bc38572321f765c")
    } }
  ]
 }

At least this way you do not $unwind documents just to get rid of them in the $match.

At this point I am not too sure it will be sufficient. But as an exercise it is easy to try before trying something else. It would be nice to give us feedback on what was the effect.

The next thing to try would be to forgo the use of localField/foreignField and the source_id match inside the new $match of the $lookup. The foreignField might make the query use the redundant source_id_1 index which is not as good as source_id_1_assoc_id_1_company_id_1 for this query.

Joby_Joseph · April 3, 2024, 5:43am

offtopic but can you please share your server configurations? 100ms is impressive i am getting 2 seconds for every lookup query with pipeline (with index) and thought that is the maximum speed mongodb can give if i use lookups.

steevej · April 3, 2024, 11:51am

First I am pretty sure that no one in this thread remembers or has the configuration from 1 year ago.

So, you may start by sharing details about your configuration? Sharing details about the schemas and size of your collections. Sharing some sample documents. Sharing the pipeline that it is slow, because yes 2 seconds might be slow?

And perhaps do all this in a new thread.