Smarter pipeline optimization with lots of $lookups?

I have a table with 500k rows. Each row may join 1:many with 20 other sources of data. Until now, I had all this information stored in a single table – it makes it very easy to search.

Despite an on-disk size of 6GB, even with 32GB of ram, it can take 220+seconds to do a tablescan.

I realized that I can opportunistically $lookup each data source, so that table scans are faster.

I usually only query on 1-3 pieces of data, not all 20. So for matching, I can do a $lookup, apply my $matches, then do the next $lookup, etc.

Can I do that automatically with a View? It does “automatic pipeline optimization” where it hoists the pre-$lookup queries to before the join, but then it just does all the $lookups before applying the $match, even though it should run that one $lookup and then the $match, which will remove most of the $rows from requiring $lookups.

Is there any way to give hints? To shape my query better? I’m afraid I just have to build the query manually.

Example:

Pipeline as a view:

db.createCollection(
    "propertiesView",
    {
        "viewOn" : "propertiesPointers",
        "pipeline" : [
    { $lookup: { from:  "DataTaxes",		localField: 'data.taxes',		foreignField:'_id', as: 'taxes' } }
    ,{ $lookup: { from: "DataBusinesses",	localField: 'data.businesses',	foreignField:'_id', as: 'businesses' } }
    ,{ $lookup: { from: "DataParking",		localField: 'data.parking',		foreignField:'_id', as: 'parking' } }
    ,{ $lookup: { from: "DataRestaurants",	localField: 'data.restaurants',	foreignField:'_id', as: 'restaurants' } }
    ]
    }
)

If I do a match on taxes it should automatically be placed after the lookup on taxes, not at the end of the entire pipeline, which is what happens now on mongodb version v4.2.1 and latest v4.2.7

Oddly, it works on normal queries, hoisting it to right after the $lookup

db.getCollection('propertiesView').find({"Year":2018, "data.price":4}).explain()

But not on size - it puts it at the end, which ironically, this one it could figure out before doing the lookup (“how large is the array of IDs to lookup”)

db.getCollection('propertiesView').find({"Year":2018, "data.parking":{$size:4}}).explain()

But regardless, it should arrange the actual $lookup order based on which matches I’ll be doing.

Is there a way to tell mongodb that?

If you need a simple way to filter the results, received from your $lookup stages, you can open nested pipeline for each $lookup stage, like this:

db.your_collection.aggregate([
  {
    $lookup: {
      from: 'DataTaxes',
      let: {
        targetId: '$_id',
      },
      pipeline: [
        {
          $match: {
            $expr: {
              // filter the results of this current $lookup 
              $eq: ['$_id', '$$targetId'],
            },
          },
          // ... your other stages for results of 
          // this $lookup goes here ...
        },
      ],
      as: 'DataTaxes',
    },
  },
  // ... other $lookups ...
]);

If the time, that you awaiting is a problem, then you can do the following:

  • use $limit + $skip stages to do the aggregation not on all 500K rows, but on some smaller part.
  • use multiple aggregations and run them in parallel. one aggregation that uses first (in your example) $lookup, second aggregation - for second $lookup and so on.
  • add $out stage to your aggregation, so the results will be stored in a dedicated collection. After - create a cron job, that will run your aggregation every hour (or any other frequency) and update data in that collection. With this you will have do do simple db.collection.find() to get all results, but much faster.
  • if that data is so related to each other, consider to put everything, that you $lookup into base document, so will not have to join anything with $lookups afterwards - everything will be in same document.

I don’t understand what is the $eq: ['$_id', '$$targetId'], all for? I’m already doing a direct ID match.

What do you mean “use multiple aggregations and run them in parallel” - how do I tell mongo to do them in parallel?

“if that data is so related to each other, consider to put everything, that you $lookup into base document”
That created a 7gb file on disk, which took several minutes to do a full table scan on… that’s why I decomposed it to merge it back as needed.

The ‘let’ statement below declares new variable ‘targetId’ and assigns ‘_id’ of each document, that we have in the ‘your_collection’.

let: {
   targetId: '$_id',
},

We need this statement, so we can use ‘_id’ in the nested pipeline (see example in previous message).

Can you share that part of aggregation, where you ‘doing a direct ID match’?

I am using MongoDB driver for Node.js. In Node.js environment it is possible to do things in parallel, using ES6 Promises.
Example:

const pipeline1 = [
  {
    $match: {},
  },
  // ... more stages
];

const pipeline2 = [
  {
    $match: {},
  },
  // ... more stages
];

const result1 = db.your_collection.aggregate(pipeline1).toArray();
const result2 = db.your_collection.aggregate(pipeline1).toArray();

const arrayOfResults = await Promise.all(result1, result2);

So, two aggregations will be executed in parallel, not synchronously.
Check, maybe your language has support for something similar.

Oh now I understand. I was using a simple 1:1 lookup:
{$lookup: { from: 'DataTaxes',localField: 'taxes', foreignField:'_id', as: 'taxes'} } and I had assumed I could add a pipeline to that to further limit the results, e.g. when using an array. But nope, you showed me the recipe.

But the ID matching with $eq seems really slow – it’s doing a full table scan instead of using the index for IDs, I’m guessing. Any way around that? I’d love to use the same $match syntax I’ll be using elsewhere…

I settled with doing the lookup and then doing a $filter (inside an $addFields) then an $arrayElemAt to just get the one.

(I’m using nodejs too. Really, if I call a long list of $lookups it will do it in order? But if I access it as a cursor and load them up at the same level, it will happen at the same time? How would I check that?
I don’t think that’s my issue though, the issue is generally with doing full table scans…)

I’d really love to ADD a filter to the lookup step, thereby merging far fewer documents.

It seems you can only do it after the lookup is finished.

May be you want to look at https://docs.mongodb.com/manual/reference/operator/aggregation/graphLookup/ and pay attention to restrictSearchWithMatch.