Execute sub query in pipeline

Sander_de_Ruiter · March 30, 2024, 12:40pm

I need to execute a 2-step pipeline.

Step 1: retrieve documents from collection 1 that meet certain conditions.
Document example in collection 1:

{
  "_id": "3239179-47xTHwtFra1d4Mq4DYZuZYJEYrDXY",
  "impacted_address_canonical": "47xTHwtFra1d4Mq4DYZuZYJEYrDXY",
  "tx_hash": "47xTHwtFra1d4382325342905Mq4DYZuZYJEYrDXY",
  "date": "2022-06-24"
}

My step-1 pipeline is:

pipeline = [
    {"$match": {"date": date}},
    {"$match": {"impacted_address_canonical": address_to_match}},
    {"$project": {"_id": 0, "tx_hash": 1}},
]

This outputs a list of tx_hashes, that will serve as filter for the step-2 pipeline.
Collection 2 document example:

{
  "_id": "6159f5bbe77b1a6544448ca17b33a1bb33280182a651094921e5641c6966b572",
  "type": {
    "type": "account_transaction",
    "contents": "account_transfer"
  },
}

The step-2 pipeline on collection 2 is:

pipeline = [
   {"$match": {"_id": {"$in": tx_hashes}}},
   {"$sortByCount": "$type.contents"},
]

This outputs a list/dict of type.contents with a corresponding count.

Now the questions…

Collection 1 has 40M+ documents. Indexed on {'date': 1, 'impacted_address_canonical': 1}.

Collection 2 has 50M+ documents. Index (by default) on {'_id: 1}.

Question 1:

Pipeline 2 frequently takes 2+ min to execute (granted, that is on 100K+ tx_hashes). Is there a faster way?

Question 2:

It seems incredibly wasteful to run pipeline 1, process data locally in Python and then send that data (tx_hashes) back up to the server again. Can this be solved by a $lookup, or a similar concept?

steevej · April 1, 2024, 2:42pm

You are in the right track by thinking:

And yes, it will involve some kind of lookup.

However, with it looks like _id in collection_2 are not matching any of the tx_hash from collection_1. A index _id:1,type.contents:1 might prevent the need to fetch documents from collection_2.

In addition, since there is a single tx_hash per document in collection_1 and tx_hash is the _id in collection_2, I suspect you might as well make the single refered documents from collection_2, an object of the documents in collection_1.

Sander_de_Ruiter · April 1, 2024, 3:06pm

Thanks for your reply.

I’m not sure I understand your point, though? I have documents containing a tx_hash in one collection and a tx_collection where the tx_hash is the _id.

My question relates to the fact that I need to perform two (unrelated) pipeline, whereby the output of pipeline 1 is going to be the input for pipeline 2. Can I somehow prevent the roundtrip in Python and instead keep everyting in Mongo?

steevej · April 1, 2024, 7:57pm

does not match

Like I wrote

but your sample documents does not match your description so we cannot be sure.

The following $lookup is based on your problem description

$lookup : {
  from : tx_collection ,
  localField : tx_hash ,
  foreignField : _id ,
  as : contents ,
  pipeline : [
      { $project : { _id : 0 , "contents" : "$type.contents" } }
  ]
}

but will not work with the documents you shared.

Mark_Smith · April 2, 2024, 9:29am

Hi @Sander_de_Ruiter,

@steevej is quite right that the sample documents you’ve shared don’t seem to quite match - the _id in collection 2 doesn’t look the same as the tx_hash you’ve shared from collection 1.

But to answer your (second) question: Yes, $lookup will help you avoid the round-trip and running two queries. For every document passing into that stage, $lookup will allow you to look up documents matching the criteria in the second collection.

This would look roughly like this:

{
    '$lookup': {
        'from': 'collection_2',
        'localField': 'tx_hash',
        'foreignField': '_id',
        'as': 'collection_2_documents'
    }
}

That’ll give you a new field in each document coming out of that stage, called ‘collection_2_documents’ which will be an array of subdocuments. (In this case, only one document in the array, because ‘_id’ is unique - but you’ll still get an array. You can fix this with a ‘$project’ stage if you want or need to.)

In terms of performance, one suggestion that jumps out at me is that if you have multiple documents containing the same ‘tx_hash’ value, then you probably don’t want to look up the same document multiple times in collection_2. In this case, you could use a ‘$group’ stage before the lookup to ensure that you’re only looking up each hash document once.

Sander_de_Ruiter · April 2, 2024, 12:37pm

Apologies for any confusion here. I took the wrong document to copy and paste from the tx_hash and _id were meant to be the same.

However…

Collections A and B both have different document types, but collection A has a property called tx_hash that happens to be the index in collection B.

Can $lookup execute a pipeline? Ie I need to find a subset of documents from collection A and capture the tx_hashes of that subset to feed into a pipeline on collection B.

steevej · April 2, 2024, 12:48pm

That has been answered twice in this thread. Have you tried what we shared?

Sander_de_Ruiter · April 3, 2024, 10:46am

I have, yes, but as expected, this doesn’t lead to a workable solution? I’ve tried the simplest lookup on collection B, referencing collection A, but even with a timeout of 10m this doesn’t yield a result.

steevej · April 3, 2024, 11:53am

Please share the exact pipeline that you have tried.

Please share the exact documents that you are expecting to work.

Please share the collection stats.

Please share you system configuration.