Execute sub query in pipeline

I need to execute a 2-step pipeline.

Step 1: retrieve documents from collection 1 that meet certain conditions.
Document example in collection 1:

{
  "_id": "3239179-47xTHwtFra1d4Mq4DYZuZYJEYrDXY",
  "impacted_address_canonical": "47xTHwtFra1d4Mq4DYZuZYJEYrDXY",
  "tx_hash": "47xTHwtFra1d4382325342905Mq4DYZuZYJEYrDXY",
  "date": "2022-06-24"
}

My step-1 pipeline is:

pipeline = [
    {"$match": {"date": date}},
    {"$match": {"impacted_address_canonical": address_to_match}},
    {"$project": {"_id": 0, "tx_hash": 1}},
]

This outputs a list of tx_hashes, that will serve as filter for the step-2 pipeline.
Collection 2 document example:

{
  "_id": "6159f5bbe77b1a6544448ca17b33a1bb33280182a651094921e5641c6966b572",
  "type": {
    "type": "account_transaction",
    "contents": "account_transfer"
  },
}

The step-2 pipeline on collection 2 is:

pipeline = [
   {"$match": {"_id": {"$in": tx_hashes}}},
   {"$sortByCount": "$type.contents"},
]

This outputs a list/dict of type.contents with a corresponding count.

Now the questions…

Collection 1 has 40M+ documents. Indexed on {'date': 1, 'impacted_address_canonical': 1}.

Collection 2 has 50M+ documents. Index (by default) on {'_id: 1}.

Question 1:

Pipeline 2 frequently takes 2+ min to execute (granted, that is on 100K+ tx_hashes). Is there a faster way?

Question 2:

It seems incredibly wasteful to run pipeline 1, process data locally in Python and then send that data (tx_hashes) back up to the server again. Can this be solved by a $lookup, or a similar concept?

You are in the right track by thinking:

And yes, it will involve some kind of lookup.

However, with it looks like _id in collection_2 are not matching any of the tx_hash from collection_1. A index _id:1,type.contents:1 might prevent the need to fetch documents from collection_2.

In addition, since there is a single tx_hash per document in collection_1 and tx_hash is the _id in collection_2, I suspect you might as well make the single refered documents from collection_2, an object of the documents in collection_1.

Thanks for your reply.

I’m not sure I understand your point, though? I have documents containing a tx_hash in one collection and a tx_collection where the tx_hash is the _id.

My question relates to the fact that I need to perform two (unrelated) pipeline, whereby the output of pipeline 1 is going to be the input for pipeline 2. Can I somehow prevent the roundtrip in Python and instead keep everyting in Mongo?

does not match

Like I wrote

but your sample documents does not match your description so we cannot be sure.

The following $lookup is based on your problem description

$lookup : {
  from : tx_collection ,
  localField : tx_hash ,
  foreignField : _id ,
  as : contents ,
  pipeline : [
      { $project : { _id : 0 , "contents" : "$type.contents" } }
  ]
}

but will not work with the documents you shared.

Hi @Sander_de_Ruiter,

@steevej is quite right that the sample documents you’ve shared don’t seem to quite match - the _id in collection 2 doesn’t look the same as the tx_hash you’ve shared from collection 1.

But to answer your (second) question: Yes, $lookup will help you avoid the round-trip and running two queries. For every document passing into that stage, $lookup will allow you to look up documents matching the criteria in the second collection.

This would look roughly like this:

{
    '$lookup': {
        'from': 'collection_2',
        'localField': 'tx_hash',
        'foreignField': '_id',
        'as': 'collection_2_documents'
    }
}

That’ll give you a new field in each document coming out of that stage, called ‘collection_2_documents’ which will be an array of subdocuments. (In this case, only one document in the array, because ‘_id’ is unique - but you’ll still get an array. You can fix this with a ‘$project’ stage if you want or need to.)

In terms of performance, one suggestion that jumps out at me is that if you have multiple documents containing the same ‘tx_hash’ value, then you probably don’t want to look up the same document multiple times in collection_2. In this case, you could use a ‘$group’ stage before the lookup to ensure that you’re only looking up each hash document once.

Apologies for any confusion here. I took the wrong document to copy and paste from the tx_hash and _id were meant to be the same.

However…

Collections A and B both have different document types, but collection A has a property called tx_hash that happens to be the index in collection B.

Can $lookup execute a pipeline? Ie I need to find a subset of documents from collection A and capture the tx_hashes of that subset to feed into a pipeline on collection B.

That has been answered twice in this thread. Have you tried what we shared?

I have, yes, but as expected, this doesn’t lead to a workable solution? I’ve tried the simplest lookup on collection B, referencing collection A, but even with a timeout of 10m this doesn’t yield a result.

Please share the exact pipeline that you have tried.

Please share the exact documents that you are expecting to work.

Please share the collection stats.

Please share you system configuration.