I need to execute a 2-step pipeline.
Step 1: retrieve documents from collection 1 that meet certain conditions.
Document example in collection 1:
{
"_id": "3239179-47xTHwtFra1d4Mq4DYZuZYJEYrDXY",
"impacted_address_canonical": "47xTHwtFra1d4Mq4DYZuZYJEYrDXY",
"tx_hash": "47xTHwtFra1d4382325342905Mq4DYZuZYJEYrDXY",
"date": "2022-06-24"
}
My step-1 pipeline is:
pipeline = [
{"$match": {"date": date}},
{"$match": {"impacted_address_canonical": address_to_match}},
{"$project": {"_id": 0, "tx_hash": 1}},
]
This outputs a list of tx_hashes, that will serve as filter for the step-2 pipeline.
Collection 2 document example:
{
"_id": "6159f5bbe77b1a6544448ca17b33a1bb33280182a651094921e5641c6966b572",
"type": {
"type": "account_transaction",
"contents": "account_transfer"
},
}
The step-2 pipeline on collection 2 is:
pipeline = [
{"$match": {"_id": {"$in": tx_hashes}}},
{"$sortByCount": "$type.contents"},
]
This outputs a list/dict of type.contents with a corresponding count.
Now the questions…
Collection 1 has 40M+ documents. Indexed on {'date': 1, 'impacted_address_canonical': 1}.
Collection 2 has 50M+ documents. Index (by default) on {'_id: 1}.
Question 1:
Pipeline 2 frequently takes 2+ min to execute (granted, that is on 100K+ tx_hashes). Is there a faster way?
Question 2:
It seems incredibly wasteful to run pipeline 1, process data locally in Python and then send that data (tx_hashes) back up to the server again. Can this be solved by a $lookup, or a similar concept?