Complex aggregation pipelines vs complex programming logic

Gowtham_Bhat · August 18, 2022, 3:34am

I have been using MongoDB for quite some time now and was exploring many powerful operators available in the aggregation pipelines.

With the increasing capabilities of aggregation piplelines, a dilemma is created about whether to,

Use complex aggregation pipelines so as to get the final data structure
Query the data from MongoDB with minimal steps and then process the data in the application logic

For example, consider the below document,

  {
    "_id": 1024,
    "data": [
        {"foo": "one", "bar": "lorem"},
        {"foo": "two", "bar": "ipsum"},
        {"foo": "three", "bar": "asdf"},
        {"foo": "four", "bar": "ljkls"},
        {"foo": "five", "bar": "afsdfk"},
        {"foo": "six", "bar": "lksflkj"},
        ...
    ]
  }

After finding the document, I want to filter the sub-documents based on if the value of foo is in an array of values, and then finally get an object with the value of foo as key and bar as value.

We can achieve this using the below aggregation query,

db.collection.aggregate([
    {"$match": {"_id": 1024}},
    {"$project": {
        "data": {
            "$map": {
                "input": {
                    "$filter": {
                        "input": "$data",
                        "as": "sub",
                        "cond": {"$in": ["$$sub.foo", ["one", "three", "six", ...]]}
                    }
                },
                "as": "sub",
                "in": {
                    "k": "$$sub.foo",
                    "v": "$$sub.baar"
                }
            }
        }
    }},
    {'$project': {
        "final_data": {"$arrayToObject": "$data"},
        "_id": 0
    }}
])

OR I can just get the document and process it in python (I am using pyMongo driver),

data = db.collection.find_one({"_id": 1024})['data']
final_data = {}
for sub_doc in data:
    if sub_doc['foo'] in ['one', 'three', 'six', ...]:
        final_data[sub_doc['foo']] = sub_doc['bar']

In the first approach, I need to know the mongo-db specific operators and syntax. But once the query is ready I can run it in shell or other languages with minimal changes.

But with the second, I am already comfortable with python or other programming languages which are designed to do such stuff. Going ahead, if I change the data source which doesn’t support such operators, I need not worry about migrating the logic back to the programming language.

How to decide between the both? Does the approach change based on the number of sub-documents, may be from 100 to 100 thousand ?

steevej · August 18, 2022, 1:47pm

Very interesting topic.

See some of my previous posts that are related.

There are times where it is better to do things on the server. One such time is when it reduce the amount of data transferred. Using $map and $filter to only returned the subset of data of interest is definitively such a case.