Aggregation pipeline ends in a group returning illogical results

I’m developing a new aggregation pipeline for a series of queries on my Mongo database (g_reports) and one of it’s collections (zip_vendor). I know the data stored in Mongo is logically consistent, because when I query with normal find() operator and use the data for other purposes I don’t see this behavior.

However, this aggregation with group() stage returns records that are logically impossible results based on my first $match stage.

Here is the aggregation I’m using (syntax is for python + pymongo):

coll = mongo_conn['all_reports']['zip_vendor']
result = coll.aggregate([
        {
            '$match': {
                'g_client_machinename': 'foo_client'
            },
            '$match': {
                "g_uploaded_at": {
                    '$gte': dparser.parse(date_from, fuzzy=True),
                    '$lte': dparser.parse(date_to, fuzzy=True)
                },
            },
        },
        {
            '$project': {
                'MyDate': {
                    '$dateToString' : {
                        'format': '%Y-%m-%d',
                        'date': {
                             '$convert': {
                                'input' : f'${txn_date_feature}',
                                'to' : 'date'
                            }
                        }
                    }
                },
                'g_client_machinename': 1,
                'g_vendor_machinename': 1,
            }
        },
        { 
            '$group': { 
                '_id': '$MyDate', 
                'count': {'$sum': 1 },
                'g_client_machinename': {'$first': '$g_client_machinename'},
                'g_vendor_machinename': {'$first': '$g_vendor_machinename'},
            } 
        },
    ])

The resulting table contains the fields and data I require and appears to increment the counter +1 for each record grouped by date. The logic-bomb is this: I’m selecting the foo_client in the $match stage at the beginning of my aggregation, but the results show a mix of other client names:

_id count g_client_machinename g_vendor_machinename
0 2021-09-17 1 foo_client zip_vendor
1 2021-09-18 1 foo_client zip_vendor
2 2021-09-19 6 bar_client zip_vendor
3 2021-09-21 2 baz_client zip_vendor

I expect the aggregation to have already selected the client name foo_client. Does this line ('g_client_machinename': {'$first': '$g_client_machinename'}) in the group stage intentionally ignore this match stage and claw out of other records in the collection these client names (making false labels) or the whole records (contaminating the results)?


To give some MongoDB context to my aggregation example, here is the show for the database and collection names:

> show dbs
admin          0.000GB
config         0.000GB
g_reports  0.070GB
[...]

> use g_reports
switched to db g_reports
> show collections
__schema__
zip_vendor 
[...]

On the application side–
Each record, in addition to it’s content fields, is stored with 2+ additional fields used by my app when I insert content and when I retrieve it: (g_client_machinename, g_vendor_machinename).

> db.zip_vendor.findOne()
{
        "_id" : ObjectId("60a40c26184e56200062257f"),
        "date" : ISODate("2021-05-10T08:45:00Z"),
        "checkout id" : "9bf04b85-f7b0-4a62-b347-08944437a549",
        "order number" : "5",
        "your_store_id": "this is a foo client alias",
        "user id" : "177ff067-86ee-41fa-8d36-48e68cad9972",
        "this_is_a_goober_field_name_this_vendor_uses": "true",
[...]
        "items" : "Eggs Benedict, Green & White Scramble, Side Three Pepper Home Fries (2), Buttermilk Pancakes",
        "g_client_machinename" : "foo_client",
        "g_vendor_machinename" : "zip_vendor",
        "g_uploaded_at" : ISODate("2021-05-18T18:49:10.293Z"),
        "g_unique_id" : "6bbd0d4b8fc27800e680691990b99f72d964673f40ab9132623c25cfd3f0007a"
}

Basically each “batch” of records stored is from one client for a particular vendor. I label (and validate) the record when I store it. So the theory goes, all I need is this 2-tuple to get it out.

What do you consider as being your first $match stage?

There is only one match stage and it is

A query is a JSON document. In most implementation of JSON, only the last occurrence of a key is kept. In your case the second $match overwrites the first one. If you try in the shell the following

match = {
            '$match': {
                'g_client_machinename': 'foo_client'
            },
            '$match': {
                "g_uploaded_at": {
                    '$gte': "dparser.parse(date_from, fuzzy=True)",
                    '$lte': "dparser.parse(date_to, fuzzy=True)"
                },
            },
        }

and the print the variable match, you will see it is equal to

{
  '$match': {
    g_uploaded_at: {
      '$gte': 'dparser.parse(date_from, fuzzy=True)',
      '$lte': 'dparser.parse(date_to, fuzzy=True)'
    }
  }
}
1 Like

Rookie mistake! 5,4,3,2,1 character min.

1 Like

Not rookie, I still fall in this trap from time to time.

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.