Aggregation pipeline ends in a group returning illogical results

xtian_simon · June 2, 2022, 1:58am

I’m developing a new aggregation pipeline for a series of queries on my Mongo database (g_reports) and one of it’s collections (zip_vendor). I know the data stored in Mongo is logically consistent, because when I query with normal find() operator and use the data for other purposes I don’t see this behavior.

However, this aggregation with group() stage returns records that are logically impossible results based on my first $match stage.

Here is the aggregation I’m using (syntax is for python + pymongo):

coll = mongo_conn['all_reports']['zip_vendor']
result = coll.aggregate([
        {
            '$match': {
                'g_client_machinename': 'foo_client'
            },
            '$match': {
                "g_uploaded_at": {
                    '$gte': dparser.parse(date_from, fuzzy=True),
                    '$lte': dparser.parse(date_to, fuzzy=True)
                },
            },
        },
        {
            '$project': {
                'MyDate': {
                    '$dateToString' : {
                        'format': '%Y-%m-%d',
                        'date': {
                             '$convert': {
                                'input' : f'${txn_date_feature}',
                                'to' : 'date'
                            }
                        }
                    }
                },
                'g_client_machinename': 1,
                'g_vendor_machinename': 1,
            }
        },
        { 
            '$group': { 
                '_id': '$MyDate', 
                'count': {'$sum': 1 },
                'g_client_machinename': {'$first': '$g_client_machinename'},
                'g_vendor_machinename': {'$first': '$g_vendor_machinename'},
            } 
        },
    ])

The resulting table contains the fields and data I require and appears to increment the counter +1 for each record grouped by date. The logic-bomb is this: I’m selecting the foo_client in the $match stage at the beginning of my aggregation, but the results show a mix of other client names:

	_id	count	g_client_machinename	g_vendor_machinename
0	2021-09-17	1	foo_client	zip_vendor
1	2021-09-18	1	foo_client	zip_vendor
2	2021-09-19	6	bar_client	zip_vendor
3	2021-09-21	2	baz_client	zip_vendor

I expect the aggregation to have already selected the client name foo_client. Does this line ('g_client_machinename': {'$first': '$g_client_machinename'}) in the group stage intentionally ignore this match stage and claw out of other records in the collection these client names (making false labels) or the whole records (contaminating the results)?

To give some MongoDB context to my aggregation example, here is the show for the database and collection names:

> show dbs
admin          0.000GB
config         0.000GB
g_reports  0.070GB
[...]

> use g_reports
switched to db g_reports
> show collections
__schema__
zip_vendor 
[...]

On the application side–
Each record, in addition to it’s content fields, is stored with 2+ additional fields used by my app when I insert content and when I retrieve it: (g_client_machinename, g_vendor_machinename).

> db.zip_vendor.findOne()
{
        "_id" : ObjectId("60a40c26184e56200062257f"),
        "date" : ISODate("2021-05-10T08:45:00Z"),
        "checkout id" : "9bf04b85-f7b0-4a62-b347-08944437a549",
        "order number" : "5",
        "your_store_id": "this is a foo client alias",
        "user id" : "177ff067-86ee-41fa-8d36-48e68cad9972",
        "this_is_a_goober_field_name_this_vendor_uses": "true",
[...]
        "items" : "Eggs Benedict, Green & White Scramble, Side Three Pepper Home Fries (2), Buttermilk Pancakes",
        "g_client_machinename" : "foo_client",
        "g_vendor_machinename" : "zip_vendor",
        "g_uploaded_at" : ISODate("2021-05-18T18:49:10.293Z"),
        "g_unique_id" : "6bbd0d4b8fc27800e680691990b99f72d964673f40ab9132623c25cfd3f0007a"
}

Basically each “batch” of records stored is from one client for a particular vendor. I label (and validate) the record when I store it. So the theory goes, all I need is this 2-tuple to get it out.

steevej · June 2, 2022, 2:30am

What do you consider as being your first $match stage?

There is only one match stage and it is

xtian_simon:

'$match': {
                "g_uploaded_at": {
                    '$gte': dparser.parse(date_from, fuzzy=True),
                    '$lte': dparser.parse(date_to, fuzzy=True)
                },
            },

A query is a JSON document. In most implementation of JSON, only the last occurrence of a key is kept. In your case the second $match overwrites the first one. If you try in the shell the following

match = {
            '$match': {
                'g_client_machinename': 'foo_client'
            },
            '$match': {
                "g_uploaded_at": {
                    '$gte': "dparser.parse(date_from, fuzzy=True)",
                    '$lte': "dparser.parse(date_to, fuzzy=True)"
                },
            },
        }

and the print the variable match, you will see it is equal to

{
  '$match': {
    g_uploaded_at: {
      '$gte': 'dparser.parse(date_from, fuzzy=True)',
      '$lte': 'dparser.parse(date_to, fuzzy=True)'
    }
  }
}

xtian_simon · June 2, 2022, 12:43pm

Rookie mistake! 5,4,3,2,1 character min.

steevej · June 2, 2022, 12:49pm

Not rookie, I still fall in this trap from time to time.

system · June 7, 2022, 12:49pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.