I’m developing a new aggregation pipeline for a series of queries on my Mongo database (g_reports
) and one of it’s collections (zip_vendor
). I know the data stored in Mongo is logically consistent, because when I query with normal find()
operator and use the data for other purposes I don’t see this behavior.
However, this aggregation with group()
stage returns records that are logically impossible results based on my first $match
stage.
Here is the aggregation I’m using (syntax is for python + pymongo):
coll = mongo_conn['all_reports']['zip_vendor']
result = coll.aggregate([
{
'$match': {
'g_client_machinename': 'foo_client'
},
'$match': {
"g_uploaded_at": {
'$gte': dparser.parse(date_from, fuzzy=True),
'$lte': dparser.parse(date_to, fuzzy=True)
},
},
},
{
'$project': {
'MyDate': {
'$dateToString' : {
'format': '%Y-%m-%d',
'date': {
'$convert': {
'input' : f'${txn_date_feature}',
'to' : 'date'
}
}
}
},
'g_client_machinename': 1,
'g_vendor_machinename': 1,
}
},
{
'$group': {
'_id': '$MyDate',
'count': {'$sum': 1 },
'g_client_machinename': {'$first': '$g_client_machinename'},
'g_vendor_machinename': {'$first': '$g_vendor_machinename'},
}
},
])
The resulting table contains the fields and data I require and appears to increment the counter +1 for each record grouped by date. The logic-bomb is this: I’m selecting the foo_client
in the $match
stage at the beginning of my aggregation, but the results show a mix of other client names:
_id | count | g_client_machinename | g_vendor_machinename | |
---|---|---|---|---|
0 | 2021-09-17 | 1 | foo_client | zip_vendor |
1 | 2021-09-18 | 1 | foo_client | zip_vendor |
2 | 2021-09-19 | 6 | bar_client | zip_vendor |
3 | 2021-09-21 | 2 | baz_client | zip_vendor |
I expect the aggregation to have already selected the client name foo_client
. Does this line ('g_client_machinename': {'$first': '$g_client_machinename'}
) in the group stage intentionally ignore this match stage and claw out of other records in the collection these client names (making false labels) or the whole records (contaminating the results)?
To give some MongoDB context to my aggregation example, here is the show for the database and collection names:
> show dbs
admin 0.000GB
config 0.000GB
g_reports 0.070GB
[...]
> use g_reports
switched to db g_reports
> show collections
__schema__
zip_vendor
[...]
On the application side–
Each record, in addition to it’s content fields, is stored with 2+ additional fields used by my app when I insert content and when I retrieve it: (g_client_machinename, g_vendor_machinename)
.
> db.zip_vendor.findOne()
{
"_id" : ObjectId("60a40c26184e56200062257f"),
"date" : ISODate("2021-05-10T08:45:00Z"),
"checkout id" : "9bf04b85-f7b0-4a62-b347-08944437a549",
"order number" : "5",
"your_store_id": "this is a foo client alias",
"user id" : "177ff067-86ee-41fa-8d36-48e68cad9972",
"this_is_a_goober_field_name_this_vendor_uses": "true",
[...]
"items" : "Eggs Benedict, Green & White Scramble, Side Three Pepper Home Fries (2), Buttermilk Pancakes",
"g_client_machinename" : "foo_client",
"g_vendor_machinename" : "zip_vendor",
"g_uploaded_at" : ISODate("2021-05-18T18:49:10.293Z"),
"g_unique_id" : "6bbd0d4b8fc27800e680691990b99f72d964673f40ab9132623c25cfd3f0007a"
}
Basically each “batch” of records stored is from one client for a particular vendor. I label (and validate) the record when I store it. So the theory goes, all I need is this 2-tuple to get it out.