My collection is un-nested transaction data. Sometimes the schema changes, and I’d like to query the collection and get the _id of the first record after a change (or the last before. doesn’t matter only so it’s consistent).
The data always has a date or time feature, plus a few features I tag on g_* when the record is added to the collection:
[_id, ..., Time, ..., g_client_machinename, g_vendor_machinename, g_report_machinename, g_uploaded_at]
Below I have an aggregation pipeline which actually returns a row for each change in schema with two features in results:
-
all_keys, a concat of all features/headers -
unique_id, some record which has these features
But I need to know when the change happened. I need an explicit sort stage I can reason about. I added $sort before the concat, but I can’t get the g_uploaded_at feature into the output.
result = coll.aggregate([
{
'$project': {
'data': {'$objectToArray': "$$ROOT"},
'g_unique_id': 1,
'g_uploaded_at': 1, }
},
{'$unwind': "$data"},
{'$project': {'g_uploaded_at': 1, 'g_unique_id': 1, 'key': "$data.k", '_id': 0}},
{'$sort': {'g_uploaded_at': -1}},
{'$group': {'_id': "$g_unique_id", 'all_keys': {'$push': "$key"}}},
{
'$project': {
'all_keys': 1,
'all_keys_string': {
'$reduce': {
'input': "$all_keys",
'initialValue': "",
'in': {'$concat': ["$$value", "$$this"]}
}
}
}
},
{
'$group': {
'_id': "$all_keys_string",
'all_keys': {'$first': "$all_keys"},
'g_unique_id': {'$first': "$_id"}
}
},
{'$unset': "_id"}
])