Date sorting aggregation with group stage so I can reason about the results

xtian_simon · June 26, 2022, 12:48am

My collection is un-nested transaction data. Sometimes the schema changes, and I’d like to query the collection and get the _id of the first record after a change (or the last before. doesn’t matter only so it’s consistent).

The data always has a date or time feature, plus a few features I tag on g_* when the record is added to the collection:

[_id, ..., Time, ..., g_client_machinename, g_vendor_machinename, g_report_machinename, g_uploaded_at]

Below I have an aggregation pipeline which actually returns a row for each change in schema with two features in results:

all_keys, a concat of all features/headers
unique_id, some record which has these features

But I need to know when the change happened. I need an explicit sort stage I can reason about. I added $sort before the concat, but I can’t get the g_uploaded_at feature into the output.

result = coll.aggregate([
    {
        '$project': {
            'data': {'$objectToArray': "$$ROOT"},
            'g_unique_id': 1,
            'g_uploaded_at': 1, }
    },
    {'$unwind': "$data"},
    {'$project': {'g_uploaded_at': 1, 'g_unique_id': 1, 'key': "$data.k", '_id': 0}},
    {'$sort': {'g_uploaded_at': -1}},
    {'$group': {'_id': "$g_unique_id", 'all_keys': {'$push': "$key"}}},
    {
        '$project': {
            'all_keys': 1,
            'all_keys_string': {
                '$reduce': {
                    'input': "$all_keys",
                    'initialValue': "",
                    'in': {'$concat': ["$$value", "$$this"]}
                }
            }
        }
    },
    {
        '$group': {
            '_id': "$all_keys_string",
            'all_keys': {'$first': "$all_keys"},
            'g_unique_id': {'$first': "$_id"}
        }
    },
    {'$unset': "_id"}
])

steevej · June 28, 2022, 7:25pm

My first recommendation would be to $sort on g_uploaded_at as the first stage. The 2 $project and $unwind will end up being a slow memory sort. If you $sort first you could use an index. Even if you do not use an index it should be faster because it is sorting before $unwind which increase the number of documents to sort.

If schema changes are important why not simply keep a schema version number in your documents, or something like g_schema_date.

Note that since you do not recursively $objectToArray what ever you do, you will only detect new or removed at the root document. Starting with the following documents:

{ _id: ObjectId("62bb53499fc9e78118c732d5"), a: { b: 2 } }
{ _id: ObjectId("62bb53529fc9e78118c732d6"), a: { b: 2, c: 3 } }
{ _id: ObjectId("62bb54239fc9e78118c732d7"), a: 2 }

without recursively doing $objectToArray your keys of $$ROOT will always be _id and a. You cannot detect if a field change from a simple value to an array or object and you cannot detect that field a.c has been added or removed.

xtian_simon · July 2, 2022, 4:32pm

If schema changes are important why not simply keep a schema version number in your documents, or something like g_schema_date.

I think this comment is more about how I should design my application, and not about the query itself. I need a query to initialize the effective date change for schemas already in MongoDB.

Note that since you do not recursively […] you cannot detect that field a.c has been added or removed.

As I stated in my OP, my data is not nested. The original data is tabular transaction data from CSV. Not an issue for this project.