How to get collected document time MongoDB

ahmad_al_sharbaji · August 16, 2023, 11:43am

In a MongoDB Atlas searchMeta aggregation:

from datetime import datetime, timedelta

results = my_collection.aggregate(
    [
        {
            "$searchMeta": {
                "index": "MsgAtlasIndex",
                "count": {"type": "total"},
                "compound": {
                    "must": [
                        {
                            "range": {
                                "path": "ts",
                                "gte": datetime.now() - timedelta(hours=24),
                            }
                        },
                        {
                            "equals": {"path": "st", "value": 2},
                            "equals": {"path": "aff", "value": 2},
                            "equals": {"path": "src", "value": 6},
                        },
                    ]
                },
            }
        }
    ]
)
list(results)

Output

[{'count': {'total': 38}}]

I need to get the time of the latest collected message via dt key

Wanted result:

[{'count': {'total': 38}},
'last_message_datetime': datetime.datetime(2023, 8, 16, 11, 30, 21)]

I can use this line "last_message_datetime": {"$max": "$dt"} in a normal aggregation but how to do it using Atlas?

Sample document below:

{
    "_id": ObjectId("64da2904b028a6d62965d623"),
    "ts": datetime.datetime(2023, 8, 14, 13, 15, 48, 632000),
    "dt": datetime.datetime(2023, 8, 14, 13, 14, 58),
    "aff": 2,
    "src": 2,
    "st": 2,
}

slava · August 18, 2023, 11:14am

Hello, @ahmad_al_sharbaji !

It seems you’re overcomplicating the solution. I think in your aggregation pipeline you can use simple $match + $group stages.

Have a look the example below - the dataset and the solution is similar to the case of yours.

Example dataset:

db.messages.insertMany([
  {
    _id: 'M1',
    text: 'Lorem ipsum dolor sit amet',
    channelId: 1,
    createdAt: ISODate('2023-08-11T15:00:00.000Z'), // 15:00:00
  },
  {
    _id: 'M2',
    channelId: 1,
    text: 'Eiusmod tempor incididunt',
    createdAt: ISODate('2023-08-11T15:01:00.000Z'), // 15:01:00
  },
  {
    _id: 'M3',
    channelId: 1,
    text: 'Ut labore et dolore magna aliqua',
    createdAt: ISODate('2023-08-11T15:01:45.000Z'), // 15:01:45
  },
  {
    _id: 'M4',
    channelId: 2,
    text: 'Excepteur sint occaecat cupidatat',
    createdAt: ISODate('2023-08-11T15:00:25.000Z'), // 15:00:25
  },
  {
    _id: 'M5',
    channelId: 2,
    text: 'Fugiat nulla pariatur',
    createdAt: ISODate('2023-08-11T15:01:15.000Z'), // 15:01:15
  },
]);

Example aggregation pipeline. Notice, that it contains same conditions, but outside of $searchMeta stage:

db.messages.aggregate([
  {
    $match: {
      channelId: 1,
      createdAt: {
        $gte: ISODate('2023-08-11T15:00:00.000Z'), // 15:00:00
      }
    }
  },
  {
    $group: {
      _id: null,
      count: {
        $sum: 1
      },
      largestCreatedAt: {
        $max: '$createdAt',
      }
    }
  }
]);

Output:

[
  {
    _id: null,
    count: 3, // total 3 messages found
    largestCreatedAt: ISODate("2023-08-11T15:01:45.000Z") // 15:01:45
  }
]

I used more verbose field names and ISODate object as values instead of timestamp, but you should see the similarity and understand the solution. If not - let me know

ahmad_al_sharbaji · August 18, 2023, 12:17pm

Dear Mr. @slava ,
I really appreciate the time you spent on this great reply! Thank you so much.

Actually, I already have this approach, but with a collection that contains more than 70M documents, this aggregation will take around 25 minutes which is bad for our production side.

That’s why I decided to use searchMeta, it takes now around 26 seconds! But this approach gets the count only without the document details and here is my problem I cannot get the document time.

I again appreciate the time you spent, but can you help me with this approach? Or could you progress any other fast approach like searchMeta?

I can’t wait for you response! Regards.

slava · August 19, 2023, 3:02pm

@ahmad_al_sharbaji ,

Such difference in speed may be because aggregation with $searchMeta strongly relies on underlying indexes, while the other aggregation based on $match + $group stages, that I provided above - not. What indexes you have on your collection? You can check it with db.collection.getIndexes() method.

To make the aggregation above faster, you should create a compound index, that would include:

A field, that you will use to sort documents (in your case it is last_message_datetime, that you use to select document withing a date range). Since you need to get only some latest set of documents (for the last hour), you need to make your index descendant -1.
Fields, that are used to filter out unnecessary documents (in the example your provided those fields are: st, aff, src).

Example of a compound index definition (relies on the example dataset I provided initially):

db.messages.createIndex({
  createdAt: -1, // for sorting
  channelId: 1 // for filtering
});

Try this out and let me know, if it helped

ahmad_al_sharbaji · August 20, 2023, 8:48am

@slava You can’t believe how much I appreciate your help!

Indeed, I already have Atlas compound index called “MsgAtlasIndex”.

I tried the solution and it took a lot of time in production I clearly see no solution but searchMeta one… But the problem with searchMeta is returning numbers not documents, so I can’t get into the result details and extract the date time.

What do you think?

slava · August 20, 2023, 12:23pm

@ahmad_al_sharbaji ,

The main problem in your approach with $searchMeta, that it is used to provide metadata (information about the data, like bucket boundaries or total number of documents, that fall under certain set of conditions), not the data itself.

However, I did not say it is not possible to get the data you want. But, you gonna need to play dirty .
If you look chosely at the example of how $search meta is used with facets, you can see, that some data, can be extracted. It just needs to become good boundary names for the buckets. A good name, in your case should:

represent document position in the result set, so it is clear whether is older or newer, comparing to the other ones.
be unique enough, so can easily select one document among other ones.

I think a good candidate for that name can be createdAt field (see my example dataset below), but stringified - `createdAtString.

I will demonstrate the idea with examples below.

Example dataset:

db.messages.insertMany([
  {
    _id: 'M1',
    text: 'Lorem ipsum dolor sit amet',
    channelId: 1,
    createdAt: ISODate('2023-08-11T15:00:00.000Z'), // 15:00:00
    createdAtString: '2023-08-11T15:00:00.000Z',
  },
  {
    _id: 'M2',
    channelId: 1,
    text: 'Eiusmod tempor incididunt',
    createdAt: ISODate('2023-08-11T15:30:00.000Z'), // 15:30:00
    createdAtString: '2023-08-11T15:30:00.000Z',
  },
  {
    _id: 'M3',
    channelId: 1,
    text: 'Ut labore et dolore magna aliqua',
    createdAt: ISODate('2023-08-11T16:00:45.000Z'), // 16:00:45
    createdAtString: '2023-08-11T16:00:45.000Z',
  },
  {
    _id: 'M4',
    channelId: 2,
    text: 'Excepteur sint occaecat cupidatat',
    createdAt: ISODate('2023-08-11T15:00:25.000Z'), // 15:00:25
    createdAtString: '2023-08-11T15:00:25.000Z',
  },
  {
    _id: 'M5',
    channelId: 2,
    text: 'Fugiat nulla pariatur',
    createdAt: ISODate('2023-08-11T16:01:15.000Z'), // 16:01:15
    createdAtString: '2023-08-11T16:01:15.000Z',
  },
]);

Example aggregation pipeline that contains conditions close to your query:

db.messages.aggregate([
  {
    $searchMeta: {
      index: 'messages-test-search',
      count: {
        type: 'total'
      },
      facet: {
        operator: {
          compound: {
            must: [
              {
                range: {
                  path: 'createdAt',
                  gte: ISODate('2023-08-11T15:00:00.000Z'),
                  lte: ISODate('2023-08-11T16:00:00.000Z')
                },
              },
              {
                equals: {
                  path: 'channelId',
                  value: 1
                }
              }
              // you can add more equality conditions in this array
            ]
          },
        },
        facets: {
          myFacet: {
            type: 'string',
            // field 'createdAtString' will be used as a bucket name
            path: 'createdAtString',
          }
        }
      },
    }
  },
]);

Output:

[
  {
    count: { total: Long("2") },
    facet: {
      myFacet: {
        buckets: [
          { _id: '2023-08-11T15:00:00.000Z', count: Long("1") },
          { _id: '2023-08-11T15:30:00.000Z', count: Long("1") }
        ]
      }
    }
  }
]

As you can see, now we have metadata (total documents selected) and actual data (docuement’s createdAt) in form of bucket names. Then, will little efforts, we can get the result you want:

db.messages.aggregate([
  {
    $searchMeta: { /* unchanged */ }
  },
  {
    $unwind: '$facet.myFacet.buckets'
  },
  {
    $project: {
      total: '$count.total',
      createdAtBoundary: {
        $toDate: '$facet.myFacet.buckets._id',
      }
    }
  },
  {
    $group: {
      _id: null,
      total: {
        $first: '$total'
      },
      largestCreatedAt: {
        $max: '$createdAtBoundary'
      }
    }
  }
]);

Final output:

[
  {
    _id: null,
    total: Long("2"),
    largestCreatedAt: ISODate("2023-08-11T15:30:00.000Z")
  }
]

Atlas Search Index configuration object I used:

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "channelId": {
        "type": "number"
      },
      "createdAt": {
        "type": "date"
      },
      "createdAtString": {
        "type": "stringFacet"
      }
    }
  }
}

If you afraid that two documents in your collection can have exact same createdAt value, you can, for example, concatenate document id to that string, so the value would look like ‘2023-08-11T15:30:00.000Z__M1’. But, in this case, you will have to add additional stages to disassemble this string in order to work with dates.

Again, it is a dirty solution, but it works with $searchMeta

ahmad_al_sharbaji · August 21, 2023, 12:44pm

@slava This dirty solution needs a very little adjustment to become a complete life saver!
I don’t know why you assigned createdAtString as a string. The field I used to get the lastest time is date field.

Can you please make createdAtString as a date field? Because I applied it on dateFacet field but didn’t work correctly. Please convert createdAtString to a date field and give it dateFacet index and try the example.

This is completely what I want!! I can’t wait for your adjustment, kindly.

ahmad_al_sharbaji · August 21, 2023, 12:49pm

@slava Your are a serious genius! Please just make createdAtString as a date field not string and give it DateFacet indexing and demonstrate it again.

slava · August 21, 2023, 5:13pm

@ahmad_al_sharbaji ,

The dirty solution won’t work with date field. If you look at the date facet syntax, you will see, it requires to use boundaries, which you have to know beforehand. Moreover, in this case, boundaries array must represent every datetime possible for a given period of time, so it won’t contain more than 1 message in it.

Using stringified version createdAtString of createdAt field is integral and dirty part of the solution.

It is the only way you get some data using $searchMeta. Either go dirty or use another solution, that does not involve $searchMeta.

ahmad_al_sharbaji · August 23, 2023, 3:24pm

@slava life saver !!! I can’t thank you enough man! God bless.

Could you please see this topic too? It’s way easier!

Thanks in advance!

system · August 28, 2023, 3:25pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.