TimeSeries last x documents

Daniel_Lux · September 13, 2022, 8:04am

Hi,

i try to retrieve the most recent X documents inserted into a timeseries collection using pymongo. The query used is the following:

pipeline = [
        { "$sort": { "timestamp": -1 } },
        { "$limit": 10 }
    ]
query = mongodb[collection].aggregate(pipeline)

This results in a very slow execution time (3-4 seconds) for a ~4GB collection. and about 70ms for a 8MB collection. An explain also shows a COLLSCAN is being performed which would explain the performance hit.
I have already created indices for the timestamp field like this:

mongodb[collection].create_index([ ("timestamp", -1) ])

i think that this use case is quite common and i don’t get what i am doing wrong.

Thank you for your help.

Aasawari · September 20, 2022, 9:48am

Hello @Daniel_Lux and welcome to the MongoDb community!!

MongoDB time series collections are basically a non materialised views under system.buckets backed by the internal collections. Querying in the time-series collection, utilises this format and returns results faster.

The query optimisation on the time series collection works differently than the normal collection and hence, the db.<collection>.explain().<query> works on the underlying collection rather on the non-materialised views being created.

The time series makes use the clustered index created by default and to further increase the performance, you can manually add the secondary indexes to the collection.

However, for the above dataset provided, I tried to reproduce the issue on the MongoDB version 6.0 with dataset of around 1 GB of dataset and observed the following:

Using the query :

db.sample.explain().aggregate([ {$sort:{timestamp:-1}}, {$limit:10} ])

This query takes around 16 seconds to respond with the appropriate documents and uses the COLLSCAN to fetch the documents which explains the delay.

....
executionStats: {
          executionSuccess: true,
          nReturned: 11001,
          executionTimeMillis: 16937,
          totalKeysExamined: 0,
          totalDocsExamined: 11001,
          executionStages: {
            stage: 'COLLSCAN',
            nReturned: 11001 
....

However, using the range query as:

db.sample.explain().aggregate([ {$match:{timestamp:{$gte:ISODate(“2021-01-10T10:02:18.242Z”)}}}, {$sort:{timestamp:-1}}, {$limit:10} ])

subsequently reduced the execution time for the query and makes use of the clustered Index.

....
executionStats: {
          executionSuccess: true,
          nReturned: 11001,
          executionTimeMillis: 8607,
          totalKeysExamined: 0,
          totalDocsExamined: 11001,
          executionStages: {
            stage: 'CLUSTERED_IXSCAN',
            filter: {
              '$and': [
                {
                  _id: { '$gte': ObjectId("5fd343aa0000000000000000") }
                },
                {
                  'control.max.timestamp': {
                    '$_internalExprGte': ISODate("2021-01-10T10:02:18.242Z")
                  }
                },
                {
                  'control.min.timestamp': {
                    '$_internalExprGte': ISODate("2020-12-11T10:02:18.242Z")
                  }
                }
              ]
            },
            nReturned: 11001,
            executionTimeMillisEstimate: 2,
....

This further explains that, the range query targets the buckets and unpacks and produces the output, however, query without the range, would basically unpack all and give the output documents which explains the COLLSCAN in the execution status.

For the above two query, the executionTimeMillisEstimate field value has efficiently reduced from ~16 seconds to ~8 seconds on average.
Therefore, the recommendation would be make use of the range query to optimise the response for the collection.

In conclusion, a time series collection works differently than a normal collection, and as of MongoDB 6.0, the usual query optimisation methods may not work as expected. Note however that this may change in the future

Let us know if you have any further questions.

Best regards
Aasawari

Daniel_Lux · September 20, 2022, 10:11am

Hi @Aasawari, thank you for answering on this topic!

the presented solution cuts down execution time indeed. But it seems rather inefficient to lookup all documents when they are sorted by timestamp anyway (even the clustered documents are sorted by “time window” i guess).
My naive approach would be to collscan the first clustered document and if the scan does not return the required amount of documents scan the next clustered document.

The issue is that the data that i store is not inserted in equal intervals, so i don’t know the time range to filter first. I could store an average insertion interval for each collection i have elsewhere and use that to calculate an approriate filter first.

There is one thing i noticed: your first axamples states executionTimeMillis: 16937 which corresponds to your 16 seconds execution time. The later example shows executionTimeMillis: 21277 which is 5 seconds more? Could you explain this discrepancy in more detail?

Thank you for your help.

Regards
Daniel

Aasawari · September 20, 2022, 12:42pm

Hi @Daniel_Lux

Thank you for pointing out the difference in the execution time for both the queries.
For the same, I have edited the post with the right execution time which has reduced from 16 seconds to nearly 8 seconds on average after using the range query.

Let us know if you have any further queries.

Regards
Aasawari

Daniel_Lux · September 20, 2022, 3:47pm

I tried to investigate that further. My aggregate does not run an CLUSTERED_IXSCAN but a COLLSCAN instead.

project_7> db.data.explain().aggregate([{ $match: { "ts" : { $gte: ISODate("2022-09-14T16:31:05.187Z") } } },{ $sort: { "ts": -1 } },{ $limit: 20 }])
{
  explainVersion: '1',
  queryPlanner: {
    namespace: 'project_7.data',
    indexFilterSet: false,
    parsedQuery: { ts: { '$gte': ISODate("2022-09-14T16:31:05.187Z") } },
    queryHash: '26B52568',
    planCacheKey: '26B52568',
    optimizedPipeline: true,
    maxIndexedOrSolutionsReached: false,
    maxIndexedAndSolutionsReached: false,
    maxScansToExplodeReached: false,
    winningPlan: {
      stage: 'SORT',
      sortPattern: { ts: -1 },
      memLimit: 104857600,
      limitAmount: 20,
      type: 'simple',
      inputStage: {
        stage: 'COLLSCAN',
        filter: { ts: { '$gte': ISODate("2022-09-14T16:31:05.187Z") } },
        direction: 'forward'
      }
    },
    rejectedPlans: []
  },

Note that i replace “timestamp” with “ts”.

Is seems like my collection creation code does not create a timeseries collection.

db.create_collection("data", timeseries={ "timeField": "ts", "metaField": "device", "granularity": "seconds" })

project_7> db.runCommand({ listCollections: 1, filter: { type: "timeseries" } })
{
  cursor: {
    id: Long("0"),
    ns: 'project_7.$cmd.listCollections',
    firstBatch: []
  },
  ok: 1
}

I feel stupid for not checking that earlier.