Time series: How to get the most recent document?

Hi MongoDB community,

I’m using a time series collection to store IoT data and would like to get the most recent document in the collection to find out the current state of the IoT device that’s pushing the data.

My collection has about 300,000 documents, uses a timestamp field as the timeField, and has an index on timestamp, as suggested in the documentation.

I assumed that the following query would be efficient:

db.collection("iot-time-series")
  .find({}, { sort: { timestamp: -1 }, limit: 1 })
  .toArray()

However, it takes about two seconds to execute on an M10 cluster. If I interpret the explain output for this query correctly, MongoDB doesn’t use the index on timestamp and loads all 300,000 documents into memory to find the one with the most recent timestamp.

Explain output
{
  "explainVersion": "1",
  "stages": [
    {
      "$cursor": {
        "queryPlanner": {
          "namespace": "REDACTED.system.buckets.iot-time-series",
          "indexFilterSet": false,
          "parsedQuery": {},
          "queryHash": "8B3D4AB8",
          "planCacheKey": "D542626C",
          "maxIndexedOrSolutionsReached": false,
          "maxIndexedAndSolutionsReached": false,
          "maxScansToExplodeReached": false,
          "winningPlan": { "stage": "COLLSCAN", "direction": "forward" },
          "rejectedPlans": []
        },
        "executionStats": {
          "executionSuccess": true,
          "nReturned": 825,
          "executionTimeMillis": 1232,
          "totalKeysExamined": 0,
          "totalDocsExamined": 825,
          "executionStages": {
            "stage": "COLLSCAN",
            "nReturned": 825,
            "executionTimeMillisEstimate": 0,
            "works": 827,
            "advanced": 825,
            "needTime": 1,
            "needYield": 0,
            "saveState": 41,
            "restoreState": 41,
            "isEOF": 1,
            "direction": "forward",
            "docsExamined": 825
          },
          "allPlansExecution": []
        }
      },
      "nReturned": 825,
      "executionTimeMillisEstimate": 90
    },
    {
      "$_internalUnpackBucket": {
        "exclude": [],
        "timeField": "timestamp",
        "metaField": "metadata",
        "bucketMaxSpanSeconds": 3600
      },
      "nReturned": 294628,
      "executionTimeMillisEstimate": 1060
    },
    {
      "$sort": { "sortKey": { "timestamp": -1 }, "limit": 1 },
      "totalDataSizeSortedBytesEstimate": 0,
      "usedDisk": false,
      "nReturned": 1,
      "executionTimeMillisEstimate": 1227
    }
  ],
  "serverInfo": {
    "host": "REDACTED",
    "port": 27017,
    "version": "5.0.6",
    "gitVersion": "212a8dbb47f07427dae194a9c75baec1d81d9259"
  },
  "serverParameters": {
    "internalQueryFacetBufferSizeBytes": 104857600,
    "internalQueryFacetMaxOutputDocSizeBytes": 104857600,
    "internalLookupStageIntermediateDocumentMaxSizeBytes": 104857600,
    "internalDocumentSourceGroupMaxMemoryBytes": 104857600,
    "internalQueryMaxBlockingSortMemoryUsageBytes": 104857600,
    "internalQueryProhibitBlockingMergeOnMongoS": 0,
    "internalQueryMaxAddToSetBytes": 104857600,
    "internalDocumentSourceSetWindowFieldsMaxMemoryBytes": 104857600
  },
  "command": {
    "aggregate": "system.buckets.iot-time-series",
    "pipeline": [
      {
        "$_internalUnpackBucket": {
          "timeField": "timestamp",
          "metaField": "metadata",
          "bucketMaxSpanSeconds": 3600,
          "exclude": []
        }
      },
      { "$sort": { "timestamp": -1 } },
      { "$limit": 1 }
    ],
    "cursor": {},
    "maxTimeMS": 60000,
    "collation": { "locale": "simple" }
  },
  "ok": 1,
  "$clusterTime": {
    "clusterTime": { "$timestamp": "7068709110501343233" },
    "signature": {
      "hash": "/scbPt2f8gVBc/Jpq0GEGEc6Ze4=",
      "keyId": { "low": 2, "high": 1633075851, "unsigned": false }
    }
  },
  "operationTime": { "$timestamp": "7068709110501343233" }
}

Is there another, more efficient way to get the latest document in a time series collection? Does MongoDB make any guarantees regarding the order of time series documents?

Hi @christiank welcome to the community!

Since time-series collection is a view into a separate bucketed collection, the usual index improvements do not really apply to them. This is because indexes ties into a physical location of a document. With the underlying bucketed collection, individual time-series measurement do not really have a physical presence that can be tracked by an index.

One example: I have a weather time-series collection with a content like:

> db.weather.find()
{ "timestamp" : ISODate("2022-03-01T01:49:59.226Z"), "metadata" : "x", "temp" : 10, "_id" : ObjectId("621d7bc74cf7a4c8c6fc47cd") }
{ "timestamp" : ISODate("2022-03-01T01:50:00.342Z"), "metadata" : "x", "temp" : 10, "_id" : ObjectId("621d7bc84cf7a4c8c6fc47ce") }
{ "timestamp" : ISODate("2022-03-01T01:50:01.269Z"), "metadata" : "x", "temp" : 10, "_id" : ObjectId("621d7bc94cf7a4c8c6fc47cf") }
{ "timestamp" : ISODate("2022-03-01T01:50:02.065Z"), "metadata" : "x", "temp" : 10, "_id" : ObjectId("621d7bca4cf7a4c8c6fc47d0") }
{ "timestamp" : ISODate("2022-03-01T01:50:02.865Z"), "metadata" : "x", "temp" : 10, "_id" : ObjectId("621d7bca4cf7a4c8c6fc47d1") }

However this is only a view into the actual physical collection holding the data:

> show collections
...
weather                     [time-series]
system.buckets.weather
...

> db.system.buckets.weather.find()
[
  {
    _id: ObjectId("621d7b8c681c5835f9a3e0fa"),
    control: {
      version: 1,
      min: {
        timestamp: ISODate("2022-03-01T01:49:00.000Z"),
        temp: 10,
        _id: ObjectId("621d7bc74cf7a4c8c6fc47cd")
      },
      max: {
        timestamp: ISODate("2022-03-01T01:50:02.865Z"),
        temp: 10,
        _id: ObjectId("621d7bca4cf7a4c8c6fc47d1")
      }
    },
    meta: 'x',
    data: {
      timestamp: {
        '0': ISODate("2022-03-01T01:49:59.226Z"),
        '1': ISODate("2022-03-01T01:50:00.342Z"),
        '2': ISODate("2022-03-01T01:50:01.269Z"),
        '3': ISODate("2022-03-01T01:50:02.065Z"),
        '4': ISODate("2022-03-01T01:50:02.865Z")
      },
      temp: { '0': 10, '1': 10, '2': 10, '3': 10, '4': 10 },
      _id: {
        '0': ObjectId("621d7bc74cf7a4c8c6fc47cd"),
        '1': ObjectId("621d7bc84cf7a4c8c6fc47ce"),
        '2': ObjectId("621d7bc94cf7a4c8c6fc47cf"),
        '3': ObjectId("621d7bca4cf7a4c8c6fc47d0"),
        '4': ObjectId("621d7bca4cf7a4c8c6fc47d1")
      }
    }
  }
]

Those 5 documents inside the weather (view) collection is in reality only 1 document, physically.

Thus having a secondary index on timestamp:1 does not work the same way as a non time-series collection, and in this case cannot help to speed up your particular query. The underlying bucketed collection was designed to be aggregated instead of selecting individual “documents” inside the collection.

There is an ongoing effort to optimize your use case of querying the recent measurements though: SERVER-4507. Please watch/upvote the ticket!

Best regards
Kevin

4 Likes

I voted :up: :up: :up: !

Cheers!

1 Like

@kevinadi Thank you so much for your detailed answer and the example!

I found a workaround that uses the internal collection you mentioned to get the most recent document without loading all other documents into memory. The first query finds the bucket that contains the most recent timestamp (which is fast because the index on control.max.timestamp is used), and the second query uses the timestamp to look up the document in the “virtual” time series collection:

async function findMostRecent() {
  const bucket = await db.collection("system.buckets.weather").findOne(
    {},
    {
      sort: { "control.max.timestamp": -1 },
      projection: { "control.max.timestamp": 1 }
    }
  );

  if (!bucket) return;

  return db.collection("weather").findOne({
    timestamp: bucket.control.max.timestamp
  });
}

Do you think this limitation could be mentioned in the documentation (here)? The introduction to time series collections says: “Time series collections behave like normal collections. You can insert and query your data as you normally would.” And the first query example on that page is a lookup of a single document, so I think it’s fair to assume that an index on the timestamp is used when sorting.