Time series: How to get the most recent document?

Hi MongoDB community,

I’m using a time series collection to store IoT data and would like to get the most recent document in the collection to find out the current state of the IoT device that’s pushing the data.

My collection has about 300,000 documents, uses a timestamp field as the timeField, and has an index on timestamp, as suggested in the documentation.

I assumed that the following query would be efficient:

db.collection("iot-time-series")
  .find({}, { sort: { timestamp: -1 }, limit: 1 })
  .toArray()

However, it takes about two seconds to execute on an M10 cluster. If I interpret the explain output for this query correctly, MongoDB doesn’t use the index on timestamp and loads all 300,000 documents into memory to find the one with the most recent timestamp.

Explain output
{
  "explainVersion": "1",
  "stages": [
    {
      "$cursor": {
        "queryPlanner": {
          "namespace": "REDACTED.system.buckets.iot-time-series",
          "indexFilterSet": false,
          "parsedQuery": {},
          "queryHash": "8B3D4AB8",
          "planCacheKey": "D542626C",
          "maxIndexedOrSolutionsReached": false,
          "maxIndexedAndSolutionsReached": false,
          "maxScansToExplodeReached": false,
          "winningPlan": { "stage": "COLLSCAN", "direction": "forward" },
          "rejectedPlans": []
        },
        "executionStats": {
          "executionSuccess": true,
          "nReturned": 825,
          "executionTimeMillis": 1232,
          "totalKeysExamined": 0,
          "totalDocsExamined": 825,
          "executionStages": {
            "stage": "COLLSCAN",
            "nReturned": 825,
            "executionTimeMillisEstimate": 0,
            "works": 827,
            "advanced": 825,
            "needTime": 1,
            "needYield": 0,
            "saveState": 41,
            "restoreState": 41,
            "isEOF": 1,
            "direction": "forward",
            "docsExamined": 825
          },
          "allPlansExecution": []
        }
      },
      "nReturned": 825,
      "executionTimeMillisEstimate": 90
    },
    {
      "$_internalUnpackBucket": {
        "exclude": [],
        "timeField": "timestamp",
        "metaField": "metadata",
        "bucketMaxSpanSeconds": 3600
      },
      "nReturned": 294628,
      "executionTimeMillisEstimate": 1060
    },
    {
      "$sort": { "sortKey": { "timestamp": -1 }, "limit": 1 },
      "totalDataSizeSortedBytesEstimate": 0,
      "usedDisk": false,
      "nReturned": 1,
      "executionTimeMillisEstimate": 1227
    }
  ],
  "serverInfo": {
    "host": "REDACTED",
    "port": 27017,
    "version": "5.0.6",
    "gitVersion": "212a8dbb47f07427dae194a9c75baec1d81d9259"
  },
  "serverParameters": {
    "internalQueryFacetBufferSizeBytes": 104857600,
    "internalQueryFacetMaxOutputDocSizeBytes": 104857600,
    "internalLookupStageIntermediateDocumentMaxSizeBytes": 104857600,
    "internalDocumentSourceGroupMaxMemoryBytes": 104857600,
    "internalQueryMaxBlockingSortMemoryUsageBytes": 104857600,
    "internalQueryProhibitBlockingMergeOnMongoS": 0,
    "internalQueryMaxAddToSetBytes": 104857600,
    "internalDocumentSourceSetWindowFieldsMaxMemoryBytes": 104857600
  },
  "command": {
    "aggregate": "system.buckets.iot-time-series",
    "pipeline": [
      {
        "$_internalUnpackBucket": {
          "timeField": "timestamp",
          "metaField": "metadata",
          "bucketMaxSpanSeconds": 3600,
          "exclude": []
        }
      },
      { "$sort": { "timestamp": -1 } },
      { "$limit": 1 }
    ],
    "cursor": {},
    "maxTimeMS": 60000,
    "collation": { "locale": "simple" }
  },
  "ok": 1,
  "$clusterTime": {
    "clusterTime": { "$timestamp": "7068709110501343233" },
    "signature": {
      "hash": "/scbPt2f8gVBc/Jpq0GEGEc6Ze4=",
      "keyId": { "low": 2, "high": 1633075851, "unsigned": false }
    }
  },
  "operationTime": { "$timestamp": "7068709110501343233" }
}

Is there another, more efficient way to get the latest document in a time series collection? Does MongoDB make any guarantees regarding the order of time series documents?

2 Likes

Hi @christiank welcome to the community!

Since time-series collection is a view into a separate bucketed collection, the usual index improvements do not really apply to them. This is because indexes ties into a physical location of a document. With the underlying bucketed collection, individual time-series measurement do not really have a physical presence that can be tracked by an index.

One example: I have a weather time-series collection with a content like:

> db.weather.find()
{ "timestamp" : ISODate("2022-03-01T01:49:59.226Z"), "metadata" : "x", "temp" : 10, "_id" : ObjectId("621d7bc74cf7a4c8c6fc47cd") }
{ "timestamp" : ISODate("2022-03-01T01:50:00.342Z"), "metadata" : "x", "temp" : 10, "_id" : ObjectId("621d7bc84cf7a4c8c6fc47ce") }
{ "timestamp" : ISODate("2022-03-01T01:50:01.269Z"), "metadata" : "x", "temp" : 10, "_id" : ObjectId("621d7bc94cf7a4c8c6fc47cf") }
{ "timestamp" : ISODate("2022-03-01T01:50:02.065Z"), "metadata" : "x", "temp" : 10, "_id" : ObjectId("621d7bca4cf7a4c8c6fc47d0") }
{ "timestamp" : ISODate("2022-03-01T01:50:02.865Z"), "metadata" : "x", "temp" : 10, "_id" : ObjectId("621d7bca4cf7a4c8c6fc47d1") }

However this is only a view into the actual physical collection holding the data:

> show collections
...
weather                     [time-series]
system.buckets.weather
...

> db.system.buckets.weather.find()
[
  {
    _id: ObjectId("621d7b8c681c5835f9a3e0fa"),
    control: {
      version: 1,
      min: {
        timestamp: ISODate("2022-03-01T01:49:00.000Z"),
        temp: 10,
        _id: ObjectId("621d7bc74cf7a4c8c6fc47cd")
      },
      max: {
        timestamp: ISODate("2022-03-01T01:50:02.865Z"),
        temp: 10,
        _id: ObjectId("621d7bca4cf7a4c8c6fc47d1")
      }
    },
    meta: 'x',
    data: {
      timestamp: {
        '0': ISODate("2022-03-01T01:49:59.226Z"),
        '1': ISODate("2022-03-01T01:50:00.342Z"),
        '2': ISODate("2022-03-01T01:50:01.269Z"),
        '3': ISODate("2022-03-01T01:50:02.065Z"),
        '4': ISODate("2022-03-01T01:50:02.865Z")
      },
      temp: { '0': 10, '1': 10, '2': 10, '3': 10, '4': 10 },
      _id: {
        '0': ObjectId("621d7bc74cf7a4c8c6fc47cd"),
        '1': ObjectId("621d7bc84cf7a4c8c6fc47ce"),
        '2': ObjectId("621d7bc94cf7a4c8c6fc47cf"),
        '3': ObjectId("621d7bca4cf7a4c8c6fc47d0"),
        '4': ObjectId("621d7bca4cf7a4c8c6fc47d1")
      }
    }
  }
]

Those 5 documents inside the weather (view) collection is in reality only 1 document, physically.

Thus having a secondary index on timestamp:1 does not work the same way as a non time-series collection, and in this case cannot help to speed up your particular query. The underlying bucketed collection was designed to be aggregated instead of selecting individual “documents” inside the collection.

There is an ongoing effort to optimize your use case of querying the recent measurements though: SERVER-4507. Please watch/upvote the ticket!

Best regards
Kevin

4 Likes

I voted :up: :up: :up: !

Cheers!

1 Like

@kevinadi Thank you so much for your detailed answer and the example!

I found a workaround that uses the internal collection you mentioned to get the most recent document without loading all other documents into memory. The first query finds the bucket that contains the most recent timestamp (which is fast because the index on control.max.timestamp is used), and the second query uses the timestamp to look up the document in the “virtual” time series collection:

async function findMostRecent() {
  const bucket = await db.collection("system.buckets.weather").findOne(
    {},
    {
      sort: { "control.max.timestamp": -1 },
      projection: { "control.max.timestamp": 1 }
    }
  );

  if (!bucket) return;

  return db.collection("weather").findOne({
    timestamp: bucket.control.max.timestamp
  });
}

Do you think this limitation could be mentioned in the documentation (here)? The introduction to time series collections says: “Time series collections behave like normal collections. You can insert and query your data as you normally would.” And the first query example on that page is a lookup of a single document, so I think it’s fair to assume that an index on the timestamp is used when sorting.

1 Like

Is there any update on this question? The performance of time series collections is in this regard very disappointing. Those queries work pretty well with normal collections and now you have to use a workaround?
Even though the data is stored internally in buckets, it will be sufficient to get the most recent bucket by sorting the buckets based on the time index. I tried with version 6.0, the performance is still very poor.

Regards,
Martin

Thank you for this post. I was struggling with the same question. Your post steered me in the right direction on how to solve this, as did this post. In my case I need to get the most recent sensor values of a device. The sensor is identified by the ‘name’ field and the device by the thingID field. So aggregate using $match by thingID and $group on name. The records are added with ‘metadata’ set to “thingID” and “name” as was recommended elsewhere. This speeds up the aggregate query by factor 5 or so. (add ‘name’ didn’t make a difference though)

I found that the sort on timestamp worked as expected but using control.max.timestamp didn’t return the most recent record. Also, both have similar performance. Getting the most recent sensors from 100K records with 10 devices and 5 sensor types takes 20msec on my desktop PC, which is a i5-4570S CPU @ 2.90GHz. 300K records takes 63msec, so it looks linear.

My code below. This is in golang. I’m new to mongodb so please forgive the ugliness. It feels like a rather cludgy way to get the results. For example, how to get all fields in the result instead of listing each one using $first?

Creating the collection:

      // prepare options
	tso := &options.TimeSeriesOptions{
		TimeField: "timestamp",
	}
	tso.SetMetaField("metadata")
	tso.SetGranularity("minutes")
	co := &options.CreateCollectionOptions{
		DefaultIndexOptions: nil,
		MaxDocuments:        nil,
		StorageEngine:       nil,
	}
	co.SetTimeSeriesOptions(tso)
    // events
   err = srv.storeDB.CreateCollection(ctx, srv.eventCollectionName, co)

Adding an event:

// AddEvent adds a new event to the history store
// The event 'created' field will be used as timestamp after parsing it using time.RFC3339
func (srv *HistoryStoreServer) AddEvent(ctx context.Context, args *thing.ThingValue) (*emptypb.Empty, error) {
	// Name and ThingID are required fields
	if args.Name == "" || args.ThingID == "" {
		err := fmt.Errorf("missing name or thingID")
		logrus.Warning(err)
		return nil, err
	}
	if args.Created == "" {
		args.Created = time.Now().UTC().Format(time.RFC3339)
	}

	// It would be nice to simply use bson marshal, but that isn't possible as the
	// required timestamp needs to be added in BSON format.
	//createdTime, err := time.Parse("2006-01-02T15:04:05-07:00", args.Created)
	createdTime, err := time.Parse(time.RFC3339, args.Created)
	timestamp := primitive.NewDateTimeFromTime(createdTime)
	evBson := bson.M{
		TimeStampField: timestamp,
            // the metadata on thingID and name speeds up aggregate query by factor 5
		"metadata":     bson.M{"thingID": args.ThingID, "name": args.Name},
		"name":         args.Name,
		"thingID":      args.ThingID,
		"valueID":      args.ValueID,
		"value":        args.Value,
		"created":      args.Created,
		"actionID":     args.ActionID,
	}
	res, err := srv.eventCollection.InsertOne(ctx, evBson)
	_ = res
	return nil, err
}

Get the most recent sensor values of a device:

// GetLatestValues returns the last received event/properties of a Thing
func (srv *HistoryStoreServer) GetLatestValues(ctx context.Context,
	args *svc.GetLatest_Args) (*svc.ThingValueMap, error) {
	values := &svc.ThingValueMap{PropValues: make(map[string]*thing.ThingValue)}
	matchStage := bson.D{
		{"$match",
			bson.D{
				{"thingID", args.ThingID},
			},
		},
	}
	sortStage := bson.D{
		{"$sort",
			bson.D{
				{"timestamp", -1},
				//{"control.max.timestamp", -1},
			},
		},
	}
	groupStage := bson.D{
		{"$group",
			bson.D{
				{"_id", "$name"},
				{"name", bson.M{"$first": "$name"}},
				{"created", bson.M{"$first": "$created"}},
				{"value", bson.M{"$first": "$value"}},
				{"valueID", bson.M{"$first": "$valueID"}},
				{"thingID", bson.M{"$first": "$thingID"}},
			},
		},
	}
	pipeline := mongo.Pipeline{matchStage, sortStage, groupStage}
	aggOptions := &options.AggregateOptions{}
	cursor, err := srv.eventCollection.Aggregate(ctx, pipeline, aggOptions)
	if err != nil {
		logrus.Error(err)
		return nil, err
	}
	count := 0
	for cursor.Next(ctx) {
		value := thing.ThingValue{}
		err = cursor.Decode(&value)
		values.PropValues[value.Name] = &value
		count++
	}
	logrus.Infof("Iterated %d values", count)
	return values, nil
}

Some additional findings:

  1. I found that the sort on timestamp worked as expected but using control.max.timestamp didn’t return the most recent record. Maybe because the timestamp on ingress is random. Should it be sequential?
  2. Performance of ‘control.max.timestamp’ is slightly better than using ‘timestamp’ 170 vs 215msec.
  3. Getting the most recent sensor vales from 100K records with 10 devices (10K values per device) and 10 different sensors to aggregate takes 20msec on my desktop PC, which is a i5-4570S CPU @ 2.90GHz.
  4. As above but with all 100K records of 1 device, 10 different sensors. Aggregate takes 215msec. The increase in time looks linear with the number of records, so it is probably scanning them all.
  5. Removing the match stage - all records are a match anyways - reduced the time from 215msec to 190msec. So the match stage is pretty fast.
  6. Removing the sort stage obviously doesn’t yield the proper result but it reduced the time from 215msec to 102msec.
  7. Removing the group stage increased the time from 215msec to 680msec
    Interestingly, reducing the number of sensors in the dataset from 10 to 2 changed the timing from 215msec to 208msec. Reducing the number of sensors to aggregate to just _id reduced time from 215msec to 144msec. (I also voted for server-4507)
  8. Last, following the approach provided by @kevinadi, getting only the object ID from the group and using findOne to read the record is approximately the same speed. If more fields need to be grouped however it would be faster. Note that this doesn’t use any additional indexes so there might be faster ways.

Hi,

thanks for the explanation. My personal speculation :slight_smile: - It seems that the optimizer does not really now how many actual “documents” are stored in a bucket in order to push the limit step up in the query plan. In a normal collection, the index size corresponds to the the number of “documents” and it is enough to scan the index only to get the number of required document as specified by the limit op in the query. In a time-series collection, the number of documents in a bucket varies and this metadata is probably not present to the optimizer. Therefore it cannot push the step up in the plan but requires to first unpack all the buckets that match the query criteria and then perform the limit step.
This is a significant difference in the behavior and IMHO it has to be listed in the limitations of time series collections. In a normal collection the query predicate can match a billion documents but if you only want 10, it will return pretty fast. Whereas doing the same query on a time series collection may lead to an unresponsive state of the database for some time.

Regards,
Martin