Duplicate Data Issue

kevinadi · December 3, 2021, 8:57am

A timeseries collection is quite different from a normal MongoDB collection. This is because although superficially it behaves like a normal collection, MongoDB treats time series collections as writable non-materialized views on internal collections that automatically organize time series data into an optimized storage format on insert (see Time Series Collections: Behavior).

For this reason, indexing a time series collection involves creating an index in the underlying internal collection, instead of creating it on the visible collection. There are index types that are unsupported at this time: TTL, partial, and unique (see Time Series Collection Limitations: Secondary Indexes).

For example, let’s create a new timeseries collection:

> db.createCollection("test", { timeseries: { timeField: "timestamp" } } )

then create a document to insert:

> doc = {_id:0, timestamp: new Date()}

and let’s insert three of those into the collection:

> db.test.insertOne(doc)
> db.test.insertOne(doc)
> db.test.insertOne(doc)

if you then see the content of the collection, all three documents with identical content will be present:

> db.test.find()
[
  { timestamp: ISODate("2021-12-03T08:43:50.503Z"), _id: 0 },
  { timestamp: ISODate("2021-12-03T08:43:50.503Z"), _id: 0 },
  { timestamp: ISODate("2021-12-03T08:43:50.503Z"), _id: 0 }
]

however, if you check the collection list, there is a mystery collection there:

> show collections
test                     [time-series]
system.buckets.test
system.views

if you delve into the mystery collection, you’ll see how the test collection is actually stored:

> db.system.buckets.test.find()
[
  {
    _id: ObjectId("61a9d8947dfd3e5b32de6144"),
    control: {
      version: 1,
      min: { _id: 0, timestamp: ISODate("2021-12-03T08:43:00.000Z") },
      max: { _id: 0, timestamp: ISODate("2021-12-03T08:43:50.503Z") }
    },
    data: {
      _id: { '0': 0, '1': 0, '2': 0 },
      timestamp: {
        '0': ISODate("2021-12-03T08:43:50.503Z"),
        '1': ISODate("2021-12-03T08:43:50.503Z"),
        '2': ISODate("2021-12-03T08:43:50.503Z")
      }
    }
  }
]

so the test collection is just a view to the actual system.buckets.test. Inside the actual underlying collection, the three documents are stored in a single “bucket”. This is why as it currently stands, you cannot create a unique index on timeseries data.

In conclusion, timeseries collection is a special collection type that is basically a view into a special underlying collection, thus it behaves differently from a normal MongoDB collection. This is done to allow MongoDB-managed storage of timeseries documents that is otherwise quite expensive to do if it’s done using a regular MongoDB document. However, having this capability also comes with some caveats, namely the unique index limitation that you came across.

Having said that, if you feel that having a secondary unique index is a must, you can create the collection in the normal manner, but lose the compactness of the timeseries collection storage. I suggest to benchmark your workload, and check if you can manage with a normal collection to store your data if the features you lose by using timeseries are important to your use case.

Hopefully this is useful.

Best regards,
Kevin