Duplicate Data Issue

Gorkem_Erdogan · December 1, 2021, 7:24pm

Hello everyone,

I’ve been using influxdb for time series for the last past year but I wanted to try mongo time series recently. and I’ve been doing some tests however I couldnt find out why it allows duplicate values and a way to prevent it. Basically I’m writing data in bulk every few minutes and usually 1 minute of data is duplicate in the batch. with the same metadata and everything. I expect mongodb to skip these duplicate values. How can I achive this without creating an _id field out of metadata

Thanks

Gorkem_Erdogan · December 1, 2021, 7:50pm

oh man, I just noticed even generating _id out of metadata on the application side doesn’t work. It allows duplicate _id fields. I hope there is a fix for this issue.

kevinadi · December 1, 2021, 11:18pm

Hi @Gorkem_Erdogan

If you’re using MongoDB 5.0.0, 5.0.1, or 5.0.2, please upgrade to the latest version (5.0.4 currently). The same applies if you’re using the 4.4 series. Latest is 4.4.10 in the 4.4 series.

There was an issue that was detected in early versions of the 5.0 series that allows duplicate unique index entries (see SERVER-58936).

Best regards
Kevin

Gorkem_Erdogan · December 2, 2021, 5:35am

Hello,

I use serverless mongodb version 5.1.0

kevinadi · December 2, 2021, 6:48am

Hi @Gorkem_Erdogan

That is interesting. The duplicate _id should be fixed in 5.1.0. Do you mind providing more information about this?

Could you walk us through exactly what you did step by step, and how you discovered this?
If you have codes, please post the relevant part of them.
Please also post the versions of the tools/drivers that you used.

Best regards
Kevin

Gorkem_Erdogan · December 2, 2021, 11:44am

Hello,

Well it is nothing really special. I’ve been testing it on a jupyter notebook using pymongo Version: 4.0

At first, I let mongodb to handle _id field thinking it could handle duplicate data with the same metadata and timestamp. and I think this is really important for a timeseries database. It should not let insert if the metadata and the timestamp are the same. we shouldnt be dealing with generating _id field for this purpose…

using insertMany, inserted couple of hundred documents. Below is one of the documents;

{
    "timestamp": {
        "$date": "2021-12-01T18:58:00.000Z"
    },
    "metadata": {
        "chart": "candles",
        "interval": "1min",
        "market": "USD",
        "symbol": "REQ"
    },
    "open": 0.7027,
    "low": 0.7015,
    "high": 0.7083,
    "close": 0.7074,
    "volume": 163137
}

then noticed it does not handle duplicate data so I decided to generate _id field on my application side. _id generated using metadata field and timestamp as below;

{
    "timestamp": {
        "$date": "2021-12-01T18:58:00.000Z"
    },
    "metadata": {
        "chart": "candles",
        "interval": "1min",
        "market": "USD",
        "symbol": "REQ"
    },
    "open": 0.7027,
    "_id": "REQ-USD-1min-candles-1638385080",
    "low": 0.7015,
    "high": 0.7083,
    "close": 0.7074,
    "volume": 163137
}

but still, duplicates were allowed… here is a screenshot

kevinadi · December 3, 2021, 8:57am

Hi @Gorkem_Erdogan

A timeseries collection is quite different from a normal MongoDB collection. This is because although superficially it behaves like a normal collection, MongoDB treats time series collections as writable non-materialized views on internal collections that automatically organize time series data into an optimized storage format on insert (see Time Series Collections: Behavior).

For this reason, indexing a time series collection involves creating an index in the underlying internal collection, instead of creating it on the visible collection. There are index types that are unsupported at this time: TTL, partial, and unique (see Time Series Collection Limitations: Secondary Indexes).

For example, let’s create a new timeseries collection:

> db.createCollection("test", { timeseries: { timeField: "timestamp" } } )

then create a document to insert:

> doc = {_id:0, timestamp: new Date()}

and let’s insert three of those into the collection:

> db.test.insertOne(doc)
> db.test.insertOne(doc)
> db.test.insertOne(doc)

if you then see the content of the collection, all three documents with identical content will be present:

> db.test.find()
[
  { timestamp: ISODate("2021-12-03T08:43:50.503Z"), _id: 0 },
  { timestamp: ISODate("2021-12-03T08:43:50.503Z"), _id: 0 },
  { timestamp: ISODate("2021-12-03T08:43:50.503Z"), _id: 0 }
]

however, if you check the collection list, there is a mystery collection there:

> show collections
test                     [time-series]
system.buckets.test
system.views

if you delve into the mystery collection, you’ll see how the test collection is actually stored:

> db.system.buckets.test.find()
[
  {
    _id: ObjectId("61a9d8947dfd3e5b32de6144"),
    control: {
      version: 1,
      min: { _id: 0, timestamp: ISODate("2021-12-03T08:43:00.000Z") },
      max: { _id: 0, timestamp: ISODate("2021-12-03T08:43:50.503Z") }
    },
    data: {
      _id: { '0': 0, '1': 0, '2': 0 },
      timestamp: {
        '0': ISODate("2021-12-03T08:43:50.503Z"),
        '1': ISODate("2021-12-03T08:43:50.503Z"),
        '2': ISODate("2021-12-03T08:43:50.503Z")
      }
    }
  }
]

so the test collection is just a view to the actual system.buckets.test. Inside the actual underlying collection, the three documents are stored in a single “bucket”. This is why as it currently stands, you cannot create a unique index on timeseries data.

In conclusion, timeseries collection is a special collection type that is basically a view into a special underlying collection, thus it behaves differently from a normal MongoDB collection. This is done to allow MongoDB-managed storage of timeseries documents that is otherwise quite expensive to do if it’s done using a regular MongoDB document. However, having this capability also comes with some caveats, namely the unique index limitation that you came across.

Having said that, if you feel that having a secondary unique index is a must, you can create the collection in the normal manner, but lose the compactness of the timeseries collection storage. I suggest to benchmark your workload, and check if you can manage with a normal collection to store your data if the features you lose by using timeseries are important to your use case.

Hopefully this is useful.

Best regards,
Kevin

Sylvain_GOUMY · January 4, 2022, 10:08am

Thanks for your details explanations @kevinadi
Like @Gorkem_Erdogan I was assuming timeseries collections would have a mechanisms to ensure uniqueness of timestamp/sensor data.
It’s clearly a must have features for time series as network unstable connectivity may cause data to be sent multiple times.

I think I’ll go with this suggestions of querying data only through aggregation pipeline with a group stage to remove duplicates.

But it has impact in performance and complexity of development…

Tyler_Hudgins · June 7, 2022, 7:34pm

Any updates to this issue? We are building a time series collection for sensor data that might send duplicates and tried to require uniqueness on the timestamp but it still allows duplicates. I’d love to leverage the performance improvements of the collection type but that a deal breaker if we can’t prevent duplication.
thanks,

Benjamin_Behringer · June 14, 2022, 8:51pm

Same here. Planned to use time series collections, yet not being able to enforce uniqueness is a show stopper for us.

Christian_Rorvik · June 15, 2022, 6:58am

Not a blocker for us, but I do wish we had at least a way to insert conditional greater than last timestamp. That would allow us to ingest with redundancy and make our pipeline more reliable. With the new index based time sorted queries for time series in 6.0, I’d think this may not be too expensive to add?

Sainath_Kotapati · June 28, 2023, 10:41am

It would be helpful by introducing the optional add on functionality of overwriting the document if it is already exists for an insert operation based on “_id” field for only timeseries collection in upcoming releases. Coz if we have huge amount of sensors then data should be blindly written into timeseries collection without querying for duplicate check (Conditional inserts will definitely impact the performance).

Michael_Gillhofer · December 18, 2024, 5:29pm

Is there any plan of adding this feature?