How to prevent duplicates in mongodb timeseries collection

Problem

Sensors check-in periodically, but network connectivity issues may cause them to check-in with the same data more than once.

MongoDB does not allow the unique property on secondary indexes for timeseries collections (MongoDB 5.0). Timeseries Limitations

In addition, calculations need to be done on the data (preferably using aggregations) that involve counting the number of entries, which will be inaccurate if there are duplicates. Not to mention it bloats the database and is just messy.

Question

Is there any way to prevent duplicate entries in a MongoDB Timeseries collection?

1 Like

I’m wondering the same thing, has anyone found a solution?

Can’t you use an update with upsert:true?

Something along the way:

// Usual inserted document that produces duplicates
insert = { "sensor":1 ,
    "timestamp":2 ,
    "mesure":"pressure" ,
    "value":10 }

// Modified into a query that would define you uniqueness
query = { "sensor" : 1 , "timestamp" : 2 , "mesure" : "pressure" }

// With the value
value = { "value":10 }

// Also replace the following insert that produces a new duplicate
// every time it is called
c.insert( insert ) ;

// into an update with upsert:true that will insert only once
// no matter how often it is called.
c.update( query , value , { upsert:true } )

I am not yet familiar with the new time series collections. I do not know if it is applicable or not. But I feel it could.

I’m having the same issue here. Any solutions other than upsert?

What is wrong with upsert?

Hello man,

Well I dont think time series collections allow us to use update operations. It is append only.

1 Like

Insert only would make sense in order to improve performance of the main use‑case for time series.

But it would prevent some others where you would want to update your time series entries with some post inserting data.

This simply confirms that

and that I must start learning more.

While you cannot prevent entries with duplicated data. You can query/aggregate the data in such a way that replicated data is ignored.

Below is some aggregation pipeline on the some_timeseries Time Series collection

db.some_timeseries.aggregate([
  // ...
  {
    $group: {
      _id: {
        username: '$metadata.username',
        media_id: '$metadata.media_id',
        timestamp: {
          $dateTrunc: {
            date: '$timestamp',
            unit: 'day',
            startOfWeek: 'monday',
            binSize: 1,
          },
        },
      },
      // ...
    },
  },
  ///
])
1 Like

On the same $group stage you could pick up the max, min, avg etc for a given entry.

You are absolutely right. I should have followed the link provided in the first post (https://docs.mongodb.com/manual/core/timeseries/timeseries-limitations/)
and I would have read:

Time series collections only support insert operations and read queries. Updates and manual delete operations result in an error.

1 Like

It appears that the limitations imposed on time-series collections are temporary because in the upcoming version MongoDB 5.1, we can perform some delete and update operations. I’m guessing time-series might support upsert and the unique property on secondary indexes in future releases.

I came up with my own method to ensure data is never retrieved from an API and accidentally inserted twice, which was my original problem.

I’m still learning the aggregation pipeline, but @sergeimeza way of creating a query that ignores duplicates is probably the best solution for now.

Thanks for all the help.

1 Like

Hi there,

I was hoping to see the support if the Unique indexes in time series but apparently it is not there yet: https://www.mongodb.com/docs/manual/core/timeseries/timeseries-limitations/#std-label-timeseries-limitations-secondary-indexes

The $group solution looks like a workaround, it doesn’t really avoid duplicates and adds constraints to the clients who acess the collection.

Anyone has any other suggestion?

1 Like