I want to ingest data into a MongoDB Time Series collection as efficiently as possible without ingesting duplicate data.
At then moment I am having to process the data one record at a time, checking for a matching record by performing a find operation and inserting the record if no match is found. This approach doesn’t seem to be particularly efficient, is there a better way of doing this when using MongoDB Time Series collections?
I am aware that when there are duplicates an an aggregation pipeline can be created using a $group stage to filter out the duplicates from users accessing the data but this is tackling the symptom and not the cause.
With normal collections you can add a unique index to a collection and then perform upsert efficiently using bulk write however for MongoDB Time Series collection unique indexes are not yet supported and upsert doesn’t appear to be supported even when $setOnInsert has been specified.
Are there any plans to support unique indexes and upserts for MongoDB Time Series collection in the near future?
Hi @Peter_B1 and welcome to the MongoDB community forum!!
Both the feature asks, unique indexes and bulk upserts in Time Series collection, are in the pipeline, but I’m not able to say when or how these features will be implemented in the future.
However, if you are on MongoDB version 5.1 or above, the update operation is possible for the metaFields values, provided this requires a certain conditions to be fulfilled. Please visit the documentation on Update in Time Series Collection for further details
Also, for tracking the feature request you could also put the feature requests into the MongoDB Feedback Engine.
To add more, if you can clarify for the above requirement, are you expecting a lot for duplicates in the collection?
If, having a unique field is an enforced requirement, working with a regular collection over time series be more effective?
Please let us know if you have any further queries.
In relation to your question about duplicates, they don’t happen all the time but can happen when data is resubmitted. We want to be able to insert data in bulk and identify what the new data was so we can inform downstream systems of the new data (and not the duplicates).
If it were possible to add a unique index on the key fields then the duplicate records would fail and we would be able to identify this by the error code.
Similarly if it were possible to perform an upsert with $setOnInsert I was expecting only the insert operations to be triggered and therefore for the command to be permitted for timeseries collections however this is not the case. Had this been permitted I expect I would have been able to determine which requests resulted in an insert.
A unique composite key isn’t a requirements however being able to insert new records in bulk, ignoring any duplicates and being able to identify the inserted records is.
Using regular collections would work however they wouldn’t have the benefits and optimizations that come with timeseries collections such as optimized internal storage and improved query efficiency.
If the unique indexes and bulk upserts are both in the pipeline do you have any open issues for them that I can track?
Hi @Peter_B1 and thank you for the detailed reply.
Since this is in the planning stage for the future releases, there are no tickets to be watched as of yet. Alternatively, you can raise the feature request in the requests into the MongoDB Feedback Engine. where you can track individual requests and the progress.