Batch insert/upsert. Avoiding duplicates

Billy_Best · May 15, 2022, 12:55am

Hi I have looked around and am currently going over the docs but wanted to ask.

I’m trying to avoid duplicate entries into by db. My application crawls the web and returns arrays of objects which when I use the insertMany() function. updates my db with all the objects with new id’s generated as Id like, but the problem is when the crawl run again it’ll return a “new” array of objects which will likely be identical to the last one. Using insert as I have will create duplicate entries. I tried using updateMany with upsert: true but it wont accept my array of objects in the same way. I can do a single object which then has the array of objects but that’s not desirable

How can I pass in many objects at once using upsert like with insert([{},{}.{}]) creating new id’s for objects that don’t exist and updating existing ones?
Wondering am I better off storing the value in some variable to use as a filter to check new arrays coming in. and just use insertMany()?

steevej · May 15, 2022, 1:33pm

You will need bulkWrite filled with one updateOne: document per object from crawler.

Each updateOne: will use upsert:true in order to insert the document if it does not exist and to update it or not if it does.

The filter: will contains fields from you input object that determine the uniqueness of your documents. The update: will contains fields that you want to modify.

Billy_Best · May 15, 2022, 5:10pm

Thanks for the reply. I kind of thought so just wanted to check. Often in these situations I gravitate to doing something more complicated and someone comes around and is like why don’t you just do this lol.

Asya_Kamsky · May 16, 2022, 3:45pm

There is another option and that’s inserting new documents into a new (temp) collection, then doing an aggregation on that collection into your “real” collection with appropriate options for cases where the document already exists and nothing has changed. The on field(s) would be the immutable fields that determine that it’s “the same” document…

Asya

Billy_Best · May 16, 2022, 7:55pm

Thank you for this incite. The initial conception of the idea I am working on had a tmp data storage stage but I have not implemented this yet and kind of forgot about it til now. Just a nooby using the mern stack for the first time haha.
Would you make your tmp collection programatically then drop it after? Or would tmp be an actual collection in the db?

Asya_Kamsky · May 16, 2022, 8:08pm

I’m using the term “temp” figuratively, i.e. your application would create it, populate it and then after merging the results into the “regular” collection it would then be dropped by same code.

Asya

Billy_Best · May 16, 2022, 8:11pm

Gotcha. Thanks again.

steevej · May 17, 2022, 2:03am

What would be the pros of this alternative?

I feel it would be more resource intensive compared to bulkWrite.

Extra RAM for the temporary collection and the default _id:1 index
Potentially extra disk space if the collection has to be written before the final erase
Extra replication for documents that are duplicate and would result in a no-op
Extra space in the oplog

Billy_Best · May 17, 2022, 6:06am

From my perspective in this case bulkWrite is the more elegant solution and what I am using to ensure no duplicates are entered. I think Asya was pointing out it was possible. Which reminded me that you can run data through some pipeline before entering it. Not sure exactly when I would do this as for my usecase I have been formatting my data before entering it so its already clean I guess you can say and any new data I want to derive from that I would just do aggregation on my existing collections I presume. I haven’t got that far haha

steevej · May 17, 2022, 12:47pm

I am all for knowing the alternatives.

At this point, I can see a pro for the temporary collection alternative. You have a log or history of the input, specially if you do not delete the documents inserted right after you processed them. You can maintained status information about when and how they are processed. And you can use a TTL index to eventually automatically delete them.

I wanted to fire the discussion to see if there are more pros or cons that I do not see.

system · May 22, 2022, 12:47pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.