How to duplicate hundred documents into millions within a same collection?

Ruchi_Ninawe · September 28, 2021, 7:35pm

I have written an Atlas Search query and wanted to test it’s performance for a million records/documents. I have a collection with only 300 documents, how can I duplicate the records in an efficient way to fill up the collection with upto a million records? Also merging different collections won’t help because there is not a lot of documents in other collections as well. Any help is appreciated. Thank you in advance.

MaBeuLux88_xxx · September 29, 2021, 2:25am

Hi @Ruchi_Ninawe,

Here is my hack:

Extract the docs.

mongoexport -d test -c test -o test1.json

My file:

{"_id":{"$oid":"6153bc90a571a9a021541f49"},"domainObjectId":"1b8480b9-d11f-4a9c-ba5e-a3144bce3126","priority":1,"field1":"value1","field3":"value3"}
{"_id":{"$oid":"6153bc90a571a9a021541f4a"},"domainObjectId":"1b8480b9-d11f-4a9c-ba5e-a3144bce3126","priority":0,"field1":"value1override","field2":"value2"}
{"_id":{"$oid":"6153bc90a571a9a021541f4b"},"domainObjectId":"1b8480b9-d11f-4a9c-ba5e-a3144bce3126","priority":2,"field1":"value1lowerprio"}

Remove the _id field from each line and fix the first {. Note that the first coma is 43 chars from the beginning of each lines.

cat test1.json | cut -c 43- > test2.json
sed -i 's/^,/{/g' test2.json

Insert a bazillion times.

for i in {1..10000}; do mongoimport -d test -c test test2.json; done

Success & Fame

Cheers,
Maxime.

Ruchi_Ninawe · September 29, 2021, 2:40am

I am sort of new to MongoDB but I can see how to implement your solution and will let you know @MaBeuLux88_xxx. Thanks a lot for the answer, appreciate it.

Stennie_X · September 29, 2021, 5:57am

Hi @Ruchi_Ninawe,

@MaBeuLux88_xxx’s hack will help you create duplicate documents which differ only by _id.

Another approach would be to create additional documents using a tool like mgeneratejs that supports schema templates and data generation. mgeneratejs uses the Chance library to generate random plausible test values for different field types.

You can also combine mgeneratejs with a few other command line tools to conveniently create test documents with field types and values inferred from your existing data:

mongodb-schema to infer a probablistic schema for an existing collection
morelikethis to convert that schema to a template
mgeneratejs to generate new documents according to a schema template
mongoimport to import the new documents into MongoDB

The first three tools are installable from npm:

npm install -g morelikethis mongodb-schema mgeneratejs

mongoimport is a part of the standard MongoDB Database Tools.

Sample usage to generate 1,000 new documents based on an analysis of the existing documents:

mongodb-schema localhost:27017 mydb.mycollection | morelikethis | mgeneratejs -n 1000 | mongoimport -d mydb -c mycollection

If you don’t have any test data yet (or prefer to describe the shape of new documents) you could always skip the schema analysis and just use mgeneratejs and mongoimport.

Regards,
Stennie

John_Page · September 29, 2021, 9:06am

You really don’t want to be copying huge quantities of data over the web , you easily just split them in there with the following aggregation - this make 1000 times the data

db.mycoll.aggregate([{ $set : { xxx : {$range:[0,1000]}}} , 
{ $unwind: "$xxxx"} ,
 { $project : { _id:0,xxx:0}},
{$out:"newhugecollection"}])

Doing this way is very very fast and does not mean a network roundtrip

chris · September 29, 2021, 12:28pm

Welcome to my bookmarks @John_Page

MaBeuLux88_xxx · September 29, 2021, 1:57pm

This will work with any number - in theory - as long as your cluster is large enough / has the required hardware to support this in a timely manner.

John_Page · September 29, 2021, 2:02pm

You are likely to end up with an out of memory issue if you set the range to 1 million, I would make it set,unwind,set,unwind so you get 1000x more records after each one

John_Page · September 29, 2021, 5:53pm

Hi @Ruchi_Ninawe , This will take all the documents already in the collection and make X copies of each them, so your 300 will become 300,000 as you asked.

Ruchi_Ninawe · September 29, 2021, 6:45pm

Hello @John_Page. Thanks a lot for the answer again. This works really well but when I pass range from 0,2000 it gives me an error to create a new collection. I think it might be because of the size of the collection.

Ruchi_Ninawe · September 29, 2021, 7:04pm

Thanks a lot @chris, I used the solution by @John_Page because I just wanted to multiply the number of fields without concerning the data in it. Again thank you for the efforts.

Ruchi_Ninawe · September 29, 2021, 7:06pm

Thank you @MaBeuLux88_xxx for the response. I would want to try this approach as well but for the use case I have now I went with the aggregation solution. Again thanks a lot for the response.

MaBeuLux88_xxx · September 29, 2021, 11:30pm

Maybe it’s because your pipeline got too big. Please consider using the { allowDiskUse: true } option to prevent an error due to the size of the pipeline.