How to duplicate hundred documents into millions within a same collection?

I have written an Atlas Search query and wanted to test it’s performance for a million records/documents. I have a collection with only 300 documents, how can I duplicate the records in an efficient way to fill up the collection with upto a million records? Also merging different collections won’t help because there is not a lot of documents in other collections as well. Any help is appreciated. Thank you in advance.

Hi @Ruchi_Ninawe,

Here is my hack:

  1. Extract the docs.
mongoexport -d test -c test -o test1.json

My file:

{"_id":{"$oid":"6153bc90a571a9a021541f49"},"domainObjectId":"1b8480b9-d11f-4a9c-ba5e-a3144bce3126","priority":1,"field1":"value1","field3":"value3"}
{"_id":{"$oid":"6153bc90a571a9a021541f4a"},"domainObjectId":"1b8480b9-d11f-4a9c-ba5e-a3144bce3126","priority":0,"field1":"value1override","field2":"value2"}
{"_id":{"$oid":"6153bc90a571a9a021541f4b"},"domainObjectId":"1b8480b9-d11f-4a9c-ba5e-a3144bce3126","priority":2,"field1":"value1lowerprio"}
  1. Remove the _id field from each line and fix the first {. Note that the first coma is 43 chars from the beginning of each lines.
cat test1.json | cut -c 43- > test2.json
sed -i 's/^,/{/g' test2.json
  1. Insert a bazillion times.
for i in {1..10000}; do mongoimport -d test -c test test2.json; done
  1. Success & Fame

Cheers,
Maxime.

1 Like

I am sort of new to MongoDB but I can see how to implement your solution and will let you know @MaBeuLux88. Thanks a lot for the answer, appreciate it.

1 Like

Hi @Ruchi_Ninawe,

@MaBeuLux88’s hack will help you create duplicate documents which differ only by _id.

Another approach would be to create additional documents using a tool like mgeneratejs that supports schema templates and data generation. mgeneratejs uses the Chance library to generate random plausible test values for different field types.

You can also combine mgeneratejs with a few other command line tools to conveniently create test documents with field types and values inferred from your existing data:

  • mongodb-schema to infer a probablistic schema for an existing collection
  • morelikethis to convert that schema to a template
  • mgeneratejs to generate new documents according to a schema template
  • mongoimport to import the new documents into MongoDB

The first three tools are installable from npm:

npm install -g morelikethis mongodb-schema mgeneratejs

mongoimport is a part of the standard MongoDB Database Tools.

Sample usage to generate 1,000 new documents based on an analysis of the existing documents:

mongodb-schema localhost:27017 mydb.mycollection | morelikethis | mgeneratejs -n 1000 | mongoimport -d mydb -c mycollection

If you don’t have any test data yet (or prefer to describe the shape of new documents) you could always skip the schema analysis and just use mgeneratejs and mongoimport.

Regards,
Stennie

3 Likes

You really don’t want to be copying huge quantities of data over the web , you easily just split them in there with the following aggregation - this make 1000 times the data

db.mycoll.aggregate([{ $set : { xxx : {$range:[0,1000]}}} , 
{ $unwind: "$xxxx"} ,
 { $project : { _id:0,xxx:0}},
{$out:"newhugecollection"}])

Doing this way is very very fast and does not mean a network roundtrip

5 Likes

Welcome to my bookmarks @John_Page :smiley:

2 Likes

This will work with any number - in theory - as long as your cluster is large enough / has the required hardware to support this in a timely manner.

1 Like

You are likely to end up with an out of memory issue if you set the range to 1 million, I would make it set,unwind,set,unwind so you get 1000x more records after each one

1 Like

Hi @Ruchi_Ninawe , This will take all the documents already in the collection and make X copies of each them, so your 300 will become 300,000 as you asked.

Hello @John_Page. Thanks a lot for the answer again. This works really well but when I pass range from 0,2000 it gives me an error to create a new collection. I think it might be because of the size of the collection.

Thanks a lot @chris, I used the solution by @John_Page because I just wanted to multiply the number of fields without concerning the data in it. Again thank you for the efforts.

Thank you @MaBeuLux88 for the response. I would want to try this approach as well but for the use case I have now I went with the aggregation solution. Again thanks a lot for the response.

Maybe it’s because your pipeline got too big. Please consider using the { allowDiskUse: true } option to prevent an error due to the size of the pipeline.