I have written an Atlas Search query and wanted to test it’s performance for a million records/documents. I have a collection with only 300 documents, how can I duplicate the records in an efficient way to fill up the collection with upto a million records? Also merging different collections won’t help because there is not a lot of documents in other collections as well. Any help is appreciated. Thank you in advance.
Hi @Ruchi_Ninawe,
Here is my hack:
- Extract the docs.
mongoexport -d test -c test -o test1.json
My file:
{"_id":{"$oid":"6153bc90a571a9a021541f49"},"domainObjectId":"1b8480b9-d11f-4a9c-ba5e-a3144bce3126","priority":1,"field1":"value1","field3":"value3"}
{"_id":{"$oid":"6153bc90a571a9a021541f4a"},"domainObjectId":"1b8480b9-d11f-4a9c-ba5e-a3144bce3126","priority":0,"field1":"value1override","field2":"value2"}
{"_id":{"$oid":"6153bc90a571a9a021541f4b"},"domainObjectId":"1b8480b9-d11f-4a9c-ba5e-a3144bce3126","priority":2,"field1":"value1lowerprio"}
- Remove the
_idfield from each line and fix the first{. Note that the first coma is 43 chars from the beginning of each lines.
cat test1.json | cut -c 43- > test2.json
sed -i 's/^,/{/g' test2.json
- Insert a bazillion times.
for i in {1..10000}; do mongoimport -d test -c test test2.json; done
- Success & Fame
Cheers,
Maxime.
I am sort of new to MongoDB but I can see how to implement your solution and will let you know @MaBeuLux88_xxx. Thanks a lot for the answer, appreciate it.
Hi @Ruchi_Ninawe,
@MaBeuLux88_xxx’s hack will help you create duplicate documents which differ only by _id.
Another approach would be to create additional documents using a tool like mgeneratejs that supports schema templates and data generation. mgeneratejs uses the Chance library to generate random plausible test values for different field types.
You can also combine mgeneratejs with a few other command line tools to conveniently create test documents with field types and values inferred from your existing data:
mongodb-schemato infer a probablistic schema for an existing collectionmorelikethisto convert that schema to a templatemgeneratejsto generate new documents according to a schema templatemongoimportto import the new documents into MongoDB
The first three tools are installable from npm:
npm install -g morelikethis mongodb-schema mgeneratejs
mongoimport is a part of the standard MongoDB Database Tools.
Sample usage to generate 1,000 new documents based on an analysis of the existing documents:
mongodb-schema localhost:27017 mydb.mycollection | morelikethis | mgeneratejs -n 1000 | mongoimport -d mydb -c mycollection
If you don’t have any test data yet (or prefer to describe the shape of new documents) you could always skip the schema analysis and just use mgeneratejs and mongoimport.
Regards,
Stennie
You really don’t want to be copying huge quantities of data over the web , you easily just split them in there with the following aggregation - this make 1000 times the data
db.mycoll.aggregate([{ $set : { xxx : {$range:[0,1000]}}} ,
{ $unwind: "$xxxx"} ,
{ $project : { _id:0,xxx:0}},
{$out:"newhugecollection"}])
Doing this way is very very fast and does not mean a network roundtrip
Welcome to my bookmarks @John_Page 
This will work with any number - in theory - as long as your cluster is large enough / has the required hardware to support this in a timely manner.
You are likely to end up with an out of memory issue if you set the range to 1 million, I would make it set,unwind,set,unwind so you get 1000x more records after each one
Hi @Ruchi_Ninawe , This will take all the documents already in the collection and make X copies of each them, so your 300 will become 300,000 as you asked.
Hello @John_Page. Thanks a lot for the answer again. This works really well but when I pass range from 0,2000 it gives me an error to create a new collection. I think it might be because of the size of the collection.
Thanks a lot @chris, I used the solution by @John_Page because I just wanted to multiply the number of fields without concerning the data in it. Again thank you for the efforts.
Thank you @MaBeuLux88_xxx for the response. I would want to try this approach as well but for the use case I have now I went with the aggregation solution. Again thanks a lot for the response.
Maybe it’s because your pipeline got too big. Please consider using the { allowDiskUse: true } option to prevent an error due to the size of the pipeline.