In one of the use case i need to insert 100 Billion records (one time) and then 3 billions on a daily basis. The initial setup i tried was taking a single collection and then dump everything in it. Collection was sharded on the key it was supposed to be filtered on, and no index were used (after initial filter, only handful rows are returned, indexing on date seemed like an overhead)
The first few billion records are written at a pretty descent speed, 3 billions under 2.5 hours, but as sson as the size increases and reaches 9+ billions, the speed is reduced drastically Right now there are 13 billion records in the same colleciton but the write times are >16 hours for 3 billion records
Note: 3 billion is the daily dump size
Now i’m exploring another approach where i tool 30 different collections within the same data base and same sharding configuration, and my write throughputs are consistent for each day’s write. (finishes in under 2.5 hours for all collections)
I am trying to understand why is this happening, even when i’m not using any index apart from sharded key. Hardware doesn’t seems to be a problem as writing to multiple collections is working fine.
That said, having mulitple collection adds an overhead which we’re hoping to avoid by using a single collection.
3 machines running mongod replica sets 64 GB Ram, 72 Cores 10 shards mounted on 2TB * 10 seperate disks (for 3 replicas total 30 disks) rest is standard, 3 config server, and 3 mongos instances Using Mongo Spark writer for writing data to mongo using PySpark.
How can i achieve my goal of writing to single collection, if not can someone explain why not?