Massive Number of Collections
Rate this article
Are you more of a video person? This is for you.
Let's begin by discussing why having a massive number of collections is an anti-pattern. If storage is relatively cheap, who cares how many collections you have?
To avoid this anti-pattern, examine your database and remove unnecessary collections. If you find that you have an increasing number of collections, consider remodeling your data so you have a consistent set of collections.
Let's take an example from the greatest tv show ever created: Parks and Recreation. Leslie is passionate about maintaining the parks she oversees, and, at one point, she takes it upon herself to remove the trash in the Pawnee River.
Let's say she wants to keep a minute-by-minute record of the water level and temperature of the Pawnee River, the Eagleton River, and the Wamapoke River, so she can look for trends. She could send her coworker Jerry to put 30 sensors in each river and then begin storing the sensor data in a MongoDB database.
One way to store the data would be to create a new collection every day to store sensor data. Each collection would contain documents that store information about one reading for one sensor.
Let's say that Leslie wants to be able to easily query on the
sensorfields, so she creates an index on each field.
If Leslie were to store hourly data throughout all of 2019 and create two indexes in each collection (in addition to the default index on
_id), her database would have the following stats:
- Database size: 5.2 GB
- Index size: 1.07 GB
- Total Collections: 365
Each day she creates a new collection and two indexes. As Leslie continues to collect data and her number of collections exceeds 10,000, the performance of her database will decline.
Also, when Leslie wants to look for trends across weeks and months, she'll have a difficult time doing so since her data is spread across multiple collections.
Leslie wants to query on the
sensorfields, so she creates two new indexes for this collection.
If Leslie were to store hourly data for all of 2019 using this updated schema, her database would have the following stats:
- Database size: 3.07 GB
- Index size: 27.45 MB
- Total Collections: 1
By restructuring her data, she sees a massive reduction in her index size (1.07 GB initially to 27.45 MB!). She now has a single collection with three indexes.
With this new schema, she can more easily look for trends in her data because it's stored in a single collection. Also, she's using the default index on
_idto her advantage by storing the hour the water level data was gathered in this field. If she wants to query by hour, she already has an index to allow her to efficiently do so.
In the example above, Leslie was able to remove unnecessary collections by changing how she stored her data.
Sometimes, you won't immediately know what collections are unnecessary, so you'll have to do some investigating yourself. If you find an empty collection, you can drop it. If you find a collection whose size is made up mostly of indexes, you can probably move that data into another collection and drop the original. You might be able to use to move data from one collection to another.
Below are a few ways you can begin your investigation.
To see a list of collections, run
db.getCollectionNames(). Output like the following will be displayed:
To retrieve stats about your database, run
db.stats(). Output like the following will be displayed:
You can also run
db.collection.stats()to see information about a particular collection.
Be mindful of creating a massive number of collections as each collection likely has a few indexes associated with it. An excessive number of collections and their associated indexes can drain resources and impact your database's performance. In general, try to limit your replica set to 10,000 collections.
Come back soon for the next post in this anti-patterns series!
Check out the following resources for more information: