Multiple databases vs. multiple collections

Matthew_Eva · February 2, 2023, 10:20pm

Hey folks!

I have a question about whether it would be wiser to use multiple databases or multiple collections within a single database when designing for applications at scale.

I’m anticipating using some sharding for certain types of data I’m storing, but I’d like to keep other collections unsharded to more easily query the dataset as a whole (using indexes, of course).

Since unsharded collections are all stored on the primary shard, would it make more sense for me to separate collections that I want to remain unsharded into their own databases? That way there’s less data stored on a primary shard for a given database, which - to my understanding - would reduce the need for sharding in that database.

Moreover, would this be more expensive for my backend API? I’d have to maintain connections to several databases rather than a single database, although they’d all still be in the same cluster.

Or, is this a misunderstanding on my part? Would creating more databases actually reduce the need for sharding, or is the need for sharding determined more by the storage available in the cluster?

Please let me know if you need any clarification as to what I’m asking. Thank you!

Kushagra_Kesav · February 3, 2023, 7:06am

Hi @Matthew_Eva,

Welcome to the MongoDB Community forums

The answer depends on the specific requirements of the application and the nature of the data being stored.

Using multiple databases can help ensure that each database can be optimized for a specific set of use cases, and can simplify scaling and management operations. It also helps in ensuring that the failure of one database does not affect other databases.

Using multiple collections within a single database can simplify deployment and management operations and reduce latency for cross-collection data relationships. It can also make it easier to take advantage of certain database-specific features.

Ultimately, there is no one-size-fits-all answer to this question, as the best approach will depend on the specific requirements of the application. It’s always a good idea to consider the specific use cases and workloads to determine the most appropriate design.

Yes, it depends on the use case, but a database is just a namespace. Actually, in the sharding doc page, all mentions are about collections. Not about databases. Note this part in particular:

“A database can have a mixture of sharded and un-sharded collections. Sharded collections are partitioned and distributed across the shards in the cluster. Unsharded collections are stored on a primary shard. Each database has its own primary shard.”

I’m not clear about the "expensive" concern here. Could you explain this bit more in what concern you are asking for?

Also, do you have a specific design goal that requires the use of sharding in mind that cannot be achieved with a normal replica set?

Please note that sharding requires the use of more infrastructure and planning compared to a regular replica set, and consequently, the maintenance of a sharded cluster will also be more complex.

The need for sharding is not solely determined by the storage available in the cluster, although that is one factor to consider. Sharding is a technique used to distribute the data and load across multiple machines, allowing the database to handle increased load and scale horizontally.

However, the need for sharding is ultimately determined by the scale characteristics of the data and the workloads being run. If you have a large amount of data and high write or read loads, you may still need to shard, even if you have multiple databases.

In my suggestion, it’s a good idea to evaluate the specific requirements of the application, including the data being stored, the queries being run, and the preferred scale, in order to determine the best approach to sharding and the use of multiple databases.

I hope it helps!

Let us know if you have any further questions!

Best,
Kushagra

Matthew_Eva · February 3, 2023, 8:31pm

Hey @Kushagra_Kesav !

Thank you so much for the thorough and detailed response! This is very helpful.

My question actually originated from the piece of documentation you excerpted:

A database can have a mixture of sharded and un-sharded collections. Sharded collections are partitioned and distributed across the shards in the cluster. Unsharded collections are stored on a primary shard. Each database has its own primary shard.

Specifically this last part - each database has it’s own primary shard.

My conclusion was that since each database has it’s own primary shard, storing multiple unsharded collections in one database would place them all on the primary shard of that database, which would eat up that primary shard’s storage capacity.

Whereas, creating multiple databases would create multiple primary shards (since each database has it’s own), which would mean that two unsharded collections stored in two separate databases would have more storage capacity available to each of them, which could help prevent the need for sharding, or at least help minimize the degree of sharding necessary.

Is this incorrect? The way I’m conceptualizing it is that each database essentially gets its own independent shard that is separate from shards in other databases. I.e. a cluster with seven separate databases would have seven separate primary shards, all with independent storage capacity. Is this not how it actually works?

I’m planning to use sharding because I anticipate having a volume of data that would exceed the storage capacity of an unsharded cluster. I’ve decided to use mongodb as my database because it specifically allows for horizontal scaling via sharding.

As far as “expensive” goes, I suppose I should have said “resource intensive”. I know that establishing a connection to a cluster is a relatively resource intensive operation, but once connected to that cluster, is it resource intensive to query to multiple different databases within that cluster from the same API? Or is it basically as resource intensive as querying to multiple collections within a single database?

Thank you so much for your help!

Best,

Matt

Kushagra_Kesav · February 6, 2023, 5:35am

Hi @Matthew_Eva,

As I understand your main concern is around

Need for sharding due to the expected increased scalability, and

Concern that a large number of un-sharded collections in the database will accumulate on a limited number of primary shards.

which is understandable.

In a sharded MongoDB setup, each database has its own primary shard. If you have multiple collections in one database that are not sharded, they would all be stored on the primary shard of that database, which could lead to storage capacity issues on that shard. Also, Having un-sharded collections in a sharded database can become a problem if the collections grow too large, as this can result in an imbalance of data across primary shards, leading to a concentration of un-sharded collections on a few primary shards. To mitigate this risk, it’s recommended to regularly monitor the growth of collections and shard them proactively if necessary.

But this depends on the use case and the growth rate of the data.

Please refer to this doc for sharding operational restrictions.

Furthermore, having un-sharded collections in a sharded database is not inherently an issue, as long as the collections remain relatively small. However, as the collections grow, the risk of an imbalance of data across primary shards increases. This concentration can lead to performance issues, as well as increased resource usage on the primary shards that are handling the majority of the un-sharded collections. To avoid these problems, it is important to monitor the growth of collections and shard them proactively if necessary to ensure a balanced distribution of data across primary shards.

Querying across multiple databases in MongoDB usually takes up the same amount of resources as searching across multiple collections in one database. The primary factor that affects the resource utilization of a query is the size of the data being queried and the complexity of the query itself, not the number of databases or collections.

Additionally, maintaining consistency across multiple databases can be more difficult, especially if you are using transactions or need to ensure that updates to one database are visible to another.

I hope it clarifies your doubt!

Best,
Kushagra

Matthew_Eva · February 6, 2023, 4:06pm

Hi @Kushagra_Kesav,

Yes, this exactly answers my question! Thank you so much!

Best,

Matt