I am curious to look at example of top level data structures (how the dbs and collections are organized) for different use cases, e.g. different types of apps. How dbs and collections and sub collections do different types of use cases typically have? How are they organized? Does anyone know of a resource like that?
PS. My guiding-thought re database structure is that the data-hierarchy you have in your mind should be reflected in the structure of the cluster to extent possible, but I was wondering if anyone else had any thoughts on this.
Hi @Rich_Guy, welcome!
The concept of a database in MongoDB is just a logical construct, which is similar to a namespace for a collection. The structure of collections however depends on your use case, i.e. whether you choose two have a single collection containing embedded documents, or two collections with related documents.
It’s less about the database structure, but more about the data model itself. An example case where you would consider a different data model for a cluster’s structure would be Aggregation $lookup: sharded collection restrictions.
Please review the following resources related to data modelling:
There’s also a free online course at MongoDB University, one that focuses on creating data models for MongoDB: M320: Data Modeling
Hi Rich. Thanks for the question.
I would add to what Wan said by noting that there are some anti-patterns with DB/collection usage for multi-tenant setups. I would avoid (for example) having a separate database for each customer, and then repeating the same collection names inside each database. As each collection, and each index require a separate file on the storage layer, you can get situations where there are 100’s of thousands of different files. Each fille requires a file handle in the OS, and these can add up to take a significant amount of memory which could otherwise be used for data caching.
I would instead recommend having shared collections for the entire SaaS with a customer-identifing field in the document schema, and have just one giant logical database with one collection for each entity type.
Hope this helps,
I was also planning a similar approach - one database per customer and the same collection names repeated for each customer, but you mention that it could get into a nightmare of files and OS handling (sic).
but if I use the suggested alternative of a shared collection with customer-identifying field then there is a huge risk - one programming error, one data leakage, and its all over.
Besides separate databases seemed to have some advantages
- quick extraction of a customers data for a local copy / local replication
- change in collection structure / customisation for a customer
- moving customer database around between servers for manual load balancing
what alternative approach can you suggest instead of shared collection with customer-identifier particularly to nullify the risk of cross customer data leakage due to a programming oversight
I agree that there are certainly benefits to having a single deployment and one database per customer. You can mitigate the file handle issue somewhat by having a highly sharded cluster (say 100 shards) and have 200 customer databases homed on each shard, giving 20k databases. Since each database has a primary shard you can use movePrimary to distribute them as you wish. As long as you you keep all databases/collections unsharded this will scale well.
At the other extreme end, you could to launch one deployment per customer. On-prem this is difficult, but on Atlas you could programmatically launch a new M10 for each customer via the API. This reduces the risk of a “noisy neighbor”. If one of your customer’s is highly loaded they are not going to negatively impact the performance of other customers since their VMs are isolated.
while I understand this information, i also want to clarify on the requirements:
- OnPrem deployment will usually be for 1 customer so usually 1 or a couple of databases only are required on a local community server.
- large number of customers is for the Hosted service which could be the community server alongwith the application on a commercial virtual server or a cloud database server with a separate application server
hope that is more specific