I am really looking for some guidance, I have been giving this some thought for a few days, but due to my lack of experience, I cant come to a solid decision.
I am migrating a website that services business profiles in various locations. Each Business Profile is stored in a Business Profile Collection. Each business profile can have sub pages - like about us / contact us / faqs etc etc. Depending on the type of business this can have as many as 15 sub pages.
Each Sub page JSON data will be between 2KB and 8KB so maximum of about of 120KB of data if all sub pages are enabled.
The question is do I embed this data in the main Business Collection, or maybe in a separate collection. I read many times “model your data based on the application requirements” and in my case the main business profile page and the sub pages will never been needed at the same time. (they are different pages)
So I see I have three possible options
Embed in parent document
Store in a separate collection (BusinessProfileSubPages)
Do additional question to (for example) an FAQs Collection and dont embed anything
I am favouring Option 2) which then leads to another question
If I store the Sub Pages in an additional Collection, so I store each page in a separate document, meaning I would probably need an index on BusinessId and Page Type, or just store of my sub page requirements in one doc, using different fields - maybe separate fields for about us / contact us and FAQs
This way I guess I could find by BusinssId, and retreive on field from the doc. I dont need aboutus and contactus at the same request, so maybe this might be a good reason to have one doc per sub page per business.
I hope this makes some sense, many thanks in advance for you reading or replying to this
I think your logic about embedding or not is on point and indeed, it’s not always a trivial choice. While I can’t provide an absolute answer, maybe I can ask a few questions that could tilt the balance on one side or the other.
How many documents (approx.) would you end up with in the profiles and the subpages collections and how big would be these collections in the end? Is it like 100MB? 1GB? 10GB? 1TB? More? Sometimes a solution is OK when you are still in the “little data” area but then it’s not when you hit the “big data” threshold.
Will you ever need to use $lookup to answer a user query ?
If you embed: you won’t be able to query directly the subpages (or at least without a more complex and less efficient query). Is there a need to query directly the subpages without going through a profile? What about the updates? Do you need update subpages without selecting a profile first?
I would estimate around 100,000 in Business Profiles with an average of 10 SubPages each. I dont think I will need lookup or aggregation, as I am hoping to be able to do it all from the PK/OID
The pages are just Objects/ViewModels for a web site.
The updates does not have to be super quick, as it is on the admin side, but I am trying to get <100ms on the user/public side.
Because of the nature of the design of the website, there is a query before all of this, that enables the correct collection to be queries, as it is not only Business Profiles (but the same theory applys). Due to this query, I will have the PK / OID of the business profile available.
so I am thinking it would be something like this.
db.businessprofiles.find({ _id: ObjectId(“4ecbe7f9e8c1c9092c000027”) })
or (I know the next syntax is incorrect)
db.businessprofiles.find({ _id: ObjectId(“4ecbe7f9e8c1c9092c000027”) }).aboutus
So if I embed, I only need one PK/OID, but if I create new collection, I will need to track two PK for the subpages
If you decide to embed, it’s an array of subdocuments. Nothing too fancy you can do.
But if you choose to separate in 2 collections, it also depends on how you represent the link between the 2 collections. It could be an array of IDs in the parent doc. Or just a reference of the parent ID in the children. Or both (bidirectional linking).
Another solution could also be to leverage the Extended Reference Pattern and store not only the IDs of the children in an array in the parent document, but also some information you might need frequently without the need to access the child. In this case, you would be storing an array of subdocuments in the parent doc with the IDs of each child + a few important fields for each subdocuments.
For example if I want to display a webpage with the books I read. I would have a user and a book collection (that contains the entire book in it!). But in order to display a page of all the books I read, I would just store the IDs of all the book I read along with their titles.
I would only query the book collection if I click on the link on the book’s title in the page to access the details of the book. But the cool part is that all the links are built using only the user document (parent one) and I only query the children collection (books) if I need the details of one of the entries.
This example is a bit simple, but it explains the logic behind this pattern and maybe it could be transposed in your use case here.
Based on the data you provided, the worst case scenario is that the docs are about 120KB large. I assume in average they will be around 60KB. Times 100K docs = 60GB uncompressed and probably around 40GB compressed in WiredTiger. That’s very reasonable in my opinion and I guess both solutions (embedded or not) would work.
From what you presented though, I don’t see any reason why embedding the subdocs would be an issue. But I could be missing information or business needs so…