Data duplication advice needed

Vladimir · March 17, 2023, 12:29am

Hello, I have an issue that has been troubling me for a very long time, I hope you could help me and ease my pains of doubt and worry.

I have 3 collections:

Letters - queried, searched, updated on a daily basis hundreds - thousands of times
Clients - clients send letters. Very rarely updated - maybe once in months, also possibly never
Categories - describes the type of any given letter. Again, very rarely updated, possibly once in months, could be never

Reading a lot about denormalization and data duplication it appears that the latter 2 could be valid candidates for data duplication, since there are a lot of read operations performed on letters while there are barely any updates expected on Clients or Categories.

So based on this assumption a Letters collection is currently designed this way:

{
  "recipient": "John Doe",
  "categoryDetails": {
    "name": "Tracked letter",
    "categoryId": ObjectId("64130f4127049b2126a9747e")
  }
  "clientDetails": {
    "name": "Company A",
    "clientId": ObjectId("64130f4127049b2123244747e")
  }
}

Data in fields “categoryDetails” and “clientDetails” is directly copied from Categories and Clients collections.

What this means, is that if some day somebody should want to change the Category name or Client name an update query should be run on all Products to update the respective name field.

Considering this strategy a few plans arise:

Is it a good strategy? Again, chances are of somebody ever updating a Category or Client once they are inserted are very slim, however the concern is that the list of products will continuously grow, eventually possibly reaching thousands of products, making such an update operation increasingly more expensive overtime.
What is the best way to update Products if Categories or Clients is changed? Specifically:
a) Would transactions be a good choice here, to ensure all 2 or even 3 documents have had their fields updated correctly? Any better alternatives?
b) Considering the constant growth of the products count, should this update take place in the main app process or should it be detached and placed in a queue of some kind or assigned to a bot task to perform the update in the background?

Looking forward to your help,
V

steevej · March 17, 2023, 1:03am

What?

and

We absolutely have no clue of what product is because you

Kobe_W · March 17, 2023, 3:28am

If the data size of clients and categories are small, i would prefer fetching them all once and store them in app server memory. This is so that you don’t have to duplicate any data.

Given those content are rarely modified, query once and cache it forever works properly. (what if there are changes ?? just change it in mongodb and then do a rolling restart of the server)

Of course there is trade-off. But you know, everything is a trade-off in a distributed world.

Vladimir · March 17, 2023, 9:25am

Hello, thank you for your reply and for pointing out the error in the question. It won’t allow me to edit my original question to correct this, but I used the words products and letters interchangeably, meaning

products are letters

We just treat letters as products and their senders as clients. Sorry for the lack of clarity in the original question.

Vladimir · March 17, 2023, 10:56am

Hello and thank you. I find your advice interesting, however I would like this particular case to be resolved within the db layer. I don’t think a client name change should cause the restart of the whole server.

I’d say I am 51% convinced that the following strategy could work:

Client name change is attempted on the Client document
a) Start a new transaction in the main flow to update the client name both in the given Client document and in all products using that client name. The execution time will grow symmetrically to the application age with more and more products getting added overtime, therefore making updates larger and larger.
b) Respond to the request immediately: {“job_status”: “running”} and run the transaction in the background, updating the job status once it is done: {“job_status”: “done”}

However there is one thing preventing me from committing to this strategy:

Products collection is bound to grow indefinitely as every day there are more and more products being inserted, which means that if somebody chooses to update the Client, the number of products that need to be queried and updated is constantly increasing.

I was thinking maybe use Refs and just $lookup the client details when I need them, however it would mean that I would be running an aggregation query for every time I want to get a product/ products which is guaranteed to happen many times a day just to dodge this rare problem if and when somebody decides to update a Client document. So I don’t know, I don’t really find either of these scenarios fully foolproof.

Vladimir · March 18, 2023, 2:11pm

Is there anyone who can help me with this? MongoDB documentation, blog articles and forums provide a lot of information, advice and advocacy for denormalization and data duplication, however I could not find a lot of practical examples/ scenarios of how to deal with data update once you do need to update. Especially on collections that are destined to scale in size exponentially, such as the products.

Vladimir · March 25, 2023, 8:07pm

Having done tons of reading and browsing, and unfortunately not having received any followup advice here, I have decided to go with heavy data duplication for those fields that are unlikely to change but have a high amount of reads. Since I chose this path from the very beginning, I am unfortunately unable to verify if and by how much that is more performant than referencing fields and populating/ lookup’ing them. However, it appears to be the strategy that MongoDB is heavily pushing both on their forums and in their docs, and that makes sense, seeing how it is one of the core things that separate document DBs from RDBs. I guess with Document DBs you really need to take the red pill and be willing enough to go down the rabbit hole to experience the database at its fullest potential.

What is really curious, is that MongoDB is very vociferous and vehement about advocating for duplicating fields, yet they provide very few practical details on how to keep the affected duplicated fields up to date. Having done some not directly related reading, I have reached the conclusion that transactions is the safest way to go, especially for those updates that affect duplicate fields across multiple collections.

MongoDB almost discourages the use of transactions and states that they are less performant than normal queries and thusly they should only be opted for in edge cases. Well, updating duplicated fields appears to be such an edge case and almost exactly what transactions were made for - trading some performance in exchange for data integrity and stability when the latter matter more for queries that are destined to happen very scarcely.

The infinitely growing products collection dilemma remains unsolved, this will result in potentially longer update times in the future, possibly pushing updating duplicate fields behind the scenes, as a slow and tedious background process bound to happen very rarely, in exchange for fast daily atomic reads.