Document Versioning Large Documents With Frequent Changes

Imran_Azad · December 2, 2022, 10:52am

I’m aware of the document versioning pattern when it comes to versioning documents but I can’t seem to find anything official on the MongoDB resources regarding document versioning lots of large documents with frequent changes.

I’ve had a look at the data modelling course on the university site and it doesn’t cover it.

Are there any patterns that can be used to version lots of large documents with frequent changes? Or is MongoDB a bad use case for this?

Pavel_Duchovny · December 4, 2022, 11:22am

Hi @Imran_Azad ,

What do you mean with frequent changes? Does each change create a new version document?

How do you consider keeping the versions? It sounds that if the document is large and change is frequent its best to not embed old versions but create a new document…

Can you show case the use case in more detail.

Thanks

steevej · December 4, 2022, 4:33pm

Be aware that an updated document is completely written back to permanent storage, even the unmodified values.

So if your document is large and is frequently updated, you might suffer write starvation. In this case, a variation of the outlier pattern might be appropriated if only a few fields of the large documents are frequently updated. You would keep the stable fields in the main large document and store the frequent modifications in a separate outlier document or documents. This would reduce the write starvation since the frequently updated and written parts are much smaller that the stable main large document.

Imran_Azad · December 6, 2022, 11:46am

Hi @Pavel_Duchovny

The specific problem is we have a document with multiple fields and we need to track the changes made to each field e.g.

User makes a change to a field
The old and and new values are preserved

Does that help clarify?

Pavel_Duchovny · December 6, 2022, 12:20pm

@Imran_Azad ,

Will the following design work:

{
   field1 : { latest: "x" , prev : [ "y", "z" ]}
   filed2 : { latest : "a" , prev : [ ] }
}

if you know that some fields are immutable you can just keep them as is, otherwise consider the above module.

If that cause the documents grow to very large sizes, I will need to understand the query pattern to suggested a better module.

For example:

do you need a history of every field in every query?
Do you require range queries or its just “byId” queries?
What are the use of history in your applicaiton? Does it make sense to duplicate a document with a newer timestamp for each change.

Ty

Imran_Azad · December 6, 2022, 3:32pm

@Pavel_Duchovny Thank you so much for this. I’ve got further clarification of the requirements. Essentially, each time a change is made to a document an entire snapshot needs to be taken of the previous version not the tracking of individual fields as I said previously. There will always be a “main” version of a document and all it’s previous versions each time it changed. There’s a possibility any aspect of the document could be changed.

The queries will be basic i.e. just to retrieve individual documents so it will “byId”
Yes, the requirement is to have a new “main” document so therefore I would expect a newer timestamp for each change.

What design would you recommend for this?

Pavel_Duchovny · December 7, 2022, 6:29am

Hi @Imran_Azad ,

In that case it sounds like you may have the benefit of splitting the data into a “latest” collection and a history collection.

In the latest collection you will store the most recent version and have it queried and indexed by id.

While the history collection will receive the privious state. So essentially an update is a “transaction” of delete => insert new with same _id => insert old.

If updates are so frequent and the critical path is the write and not the read you may consider the following alternative:

Insert a new version into the main collection keeping the history in the same one (you may offload the history as a batch process)

In this design I Invision the collection as follows:

{
   _id : some generic unique,
  ItemId : "xyz" ,
  timestamp : latest timestamp
  ...
}

The index for lookup by id is now { itemId: 1, timestamp: -1}

When you lookup by id you do:

collection.findOne({itemId : "xyz"}).sort({timestamp : -1})

This will get only latest version.

Thanks
Pavel

Imran_Azad · December 7, 2022, 11:41am

Wow! Thank you so much, I really appreciate this Pavel!