I would love to hear some opinions on a data model I’m working on.
I have an existing system that tracks changes to documents, such that for every current revision of a distinct document there is a history of everything that has happened to it since insertion. I dislike my existing system and am considering some alternative approaches.
- Separate collections (current system).
- Single collection, separate documents.
- Single collection, single document.
- Change streams.
Briefly, a document looks something like this:
{ "id":"foo", "revision":1, "value":"First version"}
Modifying the value
field increments the version
number and a log of its previous state is written along with the update. In reality there are many more document properties, but this is the gist of it. More about my design goals after I’ve described the approaches I’m considering.
1) Separate collections.
The current system maintains a master collection where all documents are updated, thus maintaining a single current state. For every update, I insert its previous state into a separate “revisions” collection.
The main reason I dislike this is that updates are not atomic and for reliability I need to use transactions. This approach also doubles my collections, which may not sound important, except I have lots. That’s another story, but for now halving my collections would be a bonus.
2) Single collection, separate documents.
Another route I’m investigating is to store every revision as a separate document in the same collection. This means that every “update” is actually an insert. Listing all current documents would be done with an aggregate query such that only the document with the highest revision number is returned.
My main concern with this approach is that two people making an update at roughly same time (having both pulled version n
into application memory) could end up both inserting the same incremented revision number (n+1
). I’m also unsure about the performance of the aggregate query as the number of revisions for each document becomes larger.
3) Single collection, single document.
My favourite route is to store a complete history of each document within the document. Each update pushes the dirty state of the document being modified into an array.
This works well and is atomic, but I am concerned about the 16MB limit of document size. Although unlikely it is certainly a possibility and I would need a backup plan for overflow data.
4) Change streams.
I’ve done some experiments with change streams whereby every update is detected by a constantly running background process which logs the change event to another collection.
I feel like this is too precarious to rely upon for data integrity. If an event was missed for some reason (e.g. momentary downtime) then the data is gone. I am keen that every update from my application is guaranteed to log the change, as long as the main document update succeeds.
In terms of design goals:
Performance: The most common action will be querying/listing the current revision of documents. The next most common action is probably updating them. The least common action is likely to be querying historical revisions. The performance of that last part is not of much concern.
Integrity: I want to ensure that if a document update succeeds that the change logging also succeeds. One without the other must be impossible.
I’m open to suggestions for new approaches, or tips to make these ones work well.
Thanks for reading!