Data loss during flexible sync schema change

Hi MongoDB team! In our app we’ve seen and been able to consistently reproduce some data loss that happens during a flexible sync schema update, and are hoping to gain a little insight into what is causing the issue and determine what we should be doing differently to fix the issue.

We’re working on a minimal reproduction, but in the meantime, here is the general setup:

  1. Our app is running Realm JS
  2. The relevant portion of the schema creates a document tree, essentially:
schema = {
  name: 'Node',
  primaryKey: '_id',
  properties: {
    _id: 'uuid',
    text: 'string',
    children: 'Node[]'
  }
}
  1. Users in the client app can create new Nodes. Creating a Node both creates a new document and adds that document to another Node’s children property.

The issue we’re seeing comes up when we deploy a schema update of our production app that adds a field to the Node schema. Specifically we have seen issues adding embedded object and array fields. Here is the sequence of events:

  1. An authenticated user has the app open
  2. The new schema is deployed, adding a field to the Node object
  3. The user creates new Nodes in app
  4. Upon restarting app, new Nodes created after schema change are no longer visible in client

A couple important things we have noted as we’ve been debugging this issue:

  • We’ve noted (via realm.objects().addListener()) a large quantity of collection changes in the client app that begin a few minutes after the schema is deployed (they appear to be swapping out data locally). They happen in batches of a dozen or so, and can take a few minutes to complete, depending on number of documents the user has.
  • Records created by the client during the collection changes do not seem to appear in the database until after the collection changes complete.
  • Newly created records are written to the database, but the updates to the children field of the parent in which they were created are not. This leaves them orphaned and not visible to the client.

My best guess as to what is happening is that when the client determines that its own schema does not match the server schema, it begins to replace the local data with data from the server. Any local changes to existing data, such as adding a new object to the children array of an existing Node, are overwritten with the data that exists on the server.

Is this what is happening? And if so, what steps can we take to ensure that updates to documents during a schema change are persisted? If not, does anyone have thoughts on why we’re seeing this type of data loss? Again, we’re working on a minimally reproducible example right now, but any more insight in the meantime would be very helpful!

Hi, unfortunately, it sounds like you are running into a consequence of the additive-only initial sync we perform when changes are made to a collection. When a field is added to a synced collection, we start an asynchronous process of searching through all of the documents in your cluster, identifying documents that have those fields set, and pushing those values into Device Sync.

It sounds like what is happening is the following:

  1. Additive change is made to the schema
  2. Asynchronous process kicks off to perform an additive initial sync
  3. Client connects and updates the field
  4. Initial sync sees the new field in the document (or just default values) and uploads it to device sync

There are a few things you can ideally do here:

  1. Wait for the initial sync to complete before having a client connect and upload changes to that field (a banner will appear in the UI when this is happening)
  2. Not have documents with that value populated in Atlas. It is possible that for embedded objects we always re-write the state of the object when this is happening, but I will look into that if you confirm that the documents do not have a value for these new fields before your change.

Best,
Tyler

Hi Tyler,

Thanks for your response! Unfortunately I’m not sure either of those suggestions are options for us. We have a lot of end users who essentially never close their app, so we cannot guarantee that no clients are connected before making changes to the schema. The second suggestion is already true. The schema changes we make are always associated with added functionality on the client app. When we roll out a new feature, we update the database schema first, and then update client apps to work with the new fields in the schema. Doing it this way ensures that there are no documents with the new field in Atlas until the schema has been updated.

Got it. It looks like the issue stems from the fact that this is in a list of embedded objects if I understand correctly. If you add a top-level field we will only sync changes for that field, but if you add to a list of embedded objects then we have to re-write the entire list as that is the only safe thing to do to ensure the fields are present.

If you are certain that when you update your schema the new fields are not present in your documents, we have a feature flag that we can add to your application to skip this additive initial sync. If you send the URL to your application (or applications) I can apply it to the app.

The danger of this flag is that if you do happen to add a new field that is a list property, and then we skip this additive initial sync, and then that new list is appended to, it will generate an invalid history since we never synced down the initial state of the list. That being said, if your process is that the field is never populated when you add to the schema, then this is a safe change.

Best,
Tyler

It looks like the issue stems from the fact that this is in a list of embedded objects if I understand correctly. If you add a top-level field we will only sync changes for that field, but if you add to a list of embedded objects then we have to re-write the entire list as that is the only safe thing to do to ensure the fields are present.

So the field where we’re specifically seeing data loss is not a list of embedded objects as far as I can tell. It is the children field in this schema, which is an array of documents (note the embedded flag is not present):

const NodeSchema = {
  name: 'Node',
  primaryKey: '_id',
  properties: {
    _id: 'uuid',
    text: 'string',
    children: 'Node[]'
  }
}

If we change the schema above to be

const NodeSchema = {
  name: 'Node',
  primaryKey: '_id',
  properties: {
    _id: 'uuid',
    text: 'string',
    children: 'Node[]',
    // new field added to the schema
    newField: 'number[]'
  }
}

and then a user adds a new Node document to an existing Node document’s children field, it is the update to the existing document’s children field that ends up being lost.

We’ve been able to reproduce the data loss when the type of newField is an array (e.g., 'number[]', like above) or an embedded object. We do not see the same data loss when newField is a simple type like 'number' or 'string'.

Hi,
Please share your app ID (you’ll find this in the url) via the DM I just sent you and I can look into this issue for you.

Thanks,
Niharika

Just reached out, thanks Niharika!