Relational Migrator Alert for multiple CDC and snapshot data integrity issues
This alert is a combined notification for 13 data integrity issues affecting various continuous (CDC) and snapshot migration configurations in the MongoDB Relational Migrator tool which can cause data loss or corruption without a visible error message. These affect any migration that contains specific mappings for a cyclic dependency loop, ‘Array-in-Doc-in-Array’ nested embedding, or use the settings for idempotency, primitive arrays or ‘Merge Into Parent’. These also affect Continuous (CDC) migrations that contain embeddings.
Relational Migrator is a free developer tool that assists with migrating existing applications from relational databases into the MongoDB document database for application migration and modernization projects.
The latest version of Relational Migrator (version 1.14) contains fixes for all these issues. However, previous migrations performed on any older version of Relational Migration (1.13.2 and earlier) may contain silent data loss or corruption if they meet the issue prerequisites below. MongoDB recommends that all Relational Migrator customers:
If your migration project matches any of these prerequisites, then you are potentially affected:
For users who have migrated data using the Continuous (CDC) migration mode, your data may be affected if it matches one of more specific project prerequisites:
If your migrations do not match any of these prerequisites, then you are unaffected, and the remainder of this advisory does not apply.
If you believe your migrated projects meet one or more of these prerequisites, or are concerned that you might be affected, please open a MongoDB support ticket and use the ‘Generate diagnostic file’ option to attach your migration project mappings. The MongoDB support team will assist with diagnosis and next steps.
Please see below for further details on these issues, which were initially identified through internal testing in April 2025. Issue index:
These issues may affect both snapshot and CDC migrations.
Fix Version: 1.13.2
Prerequisites: The mapping configuration contains a cyclic dependency and another embedded mapping is a child of one of the mappings comprising the dependency cycle.
Summary: This is best explained with an example; it’s possible in the Relational Migrator to include multiple mappings for a single relational table. In these cases the data from the table is written to multiple MongoDB collections.
Collection1 mappings include nesting TableA > TableB > TableC
Collection2 mappings include nesting TableB > TableA
In this case, the data from TableA needs to be populated in Collection 1 before the data from TableB is added, but also the data from TableB must be populated into Collection2 before TableA is added. The RM includes logic to detect and resolve these cycles and will use cache-collections to temporarily store the data to avoid the need for multiple scans of the same relational table.
A bug in this cyclic dependency handling can cause nested mappings that are children of the cycle to become disconnected from the internal mapping structure and to be omitted from the migration. This bug does not trigger for all projects with the above cyclic structure as it depends on the ordering of the mappings as stored in the project files, but it will trigger deterministically for any given project.
Fix Version: 1.13.1
Prerequisites: Users have manually enabled the migrator.engine.transform.idempotency: true setting in their user.properties file, and the mapping configuration includes an Embedded Array mapping which has another Embedded Document or Array nested within the Embedded Array.
Summary: By default, the Relational Migrator migrations are not guaranteed to be idempotent and in particular, embedded arrays are appended to using $push operations which if run multiple times will append duplicate entries. There is an idempotency setting which is a documented feature of the Relational Migrator that can be enabled in the user.properties configuration file. This setting makes inserts of documents and embedded arrays idempotent, such that the migration can be run twice without creating duplicate documents or entries.
When this idempotent mode is enabled, the handling of embedded arrays uses the MongoDB $addToSet operation instead of the $push operation in order to make array inserts idempotent. This works correctly for embedded arrays with no nested mappings underneath them as the new document should always exactly match the existing document already present in the array.
When the embedded array mapping has a nested child mapping of any kind, the second run of the migration will add duplicate entries to the array. This is because the array entry already in the document from the previous migration run contains an additional field for the nested child mapping and the newly added entry does not contain that field. The $addToSet operation correctly sees these as two different array entries and appends the new entry.
Fix Version: 1.13.1
Prerequisites: The mapping configuration includes an nested structure such with at least two Embedded Array mappings (one nested under the other) and at least one Embedded Document mapping in between the arrays.
Summary: This issue affects any Snapshot or Continuous migration where the RM mappings include a nested mapping of the following structure: New Document ( Embedded Array ( Embedded Document ( Embedded Array))).
When this mapping structure is used, the second level nested array is omitted from the migration entirely. Any embedded layers nested below this layer are also omitted from the migrated data.
The above is the minimal example, collections with additional nested layers are also affected. To trigger this bug, at least 4 relational tables must be combined into one MongoDB collection in this nested structure, with at least two embedded-array mappings and at least one embedded document mapping nested in between the two arrays.
Fix Version: 1.14
Prerequisites: A Continuous migration or Snapshot migration with cyclic dependency on the relevant mappings was run including an Embedded Document mapping with the ‘Merge fields into the parent’ option enabled, and an additional Embedded Array mapping nested beneath that mapping.
Summary: When creating or editing an Embedded Document mapping, under Advanced Settings the user can choose to “Merge fields into the parent”. This means that fields are not grouped under their own BSON document field, but instead added as additional fields to the parent - whether that parent is a top-level document or another embedded document.
This Parent-Merge Embedded Document mapping can still have Embedded Document or Embedded Array children. Like all embedded mappings, they are refreshed whenever any of the parents are updated or inserted. This bug affects only the specific case of an Embedded Array that is the child of a Parent-Merge Embedded Document that is not merged directly into the top-level document of the collection. In that case, the child-refresh aggregation pipeline incorrectly writes the array into the top level document, instead of the correct nested location.
When this happens the nested array remains present in its correct location if it is already there, but is not updated to reflect any changes in the join fields of the parent. Similarly when a new parent is inserted, any pre-existing children in this nested array are not populated in the expected location.
Fix Version: 1.14
Prerequisites: Any migration with a cyclic dependency loop that includes an Embedded Array using the ‘primitive arrays’ setting.
Summary: A primitive value array is a type of Embedded Array mapping, where the user has selected a source column to be mapped as values in an array field, instead of an array of documents. During CDC, change events to the values in the primitive array are applied correctly, but the aggregation-pipeline logic to refresh the array after updates to the parent document does not correctly handle primitive value arrays. When the parent is updated, the primitive value array is replaced with an array of documents, where each document contains a field with the correct value. This can also prevent subsequent delete events from being applied, as the expected value to be deleted is not found. Update events for embedded arrays are implemented as a delete and an insert event, meaning update events can create duplicate entries. This issue affects all primitive value array mappings where the parent document receives update events during a CDC migration.
These issues may affect users who have migrated data using the Continuous (CDC) migration mode:
Fix Version: 1.14
Prerequisites: A Continuous migration was run with a mapping configuration that uses an autogenerated ID key strategy and contains Embedded Document or Array mappings and the relational database was changed such that a table with a New Document mapping generated both insert events and update or delete events.
Summary: During a CDC migration, when multiple rows are changed in a short time window, those change events can be grouped into a single processing batch. For each batch, the corresponding documents are updated first, and then the updates to the embedded children of those documents are combined into one aggregation-pipeline per mapping. Each aggregation pipeline starts with a $match clause which identifies the updated documents.
Actual data loss in this situation can be unpredictable, so for safety, customers should assume they are affected when running any CDC migration containing embeddings, using an autogenerated ID strategy. For different kinds of change events (insert/update/delete), the way that these documents are uniquely identified is different. In some cases the generated document ObjectID is used, and in some cases specific fields are matched by value.
The bug occurs when these different types of unique identifiers are combined incorrectly and the resulting $match clause matches no documents. This means that the original change event is correctly applied, but the child documents are not correctly refreshed. For example, a newly inserted document that happens at the same time as other change events might result in no children being added to that new document.
As above, the contents of all the child rows are stored in cache collections during a CDC migration, and deleted only when the migration is manually ended by the user. If the cache collections used during the CDC process are preserved, then this data is retained in MongoDB and can be repaired by resolving each cache document to the correct parents. If the cache collections have been deleted, then the child tables must be re-migrated from the relational database.
Fix Version: 1.14
Prerequisites: A Continuous migration was run including an Embedded Array mapping, and the columns (usually a foreign key) used to join that embedded array table with the parent table were updated to a new value (on a parent row) during the Continuous migration without also updating the primary key of the parent row.
Summary: When a row is updated during a Continuous Migration, this can potentially change the fields used to identify the child elements in embedded arrays. The aggregation pipeline to refresh the child array after an update uses an $unwind operation on the list of children matched by the parent document. In MongoDB the $unwind operation has an option to preserve null and empty arrays and the default value is false. This caused the aggregation pipeline to exclude any document from processing in subsequent stages if the results of the child array lookup was an empty array.
In cases where the parent is updated such that it used to match some array children, and now no longer matches any array children, the correct behaviour would be to remove those array entries, but due this bug those children are not removed. This issue can affect any Embedded Array mapping unless the immediate parent of the array is another array, or the update to the parent row includes updating a primary key field. In both those cases, updates are handled as an insert and delete operation and are not affected by this issue.
Fix Version: 1.14
Prerequisites: A Continuous migration was run including nested Embedded Array mappings, i.e. an Embedded Array that contains an Embedded Array within it and any of the tables involved were inserted or updated during the migration.
Summary: The continuous migration logic in the RM has special handling for nested array-in-array mappings. When a relational row is inserted, deleted or updated, the change is also applied directly to MongoDB. However, when an updated row is changed or inserted, and that mapping has child nested mappings then that child mapping may also be affected. A newly inserted row may need child documents to be inserted, an updated field in the row may result in a change to the join and thus a different set of child entries.
During a continuous migration, if there is a nested mapping of the form New Document > Embedded Array > Embedded Array (possibly with additional nesting layers), then an update to the fields of the upper layers will trigger an aggregation-pipeline that regenerates the nested array entries. The top level array is updated correctly, but the nested array aggregation pipeline logic is incorrect. This bug is triggered only for nested mappings with at least two arrays in the nesting stack. It is triggered when any of the rows of the upper layers are mutated.
This incorrect logic has two outcomes depending on if the join field names match between parent and child documents.
For example, if a teachers collection has a field teacher_id, and contains an embedded array students where each student document has a field called teacher_id. To create the embedded array both of these teacher_id fields are matched together to determine which student document to add to which teacher document. In this case, the join-field names match, they are both teacher_id. If either the teacher or student document used a different name, we’ll call that non-matching join-field names.
For nested mappings with matching join-field names, the incorrect aggregation-pipeline logic results in a no-op. This can cause the wrong set of nested child entries to remain present in some cases when the field values used to join with the parent are changed, but these kinds of changes to foreign key fields are rare. In most cases this no-op aggregation pipeline would not result in data loss.
For nested mappings with non-matching join-field names, this incorrect logic results in a very different outcome. It causes the entire set of all rows from the relational table being added to the child array instead of only the relevant ones which match the join criteria. If this nested table is large, then it is quite likely that the update will fail due to the MongoDB 16MB document limit.
In both cases, only the documents that contain the equivalent entry for the updated relational row are affected.
This corruption can be generally repaired by removing any entries in the array where the intended join fields do not match between parent and child documents.
Fix Version: 1.14
Prerequisites: A Continuous migration was run, including a mapping configured to use the ‘calculated ID’ setting which also has a child Embedded Document or Embedded Array mapping, and update events were received that change the calculated ID without also being a change to the primary key on the source database table.
Summary: When a relational table does not have a primary key or primary keys are not used for _id field in target collection, it can either have an auto generated _id field or the _id can be a calculated field using the table field(s) that uniquely identifies the records of the table.
In the case of a calculated ID field that is generated from non-primary-key columns, when the change events come through, the original fields are updated, but the _id of the MongoDB document is not updated. For any subsequent operations, this document can no longer be found as the _id field does not match the expected value. Hence, the aggregation pipeline for updating or inserting children has no effect, and any subsequent update or delete events do not modify or delete the original document. These update events instead result in the creation of a new document.
Fix Version: 1.14
Prerequisites: A Continuous migration was run including mapping fields with null-handling set to “omit” and in which a row was updated to set the value from a previous non-null value to a null value.
Summary: For each field on each mapping, if the column is nullable in the source database, the user can select a null handling strategy. The default is “Insert as null” but the user can also select “omit” which indicates the field should be omitted entirely if the value in the source database is null. For snapshot migrations and insertion change events, if the value is null, the document is created correctly omitting that field. For update change events during CDC, the operation that applies the update to MongoDB also omits that field if the new value is null.
The result is that for a field with null handling strategy “Omit”, setting the value to null in the source database does not apply a change to any pre-existing value in MongoDB. The result is that a non-null value is persisted in MongoDB, after the value is set to null in the source database. If the value is later set to a non-null value, that update is applied correctly. Any other fields changed in the same update event are applied correctly.
This issue affects any field in New Document or Embedded Document mappings with the “omit” null handling strategy. It does not affect Embedded Array mappings as update events for arrays are processed separately as a delete and an insertion which results in null fields being correctly omitted.
Fix Version: 1.13.1
Prerequisites: A Continuous migration was run including a mapping from a source table with no primary key, which included Embedded Array or Document child mappings under it, and during the migration update events were received which changed the value of a join column used to identify the child mapping rows.
Summary: During a Continuous Migration, after an update change event is applied, an aggregation pipeline is triggered to refresh or populate any children of the updated document. For update events the source table primary key is used to identify the documents to be matched in this aggregation pipeline. In cases where no primary key is available, all field values are used to match the documents both for the initial update and the aggregation-pipeline. The issue is that the same filter is re-used for both cases, but for tables with no primary key, after the update is applied the filter no longer matches.
This issue can occur for update events to tables used in New Document and Embedded Document mappings - it does not occur on updates to Embedded Arrays. This does not occur for tables with primary keys, regardless of whether the primary key field is updated or not. This is because any changes to primary key columns trigger insert and delete change events rather than update change events.
Fix Version: 1.13.1
Prerequisites: A Continuous migration was run including an Embedded Array mapping from a source table with no primary key, with at least one field excluded from the migration (unchecked) and update or delete events were received for rows in that table during the migration.
Summary: When a relational table has no primary key defined, the set of all values of all columns in the table must be used to uniquely identify a row or document. For array mappings, when a row in this table is updated, it’s represented as a delete and an insert operation.
There is a bug in the deletion process for rows from an embedded array when the table has no primary key defined. The set of all fields is used to identify the array entry for the $pull operation that deletes the entry. This works correctly when all columns from the table are present as document fields, but if one or more fields are excluded from the mapping, then the $pull operation incorrectly includes these fields, which matches nothing and the entry is not deleted.
In this situation, when a row is deleted from the relational database, it is not deleted from the MongoDB embedded array. When a row in the relational database is updated, the deletion likewise fails but the subsequent insertion succeeds, resulting in duplicate entries in the embedded array.
Fix Version: 1.14
Prerequisites: A Continuous migration was run that included more than one layer of nesting of Embedded Document and Array mappings and a parent mapping of at least two layers of embedded mappings received insert or update events during the migration.
This bug affects nested mappings where one embedded mapping contains another, it affects both embedded array and embedded document mappings.
When a migration has a nested mapping of the form parent > child1 > child2 then an update to the row corresponding to the parent might require that the children be updated (effectively refreshed) from the relevant cache collections. For example if the parent is a newly inserted row, then the child documents must be resolved and added, if the new document is modified this won’t affect the children unless the join-key fields are modified - in which case the children must be re-resolved and replaced.
This issue is in the ordering of resolving those child mappings. The mappings to be resolved were stored internally in an unordered Set collection meaning that child2 might be refreshed before child1. If child2 is resolved before child1 then child1 will be resolved correctly but child2 remains unchanged as the definition of child2 depends on the contents of child1.
The contents of all the child rows are stored in cache collections during a Continuous migration, and deleted only when the migration is manually ended by the user. If the cache collections used during the CDC process are preserved, then this data is retained in MongoDB and can be repaired by resolving each cache document to the correct parents. If the cache collections have been deleted, then the child tables must be re-migrated from the relational database.
If you have questions or would like further information, please contact MongoDB support.