Hi, we come from an upgrade path 4.0 => 4.2. => 4.4.
We have a large sharded collection with { _id: hashed } as the shard key that has had balancing enabled in the past. In my current use case, I want to read from secondary (with balancing disabled) because I can tolerate any stale documents as long as they are not orphan documents.
I found out recently that reading from secondary returns duplicate documents when querying using a non-shard key index. This is true even when balancing is disabled. I investigated further and found that the duplicate documents do come from two different shards, so they are likely due to failed migration. I found out that this is a known issue from these articles and tickets:
Till 3.4, if you were reading from secondary shard members, there is a possibility of getting duplicate documents (orphans) which seems like your scenario.
When you read from a primary (or a secondary with read concern local starting from mongodb 3.6) the node will apply the shard filter before returning the document. So, we won’t return twice the document. If the shard doesn’t official own the document it will just not returning it even if it has it locally.
^ A MongoDB Employee quotes this but I cannot find this in the referenced docs link or anywhere on google. But I was able to verify that { readConcern: local } with secondary reads does remove duplicate documents. Is this information still accurate without caveats or corner cases?
(2) Background Indexing on Secondaries and Orphaned Document Cleanup in MongoDB 2.6 | MongoDB Blog
The scenario where users typically encounter issues related to orphaned documents is when issuing secondary reads. In a sharded cluster, primary replicas for each shard are aware of the chunk placements, while secondaries are not. If you query the primary (which is the default read preference), you will not see any issues as the primary will not return orphaned documents even if it has them. But if you are using secondary reads, the presence of orphaned documents can produce unexpected results, because secondaries are not aware of the chunk ownerships and they can’t filter out orphaned documents.
^ This official article explains why secondaries show up orphaned documents, but doesn’t talk about readConcern: local
- local : returns data on the local node, but with orphaned documents filtered out. This requires the node to communicate with the shard’s primary (if this read is on a secondary), or the config server to service the read. In a degraded sharded cluster, this read may stall indefinitely. This is not an issue for an unsharded collection, though. Possible to return data that could be rolled back.
^ A MongoDB Engineer mentioned this about local
concern but I want to clarify on the second statement. Does it communicate with the shard’s primary or config server or both? I am concerned about the performance degradation if it communicates with shard’s primary. Since we want to disable balancing, in theory, we should be able to perform the shard/chunk ownership filter in mongos
itself.
My primary questions are:
- Does
readConcern: local
in secondary reads guarantee that we don’t get orphaned documents if balancing is disabled? - Follow up question: How does it guarantee it OR if it doesn’t, how does mongo filter out duplicate documents with
readConcern: local
?
- Does it communicate with primary OR does it communicate with
mongos
only? Could there be some performance implication here?