For any MongoDB deployment, the Mongo Spark Connector sets the preferred location for a DataFrame or Dataset to be where the data is:
- For a non sharded system, it sets the preferred location to be the hostname(s) of the standalone or the replica set.
- For a sharded system, it sets the preferred location to be the hostname(s) of the shards.
To promote data locality,
- Ensure there is a Spark Worker on one of the hosts for non-sharded system or one per shard for sharded systems.
- Use a
nearestread preference to read from the local
- For a sharded cluster, you should have a
mongoson the same nodes and use localThreshold configuration to connect to the nearest
mongos. To partition the data by shard use the
In MongoDB deployments with mixed versions of
mongod, it is
possible to get an
Unrecognized pipeline stage name: '$sample'
error. To mitigate this situation, explicitly configure the partitioner
to use and define the Schema when using DataFrames.