Docs Home → MongoDB Spark Connector
FAQ
How can I achieve data locality?
For any MongoDB deployment, the Mongo Spark Connector sets the preferred location for a DataFrame or Dataset to be where the data is:
For a non sharded system, it sets the preferred location to be the hostname(s) of the standalone or the replica set.
For a sharded system, it sets the preferred location to be the hostname(s) of the shards.
To promote data locality,
Ensure there is a Spark Worker on one of the hosts for non-sharded system or one per shard for sharded systems.
Use a
nearest
read preference to read from the localmongod
.For a sharded cluster, you should have a
mongos
on the same nodes and use localThreshold configuration to connect to the nearestmongos
. To partition the data by shard use theShardedPartitioner
Configuration.
How do I resolve Unrecognized pipeline stage name
Error?
In MongoDB deployments with mixed versions of mongod
, it is
possible to get an Unrecognized pipeline stage name: '$sample'
error. To mitigate this situation, explicitly configure the partitioner
to use and define the Schema when using DataFrames.