For any MongoDB deployment, the Spark Connector sets the preferred location for a DataFrame or Dataset to be where the data is.
For a nonsharded system, it sets the preferred location to be the hostname(s) of the standalone or the replica set.
For a sharded system, it sets the preferred location to be the hostname(s) of the shards.
To promote data locality, we recommend taking the following actions:
Ensure there is a Spark Worker on one of the hosts for nonsharded system or one per shard for sharded systems.
For a sharded cluster, have a
mongoson the same nodes and use the
localThresholdconfiguration setting to connect to the nearest
mongos. To partition the data by shard use the
In MongoDB deployments with mixed versions of
mongod, it is
possible to get an
Unrecognized pipeline stage name: '$sample'
error. To mitigate this situation, explicitly configure the partitioner
to use and define the schema when using DataFrames.
To use mTLS, include the following options when you run
--driver-java-options -Djavax.net.ssl.trustStore=<path to your truststore.jks file> \ --driver-java-options -Djavax.net.ssl.trustStorePassword=<your truststore password> \ --driver-java-options -Djavax.net.ssl.keyStore=<path to your keystore.jks file> \ --driver-java-options -Djavax.net.ssl.keyStorePassword=<your keystore password> \ --conf spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=<path to your truststore.jks file> \ --conf spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStorePassword=<your truststore password> \ --conf spark.executor.extraJavaOptions=-Djavax.net.ssl.keyStore=<path to your keystore.jks file> \ --conf spark.executor.extraJavaOptions=-Djavax.net.ssl.keyStorePassword=<your keystore password> \
The MongoConnector includes a cache that lets workers
share a single
MongoClient across threads. To specify the length of time to keep a
MongoClient available, include the
mongodb.keep_alive_ms option when you run
--driver-java-options -Dmongodb.keep_alive_ms=<number of milliseconds to keep MongoClient available>
By default, this property has a value of
Because the cache is set up before the Spark Configuration is available, you must use a system property to configure it.