Spark connector not respecting readPreference configuration

Luis_Miguel_Mejia_Suarez · September 30, 2022, 11:38pm

We are trying to query a Mongo database hosted in Atlas from an Azure Databricks cluster.

The Atlas database is hosted in a M10 clusyer with three primary nodes in AWS (the ones used by the transactional application) with an additional read-only node in Azure (the one we are trying to connect).
We already set up a peering connection between our vnet and the Atlas one, plus whitelisting the appropriate IP range. We confirmed that we can pin the read-only node using its private DNS from one of the Databricks worker nodes; we even confirmed that we can telnet the 27017 port. Even more, using pymongo from one of the workers we are able to connect to the database and query the collections.

However, when we try to connect from Databricks we get some timeout errors which appear to be related to the mongo-spark-connector not honoring the readPreferece configuration.
This is the uri we are trying to use (omitting sensitive details)

mongodb+srv://<user>:<password>@<cluster>-pri.wrmoz.mongodb.net/<database>.<collection>?tls=true&readPreference=nearest&readPreferenceTags=provider:AZURE,region:US_EAST,nodeType:READ_ONLY&readConcernLevel=local

Yet, when trying to load the data as a DataFrame and perform a simple show() we get a connection time out error.
The stack trace of the exception shows that the driver was able to ping the desired node (while being unable to reach the AWS ones, as expected). But, neglects to connect to it since it doesn’t match the expected readConcern

primary

We also tried to specify each of the parameters as individual options, using the global cluster config, or in code config, we also tried using both the v10.0 and the v3.0 versions of the connector.
Nevertheless no matter what we tried we always got the same error.

Is this expected behavior? If so, can it be changed? Otherwise, does this count as a proper bug report?
Additionally, is there any workaround?

Paul_Dudley · February 16, 2023, 2:03pm

Hi Luis, I’m not sure if this is helpful in your use case but if you would be interested in replicating your MongoDB data to Databricks by reading the changestreams log that’s something that you could do with Streamkap. It doesn’t solve your immediate problem but is an alternative way to get data to Databricks and has the advantage of not putting querying load on your MongoDB instance.