We are trying to query a Mongo database hosted in Atlas from an Azure Databricks cluster.
The Atlas database is hosted in a M10 clusyer with three primary nodes in AWS (the ones used by the transactional application) with an additional read-only node in Azure (the one we are trying to connect).
We already set up a peering connection between our vnet and the Atlas one, plus whitelisting the appropriate IP range. We confirmed that we can pin the read-only node using its private DNS from one of the Databricks worker nodes; we even confirmed that we can telnet the 27017 port. Even more, using pymongo from one of the workers we are able to connect to the database and query the collections.
However, when we try to connect from Databricks we get some timeout errors which appear to be related to the mongo-spark-connector not honoring the readPreferece
configuration.
This is the uri we are trying to use (omitting sensitive details)
mongodb+srv://<user>:<password>@<cluster>-pri.wrmoz.mongodb.net/<database>.<collection>?tls=true&readPreference=nearest&readPreferenceTags=provider:AZURE,region:US_EAST,nodeType:READ_ONLY&readConcernLevel=local
Yet, when trying to load the data as a DataFrame
and perform a simple show()
we get a connection time out error.
The stack trace of the exception shows that the driver was able to ping the desired node (while being unable to reach the AWS ones, as expected). But, neglects to connect to it since it doesn’t match the expected readConcern
primary
We also tried to specify each of the parameters as individual options, using the global cluster config, or in code config, we also tried using both the v10.0
and the v3.0
versions of the connector.
Nevertheless no matter what we tried we always got the same error.
Is this expected behavior? If so, can it be changed? Otherwise, does this count as a proper bug report?
Additionally, is there any workaround?