Spark Connector 3.0.0 Java API for Dataset save doesn't recognize host address "host.docker.internal"

Yingding_Wang · February 9, 2021, 11:23am

I am using MongoSpark connector 3.0.0 with PlayFramework 2.8.7 and Apache Spark.
The MongoDB Community Edition is in a docker container, and Apache Spark Cluster also runs in a separate bridge network as containers in docker.

I call the MongoSpark java API

MongoSpark.write(resultsDataset)
     .option("collection", "maycollection")
     .option("replaceDocument", "false")
     .mode(SaveMode.Overwrite).save();

to save a Dataset/DataFrame in Java with OpenJDK11 to the Mongodb on the docker host, i got an error from Spark connector that host address “host.docker.internal” unknown.

Timed out after 30000 ms while waiting to connect. 
Client view of cluster state is {type=UNKNOWN, servers=[{address=host.docker.internal:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: host.docker.internal}, caused by {java.net.UnknownHostException: host.docker.internal}}]
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting to connect. Client view of cluster state is {type=UNKNOWN, servers=[{address=host.docker.internal:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: host.docker.internal}, caused by {java.net.UnknownHostException: host.docker.internal}}]

Since the Spark Workers/Executor runs in a docker container, to reach the docker host IP where my Mongodb container resides, it is common approach on MacOSX or Windows to call the “host.docker.internal” to resolve to the real IP of docker host.

The strange thing is, it only happens wenn i store a Dataset. I did not change anything, but convert the Dataset to an RDD and save a RDD using WriteConfig, MongoSpark.save(RDD<?>, …)

WriteConfig writeConfig = genWriteConfig(jsc, collectionSb);
MongoSpark.save(resultsRDD, writeConfig);

The call was executed flawless, the aforementioned unknown host error “host.docker.internal” didn’t happen. I assume here might be some inconsistent behaviour to save the RDD and Dataset. And this is really annoying.

As I started the SparkSession with a SparkConfig, the “spark.mongodb.output.uri” containing the mongodb host “host.docker.internal” wasn’t changed. But still the RDD and Dataset save() calls behave differently. Does the save function of SparkConnector doing different sanity check on the mongodb URI and host address internally?

My last test was using the IP address of docker host to replace the “host.docker.internal” String in the SparkConfig for “spark.mongodb.output.uri” and call the following code to save a Dataset.

MongoSpark.write(resultsDataset)
       .option("collection", "maycollection")
       .option("replaceDocument", "false")
       .mode(SaveMode.Overwrite).save();

The save() function works fine.

I have opened an issue for SparkConnector on Jira https://jira.mongodb.org/browse/SPARK-287 , and was advice to raise my questions here in the community support.

I am grateful for any hints and help? or do you also experience the same host unknown error?

Ross_Lawley · February 9, 2021, 12:02pm

Hi @Yingding_Wang,

That’s strange as both MongoSpark.write and MongoSpark.save ultimately should follow the same code path.

Does using MongoSpark.save(dataset, writeConfig) work as expected?
That bypasses using the DataFrameWriter API.

Could you post the stacktrace of the error?

Ross

Yingding_Wang · February 10, 2021, 10:24am

Hi @Ross_Lawley,

Thank you so much for you help. I think I may find the root cause of my issue. It is my fault. That the Apache Spark Driver node exists outside the ApacheSpark Docker Cluster.

I accidentally performed Dataset.collect(), Dataset.take() actions with spark.mongodb.input.uri set to mongodb host address host.docker.internal. Since the collect() or take() runs on Driver node. The Apache Driver node outside docker can not resolve host.docker.internal host string.

Sorry again for your time spent on my silly mistake. I may get this logical thinking error due to the wrong impression of how Spark Connector works. I thought Spark Connector will be executed solely inside spark executor node, and the Dataset/RDD will be loaded only on Spark executor and it is safe for Driver node to call collect() or count() or take()actions. But it seems like that Spark Connector has performed lazy loading and instead carry out an execution plan on the Driver node.

Thanks again for your help.

Ross_Lawley · February 10, 2021, 10:37am

Hi @Yingding_Wang,

Glad you were able to find the cause! Spark generally treats data as a lazy collection and in doing so the Spark Driver will send work to the Spark Worker nodes. However, that is outside the control of the MongoDB (or any) Spark Connector.

All the best,

Ross

system · February 15, 2021, 10:37am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.