I am using MongoSpark connector 3.0.0 with PlayFramework 2.8.7 and Apache Spark.
The MongoDB Community Edition is in a docker container, and Apache Spark Cluster also runs in a separate bridge network as containers in docker.
I call the MongoSpark java API
MongoSpark.write(resultsDataset)
.option("collection", "maycollection")
.option("replaceDocument", "false")
.mode(SaveMode.Overwrite).save();
to save a Dataset/DataFrame in Java with OpenJDK11 to the Mongodb on the docker host, i got an error from Spark connector that host address “host.docker.internal” unknown.
Timed out after 30000 ms while waiting to connect.
Client view of cluster state is {type=UNKNOWN, servers=[{address=host.docker.internal:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: host.docker.internal}, caused by {java.net.UnknownHostException: host.docker.internal}}]
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting to connect. Client view of cluster state is {type=UNKNOWN, servers=[{address=host.docker.internal:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: host.docker.internal}, caused by {java.net.UnknownHostException: host.docker.internal}}]
Since the Spark Workers/Executor runs in a docker container, to reach the docker host IP where my Mongodb container resides, it is common approach on MacOSX or Windows to call the “host.docker.internal” to resolve to the real IP of docker host.
The strange thing is, it only happens wenn i store a Dataset. I did not change anything, but convert the Dataset to an RDD and save a RDD using WriteConfig, MongoSpark.save(RDD<?>, …)
WriteConfig writeConfig = genWriteConfig(jsc, collectionSb);
MongoSpark.save(resultsRDD, writeConfig);
The call was executed flawless, the aforementioned unknown host error “host.docker.internal” didn’t happen. I assume here might be some inconsistent behaviour to save the RDD and Dataset. And this is really annoying.
As I started the SparkSession with a SparkConfig, the “spark.mongodb.output.uri” containing the mongodb host “host.docker.internal” wasn’t changed. But still the RDD and Dataset save() calls behave differently. Does the save function of SparkConnector doing different sanity check on the mongodb URI and host address internally?
My last test was using the IP address of docker host to replace the “host.docker.internal” String in the SparkConfig for “spark.mongodb.output.uri” and call the following code to save a Dataset.
MongoSpark.write(resultsDataset)
.option("collection", "maycollection")
.option("replaceDocument", "false")
.mode(SaveMode.Overwrite).save();
The save() function works fine.
I have opened an issue for SparkConnector on Jira https://jira.mongodb.org/browse/SPARK-287 , and was advice to raise my questions here in the community support.
I am grateful for any hints and help? or do you also experience the same host unknown error?