Pyspark job keep running using MongoDB Spark connector v10.0.x

strong text Spark 3.3.0, mongodb Atlas 5.0.9, Spark connector 10.x

I run a small job using pyspark reading from MongoDB Atlas and writing to BigQuery.

So far, with the MongoDB Spark connector v3.0.x, I did not encounter any errors and the job was ending normally after loading MongoDB documents and saving them into BigQuery.

It was only a few days ago that, after upgrading to the connector newest version (10.0.x), I’ve experienced some strange behavior: my job is still running even after finishing all tasks successfully.

Here is the problematic line (by that, I mean if I comment just this one, my whole job ends correctly) :

df = spark.read.format("mongodb").options(database="database", collection="collection").load()

Actually, it’s precisely the .load() part of this line which seems to be an issue, the rest of the line not causing any problem alone.

Every time from now, my last logs look like that :

22/07/27 23:00:09 INFO SparkUI: Stopped Spark web UI at http://192.168.1.173:4040
22/07/27 23:00:09 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/07/27 23:00:09 INFO MemoryStore: MemoryStore cleared
22/07/27 23:00:09 INFO BlockManager: BlockManager stopped
22/07/27 23:00:09 INFO BlockManagerMaster: BlockManagerMaster stopped
22/07/27 23:00:09 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/07/27 23:00:09 INFO SparkContext: Successfully stopped SparkContext

But then I have to force quit (with Ctrl-C for instance when running locally) to actually finish the job. It’s very problematic when using cloud services like Google Dataproc Serverless for instance, as the job keep running and so, the instance is never stopped.

I tried with every version 10.0.x (x=0, 1, 2 and 3), but I always encounter the same behavior.

Is it something expected in this version 10 that I miss or not ?

Thank you @Clovis_Masson for your post, we will look into this issue and reply back. How large is the collection you are loading?

1 Like

I’ve tested it mainly with two collections: one very small with only 2 documents and another slightly larger with 20 000.

Here are the different jar I’ve tested to reproduce it:

  • (reader) mongodb-spark-connector (versions 10.0.0, 10.0.1, 10.0.2 & 10.0.3)
  • mongodb-driver-core / mongodb-driver-sync / bson (version 4.7.0 & 4.7.1)
  • (writer) spark-bigquery-with-dependencies_2.12-0.23.2

Again, process stops successfully using version 3.0.x of the mongodb-spark-connector jar with the other mongodb-driver jars.

I was suspecting at first a new behavior due to the support of the structured streaming but it seems not.

Hi @Clovis_Masson I filed https://jira.mongodb.org/browse/SPARK-358 to have our engineers look into this issue.

2 Likes