strong text Spark 3.3.0, mongodb Atlas 5.0.9, Spark connector 10.x
I run a small job using pyspark reading from MongoDB Atlas and writing to BigQuery.
So far, with the MongoDB Spark connector v3.0.x, I did not encounter any errors and the job was ending normally after loading MongoDB documents and saving them into BigQuery.
It was only a few days ago that, after upgrading to the connector newest version (10.0.x), I’ve experienced some strange behavior: my job is still running even after finishing all tasks successfully.
Here is the problematic line (by that, I mean if I comment just this one, my whole job ends correctly) :
df = spark.read.format("mongodb").options(database="database", collection="collection").load()
Actually, it’s precisely the .load()
part of this line which seems to be an issue, the rest of the line not causing any problem alone.
Every time from now, my last logs look like that :
22/07/27 23:00:09 INFO SparkUI: Stopped Spark web UI at http://192.168.1.173:4040
22/07/27 23:00:09 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/07/27 23:00:09 INFO MemoryStore: MemoryStore cleared
22/07/27 23:00:09 INFO BlockManager: BlockManager stopped
22/07/27 23:00:09 INFO BlockManagerMaster: BlockManagerMaster stopped
22/07/27 23:00:09 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/07/27 23:00:09 INFO SparkContext: Successfully stopped SparkContext
But then I have to force quit (with Ctrl-C for instance when running locally) to actually finish the job. It’s very problematic when using cloud services like Google Dataproc Serverless for instance, as the job keep running and so, the instance is never stopped.
I tried with every version 10.0.x (x=0, 1, 2 and 3), but I always encounter the same behavior.
Is it something expected in this version 10 that I miss or not ?