Spark 3.3.0, mongodb Atlas 5.0.9, Spark connector tested with v3.x & v10.x
I run a basic ETL job using Pyspark reading data from various MongoDB collections and writing them into my sink (BigQuery).
Everything works fine using one or few collections from my database, but as soon as I’m looping across all my collections (almost 200), I’m still getting the same error after sometimes (and some successes writing data into BigQuery):
com.mongodb.MongoSocketOpenException: Exception opening socket
For information, I’m using only one
SparkSession for this job, and I loop across all the 200 collections updating my
DataFrameReader's options (only the collection name) before loading data during each iteration.
I was thinking that it was maybe a design issue coming from that, as I’m note sure if I have to recreate the
SparkSession or the
SparkContext for every collections ?
Or maybe it could come from the high frequency of connection attempts to the database and I should slow down a bit the process during each iteration by introducing some manual
What would be according to you the best way to correctly read the data from all my collections to prevent this kind of error ?