"Exception opening socket" loading multiple collections

Clovis_Masson · August 1, 2022, 7:46am

Spark 3.3.0, mongodb Atlas 5.0.9, Spark connector tested with v3.x & v10.x

I run a basic ETL job using Pyspark reading data from various MongoDB collections and writing them into my sink (BigQuery).

Everything works fine using one or few collections from my database, but as soon as I’m looping across all my collections (almost 200), I’m still getting the same error after sometimes (and some successes writing data into BigQuery):

com.mongodb.MongoSocketOpenException: Exception opening socket

For information, I’m using only one SparkSession for this job, and I loop across all the 200 collections updating my DataFrameReader's options (only the collection name) before loading data during each iteration.

I was thinking that it was maybe a design issue coming from that, as I’m note sure if I have to recreate the SparkSession or the SparkContext for every collections ?

Or maybe it could come from the high frequency of connection attempts to the database and I should slow down a bit the process during each iteration by introducing some manual wait ?

What would be according to you the best way to correctly read the data from all my collections to prevent this kind of error ?

Mani_Perumal · September 28, 2023, 3:08pm

Hi @Clovis_Masson

I too running same set of issue on my ETL pipeline. Did you able to resolve it? Can you please share your workarounds or approach how you have handled it. This would be helpful for everyone who are facing the same issue.