Pyspark spark-submit unable to read from Mongo Atlas serverless(can read from free version)

I’ve been using Apache Spark(pyspark) to read from MongoDB Atlas, I’ve a shared(free) cluster - which has a limit of 512 MB storage I’m trying to migrate to serverless, but somehow unable to connect to the serverless instance - error

pyspark.sql.utils.IllegalArgumentException: requirement failed: Invalid uri: 'mongodb+srv://vani:<password>@versa-serverless.w9yss.mongodb.net/versa?retryWrites=true&w=majority'

Pls note : I’m able to connect to the instance using pymongo, but not using pyspark.

Here is the pyspark code (Not Working):

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MongoDB operations").getOrCreate()
print(" spark ", spark)


# cluster0 - is the free version, and i'm able to connect to this
# mongoConnUri = "mongodb+srv://vani:password@cluster0.w9yss.mongodb.net/?retryWrites=true&w=majority"

mongoConnUri = "mongodb+srv://vani:password@versa-serverless.w9yss.mongodb.net/?retryWrites=true&w=majority"

mongoDB = "versa"
collection = "name_map_unique_ip"


df = spark.read\
     .format("mongo") \
     .option("uri", mongoConnUri) \
     .option("database", mongoDB) \
     .option("collection", collection) \
     .load()

Error :

22/07/26 12:25:36 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
22/07/26 12:25:36 INFO SharedState: Warehouse path is 'file:/Users/karanalang/PycharmProjects/Versa-composer-mongo/composer_dags/spark-warehouse'.
 spark  <pyspark.sql.session.SparkSession object at 0x7fa1d8b9d5e0>
Traceback (most recent call last):
  File "/Users/karanalang/PycharmProjects/Kafka/python_mongo/StructuredStream_readFromMongoServerless.py", line 30, in <module>
    df = spark.read\
  File "/Users/karanalang/Documents/Technology/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 164, in load
  File "/Users/karanalang/Documents/Technology/spark-3.2.0-bin-hadoop3.2/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1309, in __call__
  File "/Users/karanalang/Documents/Technology/spark-3.2.0-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.IllegalArgumentException: requirement failed: Invalid uri: 'mongodb+srv://vani:password@versa-serverless.w9yss.mongodb.net/?retryWrites=true&w=majority'
22/07/26 12:25:36 INFO SparkContext: Invoking stop() from shutdown hook
22/07/26 12:25:36 INFO SparkUI: Stopped Spark web UI at http://10.42.28.205:4040

pymongo code (am able to connect using the same uri):

from pymongo import MongoClient

client = MongoClient("mongodb+srv://vani:password@versa-serverless.w9yss.mongodb.net/vani?retryWrites=true&w=majority")
print(client)

all_dbs = client.list_database_names()
print(f"all_dbs : {all_dbs}")

Spark-submit command :

spark-submit --packages org.mongodb.spark:mongo-spark-connector:10.0.2 ~/PycharmProjects/Kafka/python_mongo/StructuredStream_readFromMongoServerless.py

any ideas how to debug/fix this ?

tia!

This is working code with an Atlas Serverless instance. I am using version 10.0 of the Spark Connector. Note that the conf parameter names are different than 3.x of the connector as well.

from pyspark.sql import SparkSession

spark = SparkSession.\
builder.\
appName("pyspark-notebook2").\
config("spark.executor.memory", "1g").\
config("spark.mongodb.read.connection.uri","mongodb+srv://sparkuser:xxxxxx@mysparktest.lkyil.mongodb.net/?retryWrites=true&w=majority").\
config("spark.mongodb.write.connection.uri","mongodb+srv://sparkuser:xxxxx@mysparktest.lkyil.mongodb.net/?retryWrites=true&w=majority").\
config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector:10.0.3").\
getOrCreate()

df = spark.read.format("mongodb").option('database', 'MyDB').option('collection', 'MyCollection').load()

df.show()

You can use both versions of the connector side by side, the 3.x version is “mongo” and the 10.x uses “mongodb”

1 Like