PySpark MongoDb Connector

I am trying to write a basic pyspark script to connect to MongoDB. I am using Spark 3.1.2 and MongoDb driver 3.2.2.

My code is:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()

spark = SparkSession \
    .builder \
    .appName("SparkSQL") \
    .config("spark.mongodb.input.uri", "mongodb://127.0.0.1/client.coll") \
    .config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.coll") \
    .getOrCreate()

df = spark.read.format("mongo").load()

When I execute in Pyspark with /usr/local/spark/bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 I get:

java.lang.NoClassDefFoundError: org/bson/conversions/Bson

I am very new to Spark. Could someone please help me understand how to install the missing Bson reference? I couldn’t see this in the sample code or MongoDB PySpark documentation.

Thanks in advance,

Ben.

Looks like you don’t have all the dependencies installed for the MongoDB Spark Connector.

I do have a docker environment that will spin up spark, mongodb and a jypter notebook. This will get you up and running quickly.

Hi Robert, thank you for your reply. My apologies for not getting back to you earlier, I had forgotten about this post.

Thanks for the link to your Docker image, i’ll take a look. Do you have any instructions on how to setup all the dependencies? I have been through the MongoDB Spark documentation and couldn’t find a workable solution.

Thanks in advance,

Ben.

Hi
Are you able to resolve this issue.
I am also facing the same issue. Not finding any suitable solution yet
Thanks
Saswata Dutta

Hi Saswata,

I don’t remember exactly what the solution was, but I think it might have been an issue with my environment. I would try a clean installation if you can. If you are still having issues, contact me back and i’ll share some pyspark with a mongodb connection and commands for how I submit to the cluster.

Kind regards,

Ben.

Hi Ben
I am using AWS EMR instance where i installed mongodb 6.
I am using spark 3 up. I have used mongodb-spark connectors as provided by mongodb.
I tried all different option that is availabel in documents. But not luck.
I am trying to connect from notebook
Can you please help
Thanks
Saswata

Hi Saswata,

I’m not familiar with AWS EMR so probably not much help to you. The only thing I can think of, is when I submit a job to the cluster I have to specify what packages to load. For example, this is the command I execute:
spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 --driver-memory 6G --master spark://192.168.1.13:7077 ./some_pyspark.py

Is it possible that when you execute the notebook, it isn’t including the mongodb packages? Are you able to validate your solution outside of AWS (ie a locally installed cluster & mongodb instance)?

Cheers,

Ben.

Dear nawaz_nawaz, your post looks a lot like a ChatGPT text.

Could you please clarify the pertinence of your answer?

It is clear from the previous posts on this thread that the people involved know what is PySpark.

Hi Nawaz, using ChatGPT to reply to legitimate questions is inappropriate IMO. I have flagged this response to admins.

Thanks,

Ben.