PySpark MongoDb Connector

Ben_Halicki · September 17, 2021, 2:07am

I am trying to write a basic pyspark script to connect to MongoDB. I am using Spark 3.1.2 and MongoDb driver 3.2.2.

My code is:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()

spark = SparkSession \
    .builder \
    .appName("SparkSQL") \
    .config("spark.mongodb.input.uri", "mongodb://127.0.0.1/client.coll") \
    .config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.coll") \
    .getOrCreate()

df = spark.read.format("mongo").load()

When I execute in Pyspark with /usr/local/spark/bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 I get:

java.lang.NoClassDefFoundError: org/bson/conversions/Bson

I am very new to Spark. Could someone please help me understand how to install the missing Bson reference? I couldn’t see this in the sample code or MongoDB PySpark documentation.

Thanks in advance,

Ben.

Robert_Walters · October 20, 2021, 10:29am

Looks like you don’t have all the dependencies installed for the MongoDB Spark Connector.

I do have a docker environment that will spin up spark, mongodb and a jypter notebook. This will get you up and running quickly.

Ben_Halicki · November 3, 2021, 8:52am

Hi Robert, thank you for your reply. My apologies for not getting back to you earlier, I had forgotten about this post.

Thanks for the link to your Docker image, i’ll take a look. Do you have any instructions on how to setup all the dependencies? I have been through the MongoDB Spark documentation and couldn’t find a workable solution.

Thanks in advance,

Ben.

Saswata_Dutta · May 6, 2023, 10:27pm

Hi
Are you able to resolve this issue.
I am also facing the same issue. Not finding any suitable solution yet
Thanks
Saswata Dutta

Ben_Halicki · May 6, 2023, 11:06pm

Hi Saswata,

I don’t remember exactly what the solution was, but I think it might have been an issue with my environment. I would try a clean installation if you can. If you are still having issues, contact me back and i’ll share some pyspark with a mongodb connection and commands for how I submit to the cluster.

Kind regards,

Ben.

Saswata_Dutta · May 7, 2023, 12:19am

Hi Ben
I am using AWS EMR instance where i installed mongodb 6.
I am using spark 3 up. I have used mongodb-spark connectors as provided by mongodb.
I tried all different option that is availabel in documents. But not luck.
I am trying to connect from notebook
Can you please help
Thanks
Saswata

Ben_Halicki · May 7, 2023, 11:53am

Hi Saswata,

I’m not familiar with AWS EMR so probably not much help to you. The only thing I can think of, is when I submit a job to the cluster I have to specify what packages to load. For example, this is the command I execute:
spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 --driver-memory 6G --master spark://192.168.1.13:7077 ./some_pyspark.py

Is it possible that when you execute the notebook, it isn’t including the mongodb packages? Are you able to validate your solution outside of AWS (ie a locally installed cluster & mongodb instance)?

Cheers,

Ben.

steevej · May 13, 2023, 12:43pm

Dear nawaz_nawaz, your post looks a lot like a ChatGPT text.

Could you please clarify the pertinence of your answer?

It is clear from the previous posts on this thread that the people involved know what is PySpark.

Ben_Halicki · May 13, 2023, 11:01pm

Hi Nawaz, using ChatGPT to reply to legitimate questions is inappropriate IMO. I have flagged this response to admins.

Thanks,

Ben.