PySpark MongoDb Connector

I am trying to write a basic pyspark script to connect to MongoDB. I am using Spark 3.1.2 and MongoDb driver 3.2.2.

My code is:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()

spark = SparkSession \
    .builder \
    .appName("SparkSQL") \
    .config("spark.mongodb.input.uri", "mongodb://127.0.0.1/client.coll") \
    .config("spark.mongodb.output.uri", "mongodb://127.0.0.1/test.coll") \
    .getOrCreate()

df = spark.read.format("mongo").load()

When I execute in Pyspark with /usr/local/spark/bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 I get:

java.lang.NoClassDefFoundError: org/bson/conversions/Bson

I am very new to Spark. Could someone please help me understand how to install the missing Bson reference? I couldn’t see this in the sample code or MongoDB PySpark documentation.

Thanks in advance,

Ben.

Looks like you don’t have all the dependencies installed for the MongoDB Spark Connector.

I do have a docker environment that will spin up spark, mongodb and a jypter notebook. This will get you up and running quickly.

Hi Robert, thank you for your reply. My apologies for not getting back to you earlier, I had forgotten about this post.

Thanks for the link to your Docker image, i’ll take a look. Do you have any instructions on how to setup all the dependencies? I have been through the MongoDB Spark documentation and couldn’t find a workable solution.

Thanks in advance,

Ben.