Important

In version 10.0.0 and later of the Connector, use the format mongodb to read from and write to MongoDB:

df = spark.read.format("mongodb").load()

Python Spark Shell

This tutorial uses the pyspark shell, but the code works with self-contained Python applications as well.

When starting the pyspark shell, you can specify:

the --packages option to download the MongoDB Spark Connector package. The following package is available:
- mongo-spark-connector
the --conf option to configure the MongoDB Spark Connnector. These settings configure the SparkConf object.
Note
If you use SparkConf to configure the Spark Connector, you must prefix the settings appropriately. For details and other available MongoDB Spark Connector options, see the Configuring Spark guide.

The following example starts the pyspark shell from the command line:

./bin/pyspark --conf "spark.mongodb.read.connection.uri=mongodb://127.0.0.1/test.myCollection?readPreference=primaryPreferred" \
              --conf "spark.mongodb.write.connection.uri=mongodb://127.0.0.1/test.myCollection" \
              --packages org.mongodb.spark:mongo-spark-connector_2.12:10.2.1

The spark.mongodb.read.connection.uri specifies the MongoDB server address (127.0.0.1), the database to connect (test), and the collection (myCollection) from which to read data, and the read preference.
The spark.mongodb.write.connection.uri specifies the MongoDB server address (127.0.0.1), the database to connect (test), and the collection (myCollection) to which to write data. Connects to port 27017 by default.
The packages option specifies the Spark Connector's Maven coordinates, in the format groupId:artifactId:version.

The examples in this tutorial will use this database and collection.

Create a `SparkSession` Object

Note

When you start pyspark you get a SparkSession object called spark by default. In a standalone Python application, you need to create your SparkSession object explicitly, as show below.

If you specified the spark.mongodb.read.connection.uri and spark.mongodb.write.connection.uri configuration options when you started pyspark, the default SparkSession object uses them. If you'd rather create your own SparkSession object from within pyspark, you can use SparkSession.builder and specify different configuration options.

from pyspark.sql import SparkSession
my_spark = SparkSession \
    .builder \
    .appName("myApp") \
    .config("spark.mongodb.read.connection.uri", "mongodb://127.0.0.1/test.coll") \
    .config("spark.mongodb.write.connection.uri", "mongodb://127.0.0.1/test.coll") \
    .getOrCreate()

You can use a SparkSession object to write data to MongoDB, read data from MongoDB, create DataFrames, and perform SQL operations.

MongoDB Connector for Spark →

Important

Python Spark Shell.css-134mg1q{-webkit-align-self:center;-ms-flex-item-align:center;align-self:center;padding:0 10px;visibility:hidden;}.css-6vrlzm{border-radius:0!important;display:initial!important;margin:initial!important;}.css-1l4s55v{margin-top:-175px;position:absolute;padding-bottom:2px;}

Note

Create a SparkSession Object

Note

Python Spark Shell

Create a `SparkSession` Object