Docs Home → MongoDB Spark Connector
Important
In version 10.0.0 and later of the Connector, use the format
mongodb
to read from and write to MongoDB:
df = spark.read.format("mongodb").load()
Python Spark Shell
This tutorial uses the pyspark
shell, but the code works
with self-contained Python applications as well.
When starting the pyspark
shell, you can specify:
the
--packages
option to download the MongoDB Spark Connector package. The following package is available:mongo-spark-connector
the
--conf
option to configure the MongoDB Spark Connnector. These settings configure theSparkConf
object.Note
When specifying the Connector configuration via
SparkConf
, you must prefix the settings appropriately. For details and other available MongoDB Spark Connector options, see the Configuration Options.
The following example starts the pyspark
shell from the command
line:
./bin/pyspark --conf "spark.mongodb.read.connection.uri=mongodb://127.0.0.1/test.myCollection?readPreference=primaryPreferred" \ --conf "spark.mongodb.write.connection.uri=mongodb://127.0.0.1/test.myCollection" \ --packages org.mongodb.spark:mongo-spark-connector_2.12:10.1.1
The spark.mongodb.read.connection.uri specifies the MongoDB server address (
127.0.0.1
), the database to connect (test
), and the collection (myCollection
) from which to read data, and the read preference.The spark.mongodb.write.connection.uri specifies the MongoDB server address (
127.0.0.1
), the database to connect (test
), and the collection (myCollection
) to which to write data. Connects to port27017
by default.The
packages
option specifies the Spark Connector's Maven coordinates, in the formatgroupId:artifactId:version
.
The examples in this tutorial will use this database and collection.
Create a SparkSession
Object
Note
When you start pyspark
you get a SparkSession
object called
spark
by default. In a standalone Python application, you need
to create your SparkSession
object explicitly, as show below.
If you specified the spark.mongodb.read.connection.uri
and spark.mongodb.write.connection.uri
configuration options when you
started pyspark
, the default SparkSession
object uses them.
If you'd rather create your own SparkSession
object from within
pyspark
, you can use SparkSession.builder
and specify different
configuration options.
from pyspark.sql import SparkSession my_spark = SparkSession \ .builder \ .appName("myApp") \ .config("spark.mongodb.read.connection.uri", "mongodb://127.0.0.1/test.coll") \ .config("spark.mongodb.write.connection.uri", "mongodb://127.0.0.1/test.coll") \ .getOrCreate()
You can use a SparkSession
object to write data to MongoDB, read
data from MongoDB, create DataFrames, and perform SQL operations.