Spark Connector Python Guide¶
On this page
Source Code
For the source code that contains the examples below, see introduction.py.
Prerequisites¶
- Basic working knowledge of MongoDB and Apache Spark. Refer to the MongoDB documentation and Spark documentation for more details.
- Running MongoDB instance (version 2.6 or later).
- Spark 2.0.x.
- Scala 2.11.x
Getting Started¶
Python Spark Shell¶
This tutorial uses the pyspark
shell, but the code works
with self-contained Python applications as well.
When starting the pyspark
shell, you can specify:
the
--packages
option to download the MongoDB Spark Connector package. The following package is available:mongo-spark-connector_2.11
for use with Scala 2.11.x
the
--conf
option to configure the MongoDB Spark Connnector. These settings configure theSparkConf
object.Note
When specifying the Connector configuration via
SparkConf
, you must prefix the settings appropriately. For details and other available MongoDB Spark Connector options, see the Configuration Options.
The following example starts the pyspark
shell from the command
line:
- The spark.mongodb.input.uri specifies the
MongoDB server address (
127.0.0.1
), the database to connect (test
), and the collection (myCollection
) from which to read data, and the read preference. - The spark.mongodb.output.uri specifies the
MongoDB server address (
127.0.0.1
), the database to connect (test
), and the collection (myCollection
) to which to write data. Connects to port27017
by default. - The
packages
option specifies the Spark Connector’s Maven coordinates, in the formatgroupId:artifactId:version
.
The examples in this tutorial will use this database and collection.
Create a SparkSession
Object¶
Note
When you start pyspark
you get a SparkSession
object called
spark
by default. In a standalone Python application, you need
to create your SparkSession
object explicitly, as show below.
If you specified the spark.mongodb.input.uri
and spark.mongodb.output.uri
configuration options when you
started pyspark
, the default SparkSession
object uses them.
If you’d rather create your own SparkSession
object from within
pyspark
, you can use SparkSession.builder
and specify different
configuration options.
You can use a SparkSession
object to write data to MongoDB, read
data from MongoDB, create DataFrames, and perform SQL operations.