Spark Connector R Guide
For the source code that contains the examples below, see introduction.R.
Prerequisites
- Basic working knowledge of MongoDB and Apache Spark. Refer to the MongoDB documentation and Spark documentation for more details.
- Running MongoDB instance (version 2.6 or later).
- Spark 2.4.x.
- Scala 2.12.x
Getting Started
sparkR
Shell
This tutorial uses the sparkR
shell, but the code examples work
just as well with self-contained R applications.
When starting the sparkR
shell, you can specify:
the
--packages
option to download the MongoDB Spark Connector package. The following package is available:mongo-spark-connector_2.12
for use with Scala 2.12.x
the
--conf
option to configure the MongoDB Spark Connnector. These settings configure theSparkConf
object.NoteWhen specifying the Connector configuration via
SparkConf
, you must prefix the settings appropriately. For details and other available MongoDB Spark Connector options, see the Configuration Options.
For example,
./bin/sparkR --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/test.myCollection?readPreference=primaryPreferred" \ --conf "spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection" \ --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1
- The spark.mongodb.input.uri specifies the
MongoDB server address (
127.0.0.1
), the database to connect (test
), and the collection (myCollection
) from which to read data, and the read preference. - The spark.mongodb.output.uri specifies the
MongoDB server address (
127.0.0.1
), the database to connect (test
), and the collection (myCollection
) to which to write data. Connects to port27017
by default. - The
packages
option specifies the Spark Connector's Maven coordinates, in the formatgroupId:artifactId:version
.
Create a SparkSession
Object
When you start sparkR
you get a SparkSession
object called
spark
by default. In a standalone R application, you need
to create your SparkSession
object explicitly, as show below.
If you specified the spark.mongodb.input.uri
and spark.mongodb.output.uri
configuration options when you
started sparkR
, the default SparkSession
object uses them.
If you'd rather create your own SparkSession
object from within
sparkR
, you can use sparkr.session()
and specify different
configuration options.
my_spark <- sparkR.session( master="local[*]", sparkConfig=list(), appName="my_app" )
You can use a SparkSession
object to write data to MongoDB, read
data from MongoDB, create DataFrames, and perform SQL operations.