Use your local SparkSession's read
method to create a DataFrame
representing a collection.
Note
A DataFrame
is represented by a Dataset
of
Rows
. It is an alias of Dataset[Row]
.
The following example loads the collection specified in the
SparkConf
:
val df = spark.read.format("mongodb").load() // Uses the SparkConf for configuration
To specify a different collection, database, and other read
configuration settings, use the option
method:
val df = spark.read.format("mongodb").option("database", "<example-database>").option("collection", "<example-collection>").load()
Schema Inference
When you load a Dataset or DataFrame without a schema, Spark samples the records to infer the schema of the collection.
Consider a collection named characters
:
{ "_id" : ObjectId("585024d558bef808ed84fc3e"), "name" : "Bilbo Baggins", "age" : 50 } { "_id" : ObjectId("585024d558bef808ed84fc3f"), "name" : "Gandalf", "age" : 1000 } { "_id" : ObjectId("585024d558bef808ed84fc40"), "name" : "Thorin", "age" : 195 } { "_id" : ObjectId("585024d558bef808ed84fc41"), "name" : "Balin", "age" : 178 } { "_id" : ObjectId("585024d558bef808ed84fc42"), "name" : "Kíli", "age" : 77 } { "_id" : ObjectId("585024d558bef808ed84fc43"), "name" : "Dwalin", "age" : 169 } { "_id" : ObjectId("585024d558bef808ed84fc44"), "name" : "Óin", "age" : 167 } { "_id" : ObjectId("585024d558bef808ed84fc45"), "name" : "Glóin", "age" : 158 } { "_id" : ObjectId("585024d558bef808ed84fc46"), "name" : "Fíli", "age" : 82 } { "_id" : ObjectId("585024d558bef808ed84fc47"), "name" : "Bombur" }
The following operation loads data from the MongoDB collection
specified in SparkConf
and infers the schema:
val df = MongoSpark.load(spark) // Uses the SparkSession df.printSchema() // Prints DataFrame schema
df.printSchema()
outputs the following schema to the console:
root |-- _id: struct (nullable = true) | |-- oid: string (nullable = true) |-- age: integer (nullable = true) |-- name: string (nullable = true)