Docs Menu

Docs HomeMongoDB Spark Connector

Read Configuration Options

On this page

  • Read Configuration
  • Change Streams
  • connection.uri Configuration Setting

You can configure the following properties to read from MongoDB:

Note

If you use SparkConf to set the connector's read configurations, prefix spark.mongodb.read. to each property.

Property name
Description
mongoClientFactory
MongoClientFactory configuration key.
You can specify a custom implementation which must implement the com.mongodb.spark.sql.connector.connection.MongoClientFactory interface.

Default: com.mongodb.spark.sql.connector.connection.DefaultMongoClientFactory
connection.uri
Required.
The connection string configuration key.

Default: mongodb://localhost:27017/
database
Required.
The database name configuration.
collection
Required.
The collection name configuration.
partitioner
The partitioner full class name.
You can specify a custom implementation which must implement the com.mongodb.spark.sql.connector.read.partitioner.Partitioner interface.
See the Partitioner Configuration section for more information on partitioners.

Default: com.mongodb.spark.sql.connector.read.partitioner.SamplePartitioner
partioner.options.
Partitioner configuration prefix.
See the Partitioner Configuration section for more information on partitioners.
sampleSize
The number of documents to sample from the collection when inferring
the schema.

Default: 1000
sql.inferSchema.mapTypes.enabled
Whether to enable Map types when inferring the schema.
When enabled, large compatible struct types are inferred to a MapType instead.

Default: true
sql.inferSchema.mapTypes.minimum.key.size
Minimum size of a StructType before inferring as a MapType.

Default: 250
aggregation.pipeline
Specifies a custom aggregation pipeline to apply to the collection before sending data to Spark.
The value must be either an extended JSON single document or list of documents.
A single document should resemble the following:
{"$match": {"closed": false}}
A list of documents should resemble the following:
[{"$match": {"closed": false}}, {"$project": {"status": 1, "name": 1, "description": 1}}]

Important

Custom aggregation pipelines must be compatible with the partitioner strategy. For example, aggregation stages such as $group do not work with any partitioner that creates more than one partition.

aggregation.allowDiskUse
Specifies whether to allow storage to disk when running the aggregation.

Default: true
outputExtendedJson
When true, the connector converts BSON types not supported by Spark into extended JSON strings. When false, the connector uses the original relaxed JSON format for unsupported types.

Default: false

This section contains configuration information for the following partitioners:

Note

If you use SparkConf to set the connector's read configurations, prefix each property with spark.mongodb.read.partitioner.options. instead of partitioner.options..

You must specify this partitioner using the full classname: com.mongodb.spark.sql.connector.read.partitioner.SamplePartitioner.

Property name
Description
partitioner.options.partition.field

The field to use for partitioning, which must be a unique field.

Default: _id

partitioner.options.partition.size

The size (in MB) for each partition. Smaller partition sizes create more partitions containing fewer documents.

Default: 64

partitioner.options.samples.per.partition

The number of samples to take per partition. The total number of samples taken is:

samples per partiion * ( count / number of documents per partition)

Default: 10

Example

For a collection with 640 documents with an average document size of 0.5 MB, the default SamplePartitioner configuration values creates 5 partitions with 128 documents per partition.

The MongoDB Spark Connector samples 50 documents (the default 10 per intended partition) and defines 5 partitions by selecting partition field ranges from the sampled documents.

The ShardedPartitioner automatically determines the partitions to use based on your shard configuration.

You must specify this partitioner using the full classname: com.mongodb.spark.sql.connector.read.partitioner.ShardedPartitioner.

Warning

This partitioner is not compatible with hashed shard keys.

Note

If you use SparkConf to set the connector's read configurations, prefix each property with spark.mongodb.read.partitioner.options. instead of partitioner.options..

You must specify this partitioner using the full classname: com.mongodb.spark.sql.connector.read.partitioner.PaginateBySizePartitioner.

Property name
Description
partitioner.options.partition.field

The field to use for partitioning, which must be a unique field.

Default: _id

partitioner.options.partition.size

The size (in MB) for each partition. Smaller partition sizes

create more partitions containing fewer documents.

Default: 64

Note

If you use SparkConf to set the connector's read configurations, prefix each property with spark.mongodb.read.partitioner.options. instead of partitioner.options..

You must specify this partitioner using the full classname: com.mongodb.spark.sql.connector.read.partitioner.PaginateIntoPartitionsPartitioner.

Property name
Description
partitioner.options.partition.field

The field to use for partitioning, which must be a unique field.

Default: _id

partitioner.options.maxNumberOfPartitions

The number of partitions to create.

Default: 64

Note

If you use SparkConf to set the connector's read configurations, prefix each property with spark.mongodb.read.partitioner.options. instead of partitioner.options..

You must specify this partitioner using the full classname: com.mongodb.spark.sql.connector.read.partitioner.SinglePartitionPartitioner.

This partitioner creates a single partition.

Note

If you use SparkConf to set the connector's change stream configurations, prefix spark.mongodb.change.stream. to each

property.

Property name
Description
change.stream.lookup.full.document

Determines what values your change stream returns on update operations.

The default setting returns the differences between the original document and the updated document.

The updateLookup setting returns the differences between the original document and updated document as well as a copy of the entire updated document.

Tip

For more information on how this change stream option works, see the MongoDB server manual guide Lookup Full Document for Update Operation.

Default: "default"

change.stream.publish.full.document.only
Specifies whether to publish the changed document or the full change stream document.

When set to true, the connector filters out messages that omit the fullDocument field and only publishes the value of the field.

Note

This setting overrides the change.stream.lookup.full.document setting.


Default: false

You can set all Read Configuration via the read connection.uri setting.

For example, consider the following example which sets the read connection.uri setting:

Note

If you use SparkConf to set the connector's read configurations, prefix spark.mongodb.read. to the setting.

spark.mongodb.read.connection.uri=mongodb://127.0.0.1/databaseName.collectionName?readPreference=primaryPreferred

The configuration corresponds to the following separate configuration settings:

spark.mongodb.read.connection.uri=mongodb://127.0.0.1/
spark.mongodb.read.database=databaseName
spark.mongodb.read.collection=collectionName
spark.mongodb.read.readPreference.name=primaryPreferred

If you specify a setting both in the connection.uri and in a separate configuration, the connection.uri setting overrides the separate setting. For example, given the following configuration, the database for the connection is foobar:

spark.mongodb.read.connection.uri=mongodb://127.0.0.1/foobar
spark.mongodb.read.database=bar
←  Write Configuration OptionsGetting Started →
Share Feedback
© 2023 MongoDB, Inc.

About

  • Careers
  • Investor Relations
  • Legal Notices
  • Privacy Notices
  • Security Information
  • Trust Center
© 2023 MongoDB, Inc.