Docs Home → MongoDB Spark Connector
Read Configuration Options
Read Configuration
You can configure the following properties to read from MongoDB:
Note
If you use SparkConf
to set the connector's read configurations,
prefix spark.mongodb.read.
to each property.
Property name | Description | ||
---|---|---|---|
mongoClientFactory | MongoClientFactory configuration key. You can specify a custom implementation which must implement the
com.mongodb.spark.sql.connector.connection.MongoClientFactory
interface.Default: com.mongodb.spark.sql.connector.connection.DefaultMongoClientFactory | ||
connection.uri | Required. The connection string configuration key. Default: mongodb://localhost:27017/ | ||
database | Required. The database name configuration. | ||
collection | Required. The collection name configuration. | ||
partitioner | The partitioner full class name. You can specify a custom implementation which must implement the
com.mongodb.spark.sql.connector.read.partitioner.Partitioner
interface.See the
Partitioner Configuration section for more
information on partitioners. Default: com.mongodb.spark.sql.connector.read.partitioner.SamplePartitioner | ||
partioner.options. | Partitioner configuration prefix. See the
Partitioner Configuration section for more
information on partitioners. | ||
sampleSize | The number of documents to sample from the collection when inferring the schema. Default: 1000 | ||
sql.inferSchema.mapTypes.enabled | Whether to enable Map types when inferring the schema. When enabled, large compatible struct types are inferred to a
MapType instead.Default: true | ||
sql.inferSchema.mapTypes.minimum.key.size | Minimum size of a StructType before inferring as a MapType .Default: 250 | ||
aggregation.pipeline | Specifies a custom aggregation pipeline to apply to the collection
before sending data to Spark. The value must be either an extended JSON single document or list
of documents. A single document should resemble the following:
A list of documents should resemble the following:
ImportantCustom aggregation pipelines must be compatible with the
partitioner strategy. For example, aggregation stages such as
| ||
aggregation.allowDiskUse | Specifies whether to allow storage to disk when running the
aggregation. Default: true | ||
outputExtendedJson | When true , the connector converts BSON types not supported by Spark into
extended JSON strings.
When false , the connector uses the original relaxed JSON format for
unsupported types.Default: false |
Partitioner Configurations
This section contains configuration information for the following partitioners:
SamplePartitioner
Configuration
Note
If you use SparkConf
to set the connector's read configurations, prefix
each property with spark.mongodb.read.partitioner.options.
instead of
partitioner.options.
.
You must specify this partitioner using the full classname:
com.mongodb.spark.sql.connector.read.partitioner.SamplePartitioner
.
Property name | Description | |
---|---|---|
partitioner.options.partition.field | The field to use for partitioning, which must be a unique field. Default: | |
partitioner.options.partition.size | The size (in MB) for each partition. Smaller partition sizes create more partitions containing fewer documents. Default: | |
partitioner.options.samples.per.partition | The number of samples to take per partition. The total number of samples taken is:
Default: |
Example
For a collection with 640 documents with an average document size of 0.5 MB, the default SamplePartitioner configuration values creates 5 partitions with 128 documents per partition.
The MongoDB Spark Connector samples 50 documents (the default 10 per intended partition) and defines 5 partitions by selecting partition field ranges from the sampled documents.
ShardedPartitioner
Configuration
The ShardedPartitioner
automatically determines the partitions to use
based on your shard configuration.
You must specify this partitioner using the full classname:
com.mongodb.spark.sql.connector.read.partitioner.ShardedPartitioner
.
Warning
This partitioner is not compatible with hashed shard keys.
PaginateBySizePartitioner
Configuration
Note
If you use SparkConf
to set the connector's read configurations, prefix
each property with spark.mongodb.read.partitioner.options.
instead of
partitioner.options.
.
You must specify this partitioner using the full classname:
com.mongodb.spark.sql.connector.read.partitioner.PaginateBySizePartitioner
.
Property name | Description |
---|---|
partitioner.options.partition.field | The field to use for partitioning, which must be a unique field. Default: |
partitioner.options.partition.size | The size (in MB) for each partition. Smaller partition sizes create more partitions containing fewer documents. Default: |
PaginateIntoPartitionsPartitioner
Configuration
Note
If you use SparkConf
to set the connector's read configurations, prefix
each property with spark.mongodb.read.partitioner.options.
instead of
partitioner.options.
.
You must specify this partitioner using the full classname:
com.mongodb.spark.sql.connector.read.partitioner.PaginateIntoPartitionsPartitioner
.
Property name | Description |
---|---|
partitioner.options.partition.field | The field to use for partitioning, which must be a unique field. Default: |
partitioner.options.maxNumberOfPartitions | The number of partitions to create. Default: |
SinglePartitionPartitioner
Configuration
Note
If you use SparkConf
to set the connector's read configurations, prefix
each property with spark.mongodb.read.partitioner.options.
instead of
partitioner.options.
.
You must specify this partitioner using the full classname:
com.mongodb.spark.sql.connector.read.partitioner.SinglePartitionPartitioner
.
This partitioner creates a single partition.
Change Streams
Note
If you use SparkConf
to set the connector's change stream
configurations, prefix spark.mongodb.change.stream.
to each
property.
Property name | Description |
---|---|
change.stream.lookup.full.document | Determines what values your change stream returns on update operations. The default setting returns the differences between the original document and the updated document. The TipFor more information on how this change stream option works, see the MongoDB server manual guide Lookup Full Document for Update Operation. Default: "default" |
change.stream.publish.full.document.only | Specifies whether to publish the changed document or the full
change stream document. When set to true , the connector filters out messages that
omit the fullDocument field and only publishes the value of the
field.NoteThis setting overrides the Default: false |
connection.uri
Configuration Setting
You can set all Read Configuration via the read connection.uri
setting.
For example, consider the following example which sets the read
connection.uri
setting:
Note
If you use SparkConf
to set the connector's read configurations,
prefix spark.mongodb.read.
to the setting.
spark.mongodb.read.connection.uri=mongodb://127.0.0.1/databaseName.collectionName?readPreference=primaryPreferred
The configuration corresponds to the following separate configuration settings:
spark.mongodb.read.connection.uri=mongodb://127.0.0.1/ spark.mongodb.read.database=databaseName spark.mongodb.read.collection=collectionName spark.mongodb.read.readPreference.name=primaryPreferred
If you specify a setting both in the connection.uri
and in a separate
configuration, the connection.uri
setting overrides the separate
setting. For example, given the following configuration, the
database for the connection is foobar
:
spark.mongodb.read.connection.uri=mongodb://127.0.0.1/foobar spark.mongodb.read.database=bar