Overview
You can configure the following properties when reading data from MongoDB in batch mode.
Note
If you use SparkConf to set the connector's read configurations, prefix spark.mongodb.read. to each property.
Property name | Description | ||
|---|---|---|---|
| Required. | ||
| Required. | ||
| Required. | ||
| The comment to append to the read operation. Comments appear in the
output of the Database Profiler. | ||
| The parsing strategy to use when handling documents that don't match the expected schema. This option accepts the following values:
| ||
| If you set the | ||
| MongoClientFactory configuration key. | ||
| The partitioner full class name. You can specify a custom implementation that must implement the
| ||
| Partitioner configuration prefix. | ||
| The number of documents to sample from the collection when inferring | ||
| Whether to enable Map types when inferring the schema. | ||
| Minimum size of a | ||
| Specifies a custom aggregation pipeline to apply to the collection
before sending data to Spark. A list of documents resembles the following: IMPORTANT: Custom aggregation pipelines must be compatible with the partitioner strategy. For example, aggregation stages such as | ||
| Specifies whether to allow storage to disk when running the
aggregation. | ||
| When | ||
| Specifies a partial schema of known field types to use when inferring
the schema for the collection. To learn more about the |
Partitioner Configurations
Partitioners change the read behavior of batch reads that use the Spark Connector. By dividing the data into partitions, you can run transformations in parallel.
This section contains configuration information for the following partitioner:
Note
Batch Reads Only
Because the data-stream-processing engine produces a single data stream, partitioners do not affect streaming reads.
AutoBucketPartitioner Configuration (default)
The AutoBucketPartitioner is the default partitioner configuration. It samples the data to generate partitions and uses the $bucketAuto aggregation stage to paginate. By using this configuration, you can partition the data across single or multiple fields, including nested fields.
Note
Compound Keys
The AutoBucketPartitioner configuration requires MongoDB Server version 7.0 or higher to support compound keys.
To use this configuration, set the partitioner configuration option to com.mongodb.spark.sql.connector.read.partitioner.AutoBucketPartitioner.
Property name | Description |
|---|---|
| The list of fields to use for partitioning. The value can be either a single field name or a list of comma-separated fields. Default: |
| The average size (MB) for each partition. Smaller partition sizes create more partitions containing fewer documents. Because this configuration uses the average document size to determine the number of documents per partition, partitions might not be the same size. Default: |
| The number of samples to take per partition. Default: |
| The field name to use for a projected field that contains all the fields used to partition the collection. We recommend changing the value of this property only if each document already contains the Default: |
SamplePartitioner Configuration
The SamplePartitioner configuration is similar to the AutoBucketPartitioner configuration, but does not use the $bucketAuto aggregation stage. This configuration lets you specify a partition field, partition size, and number of samples per partition.
To use this configuration, set the partitioner configuration option to com.mongodb.spark.sql.connector.read.partitioner.SamplePartitioner.
Property name | Description | |
|---|---|---|
| The field to use for partitioning, which must be a unique field. Default: | |
| The size (in MB) for each partition. Smaller partition sizes create more partitions containing fewer documents. Default: | |
| The number of samples to take per partition. The total number of samples taken is: Default: |
Example
For a collection with 640 documents with an average document size of 0.5 MB, the default SamplePartitioner configuration creates 5 partitions with 128 documents per partition.
The Spark Connector samples 50 documents (the default 10 per intended partition) and defines 5 partitions by selecting partition field ranges from the sampled documents.
ShardedPartitioner Configuration
The ShardedPartitioner configuration automatically partitions the data based on your shard configuration.
To use this configuration, set the partitioner configuration option to com.mongodb.spark.sql.connector.read.partitioner.ShardedPartitioner.
Important
ShardedPartitioner Restrictions
In MongoDB Server v6.0 and later, the sharding operation creates one large initial chunk to cover all shard key values, making the sharded partitioner inefficient. We do not recommend using the sharded partitioner when connected to MongoDB v6.0 and later.
The sharded partitioner is not compatible with hashed shard keys.
PaginateBySizePartitioner Configuration
The PaginateBySizePartitioner configuration paginates the data by using the average document size to split the collection into average-sized chunks.
To use this configuration, set the partitioner configuration option to com.mongodb.spark.sql.connector.read.partitioner.PaginateBySizePartitioner.
Property name | Description |
|---|---|
| The field to use for partitioning, which must be a unique field. Default: |
| The size (in MB) for each partition. Smaller partition sizes create more partitions containing fewer documents. Default: |
PaginateIntoPartitionsPartitioner Configuration
The PaginateIntoPartitionsPartitioner configuration paginates the data by dividing the count of documents in the collection by the maximum number of allowable partitions.
To use this configuration, set the partitioner configuration option to com.mongodb.spark.sql.connector.read.partitioner.PaginateIntoPartitionsPartitioner.
Property name | Description |
|---|---|
| The field to use for partitioning, which must be a unique field. Default: |
| The number of partitions to create. Default: |
SinglePartitionPartitioner Configuration
The SinglePartitionPartitioner configuration creates a single partition.
To use this configuration, set the partitioner configuration option to com.mongodb.spark.sql.connector.read.partitioner.SinglePartitionPartitioner.
Specifying Properties in connection.uri
If you use SparkConf to specify any of the previous settings, you can either include them in the connection.uri setting or list them individually.
The following code example shows how to specify the database, collection, and read preference as part of the connection.uri setting:
spark.mongodb.read.connection.uri=mongodb://127.0.0.1/myDB.myCollection?readPreference=primaryPreferred
To keep the connection.uri shorter and make the settings easier to read, you can specify them individually instead:
spark.mongodb.read.connection.uri=mongodb://127.0.0.1/ spark.mongodb.read.database=myDB spark.mongodb.read.collection=myCollection
Important
If you specify a setting in both the connection.uri and on its own line, the connection.uri setting takes precedence. For example, in the following configuration, the connection database is foobar, because it's the value in the connection.uri setting:
spark.mongodb.read.connection.uri=mongodb://127.0.0.1/foobar spark.mongodb.read.database=bar