Streaming From Multiple Specific Collections Using MongoDB Spark Connector 10.x

Ravi_Kottu · March 15, 2023, 5:25am

Dear Community,

We are evaluating the spark-connector in version 10.1.1 to stream the data into Spark but could not find an option on below yet and appreciate your suggestions. We are using payspark and with Databricks to structure stream the data.

How to stream data from multiple collections of a database
.option(“spark.mongodb.read.collection”, collection1, collection2,…collectionN)
How to stream data from multiple databases
.option(“spark.mongodb.read.database”, DB1, DB2,…DBn)
How read the existing data of collection first and then start the streaming
Example: “copy.existing” which will copy the existing data first then start the stream of data.

Thanks in anticipation!

Ravi

Prakul_Agarwal · April 23, 2023, 4:23am

1 & 2.
The MongoDB Spark Connector facilitates the transfer of data between MongoDB and a Spark DataFrame. Despite its capabilities, the connector is limited to interacting with only one MongoDB collection during each read or write operation. As a result, it does not natively support reading or writing from multiple database/collections/ schemas, simultaneously in a single operation.

With that said, here is the approach you can use with MongoDb Spark connector: You can create a loop that iterates over the list of collections you want to read from, and for each collection, use the MongoDB Spark Connector to read the data into Spark. You can then perform any necessary transformations and write the data to the target Delta table.

This approach involves setting up one pipeline for each collection, but it can be automated using a loop. This also applied to working with multiple MongoDB instances, you will need to create separate Spark Configuration for each instance, as the connector’s Configuration is specific to a single MongoDB instance. If you want to be reading from only a subset of collections in a MongoDB instance you can create a config file that can be used as the initial list to iterate over and create connections, or you can query the database for a list of collections.

The Spark connector doesnt have the native ability to read the existing data of collection first and then start the streaming. We have a Jira to track the ability to copy existing data. https://jira.mongodb.org/projects/SPARK/issues/SPARK-303

Here are some ways described to copy the data over.