Spark mongo connector - spark streaming into global mongo db , read and write operation in particular zone

Deepak_gusain · July 29, 2022, 4:55am

We want to read and write from spark streaming to global mongo db. We tried to use option shardkey in v10 spark mongo db connector but it is not working. We tried with v3.0 spark mongo db connector it is working but it doesn’t support streaming as streaming is supported in v10 spark mongo db connector. Can any one please help me on that ?

How we can read write from spark to global mongo db. It must read write from particular zone.

Robert_Walters · July 29, 2022, 6:21pm

By not work what exactly was the error ? Can you provide any example code that you tried ? Are you trying to write to a sharded collection ?

Deepak_gusain · July 30, 2022, 3:32am

we are trying to insert data into shard mongo db and we are able to insert data using format “mongo” (spark mongo db connector version - 3.0) for batch job, but while inserting using spark streaming job we are using format “Mongodb” version -10.0 , which is not working.

Below is the code syntax which we are using and it is not working:-

df.writeStream.format(“mongodb”)
.queryName(“some query”)
.outputMode(“append”)
.option(“spark.mongodb.connection.uri”, uri)
.option(“spark.mongodb.database”, database)
.option(“spark.mongodb.collection”, collection)
.option(“replaceDocument”, False)
.option(“shardkey” , ‘{“first field”:”hashed” , “second field”:”hashed"}’)
.trigger(processingTime=some value)
.option(“checkpointLocation”,checkpoint_folder_path)
.start()

getting below error :-
com.mongodb.MongoBulkWriteException: Bulk write operation error on server “mongo server name”:27016. Write errors: [BulkWriteError{index=0, code=61, message=‘Failed to target upsert by query :: could not extract exact shard key’, details={}}].

how we can write into shard mongo db from spark streaming job (global mongo db ) using shard key (as in mongo db doc it was mentioned that we must use shadkey option while insert our data into specific shard )?
Can we use spark mongo db connector version 3.0 for both streaming and batch job to write data into shard mongo db ( global mongo db )?
Is is possible to use spark mongo db connector version 10 for both streaming and batch for write data into global mongo db (shard mongo db ).

Robert_Walters · July 30, 2022, 5:27pm

@Deepak_gusain There is no shardkey write option, https://www.mongodb.com/docs/spark-connector/current/configuration/write/

you will need to use the idFieldList and specify the fields that are used to identify the document

V10 handles batch and structured stream, V3.x just batch. That said, we are working to make V10 at parity with V3.x, still need to add RDD support and some other things.

Deepak_gusain · July 30, 2022, 6:02pm

As per mongo db document :- Without shardkey document will be not distribute, and we are using global mongo db. Data must go into particular zone.

Will this “idfieldlist” distribute inserted data in global mongo db ?

Robert_Walters · July 30, 2022, 6:29pm

When you define the shard itself you define zones. Based upon the shard key value in the document that determines which shard it will be written to. Thus, there isn’t anything on the Spark side to configure it more to do with how you configured the shard collection itself.

Deepak_gusain · August 1, 2022, 5:28am

Hi,

I am trying to update a field ( which is not a index field ) in existing data using below code but it is not working and giving below error. Can you please tell us if we are doing anything wrong?

df.write.format(“mongodb”).mode(“append”).option(“spark.mongodb.connection.uri”,_uri)
.option(“database”,_database)
.option(“collection”, _collection)
.option(“replaceDocument”,False)
.option(“operationType” , “update”)
.option(“idFieldList” , [ “field_1”, “field_2” , “field_3”]).save()

field_1 = default “_id”
field_2 = compound index - field 1
fields_3 = compound index - field 2

ERROR :- com.mongodb.MongoBulkWriteException: Bulk write operation error on server “mongo server name”:27016. Write errors: [BulkWriteError{index=0, code=11000, message=‘E11000 duplicate key error collection: db_name.collection_name index: id dup key: { _id: “101” }’, details={}}].

Please note → in mongo spark connector version - 3 , it was working but in version -10 we are not able to update data.

Deepak_gusain · August 1, 2022, 4:58pm

@Robert_Walters , can you please help us on above error? We are not able to update data in global mongo db. Above I have pasted code.