How to write ObjectId value using Spark connector 10.1 using Pyspark?

I’m having trouble trying to figure out how to write a value to MongoDb of type ObjectId using the Spark connector 10.1 using Python (Pyspark).

Although I haven’t found much about it online, I have tried the solution in the below link which states to write a StructType column containing a string called “oid”, but this does not work. It instead ends up creating an Object with a “oid” attribute instead of an ObjectId.
python - Write PySpark dataframe to MongoDB inserting field as ObjectId - Stack Overflow

I’ve also tried enabling the convertJson write option mentioned in the below reference, but that doesn’t seem to make any difference either.
Write Configuration Options — MongoDB Spark Connector

2 Likes

I’m also facing the same issue. I think if we could somehow write oid to $oid then it would work, although directly naming the column as $oid is giving error.

Yeah this is a problem. The document stored in the database has an object with oid attribute inside of it instead of creating an object of ObjectId.

did anyone solve this issue ?

Solution:
Set .config("spark.mongodb.write.convertJson","object_Or_Array_Only") in your SparkSession.
ObjectId fields should receive a value in the following format: state = "{ '$oid' : '6240b43c279082371d0e835f' }"

MongoSpark connector requires curly brackets to parse the string to bson, while Mongo Java driver parses “$oid” as ObjectId.

I have tried above options:

  • But READ is always getting ObjectId as “string” from MongoDB 6.0/PySpark 3.4.0
  • WRITE is always writing any thing I try - as “string” to: MongoDB 6.0/PySpark 3.4.0

I am using MongoDB Spark Connector 10.0.4 driver

Really appreciate if anyone can share exact steps with sample code that works. This is important requirement for us - to at least WRITE back as ObjectId - I can live with READ coming on existing ObjectId as string worst case

@JTBS .config("spark.mongodb.write.convertJson","object_Or_Array_Only") is only available in MongoSpark connector V10.2.0. Try updating the connector and let us know.

1 Like

Yes - I noticed documentation change and switched MongoDB Spark Connector: 10.2.0
Now it works: If value of column is in this format: ‘{ “$oid” : “xxxxxxxxx”}’

But I have two problems:

  • I don’t know how to get $ in place of struct field name. I can generate column with: to_jsob(struct(_id as oid)) but PySpark complains if I use $
  • If I don’t use $ - Driver won’t convert to ObjectId
  • In addition to this, driver still READS _id (ObjectId) as “String” when it LOADS data from MongoDB

Finally there is also mismatch in Documentation of driver: This post says use “object_Or_Array_Only” but documentation uses “objectOrArrayOnly”

In my case I was ONLY able to get above working with “any” option - but when I use “object_Or_Array_Only” - it gave error on converting my DF - due to other data in DF I think

Question
How I can convert _id (String) to have value with $oid in the format ‘{“$oid”: “_id value” }’ ?

This seems so simple issue - and having latest MongoDB / Spark built for purpose - and having to go through all these workarounds - bit surprising. But I need good sample if you can share for above question - greatly appreciate - and can live with whatever workarounds

While I wait to hear any better option - for those having same issue:

Below is what worked for me:

  • In Mongo DB Options used: spark.mongodb.write.convertJson: any
  • Changed _id(String) to: expr(replace(to_json(struct(_id as oid)), “oid”,“$oid”))

Open question:

  • I don’t know why while reading _id is coming up as String
  • On Write - I am not sure if there is easier option than all circus above

I hope either future driver handles this better without breaking existing support OR something I am missing here

@JTBS, I’ll try to breakdown everything:

  1. I do not think that there’s a need to struct the column at all if you use ‘{ “$oid” : “xxxxxxxxx”}’. I generated an sql expression as seen in the code example below. Not sure that this will work with pyspark built-in functions.
  2. What are your concerns regarding the loading oid’s? Types in spark/mongo do not correlate and ObjectId type does not exist in spark.
  3. You are right, there’s discrepancy between the documentation and the actual use. object_Or_Array_Only is usable, while objectOrArrayOnly is not…
  4. To answer your question to the best of my understanding:
_id = xxxxx
formatted_id = "{ '$oid' : '" + _id + "' }"
  1. code example:
from pyspark.sql import SparkSession

_id = "{ '$oid' : '650898287d503960a631ccac' }"

spark = (
    SparkSession
    .builder
    .config("spark.mongodb.write.connection.uri","your_uri")
    .config("spark.mongodb.write.convertJson","object_Or_Array_Only")
    .config("spark.jars.packages","org.mongodb.spark:mongo-spark-connector_2.12:10.2.0")
    .getOrCreate())

expr = f'"{_id}" as _id'

query = f'select {expr}'

df = spark.sql(query)

df.write.format("mongodb").mode("append").save()
  1. I do not recommend using convertJson : "any" as it converts everything to json. You might have numeric strings, and the connect will convert them to numbers.
  2. This is a workaround but i came to it after digging into the connector code as well as mongo driver code. I’m not sure that the developers of the connector plannned for it to work like this…
1 Like

Thank you very much for concat tip - Yes no need for struct/json/replace… below worked:

  • concat(“{ ‘$oid’”, “:”, " ‘", _id, "’ }") (in PySpark)

However - while this works consistently for me with “any” option ONLY. I don’t want to use any like you suggested. But strangely when I use “object_Or_Array_Only” consistently I get below error: “Cannot cast into a BsonValue. StringType has no matching BsonValue. Error: String index out of range: 0”

This error comes out even if I remove above string data - so not sure exactly what column in my DF has issue - but using “any” has no issues.

I understand using “any” will get into unpredictable issues that I don’t like either. But it does work for ObjectId.

I wish there is some option created just to handle ObjectId which is more common scenario.
I can’t afford any data-type changes but not sure I have clear path.

Thanks

Is caused since your dataset contains empty strings and the connector does not know how to handle it. replace all occurrences of str=='' to str==' ' or str==null/None.

1 Like

Thank you very much. I really appreciate you getting back.
In our case I have built a dynamic PySpark engine that works on any incoming data and masks only requested data - saves it back to DB.

So I don’t know income schema - other than - just set of fields that will be masked/manipulated.

We have contract to retain all other data/data-types as-is once masking is done on requested masked fields - with no changes to rest of schema/data.

While PySpark approach proving very performance friendly for large data sets, I have to think through on how to probably detect string type columns and add this expressions - in my dynamic engine.

All this I have to do - just to get _id (String) to _id(ObjectId) :frowning:

I only hope authors of MongoDB Spark Connectors will improve this with may be new option - change only ObjectId scenarios etc. But can’t thank you enough for coming forward and helping me on this.

You can close this case.