How to write ObjectId value using Spark connector 10.1 using Pyspark?

Lior_Harel · September 18, 2023, 7:13pm

@JTBS, I’ll try to breakdown everything:

I do not think that there’s a need to struct the column at all if you use ‘{ “$oid” : “xxxxxxxxx”}’. I generated an sql expression as seen in the code example below. Not sure that this will work with pyspark built-in functions.
What are your concerns regarding the loading oid’s? Types in spark/mongo do not correlate and ObjectId type does not exist in spark.
You are right, there’s discrepancy between the documentation and the actual use. object_Or_Array_Only is usable, while objectOrArrayOnly is not…
To answer your question to the best of my understanding:

_id = xxxxx
formatted_id = "{ '$oid' : '" + _id + "' }"

code example:

from pyspark.sql import SparkSession

_id = "{ '$oid' : '650898287d503960a631ccac' }"

spark = (
    SparkSession
    .builder
    .config("spark.mongodb.write.connection.uri","your_uri")
    .config("spark.mongodb.write.convertJson","object_Or_Array_Only")
    .config("spark.jars.packages","org.mongodb.spark:mongo-spark-connector_2.12:10.2.0")
    .getOrCreate())

expr = f'"{_id}" as _id'

query = f'select {expr}'

df = spark.sql(query)

df.write.format("mongodb").mode("append").save()

I do not recommend using convertJson : "any" as it converts everything to json. You might have numeric strings, and the connect will convert them to numbers.
This is a workaround but i came to it after digging into the connector code as well as mongo driver code. I’m not sure that the developers of the connector plannned for it to work like this…