Pyspark UUID representation

Patrick_Shovein · March 3, 2023, 10:31pm

I’m wondering if anyone has come up with a solution for this. I’m creating a AWS Glue job using pyspark driver and connection to Atlas Mongo. I was able to successfully create a datafram and create an aggregation pipeline to retrieve data. The one issue that I’m facing is the id’s in my collection are outputting in the hex (BinData 3) format while I need this in juuid. In pymongo I can add a tag for uuid representation but no such luck with spark - Has anyone come across a solution for this?

Prakul_Agarwal · March 24, 2023, 10:53pm

Hi Patrick! Firstly welcome to the MongoDB community.

Just to understand what you were trying to do:

Currently you are able to retrieve data and create a pySpark data frame

df = spark.read.format("com.mongodb.spark.sql.DefaultSource") \
    .option("uri", "mongodb://...") \
    .option("database", "mydb") \
    .option("collection", "mycollection") \
    .load()

After this for the relevant column you want to convert the data from hex (BinData 3) format to juuid format?

For this we can use a function call XX and manipulate the dataframe in the following way
df = df.withColumn("juuid_column", XX(df["hex_column"]))

Does this summarize what you are trying to do?

Patrick_Shovein · March 25, 2023, 11:14am

Thanks for the reply - Yes this is exactly what I’m trying to accomplish. I have been able to successfully retrieve data from a df but the output for the hex columns are retrieving in bin data 3 format.

From your code snippet above, it appears that I would need to create a function XX that would convert this data?

Prakul_Agarwal · March 28, 2023, 4:23pm

Hello Patrick,
I do think you will have to write a custom function for converting this data. I didn’t find a pre existing spark SQL function for this. Let me know if that would unblock you.

Best,