I’m wondering if anyone has come up with a solution for this. I’m creating a AWS Glue job using pyspark driver and connection to Atlas Mongo. I was able to successfully create a datafram and create an aggregation pipeline to retrieve data. The one issue that I’m facing is the id’s in my collection are outputting in the hex (BinData 3) format while I need this in juuid. In pymongo I can add a tag for uuid representation but no such luck with spark - Has anyone come across a solution for this?
Hi Patrick! Firstly welcome to the MongoDB community.
Just to understand what you were trying to do:
- Currently you are able to retrieve data and create a pySpark data frame
df = spark.read.format("com.mongodb.spark.sql.DefaultSource") \
.option("uri", "mongodb://...") \
.option("database", "mydb") \
.option("collection", "mycollection") \
.load()
- After this for the relevant column you want to convert the data from hex (BinData 3) format to juuid format?
For this we can use a function call XX and manipulate the dataframe in the following way
df = df.withColumn("juuid_column", XX(df["hex_column"]))
Does this summarize what you are trying to do?
Thanks for the reply - Yes this is exactly what I’m trying to accomplish. I have been able to successfully retrieve data from a df but the output for the hex columns are retrieving in bin data 3 format.
From your code snippet above, it appears that I would need to create a function XX that would convert this data?
Hello Patrick,
I do think you will have to write a custom function for converting this data. I didn’t find a pre existing spark SQL function for this. Let me know if that would unblock you.
Best,