I’m having a performance problem with writing with a Mongo spark connector.
I want to write 0.3M data in the form of string and hex binary (4kB) through the Mongo spark connector.
spark = create_c3s_spark_session(app_name, spark_config=[
("spark.executor.cores", "1"),
("spark.executor.memory", "6g"),
("spark.executor.instances", "50"), # ("spark.executor.instances", "50"),
("spark.archives", f'{GIT_SOURCE_BASE}/{pyenv path}.tar.gz#environment'),
("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:3.0.1"),
("spark.mongodb.output.uri", default_mongodb_uri),
], c3s_username = c3s_username)
However, it takes 30 hours to get the instances set to 1, and it takes 30 hours to get to 50 too.
writer = list_df.write.format("mongo").mode("append").option("database", mongo_config.get('database')).option("collection", f'{collection_name}')
writer.save()
I don’t understand that the total data size is only 1.2GB, and it takes the same time even if the number of instances is increased.
The strange thing is that if the hex binary is tested at 400B instead of 4K, the completion will be completed within an hour, and increasing the instances will definitely reduce the time required.
Is there a solution?
What actions are needed to address performance issues?