Pyspark and mongo connector question

Ewin_Hong · March 21, 2022, 5:03am

i am trying to see if there is a way to do this within a notebook.

pyspark --conf “spark.mongodb.input.uri=mongodb://localhost:27017/db.coll?readPreference=primaryPreferred”
–conf “spark.mongodb.output.uri=mongodb://localhost:27017/db.coll”
–packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1

from my notebook, i have this:
spark = SparkSession
.builder
.appName(“spark”)
.config(‘spark.driver.memory’, ‘32g’)
.config(“spark.mongodb.input.uri”, “mongodb://localhost:27017/collection.coll”)
.config(“spark.mongodb.output.uri”, “mongodb://localhost:27017/collection.coll”)
.config(“spark.mongodb.output.database”, “db”)
.enableHiveSupport()
.getOrCreate()

i create my dataframe with no issue. but when i try to write it to mongo with this command:
df2.write.format(“mongo”).mode(‘overwrite’).save()

directly from pyspark directly
>>> df2.write.format("mongo").mode('overwrite').save() 22/03/21 00:14:09 WARN CSVHeaderChecker: CSV header does not conform to the schema. Header: ID, SPEED, TRAVEL_TIME, STATUS, DATA_AS_OF, LINK_ID, LINK_POINTS, ENCODED_POLY_LINE, ENCODED_POLY_LINE_LVLS, OWNER, TRANSCOM_ID, BOROUGH, LINK_NAME Schema: ID, Speed, TravelTime, Status, timedate, LinkId, LinkPoints, EncodedLinkPoints, EncodedPolyLineLvls, Owner, TranscomId, Borough, Link_Name Expected: TravelTime but found: TRAVEL_TIME

any suggestions?

Robert_Walters · March 21, 2022, 1:50pm

The issue is, “ClassNotFoundException” this most likely means your MongoDB Spark Connector isn’t loading.

Some things to try, use spark-submit not pyspark
./bin/spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 \ yourpysparkapp.py

If you are using a Spark platform like Databricks you need to load the Mongo Spark Connector library/import the JAR file so that you can use it within your notebook. Workspace libraries | Databricks on AWS

Ewin_Hong · March 23, 2022, 12:18am

Thanks @Robert_Walters. Thanks for suggestion.

Spark is on my local machine and using a notebook

I found a way:

import findspark
findspark.init()
findspark.add_packages(“org.mongodb.spark:mongo-spark-connector_2.12:3.0.1”)

system · March 28, 2022, 12:18am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.