Getting null pointer exception while reading data from mongo db -spark streaming (micro batch)

malay_tiwari · February 26, 2024, 4:03pm

hi,

I am using spark 3.2.1 and utilizing spark mongo connector (10.2.1) to steam directly from mongo.

Below is the error that I am getting .

   org.apache.spark.SparkException: Execution of the stream null failed. Please, fill a bug report in, and provide the full stack trace.
	at org.apache.spark.sql.execution.QueryExecution$.toInternalError(QueryExecution.scala:500)
	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:324)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:208)
Caused by: java.lang.NullPointerException

below is my sample code:

    val streamingDataFrame = spark.readStream
      .format("mongodb")
      .option("spark.mongodb.connection.uri", uri)
      .option("spark.mongodb.change.stream.publish.full.document.only","true")
          .option("spark.mongodb.collection", "plan.header")

      .load()



    val kk= streamingDataFrame.writeStream
     .option("checkpointLocation", "checkpointlocation").
     foreachBatch({(result: DataFrame, batchId: Long) =>
       result.printSchema()
       result.show(10,false)



     })

Prakul_Agarwal · February 29, 2024, 9:43am

Hi @malay_tiwari ,

Can you described more about the setup - where is your spark cluster hosted (local /cloud / a managed platform) and how are you using MongoDB (community, EA, Atlas)?
Is the batch mode working okay?
How is your data flowing in MongoDB? Is their data that was recently updated, since streaming will work in that case only? As described in documentation (https://www.mongodb.com/docs/spark-connector/current/streaming-mode/streaming-read/): "The connector reads from your MongoDB deployment’s change stream. To generate change events on the change stream, perform update operations on your database.

"

malay_tiwari · February 29, 2024, 11:18am

It’s managed service ( AWS emr)
In dev environment (the one I am using right now is a community one) but we do have Atlas as well.
Yes the data is getting updated frequently.
Data is getting updated via our web application.

So far we were using debizium connectors to push the data to Kafka but now since mongo db provides steaming, we wanted to try it out.

Problem is as soon as I am applying any action in foreachbatch sink ( further transformation is written in in foreach batch to load the data into data warehouse) it gives me NPE.