Getting started with MongoDB, PySpark, and Jupyter Notebook

Robert Walters
October 9, 2020 | Updated: August 1, 2023

Jupyter notebook is an open source web application that is a game changer for data scientists and engineers. A simple web UI that makes it simple to create and share documents that contain live code, equations, visualizations and narrative text. The Jupyter notebook has now evolved into JupyterLab. This new web-based interactive development environment takes Jupyter notebooks to a whole new level by modularizing the environment making it easy for developers to extend the platform and adds new capabilities like a console, command-line terminal, and a text editor.

Apache Spark is frequently used together with Jupyter notebooks. Spark is an open source general-purpose cluster-computing framework that is one of the most popular analytics engines for large-scale data processing. The key concept with Spark is distributed computing; taking tasks that would normally consume massive amounts of compute resources on a single server and spread the workload out to many worker nodes. This is the technical implementation of the english saying, “many hands make small work”. Spark works efficiently and can consume data from a variety of data sources like HDFS file systems, relational databases and even from MongoDB via the MongoDB Spark Connector.

In this article, we will showcase how to leverage MongoDB data in your JupyterLab notebooks via the MongoDB Spark Connector and PySpark. We will load financial security data from MongoDB, calculate a moving average then update the data in MongoDB with these new data. While you can read through this article and get the basic idea, if you’d like to get hands-on, all the docker scripts and code are available on the GitHub repository, RWaltersMA/mongo-spark-jupyter. A special thanks to Andre Perez for providing a well written article called, “Apache Spark Cluster on Docker”. The docker compose scripts used in this article are based on those that Andre provided in his article.

Getting started

Let’s start by building out an environment that consists of a MongoDB cluster, an Apache Spark deployment with one master and two worker nodes, and JupyterLab.

enter image description here

Figure 1: Components

To follow along, git clone the RWaltersMA/mongo-spark-jupyter repository and run “sh build.sh” to build the docker images then run “sh run.sh” to build the environment seen in Figure 1.

The run.sh script file runs the docker compose file which creates a three node MongoDB cluster, configures it as a replica set on port 27017. Spark is also deployed in this environment with a master node located at port 8080 and two worker nodes listening on ports 8081 and 8082 respectively. The MongoDB cluster will be used for both reading data into Spark and writing data from Spark back into MongoDB.

To interact with MongoDB there are a variety of tools options. The mongo shell command line tool has been the de facto standard since the inception of MongoDB itself. At the time of this writing, there is a new version of the MongoDB Shell called mongosh that is currently in Preview. Mongosh addresses some of the limitations of the original shell such as syntax highlighting, auto-complete, command history, and improved logging to name a few. To download this new visit the online mongo shell documentation.

To verify our MongoDB cluster is up and running we can connect to the default port 27017 using the mongo shell.

enter image description here

Figure 2: Mongosh shell tool connecting to the MongoDB cluster

Finally, we can verify that the Jupyter Lab is up and running by navigating to the URL: http://localhost:8888.

enter image description here

Figure 3: Jupyter Lab web portal

To verify our Spark master and works are online navigate to http://localhost:8080

enter image description here

Figure 4: Spark master web portal on port 8080

Creating a moving average using PySpark

Now that our environment is up and running, it is waiting for work to do. In this example we are going to read stock data from MongoDB and calculate a moving average based on the price of the stock security. This new data will be inserted into the database as a new field. The run.sh file grabbed a small database called Stocks from the github and restored it to the local MongoDB cluster. If you want to generate your own data you can run the create-stock-data python app in the DataGenerator directory in the git repository.

Let’s start by creating a new Python notebook in JupyterLab. To create a new notebook, click the Python3 icon in the Notebook section of the Launcher. This will provide you with a blank notebook as shown below:

enter image description here

Figure 5: New notebook

The MongoDB Connector for Spark can be used with Scala, Java, Python, and R. In this example we will use Python and the PySpark library. With PySpark, you create specialized data structures called Resilient Distributed Datasets (RDDs). RDDs hide all the complexity of transforming and distributing your data automatically across multiple nodes by a scheduler if you’re running on a cluster. The entry-point of any PySpark program is a SparkSession object. This object allows you to connect to a Spark cluster and create RDDs.

Let’s configure our Spark Connector to use the local MongoDB cluster as both input and output.

from pyspark.sql import SparkSession

spark = SparkSession.\

builder.\

appName("pyspark-notebook2").\

master("spark://spark-master:7077").\

config("spark.executor.memory", "1g").\

config("spark.mongodb.input.uri","mongodb://mongo1:27017,mongo2:27018,mongo3:27019/Stocks.Source?replicaSet=rs0").\

config("spark.mongodb.output.uri","mongodb://mongo1:27017,mongo2:27018,mongo3:27019/Stocks.Source?replicaSet=rs0").\

config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:3.0.0").\

getOrCreate()

Next, let’s load our MongoDB data into a data frame:

df = spark.read.format("mongo").load()

Now that the data is loaded, we can verify that the configuration worked by looking at the schema:

df.printSchema()

enter image description here

We can see that the tx_time field is loaded as a string. We can easily convert this to a time by issuing a cast statement:

df = df.withColumn(‘tx_time”, df.tx_time.cast(‘timestamp’))

Next, we can add a new ‘movingAverage’ column that will show a moving average based upon the previous value in the dataset. To do this we leverage the PySpark Window function as follows:

from pyspark.sql.window import Window

from pyspark.sql import functions as F

movAvg = df.withColumn("movingAverage", F.avg("price")

.over( Window.partitionBy("company_symbol").rowsBetween(-1,1)) )

To see our data with the new moving average column we can issue a

movAvg.show()

enter image description here

Figure 6: JupyterLab output from movAvg.show() command

To update the data in our MongoDB cluster, we use the save method.

movAvg.write.format("mongo").option("replaceDocument", "true").mode("append").save()

Since we want to update the document we set the replaceDocument option to 'true'.

JupyterLab makes it easy to build out ad-hoc queries and easily integrates with MongoDB data. A screenshot of the above example in JypterLab is seen in figure 7.

enter image description here

Figure 7: Code sample in the JypterLab UI

Filtering the data set using the aggregation pipeline

In this example we loaded the entire collection, created a moving average of the data and updated the entire dataset with our new calculation. Your particular use case may need to work with just a subset of data and the Spark connector supports providing an aggregation pipeline query to be used as a source query. For example, if we wanted to focus the Spark calculation on the ITCHY ACRE CORPORATION we could define the pipeline in the pipeline option as follows:

pipeline = "{'$match': {'company_symbol': 'IAC'}}"

df = spark.read.format("mongo").option("pipeline", pipeline).load()

Summary

In this article we created a JupyterLab notebook, leaded MongoDB data, computed a moving average and updated the collection with the new data. This simple example shows how easy it is to integrate MongoDB data within your Spark data science application. For more information on the Spark Connector check out the online documentation. For anyone looking for answers to questions feel free to ask them in the MongoDB community pages. The MongoDB Connector for Spark is open source under the Apache license. Comments/pull requests are encouraged and welcomed. Happy data exploration!

← Previous

MongoDB Atlas Powers Half a Billion Players of India's Favorite Mobile Pastime, Ludo King

>> Announcement: Some features mentioned below will be deprecated on Sep. 30, 2025. Learn more . Nothing is more human than playing games. Boards and pieces can be found from the beginnings of civilization — little scraps of technology we created to entertain ourselves. No wonder, then, that gaming is a dominant force in mobile tech. What's more surprising is that some of the most successful mobile games are versions of some of the oldest traditions. Take Ludo. A classic board game for up to four players, it can trace its direct ancestry to 6th-century India and is built from much older ideas. Players roll a die to move pieces from home along a track to a finish; the first to get all pieces there wins. You can't pass an opponent on the track, but if you land on them they go back to the start. That's it. Simple. But the way it brings players together has been enough to make Ludo the national game of the subcontinent. Now Ludo is king of the phones, in the shape of Gametion's Ludo King app. A faithful yet stylish rendition of the board game, it retains the game's simplicity and social interaction, but at an epic scale. It topped the charts for Google Play downloads in India and reached the top ten internationally, with tens of millions of players chalking up a quarter of a billion minutes of playing time a day. At one point, numbers quadrupled overnight. Yet all this was managed by a tiny team of developers who'd built their platform on MongoDB Atlas , the global cloud database service. Gametion Founder and CEO Vikash Jaiswal Ludo King's authentic board game emulation quickly tapped into the Indian psyche. "We had strong takeup right from 2016, when we launched the first version," says Gametion founder and CEO Vikash Jaiswal. "A million downloads in the first 25 days, and up to a million minutes of play a day by the start of 2020. We were doing very well already. Then came the lockdown and we went through the roof." "We Just Wanted to Concentrate on the Game" Gametion was the quintessential small gaming startup. In 2015, it had a couple of developers out of a staff of four or five, and they'd produced a suite of in-browser Flash games. The next move was obviously mobile. But at first, the company didn't move far from the idea of a simple gaming experience. Jaiswal says: "There was no database component to the Flash games, no login or user ID. We launched Ludo King in 2016 as a single player game, and soon got the user feedback that they wanted multiplayer features. You need user accounts and user data for that." The company takes pride in how quickly it can adopt and incorporate new technologies, explains Jaiswal, but that means finding the right technology to adopt. And the game was exhibiting demanding growth. "Ludo King was becoming very popular, so we knew we needed something that could scale. It had to be quick to learn — we didn't have time for complexity or long learning curves." MongoDB seemed a good fit for an underlying database. I knew it was fast and very flexible to build on, and it had lots of features. And it turned out to be a really good fit for mobile gaming — MongoDB integrates very well into our Node.js architecture. It's a native speaker. Vikash Jaiswal, Founder and CEO, Gametion Jaiswal's team was able to rely on MongoDB's flexible data model to continually expand the game's features, including more options for players and monetisation tactics. That's never stopped. In 2020, Gametion introduced two new in-game features: voice chat and egreetings to users. But they had no interest in the nuts and bolts of database administration. "We didn't want to make our own backend or worry about scaling, management or any of that. We just wanted to concentrate on the game," says Jaiswal. MongoDB Atlas hadn't made its debut yet at the time — Gametion being ahead of the game -- so the company chose the third-party mLab platform for hosting. Then in 2019, after mLab was acquired by MongoDB Inc, Gametion transitioned from mLab to MongoDB Atlas, the platform made and managed by the company behind the database. MongoDB Atlas: A 'Native Speaker' for Mobile Gaming Transitions can be challenging, but with the same underlying architecture and the support of MongoDB itself, this one was straightforward. In fact, it was so uneventful that Jaiswal says he can't remember it happening. "I don't recall any problems at all. There was no downtime, which I definitely would have remembered. MongoDB managed it all for us. The migration must have been very smooth." Once on MongoDB Atlas, running on AWS's cloud infrastructure, the team — which was now five developers — quickly found the features that mattered, such as Continuous Cloud Backup and Performance Advisor . "The dashboard is very cool. We can dial up the performance we need when we need it, and see exactly what's going on." Ludo King's Lockdown Gametion's emphasis on common open standards and a component approach has made it easy to add other functions as the game demands, maintaining a regular schedule of updates that keep the users engaged. "You can think of it as a microservices architecture. We use Kafka to manage data movement and synchronize between services. It's another way to optimize resource use across the board without sacrificing scalability or release cadence." Infrastructure Diagram for Ludo King That's something you need when you go from being one of the top mobile games in India to the uncontested champ. "At the start of March 2020, we had between 150,000 and 200,000 simultaneous users, but when lockdown hit that month, it jumped to a million, 1.5 million. We went from 8,000 IOPS to peaking at 35,000." "With 145 million downloads in the first week of lockdown alone, quickly finding the rights answers was important," says Jaiswal. "We have 50 million users a day, averaging 50 minutes of gameplay each. Some of them are on for five, six hours at a stretch." MongoDB is Integral to Future Growth The future will see more features on Ludo King, such as league tables and what Gametion sees as its primary revenue generator: in-app purchases. It'll also see some brand-new games. MongoDB is integral to this strategy, both to power innovation and to manage the consequences of success. And Gametion's roadmap is growing with its market, which means it will need features for economically managing huge numbers of casual users. " Atlas Data Lake looks useful," says Jaiswal. "We want to move inactive players — those who haven't been online in a while — away from the main database, but we don't want to just delete them." Efficiently managing hundreds of millions of users — and supporting near-instantaneous, 1,000% growth — would have once required the resources of a large corporation. But for Gametion, which still has fewer than 100 employees, these aren't limiting factors. In August 2020, India Prime Minister Narendra Modi even highlighted the success of the the game during his monthly radio programme. Ludo King is helping to fulfill the vision of popularising Indian games with a global audience. For now, Gametion's focus is growth. And MongoDB is part of that experience, the game piece that shows where you are and implements your strategy, quietly and efficiently. MongoDB Atlas is not just a database, it's a genuine game changer. Try MongoDB Atlas Free

October 9, 2020

Next →

Cars24 Improves Search For 300 Million Users With MongoDB Atlas

The Indian multinational online car marketplace Cars24 serves 300 million users globally. The company offers services that span sales, insurance, maintenance, financing, and more, reshaping the entire car ownership journey. Speaking at MongoDB .local Bengaluru in July 2025 , Pradeep Sharma, Head of Technology at Cars24, shared how MongoDB has been a key driver of Car24’s digital transformation journey. Specifically, he highlighted two recent use cases that show how MongoDB Atlas has helped Cars24 scale, improve its search capabilities, and reduce its architectural complexity. Matching the growing scale with simplified and expanded search Cars24 has operations in multiple countries, and a diverse customer base. Over the years, the company has used customer data, behavior analytics, and operational workflows to build, evolving from being a platform for buying and selling cars, to an end-to-end ecosystem, supported by a hub of interconnected systems. At the start of its journey, Cars24 relied on legacy databases for managing and searching data, such as Postgres. Their relational database set-up would store information, synchronize the data to a separate “bolt-on” search engine (such as Elasticsearch), manually indexing it, and then querying the index. While initially effective for a small application ecosystem, these processes became bottlenecked as the organization’s services grew. Multiple engineering teams piped data into a single search index, which often resulted in synchronization challenges and overwhelming administrative overhead. Cars24 faced three core limitations with this setup: Lower developer productivity: Exponential effort was spent maintaining pipelines and synchronizing procedures. Developers had little bandwidth for building business features or innovation. Architectural complexity: Ensuring data sync consistency required multiple pipelines and race logic. This led to inefficiencies in real-time dashboard updates for agents. Operational overhead: Maintaining separate systems for database and search—alongside provisioning, patching, scaling, and monitoring—strained resources. Seeking an integrated approach, Cars24 embraced MongoDB Atlas, hosted on Google Cloud . MongoDB Atlas would serve as a single, consistent, modern database and embedded search solution, powered by Apache Lucene. MongoDB Atlas Search also enabled Cars24 to run queries directly in the database. This eliminated the need to synchronise data between systems while delivering real-time results. This unified approach allowed the company’s developers to transition from managing complex synchronization mechanisms to building applications. Furthermore, the reduced administrative overhead enabled Cars24 to consolidate the team’s efforts, and to streamline query execution across the ecosystem. Thanks to MongoDB Atlas and MongoDB Atlas Search, Cars24 was able to: Avoid "synchronization tax”: Switching to MongoDB Atlas eliminated the need for data synchronization and the additional tooling this mandated. Real-time searches can be performed from a single interface and workflow. Deliver new search features faster: By using a single, unified API across database and search operations, new features can be delivered rapidly. Work with a fully managed platform: With MongoDB Atlas, Cars24’s engineers can focus more on application development and building products, rather than thinking about managing indexes, syncing, and more. Following this successful migration, Cars24 decided to also use MongoDB Atlas to replace one of its legacy databases, ArangoDB. The switch to MongoDB Atlas eliminated major roadblocks for other critical search capabilities. From ArangoDB to MongoDB: Streamlined operations and 50% cost savings As Cars24 scaled new services globally, it encountered limitations with its geospatial search solution, which was based on ArangoDB. This included performance bottlenecks, weak transactions as it was difficult to guarantee consistent data operations, and a limited ecosystem which meant that scaling developer onboarding and troubleshooting became increasingly onerous. Moving to MongoDB Atlas enabled Cars24 to transition its geospatial services, consolidating its data storage and search capabilities under a single, versatile platform. “We now have a highly available architecture, and an amazing team at MongoDB that has our back,” said Sharma. MongoDB offered a proven architecture for high availability, scalability, and real-world production readiness: Enhanced scalability: MongoDB’s ability to scale massive workloads supports Cars24’s growing global presence. Reliable transactions: MongoDB provides robust multi-document ACID transactions across shards, meeting mission-critical needs. Streamlined operations: MongoDB offers a single platform that is not limited to a database only. By consolidating its geospatial search workload under MongoDB, Cars24 has reduced maintenance and operational overhead. Not only did Cars24 cut costs in half by moving to MongoDB, but the widespread market adoption of MongoDB Atlas also means that Cars24 can continue to rapidly onboard developers familiar with MongoDB, a recruiting priority for Cars24’s growing development team. “To give you an idea, one of our business units had a developer team of less than 10 about a year ago. Now they are a triple-digit team,” said Sharma. “If we are going to keep introducing new developers, for a product coming up or scaling up, it becomes very important to focus on the community skills and support provided by our technology partner.” “Now that we have moved from ArangoDB to MongoDB Atlas, our developers are the happiest,” he added. Cars24 is now looking to consolidate even more of its application and data workflows under MongoDB Atlas. With the growing number of developers joining Cars24’s engineering teams, plans are to utilize MongoDB Atlas further to enhance productivity, scalability, and data-driven insights. Visit the MongoDB Atlas Learning Hub to learn more about Atlas. To learn more about MongoDB Atlas Search, visit our product page .

October 12, 2025