PyMongoArrow: Bridging the Gap Between MongoDB and Your Data Analysis App
Mark SmithPublished Oct 15, 2021 • Updated May 20, 2022
Rate this quickstart
MongoDB has always been a great database for data science and data analysis, and that's because you can:
- Import data without a fixed schema.
- Clean it up within the database.
- Listen in real-time for updates (a very handy feature that's used by our MongoDB Kafka Connector).
- Query your data with the super-powerful and intuitive Aggregation Framework.
But MongoDB is a general-purpose database, and not a data analysis tool, so a common pattern when analysing data that's stored within MongoDB is to extract the results of a query into a Numpy array, or Pandas dataframe, and to run complex and potentially long running analyses using the toolkit those frameworks provide. Until recently, the performance hit of converting large amounts of BSON data, as provided by MongoDB into these data structures, has been slower than we'd like.
Fortunately, MongoDB recently released PyMongoArrow, a Python library for efficiently converting the result of a MongoDB query into the Apache Arrow data model. If you're not aware of Arrow, you may now be thinking, "Mark, how does converting to Apache Arrow help me with my Numpy or Pandas analysis?" The answer is: Conversion between Arrow, Numpy, and Pandas is super efficient, so it provides a useful intermediate format for your tabular data. This way, we get to focus on building a powerful tool for mapping between MongoDB and Arrow, and leverage the existing PyArrow library for integration with Numpy and MongoDB
You'll need a recent version of Python (I'm using 3.8) with pip available. You can use conda if you like, but PyMongoArrow is released on PyPI, so you'll still need to use pip to install it into your conda Python environment.
This tutorial was written for PyMongoArrow v0.1.1.
In this tutorial, I'm going to be using a sample database you can install when creating a cluster hosted on MongoDB Atlas. The database I'll be using is the "sample_weatherdata" database. You'll access this with a
mongodb+srvURI, so you'll need to install PyMongo with the "srv" extra, like this:
Useful Tip: If you just run
pip, you may end up using a copy of
pipthat was installed for a different version of
pythonthan the one you're using. For some reason, the
PATHgetting messed up this way happens more often than you'd think. A solution to this is to run pip via Python, with the command
python -m pip. That way, it'll always run the version of
pipthat's associated with the version of
PATH. This is now the officially recommended way to run
You'll also need a MongoDB cluster set up with the sample datasets imported. Follow these instructions to import them into your MongoDB cluster and then set an environment variable,
MDB_URI, pointing to your database. It should look like the line below, but with the URI you copy out of the Atlas web interface. (Click the "Connect" button for your cluster.)
A sample document from the "data" collection looks like this:
To keep things simpler in this tutorial, I'll ignore all the fields except for "ts," "wind," and the "_id" field.
I set the
MDB_URIenvironment variable, installed the dependencies above, and then fired up a new Python 3 Jupyter Notebook. I've put the notebook on GitHub, if you want to follow along, or run it yourself.
I added the following code to a cell at the top of the file to import the necessary modules, and to connect to my database:
If the data you wish to convert to Arrow, Pandas, or Numpy data tables is already flat—i.e., the fields are all at the top level of your documents—you can use the methods
find\_numpy\_allto query your collection and return the appropriate data structure.
10000 rows × 1 columns
The first argument to find_pandas_all is the
filterargument. I'm interested in all the documents in the collection, so I've left it empty. The documents in the data collection are quite nested, so the only real value I can access with a find query is the timestamp of when the data was recorded, the "ts" field. Don't worry—I'll show you how to access the rest of the data in a moment!
Because Arrow tables (and the other data types) are strongly typed, you'll also need to provide a Schema to map from MongoDB's permissive dynamic schema into the types you want to handle in your in-memory data structure.
Schemais a mapping of the field name, to the appropriate type to be used by Arrow, Pandas, or Numpy. At the current time, these types are 64-bit ints, 64-bit floating point numbers, and datetimes. The easiest way to specify these is with the native python types
float, and with
pyarrow.datetime. Any fields in the document that aren't listed in the schema will be ignored.
PyMongoArrow currently hijacks the
projectionparameter to the
find_*_allmethods, so unfortunately, you can't write a projection to flatten the structure at the moment.
MongoDB documents are very flexible, and can support nested arrays and documents. Although Apache Arrow also supports nested lists, structs, and dictionaries, Numpy arrays and Pandas dataframes, in contrast, are tabular or columnar data structures. There are plans to support mapping to the nested Arrow data types in future, but at the moment, only scalar values are supported with all three libraries. So in all these cases, it will be necessary to flatten the data you are exporting from your documents.
To project your documents into a flat structure, you'll need to use the more powerful
aggregate_*_allmethods that PyMongoArrow adds to your PyMongo Collection objects.
In an aggregation pipeline, you can add a
$projectstage to your query to project the nested fields you want in your table to top level fields in the aggregation result.
In order to test my
$projectstage, I first ran it with the standard PyMongo aggregate function. I converted it to a
listso that Jupyter would display the results.
Because I've matched a single document by "_id," only one document is returned, but you can see that the
$projectstage has mapped
$wind.direction.angleto the top-level "windDirection" field in the result, and the same with
$wind.speed.rateand "windSpeed" in the result.
I can take this
$projectstage and use it to flatten all the results from an aggregation query, and then provide a schema to identify "windDirection" as an integer value, and "windSpeed" as a floating point number, like this:
10000 rows × 2 columns
There are only 10000 documents in this collection, but some basic benchmarks I wrote show this to be around 20% faster than working directly with
PyMongo. With larger datasets, I'd expect the difference in performance to be more significant. It's early days for the PyMongoArrow library, and so there are some limitations at the moment, such as the ones I've mentioned above, but the future looks bright for this library in providing fast mappings between your rich, flexible MongoDB collections and any in-memory analysis requirements you might have with Arrow, Pandas, or Numpy.
If you're planning to do lots of analysis of data that's stored in MongoDB, then make sure you're up on the latest features of MongoDB's powerful aggregation framework. You can do many things within the database so you may not need to export your data at all. You can connect to secondary servers in your cluster to reduce load on the primary for analytics queries, or even have dedicated analytics nodes for running these kinds of queries. Check out MongoDB 5.0's new window functions and if you're working with time series data, you'll definitely want to know about MongoDB 5.0's new time-series collections.