We are pleased to announce that PyMongoArrow, a Python library for data analysis with MongoDB, is now generally available.
PyMongoArrow allows you to efficiently move data in and out of MongoDB into other popular analytics tools in an easy and efficient manner. This library is built on top of PyMongo, MongoDB’s popular Python driver for synchronous programming.
Why we built PyMongoArrow
Today, PyMongoArrow is the recommended way to materialize MongoDB query result sets as contiguous-in-memory, typed arrays suited for in-memory analytical processing applications. It currently supports exporting MongoDB data into Pandas DataFrames, NumPy arrays, and Apache Arrow tables.
Before MongoDB created PyMongoArrow, it was possible to move data out of MongoDB into other analytics tools and systems, but there wasn’t a unified tool for working with the variety of data formats commonly used for analysis. Because different data analysts and developers may have different approaches and use different formats, this could sometimes interrupt collaboration and create a bottleneck in teams’ analytics pipelines.
PyMongoArrow solves these challenges for our users. While PyMongoArrow has been available in Public Preview since 2021, we have now made it generally available after adding additional features to ensure the best user experience.
Why use PyMongoArrow
The PyMongoArrow library can be easily integrated into your already existing analytics pipeline. Since it's built on top of the PyMongo library, it also extends all its functionality to let you work with MongoDB data in an easy and performant manner when operating at scale.
What can PyMongoArrow do?
Read data into Pandas DataFrame, NumPy Array, and Arrow Table Format
You can connect to your MongoDB instance through the PyMongoArrow library and use the following functions to output the query result sets into the desired data format:
find_pandas_all(): lets you output MongoDB query result sets as a Pandas DataFrame
find_arrow_all(): lets you output MongoDB query result sets as an Arrow Table
find_numpy_all(): lets you output MongoDB query result sets as a Numpy Array
Write to other data formats
Not only does PyMongoArrow allow you to output MongoDB query results sets as Pandas DataFrames, as NumPy arrays, and as Arrow tables, but it also allows you to write data to many other data formats. Once the MongoDB query result sets have been loaded as Arrow table type, it can be easily written to one of the other formats supported by PyArrow such as Parquet file, CSV, JSON etc.
Write data back to MongoDB
PyMongoArrow not only enables you to perform analytics tasks efficiently but lets you write the analyzed data back into the MongoDB database, ensuring permanent persistence for your valuable insights.
Result sets that have been loaded as Arrow’s table type, Pandas’ DataFrame type, or NumPy’s array type can be easily written to your MongoDB database using the write() function.
Use MongoDB's powerful aggregation pipeline with PyMongoArrow
In addition to basic find operations, you can also take advantage of MongoDB's powerful aggregation pipeline for even more complex analytical use cases.
Simply use the aggregate_pandas_all() function to query your MongoDB data using an aggregation pipeline and return the result sets as Pandas DataFrames. You can also use the aggregate_numpy_all() function and aggregate_arrow_all() functions to return the result sets as NumPy arrays and Arrow tables.
Get started today
We have plenty of resources available to guide you in your journey to quickly get started with the PyMongoArrow library. Here are some great resources: