What is the best way to run data analytics when using Python Motor (Async Driver)?

1ede6d8178ed74ad1b463c808e504bd · April 11, 2024, 11:16am

Hey folks,

I am very new to MongoDB. My use case is to have multiple collections, each containing multiple documents. I will have numerous collections and numerous documents (1000+). What is the best way to run analytics at scale? The options I see so far are as follows:

Use Pipelines in Mongo
Use PyMongoArrow (but not sure how that interfaces with Motor)

Am I missing any options? Also, which is most recommended for best practices and scalability. If it is 2, how can I make it work with Motor?

Thank you.

Steve_Silvester · April 11, 2024, 5:05pm

Hello, we do not yet support async clients in PyMongoArrow, this feature is tracked in https://jira.mongodb.org/browse/ARROW-198.

1ede6d8178ed74ad1b463c808e504bd · April 11, 2024, 6:56pm

Noted. What is the best option to run analytics across multiple collections and documents?

Steve_Silvester · April 11, 2024, 7:39pm

Using pipelines in MongoDB, as you suggested.

1ede6d8178ed74ad1b463c808e504bd · April 15, 2024, 3:50pm

Got it. As a follow up, when should someone be using PyMongoArrow for data analytics / transformations versus Mongo’s pipelines? Is it only if the input is stored as Pandas/Numpy/Apache Arrow or is there another reason to use PyMongoArrow (efficiency / additional capabilities etc.)? Thanks

Steve_Silvester · April 16, 2024, 12:14am

We have a comparison page.

In general, PyMongoArrow is faster and uses less memory when dealing with larger, un-nested documents. We will continue to improve PyMongoArrow over time.