Hey folks,
I am very new to MongoDB. My use case is to have multiple collections, each containing multiple documents. I will have numerous collections and numerous documents (1000+). What is the best way to run analytics at scale? The options I see so far are as follows:
- Use Pipelines in Mongo
- Use PyMongoArrow (but not sure how that interfaces with Motor)
Am I missing any options? Also, which is most recommended for best practices and scalability. If it is 2, how can I make it work with Motor?
Thank you.
Hello, we do not yet support async clients in PyMongoArrow, this feature is tracked in https://jira.mongodb.org/browse/ARROW-198.
Noted. What is the best option to run analytics across multiple collections and documents?
Using pipelines in MongoDB, as you suggested.
Got it. As a follow up, when should someone be using PyMongoArrow for data analytics / transformations versus Mongo’s pipelines? Is it only if the input is stored as Pandas/Numpy/Apache Arrow or is there another reason to use PyMongoArrow (efficiency / additional capabilities etc.)? Thanks
We have a comparison page.
In general, PyMongoArrow is faster and uses less memory when dealing with larger, un-nested documents. We will continue to improve PyMongoArrow over time.