What is the best way to read big collections from mongodb?

Ali_ihsan_Erdem1 · June 27, 2022, 10:05pm

hello there
i have some huge collections (2-10 million records) what is the fastest way for me to read them in pymongo ?
iterating cursor seems too slow. is there any other alternative ?
mongoexport looks kinda fast. but using a command line tool in my codebase seems like a bad idea

steevej · June 27, 2022, 11:39pm

Having to read into your client code all your documents from your huge collections

And will be slow because you transfer all your data over the Network.

And mongoexport is no magic. It uses the same protocole and API you have accès and does some kind of for Loop using a cursor. Despite being fast because it uses a faster language than python, the whole process might be slower because it has to write to disk and then you have to Read from disk while the direct pymongo route might allow you the local disk I/O.

Read about the aggregation framework.

Shane · July 1, 2022, 7:34pm

To improve the performance of reading large collections you should try using RawBSONDocument:

from bson.raw_bson import RawBSONDocument
from pymongo import MongoClient

client = MongoClient(...)
coll = client.db.test
raw_coll = coll.with_options(codec_options=coll.codec_options.with_options(document_class=RawBSONDocument))
for raw_doc in raw_coll.find():
    print(raw_doc)

RawBSONDocument is a read-only view of the raw BSON data for each document. It can improve performance because the BSON data is decoded lazily. The raw BSON can also be accessed directly via the RawBSONDocument.raw property.

You may also want to try enabling network compression to reduce the bytes sent over the network: mongo_client – Tools for connecting to MongoDB — PyMongo 4.3.3 documentation
and
Installing / Upgrading — PyMongo 4.3.3 documentation

# PyMongo must be installed with snappy support via: python3 -m pip install 'pymongo[snappy]'
client = MongoClient(compressors="snappy")