I’ve got a collection of ~4M documents around 8kb each, and I need all their _ids for a batch processing task.
%% time
allsrs = list(db["srs"].find({}, projection={"_id": True},batch_size=1000000))
print(len(allsrs). srs[0])
3855367 {'_id': 1686036615}
CPU times: user 5.17 s, sys: 873 ms, total: 6.05 s
Wall time: *10min 56s*
Is there a way to speed up this query? It doesn’t look like the one that really needs 10 minutes to complete.
Hi @Mike_Ishenin ,
Welcome to MongoDB Community.
Can you provide the explain plan of this query? Have you tried to sort on _id as well or hint the _id index to be used?
Have you considered doing the task in smaller batches fetching x first Ids and moving them to a another process to continue with the next batch?
Best regards,
Pavel
Wow, that’s the magic!
%%time
allsrs = list(db["srs"].find({}, projection={"_id": True}, sort=[( "_id", 1)],batch_size=1000000))
print(len(allsrs), allsrs[0])
3855367 {'_id': 1260}
CPU times: user 5.63 s, sys: 1.32 s, total: 6.95 s
Wall time: 14.1 s
But how and why?
Hi @Mike_Ishenin ,
Basically the sort forces the query to do an index scan on _id. Similarly this can be achieved by a hint.
In general MongoDB will sometimes prefare an collection scans over a full index scan as it is potentially faster. But in your case I suspect that _id index with numerical values is significantly smaller, therefore it probably fits fully in RAM.
This is what cause the improvement theoretically…
Thanks
Pavel
system
(system)
Closed
September 27, 2021, 6:19am
#5
This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.