How to speed up find({}. projection={"_id": True}) on a large collection?

Mike_Ishenin · September 18, 2021, 10:26am

I’ve got a collection of ~4M documents around 8kb each, and I need all their _ids for a batch processing task.

%% time
allsrs = list(db["srs"].find({}, projection={"_id": True},batch_size=1000000))
print(len(allsrs). srs[0])

3855367 {'_id': 1686036615}
CPU times: user 5.17 s, sys: 873 ms, total: 6.05 s
Wall time: *10min 56s*

Is there a way to speed up this query? It doesn’t look like the one that really needs 10 minutes to complete.

Pavel_Duchovny · September 19, 2021, 10:42am

Hi @Mike_Ishenin ,

Welcome to MongoDB Community.

Can you provide the explain plan of this query? Have you tried to sort on _id as well or hint the _id index to be used?

Have you considered doing the task in smaller batches fetching x first Ids and moving them to a another process to continue with the next batch?

Best regards,
Pavel

Mike_Ishenin · September 19, 2021, 12:17pm

Wow, that’s the magic!

%%time
allsrs = list(db["srs"].find({}, projection={"_id": True}, sort=[( "_id", 1)],batch_size=1000000))
print(len(allsrs), allsrs[0])

3855367 {'_id': 1260}
CPU times: user 5.63 s, sys: 1.32 s, total: 6.95 s
Wall time: 14.1 s

But how and why?

Pavel_Duchovny · September 19, 2021, 1:30pm

Hi @Mike_Ishenin ,

Basically the sort forces the query to do an index scan on _id. Similarly this can be achieved by a hint.

In general MongoDB will sometimes prefare an collection scans over a full index scan as it is potentially faster. But in your case I suspect that _id index with numerical values is significantly smaller, therefore it probably fits fully in RAM.

This is what cause the improvement theoretically…

Thanks
Pavel

system · September 27, 2021, 6:19am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.