Randomly select 2M docs from 30M docs, suggestions?

a_b · March 15, 2020, 11:15pm

Hi, I’m curious about is there an efficient way to select 2M docs from within 30M+ records in db, I’m well aware of the $sample operator but am not sure about it’s performance when the base is rather large, also seemed $sample contains duplicated data, hence wondering maybe there’s better suggested approach? Thanks.

coderkid · March 16, 2020, 2:00am

I do not know about efficiency, but if you have a field with unique index, $sample will return each document once.

See at https://docs.mongodb.com/manual/core/read-isolation-consistency-recency/#faq-developers-isolate-cursors

a_b · March 16, 2020, 6:32am

Thanks, creating an unique index on _id seemed to do the trick!

Doug_Duncan · March 16, 2020, 2:08pm

@a_b the _id field is already unique and has an index on it. No need to create a new one.

a_b · March 17, 2020, 7:53am

Thanks for the heads-up, do you happen to know the performance of such a large query?

logwriter · March 17, 2020, 1:58pm

@a_b,

It’s really hard to answer such a question. There are so many variables that will influence the performance of your query. It basically depends on:

is it a sharded cluster or not;
server config: # of cores, amount of memory, # of disks, SSD or HDD, network bandwidth;
workload: is the server dedicated for this sampling task? does your working set size fits in memory (perhaps not)? covered query or not?

I believe the best way to answer your question would be to run a test yourself and track metrics like execution time, and query plan. If you didn’t like the result of your testing, further investigation will be needed to find out what would be the best solution for your use case.

All the best,

Rodrigo
(a.k.a. Logwriter)