Randomly select 2M docs from 30M docs, suggestions?

Hi, I’m curious about is there an efficient way to select 2M docs from within 30M+ records in db, I’m well aware of the $sample operator but am not sure about it’s performance when the base is rather large, also seemed $sample contains duplicated data, hence wondering maybe there’s better suggested approach? Thanks.

I do not know about efficiency, but if you have a field with unique index, $sample will return each document once.

See at https://docs.mongodb.com/manual/core/read-isolation-consistency-recency/#faq-developers-isolate-cursors

Thanks, creating an unique index on _id seemed to do the trick! :+1:

1 Like

@a_b the _id field is already unique and has an index on it. No need to create a new one.

3 Likes

Thanks for the heads-up, do you happen to know the performance of such a large query?

@a_b,

It’s really hard to answer such a question. There are so many variables that will influence the performance of your query. It basically depends on:

  • is it a sharded cluster or not;
  • server config: # of cores, amount of memory, # of disks, SSD or HDD, network bandwidth;
  • workload: is the server dedicated for this sampling task? does your working set size fits in memory (perhaps not)? covered query or not?

I believe the best way to answer your question would be to run a test yourself and track metrics like execution time, and query plan. If you didn’t like the result of your testing, further investigation will be needed to find out what would be the best solution for your use case.

All the best,

Rodrigo
(a.k.a. Logwriter)

2 Likes