Hi, I’m curious about is there an efficient way to select 2M docs from within 30M+ records in db, I’m well aware of the $sample
operator but am not sure about it’s performance when the base is rather large, also seemed $sample
contains duplicated data, hence wondering maybe there’s better suggested approach? Thanks.
I do not know about efficiency, but if you have a field with unique index, $sample will return each document once.
Thanks, creating an unique
index on _id
seemed to do the trick!
1 Like
Thanks for the heads-up, do you happen to know the performance of such a large query?
@a_b,
It’s really hard to answer such a question. There are so many variables that will influence the performance of your query. It basically depends on:
- is it a sharded cluster or not;
- server config: # of cores, amount of memory, # of disks, SSD or HDD, network bandwidth;
- workload: is the server dedicated for this sampling task? does your working set size fits in memory (perhaps not)? covered query or not?
I believe the best way to answer your question would be to run a test yourself and track metrics like execution time, and query plan. If you didn’t like the result of your testing, further investigation will be needed to find out what would be the best solution for your use case.
All the best,
Rodrigo
(a.k.a. Logwriter)
2 Likes