What would be the best approach to randomly query X amount of items from a collection?

Ajay_Pillay · June 14, 2021, 9:18pm

Hi, for some context, I am trying to build a feed for my application, where data posted by other users is retrieved in random order (not chronological, so you can load up a post from a long time ago, that’s ok) and you can scroll down to load more (pagination).

I have some things I’d like to clarify with this implementation.

How do I randomly select X posts from a collection without repeats (and none made by that specific user)?
How do I then paginate this, ie: extract more with no repeats?

Suppose I have collection of Posts as such:

{
  _id: "someID",
  title: "someTitle",
  content: "someContent",
  posted_by: "userID"
}

How do I retrieve 10 random posts where posted_by is NOT equal to the current userID, say "userA"?

And how should I be implementing pagination where 10 different random posts are queried, again where posted_by !== "userA"?

I understand that I can use _id where it’s a Mongo ObjectID and limit for simple chronological pagination, but how do I incorporate randomness into this as well?

Appreciate any help, thank you!

Pavel_Duchovny · June 15, 2021, 5:18am

Hi @Ajay_Pillay ,

One of the built-in ways to do this is by using a $sample operator with a size configured to amount of retrieved documents.

db.Posts.aggregate(
   [ { $sample: { size: 10 } } ]
)

Thanks
Pavel

Ajay_Pillay · June 15, 2021, 7:45am

Hi @Pavel_Duchovny, yes I did come across this but:

$sample may output the same document more than once in its result set.

Is there any way to prevent duplicates?

Pavel_Duchovny · June 15, 2021, 8:29am

What do you mean duplicates?

To group?

Ajay_Pillay · June 15, 2021, 9:46am

Oh no what I mean is on the $random documentation page it says that the selector may return an item more than once (aka items with the same _id)

But I can’t have that for what I need to do, I can’t have duplicate items being returned by the query.

Is that not what the documentation implies?

EDIT: I meant to refer to $sample not $random.

Pavel_Duchovny · June 15, 2021, 10:24am

$smaple is not $random its a different stage it returns documents once.

Ajay_Pillay · June 15, 2021, 10:33am

Sorry I made a typo, I meant to refer to $sample.

What does this bit mean when it says $sample may return a document more than once? Doesn’t that mean there could be duplicates? Or am I not understanding that correctly?

Pavel_Duchovny · June 16, 2021, 8:01am

Hi @Ajay_Pillay ,

Ok I never payed attention to this section. Never noticed a duplicate document and I beilive its a super rare condition if you pick just “10” documents.

However, if you really need to ensure uniquness you can do a group by _id and project a new root of the first document only which will verify no document is returned twice.

[{$sample: {
  size: 10
}}, {$group: {
  _id: "$_id",
  result: { $push : "$$ROOT"}
}}, {$replaceRoot: {
  newRoot: {$first : "$result" }
}}]

Thanks,
Pavel

Ajay_Pillay · June 16, 2021, 10:25am

Thanks for the clarification!

I need to account for uniqueness because I will be querying for 10 documents at first, and when the user scrolls down the page, I need to query 10 more unique documents, and as this grows the chances of duplicates increases in the subsequent queries.

How should I be approaching this? I understand how to make a single query for 10 unique random documents but how should this be used together with pagination?

Pavel_Duchovny · June 16, 2021, 11:58am

@Ajay_Pillay ,

In such a case I suggest that you add a random number to your documents index them and pick 10 random numbers to be queried in a $in query , than pick 10 more making sure they are not already picked before.

Otherwise just pick a way to sort the documents randomly and paginate them

Thanks
Pavel

Pavel_Duchovny · June 17, 2021, 5:03am

Another option is to run the aggregation on 3000 samples and batch them into 10 documents a batch …

If a user presses 300 times on the next run a new query … No way he will notice a returning result

Ajay_Pillay · June 19, 2021, 9:36pm

Hi @Pavel_Duchovny this is an interesting idea and honestly I don’t think I even need it at 3000, I think 100 is enough. A sample of 100, and then batched into 10 documents a batch.

So once I get this random sample, I paginate 10 documents at a time. Once I reach the end of this 100, I will run another sample of 100, and batch as per before.

How exactly am I supposed to be doing this batching, and saving the $sample aggregation between queries? For context, I’m running a Meteor/Apollo/React stack, so my GraphQL queries will include an offset and limit argument in my resolvers, and I will use that to handle the batching logic.

Pavel_Duchovny · June 20, 2021, 5:46am

Aggregation has a batchSize as part of the aggregation command.

I don’t know the specific of your driver architecture but you can do just query and fetch

No need for skip and limit.

Ajay_Pillay · June 20, 2021, 6:52am

So I tried the following query:

db.getCollection('myDB').aggregate([
    { $sample: { size: 100 } }],
    { cursor: { batchSize: 10 } }
)

But I still see all the 100 samples being returned.

Also I don’t quite understand why I don’t need the idea of skip and limit here. Correct me if I am wrong, but from what I understand I run an aggregation for a sample size of 100 documents once. Then, I can choose which batch to send back, based on the offset and limit.

So my first query would be an offset of 0 and limit of 10. So I return the first batch, and second query I return the second batch (offset 10, limit 10). But if I run this command a second time, wouldn’t it be a different $sample chosen?

Pavel_Duchovny · June 21, 2021, 5:21am

I think that the shell overwrite anything below 101 results.

Try this with your driver …

Now the skip and limit is not related to the batch its to offset and limit the entire query and is done on server side …

Skip is a highly unoptimized operation as it needs to first scan all offset before retrieve results while limit just stops after x returned docs

vishal_kumar_rai · February 26, 2024, 12:21pm

I am using aggregation to fetch a single random document from a collection of over 0.3million documents. Its performing very slow. Taking 500ms-1000ms for each query.
Sample Query : db.COLLECTION_NAME.aggregate([{ ‘$match’ : {‘isAvailable’ : true} }, { ‘$sample’ : {size:1}}]);
Did anyone else face performance issue while using aggregation to fetch a random document ?

mohamed_Izourne · November 13, 2024, 1:18pm

I think you should sort based on some number, making a rank for a post by relevance using some metrics.
I’m not sure if this is easily applied to be honest