Let’s say we have a bunch of server logs spanning a couple of years, now it’s grouped by year
and I want to sample 100 documents from each year, is it possible to do that purely with aggregation pipeline?
Hello @Fred_Wilson, welcome to the MongoDB Community forum!
Yes, its possible. Here is the aggregation query which gets the desired results. The query runs from the mongo
shell.
Note that, this requires MongoDB v4.4.2 or greater. There are two variables defined, - the NUMBER_OF_SAMPLES_REQUIRED
which is a number of random samples you are looking for, for each grouping ($year
). The random numbers generated are not unique, so we generate little more than the needed 100, and remove the duplicate random numbers (and, the variable NUMBER_OF_SAMPLES_COLLECTED
allows more samples).
There is a remote chance that you may see one or two duplicate less documents in the samples.
There is an Aggregation $sample
stage, but I have not tried it in this case . Let me know how this works for you!
var NUMBER_OF_SAMPLES_REQUIRED = 100;
var NUMBER_OF_SAMPLES_COLLECTED = 150 ;
db.collection.aggregate([
{
$group: {
_id: { year: "$year" },
docs: { $push: "$$ROOT" },
count: { $sum: 1 }
}
},
{
$project: {
random_docs: {
$let: {
vars: {
random_positions: {
$slice: [ {
$setDifference: [ {
$map: {
input: { $range: [ 0, NUMBER_OF_SAMPLES_COLLECTED ] },
in: {
$floor: { $multiply: [ { $rand: {}}, { $floor: "$count" } ] }
}
}
}, [] ]
}, NUMBER_OF_SAMPLES_REQUIRED ]
}
},
in: {
$map: {
input: "$$random_positions",
in: {
$arrayElemAt: [ "$docs", "$$this" ],
}
}
},
}
}
}
},
])
Thanks, this worked like a charm!
This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.