Sample X number of documents in each group with or after a $group stage?

Fred_Wilson · June 21, 2022, 8:31am

Let’s say we have a bunch of server logs spanning a couple of years, now it’s grouped by year and I want to sample 100 documents from each year, is it possible to do that purely with aggregation pipeline?

Prasad_Saya · June 21, 2022, 12:18pm

Hello @Fred_Wilson, welcome to the MongoDB Community forum!

Yes, its possible. Here is the aggregation query which gets the desired results. The query runs from the mongo shell.

Note that, this requires MongoDB v4.4.2 or greater. There are two variables defined, - the NUMBER_OF_SAMPLES_REQUIRED which is a number of random samples you are looking for, for each grouping ($year). The random numbers generated are not unique, so we generate little more than the needed 100, and remove the duplicate random numbers (and, the variable NUMBER_OF_SAMPLES_COLLECTED allows more samples).

There is a remote chance that you may see one or two ~~duplicate~~ less documents in the samples.

There is an Aggregation $sample stage, but I have not tried it in this case . Let me know how this works for you!

var NUMBER_OF_SAMPLES_REQUIRED = 100;
var NUMBER_OF_SAMPLES_COLLECTED = 150 ;

db.collection.aggregate([
  {
      $group: { 
          _id: { year: "$year" }, 
          docs: { $push: "$$ROOT" }, 
          count: { $sum: 1 } 
      }
  },
  {
      $project: { 
          random_docs: {
              $let: {
                  vars: {
                      random_positions: {
                          $slice: [ {
                              $setDifference: [ {
                                  $map: { 
                                      input: { $range: [ 0, NUMBER_OF_SAMPLES_COLLECTED ] }, 
                                      in: { 
                                          $floor: { $multiply: [ { $rand: {}}, { $floor: "$count" } ] }
                                      }
                                  }
                              }, [] ]
                          }, NUMBER_OF_SAMPLES_REQUIRED ]
                      }
                  },
                  in: {
                      $map: {
                          input: "$$random_positions",
                          in: { 
                              $arrayElemAt: [ "$docs", "$$this" ], 
                          }
                      } 
                  },
              }
          }
      }
  },
])

Fred_Wilson · June 22, 2022, 4:29am

Thanks, this worked like a charm!

system · June 27, 2022, 4:30am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.