How to use $bucketAuto to divide into evenly distributed buckets

Is there a way to use $bucketAuto to divide my data into N buckets with equal distance?
I am trying to use this in my $facet query

(It is not perfect to use $bucket with boundaries, as I would then have to fetch the range of the data first to decide the boundaries)

Thank you!

Hello :wave: @williamwjs,

Welcome back to the MongoDB Community forums :sparkles:

The $bucketAuto accepts an optional granularity parameter which ensures that the boundaries of all buckets adhere to a specified preferred number series. So, here if you don’t specify the granularity it will distribute it automatically in equal sets**.

**Note: It will depend purely on the incoming number of the document.

For example:

A collection of things have an _id numbered from 1 to 100:

{ _id: 1 }
{ _id: 2 }
...
{ _id: 100 }

If I use the $bucketAuto without specifying the granularity, it will distribute it into equal counts of 20 documents.

db.things.aggregate( [
  {
    $bucketAuto: {
      groupBy: "$_id",
      buckets: 5,
      granularity: <No granularity>
    }
  }
] )
{ "_id" : { "min" : 1, "max" : 21 }, "count" : 20 }
{ "_id" : { "min" : 21, "max" : 41 }, "count" : 20 }
{ "_id" : { "min" : 41, "max" : 61 }, "count" : 20 }
{ "_id" : { "min" : 61, "max" : 81 }, "count" : 20 }
{ "_id" : { "min" : 81, "max" : 100 }, "count" : 20 }

But if I just increase one more document in the collection the result will be not consistent across each bucket.

{ "_id" : { "min" : 1, "max" : 21 }, "count" : 20 }
{ "_id" : { "min" : 21, "max" : 41 }, "count" : 20 }
{ "_id" : { "min" : 41, "max" : 61 }, "count" : 20 }
{ "_id" : { "min" : 61, "max" : 81 }, "count" : 20 }
{ "_id" : { "min" : 81, "max" : 101 }, "count" : 21 }

It is simply because 101 is not wholly divided by 5. Overall, using $bucketAuto we can specify the number of buckets, but not the number of documents each bucket will contain.

I hope it answers your question.

Best,
Kushagra

4 Likes

@Kushagra_Kesav Thank you for your response!

I guess “evenly distributed” is vague, and my intention is to divide into buckets having equal distance.

From the $bucketAuto doc, it says “Bucket boundaries are automatically determined in an attempt to evenly distribute the documents into the specified number of buckets.”, so looks like it is trying to divide the population into buckets, each of which would have almost the same frequency.

However, my intention is to make buckets boundary having equal range, and frequency of each bucket could differ

May I ask if you know how to do that? Thank you!

Hello :wave: @williamwjs,

Thanks for asking the question.

Indeed, in $bucketAuto the bucket boundaries are automatically determined.

For example, if I have the following documents:

16 documents in test_coll:
[
  { x: 1 },  { x: 2 },  { x: 2 },  { x: 3 },  { x: 3 },
  { x: 3 },  { x: 4 },  { x: 4 },  { x: 4 },  { x: 4 },
  { x: 5 }, { x: 5 }, { x: 5 }, { x: 6 }, { x: 6 },
  { x: 7 }]

And I execute the following query on it:

db.test_coll.aggregate( [
  {
    $bucketAuto: {
      groupBy: "$x",
      buckets: 2,
    }
  }])

It returns the following output:

[
  { _id: { min: 1, max: 5 }, count: 10 },
  { _id: { min: 5, max: 7 }, count: 6 }
]

Here, the 10/6 split is the best it can do with 2 buckets, as other potential boundaries would result in similarly balanced or even less balanced splits.

Further, if you increase the number of buckets to 4:

[
  { _id: { min: 1, max: 4 }, count: 6 },
  { _id: { min: 4, max: 5 }, count: 4 },
  { _id: { min: 5, max: 7 }, count: 5 },
  { _id: { min: 7, max: 7 }, count: 1 }
]

The boundaries are getting automatically determined in an attempt to evenly split within the given number of buckets.

So, if you wish to determine the bucket boundaries manually, I’ll recommend using the $bucket aggregation pipeline instead. As you stated in your first post, it may require fetching the range of the data first to determine the boundaries, but this is an essential step in the process. However, feel free to reach out if you need any assistance or further guidance.

Also please refer to Appendix: Stages Cheatsheet - Practical MongoDB Aggregations Book to learn more about the comparison of $bucket and $bucketAuto .

Best,
Kushagra

Hi @Kushagra_Kesav , thank you for your detailed reply!!!

I guess this is a feature request then.
To summarize, here is a detailed example of my request:

For example, requirement is automatically bucket the data into 5 buckets with equal-distance range, Then
a) if the total range of the data is 1-100, then it would automatically pick the following range:

[
  { _id: { min: 1, max: 20 }, count: 3 },
  { _id: { min: 21, max: 40 }, count: 20 },
  { _id: { min: 41, max: 60 }, count: 135 },
  { _id: { min: 61, max: 80 }, count: 60 },
  { _id: { min: 81, max: 100 }, count: 11 }
]

b) if the total range of the data is 21-70, then it would automatically pick the following range:

[
  { _id: { min: 21, max: 30 }, count: 3 },
  { _id: { min: 31, max: 40 }, count: 20 },
  { _id: { min: 41, max: 50 }, count: 135 },
  { _id: { min: 51, max: 60 }, count: 60 },
  { _id: { min: 61, max: 70 }, count: 11 }
]

Let me know if this makes sense to you! Thank you!