MongoDB Aggregation

When working with data in MongoDB, you may quickly have to run complex operations, with multiple stages of operations to gather metrics for your project.

Generating reports and displaying useful metadata are just two major use cases where MongoDB aggregation operations can prove incredibly useful, powerful, and flexible.

What Is Aggregation?

In programming, we often run a series of operations on a collection of items. Take the following JavaScript sample:

let numbers = [{val: 1}, {val: 2}, {val: 3}, {val: 4}];
numbers = numbers
    .map(obj => obj.val) // [1, 2, 3, 4]
    .reduce((prev, curr) => prev + curr, 0) // 10

In this example, we have two operations that are being run on the numbers collection:

  • First, map(): we take the objects and reduce them down to their numerical values.
  • Second, reduce(): We consolidate the output to a single number—the sum of the numbers.

Aggregation operations process data records and return computed results.

Not only do we have the ability to aggregate data on the client side with JavaScript, but we can use MongoDB to run operations on the server against our collections stored in the database before the result is returned to the client.

Single-Purpose Aggregation

At the start of this introduction, we mentioned MongoDB provides two methods to perform aggregation. The simplest is single-purpose aggregation.

Single-purpose aggregation operations are a collection of helper methods applied to a collection to calculate a result. These helper methods enable simple access to common aggregation processes.


Two of such methods provided are:


Let's use a collection named “sales” that stores purchases:

{
    _id: 5bd761dcae323e45a93ccfea,
    saleDate: 2017-06-22T09:54:14.185+00:00,
    items: [
      {
        "name": "printer paper",
        "price": 17.3,
        // ...
      },
    ],
    storeLocation: "Denver",
    customer: {
        age: 40,
        satisfaction: 5,
       // ...
    },
    couponUsed: false,
    purchaseMethod: "In store"
}

If we wanted to determine what the different purchasing methods are, we could call distinct() in our Node.js script:

const collection = client.db("sample_supplies").collection("sales");
const distinctPurchaseMethods = await collection.distinct("purchaseMethod");

distinctPurchaseMethods is an array that contains all of the unique purchase methods stored in the “sales” collection.

["In store", "Online", "Phone"]

If we wanted to see how many sales in total were made, we could run:

const totalNumberOfSales = await collection.countDocuments();

countDocuments() will aggregate the total number of documents in the collection and return that number for us to use.

If we have to aggregate a collection based on one of the above helper methods, then we can use single-purpose aggregation.

How Does Aggregation Work in MongoDB?

When you need to do more complex aggregation, you can use the MongoDB aggregation pipeline (here’s a more detailed tutorial). Aggregation pipelines are collections of stages that, combined with the MongoDB query syntax, will allow you to obtain an aggregated result.

Before we dive into the code, let's understand what the aggregation pipeline itself does and how it works. In the aggregation pipeline, you list out a series of instructions in a "stage." For each stage that's defined, MongoDB executes them one after another in order to give a finalized output you're able to use. Let's look at an example usage of the aggregate command:

collection.aggregate([
   { $match: { status: "A" } },
   { $group: { _id: "$cust_id", total: { $sum: "$amount" } } }
])

In this example, we run a stage called $match. Once that stage is run, it passes its output to the $group stage.

In the video above, we're visualizing how each stage of the pipeline is executed before transferring the output to the next stage for processing.

$match allows us to take a collection of items and only receive the items with the status values of A.

Afterwards, we use $group in order to group documents based on the cust_id field. As part of the $group stage, we calculate the sum of all of each group's amount fields.

In addition to $sum, MongoDB provides a myriad of other operators you can use in your aggregations.

The Aggregation Pipeline Method

Let's look at the same collection of sales we were using earlier, for example. Below is a document from this collection:

{
  "_id": "5bd761dcae323e45a93ccffb",
  "items": [
    {
      "name": "printer paper",
      "tags": [
        "office"
      ],
      "price": 17.3,
      "quantity": 1
    },
    {
      "name": "binder",
      "tags": [
        "school"
      ],
      "price": 23.36,
      "quantity": 3
    }
  ],
  "couponUsed": false,
  "purchaseMethod": "In store"
}

Given that we have a list of items sold for each transaction, we can calculate the average cost of all purchased items using the aggregation pipeline.

We can start by using $set to add a field to each document. Combined with $sum, we're able to add a field called itemsTotal to each of the documents

{ '$set': { 'itemsTotal': { '$sum': '$items.price' } } }

Now the documents in the pipeline have been transformed to contain a new property named itemsTotal.

[
 {
   "_id": "5bd761dcae323e45a93ccffb",
   "items": [
     // ...
   ],
   "itemsTotal": 360.33,
   "couponUsed": false,
   "purchaseMethod": "In store"
 }
]

Next, we can pass the documents from the $set stage to a $group stage. Inside of $group, we can use the "$avg" operator to calculate the average transaction price across all documents.

{ '$group': {
    'averageTransactionPrice': { '$avg': '$itemsTotal' },
    '_id': null
} }

Once this stage is completed, we'll be left with a single document that gives us the finalized output:

[{
 "_id": null,
 "averageTransactionPrice": 620.511328
}]

The output tells us that the average price across all transactions is $620.511328.


The finalized code for this aggregation should look something like this in Node.js:

const aggCursor = collection.aggregate([
       { '$set': { 'itemsTotal': { '$sum': '$items.price' } } },
       { '$group': { 'averageTransactionPrice': { '$avg': '$itemsTotal' }, '_id': null } }
]);

findAndModify Command

aggregate isn't the only function that gets to enjoy the benefits of the aggregation syntax. As of MongoDB 4.2, a variety of commands support using aggregation pipelines to update documents.

Let's take a look at just one command that does so: updateMany.

We might want to add itemsTotal as a permanent field to our documents in order to have faster reads on those operations.

Let's use updateMany with an aggregation pipeline to add a new field called itemsTotal.

await collection.updateMany({}, [
   { '$set': { 'itemsTotal': { '$sum': '$items.price' } } },
])

As you can tell, we've reused the $set stage from the previous example. Now, if we check our collection, we can see the new field in each document.

  {
    "_id": "5bd761dcae323e45a93ccffb",
    "items": [
      {
        "name": "printer paper",
        "price": 17.3,
        // ...
      }
    ],
    "itemsTotal": 360.33,
    "couponUsed": false,
    "purchaseMethod": "In store"
  }

How Fast Is MongoDB Aggregation?

While our examples have been realistic and useful in the right context, they've also been relatively small. We've only used two stages in the aggregate pipeline.

This isn't the full potential of the aggregate pipeline, though—far from it.

The aggregation pipeline allows you to perform complex operations that will allow any range of insights into your collections. There are dozens of pipeline stages as well as a wide range of operations you can utilize to build most any analysis on your data you'd imagine.

While the aggregation pipeline is extremely powerful, how performant is it compared to doing these types of analytics on our own?

Let's use the example aggregation query from before:

const { performance } = require('perf_hooks');
const startTime = performance.now();
const totalAvg = collection.aggregate([
   {
       '$set': {
           'itemsTotal': {
               '$sum': '$items.price'
           }
       }
   }, {
       '$group': {
           '_id': null,
           'total': {
               '$avg': '$itemsTotal'
           }
       }
   }
]);
await totalAvg.toArray()
const endTime = performance.now();
console.log("Aggregation took:", endTime - startTime);

In our MongoDB example, we're using two stages: one to add an itemsTotal field, and the other to calculate the average of itemsTotal across all documents.

To match this behavior in Node.js, we'll use Array.prototype.map and Array.prototype.reduce as relevant stand-ins:

const { performance } = require('perf_hooks');
const startTime = performance.now();
const allItems = await collection.find({}).toArray();
const itemsSum = allItems
   .map(item => {
       item.itemsTotal = item.items.reduce((p, c) => p + parseFloat(c.price), 0);
       return item;
   })
   .reduce((p, item) => {
       return p + item.itemsTotal;
   }, 0);
const itemAvg = itemsSum / allItems.length;

const endTime = performance.now();
console.log("Manual took:", endTime - startTime);

Running each of the code snippets above against a collection of 5,000 documents yielded the following timing results:

Aggregation took 103.46ms.

Manual took 881.32ms.

That's a difference of over 8.5x! While the difference might be in milliseconds here, we're using an extremely small collection size. It's not difficult to imagine how drastic the timing differences would be if our collection held a million or more documents.

Conclusion

The aggregation pipeline has enabled us to do a lot with this example, from determining how many documents are in a collection and being able to run complex operations against that collection, to gathering an average across multiple data points and modifying the collection in the database.

While we've learned a lot about the aggregation pipeline today, it's just the beginning. The aggregation pipeline is incredibly powerful and contains many in-depth elements. If you're wanting to read more about the pipeline and its usage, you can read through our documentation for more.

MongoDB Atlas also allows you to create and run aggregation pipelines via the aggregation pipeline builder. This makes it possible to export your finished pipeline to one of the supported driver languages.

Ready to get started?

Launch a new cluster or migrate to MongoDB Atlas with zero downtime.