EventJoin us at AWS re:Invent 2024! Learn how to use MongoDB for AI use cases. Learn more >>

MongoDB Aggregation

When working with data in MongoDB, you may quickly have to run complex operations, with multiple stages of operations to gather metrics for your project. Generating reports and displaying useful metadata are just two major use cases where MongoDB aggregation operations can prove incredibly useful, powerful, and flexible.


Table of contents

What is aggregation?

In programming, we often run a series of operations on a collection of items. Take the following JavaScript sample:

let numbers = [{val: 1}, {val: 2}, {val: 3}, {val: 4}];
numbers = numbers
    .map(obj => obj.val) // [1, 2, 3, 4]
    .reduce((prev, curr) => prev + curr, 0) // 10

In this example, we have two operations that are being run on the numbers array:

  • First, map(): we take the objects and convert them down to their numerical values.

  • Second, reduce(): We consolidate the output to a single number — the sum of the numbers.

Aggregation operations process data records and return computed results.

Not only do we have the ability to aggregate data on the client side with JavaScript, but we can use MongoDB to run operations on the server against our collections stored in the database before the result is returned to the client.

Single-purpose aggregation

MongoDB provides two methods to perform aggregation. The simplest is single-purpose aggregation.

Single-purpose aggregation operations are a collection of helper methods applied to a collection to calculate a result. These helper methods enable simple access to common aggregation processes.

Two of such methods provided are:


Let's use a collection named “sales” that stores purchases:

{
    _id: 5bd761dcae323e45a93ccfea,
    saleDate: 2017-06-22T09:54:14.185+00:00,
    items: [
      {
        "name": "printer paper",
        "price": 17.3,
        // ...
      },
    ],
    storeLocation: "Denver",
    customer: {
        age: 40,
        satisfaction: 5,
       // ...
    },
    couponUsed: false,
    purchaseMethod: "In store"
}

If we wanted to determine what the different purchasing methods are, we could call distinct() in our Node.js script:

const collection = client.db("sample_supplies").collection("sales");
const distinctPurchaseMethods = await collection.distinct("purchaseMethod");

distinctPurchaseMethods is an array that contains all of the unique purchase methods stored in the “sales” collection.

["In store", "Online", "Phone"]

If we wanted to see how many sales in total were made, we could run:

const totalNumberOfSales = await collection.countDocuments();

countDocuments() will aggregate the total number of documents in the collection and return that number for us to use. If we have to aggregate a collection based on one of the above helper methods, then we can use single-purpose aggregation.

How do I use MongoDB to aggregate data?

When you need to do more complex aggregation, you can use the MongoDB aggregation pipeline (check out our more detailed tutorial). Aggregation pipelines are sequences of stages that can query, filter, alter, and process our documents. It's a Turing-complete implementation that can be used as a (rather inefficient) programming language.

Before we dive into the code, let's understand what the aggregation pipeline itself does and how it works. In the aggregation pipeline, you list out a series of instructions in a "stage." For each stage that's defined, MongoDB executes them one after another in order to give a finalized output you're able to use. Let's look at an example usage of the aggregate command:

collection.aggregate([
   { $match: { status: "A" } },
   { $group: { _id: "$cust_id", total: { $sum: "$amount" } } }
])

In this example, we run a stage called $match. Once that stage is run, it passes its output to the $group stage.

$match allows us to take a collection of items and only receive the items with the status values of A.

Afterward, we use $group in order to group documents based on the cust_id field. As part of the $group stage, we calculate the sum of all of each group's amount fields.

In addition to $sum, MongoDB provides a myriad of other operators you can use in your aggregations.

The aggregation pipeline method

Let's look at the same collection of sales we were using earlier, for example. Below is a document from this collection:

{
  "_id": "5bd761dcae323e45a93ccffb",
  "items": [
    {
      "name": "printer paper",
      "tags": [
        "office"
      ],
      "price": 17.3,
      "quantity": 1
    },
    {
      "name": "binder",
      "tags": [
        "school"
      ],
      "price": 23.36,
      "quantity": 3
    }
  ],
  "couponUsed": false,
  "purchaseMethod": "In store"
}

Given that we have a list of items sold for each transaction, we can calculate the average cost of all purchased items using the aggregation pipeline.

We can start by using $set to add a field to each document. Combined with $sum, we're able to add a field called itemsTotal to each of the documents.

{ '$set': { 'itemsTotal': { '$sum': '$items.price' } } }

Now the documents in the pipeline have been transformed to contain a new property named itemsTotal.

[
 {
   "_id": "5bd761dcae323e45a93ccffb",
   "items": [
     // ...
   ],
   "itemsTotal": 360.33,
   "couponUsed": false,
   "purchaseMethod": "In store"
 }
]

Next, we can pass the documents from the $set stage to a $group stage. Inside of $group, we can use the "$avg" operator to calculate the average transaction price across all documents.

{ '$group': {
    'averageTransactionPrice': { '$avg': '$itemsTotal' },
    '_id': null
} }

Once this stage is completed, we'll be left with a single document that gives us the finalized output:

[{
 "_id": null,
 "averageTransactionPrice": 620.511328
}]

The output tells us that the average price across all transactions is $620.511328.


The finalized code for this aggregation should look something like this in Node.js:

const aggCursor = collection.aggregate([
       { '$set': { 'itemsTotal': { '$sum': '$items.price' } } },
       { '$group': { 'averageTransactionPrice': { '$avg': '$itemsTotal' }, '_id': null } }
]);

findAndModify command

aggregate isn't the only function that gets to enjoy the benefits of the aggregation syntax. As of MongoDB 4.2, a variety of commands support using aggregation pipelines to update documents.

Let's take a look at just one command that does so: updateMany.

We might want to add itemsTotal as a permanent field to our documents in order to have faster reads on those operations.

Let's use updateMany with an aggregation pipeline to add a new field called itemsTotal.

await collection.updateMany({}, [
   { '$set': { 'itemsTotal': { '$sum': '$items.price' } } },
])

As you can tell, we've reused the $set stage from the previous example. Now, if we check our collection, we can see the new field in each document.

  {
    "_id": "5bd761dcae323e45a93ccffb",
    "items": [
      {
        "name": "printer paper",
        "price": 17.3,
        // ...
      }
    ],
    "itemsTotal": 360.33,
    "couponUsed": false,
    "purchaseMethod": "In store"
  }

How fast is MongoDB aggregation?

While our examples have been realistic and useful in the right context, they've also been relatively small. We've only used two stages in the aggregate pipeline.

This isn't the full potential of the aggregate pipeline, though—far from it.

The aggregation pipeline allows you to perform complex operations that will allow any range of insights into your collections. There are dozens of pipeline stages as well as a wide range of operations you can utilize to build most any analysis on your data you'd imagine.

While the aggregation pipeline is extremely powerful, how performant is it compared to doing these types of analytics on our own?

Let's use the example aggregation query from before:

const { performance } = require('perf_hooks');
const startTime = performance.now();
const totalAvg = collection.aggregate([
   {
       '$set': {
           'itemsTotal': {
               '$sum': '$items.price'
           }
       }
   }, {
       '$group': {
           '_id': null,
           'total': {
               '$avg': '$itemsTotal'
           }
       }
   }
]);
await totalAvg.toArray()
const endTime = performance.now();
console.log("Aggregation took:", endTime - startTime);

In our MongoDB example, we're using two stages: one to add an itemsTotal field, and the other to calculate the average of itemsTotal across all documents.

To match this behavior in Node.js, we'll use Array.prototype.map and Array.prototype.reduce as relevant stand-ins:

const { performance } = require('perf_hooks');
const startTime = performance.now();
const allItems = await collection.find({}).toArray();
const itemsSum = allItems
   .map(item => {
       item.itemsTotal = item.items.reduce((p, c) => p + parseFloat(c.price), 0);
       return item;
   })
   .reduce((p, item) => {
       return p + item.itemsTotal;
   }, 0);
const itemAvg = itemsSum / allItems.length;

const endTime = performance.now();
console.log("Manual took:", endTime - startTime);

Running each of the code snippets above against a collection of 5,000 documents yielded the following timing results:

Aggregation took 103.46ms.

Manual iteration through the cursor took 881.32ms.

That's a difference of over 8.5x! While the difference might be in milliseconds here, we're using an extremely small collection size. It's not difficult to imagine how drastic the timing differences would be if our collection held a million or more documents. Remember that an aggregation pipeline runs in the MongoDB server and can be optimized before running, while when you iterate over a cursor to process data client-side, you add a lot of latency due to fetching pages of data from that cursor. The best approach is probably a mix of both.

Conclusion

The aggregation pipeline has enabled us to do a lot with this example, from determining how many documents are in a collection and being able to run complex operations against that collection, to gathering an average across multiple data points and modifying the collection in the database.

While we've learned a lot about the aggregation pipeline today, it's just the beginning. The aggregation pipeline is incredibly powerful and contains many in-depth elements. If you're wanting to read more about the pipeline and its usage, you can read through our documentation for more.

And if you need a book, you can always refer to Practical MongoDB Aggregations.

MongoDB Atlas also allows you to create and run aggregation pipelines via the aggregation pipeline builder. Enterprise Advance and On-Premises users can also use Compass.

This makes it possible to export your finished pipeline to one of the supported driver languages.

FAQs

What is aggregate data?

Aggregate data is high-level data formed through the combination of numerical or non-numerical data from multiple sources.

What is data aggregation?

Data aggregation is the process of putting together a large group of data for high level examination.

What are aggregators

Aggregators are organizations, websites, or software applications that collect information from different sources and consolidate it in one place.

Ready to get started?

Launch a new cluster or migrate to MongoDB Atlas with zero downtime.