Navigation
This version of the documentation is archived and no longer supported.

Aggregation Framework

New in version 2.1.

Overview

The MongoDB aggregation framework provides a means to calculate aggregated values without having to use map-reduce. While map-reduce is powerful, it is often more difficult than necessary for many simple aggregation tasks, such as totaling or averaging field values.

If you’re familiar with SQL, the aggregation framework provides similar functionality to GROUP BY and related SQL operators as well as simple forms of “self joins.” Additionally, the aggregation framework provides projection capabilities to reshape the returned data. Using the projections in the aggregation framework, you can add computed fields, create new virtual sub-objects, and extract sub-fields into the top-level of results.

See also

A presentation from MongoSV 2011: MongoDB’s New Aggregation Framework.

Additionally, consider Aggregation Framework Examples and Aggregation Framework Reference for more documentation.

Framework Components

This section provides an introduction to the two concepts that underpin the aggregation framework: pipelines and expressions.

Pipelines

Conceptually, documents from a collection pass through an aggregation pipeline, which transforms these objects as they pass through. For those familiar with UNIX-like shells (e.g. bash,) the concept is analogous to the pipe (i.e. |) used to string text filters together.

In a shell environment the pipe redirects a stream of characters from the output of one process to the input of the next. The MongoDB aggregation pipeline streams MongoDB documents from one pipeline operator to the next to process the documents. Pipeline operators can be repeated in the pipe.

All pipeline operators process a stream of documents and the pipeline behaves as if the operation scans a collection and passes all matching documents into the “top” of the pipeline. Each operator in the pipeline transforms each document as it passes through the pipeline.

Note

Pipeline operators need not produce one output document for every input document: operators may also generate new documents or filter out documents.

Warning

The pipeline cannot operate on values of the following types: Binary, Symbol, MinKey, MaxKey, DBRef, Code, and CodeWScope.

See also

The “Aggregation Framework Reference” includes documentation of the following pipeline operators:

Expressions

Expressions produce output documents based on calculations performed on input documents. The aggregation framework defines expressions using a document format using prefixes.

Expressions are stateless and are only evaluated when seen by the aggregation process. All aggregation expressions can only operate on the current document in the pipeline, and cannot integrate data from other documents.

The accumulator expressions used in the $group operator maintain that state (e.g. totals, maximums, minimums, and related data) as documents progress through the pipeline.

See also

Aggregation expressions for additional examples of the expressions provided by the aggregation framework.

Use

Invocation

Invoke an aggregation operation with the aggregate() wrapper in the mongo shell or the aggregate database command. Always call aggregate() on a collection object that determines the input documents of the aggregation pipeline. The arguments to the aggregate() method specify a sequence of pipeline operators, where each operator may have a number of operands.

First, consider a collection of documents named articles using the following format:

{
 title : "this is my title" ,
 author : "bob" ,
 posted : new Date () ,
 pageViews : 5 ,
 tags : [ "fun" , "good" , "fun" ] ,
 comments : [
             { author :"joe" , text : "this is cool" } ,
             { author :"sam" , text : "this is bad" }
 ],
 other : { foo : 5 }
}

The following example aggregation operation pivots data to create a set of author names grouped by tags applied to an article. Call the aggregation framework by issuing the following command:

db.articles.aggregate(
  { $project : {
     author : 1,
     tags : 1,
  } },
  { $unwind : "$tags" },
  { $group : {
     _id : { tags : "$tags" },
     authors : { $addToSet : "$author" }
  } }
);

The aggregation pipeline begins with the collection articles and selects the author and tags fields using the $project aggregation operator. The $unwind operator produces one output document per tag. Finally, the $group operator pivots these fields.

Result

The aggregation operation in the previous section returns a document with two fields:

  • result which holds an array of documents returned by the pipeline
  • ok which holds the value 1, indicating success, or another value if there was an error

As a document, the result is subject to the BSON Document size limit, which is currently 16 megabytes.

Optimizing Performance

Because you will always call aggregate on a collection object, which logically inserts the entire collection into the aggregation pipeline, you may want to optimize the operation by avoiding scanning the entire collection whenever possible.

Pipeline Operators and Indexes

Depending on the order in which they appear in the pipeline, aggregation operators can take advantage of indexes.

The following pipeline operators take advantage of an index when they occur at the beginning of the pipeline:

The above operators can also use an index when placed before the following aggregation operators:

Early Filtering

If your aggregation operation requires only a subset of the data in a collection, use the $match operator to restrict which items go in to the top of the pipeline, as in a query. When placed early in a pipeline, these $match operations use suitable indexes to scan only the matching documents in a collection.

Placing a $match pipeline stage followed by a $sort stage at the start of the pipeline is logically equivalent to a single query with a sort, and can use an index.

In future versions there may be an optimization phase in the pipeline that reorders the operations to increase performance without affecting the result. However, at this time place $match operators at the beginning of the pipeline when possible.

Memory for Cumulative Operators

Certain pipeline operators require access to the entire input set before they can produce any output. For example, $sort must receive all of the input from the preceding pipeline operator before it can produce its first output document. The current implementation of $sort does not go to disk in these cases: in order to sort the contents of the pipeline, the entire input must fit in memory.

$group has similar characteristics: Before any $group passes its output along the pipeline, it must operator, this frequently does not require as much memory as $sort, because it only needs to retain one record for each unique key in the grouping specification.

The current implementation of the aggregation framework logs a warning if a cumulative operator consumes 5% or more of the physical memory on the host. Cumulative operators produce an error if they consume 10% or more of the physical memory on the host.

Sharded Operation

Note

Changed in version 2.1.

Some aggregation operations using aggregate will cause mongos instances to require more CPU resources than in previous versions. This modified performance profile may dictate alternate architectural decisions if you use the aggregation framework extensively in a sharded environment.

The aggregation framework is compatible with sharded collections.

When operating on a sharded collection, the aggregation pipeline splits into two parts. The aggregation framework pushes all of the operators up to the first $group or $sort operation to each shard. [1] Then, a second pipeline on the mongos runs. This pipeline consists of the first $group or $sort and any remaining pipeline operators, and runs on the results received from the shards.

The $group operator brings in any “sub-totals” from the shards and combines them: in some cases these may be structures. For example, the $avg expression maintains a total and count for each shard; mongos combines these values and then divides.

[1]If an early $match can exclude shards through the use of the shard key in the predicate, then these operators are only pushed to the relevant shards.

Limitations

Aggregation operations with the aggregate command have the following limitations:

  • The pipeline cannot operate on values of the following types: Binary, Symbol, MinKey, MaxKey, DBRef, Code, CodeWScope.
  • Output from the pipeline can only contain 16 megabytes. If your result set exceeds this limit, the aggregate command produces an error.
  • If any single aggregation operation consumes more than 10 percent of system RAM the operation will produce an error.