Joins and Other Aggregation Enhancements Coming in MongoDB 3.2 (Part 2 of 3) – Worked Examples

Andrew Morgan

#Technical

Series:

  1. Part 1 – Introduction
  2. Part 2 – Worked Examples
  3. Part 3 - Adding Some Code Glue and Geolocation

This is the second in a three part blog series looking at the aggregation enhancements being introduced in MongoDB 3.2 (most notably $lookup which implements left-outer joins).

The previous post – Joins and Other Aggregation Enhancements Coming in MongoDB 3.2 (Part 1 of 3) – Introduction introduced the join and other new aggregation capabilities being introduced in MongoDB 3.2 as well as what value they bring.

Probably the best way to understand how $lookup and the rest of the Aggregation Framework pipeline could benefit your application is to work through an example. This is the intent of this post. If you're not a DBA or developer then don't worry too much about the syntax but instead understand the kinds of insights that are being derived.

The third and final post shows how geolocation data can be included as well as what to do when you reach the limit of what can be done using a single pipeline – including adding wrapper code. That post also summarizes some of the limitations of the aggregation pipeline and reasons why you might consider a full visualization solution such as Tableau together with the MongoDB Connectore for BI connector (also new in MongoDB 3.2).

Disclaimer

MongoDB's product plans are for informational purposes only. MongoDB's plans may change and you should not rely on them for delivery of a specific feature at a specific time.

The Data Set

The examples use two data sets, the first being a comprehensive set of home sale data for a town and the second being a mapping from postal codes to geospatial locations for that same town.

For those interested, the imported data sets needed some cleaning up to make this walkthrough more useful - the steps are described here.

The data sets used in this post (the homeSales and postcodes collections) can be downloaded here.

The two data sets can be checked using the mongo shell:

db.homeSales.findOne()
{
  "_id": ObjectId("56005dd980c3678b19792b7f"),
  "amount": 9000,
  "date": ISODate("1996-09-19T00:00:00Z"),
  "address": {
    "nameOrNumber": 25,
    "street": "NORFOLK PARK COTTAGES",
    "town": "MAIDENHEAD",
    "county": "WINDSOR AND MAIDENHEAD",
    "postcode": "SL6 7DR"
  }
}
<p>db.postcodes.findOne()
{
"_id": ObjectId("5600521e50fa77da54dfc0d2"),
"postcode": "SL6 0AA",
"location": {
"type": "Point",
"coordinates": [
51.525605,
-0.700974
]
}
}
An even better option to understand the contents of these collections is to use MongoDB Compass (to be released with MongoDB 3.2). Figure 1 shows an overview of the homeSales collection and Figure 2 delves into its address sub-document.

Figure 1: MongoDB Compass View of the homeSales Collection
Figure 2: Viewing Sub-Documents With MongoDB Compass

Building the First Pipeline

As the error messages from complex pipelines aren't always very specific, it makes sense to start with a simple pipeline and then check the results before moving onto the next.

As a collection's indexes are only used for the beginning stages in the pipeline (before any transformations are performed), it is often sensible to reduce the result set as much as possible with a $match stage to filter out any unnecessary documents. Ideally, the match would be against the sharding key so that fewer shards need to be included. For the first pipeline stage the cheaper property sales are going to excluded and so it would help to have a secondary index on the amount key:

db.homeSales.createIndex({amount: 1})

The first stage in the pipeline then filters out any sales of less than £3,000,000:

db.homeSales.aggregate([
  {$match: {
    amount: {$gte:3000000}}
  }
  ])

The results can then be inspected to understand what will flow into the next stage in the pipeline:

{
    {
      "_id": ObjectId("56005dda80c3678b19799e52"),
      "amount": 3000000,
      "date": ISODate("2012-04-19T00:00:00Z"),
      "address": {
        "nameOrNumber": "TEMPLE FERRY PLACE",
        "street": "MILL LANE",
        "town": "MAIDENHEAD",
        "county": "WINDSOR AND MAIDENHEAD",
        "postcode": "SL6 5ND"
      }
    },

...

    {
      "_id": ObjectId("56005dda80c3678b19799e5c"),
      "amount": 5425000,
      "date": ISODate("1999-03-15T00:00:00Z"),
      "address": {
        "nameOrNumber": "2 - 3",
        "street": "THE SWITCHBACK",
        "town": "MAIDENHEAD",
        "county": "WINDSOR AND MAIDENHEAD",
        "postcode": "SL6 7RJ"
      }
    }

In the next stage, a left-outer join is performed – using $lookup – to find documents from the postcodes collection with a matching postcode so that the geographic location can be determined:

db.homeSales.aggregate([
  {$match: {
    amount: {$gte:3000000}}
  }, 
  {$lookup: {
    from: "postcodes", 
    localField: "address.postcode",
    foreignField: "postcode",
    as: "postcode_docs"}
  }
])

Which yields these results:

{
    {
      "_id": ObjectId("56005dda80c3678b19799e52"),
      "amount": 3000000,
      "date": ISODate("2012-04-19T00:00:00Z"),
      "address": {
        "nameOrNumber": "TEMPLE FERRY PLACE",
        "street": "MILL LANE",
        "town": "MAIDENHEAD",
        "county": "WINDSOR AND MAIDENHEAD",
        "postcode": "SL6 5ND"
      },
      "postcode_docs": [
        {
          "_id": ObjectId("560053e280c3678b1978b293"),
          "postcode": "SL6 5ND",
          "location": {
            "type": "Point",
            "coordinates": [
              51.549516,
              -0.80702
            ]
          }
        }
      ]
    },

...

      "postcode_docs": [
        {
          "_id": ObjectId("560053e280c3678b1978b524"),
          "postcode": "SL6 7RJ",
          "location": {
            "type": "Point",
            "coordinates": [
              51.536848,
              -0.735835
            ]
          }
        }
      ]
    }

The pipeline can then be extended with a $project stage to refactor the documents, removing any information that isn't needed. The documents are then sorted in reverse-price order:

db.homeSales.aggregate([
  {$match: {
    amount: {$gte:3000000}}
  }, 
  {$lookup: {
    from: "postcodes", 
    localField: "address.postcode",
    foreignField: "postcode",
    as: "postcode_docs"}
  },
  {$project: {
    _id: 0,
    saleDate: "$date",
    price: "$amount",
    address: 1,
    location: "$postcode_docs.location"}},
  {$sort:
    {
      price: -1
    }}
])

The address and physical location of every home sale, starting with the most expensive is then shown:

{
      "address": {
        "nameOrNumber": "2 - 3",
        "street": "THE SWITCHBACK",
        "town": "MAIDENHEAD",
        "county": "WINDSOR AND MAIDENHEAD",
        "postcode": "SL6 7RJ"
      },
      "saleDate": ISODate("1999-03-15T00:00:00Z"),
      "price": 5425000,
      "location": [
        {
          "type": "Point",
          "coordinates": [
            51.536848,
            -0.735835
          ]
        }
      ]
    },

...

    {
      "address": {
        "nameOrNumber": "TEMPLE FERRY PLACE",
        "street": "MILL LANE",
        "town": "MAIDENHEAD",
        "county": "WINDSOR AND MAIDENHEAD",
        "postcode": "SL6 5ND"
      },
      "saleDate": ISODate("2012-04-19T00:00:00Z"),
      "price": 3000000,
      "location": [
        {
          "type": "Point",
          "coordinates": [
            51.549516,
            -0.80702
          ]
        }
      ]
    }

Building Further Pipelines

If all of the sales need to be analyzed (rather than just the most expensive few) then there would be too many results from the previous pipeline to be easily understood. For this reason, the pipeline is modified so that extra analysis and aggregation is performed within the database. This can be done using a $group stage – in this case to produce sales statistics by year:

db.homeSales.aggregate([
  {
    $group: 
    {
      _id: {$year: "$date"},
      higestPrice: {$max: "$amount"},
      lowestPrice: {$min: "$amount"},
      averagePrice: {$avg: "$amount"},
      priceStdDev: {$stdDevPop: "$amount"}
    }
  },
  {
    $sort: {_id: 1}
  }
])
    {
      "_id": 1995,
      "higestPrice": 1000000,
      "lowestPrice": 12000,
      "averagePrice": 114059.35206869633,
      "priceStdDev": 81540.50490801703
    },
<p>...</p>
<pre><code>{
  "_id": 2015,
  "higestPrice": 1688000,
  "lowestPrice": 125000,
  "averagePrice": 451413.23917137476,
  "priceStdDev": 228293.6005201898
}</code></pre>

Note that this is using the$stdDevPop (standard deviation) aggregation operator being introduced in MongoDB 3.2.

There's more precision than needed for some of the keys and so a common pattern can be employed – use a final $project stage to tidy up the data:

db.homeSales.aggregate([
  {
    $group: 
    {
      _id: {$year: "$date"},
      higestPrice: {$max: "$amount"},
      lowestPrice: {$min: "$amount"},
      averagePrice: {$avg: "$amount"},
      priceStdDev: {$stdDevPop: "$amount"}
    }
  },
  {
    $sort: {_id: 1}
  },
  {
    $project:
    {
      _id: 1,
      higestPrice: 1,
      lowestPrice: 1,
      averagePrice: {$trunc: "$averagePrice"},
      priceStdDev: {$trunc: "$priceStdDev"}
    }
  }
])
 {
      "_id": 1995,
      "higestPrice": 1000000,
      "lowestPrice": 12000,
      "averagePrice": 114059,
      "priceStdDev": 81540
    },
<p>...</p>
<pre><code>{
  "_id": 2015,
  "higestPrice": 1688000,
  "lowestPrice": 125000,
  "averagePrice": 451413,
  "priceStdDev": 228293
}

It often makes sense to store the results in a new collection – either because the results will be reused many times or that subsequent aggregation pipelines will reference them. This is simple to achieve using a $out stage; note that:

  • If the target collection already exists then its contents will be overwritten
  • If used then the $out must be the final stage in the pipeline

The following example writes the results to a collection called annualHomePrices:

db.homeSales.aggregate([
  {
    $group: 
    {
      _id: {$year: "$date"},
      highestPrice: {$max: "$amount"},
      lowestPrice: {$min: "$amount"},
      averagePrice: {$avg: "$amount"},
      priceStdDev: {$stdDevPop: "$amount"}
    }
  },
  {
    $sort: {_id: 1}
  },
  {
    $project:
    {
      _id: 0,
      year: "$_id",
      highestPrice: 1,
      lowestPrice: 1,
      averagePrice: {$trunc: "$averagePrice"},
      priceStdDev: {$trunc: "$priceStdDev"}
    }
  },
  {
    $out: "annualHomePrices"
  }
])
> db.annualHomePrices.findOne()
{
  "_id": ObjectId("560957ac29a5574d557d426d"),
  "highestPrice": 1000000,
  "lowestPrice": 12000,
  "averagePrice": 114059,
  "priceStdDev": 81540,
  "year": 1995
}

These "interim" results could then be used to build further result sets – without the need to run all of that processing again. As an example, using a simple $project stage, the gap between the highest and lowest house sale can be calculated for each year:

db.annualHomePrices.aggregate([
  {$project: 
    {
      Year: "$year",
      hightToLowPriceGap: {
        $subtract: ["$highestPrice", "$lowestPrice"]
      },
      _id: 0
    }
  }
])
{
   "Year": 2012,
   "hightToLowPriceGap": 2923000
},
{
   "Year": 2013,
   "hightToLowPriceGap": 5092250
},
{
   "Year": 2014,
   "hightToLowPriceGap": 3883050
},
{
  "Year": 2015,
  "hightToLowPriceGap": 1563000
}

As a final stage in this post, a pipeline is built to find the postal code and geographic location of the most expensive house sale for each of the recorded years:

  • Perform a $sort on the full homeSales data set so that the documents are ordered from most expensive sale first
  • $group the results based on the year of the home sale, deriving the priciestPostCode from the $first document in that group (year). Because the documents were sorted on price before entering the group stage, the first document is also the one with the highest price
  • $lookup the postal code in the postcodes collection to get the geolocation data
  • $sort the results by year
  • $project just the data that is of interest
db.homeSales.aggregate([
    {
      $sort: {amount: -1}
    },
    {
      $group:
      {
        _id: {$year: "$date"},
        priciestPostCode: {$first: "$address.postcode"}
      }
    },
    {
      $lookup:
      {
        from: "postcodes",
        localField: "priciestPostCode",
        foreignField: "postcode",
        as: "locationData"
      }
    },
    {
      $sort: {_id: -1}
    },
    {
      $project:
      {
        _id: 0,
        Year: "$_id",
        PostCode: "$priciestPostCode",
        Location: "$locationData.location"
      }
    }
  ])
{
  "Year": 2015,
  "PostCode": "SL6 9UD",
  "Location": [
    {
      "type": "Point",
      "coordinates": [
        51.558455,
        -0.756023
      ]
    }
  ]
},
{
  "Year": 2014,
  "PostCode": "SL6 1UP",
  "Location": [
    {
      "type": "Point",
      "coordinates": [
        51.51407,
        -0.704414
      ]
    }
  ]
},
...

Next Steps

To learn more about what's coming up in MongoDB 3.2, register for the What's new in MongoDB 3.2 webinar and review the MongoDB 3.2 release notes.

To get the best understanding of the new features then you should experiment with the software which is available in the MongoDB 3.2 (not for production) download – to use the new $lookup aggregation stage, the MongoDB Enterprise Advanced download should be used.

The reason MongoDB releases development releases is to give the community a chance to try out the new software – and we hope that you'll give us feedback, whether it be by joining the MongoDB 3.2 bug hunt or commenting on this post.


Learn more about joins and other aggregation enhancements in MongoDB 3.2, watch Andrew's on-demand webinar:

Watch the on-demand webinar


About the Author - Andrew Morgan

Andrew is a Principal Product Marketing Manager working for MongoDB. He joined at the start of this summer from Oracle where he’d spent 6+ years in product management, focussed on High Availability. He can be contacted @andrewmorgan or through comments on his blog (clusterdb.com).