Joins and Other Aggregation Enhancements Coming in MongoDB 3.2 (Part 3 of 3) – Adding Some Code Glue and Geolocation

Andrew Morgan

#Technical

Series:

  1. Part 1 – Introduction
  2. Part 2 – Worked Examples
  3. Part 3 – Adding Some Code Glue and Geolocation

This is the third in a three part blog series looking at the aggregation enhancements being introduced in MongoDB 3.2 (most notably $lookup which implements left-outer joins).

The first post – Joins and Other Aggregation Enhancements Coming in MongoDB 3.2 (Part 1 of 3) – Introduction introduced the join and other new aggregation capabilities being introduced in MongoDB 3.2 as well as what value they bring.

The second post stepped through examples of how to use these new features.

This third and final post shows how geolocation data can be included as well as what to do when you reach the limit of what can be done using a single pipeline – including adding wrapper code. This post also summarizes some of the limitations of the Aggregation Framework pipeline and reasons why you might consider a full visualization solution such as Tableau together with the MongoDB Connector for BI (also new in MongoDB 3.2).

Disclaimer

MongoDB's product plans are for informational purposes only. MongoDB's plans may change and you should not rely on them for delivery of a specific feature at a specific time.

Adding Some Coding Glue and Geolocation

Obviously, there are limits as to how much can be achieved with a single aggregation pipeline but with the addition of just a little code outside of the database (in this example, JavaScript in the mongo shell), additional results can be produced.

As a reminder, this is the structure of our starting data set:

> db.homeSales.findOne()
{
  "_id": ObjectId("56005dd980c3678b19792b7f"),
  "amount": 9000,
  "date": ISODate("1996-09-19T00:00:00Z"),
  "address": {
    "nameOrNumber": 25,
    "street": "NORFOLK PARK COTTAGES",
    "town": "MAIDENHEAD",
    "county": "WINDSOR AND MAIDENHEAD",
    "postcode": "SL6 7DR"
  }
}
<blockquote>
<p>db.postcodes.findOne()
{
"_id": ObjectId("5600521e50fa77da54dfc0d2"),
"postcode": "SL6 0AA",
"location": {
"type": "Point",
"coordinates": [
51.525605,
-0.700974
]
}
}

These data sets (the homeSales and postcodes collections) can be downloaded here.

We start by repeating an aggregation from the previous post but store the data in a collection so that we can build upon those results:

db.homeSales.aggregate([
    {
      $sort: {amount: -1}
    },
    {
      $group:
      {
        _id: {$year: "$date"},
        priciestPostCode: {$first: "$address.postcode"}
      }
    },
    {
      $lookup:
      {
        from: "postcodes",
        localField: "priciestPostCode",
        foreignField: "postcode",
        as: "locationData"
      }
    },
    {
      $sort: {_id: -1}
    },
    {
      $project:
      {
        _id: 0,
        Year: "$_id",
        PostCode: "$priciestPostCode",
        Location: "$locationData.location"
      }
    },
    {
      $out: "hottestLocations"
    }
  ])
db.hottestLocations.findOne()
{
  "_id": ObjectId("5629108c96be45aba9cb0c98"),
  "Year": 2015,
  "PostCode": "SL6 9UD",
  "Location": [
    {
      "type": "Point",
      "coordinates": [
        51.558455,
        -0.756023
      ]
    }
  ]
}

In this example, geospatial operations are performed on the location from the postcodes collection and so a geospatial index should be added:

db.postcodes.createIndex({location: "2dsphere"})

A function is created that, given a location, will find the five nearest postcodes – taking advantage of a $geoNear stage – note that this must be the first stage in the pipeline:

var findNeighbours = function (spot, yearTag) {
  var result = db.postcodes.aggregate([
      {
        $geoNear:
        {
          near: spot,
          distanceField: "distance",
          num: 5,
          spherical: true
        }
      },
      {
        $group: {
          _id: yearTag,
          "neighbours": {
            $addToSet: "$postcode"
          }
        }
      }
    ]);
  return result.result;
}

Then, for each of the hottestLocations collection, that function is called to display and the contents of the returned cursor are displayed:

db.hottestLocations.find().forEach(function(myDoc) {
  var myCursor = findNeighbours(myDoc.Location[0], myDoc.Year);
  myCursor.forEach(printjson);
})

The result is a list of the 5 closest postcodes to the top selling home each year:

{
  "_id": 1995,
  "neighbours": [
    "SL6 2NL",
    "SL6 2JL",
    "SL6 2NB",
    "SL6 2JN",
    "SL6 2NA"
  ]
}
<p>...</p>
<p>{
"_id": 2015,
"neighbours": [
"SL6 9XB",
"SL6 9XL",
"SL6 9UE",
"SL6 9UB",
"SL6 9UD"
]
}

Bonus Query – For Those Choosing a School

There's often a correlation between the house prices near a school and that school's performance. So, if you were considering a specific school then it might make sense to check house prices in the area surrounding that school.

The following pipeline will find house price statistics, by year, for all postcodes within a 3 km radius of the school – which is located at coordinates (51.5156725, -0.727387):

db.postcodes.aggregate([
      {
        $geoNear:
        {
          near: 
          {
            "type": "Point",
            "coordinates": [
              51.5156725,
              -0.727387
            ]},
          distanceField: "distance",
          num: 10000,
          maxDistance: 3000,
          spherical: true
        }
      },
      {
        $lookup: {
          from: "homeSales",
          localField: "postcode",
          foreignField: "address.postcode",
          as: "priceData"
        }
      },
      {
        $unwind: "$priceData"
      },
<pre><code>  {
    $group: 
    {
      _id: {$year: "$priceData.date"},
      highestPrice: {$max: "$priceData.amount"},
      lowestPrice: {$min: "$priceData.amount"},
      averagePrice: {$avg: "$priceData.amount"},
      priceStdDev: {$stdDevPop: "$priceData.amount"}
    }
  },
  {
    $project:
    {
      _id: 0,
      Year: "_id",
      highestPrice: 1,
      lowestPrice: 1,
      averagePrice: {$trunc: "$averagePrice"},
      priceStdDev: {$trunc: "$priceStdDev"}
    }
  },
  {
    $sort: 
    {
      "Year": -1
    }
  }
]);

{
  "highestPrice": 1350000,
  "lowestPrice": 125000,
  "averagePrice": 410593,
  "priceStdDev": 182358,
  "Year": 2015
},
...
{
  "highestPrice": 930000,
  "lowestPrice": 12000,
  "averagePrice": 103455,
  "priceStdDev": 68615,
  "Year": 1995
}

Limitations

As seen, it’s possible to build up sophisticated analytical queries using the enhanced aggregation features – especially when pipelines are combined with a little application or scripting glue.

Some limitations that you may meet include:

  • $geoNear can only be used as the first stage in the pipeline
  • $lookup only supports equality for the match and the equality has to be between a single key from each collection
  • The right-collection for $lookup cannot be sharded
  • The pipeline is linear; there are no forks and once data has been aggregated, the lost details can't be reused later in the pipeline (this is why writing results to a new collection using $out can be helpful)
  • One can remove information at each stage but it's impossible to add new raw data (other than through $lookup)
  • Indexes are only used for the beginning stages of the pipeline (and right tables in any subsequent $lookup)
  • $out can only be used in the final stage of the pipeline

When to Use Full Data Visualization Solutions

A lot can be achieved directly in the database – especially when augmented with a small amount of code. So when would it make sense to use a BI visualization tool such as Tableau. The capabilities will vary from product to product but some general considerations are given here:

  • Visualization – displaying information in graphs and on maps (rather than in JSON result sets) makes it much simpler for the human mind to see patterns and draw conclusions from the data (see Figure 1 which is based on the same data set used earlier and graphically shows the highest home sale price by location and year)
  • Multiple Data Sources – combining data from multiple sources (data blending); for example from a MongoDB database and an Excel spreadsheet can greatly broaden the context of reports
  • Interactivity – visualization tools make it simple to create interactive queries/dashboards where business user can graphically tweak parameters to get precise results and test theories
  • Skills – performing all of the analytics directly in MongoDB requires knowledge of the MongoDB Query Language and possibly some basic coding skills. Using the visualization tools is analogous to using intermediate features in Microsoft Excel such as pivot tables and graphs and so it opens up the ability to analyze the data to a broader set of users in the organization
  • Extra functions – for example, the ability to add trend lines to a chart

Figure 1: MongoDB Data Visualized in a Tableau Map

MongoDB 3.2 introduces the MongoDB Connector for BI which allows Business Intelligence tools such as Tableau to access data from MongoDB using SQL – opening up a range of new options for performing analytics on live data.

Next Steps

To learn more about what's coming up in MongoDB 3.2, register for the What's new in MongoDB 3.2 webinar and review the MongoDB 3.2 release notes.

To get the best understanding of the new features then you should experiment with the software which is available in the MongoDB 3.2 (not for production) download – to use the new $lookup aggregation theMongoDB Enterprise Advanced download should be used.

The reason MongoDB releases development releases is to give the community a chance to try out the new software – and we hope that you'll give us feedback, whether it be by joining the MongoDB 3.2 bug hunt or commenting on this post.


Learn more about joins and other aggregation enhancements in MongoDB 3.2, watch Andrew's on-demand webinar:

Watch the on-demand webinar


About the Author - Andrew Morgan

Andrew is a Principal Product Marketing Manager working for MongoDB. He joined at the start of this summer from Oracle where he’d spent 6+ years in product management, focussed on High Availability. He can be contacted @andrewmorgan or through comments on his blog (clusterdb.com).