Solving The Aggregation Rock Star Challenge
The first of Eliot's MongoDB World Weekly Challenges was an out of this world challenge to analyse the records of meteorite impacts. Even if you didn't do the challenge (there's one every Wednesday up to MongoDB World), you can still be a winner because we're going to show you how to get the solution.
We asked you to come up with an aggregation that would give us the average mass and standard deviation for that mass for meteorites within 1000km of McMurdo Antarctic Research Station. Read the full challenge for more background.
The Answer
The answer to the challenge that we were looking for was:
{
 "decades": [
 {
 "decade": 1970,
 "average": 1519.2301875901876,
 "stdDeviation": 16646.61807518022
 },
 {
 "decade": 1980,
 "average": 201.3780529765156,
 "stdDeviation": 2231.0730176941534
 },
 {
 "decade": 1990,
 "average": 92.88260108825031,
 "stdDeviation": 593.2355408243169
 },
 {
 "decade": 2000,
 "average": 102.90173329283111,
 "stdDeviation": 612.8858371301232
 },
 {
 "decade": 2010,
 "average": 3.586111111111111,
 "stdDeviation": 3.339007901373116
 }
 ]
}

You might see some slightly different trailing digits in your solution due to floating-point arithmetic. If your solution is accurate to 10 significant figures, you can consider it correct.
Solving the Challenge
There are many different ways to get to this solution. Let's take a look at one possible pipeline, stage by stage.
$geoNear to geographically filter
Since $geoNear
must be the first stage of a pipeline, we start there:
{ 
 "$geoNear": {
 "near": { "type": "Point", "coordinates": [ 166.6863, -77.8419 ]},
 "distanceField": "distance",
 "limit": 25000,
 "maxDistance": 1000000
 }
}

We set limit
to 25,000 to return more than 100 documents (which is the default). maxDistance
is in meters so we set that to 1,000,000 for 1,000 km. You can optionally set spherical
to true
. This is optional since $geoNear
will detect that you are using a 2dsphere index and will correctly use spherical geometry anyway. Since we only have one geospatial index on the collection, we don't need to specify the key
field. The $geoNear
stage will assume you are going to use the single geospatial index on the collection.
Making dates into years
Next, we want to get started on bucketing the meteorites by decade. Let's simplify the year by converting it from a string representing the full date to just a number. We use $addFields
to overwrite the year
field:
{
 "$addFields": {
 "year": {
 "$year": {
 "$dateFromString": {
 "dateString": "$year"
 }
 }
 }
 }
 }

Here $dateFromString
parses the dates, then $year
extracts just the year.
At this point, we want to filter out any meteorites from before the year 1970 before we begin bucketing them by decade. This stage is actually optional, since there are no antarctic meteorites in this data set from before 1970 (remember ANSMET started in 1976). But for completeness, let's add this stage:
{
 "$match": {
 "year": {
 "$gte": 1970
 }
 }
 }

Decades and buckets of meteorites
Next we're going to use the $bucket
stage to sort meteorites into the correct decade. While the $group
stage is great for grouping documents that share the same value, the $bucket
stage is best for grouping documents together by a range of values.
{
 "$bucket": {
 "groupBy": "$year", 
 "boundaries": [
 1970, 1980, 1990, 2000, 2010, 2019
 ], 
 "default": "notset", 
 "output": {
 "stdDev": {
 "$stdDevPop": "$mass"
 }, 
 "avg": {
 "$avg": "$mass"
 }
 }
 }
}

We set the boundaries to 1970, 1980, 1990, 2000, 2010, and 2020. In the $bucket
stage, lower boundaries are inclusive and upper boundaries are exclusive. So the first bucket will contain all meteorites where the year is 1970-1979. For each bucket, we calculate the $stdDevPop
and $avg
of the mass.
Presenting the results
We've now done all the heavy lifting. We just need to get the output in the correct format. We're going to use the $group
stage to group all the documents into a single document:
{
 "$group": {
 "_id": null, 
 "decades": {
 "$push": {
 "decade": "$_id", 
 "stdDeviation": "$stdDev", 
 "average": "$avg"
 }
 }
 }
 }

Here we set the _id
to null
. If you set _id
to any constant, $group
will accumulate all documents together into one document. We push each decade into the decades
array. Finally, we need to get rid of the _id
field:
{
 "$project": {
 "_id": 0
 }
 }

And that's it. We now get the result we had at the beginning of our solution.
The whole pipeline
Here's the full pipeline that you can copy into MongoDB Compass to test out:
[
 {
 "$geoNear": {
 "near": {
 "type": "Point", 
 "coordinates": [
 166.6863, -77.8419
 ]
 }, 
 "distanceField": "distance", 
 "limit": 25000, 
 "maxDistance": 1000000
 }
 }, {
 "$addFields": {
 "year": {
 "$year": {
 "$dateFromString": {
 "dateString": "$year"
 }
 }
 }
 }
 }, {
 "$match": {
 "year": {
 "$gte": 1970
 }
 }
 }, {
 "$bucket": {
 "groupBy": "$year", 
 "boundaries": [
 1970, 1980, 1990, 2000, 2010, 2019
 ], 
 "default": "notset", 
 "output": {
 "stdDev": {
 "$stdDevPop": "$mass"
 }, 
 "avg": {
 "$avg": "$mass"
 }
 }
 }
 }, {
 "$group": {
 "_id": null, 
 "decades": {
 "$push": {
 "decade": "$_id", 
 "stdDeviation": "$stdDev", 
 "average": "$avg"
 }
 }
 }
 }, {
 "$project": {
 "_id": 0
 }
 }
]

Common mistakes
Among the errors which caught people out, the most common mistakes were formatting errors. For example:
- Many submissions returned a document for each decade instead of a single document.
- Several submissions also had issues manipulating the date.
Another common error was found in several answers where they calculated the average of the mass, but the standard deviation of the distance from McMurdo Station. We wanted the average and standard deviation of the mass. The distance calculated in the $geoNear
stage is never used.
Wrapping up
We hope you had expanded your MongoDB knowledge completing this challenge and, hopefully, had fun doing it. As we said, there's a challenge every Wednesday up to MongoDB World 2019, so why not get involved, get learning and even win prizes.