Solving The Aggregation Rock Star Challenge

Tim Fogarty

#Developer#MongoDB World 2019

The first of Eliot's MongoDB World Weekly Challenges was an out of this world challenge to analyse the records of meteorite impacts. Even if you didn't do the challenge (there's one every Wednesday up to MongoDB World), you can still be a winner because we're going to show you how to get the solution.

We asked you to come up with an aggregation that would give us the average mass and standard deviation for that mass for meteorites within 1000km of McMurdo Antarctic Research Station. Read the full challenge for more background.

The Answer

The answer to the challenge that we were looking for was:

{
  "decades": [
    {
      "decade": 1970,
      "average": 1519.2301875901876,
      "stdDeviation": 16646.61807518022
    },
    {
      "decade": 1980,
      "average": 201.3780529765156,
      "stdDeviation": 2231.0730176941534
    },
    {
      "decade": 1990,
      "average": 92.88260108825031,
      "stdDeviation": 593.2355408243169
    },
    {
      "decade": 2000,
      "average": 102.90173329283111,
      "stdDeviation": 612.8858371301232
    },
    {
      "decade": 2010,
      "average": 3.586111111111111,
      "stdDeviation": 3.339007901373116
    }
  ]
}

You might see some slightly different trailing digits in your solution due to floating-point arithmetic. If your solution is accurate to 10 significant figures, you can consider it correct.

Solving the Challenge

There are many different ways to get to this solution. Let's take a look at one possible pipeline, stage by stage.

$geoNear to geographically filter

Since $geoNear must be the first stage of a pipeline, we start there:

{ 
    "$geoNear": {
        "near": { "type": "Point", "coordinates": [ 166.6863, -77.8419 ]},
        "distanceField": "distance",
        "limit": 25000,
        "maxDistance": 1000000
    }
}

We set limit to 25,000 to return more than 100 documents (which is the default). maxDistance is in meters so we set that to 1,000,000 for 1,000 km. You can optionally set spherical to true. This is optional since $geoNear will detect that you are using a 2dsphere index and will correctly use spherical geometry anyway. Since we only have one geospatial index on the collection, we don't need to specify the key field. The $geoNear stage will assume you are going to use the single geospatial index on the collection.

Making dates into years

Next, we want to get started on bucketing the meteorites by decade. Let's simplify the year by converting it from a string representing the full date to just a number. We use $addFields to overwrite the year field:

{
    "$addFields": {
      "year": {
        "$year": {
          "$dateFromString": {
            "dateString": "$year"
          }
        }
      }
    }
  }

Here $dateFromString parses the dates, then $year extracts just the year.

At this point, we want to filter out any meteorites from before the year 1970 before we begin bucketing them by decade. This stage is actually optional, since there are no antarctic meteorites in this data set from before 1970 (remember ANSMET started in 1976). But for completeness, let's add this stage:

{
    "$match": {
      "year": {
        "$gte": 1970
      }
    }
  }

Decades and buckets of meteorites

Next we're going to use the $bucket stage to sort meteorites into the correct decade. While the $group stage is great for grouping documents that share the same value, the $bucket stage is best for grouping documents together by a range of values.

{
    "$bucket": {
      "groupBy": "$year", 
      "boundaries": [
        1970, 1980, 1990, 2000, 2010, 2019
      ], 
      "default": "notset", 
      "output": {
        "stdDev": {
          "$stdDevPop": "$mass"
        }, 
        "avg": {
          "$avg": "$mass"
        }
      }
    }
}

We set the boundaries to 1970, 1980, 1990, 2000, 2010, and 2020. In the $bucket stage, lower boundaries are inclusive and upper boundaries are exclusive. So the first bucket will contain all meteorites where the year is 1970-1979. For each bucket, we calculate the $stdDevPop and $avg of the mass.

Presenting the results

We've now done all the heavy lifting. We just need to get the output in the correct format. We're going to use the $group stage to group all the documents into a single document:

{
    "$group": {
      "_id": null, 
      "decades": {
        "$push": {
          "decade": "$_id", 
          "stdDeviation": "$stdDev", 
          "average": "$avg"
        }
      }
    }
  }

Here we set the _id to null. If you set _id to any constant, $group will accumulate all documents together into one document. We push each decade into the decades array. Finally, we need to get rid of the _id field:

{
    "$project": {
      "_id": 0
    }
  }

And that's it. We now get the result we had at the beginning of our solution.

The whole pipeline

Here's the full pipeline that you can copy into MongoDB Compass to test out:

[
  {
    "$geoNear": {
      "near": {
        "type": "Point", 
        "coordinates": [
          166.6863, -77.8419
        ]
      }, 
      "distanceField": "distance", 
      "limit": 25000, 
      "maxDistance": 1000000
    }
  }, {
    "$addFields": {
      "year": {
        "$year": {
          "$dateFromString": {
            "dateString": "$year"
          }
        }
      }
    }
  }, {
    "$match": {
      "year": {
        "$gte": 1970
      }
    }
  }, {
    "$bucket": {
      "groupBy": "$year", 
      "boundaries": [
        1970, 1980, 1990, 2000, 2010, 2019
      ], 
      "default": "notset", 
      "output": {
        "stdDev": {
          "$stdDevPop": "$mass"
        }, 
        "avg": {
          "$avg": "$mass"
        }
      }
    }
  }, {
    "$group": {
      "_id": null, 
      "decades": {
        "$push": {
          "decade": "$_id", 
          "stdDeviation": "$stdDev", 
          "average": "$avg"
        }
      }
    }
  }, {
    "$project": {
      "_id": 0
    }
  }
]

Common mistakes

Among the errors which caught people out, the most common mistakes were formatting errors. For example:

  • Many submissions returned a document for each decade instead of a single document.
  • Several submissions also had issues manipulating the date.

Another common error was found in several answers where they calculated the average of the mass, but the standard deviation of the distance from McMurdo Station. We wanted the average and standard deviation of the mass. The distance calculated in the $geoNear stage is never used.

Wrapping up

We hope you had expanded your MongoDB knowledge completing this challenge and, hopefully, had fun doing it. As we said, there's a challenge every Wednesday up to MongoDB World 2019, so why not get involved, get learning and even win prizes.