Weather of the Century: Part 2



In part one of this series, we took a look at the Weather of the Century App, which uses the Integrated Surface Data made available by NOAA loaded into MongoDB to display the weather anywhere on Earth for any hour since 1901. In part 2, we'll review the schema we use to store the weather data, and examine the two queries the app uses to do its work.

Recap of Schema

Weather observations are stored as a single document in a collection such as this one:

<pre> { "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "postion" : { "type" : "Point", "coordinates" : [ -96.4, 39.117 ] }, "elevation" : 231, "airTemperature" : { "value" : 21.1, "quality" : "1" }, "sky condition" : { "cavok": "N", "ceilingHeight": { "determination": "9", "quality": "1", "value": 1433 } } "atmosphericPressure" : { "value" : 1009.7, "quality" : "5" } [etc] } </pre>

The Power of GeoJSON

The position field of the observation record is a GeoJSON value type, which MongoDB supports for geospatial indexes. Geospatial query operators can be used to query documents that conform to the GeoJSON geographical specification. This feature allows the Weather of the Century app to find the weather at a specific location and time. This can be seen in the second of the two queries we will examine in the next section.

The Quality Field

One peculiarity of the Integrated Surface Data is that it contains weather observations from many different origins, which were obtained by a wide variety of methods. In many cases, some of the measurements can be unreliable. NOAA therefore encodes every individual measurement with a quality value. In this article and subsequent installments in this series, you will see that we often filter on this value. We aren’t going into what the values mean – suffice to say we often only care about measurements with certain quality values.

The Queries

When the user submits a place and a time to the Weather of the Century app, it orients the globe in the Google Earth plugin at the specified location, and populates the map with an air temperature label for every station in the world that recorded a measurement from that hour. It also fetches all of the recorded data from the station closest to the specified location, although that data is hidden until the user toggles the 'info' button. The app waits a few seconds, then repeats the queries for the subsequent hour, and continues until the user clicks the 'stop' button.

The Weather of the Century app performs its duties using two queries, one to retrieve all the temperature data available everywhere in the world at one time, and one for all the measured data for the station closest to the specified area.

All the Temperatures

To populate the Google Earth plugin with markers for the air temperature measurements from all the stations in the world, for any given hour, it issues an aggregation framework query to MongoDB using the python driver. That pipeline looks like this:

<pre> # 'dt' is the date-time of the given hour pipeline = [ { '$match': { 'ts': { '$gte': dt, '$lt': dt + timedelta(hours=1) }, 'airTemperature.quality': 1 } }, { '$group': { '_id': '$st', 'position': {'$first': '$position'}, 'airTemperature': {'$first', '$airTemperature'} } }] </pre>

This is a basic aggregation pipeline, with two phases. The first phase, the [$match][], allows only those documents that have a ts field (the timestamp of the weather record) between the time specified by the user and one hour thereafter. The other parameter in the match document is the quality of the airTemperature measurement – we are only permitting the display of measurements that passed all quality control checks. We don't want to make our non-existant support personnel field any imaginary support calls from theoretical users who were in any way harmed or disgruntled by inaccurate data about what the temperature was at the hour of their birth.

The display requirements of the app are one temperature label per station that recorded data within any single hour timespan. If every station recorded exactly one measurement per hour, a simple would suffice. However, that is not the case; many stations take several measurements per hour, and we do not want multiple data points per label, we want one. We could take all the measurements and average them, but for our purposes that doesn't seem worthwhile, and in some cases (as will be explained in a moment), it would be inaccurate.

So the next phase in the pipeline is the $group phase. The documents are grouped by station (identified by the $st field path), and from that group, a single document is emitted with an _id of st, and the position and airTemperature fields taken from the first document in the group with the $first accumulator.

The order of documents in the group is not defined, as the $match phase makes no guarantees, and we did not include a $sort phase, but we do not care – for our purposes, one measurement from the hour is as good as another.

(It is important that the position of the measurement be taken from the same document as the airTemperature, because some of these measurements come from mobile monitoring stations, such as those deployed on sea vessels.)

The Data for One Station

The other feature of the app's UI is to display all the weather data from the station closest to the specified location for the given hour. We can use the $near geospatial query operator to find the document with that data.

<pre> <h1>we are looking for the observation data closest to the point specified by 'lng' and 'lat'</h1> <p>{ 'ts': dt, 'position': { '$near': { '$geometry': { 'type': 'Point', 'coordinates': [lng, lat] } } } }, as_class=SON).hint([ ('ts', 1), ('position', '2dsphere') ]).limit(-1).max_time_ms(10000))[0] </pre>

Here we have a pretty ordinary query, only in the match criteria, we see the $near geospatial query operator being used to match against the position field of the documents. Recall that the position field encodes the location of the observation station in GeoJSON format. By encoding our search location as a GeoJSON point object, and specifying it as the value of the $near field in the match document, our query returns all documents in order of closeness to that point.

Because the app is written in Python, the as_class=SON argument is passed to the find call, as without it, the results of the query would be returned in an non-order-preserving Python dictionary. The bson.son.SON class, supplied by the pymongo library, acts just like a Python dictionary, only it preserves key order.

We also pass a hint to MongoDB as to which index to use, because it turned out that MongoDB did not make an optimal choice by itself for this query. It preferred (as of this writing) the position_1 index, rather than the ts_1_position_1 index. It is always a good idea to use explain to verify which indexes are used for your queries when developing your applications!

By this point in the query, we have requested all the documents in the collection that match the requested timeframe, ordered from closest to farthest. It is the limit(-1) cursor method that makes the query conform to what we really want – only the closest result. Negative values are just like positive values when given to the limit method, except that they additionally prevent the creation of a standing cursor – this optimization allows MongoDB to return after a single batch of results is returned, no matter the number specified in the limit.

So limit(-1) essentially means "limited to one document, and you can also not bother creating a cursor."

Lastly, we specify a max_time to protect the application's performance in case of a slow query. Because results are returned as an array, even though we only asked for one, we choose the 0th item in the array and return it.

Next: High Performance MongoDB using this data set

This article describes the operation of the Weather of the Century app, but that app only uses two queries. We've got all this great weather data available... what else can we do with it?

In the next article in this series, we'll play around with different deployments of MongoDB, and see what kind of performance we can achieve when querying this massive and cool data set. In the interim, if you’re looking for a more in-depth look on MongoDB’s architecture, download our guide:

Download the Architecture Guide

<< Read Part 1

Read Part 3 >>



About the Author - Avery

Avery is an infrastructure engineer, designer, and strategist with 20 years experience in every facet of internet technology and software development. As principal of Bringing Fire Consulting, he offers clients his expertise at the intersection of technology, business strategy, and product formulation. He earned a B.A in Computer Science from Brown University, where he specialized in systems and network programming, while also studying anthropology, fiction, cog sci, and semiotics. Avery got his start in internet technology in 1993, configuring apache and automating systems at Panix, the third-oldest ISP in the world. He has an obsession with getting to the heart of a problem, a flair for communication, and a devotion to providing delight to end users.