Aggregation Options on Big Data Sets Part 1: Basic Analysis using a Flights Data Set
By Daniel Alabi and Sweet Song, MongoDB Summer Interns
Flights Dataset Overview
This is the first of three blog posts from this summer internship project showing how to answer questions concerning big datasets stored in MongoDB using MongoDB’s frameworks and connectors.
The first dataset explored was a domestic flights dataset. The Bureau of Transportation Statistics provides information for every commercial flight from 1987, but we narrowed down our project to focus on the most recent available data for the past year (April 2012-March 2013).
We were particularly attracted to this dataset because it contains a lot of fields that are well suited for manipulation using the MongoDB aggregation framework.
To get started, we wanted to answer a few basic questions concerning the dataset:
- When is the best time of day/day of week/time of year to fly to minimize delays?
- What types of planes suffer the most delays? How old are these planes?
- How often does a delay cascade into other flight delays?
- What was the effect of Hurricane Sandy on air transportation in New York? How quickly did the state return to normal?
Importing the Data
The first order of business was uploading the dataset into MongoDB. The downloaded data sets are in monthly csv files of 150MB each. Many fields such as diversions or cancellations have blank values. mongoimport
would include a lot of these blank field values. To avoid field sparsity, we used an upload Python script to weed out the unnecessary fields and incomplete entries. This means that all fields which are blank or zero will not be included.
After deliberation, we decided to keep most of the flight information aside from the diverting airports (we’d come to realize that it was better to include more fields than few). The fields we chose to keep can be found in this gist. One important distinction that must be made is the difference between fields of depDelay, the difference in minutes between the scheduled departure time and actual departure time, versus arrDelay, the difference in minutes between the scheduled arrival time and the actual arrival time. Some of the other more important fields of our schema are:
FieldDescriptionSampledateFlight DateISODate(“2012-08-10T00:00:00Z”)carrierIdUnique Carrier Code”EV”tailNumUnique Tail Number of Aircraft”N607LR”origAirportIdOrigin Airport”IAH”origStateIdOrigin State”TX”destAirportIdDestination Airport”DTW”destStateIdDestination State”MI”crsDeptTimeScheduled Departure TimeISODate(“2012-08-10T14:55:00Z”)depDelayMinutes Delayed for Departure421depTimeTime of DepartureISODate(“2012-08-10T21:56:00Z”)crsArrTimeScheduled Arrival TimeISODate(“2012-08-10T18:46:00Z”)arrDelayMinutes Delayed for Arrival398arrTimeTime of ArrivalISODate(“2012-08-11T01:24:00Z”)cancellationCodeReason for Cancellation (0 for not cancelled, 1 for carrier, 2 for weather, 3 for national air system, 4 for security)0
In total, we imported 6,155,752 recorded trips into the flying database’s flights collection.. It took 2 hours to upload this dataset using 1TB of space and 7GB of RAM with again with individual document insertions.
Adding the Age
In order to find out the ages of the aircrafts that caused the most delays, we had to download a different dataset (MASTER.txt —131MB). MASTER.txt contained information about the MFR-Year of an aircraft, the year the aircraft was created.
We wrote a script, masterNNumbers.py, with the function addAge that adds an age field into the documents for which a corresponding tailNum (also known as the n-number) exists in the current flights collection in the flying database. Since a single aircraft can have different departure and arrival times, an aircraft with a specific tail number must be updated in several documents. We used the multi flag to enable multiple updates per aircraft.
Making Sense of the Dataset
Simple aggregations
So what can we do with this data? We started by answering the questions above with simple aggregations and queries. Then we moved to more investigations, such as how Hurricane Sandy affected flights going in and leaving New York? Finally, we used the D3 JavaScript library to visualize some of our results.
The least delays
Let us assume that the delays mentioned in this question are the arrival delays, as that would matter more to the traveler.
Using the following query, where field
can be replaced by hour, week, month, or any combination therefore:
db.flights.aggregate( {"$group" : {"_id" : , "delays" : {"$avg" : "$arrDelay"}}} , {"$sort" : {"delays" : 1}} );
The best time to leave during the day with the least average arrival delays is in the morning:
Hour of DayAverage arrDelay in Minutes5-2.9542668678479396-2.9005757014293282-2.34680851063829777-2.0244167912547824-0.5765124555160143
This advice to catch flights in the morning to avoid delays is also supported when we graphed the delays vs time of week. The least delays (white) occurred on Wednesday and Thursday mornings with arrival 7 minutes early on average, while the most delays (blue) happened the evenings of Thursdays, Fridays, and Sundays with arrivals being 10 minutes late on average.
Additionally, if we look at the delays over the entire year of our data set from April 2012 to March 2013, we notice that the summer and Christmas have the most delays (red). This is probably due to the heavier flights traffic during those times.
Age affecting performance
We used this query to obtain the n most delayed aircrafts (with ages):
db.flights.find({"age" : {"$exists" : true}}, {"_id" : 1 , "arrDelay" : 1 , "depDelay" : 1 , "carrier" : 1 , "origCity" : 1 , "destCity" : 1 , "age" : 1}).sort({"arrDelay" : -1, "depDelay" : -1}).limit(n);
In order to make this query run faster, we built a compound index on both attributes arrDelay and depDelay, {arrDelay : -1, depDelay : -1}
. See the indexing paragraph summary for more information.
Cascading delays
Another statistic we sought after was the number of cascading delays caused by an initial delay, as calculated in cascadingDelays.py. For every flight which started at least on time and arrived late, the Python script finds all the flights scheduled after that particular flight’s scheduled arrival and searches for the number of immediate, consecutive delays. One consecutive delay for aircraft 123 for flight ABC is determined to be the scheduled flight immediately after flight ABC also on the aircraft 123, where the delay reason is specified as a late aircraft arrival, and this next flight both departs and arrives late.
Initially, for every consecutive delay, the program recursively searched for the next scheduled flight to determine whether it was delayed. If it was, it would search for the next one, and so on. Unfortunately this reached Python’s maximum recursion limit, and it caused too many queries, which also slowed down the performance. Instead, we used another strategy: we first found all the aircraft that were delayed; then for each delayed aircraft, we queried the next 10 scheduled flights, and iterated through these results to find the number of consecutive flights delayed. Essentially, we wrote an iterative version of the initial recursive solution. This proved to be more efficient. See cascadingDelays.py for more documentation on the function.
Interestingly, each starting flight causes on average only 0.189 consecutive trip delays for the same aircraft.
Effects of Hurricane Sandy
Hurricane Sandy devastated much of New England last October, and New York City experienced one of its worst natural disasters.
Up to this point, our calculations mostly considered scheduled flights. Thus, we looked at the number of scheduled flights throughout the period around Hurricane Sandy. However, flights are scheduled months in advance, obviously without the knowledge of Sandy, so this measurement of scheduled flights was incorrect.
Our data set documented which flights were cancelled due to weather, and the number of flights cancelled due to weather would be the measurement for Hurricane Sandy’s effects.
The following aggregation returned the total number of NY flights cancelled due to weather for dates around which Hurricane Sandy occurred.
db.flights.aggregate( {"$match" : {"cancelCode" : 2, "month" : 10, "$or" : [{"origStateId" : "NY"}, {"destStateId" : "NY"}]}} , {"$group" : {"_id" : "$dayOfMonth", "sum" : {"$sum" : 1}}} , {"$sort" : {"_id" : 1}} );
We can tell that the dates during Hurricane Sandy (October 25 - October 31) have significantly higher weather delays (over 1000x times). The only caveat to this data is that there’s another super storm that a second super storm hit on November 4th, which caused numerous cancellations on its own. Therefore, the data is not isolated and wholly controlled.
DateNumber of Cancelled Flights In or Out of NYOctober 250October 260October 270October 28246October 291316October 301239October 31951November 1361November 270November 32November 4505November 5143November 613
Busiest states
The following query finds 10 states with the most departing flights:
db.flights.aggregate({"$group" : {"_id" : "$origState" , "flew" : {"$sum" : 1 }}} , {"$sort" : {"flew" : -1 }} , {"$limit" : 10} );
StateNumber of Flights Leaving from an Airport in this StateCalifornia760113Texas725487Florida446024Georgia411166Illinois410492New York268986Colorado260408Arizona206408North Carolina204447Virginia188618
A similar query with $destState
replacing $origState
will show the 10 states sorted by the by most arriving flights.
The results of these queries have been mapped in D3. The color key denotes yellow to have fewer flights and green states have more flights in the past year. There was no record of any flights to or from Delaware in the past year. The first map is solely of flights originating from a state and the second map is of flights destined for a state.
Worst states for delays
On a similar note, we also want to discover which states experience the worst delays. The following query finds the states with the highest average delays, which is also represented in the graph below:
db.flights.aggregate( {"$group" : {"_id" : "$origStateId" , delay : {"$avg": "$depDelay"}}} , {"$sort" : {"delay" : -1}} , {"$limit" : 10} );
Yellow in the maps represents no delay while red means extreme delay. The first map summarizes the states for departure delay while the second map summarizes the states for arrival delay.
Challenges/Lessons Learned
Trips spanning multiple days
A particular point of concern for the data was the presentation of departure and arrival times. In the given data set, they were shown in military time, which was not convenient for our calculations. Instead, the script changed them to BSON date values. Next day arrivals, or delays causing next day departures, caused problems converting from time-of-day to a datetime. The problem was rectified in rectified by the correctDays() function in inputMongo.py.
The merits of indexing
MongoDB stresses the importance of indexing, which we experienced first-hand. As our calculations gradually migrated toward delays and aggregations, it became evident that certain fields needed indexes. In the end, we ended up with indexes for the following fields:
- date
- arrDelay
- origStateId
- destStateId
- arrDelay, depDelay
On average, each index took around 2 minutes on average to create. But their sizes ranged from 120MB to 200MB. This is a worthwhile trade-off since all our queries ran much faster after creating the indexes.
Note: We created a compound index on the fields, arrDelay and depDelay ({arrDelay : -1, depDelay : -1}
). Furthermore, we created an index on depDelay alone but not on arrDelay because the prefixes of the compound index can satisfy queries on any prefix of its fields. See this page for more details.
Depending on the size of the machine and the relevant indexes, the query times differed significantly. Consider the query for all flights which depart on time but arrive late:
db.flights.aggregate( {"$match" : {"arrDelay" : {"$gt" : 0}, "depDelay" : {"$lte" : 0}}} , {"$group" : {"_id" : "$origStateId", "numFlights" : {"$sum" : 1}}} , {"$sort" : {"numFlights" : -1}} , {"$limit" : 5} );
This query would take about 20 minutes on a medium EC2 instance. Creating the compound index took 2 minutes and 20 seconds. However, with the index, the query finished in less than 5 seconds indexed.
Possible Improvements
- Make the field names shorter — saves a significant amount of document storage space. But the caveat is that shorter field names are less descriptive than longer ones. It would also not affect any of the queries we created.