$sample Football 2016: The Winner’s in the Data

Jacob Ribnik

Technical
Facebook ShareLinkedin ShareReddit ShareTwitter Share

Introduction

Denver and Carolina have earned berths in this year’s main event. I don’t have a horse in this race -- I’m a San Diego fan -- and I thought it would be fun to develop an unbiased data-driven approach to predicting this year's champion using nothing more than the world’s most popular non-relational database, MongoDB.

Huddle up fellow nerds, here’s the play: There’s a wealth of play-by-play data available here for download. The first step in any data analysis is to actually look at the data to determine what you have to work with and whether the data makes any sense. For this we’ll utilize MongoDB’s new data visualization tool, MongoDB Compass, and try to identify some variables that correlate strongly with success in America’s most popular sport thirty years running. After that, we’ll utilize the new aggregation $sample operator to randomly select a game’s worth of plays from actual plays run during the regular season, and with them simulate an ensemble of championships to determine the probable winner.

Ready? Break!

Looking at the data

In order to use MongoDB Compass to start exploring the data, we must first import the data into MongoDB. The data is available for download as CSV files, which is generally a blessing since the mongoimport tool natively supports this format. Unfortunately, the process proved non-trivial this time around:

$ mongoimport --type=csv --headerline -d nfl -c pbp_2015 pbp-2015.csv
2016-01-26T21:40:44.549-0500    Failed: fields cannot be identical: '' and ''
2016-01-26T21:40:44.549-0500    imported 0 documents

Looking at the first line of the CSV file, which contains the field names, we see that there are a number of null values. A null field name isn’t a problem for a flexible schema database like MongoDB -- though we’d encourage you to rethink your schema if you have one -- but the presence of multiple non-unique field names breaks mongoimport, as seen in the above output. Oh well. Rather than try to cleanup the CSV for use with mongoimport, I’ve elected to write a script as it will give us a chance to massage the data upon entry, and, frankly, it’s a lot more fun! Here’s the script:

#!/usr/bin/env python

import csv
import dateutil.parser
import pymongo
import sys

if len(sys.argv) != 2:
    print("usage: %s csvfile" % sys.argv[0])
    sys.exit(1)

# Open the CSV file
try:
    csvfile = open(sys.argv[1], 'r')
except IOError as e:
    print("error: unable to read %s" % sys.argv[1])
    sys.exit(2)

# Connect to MongoDB
try:
    mongo = pymongo.MongoClient()
    # NOTE dashes in collection names are problematic in the shell, hence the
    # conversion to underscores
    collection = mongo.nfl[sys.argv[1].split('.')[0].replace('-', '_')]
except pymongo.errors.PyMongoError as e:
    print("error: unable to connect to MongoDB")
    sys.exit(3)

try:
    reader = csv.reader(csvfile, escapechar='\\')
    # The header line contains field names
    keys = reader.next()

    # The CSV contains multiple null fields. We are removing these by
    # identifying their positions and skipping their positions on each line.
    nulls = {int(i) for i in range(len(keys)) if keys[i] == ""}

    for vals in reader:
        try:
            assert(len(keys) == len(vals))
        except AssertionError as e:
            print("error: unable to process %s as csv" % sys.argv[1])
            sys.exit(4)

        doc = {}
        for i in range(len(vals)):
            # Is this a null position?
            if i in nulls:
                continue

            val = vals[i]

            # Convert GameDate string to datetime
            if keys[i] == "GameDate":
                try:
                    val = dateutil.parser.parse(val)
                except ValueError as e:
                    pass
            else:
                # Save integers as such instead of strings
                try:
                    val = int(val)
                except ValueError as e:
                    pass

            doc[keys[i]] = val

        collection.insert_one(doc)
except csv.Error as e:
    print("error: " + e)
    sys.exit(5)

Let me know if you have any questions by commenting below.

Now that the data is in MongoDB we can take a look at it with Compass. The Compass interface gives a pretty good overview of the data without the need to have any understanding of the MongoDB Query Language.

Taking a closer look at the left-hand section of the U.I., it tells us the instance we are exploring, the version of MongoDB this instance is running (3.2.1 in this case), and both the databases and collections on the instance.

Moving our focus to the center, we see the sampling information from the pbp_2015 collection. Compass derived our schema from a random sample of the 46,277 documents in the collection. We see the field names along with their data types, as well as distributions of the field values.

Finally on the right-hand side, I have opened the document viewer section which allows me to examine the individual documents in the sample set.

We can see that each document describes a single play in a game, with fields like “Yards”, “Down”, “Quarter”, “IsFumble” and “IsInterception”. Perhaps the most interesting aspect of the data is what isn’t there. The documents contain no “Score” field, so it’s not immediately evident which team won the game. That’s okay because we want to assess each team’s performance on a play-by-play basis. Looking at the data for this year’s most successful teams (Carolina and Denver) and least successful teams (Tennessee and Cleveland), we find that winning teams have offenses that gain a lot of yards and commit few turnovers. These are characteristics of a good team, any football fan would blindly attest to that.

Yards good, turnovers bad. Great. Here’s our first order attempt to predict a team’s final score utilizing these metrics: If every possession begins at the twenty yard line, then eighty yards of offense is required for a touchdown. Assuming the extra point is made -- a big assumption now that it's no longer a gimme -- a team scores seven points for every eighty yards gained. To penalize an offense for sloppy play, we’ll negate a touchdown for each turnover committed. Our formula for a team’s final score is thus (YARDS / 80 - TURNOVERS) * 7.

Clearly, this recipe for success on the gridiron is overly simplistic. Defense wins games, but we don’t take a team’s defensive prowess into account at all in our formulation. We’ll leave this as an exercise for the reader! Rather than attribute turnovers to good defense, we attribute them to poor offense. The challenge is to introduce a correlation between one team’s offense and the other’s defense in the simulation pipeline and final score determination. The final score might then look something like (YARDS_1 / 80 - TURNOVERS_1) * 7 + TURNOVERS_2 * 3, where _1 indicates a value for one team, and _2 the competing team. This formulation, in essence, awards a team a field goal for each turnover forced by their defense. What effect will this have on our predictions? We look forward to your submissions! Until then, offense puts points on the board.

Aggregation simulations

Time to kickoff… our aggregation simulations! Our pipeline consists of four stages. The first is a $match that filters from the regular season play-by-play data only the pass and rush plays performed by the offensive team in question. It looks like this:

    match = {"$match":
        {
            "OffenseTeam": "DEN",
            "$or": [{"PlayType": "PASS"}, {"PlayType": "RUSH"}],
            "IsNoPlay": 0
        }
    }

The additional {“IsNoPlay”: 0} requirement ensures that these are not so-called no plays in which either team was penalized during the course of the play rendering the true outcome of the play unknown.

The second stage is the $sample stage. $sample was introduced in MongoDB 3.2 and simply reduces the upstream resultset to a random selection therein of a specified size. For each championship game simulation, we’ll allow each team sixty offensive plays, a rough estimate of the average value. The stage looks like this:

    sample = {"$sample":
        {
            "size": 60
        }
    }

The next stage is a $project stage. It’s fairly trivial, actually, serving only to combine the number of interceptions and the number of fumbles into a single number of turnovers. Here it is:

    proj = {"$project":
        {
            "team": "$OffenseTeam",
            "yards": "$Yards",
            "turnovers": {"$sum": ["$IsInterception", "$IsFumble"]}
        }
    }

Note that we must remember to include in the projection any fields we’ll require in future logic, but we are free to change the names to something more palpable; in this case, OffenseTeam becomes team and Yards becomes yards.

The final stage in the pipeline is a $group stage that adds up the total yards and turnovers from the sixty randomly selected plays. It looks like this:

    group = {"$group":
        {
            "_id": "$team",
            "yards": {"$sum": "$yards"},
            "turnovers": {"$sum": "$turnovers"}
        }
    }

Each aggregation represents a simulated game for a given offense. Therefore, to create an ensemble of complete game simulations, we perform the aggregation once for each team, store the result, and do this one hundred thousand times. Here’s the complete code snippet executed directly in the mongo shell:

for (var i = 0; i < 100000; i++) {
    match = {"$match":
        {
            "OffenseTeam": "DEN",
            "$or": [{"PlayType": "PASS"}, {"PlayType": "RUSH"}],
            "IsNoPlay": 0
        }
    }
    sample = {"$sample":
        {
            "size": 60
        }
    }
    proj = {"$project":
        {
            "team": "$OffenseTeam",
            "yards": "$Yards",
            "turnovers": {"$sum": ["$IsInterception", "$IsFumble"]}
        }
    }
    group = {"$group":
        {
            "_id": "$team",
            "yards": {"$sum": "$yards"},
            "turnovers": {"$sum": "$turnovers"}
        }
    }

    res = db.pbp_2015.aggregate([match, sample, proj, group])
    simulation_doc = res['result'][0]
    simulation_doc["team"] = simulation_doc["_id"]
    simulation_doc["i"] = i
    delete simulation_doc["_id"]
    db.pbp_2015_simulations.insert(simulation_doc)

    match["$match"]["OffenseTeam"] = "CAR"
    res = db.pbp_2015.aggregate([match, sample, proj, group])
    simulation_doc = res['result'][0]
    simulation_doc["team"] = simulation_doc["_id"]
    simulation_doc["i"] = i
    delete simulation_doc["_id"]
    db.pbp_2015_simulations.insert(simulation_doc)
}

Here’s an example game simulation that resulted from running the above code:

> db.pbp_2015_simulations.find({"i": 0})
{
  "_id": ObjectId("56a7e1bc87776355710ff107"),
  "yards": 479,
  "turnovers": 2,
  "team": "DEN",
  "i": 0
}
{
  "_id": ObjectId("56a7e1bc87776355710ff108"),
  "yards": 444,
  "turnovers": 2,
  "team": "CAR",
  "i": 0
}

In this case, if we plug the numbers into our formula, we see that Denver won 28 to 25. However, one simulated win does not a champion make! We use a new aggregation pipeline to do the grunt work of analyzing all one hundred thousand games and produce a single final result. Here are the four stages in this new pipeline:

proj0 = {"$project":
            {
                "i": 1,
                "DEN_score": {"$sum": {"$cond": [{"$eq": ["$team", "DEN"]}, {"$multiply": [{"$subtract": [{"$divide": ["$yards", 80]}, "$turnovers"]}, 7]}, 0]}},
                "CAR_score": {"$sum": {"$cond": [{"$eq": ["$team", "CAR"]}, {"$multiply": [{"$subtract": [{"$divide": ["$yards", 80]}, "$turnovers"]}, 7]}, 0]}}
            }
        }
group0 = {"$group":
            {
                "_id": "$i",
                "DEN_score": {"$sum": "$DEN_score"},
                "CAR_score": {"$sum": "$CAR_score"}
            }
        }
proj1 = {"$project":
            {
                "DEN_score": 1,
                "DEN_win": {"$cond": [{"$gt": ["$DEN_score", "$CAR_score"]}, 1, 0]},
                "CAR_score": 1,
                "CAR_win": {"$cond": [{"$gt": ["$CAR_score", "$DEN_score"]}, 1, 0]}
            }
        }
group1 = {"$group":
            {
                "_id": 1,
                "DEN_avg_score": {"$avg": "$DEN_score"},
                "DEN_avg_score_std": {"$stdDevSamp": "$DEN_score"},
                "DEN_win_pct": {"$avg": "$DEN_win"},
                "CAR_avg_score": {"$avg": "$CAR_score"},
                "CAR_avg_score_std": {"$stdDevSamp": "$CAR_score"},
                "CAR_win_pct": {"$avg": "$CAR_win"}
            }
        }

The first $project stage employs our formula; the following $group stage creates a single document from the separate simulations; the second $project stage determines which team won by comparing the scores; and the final $group stage averages the results of all one hundred thousand simulated games.

Testing our theory

Let’s see how well our method would have done in predicting the outcomes of the last two championship games, as we have the complete play-by-play data from those seasons as well.

In 2014, Seattle absolutely dismantled Denver by a final score of 43 to 8. What would our simulations have yielded? Here’s the result employing the exact same method:

{
  "waitedMS": NumberLong("0"),
  "result": [
    {
      "_id": 1,
      "DEN_avg_score": 28.743076249999753,
      "DEN_avg_score_std": 9.011384256852498,
      "DEN_win_pct": 0.58883,
      "SEA_avg_score": 25.868169249999788,
      "SEA_avg_score_std": 9.032120640996355,
      "SEA_win_pct": 0.40855
    }
  ],
  "ok": 1
}

Denver won 59% of the simulations, by an average score of 2.8 points. Yikes. That does not bode well for our method. Hindsight is 20-20 I guess. How did Vegas fare that year? The Vegas line was for a Denver victory by 2.5 points and an over/under of 47.5. Not bad! How’d we do the following year when New England bested Seattle 28-24?

{
  "waitedMS": NumberLong("0"),
  "result": [
    {
      "_id": 1,
      "NE_avg_score": 25.848926249999767,
      "NE_avg_score_std": 7.987140220020844,
      "NE_win_pct": 0.45769,
      "SEA_avg_score": 27.127344999999792,
      "SEA_avg_score_std": 9.1468726553133,
      "SEA_win_pct": 0.53967
    }
  ],
  "ok": 1
}

It’s essentially a tossup, Seattle winning 54% of the time and New England the other 46%. And what did Vegas say? The line was even, each team equally likely to lose.

Great. Our method is as good as Vegas at predicting the outcomes of these games. That’s terrific considering the House always wins in the long run.

Picking our winner

And the winner is (will be)… Carolina!

{
  "waitedMS": NumberLong("0"),
  "result": [
    {
      "_id": 1,
      "DEN_avg_score": 19.734966125000213,
      "DEN_avg_score_std": 10.506752595928164,
      "DEN_win_pct": 0.33637,
      "CAR_avg_score": 25.478733875000064,
      "CAR_avg_score_std": 9.017951575045746,
      "CAR_win_pct": 0.66156
    }
  ],
  "ok": 1
}

Carolina won 66% of our simulated games, by an average score of six points. Well, that was fun. I guess we’ll see how our prediction fares this weekend.

Until then, it is with as much confidence as a Vegas bookie that we set the line on this year’s championship game at -6 to Carolina. At the time of this writing, Vegas has the line at -5.5 points.

Enjoy the game, everyone!


Have a better system for picking winners? Prove it! Join the Advocacy Hub and see if you can modify Jacob’s method for predicting the winner.
Join the Advocacy Hub


About the Author - Jacob Ribnik

Jacob hails from Southern California and enjoys wearing sandals to work. A rabid UCLA Bruins fan and alumnus, he went on to earn a PhD in physics from UCSB where he studied the origins of the universe through collisions of high energy particles at the Large Hadron Collider at CERN. After discovering the Higgs boson, he left academia to build a data analytics infrastructure on top of MongoDB for We-Care.com, and he liked that so much he ended up at MongoDB.

At MongoDB, Jacob is the self-proclaimed world's foremost expert on Saved by the Bell trivia knowledge, and Senior Technical Services Engineer on the Proactive Support Team. There he develops software to identify issues in customer deployments before they become problematic, along with internal tools used by the Support Team to deliver the best customer support experience that exists in enterprise software.

While his computer is charging, he enjoys making pottery, scuba diving, talking to dogs, and being generally merry.