Crunching 30 Years of NBA Data with MongoDB Aggregation

by Valeri Karpov@code_barbarianFebruary 14, 2014

When you are looking to run analytics on large and complex data sets, you might instinctively reach for Hadoop. However, if your data's in MongoDB, using the Hadoop connector seems like overkill if your data fits on your laptop. Luckily, MongoDB's built-in aggregation framework offers a quick solution for running sophisticated analytics right from your MongoDB instance without needing any extra setup.

As a lifelong basketball fan, I often daydreamed about being able to run sophisticated analyses on NBA stats. So, when the MongoDB Driver Days Hackathon came around and Ruby driver lead Gary Murakami suggested putting together an interesting data set, we sat down and spent an afternoon building and running a scraper for basketball-reference.com. The resulting data set contains the final score and box scores for every NBA regular season game since the 1985-1986 season.

In the aggregation framework documentation, we often use a zip code data set to illustrate the uses of the aggregation framework. However, crunching numbers about the population of the United States doesn't exactly captivate my imagination, and there are certain uses of the aggregation framework which the zip codes data set doesn't highlight as well as it could. Hopefully this data set will let you take a new look at the aggregation framework while you have some fun digging through NBA stats. You can download data set here and put it into your MongoDB instance using mongorestore.

Digging into the Data

First off, lets take a look at the structure of the data. There have been 31,686 NBA regular season games since 1985-86. Each individual document represents a game. Here is the high level metadata for the 1985-86 season opener between the Washington Bullets and the Atlanta Hawks:

{
    "_id" : ObjectId("52f29f91ddbd75540aba6dae"),
    "box" : [
        {
            "players" : [ ... ],
            "team" : { ... },
            "won" : 1
        },
        {
            "players" : [ ... ],
            "team" : { ... },
            "won" : 0
        }
    ],
    "date" : ISODate("1985-10-25T04:00:00Z"),
    "teams" : [
        {
            "name" : "Washington Bullets",
            "abbreviation" : "WSB",
            "score" : 100,
            "home" : false,
            "won" : 1
        },
        {
            "name" : "Atlanta Hawks",
            "abbreviation" : "ATL",
            "score" : 91,
            "home" : true,
            "won" : 0
        }
    ]
}

The document contains a rich box score subdocument, a date field, and information on the teams that played. We can see that the Bullets won 100-91 as the road team. The box score data is similarly broken down by team in an array, with the winning team first. Note that the won flag is a member of the top level box object, along with team and players.

{
    "_id" : ObjectId("52f29f91ddbd75540aba6dae"),
    "box" : [
        {
            "players" : [ ... ],
            "team" : {
        "ast" : 21,
                "blk" : 7,
                "drb" : 29,
                "fg" : 40,
                "fg3" : 0,
                "fg3_pct" : ".000",
                "fg3a" : 2,
                "fg_pct" : ".513",
                "fga" : 78,
                "ft" : 20,
                "ft_pct" : ".769",
                "fta" : 26,
                "mp" : 240,
                "orb" : 11,
                "pf" : 19,
                "pts" : 100,
                "stl" : 11,
                "tov" : 17,
                "trb" : 40
      },
            "won" : 1
        },
        {
            "players" : [ ... ],
            "team" : {
        "ast" : 25,
                "blk" : 5,
                "drb" : 28,
                "fg" : 41,
                "fg3" : 0,
                "fg3_pct" : ".000",
                "fg3a" : 3,
                "fg_pct" : ".446",
                "fga" : 92,
                "ft" : 9,
                "ft_pct" : ".500",
                "fta" : 18,
                "mp" : 240,
                "orb" : 16,
                "pf" : 23,
                "pts" : 91,
                "stl" : 8,
                "tov" : 16,
                "trb" : 44
      },
            "won" : 0
        }
    ],
    "date" : ISODate("1985-10-25T04:00:00Z"),
    "teams" : [
        {
            "name" : "Washington Bullets",
            "abbreviation" : "WSB",
            "score" : 100,
            "home" : false,
            "won" : 1
        },
        {
            "name" : "Atlanta Hawks",
            "abbreviation" : "ATL",
            "score" : 91,
            "home" : true,
            "won" : 0
        }
    ]
}

The box score for each team is further broken down by team statistics and player statistics. The team stats above show the cumulative statistics for the Atlanta Hawks, showing that they shot 41-92 from the field and an atrocious 9-18 from the line. The players array shows the same statistics, but broken down for an individual player. For example, below you'll see that the Hawks' star Dominique Wilkins scored 32 points on 15-29 shooting and recorded 3 steals.

{
    "_id" : ObjectId("52f29f91ddbd75540aba6dae"),
    "box" : [
        {
            "players" : [ ... ],
            "team" : {
        "ast" : 21,
                "blk" : 7,
                "drb" : 29,
                "fg" : 40,
                "fg3" : 0,
                "fg3_pct" : ".000",
                "fg3a" : 2,
                "fg_pct" : ".513",
                "fga" : 78,
                "ft" : 20,
                "ft_pct" : ".769",
                "fta" : 26,
                "mp" : 240,
                "orb" : 11,
                "pf" : 19,
                "pts" : 100,
                "stl" : 11,
                "tov" : 17,
                "trb" : 40
      },
            "won" : 1
        },
        {
            "players" : [
        {
                    "ast" : 4,
                    "blk" : 0,
                    "drb" : 1,
                    "fg" : 15,
                    "fg3" : 0,
                    "fg3_pct" : ".000",
                    "fg3a" : 2,
                    "fg_pct" : ".517",
                    "fga" : 29,
                    "ft" : 2,
                    "ft_pct" : ".400",
                    "fta" : 5,
                    "mp" : "42:00",
                    "orb" : 5,
                    "pf" : 2,
                    "player" : "Dominique Wilkins",
                    "pts" : 32,
                    "stl" : 3,
                    "tov" : 5,
                    "trb" : 6
                },
        ...
      ],
            "team" : {
        "ast" : 25,
                "blk" : 5,
                "drb" : 28,
                "fg" : 41,
                "fg3" : 0,
                "fg3_pct" : ".000",
                "fg3a" : 3,
                "fg_pct" : ".446",
                "fga" : 92,
                "ft" : 9,
                "ft_pct" : ".500",
                "fta" : 18,
                "mp" : 240,
                "orb" : 16,
                "pf" : 23,
                "pts" : 91,
                "stl" : 8,
                "tov" : 16,
                "trb" : 44
      },
            "won" : 0
        }
    ],
    "date" : ISODate("1985-10-25T04:00:00Z"),
    "teams" : [
        {
            "name" : "Washington Bullets",
            "abbreviation" : "WSB",
            "score" : 100,
            "home" : false,
            "won" : 1
        },
        {
            "name" : "Atlanta Hawks",
            "abbreviation" : "ATL",
            "score" : 91,
            "home" : true,
            "won" : 0
        }
    ]
}

Running Some Aggregations

At a high level, the MongoDB aggregation framework is exposed as a shell function called aggregate, which takes in a list of aggregation pipeline stages. Each stage of the pipeline operates on the results of the preceding stage, and each stage can filter and transform the individual documents.

Before we start doing some serious number crunching, lets start out with a simple sanity check and compute which 5 teams had the most wins in the 1999-2000 season. This can be achieved using a 6-stage pipeline:

Use the $match stage to limit ourselves to games that took place between August 1, 1999, and August 1, 2000, two dates that are sufficiently far removed from any NBA games to safely bound the season.
Use the $unwind stage to generate one document for each team in the game.
Use $match again to limit ourselves to teams that won.
Use the $group stage to count how many times a given team appears in the output of step 3.
Use the $sort stage to sort by number of wins, descending.
Use the $limit stage to limit ourselves to the 5 winningest teams.

The actual shell command is below. This command executes in essentially real-time on my laptop, even without any indices on the data, because there are only 31,686 documents in the collection.

db.games.aggregate([
  {
    $match : {
      date : {
        $gt : ISODate("1999-08-01T00:00:00Z"),
        $lt : ISODate("2000-08-01T00:00:00Z")
      }
    }
  },
  {
    $unwind : '$teams'
  },
  {
    $match : {
      'teams.won' : 1
    }
  },
  {
    $group : {
      _id : '$teams.name',
      wins : { $sum : 1 }
    }
  },
  {
    $sort : { wins : -1 }
  },
  {
    $limit : 5
  }
]);

Below is the output:

{ "_id" : "Los Angeles Lakers", "wins" : 67 }
{ "_id" : "Portland Trail Blazers", "wins" : 59 }
{ "_id" : "Indiana Pacers", "wins" : 56 }
{ "_id" : "Utah Jazz", "wins" : 55 }
{ "_id" : "San Antonio Spurs", "wins" : 53 }

We can expand on this simple example to answer the question of which team won the most games between the 2000-2001 season and the 2009-2010 season, by changing the $match step to limit ourselves to games that took place between August 1, 2000 and August 1, 2010. Turns out, the San Antonio Spurs won 579 games in that time period, narrowly beating the Dallas Mavericks' 568.

db.games.aggregate([
  {
    $match : {
      date : {
        $gt : ISODate("2000-08-01T00:00:00Z"),
        $lt : ISODate("2010-08-01T00:00:00Z")
      }
    }
  },
  {
    $unwind : '$teams'
  },
  {
    $match : {
      'teams.won' : 1
    }
  },
  {
    $group : {
      _id : '$teams.name',
      wins : { $sum : 1 }
    }
  },
  {
    $sort : { wins : -1 }
  },
  {
    $limit : 5
  }
]);

Below is the aggregation output:

{ "_id" : "San Antonio Spurs", "wins" : 579 }
{ "_id" : "Dallas Mavericks", "wins" : 568 }
{ "_id" : "Los Angeles Lakers", "wins" : 524 }
{ "_id" : "Phoenix Suns", "wins" : 495 }
{ "_id" : "Detroit Pistons", "wins" : 489 }

Correlating Stats With Wins

Lets do something a bit more interesting using a couple of aggregation operators that you don't often see when analyzing the zip codes data set: the $gte operator and the $cond operator in the $project stage. Lets use these to compute how often a team wins when they record more defensive rebounds than their opponent across the entire data set.

The tricky bit here is getting a notion of the difference between the winning team's defensive rebounding total and the losing team's defensive rebounding total. The aggregation framework makes computing the difference a bit tricky, but using $cond, we can transform the document so that the defensive rebounding total is negative if the team lost. We can then use $group to compute the defensive rebounding difference for each game. Lets walk through this step by step:

Use $unwind to get a document containing the box score for each team in the game.
Use $project with $cond to transform each document so the team's defensive rebounding total is negative if the team lost, as defined by the won flag.
Use $group and $sum to add up the rebounding totals for each game. Since the previous stage made the losing team's rebounding total negative, each document now has the difference between the winning team's defensive rebounds and the losing team's defensive rebounds.
Use $project and $gte to create a document which has a winningTeamHigher flag that is true if the winning team had more defensive rebounds than the losing team.
Use $group and $sum to compute for how many games winningTeamHigher was true.

db.games.aggregate([
  {
    $unwind : '$box'
  },
  {
    $project : {
      _id : '$_id',
      stat : {
        $cond : [
          { $gt : ['$box.won', 0] },
          '$box.team.drb',
          { $multiply : ['$box.team.drb', -1] }
        ]
      }
    }
  },
  {
    $group : {
      _id : '$_id',
      stat : { $sum : '$stat' }
    }
  },
  {
    $project : {
      _id : '$_id',
      winningTeamHigher : { $gte : ['$stat', 0] }
    }
  },
  {
    $group : {
      _id : '$winningTeamHigher',
      count : { $sum : 1 }
    }
  }
]);

Below is the output:

{ "_id" : false, "count" : 7693 }
{ "_id" : true, "count" : 23993 }

In other words, the team that recorded more defensive rebounds won 23,993 times and lost 7,693 times in this data set.

This result is pretty interesting: the team which recorded more defensive rebounds won about 75% of the time. To put this in perspective, the team that recorded more field goals than the other team only wins 78.8% of the time! Try rewriting the above aggregation for other statistics, such as field goals, 3 pointers, turnovers, etc. You'll find some rather interesting results. Offensive rebounds turn out to be a very bad predictor of which team won, as the team which recorded more offensive rebounds only won 51% of the time. 3 pointers turn out to be a very good predictor of which team won: the team which recorded more 3 pointers won about 64% of the time.

Defensive Rebounds and Total Rebounds Versus Win Percentage

Lets compute some data related to this that will be fun to graph. We're going to compute what percentage of the time a team wins as a function of the number of defensive rebounds they recorded. This aggregation is pretty simple, all we need to do is $unwind the box score, and use $group to compute the average value of the won flag across each different defensive rebounding total.

db.games.aggregate([
  {
    $unwind : '$box'
  },
  {
    $group : {
      _id : '$box.team.drb',
      winPercentage : { $avg : '$box.won' }
    }
  },
  {
    $sort : { _id : 1 }
  }
]);

And when we graph out the results of this aggregation, we can create a nice graph which neatly shows a pretty solid correlation between defensive rebounds and win percentage. An interesting factoid: the team that recorded the fewest defensive rebounds in a win was the 1995-96 Toronto Raptors, who beat the Milwaukee Bucks 93-87 on 12/26/1995 despite recording only 14 defensive rebounds.

We can pretty easily modify the above aggregation to compute the same breakdown for total rebounds (TRB) versus defensive rebounds, and see if we get a different result.

db.games.aggregate([
  {
    $unwind : '$box'
  },
  {
    $group : {
      _id : '$box.team.trb',
      winPercentage : { $avg : '$box.won' }
    }
  },
  {
    $sort : { _id : 1 }
  }
]);

And in fact we do! After about 53 total rebounds, the positive correlation between total rebounds and win percentage vanishes completely! The correlation is definitely not as strong here as it was for defensive rebounds. As an aside, the Cleveland Cavaliers beat the New York Knicks 101-97 on April 11, 1996, despite recording only 21 total rebounds. Inversely, the San Antonio Spurs lost to the Houston Rockets, 112-110, on January 4, 1992 despite recording 75 total rebounds.

Conclusion

I hope this blog post has gotten you as excited about the aggregation framework as I am. Once again, you can download the data set here, and you're very much encouraged to play with it yourself. I look forward to seeing what unique NBA analyses y'all will come up with.

Legal note: the attached data set is property of Sports Reference, LLC, and may only be used for education and evaluation under clause #1 of their terms of use. If you have not read or do not agree to Sports Reference, LLC's terms of use, please do not download the data set.

Found a typo or error? Open up a pull request! This post is available as markdown on Github