Hacking NHL.com's API Part 2: Deconstructing & Reconstructing the API Endpoint

Posted on May 15, 2020

We've cracked the API endpoint! We're basically at the part of Indiana Jones where he's found where the Temple of Doom is, but now we need to actually get through the Temple of Doom.

If you'll recall, our API endpoint we dug up from Part I (well worth a read in my humble opinion) was as follows:

"https://api.nhle.com/stats/rest/en/skater/summary?isAggregate=false&isGame=false&sort=%5B%7B%22property%22:%22points%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22goals%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22assists%22,%22direction%22:%22DESC%22%7D%5D&start=0&limit=50&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameTypeId=2%20and%20seasonId%3C=20192020%20and%20seasonId%3E=20192020"

But what do we actually do with this link we've found?

Well I am glad you asked! This post will be all about taking the API endpoint into python, deconstructing the query string to understand how it's asking for the data it wants, and then reconstructing it into asking for the boatload of data that we want. In this way we're basically like Edward Elrich.

Shoutout to FMA:Brotherhood fans (highly recommended to those who have absolutely no idea what I'm talking about). But I digress. As a reminder, our goal is to pull NHL player data by season in order to create an autoregressive model to predict a players goals over next season.

Our first step is to use python's Requests and JSON modules to load the data into python using the below script.

import pandas as pd
import requests
import json
import copy
import os

url = ('https://api.nhle.com/stats/rest/en/skater/summary'
       '?isAggregate=false'
       '&isGame=false'
       '&sort=%5B%7B%22property%22:%22goals%22,%22direction%22:%22DESC%22%7D%5D'
       '&start=0'
       '&limit=100'
       '&factCayenneExp=gamesPlayed%3E=1'
       '&cayenneExp=gameTypeId=2%20and%20'
           'seasonId%3C=20192020%20and%20seasonId%3E=20192020'
      )

response = requests.get(url)
data = json.loads(response.text)
data.keys()
>>> dict_keys(['data', 'total'])

Having received a 200 http status code, we know we got the data. This data is now a handy python dictionary thanks to the JSON module. It will return the structure of the keys.

We have two parts here. The data itself we can assume is the value under "data", while the value under "total" most likely contains some metadata we can ignore. So what type of format is the data under "data"?

type(data['data'])
>>> list

This return a list. What is the structure of the data in the list?

data['data'][0]
>>> {'assists': 21,
 'evGoals': 14,
 'evPoints': 31,
 'faceoffWinPct': 0.6923,
 'gameWinningGoals': 3,
 'gamesPlayed': 68,
 'goals': 18,
 'lastName': 'Beauvillier',
 'otGoals': 2,
 'penaltyMinutes': 15,
 'playerId': 8478463,
 'plusMinus': -11,
 'points': 39,
 'pointsPerGame': 0.57352,
 'positionCode': 'L',
 'ppGoals': 3,
 'ppPoints': 7,
 'seasonId': 20192020,
 'shGoals': 1,
 'shPoints': 1,
 'shootingPct': 0.13636,
 'shootsCatches': 'L',
 'shots': 132,
 'skaterFullName': 'Anthony Beauvillier',
 'teamAbbrevs': 'NYI',
 'timeOnIcePerGame': 1035.6029}

Boom! We have a list of dictionaries here with player data. Here we have Anthony Beauvillier's data from the 2019-2020 season. After having confirmed the data, let's now begin deconstructing the url. The go-to python library for this type of work is urllib. It has functions for encoding and decoding urls. Since we can tell the query string contains all the parameters for pulling from the API, our first step is to split the url into the base and query string (check out wikipedia if you aren't familiar with url structure). We'll then load it into urllib.parse's parse_qs function, which returns a dictionary with the query string key-value pairs decoded.

import urllib.parse as up

url_base, qstr = url.split('?')
up.parse_qs(qstr)
>>> {'isAggregate': ['false'],
 'isGame': ['false'],
 'sort': ['[{"property":"goals","direction":"DESC"}]'],
 'start': ['0'],
 'limit': ['100'],
 'factCayenneExp': ['gamesPlayed>=1'],
 'cayenneExp': ['gameTypeId=2 and seasonId<=20192020 and seasonId>=20192020']}

Wow, that is refreshingly readable! Going through the keys, we can assume isAggregate and isGame ask for aggregate player data over their full career and by game, respectively. We'll keep both those false.

Sort, will determine the sort of the data. Using goals in descending order is fine here.

Start and limit are important for us. These most likely determine the pagination of the data. Start is where in the dataset we are starting, and limit means to return a maximum of 100 entries.

The last two were a little tricky for me. They are clearly delineating some filters for the data. After googling I figured out that Cayenne is an Apache Java module for object-relational mapping (ORM) to a database. If that seemed like the jargon-y bunch of nonsense it was to you, all you have to know is the backend of this endpoint is a program run by Java, which parses these expressions to create SQL queries that pull data the program sends to us.

The first filter removes players that didn't play any games. We'll want to keep that. The second filter delineates some gameTypeID (no idea what that is) but also the seasonId. We'll update that seasonId to pull data across seasons.

Before updating, let's create a function that can re-encode this dictionary back into the same query string format we had before. This took a big of hacky code as there where some discrepencies in the urllib parsing and encoding but the following function fixes all that.

def urlencode_wrapper(qstrobj):
    return (up.urlencode(qstrobj, quote_via=up.quote)
        .replace('%3D', '=')
        .replace('%3A', ':')
        .replace('%2C', ',')
        .replace('%5B%27', '')
        .replace('%27%5D', '')
    )

assert(qstr == urlencode_wrapper(up.parse_qs(qstr)))

The assert statement confirms we are good to go.

Next, we'll put a named interpolation into that season string of the cayenneExp value that we can call. Then we'll create the below generate_qstrobj that copies our template object and then updates the start field and year in the cayenneExp field.

qstrobj = up.parse_qs(qstr)
qstrobj['cayenneExp'][0] = qstrobj['cayenneExp'][0].replace('20192020', '{year}')

def generate_qstrobj(qstrobj, start, year):
    qstrobj = copy.deepcopy(qstrobj)
    qstrobj['start'] = start
    qstrobj['cayenneExp'][0] = qstrobj['cayenneExp'][0].format(year=year)
    return urlencode_wrapper(qstrobj)

Now that we have our functions ready, let's look through seasons from 2000 to 2019 and pull the first 300 players for each season. To catch the data, we'll use the frames variable. Before putting the data in that list, we'll convert it from a list of dictionaries to a pandas dataframe using the pandas.DataFrame.from_records function. This will allow us to easily concatenate the data into one dataset and then output it to a csv.

Now that we have our data, let's put it all together and check the columns.

years = [str(x) + str(x+1) for x in range(2000, 2021, 1)]
pagination = [0] + [x * 100 + 1 for x in range(1, 4)]
frames = []

for year in years:
    for start in pagination:
        url = url_base + '?' + generate_qstrobj(qstrobj, start, year)
        response = requests.get(url)
        data = json.loads(response.text)
        frames.append(pd.DataFrame.from_records(data['data']))
df = pd.concat(frames)
df.columns

Woah! That's a lot of columns. Let's take a look at a few below.

>>> Index(['assists', 'evGoals', 'evPoints', 'faceoffWinPct', 'gameWinningGoals',
       'gamesPlayed', 'goals', 'lastName', 'otGoals', 'penaltyMinutes',
       'playerId', 'plusMinus', 'points', 'pointsPerGame', 'positionCode',
       'ppGoals', 'ppPoints', 'seasonId', 'shGoals', 'shPoints', 'shootingPct',
       'shootsCatches', 'shots', 'skaterFullName', 'teamAbbrevs',
       'timeOnIcePerGame'],
      dtype='object')
df.loc[0:15, ['skaterFullName', 'teamAbbrevs', 'seasonId', 'goals', 'assists']]
skaterFullName teamAbbrevs seasonId goals assists
Jaromir Jagr PIT 20002001 52 69
Joe Sakic COL 20002001 54 64
Patrik Elias NJD 20002001 40 56
Alex Kovalev PIT 20002001 44 51
Jason Allison BOS 20002001 36 59
Martin Straka PIT 20002001 27 68
Pavel Bure FLA 20002001 59 33
Doug Weight EDM 20002001 25 65
Ziggy Palffy LAK 20002001 38 51
Peter Forsberg COL 20002001 27 62
Alexei Yashin OTT 20002001 40 48
Luc Robitaille LAK 20002001 37 51
Bill Guerin EDM,BOS 20002001 40 45
Mike Modano DAL 20002001 33 51
Alexander Mogilny NJD 20002001 43 40

This data looks great! All we have to do now is output it to a csv and we're good to get started on our autoregressive model. Tune in next time for that!

filename = os.path.join('data', 'nhl_player_data_2000-2020.csv')
df.to_csv(filename, index=False)

Thanks for reading!