Hacking NHL.com's API Part 2: Deconstructing & Reconstructing the API Endpoint

We've cracked the API endpoint! We're basically at the part of Indiana Jones where he's found where the Temple of Doom is, but now we need to actually get through the Temple of Doom.

If you'll recall, our API endpoint we dug up from Part I (well worth a read in my humble opinion) was as follows:

"https://api.nhle.com/stats/rest/en/skater/summary?isAggregate=false&isGame=false&sort=%5B%7B%22property%22:%22points%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22goals%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22assists%22,%22direction%22:%22DESC%22%7D%5D&start=0&limit=50&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameTypeId=2%20and%20seasonId%3C=20192020%20and%20seasonId%3E=20192020"

But what do we actually do with this link we've found?

Well I am glad you asked! This post will be all about taking the API endpoint into python, deconstructing the query string to understand how it's asking for the data it wants, and then reconstructing it into asking for the boatload of data that we want. In this way we're basically like Edward Elrich.

Shoutout to FMA:Brotherhood fans (highly recommended to those who have absolutely no idea what I'm talking about). But I digress. As a reminder, our goal is to pull NHL player data by season in order to create an autoregressive model to predict a players goals over next season.

Our first step is to use python's Requests and JSON modules to load the data into python using the below script.

import pandas as pd
import requests
import json
import copy
import os

url = ('https://api.nhle.com/stats/rest/en/skater/summary'
       '?isAggregate=false'
       '&isGame=false'
       '&sort=%5B%7B%22property%22:%22goals%22,%22direction%22:%22DESC%22%7D%5D'
       '&start=0'
       '&limit=100'
       '&factCayenneExp=gamesPlayed%3E=1'
       '&cayenneExp=gameTypeId=2%20and%20'
           'seasonId%3C=20192020%20and%20seasonId%3E=20192020'
      )

response = requests.get(url)
data = json.loads(response.text)
data.keys()

>>> dict_keys(['data', 'total'])

Having received a 200 http status code, we know we got the data. This data is now a handy python dictionary thanks to the JSON module. It will return the structure of the keys.

We have two parts here. The data itself we can assume is the value under "data", while the value under "total" most likely contains some metadata we can ignore. So what type of format is the data under "data"?

type(data['data'])

>>> list

This return a list. What is the structure of the data in the list?

data['data'][0]

>>> {'assists': 21,
 'evGoals': 14,
 'evPoints': 31,
 'faceoffWinPct': 0.6923,
 'gameWinningGoals': 3,
 'gamesPlayed': 68,
 'goals': 18,
 'lastName': 'Beauvillier',
 'otGoals': 2,
 'penaltyMinutes': 15,
 'playerId': 8478463,
 'plusMinus': -11,
 'points': 39,
 'pointsPerGame': 0.57352,
 'positionCode': 'L',
 'ppGoals': 3,
 'ppPoints': 7,
 'seasonId': 20192020,
 'shGoals': 1,
 'shPoints': 1,
 'shootingPct': 0.13636,
 'shootsCatches': 'L',
 'shots': 132,
 'skaterFullName': 'Anthony Beauvillier',
 'teamAbbrevs': 'NYI',
 'timeOnIcePerGame': 1035.6029}

Boom! We have a list of dictionaries here with player data. Here we have Anthony Beauvillier's data from the 2019-2020 season. After having confirmed the data, let's now begin deconstructing the url. The go-to python library for this type of work is urllib. It has functions for encoding and decoding urls. Since we can tell the query string contains all the parameters for pulling from the API, our first step is to split the url into the base and query string (check out wikipedia if you aren't familiar with url structure). We'll then load it into urllib.parse's parse_qs function, which returns a dictionary with the query string key-value pairs decoded.

import urllib.parse as up

url_base, qstr = url.split('?')
up.parse_qs(qstr)

>>> {'isAggregate': ['false'],
 'isGame': ['false'],
 'sort': ['[{"property":"goals","direction":"DESC"}]'],
 'start': ['0'],
 'limit': ['100'],
 'factCayenneExp': ['gamesPlayed>=1'],
 'cayenneExp': ['gameTypeId=2 and seasonId<=20192020 and seasonId>=20192020']}

Wow, that is refreshingly readable! Going through the keys, we can assume isAggregate and isGame ask for aggregate player data over their full career and by game, respectively. We'll keep both those false.

Sort, will determine the sort of the data. Using goals in descending order is fine here.

Start and limit are important for us. These most likely determine the pagination of the data. Start is where in the dataset we are starting, and limit means to return a maximum of 100 entries.

The last two were a little tricky for me. They are clearly delineating some filters for the data. After googling I figured out that Cayenne is an Apache Java module for object-relational mapping (ORM) to a database. If that seemed like the jargon-y bunch of nonsense it was to you, all you have to know is the backend of this endpoint is a program run by Java, which parses these expressions to create SQL queries that pull data the program sends to us.

The first filter removes players that didn't play any games. We'll want to keep that. The second filter delineates some gameTypeID (no idea what that is) but also the seasonId. We'll update that seasonId to pull data across seasons.

Before updating, let's create a function that can re-encode this dictionary back into the same query string format we had before. This took a big of hacky code as there where some discrepencies in the urllib parsing and encoding but the following function fixes all that.

def urlencode_wrapper(qstrobj):
    return (up.urlencode(qstrobj, quote_via=up.quote)
        .replace('%3D', '=')
        .replace('%3A', ':')
        .replace('%2C', ',')
        .replace('%5B%27', '')
        .replace('%27%5D', '')
    )

assert(qstr == urlencode_wrapper(up.parse_qs(qstr)))

The assert statement confirms we are good to go.

Next, we'll put a named interpolation into that season string of the cayenneExp value that we can call. Then we'll create the below generate_qstrobj that copies our template object and then updates the start field and year in the cayenneExp field.

qstrobj = up.parse_qs(qstr)
qstrobj['cayenneExp'][0] = qstrobj['cayenneExp'][0].replace('20192020', '{year}')

def generate_qstrobj(qstrobj, start, year):
    qstrobj = copy.deepcopy(qstrobj)
    qstrobj['start'] = start
    qstrobj['cayenneExp'][0] = qstrobj['cayenneExp'][0].format(year=year)
    return urlencode_wrapper(qstrobj)

Now that we have our functions ready, let's look through seasons from 2000 to 2019 and pull the first 300 players for each season. To catch the data, we'll use the frames variable. Before putting the data in that list, we'll convert it from a list of dictionaries to a pandas dataframe using the pandas.DataFrame.from_records function. This will allow us to easily concatenate the data into one dataset and then output it to a csv.

Now that we have our data, let's put it all together and check the columns.

years = [str(x) + str(x+1) for x in range(2000, 2021, 1)]
pagination = [0] + [x * 100 + 1 for x in range(1, 4)]
frames = []

for year in years:
    for start in pagination:
        url = url_base + '?' + generate_qstrobj(qstrobj, start, year)
        response = requests.get(url)
        data = json.loads(response.text)
        frames.append(pd.DataFrame.from_records(data['data']))

df = pd.concat(frames)
df.columns

Woah! That's a lot of columns. Let's take a look at a few below.

>>> Index(['assists', 'evGoals', 'evPoints', 'faceoffWinPct', 'gameWinningGoals',
       'gamesPlayed', 'goals', 'lastName', 'otGoals', 'penaltyMinutes',
       'playerId', 'plusMinus', 'points', 'pointsPerGame', 'positionCode',
       'ppGoals', 'ppPoints', 'seasonId', 'shGoals', 'shPoints', 'shootingPct',
       'shootsCatches', 'shots', 'skaterFullName', 'teamAbbrevs',
       'timeOnIcePerGame'],
      dtype='object')

df.loc[0:15, ['skaterFullName', 'teamAbbrevs', 'seasonId', 'goals', 'assists']]

skaterFullName	teamAbbrevs	seasonId	goals	assists
Jaromir Jagr	PIT	20002001	52	69
Joe Sakic	COL	20002001	54	64
Patrik Elias	NJD	20002001	40	56
Alex Kovalev	PIT	20002001	44	51
Jason Allison	BOS	20002001	36	59
Martin Straka	PIT	20002001	27	68
Pavel Bure	FLA	20002001	59	33
Doug Weight	EDM	20002001	25	65
Ziggy Palffy	LAK	20002001	38	51
Peter Forsberg	COL	20002001	27	62
Alexei Yashin	OTT	20002001	40	48
Luc Robitaille	LAK	20002001	37	51
Bill Guerin	EDM,BOS	20002001	40	45
Mike Modano	DAL	20002001	33	51
Alexander Mogilny	NJD	20002001	43	40

This data looks great! All we have to do now is output it to a csv and we're good to get started on our autoregressive model. Tune in next time for that!

filename = os.path.join('data', 'nhl_player_data_2000-2020.csv')
df.to_csv(filename, index=False)

Thanks for reading!