NHL Player Goals by Season AutoRegressive Analysis

Posted on May 30, 2020

Time series data is it's own particular type of beast. It's like that friend who needs a little more attention and to be treated their own special way. One special way to deal with time series data is through the use of a autoregressive modeling (FANCY TERM ALERT!).

An autoregressive model is defined as a regression model that uses a previous period value from a time series dependent variable. It comes in the form of the below equations.

\begin{equation*} y_{t}=\beta_{0}+\beta_{1}y_{t-1}+\epsilon_{t}. \end{equation*}

In this regression model, the response variable in the previous time period has become the predictor. We describe autoregressive by their order. The order of an autoregressive model is the number of preceding values in the series used to predict the value at the present time, e.g. the above equation is written as AR(1). As in a simple linear regression model, autoregressive models are subject to our usual assumptions about errors.

First things first, how do we lag the dependent variable using python? Well, lucky for you I have created a function using python and panda's shift function. The basis from this function comes from Tom Augspurger's Modern Pandas blog (a must read for any heavy pandas user).

def create_lags(input_df, metric, n=5, include_metric_name_in_columns=False):
    df = input_df.copy()
    frames = [df.groupby('skaterFullName')[metric].shift(i) for i in range(n+1)]
    if include_metric_name_in_columns:
        keys = [metric] + [metric + '_L%s' % i for i in range(1, n+1)]
    else:
        keys = ['y'] + ['L%s' % i for i in range(1, n+1)]
    df = (pd.concat(frames, axis=1, keys=keys)
        .dropna()
    )
    return df

This function takes in an input_df and metric and lags that variable n times, joining them all together. The resulting dataframe will be a lagged variable. The below table image illustrates how goals would be lagged for Mark Messier. The green boxes show how his rookie year goals are lagged across the next 5 years. The blue boxes show how his 7th year goals are lagged across the next 5 years.

Now let's get to the fun modeling part. First, let's try an AR(3) goals model.

AR(3) Model Results
Dep. Variable:y R-squared:0.589
Model:OLS Adj. R-squared:0.589
Method:Least Squares F-statistic:6065.
Date:Tue, 02 Jun 2020 Prob (F-statistic):0.00
Time:09:19:12 Log-Likelihood:-41399.
No. Observations:12717 AIC:8.281e+04
Df Residuals:12713 BIC:8.284e+04
Df Model:3
Covariance Type:nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept1.24960.08514.6510.0001.0821.417
L10.45600.00853.9330.0000.4390.473
L20.21540.00923.8770.0000.1980.233
L30.14200.00817.4860.0000.1260.158
Omnibus:1599.138 Durbin-Watson:2.097
Prob(Omnibus):0.000 Jarque-Bera (JB):5318.147
Skew:0.637 Prob(JB):0.00
Kurtosis:5.901 Cond. No.37.3


We can see that this model explains roughly 58.9% of the variance of a player's goals. Not a bad result. Interpretation: by knowing a player's last three year's of goals scored we can get make a 58.9% better guess than just guessing the mean goals for all players. All variables have p-value less than 5%, so we can assume they are useful in predicting this years goals. In fact, all variables have a p-value of 0.00%! Let's get a little wild with an AR(5) model.

AR(5) Model Results
Dep. Variable:y R-squared:0.623
Model:OLS Adj. R-squared:0.622
Method:Least Squares F-statistic:2866.
Date:Tue, 02 Jun 2020 Prob (F-statistic):0.00
Time:09:19:12 Log-Likelihood:-28028.
No. Observations:8695 AIC:5.607e+04
Df Residuals:8689 BIC:5.611e+04
Df Model:5
Covariance Type:nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept0.40060.1063.7640.0000.1920.609
L10.41610.01040.6940.0000.3960.436
L20.21540.01120.2680.0000.1950.236
L30.15770.01015.1310.0000.1370.178
L40.06250.0106.0880.0000.0420.083
L50.00290.0090.3130.754-0.0150.021
Omnibus:874.612 Durbin-Watson:2.065
Prob(Omnibus):0.000 Jarque-Bera (JB):2677.525
Skew:0.528 Prob(JB):0.00
Kurtosis:5.505 Cond. No.53.9


This resulted in a 6% increased in our R2, from 0.589 to 0.623. All variables have p-values of 0.00 again. We'll stop a AR(5) because it has good explanatory value and because players typicall re-negoatie contracts after 5 years. This means 5 years will be a good time to use this models to predict next year's goals as a metric for determining their contract value. Let's add a variable for the season number for that player (i.e. their rookie year is 1, their second year is 2, etc.).

AR(5) w/ Season Number Model Results
Dep. Variable:y R-squared:0.632
Model:OLS Adj. R-squared:0.632
Method:Least Squares F-statistic:2486.
Date:Tue, 02 Jun 2020 Prob (F-statistic):0.00
Time:09:18:33 Log-Likelihood:-27917.
No. Observations:8695 AIC:5.585e+04
Df Residuals:8688 BIC:5.590e+04
Df Model:6
Covariance Type:nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept3.35400.22414.9870.0002.9153.793
L10.40350.01039.8220.0000.3840.423
L20.21030.01120.0210.0000.1900.231
L30.15750.01015.3140.0000.1370.178
L40.07180.0107.0700.0000.0520.092
L50.03340.0093.5950.0000.0150.052
season_number-0.34910.023-14.9480.000-0.395-0.303
Omnibus:801.168 Durbin-Watson:2.034
Prob(Omnibus):0.000 Jarque-Bera (JB):2548.447
Skew:0.470 Prob(JB):0.00
Kurtosis:5.480 Cond. No.118.


Season number is also significant but notice that season number is negative. Does that make sense?

A player's goals usually follows a U-curve as their career progresses, i.e. a player starts with a low number of goals in their first few years, but after gaining some experience they score more as they start hitting their career peak until their goals drop again as age starts to take effect. The below chart illustrates this.

However, our model is only trained on data after a player has 5 years of data. Therefore, the model is picking up on the third leg of that career journey. This explains why we would have a negative coefficient for season_number. If we based the model on all years of players career, we would need to account for that nonlinear relationship, but as we only track players after they have played 5 years, this makes sense. Next let's add games played to our model.

AR(5) w/ Season Number & Games Played Model Results
Dep. Variable:y R-squared:0.714
Model:OLS Adj. R-squared:0.714
Method:Least Squares F-statistic:3105.
Date:Tue, 02 Jun 2020 Prob (F-statistic):0.00
Time:09:18:33 Log-Likelihood:-26814.
No. Observations:8695 AIC:5.364e+04
Df Residuals:8687 BIC:5.370e+04
Df Model:7
Covariance Type:nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept-3.62900.241-15.0300.000-4.102-3.156
L10.32810.00936.2510.0000.3100.346
L20.19190.00920.7220.0000.1740.210
L30.14600.00916.1000.0000.1280.164
L40.07290.0098.1430.0000.0550.090
L50.04170.0085.0920.0000.0260.058
season_number-0.36190.021-17.5880.000-0.402-0.322
gamesPlayed0.13480.00350.0870.0000.1300.140
Omnibus:1208.057 Durbin-Watson:1.879
Prob(Omnibus):0.000 Jarque-Bera (JB):3082.868
Skew:0.782 Prob(JB):0.00
Kurtosis:5.463 Cond. No.302.


Another significant variable. The F-statistic is still very large as well, illustrating our model as a whole is solid. Let's keep moving forward (pun intended) and add a variable for position.

AR(5) w/ Season Number, Games Played & Position Code Model Results
Dep. Variable:y R-squared:0.718
Model:OLS Adj. R-squared:0.717
Method:Least Squares F-statistic:2207.
Date:Tue, 02 Jun 2020 Prob (F-statistic):0.00
Time:09:18:33 Log-Likelihood:-26766.
No. Observations:8695 AIC:5.355e+04
Df Residuals:8684 BIC:5.363e+04
Df Model:10
Covariance Type:nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept-2.93640.267-11.0090.000-3.459-2.414
C(positionCode)[T.D]-1.43470.165-8.6950.000-1.758-1.111
C(positionCode)[T.L]-0.20170.169-1.1970.231-0.5320.129
C(positionCode)[T.R]0.07420.1690.4390.661-0.2570.406
L10.31620.00934.8130.0000.2980.334
L20.18100.00919.5100.0000.1630.199
L30.13690.00915.0950.0000.1190.155
L40.06600.0097.3910.0000.0490.084
L50.03710.0084.5480.0000.0210.053
season_number-0.34730.021-16.9090.000-0.388-0.307
gamesPlayed0.13810.00351.1190.0000.1330.143
Omnibus:1204.406 Durbin-Watson:1.873
Prob(Omnibus):0.000 Jarque-Bera (JB):3054.431
Skew:0.782 Prob(JB):0.00
Kurtosis:5.447 Cond. No.375.


Only being a defenseman is significant in this model. Let's transform that column to only differentiate between Forward's and Defenseman.

AR(5) w/ Season Number, Games Played & Position Code (D vs F) Model Results
Dep. Variable:y R-squared:0.718
Model:OLS Adj. R-squared:0.717
Method:Least Squares F-statistic:2758.
Date:Tue, 02 Jun 2020 Prob (F-statistic):0.00
Time:09:18:33 Log-Likelihood:-26767.
No. Observations:8695 AIC:5.355e+04
Df Residuals:8686 BIC:5.362e+04
Df Model:8
Covariance Type:nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept-4.38460.252-17.3760.000-4.879-3.890
C(positionCode)[T.F]1.38970.1429.7560.0001.1101.669
L10.31640.00934.8400.0000.2990.334
L20.18100.00919.5180.0000.1630.199
L30.13680.00915.0950.0000.1190.155
L40.06620.0097.4080.0000.0490.084
L50.03730.0084.5700.0000.0210.053
season_number-0.34640.021-16.8790.000-0.387-0.306
gamesPlayed0.13810.00351.1870.0000.1330.143
Omnibus:1208.329 Durbin-Watson:1.873
Prob(Omnibus):0.000 Jarque-Bera (JB):3073.302
Skew:0.783 Prob(JB):0.00
Kurtosis:5.456 Cond. No.324.


Our final model is looking great. We've increased our R2 22% from our first model (0.589 to 0.718). Our F-statistic is large and all of our features are significant. Let's interpret them one-by-one by analyzing each by sign, size and significance. All variables are significant so we'll focus on the first two.

Intercept

Our model assumes you have scored -4.3846 if all other variables are 0. This doesn't make a ton of sense as you can score negative goals, but it is interpretable in the context of other variables. Basically, our model is saying you don't automatically start scoring goals, but you have to traverse -4.4 theoretical goals before we give you any real ones. We'll look at gamesPlayed next further.

gamesPlayed

Our gamesPlayed coefficient is 0.1381. The sign is positive meaning our model thinks you'll score more goals as you play more games (duh!). More specifically, each goal you play will allot 0.1381 goals. Fractional goals are nonsense, but in a more literal context, every 7 games we'll score a goal according to our model (ceterus paribus).

C(positionCode)[T.F]

Our model allots 1.3897 goals per season if you are forward. Basically it will give you more goals for being a forward than being a defenseman. This seems low to me as forwards average about 14 goals per year (assuming 82 games played) in our dataset, while defenseman only average 5. But, close enough, can't win 'em all.

L1

Now let's dig into the autoregressive variables. Goals from last year's (L1) has a coefficient of 0.3164. This means we will add ~32% of last year's goal to our total.

L2

We will add 18% of the player's goals from two years ago to our total. The cool thing about this model is the size of each lagged variable's coefficient is smaller as we get further from the current year. This is intuitive as players performance from last year is more likely to be impactful on this year's performance than their performance from 5 years ago.

L3

We will add 13.7% of the player's goals from three years ago to our total.

L4

We will add 6.6% of the player's goals from four years ago to our total.

L5

We will add 3.7% of the player's goals from five years ago to our total.

Wrapping Up

Overall this was a really fun model to analyze. There are definitely some issues with the errors I will analyze in further posts, but good times good times nonetheless.

Thanks for reading!