Forward Selection Regression Model for Predicting Movie ROI

In statistics, there are many different strategies for determining the best subset of predictive variables. One method is forward selection. Let's use this method on our Box Office Mojo dataset to find some variables predictive of ROI!

import os
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import luther_util as lu

We'll first apply our standard movie data transformations.

fname = sorted([x for x in os.listdir('data')
                if re.match('box_office_mojo_pp', x)])[-1]
df = (pd.read_csv('data/%s' % fname)
      .set_index('title')
      .assign(release_date=lambda x: x.release_date.astype('datetime64'),
              release_month=lambda x: x.release_date.dt.month,
              release_year=lambda x: x.release_date.dt.year,
              log_gross=lambda x: np.log(x.domestic_total_gross),
              roi=lambda x: x.domestic_total_gross.div(x.budget) - 1)
      .query('roi < 15')) # filter out ROI outliers

Next let's create a list of variables we will consider that might have predictive value for movie ROI.

independents = [
  'budget',
  'domestic_total_gross',
  'open_wkend_gross',
  'runtime',
  'widest_release',
  'in_release_days',
  ['rating[T.PG]', 'rating[T.PG-13]', 'rating[T.R]'],
  ['release_month', 'release_year']]

Next we will iterate through these variables and log them to a list.

results = list()
for variable in independents:
  if isinstance(variable, list):
    X = df.loc[:, variable]
    y = df.loc[:, 'roi']
    lr = LinearRegression()
    lu.log_model(results, lr, X, y, variable)
  else:
    X = df.loc[:, variable].values.reshape(-1, 1)
    y = df.loc[:, 'roi']
    for degree in range(1, 4):
      if degree == 1:
        lr = LinearRegression()
        lu.log_model(results, lr, X, y, variable)
      else:
        lr = Pipeline([('poly', PolynomialFeatures(degree)),
                       ('regr', LinearRegression())])
        lu.log_model(results, lr, X, y, variable, degree)
# Let's also add a bias model
X = np.ones((df.shape[0], 1))
y = df.loc[:, 'roi']
lr = LinearRegression(fit_intercept=False)
lu.log_model(results, lr, X, y, 'bias')
results_df(results).head(5)

features	degree	training_r2	test_r2	mse	rsme
budget	2	0.298164	0.221162	2.751055	1.658631
budget	3	0.162094	0.142412	3.235131	1.798647
budget	1	0.159838	0.107614	3.261668	1.806009
in_release_days	1	0.029881	-0.028430	3.764404	1.940207
[rating[T.PG], rating[T.PG-13], rating[T.R]]	1	0.029980	-0.044607	3.765328	1.940445

Using only one variable, our top model for predicting return is budget. Let's add another variable and see what we get.

import itertools
import functools
import operator

number_independents = 2

independents = [x if isinstance(x, list) else [x] for x in independents]
combs = list(itertools.combinations(independents, number_independents))
variables_list = [functools.reduce(operator.iconcat, x, []) for x in combs]
for variables in variables_list:
    X = df.loc[:, variables]
    y = df.loc[:, 'roi']
    lr = LinearRegression()
    lu.log_model(results, lr, X, y, variable)

results_df(results).head(5)

features	degree	training_r2	test_r2	mse	rsme
[budget, domestic_total_gross]	1	0.418192	0.352395	2.293047	1.514281
[budget, open_wkend_gross]	1	0.339812	0.299092	2.591965	1.609958
budget	2	0.298164	0.221162	2.751055	1.658631
[budget, in_release_days]	1	0.243988	0.128253	2.988644	1.728770
[budget, widest_release]	1	0.186577	0.119619	3.200082	1.788877

Now, instead of budget, we find that budget and domestic total gross together are the best! We'll ignore the fact that we can't use domestic total gross to predict ROI for now.

Let's get wild and crazy with three variables!

number_independents = 3

independents = [x if isinstance(x, list) else [x] for x in independents]
combs = list(itertools.combinations(independents, number_independents))
variables_list = [functools.reduce(operator.iconcat, x, []) for x in combs]
for variables in variables_list:
    X = df.loc[:, variables]
    y = df.loc[:, 'roi']
    lr = LinearRegression()
    lu.log_model(results, lr, X, y, variable)

results_df(results).head(5)

features	degree	training_r2	test_r2	mse	rsme
[budget, domestic_total_gross, release_month, release_year]	1	0.422910	0.346552	2.281242	1.510378
[budget, domestic_total_gross, release_month, release_year]	1	0.422910	0.346552	2.281242	1.510378
[budget, domestic_total_gross, in_release_days]	1	0.422280	0.354251	2.289236	1.513022
[budget, domestic_total_gross, in_release_days]	1	0.422280	0.354251	2.289236	1.513022
[budget, domestic_total_gross]	1	0.418192	0.352395	2.293047	1.514281

Adding release month and year gives us our best model yet.

We'll stop here but we could imagine continuing until we've analyzed all subsets to find the best model.