In statistics, there are many different strategies for determining the best subset of predictive variables. One method is forward selection. Let's use this method on our Box Office Mojo dataset to find some variables predictive of ROI!
import os
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import luther_util as lu
We'll first apply our standard movie data transformations.
fname = sorted([x for x in os.listdir('data')
if re.match('box_office_mojo_pp', x)])[-1]
df = (pd.read_csv('data/%s' % fname)
.set_index('title')
.assign(release_date=lambda x: x.release_date.astype('datetime64'),
release_month=lambda x: x.release_date.dt.month,
release_year=lambda x: x.release_date.dt.year,
log_gross=lambda x: np.log(x.domestic_total_gross),
roi=lambda x: x.domestic_total_gross.div(x.budget) - 1)
.query('roi < 15')) # filter out ROI outliers
Next let's create a list of variables we will consider that might have predictive value for movie ROI.
independents = [
'budget',
'domestic_total_gross',
'open_wkend_gross',
'runtime',
'widest_release',
'in_release_days',
['rating[T.PG]', 'rating[T.PG-13]', 'rating[T.R]'],
['release_month', 'release_year']]
Next we will iterate through these variables and log them to a list.
results = list()
for variable in independents:
if isinstance(variable, list):
X = df.loc[:, variable]
y = df.loc[:, 'roi']
lr = LinearRegression()
lu.log_model(results, lr, X, y, variable)
else:
X = df.loc[:, variable].values.reshape(-1, 1)
y = df.loc[:, 'roi']
for degree in range(1, 4):
if degree == 1:
lr = LinearRegression()
lu.log_model(results, lr, X, y, variable)
else:
lr = Pipeline([('poly', PolynomialFeatures(degree)),
('regr', LinearRegression())])
lu.log_model(results, lr, X, y, variable, degree)
# Let's also add a bias model
X = np.ones((df.shape[0], 1))
y = df.loc[:, 'roi']
lr = LinearRegression(fit_intercept=False)
lu.log_model(results, lr, X, y, 'bias')
results_df(results).head(5)
features | degree | training_r2 | test_r2 | mse | rsme |
---|---|---|---|---|---|
budget | 2 | 0.298164 | 0.221162 | 2.751055 | 1.658631 |
budget | 3 | 0.162094 | 0.142412 | 3.235131 | 1.798647 |
budget | 1 | 0.159838 | 0.107614 | 3.261668 | 1.806009 |
in_release_days | 1 | 0.029881 | -0.028430 | 3.764404 | 1.940207 |
[rating[T.PG], rating[T.PG-13], rating[T.R]] | 1 | 0.029980 | -0.044607 | 3.765328 | 1.940445 |
Using only one variable, our top model for predicting return is budget. Let's add another variable and see what we get.
import itertools
import functools
import operator
number_independents = 2
independents = [x if isinstance(x, list) else [x] for x in independents]
combs = list(itertools.combinations(independents, number_independents))
variables_list = [functools.reduce(operator.iconcat, x, []) for x in combs]
for variables in variables_list:
X = df.loc[:, variables]
y = df.loc[:, 'roi']
lr = LinearRegression()
lu.log_model(results, lr, X, y, variable)
results_df(results).head(5)
features | degree | training_r2 | test_r2 | mse | rsme |
---|---|---|---|---|---|
[budget, domestic_total_gross] | 1 | 0.418192 | 0.352395 | 2.293047 | 1.514281 |
[budget, open_wkend_gross] | 1 | 0.339812 | 0.299092 | 2.591965 | 1.609958 |
budget | 2 | 0.298164 | 0.221162 | 2.751055 | 1.658631 |
[budget, in_release_days] | 1 | 0.243988 | 0.128253 | 2.988644 | 1.728770 |
[budget, widest_release] | 1 | 0.186577 | 0.119619 | 3.200082 | 1.788877 |
Now, instead of budget, we find that budget and domestic total gross together are the best! We'll ignore the fact that we can't use domestic total gross to predict ROI for now.
Let's get wild and crazy with three variables!
number_independents = 3
independents = [x if isinstance(x, list) else [x] for x in independents]
combs = list(itertools.combinations(independents, number_independents))
variables_list = [functools.reduce(operator.iconcat, x, []) for x in combs]
for variables in variables_list:
X = df.loc[:, variables]
y = df.loc[:, 'roi']
lr = LinearRegression()
lu.log_model(results, lr, X, y, variable)
results_df(results).head(5)
features | degree | training_r2 | test_r2 | mse | rsme |
---|---|---|---|---|---|
[budget, domestic_total_gross, release_month, release_year] | 1 | 0.422910 | 0.346552 | 2.281242 | 1.510378 |
[budget, domestic_total_gross, release_month, release_year] | 1 | 0.422910 | 0.346552 | 2.281242 | 1.510378 |
[budget, domestic_total_gross, in_release_days] | 1 | 0.422280 | 0.354251 | 2.289236 | 1.513022 |
[budget, domestic_total_gross, in_release_days] | 1 | 0.422280 | 0.354251 | 2.289236 | 1.513022 |
[budget, domestic_total_gross] | 1 | 0.418192 | 0.352395 | 2.293047 | 1.514281 |
Adding release month and year gives us our best model yet.
We'll stop here but we could imagine continuing until we've analyzed all subsets to find the best model.