We're coming up to that time of year where the NCAA tournament is almost upon us!
Filling out your brackets can be an overwhelming process, even when you have watched a lot of college hoops over the year. For the first round especially, it's tough to evaluate 64 teams separately. Luckily, we have our good friend statistics to help us out! In this blog, we'll use a historical game results from 1985 (the first year the tournament had 64 teams) to 2019 (our last year of data) to create distributions for a binomial model probability of winning for the higher ranked team.
Our strategy will be to use Bayesian modeling to determine a higher seed's probability of winning. This will allow even the most passive fan to get some good W's in their bracket. We'll use the beta-binomial distribution using a Bayesian conjugate update to dervive our final beta distributions of binomial win probabilities. We're going to gloss over the mechanics behind the distribution for the sake of time (and to avoid losing all readers!) but it's a greater introduction to Bayesian statistics if you're interested.
The reason we're using distributions and not soley assuming the historical win rates is that historical win rates aren't necessarily a proper indicator for future win rates. By using the beta-binomial conjugate family we can create a confidence interval for this metric. Over time the win rate will change but should always be within this interval if we do our job right.
First, we'll need some good ole data to work with. Since this is a smaller project, rather than using my personal favorite scraping tool, Scrapy, I used urllib and lxml. You can find my spider here where I scraped sports-reference.com.
This gave us a dataset looking like this.
year | region | round | home_seed | home_team | home_score | away_seed | away_team | away_score | seed_matchup | winner | seed_result | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1985 | east | 1 | 1 | Georgetown | 68 | 16 | Lehigh | 43 | 01v16 | home_team | higher_seed_win |
1 | 1985 | east | 1 | 8 | Temple | 60 | 9 | Virginia Tech | 57 | 08v09 | home_team | higher_seed_win |
2 | 1985 | east | 1 | 5 | SMU | 85 | 12 | Old Dominion | 68 | 05v12 | home_team | higher_seed_win |
3 | 1985 | east | 1 | 4 | Loyola (IL) | 59 | 13 | Iona | 58 | 04v13 | home_team | higher_seed_win |
4 | 1985 | east | 1 | 6 | Georgia | 67 | 11 | Wichita State | 59 | 06v11 | home_team | higher_seed_win |
Next, we'll do some data checks to make sure everything is good and then some data cleaning too. These checks ensure we have the proper number of games and there are no ties.
def data_checks(df):
# check games total
if not(63 * (END_YEAR - START_YEAR + 1) == df.shape[0]):
return {"result": "failed", "test": "games total"}
# no ties
if not(df.loc[df.home_score == df.away_score, ].empty):
return {"result": "failed", "test": "no ties"}
return {"result": "passed"}
Then we'll take this data and filter for only the first round matchups using the below code.
mask = df.loc[:, 'round'] == 1
first_round_matchups = (df.loc[mask, 'seed_matchup']
.drop_duplicates()
.sort_values()
.tolist()
)
models = dict()
for matchup in first_round_matchups:
df_matchup = df.loc[df.seed_matchup == matchup, ]
higher_seed, lower_seed = list(map(int, matchup.split('v')))
model = BayesBetaBinomial(matchup, a_prior=lower_seed, b_prior=higher_seed)
x = sum(df_matchup.seed_result == 'higher_seed_win')
n = df_matchup.shape[0]
model.update(x, n)
title = f'{higher_seed} vs {lower_seed} Seed Matchups - Bayesian Posterior Distribution'
ax = model.plot_posterior(title)
plt.savefig(f'data/{matchup}_posterior_distribution.png')
plt.show()
models[matchup] = model
The models used are instances of the BayesBetaBinomial
class I wrote to calculate prior and posterior distributions, which you can find here.
Now that we have our models, let's can clean this up and summarize it into the data frame shown below to dive into the results a bit.
Seed Matchup | Higher Seed Wins | Lower Seed Wins | Higher Seed Win Pct | Higher Seed Win Pct Posterior Mean | Higher Seed Win Pct Lower Bound | Higher Seed Win Pct Upper Bound | Confidence Interval Band |
---|---|---|---|---|---|---|---|
01v16 | 139 | 1 | 99.3% | 98.7% | 96.5% | 99.8% | 3.4% |
02v15 | 132 | 8 | 94.3% | 93.6% | 89.3% | 96.9% | 7.6% |
03v14 | 119 | 21 | 85.0% | 84.7% | 78.7% | 89.9% | 11.2% |
04v13 | 111 | 29 | 79.3% | 79.0% | 72.3% | 85.0% | 12.7% |
05v12 | 90 | 50 | 64.3% | 65.0% | 57.4% | 72.2% | 14.9% |
06v11 | 88 | 52 | 62.9% | 63.1% | 55.4% | 70.4% | 15.0% |
07v10 | 85 | 55 | 60.7% | 60.5% | 52.8% | 68.0% | 15.2% |
08v09 | 68 | 72 | 48.6% | 49.0% | 41.3% | 56.8% | 15.6% |
Clearly, the higher the seed, the higher the expected win percentage as shown by the Higher Seed Win Pct Posterior Mean
metric.
We can also see that our confidence interval gets larger as the seeds become closer, meaning we have less certainty of the true win probability for 8v9 matchups compared to 1v16 matchups.
This pretty little visual shows the results between matchups even more clearly.
1v16 matchups have a 98.7% probability that the 1-seed will win. The extreme value here is not surprising given there's only been 1 time in history the 16-seed has won (UMBC over Baltimore in 2018). 2-seeds have a 93.6% chance of winning, while 3-seeds dropped down to only 84.7%. 5 through 7 seeds all seem to be in the mid to low 60s, while 8v9 is a coin flip. A more detailed plot of each posterior can be found below.
So what are our takeaways? Clearly, it's best to have all 1-seeds, 2-seeds and arguably 3-seeds winning all four games respectively in the first round. 4-seeds are closer to 75%, so you'd want to dig into those matchups for an upset or two. Finally, 5-8 seeds all have lower than a 65% chance or less of winning, so that's where we really want to focus our research.
Good luck with your brackets! Hopefully this helps in your selections. The full analysis can be found in my sport analysis repo here. If you want any other seed matchup distributions, feel free to contact me!
Thanks for reading!