Regex not only sounds cool, it IS cool! Even better it's useful! In my humble opinion, it's one of the more under-rated tools in a data scientist's toolbelt.
I'm currently in the process of putting together a dashboard to compare the heavyweights in boxing. One field that is important is boxing is height. It determines striking angles and defensive tactics among other things. Obviously, we would want to include a view on boxing height in any dashboard comparing boxers.
The major issue when looking at height is that it comes in a string form usually like 6'3". This isn't a great way to analyze and compare heights. Therefore we need to take that and get it into a numerical format in a single metric (i.e. only inches instead of feet and inches).
However, the next challenge is how do we actually extract that information and translate it to inches? We could split the string twice but that's a bit clunky. Instead, being the clever bloke that I am, I chose to use named regex capture groups. The below illustrates the steps taken to do so, including code and table outputs.
First, let's import the data and only take the columns we need for this exercise.
import pandas as pd
html_kwargs = dict(
classes="table",
justify="left",
border=0,
index_names=False
)
fin = r'data\FuryWilderFightHistory.xlsx'
df = (pd.read_excel(fin)
.loc[:, ['name', 'opponent_name', 'opponent_height',]]
.drop_duplicates()
.set_index(['name', 'opponent_name',])
)
df.head()
opponent_height | ||
---|---|---|
Tyson Fury | Deontay Wilder | 6′ 7″ / 201cm |
Otto Wallin | 6′ 5½″ / 197cm | |
Tom Schwarz | 6′ 5½″ / 197cm | |
Deontay Wilder | 6′ 7″ / 201cm | |
Francesco Pianeta | 6′ 5″ / 196cm |
Second, let's get rid of those pesky metric measures to only have the feet and inches string.
ft_inches_str = (df.opponent_height.str.split(' / ', expand=True)[0]
.rename('opponent_height_ft_and_inches')
)
(df.join(ft_inches_str)
.head()
)
opponent_height | opponent_height_ft_and_inches | ||
---|---|---|---|
Tyson Fury | Deontay Wilder | 6′ 7″ / 201cm | 6′ 7″ |
Otto Wallin | 6′ 5½″ / 197cm | 6′ 5½″ | |
Tom Schwarz | 6′ 5½″ / 197cm | 6′ 5½″ | |
Francesco Pianeta | 6′ 5″ / 196cm | 6′ 5″ | |
Sefer Seferi | NaN | NaN |
Before creating the named capture groups, let's mask sure that all our inputs match the pattern (excluding NAs).
pat = r'\d+′ \d+\.?\d?″'
mask = ~ft_inches_str.str.match(pat, na=True)
(df.join(ft_inches_str)
.loc[mask, ]
.head()
)
opponent_height | opponent_height_ft_and_inches | ||
---|---|---|---|
Tyson Fury | Otto Wallin | 6′ 5½″ / 197cm | 6′ 5½″ |
Tom Schwarz | 6′ 5½″ / 197cm | 6′ 5½″ | |
Christian Hammer | 6′ 2½″ / 189cm | 6′ 2½″ | |
Dereck Chisora | 6′ 1½″ / 187cm | 6′ 1½″ | |
Mathew Ellis | 5′ 11½″ / 182cm | 5′ 11½″ |
Oh no! Looks like the 1/2 character is causing some issues. Let's replace that with '0.5' and try again. Good thing we checked!
ft_inches_str = (df.opponent_height.str.split(' / ', expand=True)[0]
.str.strip()
.str.replace('½', '.5')
)
pat = r'\d+′ \d+\.?\d?″'
mask = ~ft_inches_str.str.match(pat, na=True)
(df.join(ft_inches_str)
.loc[mask, ]
.head()
)
opponent_height |
---|
No results means everything matches. Now, let's add those named capture groups.
pat = r'(?P<feet>\d+)′ (?P<inches>\d+\.?\d?)″'
ft_inch_columns = ft_inches_str.str.extract(pat).astype('float64')
(df.join(ft_inch_columns)
.head()
.to_html(**html_kwargs)
)
opponent_height | feet | inches | ||
---|---|---|---|---|
Tyson Fury | Deontay Wilder | 6′ 7″ / 201cm | 6.0 | 7.0 |
Otto Wallin | 6′ 5½″ / 197cm | 6.0 | 5.5 | |
Tom Schwarz | 6′ 5½″ / 197cm | 6.0 | 5.5 | |
Francesco Pianeta | 6′ 5″ / 196cm | 6.0 | 5.0 | |
Sefer Seferi | NaN | NaN | NaN |
Finally (drumroll please...) we will convert feet to inches and add inches to that.
height_in_inches = ((ft_inch_columns.feet * 12 + ft_inch_columns.inches)
.rename('opponent_height_in_inches')
)
(df.join(height_in_inches)
.head()
)
opponent_height | opponent_height_in_inches | ||
---|---|---|---|
Tyson Fury | Deontay Wilder | 6′ 7″ / 201cm | 79.0 |
Otto Wallin | 6′ 5½″ / 197cm | 77.5 | |
Tom Schwarz | 6′ 5½″ / 197cm | 77.5 | |
Francesco Pianeta | 6′ 5″ / 196cm | 77.0 | |
Sefer Seferi | NaN | NaN |
Bada-bing bada-boom! Keep checking-in for my heavyweight dashboard, coming soon to a computer near you!
Thanks for reading!