Extracting Parts-of-Speech with TextBlob

Posted on January 16, 2020

Starting out on a project, it's good to do some exploratory data analysis (EDA) on your data to get to know it better. It's similar to the awkward part of a first date where you ask questions such as "Where are you from?" and "Who's the best player from Mighty Ducks?" (trick question, Coach Bombay is obviously the best). The difference is that your questions are "What's the schema of the data?" and "What's the distribution of this column [x]?".

When starting out on my fourth Metis project, aka My Harry Potter House Sorting Hat Algorithm, I decided to visualize each house in word clouds to get an idea of what kind of words were associated with the different houses. Thinking about this a little more, I decided I only wanted adjectives to be used to sort people, due to the fact that adjectives describe people and sorting people into a house is a way to describe them.

The data for this came from the Traits section of each houses respective page on the Pottermore Wiki. I then used Textblob's tags feature to extract adjectives ("JJ") and also some select nouns ("NN"), which I filtered using stop-words.

The result is outputted to a json with the houses as keys and their word lists as the items. I then piped this to my Hogwarts House Word Clouds d3 visual. See below for the full script.

Enjoy!