For my third project at Metis, I built a pitch classification model. The data was sourced through scraping Kickstarter using Selenium (see post here for more detail). The model turned pitches into a Term-Frequency-Inverse-Document-Frequency matrix per word as the dependent variables, with the pitch out (1 for successfully funded, 0 for failure to fund) as the independent. To create the bag-of-words model we'll use sklearn's TfidfVectorizer. First, importing and cleaning the data.
import os
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
"""
Data
"""
fname = os.path.join(
'Data',
'kickstarter_data.json'
)
df = pd.read_json(fname, 'records')
"""
Preprocesing
"""
pat = r'(?P.+)\npledged of (?P.+) goal\n[,\d]+\nbackers?'
df[['pledged', 'goal']] = df.goal_and_pledged_backers.str.extract(pat)
df = df.assign(
pledged_currency=lambda x: x.pledged.str.extract('([^\d]+)'),
pledged_amount=lambda x: x.pledged.astype(str).map(lambda x: ''.join([i for i in x if i.isdigit()])).astype(int),
goal_currency=lambda x: x.goal.str.extract('([^\d]+)'),
goal_amount=lambda x: x.goal.astype(str).map(lambda x: ''.join([i for i in x if i.isdigit()])).astype(int),
)
Now let's look at some Kickstarter pitches, aka stories
story |
---|
Yes, it's Christmas in August! And you are among the first to hear about The Christmas Stairs, a brand new children’s book that will be released this fall! \n The Author \n Patty Stoner, a mother of five grown children, lives with her husband Tim in Grand Rapids, Michigan, where she has directed the Discipleship program at The Potters House School for the past 20 years. Patty loves children of all ages, as well as writing and creating events for their enrichment. One of her weekly highlights has been reading aloud to the second and third graders. “I especially loved reading during the Christmas season, and it was during those years a deep desire began to grow within me to write a Christmas story of my own - The Christmas Stairs is the result!”\n ... |
with the help of an amazingly Talented Lady @thingymabobsboutique \nmy Wendy Darling inspired Pin was bought to life.\n I have created a few pins previously but nothing quite this big.\n\nThe pin will meas approximately 3 inches and will have Glitter, sandblasted and Screen Printed elements to it. |
Hai!\nWelcome to my kickstarter, I’m Rosanna, and I’m a small girl with a vivid imagination hiding out in rainy England. This all started with a doodle in the back of a notebook, which when I shared with friends took a life of its own and requests started coming in. This went from stickers to now enamel pin requests, so here we are!\nWhy pledge rather than buy? ... |
Those lengthy pitches make for a lot of NLP data to work with! Next we'll run our stories through a TfidfVectorizer and then send that to our model.
documents = df.story.values
vectorizer = TfidfVectorizer(stop_words="english")
doc_vectors = vectorizer.fit_transform(documents).toarray()
y_true = df.loc[:, 'success']
model = GaussianNB().fit(doc_vectors, y_true)
y_pred = model.predict(doc_vectors)
"Accuracy Score:{:-10.2%}".format(accuracy_score(y_true, y_pred))
This returns an accuracy score of 99.46%. Now, we, the potential kickstarter investor, can know before a project starts whether it will be funded!