Sourcing Data From an API

Posted on November 08, 2018

There are several strategies to acquire data programmatically in the wild. One of the most common is through finding an API (Application Programming Interface) that allows access to an entities' database.

An example is the MTA's Turnstile website. The site allows you to access text files with csv data for subway turnstile traffic. The url's structure is:

      http://web.mta.info/developers/data/nyct/turnstile/turnstile_[YYMMDD].txt

To access multiple pages at once, we must create a function that interpolates the week starting date in the place of [YYMMDD]. i.e. Saturday, January 11th, 2020 = 200111.

import pandas as pd

def mta_data(week_numbers):
    url = "http://web.mta.info/developers/data/nyct/turnstile/turnstile_{}.txt"
    frames = list()
    for week_number in week_numbers:
        file_url = url.format(week_number)
        df = pd.read_csv(file_url)
        frames.append(df)
    return pd.concat(frames)

Let's access the last four weeks of data as example (as of January 13th, 2020).

week_numbers = [191221, 191228, 200104, 200111]
df = mta_data(week_numbers)
df.head()

C/A UNIT SCP STATION LINENAME DIVISION DATE TIME DESC ENTRIES EXITS
0 A002 R051 02-00-00 59 ST NQR456W BMT 12/14/2019 03:00:00 REGULAR 7309003 2477349
1 A002 R051 02-00-00 59 ST NQR456W BMT 12/14/2019 07:00:00 REGULAR 7309008 2477362
2 A002 R051 02-00-00 59 ST NQR456W BMT 12/14/2019 11:00:00 REGULAR 7309080 2477433
3 A002 R051 02-00-00 59 ST NQR456W BMT 12/14/2019 15:00:00 REGULAR 7309289 2477498
4 A002 R051 02-00-00 59 ST NQR456W BMT 12/14/2019 19:00:00 REGULAR 7309595 2477541