There are several strategies to acquire data programmatically in the wild. One of the most common is through finding an API (Application Programming Interface) that allows access to an entities' database.
An example is the MTA's Turnstile website. The site allows you to access text files with csv data for subway turnstile traffic. The url's structure is:
http://web.mta.info/developers/data/nyct/turnstile/turnstile_[YYMMDD].txt
To access multiple pages at once, we must create a function that interpolates the week starting date in the place of [YYMMDD]. i.e. Saturday, January 11th, 2020 = 200111.
import pandas as pd
def mta_data(week_numbers):
url = "http://web.mta.info/developers/data/nyct/turnstile/turnstile_{}.txt"
frames = list()
for week_number in week_numbers:
file_url = url.format(week_number)
df = pd.read_csv(file_url)
frames.append(df)
return pd.concat(frames)
Let's access the last four weeks of data as example (as of January 13th, 2020).
week_numbers = [191221, 191228, 200104, 200111]
df = mta_data(week_numbers)
df.head()
C/A | UNIT | SCP | STATION | LINENAME | DIVISION | DATE | TIME | DESC | ENTRIES | EXITS | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 12/14/2019 | 03:00:00 | REGULAR | 7309003 | 2477349 |
1 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 12/14/2019 | 07:00:00 | REGULAR | 7309008 | 2477362 |
2 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 12/14/2019 | 11:00:00 | REGULAR | 7309080 | 2477433 |
3 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 12/14/2019 | 15:00:00 | REGULAR | 7309289 | 2477498 |
4 | A002 | R051 | 02-00-00 | 59 ST | NQR456W | BMT | 12/14/2019 | 19:00:00 | REGULAR | 7309595 | 2477541 |