Example 1: scraping a table from a web page¶
- Go to https://en.wikipedia.org/wiki/List_of_S%26P_500_companies to see a table of S&P 500 companies.
- Right-click on the table and select "View Page Source" to see the HTML code for the table.
- The table is in a
<table>
tag. Use CTRL-F and search for "<table" to find it.- th = table header, tr = table row, td = table data.
- To find the table on the page in code, we can use BeautifulSoup. Then, we pass the table to pd.read_html().
- We don't need to do the manual inspection we just did. Just run the code.
In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
# URL of the Wikipedia page containing the list of S&P 500 constituents
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
# Fetch the page
response = requests.get(url)
response.raise_for_status() # Ensure the request was successful
# Parse the HTML
soup = BeautifulSoup(response.content, 'html.parser')
# Find the table containing the S&P 500 constituents
table = soup.find('table', {'id': 'constituents'})
# Extract the data into a DataFrame
# pd.read_html(str(table)) returns a list with a single element
df = pd.read_html(str(table))[0]
df.head()
C:\Users\kerry\AppData\Local\Temp\ipykernel_18828\2851306430.py:20: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object. df = pd.read_html(str(table))[0]
Out[2]:
Symbol | Security | GICS Sector | GICS Sub-Industry | Headquarters Location | Date added | CIK | Founded | |
---|---|---|---|---|---|---|---|---|
0 | MMM | 3M | Industrials | Industrial Conglomerates | Saint Paul, Minnesota | 1957-03-04 | 66740 | 1902 |
1 | AOS | A. O. Smith | Industrials | Building Products | Milwaukee, Wisconsin | 2017-07-26 | 91142 | 1916 |
2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | North Chicago, Illinois | 1957-03-04 | 1800 | 1888 |
3 | ABBV | AbbVie | Health Care | Biotechnology | North Chicago, Illinois | 2012-12-31 | 1551152 | 2013 (1888) |
4 | ACN | Accenture | Information Technology | IT Consulting & Other Services | Dublin, Ireland | 2011-07-06 | 1467373 | 1989 |
Explore this code to see how it works. Put each of these lines individually in a code cell and execute.
response.content
table
type(table)
str(table)
pd.read_html(str(table))
Example 2: getting csv files from a website¶
- FINRA reports short interest (cumulative short sales that have not been covered) on a semi-monthly basis for all stocks, based on shorts that were not executed on an exchange. This is almost all retail trades.
- You could google FINRA short interest files and click around if necessary to find the data links. You should end up here.
- We want to get all of the files, not just the two shown.
- If you really want FINRA data, you should get an API key at https://developer.finra.org/.
Step 1: Find the URLs¶
- Method A: right-click on Aug 30 and select "Copy link address".
- Method B: right-click on the page, choose "View Page Source," use CTRL-F and search for ".csv" or "Aug 30"
- Method C: use BeautifulSoup to find the links.
- We only get two links from this page, but we can guess that all have the same structure.
Step 2: Read a single file¶
- Download a file and look at it
- Try reading Aug 30 with pd.read_csv
- use pd.read_csv("url", sep="|") because the files are pipe-delimited.
- You'll get a "Forbidden File" error. As a workaround, use the requests library to download the file to your computer and then read it with pd.read_csv.
- Ask Julius how to do it. The next cell is Julius's response.
In [3]:
import pandas as pd
import requests
import io
# URL of the CSV file
data_url = 'https://cdn.finra.org/equity/otcmarket/biweekly/shrt20240830.csv'
# Send a GET request to the URL
response = requests.get(data_url)
# Check if the request was successful
if response.status_code == 200:
# Read the content of the response into a pandas DataFrame
df = pd.read_csv(io.StringIO(response.text), sep='|')
# Display the first few rows of the DataFrame
print(df.head())
# Display information about the DataFrame
print("\
DataFrame Info:")
df.info()
else:
print(f"Failed to retrieve the data. Status code: {response.status_code}")
print(f"Response content: {response.text[:500]}...") # Print first 500 characters of the response
accountingYearMonthNumber symbolCode issueName \ 0 20240830 A Agilent Technologies Inc. 1 20240830 AA Alcoa Corporation 2 20240830 AAA Alternative Access First Prior 3 20240830 AAALY Aareal Bank AG Unsponsored Ame 4 20240830 AAAU Goldman Sachs Physical Gold ET issuerServicesGroupExchangeCode marketClassCode \ 0 A NYSE 1 A NYSE 2 E ARCA 3 S OTC 4 H BZX currentShortPositionQuantity previousShortPositionQuantity stockSplitFlag \ 0 3160001 2855618 NaN 1 11370782 13010958 NaN 2 25705 25920 NaN 3 2 2 NaN 4 741002 749049 NaN averageDailyVolumeQuantity daysToCoverQuantity revisionFlag \ 0 1550973 2.04 NaN 1 4758982 2.39 NaN 2 6139 4.19 NaN 3 0 999.99 NaN 4 1908691 1.00 NaN changePercent changePreviousNumber settlementDate 0 10.66 304383 2024-08-30 1 -12.61 -1640176 2024-08-30 2 -0.83 -215 2024-08-30 3 0.00 0 2024-08-30 4 -1.07 -8047 2024-08-30 DataFrame Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 19514 entries, 0 to 19513 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 accountingYearMonthNumber 19514 non-null int64 1 symbolCode 19513 non-null object 2 issueName 19514 non-null object 3 issuerServicesGroupExchangeCode 19514 non-null object 4 marketClassCode 19514 non-null object 5 currentShortPositionQuantity 19514 non-null int64 6 previousShortPositionQuantity 19514 non-null int64 7 stockSplitFlag 42 non-null object 8 averageDailyVolumeQuantity 19514 non-null int64 9 daysToCoverQuantity 19514 non-null float64 10 revisionFlag 1 non-null object 11 changePercent 19514 non-null float64 12 changePreviousNumber 19514 non-null int64 13 settlementDate 19514 non-null object dtypes: float64(2), int64(5), object(7) memory usage: 2.1+ MB
Step 3: Download all files¶
- We can run a loop over URLs and use pd.read_csv.
- Method A: Try all possible dates.
- Method B: Find the dates by scraping the page.
- With either method, we can paste the files together as we download them (or we could save them individually to disk and read them all later).
In [7]:
# Method A: small example
start = "2024-06-01"
end = "2024-09-23"
dates = pd.date_range(start, end)
dates = dates.strftime("%Y%m%d")
urls = [f"https://cdn.finra.org/equity/otcmarket/biweekly/shrt{date}.csv" for date in dates]
df = None
good_dates = []
for date in dates:
url = f"https://cdn.finra.org/equity/otcmarket/biweekly/shrt{date}.csv"
response = requests.get(url)
if response.status_code == 200:
new_df = pd.read_csv(io.StringIO(response.text), sep='|', skiprows=1, header=None)
df = pd.concat((df, new_df))
good_dates.append(date)
print(date)
date = good_dates[-1]
url = f"https://cdn.finra.org/equity/otcmarket/biweekly/shrt{date}.csv"
response = requests.get(url)
new_df = pd.read_csv(io.StringIO(response.text), sep='|')
df.columns = new_df.columns
20240614 20240628 20240715 20240731 20240815 20240830
In [8]:
df.head()
Out[8]:
accountingYearMonthNumber | symbolCode | issueName | issuerServicesGroupExchangeCode | marketClassCode | currentShortPositionQuantity | previousShortPositionQuantity | stockSplitFlag | averageDailyVolumeQuantity | daysToCoverQuantity | revisionFlag | changePercent | changePreviousNumber | settlementDate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20240614 | A | Agilent Technologies Inc. | A | NYSE | 2874792 | 4423968 | NaN | 2422474 | 1.19 | NaN | -35.02 | -1549176 | 2024-06-14 |
1 | 20240614 | AA | Alcoa Corporation | A | NYSE | 12774970 | 12496353 | NaN | 3824253 | 3.34 | NaN | 2.23 | 278617 | 2024-06-14 |
2 | 20240614 | AAA | Alternative Access First Prior | E | ARCA | 21440 | 27895 | NaN | 9855 | 2.18 | NaN | -23.14 | -6455 | 2024-06-14 |
3 | 20240614 | AAALY | Aareal Bank AG Unsponsored Ame | S | OTC | 2 | 1 | NaN | 0 | 999.99 | NaN | 100.00 | 1 | 2024-06-14 |
4 | 20240614 | AAAU | Goldman Sachs Physical Gold ET | H | BZX | 304291 | 674000 | NaN | 2212406 | 1.00 | NaN | -54.85 | -369709 | 2024-06-14 |
Method B: Find the dates by scraping the page¶
- Use Selenium to work the drop-down menus and get the dates.
- First, we have to find the drop-down menus
In [2]:
# Find the drop-down menus using selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Set up the WebDriver (e.g., Chrome)
driver = webdriver.Chrome()
# Navigate to the desired webpage
driver.get('https://www.finra.org/finra-data/browse-catalog/equity-short-interest/files')
# Wait until the page is loaded and the drop-down menus are present
wait = WebDriverWait(driver, 10)
drop_downs = wait.until(EC.presence_of_all_elements_located((By.TAG_NAME, 'select')))
# Print the drop-down menus
for drop_down in drop_downs:
print(drop_down.get_attribute('outerHTML'))
# Close the WebDriver
driver.quit()
<select data-drupal-selector="edit-custom-month-month" id="edit-custom-month-month" name="custom_month[month]" class="form-select form-control" data-once="chosen" style="display:visible; position:absolute; width:0px; height: 0px; clip:rect(0,0,0,0)" tabindex="-1"><option value="any">- Any -</option><option value="01">January</option><option value="02">February</option><option value="03">March</option><option value="04">April</option><option value="05">May</option><option value="06">June</option><option value="07">July</option><option value="08" selected="selected">August</option><option value="09">September</option><option value="10">October</option><option value="11">November</option><option value="12">December</option></select> <select data-drupal-selector="edit-custom-year-year" id="edit-custom-year-year" name="custom_year[year]" class="form-select form-control" data-once="chosen" style="display:visible; position:absolute; width:0px; height: 0px; clip:rect(0,0,0,0)" tabindex="-1"><option value="any">- Any -</option><option value="0" selected="selected">2024</option><option value="1">2023</option><option value="2">2022</option><option value="3">2021</option><option value="4">2020</option><option value="5">2019</option><option value="6">2018</option><option value="7">2017</option><option value="8">2016</option></select>
In [ ]:
# Try to select month 1 (February?) from the drop-down menu for months and print the csv file links
from selenium.webdriver.support.ui import Select
# Set up the WebDriver (e.g., Chrome)
driver = webdriver.Chrome()
# Navigate to the desired webpage
driver.get('https://www.finra.org/finra-data/browse-catalog/equity-short-interest/files')
# Wait until the drop-down menu is present
wait = WebDriverWait(driver, 10)
drop_down = wait.until(EC.presence_of_element_located((By.ID, 'edit-custom-month-month')))
# Create a Select object
select = Select(drop_down)
# Select an item by value
select.select_by_index(1)
# Wait for the page to update after selecting the drop-down item
# You can wait for a specific element that appears only after the page update
wait.until(EC.presence_of_element_located((By.XPATH, "//div[@class='view-content']")))
# Locate the links to CSV files on the updated page
csv_links = driver.find_elements(By.XPATH, "//a[contains(@href, '.csv')]")
# Print the links to CSV files
for link in csv_links:
print(link.get_attribute('href'))
# Close the WebDriver
driver.quit()