Plan for today¶
- Missing data
- Datetime objects
- Time series operations (Treasury yield example)
- Panel data operations (CRSP example)
- Extended panel data example (CRSP and Compustat)
In [134]:
import numpy as np
import pandas as pd
from pandas_datareader import DataReader as pdr
import wrds
from datetime import datetime
Datetime objects¶
Get Treasury yields from FRED¶
In [135]:
yields = pdr('DGS10', 'fred')
yields.head()
Out[135]:
DGS10 | |
---|---|
DATE | |
2019-09-11 | 1.75 |
2019-09-12 | 1.79 |
2019-09-13 | 1.90 |
2019-09-16 | 1.84 |
2019-09-17 | 1.81 |
In [136]:
yields.tail()
Out[136]:
DGS10 | |
---|---|
DATE | |
2024-08-30 | 3.91 |
2024-09-02 | NaN |
2024-09-03 | 3.84 |
2024-09-04 | 3.77 |
2024-09-05 | 3.73 |
Checking for and dropping missing values¶
In [137]:
yields.isna()
Out[137]:
DGS10 | |
---|---|
DATE | |
2019-09-11 | False |
2019-09-12 | False |
2019-09-13 | False |
2019-09-16 | False |
2019-09-17 | False |
... | ... |
2024-08-30 | False |
2024-09-02 | True |
2024-09-03 | False |
2024-09-04 | False |
2024-09-05 | False |
1302 rows × 1 columns
In [138]:
yields.isna().sum()
Out[138]:
DGS10 54 dtype: int64
In [139]:
yields = yields.dropna()
yields.tail()
Out[139]:
DGS10 | |
---|---|
DATE | |
2024-08-29 | 3.87 |
2024-08-30 | 3.91 |
2024-09-03 | 3.84 |
2024-09-04 | 3.77 |
2024-09-05 | 3.73 |
Datetime objects¶
- Datetime methods (.month, .strftime, .timdelta, ...)
- String to datetime
- Pandas period to datetime
In [140]:
x = yields.index[0]
x
Out[140]:
Timestamp('2019-09-11 00:00:00')
In [141]:
x.year
Out[141]:
2019
In [142]:
[x.day for x in yields.index]
Out[142]:
[11, 12, 13, 16, 17, 18, 19, 20, 23, 24, 25, 26, 27, 30, 1, 2, 3, 4, 7, 8, 9, 10, 11, 15, 16, 17, 18, 21, 22, 23, 24, 25, 28, 29, 30, 31, 1, 4, 5, 6, 7, 8, 12, 13, 14, 15, 18, 19, 20, 21, 22, 25, 26, 27, 29, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 16, 17, 18, 19, 20, 23, 24, 26, 27, 30, 31, 2, 3, 6, 7, 8, 9, 10, 13, 14, 15, 16, 17, 21, 22, 23, 24, 27, 28, 29, 30, 31, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 18, 19, 20, 21, 24, 25, 26, 27, 28, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 16, 17, 18, 19, 20, 23, 24, 25, 26, 27, 30, 31, 1, 2, 3, 6, 7, 8, 9, 13, 14, 15, 16, 17, 20, 21, 22, 23, 24, 27, 28, 29, 30, 1, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 18, 19, 20, 21, 22, 26, 27, 28, 29, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 15, 16, 17, 18, 19, 22, 23, 24, 25, 26, 29, 30, 1, 2, 6, 7, 8, 9, 10, 13, 14, 15, 16, 17, 20, 21, 22, 23, 24, 27, 28, 29, 30, 31, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 17, 18, 19, 20, 21, 24, 25, 26, 27, 28, 31, 1, 2, 3, 4, 8, 9, 10, 11, 14, 15, 16, 17, 18, 21, 22, 23, 24, 25, 28, 29, 30, 1, 2, 5, 6, 7, 8, 9, 13, 14, 15, 16, 19, 20, 21, 22, 23, 26, 27, 28, 29, 30, 2, 3, 4, 5, 6, 9, 10, 12, 13, 16, 17, 18, 19, 20, 23, 24, 25, 27, 30, 1, 2, 3, 4, 7, 8, 9, 10, 11, 14, 15, 16, 17, 18, 21, 22, 23, 24, 28, 29, 30, 31, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 19, 20, 21, 22, 25, 26, 27, 28, 29, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 16, 17, 18, 19, 22, 23, 24, 25, 26, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 15, 16, 17, 18, 19, 22, 23, 24, 25, 26, 29, 30, 31, 1, 2, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 19, 20, 21, 22, 23, 26, 27, 28, 29, 30, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 17, 18, 19, 20, 21, 24, 25, 26, 27, 28, 1, 2, 3, 4, 7, 8, 9, 10, 11, 14, 15, 16, 17, 18, 21, 22, 23, 24, 25, 28, 29, 30, 1, 2, 6, 7, 8, 9, 12, 13, 14, 15, 16, 19, 20, 21, 22, 23, 26, 27, 28, 29, 30, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 16, 17, 18, 19, 20, 23, 24, 25, 26, 27, 30, 31, 1, 2, 3, 7, 8, 9, 10, 13, 14, 15, 16, 17, 20, 21, 22, 23, 24, 27, 28, 29, 30, 1, 4, 5, 6, 7, 8, 12, 13, 14, 15, 18, 19, 20, 21, 22, 25, 26, 27, 28, 29, 1, 2, 3, 4, 5, 8, 9, 10, 12, 15, 16, 17, 18, 19, 22, 23, 24, 26, 29, 30, 1, 2, 3, 6, 7, 8, 9, 10, 13, 14, 15, 16, 17, 20, 21, 22, 23, 27, 28, 29, 30, 31, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 18, 19, 20, 21, 24, 25, 26, 27, 28, 31, 1, 2, 3, 4, 7, 8, 9, 10, 11, 14, 15, 16, 17, 18, 22, 23, 24, 25, 28, 1, 2, 3, 4, 7, 8, 9, 10, 11, 14, 15, 16, 17, 18, 21, 22, 23, 24, 25, 28, 29, 30, 31, 1, 4, 5, 6, 7, 8, 11, 12, 13, 14, 18, 19, 20, 21, 22, 25, 26, 27, 28, 29, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 16, 17, 18, 19, 20, 23, 24, 25, 26, 27, 31, 1, 2, 3, 6, 7, 8, 9, 10, 13, 14, 15, 16, 17, 21, 22, 23, 24, 27, 28, 29, 30, 1, 5, 6, 7, 8, 11, 12, 13, 14, 15, 18, 19, 20, 21, 22, 25, 26, 27, 28, 29, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 15, 16, 17, 18, 19, 22, 23, 24, 25, 26, 29, 30, 31, 1, 2, 6, 7, 8, 9, 12, 13, 14, 15, 16, 19, 20, 21, 22, 23, 26, 27, 28, 29, 30, 3, 4, 5, 6, 7, 11, 12, 13, 14, 17, 18, 19, 20, 21, 24, 25, 26, 27, 28, 31, 1, 2, 3, 4, 7, 8, 9, 10, 14, 15, 16, 17, 18, 21, 22, 23, 25, 28, 29, 30, 1, 2, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 19, 20, 21, 22, 23, 27, 28, 29, 30, 3, 4, 5, 6, 9, 10, 11, 12, 13, 17, 18, 19, 20, 23, 24, 25, 26, 27, 30, 31, 1, 2, 3, 6, 7, 8, 9, 10, 13, 14, 15, 16, 17, 21, 22, 23, 24, 27, 28, 1, 2, 3, 6, 7, 8, 9, 10, 13, 14, 15, 16, 17, 20, 21, 22, 23, 24, 27, 28, 29, 30, 31, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14, 17, 18, 19, 20, 21, 24, 25, 26, 27, 28, 1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 15, 16, 17, 18, 19, 22, 23, 24, 25, 26, 30, 31, 1, 2, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16, 20, 21, 22, 23, 26, 27, 28, 29, 30, 3, 5, 6, 7, 10, 11, 12, 13, 14, 17, 18, 19, 20, 21, 24, 25, 26, 27, 28, 31, 1, 2, 3, 4, 7, 8, 9, 10, 11, 14, 15, 16, 17, 18, 21, 22, 23, 24, 25, 28, 29, 30, 31, 1, 5, 6, 7, 8, ...]
In [143]:
# differences in dates
y = yields.index[-1]
z = y-x
print(type(z))
z.days
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>
Out[143]:
1821
Datetime to string and back¶
In [144]:
print("x", x, type(x))
y = x.strftime('%Y-%m-%d')
print("y", y, type(y))
x 2019-09-11 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'> y 2019-09-11 <class 'str'>
In [145]:
help(x.strftime)
Help on method strftime in module pandas._libs.tslibs.timestamps: strftime(format) method of pandas._libs.tslibs.timestamps.Timestamp instance Return a formatted string of the Timestamp. Parameters ---------- format : str Format string to convert Timestamp to string. See strftime documentation for more information on the format string: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior. Examples -------- >>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651') >>> ts.strftime('%Y-%m-%d %X') '2020-03-14 15:32:52'
In [146]:
# convert between string formats using
# datetime as intermediate step
date = "Sep 9, 2024"
print(date, type(date))
date_datetime = datetime.strptime("Sep 9, 2024", "%b %d, %Y")
print(date_datetime, type(date_datetime))
new_date = date_datetime.strftime("%Y-%m-%d")
print(new_date, type(new_date))
Sep 9, 2024 <class 'str'> 2024-09-09 00:00:00 <class 'datetime.datetime'> 2024-09-09 <class 'str'>
In [147]:
datetime.strptime("Sep 9, 2024", "%b %d, %Y").strftime("%Y-%m-%d")
Out[147]:
'2024-09-09'
In [148]:
# pandas can convert indexes and columns
# between types - a simple example:
yields.index.astype(str)
Out[148]:
Index(['2019-09-11', '2019-09-12', '2019-09-13', '2019-09-16', '2019-09-17', '2019-09-18', '2019-09-19', '2019-09-20', '2019-09-23', '2019-09-24', ... '2024-08-22', '2024-08-23', '2024-08-26', '2024-08-27', '2024-08-28', '2024-08-29', '2024-08-30', '2024-09-03', '2024-09-04', '2024-09-05'], dtype='object', name='DATE', length=1248)
Pandas period objects¶
- When we downloaded the Fama-French factors from French's data library, the dates came in pandas monthly period format
- Julius converted to datetime because datetime plots better
Time Series operations in Pandas¶
- Lags and changes
- Downsampling and upsampling
- Rolling windows
- Autocorrelation
Lags and changes¶
Try
- yields.shift()
- yields.shift(2)
- yields.shift(-1)
- yields.diff()
- yields.pct_change()
Downsampling and upsampling¶
- downsampling examples:
- calculate monthly averages, std devs, etc from daily data
- extract first or last day of each month
- upsampling example:
- fill forward or backward to get daily data from monthly data
- can resample at other frequencies also - yearly, quarterly, weekly, 5 seconds, etc.
In [149]:
yields.resample("ME").mean()
Out[149]:
DGS10 | |
---|---|
DATE | |
2019-09-30 | 1.755714 |
2019-10-31 | 1.706818 |
2019-11-30 | 1.812105 |
2019-12-31 | 1.862857 |
2020-01-31 | 1.757619 |
... | ... |
2024-05-31 | 4.482273 |
2024-06-30 | 4.305263 |
2024-07-31 | 4.248636 |
2024-08-31 | 3.870909 |
2024-09-30 | 3.780000 |
61 rows × 1 columns
In [150]:
yields.resample("ME").std()
Out[150]:
DGS10 | |
---|---|
DATE | |
2019-09-30 | 0.070133 |
2019-10-31 | 0.104808 |
2019-11-30 | 0.065028 |
2019-12-31 | 0.058236 |
2020-01-31 | 0.107977 |
... | ... |
2024-05-31 | 0.068656 |
2024-06-30 | 0.075966 |
2024-07-31 | 0.090832 |
2024-08-31 | 0.062022 |
2024-09-30 | 0.055678 |
61 rows × 1 columns
In [151]:
yields.resample("ME").last()
Out[151]:
DGS10 | |
---|---|
DATE | |
2019-09-30 | 1.68 |
2019-10-31 | 1.69 |
2019-11-30 | 1.78 |
2019-12-31 | 1.92 |
2020-01-31 | 1.51 |
... | ... |
2024-05-31 | 4.51 |
2024-06-30 | 4.36 |
2024-07-31 | 4.09 |
2024-08-31 | 3.91 |
2024-09-30 | 3.73 |
61 rows × 1 columns
- "ME" marks the month by the last date of the month.
- "MS" marks the month by the first date of the month.
- but .first() and .last() determine whether you get the first date or the last date
- use .resample("ME").last() to get the last date of the month
- use .resample("MS").first() to get the first date of the month
In [152]:
# this will be confusing
yields.resample("MS").last()
Out[152]:
DGS10 | |
---|---|
DATE | |
2019-09-01 | 1.68 |
2019-10-01 | 1.69 |
2019-11-01 | 1.78 |
2019-12-01 | 1.92 |
2020-01-01 | 1.51 |
... | ... |
2024-05-01 | 4.51 |
2024-06-01 | 4.36 |
2024-07-01 | 4.09 |
2024-08-01 | 3.91 |
2024-09-01 | 3.73 |
61 rows × 1 columns
In [153]:
# this makes more sense
yields.resample("MS").first()
Out[153]:
DGS10 | |
---|---|
DATE | |
2019-09-01 | 1.75 |
2019-10-01 | 1.65 |
2019-11-01 | 1.73 |
2019-12-01 | 1.83 |
2020-01-01 | 1.88 |
... | ... |
2024-05-01 | 4.63 |
2024-06-01 | 4.41 |
2024-07-01 | 4.48 |
2024-08-01 | 3.99 |
2024-09-01 | 3.84 |
61 rows × 1 columns
In [154]:
# you can daisy chain time-series methods with resampling
yields.resample("ME").last().diff()
Out[154]:
DGS10 | |
---|---|
DATE | |
2019-09-30 | NaN |
2019-10-31 | 0.01 |
2019-11-30 | 0.09 |
2019-12-31 | 0.14 |
2020-01-31 | -0.41 |
... | ... |
2024-05-31 | -0.18 |
2024-06-30 | -0.15 |
2024-07-31 | -0.27 |
2024-08-31 | -0.18 |
2024-09-30 | -0.18 |
61 rows × 1 columns
Upsampling¶
In [155]:
# create an example with monthly data
yields_monthly = yields.resample("ME").last()
yields_monthly.head()
Out[155]:
DGS10 | |
---|---|
DATE | |
2019-09-30 | 1.68 |
2019-10-31 | 1.69 |
2019-11-30 | 1.78 |
2019-12-31 | 1.92 |
2020-01-31 | 1.51 |
In [156]:
# upsample to daily with forward fill
yields_monthly.resample("D").ffill()
Out[156]:
DGS10 | |
---|---|
DATE | |
2019-09-30 | 1.68 |
2019-10-01 | 1.68 |
2019-10-02 | 1.68 |
2019-10-03 | 1.68 |
2019-10-04 | 1.68 |
... | ... |
2024-09-26 | 3.91 |
2024-09-27 | 3.91 |
2024-09-28 | 3.91 |
2024-09-29 | 3.91 |
2024-09-30 | 3.73 |
1828 rows × 1 columns
Rolling windows¶
- can aggregate (mean, std dev, etc) over rolling windows
- can apply custom functions over rolling windows - e.g., regression
- can create expanding windows with .expanding()
In [157]:
yields.rolling(21).mean()
Out[157]:
DGS10 | |
---|---|
DATE | |
2019-09-11 | NaN |
2019-09-12 | NaN |
2019-09-13 | NaN |
2019-09-16 | NaN |
2019-09-17 | NaN |
... | ... |
2024-08-29 | 3.869048 |
2024-08-30 | 3.865238 |
2024-09-03 | 3.867143 |
2024-09-04 | 3.866667 |
2024-09-05 | 3.858571 |
1248 rows × 1 columns
If you want monthly averages at the end of each month only, how do you get that?
Autocorrelation¶
In [158]:
yields.DGS10.autocorr(1)
Out[158]:
np.float64(0.9989251861649782)
In [159]:
yields.DGS10.resample("ME").last().autocorr()
Out[159]:
np.float64(0.9795807592048992)
In [160]:
yields.DGS10.resample("ME").last().diff().autocorr()
Out[160]:
np.float64(0.17386711059598828)
Panel data¶
CRSP¶
In [161]:
import numpy as np
import pandas as pd
import wrds
conn = wrds.Connection()
WRDS recommends setting up a .pgpass file. You can create this file yourself at any time with the create_pgpass_file() function. Loading library list... Done
In [198]:
crsp = conn.raw_sql(
"""
select a.permno, a.permco, a.date, a.ret, abs(a.prc)*a.shrout as me, b.exchcd
from crsp.msf a inner join crsp.msenames b
on a.permno=b.permno and a.date between b.namedt and b.nameendt
and b.exchcd in (1,2,3) and b.shrcd in (10,11)
where a.date >= '2020-01-01'
order by a.permno, a.date
""",
date_cols=['date']
)
# change strings or floats to integers
for col in ['permno','permco']:
crsp[col] = crsp[col].astype(int)
# define market equity as sum of market equities of all permnos associated with a permco
crsp['me'] = crsp.groupby(['date','permco'], group_keys=False).me.transform("sum")
crsp.drop(columns=["permco"], inplace=True)
crsp.set_index(["permno","date"], inplace=True)
crsp.head()
Out[198]:
ret | me | exchcd | ||
---|---|---|---|---|
permno | date | |||
10026 | 2020-01-31 | -0.100016 | 3.137527e+06 | 3 |
2020-02-28 | -0.030270 | 3.042554e+06 | 3 | |
2020-03-31 | -0.244031 | 2.285448e+06 | 3 | |
2020-04-30 | 0.049835 | 2.399343e+06 | 3 | |
2020-05-29 | 0.012596 | 2.429563e+06 | 3 |
MultiIndexes¶
In [199]:
crsp.index
Out[199]:
MultiIndex([(10026, '2020-01-31'), (10026, '2020-02-28'), (10026, '2020-03-31'), (10026, '2020-04-30'), (10026, '2020-05-29'), (10026, '2020-06-30'), (10026, '2020-07-31'), (10026, '2020-08-31'), (10026, '2020-09-30'), (10026, '2020-10-30'), ... (93436, '2023-03-31'), (93436, '2023-04-28'), (93436, '2023-05-31'), (93436, '2023-06-30'), (93436, '2023-07-31'), (93436, '2023-08-31'), (93436, '2023-09-29'), (93436, '2023-10-31'), (93436, '2023-11-30'), (93436, '2023-12-29')], names=['permno', 'date'], length=195627)
In [200]:
# extract a cross section with .loc
crsp.loc[10026].head()
Out[200]:
ret | me | exchcd | |
---|---|---|---|
date | |||
2020-01-31 | -0.100016 | 3.137527e+06 | 3 |
2020-02-28 | -0.030270 | 3.042554e+06 | 3 |
2020-03-31 | -0.244031 | 2.285448e+06 | 3 |
2020-04-30 | 0.049835 | 2.399343e+06 | 3 |
2020-05-29 | 0.012596 | 2.429563e+06 | 3 |
In [201]:
# extract a cross section with .xs
crsp.xs(10026, level="permno").head()
Out[201]:
ret | me | exchcd | |
---|---|---|---|
date | |||
2020-01-31 | -0.100016 | 3.137527e+06 | 3 |
2020-02-28 | -0.030270 | 3.042554e+06 | 3 |
2020-03-31 | -0.244031 | 2.285448e+06 | 3 |
2020-04-30 | 0.049835 | 2.399343e+06 | 3 |
2020-05-29 | 0.012596 | 2.429563e+06 | 3 |
In [202]:
# swap levels
crsp.index = crsp.index.swaplevel()
crsp = crsp.sort_index()
crsp.head()
Out[202]:
ret | me | exchcd | ||
---|---|---|---|---|
date | permno | |||
2020-01-31 | 10026 | -0.100016 | 3137526.96 | 3 |
10028 | 0.607407 | 58425.08 | 2 | |
10032 | -0.075643 | 2078268.64 | 3 | |
10044 | -0.098592 | 49920.00 | 3 | |
10051 | -0.115176 | 912167.34 | 1 |
In [203]:
# unstack
crsp.ret.unstack().head()
Out[203]:
permno | 10026 | 10028 | 10032 | 10044 | 10051 | 10066 | 10104 | 10107 | 10138 | 10145 | ... | 93384 | 93393 | 93397 | 93401 | 93420 | 93422 | 93423 | 93426 | 93434 | 93436 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||||||||||
2020-01-31 | -0.100016 | 0.607407 | -0.075643 | -0.098592 | -0.115176 | NaN | -0.005474 | 0.079455 | 0.095946 | -0.021356 | ... | 0.082519 | 0.120582 | 0.015081 | -0.076132 | -0.309816 | -0.295556 | -0.154733 | 0.015882 | 0.023810 | 0.555160 |
2020-02-28 | -0.030270 | 0.225806 | -0.067070 | -0.064904 | -0.055669 | NaN | -0.057007 | -0.045292 | -0.116229 | -0.058596 | ... | -0.258776 | -0.352505 | -0.144467 | 0.328508 | -0.273333 | -0.290221 | -0.337005 | -0.204690 | 0.372093 | 0.026777 |
2020-03-31 | -0.244031 | -0.052632 | -0.177694 | -0.373368 | -0.324664 | NaN | -0.022847 | -0.026542 | -0.164901 | -0.175002 | ... | -0.654939 | 0.180516 | -0.215569 | -0.275775 | -0.785933 | -0.842444 | -0.494066 | -0.269021 | -0.305085 | -0.215557 |
2020-04-30 | 0.049835 | 0.448413 | 0.149010 | -0.052083 | 0.178434 | NaN | 0.100972 | 0.136326 | 0.184127 | 0.060617 | ... | NaN | 2.036408 | 0.024809 | 0.081019 | 1.011143 | 1.947683 | 0.595694 | 0.151394 | -0.073171 | 0.492137 |
2020-05-29 | 0.012596 | -0.016438 | 0.024406 | -0.057143 | 0.000000 | NaN | 0.015103 | 0.025389 | 0.045576 | 0.034179 | ... | NaN | -0.241407 | 0.000749 | 0.059957 | -0.345646 | -0.134280 | 0.148426 | 0.021626 | 0.121053 | 0.067939 |
5 rows × 5113 columns
Time series stuff for a panel¶
- do everything grouping by permno
- group_keys=False means don't keep the permno as an additional part of the index. This lets us put the results in the dataframe with the original (permno, date) index.
- examples:
- changes and lags
- custom functions with apply
- rolling windows
In [204]:
# changes
crsp["me_pct_chg"] = crsp.groupby("permno", group_keys=False).me.pct_change()
crsp.head()
Out[204]:
ret | me | exchcd | me_pct_chg | ||
---|---|---|---|---|---|
date | permno | ||||
2020-01-31 | 10026 | -0.100016 | 3137526.96 | 3 | NaN |
10028 | 0.607407 | 58425.08 | 2 | NaN | |
10032 | -0.075643 | 2078268.64 | 3 | NaN | |
10044 | -0.098592 | 49920.00 | 3 | NaN | |
10051 | -0.115176 | 912167.34 | 1 | NaN |
In [205]:
# shifts
crsp["me_lag"] = crsp.groupby("permno", group_keys=False).me.shift(1)
crsp.head()
Out[205]:
ret | me | exchcd | me_pct_chg | me_lag | ||
---|---|---|---|---|---|---|
date | permno | |||||
2020-01-31 | 10026 | -0.100016 | 3137526.96 | 3 | NaN | NaN |
10028 | 0.607407 | 58425.08 | 2 | NaN | NaN | |
10032 | -0.075643 | 2078268.64 | 3 | NaN | NaN | |
10044 | -0.098592 | 49920.00 | 3 | NaN | NaN | |
10051 | -0.115176 | 912167.34 | 1 | NaN | NaN |
In [206]:
# custom functions
crsp["accum"] = crsp.groupby("permno", group_keys=False).apply(
lambda x: (1+x.ret).cumprod()
)
crsp.head()
Out[206]:
ret | me | exchcd | me_pct_chg | me_lag | accum | ||
---|---|---|---|---|---|---|---|
date | permno | ||||||
2020-01-31 | 10026 | -0.100016 | 3137526.96 | 3 | NaN | NaN | 0.899984 |
10028 | 0.607407 | 58425.08 | 2 | NaN | NaN | 1.607407 | |
10032 | -0.075643 | 2078268.64 | 3 | NaN | NaN | 0.924357 | |
10044 | -0.098592 | 49920.00 | 3 | NaN | NaN | 0.901408 | |
10051 | -0.115176 | 912167.34 | 1 | NaN | NaN | 0.884824 |
In [210]:
# another custom function
# mom = return over first 11 of prior 12 months
# if this is September, shift(13) is end of August a year ago
# shift(2) is end of July this year
crsp["mom"] = crsp.groupby("permno", group_keys=False).accum.apply(
lambda x: x.shift(2)/x.shift(13)-1
)
crsp.dropna().head()
Out[210]:
ret | me | exchcd | me_pct_chg | me_lag | accum | mom | mom2 | ||
---|---|---|---|---|---|---|---|---|---|
date | permno | ||||||||
2021-02-26 | 10026 | 0.039958 | 3.013265e+06 | 3 | 0.039958 | 2897486.80 | 0.876640 | -0.046738 | 0.186441 |
10028 | -0.161342 | 1.413562e+05 | 2 | -0.161342 | 168550.50 | 3.888887 | 1.396313 | 4.939689 | |
10032 | 0.091784 | 2.420220e+06 | 3 | 0.093795 | 2212680.72 | 1.091500 | 0.099690 | 0.328714 | |
10044 | 0.218894 | 3.239596e+04 | 3 | 0.228928 | 26361.16 | 0.582110 | -0.505595 | 1.083334 | |
10051 | 0.072718 | 8.376358e+05 | 1 | 0.072718 | 780853.41 | 0.796088 | -0.099877 | 0.539223 |
Exercise¶
Grab the first 13 months of returns for permno 10026 and verify the momentum calculation.
Rolling windows for panels¶
In [220]:
# momentum with a rolling window
crsp["mom2"] = crsp.groupby("permno", group_keys=False).ret.apply(
lambda x: x.rolling(11).apply(lambda y:
(1+y).prod() - 1
)
)
crsp["mom2"] = crsp.groupby("permno", group_keys=False).mom2.shift(2)
In [221]:
# verify that we get the same result both ways
crsp[["mom", "mom2"]].dropna().head()
Out[221]:
mom | mom2 | ||
---|---|---|---|
date | permno | ||
2021-02-26 | 10026 | -0.046738 | -0.046738 |
10028 | 1.396313 | 1.396313 | |
10032 | 0.099690 | 0.099690 | |
10044 | -0.505595 | -0.505595 | |
10051 | -0.099877 | -0.099877 |
In [222]:
# better verification
(crsp.mom - crsp.mom2).abs().max()
Out[222]:
np.float64(2.842170943040401e-14)
Cross-sectional operations on a panel¶
In [233]:
crsp.groupby("date").me.median()
Out[233]:
date 2020-01-31 716361.5500 2020-02-28 659297.4500 2020-03-31 471096.1800 2020-04-30 563783.3600 2020-05-29 604435.6500 2020-06-30 638028.0500 2020-07-31 642562.6000 2020-08-31 685895.9000 2020-09-30 660193.0350 2020-10-30 667150.5700 2020-11-30 804351.6000 2020-12-31 876951.1800 2021-01-29 930408.6450 2021-02-26 997188.4950 2021-03-31 961598.2750 2021-04-30 907535.9375 2021-05-28 891399.5550 2021-06-30 937154.1600 2021-07-30 885187.9600 2021-08-31 885851.0800 2021-09-30 872242.5700 2021-10-29 880532.6500 2021-11-30 785550.5500 2021-12-31 784997.3200 2022-01-31 669696.6600 2022-02-28 653404.2000 2022-03-31 649834.1800 2022-04-29 570975.0000 2022-05-31 544259.1500 2022-06-30 498053.8900 2022-07-29 532845.6300 2022-08-31 527356.0600 2022-09-30 474613.3400 2022-10-31 506608.4000 2022-11-30 502639.1500 2022-12-30 492462.0800 2023-01-31 549974.4250 2023-02-28 530868.8700 2023-03-31 502809.1600 2023-04-28 478404.0900 2023-05-31 484659.7750 2023-06-30 523683.1950 2023-07-31 565771.7600 2023-08-31 529945.0850 2023-09-29 497064.7150 2023-10-31 459637.3000 2023-11-30 505252.5500 2023-12-29 601071.3800 Name: me, dtype: float64
In [234]:
crsp["rnk"] = crsp.groupby("date", group_keys=False).me.rank(pct=True)
crsp[["me", "rnk"]].head()
Out[234]:
me | rnk | ||
---|---|---|---|
date | permno | ||
2020-01-31 | 10026 | 3137526.96 | 0.735097 |
10028 | 58425.08 | 0.143175 | |
10032 | 2078268.64 | 0.668802 | |
10044 | 49920.00 | 0.127577 | |
10051 | 912167.34 | 0.538162 |
Exercise¶
Find the 0.2, 0.4, 0.6, and 0.8 quantiles of me each month using the subset of firms for which exchcd=1.
Compustat¶
In [246]:
comp = conn.raw_sql(
"""
select gvkey, datadate as date, at
from comp.funda
where datadate >= '2020-01-01' and at>0
and indfmt='INDL' and datafmt='STD' and popsrc='D' and consol='C'
order by gvkey, datadate
""",
date_cols=['date']
)
link = conn.raw_sql(
"""
select distinct gvkey, lpermno as permno, linkdt, linkenddt
from crsp.Ccmxpf_linktable
where linktype in ('LU', 'LC')
and LINKPRIM in ('P', 'C')
"""
)
# convert strings or floats to ints
comp['gvkey'] = comp.gvkey.astype(int)
link['gvkey'] = link.gvkey.astype(int)
link['permno'] = link.permno.astype(int)
# fill in missing end dates with a future date
link['linkenddt'] = pd.to_datetime(
link.linkenddt
).fillna(pd.Timestamp('21000101'))
# merge with Compustat data and keep rows with Compustat datadate between link date and link end date
comp = comp.merge(link, on='gvkey', how='inner')
comp = comp[
(comp.date>=comp.linkdt) & (comp.date<=comp.linkenddt)
]
comp = comp.drop(columns=['gvkey', 'linkdt', 'linkenddt'])
Merge CRSP with Compustat¶
- Change dates to monthly period format before merging, because Compustat date is the last day of the month, and CRSP date is the last trading day of the month.
- Merge keeping all rows of CRSP data. There will be NaNs for Compustat data for 11 months each year.
In [235]:
crsp.reset_index(inplace=True)
crsp.head()
Out[235]:
date | permno | ret | me | exchcd | me_pct_chg | me_lag | accum | mom | mom2 | rnk | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-01-31 | 10026 | -0.100016 | 3137526.96 | 3 | NaN | NaN | 0.899984 | NaN | NaN | 0.735097 |
1 | 2020-01-31 | 10028 | 0.607407 | 58425.08 | 2 | NaN | NaN | 1.607407 | NaN | NaN | 0.143175 |
2 | 2020-01-31 | 10032 | -0.075643 | 2078268.64 | 3 | NaN | NaN | 0.924357 | NaN | NaN | 0.668802 |
3 | 2020-01-31 | 10044 | -0.098592 | 49920.00 | 3 | NaN | NaN | 0.901408 | NaN | NaN | 0.127577 |
4 | 2020-01-31 | 10051 | -0.115176 | 912167.34 | 1 | NaN | NaN | 0.884824 | NaN | NaN | 0.538162 |
In [252]:
crsp["date"] = crsp.date.dt.to_period('M')
comp["date"] = comp.date.dt.to_period('M')
df = crsp.merge(comp, on=['permno', 'date'], how='left')
df.set_index(["date", "permno"], inplace=True)
df.head()
Out[252]:
ret | me | exchcd | me_pct_chg | me_lag | accum | mom | mom2 | rnk | at | ||
---|---|---|---|---|---|---|---|---|---|---|---|
date | permno | ||||||||||
2020-01 | 10026 | -0.100016 | 3137526.96 | 3 | NaN | NaN | 0.899984 | NaN | NaN | 0.735097 | NaN |
10028 | 0.607407 | 58425.08 | 2 | NaN | NaN | 1.607407 | NaN | NaN | 0.143175 | NaN | |
10032 | -0.075643 | 2078268.64 | 3 | NaN | NaN | 0.924357 | NaN | NaN | 0.668802 | NaN | |
10044 | -0.098592 | 49920.00 | 3 | NaN | NaN | 0.901408 | NaN | NaN | 0.127577 | NaN | |
10051 | -0.115176 | 912167.34 | 1 | NaN | NaN | 0.884824 | NaN | NaN | 0.538162 | NaN |
Fill Compustat data into months¶
- Group by permno when filling forward so we don't fill from one stock into another
- Can add a limit on ffill so we don't fill in missing years of data
In [258]:
df['at'] = df.groupby('permno', group_keys=False).at.ffill()
df.xs(10026, level="permno")["at"].dropna().head(15)
Out[258]:
date 2020-09 1056.553 2020-10 1056.553 2020-11 1056.553 2020-12 1056.553 2021-01 1056.553 2021-02 1056.553 2021-03 1056.553 2021-04 1056.553 2021-05 1056.553 2021-06 1056.553 2021-07 1056.553 2021-08 1056.553 2021-09 1122.219 2021-10 1122.219 2021-11 1122.219 Freq: M, Name: at, dtype: float64
In [259]:
# shift 6 months to ensure data is public
df["at"] = df.groupby("permno")["at"].shift(6)
df.xs(10026, level="permno")["at"].dropna().head(15)
Out[259]:
date 2021-03 1056.553 2021-04 1056.553 2021-05 1056.553 2021-06 1056.553 2021-07 1056.553 2021-08 1056.553 2021-09 1056.553 2021-10 1056.553 2021-11 1056.553 2021-12 1056.553 2022-01 1056.553 2022-02 1056.553 2022-03 1122.219 2022-04 1122.219 2022-05 1122.219 Freq: M, Name: at, dtype: float64