Day 4¶

BUSI 520: Python for Business Research¶

Kerry Back, JGSB, Rice University¶

Plan for today¶

  • Missing data
  • Datetime objects
  • Time series operations (Treasury yield example)
  • Panel data operations (CRSP example)
  • Extended panel data example (CRSP and Compustat)
In [134]:
import numpy as np
import pandas as pd
from pandas_datareader import DataReader as pdr
import wrds       
from datetime import datetime

Datetime objects¶

Get Treasury yields from FRED¶

In [135]:
yields = pdr('DGS10', 'fred')
yields.head()
Out[135]:
DGS10
DATE
2019-09-11 1.75
2019-09-12 1.79
2019-09-13 1.90
2019-09-16 1.84
2019-09-17 1.81
In [136]:
yields.tail()
Out[136]:
DGS10
DATE
2024-08-30 3.91
2024-09-02 NaN
2024-09-03 3.84
2024-09-04 3.77
2024-09-05 3.73

Checking for and dropping missing values¶

In [137]:
yields.isna()
Out[137]:
DGS10
DATE
2019-09-11 False
2019-09-12 False
2019-09-13 False
2019-09-16 False
2019-09-17 False
... ...
2024-08-30 False
2024-09-02 True
2024-09-03 False
2024-09-04 False
2024-09-05 False

1302 rows × 1 columns

In [138]:
yields.isna().sum()
Out[138]:
DGS10    54
dtype: int64

¶

In [139]:
yields = yields.dropna()
yields.tail()
Out[139]:
DGS10
DATE
2024-08-29 3.87
2024-08-30 3.91
2024-09-03 3.84
2024-09-04 3.77
2024-09-05 3.73

Datetime objects¶

  • Datetime methods (.month, .strftime, .timdelta, ...)
  • String to datetime
  • Pandas period to datetime
In [140]:
x = yields.index[0]
x
Out[140]:
Timestamp('2019-09-11 00:00:00')
In [141]:
x.year
Out[141]:
2019
In [142]:
[x.day for x in yields.index]
Out[142]:
[11,
 12,
 13,
 16,
 17,
 18,
 19,
 20,
 23,
 24,
 25,
 26,
 27,
 30,
 1,
 2,
 3,
 4,
 7,
 8,
 9,
 10,
 11,
 15,
 16,
 17,
 18,
 21,
 22,
 23,
 24,
 25,
 28,
 29,
 30,
 31,
 1,
 4,
 5,
 6,
 7,
 8,
 12,
 13,
 14,
 15,
 18,
 19,
 20,
 21,
 22,
 25,
 26,
 27,
 29,
 2,
 3,
 4,
 5,
 6,
 9,
 10,
 11,
 12,
 13,
 16,
 17,
 18,
 19,
 20,
 23,
 24,
 26,
 27,
 30,
 31,
 2,
 3,
 6,
 7,
 8,
 9,
 10,
 13,
 14,
 15,
 16,
 17,
 21,
 22,
 23,
 24,
 27,
 28,
 29,
 30,
 31,
 3,
 4,
 5,
 6,
 7,
 10,
 11,
 12,
 13,
 14,
 18,
 19,
 20,
 21,
 24,
 25,
 26,
 27,
 28,
 2,
 3,
 4,
 5,
 6,
 9,
 10,
 11,
 12,
 13,
 16,
 17,
 18,
 19,
 20,
 23,
 24,
 25,
 26,
 27,
 30,
 31,
 1,
 2,
 3,
 6,
 7,
 8,
 9,
 13,
 14,
 15,
 16,
 17,
 20,
 21,
 22,
 23,
 24,
 27,
 28,
 29,
 30,
 1,
 4,
 5,
 6,
 7,
 8,
 11,
 12,
 13,
 14,
 15,
 18,
 19,
 20,
 21,
 22,
 26,
 27,
 28,
 29,
 1,
 2,
 3,
 4,
 5,
 8,
 9,
 10,
 11,
 12,
 15,
 16,
 17,
 18,
 19,
 22,
 23,
 24,
 25,
 26,
 29,
 30,
 1,
 2,
 6,
 7,
 8,
 9,
 10,
 13,
 14,
 15,
 16,
 17,
 20,
 21,
 22,
 23,
 24,
 27,
 28,
 29,
 30,
 31,
 3,
 4,
 5,
 6,
 7,
 10,
 11,
 12,
 13,
 14,
 17,
 18,
 19,
 20,
 21,
 24,
 25,
 26,
 27,
 28,
 31,
 1,
 2,
 3,
 4,
 8,
 9,
 10,
 11,
 14,
 15,
 16,
 17,
 18,
 21,
 22,
 23,
 24,
 25,
 28,
 29,
 30,
 1,
 2,
 5,
 6,
 7,
 8,
 9,
 13,
 14,
 15,
 16,
 19,
 20,
 21,
 22,
 23,
 26,
 27,
 28,
 29,
 30,
 2,
 3,
 4,
 5,
 6,
 9,
 10,
 12,
 13,
 16,
 17,
 18,
 19,
 20,
 23,
 24,
 25,
 27,
 30,
 1,
 2,
 3,
 4,
 7,
 8,
 9,
 10,
 11,
 14,
 15,
 16,
 17,
 18,
 21,
 22,
 23,
 24,
 28,
 29,
 30,
 31,
 4,
 5,
 6,
 7,
 8,
 11,
 12,
 13,
 14,
 15,
 19,
 20,
 21,
 22,
 25,
 26,
 27,
 28,
 29,
 1,
 2,
 3,
 4,
 5,
 8,
 9,
 10,
 11,
 12,
 16,
 17,
 18,
 19,
 22,
 23,
 24,
 25,
 26,
 1,
 2,
 3,
 4,
 5,
 8,
 9,
 10,
 11,
 12,
 15,
 16,
 17,
 18,
 19,
 22,
 23,
 24,
 25,
 26,
 29,
 30,
 31,
 1,
 2,
 5,
 6,
 7,
 8,
 9,
 12,
 13,
 14,
 15,
 16,
 19,
 20,
 21,
 22,
 23,
 26,
 27,
 28,
 29,
 30,
 3,
 4,
 5,
 6,
 7,
 10,
 11,
 12,
 13,
 14,
 17,
 18,
 19,
 20,
 21,
 24,
 25,
 26,
 27,
 28,
 1,
 2,
 3,
 4,
 7,
 8,
 9,
 10,
 11,
 14,
 15,
 16,
 17,
 18,
 21,
 22,
 23,
 24,
 25,
 28,
 29,
 30,
 1,
 2,
 6,
 7,
 8,
 9,
 12,
 13,
 14,
 15,
 16,
 19,
 20,
 21,
 22,
 23,
 26,
 27,
 28,
 29,
 30,
 2,
 3,
 4,
 5,
 6,
 9,
 10,
 11,
 12,
 13,
 16,
 17,
 18,
 19,
 20,
 23,
 24,
 25,
 26,
 27,
 30,
 31,
 1,
 2,
 3,
 7,
 8,
 9,
 10,
 13,
 14,
 15,
 16,
 17,
 20,
 21,
 22,
 23,
 24,
 27,
 28,
 29,
 30,
 1,
 4,
 5,
 6,
 7,
 8,
 12,
 13,
 14,
 15,
 18,
 19,
 20,
 21,
 22,
 25,
 26,
 27,
 28,
 29,
 1,
 2,
 3,
 4,
 5,
 8,
 9,
 10,
 12,
 15,
 16,
 17,
 18,
 19,
 22,
 23,
 24,
 26,
 29,
 30,
 1,
 2,
 3,
 6,
 7,
 8,
 9,
 10,
 13,
 14,
 15,
 16,
 17,
 20,
 21,
 22,
 23,
 27,
 28,
 29,
 30,
 31,
 3,
 4,
 5,
 6,
 7,
 10,
 11,
 12,
 13,
 14,
 18,
 19,
 20,
 21,
 24,
 25,
 26,
 27,
 28,
 31,
 1,
 2,
 3,
 4,
 7,
 8,
 9,
 10,
 11,
 14,
 15,
 16,
 17,
 18,
 22,
 23,
 24,
 25,
 28,
 1,
 2,
 3,
 4,
 7,
 8,
 9,
 10,
 11,
 14,
 15,
 16,
 17,
 18,
 21,
 22,
 23,
 24,
 25,
 28,
 29,
 30,
 31,
 1,
 4,
 5,
 6,
 7,
 8,
 11,
 12,
 13,
 14,
 18,
 19,
 20,
 21,
 22,
 25,
 26,
 27,
 28,
 29,
 2,
 3,
 4,
 5,
 6,
 9,
 10,
 11,
 12,
 13,
 16,
 17,
 18,
 19,
 20,
 23,
 24,
 25,
 26,
 27,
 31,
 1,
 2,
 3,
 6,
 7,
 8,
 9,
 10,
 13,
 14,
 15,
 16,
 17,
 21,
 22,
 23,
 24,
 27,
 28,
 29,
 30,
 1,
 5,
 6,
 7,
 8,
 11,
 12,
 13,
 14,
 15,
 18,
 19,
 20,
 21,
 22,
 25,
 26,
 27,
 28,
 29,
 1,
 2,
 3,
 4,
 5,
 8,
 9,
 10,
 11,
 12,
 15,
 16,
 17,
 18,
 19,
 22,
 23,
 24,
 25,
 26,
 29,
 30,
 31,
 1,
 2,
 6,
 7,
 8,
 9,
 12,
 13,
 14,
 15,
 16,
 19,
 20,
 21,
 22,
 23,
 26,
 27,
 28,
 29,
 30,
 3,
 4,
 5,
 6,
 7,
 11,
 12,
 13,
 14,
 17,
 18,
 19,
 20,
 21,
 24,
 25,
 26,
 27,
 28,
 31,
 1,
 2,
 3,
 4,
 7,
 8,
 9,
 10,
 14,
 15,
 16,
 17,
 18,
 21,
 22,
 23,
 25,
 28,
 29,
 30,
 1,
 2,
 5,
 6,
 7,
 8,
 9,
 12,
 13,
 14,
 15,
 16,
 19,
 20,
 21,
 22,
 23,
 27,
 28,
 29,
 30,
 3,
 4,
 5,
 6,
 9,
 10,
 11,
 12,
 13,
 17,
 18,
 19,
 20,
 23,
 24,
 25,
 26,
 27,
 30,
 31,
 1,
 2,
 3,
 6,
 7,
 8,
 9,
 10,
 13,
 14,
 15,
 16,
 17,
 21,
 22,
 23,
 24,
 27,
 28,
 1,
 2,
 3,
 6,
 7,
 8,
 9,
 10,
 13,
 14,
 15,
 16,
 17,
 20,
 21,
 22,
 23,
 24,
 27,
 28,
 29,
 30,
 31,
 3,
 4,
 5,
 6,
 7,
 10,
 11,
 12,
 13,
 14,
 17,
 18,
 19,
 20,
 21,
 24,
 25,
 26,
 27,
 28,
 1,
 2,
 3,
 4,
 5,
 8,
 9,
 10,
 11,
 12,
 15,
 16,
 17,
 18,
 19,
 22,
 23,
 24,
 25,
 26,
 30,
 31,
 1,
 2,
 5,
 6,
 7,
 8,
 9,
 12,
 13,
 14,
 15,
 16,
 20,
 21,
 22,
 23,
 26,
 27,
 28,
 29,
 30,
 3,
 5,
 6,
 7,
 10,
 11,
 12,
 13,
 14,
 17,
 18,
 19,
 20,
 21,
 24,
 25,
 26,
 27,
 28,
 31,
 1,
 2,
 3,
 4,
 7,
 8,
 9,
 10,
 11,
 14,
 15,
 16,
 17,
 18,
 21,
 22,
 23,
 24,
 25,
 28,
 29,
 30,
 31,
 1,
 5,
 6,
 7,
 8,
 ...]
In [143]:
# differences in dates 

y = yields.index[-1]
z = y-x
print(type(z))
z.days
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>
Out[143]:
1821

Datetime to string and back¶

In [144]:
print("x", x, type(x))

y = x.strftime('%Y-%m-%d')
print("y", y, type(y))
x 2019-09-11 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
y 2019-09-11 <class 'str'>
In [145]:
help(x.strftime)
Help on method strftime in module pandas._libs.tslibs.timestamps:

strftime(format) method of pandas._libs.tslibs.timestamps.Timestamp instance
    Return a formatted string of the Timestamp.

    Parameters
    ----------
    format : str
        Format string to convert Timestamp to string.
        See strftime documentation for more information on the format string:
        https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.

    Examples
    --------
    >>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651')
    >>> ts.strftime('%Y-%m-%d %X')
    '2020-03-14 15:32:52'

In [146]:
# convert between string formats using 
# datetime as intermediate step 

date = "Sep 9, 2024" 
print(date, type(date))
date_datetime = datetime.strptime("Sep 9, 2024", "%b %d, %Y")
print(date_datetime, type(date_datetime))
new_date = date_datetime.strftime("%Y-%m-%d")
print(new_date, type(new_date))
Sep 9, 2024 <class 'str'>
2024-09-09 00:00:00 <class 'datetime.datetime'>
2024-09-09 <class 'str'>
In [147]:
datetime.strptime("Sep 9, 2024", "%b %d, %Y").strftime("%Y-%m-%d")
Out[147]:
'2024-09-09'
In [148]:
# pandas can convert indexes and columns
# between types - a simple example:

yields.index.astype(str)
Out[148]:
Index(['2019-09-11', '2019-09-12', '2019-09-13', '2019-09-16', '2019-09-17',
       '2019-09-18', '2019-09-19', '2019-09-20', '2019-09-23', '2019-09-24',
       ...
       '2024-08-22', '2024-08-23', '2024-08-26', '2024-08-27', '2024-08-28',
       '2024-08-29', '2024-08-30', '2024-09-03', '2024-09-04', '2024-09-05'],
      dtype='object', name='DATE', length=1248)

Pandas period objects¶

  • When we downloaded the Fama-French factors from French's data library, the dates came in pandas monthly period format
  • Julius converted to datetime because datetime plots better

Time Series operations in Pandas¶

  • Lags and changes
  • Downsampling and upsampling
  • Rolling windows
  • Autocorrelation

Lags and changes¶

Try

  • yields.shift()
  • yields.shift(2)
  • yields.shift(-1)
  • yields.diff()
  • yields.pct_change()

Downsampling and upsampling¶

  • downsampling examples:
    • calculate monthly averages, std devs, etc from daily data
    • extract first or last day of each month
  • upsampling example:
    • fill forward or backward to get daily data from monthly data
  • can resample at other frequencies also - yearly, quarterly, weekly, 5 seconds, etc.
In [149]:
yields.resample("ME").mean()
Out[149]:
DGS10
DATE
2019-09-30 1.755714
2019-10-31 1.706818
2019-11-30 1.812105
2019-12-31 1.862857
2020-01-31 1.757619
... ...
2024-05-31 4.482273
2024-06-30 4.305263
2024-07-31 4.248636
2024-08-31 3.870909
2024-09-30 3.780000

61 rows × 1 columns

In [150]:
yields.resample("ME").std()
Out[150]:
DGS10
DATE
2019-09-30 0.070133
2019-10-31 0.104808
2019-11-30 0.065028
2019-12-31 0.058236
2020-01-31 0.107977
... ...
2024-05-31 0.068656
2024-06-30 0.075966
2024-07-31 0.090832
2024-08-31 0.062022
2024-09-30 0.055678

61 rows × 1 columns

In [151]:
yields.resample("ME").last()
Out[151]:
DGS10
DATE
2019-09-30 1.68
2019-10-31 1.69
2019-11-30 1.78
2019-12-31 1.92
2020-01-31 1.51
... ...
2024-05-31 4.51
2024-06-30 4.36
2024-07-31 4.09
2024-08-31 3.91
2024-09-30 3.73

61 rows × 1 columns

  • "ME" marks the month by the last date of the month.
  • "MS" marks the month by the first date of the month.
  • but .first() and .last() determine whether you get the first date or the last date
  • use .resample("ME").last() to get the last date of the month
  • use .resample("MS").first() to get the first date of the month
In [152]:
# this will be confusing 

yields.resample("MS").last()
Out[152]:
DGS10
DATE
2019-09-01 1.68
2019-10-01 1.69
2019-11-01 1.78
2019-12-01 1.92
2020-01-01 1.51
... ...
2024-05-01 4.51
2024-06-01 4.36
2024-07-01 4.09
2024-08-01 3.91
2024-09-01 3.73

61 rows × 1 columns

In [153]:
# this makes more sense 

yields.resample("MS").first()
Out[153]:
DGS10
DATE
2019-09-01 1.75
2019-10-01 1.65
2019-11-01 1.73
2019-12-01 1.83
2020-01-01 1.88
... ...
2024-05-01 4.63
2024-06-01 4.41
2024-07-01 4.48
2024-08-01 3.99
2024-09-01 3.84

61 rows × 1 columns

In [154]:
# you can daisy chain time-series methods with resampling 

yields.resample("ME").last().diff()    
Out[154]:
DGS10
DATE
2019-09-30 NaN
2019-10-31 0.01
2019-11-30 0.09
2019-12-31 0.14
2020-01-31 -0.41
... ...
2024-05-31 -0.18
2024-06-30 -0.15
2024-07-31 -0.27
2024-08-31 -0.18
2024-09-30 -0.18

61 rows × 1 columns

Upsampling¶

In [155]:
# create an example with monthly data 

yields_monthly = yields.resample("ME").last()
yields_monthly.head()
Out[155]:
DGS10
DATE
2019-09-30 1.68
2019-10-31 1.69
2019-11-30 1.78
2019-12-31 1.92
2020-01-31 1.51
In [156]:
# upsample to daily with forward fill 

yields_monthly.resample("D").ffill()
Out[156]:
DGS10
DATE
2019-09-30 1.68
2019-10-01 1.68
2019-10-02 1.68
2019-10-03 1.68
2019-10-04 1.68
... ...
2024-09-26 3.91
2024-09-27 3.91
2024-09-28 3.91
2024-09-29 3.91
2024-09-30 3.73

1828 rows × 1 columns

Rolling windows¶

  • can aggregate (mean, std dev, etc) over rolling windows
  • can apply custom functions over rolling windows - e.g., regression
  • can create expanding windows with .expanding()
In [157]:
yields.rolling(21).mean() 
Out[157]:
DGS10
DATE
2019-09-11 NaN
2019-09-12 NaN
2019-09-13 NaN
2019-09-16 NaN
2019-09-17 NaN
... ...
2024-08-29 3.869048
2024-08-30 3.865238
2024-09-03 3.867143
2024-09-04 3.866667
2024-09-05 3.858571

1248 rows × 1 columns

If you want monthly averages at the end of each month only, how do you get that?

Autocorrelation¶

In [158]:
yields.DGS10.autocorr(1)
Out[158]:
np.float64(0.9989251861649782)
In [159]:
yields.DGS10.resample("ME").last().autocorr()
Out[159]:
np.float64(0.9795807592048992)
In [160]:
yields.DGS10.resample("ME").last().diff().autocorr()
Out[160]:
np.float64(0.17386711059598828)

Panel data¶

CRSP¶

In [161]:
import numpy as np
import pandas as pd
import wrds       
conn = wrds.Connection() 
WRDS recommends setting up a .pgpass file.
You can create this file yourself at any time with the create_pgpass_file() function.
Loading library list...
Done
In [198]:
crsp = conn.raw_sql(
    """
    select a.permno, a.permco, a.date, a.ret, abs(a.prc)*a.shrout as me, b.exchcd
    from crsp.msf a inner join crsp.msenames b
    on a.permno=b.permno and a.date between b.namedt and b.nameendt
    and b.exchcd in (1,2,3) and b.shrcd in (10,11)
    where a.date >= '2020-01-01'
    order by a.permno, a.date
    """,
date_cols=['date']
)

# change strings or floats to integers
for col in ['permno','permco']:
    crsp[col] = crsp[col].astype(int)

# define market equity as sum of market equities of all permnos associated with a permco
crsp['me'] = crsp.groupby(['date','permco'], group_keys=False).me.transform("sum")

crsp.drop(columns=["permco"], inplace=True)
crsp.set_index(["permno","date"], inplace=True)
crsp.head()
Out[198]:
ret me exchcd
permno date
10026 2020-01-31 -0.100016 3.137527e+06 3
2020-02-28 -0.030270 3.042554e+06 3
2020-03-31 -0.244031 2.285448e+06 3
2020-04-30 0.049835 2.399343e+06 3
2020-05-29 0.012596 2.429563e+06 3

MultiIndexes¶

In [199]:
crsp.index
Out[199]:
MultiIndex([(10026, '2020-01-31'),
            (10026, '2020-02-28'),
            (10026, '2020-03-31'),
            (10026, '2020-04-30'),
            (10026, '2020-05-29'),
            (10026, '2020-06-30'),
            (10026, '2020-07-31'),
            (10026, '2020-08-31'),
            (10026, '2020-09-30'),
            (10026, '2020-10-30'),
            ...
            (93436, '2023-03-31'),
            (93436, '2023-04-28'),
            (93436, '2023-05-31'),
            (93436, '2023-06-30'),
            (93436, '2023-07-31'),
            (93436, '2023-08-31'),
            (93436, '2023-09-29'),
            (93436, '2023-10-31'),
            (93436, '2023-11-30'),
            (93436, '2023-12-29')],
           names=['permno', 'date'], length=195627)
In [200]:
# extract a cross section with .loc 

crsp.loc[10026].head()
Out[200]:
ret me exchcd
date
2020-01-31 -0.100016 3.137527e+06 3
2020-02-28 -0.030270 3.042554e+06 3
2020-03-31 -0.244031 2.285448e+06 3
2020-04-30 0.049835 2.399343e+06 3
2020-05-29 0.012596 2.429563e+06 3
In [201]:
# extract a cross section with .xs 

crsp.xs(10026, level="permno").head()
Out[201]:
ret me exchcd
date
2020-01-31 -0.100016 3.137527e+06 3
2020-02-28 -0.030270 3.042554e+06 3
2020-03-31 -0.244031 2.285448e+06 3
2020-04-30 0.049835 2.399343e+06 3
2020-05-29 0.012596 2.429563e+06 3
In [202]:
# swap levels

crsp.index = crsp.index.swaplevel() 
crsp = crsp.sort_index()
crsp.head()
Out[202]:
ret me exchcd
date permno
2020-01-31 10026 -0.100016 3137526.96 3
10028 0.607407 58425.08 2
10032 -0.075643 2078268.64 3
10044 -0.098592 49920.00 3
10051 -0.115176 912167.34 1
In [203]:
# unstack 

crsp.ret.unstack().head()
Out[203]:
permno 10026 10028 10032 10044 10051 10066 10104 10107 10138 10145 ... 93384 93393 93397 93401 93420 93422 93423 93426 93434 93436
date
2020-01-31 -0.100016 0.607407 -0.075643 -0.098592 -0.115176 NaN -0.005474 0.079455 0.095946 -0.021356 ... 0.082519 0.120582 0.015081 -0.076132 -0.309816 -0.295556 -0.154733 0.015882 0.023810 0.555160
2020-02-28 -0.030270 0.225806 -0.067070 -0.064904 -0.055669 NaN -0.057007 -0.045292 -0.116229 -0.058596 ... -0.258776 -0.352505 -0.144467 0.328508 -0.273333 -0.290221 -0.337005 -0.204690 0.372093 0.026777
2020-03-31 -0.244031 -0.052632 -0.177694 -0.373368 -0.324664 NaN -0.022847 -0.026542 -0.164901 -0.175002 ... -0.654939 0.180516 -0.215569 -0.275775 -0.785933 -0.842444 -0.494066 -0.269021 -0.305085 -0.215557
2020-04-30 0.049835 0.448413 0.149010 -0.052083 0.178434 NaN 0.100972 0.136326 0.184127 0.060617 ... NaN 2.036408 0.024809 0.081019 1.011143 1.947683 0.595694 0.151394 -0.073171 0.492137
2020-05-29 0.012596 -0.016438 0.024406 -0.057143 0.000000 NaN 0.015103 0.025389 0.045576 0.034179 ... NaN -0.241407 0.000749 0.059957 -0.345646 -0.134280 0.148426 0.021626 0.121053 0.067939

5 rows × 5113 columns

Time series stuff for a panel¶

  • do everything grouping by permno
  • group_keys=False means don't keep the permno as an additional part of the index. This lets us put the results in the dataframe with the original (permno, date) index.
  • examples:
    • changes and lags
    • custom functions with apply
    • rolling windows
In [204]:
# changes 

crsp["me_pct_chg"] = crsp.groupby("permno", group_keys=False).me.pct_change()
crsp.head()
Out[204]:
ret me exchcd me_pct_chg
date permno
2020-01-31 10026 -0.100016 3137526.96 3 NaN
10028 0.607407 58425.08 2 NaN
10032 -0.075643 2078268.64 3 NaN
10044 -0.098592 49920.00 3 NaN
10051 -0.115176 912167.34 1 NaN
In [205]:
# shifts 

crsp["me_lag"] = crsp.groupby("permno", group_keys=False).me.shift(1)
crsp.head()
Out[205]:
ret me exchcd me_pct_chg me_lag
date permno
2020-01-31 10026 -0.100016 3137526.96 3 NaN NaN
10028 0.607407 58425.08 2 NaN NaN
10032 -0.075643 2078268.64 3 NaN NaN
10044 -0.098592 49920.00 3 NaN NaN
10051 -0.115176 912167.34 1 NaN NaN
In [206]:
# custom functions 

crsp["accum"] = crsp.groupby("permno", group_keys=False).apply(
    lambda x: (1+x.ret).cumprod()
)
crsp.head()
Out[206]:
ret me exchcd me_pct_chg me_lag accum
date permno
2020-01-31 10026 -0.100016 3137526.96 3 NaN NaN 0.899984
10028 0.607407 58425.08 2 NaN NaN 1.607407
10032 -0.075643 2078268.64 3 NaN NaN 0.924357
10044 -0.098592 49920.00 3 NaN NaN 0.901408
10051 -0.115176 912167.34 1 NaN NaN 0.884824
In [210]:
# another custom function 
# mom = return over first 11 of prior 12 months
# if this is September, shift(13) is end of August a year ago 
# shift(2) is end of July this year

crsp["mom"] = crsp.groupby("permno", group_keys=False).accum.apply(
    lambda x: x.shift(2)/x.shift(13)-1
)
crsp.dropna().head()
Out[210]:
ret me exchcd me_pct_chg me_lag accum mom mom2
date permno
2021-02-26 10026 0.039958 3.013265e+06 3 0.039958 2897486.80 0.876640 -0.046738 0.186441
10028 -0.161342 1.413562e+05 2 -0.161342 168550.50 3.888887 1.396313 4.939689
10032 0.091784 2.420220e+06 3 0.093795 2212680.72 1.091500 0.099690 0.328714
10044 0.218894 3.239596e+04 3 0.228928 26361.16 0.582110 -0.505595 1.083334
10051 0.072718 8.376358e+05 1 0.072718 780853.41 0.796088 -0.099877 0.539223

Exercise¶

Grab the first 13 months of returns for permno 10026 and verify the momentum calculation.

Rolling windows for panels¶

In [220]:
# momentum with a rolling window 

crsp["mom2"] = crsp.groupby("permno", group_keys=False).ret.apply(
    lambda x: x.rolling(11).apply(lambda y: 
        (1+y).prod() - 1 
    )
)
crsp["mom2"] = crsp.groupby("permno", group_keys=False).mom2.shift(2)
In [221]:
# verify that we get the same result both ways 

crsp[["mom", "mom2"]].dropna().head()
Out[221]:
mom mom2
date permno
2021-02-26 10026 -0.046738 -0.046738
10028 1.396313 1.396313
10032 0.099690 0.099690
10044 -0.505595 -0.505595
10051 -0.099877 -0.099877
In [222]:
# better verification 

(crsp.mom - crsp.mom2).abs().max()
Out[222]:
np.float64(2.842170943040401e-14)

Cross-sectional operations on a panel¶

In [233]:
crsp.groupby("date").me.median()
Out[233]:
date
2020-01-31    716361.5500
2020-02-28    659297.4500
2020-03-31    471096.1800
2020-04-30    563783.3600
2020-05-29    604435.6500
2020-06-30    638028.0500
2020-07-31    642562.6000
2020-08-31    685895.9000
2020-09-30    660193.0350
2020-10-30    667150.5700
2020-11-30    804351.6000
2020-12-31    876951.1800
2021-01-29    930408.6450
2021-02-26    997188.4950
2021-03-31    961598.2750
2021-04-30    907535.9375
2021-05-28    891399.5550
2021-06-30    937154.1600
2021-07-30    885187.9600
2021-08-31    885851.0800
2021-09-30    872242.5700
2021-10-29    880532.6500
2021-11-30    785550.5500
2021-12-31    784997.3200
2022-01-31    669696.6600
2022-02-28    653404.2000
2022-03-31    649834.1800
2022-04-29    570975.0000
2022-05-31    544259.1500
2022-06-30    498053.8900
2022-07-29    532845.6300
2022-08-31    527356.0600
2022-09-30    474613.3400
2022-10-31    506608.4000
2022-11-30    502639.1500
2022-12-30    492462.0800
2023-01-31    549974.4250
2023-02-28    530868.8700
2023-03-31    502809.1600
2023-04-28    478404.0900
2023-05-31    484659.7750
2023-06-30    523683.1950
2023-07-31    565771.7600
2023-08-31    529945.0850
2023-09-29    497064.7150
2023-10-31    459637.3000
2023-11-30    505252.5500
2023-12-29    601071.3800
Name: me, dtype: float64
In [234]:
crsp["rnk"] = crsp.groupby("date", group_keys=False).me.rank(pct=True)
crsp[["me", "rnk"]].head()
Out[234]:
me rnk
date permno
2020-01-31 10026 3137526.96 0.735097
10028 58425.08 0.143175
10032 2078268.64 0.668802
10044 49920.00 0.127577
10051 912167.34 0.538162

Exercise¶

Find the 0.2, 0.4, 0.6, and 0.8 quantiles of me each month using the subset of firms for which exchcd=1.

Compustat¶

In [246]:
comp = conn.raw_sql(
    """
    select gvkey, datadate as date, at
    from comp.funda
    where datadate >= '2020-01-01' and at>0 
    and indfmt='INDL' and datafmt='STD' and popsrc='D' and consol='C'
    order by gvkey, datadate
    """, 
    date_cols=['date']
)

link = conn.raw_sql(
    """
    select distinct gvkey, lpermno as permno, linkdt, linkenddt
    from crsp.Ccmxpf_linktable
    where linktype in ('LU', 'LC')
    and LINKPRIM in ('P', 'C')
    """
)

# convert strings or floats to ints
comp['gvkey'] = comp.gvkey.astype(int)
link['gvkey'] = link.gvkey.astype(int)
link['permno'] = link.permno.astype(int)

# fill in missing end dates with a future date
link['linkenddt'] = pd.to_datetime(
    link.linkenddt
).fillna(pd.Timestamp('21000101'))

# merge with Compustat data and keep rows with Compustat datadate between link date and link end date
comp = comp.merge(link, on='gvkey', how='inner')
comp = comp[
    (comp.date>=comp.linkdt) & (comp.date<=comp.linkenddt)
]

comp = comp.drop(columns=['gvkey', 'linkdt', 'linkenddt'])

Merge CRSP with Compustat¶

  • Change dates to monthly period format before merging, because Compustat date is the last day of the month, and CRSP date is the last trading day of the month.
  • Merge keeping all rows of CRSP data. There will be NaNs for Compustat data for 11 months each year.
In [235]:
crsp.reset_index(inplace=True) 
crsp.head()
Out[235]:
date permno ret me exchcd me_pct_chg me_lag accum mom mom2 rnk
0 2020-01-31 10026 -0.100016 3137526.96 3 NaN NaN 0.899984 NaN NaN 0.735097
1 2020-01-31 10028 0.607407 58425.08 2 NaN NaN 1.607407 NaN NaN 0.143175
2 2020-01-31 10032 -0.075643 2078268.64 3 NaN NaN 0.924357 NaN NaN 0.668802
3 2020-01-31 10044 -0.098592 49920.00 3 NaN NaN 0.901408 NaN NaN 0.127577
4 2020-01-31 10051 -0.115176 912167.34 1 NaN NaN 0.884824 NaN NaN 0.538162
In [252]:
crsp["date"] = crsp.date.dt.to_period('M')
comp["date"] = comp.date.dt.to_period('M')

df = crsp.merge(comp, on=['permno', 'date'], how='left')
df.set_index(["date", "permno"], inplace=True)
df.head()
Out[252]:
ret me exchcd me_pct_chg me_lag accum mom mom2 rnk at
date permno
2020-01 10026 -0.100016 3137526.96 3 NaN NaN 0.899984 NaN NaN 0.735097 NaN
10028 0.607407 58425.08 2 NaN NaN 1.607407 NaN NaN 0.143175 NaN
10032 -0.075643 2078268.64 3 NaN NaN 0.924357 NaN NaN 0.668802 NaN
10044 -0.098592 49920.00 3 NaN NaN 0.901408 NaN NaN 0.127577 NaN
10051 -0.115176 912167.34 1 NaN NaN 0.884824 NaN NaN 0.538162 NaN

Fill Compustat data into months¶

  • Group by permno when filling forward so we don't fill from one stock into another
  • Can add a limit on ffill so we don't fill in missing years of data
In [258]:
df['at'] = df.groupby('permno', group_keys=False).at.ffill()
df.xs(10026, level="permno")["at"].dropna().head(15)
Out[258]:
date
2020-09    1056.553
2020-10    1056.553
2020-11    1056.553
2020-12    1056.553
2021-01    1056.553
2021-02    1056.553
2021-03    1056.553
2021-04    1056.553
2021-05    1056.553
2021-06    1056.553
2021-07    1056.553
2021-08    1056.553
2021-09    1122.219
2021-10    1122.219
2021-11    1122.219
Freq: M, Name: at, dtype: float64
In [259]:
# shift 6 months to ensure data is public

df["at"] = df.groupby("permno")["at"].shift(6)
df.xs(10026, level="permno")["at"].dropna().head(15)
Out[259]:
date
2021-03    1056.553
2021-04    1056.553
2021-05    1056.553
2021-06    1056.553
2021-07    1056.553
2021-08    1056.553
2021-09    1056.553
2021-10    1056.553
2021-11    1056.553
2021-12    1056.553
2022-01    1056.553
2022-02    1056.553
2022-03    1122.219
2022-04    1122.219
2022-05    1122.219
Freq: M, Name: at, dtype: float64