Today's plan¶
- Other ways to run python (scripts) and using python on jgsrc1
- Linear regression
1. Scripts, environments, and jgsrc1¶
Python scripts¶
- Scripts are text files with python code. We can create a script with any text editor.
- In VS Code, we can select File/New File/Python File.
- We can also convert a notebook to a script in VS Code (more later).
- Scripts are usually saved with the file extension .py.
- We can run a script in VS Code (click run button at top) or at command line with
python scriptname.py
- Print statements will print to the console (terminal). We can save files, figures, etc. as in a notebook.
Why scripts?¶
- For me, notebooks are for exploring. I move around in the notebook, execute cells in the wrong order, add code that is not intended to be permanent just for experimenting or checking things and then may forget to delete it, etc.
- If I want to save a record of what I did, to go back to when I next revise a paper, I save the code as a script. Scripts execute in the order they were written, so they are easier to follow and a better record of what was done.
- Also, scripts are very useful if you want to run code that will take a long time in the background on a remote machine (like jgsrc1).
Example¶
- Select File/New File/Python file in VS Code.
- Type the following code into the file.
import pandas as pd
df = pd.DataFrame(
{"a": [1, 2, 3], "b": [4, 5, 6]}
)
df.to_csv("example.csv")
print(df)
- Click the "run" button at the top of the VS Code window.
Running at the command line¶
- Save the file as example.py.
- Open a terminal window and navigate to the directory where example.py is saved (use cd directoryname to move around).
- Type
python example.py
at the command line. - If command line stuff is new to you, do yourself a favor and spend a few minutes googling "MS DOS shell commands."
Exporting notebooks to scripts¶
- After working in a notebook to write code, it is easy to export the notebook as a script.
- Type the above code into multiple cells in a notebook (just break it up arbitrarily - for illustration).
- Do CTRL-SHFT-P to open the command palette and find "Jupyter: Export to Python Script."
- The comments inserted between what were different cells in the notebook do not affect execution, but we can delete them to clean up the script.
Running code on jgsrc1¶
- Any SSH client: bitvise, remote desktop, etc.
- jupyterhub.rice.edu in a browser
- VS Code
We can run either notebooks or scripts with either jupyterhub or VS Code.
Using VS Code on jgsrc1¶
- Go to the VS Code marketplace and install Remote SSH from Microsoft
- Do CTRL-SHFT-P to open the command palette and type "Remote-SSH: Add New SSH Host" and enter username@jgsrc1.rice.edu.
- Select File/New Window in VS Code. This isn't necessary but it will allow you to work in two places at the same time.
- For the remainder, you should be on Rice Owls not Rice Visitor.
- Do CTRL-SHFT-P to open the command palette in the new window and type "Remote-SSH: Connect to Host" and select jgsrc1.
- Enter your password. You should be on the server in your home folder. You can now work on the server just as on your own machine (though you have to enter your password again each time you execute "Open Folder").
Virtual environments¶
- Everyone shares the same environment on jupyterhub. So, you cannot control what packages or what versions of packages you have.
- For using notebooks in VS Code or for running scripts, you can customize your python environment.
- In VS Code on jgsrc1, do CTRL-SHFT-P and type "Python: Select Interpreter" and select "Create Virtual Environment"
- You should end up with a new folder in your home directory called .venv.
- You could do the same thing in a terminal with
python -m venv .venv
- You could do the same thing in a terminal with
You need to activate the environment to install packages in it. Open a terminal pane in VS Code on the server. Execute the following.
source .venv/bin/activate pip install pandas matplotlib jupyter
You need to do the same each time you want to install packages.
Do CTRL-SHFT-P and type "Python: Select Interpreter" and select the one in .venv. You're now running code in your new environment.
- You can also create new environments on a Windows computer.
- The only change is that you use
.venv\Scripts\activate
instead ofsource .venv/bin/activate
to activate the environment.
Using tmux on linux¶
- To run things on jgsrc1 that will keep running if you lose your connection (e.g., you close your laptop), run scripts with tmux.
- You can do this from a VS Code connection or using any ssh client (e.g., bitvise) or a terminal window opened from jupyterhub.
- At the command line type
tmux new -s mysessionname
to start a new session. - Activate your python environment within the tmux session (if you want) with
source .venv/bin/activate
.
- Start your script running with
python myscript.py
. - Do CTRL-b and then d to detach from the session.
- Do
tmux attach -t mysessionname
to reattach to the session. - You can have multiple sessions running simultaneously with different names.
- Google 'tmux cheat sheet' for more details.
2. Regression¶
- statsmodels implements OLS with HAC standard errors (Newey-West, etc.)
- statsmodels has two flavors of OLS: formula-based (like R) and array-based.
- linearmodels is a possible Stata replacement with fixed effects, clustered standard errors, 2SLS, SUR, etc.
- use statsmodels for repeated regressions and use linearmodels for more complex models.
- pystout saves regression output to latex (python version of stata out).
- scikit-learn implements OLS, LASSO, and Ridge.
Lecture plan¶
- We'll run the two flavors of statsmodels.
- We'll look at an example of linearmodels and pystout. You should check the linearmodels docs for more.
- We'll code Fama-MacBeth. linearmodels includes a Fama-MacBeth function but it is not as flexible as we might like (as far as I can tell).
- We'll code rolling-window beta estimates.
- LASSO and Ridge will be next week.
In [83]:
import numpy as np
import pandas as pd
df = pd.read_stata("WAGE1.DTA")
df.columns
Out[83]:
Index(['wage', 'educ', 'exper', 'tenure', 'nonwhite', 'female', 'married', 'numdep', 'smsa', 'northcen', 'south', 'west', 'construc', 'ndurman', 'trcommpu', 'trade', 'services', 'profserv', 'profocc', 'clerocc', 'servocc', 'lwage', 'expersq', 'tenursq'], dtype='object')
Old-fashioned statsmodels¶
- pass X and y arrays
- must add column of 1's to X to get an intercept
- define model, fit it, then get results with .summary
- there are other attributes to results, including .params, .pvalues, ...
In [84]:
import statsmodels.api as sm
# right-hand side variables
X = df[
[
"educ",
"exper",
"tenure",
"nonwhite",
"female",
"married"
]
]
# include an intercept
X = sm.add_constant(X)
# left-hand side variable
y = df.wage
# define the model
model = sm.OLS(exog=X, endog=y)
# fit the model
result = model.fit()
# look at results
result.summary()
Out[84]:
Dep. Variable: | wage | R-squared: | 0.368 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.361 |
Method: | Least Squares | F-statistic: | 50.41 |
Date: | Wed, 11 Sep 2024 | Prob (F-statistic): | 8.27e-49 |
Time: | 07:52:53 | Log-Likelihood: | -1312.3 |
No. Observations: | 526 | AIC: | 2639. |
Df Residuals: | 519 | BIC: | 2668. |
Df Model: | 6 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -1.6022 | 0.731 | -2.192 | 0.029 | -3.038 | -0.166 |
educ | 0.5551 | 0.050 | 11.090 | 0.000 | 0.457 | 0.653 |
exper | 0.0187 | 0.012 | 1.557 | 0.120 | -0.005 | 0.042 |
tenure | 0.1388 | 0.021 | 6.562 | 0.000 | 0.097 | 0.180 |
nonwhite | -0.0658 | 0.427 | -0.154 | 0.877 | -0.904 | 0.772 |
female | -1.7424 | 0.267 | -6.530 | 0.000 | -2.267 | -1.218 |
married | 0.5566 | 0.287 | 1.941 | 0.053 | -0.007 | 1.120 |
Omnibus: | 188.311 | Durbin-Watson: | 1.795 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 728.036 |
Skew: | 1.611 | Prob(JB): | 8.11e-159 |
Kurtosis: | 7.779 | Cond. No. | 143. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [86]:
# if you want only the main table
result.summary().tables[1]
Out[86]:
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -1.6022 | 0.731 | -2.192 | 0.029 | -3.038 | -0.166 |
educ | 0.5551 | 0.050 | 11.090 | 0.000 | 0.457 | 0.653 |
exper | 0.0187 | 0.012 | 1.557 | 0.120 | -0.005 | 0.042 |
tenure | 0.1388 | 0.021 | 6.562 | 0.000 | 0.097 | 0.180 |
nonwhite | -0.0658 | 0.427 | -0.154 | 0.877 | -0.904 | 0.772 |
female | -1.7424 | 0.267 | -6.530 | 0.000 | -2.267 | -1.218 |
married | 0.5566 | 0.287 | 1.941 | 0.053 | -0.007 | 1.120 |
In [102]:
# getting the main table as a dataframe
# we can ignore the future warning (for now!)
table_as_html = result.summary().tables[1].as_html()
table_as_df = pd.read_html(table_as_html, header=0, index_col=0)[0]
table_as_df.round(3)
C:\Users\kerry\AppData\Local\Temp\ipykernel_22164\2205508342.py:4: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object. table_as_df = pd.read_html(table_as_html, header=0, index_col=0)[0]
Out[102]:
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -1.602 | 0.731 | -2.192 | 0.029 | -3.038 | -0.166 |
educ | 0.555 | 0.050 | 11.090 | 0.000 | 0.457 | 0.653 |
exper | 0.019 | 0.012 | 1.557 | 0.120 | -0.005 | 0.042 |
tenure | 0.139 | 0.021 | 6.562 | 0.000 | 0.097 | 0.180 |
nonwhite | -0.066 | 0.427 | -0.154 | 0.877 | -0.904 | 0.772 |
female | -1.742 | 0.267 | -6.530 | 0.000 | -2.267 | -1.218 |
married | 0.557 | 0.287 | 1.941 | 0.053 | -0.007 | 1.120 |
In [101]:
# save to excel (may need to pip install openpyxl)
table_as_df.to_excel("table.xlsx")
# save to latex
table_as_df.to_latex("table.tex")
In [14]:
# standardize regressors
for col in ["educ", "exper", "tenure"]:
X[col] = X[col] / X[col].std()
sm.OLS(exog=X, endog=y).fit().summary().tables[1]
Out[14]:
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -1.6022 | 0.731 | -2.192 | 0.029 | -3.038 | -0.166 |
educ | 1.5371 | 0.139 | 11.090 | 0.000 | 1.265 | 1.809 |
exper | 0.2544 | 0.163 | 1.557 | 0.120 | -0.067 | 0.575 |
tenure | 1.0030 | 0.153 | 6.562 | 0.000 | 0.703 | 1.303 |
nonwhite | -0.0658 | 0.427 | -0.154 | 0.877 | -0.904 | 0.772 |
female | -1.7424 | 0.267 | -6.530 | 0.000 | -2.267 | -1.218 |
married | 0.5566 | 0.287 | 1.941 | 0.053 | -0.007 | 1.120 |
In [ ]:
# more compactly,
X[["educ", "exper", "tenure"]] /= X[["educ", "exper", "tenure"]].std()
R-like statsmodels¶
- R-like formula, pass dataframe containing the data to the function
- Can include transformations of variables within the formula, including interactions
- Includes intercept by default
- If you don't want intercept use -1 in the formula
- Do model.fit().summary() to get the summary or .params or whatever like before.
- Convert to dataframe and save to excel or latex as before.
In [24]:
import statsmodels.formula.api as smf
model = smf.ols(
"""
wage ~ educ + np.log(exper) + tenure**2 + nonwhite
+ female + married + educ*female - 1
""",
data=df
)
result = model.fit()
result.summary().tables[1]
Out[24]:
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
educ | 0.4237 | 0.024 | 17.734 | 0.000 | 0.377 | 0.471 |
np.log(exper) | 0.3510 | 0.139 | 2.521 | 0.012 | 0.077 | 0.625 |
tenure | 0.1247 | 0.021 | 6.072 | 0.000 | 0.084 | 0.165 |
nonwhite | -0.2982 | 0.424 | -0.703 | 0.482 | -1.131 | 0.535 |
female | -2.6244 | 1.015 | -2.585 | 0.010 | -4.619 | -0.630 |
married | 0.2517 | 0.303 | 0.830 | 0.407 | -0.344 | 0.848 |
educ:female | 0.0503 | 0.081 | 0.620 | 0.536 | -0.109 | 0.210 |
In [20]:
# the covariance estimator to use when computing std errors is passed
# as an argument to .fit() in both statsmodels flavors
help(model.fit)
Help on method fit in module statsmodels.regression.linear_model: fit(method: "Literal['pinv', 'qr']" = 'pinv', cov_type: "Literal['nonrobust', 'fixed scale', 'HC0', 'HC1', 'HC2', 'HC3', 'HAC', 'hac-panel', 'hac-groupsum', 'cluster']" = 'nonrobust', cov_kwds=None, use_t: 'bool | None' = None, **kwargs) method of statsmodels.regression.linear_model.OLS instance Full fit of the model. The results include an estimate of covariance matrix, (whitened) residuals and an estimate of scale. Parameters ---------- method : str, optional Can be "pinv", "qr". "pinv" uses the Moore-Penrose pseudoinverse to solve the least squares problem. "qr" uses the QR factorization. cov_type : str, optional See `regression.linear_model.RegressionResults` for a description of the available covariance estimators. cov_kwds : list or None, optional See `linear_model.RegressionResults.get_robustcov_results` for a description required keywords for alternative covariance estimators. use_t : bool, optional Flag indicating to use the Student's t distribution when computing p-values. Default behavior depends on cov_type. See `linear_model.RegressionResults.get_robustcov_results` for implementation details. **kwargs Additional keyword arguments that contain information used when constructing a model using the formula interface. Returns ------- RegressionResults The model estimation results. See Also -------- RegressionResults The results container. RegressionResults.get_robustcov_results A method to change the covariance estimator used when fitting the model. Notes ----- The fit method uses the pseudoinverse of the design/exogenous variables to solve the least squares minimization.
Newey-West is HAC with Bartlett kernel.
Logit¶
You can also run logit with either old-fashioned or R-like statsmodels.
In [23]:
smf.logit("female ~ wage", df).fit().summary().tables[1]
Optimization terminated successfully. Current function value: 0.622458 Iterations 6
Out[23]:
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 1.3893 | 0.215 | 6.458 | 0.000 | 0.968 | 1.811 |
wage | -0.2654 | 0.037 | -7.096 | 0.000 | -0.339 | -0.192 |
Linearmodels¶
- created by Kevin Sheppard
- docs are very good
- has R-like formula option
- define model, fit, and then get summary like before - but .summary instead of .summary()
- pass covariance matrix type to fit method as before
In [38]:
from linearmodels.panel import PanelOLS
from linearmodels.datasets import wage_panel
data = wage_panel.load()
data = data.set_index(["nr", "year"])
data.head()
Out[38]:
black | exper | hisp | hours | married | educ | union | lwage | expersq | occupation | ||
---|---|---|---|---|---|---|---|---|---|---|---|
nr | year | ||||||||||
13 | 1980 | 0 | 1 | 0 | 2672 | 0 | 14 | 0 | 1.197540 | 1 | 9 |
1981 | 0 | 2 | 0 | 2320 | 0 | 14 | 1 | 1.853060 | 4 | 9 | |
1982 | 0 | 3 | 0 | 2940 | 0 | 14 | 0 | 1.344462 | 9 | 9 | |
1983 | 0 | 4 | 0 | 2960 | 0 | 14 | 0 | 1.433213 | 16 | 9 | |
1984 | 0 | 5 | 0 | 3071 | 0 | 14 | 0 | 1.568125 | 25 | 5 |
Indexing for linearmodels data¶
- The indexing is crucial for fixed effects and clustered standard errors to work.
- The outside index should be the entity (firm, person, ...)
- The inside index should be a time index.
In [41]:
model = PanelOLS.from_formula(
"lwage ~ exper + educ + black + hisp + married + union + TimeEffects",
data=data,
)
result = model.fit(cov_type="clustered", cluster_time=True, cluster_entity=True)
result.summary
Out[41]:
Dep. Variable: | lwage | R-squared: | 0.1216 |
---|---|---|---|
Estimator: | PanelOLS | R-squared (Between): | 0.9312 |
No. Observations: | 4360 | R-squared (Within): | 0.1305 |
Date: | Tue, Sep 10 2024 | R-squared (Overall): | 0.8962 |
Time: | 16:23:01 | Log-likelihood | -2986.3 |
Cov. Estimator: | Clustered | ||
F-statistic: | 100.30 | ||
Entities: | 545 | P-value | 0.0000 |
Avg Obs: | 8.0000 | Distribution: | F(6,4346) |
Min Obs: | 8.0000 | ||
Max Obs: | 8.0000 | F-statistic (robust): | 36.580 |
P-value | 0.0000 | ||
Time periods: | 8 | Distribution: | F(6,4346) |
Avg Obs: | 545.00 | ||
Min Obs: | 545.00 | ||
Max Obs: | 545.00 | ||
Parameter | Std. Err. | T-stat | P-value | Lower CI | Upper CI | |
---|---|---|---|---|---|---|
exper | 0.0303 | 0.0133 | 2.2750 | 0.0230 | 0.0042 | 0.0565 |
educ | 0.0929 | 0.0100 | 9.3247 | 0.0000 | 0.0733 | 0.1124 |
black | -0.1373 | 0.0478 | -2.8760 | 0.0040 | -0.2310 | -0.0437 |
hisp | 0.0137 | 0.0350 | 0.3906 | 0.6961 | -0.0550 | 0.0823 |
married | 0.1110 | 0.0219 | 5.0604 | 0.0000 | 0.0680 | 0.1540 |
union | 0.1863 | 0.0269 | 6.9147 | 0.0000 | 0.1335 | 0.2392 |
F-test for Poolability: 3.1404
P-value: 0.0026
Distribution: F(7,4346)
Included effects: Time
pystout¶
- works with statsmodels or linearmodels
- pass fitted models
- specify filename to save to and various formatting options (significance stars, etc.) in pystout function
In [50]:
# example regressions
model1 = smf.ols("wage ~ female", data=df)
result1 = model1.fit()
model2 = smf.ols("wage ~ female + educ + female*educ", data=df)
result2 = model2.fit()
In [44]:
from pystout import pystout
pystout(
models=[result1, result2],
file="regressions.tex",
endog_names = ["wage", "wage"],
exogvars=[
'female',
'educ',
'female:educ',
],
stars={0.1: "*", 0.05: "**", 0.01: "***"},
addnotes=["$^*p<0.1$, $^{**}p<0.05$, $^{***}p<0.01$"],
modstat={"nobs": "Obs", "rsquared_adj": "Adj $R^2$"},
title="Wage Equation",
label="tab:wage"
)
Fama-MacBeth¶
Run cross-sectional regressions and then use $t$ tests for the means of the time series of cross-sectional coefficients.
In [77]:
# A small data set with acc=accruals and agr=asset growth,
# monthly data since 2010, roughly 2,000 stocks per month.
data = pd.read_csv(
"https://www.dropbox.com/s/012c6y4gxsxss6y/ghz.csv?dl=1",
parse_dates=["date"]
)
# change dates to period format (to merge with FF factors)
data["date"] = data.date.dt.to_period("M")
data = data.set_index(["permno", "date"])
data = data.sort_index()
data.info()
<class 'pandas.core.frame.DataFrame'> MultiIndex: 282598 entries, (np.float64(10025.0), Period('2012-10', 'M')) to (np.float64(93436.0), Period('2021-12', 'M')) Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ret 282598 non-null float64 1 acc 282598 non-null float64 2 agr 282598 non-null float64 dtypes: float64(3) memory usage: 7.7 MB
In [67]:
data.head()
Out[67]:
ret | acc | agr | ||
---|---|---|---|---|
permno | date | |||
10025.0 | 2012-10 | 0.055125 | -0.028995 | 0.184931 |
2012-11 | -0.053652 | -0.028995 | 0.184931 | |
2012-12 | -0.020992 | -0.028995 | 0.184931 | |
2013-01 | 0.086949 | -0.028995 | 0.184931 | |
2013-02 | 0.095527 | -0.028995 | 0.184931 |
In [68]:
# cross-sectional regression function
def xreg(df):
model = smf.ols("ret ~ acc + agr", data=df)
result = model.fit()
return result.params
# apply to each cross-section to get time series of coefficients
regressions = data.groupby('date').apply(xreg)
regressions.head()
Out[68]:
Intercept | acc | agr | |
---|---|---|---|
date | |||
2010-01 | -0.029367 | 0.021650 | -0.011795 |
2010-02 | 0.040089 | -0.052454 | -0.003666 |
2010-03 | 0.081821 | 0.037491 | -0.024873 |
2010-04 | 0.042478 | -0.090209 | -0.009522 |
2010-05 | -0.074971 | 0.008434 | 0.005427 |
In [69]:
# We want to run t-tests on the means (the null is that they are zero).
# We probably want to use Newey-West standard errors
# One way to do that is to regress on 1 (the coefficient is the mean) and use
# Newey-West standard errors in statsmodels OLS.
for col in ["acc", "agr"]:
model = smf.ols(f"{col} ~ 1", data=regressions)
result = model.fit(cov_type='HAC', cov_kwds={"kernel": "bartlett", "maxlags": 12})
print(col, result.tvalues.item())
acc 0.43832436517863066 agr -1.5746161411946056
Rolling Window Betas¶
- We'll run time-series regressions for each stock over rolling windows.
- It is common to use 60 months as the window but to include all stock/months for which 24 past months were available in the prior 60 months. We do that with window=60, min_nobs=24, and expanding=True.
- The RollingOLS function crashes if you specify a window size that is larger than the number of rows in the data frame.
- So, we construct a function to "pass" if the number of rows is less than 24 and specify the window size as the smaller of 60 and the number of rows.
In [78]:
# merge Fama-French factors (we want Mkt-RF and RF)
# this pastes the factors for each month beside each stock for that month
from pandas_datareader import DataReader as pdr
ff = pdr("F-F_Research_Data_Factors", "famafrench", start=2010)[0] / 100
data_ff = data.merge(ff, left_on='date', right_index=True, how="left")
# compute excess returns
data_ff["ret_RF"] = data_ff.ret - data_ff.RF
# we can't put Mkt-RF in the ols formula so we change to an underscore
data_ff = data_ff.rename(columns={"Mkt-RF": "Mkt_RF"})
data_ff.head()
C:\Users\kerry\AppData\Local\Temp\ipykernel_22164\4004107026.py:6: FutureWarning: The argument 'date_parser' is deprecated and will be removed in a future version. Please use 'date_format' instead, or read your data in as 'object' dtype and then call 'to_datetime'. ff = pdr("F-F_Research_Data_Factors", "famafrench", start=2010)[0] / 100 C:\Users\kerry\AppData\Local\Temp\ipykernel_22164\4004107026.py:6: FutureWarning: The argument 'date_parser' is deprecated and will be removed in a future version. Please use 'date_format' instead, or read your data in as 'object' dtype and then call 'to_datetime'. ff = pdr("F-F_Research_Data_Factors", "famafrench", start=2010)[0] / 100
Out[78]:
ret | acc | agr | Mkt_RF | SMB | HML | RF | ret_RF | ||
---|---|---|---|---|---|---|---|---|---|
permno | date | ||||||||
10025.0 | 2012-10 | 0.055125 | -0.028995 | 0.184931 | -0.0176 | -0.0115 | 0.0359 | 0.0001 | 0.055025 |
2012-11 | -0.053652 | -0.028995 | 0.184931 | 0.0078 | 0.0063 | -0.0084 | 0.0001 | -0.053752 | |
2012-12 | -0.020992 | -0.028995 | 0.184931 | 0.0118 | 0.0148 | 0.0351 | 0.0001 | -0.021092 | |
2013-01 | 0.086949 | -0.028995 | 0.184931 | 0.0557 | 0.0034 | 0.0096 | 0.0000 | 0.086949 | |
2013-02 | 0.095527 | -0.028995 | 0.184931 | 0.0129 | -0.0027 | 0.0011 | 0.0000 | 0.095527 |
In [81]:
from statsmodels.regression.rolling import RollingOLS
def rolling_betas(df):
# we'll pass the time series for each stock to this function
# we first check how many months the stock is in the data
n = df.shape[0]
# if there are at least 24 months, we proceed
if n >= 24:
# data = df.set_index("date")
model = RollingOLS.from_formula(
"ret_RF ~ Mkt_RF",
window=min(n, 60),
min_nobs=24,
expanding=True,
data=df
)
result = model.fit()
# return the time series of rolling window betas
return result.params['Mkt_RF'].dropna()
# if number of months < 24, we skip this stock
else:
pass
betas = data_ff.groupby("permno", group_keys=False).apply(rolling_betas)
betas.dropna().head()
Out[81]:
permno date 10026.0 2011-12 0.635232 2012-01 0.582387 2012-02 0.557631 2012-03 0.565231 2012-04 0.545091 dtype: float64