Day 11: Odds and Ends¶

BUSI 520 - Python for Business Research¶

Kerry Back, JGSB, Rice University¶

Outline¶

Debugging
Timing code
Parallel processing
GPUs
Geometry of ridge regression

Error messages¶

Find the last line of your code in the error message (before it goes into any python libraries, if it does).
The error is in that line or in some object that appears in that line.
Ask Julius/Github Copilot/ChatGPT or google if the message is not clear.
Inspect all of the objects in that line to see if they are what you expect them to be.

How to inspect objects¶

Use type() or .shape or .head() or whatever
Use print statements
Use CTRL-SHFT-P to bring up the command palette and search for Jupyter: Open Variables View to see the current values of all variables.
Use the Data Wrangler extension to see the data in a table.

Errors inside functions¶

Try to avoid them by testing code on examples before you put it into a function.
Take code out of the function if there's an error inside the function and test on examples.
- Use CTRL-[ to indent a block of code and CTRL-] to unindent it.
Use print statements inside the function.
Use the debugger.

pdb debugger¶

Ask Github Copilot how to use the pdb debugger in Jupyter.
pdb.set_trace()
%debug magic command

Timing code¶

Use time.time or timeit.timeit
Or use %%time or %%timeit magic commands in Jupyter
time runs once, timeit runs multiple times and averages the time.

In [39]:

import numpy as np 

def add_numbers(n):
    return np.arange(n+1).sum()

In [40]:

%%timeit 

add_numbers(100000)

113 μs ± 10.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [47]:

import time 

start = time.time()
add_numbers(100000)
end = time.time()
print(f"elapsed time: {end-start}")

elapsed time: 0.0010008811950683594

¶

Parallel processing¶

Use the joblib library or multiprocessing.Pool to parallelize loops.
Other libraries are available for more complex parallel processing.
Only worthwhile for tasks that take a long time to run.
For small tasks, parallel processing is too much overhead.

In [11]:

from joblib import Parallel, delayed

In [48]:

start = time.time()
lst = [add_numbers(n) for n in range(100000)]
end = time.time()
print(f"elapsed time: {end-start}")

elapsed time: 4.282323360443115

In [49]:

start = time.time()

lst = Parallel(n_jobs=-1, verbose=0)(
    delayed(add_numbers)(n) for n in range(100000)
)

end = time.time()
print(f"elapsed time: {end-start}")

elapsed time: 2.8435566425323486

GPU processing¶

Many cores, each slower than CPU cores.
Good for tasks that can be parallelized.
Code still runs on CPU, but tasks are sent to GPU.
Various libraries will automatically send tasks to the GPU and parallelize them.
- Nvidia: numpy -> cupy, pandas -> cudf, scikit-learn -> cuml
Google Colab has free GPU access: Runtime/Select Runtime Type/GPU

cupy¶

https://colab.research.google.com/drive/1xhORH4VQr5vuaDVsvKY54JW_dIE5EMUz?usp=sharing

cudf¶

Can rewrite code to use cudf instead of pandas. Almost everything is exactly the same - just use cudf.DataFrame instead of pd.DataFrame, etc.
Or can use magic %load_ext cudf.pandas to automatically run pandas code with cudf:

https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/getting_started_tutorials/cudf_pandas_colab_demo.ipynb

Geometry of Ridge Regression¶

OLS¶

$N$ observations, $K$ regressors, $K<N$
Demean $X$ and $y$, with no constant column in $X$
OLS is $$\hat\beta = (X'X)^{-1} = ((1/N)X'X)^{-1}((1/N)X'y)$$ $$ = (\text{sample cov of $X$})^{-1} \times \text{sample cov of $X$ with $y$}$$

Why ridge is called ridge¶

Ridge is obtained by $$\min \quad \frac{1}{2}(y-X\hat\beta)'(y-X\hat\beta) + \frac{1}{2}\lambda \hat\beta'\hat\beta$$
FOC is $$X'(y-X\hat\beta) - \lambda \hat\beta = 0$$
Solution is $$\hat\beta = (X'X + \lambda I)^{-1}X'y$$
So replace sample cov matrix of $X$ with $(1/N)(X'X + \lambda I)$.
$\lambda I$ adds a ridge to the diagonal of the sample cov matrix of $X$.

Bias-variance tradeoff¶

Ridge predictions are $$\hat y = X\hat\beta = X(X'X + \lambda I)^{-1}X'y$$
If $y=X\beta +$ conditional-mean-zero noise, then OLS is unbiased
Ridge is biased but could have lower variance

Singular value decomposition¶

Take the singular value decomposition $X = UDV'$.

U is $N \times N$, D is $N \times K$, V is $K \times K$.
U and V are orthogonal matrices.
- Columns of U are eigenvectors of $XX'$.
- Columns of V are eigenvectors of $X'X$.
First $K$ rows of $D$ is a diagonal matrix containing the nonzero singular values of $X$. Other rows are zero.

$$\begin{pmatrix} X_1 & \cdots & X_K \end{pmatrix} = \begin{pmatrix} U_1 & \cdots & U_N \end{pmatrix} \begin{pmatrix} d_1 & 0 & 0 & \cdots & 0 \\ 0 & d_2 & 0 & \cdots & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & d_K \\ 0 & 0 & 0 & \cdots & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & 0 \end{pmatrix} \begin{pmatrix} V_1'\\ \vdots \\ V_K' \end{pmatrix} $$

Let $\hat D$ denote the first $K$ rows of $D$ (the nonzero rows).
Let $\hat U$ denote the $N \times K$ matrix consisting of the first $K$ columns of $U$.
Then, if we multiply out $UDV'$ using the fact that $$D = \begin{pmatrix} \hat D \\ 0 \end{pmatrix}$$ we get $$X = UDV' = \hat U \hat D V'$$
$X = \hat U\hat D V'$ is called the compact SVD of $X$.

Because $X = \hat U \hat D V'$ and $U$ is orthogonal, we have $X'X = V\hat D^2V'$.
This is the diagonalization of the nonsingular matrix $X'X$.
The $d_i^2$ are the eigenvalues of $X'X$.

Because $X = \hat U \hat D V'$, we have $\hat U = XV\hat{D}^{-1}$.
So the first $K$ columns of $U$ are linear combinations of the columns of $X$
The change of variables $X \mapsto \hat U$ is a conversion to $K$ orthogonal variables with unit variances.
Change of variables $X \mapsto \hat U \hat D$ is a conversion to $K$ orthogonal variables with variances equal to $d_i^2$.

OLS predictions¶

We have $X'X = V\hat D^2V'$, so $(X'X)^{-1} = V\hat D^{-2} V'$
For OLS, $$X \hat\beta = XV\hat D^{-2}V'X'y = \hat U \hat D \hat D^{-2} \hat D \hat U'y = \sum_{i=1}^K (U_i'y) U_i$$
This just repeats what you know: multivariate betas = univariate betas after you orthogonalize the regressors, and if the regressors have unit variances, then betas = covariances.

Ridge predictions¶

We have $$X'X + \lambda I = V\hat D^2V' + \lambda VV' = V(\hat D^2 + \lambda I)V'$$
So $$X \hat\beta = \hat U \hat D (\hat D^2 + \lambda I)^{-1} \hat D \hat U' y = \sum_{i=1}^K \frac{d_i^2}{d_i^2 + \lambda} (U_i'y) U_i$$
So, the regression is as if the $U_i$ had variances equal to $$\frac{d_i^2 + \lambda}{d_i^2}$$

What if there are more regressors than observations?¶

In the singular value decomposition, $D$ has $N$ nonzero columns and $K-N$ zero columns when $K>N$.
And $X = UDV' = U\hat D \hat V'$ where $\hat D$ is $N \times N$ and $\hat V$ is the first $N$ columns of $V$.
But we get the same result for the predictions: $$X\hat\beta =\sum_{i=1}^N \frac{d_i^2}{d_i^2 + \lambda} (U_i'y) U_i$$
Here, the $U_i$ are $N$ linearly independent (and orthogonal and unit variance) regressors constructed from the $K>N$ regressors.