Day 11: Odds and Ends¶

BUSI 520 - Python for Business Research¶

Kerry Back, JGSB, Rice University¶

Outline¶

  • Debugging
  • Timing code
  • Parallel processing
  • GPUs
  • Geometry of ridge regression

Error messages¶

  • Find the last line of your code in the error message (before it goes into any python libraries, if it does).
  • The error is in that line or in some object that appears in that line.
  • Ask Julius/Github Copilot/ChatGPT or google if the message is not clear.
  • Inspect all of the objects in that line to see if they are what you expect them to be.

How to inspect objects¶

  • Use type() or .shape or .head() or whatever
  • Use print statements
  • Use CTRL-SHFT-P to bring up the command palette and search for Jupyter: Open Variables View to see the current values of all variables.
  • Use the Data Wrangler extension to see the data in a table.

Errors inside functions¶

  • Try to avoid them by testing code on examples before you put it into a function.
  • Take code out of the function if there's an error inside the function and test on examples.
    • Use CTRL-[ to indent a block of code and CTRL-] to unindent it.
  • Use print statements inside the function.
  • Use the debugger.

pdb debugger¶

  • Ask Github Copilot how to use the pdb debugger in Jupyter.
  • pdb.set_trace()
  • %debug magic command

Timing code¶

  • Use time.time or timeit.timeit
  • Or use %%time or %%timeit magic commands in Jupyter
  • time runs once, timeit runs multiple times and averages the time.
In [39]:
import numpy as np 

def add_numbers(n):
    return np.arange(n+1).sum()
In [40]:
%%timeit 

add_numbers(100000)
113 μs ± 10.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [47]:
import time 

start = time.time()
add_numbers(100000)
end = time.time()
print(f"elapsed time: {end-start}")
elapsed time: 0.0010008811950683594

¶

Parallel processing¶

  • Use the joblib library or multiprocessing.Pool to parallelize loops.
  • Other libraries are available for more complex parallel processing.
  • Only worthwhile for tasks that take a long time to run.
  • For small tasks, parallel processing is too much overhead.
In [11]:
from joblib import Parallel, delayed
In [48]:
start = time.time()
lst = [add_numbers(n) for n in range(100000)]
end = time.time()
print(f"elapsed time: {end-start}")
elapsed time: 4.282323360443115
In [49]:
start = time.time()

lst = Parallel(n_jobs=-1, verbose=0)(
    delayed(add_numbers)(n) for n in range(100000)
)

end = time.time()
print(f"elapsed time: {end-start}")
elapsed time: 2.8435566425323486

GPU processing¶

  • Many cores, each slower than CPU cores.
  • Good for tasks that can be parallelized.
  • Code still runs on CPU, but tasks are sent to GPU.
  • Various libraries will automatically send tasks to the GPU and parallelize them.
    • Nvidia: numpy -> cupy, pandas -> cudf, scikit-learn -> cuml
  • Google Colab has free GPU access: Runtime/Select Runtime Type/GPU

cupy¶

https://colab.research.google.com/drive/1xhORH4VQr5vuaDVsvKY54JW_dIE5EMUz?usp=sharing

cudf¶

  • Can rewrite code to use cudf instead of pandas. Almost everything is exactly the same - just use cudf.DataFrame instead of pd.DataFrame, etc.
  • Or can use magic %load_ext cudf.pandas to automatically run pandas code with cudf:

https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/getting_started_tutorials/cudf_pandas_colab_demo.ipynb

Geometry of Ridge Regression¶

OLS¶

  • $N$ observations, $K$ regressors, $K<N$
  • Demean $X$ and $y$, with no constant column in $X$
  • OLS is $$\hat\beta = (X'X)^{-1} = ((1/N)X'X)^{-1}((1/N)X'y)$$ $$ = (\text{sample cov of $X$})^{-1} \times \text{sample cov of $X$ with $y$}$$

Why ridge is called ridge¶

  • Ridge is obtained by $$\min \quad \frac{1}{2}(y-X\hat\beta)'(y-X\hat\beta) + \frac{1}{2}\lambda \hat\beta'\hat\beta$$
  • FOC is $$X'(y-X\hat\beta) - \lambda \hat\beta = 0$$
  • Solution is $$\hat\beta = (X'X + \lambda I)^{-1}X'y$$
  • So replace sample cov matrix of $X$ with $(1/N)(X'X + \lambda I)$.
  • $\lambda I$ adds a ridge to the diagonal of the sample cov matrix of $X$.

Bias-variance tradeoff¶

  • Ridge predictions are $$\hat y = X\hat\beta = X(X'X + \lambda I)^{-1}X'y$$
  • If $y=X\beta +$ conditional-mean-zero noise, then OLS is unbiased
  • Ridge is biased but could have lower variance

Singular value decomposition¶

Take the singular value decomposition $X = UDV'$.

  • U is $N \times N$, D is $N \times K$, V is $K \times K$.
  • U and V are orthogonal matrices.
    • Columns of U are eigenvectors of $XX'$.
    • Columns of V are eigenvectors of $X'X$.
  • First $K$ rows of $D$ is a diagonal matrix containing the nonzero singular values of $X$. Other rows are zero.

$$\begin{pmatrix} X_1 & \cdots & X_K \end{pmatrix} = \begin{pmatrix} U_1 & \cdots & U_N \end{pmatrix} \begin{pmatrix} d_1 & 0 & 0 & \cdots & 0 \\ 0 & d_2 & 0 & \cdots & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & d_K \\ 0 & 0 & 0 & \cdots & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & 0 \end{pmatrix} \begin{pmatrix} V_1'\\ \vdots \\ V_K' \end{pmatrix} $$

  • Let $\hat D$ denote the first $K$ rows of $D$ (the nonzero rows).
  • Let $\hat U$ denote the $N \times K$ matrix consisting of the first $K$ columns of $U$.
  • Then, if we multiply out $UDV'$ using the fact that $$D = \begin{pmatrix} \hat D \\ 0 \end{pmatrix}$$ we get $$X = UDV' = \hat U \hat D V'$$
  • $X = \hat U\hat D V'$ is called the compact SVD of $X$.
  • Because $X = \hat U \hat D V'$ and $U$ is orthogonal, we have $X'X = V\hat D^2V'$.
  • This is the diagonalization of the nonsingular matrix $X'X$.
  • The $d_i^2$ are the eigenvalues of $X'X$.
  • Because $X = \hat U \hat D V'$, we have $\hat U = XV\hat{D}^{-1}$.
  • So the first $K$ columns of $U$ are linear combinations of the columns of $X$
  • The change of variables $X \mapsto \hat U$ is a conversion to $K$ orthogonal variables with unit variances.
  • Change of variables $X \mapsto \hat U \hat D$ is a conversion to $K$ orthogonal variables with variances equal to $d_i^2$.

OLS predictions¶

  • We have $X'X = V\hat D^2V'$, so $(X'X)^{-1} = V\hat D^{-2} V'$
  • For OLS, $$X \hat\beta = XV\hat D^{-2}V'X'y = \hat U \hat D \hat D^{-2} \hat D \hat U'y = \sum_{i=1}^K (U_i'y) U_i$$
  • This just repeats what you know: multivariate betas = univariate betas after you orthogonalize the regressors, and if the regressors have unit variances, then betas = covariances.

Ridge predictions¶

  • We have $$X'X + \lambda I = V\hat D^2V' + \lambda VV' = V(\hat D^2 + \lambda I)V'$$
  • So $$X \hat\beta = \hat U \hat D (\hat D^2 + \lambda I)^{-1} \hat D \hat U' y = \sum_{i=1}^K \frac{d_i^2}{d_i^2 + \lambda} (U_i'y) U_i$$
  • So, the regression is as if the $U_i$ had variances equal to $$\frac{d_i^2 + \lambda}{d_i^2}$$

What if there are more regressors than observations?¶

  • In the singular value decomposition, $D$ has $N$ nonzero columns and $K-N$ zero columns when $K>N$.
  • And $X = UDV' = U\hat D \hat V'$ where $\hat D$ is $N \times N$ and $\hat V$ is the first $N$ columns of $V$.
  • But we get the same result for the predictions: $$X\hat\beta =\sum_{i=1}^N \frac{d_i^2}{d_i^2 + \lambda} (U_i'y) U_i$$
  • Here, the $U_i$ are $N$ linearly independent (and orthogonal and unit variance) regressors constructed from the $K>N$ regressors.