Outline¶
- Debugging
- Timing code
- Parallel processing
- GPUs
- Geometry of ridge regression
Error messages¶
- Find the last line of your code in the error message (before it goes into any python libraries, if it does).
- The error is in that line or in some object that appears in that line.
- Ask Julius/Github Copilot/ChatGPT or google if the message is not clear.
- Inspect all of the objects in that line to see if they are what you expect them to be.
How to inspect objects¶
- Use type() or .shape or .head() or whatever
- Use print statements
- Use CTRL-SHFT-P to bring up the command palette and search for
Jupyter: Open Variables View
to see the current values of all variables. - Use the Data Wrangler extension to see the data in a table.
Errors inside functions¶
- Try to avoid them by testing code on examples before you put it into a function.
- Take code out of the function if there's an error inside the function and test on examples.
- Use CTRL-[ to indent a block of code and CTRL-] to unindent it.
- Use print statements inside the function.
- Use the debugger.
pdb debugger¶
- Ask Github Copilot how to use the pdb debugger in Jupyter.
- pdb.set_trace()
- %debug magic command
Timing code¶
- Use time.time or timeit.timeit
- Or use %%time or %%timeit magic commands in Jupyter
- time runs once, timeit runs multiple times and averages the time.
In [39]:
import numpy as np
def add_numbers(n):
return np.arange(n+1).sum()
In [40]:
%%timeit
add_numbers(100000)
113 μs ± 10.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [47]:
import time
start = time.time()
add_numbers(100000)
end = time.time()
print(f"elapsed time: {end-start}")
elapsed time: 0.0010008811950683594
Parallel processing¶
- Use the joblib library or multiprocessing.Pool to parallelize loops.
- Other libraries are available for more complex parallel processing.
- Only worthwhile for tasks that take a long time to run.
- For small tasks, parallel processing is too much overhead.
In [11]:
from joblib import Parallel, delayed
In [48]:
start = time.time()
lst = [add_numbers(n) for n in range(100000)]
end = time.time()
print(f"elapsed time: {end-start}")
elapsed time: 4.282323360443115
In [49]:
start = time.time()
lst = Parallel(n_jobs=-1, verbose=0)(
delayed(add_numbers)(n) for n in range(100000)
)
end = time.time()
print(f"elapsed time: {end-start}")
elapsed time: 2.8435566425323486
GPU processing¶
- Many cores, each slower than CPU cores.
- Good for tasks that can be parallelized.
- Code still runs on CPU, but tasks are sent to GPU.
- Various libraries will automatically send tasks to the GPU and parallelize them.
- Nvidia: numpy -> cupy, pandas -> cudf, scikit-learn -> cuml
- Google Colab has free GPU access: Runtime/Select Runtime Type/GPU
cudf¶
- Can rewrite code to use cudf instead of pandas. Almost everything is exactly the same - just use cudf.DataFrame instead of pd.DataFrame, etc.
- Or can use magic %load_ext cudf.pandas to automatically run pandas code with cudf:
Geometry of Ridge Regression¶
OLS¶
- $N$ observations, $K$ regressors, $K<N$
- Demean $X$ and $y$, with no constant column in $X$
- OLS is $$\hat\beta = (X'X)^{-1} = ((1/N)X'X)^{-1}((1/N)X'y)$$ $$ = (\text{sample cov of $X$})^{-1} \times \text{sample cov of $X$ with $y$}$$
Why ridge is called ridge¶
- Ridge is obtained by $$\min \quad \frac{1}{2}(y-X\hat\beta)'(y-X\hat\beta) + \frac{1}{2}\lambda \hat\beta'\hat\beta$$
- FOC is $$X'(y-X\hat\beta) - \lambda \hat\beta = 0$$
- Solution is $$\hat\beta = (X'X + \lambda I)^{-1}X'y$$
- So replace sample cov matrix of $X$ with $(1/N)(X'X + \lambda I)$.
- $\lambda I$ adds a ridge to the diagonal of the sample cov matrix of $X$.
Bias-variance tradeoff¶
- Ridge predictions are $$\hat y = X\hat\beta = X(X'X + \lambda I)^{-1}X'y$$
- If $y=X\beta +$ conditional-mean-zero noise, then OLS is unbiased
- Ridge is biased but could have lower variance
Singular value decomposition¶
Take the singular value decomposition $X = UDV'$.
- U is $N \times N$, D is $N \times K$, V is $K \times K$.
- U and V are orthogonal matrices.
- Columns of U are eigenvectors of $XX'$.
- Columns of V are eigenvectors of $X'X$.
- First $K$ rows of $D$ is a diagonal matrix containing the nonzero singular values of $X$. Other rows are zero.
$$\begin{pmatrix} X_1 & \cdots & X_K \end{pmatrix} = \begin{pmatrix} U_1 & \cdots & U_N \end{pmatrix} \begin{pmatrix} d_1 & 0 & 0 & \cdots & 0 \\ 0 & d_2 & 0 & \cdots & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & d_K \\ 0 & 0 & 0 & \cdots & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & 0 \end{pmatrix} \begin{pmatrix} V_1'\\ \vdots \\ V_K' \end{pmatrix} $$
- Let $\hat D$ denote the first $K$ rows of $D$ (the nonzero rows).
- Let $\hat U$ denote the $N \times K$ matrix consisting of the first $K$ columns of $U$.
- Then, if we multiply out $UDV'$ using the fact that $$D = \begin{pmatrix} \hat D \\ 0 \end{pmatrix}$$ we get $$X = UDV' = \hat U \hat D V'$$
- $X = \hat U\hat D V'$ is called the compact SVD of $X$.
- Because $X = \hat U \hat D V'$ and $U$ is orthogonal, we have $X'X = V\hat D^2V'$.
- This is the diagonalization of the nonsingular matrix $X'X$.
- The $d_i^2$ are the eigenvalues of $X'X$.
- Because $X = \hat U \hat D V'$, we have $\hat U = XV\hat{D}^{-1}$.
- So the first $K$ columns of $U$ are linear combinations of the columns of $X$
- The change of variables $X \mapsto \hat U$ is a conversion to $K$ orthogonal variables with unit variances.
- Change of variables $X \mapsto \hat U \hat D$ is a conversion to $K$ orthogonal variables with variances equal to $d_i^2$.
OLS predictions¶
- We have $X'X = V\hat D^2V'$, so $(X'X)^{-1} = V\hat D^{-2} V'$
- For OLS, $$X \hat\beta = XV\hat D^{-2}V'X'y = \hat U \hat D \hat D^{-2} \hat D \hat U'y = \sum_{i=1}^K (U_i'y) U_i$$
- This just repeats what you know: multivariate betas = univariate betas after you orthogonalize the regressors, and if the regressors have unit variances, then betas = covariances.
Ridge predictions¶
- We have $$X'X + \lambda I = V\hat D^2V' + \lambda VV' = V(\hat D^2 + \lambda I)V'$$
- So $$X \hat\beta = \hat U \hat D (\hat D^2 + \lambda I)^{-1} \hat D \hat U' y = \sum_{i=1}^K \frac{d_i^2}{d_i^2 + \lambda} (U_i'y) U_i$$
- So, the regression is as if the $U_i$ had variances equal to $$\frac{d_i^2 + \lambda}{d_i^2}$$
What if there are more regressors than observations?¶
- In the singular value decomposition, $D$ has $N$ nonzero columns and $K-N$ zero columns when $K>N$.
- And $X = UDV' = U\hat D \hat V'$ where $\hat D$ is $N \times N$ and $\hat V$ is the first $N$ columns of $V$.
- But we get the same result for the predictions: $$X\hat\beta =\sum_{i=1}^N \frac{d_i^2}{d_i^2 + \lambda} (U_i'y) U_i$$
- Here, the $U_i$ are $N$ linearly independent (and orthogonal and unit variance) regressors constructed from the $K>N$ regressors.