Assignment 6
BUSI 520: Python for Business Research
Jones Graduate School of Business
Rice University
- The UnivariateSpline function from scipy fits a cubic spline to data, as we’ve seen before. The following code wraps the function into a scikit-learn Estimator, which can be used like the other estimators we’ve studied. For example, you can execute
10)
model = Spline(x= model.fit(...)
import numpy as np
import UnivariateSpline
from scipy.interpolate import BaseEstimator
from sklearn.base import check_X_y, check_array, check_is_fitted
from sklearn.utils.validation class Spline(BaseEstimator):
self, s=1):
def __init__(self.s = s
self, X, y):
def fit(
# Check that X and y have correct shape
X, y = check_X_y(X, y)
# store the fitted estimator
self.X_ = X
self.y_ = y
self.spline = UnivariateSpline(X, y, s=self.s)
self
return
self, X):
def predict(
# Check if fit has been called
self)
check_is_fitted(
# Input validation
X = check_array(X)
self.spline(X) return
Run train-test-split in our “noisy sine curve example” and then run GridSearchCV on Spline on the training data to find the best value of s among (1, 10, 100, 1000, 10000). [To learn more about creating Estimators, see https://scikit-learn.org/stable/developers/develop.html.]
- The following generates data in which the target takes the high value (2) in the northeast and southwest quadrants, the low value (0) in the other quadrants, except that it takes the middle value (1) around the origin.
0)
np.random.seed(
X = pd.DataFrame(1000, 2))
np.random.normal(size=(
)0]*X[1]
y = X[1*(y>-0.3) + 1*(y>0.3) y =
- Run the following code from the Visualization notebook to see the data. The horizontal and vertical axis labels will be the percentiles of X[0] and X[1] from low to high.
import binned_statistic_2d
from scipy.stats import seaborn as sns
statistic, x_edge, y_edge, binnumber = binned_statistic_2d(0], X[1], y,
X['mean',
statistic=100, 100]
bins=[
)
sns.heatmap(
statistic.T, 'coolwarm',
cmap=True
cbar=
)
plt.gca().invert_yaxis() plt.show()
- Run GridSearchCV on RandomForestClassifier for max_depths in (4, 6, 8, 10, 12, 16). Report the best max_depth and the score on the test data.
- Run train-test-split on scikit-learn’s wine dataset. Create pipelines with StandardScaler and (a) LogisticRegression with an \(\ell^2\) penalty, (b) RandomForestClassifier, and (c) MLPClassifier on scikit-learn’s wine dataset. Run GridSearchCV on the pipeline in each case to get some idea of the best hyperparameters. Report (i) the best hyperparameters, (ii) the fraction correct on the test data, and (iii) the confusion matrix for the test data. For MLPClassifier, use
model = MLPClassifier(solver="adam")