Assignment 6
BUSI 520: Python for Business Research
Jones Graduate School of Business
Rice University

  1. The UnivariateSpline function from scipy fits a cubic spline to data, as we’ve seen before. The following code wraps the function into a scikit-learn Estimator, which can be used like the other estimators we’ve studied. For example, you can execute
  model = Spline(x=10)
  model.fit(...)
  import numpy as np
  from scipy.interpolate import UnivariateSpline
  from sklearn.base import BaseEstimator
  from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
  class Spline(BaseEstimator):

    def __init__(self, s=1):
        self.s = s

    def fit(self, X, y):
         
        # Check that X and y have correct shape
        X, y = check_X_y(X, y)

        # store the fitted estimator
        self.X_ = X
        self.y_ = y
        self.spline = UnivariateSpline(X, y, s=self.s)
    
        return self

    def predict(self, X):

        # Check if fit has been called
        check_is_fitted(self)

        # Input validation
        X = check_array(X)
    
        return self.spline(X)

Run train-test-split in our “noisy sine curve example” and then run GridSearchCV on Spline on the training data to find the best value of s among (1, 10, 100, 1000, 10000). [To learn more about creating Estimators, see https://scikit-learn.org/stable/developers/develop.html.]

  1. The following generates data in which the target takes the high value (2) in the northeast and southwest quadrants, the low value (0) in the other quadrants, except that it takes the middle value (1) around the origin.
    np.random.seed(0)
    X = pd.DataFrame(
        np.random.normal(size=(1000, 2))
    )
    y = X[0]*X[1]
    y = 1*(y>-0.3) + 1*(y>0.3)
  1. Run the following code from the Visualization notebook to see the data. The horizontal and vertical axis labels will be the percentiles of X[0] and X[1] from low to high.
    from scipy.stats import binned_statistic_2d
    import seaborn as sns

    statistic, x_edge, y_edge, binnumber = binned_statistic_2d(
        X[0], X[1], y,
        statistic='mean', 
        bins=[100, 100]
    )

    sns.heatmap(
        statistic.T, 
        cmap='coolwarm',
        cbar=True
    )
    plt.gca().invert_yaxis()
    plt.show()
  1. Run GridSearchCV on RandomForestClassifier for max_depths in (4, 6, 8, 10, 12, 16). Report the best max_depth and the score on the test data.
  1. Run train-test-split on scikit-learn’s wine dataset. Create pipelines with StandardScaler and (a) LogisticRegression with an \(\ell^2\) penalty, (b) RandomForestClassifier, and (c) MLPClassifier on scikit-learn’s wine dataset. Run GridSearchCV on the pipeline in each case to get some idea of the best hyperparameters. Report (i) the best hyperparameters, (ii) the fraction correct on the test data, and (iii) the confusion matrix for the test data. For MLPClassifier, use
model = MLPClassifier(solver="adam")