To clone and set hyper-parameter or not using SFS with GridSearch?

80 views

Skip to first unread message

Peter Aaby

unread,

Feb 4, 2020, 5:42:37 AM2/4/20

to mlxtend

Greetings all, in particular Sebastian for who'm this project would not exist. Thank you so much for the hard work!

I am currently using the SFS to identify the 'best' feature sets for several models.

However, I am unsure how the 'clone_estimator' and 'cv' parameter of SequentialFeatureSelector() interacts with the GridSearch object, and, if I should use 'set_param' on the best_estimators_ hyperparameters myself before using the GridSearch object for prediction (probabilities)?

Here is a toy example and demonstration of my understanding together with how I implemented the SFS+GS objects:

from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector
import mlxtend
from sklearn.metrics import confusion_matrix, roc_curve, auc


# Load data
X, y = datasets.load_breast_cancer(return_X_y=True)

# Split dataset with low train size to increase difficulty
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.10, random_state=42, stratify=y) 


clf = KNeighborsClassifier()
SFS = SequentialFeatureSelector(
    estimator=clf,
    k_features='best',       # 'best' search for optimal feature combination
    scoring='roc_auc',       # threshold independant
    cv=5,                    # InnerCV? GridSearch will do CV!?
    n_jobs=1,                # must be 1 or grid.cv_results_ will be empty when NOT cloning the estimator!
    clone_estimator=False)   # false when used in gridsearch!
  
pipe    = Pipeline([('fSel', SFS), ('clf',  clf)]) 
gParams = {'fSel__estimator__n_neighbors': [3, 7, 9]} #avoid ties :)


gs = GridSearchCV(
    estimator=pipe,
    param_grid=gParams,
    scoring='roc_auc',
    cv=5,
    n_jobs=-1,
    return_train_score=True,
    refit=True,
    verbose=0)    


model1 = gs.fit(X_train, y_train)
#_ = model1.best_estimator_.named_steps['clf'].set_params(n_neighbors = model1.best_estimator_.named_steps["fSel"].estimator.n_neighbors)

print(f'Best params from gridsearch model1 {model1.best_params_}')
print(f'Best features GS selected: {model1.best_estimator_.named_steps["fSel"].k_feature_idx_}')
print(f'model1.estimator.named_steps["fSel"].estimator.n_neighbors:       {model1.estimator.named_steps["fSel"].estimator.n_neighbors}')
print(f'model1.estimator.named_steps["clf"].estimator.n_neighbors:        {model1.estimator.named_steps["clf"].n_neighbors}')
print(f'model1.best_estimator_.named_steps["fSel"].estimator.n_neighbors: {model1.best_estimator_.named_steps["fSel"].estimator.n_neighbors}')
print(f'model1.best_estimator_.named_steps["clf"].n_neighbors:            {model1.best_estimator_.named_steps["clf"].n_neighbors}')


print(f'\nConfusion Matrix (GridSearch.predict)')
print(confusion_matrix(y_test, model1.predict(X_test)))
fpr, tpr, threshold = roc_curve(y_true=y_test, y_score=model1.predict_proba(X_test)[:, 1], pos_label=1)
print(f'AUC: {auc(fpr, tpr)}')

print(f'\nConfusion Matrix (GridSearch.best_estimator_.predict)')
print(confusion_matrix(y_test, model1.best_estimator_.predict(X_test)))
fpr, tpr, threshold = roc_curve(y_true=y_test, y_score=model1.predict_proba(X_test)[:, 1], pos_label=1)
print(f'AUC: {auc(fpr, tpr)}')

# Manually train a model using best features and hyperparameter to confirm GridSearch results!
X_train_sfs = X_train[:,list(model1.best_estimator_.named_steps["fSel"].k_feature_idx_)]
X_test_sfs  = X_test[:,list(model1.best_estimator_.named_steps["fSel"].k_feature_idx_)]

# Manual model2: confirm if model 1 used n_neigh = default to 5?!
model2 = KNeighborsClassifier(n_neighbors=5).fit(X_train_sfs, y_train) 
print(f'\nConfusion Matrix (Model2 knn=5 (static and default clf h-param))')
print(confusion_matrix(y_test, model2.predict(X_test_sfs)))
fpr, tpr, threshold = roc_curve(y_true=y_test, y_score=model2.predict_proba(X_test_sfs)[:, 1], pos_label=1)
print(f'AUC: {auc(fpr, tpr)}')

# Manual model3: confirm if model 1 used best_param n_neighbors
model3 = KNeighborsClassifier(n_neighbors=model1.best_estimator_.named_steps["fSel"].estimator.n_neighbors).fit(X_train_sfs, y_train)
print(f'\nConfusion Matrix (model 3 using val from GridSearch best est knn={model1.best_estimator_.named_steps["fSel"].estimator.n_neighbors})')
print(confusion_matrix(y_test, model3.predict(X_test_sfs)))
fpr, tpr, threshold = roc_curve(y_true=y_test, y_score=model3.predict_proba(X_test_sfs)[:, 1], pos_label=1)
print(f'AUC: {auc(fpr, tpr)}')

Which produces:

Best params from gridsearch model1 {'fSel__estimator__n_neighbors': 9}

Best features GS selected: (7, 8, 9, 14, 15, 17, 18, 19, 29)

model1.estimator.named_steps["fSel"].estimator.n_neighbors:       5

model1.estimator.named_steps["clf"].estimator.n_neighbors:        5

model1.best_estimator_.named_steps["fSel"].estimator.n_neighbors: 9

model1.best_estimator_.named_steps["clf"].n_neighbors:            5

Confusion Matrix (GridSearch.predict)

[[157  34]

 [ 19 303]]

AUC: 0.9460505349419532

Confusion Matrix (GridSearch.best_estimator_.predict)

[[157  34]

 [ 19 303]]

AUC: 0.9460505349419532

Confusion Matrix (Model2 knn=5 (static and default clf h-param))

[[157  34]

 [ 19 303]]

AUC: 0.9460505349419532

Confusion Matrix (model 3 using val from GridSearch best est knn=9)

[[151  40]

 [ 15 307]]

AUC: 0.9552778771422068

It appears as if the best parameter was not used, and, that using the best parameter in the manual model 3 would produce slightly better AUC score.

As such, I thought to confirm using the SFS module on it's own by setting the parameters manually and searching the feature space again using:

simpleSFS = SequentialFeatureSelector(
    KNeighborsClassifier(n_neighbors=9),
    scoring='roc_auc',
    k_features='best', 
    cv=5,
    clone_estimator=False).fit(X_train, y_train)

print(f'\nBest features chosen by simpleSFS knn={simpleSFS.estimator.n_neighbors}: {simpleSFS.k_feature_idx_}')
X_train_sfs = X_train[:,list(simpleSFS.k_feature_idx_)]
X_test_sfs  = X_test[:,list(simpleSFS.k_feature_idx_)]

model4 = KNeighborsClassifier(n_neighbors=simpleSFS.estimator.n_neighbors).fit(X_train_sfs, y_train)

print(f'Confusion Matrix (simpleSFS control knn={simpleSFS.estimator.n_neighbors})')
print(confusion_matrix(y_test, model4.predict(X_test_sfs)))
fpr, tpr, threshold = roc_curve(y_true=y_test, y_score=model4.predict_proba(X_test_sfs)[:, 1], pos_label=1)
print(auc(fpr, tpr))

Producing:

Best features chosen by simpleSFS knn=9: (7, 8, 9, 14, 15, 17, 18, 19, 29)

Confusion Matrix (simpleSFS control knn=9)

[[151  40]

 [ 15 307]]

0.9552778771422068

From above experimentation, I believe it's required to manually set the best found hyperparameter on the GridSearch object to utilise the correct hyperparameter for prediction?

I also notice that it does not matter if I use .predict or best_estimator_.predict which appear perfectly valid as 'refit' is true.

Additionally, it seems okay to infer that the features are correctly identified and used in the GridSearch object since simpleSFS and model1 finds the same feature-set. Similarly, the result of model 3 and simpleSFS matches up too.

However, even though this example shows similar results between model1 and the manually fitted model 4 (simpleSFS). When using my own data, it appear that manually fitting a model 4using the best parameters and feature set found with GridSearch (model1) produces slightly different results.

Hope someone with better experience and understanding of using SFS together with hyper-parameter tuning would be kind enough to chime in.

Best wishes,

Peter

Reply all

Reply to author

Forward

0 new messages