Running Exhaustive Feature Selection with .632 Bootstrap

153 views
Skip to first unread message

jh

unread,
Oct 23, 2018, 1:03:37 PM10/23/18
to mlxtend
Hi, recently came across this library and am very excited to see some of the native options/features implemented.

I'm attempting to find the best feature set (subset) by minimizing the .632 bootstrap prediction error (as opposed to the CV offered in Exhaustive Feature Selection(EFS)).

Since bootstrap_point632_score returns an array, I've attempted to define a "scorer" function as directed in the EFS documentation. See below:

from mlxtend.evaluate import bootstrap_point632_score
from mlxtend.feature_selection import ExhaustiveFeatureSelector
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

def scorer(function,datamatrix,outputvector):
    return np.mean(bootstrap_point632_score(function, datamatrix, outputvector))

efs = ExhaustiveFeatureSelector(lr, 
          min_features=1,
          max_features=9,
          scoring=scorer,
          cv=0)

efs.fit(X, y)

# print('Best MSE score: %.2f' % efs.best_score_ * (-1))
# print('Best subset:', efs.best_idx_)

Currently I'm receiving an error message 
KeyError: 'None of [[7 5 4 0 2 0 9 6 5 5]] are in the [index]'

and it seems to stem from the following:
C:\ProgramData\Anaconda3\lib\site-packages\mlxtend\feature_selection\exhaustive_feature_selector.py in _calc_score(selector, X, y, indices, **fit_params)
     34                                  n_jobs=1,
     35                                  pre_dispatch=selector.pre_dispatch,
---> 36                                  fit_params=fit_params)
     37     else:
     38         selector.est_.fit(X[:, indices], y, **fit_params)

Is what I'm trying to do even possible? Am I missing some understanding in terms of how these functions operate?

Sebastian Raschka

unread,
Oct 26, 2018, 5:29:27 PM10/26/18
to jh...@utk.edu, mlxtend
Hi Jeremy,

> Thanks for your reply. I noticed that this reply didn't get posted onto the message boards, and wondered if I should post you reply there.

I think this happened by accident, because I used a different email to reply, which I didn't register with the group.

> I did end up manually iterating the .632 bootstrap scoring across each feature combination. Of course, I always knew this was an option, but obviously the point of some of the mlxtend library features seems to be to attempt to automate some of this process, so I wanted to try to use it!

Yes, of course this would ideally be automated. The only issue is that I haven't found the time yet to implement it this way for the 0.632 bootstrap.

>
> If it is helpful to have some feedback, as a suggestion: if there is any way to allow the exhaustive (and sequential) feature selection determine which features to drop based on any given output value (maybe any given "scorer" which outputs a single, real value), that would be the ideal solution for what I foresee as my needs in the future. Depending on my use case, I could see this library being very useful if I could easily determine best feature subsets based on any given scorer: simple MSE, p-value, f-value, AIC, BIC, 5- or 10-fold CV, bootstrap, etc, or even the possibility to use test rather than training data ... I think there's a huge gap for a library to streamline these processes within Python.

When I understand it correctly, this is already supported via the `scorer` attribute. The problem why it didn't work with the 0.632 bootstrap method specifically is the following:

- the bootstrap (accuracy) score is computed on BOTH the training and test set in each split. I.e., the average of (0.632*test accuracy + 0.368* training accuracy) over the bootstrap samples as described in Efron, Bradley, and Robert Tibshirani. 1994. An Introduction to the Bootstrap. Chapman & Hall.

However, the way scikit-learn and the exhaustive/sequential feature selector etc. work is that the scorer is only applied to the test fold (i.e., it doesn't include the weighted training fold). Hence, custom functions would need to be written such that both the training and the test folds get passed on to the scorer in order to compute the performance value.

I added this to the issue tracker the other day in hope that I or someone else adds support for these scenarios some time: https://github.com/rasbt/mlxtend/issues/455

Best,
Sebastian


> On Oct 26, 2018, at 2:35 PM, Hale, Jeremy <xxl...@vols.utk.edu> wrote:
>
> Thanks for your reply. I noticed that this reply didn't get posted onto the message boards, and wondered if I should post you reply there.
>
> I did end up manually iterating the .632 bootstrap scoring across each feature combination. Of course, I always knew this was an option, but obviously the point of some of the mlxtend library features seems to be to attempt to automate some of this process, so I wanted to try to use it!
>
> If it is helpful to have some feedback, as a suggestion: if there is any way to allow the exhaustive (and sequential) feature selection determine which features to drop based on any given output value (maybe any given "scorer" which outputs a single, real value), that would be the ideal solution for what I foresee as my needs in the future. Depending on my use case, I could see this library being very useful if I could easily determine best feature subsets based on any given scorer: simple MSE, p-value, f-value, AIC, BIC, 5- or 10-fold CV, bootstrap, etc, or even the possibility to use test rather than training data ... I think there's a huge gap for a library to streamline these processes within Python.
>
> Thanks, again, for your reply!
>
>
> Jeremy Hale
> Graduate Research Assistant
> PhD Student
>
> The University of Tennessee, Knoxville
> LTS Lab
> 502 John D. Tickle Engineering Building
> 851 Neyland Dr
> Knoxville, TN 37996
>
> jh...@utk.edu
>
> Big Orange. Big Ideas.
>
>
>
> On Tue, Oct 23, 2018 at 3:40 PM Sebastian Raschka <ma...@sebastianraschka.com> wrote:
> Hi there,
>
> unfortunately, I don't think this is currently possible. The `bootstrap_point632_score` function is more like scikit-learn's `cross_val_score` which basically bundles a `scorer` with a `cv` method into one function, but here we need both separately.
>
> The `scorer` in context of scikit-learn is more like accuracy, precision, recall, so the function signature is not designed to accept the input features. The `cv` includes KFold, StratifiedKFold etc. which are iterating over the dataset producing the split. So, in order to make this work with the ExhaustiveFeatureSelector interface, one would have to factorize the bootstrap_point632_score into an BootstrapIterator + a 0.632 scoring function (it scores and weights both the training and the out of bag samples), like cross_val_score can be refactored into a KFold/StratifiedKFold iterator + scoring function (which only scores the test folds).
>
> I will add this to the issue tracker, but I will probably not have time to get to this soon.
>
> So for the time being, the probably easiest thing would be to iterate over all possible feature combinations manually and apply the the bootstrap scoring function (you basically need the all_comb variable shown on line https://github.com/rasbt/mlxtend/blob/master/mlxtend/feature_selection/exhaustive_feature_selector.py#L264)
>
>
> Best,
> Sebastian
> > --
> > You received this message because you are subscribed to the Google Groups "mlxtend" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to mlxtend+u...@googlegroups.com.
> > To post to this group, send email to mlx...@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/msgid/mlxtend/feea1a75-09d8-4fef-a75e-32875bd2de61%40googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>

Sebastian Raschka

unread,
Oct 26, 2018, 5:52:36 PM10/26/18
to jh...@utk.edu, mlxtend
So to better clarify my previous email, you can do the generic bootstrapping with the current implementations, e.g.,

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.evaluate import BootstrapOutOfBag

iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)
oob = BootstrapOutOfBag(n_splits=200)


sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=oob)

sfs1 = sfs1.fit(X, y)


However, the performance metric (you can pass any scorer to "scoring") will only be computed on the test fold (here: out of bag examples). The tricky part is to find a general solution such that both the training and test set can be considered via the "scoring" function for the 0.632 bootstrap method.

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages