Assessing Model Robustness for SFFS

Christopher Remmel

unread,

Nov 1, 2019, 12:08:34 PM11/1/19

to mlxtend

Thanks so much for this great library! It has been a real joy to work with.

I've used SFFS in a feature selection pipeline on biological datasets that are between 250-500 features and between 27-100 samples.

It has been working well, but we are interested in assessing the robustness of the results by running the feature search process on the same data with permuted labels, to see if it finds models that are as strong in noise. The test is framed as follows:

H0: The results are no better than random chance.
HA: The results are better than random chance.

The problem that I'm running into is that the permuted version has had a mean floor accuracy of between 60% and 70%. So in cases where that level of accuracy would be an exciting result on the real labels, we can't be confident that a similarly performing featureset and model could not have been found for noise.

The pipeline and permutation test are performed as follows:

True Labels:

1. Center, scale, and prefilter features.

2. Find parsimonious featureset using SFFS and stratified kfold cv (5 folds, 20 repeats)

3. Record parsimonious number of features (k_p)

Permuted Labels

1. Randomize labels 10 times, preserving number of samples in each class

2. For each randomization:

a. Center, scale, and prefilter features

b. Find featureset of size k_p using SFFS and stratified kfold cv (5 folds, 20 repeats)

The feature size for the permuted versions are limited to k_p to avoid arbitrarily overfitting to the "fake" labels.

Here's how this tends to look visually for a handful of different setups/datasets:

The left violin contains the mean accuracy for each of the 20 repeats. On the right, the violin has the mean accuracy of the 20 repeats for each of the 10 label permutations.

As you can see, if there is a particularly strong signal for the true labels, this works fine, but if 70% is actually a good result, this approach would be unable to assess robustness.

Is there something I'm missing in terms of how I am setting up the test that is making the permuted accuracy so consistently high? Is there a better way to approach this question? Or is this method just so good at finding features that will work for any labels that it can't be used in this way?

Thanks so much again,

Chris

Christopher Remmel

unread,

Nov 6, 2019, 3:26:55 PM11/6/19

to mlxtend

More concisely:

I'm finding that if I permute the labels in a dataset, SFFS will often still find a feature set that achieves ~70%+ accuracy with those permuted labels.

If I am doing this to make sure that I wouldn't have found a relationship where there isn't one with the real labels, I can't do this unless the version with the real labels achieves an extremely high accuracy.

So: Why is the accuracy so high with the permuted labels? Am I doing something that is overindulging the permuted labels, or failing to correct for something obvious? Or could there be another way to test this that would make more sense?

Best,

Chris

Christopher Remmel

unread,

Nov 21, 2019, 3:57:08 PM11/21/19

to mlxtend

Ok, I tried another run that illustrates the problem.

I generated a completely random dataset using np.random.rand with 150 features and 100 samples -- pretty common size in bioengineering. I assigned the classes completely randomly using np.rand.randint.

I still achieved substantially better than random accuracy with a pretty small set of features.

Is it at all possible to trust results of ~70% with high dimensional data? Or is this just a drawback of automatic feature search algorithms?

Best,

Chris

Sebastian Raschka

unread,

Nov 21, 2019, 4:31:10 PM11/21/19

to Christopher Remmel, mlxtend

Hi Christopher,

just wanted to write you a quick response that I've seen your emails and questions. However, I currently have lots of stuff on my plate (2 weeks left in this semester and lots of things to prepare + grading etc) and I wasn't able to read through them yet. Hope to get to take a more detailed look in the upcoming days.

Best,
Sebastian

> On Nov 21, 2019, at 2:57 PM, Christopher Remmel <calr...@gmail.com> wrote:
>
> Ok, I tried another run that illustrates the problem.
>
> I generated a completely random dataset using np.random.rand with 150 features and 100 samples -- pretty common size in bioengineering. I assigned the classes completely randomly using np.rand.randint.
>
> I still achieved substantially better than random accuracy with a pretty small set of features.
>

> <20191121_random_data_sffs_plot.png>

> The left violin contains the mean accuracy for each of the 20 repeats. On the right, the violin has the mean accuracy of the 20 repeats for each of the 10 label permutations.
>
>
>
> As you can see, if there is a particularly strong signal for the true labels, this works fine, but if 70% is actually a good result, this approach would be unable to assess robustness.
>
>
>
> Is there something I'm missing in terms of how I am setting up the test that is making the permuted accuracy so consistently high? Is there a better way to approach this question? Or is this method just so good at finding features that will work for any labels that it can't be used in this way?
>
>
>
> Thanks so much again,
>
>
>
> Chris
>
>
>
>
>
>
>

> --
> You received this message because you are subscribed to the Google Groups "mlxtend" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mlxtend+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/mlxtend/165c2fc9-a876-4d32-8445-b7f54fc7f799%40googlegroups.com.
> <20191121_random_data_sffs_plot.png><20191121_random_data_permutation_violins.png>

Christopher Remmel

unread,

Nov 21, 2019, 4:39:45 PM11/21/19

to mlxtend

Thanks a ton, Sebastian! I really appreciate it.

I've been combing through papers on assessing feature selection methods, but haven't been able to find much on hypothesis testing for the selection method, rather than just the individual model. I'd be delighted to learn I'm missing something simple.

Hope the end of the semester is going well, despite the madness of it.

Best,

Chris

> To unsubscribe from this group and stop receiving emails from it, send an email to mlxtend+unsubscribe@googlegroups.com.

Christopher Remmel

unread,

Nov 26, 2019, 11:04:16 AM11/26/19

to mlxtend

I think I've figured out what I was doing wrong.

I was using the SFFS CV to evaluate the whole method, leaking information and introducing bias.

What I need to do to fix this is perform the feature selection itself within individual CV folds, and use that to evaluate the method.

I don't know if anything like that is already implemented, but it shouldn't be too hard to write.

Here's the test I ran that showed me what I was doing wrong:

On a dataset of wholly random data and classes with 150 samples, here is what happens if I run SFFS on the entire dataset, using CV results from the best 3 feature model, varying number of random features from 5 to 100:

And here are the results if I perform the feature selection within 3-fold cross validation instead, selecting the best 3 features on the training fold, fitting the model using the estimator and those features, then testing on the test fold:

The bias goes away entirely.

So, what I think I need to do in order to evaluate the method on a dataset and pick the best k number of features is:

1. Split data into CV folds

2. For each fold:

a. Run SFFS on entire training set.

b. For each feature set in .get_metric_dict():

b1. Train model with estimator on training fold using feature set

b2. Test on testing fold, record metrics

c. Select final k using best/parsimonious for whatever metric (accuracy, f1, etc)

3. Repeat process with randomized labels, limiting size by chosen k to make computationally feasible

4. Compare true results to randomized results in order to judge whether performance of method will likely be better than random on unseen data

Having done that to get a robust idea of performance, I can then proceed to run SFFS on the entire dataset, select the feature set of size k, and evaluate from there.

Overkill? Am I missing something? Either way, I'm sure glad I have access to the university's computing cluster.