The left violin contains the mean accuracy for each of the 20 repeats. On the right, the violin has the mean accuracy of the 20 repeats for each of the 10 label permutations.
As you can see, if there is a particularly strong signal for the true labels, this works fine, but if 70% is actually a good result, this approach would be unable to assess robustness.
Is there something I'm missing in terms of how I am setting up the test that is making the permuted accuracy so consistently high? Is there a better way to approach this question? Or is this method just so good at finding features that will work for any labels that it can't be used in this way?
Thanks so much again,
Chris
> To unsubscribe from this group and stop receiving emails from it, send an email to mlxtend+unsubscribe@googlegroups.com.
And here are the results if I perform the feature selection within 3-fold cross validation instead, selecting the best 3 features on the training fold, fitting the model using the estimator and those features, then testing on the test fold:
The bias goes away entirely.
So, what I think I need to do in order to evaluate the method on a dataset and pick the best k number of features is:
1. Split data into CV folds
2. For each fold:
a. Run SFFS on entire training set.
b. For each feature set in .get_metric_dict():
b1. Train model with estimator on training fold using feature set
b2. Test on testing fold, record metrics
c. Select final k using best/parsimonious for whatever metric (accuracy, f1, etc)
3. Repeat process with randomized labels, limiting size by chosen k to make computationally feasible
4. Compare true results to randomized results in order to judge whether performance of method will likely be better than random on unseen data
Having done that to get a robust idea of performance, I can then proceed to run SFFS on the entire dataset, select the feature set of size k, and evaluate from there.
Overkill? Am I missing something? Either way, I'm sure glad I have access to the university's computing cluster.
Best,
Chris