Setting a validation set for CV

101 views
Skip to first unread message

B K

unread,
Sep 18, 2018, 5:56:05 AM9/18/18
to mlxtend
Hi guys,

I am using mlxtend's sequential feature selector (SFS) in order to find the best k feature combination in a certain range for a model, I am following the example in the user guide (https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/#example-8-selecting-the-best-feature-combination-in-a-k-range).

Now my question is: Is it possible to set your own validation sample for the cross validation (so that it is not a sample split of the total training set)? The thing is, that I am using an oversampled data set for training, but I would like the algorithm to use a data sample that is not oversampled for the validation. Something like this would be what I want for every cross validation step:
-) first split of a certain fraction of data instances (validation sample)
-) then oversample the rest
-) start the training with cv

Any ideas would be appreciated!

Cheers!


Sebastian Raschka

unread,
Sep 18, 2018, 10:35:20 AM9/18/18
to B K, mlxtend
Hi there,

Currently, it's unfortunately not possible to use your own validation dataset with the SequentialFeatureSelector. However, I would welcome a PR if you like to modify the code to support this.

For instance one could extend the API to accept a validation set as argument for "cv" like

SequentialFeatureSelector(cv=own_valididation_set) where own_valididation_set is a dictionary with

own_valididation_set['features'] = X_valid
own_valididation_set['targets'] = y_valid

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "mlxtend" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mlxtend+u...@googlegroups.com.
> To post to this group, send email to mlx...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/mlxtend/3bddf333-211e-4963-84fb-821e88315923%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Sebastian Raschka

unread,
Sep 23, 2018, 12:38:48 PM9/23/18
to mlxtend
Hi there,

my previous answer was incorrect. You can actually use the PredefinedSplit Class in sklearn to use a fixed train/validation split. Here, you would need to combine both data subsets that you have, and then


from sklearn.model_selection import PredefinedSplit
import numpy as np


piter
= PredefinedSplit(np.arange(20))
piter

where `np.arange(20)` has to be replaced with the indices of the test or validation fold. Then, you can use the `piter` object in the SFS:

from mlxtend.feature_selection import SequentialFeatureSelector as SFS


sfs1
= SFS(knn,
           k_features
=3,
           forward
=True,
           floating
=False,
           verbose
=2,
           scoring
='accuracy',
           cv
=piter)


sfs1
= sfs1.fit(X, y)

Best,
Sebastian

B K

unread,
Sep 23, 2018, 12:56:39 PM9/23/18
to mlxtend
Hi!

Thanks a lot, that helps immensely.. Really awesome!

Cheers

B K

unread,
Sep 23, 2018, 2:35:54 PM9/23/18
to mlxtend
In case someone else wants to do this, s short update:

using the PredefinedSplit works nicely. But PredefinedSplit(my_test_fold) doesnt need the indices of the validation set, but an array of length len(X) consisting of 0s and -1s, where 0 indicates instances of X used for validation and -1 instances used for training.

Cheers!

Sebastian Raschka

unread,
Sep 23, 2018, 9:21:14 PM9/23/18
to B K, mlxtend
Thanks for the clarification. Just saw this function when someone pointed it out on the mlxtend issue tracker and didn't check carefully.

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "mlxtend" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mlxtend+u...@googlegroups.com.
> To post to this group, send email to mlx...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/mlxtend/62d73aea-a86d-43c7-a49d-8a2ea899428a%40googlegroups.com.

Sebastian Raschka

unread,
Sep 24, 2018, 11:38:13 AM9/24/18
to Sebastian Raschka, B K, mlxtend
Just noticed a small "issue" (depending on what you indent to do): it does cross-validation in the sense that it rotates the training and validation split.

I wanted to use it in the GridSearchCV to demonstrate model selection with simple holdout validation before explaining cross-validation in my ML class and hence I just went ahead and implemented a PredefinedHoldoutSplit method. It's currently only in the master branch until the new mlxtend version will be released. However, you can install that version via

pip install git+git://github.com/rasbt/mlxtend.git

A quick usage example can be found here:
https://github.com/rasbt/mlxtend/blob/master/docs/sources/user_guide/evaluate/PredefinedHoldoutSplit.ipynb

Best,
Sebastian
> To view this discussion on the web visit https://groups.google.com/d/msgid/mlxtend/AEE210D2-6F98-44C5-9442-118E48CE1AC8%40gmail.com.

Sebastian Raschka

unread,
Sep 24, 2018, 12:00:44 PM9/24/18
to B K, mlxtend
> Sorry, I am not sure I understand what you mean by rotating the training and validation split. Could you clarify?

Say you have a dataset [1, 2, 3, 4, 5], which you split into [1, 2, 3] and [4, 5]. If I am not mistaken, Predefined split inside the SFS will then use it as

Training: [1, 2, 3]
Validation: [4, 5]

then rotate it

Training: [4, 5]
Validation: [1, 2, 3]

And then compute the validation performance as the average. It's actually not a bad thing to do but it may not be the intended thing. In my case, I want to use the simplest possible holdout validation (no cross validation) -- in my case, this is for teaching purposes.

Best,
Sebastian

> On Sep 24, 2018, at 10:54 AM, B K <bernade...@gmail.com> wrote:
>
> Hi,
>
> Sorry, I am not sure I understand what you mean by rotating the training and validation split. Could you clarify?
>
> My intention was setting up the cross validation, so that the validation step does not use a subset of the over-sampled data but a sample that's untouched (just to be careful and not tune the model possibly to the over-sampled data).
>
> Is this not what the PredefinedSplit() function can be used for? Should I rather use PredefinedHoldoutSplit()?
>
> Thanks for commenting again and putting more thought into it.
>
> Cheers
Reply all
Reply to author
Forward
0 new messages