Feature selection gives suspiciously high accuracy

31 views
Skip to first unread message

Andrea Ivan Costantino

unread,
Mar 29, 2023, 5:59:40 AM3/29/23
to CoSMoMVPA
Hi all,

I am running a classification analysis on some brain data I collected, using a LDA classifier and then calculating the cross-validated distance along the linear discriminant. The task is a visual task, with some higher cognitive components.

The results for one subject are shown below. On the x axis there are several ROIs, stacked so that the blue bars is the average distance for the classification in the original mask (no feature selection) and the orange bars represent the distance for the masks with feature selection.

The good news is that the feature selection improves the classifier performance all over the board. The bad news is that we really would not expect any classification accuracy in control regions (i.e., non-visual regions) such as the auditory or the motor cortex.

Anyone has an idea of what is going on? Any help would be very much appreciated.

Best regards,
AndreaFigure 2023-03-29 114623.png

Nick Oosterhof

unread,
Mar 29, 2023, 2:05:42 PM3/29/23
to Andrea Ivan Costantino, CoSMoMVPA

Andrea Ivan Costantino

unread,
Oct 30, 2023, 7:46:19 AM10/30/23
to Nick Oosterhof, CoSMoMVPA
Thanks. I am aware of the problem of double dipping, and it seems indeed that this is what is happening here.

However, it's not clear to me how the partitioning in test/train data would be implemented when we want to do features selection. I can implement it manually for each fold, but I feel there must be an easier way using cosmo native functions. This is how I am running the MVPA:


% Define labels for the data samples and other arguments needed for classification
ds.sa.targets = results.targets_table.CheckmateTarget; % Assign the variable "checkmateTargets" as the target labels
comsoArgs = struct(); % Initialize an empty structure to hold classification arguments
% Define the classifier function to be used (Linear Discriminant Analysis in this case)
comsoArgs.classifier = @cosmo_classify_lda;
% Define how to partition the data for cross-validation
comsoArgs.partitions = cosmo_nfold_partitioner(ds);
% Specify the type of output to be produced by the classifier ('fold_accuracy' means accuracy will be calculated for each fold)
comsoArgs.output = 'fold_accuracy';
% Set the maximum number of features to be considered in the classification
comsoArgs.max_feature_count = 10000;
% Run the MVPA classification
checkRes = cosmo_crossvalidation_measure(ds, comsoArgs);

How would I use here the cosmo_meta_feature_selection_classifier function or, more generally, how would I do features selection in this lda analysis?

Andrea

On 29 Mar 2023, at 8:05 pm, Nick Oosterhof <n.n.oo...@googlemail.com> wrote:



Nick Oosterhof

unread,
Oct 30, 2023, 7:58:29 AM10/30/23
to Andrea Ivan Costantino, CoSMoMVPA
Greetings,

> On Oct 30, 2023, at 12:46, Andrea Ivan Costantino <andreaivan...@gmail.com> wrote:
>
> Thanks. I am aware of the problem of double dipping, and it seems indeed that this is what is happening here.
>
> However, it's not clear to me how the partitioning in test/train data would be implemented when we want to do features selection. I can implement it manually for each fold, but I feel there must be an easier way using cosmo native functions. This is how I am running the MVPA:
>
>
> % Define labels for the data samples and other arguments needed for classification
> ds.sa.targets = results.targets_table.CheckmateTarget; % Assign the variable "checkmateTargets" as the target labels
> comsoArgs = struct(); % Initialize an empty structure to hold classification arguments % Define the classifier function to be used (Linear Discriminant Analysis in this case)
> comsoArgs.classifier = @cosmo_classify_lda; % Define how to partition the data for cross-validation
> comsoArgs.partitions = cosmo_nfold_partitioner(ds); % Specify the type of output to be produced by the classifier ('fold_accuracy' means accuracy will be calculated for each fold)
> comsoArgs.output = 'fold_accuracy'; % Set the maximum number of features to be considered in the classification
> comsoArgs.max_feature_count = 10000; % Run the MVPA classification
> checkRes = cosmo_crossvalidation_measure(ds, comsoArgs);
>
> How would I use here the cosmo_meta_feature_selection_classifier function or, more generally, how would I do features selection in this lda analysis?

Actually cosmo_meta_feature_selection_classifier is deprecated, the more updated function is now called cosmo_classify_meta_feature_selection. Its documentation contains an example using a searchlight.

There the final line of code is:

res=cosmo_searchlight(ds_tl,nbrhood,measure,measure_args,...
'progress',false);

which, if you want to run the analysis only once on the entire dataset in ds_tl (without searchlight), can be changed into:

res=measure(ds_tl, measure_args)

The idea is that for the measure arguments used in cosmo_classify_meta_feature_selection:
- child_classifier is used as classifier, you would use @cosmo_classify_lda there
- feature_selector and feature_selection_ratio_to_keep define how to select the ‘best' features
- other arguments, such as partitions, are passed onto the child_classifier.

Does that help?


Reply all
Reply to author
Forward
0 new messages