By and
large studies have used cross-validation performance as the gold
standard for model comparison. The motivation for using this measure
is that it is a proxy for the generalization ability of the model: how
well can it predict labels for data that it hasn't seen before.
what we should
ultimately seek is converging evidence from these different options,
which have various strengths and weaknesses.
The essence of his argument is that if we entertain enough models,
just by chance one of them might do really well on the hold-out set
even if it has actually poor generalization ability, because the hold-
out samples are corrupted versions of the underlying function that the
model is trying to learn.
2) It was shown quite a while ago, by Stone (1977), that leave-one-out
cross-validation as a model-selection criterion is asymptotically
equivalent to the Akaike Information Criterion (AIC). The AIC derives
from information theory; it converges to the KL-divergence between
"reality" and the model (which means you want to choose the model with
the smallest AIC). So asymptotically, models selected by cross-validation will be
insufficiently conservative about model complexity.
I will venture
to make the tendentious assertion that if we are to be biased in one
direction, we should be biased AGAINST complex models. This is the
heart of Occam's Razor, and falls naturally out of Bayesian statistics
(as long as you don't use non-sensical priors).
1) The subset of features contains enough information you can achieve
high classification accuracy (i.e Haxby 2001)
2) Classifier/model X presented here shows how information is encoded
in the brain. This includes things like talking about how a voxel
contains more information because it is assigned a higher weight by
the discriminant, which is a statement about a particular set of
parameters fit by your model. (I am not trying to argue against making
or publishing importance maps, just urging caution against
over-interpretation of them.)
3) Following Kay (Nature 2008) one could imagine the following theory
testing framework. If you have two theories which predict a set of
different variables you think the brain is using for its internal
representation, use their approach to engineer two different models
based on your theoretical assumptions, and then you can do some kind
of reasonable model comparison i.e. the various Bayesian approaches.
For 1) I am less convinced that cross validation is a problem, because
to make the point it is sufficient to show one classifier that works.
For 2) I think the criticism is more appropriate, but I think we
should move towards 3) and do comparison of several different
theoretically motivated models, which I think makes more sense than
trying to infer anything from one particular model's parameters (esp
with a point estimate of them).
> This is the
> heart of Occam's Razor, and falls naturally out of Bayesian statistics
> (as long as you don't use non-sensical priors).
I was reading this interesting discussion on Andrew Gelman's blog
earlier yesterday arguing against parsimony.
http://www.stat.columbia.edu/~cook/movabletype/archives/2004/12/against_parsimo.html
>Permutation tests will work for
>this purpose, as will a few analytical shortcuts. I'd hesitate to inflict my thesis
>work on the list, but I'm trying to write a nicer version :)
I am looking forward to this as although I trust permutation testing,
it can be kind of taxing on various resources to run all the models
thousands of times.
--
Vicente Malave
let me make up a silly analogy -- regardless of how many generic
manufactured Ford Focuses you take, it never would beat stock
Lamborghini Diablo ;-) unless Diablo is broken, or unless you boost
your Focus for the specific race.
It is to mention that probably your linear classifier never would
perform good 'by chance' for a non-linear decision surface (unless
that surface is degenerate and piece-wise linear in the neighborhood
of interest).
You touch on an interesting issue that I didn't mention .If I
understand you correctly, you are making a distinction between
"comparison between models" and "comparison between a model and
chance". I don't believe in such magical things... "Chance" is just
another model.
There are situations where chance can not be
conveniently specified in terms of an unbiased coin, or an
equiprobable multinomial distribution.
If I have two models with identical cross-validation performance, is
there any way to know which one is better? My point is that two such
models are not necessarily equally better than "chance" because our
notion of chance is determined by our prior beliefs and experience,
which could differ for each model.