The case against cross-validation

Sam Gershman

unread,

Aug 20, 2008, 8:16:33 PM8/20/08

to Princeton MVPA Toolbox for Matlab

I'd like to draw attention to an issue that hasn't really been
addressed in the MVPA literature. It regards the way in which we
select between models, where I construe model broadly as encompassing
learning algorithm, feature selection, pre-processing, etc. By and
large studies have used cross-validation performance as the gold
standard for model comparison. The motivation for using this measure
is that it is a proxy for the generalization ability of the model: how
well can it predict labels for data that it hasn't seen before. What I
want to argue is not that cross-validation is bad or incorrect, but
that there are other options for model comparison and what we should
ultimately seek is converging evidence from these different options,
which have various strengths and weaknesses. I'll make two theoretical
arguments and one more philosophical argument.

1) Models that minimize the cross-validation error are not necessarily
the ones that have the best generalization ability. This assertion
comes from an astonishing paper by Andrew Ng: http://ai.stanford.edu/~ang/papers/cv-final.pdf.
The essence of his argument is that if we entertain enough models,
just by chance one of them might do really well on the hold-out set,
even if it has actually poor generalization ability, because the hold-
out samples are corrupted versions of the underlying function that the
model is trying to learn. Hence, the model that minimizes cross-
validation error may end up fitting noise on the hold-out samples.

2) It was shown quite a while ago, by Stone (1977), that leave-one-out
cross-validation as a model-selection criterion is asymptotically
equivalent to the Akaike Information Criterion (AIC). The AIC derives
from information theory; it converges to the KL-divergence between
"reality" and the model (which means you want to choose the model with
the smallest AIC). This leads naturally to the question: is AIC a good
model-comparison measure? Often AIC is compared with the Bayesian
Information Criterion (BIC), which has a very different derivation.
The BIC converges asymptotically to the marginal likelihood (the
denominator in Bayes rule). Nonetheless, the equations for BIC and AIC
are quite similar, the main difference being that the BIC penalty for
complex models is harsher than that for AIC. Burnham and Anderson
(http://smr.sagepub.com/cgi/content/abstract/33/2/261) observed that
the AIC can be derived as a special case of the BIC for a certain
prior. However, it strikes me that this "prior" is a bastardization of
the notion of a prior because it is a function of the number of
datapoints in the training set, which are not actually part of your
prior experience. This has the effect of overstating the complexity of
your prior experience, and thus artificially licensing more complex
models. So asymptotically, models selected by cross-validation will be
insufficiently conservative about model complexity.

3) Given that cross-validation is biased towards more complex models,
it's worth considering that the onus is on us, as scientists, to find
the simplest explanations of natural phenomena, and so I will venture
to make the tendentious assertion that if we are to be biased in one
direction, we should be biased AGAINST complex models. This is the
heart of Occam's Razor, and falls naturally out of Bayesian statistics
(as long as you don't use non-sensical priors).

So what are the other options?

One is to start trying to compute the marginal likelihood for various
models (a tough thing to do if yours is not probabilistic, or you use
something like RFE). There are numerous (approximate) ways to do this
that I won't get into here. We can also use the asymptotic measures
like AIC and BIC mentioned above. As I said, I think the best strategy
in the end will be to seek converging evidence from all of these for a
particular model, and when they disagree, to look more closely at what
in the data is causing the disagreement (a ripe opportunity for
serendipitous discoveries!).

Sam

Yaroslav Halchenko

unread,

Aug 20, 2008, 10:58:50 PM8/20/08

to Princeton MVPA Toolbox for Matlab

> 1) Models that minimize the cross-validation error are not necessarily
> the ones that have the best generalization ability. This assertion
> comes from an astonishing paper by Andrew Ng:http://ai.stanford.edu/~ang/papers/cv-final.pdf.
> The essence of his argument is that if we entertain enough models,
> just by chance one of them might do really well on the hold-out set,
> even if it has actually poor generalization ability, because the hold-
> out samples are corrupted versions of the underlying function that the
> model is trying to learn. Hence, the model that minimizes cross-
> validation error may end up fitting noise on the hold-out samples.

Well -- if you check cross-validation not on a single hold-out set
(and not just few hypothesized sets), but on all possible ones, and
if you assure that those sets are identical (or close) and
independent (ie like independent runs or even separate sessions in
fMRI experiment with randomized design), then I don't see much of
possibility on how you could fit noise for all (or most) hold-out
samples.

another useful strategy is to compare achieved best result to the null-
distribution on the same data but permuted labels -- ie carrying out
same training/model_selection but with permuted labels (sufficient
amount of times to get a reliable estimate of null-distribution). (we
do have some hooks for that in PyMVPA although implementation might
not be the most effective ... yet). The same approach is valid as well
for the feature sensitivity estimation or feature selection.

Francisco Pereira

unread,

Aug 21, 2008, 7:43:00 AM8/21/08

to mvpa-t...@googlegroups.com

I think there is indeed a serious issue in the points that you raise, but I
think it's more related to model selection (and model comparison) than to
using cross-validation :) I entirely agree with Yaroslav's points, and will add
a few things.

On Wed, Aug 20, 2008 at 8:16 PM, Sam Gershman <sam.ge...@gmail.com> wrote:

By and
large studies have used cross-validation performance as the gold
standard for model comparison. The motivation for using this measure
is that it is a proxy for the generalization ability of the model: how
well can it predict labels for data that it hasn't seen before.

In general, people tend to compare accuracy results (from cross-validation
or obtained by simply splitting into train and test sets) without necessarily
considering that they are *estimates* of the true accuracy (the probability
that the model would label any new example correctly). As estimates, they
have an uncertainty that is a function of how large the test set is; hence,
the comparison has to take that into account, e.g. through confidence intervals.
A good paper on this is the pair of tech reports starting with "small sample statistics"
http://www.ics.uci.edu/~dan/pub.html

and a summary article on confidence intervals of binomial probabilities
(which is essentially what we are trying to estimate) is

http://www.ppsw.rug.nl/~boomsma/confbin.pdf

(see, in particular, the reference Agresti/Coull 1998 in that)

Cross-validation is an attempt at trying to estimate what the classifier accuracy
would be if you could train using the entire dataset (or almost all of it, if
you consider the limit case of leave-one-out cross-validation).
You could point out a different (philosophical) problem with cross-validation:
that the classifier learnt from each fold is slightly different from the ones learnt
on other folds, and hence the estimate does not pertain to any single classifier.
The classifiers are very similar, as they share almost all the training data, so
this is generally glossed over.

There are several other ways of getting an estimate of true accuracy and uncertainty
around it. A really good introduction to these is
http://www.hunch.net/~jl/projects/prediction_bounds/tutorial/langford05a.pdf

note that some of the approaches here make statements about
the true accuracy from the training accuracy, and those might be
a more interesting thing to contrast with the bayesian approaches you mention.

what we should
ultimately seek is converging evidence from these different options,
which have various strengths and weaknesses.

All for that. :)

The essence of his argument is that if we entertain enough models,
just by chance one of them might do really well on the hold-out set

even if it has actually poor generalization ability, because the hold-
out samples are corrupted versions of the underlying function that the
model is trying to learn.

As Yaroslav pointed out, you will actually get to test on the entire dataset,
you don't need a hold out set. Because the classifier is being tested on
examples it hasn't seen before, the accuracy is an unbiased estimate.
Because the test set is finite, there's uncertainty (with an infinite test set
it would be the true accuracy).

But you raise an important issue: if you train enough classifiers, some of
them *will* do well by chance. And generally people train many, by
using various voxel selection methods, selecting many numbers of voxels
or, of course, trying to learn different kinds of classifier on the products of
the previous choices.

This is, however, a multiple comparison problem rather than a problem with
cross-validation. If the question you want to ask is "is there any combination
of choices that gives me a result that indicates a classifier learnt something",
you *will* need to address the problem.

Yaroslav also hits the nail on the head there. Permutation tests will work for
this purpose, as will a few analytical shortcuts. I'd hesitate to inflict my thesis
work on the list, but I'm trying to write a nicer version :)

2) It was shown quite a while ago, by Stone (1977), that leave-one-out
cross-validation as a model-selection criterion is asymptotically
equivalent to the Akaike Information Criterion (AIC). The AIC derives
from information theory; it converges to the KL-divergence between
"reality" and the model (which means you want to choose the model with

the smallest AIC). So asymptotically, models selected by cross-validation will be

insufficiently conservative about model complexity.

I think there are two points to address there. The first is "asymptotically"; our
datasets are not just finite, they are very small, so many asymptotic results either
do not apply or apply with caveats or in particular regions of parameter space
(again, Agresti/Coull 1998 is eye-opening on this topic).

The second thing is what question you want to ask. Most people want the question
above, which is not about model selection: it's about accuracy of the models tried
as "sensors" of whether there is information in the data.

I think a more interesting thing to discuss would be the extent to which we
can say that any of the models we currently have actually corresponds with
fMRI data and neuroscience knowledge in any way. At one extreme, we
have discriminative models whose sole purpose is to learn to distinguish
classes. Those can be modified to incorporate prior knowledge (brain locations,
smoothness of weight maps, etc). This leads all the way to complex generative
models that could, in principle, model the entire pattern of activation in any
class.

The problem here is that you are vieweing cross-validation as a selection criterion
that allows ordered comparisons between models. I tend to think - conservatively -
that the most I can say from cross-validation is the model is "good enough" if it can predict, with the multiple testing caveats. In general, I can learn many different models, with more or fewer assumptions, that can predict equally well
(up to accuracy estimate uncertainty, of course).

I will venture
to make the tendentious assertion that if we are to be biased in one
direction, we should be biased AGAINST complex models. This is the
heart of Occam's Razor, and falls naturally out of Bayesian statistics
(as long as you don't use non-sensical priors).

That is what we have to do in practice, anyway, even if the goal is
to merely classify. I don't think anyone will use a nonlinear classifier
if a linear one suffice :)

Now, the question that interests me is whether it's scientifically legit to
say that a model A that predicts as well as model B (as measured by
cross-validation accuracy) is "better" if it incorporates/is constrained by
more neuroscience knowledge. Perhaps we could discuss this?

cheers,
Francisco

Francisco Pereira

unread,

Aug 21, 2008, 7:47:42 AM8/21/08

to mvpa-t...@googlegroups.com

I forgot to add that you and others interested in this should get into/check the
archives of the comp-neuro mailing list for the last month or so. There is an ongoing
discussion of various issues in what I think one could call neuroscience epistemology,
with several well-known scientists participating. It's especially nice because it is way
less formal than one might expect and people are quite open to stating somewhat
polemic beliefs :)

cheers,
Francisco

Yaroslav Halchenko

unread,

Aug 21, 2008, 11:26:48 AM8/21/08

to Princeton MVPA Toolbox for Matlab

Very well done Francisco -- thanks for plenty of nice references. Let
me just make more "optimistic" statement in regard of sampling the
space of classifiers (or possible models), i.e

> But you raise an important issue: if you train enough classifiers, some of
> them *will* do well by chance. And generally people train many, by
> using various voxel selection methods, selecting many numbers of voxels
> or, of course, trying to learn different kinds of classifier on the products
> of
> the previous choices.

let me make up a silly analogy -- regardless of how many generic
manufactured Ford Focuses you take, it never would beat stock
Lamborghini Diablo ;-) unless Diablo is broken, or unless you boost
your Focus for the specific race.
It is to mention that probably your linear classifier never would
perform good 'by chance' for a non-linear decision surface (unless
that surface is degenerate and piece-wise linear in the neighborhood
of interest).

Vicente Malave

unread,

Aug 21, 2008, 11:29:32 AM8/21/08

to mvpa-t...@googlegroups.com

I think before getting into model selection we need to think about
what kind of argument we are trying to make with our papers, of which
there are several in the mvpa literature:

1) The subset of features contains enough information you can achieve
high classification accuracy (i.e Haxby 2001)

2) Classifier/model X presented here shows how information is encoded
in the brain. This includes things like talking about how a voxel
contains more information because it is assigned a higher weight by
the discriminant, which is a statement about a particular set of
parameters fit by your model. (I am not trying to argue against making
or publishing importance maps, just urging caution against
over-interpretation of them.)

3) Following Kay (Nature 2008) one could imagine the following theory
testing framework. If you have two theories which predict a set of
different variables you think the brain is using for its internal
representation, use their approach to engineer two different models
based on your theoretical assumptions, and then you can do some kind
of reasonable model comparison i.e. the various Bayesian approaches.

For 1) I am less convinced that cross validation is a problem, because
to make the point it is sufficient to show one classifier that works.
For 2) I think the criticism is more appropriate, but I think we
should move towards 3) and do comparison of several different
theoretically motivated models, which I think makes more sense than
trying to infer anything from one particular model's parameters (esp
with a point estimate of them).

> This is the
> heart of Occam's Razor, and falls naturally out of Bayesian statistics
> (as long as you don't use non-sensical priors).

I was reading this interesting discussion on Andrew Gelman's blog
earlier yesterday arguing against parsimony.
http://www.stat.columbia.edu/~cook/movabletype/archives/2004/12/against_parsimo.html

>Permutation tests will work for
>this purpose, as will a few analytical shortcuts. I'd hesitate to inflict my thesis
>work on the list, but I'm trying to write a nicer version :)

I am looking forward to this as although I trust permutation testing,
it can be kind of taxing on various resources to run all the models
thousands of times.

--
Vicente Malave

Francisco Pereira

unread,

Aug 21, 2008, 11:35:46 AM8/21/08

to mvpa-t...@googlegroups.com

let me make up a silly analogy -- regardless of how many generic
manufactured Ford Focuses you take, it never would beat stock
Lamborghini Diablo ;-) unless Diablo is broken, or unless you boost
your Focus for the specific race.

There's also the option of mobbing the Diablo with the Fords as
they get lapped, but that might be pushing the analogy a bit far ;)

It is to mention that probably your linear classifier never would
perform good 'by chance' for a non-linear decision surface (unless
that surface is degenerate and piece-wise linear in the neighborhood
of interest).

I agree. Mostly I just like to encourage people to train a linear classifier first,
on grounds of simplicity and interpretability but also of preventing overfitting
and the need to control complexity or explore a larger parameter space (with
nested cross-validation to set parameters, say).

cheers,
Francisco

Sam Gershman

unread,

Aug 21, 2008, 12:42:35 PM8/21/08

to Princeton MVPA Toolbox for Matlab

Francisco,

You touch on an interesting issue that I didn't mention .If I
understand you correctly, you are making a distinction between
"comparison between models" and "comparison between a model and
chance". I don't believe in such magical things... "Chance" is just
another model. There are situations where chance can not be
conveniently specified in terms of an unbiased coin, or an
equiprobable multinomial distribution. Moreoever (and this gets to
your last point), what we define as chance depends on our prior
experience. For example, if you bring a child to a magic show for the
first time, he is astounded that the magician can correctly determine
the top card in the deck. For him, never having seen a magician
before, his prior probability of correctly determing the card is 1/52.
The magician did something very special relative to "chance." But you,
as a veteran of magic shows, have a very different prior belief about
chance in this case.

If I have two models with identical cross-validation performance, is
there any way to know which one is better? My point is that two such
models are not necessarily equally better than "chance" because our
notion of chance is determined by our prior beliefs and experience,
which could differ for each model. Let me give a concrete example.
Research on "default mode" activity has shown that when people are
supposedly doing nothing their brains are quite active. Then I build a
model incorporating this information and it gets the same cross-
validation performance as a model without that information. Let's also
say that they both assign the same likelihood to the data. Then I
should prefer the default mode model. Why? Because the default mode
model is simpler: it concentrates its probability mass around a
smaller number of possible activation patterns. If you actually
compute the marginal likelihood, you will get this result. The
alternative model, being more flexible, predicts a wider range of
patterns. It is closer to our intuitive notion of "chance," which
assigns equal probability to all possible patterns.

Sam

Francisco Pereira

unread,

Aug 21, 2008, 12:57:24 PM8/21/08

to mvpa-t...@googlegroups.com

On Thu, Aug 21, 2008 at 12:42 PM, Sam Gershman <sam.ge...@gmail.com> wrote:

You touch on an interesting issue that I didn't mention .If I
understand you correctly, you are making a distinction between
"comparison between models" and "comparison between a model and
chance". I don't believe in such magical things... "Chance" is just
another model.

Oh, I agree with this :) Mostly I wanted to stress that what people are typically
doing is that comparison to "chance", that cross-validation is used to do it and
that model selection is a different problem.

I honestly don't think that's the best question to ask of data, or that it is generally
asked well. To give an example: with enough data, 55% accurate at distinguishing
two classes is significantly different from chance (50%), yet I don't think that has much scientific use.

There are situations where chance can not be
conveniently specified in terms of an unbiased coin, or an
equiprobable multinomial distribution.

Agreed, which is why <insert creation model author here> saw it fit to
give us permutation tests ;) (even though they will still require a model
of the scenario where labels are exchangeable, say).

If I have two models with identical cross-validation performance, is
there any way to know which one is better? My point is that two such
models are not necessarily equally better than "chance" because our
notion of chance is determined by our prior beliefs and experience,
which could differ for each model.

Exactly. That's what I think is the interesting discussion to have.
Still reading through the comp-neuro discussion, where they are going
over this for lower-level computational neuroscience...

> Let's also
> say that they both assign the same likelihood to the data. Then I
> should prefer the default mode model. Why? Because the default mode
> model is simpler: it concentrates its probability mass around a
> smaller number of possible activation patterns.

I should probably note that I like bayesian model selection as much as you do,
lest you fear I harbour some kind of frequentist grudge. That said, I'd never trust
a model if I hadn't seen how it performed in unseen data ;)

cheers,
Francisco

Reply all

Reply to author

Forward