Leave-one-out validation and best model in linear discriminant function analysis

469 views
Skip to first unread message

Felix_B

unread,
Feb 28, 2011, 6:26:32 AM2/28/11
to MedStats
Dear all,

I want to compare the performance of linear discriminant function
analysis with regularized DFA. The most common recommended method to
validate the classification procedure is the Leave one out method (or
alternatively n-fold cross validation). However, in each of my
training samples I will get slightly different parameter estimates.
How do I select the best model. Do I e.g. average the parameters?

Thank you for your help in advance
Best wishes, Felix

Frank Harrell

unread,
Mar 6, 2011, 5:56:33 PM3/6/11
to MedStats
Leave-out-one cross-validation is slow and is sometimes not as good as
the bootstrap or 100 repeats of 10-fold cross-validation.

Expect the estimates to differ. You are trying to validate a 'final'
model using this procedure. Don't replace the final model with an
average of the intermediate models.

LDA is generally regarded as obsolete in some circles, as compared to
logistic regression.

Frank

Felix_B

unread,
Mar 7, 2011, 3:18:15 PM3/7/11
to MedStats
Dear Frank, thanks for your answer. Yes, I realized that I should
estimate my model with all data because I will get the best estimate.
I am not sure what the cross-validated error means.Is that an estimate
of the error if I construct a model with 90% of my data? I guess I
should go in the library and should have a look in your book.
Logistic regression is in my case not an option because I have got a
relatively small sample size and need to use a regularized DFA.
Best wishes, Felix

Juliet Hannah

unread,
Mar 7, 2011, 4:10:28 PM3/7/11
to meds...@googlegroups.com
At each step of cross-validation, some samples are held out. The model
is estimated on
the retained samples, and accuracy is based on on those held out. So
if 10% was held out
at each step, you go through 10 times, to get the overall error.

Frank's book explains why classification accuracy is not the way to go. This
measure is the most common summary with the data sets I work with now, so
I'm still learning about this aspect.

There are regularized version of logistic regression and Frank's rms
package fits
penalized lr. But there are also several versions of penalized lda,
which are popular
in microarray expression analysis.

Many statisticians use lda and versions of it for microarray analysis.
This, coupled with
Frank's comments on lda in this thread and his book, also present
another unresolved
area for me to understand.

In addition to the references you have been given, I would read some
of Richard Simon's
papers on cross validation in microarray studies. The ideas are not specific
to microarrays. But this is an area in which cross validation was
abused, so a lot
can be learned.

> --
> To post a new thread to MedStats, send email to MedS...@googlegroups.com .
> MedStats' home page is http://groups.google.com/group/MedStats .
> Rules: http://groups.google.com/group/MedStats/web/medstats-rules
>

Frank Harrell

unread,
Mar 7, 2011, 4:20:22 PM3/7/11
to MedStats
Use penalized maximum likelihood estimation with the logistic model
[data mining folks took a perfectly acceptable term 'penalization' and
turned it into 'regularization' which makes less sense].

Frank

Felix_B

unread,
Mar 9, 2011, 11:51:40 AM3/9/11
to MedStats
Thanks Juliet and frank for the useful information!!
Best wishes, Felix

Basilio de Bragança Pereira

unread,
Mar 9, 2011, 6:39:01 PM3/9/11
to meds...@googlegroups.com
Actually the term regularization is much older and due to mathematician Andrey Nikolayevich Tychonoff and includes many staistical results like ridge regression, lasso, AIC etc. Tikhonov was awarded the Lenin Prize for his work on ill-posed problems in 1966.
It is not a data mining term
Basilio

2011/3/9 Felix_B <Felix...@gmx.de>
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules



--

Basilio de Bragança Pereira ,DIC and PhD(Imperial College), DL(COPPE)
*UFRJ-Federal University of Rio de Janeiro
*Titular Professor of  Bioestatistics and of Applied Statistics
*FM-School of Medicine and COPPE-Posgraduate School of Engineering and
HUCFF-University Hospital Clementino Fraga Filho.

*Tel: 55 21 2562-7045/7047/2618/2558
www.po.ufrj.br/basilio/

*MailAddress:
COPPE/UFRJ
Caixa Postal 68507
CEP 21941-972 Rio de Janeiro,RJ
Brazil



Frank Harrell

unread,
Mar 10, 2011, 7:20:13 AM3/10/11
to MedStats
Interesting. I stand corrected.
Frank

On Mar 9, 5:39 pm, Basilio de Bragança Pereira <basi...@hucff.ufrj.br>
wrote:
> Actually the term regularization is much older and due to mathematician
> Andrey Nikolayevich Tychonoff and includes many staistical results like
> ridge regression, lasso, AIC etc. Tikhonov was awarded the Lenin Prize for
> his work on ill-posed problems in 1966.
> It is not a data mining term
> Basilio
>
> 2011/3/9 Felix_B <Felix_B...@gmx.de>
Reply all
Reply to author
Forward
0 new messages