probabilities and discriminant analysis

Richard Wright

unread,

Dec 11, 2004, 5:13:27 PM12/11/04

to

In a submitted manuscript we have reported the classification of an
unknown case by linear discriminant analysis.

A referee has stated that we must attach an error term to the
probability of the classification (i.e. to the highest probability).

Software and publications that I have seen do not offer such a
statistic and I can't get my around the meaningfulness of such an
error term.

Error terms, as I understand it, are associated with such statistics
as the mean of a set of observations and to regression coefficients.

If I am right I would be grateful for any suggestions about the
statistical wording of a tactful response that we can make to the
editor. If I am wrong then I would be grateful for a pointer to
literature on the matter.

Richard Ulrich

unread,

Dec 12, 2004, 2:02:39 PM12/12/04

to

On Sun, 12 Dec 2004 09:13:27 +1100, Richard Wright
<richwri...@tig.com.au> wrote:

> In a submitted manuscript we have reported the classification of an
> unknown case by linear discriminant analysis.

When I first read through the question, I read it as simple --
Nobody reports errors on the "classification table".

On re-reading, I noticed that this was *one* case, "... classification
of an unknown case ...."
Well, I don't remember anyone trying to predict just one case,
either. And when they are trying to "predict" for several
new cases, it does not hurt to have some error rate in mind.

>
> A referee has stated that we must attach an error term to the
> probability of the classification (i.e. to the highest probability).

What? Do you have more than two groups, "highest"?

>
> Software and publications that I have seen do not offer such a
> statistic and I can't get my around the meaningfulness of such an
> error term.

Here are notes on "meaningful".
In logistic regression, it is useful to look at the grouped deciles
of predicted scores (fitting) as a diagnostic tool for whether the
regression meets its regularity assumptions. The deciles should
show a progression of "fit" from one end to the other -- 100%
in this group, to (near) 100% in the other.

- You can make a similar, descriptive plot for a discriminant
function scores between two groups, if you really want.

Overfitting is potentially present in either. If I was using
ordinary discriminant function, and attempting a *useful*
prediction, I think I would be compelled to create some
indicator of the expected error. One obvious choice would
be jack-knifing -- Wasn't that a classical application?
You do need the special software if you are going to do
a *real* jack-knife instead of a leave-one-out imitation,
but I'm not sure whether that really matters if you are looking
at predicted-cases -- instead of the unbiased coefficients.

Leave out 10% of the sample for each of 10 predictive runs,
or 1 case per run.

You have an overall rate of mis-prediction, and you can
describe your misses, further, in terms of distances from
the centroids. Or whatever. All this becomes easier, of
course, if you have hundreds or thousands of cases to
play with, instead of some bare minimum.

>
> Error terms, as I understand it, are associated with such statistics
> as the mean of a set of observations and to regression coefficients.
>
> If I am right I would be grateful for any suggestions about the
> statistical wording of a tactful response that we can make to the
> editor. If I am wrong then I would be grateful for a pointer to
> literature on the matter.

It could be that I'm off track here. At first, I was headed
in the direction of "tactful response", but then I started
staring at the *single prediction*. What is the value of
a prediction without some indication of its likelihood?

For the time being, I guess I'm on the side of the editor.

--
Rich Ulrich, wpi...@pitt.edu
http://www.pitt.edu/~wpilib/index.html

Ray Koopman

unread,

Dec 12, 2004, 3:36:50 PM12/12/04

to

In the absence of the manuscript and the referee's commments, the
situation is ambiguous. I can see two potential interpretations.
First, the referee may be objecting to statements whose general form
is "discriminant analysis predicts that a case with such scores will
be Type A." If so then the solution is to report the probability of
the most likely category, plus the probability of the next most likely
category or two.

However, that addresses only uncertainty arising from the phenomenon
itself. It ignores uncertainty in the classification probabilities
due to sampling error and/or inappropriateness of the model. If these
are the referee's concerns then you should look into the literature
on cross-validation. Googling for <cross validation> got me 2,400,000
results; check out some of these. The subject should also be covered
in any decent text on classification.

Art Kendall

unread,

Dec 12, 2004, 5:26:52 PM12/12/04

to

Did you use SPSS, where you did a discriminant function analysis on a
set of cases, then in the classification phase, you had one case that
had a value on the group variable that said it was an unclassified case?

Do you report the classification table -- actual vs predicted for the
cases in the "training" set?

Did you specify the option to save case by case information to the
system file? I.e., the value of each case (training and unclassified) on
each of the discriminant functions, the probability of the case being in
each of the groups, and the probability of a case in the group being so
far from the centroid of the group.

Art
A...@DrKendall.org
Social Research Consultants
University Park, MD USA
(301) 864-5570

Richard Ulrich

unread,

Dec 12, 2004, 8:33:49 PM12/12/04

to

- google note -
not relevant to the main question.

On 12 Dec 2004 12:36:50 -0800, "Ray Koopman" <koo...@sfu.ca> wrote:

[ ... ] " Googling for <cross validation> got me 2,400,000 results; "

That is a few too many, since "cross" and "validation" may
appear anywhere in the note, without being a phrase.

Try one of these.

< "cross validation" > yields 289,000.
< cross-validation > yields 305,000 .

I don't know why these two forms don't work precisely
the same, since the hyphen seems to be optional, and
so does a single space (at times).

Greg

unread,

Dec 14, 2004, 9:00:52 AM12/14/04

to

Assume a 1-of-c classification assignment is made by comparing
estimates of c input-conditional discriminants (e.g.,
input-conditional posterior probabilities) obtained from the
output of a c-class classifier.

Based on the classifier outputs from a validation
(i.e., non-training, out-of-sample, calibration) set, an
unbiased empirical class-conditional CDF can be tabulated
for each class-conditional discriminant.

Given an input x and an estimated maximum input-conditional
discriminant D(J|x) = MAX(j=1,c){ d(j|x)}, a cumulative
class-conditional probability P( d <= D | J ) can be attached
as a measure of confidence.

Hope this helps.

Greg

unread,

Dec 15, 2004, 3:15:33 PM12/15/04

to

Apologies if this is a repeated post. I can't find my first two
attempts.

The following is not restricted to linear classification.

An assignment to 1-of-c classes is made by comparing c discriminants,
d(i) (i=1,2...c) from the output of a classifier. The discriminants
may or may not be estimates of input-conditional posterior
probabilities P(i|x) (i=1,2,...c).

Two measures of confidence can be obtained by using a validation set
to create two empirical MOC curves for each of the c classes.

The MOC curves for class i are

1. The CDF for the class i discriminant, P(D <= d(i)).

2. The complementary CDF for the largest of the other
class discriminants, P( D >= d(J)), where J =
argmax( d(j), |j-i| > 0).

Hope this helps.

Greg

unread,

Dec 15, 2004, 3:16:26 AM12/15/04

to

Apologies if this is a duplicate post. I can't find my first one.

The following is not restricted to linear classification.

An assignment to 1-of-c classes is made by comparing c discriminants,

d(i), (i=1,2...c) from the output of a classifier. The discriminants

may or may not be estimates of input-conditional posterior

probabilities P(i|x), (i=1,2,...c).

Richard Wright

unread,

Dec 16, 2004, 1:02:55 AM12/16/04

to

Thanks to those who responded publicly and privately.

Rich's comments on jack knifing describe what could obviously be done,
but I have nevertheless decided to write a polite note of
non-compliance to the editor on the grounds of 'why us?'. In the
particular field of analysis nobody, so far as I know, has ever got
round to computing error terms for the probabilities of classification
to a particular group.

I didn't bother to describe the context in my original post. The
database, and its relatives, are widely used in the linear
discriminant analysis of human cranial metric variables. The
particular database consists of 29 linear dimensions measured on 2,870
crania divided into 66 groups from all over the world. The
resubstitution classification error rate (note, some responders, that
is not the statistic I was concerned about) is around 32%, but that is
a crude error rate. By that I mean that if, for example, a skull from
one Mediterranean group is misclassified as a member of another
Mediterranean group that is as much of an error as if it were
classified as a Native American. In other words a substantively
sensible error rate is much less than 32%.

The purpose of the database is to help identify the ancestry of
individual crania of unknown origin. So at any one time there may be
only one skull to be classified against the database training set - an
unavoidable circumstance that Rich and others seemed to have some
difficulty with (sorry if I misunderstood that worry).

In the MS, linear discriminant analysis is somewhat incidental. We
deal with one skull only, and we listed the probabilities of group
membership for all values above 0.01. The highest probability value is
clear cut and for a group that is substantively uncontroversial.

So all in all, and taking into account the helpful replies, I don't
think the linear discriminant analysis needs to be extended in this
applied instance. Maybe some research on the question of error terms
using this database would interest a statistician . . . ?

I am still interested in learning of any published example where a
classified case has a probability of group membership with an error
term attached.

On Sun, 12 Dec 2004 09:13:27 +1100, Richard Wright
<richwri...@tig.com.au> wrote:

Greg

unread,

Dec 20, 2004, 12:42:26 AM12/20/04

to

I posted this one or two days ago. Sorry if it's a duplicate.

For the past 10 years or so, military sponsors in certain
life-threatening scenarios have been requiring measures of
confidence to be quoted for each individual classification
assignment. Finding no silver bullet references in the
literature, I was forced to develop several. Although I
could not publish application details, I have discussed
various aspects of the problem in several sci.stat.consult and
comp.ai.neural-nets threads. One such thread is titled
"Classification Measure of Confidence" , 3 Aug. 1997.
Others can be found in groups.google.com using

"greg heath" "measure of confidence"
and

"greg heath" measure of confidence

Hope this helps.

Greg

Aleks Jakulin

unread,

Dec 21, 2004, 1:44:14 PM12/21/04

to

Greg Heath wrote:
> For the past 10 years or so, military sponsors in certain
> life-threatening scenarios have been requiring measures of
> confidence to be quoted for each individual classification
> assignment. Finding no silver bullet references in the
> literature, I was forced to develop several. Although I
> could not publish application details, I have discussed
> various aspects of the problem in several sci.stat.consult and
> comp.ai.neural-nets threads. One such thread is titled
> "Classification Measure of Confidence" , 3 Aug. 1997.

This reminds me of proper loss functions. The concept is defined in
Bernardo & Smith (2000) "Bayesian Theory".

Proper loss functions have the property that if you minimize their
average on a sample, the probability model Q will be calibrated.
Examples of proper loss functions are the square of the outcome's
probability (Brier score) and the logarithmic loss (negation of the
logarithm of the outcome's probability). On the other hand, error
rate, zero-one loss, etc., are not proper. The expectation of the
log-loss with respect to the "true" probability model P of the
logarithmic score is the KL-divergence D(P||Q), and the minimum
attainable logarithmic score on P is Shannon entropy H.

Proper loss functions can be interpreted as "noninformative" loss
functions, ones that yield models that have good frequentist
properties. Minimizing the logarithmic loss function is an agnostic
re-phrasing of maximum likelihood.

--
mag. Aleks Jakulin
http://www.ailab.si/aleks/
Artificial Intelligence Laboratory,
Faculty of Computer and Information Science,
University of Ljubljana, Slovenia.

David Winsemius

unread,

Dec 24, 2004, 12:30:51 AM12/24/04

to

Richard Ulrich wrote in news:bmrpr0dvqcbc7e7fk...@4ax.com:

> - google note -
> not relevant to the main question.
>
> On 12 Dec 2004 12:36:50 -0800, "Ray Koopman" <koo...@sfu.ca> wrote:
>
> [ ... ] " Googling for <cross validation> got me 2,400,000 results; "
>
> That is a few too many, since "cross" and "validation" may
> appear anywhere in the note, without being a phrase.
>
> Try one of these.
>
> < "cross validation" > yields 289,000.
> < cross-validation > yields 305,000 .
>
> I don't know why these two forms don't work precisely
> the same, since the hyphen seems to be optional, and
> so does a single space (at times).

From experimentation I have found that a google-dash matches pretty much
any character or no character, so the second form matches:
crossvalidation, cross-validation, cross/validation, "cross validation",
cross_validation, etc. I have never found offishul documentation for this
behavior.

--
David Winsemius