Classify Error: pooled covariance matrix of TRAINING must be positive definite

Jorge

unread,

Sep 21, 2007, 6:32:23 AM9/21/07

to

I am trying to use the classify function of Matlab but, I
always get the same message. I have removed columns in the
Training matrix to have the rank(Training)=ncolumns of
training and I got it working, the problem is that those
columns seem to be important. If instead of classify I use a
tree for the original Training and the Training with removed
columns, the performance in the later decreases a lot (50%).

Any ideas of how to get classify working without removing
the columns?

Thanks a lot!!!

Best Regards!,
J.

Peter Perkins

unread,

Sep 21, 2007, 1:22:09 PM9/21/07

to

Jorge wrote:
> I am trying to use the classify function of Matlab but, I
> always get the same message. I have removed columns in the
> Training matrix to have the rank(Training)=ncolumns of
> training and I got it working, the problem is that those
> columns seem to be important. If instead of classify I use a
> tree for the original Training and the Training with removed
> columns, the performance in the later decreases a lot (50%).

Jorge, you haven't said, but I expect you're using the default linear
discriminant analysis options in CLASSIFY.

You're describing this in terms that make me think you're looking for
find a work-around for a bug. That isn't the case. LDA requires enough
information to be able to estimate a full-rank covariance matrix, and at
a minimum that means more observations than variables. You might think
about selecting a good subset of variables somehow, or constructing new
variables using, for example, PCA. You might think about using one of
the two "naive Bayes" options in CLASSIFY.

Hope this helps.

- Peter Perkins
The MathWorks, Inc.

Greg Heath

unread,

Sep 23, 2007, 12:20:37 PM9/23/07

to

If you use BACKSLASH or PINV instead of CLASSIFY, the effect of
illconditioned matrices can be mitigated.

BACKSLASH will effectively remove variables
by generating n-r zero coefficients in the weight vector.

PINV will tend to use all of the variables. However, it effectively
creates r significant linear combinations of the n original variables
and n-r insignificant linear combinations. This is equivalent to
projecting the data into an n-r dimensional space.

Hope this helps.

Greg

Julie Oswald

unread,

Jan 21, 2008, 4:26:02 PM1/21/08

to

Greg Heath <he...@alumni.brown.edu> wrote in message
<1190564437....@r29g2000hsg.googlegroups.com>...

Hello,
I am getting the same error message when I try to run DFA
on my data. Removing variables helps, but this reduces my
variable set too much. I do have more cases than variables
and I'm trying to compare the performance of linear,
Quadratic, and Mahalanobis DFA. I've run DFA on the same
data set using another statistical package (SPSS) and did
not get any errors. So, I went into the classify function
in Matlab and removed the 'if' loop that produces the
covariance matrix error. I did get results when I ran that
version of classify, but I don't know whether to trust
those results. Can you tell my why Matlab requires
positive covariance matrices and whether I can trust the
results I get when I remove that 'if' statement?

Thank you.

Julie

Peter Perkins

unread,

Jan 22, 2008, 1:29:45 PM1/22/08

to

Julie Oswald wrote:

> I am getting the same error message when I try to run DFA
> on my data. Removing variables helps, but this reduces my
> variable set too much. I do have more cases than variables
> and I'm trying to compare the performance of linear,
> Quadratic, and Mahalanobis DFA. I've run DFA on the same
> data set using another statistical package (SPSS) and did
> not get any errors. So, I went into the classify function
> in Matlab and removed the 'if' loop that produces the
> covariance matrix error. I did get results when I ran that
> version of classify, but I don't know whether to trust
> those results. Can you tell my why Matlab requires
> positive covariance matrices and whether I can trust the
> results I get when I remove that 'if' statement?

I can almost guarantee that the results you get after removing that
error check are not reliable if the error was indeed occurring.

CLASSIFY uses a discriminant analysis algorithm whose training step is
equivalent to fitting a multivariate normal distribution to each group
in your training data. For LDA, those multivariate normals are assumed
to have the same covariance matrix; for QDA they are assumed to have
different cov matrices. Given the estimated MVN dist'ns from training,
the algorithm's classification step simply compares the probability
densities for an observation across each of the estimated MVNs. You
can't reasonably do that if the MVN distributions are degenerate, i.e.,
if their cov matrix is not positive definite.

Some ways to get positive-definiteness: select a good subset of
variables somehow, or construct a small set of new variables using, for
example, PCA. Use one of the two "naive Bayes" options in CLASSIFY.

Greg Heath

unread,

Jan 24, 2008, 8:05:16 AM1/24/08

to

On Jan 22, 1:29 pm, Peter Perkins

<Peter.PerkinsRemoveT...@mathworks.com> wrote:
> Julie Oswald wrote:
> > I am getting the same error message when I try to run DFA
> > on my data. Removing variables helps, but this reduces my
> > variable set too much. I do have more cases than variables
> > and I'm trying to compare the performance of linear,
> > Quadratic, and Mahalanobis DFA. I've run DFA on the same
> > data set using another statistical package (SPSS) and did
> > not get any errors. So, I went into the classify function
> > in Matlab and removed the 'if' loop that produces the
> > covariance matrix error. I did get results when I ran that
> > version of classify, but I don't know whether to trust
> > those results. Can you tell my why Matlab requires
> > positive covariance matrices and whether I can trust the
> > results I get when I remove that 'if' statement?
>
> I can almost guarantee that the results you get after removing that
> error check are not reliable if the error was indeed occurring.
>
> CLASSIFY uses a discriminant analysis algorithm whose training step is
> equivalent to fitting a multivariate normal distribution to each group
> in your training data. For LDA, those multivariate normals are assumed
> to have the same covariance matrix; for QDA they are assumed to have
> different cov matrices. Given the estimated MVN dist'ns from training,

> the algorithm'sclassificationstep simply compares the probability

> densities for an observation across each of the estimated MVNs. You
> can't reasonably do that if the MVN distributions are degenerate, i.e.,
> if their cov matrix is not positive definite.
>
> Some ways to get positive-definiteness: select a good subset of
> variables somehow, or construct a small set of new variables using, for
> example, PCA. Use one of the two "naive Bayes" options in CLASSIFY.

Covariance matrices cannot be negative definite. If they are
singular,
it indicates that one or more variables are linearly dependent. If
rank(Cov) = r < n, you can find a group of r variables that contain
all of the spread information.

Those r variables could be a subset of the original n or they
could be a linear combination. rank(Cov) = r < n implies that the
data in the original n-dimensional space only occupies r
dimensions of that space and can be projected from the original
n-dimensional space to a proper r-dimensional space with a
positive-definite rXr covariance matrix.

As Peter has indicated, this usually done one of two ways.

1. Input variable subset selection.
a. Use your knowledge of the problem to choose q (q<=r) of
the variables that you believe to be the most important.
If you wish to play dumb you can use q = 0 and skip to
step d. However, this is not recommended because you
should never try regression or classification without at
least obtaining some information via scatter plots on
1 and 2-D projections.
b. Check the corresponding qXq covariance submatrix to be
sure they are linearly independent.
c. If q = r you are done.
d. If q < r, use STEPWISEFIT or STEPWISE with the 'keep'
option to find r-q more variables.
2. PCA dimensionality reduction.
a. Project the data into the space spanned by the r
dominant eigenvectors of the covariance matrix,
b. In regression you can reduce the dimensionality even
further by only using enough dimensions to recover
some high percentage (e.g., 90, 95, 99 or 99.9%)
of the original spread quantified by trace(Cov) =
sum(eigenvalues).
c. In classification, dimensions with small data spreads
can contain all of the group separability info. For an
example, search Google Groups using

greg-heath thin parallel disks

Therefore, keep all r dimensions.

Hope this helps.

Greg

Nida Aziz

unread,

Aug 3, 2009, 8:45:01 PM8/3/09

to

I have a set of EEG data that I am trying to classify into imagination and rest using LDA option of CLASSIFY function. I am using a training set of (480 x 2570) and a test set of (240 x 2570) values. I keep on getting this error
"The pooled covariance matrix of TRAINING must be positive definite."
Can anyone please help me out with it?

Peter Perkins

unread,

Aug 4, 2009, 10:07:32 AM8/4/09

to

Nida Aziz wrote:
> I have a set of EEG data that I am trying to classify into imagination and rest using LDA option of CLASSIFY function. I am using a training set of (480 x 2570) and a test set of (240 x 2570) values. I keep on getting this error
> "The pooled covariance matrix of TRAINING must be positive definite."

What does LDA do? It fits a multivariate normal distribution to the data from each class. Even with the sharde cov matrix model in LDA, that means estimating, in your case, a 2570x2570 covariance matrix. With 480 observations, that isn't going to work. You either need _a lot_ more data, _a lot_ fewer dimensions, or a different kind of classifier.

Consider using something like PCA or partial least squares to transform to a smaller set of variables.

Hope this helps.