Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

What is the relationship between pdist2(..., 'mahalanobis') and mahal?

146 views

Skip to first unread message

kj

unread,

Apr 28, 2016, 5:12:49 PM4/28/16

This question is about the relationship between two different ways to
compute Mahalanobis distances with MATLAB: `pdist2(..., 'mahalanobis')`
and `mahal(...)`. (Both functions are part of MATLAB's stats
toolbox.)

To be concrete, suppose we have the following:

>> rng(0)
>> X = rand(5, 3)
X =
0.8147 0.0975 0.1576
0.9058 0.2785 0.9706
0.1270 0.5469 0.9572
0.9134 0.9575 0.4854
0.6324 0.9649 0.8003
>> Y = rand(4, 3)
Y =
0.1419 0.9595 0.9340
0.4218 0.6557 0.6787
0.9157 0.0357 0.7577
0.7922 0.8491 0.7431
>> PD2 = pdist2(Y, X, 'mahalanobis')
PD2 =
7.6812 2.9122 1.4405 4.5954 1.9977
5.2445 4.4568 2.8991 2.9615 2.4820
6.9355 2.4993 2.3643 4.3252 2.4828
6.9843 3.2129 3.5415 2.7514 0.5562
>> MAH = mahal(Y, X)
MAH =
3.4400
0.8204
3.1139
0.7368

Is it possible to get the same result `PD2` (ignoring round-off
errors) using only `mahal`? Conversely, can one obtain `MAH` using
`pdist2(..., 'mahalanobis')` alone?

More generally, what exactly is the relationship between `PD2` and
`MAH`?

(Just to be clear: I'm asking about the mathematical/conceptual
relationship between these two functions. I know that `pdist2(...,
'mahalanobis')` does not invoke `mahal`, and I suspect that the
converse is also true.)

---

(!!!IMPORTANT!!!: All the post's essential information is given above.
What follows may be skipped.)

The documentation for the function `pdist2` begins as follows:

D = pdist2(X,Y) returns a matrix D containing the Euclidean distances
between each pair of observations in the MX-by-N data matrix X and
MY-by-N data matrix Y. Rows of X and Y correspond to observations,
and columns correspond to variables. D is an MX-by-MY matrix, with the
(I,J) entry equal to distance between observation I in X and
observation J in Y.

D = pdist2(X,Y,DISTANCE) computes D using DISTANCE.

One of the possible values for the `DISTANCE` parameter is
`mahalanobis`; the relevant part of the `pdist2` documentation begins
as follows:

'mahalanobis' - Mahalanobis distance, using the sample covariance
of X as computed by NANCOV.

AFAICT, in the definition of `pdist2`, `X` and `Y` are the same "sort
of thing". IOW, `pdist2(X, Y, 'mahalanobis')` and `pdist2(Y, X,
'mahalanobis')` both make sense, even if they are not equal.

In contrast, the main arguments to the built-in `mahal` function are
neither semantically nor mathematically symmetric:

D2 = mahal(Y,X) returns the Mahalanobis distance (in squared units) of
each observation (point) in Y from the sample data in X, i.e.,

D2(I) = (Y(I,:)-MU) * SIGMA^(-1) * (Y(I,:)-MU)',

where MU and SIGMA are the sample mean and covariance of the data in X.
Rows of Y and X correspond to observations, and columns to variables. X
and Y must have the same number of columns, but can have different numbers
of rows. X must have more rows than columns.

In particular, for some pairs of `Y` and `X`, `mahal(Y, X)` may make
sense, while `mahal(X, Y)` will result in an error. For example:

>> rng(0)
>> Y = rand(1, 3); X = rand(5, 3);
>> mahal(Y, X)
ans =
7.3939
>> mahal(X, Y)
Error using mahal (line 38)
The number of rows of X must exceed the number of columns.

IOW, the *domain* of `pdist2(..., ..., 'mahalanobis')` is a symmetric
Cartesian product A x A, whereas the domain of `mahal`
is an asymmetric one, A x B with A ~= B.

0 new messages