[Numpy-discussion] Cross-covariance function

Elliot Saba

unread,

Jan 19, 2012, 1:50:04 AM1/19/12

to numpy-di...@scipy.org

Greetings,

I recently needed to calculate the cross-covariance of two random vectors, (e.g. I have two matricies, X and Y, the columns of which are observations of one variable, and I wish to generate a matrix pairing each value of X and Y) and so I wrote a small utility function to do so, and I'd like to try and get it integrated into numpy core, if it is deemed useful.

I have never submitted a patch to numpy before, so I'm not sure as to the protocol; do I ask someone on this list to review the code? Are there conventions I should be aware of? Etc...

Thank you all,

-E

Pierre Haessig

unread,

Jan 20, 2012, 7:39:00 AM1/20/12

to Discussion of Numerical Python

Hi Eliot,

Le 19/01/2012 07:50, Elliot Saba a écrit :
> I recently needed to calculate the cross-covariance of two random
> vectors, (e.g. I have two matricies, X and Y, the columns of which are
> observations of one variable, and I wish to generate a matrix pairing
> each value of X and Y)

I don't see how does your function relates to numpy.cov [1]. Is it an
"extended case" function or is there a difference in the underlying math ?

Best,
Pierre

[1] numpy.cov docstring :
http://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html
_______________________________________________
NumPy-Discussion mailing list
NumPy-Di...@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Sturla Molden

unread,

Jan 20, 2012, 10:30:42 AM1/20/12

to Discussion of Numerical Python

Den 20.01.2012 13:39, skrev Pierre Haessig:
> I don't see how does your function relates to numpy.cov [1]. Is it an
> "extended case" function or is there a difference in the underlying math ?
>

If X is rank n x p, then np.cov(X, rowvar=False) is equal
to S after

cX = X - X.mean(axis=0)[np.newaxis,:]
S = np.dot(cX.T, cX)/(n-1.)

If we also have Y rank n x p, then the upper p x p
quadrant of

np.cov(X, y=Y, rowvar=False)

is equal to S after

XY = np.hstack(X,Y)
cXY = XY - XY.mean(axis=0)[np.newaxis,:]
S = np.dot(cXY.T, cXY)/(n-1.)

Thus we can see thatthe total cocariance is composed
of four parts:

S[:p,:p] = np.dot(cX.T, cX)/(n-1.) # upper left
S[:p,p:] = np.dot(cXY.T, cYY)/(n-1.) # upper right
S[p:,:p] = np.dot(cY.T, cX)/(n-1.) # lower left
S[p:,:p] = np.dot(cYX.T, cYX)/(n-1.) # lower right

Often we just want the upper-right p x p quadrant. Thus
we can save 75 % of the cpu time by not computing the rest.

Sturla

Pierre Haessig

unread,

Jan 20, 2012, 1:04:26 PM1/20/12

to Discussion of Numerical Python

Le 20/01/2012 16:30, Sturla Molden a écrit :
> Often we just want the upper-right p x p quadrant.

Thanks for the explanation.
If I understood it correctly, you're interested in the
*cross*-covariance block of the matrix (and now I understand better
Elliot's message). Actually, I thought that was the behavior of the
np.cor function ! But you're right it's not ! [source code] The seconde
'y' argument just gets concatenated with the first one 'm'.

I would go further and ask why it so. People around may have use cases
in mind, because I have not.
Otherwise, I feel that the default behavior of cov when called with two
arguments should be what Sturla and Elliot just described.

Best,
Pierre

(that is something like this :

def cov(X, Y=None):
if Y is None:
Y = X
else:
assert Y.shape == X.shape # or something like that
# [...jumping to the end of the existing code...]
if not rowvar:
return (dot(X.T, Y.conj()) / fact).squeeze()
else:
return (dot(X, Y.T.conj()) / fact).squeeze()
)

[source code]
https://github.com/numpy/numpy/blob/master/numpy/lib/function_base.py

Elliot Saba

unread,

Jan 21, 2012, 3:47:36 PM1/21/12

to numpy-di...@scipy.org

Thank you Sturla, that's exactly what I want.

I'm sorry that I was not able to reply for so long, but Pierre's code is similar to what I have already implemented, and I am in support of changing the functionality of cov(). I am unaware of any arguments for a covariance function that works in this way, except for the fact that the MATLAB cov() function behaves in the same way. [1]

MATLAB, however, has an xcov() function, which is similar to what we have been discussing. [2]

Unless you all wish to retain compatibility with MATLAB, I feel that the behaviour of cov() suggested by Pierre is the most straightforward method, and that if users wish to calculate the covariance of X concatenated with Y, then they may simply concatenate the matrices explicitly before passing into cov(), as this way the default method does not use 75% more CPU time.

Again, if there is an argument for this functionality, I would love to learn of it!

-E

[1] http://www.mathworks.com/help//techdoc/ref/cov.html

[2] http://www.mathworks.com/help/toolbox/signal/ref/xcov.html

John Salvatier

unread,

Jan 21, 2012, 6:26:30 PM1/21/12

to Discussion of Numerical Python

I ran into this a while ago and was confused why cov did not behave the way pierre suggested.

josef...@gmail.com

unread,

Jan 21, 2012, 7:40:34 PM1/21/12

to Discussion of Numerical Python

On Sat, Jan 21, 2012 at 6:26 PM, John Salvatier
<jsal...@u.washington.edu> wrote:
> I ran into this a while ago and was confused why cov did not behave the way
> pierre suggested.

same here,
When I rewrote scipy.stats.spearmanr, I matched the numpy behavior for
two arrays, while R only returns the cross-correlation part.

Josef

Pierre Haessig

unread,

Jan 26, 2012, 8:19:18 AM1/26/12

to numpy-di...@scipy.org

Le 22/01/2012 01:40, josef...@gmail.com a écrit :
> same here,
> When I rewrote scipy.stats.spearmanr, I matched the numpy behavior for
> two arrays, while R only returns the cross-correlation part.

Since I've seen no negative feedback, I jumped to the next step by
creating a Trac account and posting a new ticket :

http://projects.scipy.org/numpy/ticket/2031

If people feel ok with this proposal, I can try to expand the proposed
implementation skeleton to something more serious. But maybe Elliot has
already something ready to pull-request on GitHub ?

Pierre

Bruce Southey

unread,

Jan 26, 2012, 9:57:06 AM1/26/12

to Discussion of Numerical Python

Really I do not understand what you want to do especially when the
ticket contains some very basic errors. Can you please provide a
couple of real examples with expected output that clearly show what
you want?

>From a statistical viewpoint, np.cov is correct because it outputs the
variance/covariance matrix. Also I believe that changing the np.cov
function will cause major havoc because numpy and people's code depend
on the current behavior.

Bruce

Pauli Virtanen

unread,

Jan 26, 2012, 10:50:17 AM1/26/12

to numpy-di...@scipy.org

26.01.2012 15:57, Bruce Southey kirjoitti:
[clip]

> Also I believe that changing the np.cov
> function will cause major havoc because numpy and people's code depend
> on the current behavior.

Changing the behavior of `cov` is IMHO not really possible at this point
--- the current behavior is not a bug, but a documented feature that has
been around probably already since Numeric.

However, adding a new function could be possible.

--
Pauli Virtanen

Pierre Haessig

unread,

Jan 26, 2012, 11:07:15 AM1/26/12

to numpy-di...@scipy.org

Le 26/01/2012 15:57, Bruce Southey a écrit :
> Can you please provide a
> couple of real examples with expected output that clearly show what
> you want?
>

Hi Bruce,

Thanks for your ticket feedback ! It's precisely because I see a big
potential impact of the proposed change that I send first a ML message,
second a ticket before jumping to a pull-request like a Sergio Leone's
cowboy (sorry, I watched "for a few dollars more" last weekend...)

Now, I realize that in the ticket writing I made the wrong trade-off
between conciseness and accuracy which led to some of the errors you
raised. Let me try to use your example to try to share what I have in mind.

> >> X = array([-2.1, -1. , 4.3])
> >> Y = array([ 3. , 1.1 , 0.12])

Indeed, with today's cov behavior we have a 2x2 array:
> >> cov(X,Y)
array([[ 11.71 , -4.286 ],
[ -4.286 , 2.14413333]])

Now, when I used the word 'concatenation', I wasn't precise enough
because I meant assembling X and Y in the sense of 2 vectors of
observations from 2 random variables X and Y.
This is achieved by concatenate(X,Y) *when properly playing with
dimensions* (which I didn't mentioned) :
> >> XY = np.concatenate((X[None, :], Y[None, :]))
array([[-2.1 , -1. , 4.3 ],
[ 3. , 1.1 , 0.12]])

In this case, I can indeed say that "cov(X,Y) is equivalent to cov(XY)".
> >> np.cov(XY)
array([[ 11.71 , -4.286 ],
[ -4.286 , 2.14413333]])

(And indeed, the actual cov Python code does use concatenate() )

Now let me come back to my assertion about this behavior *usefulness*.
You'll acknowledge that np.cov(XY) is made of four blocks (here just 4
simple scalars blocks).
* diagonal blocks are just cov(X) and cov(Y) (which in this case comes
to var(X) and var(Y) when setting ddof to 1)
* off diagonal blocks are symetric and are actually the covariance
estimate of X, Y observations (from
http://en.wikipedia.org/wiki/Covariance)

that is :
> >> ((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1)
-4.2860000000000005

The new proposed behaviour for cov is that cov(X,Y) would return :
array(-4.2860000000000005) instead of the 2*2 matrix.

* This would be in line with the cov(X,Y) mathematical definition, as
well as with R behavior.
* This would save memory and computing resources. (and therefore help
save the planet ;-) )

However, I do understand that the impact for this change may be big.
This indeed requires careful reviewing.

Pierre Haessig

unread,

Jan 26, 2012, 11:25:38 AM1/26/12

to numpy-di...@scipy.org

Le 26/01/2012 16:50, Pauli Virtanen a écrit :
> the current behavior is not a bug,

I completely agree that numpy.cov(m,y) does what it says !

I (and apparently some other people) are only questioning why there is
such a behavior ? Indeed, the second variable `y` is presented as "An
additional set of variables and observations".

This raises for me two different questions :
* What is the use case for such an additional set of variables that
could just be concatenated to the first set `̀m` ?
* Or, if indeed this sort of integrated concatenation is useful, why
just add one "additional set" and not several "additional sets" like :
>>> cov(m, y1, y2, y3, ....) ?

But I would understand that numpy responsibility to provide a stable
computing API would prevent any change in cov behavior. You have the
long term experience to judge that. (I certainly don't ;-) )

However, in the case this change is not possible, I would see this
solution :
* add and xcov function that does what Elliot and Sturla and I
described, because
* possibly deprecate the `y` 2nd argument of cov because I feel it
brings more definition complication than real programming benefits

(but I still find that changing cov would lead to a leaner numpy API
which was my motivation for reacting to Elliot's first message)

Pierre

Sturla Molden

unread,

Jan 26, 2012, 12:26:32 PM1/26/12

to Discussion of Numerical Python

Den 26.01.2012 17:25, skrev Pierre Haessig:
> However, in the case this change is not possible, I would see this
> solution :
> * add and xcov function that does what Elliot and Sturla and I
> described, because

The current np.cov implementation returns the cross-covariance the way
it is commonly used in statistics. If MATLAB does not, then that is
MATLAB's problem I think.

http://www.stat.washington.edu/research/reports/2001/tr391.pdf

Sturla

josef...@gmail.com

unread,

Jan 26, 2012, 1:19:11 PM1/26/12

to Discussion of Numerical Python

On Thu, Jan 26, 2012 at 12:26 PM, Sturla Molden <stu...@molden.no> wrote:
> Den 26.01.2012 17:25, skrev Pierre Haessig:
>> However, in the case this change is not possible, I would see this
>> solution :
>> * add and xcov function that does what Elliot and Sturla and I
>> described, because
>
> The current np.cov implementation returns the cross-covariance the way
> it is commonly used in statistics. If MATLAB does not, then that is
> MATLAB's problem I think.

The discussion had this reversed, numpy matches the behavior of
MATLAB, while R (statistics) only returns the cross covariance part as
proposed.

If there is a new xcov, then I think there should also be a xcorrcoef.
This case needs a different implementation than corrcoef, since the
xcov doesn't contain the variances and they need to be calculated
separately.

Josef

Bruce Southey

unread,

Jan 26, 2012, 1:25:55 PM1/26/12

to Discussion of Numerical Python

In this context, I find stacking, np.vstack((X,Y)), more appropriate
than concatenate.

>
> In this case, I can indeed say that "cov(X,Y) is equivalent to cov(XY)".
>> >> np.cov(XY)
> array([[ 11.71 , -4.286 ],
> [ -4.286 , 2.14413333]])
>

Sure the resulting array is the same but whole process is totally different.

> (And indeed, the actual cov Python code does use concatenate() )

Yes, but the user does not see that. Whereas you are forcing the user
to do the stacking in the correct dimensions.

>
>
> Now let me come back to my assertion about this behavior *usefulness*.
> You'll acknowledge that np.cov(XY) is made of four blocks (here just 4
> simple scalars blocks).

No there are not '4' blocks just rows and columns.

> * diagonal blocks are just cov(X) and cov(Y) (which in this case comes
> to var(X) and var(Y) when setting ddof to 1)

Sure but variances are still covariances.

> * off diagonal blocks are symetric and are actually the covariance
> estimate of X, Y observations (from
> http://en.wikipedia.org/wiki/Covariance)

Sure

>
> that is :
>> >> ((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1)
> -4.2860000000000005
>
> The new proposed behaviour for cov is that cov(X,Y) would return :
> array(-4.2860000000000005) instead of the 2*2 matrix.

But how you interpret an 2D array where the rows are greater than 2?
>>> Z=Y+X
>>> np.cov(np.vstack((X,Y,Z)))
array([[ 11.71 , -4.286 , 7.424 ],
[ -4.286 , 2.14413333, -2.14186667],
[ 7.424 , -2.14186667, 5.28213333]])

>
> * This would be in line with the cov(X,Y) mathematical definition, as
> well as with R behavior.

I don't care what R does because I am using Python and Python is
infinitely better than R is!

But I think that is only in the 1D case.

> * This would save memory and computing resources. (and therefore help
> save the planet ;-) )

Nothing that you have provided shows that it will.

>
> However, I do understand that the impact for this change may be big.
> This indeed requires careful reviewing.
>
> Pierre
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Di...@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

Bruce

josef...@gmail.com

unread,

Jan 26, 2012, 1:45:46 PM1/26/12

to Discussion of Numerical Python

Sturla showed the 4 blocks in his first message.

>
>> * diagonal blocks are just cov(X) and cov(Y) (which in this case comes
>> to var(X) and var(Y) when setting ddof to 1)
> Sure but variances are still covariances.
>
>> * off diagonal blocks are symetric and are actually the covariance
>> estimate of X, Y observations (from
>> http://en.wikipedia.org/wiki/Covariance)
> Sure
>>
>> that is :
>>> >> ((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1)
>> -4.2860000000000005
>>
>> The new proposed behaviour for cov is that cov(X,Y) would return :
>> array(-4.2860000000000005) instead of the 2*2 matrix.
>
> But how you interpret an 2D array where the rows are greater than 2?
>>>> Z=Y+X
>>>> np.cov(np.vstack((X,Y,Z)))
> array([[ 11.71 , -4.286 , 7.424 ],
> [ -4.286 , 2.14413333, -2.14186667],
> [ 7.424 , -2.14186667, 5.28213333]])
>
>
>>
>> * This would be in line with the cov(X,Y) mathematical definition, as
>> well as with R behavior.
> I don't care what R does because I am using Python and Python is
> infinitely better than R is!
>
> But I think that is only in the 1D case.

I just checked R to make sure I remember correctly

> xx = matrix((1:20)^2, nrow=4)
> xx
[,1] [,2] [,3] [,4] [,5]
[1,] 1 25 81 169 289
[2,] 4 36 100 196 324
[3,] 9 49 121 225 361
[4,] 16 64 144 256 400
> cov(xx, 2*xx[,1:2])
[,1] [,2]
[1,] 86.0000 219.3333
[2,] 219.3333 566.0000
[3,] 352.6667 912.6667
[4,] 486.0000 1259.3333
[5,] 619.3333 1606.0000
> cov(xx)
[,1] [,2] [,3] [,4] [,5]
[1,] 43.0000 109.6667 176.3333 243.0000 309.6667
[2,] 109.6667 283.0000 456.3333 629.6667 803.0000
[3,] 176.3333 456.3333 736.3333 1016.3333 1296.3333
[4,] 243.0000 629.6667 1016.3333 1403.0000 1789.6667
[5,] 309.6667 803.0000 1296.3333 1789.6667 2283.0000

>
>> * This would save memory and computing resources. (and therefore help
>> save the planet ;-) )
> Nothing that you have provided shows that it will.

I don't know about saving the planet, but if X and Y have the same
number of columns, we save 3 quarters of the calculations, as Sturla
also explained in his first message.

Josef

Bruce Southey

unread,

Jan 26, 2012, 3:58:45 PM1/26/12

to Discussion of Numerical Python

Well, I could not follow that because the code is wrong.
X = np.array([-2.1, -1. , 4.3])

>>> cX = X - X.mean(axis=0)[np.newaxis,:]

Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>

cX = X - X.mean(axis=0)[np.newaxis,:]

IndexError: 0-d arrays can only use a single () or a list of newaxes
(and a single ...) as an index
whoops!

Anyhow, variance-covariance matrix is symmetric but numpy or scipy
lacks lapac's symmetrix matrix
(http://www.netlib.org/lapack/explore-html/de/d9e/group___s_y.html)

Can not figure those savings:
For a 2 by 2 output has 3 covariances (so 3/4 =0.75 is 'needed' not 25%)
a 3 by 3 output has 6 covariances
a 5 by 5 output 15 covariances

If you want to save memory and calculation then use symmetric storage
and associated methods.

Bruce

josef...@gmail.com

unread,

Jan 26, 2012, 7:43:14 PM1/26/12

to Discussion of Numerical Python

what numpy calculates are 4, 9 and 25 covariances, we might care only
about 1, 2 and 4 of them.

>
> If you want to save memory and calculation then use symmetric storage
> and associated methods.

actually for covariance matrix we stilll need to subtract means, so we
won't save 75%, but we save 75% in the cross-product.

suppose X and Y are (nobs, k_x) and (nobs, k_y) (means already subtracted)
(and ignoring that numpy "likes" rows instead of columns)

the partitioned dot product [X,Y]'[X,Y] is

[[ X'X, X'Y],
[Y'X, Y'Y]]

X'Y is (n_x, n_y)
total shape is (n_x + n_y, n_x + n_y)

If we are only interested in X'Y, we don't need the other three submatrices.

If n_x = 99 and n_y is 1, we save .... ?
(we get a (99,1) instead of a (100, 100) matrix)

and X'Y , np.dot(X, Y), doesn't have any duplicated symmetry, so
exploiting symmetry is a different issue.

Josef

Bruce Southey

unread,

Jan 26, 2012, 9:45:49 PM1/26/12

to Discussion of Numerical Python

Thanks for someone to clearly state what they want.
But still lacks evidence that it will save the world - when nobs is
large, n_x and n_y are meaningless and thus (99,1) vs (100, 100) is
also meaningless.
Further dealing separately with the two arrays also bring additional
overhead - small not zero.

Pierre Haessig

unread,

Jan 27, 2012, 5:09:53 AM1/27/12

to numpy-di...@scipy.org

Le 26/01/2012 19:19, josef...@gmail.com a écrit :
> The discussion had this reversed, numpy matches the behavior of
> MATLAB, while R (statistics) only returns the cross covariance part as
> proposed.
>

I would also say that there was an attempt to match MATLAB behavior.
However, there is big difference with numpy.cov because of the default
value `rowvar` being True. Most softwares and textbooks I know consider
that, in a 2D context, matrix rows are obvervations while columns are
the variables.

Any idea why the "transposed" convention was selected in np.cov ?
(This question, I'm raising for informative purpose only... ;-) )

I also compared with octave to see how it works :
-- Function File: cov (X, Y)
Compute covariance.

If each row of X and Y is an observation and each column is a
variable, the (I, J)-th entry of `cov (X, Y)' is the covariance
between the I-th variable in X and the J-th variable in Y. If
called with one argument, compute `cov (X, X)'.

(http://www.gnu.org/software/octave/doc/interpreter/Correlation-and-Regression-Analysis.html)
I like the clear tone of this description. But strangely enough, this a
bit different from Matlab.
(http://webcache.googleusercontent.com/search?q=cache:L3kF8BHcB4EJ:octave.1599824.n4.nabble.com/cov-m-function-behaves-different-from-Matlab-td1634956.html+&cd=1&hl=fr&ct=clnk&client=iceweasel-a)

> If there is a new xcov, then I think there should also be a xcorrcoef.
> This case needs a different implementation than corrcoef, since the
> xcov doesn't contain the variances and they need to be calculated
> separately.

Adding xcorrcoeff as well would make sense. The use of the np.var when
setting the `axis` and `̀̀ddof` arguments to appropriate values should the
bring variances needed for the normalization.

In the end, if adding xcov is the path of least resistance, this may be
the way to go. What do people think ?

Pierre

Benjamin Root

unread,

Jan 27, 2012, 10:00:02 AM1/27/12

to Discussion of Numerical Python

My vote is for xcov() and xcorrcoeff(). It won't break compatibility, and the name of the function makes it clear what it does. It would also make sense to add "seealso" references to each other in the docstrings. The documentation for xcov() should also make it clear the differences between cov() and xcov() with examples and show how to get equivalent results using just cov() for those with older versions of numpy.

Ben Root

Bruce Southey

unread,

Jan 27, 2012, 11:28:53 AM1/27/12

to numpy-di...@scipy.org

_______________________________________________
NumPy-Discussion mailing list
NumPy-Di...@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

-1 because these are too close to cross-correlation as used by signal processing.

The output is still a covariance so do we really need yet another set of very similar functions to maintain?
Or can we get away with a new keyword?

If speed really matters to you guys then surely moving np.cov into C would have more impact on 'saving the world' than this proposal. That also ignores algorithm used ( http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Covariance).

Actually np.cov also is deficient in that it does not have the dtype argument so it is prone to numerical precision errors (especially getting the mean of the array). Probably should be a ticket...

Bruce

Pierre Haessig

unread,

Feb 1, 2012, 11:47:42 AM2/1/12

to numpy-di...@scipy.org

Hi Bruce,
Sorry for the delay in the answer.

Le 27/01/2012 17:28, Bruce Southey a écrit :
> The output is still a covariance so do we really need yet another set
> of very similar functions to maintain?
> Or can we get away with a new keyword?
>

The idea of an additional keyword seems appealing.
Just to make sure I understood it well, you woud be proposing a new
signature like :
def cov(.... get_full_cov_matrix=True)

and when `get_full_cov_matrix` is set to False, only the cross
covariance part would be returned.
Am I right ?

> If speed really matters to you guys then surely moving np.cov into C
> would have more impact on 'saving the world' than this proposal. That
> also ignores algorithm used (
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Covariance).
>
>

I didn't get your point about the algorithm here. From this
nomenclature, I would say that numpy.cov is based on a vectorized
"two-pass algorithm" which computes the means first and then substracts
it before computing the matrix product. Would you make it different ?

> Actually np.cov also is deficient in that it does not have the dtype
> argument so it is prone to numerical precision errors (especially
> getting the mean of the array). Probably should be a ticket...

I'm not a specialist of numerical precisions, but I got very impressed
by the recent example raised on Jan 24th by Michael Aye which was one of
the first "real life" example I've seen.

The way I see the cov algorithm, I see first a possibility to propagate
an optional dtype argument to the mean computation.
However, I'm unsure about what to do after, for the matrix product since
"dot(X.T, X.conj()) / fact" is also a sort of mean computation.
Therefore it can also be affected by numerical precision issue. What
would you suggest ?

(the only solution I see would be to use the running variance algorithm.
Since the code wouldn't be vectorized anymore, this indeed would
benefits from going to C)

Best,
Pierre

Reply all

Reply to author

Forward