Le 19/01/2012 07:50, Elliot Saba a écrit :
> I recently needed to calculate the cross-covariance of two random
> vectors, (e.g. I have two matricies, X and Y, the columns of which are
> observations of one variable, and I wish to generate a matrix pairing
> each value of X and Y)
I don't see how does your function relates to numpy.cov [1]. Is it an
"extended case" function or is there a difference in the underlying math ?
Best,
Pierre
[1] numpy.cov docstring :
http://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html
_______________________________________________
NumPy-Discussion mailing list
NumPy-Di...@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
If X is rank n x p, then np.cov(X, rowvar=False) is equal
to S after
cX = X - X.mean(axis=0)[np.newaxis,:]
S = np.dot(cX.T, cX)/(n-1.)
If we also have Y rank n x p, then the upper p x p
quadrant of
np.cov(X, y=Y, rowvar=False)
is equal to S after
XY = np.hstack(X,Y)
cXY = XY - XY.mean(axis=0)[np.newaxis,:]
S = np.dot(cXY.T, cXY)/(n-1.)
Thus we can see thatthe total cocariance is composed
of four parts:
S[:p,:p] = np.dot(cX.T, cX)/(n-1.) # upper left
S[:p,p:] = np.dot(cXY.T, cYY)/(n-1.) # upper right
S[p:,:p] = np.dot(cY.T, cX)/(n-1.) # lower left
S[p:,:p] = np.dot(cYX.T, cYX)/(n-1.) # lower right
Often we just want the upper-right p x p quadrant. Thus
we can save 75 % of the cpu time by not computing the rest.
Sturla
I would go further and ask why it so. People around may have use cases
in mind, because I have not.
Otherwise, I feel that the default behavior of cov when called with two
arguments should be what Sturla and Elliot just described.
Best,
Pierre
(that is something like this :
def cov(X, Y=None):
if Y is None:
Y = X
else:
assert Y.shape == X.shape # or something like that
# [...jumping to the end of the existing code...]
if not rowvar:
return (dot(X.T, Y.conj()) / fact).squeeze()
else:
return (dot(X, Y.T.conj()) / fact).squeeze()
)
[source code]
https://github.com/numpy/numpy/blob/master/numpy/lib/function_base.py
I ran into this a while ago and was confused why cov did not behave the way pierre suggested.
same here,
When I rewrote scipy.stats.spearmanr, I matched the numpy behavior for
two arrays, while R only returns the cross-correlation part.
Josef
http://projects.scipy.org/numpy/ticket/2031
If people feel ok with this proposal, I can try to expand the proposed
implementation skeleton to something more serious. But maybe Elliot has
already something ready to pull-request on GitHub ?
Pierre
Really I do not understand what you want to do especially when the
ticket contains some very basic errors. Can you please provide a
couple of real examples with expected output that clearly show what
you want?
>From a statistical viewpoint, np.cov is correct because it outputs the
variance/covariance matrix. Also I believe that changing the np.cov
function will cause major havoc because numpy and people's code depend
on the current behavior.
Bruce
Changing the behavior of `cov` is IMHO not really possible at this point
--- the current behavior is not a bug, but a documented feature that has
been around probably already since Numeric.
However, adding a new function could be possible.
--
Pauli Virtanen
Thanks for your ticket feedback ! It's precisely because I see a big
potential impact of the proposed change that I send first a ML message,
second a ticket before jumping to a pull-request like a Sergio Leone's
cowboy (sorry, I watched "for a few dollars more" last weekend...)
Now, I realize that in the ticket writing I made the wrong trade-off
between conciseness and accuracy which led to some of the errors you
raised. Let me try to use your example to try to share what I have in mind.
> >> X = array([-2.1, -1. , 4.3])
> >> Y = array([ 3. , 1.1 , 0.12])
Indeed, with today's cov behavior we have a 2x2 array:
> >> cov(X,Y)
array([[ 11.71 , -4.286 ],
[ -4.286 , 2.14413333]])
Now, when I used the word 'concatenation', I wasn't precise enough
because I meant assembling X and Y in the sense of 2 vectors of
observations from 2 random variables X and Y.
This is achieved by concatenate(X,Y) *when properly playing with
dimensions* (which I didn't mentioned) :
> >> XY = np.concatenate((X[None, :], Y[None, :]))
array([[-2.1 , -1. , 4.3 ],
[ 3. , 1.1 , 0.12]])
In this case, I can indeed say that "cov(X,Y) is equivalent to cov(XY)".
> >> np.cov(XY)
array([[ 11.71 , -4.286 ],
[ -4.286 , 2.14413333]])
(And indeed, the actual cov Python code does use concatenate() )
Now let me come back to my assertion about this behavior *usefulness*.
You'll acknowledge that np.cov(XY) is made of four blocks (here just 4
simple scalars blocks).
* diagonal blocks are just cov(X) and cov(Y) (which in this case comes
to var(X) and var(Y) when setting ddof to 1)
* off diagonal blocks are symetric and are actually the covariance
estimate of X, Y observations (from
http://en.wikipedia.org/wiki/Covariance)
that is :
> >> ((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1)
-4.2860000000000005
The new proposed behaviour for cov is that cov(X,Y) would return :
array(-4.2860000000000005) instead of the 2*2 matrix.
* This would be in line with the cov(X,Y) mathematical definition, as
well as with R behavior.
* This would save memory and computing resources. (and therefore help
save the planet ;-) )
However, I do understand that the impact for this change may be big.
This indeed requires careful reviewing.
I (and apparently some other people) are only questioning why there is
such a behavior ? Indeed, the second variable `y` is presented as "An
additional set of variables and observations".
This raises for me two different questions :
* What is the use case for such an additional set of variables that
could just be concatenated to the first set `̀m` ?
* Or, if indeed this sort of integrated concatenation is useful, why
just add one "additional set" and not several "additional sets" like :
>>> cov(m, y1, y2, y3, ....) ?
But I would understand that numpy responsibility to provide a stable
computing API would prevent any change in cov behavior. You have the
long term experience to judge that. (I certainly don't ;-) )
However, in the case this change is not possible, I would see this
solution :
* add and xcov function that does what Elliot and Sturla and I
described, because
* possibly deprecate the `y` 2nd argument of cov because I feel it
brings more definition complication than real programming benefits
(but I still find that changing cov would lead to a leaner numpy API
which was my motivation for reacting to Elliot's first message)
Pierre
The current np.cov implementation returns the cross-covariance the way
it is commonly used in statistics. If MATLAB does not, then that is
MATLAB's problem I think.
http://www.stat.washington.edu/research/reports/2001/tr391.pdf
Sturla
The discussion had this reversed, numpy matches the behavior of
MATLAB, while R (statistics) only returns the cross covariance part as
proposed.
If there is a new xcov, then I think there should also be a xcorrcoef.
This case needs a different implementation than corrcoef, since the
xcov doesn't contain the variances and they need to be calculated
separately.
Josef
In this context, I find stacking, np.vstack((X,Y)), more appropriate
than concatenate.
>
> In this case, I can indeed say that "cov(X,Y) is equivalent to cov(XY)".
>> >> np.cov(XY)
> array([[ 11.71 , -4.286 ],
> [ -4.286 , 2.14413333]])
>
Sure the resulting array is the same but whole process is totally different.
> (And indeed, the actual cov Python code does use concatenate() )
Yes, but the user does not see that. Whereas you are forcing the user
to do the stacking in the correct dimensions.
>
>
> Now let me come back to my assertion about this behavior *usefulness*.
> You'll acknowledge that np.cov(XY) is made of four blocks (here just 4
> simple scalars blocks).
No there are not '4' blocks just rows and columns.
> * diagonal blocks are just cov(X) and cov(Y) (which in this case comes
> to var(X) and var(Y) when setting ddof to 1)
Sure but variances are still covariances.
> * off diagonal blocks are symetric and are actually the covariance
> estimate of X, Y observations (from
> http://en.wikipedia.org/wiki/Covariance)
Sure
>
> that is :
>> >> ((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1)
> -4.2860000000000005
>
> The new proposed behaviour for cov is that cov(X,Y) would return :
> array(-4.2860000000000005) instead of the 2*2 matrix.
But how you interpret an 2D array where the rows are greater than 2?
>>> Z=Y+X
>>> np.cov(np.vstack((X,Y,Z)))
array([[ 11.71 , -4.286 , 7.424 ],
[ -4.286 , 2.14413333, -2.14186667],
[ 7.424 , -2.14186667, 5.28213333]])
>
> * This would be in line with the cov(X,Y) mathematical definition, as
> well as with R behavior.
I don't care what R does because I am using Python and Python is
infinitely better than R is!
But I think that is only in the 1D case.
> * This would save memory and computing resources. (and therefore help
> save the planet ;-) )
Nothing that you have provided shows that it will.
>
> However, I do understand that the impact for this change may be big.
> This indeed requires careful reviewing.
>
> Pierre
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Di...@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
Bruce
Sturla showed the 4 blocks in his first message.
>
>> * diagonal blocks are just cov(X) and cov(Y) (which in this case comes
>> to var(X) and var(Y) when setting ddof to 1)
> Sure but variances are still covariances.
>
>> * off diagonal blocks are symetric and are actually the covariance
>> estimate of X, Y observations (from
>> http://en.wikipedia.org/wiki/Covariance)
> Sure
>>
>> that is :
>>> >> ((X-X.mean()) * (Y-Y.mean())).sum()/ (3-1)
>> -4.2860000000000005
>>
>> The new proposed behaviour for cov is that cov(X,Y) would return :
>> array(-4.2860000000000005) instead of the 2*2 matrix.
>
> But how you interpret an 2D array where the rows are greater than 2?
>>>> Z=Y+X
>>>> np.cov(np.vstack((X,Y,Z)))
> array([[ 11.71 , -4.286 , 7.424 ],
> [ -4.286 , 2.14413333, -2.14186667],
> [ 7.424 , -2.14186667, 5.28213333]])
>
>
>>
>> * This would be in line with the cov(X,Y) mathematical definition, as
>> well as with R behavior.
> I don't care what R does because I am using Python and Python is
> infinitely better than R is!
>
> But I think that is only in the 1D case.
I just checked R to make sure I remember correctly
> xx = matrix((1:20)^2, nrow=4)
> xx
[,1] [,2] [,3] [,4] [,5]
[1,] 1 25 81 169 289
[2,] 4 36 100 196 324
[3,] 9 49 121 225 361
[4,] 16 64 144 256 400
> cov(xx, 2*xx[,1:2])
[,1] [,2]
[1,] 86.0000 219.3333
[2,] 219.3333 566.0000
[3,] 352.6667 912.6667
[4,] 486.0000 1259.3333
[5,] 619.3333 1606.0000
> cov(xx)
[,1] [,2] [,3] [,4] [,5]
[1,] 43.0000 109.6667 176.3333 243.0000 309.6667
[2,] 109.6667 283.0000 456.3333 629.6667 803.0000
[3,] 176.3333 456.3333 736.3333 1016.3333 1296.3333
[4,] 243.0000 629.6667 1016.3333 1403.0000 1789.6667
[5,] 309.6667 803.0000 1296.3333 1789.6667 2283.0000
>
>> * This would save memory and computing resources. (and therefore help
>> save the planet ;-) )
> Nothing that you have provided shows that it will.
I don't know about saving the planet, but if X and Y have the same
number of columns, we save 3 quarters of the calculations, as Sturla
also explained in his first message.
Josef
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
cX = X - X.mean(axis=0)[np.newaxis,:]
IndexError: 0-d arrays can only use a single () or a list of newaxes
(and a single ...) as an index
whoops!
Anyhow, variance-covariance matrix is symmetric but numpy or scipy
lacks lapac's symmetrix matrix
(http://www.netlib.org/lapack/explore-html/de/d9e/group___s_y.html)
Can not figure those savings:
For a 2 by 2 output has 3 covariances (so 3/4 =0.75 is 'needed' not 25%)
a 3 by 3 output has 6 covariances
a 5 by 5 output 15 covariances
If you want to save memory and calculation then use symmetric storage
and associated methods.
Bruce
what numpy calculates are 4, 9 and 25 covariances, we might care only
about 1, 2 and 4 of them.
>
> If you want to save memory and calculation then use symmetric storage
> and associated methods.
actually for covariance matrix we stilll need to subtract means, so we
won't save 75%, but we save 75% in the cross-product.
suppose X and Y are (nobs, k_x) and (nobs, k_y) (means already subtracted)
(and ignoring that numpy "likes" rows instead of columns)
the partitioned dot product [X,Y]'[X,Y] is
[[ X'X, X'Y],
[Y'X, Y'Y]]
X'Y is (n_x, n_y)
total shape is (n_x + n_y, n_x + n_y)
If we are only interested in X'Y, we don't need the other three submatrices.
If n_x = 99 and n_y is 1, we save .... ?
(we get a (99,1) instead of a (100, 100) matrix)
and X'Y , np.dot(X, Y), doesn't have any duplicated symmetry, so
exploiting symmetry is a different issue.
Josef
Thanks for someone to clearly state what they want.
But still lacks evidence that it will save the world - when nobs is
large, n_x and n_y are meaningless and thus (99,1) vs (100, 100) is
also meaningless.
Further dealing separately with the two arrays also bring additional
overhead - small not zero.
Any idea why the "transposed" convention was selected in np.cov ?
(This question, I'm raising for informative purpose only... ;-) )
I also compared with octave to see how it works :
-- Function File: cov (X, Y)
Compute covariance.
If each row of X and Y is an observation and each column is a
variable, the (I, J)-th entry of `cov (X, Y)' is the covariance
between the I-th variable in X and the J-th variable in Y. If
called with one argument, compute `cov (X, X)'.
(http://www.gnu.org/software/octave/doc/interpreter/Correlation-and-Regression-Analysis.html)
I like the clear tone of this description. But strangely enough, this a
bit different from Matlab.
(http://webcache.googleusercontent.com/search?q=cache:L3kF8BHcB4EJ:octave.1599824.n4.nabble.com/cov-m-function-behaves-different-from-Matlab-td1634956.html+&cd=1&hl=fr&ct=clnk&client=iceweasel-a)
> If there is a new xcov, then I think there should also be a xcorrcoef.
> This case needs a different implementation than corrcoef, since the
> xcov doesn't contain the variances and they need to be calculated
> separately.
Adding xcorrcoeff as well would make sense. The use of the np.var when
setting the `axis` and `̀̀ddof` arguments to appropriate values should the
bring variances needed for the normalization.
In the end, if adding xcov is the path of least resistance, this may be
the way to go. What do people think ?
Pierre
_______________________________________________ NumPy-Discussion mailing list NumPy-Di...@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Le 27/01/2012 17:28, Bruce Southey a écrit :
> The output is still a covariance so do we really need yet another set
> of very similar functions to maintain?
> Or can we get away with a new keyword?
>
The idea of an additional keyword seems appealing.
Just to make sure I understood it well, you woud be proposing a new
signature like :
def cov(.... get_full_cov_matrix=True)
and when `get_full_cov_matrix` is set to False, only the cross
covariance part would be returned.
Am I right ?
> If speed really matters to you guys then surely moving np.cov into C
> would have more impact on 'saving the world' than this proposal. That
> also ignores algorithm used (
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Covariance).
>
>
I didn't get your point about the algorithm here. From this
nomenclature, I would say that numpy.cov is based on a vectorized
"two-pass algorithm" which computes the means first and then substracts
it before computing the matrix product. Would you make it different ?
> Actually np.cov also is deficient in that it does not have the dtype
> argument so it is prone to numerical precision errors (especially
> getting the mean of the array). Probably should be a ticket...
I'm not a specialist of numerical precisions, but I got very impressed
by the recent example raised on Jan 24th by Michael Aye which was one of
the first "real life" example I've seen.
The way I see the cov algorithm, I see first a possibility to propagate
an optional dtype argument to the mean computation.
However, I'm unsure about what to do after, for the matrix product since
"dot(X.T, X.conj()) / fact" is also a sort of mean computation.
Therefore it can also be affected by numerical precision issue. What
would you suggest ?
(the only solution I see would be to use the running variance algorithm.
Since the code wouldn't be vectorized anymore, this indeed would
benefits from going to C)
Best,
Pierre