Incremental covariance matrix

Andrei Zh

unread,

Dec 9, 2014, 4:54:24 PM12/9/14

to julia...@googlegroups.com

In Julia, ``cov(a)`` calculates covariance matrix, but takes the whole dataset as its argument. Instead I want to generate it vector by vector to be able to handle big datasets. I know that it's certainly possible and not really hard to implement (e.g. see [1]), but I wonder if there's something ready to use.

[1]: http://rebcabin.github.io/blog/2013/01/22/covariance-matrices/

Sam Lendle

unread,

Dec 9, 2014, 11:40:02 PM12/9/14

to julia...@googlegroups.com

You can call cov on two vectors to compute the covariance between them, or call cov on two matrices to compute the covariance between each column in the first matrix paired with each in the 2nd.

For example:
julia> x = rand(10, 3);

julia> y = rand(10, 4);

julia> cov(x, y)
3x4 Array{Float64,2}:
0.0190784 -0.000211094 -0.0314128 -0.0111155
-0.0165716 -0.00996252 0.0161174 -0.000363669
-0.0376887 -0.0322297 0.0179769 -0.00729683

So the (i,j)th element of the result is the covariance between the ith column of x and the jth column of y.

Does that help?

Sam

Andrei

unread,

Dec 10, 2014, 1:54:58 AM12/10/14

to julia...@googlegroups.com

Hi Sam,

thanks for your answer, but seems like I mislead you. Let me explained it in more details.

I'm working on an algorithm that uses multivariate normal distribution (and more concretely ``MvNormal()`` from Distributions.jl). One of the parameters of this distribution is covariance matrix. So, in Julia, given dataset ``X`` we can instantiate multivariate normal distribution like this:

mu = squeeze(mean(X, 1), 1)
C = cov(X)

MvNormal(mu, X)

Each row in X here is an observation and we use list of these observation to calculate covariance between variables. But what if X has size ``(100_000_000, 20)``, i.e. 100M observations and only 20 variables (which is pretty typical pattern in large-scale machine learning)? Covariance matrix will be pretty small, but to calculate it we need to load the entire dataset into memory, which is not always an option. Thus I'm looking for a way to build covariance matrix iteratively, adding new vectors (or mini-batches) one-by-one. Something like this:

batches = ... # split X into list of smaller matrices, e.g. of size (100, 20)

C = ... # init covariance matrix

for batch in batches

updatecov(C, batch)

end

Alternatively, some kind of bookkeeping data structure could be used. E.g.:

dstats = DataStats()

for batch in batches

partial_fit(dstats, batch)

end

mean(dstats) # ==> variables' mean

cov(dstats) # ==> covariance matrix
...

If there's nothing like this in Julia yet, I'll probably come up with my own solution, but it's always worth to check if somebody have already done that.

--
You received this message because you are subscribed to a topic in the Google Groups "julia-stats" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/julia-stats/Qh_iZPO-dmI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to julia-stats...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sam Lendle

unread,

Dec 10, 2014, 3:06:43 AM12/10/14

to julia...@googlegroups.com

You didn't mislead me, I misunderstood your question. I should have looked at the link in your first message. As far as I know that is not available in the main stats packages at least.

Andreas Noack

unread,

Dec 11, 2014, 9:01:45 PM12/11/14

to julia...@googlegroups.com

I haven't seen any functions for this either. You probably want to consider the `BLAS.syr` function when implementing this if performance is critical.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.

Simon Byrne

unread,

Dec 12, 2014, 6:37:58 AM12/12/14

to julia...@googlegroups.com

BLAS.syr! should work for individual rows. Otherwise, if you're doing multiple rows, the quickest might be use BLAS.gemm! (I don't think there is a symmetric version). Something like:

n = 0
while more_rows()

X = read_more_rows()

n += size(X,1)

BLAS.gemm!('T', 'N', 1.0, X, X, 1.0, C)

end

C /= n-1

Andreas Noack

unread,

Dec 12, 2014, 1:37:32 PM12/12/14

to julia...@googlegroups.com

The symmetric version is BLAS.syrk! the symmetric rank-k update.

Andrei Zh

unread,

Dec 14, 2014, 3:39:41 PM12/14/14

to julia...@googlegroups.com

Thanks for all your answers. Eventually, I came to the code that boils down to the following (more structured code can be found in [1]):

x_sums = zeros(Float64, n_vars)
cross_sums = zeros(Float64, n_vars, n_vars)

n = 0
while more_rows()
X = read_more_rows()

axpy!(1.0, sum(X, 1), x_sums)
syrk!('L', 'T', 1.0, X, 1.0, cross_sums)
n += size(X, 1)
end
m = x_sums / n # mean for each variable
C = (cross_sums - n * (m*m')) / (n - 1)
for j=1:n_vars, i=1:j C[i, j] = C[j, i] end

Last line is needed to transform lower-triangular matrix produced by ``syrk!()`` into a full covariance matrix, though there may be more efficient/elegant solution to do this.

[1]: https://github.com/dfdx/NaiveBayes.jl/blob/master/src/datastats.jl

Andreas Noack

unread,

Dec 14, 2014, 4:02:32 PM12/14/14

to julia...@googlegroups.com

Depending on your use case you could consider to only store the Cholesky factorization or wrap the result in the Symmetric type. Both solution makes the explicit symmetrization unnecessary. Alternatively you could also use the Base.LinAlg.copytri! function for the this.

Andrei Zh

unread,

Dec 15, 2014, 4:32:19 PM12/15/14

to julia...@googlegroups.com

Seems like Symmetric doesn't support some operations on Arrays, so I ended up using Base.LinAlg.copytri!(). Thanks!

Reply all

Reply to author

Forward