Incremental covariance matrix

835 views
Skip to first unread message

Andrei Zh

unread,
Dec 9, 2014, 4:54:24 PM12/9/14
to julia...@googlegroups.com
In Julia, ``cov(a)`` calculates covariance matrix, but takes the whole dataset as its argument. Instead I want to generate it vector by vector to be able to handle big datasets. I know that it's certainly possible and not really hard to implement (e.g. see [1]), but I wonder if there's something ready to use. 

[1]: http://rebcabin.github.io/blog/2013/01/22/covariance-matrices/

Sam Lendle

unread,
Dec 9, 2014, 11:40:02 PM12/9/14
to julia...@googlegroups.com
You can call cov on two vectors to compute the covariance between them, or call cov on two matrices to compute the covariance between each column in the first matrix paired with each in the 2nd.

For example:
julia> x = rand(10, 3);

julia> y = rand(10, 4);

julia> cov(x, y)
3x4 Array{Float64,2}:
  0.0190784  -0.000211094  -0.0314128  -0.0111155
 -0.0165716  -0.00996252    0.0161174  -0.000363669
 -0.0376887  -0.0322297     0.0179769  -0.00729683


So the (i,j)th element of the result is the covariance between the ith column of x and the jth column of y.

Does that help?

Sam

Andrei

unread,
Dec 10, 2014, 1:54:58 AM12/10/14
to julia...@googlegroups.com
Hi Sam,

thanks for your answer, but seems like I mislead you. Let me explained it in more details.

I'm working on an algorithm that uses multivariate normal distribution (and more concretely ``MvNormal()`` from Distributions.jl). One of the parameters of this distribution is covariance matrix. So, in Julia, given dataset ``X`` we can instantiate multivariate normal distribution like this:

  mu = squeeze(mean(X, 1), 1)
  C = cov(X)
  MvNormal(mu, X)

Each row in X here is an observation and we use list of these observation to calculate covariance between variables. But what if X has size ``(100_000_000, 20)``, i.e. 100M observations and only 20 variables (which is pretty typical pattern in large-scale machine learning)? Covariance matrix will be pretty small, but to calculate it we need to load the entire dataset into memory, which is not always an option. Thus I'm looking for a way to build covariance matrix iteratively, adding new vectors (or mini-batches) one-by-one. Something like this:

  batches = ... # split X into list of smaller matrices, e.g. of size (100, 20)
  C = ... # init covariance matrix
  for batch in batches
      updatecov(C, batch)
  end
 
Alternatively, some kind of bookkeeping data structure could be used. E.g.:

  dstats = DataStats()
  for batch in batches
      partial_fit(dstats, batch)
  end
  mean(dstats)  # ==> variables' mean
  cov(dstats)     # ==> covariance matrix
  ...

If there's nothing like this in Julia yet, I'll probably come up with my own solution, but it's always worth to check if somebody have already done that.

 
 


--
You received this message because you are subscribed to a topic in the Google Groups "julia-stats" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/julia-stats/Qh_iZPO-dmI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to julia-stats...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sam Lendle

unread,
Dec 10, 2014, 3:06:43 AM12/10/14
to julia...@googlegroups.com
You didn't mislead me, I misunderstood your question. I should have looked at the link in your first message.  As far as I know that is not available in the main stats packages at least.

Andreas Noack

unread,
Dec 11, 2014, 9:01:45 PM12/11/14
to julia...@googlegroups.com
I haven't seen any functions for this either. You probably want to consider the `BLAS.syr` function when implementing this if performance is critical.

--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.

Simon Byrne

unread,
Dec 12, 2014, 6:37:58 AM12/12/14
to julia...@googlegroups.com
BLAS.syr! should work for individual rows. Otherwise, if you're doing multiple rows, the quickest might be use BLAS.gemm! (I don't think there is a symmetric version). Something like:

n = 0
while more_rows()
    X = read_more_rows()
    n += size(X,1)
    BLAS.gemm!('T', 'N', 1.0, X, X, 1.0, C)
end
C /= n-1

Andreas Noack

unread,
Dec 12, 2014, 1:37:32 PM12/12/14
to julia...@googlegroups.com
The symmetric version is BLAS.syrk! the symmetric rank-k update.

Andrei Zh

unread,
Dec 14, 2014, 3:39:41 PM12/14/14
to julia...@googlegroups.com
Thanks for all your answers. Eventually, I came to the code that boils down to the following (more structured code can be found in [1]): 

x_sums = zeros(Float64, n_vars)
cross_sums = zeros(Float64, n_vars, n_vars)
n = 0
while more_rows()
     X = read_more_rows()
    axpy!(1.0, sum(X, 1), x_sums)
    syrk!('L', 'T', 1.0, X, 1.0, cross_sums)
    n += size(X, 1)
end
m = x_sums / n         # mean for each variable
C = (cross_sums - n * (m*m')) / (n - 1)
for j=1:n_vars, i=1:j C[i, j] = C[j, i] end

Last line is needed to transform lower-triangular matrix produced by ``syrk!()`` into a full covariance matrix, though there may be more efficient/elegant solution to do this. 

Andreas Noack

unread,
Dec 14, 2014, 4:02:32 PM12/14/14
to julia...@googlegroups.com
Depending on your use case you could consider to only store the Cholesky factorization or wrap the result in the Symmetric type. Both solution makes the explicit symmetrization unnecessary. Alternatively you could also use the Base.LinAlg.copytri! function for the this.

Andrei Zh

unread,
Dec 15, 2014, 4:32:19 PM12/15/14
to julia...@googlegroups.com
Seems like Symmetric doesn't support some operations on Arrays, so I ended up using Base.LinAlg.copytri!(). Thanks!
Reply all
Reply to author
Forward
0 new messages