Multi-threaded Cholesky Decomposition, etc.

564 views
Skip to first unread message

Eric Brown

unread,
Jan 17, 2013, 11:38:13 PM1/17/13
to stan-...@googlegroups.com
I have noticed/believe that some of my models spend a lot of time in the Cholesky Decomposition, as I am using the multivariate Matt trick.

I believe that this function comes from Eigen, and if code is compiled with -fopenmp then Eigen may use multiple threads. I have set OMP_NUM_THREADS environment variable, but I don't see cholesky_decompose using multiple processors.

Has anyone been able to get multiple processors working for this operation and/or any others, like matrix multiplication?

Eric Brown

unread,
Jan 17, 2013, 11:42:19 PM1/17/13
to stan-...@googlegroups.com
P.S. I have doctored up my Makeconf file with -fopenmp's so that a stan fit object reports that the model was compiled with -fopenmp (I'm using gfortran)

Marcus Brubaker

unread,
Jan 18, 2013, 12:00:57 AM1/18/13
to stan-...@googlegroups.com
Hi Eric,

Parallelization probably won't help in this regard, at least not right now.  The main bottleneck is not in the computation but the instantiation of automatic differentiation variables necessary to differentiate through the cholesky decomposition.  I'm not certain how the allocation will handle openmp, it may require locking which will defeat the purpose of the parallelization.  We haven't really tested any form of parallelization at that level as far as I'm aware.

Cheers,
Marcus




--
 
 

Jeffrey Arnold

unread,
Jan 18, 2013, 12:12:14 AM1/18/13
to stan-...@googlegroups.com
I had tried compiling Stan and Stan models with the -fopenmp flag.  You'll likely run into nasty errors when compiling models sooner or later. E.g. this bug report of mine: http://code.google.com/p/stan/issues/detail?id=80

Ben Goodrich

unread,
Jan 18, 2013, 1:31:31 AM1/18/13
to
On Thursday, January 17, 2013 11:38:13 PM UTC-5, Eric Brown wrote:

Instead of defining a covariance matrix as a parameter and calculating its Cholesky factor as a local variable (which involves a lot of auto-diff stuff), I think it would be better to do the reverse and define the (free elements of the) Cholesky factor as a parameter and (if necessary) calculate the covariance matrix as a local variable or generated quantity. There is an example in section 13.1 of the manual that implies a Wishart prior on the covariance matrix. In other words, something like this

data {
 
int<lower=1> K;
  cov_matrix
[K] S;  // scale matrix
  real
<lower=K> nu; // degrees of freedom
}
transformed data
{
  matrix
[K,K] L_of_S;
  L_of_S
<- cholesky_decompose(S);
}
parameters
{
  real subdiagonals
[0.5 * K * (K - 1)];
  real
<lower=0> diagonals[K];
}
model
{
  matrix
[K,K] L;
 
int count;
  count
<- 1;
 
for(j in 1:K) {
    L
[j,j] <- sqrt(diagonals[K]);
   
for(i in 1:(j - 1)) {
      L
[i,j] <- subdiagonals[count];
      count
<- count + 1;
   
}
   
for(i in (j+1):K) {
      L
[i,j] <- 0.0;
   
}
 
}

 
// priors imply L * L_of_S * L_of_S' * L' ~ Wishart(nu, S);
  subdiagonals
~ normal(0.0, 1.0);
 
for(k in 1:K) {
    diagonals
[k] ~ chi_squared(nu - k + 1);
 
}

 
// rest of model
}
generated quantities
{
  cov_matrix
[K] Sigma;
 
Sigma <- multiply_lower_tri_self_transpose(L * L_of_S);
}

Ben

Ben Goodrich

unread,
Jan 18, 2013, 1:30:35 AM1/18/13
to stan-...@googlegroups.com
I said it wrong: L * L_of_S * L_of_S' * L' is distributed Wishart(nu,S), rather than L * L'. Editing my post now.

Ben


On Friday, January 18, 2013 1:15:58 AM UTC-5, Ben Goodrich wrote:
On Thursday, January 17, 2013 11:38:13 PM UTC-5, Eric Brown wrote:
Instead of defining a covariance matrix as a parameter and calculating its Cholesky factor as a local variable (which involves a lot of auto-diff stuff), I think it would be better to do the reverse and define the (free elements of the) Cholesky factor as a parameter and (if necessary) calculate the covariance matrix as a local variable or generated quantity. There is an example in section 13.1 of the manual that implies a Wishart prior on the covariance matrix. In other words, something like this

data {
 
int<lower=1> K;
  cov_matrix
[K] S;  // scale matrix
  real
<lower=K> nu; // degrees of freedom
}
transformed data
{
  matrix
[K,K] L_of_S;
  L_of_S
<- cholesky_decompose(S);
}
parameters
{
  real subdiagonals
[0.5 * K * (K - 1)];
  real
<lower=0> diagonals[K];
}
model
{
  matrix
[K,K] L;
 
int count;
  count
<- 1;
 
for(j in 1:K) {
    L
[j,j] <- sqrt(diagonals[K]);
   
for(i in 1:(j - 1)) {
      L
[i,j] <- subdiagonals[count];
      count
<- count + 1;
   
}
   
for(i in (j+1):K) {
      L
[i,j] <- 0.0;
   
}
 
}


 
// priors imply L * L' ~ Wishart(nu, S);

  subdiagonals
~ normal(0.0, 1.0);
 
for(k in 1:K) {
    diagonals
[k] ~ chi_squared(nu - k + 1);
 
}

 
// rest of model
}
generated quantities
{
  cov_matrix
[K] Sigma;

 
Sigma <- multiply_lower_tri_self_transpose(L);
}

Ben

Bob Carpenter

unread,
Jan 18, 2013, 2:30:03 PM1/18/13
to stan-...@googlegroups.com
We certainly haven't tested with Open MP.

But the reason I closed the issue was that
the bulk of our time is spent on matrix calcs
that involve parameters and hence need gradients
and hence can't be easily sped up with things
like Open MP.

Also, the rest of what we do use double variables
for precision, which can also be harder to speed
up with hardware, much of which is targeted to
32-bit values.

- Bob
> --
>
>

terrance savitsky

unread,
Jan 21, 2013, 8:33:11 PM1/21/13
to stan-...@googlegroups.com
Ben, Per your guidance in the attached noted, would the geometry / mixing improve to sample a covariance matrix from the parameters of it's cholesky decomposition with the off-diagonals drawn from standard normals and the diagonals from inverse gammas or folded normals, on the one hand, versus decomposing the covariance matrix into a correlation matrix and vector of scales for sampling, on the other hand?  

Terrance


On Thu, Jan 17, 2013 at 10:15 PM, Ben Goodrich <goodri...@gmail.com> wrote:
On Thursday, January 17, 2013 11:38:13 PM UTC-5, Eric Brown wrote:

--
 
 



--
Thank you, Terrance Savitsky

Bob Carpenter

unread,
Jan 21, 2013, 11:03:10 PM1/21/13
to stan-...@googlegroups.com
That really depends on how you do it!

What matters is the position-invariance of the Hessian
of the unconstrained log probability function. That is,
you don't want the Hessian of p(theta) to vary with the parameters
theta, especially in terms of scale. See the funnel example
in the manual for an example of what goes wrong when the posterior
scale is position dependent.

The transform for Stan's cov_matrix data type
is based on a Cholesky factor representation of
the covariance matrix with the diagonal log-transformed
to give it unbounded support. The correlation matrix
transform is more involved, but involves (n choose 2)
parameters representing canonical partial correlations
(again see the doc on transforms later in the manual).

That's a long way of saying the geometry's a bit too hard
for me to comprehend directly. Maybe Ben or Michael will
have some insight.

For interpretability, I like the scaled lkj_correlation
densities, which can shrink the posterior correlation matrix
toward the identity.

- Bob
> --
>
>

Ben Goodrich

unread,
Jan 22, 2013, 12:04:29 AM1/22/13
to stan-...@googlegroups.com
On Monday, January 21, 2013 8:33:11 PM UTC-5, tds151 wrote:
Ben, Per your guidance in the attached noted, would the geometry / mixing improve to sample a covariance matrix from the parameters of it's cholesky decomposition with the off-diagonals drawn from standard normals and the diagonals from inverse gammas or folded normals, on the one hand, versus decomposing the covariance matrix into a correlation matrix and vector of scales for sampling, on the other hand?  

With standard normal off-diagonals and diagonals being the square roots of chi-squared parameters with particular degrees of freedom, we know the implied covariance matrix has a Wishart distribution. If you put inverse gammas or folded normals on the diagonal, then the implied prior is something else, probably without a name. I don't like imposing priors that I don't understand the implications of, but that doesn't stop some people. But as long as the off-diagonals have independent standard normal priors, the geometry is probably pretty good along those dimensions of the parameter space.

The lkj_corr() prior is equivalent to a bunch of independent student t priors on the primitives, which also probably have good geometry, although not quite as good as standard normals. What you do with the priors on the standard deviations is another question, but as long as the priors are independent, I doubt it would run into too much trouble.

At the moment, it is not possible to get a Cholesky factor of a correlation matrix, except by declaring a cor_matrix as a parameter and taking the Cholesky decomposition of it, which will spend a lot of wall time on the auto-diff. This is sort of a shame because it creates the cor_matrix by multiplying its Cholesky factor by the transpose. You could write Stan code to mimic the C++ that creates a Cholesky factor of a correlation matrix, but the code is somewhat opaque. In contrast, constructing a Cholesky factor of a covariance matrix is pretty easy in Stan code, and about the only thing to remember is to make the diagonal positive.

Ben

Asim

unread,
Jan 22, 2013, 9:18:33 AM1/22/13
to stan-...@googlegroups.com

Here is another prior on Covariance matrices that you may find useful.

http://www.uow.edu.au/~mwand/wispap.pdf

Asim

Ben Goodrich

unread,
Jan 22, 2013, 10:00:30 AM1/22/13
to stan-...@googlegroups.com
On Tuesday, January 22, 2013 9:18:33 AM UTC-5, Asim wrote:

Here is another prior on Covariance matrices that you may find useful.

http://www.uow.edu.au/~mwand/wispap.pdf


Thanks. The authors sent the first draft of that paper to us for comments a few months ago and it is even better now. That said, I'm not sure it is especially applicable to Stan. One of the virtues of that distribution is that stuff is conjugate so it is easy to use in a Gibbs sampler. There is nothing wrong with using conjugate priors in Stan if they correspond to your prior beliefs, but they don't yield any computational benefits for a HMC sampler.

The other thing is that the authors claim that the fact that the correlations are all *marginally* uniform for a certain choice of a hyperparameter is a good thing. To me, this is a bad thing because marginally uniform correlations are *jointly* non-uniform in an extreme and implausible way, where the density is concentrated in the corners of the admissible parameter space so that the correlation matrix is almost singular. I much prefer the lkj_corr prior for correlation matrices, which for the default value of its hyperparameter is *jointly* uniform over correlation matrices of a given size (and for what it is worth, implies that the correlations are marginally beta distributed after a shift-and-scale). Then you could put half-t priors (or something else) on the standard deviations and go from there.

Ben

terrance savitsky

unread,
Jan 22, 2013, 1:12:04 PM1/22/13
to stan-...@googlegroups.com
Thanks Ben and Asim, My question on the  direct construction of the cholesky decomposition of a covariance matrix was motivated by the possibility to improve mixing, which is a gate the model must pass through to be usable.  My model has 3 parameter types:  multivariate random effects, ordered cutpoints, correlation matrices and associated scales.   Sampling pre-transformed standard normal versions of the first two does not produce adequate mixing performance, so I've begun to wonder about the covariance matrix, which i employ only through the cholesky.   In particular, since standard normals are the ideal geometry for stan, I keyed on Ben's comment on constructing a covariance matrix under an IW through the cholesky parameters.   I want to avoid the property of the IW where correlations may be dependent on the variances/scales, but yet use a similar construction mechanism (through the cholesky) as with the IW if this approach improves the mixing performance; otherwise, i would use other approaches for constructing covariance matrices.   

terrance.


--
 
 

terrance savitsky

unread,
Jan 22, 2013, 8:04:13 PM1/22/13
to stan-...@googlegroups.com
2 questions;

1. the following declaration won't work b/c stan interprets the computed index as real, not integer, real subdiagonals[0.5 * K * (- 1)];,
   fixing it by assigning the result to an integer-declared variable doesn't work.
1. if one wants to compose the L[i,j] as transformed parameters so that one extract the transformed standard normal parameters (against which L is applied), will declaring "count" in a transformed data block allow it to be used in a transformed parameters block?

Terrance.


On Thu, Jan 17, 2013 at 10:15 PM, Ben Goodrich <goodri...@gmail.com> wrote:
On Thursday, January 17, 2013 11:38:13 PM UTC-5, Eric Brown wrote:

Ben

--
 
 

Bob Carpenter

unread,
Jan 22, 2013, 9:55:59 PM1/22/13
to stan-...@googlegroups.com


On 1/22/13 8:04 PM, terrance savitsky wrote:
> 2 questions;
>
> 1. the following declaration won't work b/c stan interprets the computed index as real, not integer, real
> subdiagonals[0.5 * K * (K - 1)];,

You should use (K * (K - 1)) / 2.

That will produce an integer result.

Any operation involving a real produces a real result.

You could use real to integer conversion like floor(),
but it's better just to do the arithmetic with integers if
you know you'll get an integer result.

> fixing it by assigning the result to an integer-declared variable doesn't work.

Right -- you can't assign a real to an integer in Stan.
Or in Java. C++ will let you do it, and of course,
R, etc., are very free with their typing and assignments.

> 1. if one wants to compose the L[i,j] as transformed parameters so that one extract the transformed standard normal
> parameters (against which L is applied), will declaring "count" in a transformed data block allow it to be used in a
> transformed parameters block?

Yes, you can use transformed data in the transformed
parameter block.

You can use a variable in a statement any time after
it has been declared. Variables in dimension declarations
have to be data.

- Bob

terrance savitsky

unread,
Jan 23, 2013, 5:21:14 PM1/23/13
to stan-...@googlegroups.com
Thanks, Bob, but I receive an 'expectation failure location message' when attempting to use integer, "count", to increment elements of a matrix of transformed parameters, as shown below.   Is there some manner in which I may use convenience variables like "count", which aren't parameters, in the transformed parameters block.  I need these specify L_t as parameters, so I can't move the whole thing to the model block.  


transformed data{
int<lower=1> count;
}

transformed parameters {
        /* parameters of a cholesky decomposition */ 
matrix[D,D] L_t;
matrix[Ds,Ds] L_s;
    
       count <- 1;
  for(j in 1:D) {
    L_t[j,j] <- 1;
    for(i in 1:(j - 1)) {
      L_t[i,j] <- subdiagonals_t[count];
      count <- count + 1;
    }
    for(i in (j+1):D) {
      L_t[i,j] <- 0.0;
    }
  }
L_t <- diag_post_multiply(L_t,diagonals_t);
  count <- 1;
  for(j in 1:Ds) {
    L_s[j,j] <- 1;
    for(i in 1:(j - 1)) {
      L_s[i,j] <- subdiagonals_s[count];
      count <- count + 1;
    }
    for(i in (j+1):Ds) {
      L_s[i,j] <- 0.0;
    }
  }
L_s <- diag_post_multiply(L_s,diagonals_s);

}




- Bob

--


Bob Carpenter

unread,
Jan 23, 2013, 6:51:14 PM1/23/13
to stan-...@googlegroups.com
You need to make count a local variable in the transformed
parameters block. Variables can only be assigned in the block
in which they are declared.

You can always put brackets around a group of statements and
declare local variables at the top, so it'd just be:

{
int count;
count <- 1;
...
}

and count is defined and assignable within the
brackets but doesn't escape outside of them.

- Bob
> --
>
>
Reply all
Reply to author
Forward
0 new messages