Multivariate normal parameterizations - difficulties with precision/Cholesky

Kyle Foreman

unread,

Mar 29, 2013, 8:11:32 PM3/29/13

to stan-...@googlegroups.com

Hi list,

I've been playing around with various ways to sample from a CAR/GMRF, as you may have seen from my posts last month trying to replicate the GeoBUGS lip cancer example.

It works with either multi_normal() or lp__ -= with a MVN density (multi_normal.stan.txt and lp_method.stan.txt respectively). Either method gets the same results (which are close but not the same as GeoBUGS). Using lp__ is significantly faster (multi_normal() takes 344s for 2k samples including compilation, lp__ only 35s with compilation).

I'm having trouble getting multi_normal_prec() or multi_normal_cholesky() to work, however.

multi_normal_prec() samples but I get a really poor fit. The random effects are all effectively stuck at 0 and the lp is much worse (826 for the first two models, 2015 for the precision version). Unclear why this is happening. It runs a bit slower than the lp__ method but faster than the regular multi_normal() at 55s.

multi_normal_cholesky() I can't test because I'm unable to run cholesky_decompose() on my covariance matrix. I get an error that it isn't symmetric [Error : error in call to cholesky decomposition; require symmetric matrix, but found; x[0,3]=1.41057e-07:0; x[3,0]=1.41057e-07:0], which I assume may be related to tolerance since the covariance matrix itself is symmetric... (I recognize that the way I have it written out in multi_normal_cholesky.stan.txt negates any speedups from using the cholesky parameterization, but I'm just trying to work backwards from the multi_normal() version to test it out to start).

Any suggestions on what might be happening with the precision or cholesky formulations? I've attached the data/inits and code to run all 4 models (run_models.R.txt).

Also, any thoughts on whether a proper multi_normal_cholesky() implementation would be faster than the lp__ method? It'd certainly look cleaner...

Thanks,

Kyle

lip_cancer.data.R.txt

lip_cancer.init.R.txt

lp_method.stan.txt

multi_normal_cholesky.stan.txt

multi_normal_prec.stan.txt

multi_normal.stan.txt

run_models.R.txt

Bob Carpenter

unread,

Mar 29, 2013, 8:34:56 PM3/29/13

to stan-...@googlegroups.com

On 3/29/13 8:11 PM, Kyle Foreman wrote:
> Hi list,
>
> I've been playing around with various ways to sample from a CAR/GMRF, as you may have seen from my posts last month

> <https://groups.google.com/forum/#!topic/stan-users/he7BxxHfBfk> trying to replicate the GeoBUGS lip cancer example.

>
> It works with either multi_normal() or lp__ -= with a MVN density (multi_normal.stan.txt and lp_method.stan.txt
> respectively). Either method gets the same results (which are close but not the same as GeoBUGS). Using lp__ is
> significantly faster (multi_normal() takes 344s for 2k samples including compilation, lp__ only 35s with compilation).

Thanks for attaching the code. The reason the lp__ version is
faster is that it's not checking if the covariance matrix is
positive definite (there are some other checks for finiteness,
but they're fast).

At some point, we'll start evaluating the derivatives
more efficiently inside the probability function compared
to writing out the expression.

We've discussed before whether to keep the error checking
given how expensive it is. We can probably optimize the code
a bit further.

The Cholesky version should be much faster, either called directly
or implemented directly with lp__.

(Also, I noticed an unused "matrix[N,N] Sigma" delcaration in the lp_method.stan
example.)

> I'm having trouble getting multi_normal_prec() or multi_normal_cholesky() to work, however.
>
> multi_normal_prec() samples but I get a really poor fit. The random effects are all effectively stuck at 0 and the lp is
> much worse (826 for the first two models, 2015 for the precision version). Unclear why this is happening. It runs a bit
> slower than the lp__ method but faster than the regular multi_normal() at 55s.
>
> multi_normal_cholesky() I can't test because I'm unable to run cholesky_decompose() on my covariance matrix. I get an
> error that it isn't symmetric [Error : error in call to cholesky decomposition; require symmetric matrix, but found;
> x[0,3]=1.41057e-07:0; x[3,0]=1.41057e-07:0], which I assume may be related to tolerance since the covariance matrix
> itself is symmetric... (I recognize that the way I have it written out in multi_normal_cholesky.stan.txt negates any
> speedups from using the cholesky parameterization, but I'm just trying to work backwards from the multi_normal() version
> to test it out to start).

I didn't understand the parameterization here.
Maybe Ben will.

Absolutely --- this looks like a tolerance issue.
You can always symmetrize explicitly:

for (i in 1:N)
for (j in (i+1):N)
Tau[i,j] <- Tau[j,i];

Given that you have:

Tau <- (tau * tau) * (D - p * A);

and D is clearly diagonal and tau and p are
simple scalars, I don't see how it could not be
exactly symmetric unless A wasn't symmetric coming
in. So you could probably just symmetrize A if
that's where the issue is.

There are also two things we could do internally to
correct for this. First, we could increase the error
tolerance -- your inputs look like they're identical to
6 decimal places. Second, we could use a symmetric
triangular view.

> Any suggestions on what might be happening with the precision or cholesky formulations? I've attached the data/inits and
> code to run all 4 models (run_models.R.txt).
>
> Also, any thoughts on whether a proper multi_normal_cholesky() implementation would be faster than the lp__ method? It'd
> certainly look cleaner...

See above.

- Bob

Kyle Foreman

unread,

Mar 29, 2013, 9:01:59 PM3/29/13

to stan-...@googlegroups.com

Thanks, Bob!

A is indeed symmetric (checked in R via isSymmetric(A) => TRUE), so it must be a tolerance issue.

The Tau code you mention is for the multi_normal_prec() version, which doesn't have problems with symmetry. But I applied the same idea to the cholesky version (attached):

// covariance matrix
    Sigma <- sigma_sq * inverse(D - p * A);

for (i in 1:N)

        for (j in 1:i)
            Sigma[i,j] <- Sigma[j,i];
    chol_Sigma <- cholesky_decompose(Sigma);
// CAR model
    beta1 ~ multi_normal_cholesky(zeros, chol_Sigma);

It's rather slow at 285s (as expected, since it's still doing a cholesky decomposition every iteration plus the new loop to force symmetry), but it does get the right fit (i.e. same as the first two models).

So yes, I think allowing for higher tolerances would be useful. But I think I should be able to do the decomposition outside of the model block and thus speed things up significantly. Need to pull out my old matrix algebra book though, as I haven't dealt with that in ages...

multi_normal_cholesky.stan.txt

Ben Goodrich

unread,

Mar 29, 2013, 10:59:46 PM3/29/13

to stan-...@googlegroups.com

On Friday, March 29, 2013 8:34:56 PM UTC-4, Bob Carpenter wrote:

The Cholesky version should be much faster, either called directly
or implemented directly with lp__.

In this kind of model, the fastest way is going to be with multi_normal_prec() or the equivalent by manipulating lp__. That is because the precision matrix is directly parameterized, so there is no need for matrix decompositions. And Marcus did some optimizing here.

So you could probably just symmetrize A if
that's where the issue is.

That would be best if you were going to do something that checks for symmetry.

There are also two things we could do internally to
correct for this. First, we could increase the error
tolerance -- your inputs look like they're identical to
6 decimal places.

The tolerance is already 1e-8. I don't know how much we can afford to weaken it further.

Ben

Kyle Foreman

unread,

Mar 29, 2013, 11:43:35 PM3/29/13

to stan-...@googlegroups.com

Thanks Ben! Any idea why the multi_normal_prec() version doesn't seem to work? It's not in the manual but I assumed I should just pass in the precision matrix i.e. the inverse of the covariance matrix. But it predicts all zeroes and gives a much higher lp than the covariance parameterization (as well as being a bit slower, probably due to the checks for symmetry etc) .

--
You received this message because you are subscribed to a topic in the Google Groups "stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/S8dhFpn0Lg8/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bob Carpenter

unread,

Mar 30, 2013, 12:11:01 PM3/30/13

to stan-...@googlegroups.com

On 3/29/13 10:59 PM, Ben Goodrich wrote:
> On Friday, March 29, 2013 8:34:56 PM UTC-4, Bob Carpenter wrote:
>
> The Cholesky version should be much faster, either called directly
> or implemented directly with lp__.
>
>
> In this kind of model, the fastest way is going to be with multi_normal_prec() or the equivalent by manipulating lp__.
> That is because the precision matrix is directly parameterized, so there is no need for matrix decompositions.

Thanks for the correction. Listen to Ben on multivariate
models, not me!

Any idea why the precision's not working for Kyle? It
looks like all the tests are in place for values and
autodiff.

> And
> Marcus did some optimizing here.

The big gain will come from direct vectorizations of
the probability function.

> So you could probably just symmetrize A if
> that's where the issue is.
>
>
> That would be best if you were going to do something that checks for symmetry.
>
> There are also two things we could do internally to
> correct for this. First, we could increase the error
> tolerance -- your inputs look like they're identical to
> 6 decimal places.
>
>
> The tolerance is already 1e-8. I don't know how much we can afford to weaken it further.

OK. I'm out of my depth numerical analysis-wise.

We can work on making it possible to disable testing.
The refactoring of math should go a long way to making this
easier. Our previous error handling was rather inconsistent
in implementation and it's all easier to see and fix now.

- Bob

Ben Goodrich

unread,

Mar 30, 2013, 2:12:14 PM3/30/13

to stan-...@googlegroups.com

On Saturday, March 30, 2013 12:11:01 PM UTC-4, Bob Carpenter wrote:

Any idea why the precision's not working for Kyle? It
looks like all the tests are in place for values and
autodiff.

Comparing the test_grad of the lp__ method and the multi_normal_prec() method reveals some differences in the log posterior and the partial derivatives wrt to the last two parameters, which are

    real<lower=1e-5> tau;   // precision of CAR
    real<lower=0,upper=1> p;// strength of spatial correlation

Namely the partials seem to differ by a factor of two:

> g_lp <-       stan(
+             file = 'lp_method.stan',
+             data = dat,
+             init = init,
+ chains = 1, test_grad = TRUE)

TRANSLATING MODEL 'lp_method' FROM Stan CODE TO C++ CODE NOW.
COMPILING THE C++ CODE FOR MODEL 'lp_method' NOW.
TESTING GRADIENT FOR MODEL 'lp_method' NOW (CHAIN 1).

TEST GRADIENT MODE

 Log probability=683.794

 param idx           value           model     finite diff           error
         0               0            -0.2            -0.2     1.58657e-07
...
        58    -1.00001e-05         56.4989         56.4989    -9.02741e-09
        59               0        -1.28883        -1.28883    -2.64132e-08

versus

> g_mnp <-       stan(
+             file = 'multi_normal_prec.stan',
+             data = dat,
+             init = init,
+             chains = 1,
+ test_grad = TRUE)

TRANSLATING MODEL 'multi_normal_prec' FROM Stan CODE TO C++ CODE NOW.
COMPILING THE C++ CODE FOR MODEL 'multi_normal_prec' NOW.
TESTING GRADIENT FOR MODEL 'multi_normal_prec' NOW (CHAIN 1).

TEST GRADIENT MODE

 Log probability=721.016

 param idx           value           model     finite diff           error
         0               0            -0.2            -0.2     1.58657e-07
...
        58    -1.00001e-05         112.498         112.498    -1.68086e-08
        59               0        -2.57767        -2.57767       4.017e-09

If we do it in doubles and keep the constants, we get

> log_prob(g_lp, pars = unlist(init))
[1] 739.4773
> log_prob(g_mnp, pars = unlist(init))
[1] 831.9454

Kyle, do you know which is the correct log posterior for this data-generating process?

Ben

Kyle Foreman

unread,

Mar 30, 2013, 6:28:08 PM3/30/13

to stan-...@googlegroups.com

If we do it in doubles and keep the constants, we get

> log_prob(g_lp, pars = unlist(init)) [1] 739.4773 > log_prob(g_mnp, pars = unlist(init)) [1] 831.9454

Kyle, do you know which is the correct log posterior for this data-generating process?

I believe that the first one (g_lp) is correct (basing that assumption just off of the behavior of the model fitting). I just wrote up the model in pymc to try to confirm, but it gives -446 for logp which makes me think something is different in the treatment of constants between the two.

I am confident that the lp_method and multi_normal() models are giving "correct" fits. They give the same mean values as the version I coded up in pymc (attached for the sake of completeness), which uses the precision matrix for the multivariate normal.

Does that at least partially answer your question, Ben? I'll look more into what rstan's log_prob() is calculating exactly, maybe I'm missing something there.

And thanks again for the help!

lip_cancer_pymc.py.txt

Ben Goodrich

unread,

Mar 30, 2013, 6:52:44 PM3/30/13

to stan-...@googlegroups.com

I think the lp__ method is correct too, but it would be nice to confirm the bug is in multi_normal_prec(). log_prob() should be calculating the log posterior the same way that it would be calculated in R or BUGS.

Ben

Marcus Brubaker

unread,

Mar 31, 2013, 12:02:47 PM3/31/13

to stan-...@googlegroups.com

I took a quick look at multi_normal_prec and it looks like there was a
bug with the determinant term. I'm not sure if this is what was
causing your problem, but it could be related.

Cheers,
Marcus

> --
> You received this message because you are subscribed to the Google Groups

> "stan users mailing list" group.

> To unsubscribe from this group and stop receiving emails from it, send an

Ben Goodrich

unread,

Mar 31, 2013, 2:34:10 PM3/31/13

to stan-...@googlegroups.com

Yeah, the log posterior and the gradients are the same in master now.

Kyle Foreman

unread,

Mar 31, 2013, 6:16:47 PM3/31/13

to stan-...@googlegroups.com

It works great with the 0.5 fix, thanks so much for the help everyone!

Bob Carpenter

unread,

Mar 31, 2013, 6:25:29 PM3/31/13

to stan-...@googlegroups.com

We really appreciate the bug reports. And thanks
for being so patient. This one was our fault
for not testing enough, so we're doubly appreciative.

- Bob

On 3/31/13 6:16 PM, Kyle Foreman wrote:
> It works great with the 0.5 fix, thanks so much for the help everyone!
>

Kyle Foreman

unread,

Apr 1, 2013, 12:18:13 AM4/1/13

to stan-...@googlegroups.com

Thanks, Bob. I must say I'm incredibly impressed with how well the software works already, though - early stage academic software isn't supposed to be this good!

On another note, I did some benchmarking to see how long the various checks take for the multi_normal_prec() implementation. If I comment out all the checks for multi_normal_prec_log (i.e. lines 552-584 and 663-695) it results in about a 25% speed up. But using the code Ben helped with a couple weeks ago to use lp__ directly is even faster, about 50% faster than multi_normal_prec(). So I'm guessing there's some additional overhead in the multi_normal_prec() function that slows it down a bit. Of course they're all much, much faster than the basic multi_normal() which uses the covariance matrix.

Sampling times for 10k iterations:

multi_normal() 1190s

multi_normal_prec() 61s

multi_normal_prec with no checks 47s

lp__ method 32s

Anyhow, just wanted to post that for reference in case anyone else is trying to optimize multivariate normals in their models. For my purposes, I'll probably stick with the basic multi_normal_prec() during development and then swap in the lp__ method instead when I'm doing big runs.

Ben Goodrich

unread,

Apr 1, 2013, 3:03:38 AM4/1/13

to stan-...@googlegroups.com

On Monday, April 1, 2013 12:18:13 AM UTC-4, Kyle Foreman wrote:

But using the code Ben helped with a couple weeks ago to use lp__ directly is even faster, about 50% faster than multi_normal_prec(). So I'm guessing there's some additional overhead in the multi_normal_prec() function that slows it down a bit.

Part of the issue is that in this line

lp__ <- lp__ - 0.5 / (sigma * sigma) * (beta1' * D * beta1 - p * (beta1' * A * beta1));

the partial derivative wrt p is very simple, which is why I wrote it the long way. Unfortunately, although the partial derivative of

lp__ <- lp__ - 0.5 * N * log(sigma * sigma) + 0.5 * log(determinant(D - p * A));

also turns out to be pretty simple, we can't exploit that fact in Stan code. Eventually, we might write some C++ that does so.

But in contrast, if you do

Tau <- (tau * tau) * (D - p * A);


beta1 ~ multi_normal_prec(zeros, Tau);

Stan does a massive chain-rule thing to obtain the partial derivative wrt to p.

Basically, the general rule is never do a calculation that mixes a (function of a) parameter with a (function of a) constant unless you have to.

Ben

Bob Carpenter

unread,

Apr 1, 2013, 12:32:56 PM4/1/13

to stan-...@googlegroups.com

On 4/1/13 12:18 AM, Kyle Foreman wrote:
> Thanks, Bob. I must say I'm incredibly impressed with how well the software works already, though - early stage academic
> software isn't supposed to be this good!

I spent the last ten years in professional software
dev, so to me it feels like we're being very sloppy!

It is getting better, though --- we're most of the way
through refactoring the code base, so it should all
be much more transparent and have better error checking.

> Sampling times for 10k iterations:

> multi_normal() *1190s*
> multi_normal_prec() *61s*
> multi_normal_prec with no checks *47s*
> lp__ method *32s**

We'll be speeding up multi_normal_prec() and the
other multivariate distributions soon. We had to
finalize the design for vectorizing them first.
And now we have to calculate a bunch of vector
derivatives so we can get the kind of advantages Ben
was talking about.

> Anyhow, just wanted to post that for reference in case anyone else is trying to optimize multivariate normals in their
> models. For my purposes, I'll probably stick with the basic multi_normal_prec() during development and then swap in the
> lp__ method instead when I'm doing big runs.

That's a very good strategy. Get a model working with
small data with a simple encoding, then build out the
model, then optimize, then dump in bigger data sets.

- Bob

Kyle Foreman

unread,

Mar 4, 2014, 6:34:13 PM3/4/14

to stan-...@googlegroups.com

You're correct. An older version of Stan allowed for a vector there, but more recent versions require an array of ints.

On Tue, Mar 4, 2014 at 2:31 PM, Michael T <micha...@gmail.com> wrote:

Hi all,

I've been bouncing back and forth on all things related to the spatial CAR implementation in STAN.
For other people in a similar situation, here's a new tidbit from my scavenging.

To implement Kyle's examples,
a slight modification in data section of the stan file needs to be made

wrong:
vector[N] O;

right:
int O[N];

I could be completely wrong, as this is my first dive into STAN, but it's motivated from reading
https://groups.google.com/forum/#!msg/stan-dev/vNx8FyKcsLs/b7ft0lyZWXoJ

Mike

--
You received this message because you are subscribed to a topic in the Google Groups "stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/S8dhFpn0Lg8/unsubscribe.

To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Kyle Foreman

unread,

Mar 31, 2014, 11:52:48 AM3/31/14

to stan-...@googlegroups.com

In MT's thread I've added an example for how to manually implement a sparse precision matrix version of the MVN distribution. For cases with hundreds or thousands of spatial units I've found it to speed things up by a couple orders of magnitude or more.

Since I often get emails from Stan users referencing this thread, I'm going to attach that example here as well so that it's easy to find.

Best

Kyle

On Sun, Mar 30, 2014 at 5:21 PM, MT <micha...@gmail.com> wrote:

Hi all,

I hope this isn't regarded as redundant spam.

I've started a seperate thread trying to extend the spatial car model (single random effect) into the bym convolution model ( car random effect and normal random effect) as well as scaling up the model to 480 spatial neighbors (instead of the 50 or so here).

https://groups.google.com/forum/#!searchin/stan-users/spatial|sort:date/stan-users/M7T7EIlyhoo/jli3CRobjeIJ

Stan isn't able to handle it and I was wondering if people in this thread would have some knowledge as to what's going on.

--
You received this message because you are subscribed to a topic in the Google Groups "stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/S8dhFpn0Lg8/unsubscribe.

To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

scotland_lip_cancer.RData.txt

sparse_MVN.R.txt

Yan Liu

unread,

Sep 8, 2014, 12:29:49 PM9/8/14

to stan-...@googlegroups.com

Hi, Kyle

I am manipulating your code now. It seems that the results from your code are not the same as results from GeoBUGS. Have you run simulation to test if stan can handle this? Like generating data from true model, do model fitting, repeat 1000 times and take average to see if any bias exists.

Thanks a lot

On Friday, March 29, 2013 8:11:32 PM UTC-4, Kyle Foreman wrote:

Reply all

Reply to author

Forward

Message has been deleted