Sampling from multivariate normal

3,238 views
Skip to first unread message

eligu...@yandex.ru

unread,
Aug 28, 2013, 6:43:38 PM8/28/13
to stan-...@googlegroups.com
Dear STANsters,

First, thanks for developing an exciting and well-designed and well-documented tool for Bayesian sampling.  

I would like to use STAN to estimate parameters that are nestled inside of the mean and variance-covariance matrix of a multi-variate normal distribution, i.e. X ~ MVN(Mu(theta), Sigma(theta)), where "thetas" are the parameters and the likelihood is just the probability density of the data vector X.  I am using STAN from within R, and the ultimate objective is to develop a package that obtains these estimates (and C.I.'s) from data.

As a simplistic example, I attach an example [.stan and .r files] where I estimate the autocorrelation coefficient rho of an AR(1) process by using the variance-covariance of the complete time-series [i.e. Sigma[i,j] = rho^(abs(i-j))].  This is not the best way to estimate rho, but it is exactly the template I'm looking for. 

I got this to work, but have several questions.  (I am a bit of a novice here, so apologize in advance if the answers are well-known or documented elsewhere - I really did dig around):

1) The compilation worked using "set_cppo("fast")", but it would not compile using "set_cppo("debug")", giving me a non-specific error, with only a Cygwin POSIX path warning.  How/why is that possible?  It takes quite a while to compile - and if the "debug" is faster [plus more informative output], then that would be much more convenient for development. 

2) I see that the compiled C++ code is stored somewhere deep (Users/eli/AppData/Temp/Local ... etc.) with some complicated auto-generated name. I couldn't figure out how to control where that code goes or how it is named.  A further question: Will it eventually be straightforward to implement the compiled C++ code in an R package via Rcpp (which is the ultimate goal)?  

3) If there are any comments on the optimality/efficiency of the "stan" code for sampling from the MVN itself, they would be appreciated.  For the desired application, speed is highly desired. 

Thanks,
- Eli
multinormal.r
multinormal.stan

Bob Carpenter

unread,
Aug 28, 2013, 7:12:55 PM8/28/13
to stan-...@googlegroups.com


On 8/28/13 6:43 PM, eligu...@yandex.ru wrote:
> Dear STANsters,
>
> First, thanks for developing an exciting and well-designed and well-documented tool for Bayesian sampling.
>
> I would like to use STAN to estimate parameters that are nestled inside of the mean and variance-covariance matrix of a
> multi-variate normal distribution, i.e. X ~ MVN(Mu(theta), Sigma(theta)), where "thetas" are the parameters and the
> likelihood is just the probability density of the data vector X. I am using STAN from within R, and the ultimate
> objective is to develop a package that obtains these estimates (and C.I.'s) from data.

So it's like a Gaussian process example where Mu and Sigma are
functions of parameters? There are some examples in the manual.

> As a simplistic example, I attach an example [.stan and .r files] where I estimate the autocorrelation coefficient rho
> of an AR(1) process by using the variance-covariance of the complete time-series [i.e. Sigma[i,j] = rho^(abs(i-j))].
> This is not the best way to estimate rho, but it is exactly the template I'm looking for.
>
> I got this to work, but have several questions. (I am a bit of a novice here, so apologize in advance if the answers
> are well-known or documented elsewhere - I really did dig around):
>
> 1) The compilation worked using "set_cppo("fast")", but it would not compile using "set_cppo("debug")", giving me a
> non-specific error, with only a Cygwin POSIX path warning. How/why is that possible?



set_cppo("fast") sets C++ compiler
optimization to level 3 and set_cppo("debug") makes
it optimization level 0. The current Stan (1.3.0) also uses set_cppo("fast")
to turn off some error handling, leading to potential R crashes.

In 2.0, the options will only set optimization and not error handling.

The problem is the C++ toolchain on Windows, which typically runs out
of memory or just dies when setting optimization level 0.

> It takes quite a while to compile
> - and if the "debug" is faster [plus more informative output], then
> that would be much more convenient for development.

We don't want to rely on error messages from C++, but rather catch
them in Stan directly. The debug version won't have more informative
output as far as Stan's concerned.

> 2) I see that the compiled C++ code is stored somewhere deep (Users/eli/AppData/Temp/Local ... etc.) with some
> complicated auto-generated name. I couldn't figure out how to control where that code goes or how it is named.

Why would you want to control this?

Jiqiang should be able to help if you really need to control it.

> A
> further question: Will it eventually be straightforward to implement the compiled C++ code in an R package via Rcpp
> (which is the ultimate goal)?

I'm not sure what you mean by "implement the compiled C++ code".

You can use the command-line version of Stan to compile the C++
code for a model and then do whatever you want with it. But the I/O
is tricky given the constraints.

> 3) If there are any comments on the optimality/efficiency of the "stan" code for sampling from the MVN itself, they
> would be appreciated. For the desired application, speed is highly desired.

We're working on making multivariate normal faster in a few different
ways. One, we'll vectorize the distributions, and two, we'll support
Cholesky factor representations everywhere. For now, Ben and Marcus have
provided a lot of examples on the mailing list. But the basic idea is
to do the heavy lifting outside of a loop as much as possible. So rather than

for (n in 1:N) y[n] ~ multi_normal(mu,Sigma);

which causes Sigma to be solved in every iteration, you can do something like:

L <- cholesky_decompose(Sigma);
for (n in 1:N) y[n] ~ multi_normal_cholesky(mu,L);

- Bob

eligu...@yandex.ru

unread,
Aug 28, 2013, 8:34:35 PM8/28/13
to stan-...@googlegroups.com
Thanks Bob for a prompt response, 


We don't want to rely on error messages from C++, but rather catch
them in Stan directly.  The debug version won't have more informative
output as far as Stan's concerned.

So, long story short, there is no good reason to use "debug"? 
 
You can use the command-line version of Stan to compile the C++
code for a model and then do whatever you want with it.  But the I/O
is tricky given the constraints.

I will experiment with incorporating the code into a rstan dependent package. 

We're working on making multivariate normal faster in a few different
ways.  One, we'll vectorize the distributions, and two, we'll support
Cholesky factor representations everywhere.  For now, Ben and Marcus have
provided a lot of examples on the mailing list.  But the basic idea is
to do the heavy lifting outside of a loop as much as possible.  So rather than

  for (n in 1:N)   y[n] ~ multi_normal(mu,Sigma);

which causes Sigma to be solved in every iteration, you can do something like:

  L <- cholesky_decompose(Sigma);
  for (n in 1:N) y[n] ~ multi_normal_cholesky(mu,L);

I believe that is more or less what I am doing.  Here is the (fairly compact) stan code:
 
data {
int<lower=0> N; // length of data set
vector[N] y;    // data
}
parameters {
real mu; 
real<lower=0> sigma;
real<lower=0,upper = 1> rho;
}

transformed parameters {
// declare mu, Sigma, chol_Sigma
vector[N] Mu;
matrix[N,N] Sigma;
matrix[N,N] chol_Sigma;

// define mu
for(i in 1:N)
Mu[i] <- mu;

// define Sigma
for(i in 1:N)
for(j in 1:N)
Sigma[i,j] <- pow(sigma,2)*pow(rho,abs(i-j));

// define chol_Sigma
chol_Sigma <- cholesky_decompose(Sigma);
}

model {
// prior distribution
        rho ~ uniform(0.0,1.0); 
// likelihood
y ~ multi_normal_cholesky(Mu, chol_Sigma);
}

Perhaps a quick glance will tell you if there are any low-hanging fruit to speed it up. 

Thanks again, 
Eli

Ben Goodrich

unread,
Aug 28, 2013, 8:47:34 PM8/28/13
to stan-...@googlegroups.com
On Wednesday, August 28, 2013 8:34:35 PM UTC-4, eligu...@yandex.ru wrote:
Thanks Bob for a prompt response, 

We don't want to rely on error messages from C++, but rather catch
them in Stan directly.  The debug version won't have more informative
output as far as Stan's concerned.

So, long story short, there is no good reason to use "debug"? 

There is a good reason: If you have run-time errors.
 
Perhaps a quick glance will tell you if there are any low-hanging fruit to speed it up. 

For an AR(1) model, I think you should pretty much never use the covariance matrix. Define a "cov_matrix" that is actually the inverse of the covariance matrix, which will be tridiagonal and is know analytically as a function of rho and sigma. Then call multi_normal_prec(). Or define the Cholesky factor of the precision matrix, which is bidiagonal, and manually increment the log-posterior in the model{} block.

Ben

eligu...@yandex.ru

unread,
Aug 28, 2013, 9:38:40 PM8/28/13
to stan-...@googlegroups.com
Thanks for your comment Ben, 


For an AR(1) model, I think you should pretty much never use the covariance matrix. Define a "cov_matrix" that is actually the inverse of the covariance matrix, which will be tridiagonal and is know analytically as a function of rho and sigma. Then call multi_normal_prec(). Or define the Cholesky factor of the precision matrix, which is bidiagonal, and manually increment the log-posterior in the model{} block.

I'm aware that this is a ridiculous way to estimate rho for an AR(1) - but it is a perfect minimal template for my actual problem.  In the real thing, the inverse matrix is unfortunately not a trivial function of the parameters ... but it should (I think) be somewhat sparse, in which case numerically using "multi_normal_prec()" may be faster.  I'll try that when I weave it all together.

Let me reiterate that implementing this estimation in STAN has been a smooth and (relatively) pleasant process!   It estimates well, is faster than anything a hack like myself could ever code, and the syntax / workflow / implementation is smooth and very transparent (e.g. only took me a couple of hours from downloading rstan to getting the AR(1) example working).   So kudos again on a great tool.  Am looking forward to watching it evolve. 

Best,
- Eli

  


Reply all
Reply to author
Forward
0 new messages