Is it possible to generate simulated data for model testing using Stan?

Michael Benezra

unread,

Oct 24, 2016, 11:37:33 AM10/24/16

to Stan users mailing list

Hi,

I am relatively new to Stan and to Bayesian statistics in general. I am implementing an linear regression with multiple predictors and multiple outcomes, similar to the example in based on the example provided in Chapter 6 of the stan-reference manual. In the manual it is also suggested that "One of the best ways to make sure your model is doing the right thing computationally is to generate simulated (i.e., “fake”) data with known parameter values, then see if the model can recover these parameters from the data. If not, there is very little hope that it will do the right thing with data from the wild." I would like to validate my model by generating simulated data as suggested. One can generate simulated data with R for example (it gets a little tricky if you need to have LKJ priors, but I'm sure it's doable), but since Stan has already built in functions for all the distributions, it would be very useful if Stan could also help generate the simulated data. Is that possible and if so how? I read in the manual that among other things, the generated quantities section can be used to "forward sampling to generate simulated data for model testing", but there are no examples of how this can be done. It would be great if someone could provide an example. Much thanks for your help.

Best,

Michael

James Savage

unread,

Oct 24, 2016, 1:39:34 PM10/24/16

to Stan users mailing list

Hi Michael -

There are a couple of ways of doing this. The simplest might be to simulate the data like so:

data {

int N;

int P; // number of predictors

real<lower = 0> sigma; // your "known" error scale

vector[P] beta; // your "known" coefficients

matrix[N, P] X; // your predictors, including a column of 1s

}

parameters {

vector[N] yhat;

}

model {

yhat ~ normal(X*beta, sigma);

}

If you run that, each draw would be a sample you could use to test your model with. Your question mentioned having LKJ priors (I presume for random effects on betas covariance?). So you want to provide known group-level betas to exhibit correlation consistent with your prior? If so, you could export the lkj_corr_rng() function from Stan to R. Your Stan program would be

// saved as export_lkj.stan

functions {

matrix lkj_rng(int size, real eta) {

return( lkj_corr_rng(size, eta));

}

model {}

at the top. You can then extract the lkj rng function using

rstan::expose_stan_functions(export_lkj.stan)

After which the function lkj_rng() will be available for use in your R session.

Hope this helps. FWIW I normally do all my fake data stuff in R.

Jim

Bob Carpenter

unread,

Oct 24, 2016, 1:47:00 PM10/24/16

to stan-...@googlegroups.com

A simple example for generating data from a univariate
regression:

data {
real alpha;
real beta;
real<lower=0> sigma;
}
model { }
generated quantities {
real y[10];
real x[10];
for (n in 1:10) {
x[n] = normal_rng(0, 1);
y[n] = normal_rng(alpha * x[n] + beta, sigma);
}
}

I turned it into an issue to add an example like this to the
manual:

https://github.com/stan-dev/stan/issues/2051#issuecomment-255812404

How to get the generated x and y back out will depend on interface.
You can run a single iteration in no-parameters mode (also how to
specify that will depend on interface). There's a section in the manual
on "Sampling without Parameters" in the HMC chapter.

- Bob

> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Bob Carpenter

unread,

Oct 24, 2016, 2:14:52 PM10/24/16

to stan-...@googlegroups.com

The advantage of the approach Jim outlines is that it
works for any kind of model specification. The RNG approach may
require some translation, but it's a lot more efficient computationally
and statistically (it iterates much faster and takes independent
draws by design).

- Bob

Reply all

Reply to author

Forward