rstan - how to use/re-use a stanfit object for different prediction

2,909 views
Skip to first unread message

andy...@cubist-asia.com

unread,
Feb 25, 2015, 10:03:26 PM2/25/15
to stan-...@googlegroups.com
Dear all,

This is perhaps a more rstan and software design related question.

My goal is to do a linear regression fitting and then use the fitted model to compute posterior for a range of different predictor values.

In R, using lm, i would do:

# y is dep, x is indepedent
fit <- lm (y ~ x)

Now, I can pass this fit (class: lm) object to other parts of the code that do the "prediction"
predict (lm, newdata=data.frame(....), ...)

In rstan, my current approach for fitting is to do:

1. create a Stan file:

data {
  int<lower=0> N; // num of obs
  vector[N] y;   
  vector[N] x;
}
parameters {
  real<lower=0> beta;
  real<lower=0> sigma;
}
model {
  beta ~ normal (1, 2);

  y ~ normal (x * beta, sigma);
}

2. create a rstan object using stan_model

R> sm6 <- stan_model (file="t6.stan", model_name="t6")

3. fit the data with sampling function to get a stanfit object

R> sfit <- sampling (sm6, data=some_data)

The question is: what's the best/recommended way to do prediction with new values of x now I have a stanfit object?

1. I can added a generated quantities block into the STAN code, but that would require me to supply the new data at the same time with the fitting process - something not very clean from a design perspective (also practically, my system may not have all the new_data values available during the fitting phrase).

2. Should I read off the mean and sd of the fitted parameters from stanfit object and then specify them in the data section of a different STAN code that does the sampling?

data {
  real beta_location;
  real beta_scale;

  real sigma_location;
  real sigma_scale;

  real x_tilde;
}

model {

}

generated quantities {
  real y_new;
  y_new <- normal_rng (x_tilde * normal_rng(beta_location, beta_scale),
               normal_rng(sigma_location, sigma_scale));
}

# beta_*, simga_* values come from summary of a stanfit object
R> sampling (sm6, data=list (beta_location=1.62, beta_scale = 0.16, sigma_location = 0.12, sigma_scale = 0.02, x_tilde = 0.05), algorithm="Fixed_param")


3. none of the above?

I guess the more generic version question is: what's the best way to "pass" estimated values from one STAN run (via stanfit object) to another STAN run?

Many thanks,

Andy





Bob Carpenter

unread,
Feb 25, 2015, 10:22:18 PM2/25/15
to stan-...@googlegroups.com

> On Feb 26, 2015, at 2:03 PM, andy...@cubist-asia.com wrote:
>
> ...
> The question is: what's the best/recommended way to do prediction with new values of x now I have a stanfit object?
>
> 1. I can added a generated quantities block into the STAN code, but that would require me to supply the new data at the same time with the fitting process - something not very clean from a design perspective (also practically, my system may not have all the new_data values available during the fitting phrase).

This is the cleanest way to do it.

>
> 2. Should I read off the mean and sd of the fitted parameters from stanfit object and then specify them in the data section of a different STAN code that does the sampling?

This will not give you full Bayesian inference.
It will use the posterior means as point estimates
and thus underestimate uncertainty in your estimates.

> 3. none of the above?

What you want to do is simulate one or more predictions
using an RNG for each draw in the posterior sample.
Then that gives you a sample for the predictions.

If you only care about the mean prediction, then you
can take the mean of this. In some cases, this will work
out to the prediction you get using the posterior means
as point parameters in prediction.

> I guess the more generic version question is: what's the best way to "pass" estimated values from one STAN run (via stanfit object) to another STAN run?

If you really need to do this, the only option is to create
a Stan program that reads in the whole sample object and then
iterates through them generating predictions for each draw in
the sample. Here's a super-duper simple case to illustrate:

data {
int<lower=1> M; // num posterior draws
real mu[M]; // posterior draws for mu
real<lower=0> sigma[M]; // posterior draws for sigma
}
model {
}
generated quantities {
real y[M];
for (m in 1:M)
y[m] <- normal_rng(mu[m], sigma[m]);
}

You'll need to run it in the special fixed parameters mode or it'll
fail.

And of course, if you have predictors or other parameters, those
all need to be passed in the same way.

- Bob

UD1989

unread,
Jul 9, 2015, 8:22:53 AM7/9/15
to stan-...@googlegroups.com
Hi Andy

Could you let me know what solution you finally implemented?? I have a similar use case as you in which new data will be fed into the prediction engine periodically . Then I plan to use the predictions to be wrapped in an API to be further used by another applications. Using the generated quantities is not exactly possible here because I wouldn't want the STAN model to run for another half-a-day everytime someone wants to predict from a new set of input variables.  Any help would be appeciated . Thanks!!

Regards

Sebastian Weber

unread,
Jul 9, 2015, 8:46:14 AM7/9/15
to stan-...@googlegroups.com
Hi!

I had a similar issues and I think rstan should improve here. For the moment I do the following:

1. I assume that you do compute in Stan the mean prediction for your model, i.e. commonly called mu or ypred (for the example model, the x*beta would need to be saved somewhere)
2. I wrote myself a function which takes as input the data I want to simulate and in addition those parameter values I want to simulate
3. This data I pass into a function together with the posterior
4. I then iterate over each draw from the posterior and combine the data + parameters to simulate with the posterior draw and pass this into Stan, calling the "Fixed_Param" algorithm
5. You end up with a prediction of the mean for each posterior draw and all you need to do is add your error term (if needed)

This procedure works nice and reliable, but is terribly slow if large amounts of data are generated as the communication between rstan and Stan itself is for reasons yet unknown (most likely Rcpp related) terribly slow. But, it is a nice method to simulate off a posterior after the fit has been done.

@ Ben: A function like this would be wonderful to have in rstan :)

Best,
Sebastian

Krzysztof Sakrejda

unread,
Jul 9, 2015, 9:06:19 AM7/9/15
to stan-...@googlegroups.com
On Thursday, July 9, 2015 at 8:46:14 AM UTC-4, Sebastian Weber wrote:
This procedure works nice and reliable, but is terribly slow if large amounts of data are generated as the communication between rstan and Stan itself is for reasons yet unknown (most likely Rcpp related) terribly slow.

I doubt this is Rcpp-related.  Rcpp make it relatively easy to write thin wrappers for std::vector<double> as well as for Eigen types and view them directly in R.  It makes it even easier to copy everything umpteen times and pass it around as R lists.  I don't really know which way rstan does it and I would never blame Stan's developers for not having the time to optimize this but blaming it on Rcpp is a big stretch.  Is the issue really unknown? It would not take long to profile with a slow example model.  I've had trouble getting data into Stan via stan_rdump which can hang on long strings but I usually work with CmdStan's .csv output so I haven't seen issues like this with rstan. 

Krzysztof
 

Sebastian Weber

unread,
Jul 9, 2015, 10:44:10 AM7/9/15
to stan-...@googlegroups.com
Hi!

We were trying to hunt it down already and only came to a partial solution of

https://github.com/stan-dev/stan/pull/772

Here is the code which is really killing the machines here:

code1 <- '
    data {
      int N;
    }
    parameters {
      vector[N] p;
    }
    model {
      p ~ normal(0, 1);
    }
'

library(rstan)
sm <- stan_model(model_code = code1)

## ok
system.time(sampling(sm, data = list(N = 10000), iter = 10))

## really slow!
system.time(sampling(sm, data = list(N = 100000), iter = 10))

It takes forever to sample from a N(0,1) !!

UD1989

unread,
Jul 9, 2015, 11:36:20 AM7/9/15
to stan-...@googlegroups.com
Hi Sebastian

Thanks so much for your reply? Do you happen to have any example code that explains your method? I 'm having trouble understanding how exactly your procedure will be achieved and an exmple will be helpful :)

Thanks and Regards

Krzysztof Sakrejda

unread,
Jul 9, 2015, 1:35:49 PM7/9/15
to stan-...@googlegroups.com
Ok, one of the related issues is mine (https://github.com/stan-dev/rstan/issues/76) unfortunately when I was last looking at it the
conclusion I came to was that maybe lists are not the best way to pass data back to R since the segfault results from recursion going
too far in libR.so  and it's a known lists-only problem.  I don't have time to dig up the details right now, sorry...

Krzysztof

Sebastian Weber

unread,
Jul 9, 2015, 2:48:25 PM7/9/15
to stan-...@googlegroups.com

Yep, right. I also think that using some array structure would have been a much better choice. Generally, I was surprised to learn that rstan uses internally many nested lists to handle stan output. R is orders of magnitude faster if such things are handled with multi-dimensional arrays. They are a pain to program (index war), but are worth the effort. Maybe worth while for rstan3...

Sebastian

Bob Carpenter

unread,
Jul 10, 2015, 2:13:46 PM7/10/15
to stan-...@googlegroups.com

> On Jul 9, 2015, at 2:48 PM, Sebastian Weber <sdw....@gmail.com> wrote:
>
>
> Yep, right. I also think that using some array structure would have been a much better choice. Generally, I was surprised to learn that rstan uses internally many nested lists to handle stan output. R is orders of magnitude faster if such things are handled with multi-dimensional arrays. They are a pain to program (index war), but are worth the effort. Maybe worth while for rstan3...

RStan's open source and we're always looking for volunteers. :-)
This would hopefully be a relatively modular project.

- Bob

Ben Goodrich

unread,
Jul 10, 2015, 2:48:21 PM7/10/15
to stan-...@googlegroups.com, ca...@alias-i.com
On Friday, July 10, 2015 at 2:13:46 PM UTC-4, Bob Carpenter wrote:
> Yep, right. I also think that using some array structure would have been a much better choice. Generally, I was surprised to learn that rstan uses internally many nested lists to handle stan output. R is orders of magnitude faster if such things are handled with multi-dimensional arrays. They are a pain to program (index war), but are worth the effort. Maybe worth while for rstan3...

RStan's open source and we're always looking for volunteers. :-)
This would hopefully be a relatively modular project.  

This is probably going to get addressed to some extent for rstan3 so we are not necessarily super-eager for volunteers to refactor this at the moment. Although if someone were to get it working with rstan 2.x we could merge it.

Ben

andy...@cubist-asia.com

unread,
Jul 14, 2015, 2:08:51 AM7/14/15
to stan-...@googlegroups.com
Hi!

At the end, i went with solution - as my fitting was relatively fast, I can "afford" to run fitting twice. Also, i was (and still is) very new to STAN at the time, it was an easier approach.

But, agree with Sebastian, it would be nice if there is a built-in function in rstan to "automate" this.

Cheers,

Andy

Sidharth Gupta

unread,
May 16, 2017, 5:35:39 PM5/16/17
to stan-...@googlegroups.com
Was reading this post. I had a related question. When we use the rng(I used bernoulli_logit_rng) which has the signature 


int bernoulli_logit_rng(real alpha)


If I have N examples to predict on and D is the # posterior samples of the parameters. Then I have to do this:


generated quantities {
    matrix[D, N] y_pred;
    for(n in 1:N) { 
        for(d in 1:D) {
            y_pred[d, n] = bernoulli_logit_rng(p[d, n]);
        }
    }
}

AFAICT, The looping will make things slow. Shouldn't stan also have rng functions vectorized? Maybe it's already an open issue?

Secondly, in the above for each the posterior sample: p[d,n]  the generated quantities draw many samples again. Which is kind of odd shaped(understandably), It results in a y_pred that has dimensions: (nrow=1, ncol=d*n), because I call stan() with parameters iterations=1, warmup=0, chains=1 for prediction, hence the 1 row. Which is all I want. Is there a way to parameterize the call to stan to maybe not do this?

Best
Sid

Bob Carpenter

unread,
May 17, 2017, 3:43:45 PM5/17/17
to stan-...@googlegroups.com
No, looping with multiple calls to an _rng function isn't slow.
Unlike R or Python in their normal interactive forms, Stan compiles
down to C++ where the looping is fast.

Sometimes we can get a bit of boost in speed if there are shared
computations. This is a *much* bigger deal for parameters, where
vectorizing can cut down the work required for derivatives. But
there are no derivatives for RNGs.

We do have an issue open for vectorizing the _rng functions, but
it'll just make the code shorter, not faster.

- Bob

Reply all
Reply to author
Forward
0 new messages