what Andrew wants for posterior fiddling

56 views
Skip to first unread message

Bob Carpenter

unread,
Apr 14, 2015, 11:53:01 PM4/14/15
to stan...@googlegroups.com
OK, I think I now understand what Andrew wants after thinking
about it for a bit:

* write a generated quantities block, and

* evaluate it AFTER the samples are drawn from Stan

so that the results are the same as if it had been originally
included in the Stan program.

I don't think this has anything to do with fiddling with
functions in R, which is why I was so confused during the
meeting. (I also thinking changing the name from "testify"
would be helpful for those of us who take names too literally.)

The only reason we can't do this very easily is that we have
no way of taking the CSV-formatted draws from CmdStan or the extracted
draws from RStan or whatever you get out of PyStan and
passing them back in.

- Bob

Ben Goodrich

unread,
Apr 15, 2015, 1:58:35 AM4/15/15
to stan...@googlegroups.com
On Tuesday, April 14, 2015 at 11:53:01 PM UTC-4, Bob Carpenter wrote:
OK, I think I now understand what Andrew wants after thinking
about it for a bit:

  * write a generated quantities block, and

  * evaluate it AFTER the samples are drawn from Stan

I think it would be better to have a new block with a new name. The things in generated quantities get evaluated once per iteration, while the things in the unnamed block get evaluated once using all the results from all the iterations (e.g. calculating a WAIC, doing importance weighting, etc.). And I suppose they need to get written to a separate output file.
 
I don't think this has anything to do with fiddling with
functions in R, which is why I was so confused during the
meeting.  (I also thinking changing the name from "testify"
would be helpful for those of us who take names too literally.)

I originally envisioned it was just for testing the self-written Stan functions. But people seem to want to use it in ways that blur the distinction between Stan and the interface, so it does need a new name.

But that is also why having a new block might not even be sufficient because people will want to do things that they don't realize they want to do until later. I think calling a user-written Stan function from the interface is fine, but there is really no avoiding the looping. It is just a question of whether you do it in the interface and call a function every iteration of the loop or call the function once and loop through all the simulations in C++.
 
The only reason we can't do this very easily is that we have
no way of taking the CSV-formatted draws from CmdStan or the extracted
draws from RStan or whatever you get out of PyStan and
passing them back in.

I agree it is hard in general. It is not super hard with RStan and user-written functions. If you can extract() to a vector or matrix of simulations, that is easy to pass back to a user-written Stan function. What is harder is when you have a covariance matrix or something and extract() makes that into a multidimensional array. You would have to pass it back as a R list of matrices I think to be compatible with a matrix[] in the signature of your function.

Ben

Bob Carpenter

unread,
Apr 15, 2015, 3:17:01 AM4/15/15
to stan...@googlegroups.com

> On Apr 15, 2015, at 3:58 PM, Ben Goodrich <goodri...@gmail.com> wrote:
>
> On Tuesday, April 14, 2015 at 11:53:01 PM UTC-4, Bob Carpenter wrote:
> OK, I think I now understand what Andrew wants after thinking
> about it for a bit:
>
> * write a generated quantities block, and
>
> * evaluate it AFTER the samples are drawn from Stan
>
> I think it would be better to have a new block with a new name. The things in generated quantities get evaluated once per iteration, while the things in the unnamed block get evaluated once using all the results from all the iterations (e.g. calculating a WAIC, doing importance weighting, etc.). And I suppose they need to get written to a separate output file.

OK, if that's what Andrew wants, that's yet another different
thing. And I agree that it would go into a different block.
But then we'd have to have some way to pass it the sample array
for evaluation.

> I don't think this has anything to do with fiddling with
> functions in R, which is why I was so confused during the
> meeting. (I also thinking changing the name from "testify"
> would be helpful for those of us who take names too literally.)
>
> I originally envisioned it was just for testing the self-written Stan functions. But people seem to want to use it in ways that blur the distinction between Stan and the interface, so it does need a new name.

Right --- the name made sense in that context.

> But that is also why having a new block might not even be sufficient because people will want to do things that they don't realize they want to do until later. I think calling a user-written Stan function from the interface is fine, but there is really no avoiding the looping.

I don't think we should tangle calling Stan functions with looping.
That should be done externally, perhaps with helper functions.

> It is just a question of whether you do it in the interface and call a function every iteration of the loop or call the function once and loop through all the simulations in C++.

Agreed. I don't see how we could figure out how to do that in
C++ other than with one of the above two strategies (rerunning
generated quantities post-hoc or having something that operated
over all the draws in the sample).

- Bob

Michael Betancourt

unread,
Apr 15, 2015, 3:58:24 AM4/15/15
to stan...@googlegroups.com
Trying to extract specific structured data from the fit object to pass into
user-defined functions will be a pain in the ass, whether it’s done on
the interface level or internal to the API (the implementation of which
already makes me sad).

But why not just have a new type of Stan program instead of just
a new block? Something like

parameters {
… list all of the model parameters here;
}

postprocess {
… code that gets applied to each sample;
}

Then this can be called from the interfaces via

stan.postprocess(fit_object);

Internally we read in the fit object, parse it into the original parameters
(the stan_csv_reader has methods that do much of this already), check
that the parameters are the same as those defined in this new program,
then just loop over the samples.

Something new will have to be compiled and the user will have to type
the parameters block twice but I think that is an absolutely minimal
burden compared to benefits internally.
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Bob Carpenter

unread,
Apr 15, 2015, 9:13:01 AM4/15/15
to stan...@googlegroups.com
Makes sense to me. The result would be the same
as if the postprocss block had been the original generated
quantities.

I'm guessing Andrew's going to balk at having to rewrite
all the parameters, which is why I was imagining something
that was just a version of the original model with a tweaked
generated quantities block.

It makes me nervous thinking people are going to be doing
all their analyses in the R interpreter (or any other interpreter)
using function calls. It's just not going to be reproducible
unless it's all put into a script and tested on its own.
Whatever its pitfalls, the benefit of doing everything at
once with the generated quantities block is that you know the
sample came from the program the generated quantities block is
attached to.

- Bob

Andrew Gelman

unread,
Apr 15, 2015, 4:27:24 PM4/15/15
to stan...@googlegroups.com
This is all fine, but, now that it’s all been explained to me, that it would be fine to replace “write a generated quantities block” with “write a function.” But, I guess that a generated quantities block would be more general than a function because it allows the return of multiple arguments.

Andrew Gelman

unread,
Apr 15, 2015, 4:31:02 PM4/15/15
to stan...@googlegroups.com

> On Apr 15, 2015, at 1:58 AM, Ben Goodrich <goodri...@gmail.com> wrote:
>
> I think it would be better to have a new block with a new name. The things in generated quantities get evaluated once per iteration, while the things in the unnamed block get evaluated once using all the results from all the iterations (e.g. calculating a WAIC, doing importance weighting, etc.). And I suppose they need to get written to a separate output file.

There are actually 2 different ideas floating around here. The first idea is to have a function, or generated quantities block, that does a certain action separately on each posterior simulation draw. The idea is to pass it a posterior-fit Stan object with 1000 (say) draws, and it will return 1000 sets of this output.

The second idea is to do operations on all the draws together. This is needed for EP, importance weighting, WAIC, and, for that matter, R-hat. I hadn’t been thinking about doing the second idea in Stan but maybe it would make sense? It would require a different sort of notation.

Also related is to allow iteration number and chain number as arguments in the model block or in the generated quantities block.

A


Andrew Gelman

unread,
Apr 15, 2015, 4:32:17 PM4/15/15
to stan...@googlegroups.com
I like the name “postprocess”

Andrew Gelman

unread,
Apr 15, 2015, 4:38:34 PM4/15/15
to stan...@googlegroups.com
Bob:
Just to be clear, the issue is that I need to run the generated quantities block (or the function) several times, using the same posterior draws but different inputs. To put it another way, the Stan program I’m working with has lots of data. Some of the data get used in the model block, some of the data get used in the generated quantities block. I want to rerun the Stan program, just changing some of the data: just changing the data that get used in the generated quantities block, not changing the data that get used in the model block. So I don’t want to rerun the model, I just want to rerun the generated quantities block.
I hope this helps.
A

Bob Carpenter

unread,
Apr 15, 2015, 6:54:01 PM4/15/15
to stan...@googlegroups.com

> On Apr 16, 2015, at 6:38 AM, Andrew Gelman <gel...@stat.columbia.edu> wrote:
>
> Bob:
> Just to be clear, the issue is that I need to run the generated quantities block (or the function) several times, using the same posterior draws but different inputs. To put it another way, the Stan program I’m working with has lots of data. Some of the data get used in the model block, some of the data get used in the generated quantities block. I want to rerun the Stan program, just changing some of the data: just changing the data that get used in the generated quantities block, not changing the data that get used in the model block. So I don’t want to rerun the model, I just want to rerun the generated quantities block.
> I hope this helps.

Yes, knowing that you need more data helps. Then we'd need a bit
more than Ben suggested --- you'd also need to declare a new data
block and provide it as an argument.

I don't know how we can do it with just functions without the need
for the index fiddling you don't like.

- Bob

Andrew Gelman

unread,
Apr 15, 2015, 8:53:38 PM4/15/15
to stan...@googlegroups.com
Hi, attached is my code. I don’t know whether this is helpful, you can feel free to ignore it. I ended up writing the “generated quantities block” as a function and passing the information in that way. This actually ended up being cleaner than using generated quantities, because it meant that I din’t have to have such a big pile of data coming into the main Stan program. The only awkward thing was that a Stan function can return only one thing, so I had to hack it by concatenating two Stan variables into a single vector output, then untangle them at the other end.
A

extrap.R
testify.R
stan.R
extrap_impsampling_function.stan
data_fit.stan

Bob Carpenter

unread,
Apr 16, 2015, 1:11:02 AM4/16/15
to stan...@googlegroups.com
I'm soooo confused. I thought you were adamantly opposed
to having to loop over the iterations, as in:

n_iter <- 10
for (i in 1:n_iter){
generated <- list(delta=array(NA, c(n_sims,K)), log_weight=rep(NA, n_sims))
for (n in 1:n_sims) {

And I'm really surprised given your opposition to writing
data = c("x", "y", "N") that you'd be OK with this:

vector extrap_impsampling_rng(int J_prime, int T_prime, int K, vector y_prime_bar,
vector x_prime, vector mu_delta_p, matrix Sigma_delta_p, vector mu_delta_g,
matrix Sigma_delta_g, real df_g, int J_tilde, vector mu_a, vector sigma_a, real sigma_y) {
...

which seems just the kind of duplication you wanted to avoid.

I'd like to add tuple types to Stan so we can have proper structs.
I should probably stop looking at and responding to all these e-mails
and spend more time adding to the Stan language!

- Bob

> On Apr 16, 2015, at 10:53 AM, Andrew Gelman <Gel...@stat.columbia.edu> wrote:
>
> Hi, attached is my code. I don’t know whether this is helpful, you can feel free to ignore it. I ended up writing the “generated quantities block” as a function and passing the information in that way. This actually ended up being cleaner than using generated quantities, because it meant that I din’t have to have such a big pile of data coming into the main Stan program. The only awkward thing was that a Stan function can return only one thing, so I had to hack it by concatenating two Stan variables into a single vector output, then untangle them at the other end.
> A
>
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <extrap.R><testify.R><stan.R><extrap_impsampling_function.stan><data_fit.stan>

Andrew Gelman

unread,
Apr 16, 2015, 1:21:20 AM4/16/15
to stan...@googlegroups.com
1. I don’t want to loop over the iterations! But I did so, because that’s what I needed to get my answer!

2. I no longer need c(“x”, “y”, “n”), now that Jonah taught me about nlist(x, y, n). As far as I’m concerned, c(“x”, “y”, “n”) is history and we can deprecate it as of yesterday.

3. I passed all the data to the Stan function because I had to! But it’s actually not duplication, because these data objects are actually no longer in the Stan program that fits the model to y, they’re only in the function.

Bob Carpenter

unread,
Apr 16, 2015, 6:54:02 AM4/16/15
to stan...@googlegroups.com

> On Apr 16, 2015, at 3:21 PM, Andrew Gelman <gel...@stat.columbia.edu> wrote:
>
> 1. I don’t want to loop over the iterations! But I did so, because that’s what I needed to get my answer!

How do you feel about Vectorize as an alternative?

> 2. I no longer need c(“x”, “y”, “n”), now that Jonah taught me about nlist(x, y, n). As far as I’m concerned, c(“x”, “y”, “n”) is history and we can deprecate it as of yesterday.

Let's let Ben and Jiqiang decide this one. I'm OK either way.

> 3. I passed all the data to the Stan function because I had to! But it’s actually not duplication, because these data objects are actually no longer in the Stan program that fits the model to y, they’re only in the function.

Didn't some of those variables correspond to parameters?

- Bob
Reply all
Reply to author
Forward
0 new messages