specs for Rstan!

128 views
Skip to first unread message

Andrew Gelman

unread,
Jun 21, 2014, 3:13:58 AM6/21/14
to stan...@googlegroups.com
Hi all. Following my conversation with Bob the other day, here are a first draft of specs for the next rstan. Bob can connect me if I’ve garbled anything.
A

The logic will be to consider a sequence of files or objects:

foo.stan [Stan program: a text file]
foo.so [compiled model: a dynamic shared object that exists in the R workspace]
foo.rda [compiled model: stored as a file in the working directory for R, or maybe in a library directory of compiled Stan models]
foo.instantiated [instantiated model: I assume this is a dynamic shared object too, it is created by putting the data into the compiled model]
foo.adapt [adaptation information for 1 chain: a stan object of some sort]
foo.fit.single [fitted model for 1 chains: a stan object]
foo.fit.multiple [fitted model for multiple chains: a stan object that includes the simulations, R-hat, and n_eff, and the scramble code]
foo.fit.mle [fitted point estimate with cov matrix, conf intervals, possibly simulations]

And now the functions (I’ve tried to follow Bob’s principles here and keep the number of arguments to a minimum):

First, the sequence of primitives that lead to a fitted stan model:

foo.so <- stan_compile(“foo.stan”, boost_lib, eigen_lib, verbosity)
foo.instantiated <- stan_instantiate(foo.so, data, verbosity)
foo.adapt <- stan_adapt(foo.instantiated, algorithm(sub_arguments), init, seed, chain_id, n_warmup, verbosity)
foo.fit.single <- stan_simulate(foo.instantiated, foo.adapt, algorithm(sub_arguments), parameters_to_save, init, seed, chain_id, n_iter, verbosity)
foo.fit.multiple <- stan_combine(list(foo.fit.single.1,foo.fit.single.2,…), seed, verbosity)

Next, the primitives that do other things:

point.estimate <- stan_optimize(foo.instantiated, algorithm(sub_arguments), init, seed, verbosity)
log.prob <- stan_logprob(foo.instantiated, parameters, format)
gradient <- stan_gradient(foo.instantiated, parameters, format)
hessian <- stan_hessian(foo.instantiated, parameters, format)
In the above 3 functions, the “format” argument tells whether these are on the original or transformed scale, also they could be supplied as a list of named parameters or as a vector with all of them strung together.
parameters_new <- stan_transform(foo.instantiated, parameters) to transform forward and back

Next, the make function:

stan_make (“foo.stan”, data, file_save)
This creates (if necessary) the compiled model foo.so in the R workspace, and, if file_save=TRUE, also saves it as foo.rda. If data are supplied, this function call also creates (if necessary) the instantiated model and saves it as foo.instantiated in the R workspace. This function can also have a bunch of optional arguments telling it what directory to use and what names to call the R object and the files.

And, finally, the user-facing functions:

foo.fit.multiple <- stan(“foo.stan”, algorithm(sub_arguments), parallelism, parameters_to_save, init, seed, n_iter, n_warmup, thin, n_chains, verbosity)
foo.fit.mle <- stan(“foo.stan”, algorithm(sub_arguments), parameters_to_save, init, seed, verbosity)
and various extractor functions, for example:
sims <- stan_sims(foo.fit.multiple)
diagnostics <- stan_diagnostics(foo.fit.multiple)
estimates <- stan_estimate(foo.fit.mle)
stderrs <- stan_se(foo.fit.mle)
sims <- stan_sims(foo.fit.mle, algorithm(sub_arguments)), here the algorithm could be normal approx, stabilized importance sampling, or wedge sampling

I’m probably missing a few things but the above is the basic idea.

Ben Goodrich

unread,
Jun 21, 2014, 1:16:03 PM6/21/14
to stan...@googlegroups.com, gel...@stat.columbia.edu
This looks mostly fine. We should put in on a wiki page. If we are going to break everything, we might as well do it all at once and call it version 3.0 for the next Stan release and implement it for PyStan too. But if we are going to break everything, then it is worth seriously considering changing to "reference class" semantics. In R, see

help("setRefClass", "methods")

this is more Pythonic and C++-like, so it would probably help keep the interfaces consistent-ish.

I don't really get the purpose of the distinction between the .so / .dll dynamic shared object and the instantiated dynamic shared object with data. We don't currently associate the data with anything really, and it would seem like a superfluous copy to me.

I would rather not put the parallelism stuff into interfaces like rstan. It is easy enough to do on any particular computer and almost impossible to do in the abstract across all platforms where rstan might be used. And it would be different in PyStan. Chain-level parallelism should be done in Stan itself, possibly ASAP.

I got rstan to build and run without copying lib/, and we are down to 14MB on disk inclusive of a lot of cruft. Basically, we only need to copy src/stan and src/models into an RStan package, and even then we should probably split into multiple packages.

Ben

Andrew Gelman

unread,
Jun 21, 2014, 1:32:22 PM6/21/14
to stan...@googlegroups.com

On Jun 21, 2014, at 7:16 PM, Ben Goodrich <goodri...@gmail.com> wrote:

> I don't really get the purpose of the distinction between the .so / .dll dynamic shared object and the instantiated dynamic shared object with data. We don't currently associate the data with anything really, and it would seem like a superfluous copy to me.

I don’t know from .so or .dll, but the distinction between the compiled model and the instantiated model came from Bob. His reasoning was as follows: In many settings we will want to work with the compiled model and fit to new data, so we definitely need to be able to save the compiled model. But in some settings people will want to make repeated queries of a model with particular data, and for that reason he wanted the option to save the instantiated model.

Bob Carpenter

unread,
Jun 21, 2014, 1:39:41 PM6/21/14
to stan...@googlegroups.com

On Jun 21, 2014, at 7:16 PM, Ben Goodrich <goodri...@gmail.com> wrote:

> This looks mostly fine. We should put in on a wiki page. If we are going to break everything, we might as well do it all at once and call it version 3.0 for the next Stan release and implement it for PyStan too.

That's exactly what I said to Andrew.

> But if we are going to break everything, then it is worth seriously considering changing to "reference class" semantics. In R, see
>
> help("setRefClass", "methods")
>
> this is more Pythonic and C++-like, so it would probably help keep the interfaces consistent-ish.

I'm in favor of anything that makes the interfaces more
consistent in coding.

> I don't really get the purpose of the distinction between the .so / .dll dynamic shared object and the instantiated dynamic shared object with data. We don't currently associate the data with anything really, and it would seem like a superfluous copy to me.

I don't think the shared object should be called .so in programs.

You can only instantiate the model class with data. Before
that, it's just the shared code. I was trying to break things
down as modularly as possible.

> I would rather not put the parallelism stuff into interfaces like rstan. It is easy enough to do on any particular computer and almost impossible to do in the abstract across all platforms where rstan might be used. And it would be different in PyStan. Chain-level parallelism should be done in Stan itself, possibly ASAP.

Andrew and I also talked about this and implications for CRAN.
I'll go with your judgment here, but most users are going to
be like Andrew and want everything wrapped up.

>
> I got rstan to build and run without copying lib/, and we are down to 14MB on disk inclusive of a lot of cruft. Basically, we only need to copy src/stan and src/models into an RStan package, and even then we should probably split into multiple packages.
>

Cool.

- Bob

Ben Goodrich

unread,
Jun 21, 2014, 1:41:45 PM6/21/14
to stan...@googlegroups.com, gel...@stat.columbia.edu
On Saturday, June 21, 2014 1:32:22 PM UTC-4, Andrew Gelman wrote:
> I don't really get the purpose of the distinction between the .so / .dll dynamic shared object and the instantiated dynamic shared object with data. We don't currently associate the data with anything really, and it would seem like a superfluous copy to me.

I don’t know from .so or .dll, but the distinction between the compiled model and the instantiated model came from Bob.  His reasoning was as follows:  In many settings we will want to work with the compiled model and fit to new data, so we definitely need to be able to save the compiled model.  But in some settings people will want to make repeated queries of a model with particular data, and for that reason he wanted the option to save the instantiated model.

We'll see. It wouldn't hurt much to be duplicative if we can figure out some way to not make copies of the data. But it seems like it would be okay to initialize the thing with empty data and then associate data with it afterward, rather than having two objects that only differ in that one has data associated with it and the other doesn't.

Ben

Ben Goodrich

unread,
Jun 21, 2014, 7:17:28 PM6/21/14
to stan...@googlegroups.com
On Saturday, June 21, 2014 1:39:41 PM UTC-4, Bob Carpenter wrote:

On Jun 21, 2014, at 7:16 PM, Ben Goodrich wrote:

> This looks mostly fine. We should put in on a wiki page.
> I don't really get the purpose of the distinction between the .so / .dll dynamic shared object and the instantiated dynamic shared object with data. We don't currently associate the data with anything really, and it would seem like a superfluous copy to me.

I don't think the shared object should be called .so in programs.

You can only instantiate the model class with data.  Before
that, it's just the shared code.  I was trying to break things
down as modularly as possible.  

Some models may not have data though. I think instantiating with empty data and then setting the data would be fine.
 
> I would rather not put the parallelism stuff into interfaces like rstan. It is easy enough to do on any particular computer and almost impossible to do in the abstract across all platforms where rstan might be used. And it would be different in PyStan. Chain-level parallelism should be done in Stan itself, possibly ASAP.

Andrew and I also talked about this and implications for CRAN.
I'll go with your judgment here, but most users are going to
be like Andrew and want everything wrapped up.

I'm sure each user does, but they collectively want different implementations. The only easy way to make it so that we can just have a

threads = 8

option would be if Stan had the capability to spawn 8 threads in parallel that the interfaces could utilize.

Ben

Bob Carpenter

unread,
Jun 22, 2014, 6:51:42 AM6/22/14
to stan...@googlegroups.com
There is no model object on the C++ side before the data.
The model class's constructor takes a data reader argument and
upon construction, an instance will have its data assigned
to member variables. The omission of no-arg constructor is by design
here, because none of the functions implemented by models can
be evaluated before the data is read in. On construction, the
data is read in and the constructed model is guaranteed to be well
formed and immutable (and hence thread safe modulo the thread safety
of autodiff).

One place we need a model instance is to compute a log probability
function --- that doesn't take data as an argument. It could
if we were willing to load in the data each time, but it seems
more likely that someone would re-use the same model object with
data in multiple calls.

One place where we need the dynamically linked code is to read
in new data in a call to stan(...).

- Bob

Ben Goodrich

unread,
Jun 22, 2014, 12:35:36 PM6/22/14
to stan...@googlegroups.com
On Sunday, June 22, 2014 6:51:42 AM UTC-4, Bob Carpenter wrote:
There is no model object on the C++ side before the data.
The model class's constructor takes a data reader argument and
upon construction, an instance will have its data assigned
to member variables.  The omission of no-arg constructor is by design
here, because none of the functions implemented by models can
be evaluated before the data is read in.  On construction, the
data is read in and the constructed model is guaranteed to be well
formed and immutable (and hence thread safe modulo the thread safety
of autodiff).

On the C++ side, having instantiation be a separate step makes sense. I was talking about the R side, where in the proposed

foo.instantiated <- stan_instantiate(foo.so, data, verbosity)

I don't think the distinction between foo.so and foo.instantiated makes enough sense to justify the copying. I'd prefer

foo$set_data(list)
foo$instantiate
()

One place we need a model instance is to compute a log probability
function --- that doesn't take data as an argument.  It could
if we were willing to load in the data each time, but it seems
more likely that someone would re-use the same model object with
data in multiple calls.

Yeah, that is currently a bit annoying. I'd prefer the R syntax to be

foo$logprob(theta) # with previously set data
foo$set_data
(another_list)
foo$logprob
(theta)

Another thing to think about is whether we can use the data in R's memory and just shuffle pointers around.

Ben

Andrew Gelman

unread,
Jun 22, 2014, 12:41:03 PM6/22/14
to stan...@googlegroups.com
In R, I would prefer to do this using functions rather than $ signs.
A

--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ben Goodrich

unread,
Jun 22, 2014, 1:07:59 PM6/22/14
to stan...@googlegroups.com, gel...@stat.columbia.edu
On Sunday, June 22, 2014 12:41:03 PM UTC-4, Andrew Gelman wrote:
In R, I would prefer to do this using functions rather than $ signs.

There would still be a stan() function that is similar to the existing one that most users would call. The $methods() are mostly low-level stuff and intended to make RStan be as close as possible to PyStan. What is the case for using functional semantics exclusively?

Ben


Andrew Gelman

unread,
Jun 22, 2014, 2:05:28 PM6/22/14
to Ben Goodrich, stan...@googlegroups.com
As long as it’s clear, I guess.  For an R user such as myself it’s helpful to just be able to have a list of functions.

Ben Goodrich

unread,
Jun 22, 2014, 2:14:27 PM6/22/14
to stan...@googlegroups.com, gel...@stat.columbia.edu
On Sunday, June 22, 2014 2:05:28 PM UTC-4, Andrew Gelman wrote:
As long as it’s clear, I guess.  For an R user such as myself it’s helpful to just be able to have a list of functions.

A short list of functions. Currently,

library(rstan)
ls
("package:rstan")

is too long

 [1] "constrain_pars"      "extract"             "get_adaptation_info" "get_cppcode"        
 
[5] "get_cppo"            "get_cppo_mode"       "get_cxxflags"        "get_inits"          
 
[9] "get_logposterior"    "get_num_upars"       "get_posterior_mean"  "get_sampler_params"
[13] "get_seed"            "get_seeds"           "get_stancode"        "get_stanmodel"      
[17] "grad_log_prob"       "log_prob"            "makeconf_path"       "monitor"            
[21] "optimizing"          "plot"                "read_rdump"          "read_stan_csv"      
[25] "reset_cppo"          "rstan_options"       "sampling"            "set_cppo"          
[29] "sflist2stanfit"      "show"                "stan"                "stanc"              
[33] "stan_demo"           "stan_model"          "stan_rdump"          "stan_version"      
[37] "summary"             "traceplot"           "unconstrain_pars"

and does not emphasize which functions are intended to be called by the user. All those get_ functions, plus some others, should just be methods and could be listed by calling $methods(), which have Pythonic documentation and would allow us to condense the RStan reference manual.

Ben

Andrew Gelman

unread,
Jun 22, 2014, 2:47:55 PM6/22/14
to stan...@googlegroups.com
Yes, I like the functions in my spec but i agree that 39 functions could be too much.

Daniel Lee

unread,
Jun 22, 2014, 11:26:36 PM6/22/14
to stan...@googlegroups.com
Ben, I'm starting to get confused. Why would we have $methods() as opposed to plain functions? What's the value in having the R interface be close to PyStan in terms of calling semantics? I guess I'm not seeing RStan users being PyStan users and vice versa.

With coding it as a reference class, that's great, but can you really spec out the interface so it's clear what's happening to the class at every point? Andrew's spec was easy enough to follow... I'm not sure if it was optimal, but it was clear what was being created when. In your counter-proposal, it's not clear where the work is being done. For example, to create foo, what function should be called exactly? What fields are created with the creation of foo? What happens when functions need to act on the object with it not instantiated in the right way? How do we tell that from inspection of foo?.... or does everything get created all at once?


Reply all
Reply to author
Forward
0 new messages