Suitability of Stan for imputation of multinomial outcomes with hierarchical models

531 views
Skip to first unread message

Ross Boylan

unread,
Aug 1, 2013, 7:40:33 PM8/1/13
to stan-...@googlegroups.com
Hi, everyone. I have a relatively well-defned initial question, and
then some progressively broader issues.

The manual says "Stan 1.0 does not do discrete sampling." I'm trying to
figure out the implications of that for using Stan to do imputation of
multinomial outcomes with hierarchical models. I think it means that
Stan will not allow a discrete variable to be among the values whose
distribution it estimates, i.e., one of the theta in the language of
section 1.1.

This is not necessarily a show-stopper, since given a draw from the
regular model parameters I could simulate a discrete outcome. However,
shoe-horning that in to Stan might be awkward.

More generally, any assessment of how suitable Stan would be for this
task would be great. Some earlier discussion refers to Stan as being
very slow for the multinomial, but it looks as if the promised
multinomial_log function has since been implemented (section 33.1).

Background: we have data with a lot of missing values, at all
measurement levels, and marked clustering and are trying to impute. I
have been extending the mice package for R to handle some of the cases
(mlprobit branch of https://github.com/RossBoylan/mice). The general
framework is a Gibbs sampler for variables to be imputed; this implies
that any particular imputation will be working with complete data
(except for the variable to be imputed). The task is to create inner
samplers for certain types of variables.

I've implemented models for clustered binary; the first cut was
painfully slow because of high auto-correlation, and so I tried a Hybrid
MC which was somewhat better, though not dramatically so. I spent some
time tuning it, though probably it could be tuned further.

The alternatives I'm currently contemplating are
1) extend the first cut clustered binary model, based on work by Albert
and Chib, to multinomial outcomes as they suggest. The drawback is that
the performance was bad for a 2-level variable, suggesting it will be
worse for >2.
2) Use a hybrid MC approach by hand. Theoretically, this will probably
require a different framework (likely mulitnomial logit) since simple
extension of the current method requires high dimensional integrals of
the normal distribution. Practically, calculating derivatives was
time-consuming and error-prone (for me) in the binary case, and the
chance of doing it incorrectly for multinomial seems too high.
3) Create a bunch of dummy variables for each categorical outcome,
predict them separately, and then use the highest predicted probability
to give the imputed value (or maybe it would be better to normalize the
probabilities and draw randomly?).
4) Use Stan.

Comments welcome.

Ben Goodrich

unread,
Aug 1, 2013, 9:05:55 PM8/1/13
to stan-...@googlegroups.com
On Thursday, August 1, 2013 7:40:33 PM UTC-4, Ross Boylan wrote:
More generally, any assessment of how suitable Stan would be for this
task would be great.  

I would say that Stan, for the foreseeable future, is ill-suited for any missing data problem, except the one (possibly your case) where only the outcome variable has missingness. In that one case, Stan is decent.

For a nominal outcome, I guess the first thing to consider is whether it is reasonable to consider the missing values to be an additional category. In that case, there is no missing data problem.

Failing that, you probably want to introduce an unknown simplex parameter for each missing outcome. For observed data, the vector of probabilities goes into the likelihood as a function of covariates. For missing data, this function of covariates essentially constitutes a prior on the unknown vector of probabilities. In which case, given the posterior distribution of the unknown probability vectors, you can draw imputations in the generated quantity block from a categorical distribution.

In short, as with all things Stan, if you can write down a joint posterior distribution for all the continuous unknowns, then it is straightforward to use Stan and it will probably work well. That is really a different question than the one asked with Gibbs samplers for missing data where you need a full-conditional distribution for each unknown.

Ben

Bob Carpenter

unread,
Aug 1, 2013, 11:39:25 PM8/1/13
to stan-...@googlegroups.com


On 8/1/13 7:40 PM, Ross Boylan wrote:
> Hi, everyone. I have a relatively well-defned initial question, and then some progressively broader issues.
>
> The manual says "Stan 1.0 does not do discrete sampling." I'm trying to figure out the implications of that for using
> Stan to do imputation of multinomial outcomes with hierarchical models. I think it means that Stan will not allow a
> discrete variable to be among the values whose distribution it estimates, i.e., one of the theta in the language of
> section 1.1.

That's right.

> This is not necessarily a show-stopper, since given a draw from the regular model parameters I could simulate a discrete
> outcome. However, shoe-horning that in to Stan might be awkward.

Awkward is putting it kindly. A better approach is to marginalize out
what you can, but it's hard to do in anything other than simple mixture
models.

> More generally, any assessment of how suitable Stan would be for this task would be great. Some earlier discussion
> refers to Stan as being very slow for the multinomial, but it looks as if the promised multinomial_log function has
> since been implemented (section 33.1).

I think that's just the usual multinomial on the log
probability scale --- that's the convention in the manual,
which we're going to try to explain better in the 2.0
manual because it's confusing as is.

Stan 2.0 will have a multinomial_logit, for multinomial
logistic regression, with

multinomial_logit(y|alpha) = multinomial(y|softmax(alpha))

The form on the right is much faster in 2.0, and the customized
one on the left is even faster. softmax() and multinomial_logit
now have analytic partials instead of using auto-diff internally.

> Background: we have data with a lot of missing values, at all measurement levels, and marked clustering and are trying
> to impute. I have been extending the mice package for R to handle some of the cases (mlprobit branch of
> https://github.com/RossBoylan/mice). The general framework is a Gibbs sampler for variables to be imputed; this implies
> that any particular imputation will be working with complete data (except for the variable to be imputed). The task is
> to create inner samplers for certain types of variables.

This sounds like what Andrew and Jennifer and most recently Ben have worked
on in the mi package. As Ben said, Stan's not a good choice for this.
You couldn't even hack it up well in C++ because of the data immutability
in a Stan model.

> I've implemented models for clustered binary; the first cut was painfully slow because of high auto-correlation, and so
> I tried a Hybrid MC which was somewhat better, though not dramatically so. I spent some time tuning it, though probably
> it could be tuned further.

NUTS is pretty good for tuning. Also, did you
tune a diagonal mass matrix for HMC?

> The alternatives I'm currently contemplating are
> 1) extend the first cut clustered binary model, based on work by Albert and Chib, to multinomial outcomes as they
> suggest. The drawback is that the performance was bad for a 2-level variable, suggesting it will be worse for >2.

Performance as in speed? Speed goes up with number of outcomes
because of the need to normalize.

> 2) Use a hybrid MC approach by hand. Theoretically, this will probably require a different framework (likely
> mulitnomial logit) since simple extension of the current method requires high dimensional integrals of the normal
> distribution. Practically, calculating derivatives was time-consuming and error-prone (for me) in the binary case, and
> the chance of doing it incorrectly for multinomial seems too high.

They're all coded and tested vs. auto-diff in Stan, but that's
C++.

> 3) Create a bunch of dummy variables for each categorical outcome, predict them separately, and then use the highest
> predicted probability to give the imputed value (or maybe it would be better to normalize the probabilities and draw
> randomly?).

> 4) Use Stan.

You could use Stan for various pieces of the above plans, but it's
not going to solve the problem for you out of the box, or even
on its own.

Or you could write a big joint model and use Stan. It's still
ugly syntactically, but the manual shows you how to do the same
kind of joint modeling approach to missing data you get with BUGS.

- Bob

Ross Boylan

unread,
Aug 2, 2013, 12:53:03 AM8/2/13
to stan-...@googlegroups.com, Bob Carpenter
Thanks to Bob and Ben for their responses. I can answer one specific question.

On Thursday, August 01, 2013 08:39:25 PM Bob Carpenter wrote:
> > I've implemented models for clustered binary; the first cut was painfully
> > slow because of high auto-correlation, and so I tried a Hybrid MC which
> > was somewhat better, though not dramatically so. I spent some time
> > tuning it, though probably it could be tuned further.
>
> NUTS is pretty good for tuning. Also, did you
> tune a diagonal mass matrix for HMC?
I tried various things, and at least moved the effecive sample sizes for
different parameters closer to parity. I had major scaling problems, with
appropriate step sizes varying by at least an order of 100 between parameters.

I was a bit baffled how to approach that in principle, since it seemed tuning
the mass would move the initial speed and the later step sizes in opposite
directions, whereas I really wanted them to both go up or down with the scale.

I ended up usiing weights of abs(d2/d1) were d2 and d1 are the second and first
derivatives of the log probability evaluated at the initial guess for
parameter values.

I was a bit suprised the scale differences cause a complete failure of the
algorithm; I worried that

I'm leaning toward working out the analytic derivatives and using HMC for my
problem (I'm using the HybridMC package in R).

Ross

Andrew Gelman

unread,
Aug 2, 2013, 12:54:50 AM8/2/13
to stan-...@googlegroups.com
Now I'm confused. If you're going to be using HMC, why not use Stan? I think that will be more generalizable and reliable (and possibly faster) than rolling it yourself.
A

Bob Carpenter

unread,
Aug 2, 2013, 11:25:52 AM8/2/13
to stan-...@googlegroups.com


On 8/2/13 12:54 AM, Andrew Gelman wrote:
> Now I'm confused. If you're going to be using HMC, why not use Stan?

That's what I meant by using Stan for part of the
problem. You can't just build the whole model in a single
Stan program, but you can implement the HMC parts in
Stan.

> I think that will be more generalizable and reliable (and possibly faster) than rolling it yourself.

Almost certainly. Though with tight coding in C++ and
analytical derivatives you should be able to build something
faster than Stan. There's a price for the generality.

- Bob

Bob Carpenter

unread,
Aug 2, 2013, 11:29:32 AM8/2/13
to stan-...@googlegroups.com
On 8/2/13 12:53 AM, Ross Boylan wrote:
> Thanks to Bob and Ben for their responses. I can answer one specific question.
>
> On Thursday, August 01, 2013 08:39:25 PM Bob Carpenter wrote:

>> NUTS is pretty good for tuning. Also, did you
>> tune a diagonal mass matrix for HMC?

> I tried various things, and at least moved the effecive sample sizes for
> different parameters closer to parity. I had major scaling problems, with
> appropriate step sizes varying by at least an order of 100 between parameters.
>
> I was a bit baffled how to approach that in principle, since it seemed tuning
> the mass would move the initial speed and the later step sizes in opposite
> directions, whereas I really wanted them to both go up or down with the scale.

The mass matrix should account for the covariance in the
posterior and the overall step size for accuracy of the integrator.

> I ended up usiing weights of abs(d2/d1) were d2 and d1 are the second and first
> derivatives of the log probability evaluated at the initial guess for
> parameter values.

There are ways to set this up based on derivatives. Check
out Michael Betancourt's papers on SoftAbs. It's like RMHMC
with a metric that doesn't vary by position.

> I was a bit suprised the scale differences cause a complete failure of the
> algorithm; I worried that

It is a tricky algorithm that's very sensitive to
tuning params.

- Bob

Ross Boylan

unread,
Aug 2, 2013, 1:58:24 PM8/2/13
to stan-...@googlegroups.com
On 8/2/2013 8:25 AM, Bob Carpenter wrote:
>
>
> On 8/2/13 12:54 AM, Andrew Gelman wrote:
>> Now I'm confused. If you're going to be using HMC, why not use Stan?
>
> That's what I meant by using Stan for part of the
> problem. You can't just build the whole model in a single
> Stan program, but you can implement the HMC parts in
> Stan.
OK. Thinking about it.

Just to indicate why I was leaning to HMC without Stan:
General:
Adding technology to a late project often makes it later, and certainly
makes it more complex.
To use it I have to learn it.
It might turn out to be unsuitable (in either the "won't do what I want
at all" sense or the "slow" sense).
[The upshot of the discussion seems to be Stan is suitable in the "will
do that" sense for my intended use,
which is doing an estimate of one particular model within a larger Gibbs
sampler.]

Specific:
The dependence on the compiler brings installation issues, esp. since
some of my work is on Windows.
My intended use is at least a bit off the standard usage pattern.


The good news is that the data Stan would get is complete, since all
imputed values would be filled in, including those for the outcome
variable. In a perfect world those values would be adjusted within each
iteration of the markov chain (that is, the inner chain for a particular
outcome, the part Stan might do, not the outer Gibbs sampler driving the
overall imputation), but I've already given that up by using the
HybridMC package.


>
>> I think that will be more generalizable and reliable (and possibly
>> faster) than rolling it yourself.
I'm not redoing everything from scratch since I'm using the HybridMC
package.
>
> Almost certainly. Though with tight coding in C++ and
> analytical derivatives you should be able to build something
> faster than Stan. There's a price for the generality.
>
> - Bob
>
The alternative I'm using is pure R, though with analytic derivatives.

Also, an earlier remark got completely garbled. It was
> I was a bit suprised the scale differences cause a complete failure of the
> algorithm; I worried that
That should have been "I was a bit surprised the scale differences did
not cause a complete failure of the
algorithm; I worried that they would." My concern was that a step size
small enough to match the parameter with the least variation would be
too small to overcome the correlations in the other parameters, while a
step large enough for those parameters would take the "small" parameters
way out of their plausible range, leading to rejection of the entire
move by the Metropolis phase of the algorithm.

Thanks, Bob, for the pointers about mass. I see Stan can find good
values automatically, which is certainly a plus.

Ross

Ross Boylan

unread,
Aug 3, 2013, 11:14:19 AM8/3/13
to stan-...@googlegroups.com, ro...@biostat.ucsf.edu
OK, so I'm thinking of trying stan, and thinking of packaging it for
Debian. Any tips?

In particular, has the external code in libs got any local
modifications? Are there any notable version dependicies? Debian
packages generally rely on existing Debian packages for external libraries.

Also, what's up with the recursive rstan directories
https://github.com/stan-dev/rstan/tree/develop/rstan/rstan?

I think the binary packages that would result include one for the stan
library, one for the stan executable, and one for the r package.

Ross Boylan

Ben Goodrich

unread,
Aug 3, 2013, 1:19:21 PM8/3/13
to stan-...@googlegroups.com, ro...@biostat.ucsf.edu
On Saturday, August 3, 2013 11:14:19 AM UTC-4, Ross Boylan wrote:
OK, so I'm thinking of trying stan, and thinking of packaging it for
Debian.  Any tips?

I think Debian policy would dictate that libstan (the library) be packaged separately from r-noncran-rstan (the R package). I don't think this would be too hard but would require some doing. In particular, the R package would have to patched a bit to use external libraries. I could help. I've used Debian unstable for ages but never gotten involved in packaging anything. Are you a DD?

In particular, has the external code in libs got any local
modifications?

No.
 
Are there any notable version dependicies?

Short answer no; long answer yes. For Eigen, every release in the 3.x series should work including the one currently in Debian unstable, which is the same one that is embedded with the develop branch of Stan on github. We don't know how far back you can go with Boost but 1.54 is available in the Debian repositories and is embedded with Stan.

Bob is really adverse to supporting older releases of the libraries (or for that matter older versions of the C++ compiler, but we should be okay with any g++ or clang++ in testing or unstable). So, officially Stan only supports the versions of the dependencies that it embeds, and it would be up to the Debian maintainer(s) to patch around that for alternative versions of the dependencies if the need arises, which it shouldn't for the moment.

 Debian
packages generally rely on existing Debian packages for external libraries.

Right, there is also gtest but that almost never gets updated and is only necessary for the unit tests so it could be merely suggested.

Also, what's up with the recursive rstan directories
https://github.com/stan-dev/rstan/tree/develop/rstan/rstan?

It isn't essential to anything.

I think the binary packages that would result include one for the stan
library, one for the stan executable, and one for the r package.

Yeah, technically there is no stan executable, but there is a stanc binary that parses the .stan files. And then the patched R package would need to depend on libstan and stanc.

Ben

Ross Boylan

unread,
Aug 3, 2013, 3:26:10 PM8/3/13
to Ben Goodrich, stan-...@googlegroups.com
On Saturday, August 03, 2013 10:19:21 AM Ben Goodrich wrote:
> I think the binary packages that would result include one for the stan
>
> > library, one for the stan executable, and one for the r package.
>
> Yeah, technically there is no stan executable, but there is a stanc binary
> that parses the .stan files. And then the patched R package would need to
> depend on libstan and stanc.
Why do you say stanc is not the stan executable? I thought it was.
Obviously it is not the compiled stan program, which the package is not going
to provide.

Ross

Bob Carpenter

unread,
Aug 3, 2013, 3:46:12 PM8/3/13
to stan-...@googlegroups.com


On 8/3/13 11:14 AM, Ross Boylan wrote:
> OK, so I'm thinking of trying stan, and thinking of packaging it for
> Debian. Any tips?
>
> In particular, has the external code in libs got any local
> modifications?

No. RStan doesn't include all of Boost, but the regular
Stan distribution does.

> Are there any notable version dependicies?

We only test the latest versions of the libs, which we distribute
with Stan and RStan. I don't recall any dependencies, but having said that,
Boost and Eigen and Stan have all changed considerably over the last two years
of development. We'd love to hear back about compatibility issues --- we just
don't have the resources to test all combinations of compiler, lib and platform.
So we restrict ourselves to relatively recent g++ and clang++ and mingw,
the most recent versions of the libs, Mac OS latest, Windows 7, and various versions
of Linux our developers use.

> Debian
> packages generally rely on existing Debian packages for external libraries.

That's typical for linux, but you can always use other versions of
the libs. Stan is very easy to customize for the location of the
libs. It assumes they're where they are in the Stan distribution, so
you'd have to use the right set of properties for the makefile to have
them point elsewhere.

> Also, what's up with the recursive rstan directories
> https://github.com/stan-dev/rstan/tree/develop/rstan/rstan?

I'm guessing that is just mirroring the R distribution structure,
but you'll have to ask the R wizards.

> I think the binary packages that would result include one for the stan
> library, one for the stan executable, and one for the r package.

Right --- there's the library archive libstan.a, the Stan-to-C++ translator bin/stanc,
and then the R package.

The major issue with packaging Stan itself is that it
needs a C++ compiler to compile the models.

- Bob

Bob Carpenter

unread,
Aug 3, 2013, 3:53:51 PM8/3/13
to stan-...@googlegroups.com
On 8/3/13 1:19 PM, Ben Goodrich wrote:

> Bob is really adverse to supporting older releases of the libraries (or for that matter older versions of the C++
> compiler, but we should be okay with any g++ or clang++ in testing or unstable).

I just don't think it should be a priority given everything
else we have going on and how easy it is to use the most
recent libs on all platforms.

I put an even lower priority on supporting older versions of Stan.

> So, officially Stan only supports versions
> versions of the dependencies that it embeds.

Even more officially, we offer no legal guarantees whatsoever. :-)

But we do want to make things work for as many people as possible
given our limited resources. So if there are incompatibilities with
older versions of Boost or Eigen that users run into that we can
fix, I'm all for fixing them.

- Bob

Bob Carpenter

unread,
Aug 3, 2013, 3:55:55 PM8/3/13
to stan-...@googlegroups.com
Just terminology. We think of the translator as "stanc" and Stan as the
whole package. As such, there's no Stan executable per se. But I understood
what you meant.

And now that I think about it, you might also want to take bin/print, which does
the posterior analysis in the same way as RStan.

But before going down the whole installation route, I'd suggest just trying
to use it as is to see if it'll work for you. Then if you don't use it,
all the integration with Debian won't be wasted effort.

- Bob

Ross Boylan

unread,
Aug 3, 2013, 3:56:24 PM8/3/13
to Ben Goodrich, stan-...@googlegroups.com
Maybe I'm overthinking it. Maybe you just meant that "stan" is not the name
of the executable. My intent was "the executable for the stan system".
Ross

Dan Stowell

unread,
Aug 3, 2013, 4:06:46 PM8/3/13
to stan-...@googlegroups.com
2013/8/3 Bob Carpenter <ca...@alias-i.com>:
The point is not whether it's easy to use different lib versions. The
difference is that debian maintainers will look very suspiciously at
any new package that bundles a nonstandard lib version rather than
using the system one. It's a principle which helps the debian
maintaners ensure that bugfixes and security issues can be handled
coherently. (Imagine if they had to patch a security issue in boost,
and they had dozens of different versions of boost in the repo which
they had to individually check and patch.)

I've had trouble with this while packaging some other software. I
think you'll find that if you can make "stan.deb" use system
libraries, the debian admins will be much much happier with it and the
flow to getting it available will be much smoother.

>> Also, what's up with the recursive rstan directories
>> https://github.com/stan-dev/rstan/tree/develop/rstan/rstan?
>
>
> I'm guessing that is just mirroring the R distribution structure,
> but you'll have to ask the R wizards.
>
>
>> I think the binary packages that would result include one for the stan
>> library, one for the stan executable, and one for the r package.
>
>
> Right --- there's the library archive libstan.a, the Stan-to-C++ translator
> bin/stanc,
> and then the R package.
>
> The major issue with packaging Stan itself is that it
> needs a C++ compiler to compile the models.

That's no problem - just a runtime dependency declared in the debian
control file, which ensures one of the appropriate compilers is
available.

I'm not an experienced debian maintainer but I've been through it, so
I might be able to test/advise on packaging.

Best
Dan


> - Bob
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
http://www.mcld.co.uk

Ben Goodrich

unread,
Aug 3, 2013, 4:20:22 PM8/3/13
to stan-...@googlegroups.com
On Saturday, August 3, 2013 3:55:55 PM UTC-4, Bob Carpenter wrote:
But before going down the whole installation route, I'd suggest just trying
to use it as is to see if it'll work for you.  Then if you don't use it,
all the integration with Debian won't be wasted effort.

If Ross / Dan / whomever are up for it, packaging for Debian would be really beneficial. It would mean Stan gets a lot of testing on non-Intel platforms, facilitate the python integration, etc. It just means doing a lot of things the Debian way instead of the wrong way.

Ben

Ben Goodrich

unread,
Aug 3, 2013, 4:21:19 PM8/3/13
to stan-...@googlegroups.com
and then I forgot to CC the correct list

On Saturday, August 3, 2013 4:17:44 PM UTC-4, Ben Goodrich wrote:
forgot to CC the list

On Sat, Aug 3, 2013 at 3:56 PM, Ross Boylan wrote:
Maybe I'm overthinking it.  Maybe you just meant that "stan" is not the name
of the executable.  My intent was "the executable for the stan system".

Right, I just meant that the binary is called stanc

goodrich@CYBERPOWERPC:/opt/stan$ bin/stanc --help

stanc version
1.3.0

USAGE
:  stanc [options] <model_file>

and there is no binary called stan. Although, as Bob points out, there is a print binary (that would need to be renamed in the Debian package) that prints a summary of the posterior distribution from the .csv files. Those are the only two executables.

Anyway, packaging Stan wouldn't be very hard. Making r-noncran-rstan is the harder part, but probably more useful. It is true, as Bob mentioned on another post, that the R package doesn't use all of Boost, but neither does Stan. It is just that Stan embeds all of Boost because no one has gotten around to pruning the parts we don't need with bcp. The easiest thing to do would be to depend on libboost1.54-all, but Debian frowns on that sort of thing.

And then reconfiguring the R package to work with a system-wide Stan is a bit more work. Nothing too hard, it is just that Stan has followed a very Windows-centric philosophy of distributing itself and that is often at odds with the Debian / UNIX way.

Ben


Ben
 

Bob Carpenter

unread,
Aug 3, 2013, 4:39:53 PM8/3/13
to stan-...@googlegroups.com


On 8/3/13 4:06 PM, Dan Stowell wrote:
> 2013/8/3 Bob Carpenter <ca...@alias-i.com>:
>> On 8/3/13 11:14 AM, Ross Boylan wrote:
...
>> That's typical for linux, but you can always use other versions of
>> the libs. Stan is very easy to customize for the location of the
>> libs. It assumes they're where they are in the Stan distribution, so
>> you'd have to use the right set of properties for the makefile to have
>> them point elsewhere.
>
> The point is not whether it's easy to use different lib versions. The
> difference is that debian maintainers will look very suspiciously at
> any new package that bundles a nonstandard lib version rather than
> using the system one.

I see -- I didn't understand you were intending to package it
for Debian distribution. I thought you just wanted to get something
running on Debian.

> It's a principle which helps the debian
> maintaners ensure that bugfixes and security issues can be handled
> coherently. (Imagine if they had to patch a security issue in boost,
> and they had dozens of different versions of boost in the repo which
> they had to individually check and patch.)

That means there is a single official version of Boost which Debian
requires? Do other linuxes require the same one?

Boost isn't part of either Mac OS or Windows, so we don't have any
constraints there.

> I've had trouble with this while packaging some other software. I
> think you'll find that if you can make "stan.deb" use system
> libraries, the debian admins will be much much happier with it and the
> flow to getting it available will be much smoother.

That would be great.

- Bob

Ben Goodrich

unread,
Aug 3, 2013, 4:56:00 PM8/3/13
to stan-...@googlegroups.com
On Saturday, August 3, 2013 4:39:53 PM UTC-4, Bob Carpenter wrote:
That means there is a single official version of Boost which Debian
requires?  Do other linuxes require the same one?

No and no. There are actually at least 2 versions of Debian, depending on how you count. For this purpose, we only need to consider testing and unstable, which at the moment have essentially the same versions of Boost available. But neither has a single official Boost and at any moment, there could be a more recent version in unstable than in testing. And aside from Debian-derivatives like Ubuntu, any Linux distribution can do whatever it wants in terms of packaging Boost, although almost all of them would have the most recently released version of Boost available.

Ben

Dan Stowell

unread,
Aug 3, 2013, 4:57:40 PM8/3/13
to stan-...@googlegroups.com
2013/8/3 Bob Carpenter <ca...@alias-i.com>:
>
>
> On 8/3/13 4:06 PM, Dan Stowell wrote:
>>
>> 2013/8/3 Bob Carpenter <ca...@alias-i.com>:
>>>
>>> On 8/3/13 11:14 AM, Ross Boylan wrote:
>
> ...
>
>>> That's typical for linux, but you can always use other versions of
>>> the libs. Stan is very easy to customize for the location of the
>>> libs. It assumes they're where they are in the Stan distribution, so
>>> you'd have to use the right set of properties for the makefile to have
>>> them point elsewhere.
>>
>>
>> The point is not whether it's easy to use different lib versions. The
>> difference is that debian maintainers will look very suspiciously at
>> any new package that bundles a nonstandard lib version rather than
>> using the system one.
>
> I see -- I didn't understand you were intending to package it
> for Debian distribution. I thought you just wanted to get something
> running on Debian.

OK. Getting it running on debian just requires following the install
instructions in the manual ;) makefile works well here.

>> It's a principle which helps the debian
>> maintaners ensure that bugfixes and security issues can be handled
>> coherently. (Imagine if they had to patch a security issue in boost,
>> and they had dozens of different versions of boost in the repo which
>> they had to individually check and patch.)
>
> That means there is a single official version of Boost which Debian
> requires?

Not exactly. There _can_ be more than one version in the repository,
if certain software dependencies require it and there are the people
available to maintain the other versions.

> Do other linuxes require the same one?

Highly unlikely. Different distros run to different schedules. Ubuntu
is derived from debian so it is in lockstep with it though.

Cheers
Dan


> Boost isn't part of either Mac OS or Windows, so we don't have any
> constraints there.
>
>
>> I've had trouble with this while packaging some other software. I
>> think you'll find that if you can make "stan.deb" use system
>> libraries, the debian admins will be much much happier with it and the
>> flow to getting it available will be much smoother.
>
>
> That would be great.
>
>
> - Bob
>

Ben Goodrich

unread,
Aug 3, 2013, 5:05:16 PM8/3/13
to stan...@googlegroups.com, stan-...@googlegroups.com
On Saturday, August 3, 2013 4:36:20 PM UTC-4, Bob Carpenter wrote:
On 8/3/13 4:20 PM, Ben Goodrich wrote:

> If Ross / Dan / whomever are up for it, packaging for Debian would be really beneficial. It would mean Stan gets a lot
> of testing on non-Intel platforms, facilitate the python integration, etc.

That would be great.

Also, it would re-open the possibility of getting Stan-related packages onto CRAN.
 
How does it facilitate Python integration?  (I'm pretty much
completely ignorant about Python and Linux packaging.)

The Pystan repo is like RStan in that it embeds its dependencies. If there were Stan-related Debian packages, then it would be pretty easy to make a tiny python-pystan Debian package that also used the system libraries. It doesn't make anything easier on Mac or Windows.
 
> It just means doing a lot of things the
> Debian way instead of the wrong way.

Could you be more specific?  I know we've talked before about
lib and compiler dependencies.

Debian is as much a religion as it is a Linux distribution. But I would say that three of the packaging principles are

-- Don't embed dependencies on other software
-- Make version requirements as lenient as possible
-- Separate the library from the headers from the binaries from the test-suite, etc.

(R)Stan is pretty much the opposite of that, but it wouldn't be that hard to have a Debian git branch that facilitated doing things the way Debian requires them.

Ben

Andrew Gelman

unread,
Aug 3, 2013, 6:05:27 PM8/3/13
to stan-...@googlegroups.com
I agree.  Supporting old versions seems like a nightmare for so many reasons.

Ross Boylan

unread,
Aug 3, 2013, 8:15:13 PM8/3/13
to stan-...@googlegroups.com, ro...@biostat.ucsf.edu
I'll try to respond to some of the issues that have been raised.

I'm not a debian developer, but I have discussed this on debian-science
(and just now r-sig-debian) and there seem to be people there who could
do a sponsored upload and perhaps help in other ways.

Why not try just installing the thing? I might, but it makes me a
little nervous. Debian has a nice way of managing things, and stomping
on it can have unfortunate effects. For example, I don't particularly
want to have multiple versions of libraries like boost and eigen on my
system, and if the installer touched areas that are supposed to be under
management of the package system it would be a bad thing (e.g.,
installing stuff to /usr or /lib--/usr/local is OK). Worst case would
be if make install overwrite some libraries that are part of debian
package and made random other software that uses the libraries fail.

I figure if I go as far as rearranging things for myself I might as well
try to make Debian package(s) as a public service.

On the issue of library versions, a few more comments. Debian is not
just the OS or the low-level stuff, it is supposed to be a coherent set
of packages that work together. So each release will have its own set
of libraries like boost and eigen (boost is actually packaged in several
different parts and the current stable distribution has libeigen3 and
libeigen2 available). So the natural way to package stan would be to
declare a debian package dependency on packages containing the necessary
libraries (as well as the g++ compiler, which again varies with the
release). Ideally this means that people with different versions of the
libraries (e.g., because they are running different releases, or perhaps
running a Debian derivative like Ubuntu) can get the source and build
it, using whatever libraries they have. Of course, if some version of a
library is absolutely necessary one can declare that, and life becomes
more painful for those who are not at the cutting edge.

All of which underlines how important a good test suite is; I see there
are some tests in the R package and Ben mentioned gtest and unit tests
for the main program. Does gtest run the unit tests, or is it a program
that is used by (some of?) the unit tests.

Bob remarked, on the subject of supporting older versions of the libs
(and perhaps of g++):

> I just don't think it should be a priority given everything
> else we have going on and how easy it is to use the most
> recent libs on all platforms.
>

I don't disagree about the priorities, but using the most recent libs is not easy
on most linux distributions. The distribution (which in this context means not
just Debian but a particular release, e.g. Debian wheezy) will come with a particular version
of the tools/libraries and using a later one will not be straightforward for many
users. Attempting to install a later version improperly could destabilize the system.
Debian has versions that are more current and less stable; in order of
increasing currentness and decreasing reliability they are testing, unstable, and
experimental (that latter is not a full distribution and the former are essentially pre-releases).
It also has backports of more recent versions of some packages to the current stable system. In particular
there are backports of R available through cran, though not the usual backports locations.

So an informed user can do anything, but it's not necessarily straightforward.
And some people stick with the official releases, aka stable, because it is more
stable and has better support (including security fixes).

Debian provides a way of specifying tests that are optionally run at the
time the package is built; we could hook available tests into that
mechanism. That would make it easy to build and test the packages on
all the Debian architectures.

With the R package I guess there's a tension between wanting to provide
a nearly "one-click" install experience and providing a regular R
library. Definitely for Debian the r-noncran-stan package would be
basically rstan without the included stan subproject.

Apparently stan has a static library, libstan.a. I think Debian policy
favors using dynamic libraries; is there any particular reason for the
static form?

Ross

Bob Carpenter

unread,
Aug 3, 2013, 9:06:14 PM8/3/13
to stan-...@googlegroups.com


On 8/3/13 8:15 PM, Ross Boylan wrote:
> I'll try to respond to some of the issues that have been raised.
>
> I'm not a debian developer, but I have discussed this on debian-science
> (and just now r-sig-debian) and there seem to be people there who could
> do a sponsored upload and perhaps help in other ways.
>
> Why not try just installing the thing? I might, but it makes me a
> little nervous. Debian has a nice way of managing things, and stomping
> on it can have unfortunate effects.

Stan won't stomp on things in the OS --- it keeps to the directory in
which it was unpacked. And it's self contained.

The problem for us is that we have a small team and the Linux way and
the Mac/Windows way are at odds. We're getting enough users and enough
outside help that if we can manage to keep everything together, we can
probably sort out how to have separate releases and release instructions
for different platforms.

For instance, it'd be nice if we had a Windows installer, because that's
what Windows users expect. And it'd be nice if we had a CRAN install,
because that's what R users expect. And it'd be nice if we released a
Mac .pkg, because that's what Mac users expect.

And we need Python and MATLAB interfaces, because right now we're
not talking to either of those communities.

> For example, I don't particularly
> want to have multiple versions of libraries like boost and eigen on my
> system, and if the installer

There isn't an installer for Stan.

> touched areas that are supposed to be under
> management of the package system it would be a bad thing (e.g.,
> installing stuff to /usr or /lib--/usr/local is OK). Worst case would
> be if make install overwrite some libraries that are part of debian
> package and made random other software that uses the libraries fail.
>
> I figure if I go as far as rearranging things for myself I might as well
> try to make Debian package(s) as a public service.

Thanks!

> On the issue of library versions, a few more comments. Debian is not
> just the OS or the low-level stuff, it is supposed to be a coherent set
> of packages that work together. So each release will have its own set
> of libraries like boost and eigen (boost is actually packaged in several
> different parts and the current stable distribution has libeigen3 and
> libeigen2 available). So the natural way to package stan would be to
> declare a debian package dependency on packages containing the necessary
> libraries (as well as the g++ compiler, which again varies with the
> release). Ideally this means that people with different versions of the
> libraries (e.g., because they are running different releases, or perhaps
> running a Debian derivative like Ubuntu) can get the source and build
> it, using whatever libraries they have. Of course, if some version of a
> library is absolutely necessary one can declare that, and life becomes
> more painful for those who are not at the cutting edge.
>
> All of which underlines how important a good test suite is; I see there
> are some tests in the R package and Ben mentioned gtest and unit tests
> for the main program. Does gtest run the unit tests, or is it a program
> that is used by (some of?) the unit tests.

Our test suite's pretty extensive. They can all be run through
make. They're mostly implemented using google test --- some of them
require code generation to handle the combinatorics.

> Bob remarked, on the subject of supporting older versions of the libs
> (and perhaps of g++):
>
>> I just don't think it should be a priority given everything
>> else we have going on and how easy it is to use the most
>> recent libs on all platforms.

>
> I don't disagree about the priorities, but using the most recent libs is not easy
> on most linux distributions. The distribution (which in this context means not
> just Debian but a particular release, e.g. Debian wheezy) will come with a particular version
> of the tools/libraries and using a later one will not be straightforward for many
> users.Attempting to install a later version improperly could destabilize the system.

I wasn't suggesting installing a new version of Boost into
your Linux install!

The reason we're distributing all our libs is to cut down
on external dependencies.

> Debian has versions that are more current and less stable; in order of
> increasing currentness and decreasing reliability they are testing, unstable, and
> experimental (that latter is not a full distribution and the former are essentially pre-releases).
> It also has backports of more recent versions of some packages to the current stable system. In particular
> there are backports of R available through cran, though not the usual backports locations.
>
> So an informed user can do anything, but it's not necessarily straightforward.
> And some people stick with the official releases, aka stable, because it is more
> stable and has better support (including security fixes).

Makes sense to me.

> Debian provides a way of specifying tests that are optionally run at the
> time the package is built; we could hook available tests into that
> mechanism. That would make it easy to build and test the packages on
> all the Debian architectures.

Our tests take on the order of hours to run if you run the model
tests. The unit tests take 12 minutes or so on a decent quad-core
CPU using make -j4 (using four cores).

> With the R package I guess there's a tension between wanting to provide
> a nearly "one-click" install experience and providing a regular R
> library. Definitely for Debian the r-noncran-stan package would be
> basically rstan without the included stan subproject.

We'd love to put Stan on CRAN and have it be one-click that way.
But Stan's too big (takes too long to compile, too large in size
with the external code, even stripped down). There is no Boost and Eigen libs built
into R (there are some partial ones and some unstable ones, but nothing
we can rely on).

> Apparently stan has a static library, libstan.a. I think Debian policy
> favors using dynamic libraries; is there any particular reason for the
> static form?

Not that I know of. I don't know how to build non-static ones, but
someone on the team can probably figure it out. We use R's build tools and
the inline package to build dynamically linkable versions of the models for R.

- Bob

Ross Boylan

unread,
Aug 4, 2013, 12:01:56 AM8/4/13
to stan-...@googlegroups.com, ro...@biostat.ucsf.edu
On Sat, 2013-08-03 at 21:06 -0400, Bob Carpenter wrote:
>
> On 8/3/13 8:15 PM, Ross Boylan wrote:
...
> > Why not try just installing the thing? I might, but it makes me a
> > little nervous. Debian has a nice way of managing things, and stomping
> > on it can have unfortunate effects.
>
> Stan won't stomp on things in the OS --- it keeps to the directory in
> which it was unpacked. And it's self contained.
Good to know.
...
> > For example, I don't particularly
> > want to have multiple versions of libraries like boost and eigen on my
> > system, and if the installer
>
> There isn't an installer for Stan.
I meant"make install", though a quick look at the makefile suggests
there's no install target--consistent with everything staying in the
original directory.
>
....
> >
> > I don't disagree about the priorities, but using the most recent libs is not easy
> > on most linux distributions. The distribution (which in this context means not
> > just Debian but a particular release, e.g. Debian wheezy) will come with a particular version
> > of the tools/libraries and using a later one will not be straightforward for many
> > users.Attempting to install a later version improperly could destabilize the system.
>
> I wasn't suggesting installing a new version of Boost into
> your Linux install!
If I follow the regular install procedure I will end up with 2 versions
of boost (or at least 2 copies), won't I? As you pointed out,
apparently there's no risk of the version for Stan getting installed
over the ones on the system.
...
>
> We'd love to put Stan on CRAN and have it be one-click that way.
> But Stan's too big (takes too long to compile, too large in size
> with the external code, even stripped down). There is no Boost and Eigen libs built
> into R (there are some partial ones and some unstable ones, but nothing
> we can rely on).
>
In terms of the overall architecture of rstan, does it use the stanc
executable? The code looks as if it does not. And I think the docs say
the result is a loadable library, not the executable that stanc
produces.



> > Apparently stan has a static library, libstan.a. I think Debian policy
> > favors using dynamic libraries; is there any particular reason for the
> > static form?
>
> Not that I know of. I don't know how to build non-static ones, but
> someone on the team can probably figure it out. We use R's build tools and
> the inline package to build dynamically linkable versions of the models for R.
>
I think the static libraries provide more reliable control over the
libraries linked to. With dynamic libraries the libraries chosen are
deliberately deferred until execution time, which increases the risk of
picking up the "wrong" version of the library (e.g., the system version
of a library, rather than the one shipped with stan).

These issues are probably mostly irrelevant to boost and eigen, which
are primarily compile time header libraries (though of course their are
analogous issues gettting the right include file). Some boost
components have small run-time libraries they rely on.

I've also read reports that static libraries run faster.

Ross


Bob Carpenter

unread,
Aug 4, 2013, 12:37:21 AM8/4/13
to stan-...@googlegroups.com


On 8/4/13 12:01 AM, Ross Boylan wrote:
> On Sat, 2013-08-03 at 21:06 -0400, Bob Carpenter wrote:
>>
>> On 8/3/13 8:15 PM, Ross Boylan wrote:
> ...

>> I wasn't suggesting installing a new version of Boost into
>> your Linux install!
> If I follow the regular install procedure I will end up with 2 versions
> of boost (or at least 2 copies), won't I?

Yes, installing Stan will add one copy. And it's a fat lib.

> As you pointed out,
> apparently there's no risk of the version for Stan getting installed
> over the ones on the system.

Right.


> In terms of the overall architecture of rstan, does it use the stanc
> executable? The code looks as if it does not. And I think the docs say
> the result is a loadable library, not the executable that stanc
> produces.

Yes, it uses stanc, but it uses a flag we need to expose to not
generate a main(). Then it gets linked in so that we can read memory
directly from R.

RStan builds its own version of everything from the Stan source.

>>> Apparently stan has a static library, libstan.a. I think Debian policy
>>> favors using dynamic libraries; is there any particular reason for the
>>> static form?
>>
>> Not that I know of. I don't know how to build non-static ones, but
>> someone on the team can probably figure it out. We use R's build tools and
>> the inline package to build dynamically linkable versions of the models for R.
>>
> I think the static libraries provide more reliable control over the
> libraries linked to. With dynamic libraries the libraries chosen are
> deliberately deferred until execution time, which increases the risk of
> picking up the "wrong" version of the library (e.g., the system version
> of a library, rather than the one shipped with stan).

Good point. Stan reintroduces that risk by compiling and linking
for every model at what looks like run time from the RStan user
or command-line user.

> These issues are probably mostly irrelevant to boost and eigen, which
> are primarily compile time header libraries (though of course their are
> analogous issues gettting the right include file). Some boost
> components have small run-time libraries they rely on.

We've so far stuck to the header-only parts of Boost. Eigen's header only.

That's why our compile times are so slow. We are trying to speed them up
and may wind up creating template instantiations we can precompile.

- Bob

Ben Goodrich

unread,
Aug 4, 2013, 9:51:48 AM8/4/13
to stan-...@googlegroups.com, ro...@biostat.ucsf.edu
On Saturday, August 3, 2013 8:15:13 PM UTC-4, Ross Boylan wrote:
I figure if I go as far as rearranging things for myself I might as well
try to make Debian package(s) as a public service.

As much as I would like Debian packages to be available, you can easily get it working yourself on a Debian system where the only cost is a lot of wasted disk space. I have about 4 copies of Boost, which is crazy, but doesn't really hurt anything.

All that said, making Debian packages would be conceptually simple. It would just be

  1. Patch the makefile to replace these lines
    EIGEN ?= lib/eigen_3.2.0
    BOOST
    ?= lib/boost_1.54.0
    GTEST
    ?= lib/gtest_1.6.0
    with the Debian equivalents
    EIGEN ?= /usr/include/eigen3/
    BOOST
    ?= /usr/include
    GTEST
    ?= /usr/include
    (or better to use pkg-config) and delete everything under lib/
  2. For the libstan package, make bin/libstan.a and mv it to /usr/lib/stan/bin
  3. For the libstanc package, make bin/libstanc.a and mv it to /usr/lib/stan/bin
  4. For the stanc package, make bin/stanc and mv it to /usr/bin
  5. For the stan_print package, make bin/print and mv it to /usr/bin/stan_print. Replace any calls in stan to bin/print with /usr/bin/stan_print.
  6. For the libstan package, mv src/stan to /usr/include/stan
  7. For the libstan package, mv src/docs to /usr/share/stan/docs and mv src/models to /usr/share/stan/models

I'm not quite sure what to do with the unit tests under src/test but they should go somewhere. The tricky part is that some of the unit tests depend on stanc while the libstan package shouldn't necessarily have a hard dependency on stanc. And that's basically it. The rest is just compliance with Debian policy.

With the R package I guess there's a tension between wanting to provide
a nearly "one-click" install experience and providing a regular R
library.  Definitely for Debian the r-noncran-stan package would be
basically rstan without the included stan subproject.

Doing a r-noncran-rstan package would take more work. We would probably have to utilize autotools.
 
Apparently stan has a static library, libstan.a.  I think Debian policy
favors using dynamic libraries; is there any particular reason for the
static form?

Not a compelling one as far as I know of.

Ben

Ross Boylan

unread,
Aug 5, 2013, 3:23:57 AM8/5/13
to stan-...@googlegroups.com, ro...@biostat.ucsf.edu
While trying to find how things are put together I think I've made some discoveries,
some of which are of bugs. To summarize the bugs, R build is having trouble
with the dataset save in the R directory; I have a work-around that seems crude.
I think I found how to cut the boost headers drastically.

About the rstan/rstan/rstan directory structure:
rstan is the git project.
rstan/stan is a submodule that imports the stan project
rstan/rstan is for building the rstan tarball.
rstan/rstan/rstan is the actual R library that the upper directory builds.

rstan/rstan/Makefile copies or links stuff from rstan/stan into rstan/rstan/rstan (!).

Now for the bugs.
cd rstan/rstan
make
produced the error
[stuff omitted]
* looking to see if a ‘data/datalist’ file should be added
Error in if (any(update)) { : missing value where TRUE/FALSE needed
Execution halted

This seems to be an R bug that has been corrected since the version I'm using (2.15.1).
I added --resave-data="best" as a work-around from the bug indicated in the comments
below; I think moving the data file into the data directory from the R directory might
work too.

Next try:
ross@tempserver:~/UCSF/Choi/GitHub/rstan/rstan$ date; make
Sun Aug 4 22:44:41 PDT 2013
if test -d tmpstanlib; then ln -s ../../../tmpstanlib/ ./rstan/inst/include/stanlib; else ln -s ../../../../stan/lib ./rstan/inst/include/stanlib; fi
ln -s ../../../../stan/src ./rstan/inst/include/stansrc
# --resave-data="best" to work around https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=14947
R --vanilla CMD build rstan --md5 --resave-data="best" # --no-vignettes --no-manual
* checking for file ‘rstan/DESCRIPTION’ ... OK
* preparing ‘rstan’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to process help pages
* saving partial Rd database
* creating vignettes ... OK
* cleaning src
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
Removed empty directory ‘rstan/inst/include/stanlib/eigen_3.2.0/unsupported/doc/snippets’
Removed empty directory ‘rstan/inst/include/stansrc/models/bugs_examples/vol2/biopsies’
Removed empty directory ‘rstan/inst/include/stansrc/models/bugs_examples/vol3/jama’
* looking to see if a ‘data/datalist’ file should be added
* re-saving image files
Error in loadNamespace(name) : there is no package called ‘rstan’
Execution halted
make: *** [build] Error 1

I think the problem is that it is trying to load the data file, the data file includes a
reference to the package rstan, and there is no such package loaded. I can get the same
effect from in a new R session:
> load("GitHub/rstan/rstan/rstan/R/sysdata.rda")
Error in loadNamespace(name) : there is no package called ‘rstan’

I suppose you developers alredy have rstan on your systems and so don't see this.

I changed to --no-resave-data, which allowed the build. This doesn't seem like a great
solution, however.

I'm working off the latest rstan git, c14f746ec7b85320b2c113468331132bda15b4d8.
I created a local debian branch; I could clone the project on github and push my changes
there if needed.

When the tarball is expanded over 97% of the space is the boost headers. The following suggests
few of them are used:
$ fgrep boost rstan/src/*.cpp | tee boostlist | wc
14 46 1031
ross@tempserver:~/UCSF/Choi/GitHub/rstan/rstan/tmp$ cat boostlist
rstan/src/chains.cpp:#include <boost/accumulators/accumulators.hpp>
rstan/src/chains.cpp:#include <boost/accumulators/statistics/stats.hpp>
rstan/src/chains.cpp:#include <boost/accumulators/statistics/mean.hpp>
rstan/src/chains.cpp:#include <boost/lexical_cast.hpp>
rstan/src/chains.cpp:#include <boost/random/additive_combine.hpp> // L'Ecuyer RNG
rstan/src/chains.cpp:#include <boost/random/uniform_int_distribution.hpp>
rstan/src/chains.cpp: return boost::lexical_cast<unsigned int>(Rcpp::as<std::string>(seed));
rstan/src/chains.cpp: using boost::accumulators::accumulator_set;
rstan/src/chains.cpp: using boost::accumulators::stats;
rstan/src/chains.cpp: using boost::accumulators::tag::mean;
rstan/src/chains.cpp: return boost::accumulators::mean(acc);
rstan/src/chains.cpp: boost::uintmax_t DISCARD_STRIDE = static_cast<boost::uintmax_t>(1) << 50;
rstan/src/chains.cpp: typedef boost::random::ecuyer1988 RNG;
rstan/src/chains.cpp: boost::random::uniform_int_distribution<int> uid(0, i);

Of course, the referenced modules might use other boost modules behind the scenes. It looks as if
bcpp is a tool for dealing with this:
http://www.boost.org/doc/libs/1_49_0/tools/bcp/doc/html/index.html

ross@tempserver:~/UCSF/Choi/GitHub/rstan/rstan/tmp$ bcp --boost=rstan/inst/include/stanlib/boost_1.54.0 accumulators/accumulators.hpp accumulators/statistics/stats.hpp accumulators/statistics/mean.hpp lexical_cast.hpp random/additive_combine.hpp random/uniform_int_distribution.hpp myboost
#lots of stuff
ross@tempserver:~/UCSF/Choi/GitHub/rstan/rstan/tmp$ du -s myboost
19016 myboost
ross@tempserver:~/UCSF/Choi/GitHub/rstan/rstan/tmp$ du -s rstan/inst/include/stanlib/boost_1.54.0/
117504 rstan/inst/include/stanlib/boost_1.54.0/

i.e., saves ~80% of the space.
Ross

Ben Goodrich

unread,
Aug 5, 2013, 9:46:40 AM8/5/13
to stan-...@googlegroups.com, ro...@biostat.ucsf.edu
On Monday, August 5, 2013 3:23:57 AM UTC-4, Ross Boylan wrote:
Now for the bugs.
cd rstan/rstan
make
produced the error
[stuff omitted]
* looking to see if a ‘data/datalist’ file should be added
Error in if (any(update)) { : missing value where TRUE/FALSE needed
Execution halted

This seems to be an R bug that has been corrected since the version I'm using (2.15.1).
I added --resave-data="best" as a work-around from the bug indicated in the comments
below; I think moving the data file into the data directory from the R directory might
work too.

We've had some issues with that. The idea was to have a cached model that could be used in many of the examples. But it seems to have caused problems with 2.15.1 -> 3.x transitions and other things. I'll look into it more.

I changed to --no-resave-data, which allowed the build.  This doesn't seem like a great
solution, however.

OK, that may work.
 
I'm working off the latest rstan git, c14f746ec7b85320b2c113468331132bda15b4d8.
I created a local debian branch; I could clone the project on github and push my changes
there if needed.

When the tarball is expanded over 97% of the space is the boost headers.

Yes, when you create an rstan package locally, it includes all of Boost, much of which is unnecessary. The same goes for Stan. For actual rstan releases, Jiqiang does something like what you mentioned to cut out the unnecessary headers. We've talked about doing that in the github repository too, but basically decided against it. The cost of having hundreds of MB of unused headers was deemed less than the cost of running a one-line bcp command. Obviously, that wouldn't fly in Debian, which is why I mentioned that instead of depending on libboost1.54-all, we would have to depend on the particular parts of Boost that Stan uses, which differs depending on whether we are talking about the stanc package or the libstan package.

Ben

Ross Boylan

unread,
Aug 5, 2013, 11:54:46 AM8/5/13
to stan-...@googlegroups.com, ro...@biostat.ucsf.edu
Another issue with sysdata.rda is that it is a binary file. It would
be desirable to have whatever the "source" is (presumably an R script)
for Debian packaging (I'm not sure if policy requires it).

On Mon, Aug 05, 2013 at 06:46:40AM -0700, Ben Goodrich wrote:
> On Monday, August 5, 2013 3:23:57 AM UTC-4, Ross Boylan wrote:
> >
> > Now for the bugs.
> > cd rstan/rstan
> > make
> > produced the error
> > [stuff omitted]
> > * looking to see if a ‘data/datalist’ file should be added
> > Error in if (any(update)) { : missing value where TRUE/FALSE needed
> > Execution halted
> >
> > This seems to be an R bug that has been corrected since the version I'm
> > using (2.15.1).
> > I added --resave-data="best" as a work-around from the bug indicated in
> > the comments
> > below; I think moving the data file into the data directory from the R
> > directory might
> > work too.
> >
>
> We've had some issues with that. The idea was to have a cached model that
> could be used in many of the examples. But it seems to have caused problems
> with 2.15.1 -> 3.x transitions and other things. I'll look into it more.

How about the
Error in loadNamespace(name) : there is no package called ‘rstan’
problem?

One cheat that might work is to have each example check for the cached model in
a global variable, using it if available and filling it in otherwise. I think
when R CMD build and R CMD check stick all the examples together before
trying to run them, so that the caching would be effective.

I haven't looked at the examples, but this might make them more transparent
to the user, as well as eliminating the mysterious binary from the source
package.

>
> I changed to --no-resave-data, which allowed the build. This doesn't seem
> > like a great
> > solution, however.
> >
>
> OK, that may work.

Note this may result in a larger tarball.

>
>
> > I'm working off the latest rstan git,
> > c14f746ec7b85320b2c113468331132bda15b4d8.
> > I created a local debian branch; I could clone the project on github and
> > push my changes
> > there if needed.
> >
> > When the tarball is expanded over 97% of the space is the boost headers.
>
>
> Yes, when you create an rstan package locally, it includes all of Boost,
> much of which is unnecessary. The same goes for Stan. For actual rstan
> releases, Jiqiang does something like what you mentioned to cut out the
> unnecessary headers. We've talked about doing that in the github repository
> too, but basically decided against it. The cost of having hundreds of MB of
> unused headers was deemed less than the cost of running a one-line bcp
> command.

The problem with using bcp, e.g., in rstan/rstan/makefile, is that this
makes the build system dependent on another external binary.

However, that part of the build system is really the meta-build system (I think):
it is the system that prepares the tarballs that end users will receive.
Since it does not require the end user to have bcp, it might be more acceptable.

> Obviously, that wouldn't fly in Debian, which is why I mentioned
> that instead of depending on libboost1.54-all, we would have to depend on
> the particular parts of Boost that Stan uses, which differs depending on
> whether we are talking about the stanc package or the libstan package.

Actually, the headers of boost all seem to be in a single package. It is the the little
boost binary libraries that are in separate packages.

The Debian way would be to remove the external libraries from the (Debian) source package
entirely and simply have the package depend, or build-depend, on the appropriate
libraries.

The current packaging is inconsistent on handling of external
dependencies. To get things working one needs a compiler, R (at leat
for rstan), boost, eigen and (sometimes?) gtest. The first 2 are left
as an exercise for the user, while the remainder are provided.

Ross

Ben Goodrich

unread,
Aug 5, 2013, 1:22:50 PM8/5/13
to stan-...@googlegroups.com, ro...@biostat.ucsf.edu
On Monday, August 5, 2013 11:54:46 AM UTC-4, Ross Boylan wrote:
Another issue with sysdata.rda is that it is a binary file.  It would
be desirable to have whatever the "source" is (presumably an R script)
for Debian packaging (I'm not sure if policy requires it).

Yeah, if we continue having sysdata.rda, then we can easily include the script that generates it in inst/ or whatever.

Ben

Jiqiang Guo

unread,
Aug 5, 2013, 3:19:48 PM8/5/13
to stan-...@googlegroups.com
The code generating sysdata.rda is https://github.com/stan-dev/rstan/blob/develop/rstan/example/examplemodel.R

I would like to discontinue having sysdata.rda.  We could put more tests to 
It is tested on Jenkins daily now after a successful build of rstan. 

--
Jiqiang 

Ben Goodrich

unread,
Aug 5, 2013, 3:26:54 PM8/5/13
to stan-...@googlegroups.com
On Monday, August 5, 2013 3:19:48 PM UTC-4, Jiqiang Guo wrote:
The code generating sysdata.rda is https://github.com/stan-dev/rstan/blob/develop/rstan/example/examplemodel.R

I would like to discontinue having sysdata.rda.  We could put more tests to 
It is tested on Jenkins daily now after a successful build of rstan. 

I never thought of sysdata.rda as being primarily for unit-testing but more as a way to avoid a bunch of \dontrun in the examples.

Would you rather distribute a bunch of .csv files from a model and have the examples generate a stanfit object from those .csv files and then call whatever function is being illustrated in the example?

Ben


Jiqiang Guo

unread,
Aug 5, 2013, 4:00:31 PM8/5/13
to stan-...@googlegroups.com
On Mon, Aug 5, 2013 at 3:26 PM, Ben Goodrich <goodri...@gmail.com> wrote:
On Monday, August 5, 2013 3:19:48 PM UTC-4, Jiqiang Guo wrote:
The code generating sysdata.rda is https://github.com/stan-dev/rstan/blob/develop/rstan/example/examplemodel.R

I would like to discontinue having sysdata.rda.  We could put more tests to 
It is tested on Jenkins daily now after a successful build of rstan. 

I never thought of sysdata.rda as being primarily for unit-testing but more as a way to avoid a bunch of \dontrun in the examples.
I thought being part of the test was one of your goal.  I do not have problem with dontrun as it does not mean the code cannot be run though after the doc is written we do not run it everytime we release rstan.  

Would you rather distribute a bunch of .csv files from a model and have the examples generate a stanfit object from those .csv files and then call whatever function is being illustrated in the example?
I do not like it either as they need to be regenerated.  But on a second thought, how the code in the example section gets used typically?  If users really try the code and look at the result, I can see the merits of using csv files.  But if users just take a quick look of the code most of the times, putting them in \dontrun is enough. 

--
Jiqiang 

Ben Goodrich

unread,
Aug 5, 2013, 4:15:52 PM8/5/13
to stan-...@googlegroups.com
On Monday, August 5, 2013 4:00:31 PM UTC-4, Jiqiang Guo wrote:
On Mon, Aug 5, 2013 at 3:26 PM, Ben Goodrich <goodri...@gmail.com> wrote:
On Monday, August 5, 2013 3:19:48 PM UTC-4, Jiqiang Guo wrote:
The code generating sysdata.rda is https://github.com/stan-dev/rstan/blob/develop/rstan/example/examplemodel.R

I would like to discontinue having sysdata.rda.  We could put more tests to 
It is tested on Jenkins daily now after a successful build of rstan. 

I never thought of sysdata.rda as being primarily for unit-testing but more as a way to avoid a bunch of \dontrun in the examples.
I thought being part of the test was one of your goal.  I do not have problem with dontrun as it does not mean the code cannot be run though after the doc is written we do not run it everytime we release rstan.  

Would you rather distribute a bunch of .csv files from a model and have the examples generate a stanfit object from those .csv files and then call whatever function is being illustrated in the example?
I do not like it either as they need to be regenerated.  But on a second thought, how the code in the example section gets used typically?  If users really try the code and look at the result, I can see the merits of using csv files.  But if users just take a quick look of the code most of the times, putting them in \dontrun is enough. 

Maybe I hate \dontrun more than most people. But I hate it, particularly for plots. Even if you do example() to see the code, it is hard to copy and paste the code because of the prefixes. Sometimes it is useful to just see the code, but sometimes it is useful to see the output of the code, like example(plotmath). I think the examples of rstan tend to be more like the latter, so if we can do it with making a stanfit objects out of .csv files, that would be good I think.

Ben

Jiqiang Guo

unread,
Aug 6, 2013, 12:02:57 PM8/6/13
to stan-...@googlegroups.com
I will add some csv files so you can hate a bit less. 

--
Jiqiang 

Reply all
Reply to author
Forward
0 new messages