precompiling RStan vs. process dependency

78 views
Skip to first unread message

Bob Carpenter

unread,
Dec 1, 2013, 10:07:55 PM12/1/13
to stan...@googlegroups.com
The only thing we can precompile for RStan is stanc and the
posterior analysis tools (equivalent of bin/print in
the command-line Stan). We could already precompile these if we wanted
to. Then we'd still need to be able to compile models dynamically,
which means we need the entire tool chain, including Rcpp and
a recent enough C++ compiler. And we'd still have the same issue
with R's global config of compiler options to contend with.

So I still don't think being distributed on CRAN is
going to make it much easier for users to get the whole system
working. It will make the initial install easier and make us
easier to find. Maybe it will give us more credibility with R users.

Here's the alternative I'm currently thinking is best:

Refactor RStan so that it calls Stan command-line in
a separate process (or processes). This is the way that R2WinBUGS and
R2jags work. This has a number of pleasant properties:

* make it trivial to update RStan for new versions of Stan

* completely eliminate the possiblity of Stan crashing R or
leaking resources like memory or file handles in R

* make it trivial to get onto CRAN

* move the responsibility for supporting Windows, RStudio, etc.,
off of the RStan maintainers and put it in a single place that
is going to be maintained anyway

* move the responsibility for the majority of doc off of
Rstan --- RStan will only need to point to the Stan command-line
doc for configuring the actual samplers

* remove the Rcpp dependency from RStan

I think this is the way we have to go longer term to reduce the number
of dependencies in the project and make it maintainable going forward.
It will require more disk-based I/O:

* RStan will need to write data to file (already implemented with RStan's dump)

* RStan will need to read Stan's CSV output (also already implemented)

I now realize that neither of these are really bottlenecks for RStan.

It will also require:

* RStan will need to know where Stan is installed

* RStan users will need to install Stan

I actually think the latter won't be much harder than getting RStan
installed --- we can still let them install Ripley's toolchain for
Windows, use Xcode for the Mac, and let Linux users have whatever
version of makefile and clang++ or g++ they want as long as it's
recent enough.

- Bob

Ben Goodrich

unread,
Dec 1, 2013, 10:40:10 PM12/1/13
to stan...@googlegroups.com
On Sunday, December 1, 2013 10:07:55 PM UTC-5, Bob Carpenter wrote:
Here's the alternative I'm currently thinking is best:

Refactor RStan so that it calls Stan command-line in
a separate process (or processes).  This is the way that R2WinBUGS and
R2jags work.  This has a number of pleasant properties:

This may make things easier for us, but there are going to be more instances where the user doesn't have enough RAM to hold the data in R and a copy of the data in Stan. However, usually compiling the user model consumes more RAM than one copy of the data, so if they can compile the model, then they can run it most of the time. But there are going to more complaints that Stan is slow, when really they are just swapping.

   * make it trivial to update RStan for new versions of Stan

   * completely eliminate the possiblity of Stan crashing R or
     leaking resources like memory or file handles in R

   * make it trivial to get onto CRAN

Iff we make regular Stan a system requirement, but binary installs of an R package doesn't help that much if the user has to compile stanc, libstan, etc. from source. And we would need something like pkg-config

https://groups.google.com/d/msg/stan-dev/fAxadsIJ5ss/V9o6Os3yaqcJ

Maintaining a different command.hpp file for RStan is somewhat annoying, but I don't think the rest of RStan has been that difficult to do.

Ben

Bob Carpenter

unread,
Dec 1, 2013, 11:22:02 PM12/1/13
to stan...@googlegroups.com


On 12/1/13, 10:40 PM, Ben Goodrich wrote:
> On Sunday, December 1, 2013 10:07:55 PM UTC-5, Bob Carpenter wrote:
>
> Here's the alternative I'm currently thinking is best:
>
> Refactor RStan so that it calls Stan command-line in
> a separate process (or processes). This is the way that R2WinBUGS and
> R2jags work. This has a number of pleasant properties:
>
>
> This may make things easier for us, but there are going to be more instances where the user doesn't have enough RAM to
> hold the data in R and a copy of the data in Stan. However, usually compiling the user model consumes more RAM than one
> copy of the data, so if they can compile the model, then they can run it most of the time. But there are going to more
> complaints that Stan is slow, when really they are just swapping.

Although it would be nice if we shared memory between R and Stan,
we don't currently do that. The memory gets copied from R into member
variables in the Stan model.

Changing to an external process, we could actually delete the local
copy of the data after it's written and before calling Stan.

There's also no extra overhead on reading the output from a Stan CSV
file back into R.

>
> * make it trivial to update RStan for new versions of Stan
>
> * completely eliminate the possiblity of Stan crashing R or
> leaking resources like memory or file handles in R
>
> * make it trivial to get onto CRAN
>
>
> Iff we make regular Stan a system requirement,

I don't know what this means.

> but binary installs of an R package doesn't help that much if the user
> has to compile stanc, libstan, etc. from source.

It would break the dependency between R's compiler config and
Stan's. So far, a huge number of our R install problems have been
due to R's compiler config.

> And we would need something like pkg-config
>
> https://groups.google.com/d/msg/stan-dev/fAxadsIJ5ss/V9o6Os3yaqcJ

Which references: http://en.wikipedia.org/wiki/Pkg-config

If we can make that work, it would probably be better than what
R2jags and R2WinBUGS do, which is require the R lib to be configured
with the location of JAGS or BUGS.

> Maintaining a different command.hpp file for RStan is somewhat annoying, but I don't think the rest of RStan has been
> that difficult to do.

Jiqiang reported that he didn't want to deal with
Windows. Allen said the same thing w.r.t. PyStan.
RStudio's caused some issues that would go away.
And installing Rcpp has been the major pain, though
we're left with a large part of that pain in getting the
toolchain installed.

I think porting the command as it stands is pretty challenging.
Jiqiang's done a great job of it, but it's still work and it's
still error prone and it still means every time we change Stan's command
line, we have to change RStan and PyStan. A process-based approach reduces
dependencies. It means we only have to update RStan when the
output of Stan command-line changes formats, which we don't anticipate
happening as frequently as the sampler and its config through args gets
updated.

Porting would be easier if we make the Stan samplers and optimizers
more modular and less tied into the command line.

- Bob

Ben Goodrich

unread,
Dec 2, 2013, 12:04:07 AM12/2/13
to stan...@googlegroups.com
On Sunday, December 1, 2013 11:22:02 PM UTC-5, Bob Carpenter wrote:
On 12/1/13, 10:40 PM, Ben Goodrich wrote:
> On Sunday, December 1, 2013 10:07:55 PM UTC-5, Bob Carpenter wrote:
>
>     Here's the alternative I'm currently thinking is best:
>
>     Refactor RStan so that it calls Stan command-line in
>     a separate process (or processes).  This is the way that R2WinBUGS and
>     R2jags work.  This has a number of pleasant properties:
>
>
> This may make things easier for us, but there are going to be more instances where the user doesn't have enough RAM to
> hold the data in R and a copy of the data in Stan. However, usually compiling the user model consumes more RAM than one
> copy of the data, so if they can compile the model, then they can run it most of the time. But there are going to more
> complaints that Stan is slow, when really they are just swapping.

Although it would be nice if we shared memory between R and Stan,
we don't currently do that.  The memory gets copied from R into member
variables in the Stan model.

I did not know that. Was there a reason besides my inability to recognize the need to add DUP = FALSE to all the .C() and .Call() calls?
 
> Iff we make regular Stan a system requirement,

I don't know what this means.

I thought that was what you were advocating: Requiring the users to have Stan locally and then installing an R package that accesses that. No?
 
> but binary installs of an R package doesn't help that much if the user
> has to compile stanc, libstan, etc. from source.

It would break the dependency between R's compiler config and
Stan's.  So far, a huge number of our R install problems have been
due to R's compiler config.

There have been a number of issues where the user has -g or something in their Makevars and then they can't build it. But that is a solved problem for the latest version of R. In any event, if we could just have users do install.packages("rstan") and have it prompt to install Rtools if necessary, that would minimize the problems. We just have to get the binary package to build on the CRAN servers.
 
> And we would need something like pkg-config
>
> https://groups.google.com/d/msg/stan-dev/fAxadsIJ5ss/V9o6Os3yaqcJ

Which references:  http://en.wikipedia.org/wiki/Pkg-config

If we can make that work, it would probably be better than what
R2jags and R2WinBUGS do, which is require the R lib to be configured
with the location of JAGS or BUGS.

> Maintaining a different command.hpp file for RStan is somewhat annoying, but I don't think the rest of RStan has been
> that difficult to do.

Jiqiang reported that he didn't want to deal with
Windows.  Allen said the same thing w.r.t. PyStan.

I hate Windows as much as anyone, but I don't think this is going to help with Windows nearly as much as having binary Windows packages on CRAN. I think Windows users are going to have just as many problems getting stanc, libstan, etc. built.
 
RStudio's caused some issues that would go away.
And installing Rcpp has been the major pain, though
we're left with a large part of that pain in getting the
toolchain installed.

I think it would be within our rights to refer people to the Rcpp list.
 
I think porting the command as it stands is pretty challenging.
Jiqiang's done a great job of it, but it's still work and it's
still error prone and it still means every time we change Stan's command
line, we have to change RStan and PyStan.  A process-based approach reduces
dependencies.  It means we only have to update RStan when the
output of Stan command-line changes formats, which we don't anticipate
happening as frequently as the sampler and its config through args gets
updated.

I can see the value in having the rstan package just make system calls, but that is a separate issue from what to do about CRAN or whether rstan should include Stan.

There is another issue in that we would be lucky if we could get CRAN to update a local installation of Stan once a year on three platforms. Once a year might be enough if the old Stan were adequate to build the new rstan package and users were downloading the new Stan themselves. But there would be a lot of emails with us pleading for CRAN to update Stan and CRAN telling us to fuck off, that we are ignoring the fact that they have to manage thousands of packages, that we don't know how to develop software in a civilized fashion, etc.

Ben

Bob Carpenter

unread,
Dec 2, 2013, 12:36:33 AM12/2/13
to stan...@googlegroups.com


On 12/2/13, 12:04 AM, Ben Goodrich wrote:
> On Sunday, December 1, 2013 11:22:02 PM UTC-5, Bob Carpenter wrote:

> Although it would be nice if we shared memory between R and Stan,
> we don't currently do that. The memory gets copied from R into member
> variables in the Stan model.
>
>
> I did not know that. Was there a reason besides my inability to recognize the need to add DUP = FALSE to all the .C()
> and .Call() calls?

Only that Stan itself copies the data in its model constructor.
Nothing you can do about that from the outside. We could potentially
rewrite it so that if the memory were local coming from R, we could
share the memory in Eigen using Map. It'd be a pretty big change to
Stan and it wouldn't be possible with array (std::vector) types, because
they don't have constructors that don't copy the data (at least as far
as I know).

>
> > Iff we make regular Stan a system requirement,
>
> I don't know what this means.
>
>
> I thought that was what you were advocating: Requiring the users to have Stan locally and then installing an R package
> that accesses that. No?

Yes. I just didn't know if the "system requirement" terminology
entailed something more than that.

> > but binary installs of an R package doesn't help that much if the user
> > has to compile stanc, libstan, etc. from source.
>
> It would break the dependency between R's compiler config and
> Stan's. So far, a huge number of our R install problems have been
> due to R's compiler config.
>
>
> There have been a number of issues where the user has -g or something in their Makevars and then they can't build it.
> But that is a solved problem for the latest version of R.

That's good news --- it should cut down on our install help mail.

> In any event, if we could just have users do
> install.packages("rstan") and have it prompt to install Rtools if necessary, that would minimize the problems.

Yes, that would be great. That's still just stanc and libstan, though.
And I'm not sure how much libstan is even saving us in compile time
for models vs. a header-only approach.

We could distribute Windows binaries of it ourselves. In
fact, we probably should given the problems people have been
having getting it to build given Windows' poor tool chain.

> We just
> have to get the binary package to build on the CRAN servers.

No idea what that entails. Making the model compiles go faster
by precompiling things is probably in direct opposition to getting
CRAN to build our packages for us.

> I hate Windows as much as anyone, but I don't think this is going to help with Windows nearly as much as having binary
> Windows packages on CRAN. I think Windows users are going to have just as many problems getting stanc, libstan, etc. built.

Windows was fine when I was doing Java development.
But the C++ tool chain there is miserable. So users will
still have trouble getting stanc, libstan and models built.
My point was just that it won't be an RStan issue any more,
so whoever maintains RStan won't have to deal with it.

> RStudio's caused some issues that would go away.
> And installing Rcpp has been the major pain, though
> we're left with a large part of that pain in getting the
> toolchain installed.
>
>
> I think it would be within our rights to refer people to the Rcpp list.

We could. But I want to try to be more helpful. The Rcpp list
wasn't particularly friendly to us when we were trying to build in
the first place.

>
> I think porting the command as it stands is pretty challenging.
> Jiqiang's done a great job of it, but it's still work and it's
> still error prone and it still means every time we change Stan's command
> line, we have to change RStan and PyStan. A process-based approach reduces
> dependencies. It means we only have to update RStan when the
> output of Stan command-line changes formats, which we don't anticipate
> happening as frequently as the sampler and its config through args gets
> updated.
>
>
> I can see the value in having the rstan package just make system calls, but that is a separate issue from what to do
> about CRAN or whether rstan should include Stan.

It won't help us get on CRAN if there isn't any C++ code
in RStan? I thought that'd make it trivial, but admit I
know squat about CRAN or R package build and distribution.

> There is another issue in that we would be lucky if we could get CRAN to update a local installation of Stan once a year
> on three platforms.

Given our development cycle, I don't think that'll work.
At least not until we have much more stable versions of Stan.

> Once a year might be enough if the old Stan were adequate to build the new rstan package and users
> were downloading the new Stan themselves. But there would be a lot of emails with us pleading for CRAN to update Stan
> and CRAN telling us to fuck off, that we are ignoring the fact that they have to manage thousands of packages, that we
> don't know how to develop software in a civilized fashion, etc.

I'd rather just reduce our dependence on the kindness of strangers.

- Bob

Ben Goodrich

unread,
Dec 2, 2013, 2:15:47 AM12/2/13
to stan...@googlegroups.com
On Monday, December 2, 2013 12:36:33 AM UTC-5, Bob Carpenter wrote:
On 12/2/13, 12:04 AM, Ben Goodrich wrote:
> On Sunday, December 1, 2013 11:22:02 PM UTC-5, Bob Carpenter wrote:

>     Although it would be nice if we shared memory between R and Stan,
>     we don't currently do that.  The memory gets copied from R into member
>     variables in the Stan model.
>
>
> I did not know that. Was there a reason besides my inability to recognize the need to add DUP = FALSE to all the .C()
> and .Call() calls?

Only that Stan itself copies the data in its model constructor.
Nothing you can do about that from the outside.  We could potentially
rewrite it so that if the memory were local coming from R, we could
share the memory in Eigen using Map.  It'd be a pretty big change to
Stan and it wouldn't be possible with array (std::vector) types, because
they don't have constructors that don't copy the data (at least as far
as I know).

We have two extra copies of the data then because R copies the data by default before passing it to an external library. One can be turned off by specifying DUP = FALSE, which is safe when the external library does not modify the data (in R's workspace). But if Stan is making another copy anyway, then there is not much of an advantage to the in-memory approach.
 
> In any event, if we could just have users do
> install.packages("rstan") and have it prompt to install Rtools if necessary, that would minimize the problems.

Yes, that would be great.  That's still just stanc and libstan, though.
And I'm not sure how much libstan is even saving us in compile time
for models vs. a header-only approach.

We have not had many instances on stan-users where someone had a valid model that couldn't be compiled on their machine. We have had parser bugs that were producing flawed C++ and a lot of problems building stanc. So, if we could get CRAN to provide binary installation of stanc, we would be 90% of the way there (particularly since rstan could check for the existence of a C++ compiler and offer to download / install one if it is not found).
 
We could distribute Windows binaries of it ourselves.  In
fact, we probably should given the problems people have been
having getting it to build given Windows' poor tool chain.

I wouldn't want anyone to trust us with binaries that we generate / sign ourselves. People should barely be trusting the binaries they get from CRAN.
 
> We just
> have to get the binary package to build on the CRAN servers.

No idea what that entails.  Making the model compiles go faster
by precompiling things is probably in direct opposition to getting
CRAN to build our packages for us.

It's documented pretty extensively

http://cran.r-project.org/doc/manuals/r-release/R-exts.html

but the short version is that rstan was rejected the first time for including many MB of Boost.

So, we could say that Boost is a SystemRequirement in rstan's DESCRIPTION file, in which case it would probably be fine for rstan to bundle src/stan/ and maybe lib/eigen_*. But I don't think we could get away with saying that a specific version of Boost is a SystemRequirement because CRAN won't promise to install that version (it probably has 1.48 or 1.49 or something installed).

Or we could say that Stan is a SystemRequirement, in which case it would have whatever version of Boost and Eigen that we have under lib. And CRAN does install open-source libraries on the CRAN servers that R packages access. But I don't think they would be happy to update CRAN's copy of Stan very often since it requires doing so on 3 platforms by a human who disagrees with this non-UNIX approach to packaging and distributing software anyway.

Or we can say that the BH R package is a Depends of rstan. In that case, rstan would include Stan (sans most of boost but maybe with Eigen and a few parts of Boost) and builds stanc, libstan, etc. This is what I think would be best because we could update rstan once a month if we wanted to and no human associated with CRAN has to do much. That would require that Stan build with the version of Boost in BH plus whatever parts of Boost that rstan includes but I don't think that is very daunting (I did it a few weeks ago). 

> I can see the value in having the rstan package just make system calls, but that is a separate issue from what to do
> about CRAN or whether rstan should include Stan.

It won't help us get on CRAN if there isn't any C++ code
in RStan?   I thought that'd make it trivial, but admit I
know squat about CRAN or R package build and distribution.

There are tons of R packages that use C++ (although nowdays most do so via Rcpp). Our problems with CRAN before weren't even technical. But under the non-CRAN approach I think you prefer, we have to get all the (mostly Windows) users to build stanc and libstan locally and then install our rstan package from GitHub and get it talking to the local installation of Stan. And I am skeptical that will be much of a net gain from where we are now. It could be a net loss. But I don't think having rstan on CRAN would help much if it required users and CRAN to have a local installation of Stan. Conversely, I think it would be worth almost any cost if we could get binary installations of rstan from CRAN that didn't require a local installation of Stan.

Ben

Bob Carpenter

unread,
Dec 2, 2013, 12:58:32 PM12/2/13
to stan...@googlegroups.com
[I modified the subject to get everyone else's attention who
may have dropped out thinking this issue only affects RStan.]

I believe what Ben is suggesting would:

* make RStan installation easier,

* increase the burden on RStan devs for supporting platforms
like Windows and RStudio, and

* tie Stan C++ API down to whatever version and subset of Boost
and Eigen are supplied by RcppEigen and BH (we could perhaps influence
these package's choices within the guidelines of CRAN compatibility).


On 12/2/13, 2:15 AM, Ben Goodrich wrote:
>...
> There are tons of R packages that use C++ (although nowdays most do so via Rcpp). Our problems with CRAN before weren't
> even technical. But under the non-CRAN approach I think you prefer, we have to get all the (mostly Windows) users to
> build stanc and libstan locally and then install our rstan package from GitHub and get it talking to the local
> installation of Stan. And I am skeptical that will be much of a net gain from where we are now.

Not necessarily in terms of ease of installation, but it takes
the burden off the RStan devs to worry about Windows and other
platforms

> It could be a net loss.

I can vouch for the fact that when I was a very novice R user and Windows
user that I couldn't get R2WinBUGS or R2jags installed on Windows without
help from Yu-Sun and Masanao. So I agree that the approach I'm advocating
is not without its costs.

> But I don't think having rstan on CRAN would help much if it required users and CRAN to have a local installation of
> Stan. Conversely, I think it would be worth almost any cost if we could get binary installations of rstan from CRAN that
> didn't require a local installation of Stan.

I think this latter statement is where we fundamentally disagree.

I'd say that avoiding dependencies is worth almost any cost.
We're going to have the same issues with Python, Debian and other
Linux distros, Stata, etc., and I have a very strong preference for
minimizing the dependencies in both the code and between Stan
and its interfaces over making it slightly easier for users to install.
If we keep the interfaces abstracted from the executables of Stan,
all of these dependencies goes away.

This is exactly the kind of issue we have no process in place to
resolve. Reasonable people can disagree about priorities.

Does anyone else have any opinions? This is a decision that will impact
what we can do in the Stan C++ API, as well as in the CmdStan, RStan
and PyStan interfaces, assuming we want RStan to be able to keep up with
CmdStan.

- Bob

Ben Goodrich

unread,
Dec 2, 2013, 2:27:48 PM12/2/13
to stan...@googlegroups.com
On Monday, December 2, 2013 12:58:32 PM UTC-5, Bob Carpenter wrote:
> There are tons of R packages that use C++ (although nowdays most do so via Rcpp). Our problems with CRAN before weren't
> even technical. But under the non-CRAN approach I think you prefer, we have to get all the (mostly Windows) users to
> build stanc and libstan locally and then install our rstan package from GitHub and get it talking to the local
> installation of Stan. And I am skeptical that will be much of a net gain from where we are now.

Not necessarily in terms of ease of installation, but it takes
the burden off the RStan devs to worry about Windows and other
platforms

I don't know if I understand your position here. I think you are saying that if a Windows person can't build Stan locally then it is a Stan problem rather than an RStan problem. And if so, then it would be a collective responsibility for us to deal with that person rather than something that falls more heavily on Mav and me. I guess I just feel that an installation problem is an installation problem, so if there is one I will probably reply on stan-users unless it is a Mac thing that I don't know anything about. I think what is more relevant is to get the probability of an installation problem down. Maybe going away from Rcpp would help some, but what I think would help most is binary installs of stanc.

> But I don't think having rstan on CRAN would help much if it required users and CRAN to have a local installation of
> Stan. Conversely, I think it would be worth almost any cost if we could get binary installations of rstan from CRAN that
> didn't require a local installation of Stan.

I think this latter statement is where we fundamentally disagree.

I'd say that avoiding dependencies is worth almost any cost.
We're going to have the same issues with Python, Debian and other
Linux distros, Stata, etc.,

Yes, we have those issues, but Debian in particular will refuse as a matter of policy to embed Boost or Eigen with Stan for security reasons primarily. (and I'm pretty sure every Linux distro has that policy). You are only allowed to use one of the versions of Boost that is available in a particular repository. At the moment, Debian is trying to get Boost 1.54 into the next stable version of Debian which is likely a year or more away and the latest stable version of Debian has Boost 1.49. So, dealing with multiple versions of Boost or Eigen or refraining from updating Boost in Stan is inevitable if Stan (or PyStan) is ever going to be packages for Linux distros. But it is not a huge deal. Someone discovers that Stan doesn't work with some version of Boost on ARM or MIPS or whatever and they usually send you a patch to fix it.
 
This is exactly the kind of issue we have no process in place to
resolve.  Reasonable people can disagree about priorities.

I agree we are making predictions about the future states of the world. But I am influenced a lot by the experience of Sage, which is an open-source alternative to Mathematica and other closed-source math software, and they bundle all of their (dozens of) dependencies into a 340+ MB tarball with particular (and sometimes patched) versions:

http://www.sagemath.org/packages/standard/

And that makes it easier to write code. But they get a ton of emails from users that can't build it successfully on their machine and ask why they need to build such and such when they already have that installed. And someone spent a lot of time getting everything packaged for Debian experimental but it eventually got kicked out of Debian. Then they got a GSOC student to refactor everything so Sage could utilize system dependencies and thus be easier for Linux distros to package, but then Sage wouldn't accept the patches. Basically, it has been a giant mess and for most of the Mathematica users that they are trying to lure away, I think it would have been much easier if they had sought to do binary installation of the dependencies from the outset.

Ben

Bob Carpenter

unread,
Dec 2, 2013, 2:48:46 PM12/2/13
to stan...@googlegroups.com


On 12/2/13, 2:27 PM, Ben Goodrich wrote:
> I don't know if I understand your position here. I think you are saying that if a Windows person can't build Stan
> locally then it is a Stan problem rather than an RStan problem.

Exactly. There's just a single set of install instructions for
Stan.

As I keep stressing, to me this is all about minimizing dependencies
in code, doc, and development.

I'm less worried about responsibility than ease of development, doc,
testing, and maintenance.

> And if so, then it would be a collective responsibility
> for us to deal with that person rather than something that falls more heavily on Mav and me.

Unfortunately, none of the rest of us know much about the
intricacies of R.

My suggestion to integrate at the process level would
mean that the installation of Stan only has to be documented,
tested, and supported in a single way that works across
platform.

Net, I think it'll be a little harder on users, but a whole
lot easier on us.

> I guess I just feel that an
> installation problem is an installation problem, so if there is one I will probably reply on stan-users unless it is a
> Mac thing that I don't know anything about.

I feel the same way about Linux, Python, R, and Windows.
I'm terrible at installing things and our makefiles are
pretty much completely opaque to me.

>I think what is more relevant is to get the probability of an installation
> problem down.

I agree that this is a big issue.

> Maybe going away from Rcpp would help some, but what I think would help
> most is binary installs of stanc.

Going away from Rcpp isn't so much to make installation easier as to
reduce dependencies in the code.

Binary installs of stanc and libstan would help, but
unless the users can install Rcpp and get the toolchain working to
compile models, they're hosed no matter how easy it is to install
stanc and libstan.a.

And we're not going to go to a 340MB tarball for distribution! That's
one of the reasons I don't want to just bring in every lib we think
might be useful. I really think we can get by with just Boost and
Eigen (and maybe the PRNG package you want us to use).

It'd be great to have a Debian package, binary R installs, official
Python installs, and so forth, but what I'm questioning is whether
it's worth our ongoing efforts.

- Bob

Ben Goodrich

unread,
Dec 4, 2013, 6:37:49 PM12/4/13
to stan...@googlegroups.com
On Monday, December 2, 2013 12:36:33 AM UTC-5, Bob Carpenter wrote:
On 12/2/13, 12:04 AM, Ben Goodrich wrote:
> On Sunday, December 1, 2013 11:22:02 PM UTC-5, Bob Carpenter wrote:

>     Although it would be nice if we shared memory between R and Stan,
>     we don't currently do that.  The memory gets copied from R into member
>     variables in the Stan model.
>
>
> I did not know that. Was there a reason besides my inability to recognize the need to add DUP = FALSE to all the .C()
> and .Call() calls?

Only that Stan itself copies the data in its model constructor.
Nothing you can do about that from the outside.  We could potentially
rewrite it so that if the memory were local coming from R, we could
share the memory in Eigen using Map.  It'd be a pretty big change to
Stan and it wouldn't be possible with array (std::vector) types, because
they don't have constructors that don't copy the data (at least as far
as I know).

FYI: I got the following to work in a couple of minutes, which uses memory-mapped files

http://www.boost.org/doc/libs/1_55_0/doc/html/interprocess/sharedmemorybetweenprocesses.html#interprocess.sharedmemorybetweenprocesses.mapped_file
http://en.wikipedia.org/wiki/Memory-mapped_file

that ostensibly allow the application to work with files on disk that are bigger than the available RAM and the OS manages what parts of the file it needs at a particular time.

#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/mapped_region.hpp>
#include <Eigen/Dense>
#include <iostream>

int main() {
 
using namespace boost::interprocess;

  file_mapping m_file
("test.bin", read_only);

 
const std::size_t FileSize = 10000;      // too big but works
  mapped_region region
(m_file, read_only);

 
double *mem = static_cast<double*>(region.get_address());
 
Eigen::Map<Eigen::MatrixXd> m(mem, 10, 3);

  std
::cout << m << std::endl;

 
return 0;
}

With the attached file that has 30 doubles in binary format,

goodrich@CYBERPOWERPC:/tmp$ Rscript --quiet -e "matrix(readBin('test.bin', what = 'double', n = 30), 10, 3)"
            [,1]       [,2]        [,3]
 [1,]  1.6352946  2.1331513 -0.17195892
 [2,] -0.4512861  0.2377836  0.57613430
 [3,] -0.3510201  0.5198343  0.88528544
 [4,]  0.2336205 -1.3662959  1.12978856
 [5,] -0.8145016 -0.9678363 -1.79383955
 [6,] -0.4750529  1.2324999  0.47748876
 [7,]  1.8257707 -0.2309029 -0.07845102
 [8,]  0.1406174 -0.2587299 -0.96731231
 [9,]  1.3540141 -0.5267550 -0.12012563
[10,]  0.5181718 -1.0233718 -1.56548363
goodrich@CYBERPOWERPC:/
tmp$ clang++ -I/usr/include -I/usr/include/eigen3 fm.cpp
goodrich@CYBERPOWERPC
:/tmp$ ./a.out
 
1.63529   2.13315 -0.171959
-0.451286  0.237784  0.576134
 
-0.35102  0.519834  0.885285
 
0.233621   -1.3663   1.12979
-0.814502 -0.967836  -1.79384
-0.475053    1.2325  0.477489
 
1.82577 -0.230903 -0.078451
 
0.140617  -0.25873 -0.967312
 
1.35401 -0.526755 -0.120126
 
0.518172  -1.02337  -1.56548


This might be a cool option to have someday. It is presumably slower in general but faster than if your computer has to swap. And I guess lots of other applications can write data to binary files, rather than learning the text format Stan expects now.

Ben

test.bin

Bob Carpenter

unread,
Dec 4, 2013, 6:57:59 PM12/4/13
to stan...@googlegroups.com
I started out being very worried about both memory consumption and
I/O issues from data, but I now realize that Stan isn't fast enough to
deal with data sets that would give memory problems and that the I/O
time is dominated by sampling time. The auto-diff will kill us before lack
of memory sharing will.

Nifty. I didn't even know Boost had it. Usually you use
memory mapping for non-random access to files.

It'll be significantly slower (many orders of magnitude)
than in memory for random access, more so if the machine
has spinny disks instead of solid state.

But if you have multiple processes hitting the same memory-mapped
file, you get disk-head contention (again, the problem's there
but less so with SSDs because seek times are several orders of
magnitude faster) --- with spinny disk it takes multiple milliseconds
(more than 5 last time I checked) for the first random access byte from a
file system.

If the data can be pipelined, it can be relatively efficient,
but then so is non-memory mapped I/O.

There's a great package called Vowpal Wabbit (an Elmer Fudd pronouncing
Monty Python referring to Lewis Carroll -- and you thought Andrew's
names were obscure), which does stochastic gradient desecent on huge
files by streaming data from file as needed. We are going to have to
start thinking about these issues when our stochastic VB gets off
the ground. Matt's stochastic variational implementation of LSA got
rolled into Vowpal Wabbit, but I think VW is mostly used for linear/logistic
regression and CRF/MRF type problems. VW also does really cool things
with dimensionality reduction using hash-based feature encodings. Not
exactly the way social scientists want to roll!

- Bob
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Ben Goodrich

unread,
Dec 4, 2013, 7:24:17 PM12/4/13
to stan...@googlegroups.com
On Wednesday, December 4, 2013 6:57:59 PM UTC-5, Bob Carpenter wrote:
I started out being very worried about both memory consumption and
I/O issues from data, but I now realize that Stan isn't fast enough to
deal with data sets that would give memory problems and that the I/O
time is dominated by sampling time.  The auto-diff will kill us before lack
of memory sharing will.

But currently you can hit a RAM bound for a model with somewhat regular data requirements if you do a bunch of chains in parallel. I'm sure there would still be performance tradeoffs, but I doubt many Stan models are doing much random access (as opposed to matrix ops) if they are using Eigen types.

Ben

Bob Carpenter

unread,
Dec 4, 2013, 8:05:46 PM12/4/13
to stan...@googlegroups.com
If we want to fit a simple GLM with a lot of data, the auto-diff memory size will
be small relative to the data size. Then, if the predictors are kept memory
local (e.g., array of Eigen vectors, or columns of an Eigen matrix), then
something like this might make sense. For more complex models, auto-diff memory
requirements will be the main bottleneck.

I'm guessing it would still be faster to just run the chains one at a time
rather than in parallel with memory mapping to disk I/O.

- Bob

Reply all
Reply to author
Forward
0 new messages