cloud-stan primer

79 views
Skip to first unread message

Devin Pastoor

unread,
Jul 29, 2015, 5:12:40 PM7/29/15
to stan development mailing list
Hi all,

First, I wanted to introduce myself as I'm new to the group. I'm a PhD student at the University of Maryland working in pharmaceuticals (highly aligned with Sebastian's work) and since my thesis/interests revolve around both visualization/application development and deployment I'm going to take a crack at those areas.

Since this group will likely turn into alpha/beta testers a couple things that would be extremely helpful to me are if people could chime in on the following:

a) current/desired dev environment (eg rstudio + shinystan vs r-repl+emacs vs ipython-notebook+pystan etc)

b) project resource consumption - especially upper range of RAM consumption and range of model run times

c) use case - transient (temp environment to be destroyed after logging off) vs day-to-day driver (maintain results/projects/etc).

d) If day-to-day use, what are ideal sync scenarios (dropbox vs github vs direct upload/download, etc)

e) yes/no to desire to ever self-host, and if so hosting platform (internal cluster vs something like AWS).


Thanks and look forward to interacting more with everyone!

Devin

Bob Carpenter

unread,
Jul 29, 2015, 6:59:47 PM7/29/15
to stan...@googlegroups.com

> On Jul 29, 2015, at 5:12 PM, Devin Pastoor <devin....@gmail.com> wrote:
>
> Hi all,
>
> First, I wanted to introduce myself as I'm new to the group. I'm a PhD student at the University of Maryland working in pharmaceuticals (highly aligned with Sebastian's work) and since my thesis/interests revolve around both visualization/application development and deployment I'm going to take a crack at those areas.
>
> Since this group will likely turn into alpha/beta testers a couple things that would be extremely helpful to me are if people could chime in on the following:
>
> a) current/desired dev environment (eg rstudio + shinystan vs r-repl+emacs vs ipython-notebook+pystan etc)

Dev for what? I do C++ almost exclusively in emacs and the shell
with a combination of make and Python scripting.


> b) project resource consumption - especially upper range of RAM consumption and range of model run times

Upper range for whom? We have people running models that take tens of
gigabytes and run for a week. But someone like Andrew moves to approximate
methods if something's going to take a week to run with full Bayes.

Building from source can take up to 8 GB due to all the templating
in the parser (stanc).

> c) use case - transient (temp environment to be destroyed after logging off) vs day-to-day driver (maintain results/projects/etc).

I don't understand the question. I don't log into any environments
(not counting my notebook, GitHub, and our web host for static web
pages, which just switched to GitHub).

> d) If day-to-day use, what are ideal sync scenarios (dropbox vs github vs direct upload/download, etc)

Synch for what? We use GitHub for everything Stan
development related. Those who are shy often work in their
own sandboxes, but most of us work on the stan-dev GitHub
account in various repos.

> e) yes/no to desire to ever self-host, and if so hosting platform (internal cluster vs something like AWS).

I don't know what "self-hosting" is, but we're not doing any
kind of hosting other than on GitHub.

We'd like to be able to deploy two things --- a Jupyter-like
environment --- Allen's already nailed this, I think, for
Python, R, and Julia. We can't afford to host it for high
volume, but Allen found a spare machine at Dartmouth we can
use for demos. We'd also like to deploy a GUI that's not
based on R, Python, or Julia --- just a simple demo-type web
app. It's not fully designed, as you pointed out to Andrew
in previous mail.

- Bob

Devin Pastoor

unread,
Jul 29, 2015, 7:32:25 PM7/29/15
to stan development mailing list, ca...@alias-i.com
Thanks Bob, good questions. Hopefully I made the questions a little more clear in my responses below.

On Wednesday, July 29, 2015 at 6:59:47 PM UTC-4, Bob Carpenter wrote:
> > On Jul 29, 2015, at 5:12 PM, Devin Pastoor <devin....@gmail.com> wrote:
> >
> > Hi all,
> >
> > First, I wanted to introduce myself as I'm new to the group. I'm a PhD student at the University of Maryland working in pharmaceuticals (highly aligned with Sebastian's work) and since my thesis/interests revolve around both visualization/application development and deployment I'm going to take a crack at those areas.
> >
> > Since this group will likely turn into alpha/beta testers a couple things that would be extremely helpful to me are if people could chime in on the following:
> >
> > a) current/desired dev environment (eg rstudio + shinystan vs r-repl+emacs vs ipython-notebook+pystan etc)
>
> Dev for what? I do C++ almost exclusively in emacs and the shell
> with a combination of make and Python scripting.
>
>

Dev for modeling/projects in stan - the question comes for what should be available in default environment(s). My thought would be that people would have access to a couple different pre-configured environments (plus shell access for customizations). As you mentioned below, this initially would be for simple models/teaching purposes.

> > b) project resource consumption - especially upper range of RAM consumption and range of model run times
>
> Upper range for whom? We have people running models that take tens of
> gigabytes and run for a week. But someone like Andrew moves to approximate
> methods if something's going to take a week to run with full Bayes.
>
> Building from source can take up to 8 GB due to all the templating
> in the parser (stanc).
>

Looking for what "reasonable" cloud-stan resource allocation would be. Having stan inside a docker image pre-compiled should negate the need during use to allocate that kind of memory, however want to get an idea of what ranges of RAM people run into during day-to-day modeling. I appreciate that this will be highly variable, but want to get a feel for the distribution if possible.

> > c) use case - transient (temp environment to be destroyed after logging off) vs day-to-day driver (maintain results/projects/etc).
>
> I don't understand the question. I don't log into any environments
> (not counting my notebook, GitHub, and our web host for static web
> pages, which just switched to GitHub).
>

transient would be a "one"-off notebook environment, so to Andrew's request in another thread, it may be forking a model from github, running it, then being ok with the environment being destroyed after downloading results. Another use case could be for testing, where a model could be run, the output collected, then the environment destroyed and resources re-allocated.

I don't know if that is much of a use case, or if everyone would 'always' want to be able to keep projects around long term.


> > d) If day-to-day use, what are ideal sync scenarios (dropbox vs github vs direct upload/download, etc)
>
> Synch for what? We use GitHub for everything Stan
> development related. Those who are shy often work in their
> own sandboxes, but most of us work on the stan-dev GitHub
> account in various repos.

Sync for modeling/results. Eg after running a model, and getting output, say in an ipython notebook. How would you (want to) backup said notebook/project.

> > e) yes/no to desire to ever self-host, and if so hosting platform (internal cluster vs something like AWS).
>
> I don't know what "self-hosting" is, but we're not doing any
> kind of hosting other than on GitHub.
>
> We'd like to be able to deploy two things --- a Jupyter-like
> environment --- Allen's already nailed this, I think, for
> Python, R, and Julia. We can't afford to host it for high
> volume, but Allen found a spare machine at Dartmouth we can
> use for demos. We'd also like to deploy a GUI that's not
> based on R, Python, or Julia --- just a simple demo-type web
> app. It's not fully designed, as you pointed out to Andrew
> in previous mail.
>
> - Bob

The question comes from both a testing/documentation perspective. If people would want to host (eg on the spare machine in dartmouth) vs deploy onto AWS or personal server to have their own personal sandbox, vs only be interested in just paying for compute resources but having everything managed. It comes down to a flexibility of the implementation. Sounds like for now should keep it flexible and not tied down to a specific environment/setup.

Regarding the GUI, I absolutely agree that the GUI should be language agnostic, however (I think) we can do some 'tricks' to provide hooks to the various implementations to make it easy for people to say, do data manipulation/creation in their language of choice, but manage the run/output in the gui.


Let me know if I can clarify further!

Bob Carpenter

unread,
Jul 30, 2015, 11:47:49 AM7/30/15
to stan...@googlegroups.com

> On Jul 29, 2015, at 7:32 PM, Devin Pastoor <devin....@gmail.com> wrote:
>
> Thanks Bob, good questions. Hopefully I made the questions a little more clear in my responses below.
>
> On Wednesday, July 29, 2015 at 6:59:47 PM UTC-4, Bob Carpenter wrote:
>>> On Jul 29, 2015, at 5:12 PM, Devin Pastoor <devin....@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> First, I wanted to introduce myself as I'm new to the group. I'm a PhD student at the University of Maryland working in pharmaceuticals (highly aligned with Sebastian's work) and since my thesis/interests revolve around both visualization/application development and deployment I'm going to take a crack at those areas.
>>>
>>> Since this group will likely turn into alpha/beta testers a couple things that would be extremely helpful to me are if people could chime in on the following:
>>>
>>> a) current/desired dev environment (eg rstudio + shinystan vs r-repl+emacs vs ipython-notebook+pystan etc)
>>
>> Dev for what? I do C++ almost exclusively in emacs and the shell
>> with a combination of make and Python scripting.
>>
>>
>
> Dev for modeling/projects in stan - the question comes for what should be available in default environment(s). My thought would be that people would have access to a couple different pre-configured environments (plus shell access for customizations). As you mentioned below, this initially would be for simple models/teaching purposes.

I usually use CmdStan from the shell when I'm doing development
and testing of Stan models for the purposes of Stan development or
if they're big hairy things like the PK/PD models we do with Novartis.
Otherwise, for casual use or exploring models that don't take days to
run, I'll use RStan in the terminal. I ran Stan.jl when I had to
give a talk about it, but I'm not fluent in Julia. I sometimes use
knitr through the terminal to create short demos with text, but I hate
hate hate mixing code and text in one file, so my knitr are bunches of
R scripts I can run. Whenever I do R, I make it all scripted and
run from clean environments for testing. I have an equal amount of
animosity toward the whole REPL lifestyle other than for very simple
exploration.


>>> b) project resource consumption - especially upper range of RAM consumption and range of model run times
>>
>> Upper range for whom? We have people running models that take tens of
>> gigabytes and run for a week. But someone like Andrew moves to approximate
>> methods if something's going to take a week to run with full Bayes.
>>
>> Building from source can take up to 8 GB due to all the templating
>> in the parser (stanc).
>>
> Looking for what "reasonable" cloud-stan resource allocation would be. Having stan inside a docker image pre-compiled should negate the need during use to allocate that kind of memory, however want to get an idea of what ranges of RAM people run into during day-to-day modeling. I appreciate that this will be highly variable, but want to get a feel for the distribution if possible.

30 MB to 30 GB. I don't know what the distribution is.
Rarely do models need more than 1 GB to run.

Memory usage is very predictable. It's the size to
store the data (with a very small bit of overhead for
the container structures, but not much) plus the size to
store a single set of parameters in CmdStan. Then everything's
streamed. R is much less efficient and I think (hope?) people are
working on making it tighter in memory. It keeps every draw in
memory and then there's something like a 2x overhead on top
of that if I recall correctly. So then if you get the people
who have 2000 parameters and believe they need 100K draws getting
into problems with the 200M numbers they want to store, which can
consume multiple GB in RStan. Then multiply all this by four to run Andrew's
recommended four chains in parallel. That's why these 2000 x 100K x 8 (raw
storage) x 2 (R Overhead) x 4 >> 10 GB are so hairy.

>>> c) use case - transient (temp environment to be destroyed after logging off) vs day-to-day driver (maintain results/projects/etc).
>>
>> I don't understand the question. I don't log into any environments
>> (not counting my notebook, GitHub, and our web host for static web
>> pages, which just switched to GitHub).
>>
>
> transient would be a "one"-off notebook environment, so to Andrew's request in another thread, it may be forking a model from github, running it, then being ok with the environment being destroyed after downloading results. Another use case could be for testing, where a model could be run, the output collected, then the environment destroyed and resources re-allocated.
>
> I don't know if that is much of a use case, or if everyone would 'always' want to be able to keep projects around long term.

This is getting into differences in use case. If we want to build
a demo rather than a useful online GUI, no long-term storage at all.
If we want to be persistent, then we need user IDs, logins, ways to
identify and store models, maybe output (that'll get expensive), etc.
I believe this will send us down the rabbit hole. Then we'd need backup,
of course --- maybe some kind of GitHub tie-in or something?

Andrew really really wants that demo judging by how often he mentions
it. So a demo with no persistence in a month is preferable to waiting
a year for a system with persistence.

>>> d) If day-to-day use, what are ideal sync scenarios (dropbox vs github vs direct upload/download, etc)
>>
>> Synch for what? We use GitHub for everything Stan
>> development related. Those who are shy often work in their
>> own sandboxes, but most of us work on the stan-dev GitHub
>> account in various repos.
>
> Sync for modeling/results. Eg after running a model, and getting output, say in an ipython notebook. How would you (want to) backup said notebook/project.


See above w.r.t. backup. I don't want to do anything this way,
so I can't really anticipate what anyone else might want. Basically,
I don't like things in the way, which is why I like emacs and the terminal.

>>> e) yes/no to desire to ever self-host, and if so hosting platform (internal cluster vs something like AWS).
>>
>> I don't know what "self-hosting" is, but we're not doing any
>> kind of hosting other than on GitHub.
>>
>> We'd like to be able to deploy two things --- a Jupyter-like
>> environment --- Allen's already nailed this, I think, for
>> Python, R, and Julia. We can't afford to host it for high
>> volume, but Allen found a spare machine at Dartmouth we can
>> use for demos. We'd also like to deploy a GUI that's not
>> based on R, Python, or Julia --- just a simple demo-type web
>> app. It's not fully designed, as you pointed out to Andrew
>> in previous mail.
>>
>> - Bob
>
> The question comes from both a testing/documentation perspective. If people would want to host (eg on the spare machine in dartmouth) vs deploy onto AWS or personal server to have their own personal sandbox, vs only be interested in just paying for compute resources but having everything managed. It comes down to a flexibility of the implementation. Sounds like for now should keep it flexible and not tied down to a specific environment/setup.

That would be good, but one working instance to start is better
than waiting for something more flexible.

So maybe break this into two projects.

1. Quick demo.

2. Usable IDE on the web with persistence, etc. etc.


> Regarding the GUI, I absolutely agree that the GUI should be language agnostic, however (I think) we can do some 'tricks' to provide hooks to the various implementations to make it easy for people to say, do data manipulation/creation in their language of choice, but manage the run/output in the gui.

Throttling will be critical if we do that and host somewhere
other than on a user's machine.

- Bob

Andrew Gelman

unread,
Jul 30, 2015, 11:26:35 PM7/30/15
to stan...@googlegroups.com
I learned probability and stochastic processes at the University of Maryland (many many years ago)!
A

> On Jul 29, 2015, at 5:12 PM, Devin Pastoor <devin....@gmail.com> wrote:
>
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.


Jeffrey Arnold

unread,
Jul 31, 2015, 12:26:40 AM7/31/15
to stan...@googlegroups.com
If it helps, I've started experimenting with using docker. Here's a few of my docker images: https://hub.docker.com/u/jrnold/.


Devin



Bob Carpenter

unread,
Jul 31, 2015, 4:52:46 PM7/31/15
to stan...@googlegroups.com
Allen put up docker versions for PyStan, RStan, and Stan.jl
through Jupyter. That may have only gone out to stan-core, which
you may not be on.

- Bob

Devin Pastoor

unread,
Aug 3, 2015, 10:36:29 AM8/3/15
to stan development mailing list
Thanks Jeff, I saw those and figured I'd work from there.
Reply all
Reply to author
Forward
0 new messages