RStan memory usage?

146 views
Skip to first unread message

Bob Carpenter

unread,
Nov 27, 2015, 5:36:15 PM11/27/15
to stan...@googlegroups.com
[moved to stan-dev; removed cc list]

If I have N parameters and M posterior draws,
the amount of memory required to store the draws in
R should be roughly (N * M * 8) bytes + overhead.

Does anyone know what the overhead is in RStan as a
function of N and M? And if the total is much more
than (N * M * 8), which it seems to be if I understand
user reports correctly, where is the extra memory is
being used?

My understanding is that it's possible to generate the
draws, store to a CSV file, then read them back into R
using less memory than would be required to store them
in R in the first place. Is that right?

- Bob

> On Nov 27, 2015, at 12:55 PM, Jiqiang Guo <guo...@gmail.com> wrote:
>
> If one really don't want to save the draws into R's memory, one can set argument pars=character(0), and specify argument sample_file to save draws into external files.
>
> Jiqiang
>
> On Thu, Nov 26, 2015 at 8:58 PM, Andrew Gelman <gel...@stat.columbia.edu> wrote:
> cc-ing Ben, Jonah, and Jiqiang, since the issue is R’s memory hogging. Also cc-ing Hadley in case he has any thoughts on this>
> A
>
> > On Nov 26, 2015, at 11:13 AM, Bob Carpenter <ca...@alias-i.com> wrote:
> >
> > These queries can go to our users list.
> >
> > Yes, CmdStan is very economical with memory compared
> > to other interfaces. It produces CSV file outputs.
> >
> > We really need to fix the memory hogging issues in
> > R, but I don't know the first thing about it.
> >
> > - Bob
> >
> >
> >> On Nov 26, 2015, at 4:15 AM, Grant, Robert L <Robert...@sgul.kingston.ac.uk> wrote:
> >>
> >> I now have the scaling-up run on Stata 14.1 and rstan (in Rgui). JAGS is next but here's a question for you all. Stata has the same (or similar) memory issue as before, bombing out at i=20, p=5000. I think I'll scale down to p=100 to get at least three points to suggest a trend (currently aiming for 500, 1000, 5000, 10000). And rstan uses more than my puny PC will allow at that size too, when it tries the hrasch. Unfortunately that's the only computer that has both Stata and the freedom to compile for Stan. Such are the challenges enjoyed by isolated statisticians. So I was wondering about trying StataStan as it's just a CmdStan wrapper. I suspect CmdStan uses less memory by writing results out to text files as it goes, but would be interested to hear your thoughts. I can re-run that and JAGS straight away to get this paper going.
> >>
> >> And happy thanksgiving to y'all!
> >>
> >> Robert Grant
> >> Senior Lecturer in health & social care statistics
> >> Kingston University & St George's, University of London
> >> www.robertgrantstats.co.uk
>
>
>

Ben Goodrich

unread,
Nov 27, 2015, 6:52:01 PM11/27/15
to stan development mailing list
On Friday, November 27, 2015 at 5:36:15 PM UTC-5, Bob Carpenter wrote:
My understanding is that it's possible to generate the
draws, store to a CSV file, then read them back into R
using less memory than would be required to store them
in R in the first place.  Is that right?

I don't think so. The known areas where RStan uses more memory than CmdStan are
  1. RStan stores the warmup whereas CmdStan by default does not
  2. RStan stores the draws as a nested list
  3. In RStan, the chains are typically run in parallel, which entails extra copies of data, transformed data, etc.

When you read them from a csv file it creates a nested list out of the output, including warmup if the csv file was generated by rstan. So, you have the same memory plus the memory required the read each line of the csv file as a character string one character at a time (and then convert into doubles).


It might be the case that there is an extra copy when the sampler finishes but before it hands control back to R. But people having memory issues seem to crash before it gets to that point.


Ben

Bob Carpenter

unread,
Nov 27, 2015, 10:14:49 PM11/27/15
to stan...@googlegroups.com
I fired up the R terminal (not Rstudio, not the R GUI).

With a trivial model:

parameters { vector[10000] a; } model { a ~ normal(0, 1); }

and default call:

> fit <- stan(model_code=model_code);

Storing the default 2000 iterations * 4 chains should involve

10K parameters
* 2K iterations/chain
* 4 chains
* 8 bytes/parameter
= 640MB

The system Activity Monitor on my Mac says R is using 775 MB.
And here's the result of calling gc() after everything's done:

> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 492180 26.3 818163 43.7 712782 38.1
Vcells 81187641 619.5 104938442 800.7 90431885 690.0
^^^^^

You can see the GC trigger is at 800 MB. Could that being
set too high be an issue? Or is it sensitive to system
memory available?

So I have no idea what's going on. If that's all for storing
the fit object, that's only 10% overhead, which isn't much.
There was no spike on completion I could see, by the way.

But then, if I set fit <- 0, and call gc(), I get:

> fit <- 0
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 441810 23.6 818163 43.7 712782 38.1
Vcells 868437 6.7 83950753 640.5 90431885 690.0
^^^^ ^^^^^

So that all looks hunky dory. But the GC trigger's still
huge. Then If I run gc() 20 more times, I get this:

> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 441585 23.6 818163 43.7 712782 38.1
Vcells 691899 5.3 2363001 18.1 90431885 690.0
^^^^^

I also tried a more nested data structure, with exactly the
same parameters in a different shape:

parameters { matrix[10, 10] a[10, 10]; }
model { for (m in 1:10) for (n in 1:10) to_vector(a[m,n]) ~ normal(0, 1); }

Almost identical behavior, so it's not just simple nesting that's
taking up space.

Oddly, if I keep calling gc() after I fit a model, the "gc trigger"
just keeps growing up to about 50% more than the memory being used.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Jiqiang Guo

unread,
Nov 27, 2015, 11:05:05 PM11/27/15
to stan...@googlegroups.com
What is "gc trigger"?  Should we just look at column used from gc()?  

Jiqiang 

Ben Goodrich

unread,
Nov 27, 2015, 11:06:52 PM11/27/15
to stan development mailing list
On Friday, November 27, 2015 at 10:14:49 PM UTC-5, Bob Carpenter wrote:
You can see the GC trigger is at 800 MB.  Could that being
set too high be an issue?  Or is it sensitive to system
memory available?

It is sort-of explained, along with configuration options, under help(Memory)
 
So I have no idea what's going on.  If that's all for storing
the fit object, that's only 10% overhead, which isn't much.
There was no spike on completion I could see, by the way.

We've never seen anything consistent with what users sporadically experience.
You can enable gcinfo(TRUE) before sampling and see when it garbage collects.
After running your example, I see
Garbage collection 54 = 23+6+25 (level 2) ...
60.1 Mbytes of cons cells used (64%)
643.5 Mbytes of vectors used (70%)

           used  
(Mb) gc trigger  (Mb)  max used  (Mb)
Ncells  1124993  60.1    1770749  94.6   1442291  77.1
Vcells 84336928 643.5  120666218 920.7 100488229 766.7

Ben

Bob Carpenter

unread,
Nov 28, 2015, 11:04:18 AM11/28/15
to stan...@googlegroups.com

> On Nov 27, 2015, at 11:06 PM, Ben Goodrich <goodri...@gmail.com> wrote:
>
> ...

> We've never seen anything consistent with what users sporadically experience.

I didn't realize you couldn't reproduce the problem!
In that case, the burden's on them to give us a reproducible
error case. I feel much better about R!

- Bob
Reply all
Reply to author
Forward
0 new messages