Fwd: rstan vs cmd stan speed

1,113 views
Skip to first unread message

Sebastian Weber

unread,
Aug 11, 2014, 4:19:59 PM8/11/14
to stan...@googlegroups.com
Hi!

As suggested by Daniel, here is the threat I started:

To add to this: I think rstan could be speed up considerably if it would know about the declared data types in a model. Right now - at least from what I get from the code - is that variable names (like arrays and their indices) get assembled from the strings in the csv files. This step is somewhat time-consuming and certainly not needed if the layout of the data could be derived from the Stan model, i.e. a array of a large size is much faster processed from a statement in the model file rather than processing the individual elements from the csv.

>
> Hi Bob!
>
> Sure, we can take it to the dev list. I am not sure if writing out files with Stan and then importing is faster. At least this would require a rewrite of the read_stan_csv function (writing the csv and importing with the read_stan_csv takes longer).
>
> From my fairly naive profiling with outputs to rstan::cout I got the impression that the time rstan takes longer in comparison to cmdstan comes from a couple of places. Quite surprisingly the vector<vector> construct to save the simulated values (common/recorder/values.hpp) does not appear to be the bottleneck, i.e. I replaced that with a flat storage where I did the indexing myself, but this did not help.
>
> To properly solve the problem, this would need a bit of profiling if the Rcpp code - but here I am lost as this would require R to be build in profiling mode as well from what I read. Is it possible to run the rstan<->stan thing without R? Then one could run a gdb session on it easily.
>
> Best,
> Sebastian
>
> Am 10.08.2014 um 19:29 schrieb Bob Carpenter <ca...@alias-i.com>:
>
>> I keep forgetting to hit reply-all to these messages not
>> to our lists. Is there a reason we're doing this all off-list?
>>
>> Here's what I just replied to Sebastian:
>>
>> One thing we could do is set up a mode whereby RStan
>> writes output to file. Then the runs won't be as slow,
>> and we'd take the hit all at once in reading in the data
>> structure --- not sure if it'd save time overall, but
>> it should allow more scalability, because everything
>> won't be copied around in memory.
>>
>> - Bob
>>
>> On Aug 10, 2014, at 7:41 AM, Sebastian Weber <sdw....@gmail.com> wrote:
>>
>>> Hi!
>>>
>>> This week I tried to solve the speed hit from rstan in comparison to cmd stan when you simulate lots of variables, but I am afraid I am running out of ideas where to look (I did „cout“-profiling all over in rstan but was not able to find out where the speed is effectively lost).
>>>
>>> Attached is an example which demonstrates that rstan is about 5x slower than cmdstan. I know R is slower, but it would be great if the performance hit is not that dramatic.
>>>
>>> Best,
>>>
>>> --
>>> Sebastian Weber
>>>
>>> <slow.Rout><slow.R><slow.stan>
>>
>
> --
> Sebastian Weber
>
>

--
Sebastian Weber


Bob Carpenter

unread,
Aug 11, 2014, 4:31:47 PM8/11/14
to stan...@googlegroups.com
The RStan developers might correct me, but I'm pretty sure
RStan doesn't write to a CSV file and then read back in. If it
does, it shouldn't be doing that! It should be keeping everything
in memory.

The layout of the data may indeed be determined from an instantiated
model. This is what RStan uses to impose the structure on it you
get for the parameters in the extracted sample.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>

Ben Goodrich

unread,
Aug 11, 2014, 11:30:50 PM8/11/14
to stan...@googlegroups.com
On Monday, August 11, 2014 4:31:47 PM UTC-4, Bob Carpenter wrote:
The RStan developers might correct me, but I'm pretty sure
RStan doesn't write to a CSV file and then read back in.  If it
does, it shouldn't be doing that!  It should be keeping everything
in memory.

There is an option to write to a CSV file but it is FALSE by default. I actually get a stack overflow when running the example

SAMPLING FOR MODEL 'slow' NOW (CHAIN 1).
Error: segfault from C stack overflow
Error: C stack usage  140737242506772 is too close to the limit

which is probably not good.

Ben


Rob J. Goedman

unread,
Aug 11, 2014, 11:53:23 PM8/11/14
to stan...@googlegroups.com
Hi Ben,

If this is Sebastian's 'slow' example with few rows (10 or so) and many columns (>>1000), I think R simply takes a very long time to create such a data.frame (and occasionally it will segfault).

Rob J. Goedman




Ben Goodrich

unread,
Aug 12, 2014, 12:36:45 AM8/12/14
to stan...@googlegroups.com
On Monday, August 11, 2014 11:53:23 PM UTC-4, Rob J Goedman wrote:
If this is Sebastian's 'slow' example with few rows (10 or so) and many columns (>>1000), I think R simply takes a very long time to create such a data.frame (and occasionally it will segfault).

It is a slow example from Sebastian, but there is no data.frame involved. It is just sampling an iid normal in N dimensions where N = 10^6.

Ben

Sebastian Weber

unread,
Aug 13, 2014, 4:30:17 PM8/13/14
to stan...@googlegroups.com
Hi!

The vector ends up in rstan as being saved as a data.frame. However, I played with the code and translated the data.frame to a matrix (i.e. I change the stan C++ part) which is a lot more memory efficient (making matrices as default storage for samples could be considered for rstan 3.0). The result was that this was not the bottleneck, i.e. rstan uses a lot of extra resources in a number of places as it appears to me.

Has someone managed to profile rstan C++ code already? This is not a straightforward exercises from what I read.

Best,
Sebastian

Bob Carpenter

unread,
Aug 13, 2014, 6:45:54 PM8/13/14
to stan...@googlegroups.com
It's pretty easy to profile just about anything by
attaching to its process. At least on the Mac.

I've been swamped after traveling 7 of the last 7.5
months.

- Bob

Ben Goodrich

unread,
Aug 13, 2014, 7:05:24 PM8/13/14
to stan...@googlegroups.com
I have done some profiling (attached) with N = 10^5, but nothing really stands out. It seems as if the R process is what is taking a lot of time before and after the executable is called. But the annotated C++ file (also attached) from oprofile seems to imply that almost all the time was taken by "line 0", which is a get_all_flatnames thing.

Ben
report.txt
file5f2c21c234a4.cpp

Jiqiang Guo

unread,
Aug 13, 2014, 9:47:48 PM8/13/14
to stan...@googlegroups.com
On Wed, Aug 13, 2014 at 7:05 PM, Ben Goodrich <goodri...@gmail.com> wrote:
I have done some profiling (attached) with N = 10^5, but nothing really stands out. It seems as if the R process is what is taking a lot of time before and after the executable is called. But the annotated C++ file (also attached) from oprofile seems to imply that almost all the time was taken by "line 0", which is a get_all_flatnames thing.

I guess this function might be written better.  I tried call this function in a standalone code and it takes less than 1 second for N=1e6.  As this function is only called once, I really don't think there is motivation to work on it.   


Ben


On Wednesday, August 13, 2014 4:30:17 PM UTC-4, Sebastian Weber wrote:
Hi!

The vector ends up in rstan as being saved as a data.frame. However, I played with the code and translated the data.frame to a matrix (i.e. I change the stan C++ part) which is a lot more memory efficient (making matrices as default storage for samples could be considered for rstan 3.0). The result was that this was not the bottleneck, i.e. rstan uses a lot of extra resources in a number of places as it appears to me.

Has someone managed to profile rstan C++ code already? This is not a straightforward exercises from what I read.

Best,
Sebastian

--

Ben Goodrich

unread,
Aug 14, 2014, 12:27:32 AM8/14/14
to stan...@googlegroups.com
On Wednesday, August 13, 2014 9:47:48 PM UTC-4, Jiqiang Guo wrote:
On Wed, Aug 13, 2014 at 7:05 PM, Ben Goodrich <goodri...@gmail.com> wrote:
I have done some profiling (attached) with N = 10^5, but nothing really stands out. It seems as if the R process is what is taking a lot of time before and after the executable is called. But the annotated C++ file (also attached) from oprofile seems to imply that almost all the time was taken by "line 0", which is a get_all_flatnames thing.

I guess this function might be written better.  I tried call this function in a standalone code and it takes less than 1 second for N=1e6.  As this function is only called once, I really don't think there is motivation to work on it.   

Upon further review, i.e. doing the profiling on libR.so instead of the Stan .so

opreport -l /usr/lib/libR.so | head

it is dominated by RecursiveRelease

Overflow stats not available
CPU
: Intel Architectural Perfmon, speed 900.703 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  
%        symbol name
7171237  99.8434  RecursiveRelease
5619      0.0782  R_gc_internal
1215      0.0169  R_ReleaseObject
1126      0.0157  TAG
546       0.0076  Rf_install
367       0.0051  Rf_mkCharLenCE
149       0.0021  Rf_protect

Still trying to figure out what that means.

Ben

Ben Goodrich

unread,
Aug 14, 2014, 11:20:52 AM8/14/14
to stan...@googlegroups.com
On Wednesday, August 13, 2014 7:05:24 PM UTC-4, Ben Goodrich wrote:
I have done some profiling (attached) with N = 10^5, but nothing really stands out.

I don't think it is the main cause of the slowdown, but as a side note, it is sort of absurd to be taking half as much time to evaluate check_nan as we do to evaluate the gradient

CPU: Intel Architectural Perfmon, speed 900.703 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with
a unit mask
samples  
%        symbol name
1730     17.8719  void stan::agrad::gradient<stan::model::model_functional<model
871       8.9979  bool stan::math::check_not_nan<Eigen::Matrix<stan::agrad::var,
497       5.1343  rstan::sample_recorder_factory(std::ostream*, std::string, uns
468       4.8347  std::vector<std::string, std::allocator<std::string> >::_M_ins
394       4.0702  stan::common::recorder::filtered_values<Rcpp::Vector<14, Rcpp:
392       4.0496  _ZN4stan4prob10normal_logILb1EN5Eigen6MatrixINS_5agrad3varELin
358       3.6983  void rstan::(anonymous namespace)::get_all_flatnames<std::vect

If nan input could result in a bogus calculation, then that function has to check for NaN and maybe throw exceptions or whatever. But in this case, it is just checking for NaNs inside normal_log(), in which case the worst thing that could happen is that lp__ goes NaN which is reasonable behavior. If the only thing 9% gets us is an occasional exception message that says "something is nan and must be not nan", then we have got to implement a mechanism to enable the user to turn the checking off.

If we were to switch to C++11, then there would be a standard mechanism to check whether any calculations resulted in a NaN since the last time you checked. So, if one of the proposed parameters goes NaN, then we would know not to bother with that proposal, and we wouldn't need to check for NaN inside the probability functions, unless we reserve NaN to indicate a missing value.

Ben

Daniel Lee

unread,
Aug 14, 2014, 11:31:38 AM8/14/14
to stan...@googlegroups.com
Ben, the last time this issue came up, I profiled the code in a different way. Can you take the absolute time it takes to run the model with the current code (with seeds fixed, ignoring time for compilation, etc), then commenting out that error checking line and rerunning?

If I remember correctly, profilers showed that check taking up 10%, but in reality, there was no noticeable difference in absolute runtime. I did that last time through CmdStan where I could just compile separate executables. I think I was the only one to do this, so if you're able to either confirm what the profiler is seeing or what I've seen using optimized code, I'd be interested.


Daniel



--

Ben Goodrich

unread,
Aug 14, 2014, 11:55:12 AM8/14/14
to stan...@googlegroups.com
My computer can't compile anything with a main() function at the moment. But this gets back to the problem Sebastian is trying to figure out: The absolute time is 507.281 seconds for 10 iterations, of which only 12.4178 is spent in the Stan executable. So, 9% of that would be negligible in wall time.

I can say from rstan that changing the model block from y ~ normal(0,1) to increment_log_prob(-0.5 * dot_self(y)) reduces the Stan time to 1.1225 seconds, but that also skips the subtracting zero and dividing by one steps in normal_log_prob plus some other stuff in normal_log().

Ben
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.

Daniel Lee

unread,
Aug 14, 2014, 12:15:21 PM8/14/14
to stan...@googlegroups.com
If I read that correctly:
- y ~ normal(0, 1) takes 12.4178 seconds
- increment_log_prob(-0.5 * dot_self(y)) takes 1.1225 seconds

Really??? That's not good.




On Thu, Aug 14, 2014 at 11:55 AM, Ben Goodrich <goodri...@gmail.com> wrote:
My computer can't compile anything with a main() function at the moment.

??? In general? (how did you get there? how do I make sure I never get there?)

 
But this gets back to the problem Sebastian is trying to figure out: The absolute time is 507.281 seconds for 10 iterations, of which only 12.4178 is spent in the Stan executable. So, 9% of that would be negligible in wall time.

That's still a second, but I guess a second with respect to 500 is too little.


 

I can say from rstan that changing the model block from y ~ normal(0,1) to increment_log_prob(-0.5 * dot_self(y)) reduces the Stan time to 1.1225 seconds, but that also skips the subtracting zero and dividing by one steps in normal_log_prob plus some other stuff in normal_log().


Once again, really???

Bob Carpenter

unread,
Aug 14, 2014, 11:57:27 PM8/14/14
to stan...@googlegroups.com
I used the following two models

norm1.stan
-------------------
parameters {
vector[1000] y;
}
model {
y ~ normal(0,1);

norm2.stan
---------------------
parameters {
vector[1000] y;
}
model {
increment_log_prob(-0.5 * dot_self(y));
}


and fit in RStan with the same seed using 2000 iterations per
chain with total chain time reported as:

norm1.stan: 1.67, 1.79, 1.68, 1.67
norm2.stan: 1.38, 1.36, 1.37, 1.49

Ben --- did you have profiling turned on? That can dramatically
change real timings. The above were just run straight
up in RStan using

fit1 <- stan("norm1.stan",seed=12345)

- Bob
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.

Ben Goodrich

unread,
Aug 15, 2014, 10:02:04 AM8/15/14
to stan...@googlegroups.com
On Thursday, August 14, 2014 11:57:27 PM UTC-4, Bob Carpenter wrote:
Ben --- did you have profiling turned on?  That can dramatically
change real timings.  

I don't think so, but the timings are real variable when there are a million parameters. Sometimes it stack overflows, sometimes it takes a lot longer, it depends on whether you do vector[N] y; or real y[N]; etc. So, I think increment_log_prob(-0.5 * dot_self(y)) is faster but not as dramatically faster than the one set of timings I reported (which was just 1 chain, 10 iterations, no warmup).

Ben
Reply all
Reply to author
Forward
0 new messages