The RStan developers might correct me, but I'm pretty sure
RStan doesn't write to a CSV file and then read back in. If it
does, it shouldn't be doing that! It should be keeping everything
in memory.
SAMPLING FOR MODEL 'slow' NOW (CHAIN 1).
Error: segfault from C stack overflow
Error: C stack usage 140737242506772 is too close to the limit
If this is Sebastian's 'slow' example with few rows (10 or so) and many columns (>>1000), I think R simply takes a very long time to create such a data.frame (and occasionally it will segfault).
The vector ends up in rstan as being saved as a data.frame. However, I played with the code and translated the data.frame to a matrix (i.e. I change the stan C++ part) which is a lot more memory efficient (making matrices as default storage for samples could be considered for rstan 3.0). The result was that this was not the bottleneck, i.e. rstan uses a lot of extra resources in a number of places as it appears to me.
Has someone managed to profile rstan C++ code already? This is not a straightforward exercises from what I read.
Best,
Sebastian
I have done some profiling (attached) with N = 10^5, but nothing really stands out. It seems as if the R process is what is taking a lot of time before and after the executable is called. But the annotated C++ file (also attached) from oprofile seems to imply that almost all the time was taken by "line 0", which is a get_all_flatnames thing.
Ben
On Wednesday, August 13, 2014 4:30:17 PM UTC-4, Sebastian Weber wrote:Hi!The vector ends up in rstan as being saved as a data.frame. However, I played with the code and translated the data.frame to a matrix (i.e. I change the stan C++ part) which is a lot more memory efficient (making matrices as default storage for samples could be considered for rstan 3.0). The result was that this was not the bottleneck, i.e. rstan uses a lot of extra resources in a number of places as it appears to me.
Has someone managed to profile rstan C++ code already? This is not a straightforward exercises from what I read.
Best,
Sebastian
--
On Wed, Aug 13, 2014 at 7:05 PM, Ben Goodrich <goodri...@gmail.com> wrote:
I have done some profiling (attached) with N = 10^5, but nothing really stands out. It seems as if the R process is what is taking a lot of time before and after the executable is called. But the annotated C++ file (also attached) from oprofile seems to imply that almost all the time was taken by "line 0", which is a get_all_flatnames thing.I guess this function might be written better. I tried call this function in a standalone code and it takes less than 1 second for N=1e6. As this function is only called once, I really don't think there is motivation to work on it.
opreport -l /usr/lib/libR.so | head
Overflow stats not available
CPU: Intel Architectural Perfmon, speed 900.703 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples % symbol name
7171237 99.8434 RecursiveRelease
5619 0.0782 R_gc_internal
1215 0.0169 R_ReleaseObject
1126 0.0157 TAG
546 0.0076 Rf_install
367 0.0051 Rf_mkCharLenCE
149 0.0021 Rf_protect
I have done some profiling (attached) with N = 10^5, but nothing really stands out.
CPU: Intel Architectural Perfmon, speed 900.703 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask
samples % symbol name
1730 17.8719 void stan::agrad::gradient<stan::model::model_functional<model
871 8.9979 bool stan::math::check_not_nan<Eigen::Matrix<stan::agrad::var,
497 5.1343 rstan::sample_recorder_factory(std::ostream*, std::string, uns
468 4.8347 std::vector<std::string, std::allocator<std::string> >::_M_ins
394 4.0702 stan::common::recorder::filtered_values<Rcpp::Vector<14, Rcpp:
392 4.0496 _ZN4stan4prob10normal_logILb1EN5Eigen6MatrixINS_5agrad3varELin
358 3.6983 void rstan::(anonymous namespace)::get_all_flatnames<std::vect
--
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+unsubscribe@googlegroups.com.
My computer can't compile anything with a main() function at the moment.
But this gets back to the problem Sebastian is trying to figure out: The absolute time is 507.281 seconds for 10 iterations, of which only 12.4178 is spent in the Stan executable. So, 9% of that would be negligible in wall time.
I can say from rstan that changing the model block from y ~ normal(0,1) to increment_log_prob(-0.5 * dot_self(y)) reduces the Stan time to 1.1225 seconds, but that also skips the subtracting zero and dividing by one steps in normal_log_prob plus some other stuff in normal_log().
Ben --- did you have profiling turned on? That can dramatically
change real timings.