On 5/17/13 1:18 PM, Steve wrote:
> Bob:
>
> Thanks - this is all helpful, although puzzling on some points.
>
> 1. Your times indicate that the matrix form is faster than the row_vectors form (like the results we posted yesterday),
> while your conclusion is that the row_vectors form speeds up the computation. Maybe I'm misinterpreting the
> proc.time output.
I mislabeled the results --- it was the other way around.
I reran below to make sure.
The array of vectors was faster than matrix.
> 2. The time we care about is "elapsed time" - time it takes to obtain the results. It is surprising that sometimes the
> ordering given by elapsed time and by system CPU time are different, and not just by small margins.
I think part of the system time may be waiting time for
access to things like file handles, which is going to be
impacted by other things going on in the system.
> 3. The Rstan manual has now omitted changing the makefile to set flags to -O3, and I understood the reason was that
> using the R command set_cppo("fast") achieved the same optimization in C++ compilation. Does editing Makeconf
> optimize C++ compilation further than solely using set_cppo("fast")?
I'm still trying to work this out myself. The key is that
you want to compile the rstan binaries with -O3 optimization.
We'll fix the instructions over the next week to make this
clearer.
> Our timing from yesterday was based on R 2.15. We upgraded to R 3.0, and reinstalled STAN and Rstan. This is on a PC
> with 3.46GB Intel DUO CPU. The new timing is:
>
> user.self sys.self elapsed
> time1 105.50 3.53 109.26 real mat[R,C] big joint model of independent parameters
> time2 37.73 161.17 199.30 real vec[C] many individual models via looping in R
> time3 68.32 2.80 71.41 row_vector[C] mat[R] big joint model of independent parameters
>
> So this run exhibits the reduction in timing for row_vectors you originally suggested, about a 35% reduction.
Sorry -- I mislabeled the output, because I'd done them
in the opposite order. I just reran all the tests to be sure.
Original seed=444
user system elapsed
116.242 0.303 116.397 loop over rows of matrix
81.561 0.274 81.581 loop over array of row vectors
49.315 0.625 50.781 loop in R
With a different seed, we're actually faster than looping in
R:
New seed=9879087987 (seed = 444 still used in R for data)
user system elapsed
63.034 0.517 63.575 loop over rows of matrix
46.661 1.034 48.728 loop in R
44.313 0.188 44.525 loop over array of row vectors
The looping in R is more stably estimated because it's essentially
an average of 1000 independent trials, one for each row.
This also highlights how much variability there is in Stan's
implementation of NUTS and warmup adaptation. We're in the process
of figuring out how to quantify all the timing and variability issues
across a large number of models.
I think the user time here is measuring the work that's actually
being done by the code locally. System time's going to include time
to get memory or file handles from the system. And probbably
process management time.
The other big issue here in profiling is that what we really care
about is effective sample size per second, not number of iterations.
In this case, the two models are equal, so we don't need to adjust
for that, because the samples and hence effective sample sizes should
be the same.
> What is
> puzzling about this run is that the looping in R 3.0 now takes 199 for elapsed time instead of 37 in R 2.15.
I can imagine the system could get bogged down if there
are lots of open file handles or something --- I really don't
know what R does on the back end. You should try it again with
it the only thing running and other processes shut down.
But this is also one of the reason why it's tough to measure
software performance. What can work well under low system load
(either file handles, threads, processes, whatever) might not work
well under high system load. In general, it's better to pool/batch
system requests as much as possible because they tend to be really
slow and scarce compared to executing code.
> Given these results, we plan on implementing the hierarchical model with the row_vector data structure.
That's what I'd recommend.
> Thanks for all your work on this issue. Looking forward to your discussion of it in the next version of the manual.
The next release is going to be a big one.
- Bob