In examing the source, I see that steady_clock::now() reads the performance
counter and then multiplies it by the reciprocal of the counter's frequency
(acquired on first use of the clock) to return the time in nanoseconds. How
expensive is this 64 bit multiply on a 32 bit system? How hard would it be
to have now() return a time_point that stores the raw tick value and only
converts it to nanoseconds when a calculation in real units is required?
(Comparison to another time_point could still be done in performance
counter ticks, saving the multiply.) Is it worth the extra complexity to
demand-convert ticks to nanoseconds?
Source is in boost_1_48_0\boost\chrono\detail\inlined\win\chrono.cpp
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
I get for 100000 the following figures
debug
system_clock 107995476 nanoseconds
changes: 137173
steady_clock 27631675 nanoseconds
changes: 999999
high_resolution_clock 27683032 nanoseconds
changes: 999999
process_real_cpu_clock 640388694 nanoseconds
changes: 560828
process_user_cpu_clock 1090746878 nanoseconds
changes: 50
process_system_cpu_clock 1083428100 nanoseconds
changes: 62
process_cpu_clock 1093743446 nanoseconds
changes: 238
release
system_clock 44870075 nanoseconds
changes: 46173
steady_clock 23371245 nanoseconds
changes: 999999
high_resolution_clock 23554410 nanoseconds
changes: 999999
process_real_cpu_clock 619828025 nanoseconds
changes: 531914
process_user_cpu_clock 1079721262 nanoseconds
changes: 45
process_system_cpu_clock 1081643605 nanoseconds
changes: 63
process_cpu_clock 1077525848 nanoseconds
changes: 217
That is ~24ns for steady_clock::now().
> How hard would it be
> to have now() return a time_point that stores the raw tick value and only
> converts it to nanoseconds when a calculation in real units is required?
Ummm, we can store whatever we want in
steady_clock::time_point::duration::rep, but duration::period must be
know at compile-time and in Windows this is a run-time information.
So no, we cannot store the result of the performances counter and
convert it when requiered without making it special case that would
break all the chrono architecture.
>
> (Comparison to another time_point could still be done in performance
> counter ticks, saving the multiply.) Is it worth the extra complexity to
> demand-convert ticks to nanoseconds?
I see the advantage, I would like to considering it if the performances
results are really catastrofic to you.
Best,
Vicente
I've been thinking about the cost induced by the run-time getting of the
period for steady_clock and other clocks.
If I'm not wrong, the fact this period (the processor frequency) can
change at run-time have as consequence that the measured time could be
not monotonic neither steady nor exact. Maybe others with more knowledge
could confirm my suspicion, or the contrary, explain why this is not a
real problem.
When using these clocks to measure performances, it is clear that what
we want to measure is the number of ticks, independently of the
processor frequency change. In addition, as you suggest, the conversion
to a time_point or a duration could be delayed until the unit is needed.
Unfortunately the duration class could not be used for this purpose.
Maybe Boost could have a run-time duration class that stores the number
of ticks and that is able to make conversions to duration using a
discrete sample of the period at the time of the conversion.
In this way we could expect that the time spent while doing performance
measures could maybe divided by 2.
Waiting for some comments from the Boost community,
Vicente
P.S. Beman has added a new version of his Boost.Timer library, that
maybe it could help you.
> Glad to hear you are considering it. Have you any measures that let you
> think there is a serious overhead? How do you use the ::now() function
> to make your performance measures?
Thanks for the feedback. A typical use case:
const time_point startTime = steady_clock::now();
// stuff to measure here
const time_point endTime = steady_clock::now();
const duration elapsed = endTime - startTime;
Ideally the conversion from ticks to nanoseconds would be done by the
subtraction operator.
Another use case, waiting a precise amount of time:
void wait(const duration timeToWait)
{
const time_point endTime = steady_clock::now() + timeToWait;
while (steady_clock::now() < endTime)
;
}
Here, the conversion from nanoseconds to ticks would be done by the
addition operator, so that the time_point comparison would be in ticks.
I'd use this where I need to wait a small amount of time (much less than a
scheduler tick) for a hardware device to respond, before giving up my time
slice.
> I have added recently a performance test in
> https://svn.boost.org/svn/boost/branches/release/libs/chrono/perf/store
> _now_in_vector.cpp. I have no access to a windows platform now, but if
> you could run it in your system?
Grepping for "steady" from the output files:
boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_header.test\msvc-
10.0\debug\asynch-exceptions-on\threading-multi
\store_now_in_vector_header.output: steady_clock 649713231 nanoseconds
boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_header.test\msvc-
10.0\release\asynch-exceptions-on\threading-multi
\store_now_in_vector_header.output: steady_clock 697898058 nanoseconds
boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_shared.test\msvc-
10.0\debug\asynch-exceptions-on\threading-multi
\store_now_in_vector_shared.output: steady_clock 645896685 nanoseconds
boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_shared.test\msvc-
10.0\release\asynch-exceptions-on\threading-multi
\store_now_in_vector_shared.output: steady_clock 659919417 nanoseconds
boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_static.test\msvc-
10.0\debug\asynch-exceptions-on\threading-multi
\store_now_in_vector_static.output: steady_clock 646621148 nanoseconds
boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_static.test\msvc-
10.0\release\asynch-exceptions-on\threading-multi
\store_now_in_vector_static.output: steady_clock 636746989 nanoseconds
So about 650 nanoseconds using a 32-bit build on Vista 64 Ultimate.
BTW, I'd like to suggest changing the output format in the program to put
the times first in fixed-width right-justified format followed by the clock
names, so the times will line up nicely in columns.
> So about 650 nanoseconds using a 32-bit build on Vista 64 Ultimate.
I recoded now() to eliminate the math with no improvement. The Windows API
call totally dominates the time spent, so there's no significant savings to
deferring the conversion.
The lower numbers on XP may be due to it being single-core, and not needing
to deal with which core is reading the counters. My Vista system is dual-
core and I understand there's logic in the API to worry about that.
We could see a little improvement in this case as the conversion could be
done only once, instead of two. Note the rt_ prefixes in the code (for
run-time):
const rt_time_point startTime = rt_steady_clock::now();
// stuff to measure here
const rt_time_point endTime = rt_steady_clock::now();
const duration elapsed = duration_cast<duration>(endTime - startTime); //
conversion is done by the cast operator
The conversion from rt_duration to duration could also be implicit and the
last line be
const duration elapsed = endTime - startTime; // conversion is done
implicitly by a implicit conversion operator
Another use case, waiting a precise amount of time:
void wait(const duration timeToWait)
{
const time_point endTime = steady_clock::now() + timeToWait;
while (steady_clock::now() < endTime)
;
}
Here, the conversion from nanoseconds to ticks would be done by the
addition operator, so that the time_point comparison would be in ticks.
More exactly it should be done on the conversion from the timeToWait to
ticks to wait
void wait(const rt_duration ticksToWait)
{
const rt_time_point endTime = rt_steady_clock::now() + ticksToWait; //
** ticks
while (rt_steady_clock::now() < endTime)
;
}
void wait(const duration timeToWait)
{
wait(rt_duration_cast<rt_duration>(timeToWait); // conversion to ticks
}
Again if the conversion from duration to rt_duration is implicit the last
line becomes
wait(timeToWait);
and so the wait specialization will be not really needed.
I'd use this where I need to wait a small amount of time (much less than a
scheduler tick) for a hardware device to respond, before giving up my time
slice.
I understand the use case. I don't see a major difference in this case as
what is important is the accuracy, which IMO is preserved. With the current
design/implementation your processor will be just more busy which is no good
however.
I will add some performance tests that measure the the time spent to get the
ticks and the period, to see if it is worth continuing.
Best,
Vicente
--
View this message in context: http://boost.2283326.n4.nabble.com/chrono-steady-clock-efficiency-tp4117923p4122296.html
Sent from the Boost - Dev mailing list archive at Nabble.com.
> We could see a little improvement in this case as the conversion could be
> done only once, instead of two. Note the rt_ prefixes in the code (for
> run-time):
Aha, that looks perfect.
> 10.0\release\asynch-exceptions-on\threading-multi
> \store_now_in_vector_static.output: steady_clock 636746989 nanoseconds
>
> So about 650 nanoseconds using a 32-bit build on Vista 64 Ultimate.
Some of that may reflect quantization error.
E.g., the clock output might be truncated to microsecond precision which
introduces a 500 ns error on average and the actual read overhead is
something like 150 ns.
- Marsh
> On 11/29/2011 08:31 PM, Kenneth Porter wrote:
>
>> 10.0\release\asynch-exceptions-on\threading-multi
>> \store_now_in_vector_static.output: steady_clock 636746989
>> nanoseconds
>>
>> So about 650 nanoseconds using a 32-bit build on Vista 64 Ultimate.
>
> Some of that may reflect quantization error.
>
> E.g., the clock output might be truncated to microsecond precision
> which introduces a 500 ns error on average and the actual read
> overhead is something like 150 ns.
The profiling program reads the clock a million times, storing the results
in a pre-allocated array, and times the whole operation. (An initial run
that constructs a million time_points is used to factor out the loop and
array member constructor time.) How would microsecond jitter affect the
overall operation to that degree?
Ah, well not if you do it that way.
When you look at the array, does it reflect any particular quantization?
It might be interesting to run the test on bare metal and again in a
VMware-type VM. The high resolution counters tend to be virtualized in
VMs and could produce greatly different results.
- Marsh
The times in the array have the precision of the clock and stors the
actual time, not a duration.
I have no access to a VM. Could others run the performance test?
Best,
Vicente.
> I have no access to a VM. Could others run the performance test?
For those interested in running the test, simply go to the root of your
boost installation and run this command:
bjam libs/chrono/perf
This is currently broken on trunk for 64-bit Ubuntu Linux/g++ 4.6
Anthony
--
Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/
just::thread C++0x thread library http://www.stdthread.co.uk
Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976
Index: Jamfile.v2
===================================================================
--- Jamfile.v2 (revision 75799)
+++ Jamfile.v2 (working copy)
@@ -42,6 +42,7 @@
# remark #383: value copied to temporary, reference to temporary used
# remark #1418: external function definition with no prior declaration
<toolset>intel:<cxxflags>-wd304,383,1418
+ <define>BOOST_CHRONO_VERSION=2
;
rule chrono-run ( sources )
Committed revision 75827.
Best,
Vicente