[boost] [chrono] steady_clock efficiency

932 views
Skip to first unread message

Kenneth Porter

unread,
Nov 28, 2011, 10:22:16 PM11/28/11
to bo...@lists.boost.org
I expect most people will use boost::chrono::steady_clock to measure
performance. I've been considering switching from my raw calls to the
Windows peformance counter APIs to the steady_clock. I'm a little worried
about the overhead of what should be a non-intrusive facility.

In examing the source, I see that steady_clock::now() reads the performance
counter and then multiplies it by the reciprocal of the counter's frequency
(acquired on first use of the clock) to return the time in nanoseconds. How
expensive is this 64 bit multiply on a 32 bit system? How hard would it be
to have now() return a time_point that stores the raw tick value and only
converts it to nanoseconds when a calculation in real units is required?
(Comparison to another time_point could still be done in performance
counter ticks, saving the multiply.) Is it worth the extra complexity to
demand-convert ticks to nanoseconds?

Source is in boost_1_48_0\boost\chrono\detail\inlined\win\chrono.cpp


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Vicente J. Botet Escriba

unread,
Nov 29, 2011, 3:39:03 AM11/29/11
to bo...@lists.boost.org
Le 29/11/11 04:22, Kenneth Porter a écrit :

> I expect most people will use boost::chrono::steady_clock to measure
> performance.
I guess that also.

> I've been considering switching from my raw calls to the
> Windows peformance counter APIs to the steady_clock. I'm a little worried
> about the overhead of what should be a non-intrusive facility.
Glad to hear you are considering it. Have you any measures that let you
think there is a serious overhead? How do you use the ::now() function
to make your performance measures?

> In examing the source, I see that steady_clock::now() reads the performance
> counter and then multiplies it by the reciprocal of the counter's frequency
> (acquired on first use of the clock) to return the time in nanoseconds. How
> expensive is this 64 bit multiply on a 32 bit system?
I have added recently a performance test in
https://svn.boost.org/svn/boost/branches/release/libs/chrono/perf/store_now_in_vector.cpp.
I have no access to a windows platform now, but if you could run it in
your system? For a Mac-OS

I get for 100000 the following figures

debug

system_clock 107995476 nanoseconds
changes: 137173
steady_clock 27631675 nanoseconds
changes: 999999
high_resolution_clock 27683032 nanoseconds
changes: 999999
process_real_cpu_clock 640388694 nanoseconds
changes: 560828
process_user_cpu_clock 1090746878 nanoseconds
changes: 50
process_system_cpu_clock 1083428100 nanoseconds
changes: 62
process_cpu_clock 1093743446 nanoseconds
changes: 238

release

system_clock 44870075 nanoseconds
changes: 46173
steady_clock 23371245 nanoseconds
changes: 999999
high_resolution_clock 23554410 nanoseconds
changes: 999999
process_real_cpu_clock 619828025 nanoseconds
changes: 531914
process_user_cpu_clock 1079721262 nanoseconds
changes: 45
process_system_cpu_clock 1081643605 nanoseconds
changes: 63
process_cpu_clock 1077525848 nanoseconds
changes: 217

That is ~24ns for steady_clock::now().

> How hard would it be
> to have now() return a time_point that stores the raw tick value and only
> converts it to nanoseconds when a calculation in real units is required?

Ummm, we can store whatever we want in
steady_clock::time_point::duration::rep, but duration::period must be
know at compile-time and in Windows this is a run-time information.
So no, we cannot store the result of the performances counter and
convert it when requiered without making it special case that would
break all the chrono architecture.


>
> (Comparison to another time_point could still be done in performance
> counter ticks, saving the multiply.) Is it worth the extra complexity to
> demand-convert ticks to nanoseconds?

I see the advantage, I would like to considering it if the performances
results are really catastrofic to you.

Best,
Vicente

Vicente J. Botet Escriba

unread,
Nov 29, 2011, 5:09:27 PM11/29/11
to bo...@lists.boost.org
Le 29/11/11 09:39, Vicente J. Botet Escriba a écrit :

> Le 29/11/11 04:22, Kenneth Porter a écrit :
>
>> How hard would it be
>> to have now() return a time_point that stores the raw tick value and
>> only
>> converts it to nanoseconds when a calculation in real units is required?
> Ummm, we can store whatever we want in
> steady_clock::time_point::duration::rep, but duration::period must be
> know at compile-time and in Windows this is a run-time information.
> So no, we cannot store the result of the performances counter and
> convert it when requiered without making it special case that would
> break all the chrono architecture.
>>
>> (Comparison to another time_point could still be done in performance
>> counter ticks, saving the multiply.) Is it worth the extra complexity to
>> demand-convert ticks to nanoseconds?
> I see the advantage, I would like to considering it if the
> performances results are really catastrofic to you.
>
Hi again,

I've been thinking about the cost induced by the run-time getting of the
period for steady_clock and other clocks.

If I'm not wrong, the fact this period (the processor frequency) can
change at run-time have as consequence that the measured time could be
not monotonic neither steady nor exact. Maybe others with more knowledge
could confirm my suspicion, or the contrary, explain why this is not a
real problem.

When using these clocks to measure performances, it is clear that what
we want to measure is the number of ticks, independently of the
processor frequency change. In addition, as you suggest, the conversion
to a time_point or a duration could be delayed until the unit is needed.
Unfortunately the duration class could not be used for this purpose.
Maybe Boost could have a run-time duration class that stores the number
of ticks and that is able to make conversions to duration using a
discrete sample of the period at the time of the conversion.


In this way we could expect that the time spent while doing performance
measures could maybe divided by 2.

Waiting for some comments from the Boost community,
Vicente

P.S. Beman has added a new version of his Boost.Timer library, that
maybe it could help you.

Kenneth Porter

unread,
Nov 29, 2011, 8:25:47 PM11/29/11
to bo...@lists.boost.org
"Vicente J. Botet Escriba" <vicent...@wanadoo.fr> wrote in
news:4ED49A27...@wanadoo.fr:

> Glad to hear you are considering it. Have you any measures that let you
> think there is a serious overhead? How do you use the ::now() function
> to make your performance measures?

Thanks for the feedback. A typical use case:

const time_point startTime = steady_clock::now();
// stuff to measure here
const time_point endTime = steady_clock::now();
const duration elapsed = endTime - startTime;

Ideally the conversion from ticks to nanoseconds would be done by the
subtraction operator.

Another use case, waiting a precise amount of time:

void wait(const duration timeToWait)
{
const time_point endTime = steady_clock::now() + timeToWait;
while (steady_clock::now() < endTime)
;
}

Here, the conversion from nanoseconds to ticks would be done by the
addition operator, so that the time_point comparison would be in ticks.

I'd use this where I need to wait a small amount of time (much less than a
scheduler tick) for a hardware device to respond, before giving up my time
slice.

Kenneth Porter

unread,
Nov 29, 2011, 9:31:04 PM11/29/11
to bo...@lists.boost.org
"Vicente J. Botet Escriba" <vicent...@wanadoo.fr> wrote in
news:4ED49A27...@wanadoo.fr:

> I have added recently a performance test in

> https://svn.boost.org/svn/boost/branches/release/libs/chrono/perf/store
> _now_in_vector.cpp. I have no access to a windows platform now, but if
> you could run it in your system?

Grepping for "steady" from the output files:

boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_header.test\msvc-
10.0\debug\asynch-exceptions-on\threading-multi
\store_now_in_vector_header.output: steady_clock 649713231 nanoseconds
boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_header.test\msvc-
10.0\release\asynch-exceptions-on\threading-multi
\store_now_in_vector_header.output: steady_clock 697898058 nanoseconds
boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_shared.test\msvc-
10.0\debug\asynch-exceptions-on\threading-multi
\store_now_in_vector_shared.output: steady_clock 645896685 nanoseconds
boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_shared.test\msvc-
10.0\release\asynch-exceptions-on\threading-multi
\store_now_in_vector_shared.output: steady_clock 659919417 nanoseconds
boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_static.test\msvc-
10.0\debug\asynch-exceptions-on\threading-multi
\store_now_in_vector_static.output: steady_clock 646621148 nanoseconds
boost_1_48_0\bin.v2\libs\chrono\perf\store_now_in_vector_static.test\msvc-
10.0\release\asynch-exceptions-on\threading-multi
\store_now_in_vector_static.output: steady_clock 636746989 nanoseconds

So about 650 nanoseconds using a 32-bit build on Vista 64 Ultimate.

BTW, I'd like to suggest changing the output format in the program to put
the times first in fixed-width right-justified format followed by the clock
names, so the times will line up nicely in columns.

Kenneth Porter

unread,
Nov 29, 2011, 11:38:11 PM11/29/11
to bo...@lists.boost.org
On XP SP3 (32-bit) I see values of 260-280 nanoseconds.

Kenneth Porter

unread,
Nov 30, 2011, 12:10:24 AM11/30/11
to bo...@lists.boost.org
Kenneth Porter <shiva.b...@sewingwitch.com> wrote in
news:Xns9FACBC567FE...@80.91.229.10:

> So about 650 nanoseconds using a 32-bit build on Vista 64 Ultimate.

I recoded now() to eliminate the math with no improvement. The Windows API
call totally dominates the time spent, so there's no significant savings to
deferring the conversion.

The lower numbers on XP may be due to it being single-core, and not needing
to deal with which core is reading the counters. My Vista system is dual-
core and I understand there's logic in the API to worry about that.

Vicente Botet

unread,
Nov 30, 2011, 4:07:31 AM11/30/11
to bo...@lists.boost.org

Kenneth Porter-2 wrote
>
> "Vicente J. Botet Escriba" &lt;vicente.botet@&gt; wrote in
> news:4ED49A27.2030500@:

>
>> Glad to hear you are considering it. Have you any measures that let you
>> think there is a serious overhead? How do you use the ::now() function
>> to make your performance measures?
>
> Thanks for the feedback. A typical use case:
>
> const time_point startTime = steady_clock::now();
> // stuff to measure here
> const time_point endTime = steady_clock::now();
> const duration elapsed = endTime - startTime;
>
> Ideally the conversion from ticks to nanoseconds would be done by the
> subtraction operator.
>

We could see a little improvement in this case as the conversion could be
done only once, instead of two. Note the rt_ prefixes in the code (for
run-time):

const rt_time_point startTime = rt_steady_clock::now();


// stuff to measure here

const rt_time_point endTime = rt_steady_clock::now();
const duration elapsed = duration_cast<duration>(endTime - startTime); //
conversion is done by the cast operator

The conversion from rt_duration to duration could also be implicit and the
last line be

const duration elapsed = endTime - startTime; // conversion is done
implicitly by a implicit conversion operator


Another use case, waiting a precise amount of time:

void wait(const duration timeToWait)
{
const time_point endTime = steady_clock::now() + timeToWait;
while (steady_clock::now() < endTime)
;
}

Here, the conversion from nanoseconds to ticks would be done by the
addition operator, so that the time_point comparison would be in ticks.


More exactly it should be done on the conversion from the timeToWait to
ticks to wait

void wait(const rt_duration ticksToWait)
{
const rt_time_point endTime = rt_steady_clock::now() + ticksToWait; //
** ticks
while (rt_steady_clock::now() < endTime)
;
}

void wait(const duration timeToWait)
{
wait(rt_duration_cast<rt_duration>(timeToWait); // conversion to ticks
}

Again if the conversion from duration to rt_duration is implicit the last
line becomes

wait(timeToWait);

and so the wait specialization will be not really needed.


I'd use this where I need to wait a small amount of time (much less than a
scheduler tick) for a hardware device to respond, before giving up my time
slice.


I understand the use case. I don't see a major difference in this case as
what is important is the accuracy, which IMO is preserved. With the current
design/implementation your processor will be just more busy which is no good
however.

I will add some performance tests that measure the the time spent to get the
ticks and the period, to see if it is worth continuing.

Best,
Vicente

--
View this message in context: http://boost.2283326.n4.nabble.com/chrono-steady-clock-efficiency-tp4117923p4122296.html
Sent from the Boost - Dev mailing list archive at Nabble.com.

Kenneth Porter

unread,
Nov 30, 2011, 4:35:59 AM11/30/11
to bo...@lists.boost.org
Vicente Botet <vicent...@wanadoo.fr> wrote in news:1322644051575-
412229...@n4.nabble.com:

> We could see a little improvement in this case as the conversion could be
> done only once, instead of two. Note the rt_ prefixes in the code (for
> run-time):

Aha, that looks perfect.

Marsh Ray

unread,
Dec 1, 2011, 12:55:11 PM12/1/11
to bo...@lists.boost.org, Kenneth Porter
On 11/29/2011 08:31 PM, Kenneth Porter wrote:

> 10.0\release\asynch-exceptions-on\threading-multi
> \store_now_in_vector_static.output: steady_clock 636746989 nanoseconds
>
> So about 650 nanoseconds using a 32-bit build on Vista 64 Ultimate.

Some of that may reflect quantization error.

E.g., the clock output might be truncated to microsecond precision which
introduces a 500 ns error on average and the actual read overhead is
something like 150 ns.

- Marsh

Kenneth Porter

unread,
Dec 1, 2011, 11:23:41 PM12/1/11
to bo...@lists.boost.org
Marsh Ray <ma...@extendedsubset.com> wrote in
news:4ED7BF7F...@extendedsubset.com:

> On 11/29/2011 08:31 PM, Kenneth Porter wrote:
>
>> 10.0\release\asynch-exceptions-on\threading-multi
>> \store_now_in_vector_static.output: steady_clock 636746989
>> nanoseconds
>>
>> So about 650 nanoseconds using a 32-bit build on Vista 64 Ultimate.
>
> Some of that may reflect quantization error.
>
> E.g., the clock output might be truncated to microsecond precision
> which introduces a 500 ns error on average and the actual read
> overhead is something like 150 ns.

The profiling program reads the clock a million times, storing the results
in a pre-allocated array, and times the whole operation. (An initial run
that constructs a million time_points is used to factor out the loop and
array member constructor time.) How would microsecond jitter affect the
overall operation to that degree?

Marsh Ray

unread,
Dec 4, 2011, 10:41:04 PM12/4/11
to bo...@lists.boost.org, Kenneth Porter
On 12/01/2011 10:23 PM, Kenneth Porter wrote:
>>
>> Some of that may reflect quantization error.
>>
>> E.g., the clock output might be truncated to microsecond precision
>> which introduces a 500 ns error on average and the actual read
>> overhead is something like 150 ns.
>
> The profiling program reads the clock a million times, storing the results
> in a pre-allocated array, and times the whole operation. (An initial run
> that constructs a million time_points is used to factor out the loop and
> array member constructor time.) How would microsecond jitter affect the
> overall operation to that degree?

Ah, well not if you do it that way.

When you look at the array, does it reflect any particular quantization?

It might be interesting to run the test on bare metal and again in a
VMware-type VM. The high resolution counters tend to be virtualized in
VMs and could produce greatly different results.

- Marsh

Vicente J. Botet Escriba

unread,
Dec 5, 2011, 2:04:02 AM12/5/11
to bo...@lists.boost.org
Le 05/12/11 04:41, Marsh Ray a écrit :

> On 12/01/2011 10:23 PM, Kenneth Porter wrote:
>>>
>>> Some of that may reflect quantization error.
>>>
>>> E.g., the clock output might be truncated to microsecond precision
>>> which introduces a 500 ns error on average and the actual read
>>> overhead is something like 150 ns.
>>
>> The profiling program reads the clock a million times, storing the
>> results
>> in a pre-allocated array, and times the whole operation. (An initial run
>> that constructs a million time_points is used to factor out the loop and
>> array member constructor time.) How would microsecond jitter affect the
>> overall operation to that degree?
>
> Ah, well not if you do it that way.
>
> When you look at the array, does it reflect any particular quantization?
>
> It might be interesting to run the test on bare metal and again in a
> VMware-type VM. The high resolution counters tend to be virtualized in
> VMs and could produce greatly different results.
>
Hi,

The times in the array have the precision of the clock and stors the
actual time, not a duration.

I have no access to a VM. Could others run the performance test?

Best,
Vicente.

Kenneth Porter

unread,
Dec 6, 2011, 3:07:39 AM12/6/11
to bo...@lists.boost.org
"Vicente J. Botet Escriba" <vicent...@wanadoo.fr> wrote in
news:4EDC6CE2...@wanadoo.fr:

> I have no access to a VM. Could others run the performance test?

For those interested in running the test, simply go to the root of your
boost installation and run this command:

bjam libs/chrono/perf

Anthony Williams

unread,
Dec 6, 2011, 5:01:26 AM12/6/11
to bo...@lists.boost.org
On 06/12/11 08:07, Kenneth Porter wrote:
> bjam libs/chrono/perf

This is currently broken on trunk for 64-bit Ubuntu Linux/g++ 4.6

Anthony
--
Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/
just::thread C++0x thread library http://www.stdthread.co.uk
Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk
15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

Vicente J. Botet Escriba

unread,
Dec 6, 2011, 12:05:00 PM12/6/11
to bo...@lists.boost.org
Le 06/12/11 11:01, Anthony Williams a écrit :

> On 06/12/11 08:07, Kenneth Porter wrote:
>> bjam libs/chrono/perf
>
> This is currently broken on trunk for 64-bit Ubuntu Linux/g++ 4.6
>
You are right. I didn't updated it since I changed to version2

Index: Jamfile.v2
===================================================================
--- Jamfile.v2 (revision 75799)
+++ Jamfile.v2 (working copy)
@@ -42,6 +42,7 @@
# remark #383: value copied to temporary, reference to temporary used
# remark #1418: external function definition with no prior declaration
<toolset>intel:<cxxflags>-wd304,383,1418
+ <define>BOOST_CHRONO_VERSION=2
;

rule chrono-run ( sources )

Committed revision 75827.

Best,
Vicente

Reply all
Reply to author
Forward
0 new messages