measuring clock cycles per second

Chris

unread,

Nov 13, 2008, 7:38:40 PM11/13/08

to

Hello,

How do I measure the execution time of an algorithm/function in units of
clock cycles per second?

I'm in the processes of measuring an algorithm in terms of units of time. To
measure the wall-clock time I am using gettimeofday(). To measure the
processor time, I am using getrusage(RUSAGE_SELF). So far, all is good.

Now I would like to measure the the execution time of the algorithm in terms
of clock cycles per second. My goal is to determine/guestimate the minimum
power requirements (e.g. psu and cpu) my algorithm would/could/may require.

After googling and reading the first few pages of chapter 6 in Linux Device
Drivers 2nd edition (I don't have the 3rd edition yet). It appears I can
use rdtscl() to access the TSC?

That said, rdtscl() has been removed as of the latest kernels (I am using
kernel 2.6.22.19-0.1-default - opensuse 10.3) as per this post
<http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=47;att=0;bug=436332>.

I did manage to get the test program, using rdtscl, to compile and run. The
is at the end of this post.

Ultimately, I am going to create a linux boot cd (probably based off of the
opensuse 10.3 kiwi project) that will run the code. So, compatibility of
different kernels will not be an issue. My plan is to boot the CD on
different computers (Intel and AMD based) an record the results.

So, is there a way to measure the clock cycles or perhaps convert the time
elapsed from getrusage() to clock cycles per second?

Any information or guidance would be greatly appreciated, thank you.

---8<----
// test program calling rdtscl()
// compile using: gcc -o testcycles testcycles.c

// the following two lines are needed since asm-i386/msr.h does not
// contain the definitions for rdtsc() and rdtscl().
// On my machine, opensuse 10.3 kernel 2.6.22 amd x64 4200+, rdtscl() is
// only defined in <ams-x86_64/msr.h>
#define __x86_64__
typedef unsigned int u32;

#include <asm/msr.h>
#include <stdio.h>

typedef unsigned int cycles_t;

int main(int argc, char *argv[])
{
unsigned int start = 0;
unsigned int end = 0;

rdtscl(start);
rdtscl(end);

printf("rdtscl: No. of cycles: %li\n", end - start);

rdtscl(start);
sleep(1);
rdtscl(end);

printf("sleep 1s: No. of cycles: %li\n", end - start);

return 0;
}
--->8----

Here are the results from my machine:
OS: openSUSE 10.3

chris@Desktop:~/tests/rdtscl_test> uname -a
Linux Desktop 2.6.22.19-0.1-default #1 SMP 2008-10-14 22:17:43 +0200 i686
athlon i386

chris@Desktop:~/tests/rdtscl_test> ./testcycles
rdtscl: No. of cycles: 9
sleep 1s: No. of cycles: 1004509086

--
Chris

David Schwartz

unread,

Nov 13, 2008, 10:15:56 PM11/13/08

to

On Nov 13, 4:38 pm, Chris <ch...@thisisnotanemailaddress.ca> wrote:

> That said, rdtscl() has been removed as of the latest kernels (I am using
> kernel 2.6.22.19-0.1-default - opensuse 10.3) as per this post
> <http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=47;att=0;bug=436332>.

Since you're not writing kernel code (are you?!) why do you care
whether rdtscl is in the kernel or not?

DS

scholz...@gmail.com

unread,

Nov 14, 2008, 12:40:49 AM11/14/08

to

There is not way to do this. Even Linus agrees that profiling is worse
on Linux
and one of things that still needs to be done.

TSC and other CPU counter events are not working because they are
not saved across task switches. So it's almost useless to use them.

I have a Blade 1000 SPARC machine here with Solaris which is able to
do something
like this. You can also use Itanium2 CPU's with HP-UX.

Jasen Betts

unread,

Nov 14, 2008, 4:10:48 AM11/14/08

to

On 2008-11-14, Chris <ch...@thisisnotanemailaddress.ca> wrote:
> Hello,

> How do I measure the execution time of an algorithm/function in units of
> clock cycles per second?

you have a unit mismatch.

> I'm in the processes of measuring an algorithm in terms of units of time. To
> measure the wall-clock time I am using gettimeofday(). To measure the
> processor time, I am using getrusage(RUSAGE_SELF). So far, all is good.

> Now I would like to measure the the execution time of the algorithm in terms
> of clock cycles per second. My goal is to determine/guestimate the minimum
> power requirements (e.g. psu and cpu) my algorithm would/could/may require.

divide by the speed of the processor you are using ??

clocks cycles are a rather meaningless unit, they come in all
different sizes.

Rainer Weikusat

unread,

Nov 14, 2008, 7:47:43 AM11/14/08

to

Chris <ch...@thisisnotanemailaddress.ca> writes:
> How do I measure the execution time of an algorithm/function in units of
> clock cycles per second?

Such a measure isn't very useful. Except in special cases, execution
time will be dominated by memory access latencies and these will vary
wildly, depending on such things as 'past history of the system' (ie
what's [still] contained in the various cache(s)) or seemingly
innocuos minor changes in machine code layout due to C-level changes
completely unrelated to the algorithm you plan to measure.

> I'm in the processes of measuring an algorithm in terms of units of time. To
> measure the wall-clock time I am using gettimeofday(). To measure the
> processor time, I am using getrusage(RUSAGE_SELF). So far, all is
> good.

Consider using a profiler. gprof is basically useless except on
ancient hardware. I found OProfile to be fairly usuable, though.

http://oprofile.sourceforge.net/about

Rainer Weikusat

unread,

Nov 14, 2008, 7:50:12 AM11/14/08

to

scholz...@gmail.com writes:
> There is not way to do this. Even Linus agrees that profiling is worse
> on Linux and one of things that still needs to be done.
>
> TSC and other CPU counter events are not working because they are
> not saved across task switches.

There is no instruction to set the time stamp counter and there cannot
even be one, because this would directly contradict its purpose.

Chris

unread,

Nov 14, 2008, 9:08:11 AM11/14/08

to

Rainer Weikusat wrote:

Thanks for the link. I will look into OProfile in more detail.
--
Chris

Chris

unread,

Nov 14, 2008, 9:11:07 AM11/14/08

to

Jasen Betts wrote:

I figured I could do that, but I have several machines that I wish to test,
ranging from a Pentium 150Mhz up to a AMD 64X2 6000+. My goal is to create
a boot CD that boots the 2.6 kernel (Not sure how that will work with the
P150 since that last OS I used on it as Red Hat 6.0 possibly the version
before that. It's been awhile since I turned it on :)).

I was kind of hoping there would some dynamic way of
measuring/calculating/deriving the clock cycles.

--
Chris

Chris

unread,

Nov 14, 2008, 9:17:54 AM11/14/08

to

David Schwartz wrote:

Ugh, I was afraid that this might be something that I cannot do from
userspace.

I have several computers that I would like to test the algorithm on: Pentium
150 -> AMD 64x2 6000+. To minimise the variables, I am thinking of creating
a boot CD based on opensuse 10.3 or 11.0 (I use 10.3 to develop). I haven't
got this far yet, so the details of the CD are sketchy at best.

--
Chris

David Schwartz

unread,

Nov 14, 2008, 1:01:16 PM11/14/08

to

On Nov 14, 6:17 am, Chris <ch...@thisisnotanemailaddress.ca> wrote:

> David Schwartz wrote:

> > Since you're not writing kernel code (are you?!) why do you care
> > whether rdtscl is in the kernel or not?

> Ugh, I was afraid that this might be something that I cannot do from
> userspace.

You can. You can call 'rdtscl', whether the kernel has it or not.
You're not writing kernel code, so it doesn't matter whether the
kernel has it or not. Since you're calling from user space, all that
matters is whether user space has it or not, and you have complete
control over user space.

> I have several computers that I would like to test the algorithm on: Pentium
> 150 -> AMD 64x2 6000+. To minimise the variables, I am thinking of creating
> a boot CD based on opensuse 10.3 or 11.0 (I use 10.3 to develop). I haven't
> got this far yet, so the details of the CD are sketchy at best.

Depending on what the algorithm does and what you hope to measure,
there may or may not be a sensible way to do what you're trying to do.

Why not just come up with a realistic test scenario that stresses your
algorithm, measure its wall time, and if you really want to, multiply
by the system's clock speed.

DS

Chris

unread,

Nov 14, 2008, 2:09:50 PM11/14/08

to

David Schwartz wrote:
>> I have several computers that I would like to test the algorithm on:
>> Pentium 150 -> AMD 64x2 6000+. To minimise the variables, I am thinking
>> of creating a boot CD based on opensuse 10.3 or 11.0 (I use 10.3 to
>> develop). I haven't got this far yet, so the details of the CD are
>> sketchy at best.
>
> Depending on what the algorithm does and what you hope to measure,
> there may or may not be a sensible way to do what you're trying to do.
>
> Why not just come up with a realistic test scenario that stresses your
> algorithm, measure its wall time, and if you really want to, multiply
> by the system's clock speed.
>
> DS

I think I will do this since this was my original plan.

What timings would you suggest I be concerned with: the system (wallclock)
time or the process time (time in CPU) or both? The timing code I wrote
gives me both. The algorithm isn't simply a FP calculation but it also
includes reading large chunks of memory. I guess, if I want to truly gauge
the performance, the system time would be better since it would include bus
delays, interrupts, etc. which are important if I want to be able to
determine the minimum hardware requirements for the algorithm.

--
Chris

Nate Eldredge

unread,

Nov 14, 2008, 2:27:09 PM11/14/08

to

Rainer Weikusat <rwei...@mssgmbh.com> writes:

Interestingly, that's not quite true. On my Opteron CPU, the TSC is a
model-specific register and can be written with the appropriate
(privileged) WRMSR instruction. (I just tried it and it works, though
the documentation says this feature is not to be relied upon.) So in
principle, you could have the TSC saved on task switches, so it would
count cycles for each process. Not that it seems like a particularly
good idea.

FreeBSD has support for CPU performance-monitoring counters (PMC) which
can count not only clock cycles but many other CPU events (jumps taken,
cache misses, pipeline stalls, etc). These can be set to run on a
systemwide or per-process basis. It doesn't appear that Linux has this
support yet, unless I am missing something.

David Schwartz

unread,

Nov 14, 2008, 2:27:21 PM11/14/08

to

On Nov 14, 11:09 am, Chris <ch...@thisisnotanemailaddress.ca> wrote:

> What timings would you suggest I be concerned with: the system (wallclock)
> time or the process time (time in CPU) or both? The timing code I wrote
> gives me both. The algorithm isn't simply a FP calculation but it also
> includes reading large chunks of memory. I guess, if I want to truly gauge
> the performance, the system time would be better since it would include bus
> delays, interrupts, etc. which are important if I want to be able to
> determine the minimum hardware requirements for the algorithm.

Bus delays and memory delay will be counted as user time. Interrupts
will not be counted as user time or system time. System time is time
that your process spends using the CPU in kernel space.

DS

Jasen Betts

unread,

Nov 15, 2008, 8:28:39 AM11/15/08

to

Whatever it is you are tyring to do you are going about it the wrong way.

Chris

unread,

Nov 15, 2008, 8:58:10 AM11/15/08

to

Jasen Betts wrote:

It's better for me to find out now before I invest too much time in this.

Basically, I am benchmarking my algorithm/process for its
appropriateness/likelihood to be executed on specialised, embedded
hardware. The benchmarking app I have written takes a set of input and
executes the algorithm and writes a report on the various timings
(milliseconds) and memory consumption.

It was suggested to be, by a coworker, that I look into counting the clock
cycle execution of the algorithm (to report the number lock cycles to
execute the algorithm and various parts of the algorithm). The idea is to
provide a metric that can be used to determine/guestimate the minimum cpu
requirements and to (possible) determine/guestimate the power requirements
of the algorithm (i.e. if it takes N clock cycles on a Pentium 150Mhz CPU,
then it requires X watts/volts/etc.).

I know there are a lot of factors that I haven't even begun to consider. I
was just curious if there was a way to report that the algorithm took N
clock cycles to complete. Where N would be a number on the order of 1e9 I'm
sure!

The more reading I have done on this suggests that if it takes N clock
cycles on a Pentium 150Mhz CPU, it may not necessarily take M cycles on a
Pentium 3 1Ghz where M < N. It all depends on how many operations the CPU
can execute in one clock cycle.

For some reason I am failing to construct the right query for Google in
order to get more information on this.

--
Chris

Jasen Betts

unread,

Nov 15, 2008, 5:58:45 PM11/15/08

to

On 2008-11-15, Chris <ch...@thisisnotanemailaddress.ca> wrote:

> It's better for me to find out now before I invest too much time in this.
>
> Basically, I am benchmarking my algorithm/process for its
> appropriateness/likelihood to be executed on specialised, embedded
> hardware. The benchmarking app I have written takes a set of input and
> executes the algorithm and writes a report on the various timings
> (milliseconds) and memory consumption.

measure the execution of the sample algorithm ether in wall time or
time spent executing the process and then divide that by the clock
speed. that'll give you the suggested measure.

> It was suggested to be, by a coworker, that I look into counting the clock
> cycle execution of the algorithm (to report the number lock cycles to
> execute the algorithm and various parts of the algorithm). The idea is to
> provide a metric that can be used to determine/guestimate the minimum cpu
> requirements and to (possible) determine/guestimate the power requirements
> of the algorithm

probably better to pick a common benchmark (used to be Dhrystone or
Whetstone) that is quoted for the potential target hardware and
compare that to the same benchmark on your hardware.

> (i.e. if it takes N clock cycles on a Pentium 150Mhz CPU,
> then it requires X watts/volts/etc.).

> I know there are a lot of factors that I haven't even begun to consider. I
> was just curious if there was a way to report that the algorithm took N
> clock cycles to complete. Where N would be a number on the order of 1e9 I'm
> sure!
>
> The more reading I have done on this suggests that if it takes N clock
> cycles on a Pentium 150Mhz CPU, it may not necessarily take M cycles on a
> Pentium 3 1Ghz where M < N. It all depends on how many operations the CPU
> can execute in one clock cycle.

yeah, benchmarks were invented for this reason (and also to enable comparison
with wildly different architectues like ARM and 68K) back in the 90s when I
was more interested in this the benchmarks were Dhrystone ans Whetstone
but time has moved on. but queries involving the potential target and
the word 'benchmark' may yield useful results.

eg a Via Nehemiah based processor seems to take about twice as many cycles as a
celeron on many benchmarks but only uses 1/4 of the power. non-X86 cores may be
more efficient.

Rainer Weikusat

unread,

Nov 16, 2008, 10:04:58 AM11/16/08

to

Nate Eldredge <na...@vulcan.lan> writes:
> Rainer Weikusat <rwei...@mssgmbh.com> writes:
>> scholz...@gmail.com writes:
>>> There is not way to do this. Even Linus agrees that profiling is worse
>>> on Linux and one of things that still needs to be done.
>>>
>>> TSC and other CPU counter events are not working because they are
>>> not saved across task switches.
>>
>> There is no instruction to set the time stamp counter and there cannot
>> even be one, because this would directly contradict its purpose.
>
> Interestingly, that's not quite true. On my Opteron CPU, the TSC is a
> model-specific register and can be written with the appropriate
> (privileged) WRMSR instruction. (I just tried it and it works, though
> the documentation says this feature is not to be relied upon.) So in
> principle, you could have the TSC saved on task switches, so it would
> count cycles for each process.

Not without hardware support. How is the correction necessary to
account for the time between start of interrupt handling and reading
the TSC value in order to store it to some memory location supposed to
be determined?

> FreeBSD has support for CPU performance-monitoring counters (PMC) which
> can count not only clock cycles but many other CPU events (jumps taken,
> cache misses, pipeline stalls, etc). These can be set to run on a
> systemwide or per-process basis. It doesn't appear that Linux has this
> support yet, unless I am missing something.

I posted a link to the sourceforge page of a project which provides
exactly this in another subthread (OProfile).

Rainer Weikusat

unread,

Nov 16, 2008, 10:11:20 AM11/16/08

to

Jasen Betts <ja...@xnet.co.nz> writes:
> On 2008-11-15, Chris <ch...@thisisnotanemailaddress.ca> wrote:
>
>> It's better for me to find out now before I invest too much time in this.
>>
>> Basically, I am benchmarking my algorithm/process for its
>> appropriateness/likelihood to be executed on specialised, embedded
>> hardware. The benchmarking app I have written takes a set of input and
>> executes the algorithm and writes a report on the various timings
>> (milliseconds) and memory consumption.
>
> measure the execution of the sample algorithm ether in wall time or
> time spent executing the process and then divide that by the clock
> speed. that'll give you the suggested measure.

When the assumption that the CPU spent significantly more time with
working than with waiting for some external event, eg data transfer
to/from some sort of memory, is true (plus a few others, eg that
nothing was causing lots of interrupts for some reason during the
measurement period and that no other tasks ran for significant amounts
of time then). The nasty detail about this is that all of these
assumptions will usually be true on a dedicated 'test system', ie
developer workstation, but not on a computer which is actually busy
with executing multiple tasks concurrently.

Nate Eldredge

unread,

Nov 16, 2008, 3:37:14 PM11/16/08

to

Rainer Weikusat <rwei...@mssgmbh.com> writes:

> Nate Eldredge <na...@vulcan.lan> writes:
>> Rainer Weikusat <rwei...@mssgmbh.com> writes:
>>> scholz...@gmail.com writes:
>>>> There is not way to do this. Even Linus agrees that profiling is worse
>>>> on Linux and one of things that still needs to be done.
>>>>
>>>> TSC and other CPU counter events are not working because they are
>>>> not saved across task switches.
>>>
>>> There is no instruction to set the time stamp counter and there cannot
>>> even be one, because this would directly contradict its purpose.
>>
>> Interestingly, that's not quite true. On my Opteron CPU, the TSC is a
>> model-specific register and can be written with the appropriate
>> (privileged) WRMSR instruction. (I just tried it and it works, though
>> the documentation says this feature is not to be relied upon.) So in
>> principle, you could have the TSC saved on task switches, so it would
>> count cycles for each process.
>
> Not without hardware support. How is the correction necessary to
> account for the time between start of interrupt handling and reading
> the TSC value in order to store it to some memory location supposed to
> be determined?

It couldn't, of course.

I was imagining this as a high-resolution clock() that could be read
without a system call. In this case, the time you mention would get
charged to the process, just as it is in the usual implementation. But
at least now it would be a pretty good approximation to the actual
number of cycles used by the process.

If you want to get it exactly on the nose, I think you'd have to run in
privileged mode and disable interrupts. I'm not sure what the value of
that is, however, since there are plenty of other things that will make
your timing fluctuate.

>> FreeBSD has support for CPU performance-monitoring counters (PMC) which
>> can count not only clock cycles but many other CPU events (jumps taken,
>> cache misses, pipeline stalls, etc). These can be set to run on a
>> systemwide or per-process basis. It doesn't appear that Linux has this
>> support yet, unless I am missing something.
>
> I posted a link to the sourceforge page of a project which provides
> exactly this in another subthread (OProfile).

Aha. I didn't follow the link and so I didn't realize that was how
OProfile worked.

scholz...@gmail.com

unread,

Nov 19, 2008, 5:56:53 AM11/19/08

to

On 14 Nov., 13:50, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:

Are you really so unexperienced?

I give a shit about TSC just the counter values. So of course the
System
needs to keep an adjustment factor on a thread base which is added to
the counter.

It is also not really important if you have latencies caused by
memory
delays or interrupts. You only need relative values.You compare parts
of your program and improvements in your algorithms with this
relative
numbers.

The fact and keypoint is: Linux sucks here.

Almost 12 years after the introduction of this counters the kernel
guys
still haven't implemented any way in the normal kernel code to measure
CPU events. Yes i know there are special kernel patches but it sucks
and makes it useless for the usual software developers.

Rainer Weikusat

unread,

Nov 19, 2008, 7:24:40 AM11/19/08

to

scholz...@gmail.com writes:
> On 14 Nov., 13:50, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:
>> scholz.lot...@gmail.com writes:
>> > There is not way to do this. Even Linus agrees that profiling is worse
>> > on Linux and one of things that still needs to be done.
>>
>> > TSC and other CPU counter events are not working because they are
>> > not saved across task switches.
>>
>> There is no instruction to set the time stamp counter and there cannot
>> even be one, because this would directly contradict its purpose.
>
> Are you really so unexperienced?

That hardly follows from a statement of fact which happens to be only
partially true (there is no instruction to store the TSC as such).

> I give a shit about TSC just the counter values. So of course the
> System needs to keep an adjustment factor on a thread base which is
> added to the counter.

There still isn't an instruction to do so. In line with that, while
the register could be written to by 'other means', there is no way to
determine such an adjustment factor, because the counter is just going
to keep on counting (which happens to be its purpose), no matter what
the CPU is presently doing or not doing.

> It is also not really important if you have latencies caused by
> memory delays or interrupts.

Latencies caused by memory delays or execution of other tasks,
including interrupt handlers, mean that the TSC difference, divided by
the (nominal) CPU speed is not an accurate measurement of the amount
of CPU clock cycles necessary to execute a particular instruction
sequence. Which happens to be what was asked for.

[...]

> Almost 12 years after the introduction of this counters the kernel
> guys still haven't implemented any way in the normal kernel code to
> measure CPU events. Yes i know there are special kernel patches but
> it sucks and makes it useless for the usual software developers.

There is no difference between 'normal kernel code' and 'kernel
patches' which would justify an classification of the code itself.
Your statement should have been "Linus Torvalds sucks, because even
after twelve years ..., he still isn't distributing a kernel source
tree with a particular feature in it".

Coincidentally, this claim is wrong.

scholz...@gmail.com

unread,

Nov 20, 2008, 3:08:32 AM11/20/08

to

> There still isn't an instruction to do so. In line with that, while
> the register could be written to by 'other means', there is no way to
> determine such an adjustment factor, because the counter is just going
> to keep on counting (which happens to be its purpose), no matter what
> the CPU is presently doing or not doing.

The kernel needs to read the current value of the counter register
before a thread base CPU schedule and when it comes back it needs
to add the difference between last counter and current counter to the
adjustment factor. When the user code reads the counter register
again it must substract the adjustment factor and gets a correct
value not (primarily) depending on the cpu load (if memory is behaving
well and interrupts are routed to other cpu kernels).

Of couse the best would be to add this to the CPU hardware.
We can also blame Intel here for a typical brainless implementation
thats to the fact that there is no real competition among cpu vendors
anymore.

> > It is also not really important if you have latencies caused by
> > memory delays or interrupts.
>
> Latencies caused by memory delays or execution of other tasks,
> including interrupt handlers, mean that the TSC difference, divided by
> the (nominal) CPU speed is not an accurate measurement of the amount
> of CPU clock cycles necessary to execute a particular instruction
> sequence. Which happens to be what was asked for.

AFAIK there is a counter that tracks executed number of instructions.
Then we need to use this counter. There are hunderts of them. Don't
hang
on with TSC thats a decade old technologie from the first pentium.
Core2
cpus have much more to offer exactly for this reason.

> There is no difference between 'normal kernel code' and 'kernel
> patches' which would justify an classification of the code itself.

No there is an offical kernel instead of the hunderts of different
Linux based
distributions. But here you can really blame the Linux kernel
developers and
Linux Torvalds personally.

Jacek Dziedzic

unread,

Nov 30, 2008, 5:37:59 PM11/30/08

to

Rainer Weikusat wrote:
> gprof is basically useless except on ancient hardware.

Could you elaborate why?

- J.

Rainer Weikusat

unread,

Nov 30, 2008, 5:46:03 PM11/30/08

to

Because its analysis is based on 100Hz-sampling of the instruction
pointer in order to determine where a program spent its running time?

John Hasler

unread,

Nov 30, 2008, 6:47:48 PM11/30/08

to

Rainer Weikusat wrote:
> gprof is basically useless except on ancient hardware.

Jacek Dziedzic writes:
> Could you elaborate why?

Rainer Weikusat wrote:
> Because its analysis is based on 100Hz-sampling of the instruction
> pointer in order to determine where a program spent its running time?

<http://sourceware.org/ml/binutils/2005-07/msg00423.html>
--
John Hasler
jo...@dhh.gt.org
Dancing Horse Hill
Elmwood, WI USA

Rainer Weikusat

unread,

Dec 1, 2008, 5:42:30 AM12/1/08

to

John Hasler <jo...@dhh.gt.org> writes:
> Rainer Weikusat wrote:
>> gprof is basically useless except on ancient hardware.
>
> Jacek Dziedzic writes:
>> Could you elaborate why?
>
> Rainer Weikusat wrote:
>> Because its analysis is based on 100Hz-sampling of the instruction
>> pointer in order to determine where a program spent its running time?
>
> <http://sourceware.org/ml/binutils/2005-07/msg00423.html>

This communicates that some random routine in some version of glibc
tries to autodetect the resolution of the system clock.
Which is supposed to mean precisely what?

Jacek Dziedzic

unread,

Dec 1, 2008, 7:22:13 AM12/1/08

to

I was hoping more for an answer than a question.

I still don't see how sampling with an, admittedly, low 100Hz
frequency makes it useless except on ancient hardware. Even if a short
function called from within an inner loop takes 100 us, it will get hits
with a sampling rate of 0.01/s in a statistical sense, provided it is
executed many times. Right, the timings may not be very accurate, but
the percentages will be more or less right.

I can understand how programs that have a walltime of under a second
could get poor statistics, but who profiles such programs?

- J.

Rainer Weikusat

unread,

Dec 1, 2008, 7:43:55 AM12/1/08

to

Jacek Dziedzic <jacek.dziedzi...@gmail.com> writes:
> Rainer Weikusat wrote:
>> Jacek Dziedzic <jacek.dziedzi...@gmail.com> writes:
>>> Rainer Weikusat wrote:
>>>> gprof is basically useless except on ancient hardware.
>>> Could you elaborate why?
>> Because its analysis is based on 100Hz-sampling of the instruction
>> pointer in order to determine where a program spent its running time?
>
> I was hoping more for an answer than a question.

And that was an answer. In form of a rethorical question, ie one which
has an implied answer.

> I still don't see how sampling with an, admittedly, low 100Hz
> frequency makes it useless except on ancient hardware. Even if a short
> function called from within an inner loop takes 100 us, it will get
> hits with a sampling rate of 0.01/s in a statistical sense, provided
> it is executed many times.

Why do you think you can predict which of the millions, if not
billions of values the instruction pointer had during the sampling
interval will be recorded?

Jacek Dziedzic

unread,

Dec 1, 2008, 9:46:17 AM12/1/08

to

Rainer Weikusat wrote:
>
> And that was an answer. In form of a rethorical question, ie one which
> has an implied answer.

OK, fair enough.

>> I still don't see how sampling with an, admittedly, low 100Hz
>> frequency makes it useless except on ancient hardware. Even if a short
>> function called from within an inner loop takes 100 us, it will get
>> hits with a sampling rate of 0.01/s in a statistical sense, provided
>> it is executed many times.
>
> Why do you think you can predict which of the millions, if not
> billions of values the instruction pointer had during the sampling
> interval will be recorded?

Mind you, I'm not saying I can predict a single hit of the profiler
at a single EIP value. I'm saying that, given adequate number of
repetitions, every part of the code covered by the sampling interval
gets its hits proportionally, however short the part of the code (even a
single instruction) is. Or, in other words, with the number of
iterations N going to infinity, I can predict that every part of the
code that is executed within these iterations will be sampled any
desired number of times, regardless of its size and the sampling
resolution. Or, in other words still, however rare the sampling and
however short the code portion, I can guarantee, with any desired
probability, e.g. 99.9999% that every EIP value will get at least a
desired number of hits, say 300. Sure, it might take a long time (i.e.
make N really large) if the sampling is done, say, once per second, but
we're arguing about the general principle, right?

So I tend towards "realizing the limits of profiling with a small
resolution", yet the claim that the technique is "useless except on
ancient hardware" eludes me.

- J.

Rainer Weikusat

unread,

Dec 1, 2008, 10:13:11 AM12/1/08

to

Jacek Dziedzic <jacek.dziedzi...@gmail.com> writes:
> Rainer Weikusat wrote:
>> And that was an answer. In form of a rethorical question, ie one
>> which
>> has an implied answer.
>
> OK, fair enough.
>
>>> I still don't see how sampling with an, admittedly, low 100Hz
>>> frequency makes it useless except on ancient hardware. Even if a short
>>> function called from within an inner loop takes 100 us, it will get
>>> hits with a sampling rate of 0.01/s in a statistical sense, provided
>>> it is executed many times.
>> Why do you think you can predict which of the millions, if not
>> billions of values the instruction pointer had during the sampling
>> interval will be recorded?
>
> Mind you, I'm not saying I can predict a single hit of the profiler
> at a single EIP value. I'm saying that, given adequate number of
> repetitions, every part of the code covered by the sampling interval
> gets its hits proportionally, however short the part of the code (even
> a single instruction) is.

You already asserted this in your last posting. But this amounts to
'it works because I say it does'. Trying a little though experiment:
Let's assume that a program only executes a single loop and this loop
calls invokes two subroutines. The average execution time of
subroutine #1 is 1/4 of the sampling interval, the average execution
time of #2 3/4. This means the program spends 1/4 of its time in
subroutine one, yet if the profiling timer was started 'close' to the
start of the loop, the instruction pointer should basically always be
somewhere in #2 when the value is recorded.

What's wrong with this example?

[...]

> So I tend towards "realizing the limits of profiling with a small
> resolution", yet the claim that the technique is "useless except on
> ancient hardware" eludes me.

The assumption that gprof actually provides useful output at some
point in the past is just a complimentary assumption of mine, because
I have never seen this happen ever since I first encountered the
program on a 25 Mhz processor.

Jacek Dziedzic

unread,

Dec 1, 2008, 1:59:17 PM12/1/08

to

Rainer Weikusat wrote:

> You already asserted this in your last posting. But this amounts to
> 'it works because I say it does'.

No, not really. I could've backed this up with statistical reasoning,
Bernoulli trials come to mind, but thought that would be overkill. In
fact, I am not trying to convince you, I'm trying to find a reason why
you would think it's useless.

> Trying a little though experiment:
> Let's assume that a program only executes a single loop and this loop
> calls invokes two subroutines. The average execution time of
> subroutine #1 is 1/4 of the sampling interval, the average execution
> time of #2 3/4. This means the program spends 1/4 of its time in
> subroutine one, yet if the profiling timer was started 'close' to the
> start of the loop, the instruction pointer should basically always be
> somewhere in #2 when the value is recorded.
>
> What's wrong with this example?

It's the assumption that what happens "on average" happens "in every
instance" that you, it seems to me, made above.

Even if the average execution time of #1 is exactly 1/4 of the
sampling interval and the average execution time of #2 is exactly 3/4 of
the sampling interval, then, provided the standard deviation of the
respective probability distributions of times is nonzero, the total time
taken by the loop will, in general, be unequal to the sampling interval,
even if by a mere 1%. Accumulation of those little differences will, as
time progresses, quickly rid #2 of its preference for gathering sampling
ticks.

You also seem to conveniently

a) forget that there are other schedulable entities which can,
apparently stochastically, cause the program in question to be
preemptied anywhere inside the loop, and as it runs again, the offset of
the sampling timer will change things drastically, because of things
like stalls, cache invalidation, page faults,...

b) assume an extraordinary view that _all_ loops that are timed are
somehow exact multiples of sampling time (which, imho, follows from the
adjective "useless" that you have used). Or was this example only to
demonstrate something else?

>> So I tend towards "realizing the limits of profiling with a small
>> resolution", yet the claim that the technique is "useless except on
>> ancient hardware" eludes me.
>
> The assumption that gprof actually provides useful output at some
> point in the past is just a complimentary assumption of mine, because
> I have never seen this happen ever since I first encountered the
> program on a 25 Mhz processor.

My view is rather that "gprof actually provides useful output today,
even with 100/s resolution". I see it happen once in a while when I look
for bottlenecks in code that runs on 2.4 GHz processor. The function in
which they occur, it seems I can isolate it pretty easily having gprof
output. It's the one up there in the list, with >90% of time taken. You
optimize it and bang, the running time is slashed. That's opposite of
useless, happening every now and then. Rather than the poor resolution,
I find the fact that measurement itself leads to changes in timing (via
stalls, interference with cache, etc) more difficult to work with,
especially when -g is used (not sure about the reason).

- J.

Nate Eldredge

unread,

Dec 1, 2008, 2:03:03 PM12/1/08

to

Rainer Weikusat <rwei...@mssgmbh.com> writes:

It seems like an easy fix would be to introduce some noise, so that the
sampling interval is not fixed but fluctuates randomly. This would
eliminate the problem that occurs if the sample interval happens to be
commensurable with the execution time of some significant piece of the
program.

Rainer Weikusat

unread,

Dec 2, 2008, 5:31:07 AM12/2/08

to

Jacek Dziedzic <jacek.dziedzi...@gmail.com> writes:
> Rainer Weikusat wrote:
>> You already asserted this in your last posting. But this amounts to
>> 'it works because I say it does'.
>
> No, not really. I could've backed this up with statistical
> reasoning, Bernoulli trials come to mind, but

... for some weird reason, you'd rather continue to assert that what
must not be cannot be and that you certainly no why it would be this
way, and so please everyone just believe you.

>> Trying a little though experiment: Let's assume that a program only
>> executes a single loop and this loop calls invokes two
>> subroutines. The average execution time of subroutine #1 is 1/4 of
>> the sampling interval, the average execution time of #2 3/4. This
>> means the program spends 1/4 of its time in subroutine one, yet if
>> the profiling timer was started 'close' to the start of the loop,
>> the instruction pointer should basically always be somewhere in #2
>> when the value is recorded.

>> What's wrong with this example?
>
> It's the assumption that what happens "on average" happens "in every
> instance" that you, it seems to me, made above.

Since this would directly contradict that I wrote 'on average' which
means something completely different than what you claim I had
certainly meant to say, that's obviously (another) nonsense assertion
you would like to make for an unspoken-of reason.

> Even if the average execution time of #1 is exactly 1/4 of the
> sampling interval and the average execution time of #2 is exactly 3/4
> of the sampling interval, then, provided the standard deviation of the
> respective probability distributions of times is nonzero, the total
> time taken by the loop will, in general, be unequal to the sampling
> interval,

You kind-of contradict yourself: The only way for the standard
deviation to be zero is when the values are exact and not
averages of a sequence of varying values. This means that the complete
pompous paragraph above can be dropped because it just re-iterates
well known property of a sequence of inequal values whose average is
some value: They are not generally equal to the average.

> even if by a mere 1%. Accumulation of those little
> differences will, as time progresses, quickly rid #2 of its preference
> for gathering sampling ticks.

And here we have assertion #3: The differences will accumulate. There
is no reason why they should. Since execution of the loop body getting
progressively out of sync with the sampling interval would obviously
render the example useless for what it intended to demonstrate, you
have basically just evaded the question by assuming that "it will not
happen" (because you say so).

[...]

>> The assumption that gprof actually provides useful output at some
>> point in the past is just a complimentary assumption of mine, because
>> I have never seen this happen ever since I first encountered the
>> program on a 25 Mhz processor.
>
> My view is rather that "gprof actually provides useful output today,
> even with 100/s resolution". I see it happen once in a while when I
> look for bottlenecks in code that runs on 2.4 GHz processor. The
> function in which they occur, it seems I can isolate it pretty easily
> having gprof output. It's the one up there in the list, with >90% of
> time taken.

If this is typical runtime behaviour of code you happen to profile,
rest assured that your myopic experience in this respect isn't
universal.

Jacek Dziedzic

unread,

Dec 2, 2008, 3:15:47 PM12/2/08

to

Rainer Weikusat wrote:

>> It's the assumption that what happens "on average" happens "in every
>> instance" that you, it seems to me, made above.
>
> Since this would directly contradict that I wrote 'on average' which
> means something completely different than what you claim I had
> certainly meant to say, that's obviously (another) nonsense assertion
> you would like to make for an unspoken-of reason.

That I could not parse.

>> Even if the average execution time of #1 is exactly 1/4 of the
>> sampling interval and the average execution time of #2 is exactly 3/4
>> of the sampling interval, then, provided the standard deviation of the
>> respective probability distributions of times is nonzero, the total
>> time taken by the loop will, in general, be unequal to the sampling
>> interval,
>
> You kind-of contradict yourself: The only way for the standard
> deviation to be zero is when the values are exact and not
> averages of a sequence of varying values. This means that the complete
> pompous paragraph above can be dropped because it just re-iterates
> well known property of a sequence of inequal values whose average is
> some value: They are not generally equal to the average.

So you seem to agree with me on that. Yet if we agree, I fail to see
where I "kind-of contradict myself".

The "pompous" paragraph I wrote could not be dropped without changing
the argument. I specifically excluded the corner case (std.dev=0) of the
corner case (perfect loop time/sampling time syncrhonization) that you
provided.

>> even if by a mere 1%. Accumulation of those little
>> differences will, as time progresses, quickly rid #2 of its preference
>> for gathering sampling ticks.
>
> And here we have assertion #3: The differences will accumulate. There
> is no reason why they should.

Oh sure there are, it's called the Random Walk. You should google it!
This no assumption, this is simple maths. With some modest assumptions
about the stochasticity of the noise, you immediately wind up with the
conclusion that the accumulated difference grows with sqrt(t). Grows.

> Since execution of the loop body getting
> progressively out of sync with the sampling interval would obviously
> render the example useless for what it intended to demonstrate, you
> have basically just evaded the question by assuming that "it will not
> happen" (because you say so).

Not because I say so, but because this is an example of Random Walk,
to which the result is known, you cannot simply ignore it.

I fail to see how your corner-case example might serve to strengthen
your, very general and all-encompassing, argument that the tool is
"useless".

> If this is typical runtime behaviour of code you happen to profile,
> rest assured that your myopic experience in this respect isn't
> universal.

I am hyperopic.

Sure my experience is not universal. Neither is yours. Yet, I can
safely claim "X is not useless, because there are instances when it's
useful" and you cannot safely claim "X is useless because I don't find
it useful".

cheers,
- J.