I want to use gprof to check the performance of my program. I use shared
library method to compile the problem.
But After use gprof to check the time usage of each subroutines, I found
that there are some functions which gprof hasn't marked out. But these
functions are used in my program, I promise.
Is it right I need to compile my program with static library of the
third party code?
By the way, any better tools to check the time usage of one program. I
just want to drag out my worst code.
Thanks
Water Lin
--
Water Lin's notes and pencils: http://en.waterlin.org
Email: Wate...@ymail.com
Without an example, it's hard to tell which of several issues is
affecting you. You may just need to add '-rdynamic' (GCC/G++) or '-
E' (LD) to your final link line.
DS
The short answer is 'forget it'. 'gprof' samples the instruction
pointer at 100Hz and its output is essentially random on modern
computers (actually, it was already next to useless on the 25Mhz
486-based computer were I first tried to use it).
[...]
> By the way, any better tools to check the time usage of one program. I
> just want to drag out my worst code.
oprofile.
http://oprofile.sourceforge.net/news/
The main drawback (I am aware of) is that it doesn't count how often a
particular routine was invoked.
> oprofile.
>
> http://oprofile.sourceforge.net/news/
>
> The main drawback (I am aware of) is that it doesn't count how often a
> particular routine was invoked.
I've never been able to get useful results out of oprofile. Maybe I'm
doing something wrong, but I put quite a bit effort into it and got
nothing back. It seems to be good for profiling your whole system
(what's the system doing?) but useless for analyzing a single program.
For example, I have one program that has performance issues. I believe
it's because it's calling memory allocation functions and
synchronization functions way too much. So I run it with oprofile.
What does oprofile tell me? That it's calling memory allocation and
synchronization functions way too much. And that's all it will tell
me.
I've turned the call graph depth up to 10, but I still only get two.
My low-level lock function, for example, is the performance problem,
and it's being called by my high-level lock function. Well, duh, of
course. And my low-level memory allocator is the performance problem,
and it's being called by my high-level memory allocator. Great, how
does any of that help me?
On the other hand, 'gprof' *does* tell me who is calling the high-
level functions and from what context.
DS
http://google-perftools.googlecode.com/svn/trunk/doc/cpuprofile.html
And that is, conincidentally, what people usually call 'profiling
based on statistical sampling'. If this is not what you want, for
instance, because you have really no idea what your program is doing,
as opposed to 'having no clear idea how expensive what your program is
doing might be', using a tool which is intended to provide information
for the second case is obviously pointless. Generation of accurate
call graphs requires 'active software support' and the
gprof-instrumentation code can do this just fine, as you are probably
aware. But I was writing about the statistical sampling profiler part
of gprof and about the ludicrous idea to get an accurate trace of a
value which changes a couple of billion times per second by storing a
value every 0.01s, and this done by userspace code in response to
UNIX(*) signals.
You have a problem with teh oprofile call graph support.
I recently developed callgraph support for oprofile on MIPS/n32,
so I have experience with this.
oprofile computes call graphs in the kernel. It does so without
access to any dwarf frame info from the binary.
For instance, the callgraph code I wrote is a bunch of code-walking
hacks consisting of heuristics for identifying the function-starting
instructions, frame sizes, and the frame linkage pointers to
continue the stack search. I have special cases for certain glibc
routines, and other cruft.
For instance, an early drat of my code did not produce call
traces from threads stuck in glibc system calls. Why? Because
syscalls are wrapped in hand-rolled machine code, not conforming
to the patterns that the code looks for.
> My low-level lock function, for example, is the performance problem,
> and it's being called by my high-level lock function. Well, duh, of
> course. And my low-level memory allocator is the performance problem,
> and it's being called by my high-level memory allocator. Great, how
> does any of that help me?
It sounds like you have some call graph problem issue clearly. Maybe
it's not able to back-trace through some kinds of frames properly on
your platform? Is this just x86?
You should be able to see the callers, as well as the callees of a
function, like for instance which caller is responsible for most of the
time being spent in the allocator, or which child function
of the top-level allocator function is the biggest pig.
Also, be sure that oprofile really is working with the depth. Set up
the depth in the configuration file; don't mess around with
/dev/oprofile. But you can peek at /dev/oprofile filesystem to confirm
that the backtrace_depth is not zero.
The userland tools for oprofile are badly designed.
Sometimes when you change a configuration, it does not go into
effect right away; you must do an ``opcontrol --reset''.
If you don't reset, you are doing cumulative profiling, and some
parameter changes don't take effect. Like say if you are working with
combined totals across all cores, but reconfigure to profiling being
split up by CPU, that won't take effect until you reset.
I can't remember whether the backtrace depth is this way.
E.g. actual output:
Here, a program called txr is being run with ``gc debugging''
turned on, which exercises the garbage collector deliberately,
in order to flush bugs.
So we can see that the gc function is taking 78% of the time.
Now 80% of /its/ time is in itself, while about 20% is in
the mark_obj function. Only a tiny amount of time is spent
in finalizing malloced objects (the call to free).
Processing weak hashes is also negligible.
In the next section, we can see that mark_obj spends most of its time in
itself, or a recursive call, with some minor time spent marking hash tables in
hash_mark, stdio streams in stdio_stream_mark, etc.
From this I can infer that it's mostly marking objects for which the marking
functions are inlined, like cons cells, symbols and strings. (If I wanted
details, I'd have to rebuild the program to defeat some inlining within the
gc).
OTOH, I'm pretty sure that the _IO_file_open reference is wrong; no way does
the garbage collector open files! this could be a consequence of the backtrace
code being confused by the stack frame of __sigjmp_save. There were two hits
in the __sigjmp_save, one of which maybe backtraced wrongly.
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples % image name app name symbol name
-------------------------------------------------------------------------------
81546 100.000 txr txr make_obj
65604 78.5112 txr txr gc
65604 80.4394 txr txr gc [self]
15911 19.5091 txr txr mark_obj
18 0.0221 libc-2.5.so txr free
9 0.0110 txr txr hash_process_weak
6 0.0074 txr txr gc_mark
5 0.0061 txr txr stdio_stream_mark
2 0.0025 libc-2.5.so txr __sigjmp_save
1 0.0012 libc-2.5.so txr _IO_file_open
1 0.0012 txr txr bind_cdr
-------------------------------------------------------------------------------
15911 30.4546 txr txr gc
36334 69.5454 txr txr mark_obj
15978 19.1216 txr txr mark_obj
36334 69.4059 txr txr mark_obj
15978 30.5215 txr txr mark_obj [self]
19 0.0363 txr txr hash_mark
10 0.0191 txr txr gc_mark
6 0.0115 txr txr stdio_stream_mark
3 0.0057 txr txr c_num
[ ... etc: rest of report snipped ]
I am trying to use google-perftools, but I found that I can't analyze my
google-perftools output. I installed google-perftools from ubuntu souce.
I add code
----------
ProfilerStart("google-profiler.prof");
----------
to my code, and I get the output google-profiler.prof file. But when I
use command
-----------
% pprof myprogramname google-profiler.prof
-----------
to analyze the output file, my shell reminds folloing errors:
----------
Reading Profile files in profile.*
Error: Could not open profile.0.0.0: No such file or directory
profile.ftab: No such file or directory
----------
Anything wrong with my stpes?
BTW: the tool pprof is provided by installing Tau.
[...]
> I am trying to use google-perftools, but I found that I can't analyze my
> google-perftools output. I installed google-perftools from ubuntu
> souce.
Judging from a glance at the documentation, this doesn't really
provide anything gprof doesn't already provide. In particular, with
default settings, it suffers from the same kind of data collection
problem. I have some small C program here which part of an SMS-like
messaging system used by a social networking application running on
iPhones. It has 3129 SLOC and the compiled binary has 4810
x86-instructions. It is running on a 2.4Ghz Core2. Doing some naive
off-the-envelope calculation indicates that the processor could
execute all of the program 12 million times in the time between two
gprof/ 'google perftools' sampling points. The gprof output will
actually inform you that the 'output will be inaccurate for functions
whose total running time was less than or equal to the sampling
interval'. The program I was using as an example has been in
continouus opertation since 2009/09/07 and has used 5 minutes and 67
seconds of CPU time since then. This means that even if I had been
gprofiling it all the time, the output would still be useless. Why
don't you rather try to square a few circles? That would at least be a
famous problem which is impossible to solve.
That's odd; 5 minutes and 67 seconds is usually 6 minutes and 7 seconds.
Maybe your procps utilities installation is as broken as you gprof and
oprofile?
:)
> This means that even if I had been
> gprofiling it all the time, the output would still be useless.
No, the output would confirm that you have a program which has taken several
months of continuous use to rack up 6 minutes of CPU time.
I.e. maybe profile something else?
> No, the output would confirm that you have a program which has taken several
> months of continuous use to rack up 6 minutes of CPU time.
>
> I.e. maybe profile something else?
What, and miss the chance to reduce its CPU usage from .09% to .085%?!
DS
Indeed. Consequently, I misinterpreted the number and it is actually
5.69 seconds.
> Maybe your procps utilities installation is as broken as you gprof and
> oprofile?
>
> :)
Maybe I often write postings and actually 'send' them without reading
through the text a couple of times, which is a requirement when one
desires to catch obvious mistakes like this. But this is, after all,
USENET, and not going to be executed, so I would classify this as minor
sin. The 'He has made an error! Everything he ever did or will ever do
must be wrong!!' you try to pull above, even if it was meant in a
humorous way, isn't.
>> This means that even if I had been
>> gprofiling it all the time, the output would still be useless.
>
> No, the output would confirm that you have a program which has taken
> several months of continuous use to rack up 6 minutes of CPU time.
And neither is this attempt to again reinterpret my statements on this
topic in a misleading way. 'gprof' tries to collect instruction
pointer traces by recording a value 100 times per second and to use
the recorded values to assign 'fractions of total execution time' and
'absolute execution times' to individual subroutines. Due to
limitations in the data collection state, such as '7 orders of
magnitude difference between sampling frequency and signal frequency'
this doesn't work.
While the program is actually running, its 'CPU usage' is always 100%.
That's vacuously true of everything. The instantaneous CPU usage of
any machine instruction in the system is 100%.
So what do you hope to achieve by profiling a program that runs for 5.69
seconds in three months?
Suppose you do a fantastic optimization job and you reduce this program
from 5.69 to 0.01 seconds.
What does that buy you?
Is it the case that the program actually serviced only one event
during the entire three months of being runnable, but took 5.69 seconds
to do it, where the requirement is for a much faster response?
(If that's the case, can't you just reproduce such a triggering event
to generate profiles?)
There is a more general question hiding behind this 2nd besides-the-point
ad-hominem attempt: What do you believe is the purpose of a profiler?
Is it
a) an analysis tool which can be used to gain improved
understanding regarding what a program is actually doing at
run time, as opposed what is assumed about its behaviour?
b) an emergency repair tool one fires up (reluctantly) after
having created a real-world problem which cannot be ignored
anymore, in the hope that it will 'magically' help somehow?
In my opinion, the answer is a). I like to verify if my assumption
about were the program will spend most of its time are actually true
if only to gain a better understanding of it, since this will help me
with future programs. BTW, please spare me the obvious 'why do want to
understand something which doesn't matter to me [that is, you]'
retort.
Sorry, no ad-hominem was intended. Permit me amend the sentence to like this:
``what can be achieved by profiling a program that ...''
Thanks.
> There is a more general question hiding behind this 2nd besides-the-point
> ad-hominem attempt: What do you believe is the purpose of a profiler?
> Is it
>
> a) an analysis tool which can be used to gain improved
> understanding regarding what a program is actually doing at
> run time, as opposed what is assumed about its behaviour?
>
> b) an emergency repair tool one fires up (reluctantly) after
> having created a real-world problem which cannot be ignored
> anymore, in the hope that it will 'magically' help somehow?
>
> In my opinion, the answer is a). I like to verify if my assumption
> about were the program will spend most of its time are actually true
> if only to gain a better understanding of it, since this will help me
> with future programs. BTW, please spare me the obvious 'why do want to
> understand something which doesn't matter to me [that is, you]'
> retort.
The truth is somewhere between these two extremes. It's not reasonable
to expect a useful profiler, out of the box, to accommodate either of
these extremes particularly well. However, they're quite useful for
the more typical in-between cases.
DS
How much do you trust this figure, by the way?
It was obtained using a techniques not unlike that of profilers; a
counter is incremented at interrupt time. (See update_process_times
in the kernel).
So is this 5.69 also not random garbage?
If you have a reason to believe that the tick accounting mechanism is
too inaccurate for something particular that you would like to do with
it, you will need to find an alternate way. It is sufficient to
demonstrate that 0.01s is a very long time when compared to the
execution speed of 'modern CPUs', especially since this program is
guaranteed to do 'something' once per minute (and this something being
the most expensive individual opertation it does at all -- writing a
log record).
The more complete answer is that this depends on the ratio of
scheduling frequency and tick frequency and the discrepances are not
that awful in this case. Someone with more knowledge in signal
processing theory than I possess could probably provide a more
accurate estimate.
[...]
> The more complete answer
Below is a 2nd 'more complete text', this time regarding my original
statement (which I couldn't possibly write while still being at
work). I hope it is detailed enough to at least remove any possible
misunderstanding.
NB: Most of the text below is based on conclusions drawn from the
documentation of the various involved pieces. I haven't inspected the
code and thus, could be completely wrong. It is just a possible
explanation for the seemingly random fluctuations in the gprof-output
I have observed in the past.
How a profiler works (or doesn't).
----------------------------------
First, a few terms are needed:
'instruction pointer'(IP)
This is a CPU-register which contains the memory address of
the instruction which is currently being executed
(simplification). It is a numerical value whose theoretical
range is (0, <highest virtual memory address>). Whenever a new
instruction is executed, the IP changes to contain the new
address.
'execution trace'
An 'execution trace' is the sequence of values the IP had
while executing a particular program for some time.
Assuming that a particular execution trace is known, it is possible to
calculate the amount of 'work' which was done in each subroutine of a
program (which was actually executed during the traced execution) by
mapping each IP-value of the trace to such a subroutine by comparing
the numerical value with the start and end addresses of all
subroutines, this information being available as part of 'the
debugging information' generated by the compiler while the source code
files making up the program were translated. After this has been done,
a histogram is calculated by counting how often each subroutine
appears in the mapped execution trace. This count, divided by the
number of recorded IP values, gives a measure for the amount of work
which was done in each particular subroutine relative to the total
amount of work during the execution the trace came from.
In practice, it is usually infeasible to record each IP value because
this would be very slow and need a lot of memory. What is usually done
instead is that the IP value is sampled in certain intervals and the
approximate execution trace gained in this way is used to do the
calculations described in the last paragraph. The better this
approximation is, the more realistic the generated output will
be. Now, what is a 'good' approximation in this respect? This can be
visualized as follows: Assuming a real execution trace was available,
it could be diagrammed by using the time difference from 'program
start' when each IP values was recorded as X-coordinate and the
IP-value itself as Y-coordinate and make a 'dot' on a piece of paper
for every IP-value. The same can be done for an approximate execution
trace and an approximation is 'better' than another when the number of
dots missing from the approximate execution trace diagram, when
compared to the real execution trace diagram, is smaller than that of
another approximate execution trace.
The way this sampling works in gprof (AFAIK) is that the so-called
profiling timer (ticks whenever the program is actually executing) is
programmed to cause a signal to be sent to the process whenever the
program has been executing for 0.01s and the IP-value at the time the
timer expired is recorded. This sequence of recorded values is the
approximate execution trace used to calculate the output I am writing
about. The clock frequency of a modern CPU is usually in the range of
more than one billion Hz and this implies that the CPU can at least
execute a number of instruction which is a large fraction of the
clock-frequency per second. Sampling the IP value 100 times per second
thus means that a number of IP-values in the range of n * 10^6 (bc
syntax) to n * 10^7 are missed between each sample, or, in other
words, the overwhelming majority of the values making up an actual
execution traces just 'whizzes by' the profiler, making the
approximation used by it a really bad one. Further, the time needed to
actually execute each single instruction is variable and depends on
unpredictable events external to the process, which, for instance,
influence the contents of the various CPU-caches, the
branch-prediction unit and associated buffers, the TLB etc. This
introduces a certain element of random fluctuation into the
approximate execution trace: Continous runs of the program will
produce different execution traces and the output produced because of
each single one will vary randomly, especially for programs who spent
the majority of their time in a lot of different 'small'
subroutines. Over time, assuming that many approximate execution
traces are collected and merged, the combined trace can be expected to
subsequently approximate the real execution trace better and better,
again subject to a certain amount of randomness, because each
individual approximation can be better or worse than the previous
one. But given the monstrous difference between the 'number of dots'
in each approximate trace diagram and the 'number of dots' in a 'real'
diagram, this will need a lot of program runs, presumably close to the
difference between the sampling interval and the average instruction
speed, which isn't practical.
But you know what it's doing - it's doing nothing 99.999% of the time.
Phil
--
Any true emperor never needs to wear clothes. -- Devany on r.a.s.f1
> I've never been able to get useful results out of oprofile. Maybe I'm
> doing something wrong, but I put quite a bit effort into it and got
> nothing back. It seems to be good for profiling your whole system
> (what's the system doing?) but useless for analyzing a single
> program.
You may want to take a look at sysprof
http://www.daimi.au.dk/~sandmann/sysprof
Soren
I care not what you believe; your ability to analyse situations
has already been proved to be fundamentally flawed.
[ text irrelavant to this flame-thread ]
> I care not what you believe; your ability to analyse situations
> has already been proved to be fundamentally flawed.
If you wouldn't, you wouldn't have posted another nonsense
text. WARNING: The following text contains some information which must
be absolutely NEW to you: There is no such thing as 'a fundamentally
flawed human being' (I might make an exception for you). These are not
machines (or programs), but individuals which are capable of learning
from their mistakes (I again might make an exception for you here,
since your language strongly suggests that you would never admit to
having made a mistake and this means that YOU cannot learn).
Now, you may go back to beating your wife or whatever the pasttime of
a troll like you happens to be.
[...]
>> a) an analysis tool which can be used to gain improved
>> understanding regarding what a program is actually doing at
>> run time, as opposed what is assumed about its behaviour?
>>
>> b) an emergency repair tool one fires up (reluctantly) after
>> having created a real-world problem which cannot be ignored
>> anymore, in the hope that it will 'magically' help somehow?
>>
>> In my opinion, the answer is a).
>
> But you know what it's doing - it's doing nothing 99.999% of the time.
Since this is apparently not as self-evident as I had hoped: This
program is a network server and the reason it is 'doing nothing' most
of the time is - of course - that it spends most of its time waiting
for input, like any (non-overloaded) network server. But since the
program is doing nothing while there is nothing to do for it, the
behaviour in this case is ... well ... not particularly
interesting. What is interesting, however, is what it is doing when it
is processing input, especially since, again, like any network server,
the total amount of work it has to do during a given time period is
at least proportional to the amount of requests it has to process
during this time and this amount is principally unbounded (it is
effectively bounded by the speed at which requests can be processed or
the capacity of of the uplink of the machine it is running on,
whichever is smaller).
And this implies that the remark by 'Mr Oxford-Mathematician who is
convinced to have seen it all' (despite being even younger than me) is
a profoundly stupid one.
I'm confident I could meaningfully profile this software using a tool
like oprofile, driven by a measly 100 Hz interrupt timer.
BTW^2: While I don't usually write very favorably about BSD, something
I'd like to add is that the FreeBSD-kernel (and presumably, the
others, too) supports (or used to support, the information I am
referring to is about 5.2 [TDAIOTFOS]) a so-called 'profclock' which
ticks 1024 times per second and that, besides the higher frequency,
this clock it supposed to be 'relatively prime' to the system clock
because of known problems with the traditional approach (admittedly,
other 'known problems'). I'll see if I can locate the corresponding
paper and post a link to it later during the day.
Since I am all too well aware of the problems of this approach, I