I've recently been using a very nice tool to generate execution time
(i.e. cycle estimates) reports and call graphs for code compiled with
g95. The tools are linux specific, and easy to use. Since questions
about profiling have been sent a few times to the group, I give some
details here on how to use these tools.
you'll need:
0) g95
- compile your code with the option -g to obtain useful line info, but
leave the usual optimisation flags (e.g. use '-g -O2 -march=pentium4').
I'll call the executable a.out in the following)
1) valgrind
- can be installed easily directly from source. I recommend using the
latest version (http://valgrind.org/), which includes the tool
'callgrind'
- in addition to what I will describe here, it is a very powerful to
find memory leaks and references to undefined variables (execute
'valgrind --tool=memcheck ./a.out').
2) kcachegrind (optional, but *very* much recommended)
- has sources and docs
(http://kcachegrind.sourceforge.net/cgi-bin/show.cgi)
- AFAICT requires KDE3 and is quite difficult to install from source
- best seems to install a prepacked binary with your distribution. In
my case (SUSE) this case with kdesdk.
to proceed
1) execute your code through valgrind to generate the data to be
analysed. Since valgrind is a kind of 'CPU simulator' it will run very
slowly (i.e. up to 100 times slower) and require quite a bit of memory.
valgrind --simulate-cache=yes --tool=callgrind ./a.out
2) the generated result file (callgrind.out.XYZ) can be annotated
yielding some human readable text (i.e. so that kcachegrind is not
strictly needed). This includes a list of routines sorted with respect
to the instruction count (which isn't quite runtime, because of cache
effects, but already quite useful) and auto-annotated source showing
the bottlenecks in the program.
callgrind_annotate --auto=yes --include=/path/to/the/sources
callgrind.out.XYZ
3) All the data can be much better visualised in kcachegrind, with many
options to play with. A most useful configuration option is to specify
where the sources for the program are (unless they are in the working
directory). Interesting might be to look at the 'Cycle estimation' as a
primary event (i.e. that is roughly the run time), or for example the
cache misses (L1 or L2). One can also sort subroutines with respect to
their runtime (including or excluding subroutines they called) or the
number of times the subroutines are called. The source window includes
the cost of execution of every source line, so can be quite nicely used
to locate bottlenecks in large subroutines. The graphics is very nice
(and provides a good reason to finally order a 21inch monitor), with
very intuitive navigation through the graph.
kcachegrind callgrind.out.XYZ
And of course, I refer to the documentation that comes with these tools
for the fine details ...
Cheers,
Joost