How to interpret the tree profiler results

Scott

unread,

Apr 6, 2010, 12:53:57 PM4/6/10

to oktech-profiler

Hey, I think you're doing a great thing in this profiler with
supporting statistical analysis; The other java profilers I've looked
at using are invasive, and perturb the results so much that they are
useless. In the case of the JVMTI profiler in eclipse, a cryptographic
operation that was responsible for 80% of the runtime didn't show up
in the top 30 results.

However, I'm having difficulty in understanding the results, or maybe
I'm just not finding it in the manual. How do I interpret the fields?
What are cnt/cntT/cntP/sys/sysr/sysP? Also, have you considered
generating output that can be visualized by kcachegrind?

|-java.lang.ref.ReferenceQueue.remove() cnt=41522/
cntT=9.223372036854776E15%/cntP=9.223372036854776E15%
sys=4167152.247ms/sysr=0.0ms/sysP=100.0%
| |-java.lang.ref.ReferenceQueue.remove() cnt=20761/
cntT=9.223372036854776E15%/cntP=50.0% sys=2083576.124ms/sysr=0.0ms/
sysP=50.0%
| |-java.lang.Object.wait() cnt=20761/
cntT=9.223372036854776E15%/cntP=50.0% sys=2083576.124ms/
sysr=2083576.124ms/sysP=50.0%
| <|-java.lang.ref.Finalizer$FinalizerThread.run() cnt=20761/
cntT=9.223372036854776E15%/cntP=50.0% sys=2083576.124ms/sysr=0.0ms/
sysP=50.0%
| <|-java.lang.ref.ReferenceQueue.remove() cnt=20761/
cntT=9.223372036854776E15%/cntP=50.0% sys=2083576.124ms/sysr=0.0ms/
sysP=50.0%

Istvan Soos

unread,

Apr 7, 2010, 2:58:39 AM4/7/10

to oktech-...@googlegroups.com, scro...@gmail.com

Hi Scott,

I'm glad that you have met your expectations with this profiler.
Although the next version seems to take a bit longer time than
expected, it is on the way out...

> However, I'm having difficulty in understanding the results, or maybe
> I'm just not finding it in the manual. How do I interpret the fields?
> What are cnt/cntT/cntP/sys/sysr/sysP? Also, have you considered
> generating output that can be visualized by kcachegrind?

cnt = sample count
cntT = percentage of the sample count compared to the total. the
actual live version contains a bug, it has been fixed in the SVN
version
cntP = percentage of the sample count compared to the parent node's
count. the same bug here

sys = Estimated system time. This is an experimental calculated value
that tried to give an time guess in the sampling profiler field,
because it looked good at first in our measurements. I wouldn't
suggest to take it serious now...
sysr = the 'real' time spent in the actual method, this is an other
estimate, same notes go here...
sysP = the percentage compared to the parent node's value. Given that
we are talking about the division of two estimated value, this is not
something you should highly rely on...

I'd use only the cnt value to look for performance bottlenecks. Please
let us know if you have further questions or need guidance.
Regards,
Istvan

Scott

unread,

Apr 9, 2010, 12:11:01 PM4/9/10

to oktech-profiler

On Apr 7, 1:58 am, Istvan Soos <istvan.s...@gmail.com> wrote:
> Hi Scott,
>
> I'm glad that you have met your expectations with this profiler.
> Although the next version seems to take a bit longer time than
> expected, it is on the way out...

I'm definitely interested to see what improvements you can add. I have
a java profiler that can pinpoint my performance bottlenecks, and the
results seem to be fairly non-perturbing. If I divide the number of
samples in non-blocked situations by the sampling rate, I should be
measuring the used CPU time. For CPU bound applications, the number is
within 25\% of correct.

I have a few improvements of my own to share.

I've made a generator to kcachegrind, which is a *really* nice open
source profile visualization tool. Check out the screen shots and the
website. My generator uses the undocumented 'output.method.filter' to
indicate functions to be completely removed from the profile, meaning
that the counts through these functions are subtracted from their
parents. Using this to filter thread sleep and thread blocking
functions, I can pinpoint where CPU is being used. 'Method.filter'
remains and functions by removing functions from appearing in the
profile, but accounts for their time in their parents.

Further integration is possible. For instance, I've not included
source code lines, which can be used to disambiguate overloaded
function names. Overloaded functions with the same name are merged
together; these might be distinguishable by using source code lines?

An example command line, generating a report on a profile of the
report generator is:

java -cp ~/source/Misc/oktech-profiler/hu.oktech.profiler/bin
hu.oktech.profiler.report.kcachegrind.KCachegrindReport input=tmp/
profiler/2010-04-09-08-10-24.dump.gz report.sampling.tree=true
'output.method.filter=-java.lang.Object.wait!,-
java.lang.Thread.sleep!,-sun.misc.Unsafe.park!,-
sun.nio.ch.FileDispatcher.read0!,-
sun.nio.ch.ServerSocketChannelImpl.accept!,-
java.net.SocketInputStream.socketRead0!,+.*' > /tmp/grind1.out &

Running this on my current codebase indicates that the report
generator spends about 80% of its time parsing strings and integers,
15% in decompressing, and 4% in generating the kcachegrind report.
(Contrast the output with using method.filter or no filter.)

I have a few other changes. As it only takes 10-20 minutes to collect
1gb of stacktraces in the dump file, I added a hack that gzips the
dump files on the fly giving 30:1 compression. The correct fix would
be a denser and redundancy-reducing format for dumping stack traces,
which would also make the report generator faster. I also added some
code to make the reader more robust to corrupt dump files, so I can
visualize the profile results before the program finishes running.

I will be submitting issues in the tracker with the individual
patches.

Scott

Istvan Soos

unread,

Apr 19, 2010, 8:18:12 AM4/19/10

to oktech-...@googlegroups.com

> samples in non-blocked situations by the sampling rate, I should be
> measuring the used CPU time. For CPU bound applications, the number is
> within 25\% of correct.

Estimating correct CPU time from sampling is a grey area. Although
there are applications and use cases where it can be correct or
near-to-correct, there are situations where in itself it is
non-reliable to the extent it is not meaningful at all. First of all
you shall measure the thread CPU time to separate often-sleeping
threads from full-CPU threads. Because this operation does cost a lot
time, the new version allows to measure the time in a cyclic manner
(e.g. one time in every 10 sampling or like that). From that point you
still have the continuity problem: two consecutive sample might
contain very large common stack trace, even it can be the same but it
doesn't necessarily means it takes its time only there. This is a
research area we will definitely put some efforts to create good
estimates.

> I've made a generator to kcachegrind, which is a *really* nice open
> source profile visualization tool. Check out the screen shots and the
> website. My generator uses the undocumented 'output.method.filter' to
> indicate functions to be completely removed from the profile, meaning
> that the counts through these functions are subtracted from their
> parents. Using this to filter thread sleep and thread blocking
> functions, I can pinpoint where CPU is being used. 'Method.filter'
> remains and functions by removing functions from appearing in the
> profile, but accounts for their time in their parents.

Sounds good. This (and such) call graph analysis will be possibly more
frequent in the next version, because we are planning to make use of
partial stack traces, leaving the full tree and focusing more on the
bottom of the stack trace.

> Further integration is possible. For instance, I've not included
> source code lines, which can be used to disambiguate overloaded
> function names. Overloaded functions with the same name are merged
> together; these might be distinguishable by using source code lines?

Unfortunately source code lines are not always present, however when
they are, they include the file name also. Check the StackTraceElement
class of the JVM. In the end it allows the separation of overloaded
functions.

> Running this on my current codebase indicates that the report
> generator spends about 80% of its time parsing strings and integers,
> 15% in decompressing, and 4% in generating the kcachegrind report.
> (Contrast the output with using method.filter or no filter.)

Yeah, this is valid, although known issue we shall try to solve...

> I have a few other changes. As it only takes 10-20 minutes to collect
> 1gb of stacktraces in the dump file, I added a hack that gzips the
> dump files on the fly giving 30:1 compression.

This is a good point, again.

> I will be submitting issues in the tracker with the individual
> patches.

Thank you for sending the patches! Although it might take a bit more
time (as you might have checked my other mail about the new repository
and the new code base), I'll integrate your patches in the code...

Thanks and regards,
Istvan

--
Subscription settings: http://groups.google.com/group/oktech-profiler/subscribe?hl=en

Scott

unread,

Apr 19, 2010, 11:11:41 AM4/19/10

to oktech-profiler

On Apr 19, 7:18 am, Istvan Soos <istvan.s...@gmail.com> wrote:
> > samples in non-blocked situations by the sampling rate, I should be
> > measuring the used CPU time. For CPU bound applications, the number is
> > within 25\% of correct.
>
> Estimating correct CPU time from sampling is a grey area. Although
> there are applications and use cases where it can be correct or
> near-to-correct, there are situations where in itself it is
> non-reliable to the extent it is not meaningful at all. First of all
> you shall measure the thread CPU time to separate often-sleeping
> threads from full-CPU threads.

What are the situations that make it unreliable and what are the
causes? I do observe a lot of samples occuring in the JVM::sampling
bucket that appear to be not accounted for anywhere else. (I'm very
new to java profiling.) I see more distortions in applications with
often-sleeping threads, and better numbers looking at applications
with full-CPU threads.

> Because this operation does cost a lot
> time, the new version allows to measure the time in a cyclic manner
> (e.g. one time in every 10 sampling or like that). From that point you
> still have the continuity problem: two consecutive sample might
> contain very large common stack trace, even it can be the same but it
> doesn't necessarily means it takes its time only there.

Sure, this situation can occur by chance, but with enough samples,
won't the estimate be very close to correct?
For instance, if a program is spending 90% of its time in function A,
and 1000 random samples are taken, can't I apply the binomial
distribution and calculate that there is a 70% probability that we
will observe A in 890 to 910 samples and a 99.997% chance that A will
occur in 860 to 940 samples, which implies that with 99.997%
probability, the statistical sample estimate will be within 5% of
being correct.

Scott

Reply all

Reply to author

Forward