Profiling Kona with gprof

90 views
Skip to first unread message

Tom Szczesny

unread,
Mar 9, 2014, 1:32:22 PM3/9/14
to kona...@googlegroups.com
Profiling the test suite on 32-bit-Linux and Cygwin gives inconsistent results.  Only 2 functions are in the top 10 time-users for both environments (wd_ and capture).  The top 10 time-users in each are:

32-bit-Linux (test suite)
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 92.68      0.76     0.76   237195     0.00     0.00  doMergeGrade
  1.22      0.77     0.01   566458     0.00     0.00  distributionGrade
  1.22      0.78     0.01   286987     0.00     0.00  capture
  1.22      0.79     0.01    55818     0.00     0.00  Kv
  1.22      0.80     0.01    10353     0.00     0.00  dv_ex
  1.22      0.81     0.01     9133     0.00     0.00  wd_
  1.22      0.82     0.01     5990     0.00     0.00  cirRef
  0.00      0.82     0.00   564432     0.00     0.00  plus_scan
  0.00      0.82     0.00   353953     0.00     0.00  merger
  0.00      0.82     0.00   241650     0.00     0.00  minus_eachpair

Cygwin (test suite)
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 87.10      0.54     0.54                             bv_ex
  4.84      0.57     0.03     6868     4.37     4.37  capture
  1.61      0.58     0.01   326719     0.03     0.03  sd_
  1.61      0.59     0.01    26242     0.38     0.38  wd_
  1.61      0.60     0.01    24529     0.41     0.41  ci
  1.61      0.61     0.01    17848     0.56     0.59  FC
  0.00      0.61     0.00  1202504     0.00     0.00  vf_ex
  0.00      0.61     0.00   572706     0.00     0.00  _lsq
  0.00      0.61     0.00   221944     0.00     0.00  KC
  0.00      0.61     0.00   190539     0.00     0.00  _sm

But, the complete tests suite of 870 tests takes less than a second to complete, and all the tests are very short.

On the other hand, a single execution of ray.k takes over a minute to run.  Profiling an execution of ray.k gives quite consistent results in 32-bit-Linux and Cygwin.  The top 9 time-users are exactly the same (in somewhat different in order).  The top 10 time-users in each are:

32-bit-Linux (ray.k)
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 32.69     25.69    25.69  4799040     0.00     0.00  wd_
 17.03     39.07    13.38 116088485     0.00     0.00  cd
  7.24     44.76     5.69 37800866     0.00     0.00  capture
  5.01     48.70     3.94  4799050     0.00     0.00  complete
  4.30     52.08     3.38 1009351442     0.00     0.00  max
  3.75     55.03     2.95 55828898     0.00     0.00  ci
  3.41     57.71     2.68 82795586     0.00     0.00  newK
  3.35     60.34     2.63 137026270     0.00     0.00  charpos
  2.33     62.17     1.83  1117081     0.00     0.00  kclone
  2.06     63.79     1.62 10973880     0.00     0.00  ex2

Cygwin (ray.k)
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 28.58     12.79    12.79  4799040     2.67     4.79  wd_
 14.32     19.20     6.41 76687503     0.08     0.08  cd
  6.61     22.16     2.96 148453584     0.02     0.02  newK
  6.32     24.99     2.83  4799050     0.59     1.10  complete
  6.23     27.78     2.79 1009351442     0.00     0.00  max
  6.06     30.49     2.71 37800866     0.07     0.20  capture
  3.49     32.05     1.56 137026270     0.01     0.01  charpos
  3.15     33.46     1.41  1117081     1.26     2.13  kclone
  3.06     34.83     1.37 55828898     0.02     0.02  ci
  1.41     35.46     0.63 127018929     0.00     0.00  bottom

Kevin Lawler

unread,
Mar 9, 2014, 6:24:05 PM3/9/14
to kona...@googlegroups.com
That is interesting. I can't satisfactorily explain why the test suite differs on 32-bit Linux versus 32-bit Cygwin. However, my starting frame would be that the test suite is short and broad. The test suite's intention is to cover all the code paths in the language. So differences in system implementation are more likely to manifest there, rather than in a deep, repetitive execution path like the ray tracing script. If the system calls underpinning the ray tracer are implemented similarly in both platforms, then the ray tracer benchmarks will look similar. If the system implementation aren't similar, the outputs will differ.

In this case Cygwin and Linux seem to execute the ray tracer much the same. The test suite on the other hand executes quite differently. So likely there exists a single test which produces substantially different results. You could find such a test (perhaps via random selection, perhaps via repeated bisection) and drill in from there, if the goal is to identify differences in Cygwin. Another less-likely possibility is that the profiler operates differently.

Kevin



--
You received this message because you are subscribed to the Google Groups "Kona Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kona-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tom Szczesny

unread,
Mar 12, 2014, 12:45:29 AM3/12/14
to kona...@googlegroups.com
As a follow-up to the profiling exercise with ray-tracer where both cd and ci were in the top 10 time-users, 
I once again removed recursive ci/cd (see issue #205) from a test copy of the latest kevinlawler/kona codebase.
Surprisingly, the complete test-suite ran fine (no errors, no memory leaks).
The ray-tracer script ran fine also.  (The runs were done on Cygwin.)

However, the test suite ran slightly faster WITH recursive ci/cd.  (0.561 seconds with, 0.577 seconds without).
The ray-tracer script ran slightly faster WITHOUT recursive ci/cd. (56.737 seconds with, 54.226 seconds without).
When I get a chance, I'll try a similar exercise on 32-bit-Linux.

Kevin Lawler

unread,
Mar 12, 2014, 2:13:11 AM3/12/14
to kona...@googlegroups.com
cd works like calloc in that it zeroes memory. it's not returning new pages however (they're pool entries), so it actually has to zero out memory. this usually causes a slowdown. cd should instead work like malloc. i would change this now except the codebase assumes it in too many places. calloc cd should disappear later on.

the timing differences between recursive/non-recursive cd are possibly due to random lucky ordering of references, and the recursive version warms the data structures in the right order.

recursive cd is also something that should disappear later on.



Reply all
Reply to author
Forward
0 new messages