difference between runtime RDTSC and CPU_CLOCKS_UNHALTED. Also why are times always different when you rerun the same code again?

555 views

Skip to first unread message

D Br

unread,

Aug 24, 2016, 2:46:21 PM8/24/16

to likwid-users

I am trying to measure the time taken to run a loop (omp parallel for on 4 cores). The loop is inside a function. I measure time by putting clock_time, gettimeofday getrusage(RUSAGE_SELF,utime ) and getrusage(RUSAGE_SELF, utime+stime) around the function. I also put likwid markers inside the function around the code. I ran it on my laptop with 4 cores.

I ran it twice and I have pasted the results. ( 2 results).

The results are different for the two runs (quiet a bit different).

Are the results different because some other daemons (or unknown programs) run and hence results are different. I am running on my laptop with no other known program running.

Is there a way that I can consistently get the same time?

To find that out I ran it under likwid, but here too the times for two runs are quiet different.

Also likwid reports RDTSC (runtime) and CPU_CLOCKS_UNHALTED (runtime). Which one is meaningful.

The code consists of a for loop (the loop has 3 multiply and 3 adds per iteration. Iteration size is 2048.

Theoretically this should be memory bound.

So if we fire four threads and if the memory cannot supply data fast enough will the cpu wait (go in halt state).

So in this case the clock cycles (unhalted) will not capture the unhalted time. So is the difference between RDTSC and CPU_CLOCKS_UNHALTED the result of memory waits?

So is RDTSC the correct time to report (which is also accounting for waits due to memory bound nature of the code) ?

Also for the 2 runs the computation is the same so why are the CPU_CLOCKS_UNHALTED different for the 2 runs?

Thanks and Regards,

D Brahme

The results of the two runs (from my timing functions and likwid output are given below). I used

likwid-perfctr -m -C 0-3 -g FLOPS_SP ./myprogram -n 2048 -tx 4.

size=2048, ignore=0.000000

CLOCKS_PER_SEC =1000000 sec

velocity0 clock time=0.030000 sec

velocity0 gettimeofday=1026.154000 ms

velocity0 rusagetime=8.000000 ms

velocity0 Total rusagetime=22.000000 ms

--------------------------------------------------------------------------------

CPU name: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz

CPU type: Intel Core IvyBridge processor

CPU clock: 2.59 GHz

--------------------------------------------------------------------------------

Region velocity0, Group 1: FLOPS_SP

+-------------------+----------+----------+----------+----------+

+-------------------+----------+----------+----------+----------+

| RDTSC Runtime [s] | 0.002435 | 0.000010 | 0.000170 | 0.001092 |

| call count | 1 | 1 | 1 | 1 |

+-------------------+----------+----------+----------+----------+

+--------------------------------------+---------+---------+--------+--------+---------+

+--------------------------------------+---------+---------+--------+--------+---------+

| INSTR_RETIRED_ANY | FIXC0 | 1850665 | 47866 | 215379 | 1034260 |

| CPU_CLK_UNHALTED_CORE | FIXC1 | 5574614 | 55544 | 552581 | 3379475 |

| CPU_CLK_UNHALTED_REF | FIXC2 | 5809856 | 46566 | 463424 | 2834390 |

| FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE | PMC0 | 0 | 0 | 0 | 0 |

| FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE | PMC1 | 3166 | 3147 | 3159 | 3192 |

| SIMD_FP_256_PACKED_SINGLE | PMC2 | 0 | 0 | 0 | 0 |

+--------------------------------------+---------+---------+--------+--------+---------+

+-------------------------------------------+---------+---------+-------+---------+--------------+

| Event | Counter | Sum | Min | Max | Avg |

+-------------------------------------------+---------+---------+-------+---------+--------------+

| INSTR_RETIRED_ANY STAT | FIXC0 | 3148170 | 47866 | 1850665 | 787042.5000 |

| CPU_CLK_UNHALTED_CORE STAT | FIXC1 | 9562214 | 55544 | 5574614 | 2.390554e+06 |

| CPU_CLK_UNHALTED_REF STAT | FIXC2 | 9154236 | 46566 | 5809856 | 2288559 |

| FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE STAT | PMC0 | 0 | 0 | 0 | 0 |

| FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE STAT | PMC1 | 12664 | 3147 | 3192 | 3166 |

| SIMD_FP_256_PACKED_SINGLE STAT | PMC2 | 0 | 0 | 0 | 0 |

+-------------------------------------------+---------+---------+-------+---------+--------------+

+----------------------+-----------+--------------+-----------+-----------+

+----------------------+-----------+--------------+-----------+-----------+

| Runtime (RDTSC) [s] | 0.0024 | 9.925235e-06 | 0.0002 | 0.0011 |

| Runtime unhalted [s] | 0.0022 | 2.143253e-05 | 0.0002 | 0.0013 |

| Clock [MHz] | 2486.6415 | 3091.2346 | 3090.1614 | 3089.9637 |

| CPI | 3.0122 | 1.1604 | 2.5656 | 3.2675 |

| SP MFLOP/s | 1.3004 | 317.0706 | 18.5671 | 2.9242 |

| AVX SP MFLOP/s | 0 | 0 | 0 | 0 |

| Packed MUOPS/s | 0 | 0 | 0 | 0 |

| Scalar MUOPS/s | 1.3004 | 317.0706 | 18.5671 | 2.9242 |

+----------------------+-----------+--------------+-----------+-----------+

+---------------------------+------------+--------------+-----------+-----------+

| Metric | Sum | Min | Max | Avg |

+---------------------------+------------+--------------+-----------+-----------+

| Runtime (RDTSC) [s] STAT | 0.0037 | 9.925235e-06 | 0.0024 | 0.0009 |

| Runtime unhalted [s] STAT | 0.0037 | 2.143253e-05 | 0.0022 | 0.0009 |

| Clock [MHz] STAT | 11758.0012 | 2486.6415 | 3091.2346 | 2939.5003 |

| CPI STAT | 10.0057 | 1.1604 | 3.2675 | 2.5014 |

| SP MFLOP/s STAT | 339.8623 | 1.3004 | 317.0706 | 84.9656 |

| AVX SP MFLOP/s STAT | 0 | 0 | 0 | 0 |

| Packed MUOPS/s STAT | 0 | 0 | 0 | 0 |

| Scalar MUOPS/s STAT | 339.8623 | 1.3004 | 317.0706 | 84.9656 |

+---------------------------+------------+--------------+-----------+-----------+

Second Run: (Giving different timings) Both are run on my laptop with no other known jobs.

size=2048, ignore=0.000000

CLOCKS_PER_SEC =1000000 sec

velocity0 clock time=0.000000 sec

velocity0 gettimeofday=1014.440000 ms

velocity0 rusagetime=5.000000 ms

velocity0 Total rusagetime=11.000000 ms

--------------------------------------------------------------------------------

CPU name: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz

CPU type: Intel Core IvyBridge processor

CPU clock: 2.59 GHz

--------------------------------------------------------------------------------

Region velocity0, Group 1: FLOPS_SP

+-------------------+----------+----------+----------+----------+

+-------------------+----------+----------+----------+----------+

| RDTSC Runtime [s] | 0.001172 | 0.000014 | 0.000039 | 0.000014 |

| call count | 1 | 1 | 1 | 1 |

+-------------------+----------+----------+----------+----------+

+--------------------------------------+---------+---------+--------+--------+--------+

+--------------------------------------+---------+---------+--------+--------+--------+

| INSTR_RETIRED_ANY | FIXC0 | 1155591 | 47869 | 68655 | 49074 |

| CPU_CLK_UNHALTED_CORE | FIXC1 | 3596837 | 65758 | 140504 | 71419 |

| CPU_CLK_UNHALTED_REF | FIXC2 | 3016728 | 55120 | 117832 | 59878 |

| FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE | PMC0 | 0 | 0 | 0 | 0 |

| FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE | PMC1 | 3158 | 3276 | 3124 | 3284 |

| SIMD_FP_256_PACKED_SINGLE | PMC2 | 0 | 0 | 0 | 0 |

+--------------------------------------+---------+---------+--------+--------+--------+

+-------------------------------------------+---------+---------+-------+---------+-------------+

| Event | Counter | Sum | Min | Max | Avg |

+-------------------------------------------+---------+---------+-------+---------+-------------+

| INSTR_RETIRED_ANY STAT | FIXC0 | 1321189 | 47869 | 1155591 | 330297.2500 |

| CPU_CLK_UNHALTED_CORE STAT | FIXC1 | 3874518 | 65758 | 3596837 | 968629.5000 |

| CPU_CLK_UNHALTED_REF STAT | FIXC2 | 3249558 | 55120 | 3016728 | 812389.5000 |

| FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE STAT | PMC0 | 0 | 0 | 0 | 0 |

| FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE STAT | PMC1 | 12842 | 3124 | 3284 | 3210.5000 |

| SIMD_FP_256_PACKED_SINGLE STAT | PMC2 | 0 | 0 | 0 | 0 |

+-------------------------------------------+---------+---------+-------+---------+-------------+

+----------------------+-----------+--------------+--------------+--------------+

+----------------------+-----------+--------------+--------------+--------------+

| Runtime (RDTSC) [s] | 0.0012 | 1.402813e-05 | 3.945990e-05 | 1.401655e-05 |

| Runtime unhalted [s] | 0.0014 | 2.537375e-05 | 0.0001 | 2.755814e-05 |

| Clock [MHz] | 3089.9288 | 3091.7421 | 3090.2193 | 3091.0807 |

| CPI | 3.1126 | 1.3737 | 2.0465 | 1.4553 |

| SP MFLOP/s | 2.6935 | 233.5308 | 79.1690 | 234.2945 |

| AVX SP MFLOP/s | 0 | 0 | 0 | 0 |

| Packed MUOPS/s | 0 | 0 | 0 | 0 |

| Scalar MUOPS/s | 2.6935 | 233.5308 | 79.1690 | 234.2945 |

+----------------------+-----------+--------------+--------------+--------------+

+---------------------------+------------+--------------+-----------+-----------+

| Metric | Sum | Min | Max | Avg |

+---------------------------+------------+--------------+-----------+-----------+

| Runtime (RDTSC) [s] STAT | 0.0013 | 1.401655e-05 | 0.0012 | 0.0003 |

| Runtime unhalted [s] STAT | 0.0016 | 2.537375e-05 | 0.0014 | 0.0004 |

| Clock [MHz] STAT | 12362.9709 | 3089.9288 | 3091.7421 | 3090.7427 |

| CPI STAT | 7.9881 | 1.3737 | 3.1126 | 1.9970 |

| SP MFLOP/s STAT | 549.6878 | 2.6935 | 234.2945 | 137.4220 |

| AVX SP MFLOP/s STAT | 0 | 0 | 0 | 0 |

| Packed MUOPS/s STAT | 0 | 0 | 0 | 0 |

| Scalar MUOPS/s STAT | 549.6878 | 2.6935 | 234.2945 | 137.4220 |

+---------------------------+------------+--------------+-----------+-----------+

Thomas Röhl

unread,

Aug 26, 2016, 5:58:06 AM8/26/16

to likwid-users

Hi,

I try to answer your questions inline.

Am Mittwoch, 24. August 2016 20:46:21 UTC+2 schrieb D Br:

I am trying to measure the time taken to run a loop (omp parallel for on 4 cores). The loop is inside a function. I measure time by putting clock_time, gettimeofday getrusage(RUSAGE_SELF,utime ) and getrusage(RUSAGE_SELF, utime+stime) around the function. I also put likwid markers inside the function around the code. I ran it on my laptop with 4 cores.
I ran it twice and I have pasted the results. ( 2 results).
The results are different for the two runs (quiet a bit different).
Are the results different because some other daemons (or unknown programs) run and hence results are different. I am running on my laptop with no other known program running.

There is always something running but it might not affect your application. I'm no expert in the different measurement facilities, I commonly stick with the TSC counter.

Is there a way that I can consistently get the same time?

For small and only once executed loops this is very difficult. There are so many things happening that can defer or delay the execution of your actual code. Examples are the starting of threads or the implicit barrier in the end of OpenMP parallelized loops.

To find that out I ran it under likwid, but here too the times for two runs are quiet different.

Also likwid reports RDTSC (runtime) and CPU_CLOCKS_UNHALTED (runtime). Which one is meaningful.

Both are meaningful but have different sources. The TSC counter should be invariant on Intel IvyBridge and thus increments at each cycle of a fixed frequency clock, even if the CPU is in halted state. The CPU_CLOCKS_UNHALTED instead measures only cycles where the CPU is in unhalted state and is influenced by varying CPU frequency. The CPU_CLK_UNHALTED_REF event corresponds more to the TSC counter as it uses the same fixed frequency clock source but counts only in unhalted state.
If your system has the CPU_CLOCKS_UNHALTED event also for the PMC counters, you can get the total clocks by setting some options. I added this event already to the event list for IvyBridge. It is called CPU_CLOCK_UNHALTED_TOTAL_CYCLES. Despite the name it also increments in halted state with the actual CPU clock.

The code consists of a for loop (the loop has 3 multiply and 3 adds per iteration. Iteration size is 2048.

Very small loop count to get reliable results, especially with work sharing of 4 threads.

Theoretically this should be memory bound.

If you run the loop only once you can see the memory traffic, but it is too less traffic to really recognize it out of the normal traffic. Increase the iteration count so that the system gets in a steady state.
If you would run the loop multiple times, all data would be in some cache level, for a iteration count of 2048 with 4 threads, this could fit in the L1 cache.

So if we fire four threads and if the memory cannot supply data fast enough will the cpu wait (go in halt state).

Yes, you can measure these stalls with the CYCLE_ACTIVITY_CYCLES events. The CYCLE_ACTIVITY_STALLS events increment only when the stall in the selected cache level causes an execution stall in the core.

So in this case the clock cycles (unhalted) will not capture the unhalted time. So is the difference between RDTSC and CPU_CLOCKS_UNHALTED the result of memory waits?

I wouldn't say that a stall is equal to a halted state. Stalls are short in general and changing the state would cost additional time. The differences come from different clock sources and the activity in different states.

So is RDTSC the correct time to report (which is also accounting for waits due to memory bound nature of the code) ?

I would design my code differently to get reliable results. Do some warmup iterations, run the loop multiple times and determine min/max/mean values. Just take a look in other benchmarking codes, e.g. STREAM, how they perform their measurements.

Also for the 2 runs the computation is the same so why are the CPU_CLOCKS_UNHALTED different for the 2 runs?

Almost any number is different for your two runs. In my opinion there are some problems with your code. All threads perform a different amount of instructions, fixed-clock unhalted clocks and the CPI is different. There could be the possibility that the threads wait in a barrier.

I hope I answered your questions. If not, most of your questions would also be suitable for the Intel Performance Monitoring Forum (https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring).

Greetings,
Thomas

D Br

unread,

Sep 7, 2016, 5:55:10 AM9/7/16

to likwid-users

Thanks,

D Brahme

Reply all

Reply to author

Forward

0 new messages