I am trying to measure the time taken to run a loop (omp parallel for on 4 cores). The loop is inside a function. I measure time by putting clock_time, gettimeofday getrusage(RUSAGE_SELF,utime ) and getrusage(RUSAGE_SELF, utime+stime) around the function. I also put likwid markers inside the function around the code. I ran it on my laptop with 4 cores.
I ran it twice and I have pasted the results. ( 2 results).
The results are different for the two runs (quiet a bit different).
Are the results different because some other daemons (or unknown programs) run and hence results are different. I am running on my laptop with no other known program running.
Is there a way that I can consistently get the same time?
To find that out I ran it under likwid, but here too the times for two runs are quiet different.
Also likwid reports RDTSC (runtime) and CPU_CLOCKS_UNHALTED (runtime). Which one is meaningful.
The code consists of a for loop (the loop has 3 multiply and 3 adds per iteration. Iteration size is 2048.
Theoretically this should be memory bound.
So if we fire four threads and if the memory cannot supply data fast enough will the cpu wait (go in halt state).
So in this case the clock cycles (unhalted) will not capture the unhalted time. So is the difference between RDTSC and CPU_CLOCKS_UNHALTED the result of memory waits?
So is RDTSC the correct time to report (which is also accounting for waits due to memory bound nature of the code) ?
Also for the 2 runs the computation is the same so why are the CPU_CLOCKS_UNHALTED different for the 2 runs?
Thanks and Regards,
D Brahme
The results of the two runs (from my timing functions and likwid output are given below). I used
likwid-perfctr -m -C 0-3 -g FLOPS_SP ./myprogram -n 2048 -tx 4.
size=2048, ignore=0.000000
CLOCKS_PER_SEC =1000000 sec
velocity0 clock time=0.030000 sec
velocity0 gettimeofday=1026.154000 ms
velocity0 rusagetime=8.000000 ms
velocity0 Total rusagetime=22.000000 ms
--------------------------------------------------------------------------------
CPU name: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz
CPU type: Intel Core IvyBridge processor
CPU clock: 2.59 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Region velocity0, Group 1: FLOPS_SP
+-------------------+----------+----------+----------+----------+
| Region Info | Core 0 | Core 1 | Core 2 | Core 3 |
+-------------------+----------+----------+----------+----------+
| RDTSC Runtime [s] | 0.002435 | 0.000010 | 0.000170 | 0.001092 |
| call count | 1 | 1 | 1 | 1 |
+-------------------+----------+----------+----------+----------+
+--------------------------------------+---------+---------+--------+--------+---------+
| Event | Counter | Core 0 | Core 1 | Core 2 | Core 3 |
+--------------------------------------+---------+---------+--------+--------+---------+
| INSTR_RETIRED_ANY | FIXC0 | 1850665 | 47866 | 215379 | 1034260 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 5574614 | 55544 | 552581 | 3379475 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 5809856 | 46566 | 463424 | 2834390 |
| FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE | PMC0 | 0 | 0 | 0 | 0 |
| FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE | PMC1 | 3166 | 3147 | 3159 | 3192 |
| SIMD_FP_256_PACKED_SINGLE | PMC2 | 0 | 0 | 0 | 0 |
+--------------------------------------+---------+---------+--------+--------+---------+
+-------------------------------------------+---------+---------+-------+---------+--------------+
| Event | Counter | Sum | Min | Max | Avg |
+-------------------------------------------+---------+---------+-------+---------+--------------+
| INSTR_RETIRED_ANY STAT | FIXC0 | 3148170 | 47866 | 1850665 | 787042.5000 |
| CPU_CLK_UNHALTED_CORE STAT | FIXC1 | 9562214 | 55544 | 5574614 | 2.390554e+06 |
| CPU_CLK_UNHALTED_REF STAT | FIXC2 | 9154236 | 46566 | 5809856 | 2288559 |
| FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE STAT | PMC0 | 0 | 0 | 0 | 0 |
| FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE STAT | PMC1 | 12664 | 3147 | 3192 | 3166 |
| SIMD_FP_256_PACKED_SINGLE STAT | PMC2 | 0 | 0 | 0 | 0 |
+-------------------------------------------+---------+---------+-------+---------+--------------+
+----------------------+-----------+--------------+-----------+-----------+
| Metric | Core 0 | Core 1 | Core 2 | Core 3 |
+----------------------+-----------+--------------+-----------+-----------+
| Runtime (RDTSC) [s] | 0.0024 | 9.925235e-06 | 0.0002 | 0.0011 |
| Runtime unhalted [s] | 0.0022 | 2.143253e-05 | 0.0002 | 0.0013 |
| Clock [MHz] | 2486.6415 | 3091.2346 | 3090.1614 | 3089.9637 |
| CPI | 3.0122 | 1.1604 | 2.5656 | 3.2675 |
| SP MFLOP/s | 1.3004 | 317.0706 | 18.5671 | 2.9242 |
| AVX SP MFLOP/s | 0 | 0 | 0 | 0 |
| Packed MUOPS/s | 0 | 0 | 0 | 0 |
| Scalar MUOPS/s | 1.3004 | 317.0706 | 18.5671 | 2.9242 |
+----------------------+-----------+--------------+-----------+-----------+
+---------------------------+------------+--------------+-----------+-----------+
| Metric | Sum | Min | Max | Avg |
+---------------------------+------------+--------------+-----------+-----------+
| Runtime (RDTSC) [s] STAT | 0.0037 | 9.925235e-06 | 0.0024 | 0.0009 |
| Runtime unhalted [s] STAT | 0.0037 | 2.143253e-05 | 0.0022 | 0.0009 |
| Clock [MHz] STAT | 11758.0012 | 2486.6415 | 3091.2346 | 2939.5003 |
| CPI STAT | 10.0057 | 1.1604 | 3.2675 | 2.5014 |
| SP MFLOP/s STAT | 339.8623 | 1.3004 | 317.0706 | 84.9656 |
| AVX SP MFLOP/s STAT | 0 | 0 | 0 | 0 |
| Packed MUOPS/s STAT | 0 | 0 | 0 | 0 |
| Scalar MUOPS/s STAT | 339.8623 | 1.3004 | 317.0706 | 84.9656 |
+---------------------------+------------+--------------+-----------+-----------+
Second Run: (Giving different timings) Both are run on my laptop with no other known jobs.
size=2048, ignore=0.000000
CLOCKS_PER_SEC =1000000 sec
velocity0 clock time=0.000000 sec
velocity0 gettimeofday=1014.440000 ms
velocity0 rusagetime=5.000000 ms
velocity0 Total rusagetime=11.000000 ms
--------------------------------------------------------------------------------
CPU name: Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz
CPU type: Intel Core IvyBridge processor
CPU clock: 2.59 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Region velocity0, Group 1: FLOPS_SP
+-------------------+----------+----------+----------+----------+
| Region Info | Core 0 | Core 1 | Core 2 | Core 3 |
+-------------------+----------+----------+----------+----------+
| RDTSC Runtime [s] | 0.001172 | 0.000014 | 0.000039 | 0.000014 |
| call count | 1 | 1 | 1 | 1 |
+-------------------+----------+----------+----------+----------+
+--------------------------------------+---------+---------+--------+--------+--------+
| Event | Counter | Core 0 | Core 1 | Core 2 | Core 3 |
+--------------------------------------+---------+---------+--------+--------+--------+
| INSTR_RETIRED_ANY | FIXC0 | 1155591 | 47869 | 68655 | 49074 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 3596837 | 65758 | 140504 | 71419 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 3016728 | 55120 | 117832 | 59878 |
| FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE | PMC0 | 0 | 0 | 0 | 0 |
| FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE | PMC1 | 3158 | 3276 | 3124 | 3284 |
| SIMD_FP_256_PACKED_SINGLE | PMC2 | 0 | 0 | 0 | 0 |
+--------------------------------------+---------+---------+--------+--------+--------+
+-------------------------------------------+---------+---------+-------+---------+-------------+
| Event | Counter | Sum | Min | Max | Avg |
+-------------------------------------------+---------+---------+-------+---------+-------------+
| INSTR_RETIRED_ANY STAT | FIXC0 | 1321189 | 47869 | 1155591 | 330297.2500 |
| CPU_CLK_UNHALTED_CORE STAT | FIXC1 | 3874518 | 65758 | 3596837 | 968629.5000 |
| CPU_CLK_UNHALTED_REF STAT | FIXC2 | 3249558 | 55120 | 3016728 | 812389.5000 |
| FP_COMP_OPS_EXE_SSE_FP_PACKED_SINGLE STAT | PMC0 | 0 | 0 | 0 | 0 |
| FP_COMP_OPS_EXE_SSE_FP_SCALAR_SINGLE STAT | PMC1 | 12842 | 3124 | 3284 | 3210.5000 |
| SIMD_FP_256_PACKED_SINGLE STAT | PMC2 | 0 | 0 | 0 | 0 |
+-------------------------------------------+---------+---------+-------+---------+-------------+
+----------------------+-----------+--------------+--------------+--------------+
| Metric | Core 0 | Core 1 | Core 2 | Core 3 |
+----------------------+-----------+--------------+--------------+--------------+
| Runtime (RDTSC) [s] | 0.0012 | 1.402813e-05 | 3.945990e-05 | 1.401655e-05 |
| Runtime unhalted [s] | 0.0014 | 2.537375e-05 | 0.0001 | 2.755814e-05 |
| Clock [MHz] | 3089.9288 | 3091.7421 | 3090.2193 | 3091.0807 |
| CPI | 3.1126 | 1.3737 | 2.0465 | 1.4553 |
| SP MFLOP/s | 2.6935 | 233.5308 | 79.1690 | 234.2945 |
| AVX SP MFLOP/s | 0 | 0 | 0 | 0 |
| Packed MUOPS/s | 0 | 0 | 0 | 0 |
| Scalar MUOPS/s | 2.6935 | 233.5308 | 79.1690 | 234.2945 |
+----------------------+-----------+--------------+--------------+--------------+
+---------------------------+------------+--------------+-----------+-----------+
| Metric | Sum | Min | Max | Avg |
+---------------------------+------------+--------------+-----------+-----------+
| Runtime (RDTSC) [s] STAT | 0.0013 | 1.401655e-05 | 0.0012 | 0.0003 |
| Runtime unhalted [s] STAT | 0.0016 | 2.537375e-05 | 0.0014 | 0.0004 |
| Clock [MHz] STAT | 12362.9709 | 3089.9288 | 3091.7421 | 3090.7427 |
| CPI STAT | 7.9881 | 1.3737 | 3.1126 | 1.9970 |
| SP MFLOP/s STAT | 549.6878 | 2.6935 | 234.2945 | 137.4220 |
| AVX SP MFLOP/s STAT | 0 | 0 | 0 | 0 |
| Packed MUOPS/s STAT | 0 | 0 | 0 | 0 |
| Scalar MUOPS/s STAT | 549.6878 | 2.6935 | 234.2945 | 137.4220 |
+---------------------------+------------+--------------+-----------+-----------+