What do these multithreaded results mean?

70 views
Skip to first unread message

Doug Graham

unread,
Sep 21, 2020, 3:45:50 PM9/21/20
to benchmark-discuss
Hi,

I'm experimenting with multithreaded benchmarks and trying to figure out what the numbers mean.  I'll include my test code below but I think the question is clear even without the code. So let me start with the results:

$ ./multithreaded
2020-09-21T15:15:07-04:00
Running ./multithreaded
Run on (4 X 3500 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
Load Average: 0.16, 0.21, 0.10
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_Empty/threads:1        54545 ns        54545 ns        12542
BM_Empty/threads:2        28071 ns        56140 ns        12494
BM_Empty/threads:4        14431 ns        57719 ns        12120
BM_Empty/threads:8        11468 ns        57578 ns        12128
BM_Empty/threads:16        8769 ns        57702 ns        12176
BM_Empty/threads:32        6269 ns        57727 ns        12128
BM_Empty/threads:64        2308 ns        57729 ns        12160
BM_Empty/threads:128        484 ns        57747 ns        12160

I just noticed that I should have changed BM_Empty to a better name. I'm actually just running a "burnCpu" function that uses about 55 us of CPU.  So the numbers for 1, 2, and 4 threads make sense.  What I don't understand is how Time gets so much better beyond 4 threads. This is running on a quad core machine without hyperthreading, so I can't see how Time could ever be less than CPU/4.

I could just ignore the result when there are more threads than there are cores, except that I encountered this issue while benchmarking real code that we thought might have bottlenecked  on mutex contention.  With a regular non-shared mutex, Time was about 1/4 CPU when using four threads or more (still running on a quad core machine), which to me means very little contention.  But when the mutex was changed to a shared_mutex, the results look a lot like the above.  What does that mean?

The benchmark code looks like:

$ cat multithreaded.cc
#include <benchmark/benchmark.h>
#include "burnCpu.h"

static void BM_Empty(benchmark::State& state) {
    while (state.KeepRunning()) {
        burncpu();
    }
}

BENCHMARK(BM_Empty)->Threads(1);
BENCHMARK(BM_Empty)->Threads(2);
BENCHMARK(BM_Empty)->Threads(4);
BENCHMARK(BM_Empty)->Threads(8);
BENCHMARK(BM_Empty)->Threads(16);
BENCHMARK(BM_Empty)->Threads(32);
BENCHMARK(BM_Empty)->Threads(64);
BENCHMARK(BM_Empty)->Threads(128);

BENCHMARK_MAIN();

Burncpu is in a different source file to try to prevent inlining. It looks like:

$ cat burnCpu.cc
#include "burnCpu.h"
#include <cmath>

double burncpu()
{
    double tot = 0.0;

    for (int i = 0; i < 10000; i++)
        tot += std::sqrt(i);
    return tot;
}

Reply all
Reply to author
Forward
0 new messages