What do these multithreaded results mean?

182 views

Skip to first unread message

Doug Graham

unread,

Sep 21, 2020, 3:45:50 PM9/21/20

to benchmark-discuss

Hi,

I'm experimenting with multithreaded benchmarks and trying to figure out what the numbers mean. I'll include my test code below but I think the question is clear even without the code. So let me start with the results:

$ ./multithreaded

2020-09-21T15:15:07-04:00

Running ./multithreaded

Run on (4 X 3500 MHz CPU s)

CPU Caches:

L1 Data 32 KiB (x4)

L1 Instruction 32 KiB (x4)

L2 Unified 256 KiB (x4)

L3 Unified 6144 KiB (x1)

Load Average: 0.16, 0.21, 0.10

***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.

***WARNING*** Library was built as DEBUG. Timings may be affected.

---------------------------------------------------------------

Benchmark Time CPU Iterations

---------------------------------------------------------------

BM_Empty/threads:1 54545 ns 54545 ns 12542

BM_Empty/threads:2 28071 ns 56140 ns 12494

BM_Empty/threads:4 14431 ns 57719 ns 12120

BM_Empty/threads:8 11468 ns 57578 ns 12128

BM_Empty/threads:16 8769 ns 57702 ns 12176

BM_Empty/threads:32 6269 ns 57727 ns 12128

BM_Empty/threads:64 2308 ns 57729 ns 12160

BM_Empty/threads:128 484 ns 57747 ns 12160

I just noticed that I should have changed BM_Empty to a better name. I'm actually just running a "burnCpu" function that uses about 55 us of CPU. So the numbers for 1, 2, and 4 threads make sense. What I don't understand is how Time gets so much better beyond 4 threads. This is running on a quad core machine without hyperthreading, so I can't see how Time could ever be less than CPU/4.

I could just ignore the result when there are more threads than there are cores, except that I encountered this issue while benchmarking real code that we thought might have bottlenecked on mutex contention. With a regular non-shared mutex, Time was about 1/4 CPU when using four threads or more (still running on a quad core machine), which to me means very little contention. But when the mutex was changed to a shared_mutex, the results look a lot like the above. What does that mean?

The benchmark code looks like:

$ cat multithreaded.cc

#include <benchmark/benchmark.h>

#include "burnCpu.h"

static void BM_Empty(benchmark::State& state) {

while (state.KeepRunning()) {

burncpu();

}

BENCHMARK(BM_Empty)->Threads(1);

BENCHMARK(BM_Empty)->Threads(2);

BENCHMARK(BM_Empty)->Threads(4);

BENCHMARK(BM_Empty)->Threads(8);

BENCHMARK(BM_Empty)->Threads(16);

BENCHMARK(BM_Empty)->Threads(32);

BENCHMARK(BM_Empty)->Threads(64);

BENCHMARK(BM_Empty)->Threads(128);

BENCHMARK_MAIN();

Burncpu is in a different source file to try to prevent inlining. It looks like:

$ cat burnCpu.cc

#include "burnCpu.h"

#include <cmath>

double burncpu()

{

double tot = 0.0;

for (int i = 0; i < 10000; i++)

tot += std::sqrt(i);

return tot;

}

Reply all

Reply to author

Forward

0 new messages