I'm using benchmark to test SPDK performance, and need to manage threads myself, so that they can be bound to specific cores. Currently, I'm using an atomic counter to synchronize between master thread and worker threads, and run the waiting loop in master thread:
for (auto _ : state) {
while (iterations == counter);
iterations++;
}
worker threads increment counter each time when an operation is done.
This approach works, but it seems inter-thread synchronization is a huge overhead. Comparing to un-threaded benchmark, threaded benchmark shows 30%+ slower performance.
I wonder if there are better approaches to run benchmark in this situation.