Design goals:
1) Benchmark should run on a reasonable OS. Requiring a special purpose hardware would increase benchmark accuracy, but would make running it much more difficult.
2) Running with an OS might distort the benchmark with context switches or interrupt. To solve that, you can repeat the test multiple times and remove outliers.
3) Should give a few simple outputs, so that lock designer could use the results as a proxy for how to optimize the lock for his use case.
4) Initial design assumes x86 for brevity, but changes to other architecture should be easy.
Implementation:
Input parameters:
LENGTH estimated # cycles that the lock would be held.
N number of CPUs contending for the lock
1) Count cycles (e.g., r for uncontended acquire and release - that would be the first output.
2) Measure the memory footprint of uncontended acquire and release - that would be the second output. This can be done by invalidating the cache, running lock acquire release, and reading cache misses from performance counters. To eliminate other overhead, do the same with a NOP lock - this would be the overhead.
That would be the second output.
3) Pin N threads to N different logical CPUs. We can afford ignoring logical CPUs, since in most use cases you cannot ensure two threads would run on different hardware cpus. Each CPU would pause for LENGTH cycles, by running commands that do NOT use the memory bus.
4) Count time (e.g., rdtsc) from the time the first CPU acquired the lock, to the time the last one did.
Do that by acquiring the lock, keeping the timestamp in a register, releasing the lock, and then after a 1ms pause, writing this value to a known location, so that writing the timestamp won't interfere with the test. after 10ms, when all CPUS are definitely done with the benchmark.
The problem here is, the timestamps between CPUs might not be synchronized.
Possible solutions are:
4.a. Use HPET instead, than again the HPET overhead might distort the test case.
4.b On newer Intel HW, I understand there's invariant tsc which can be a solution. Not sure how portable would that be.
4.c Assuming the time drift is small, averaging over multiple experiments might give a small enough standard deviation.
I'm not sure how much problematic would that be in practice.
5?) I'm not sure its possible, or meaningful, but one measure the memory bus utilization during the lock contention. Maybe counting the total cache misses for all CPUs while contending over the lock.
You'll end up with four actionable parameters:
CPU usage for the uncontended case.
Memory footprint for the uncontended case.
CPU overhead for the contended case
Memory bus usage (or a close proxy of it) in the contended case.