In the 20-thread benchmark of tests/misc-mutex.so, we have 20 threads on 4 cpus, each thread in a tight loop to get the same mutex.
The lock/unlock cycle reported for this benchmark using a mutex is around 1100ns, while when using a spin-lock it is just 120ns.
The question I would like to raise is whether we actually *want* the performance of the mutex to approach that of the spin-lock in this case.
If you think *how* the spin-lock achieves this good performance in the 20-thread benchmark, the explanation is this: At a particular point in time, we have 4 out of the 20 threads running on the 4 cpus. Each of these 4 threads will now spin for the lock, and not context-switch until its time slice is over. This will result in minimal number of context switches in each CPU (each costing around 300ns), and zero number of the even slower cross-cpu wakeups.
But the price is that the spin-lock also doesn't guarantee any short-term fairness or prevent starvation: In the 2 ms (or whatever) when these 4 threads are spinning and each getting the lock thousands of time, we have 16 other threads which all want the lock as well, but don't get to run at all.
On the other hand, using the mutex is much more prompt: If 20 threads want the lock and are on the wait queue, they will be woken one by one. If 20 threads want to run to put themselves on the queue, the fair scheduler will let all of the run quickly (because the mutex sleeps and wakes, it causes reschedules on each cycle, without waiting for the time slice to expire).
So my question is - are we ok that a faster mutex would behave more like the spin-lock, where a few threads can monopolize the cpus for whole time slices, while other threads are waiting? I'm guessing that for most workloads, it simply won't matter, but I don't really know.