[PATCH] RFC v2: Mutex with spinning

Nadav Har'El

unread,

Jul 28, 2014, 3:55:38 PM7/28/14

to osv...@googlegroups.com, Nadav Har'El

v2: Make this patch less likely to be really bad on a non-benchmark
workload: limit the spinning to 1,000 instead of 10,000, and stop
spinning if the lock holder is sleeping.

When a thread locks an already-held mutex, it goes to sleep, which incurs
the penalty of a context switch, and on SMP, likely also a slow cross-CPU
wakeup. In many cases, the lock holder will be holding the lock for a very
short time, and it will be beneficial to spin a bit waiting for the lock
to be released, before going to sleep.

This patch implements such spinning, using technique 2 described in
issue #25: we spin after incrementing the mutex's count, but before
adding ourselves to the queue, so unlock() may wake us using the handoff
mechanism.

For uncontended locks, the added spinning does nothing (we only spin when
we find a taken lock). For heavily-contended locks it also usually does
nothing as well, because as soon as the mutex's wait queue becomes
non-empty spinning stops, so if the wait queue is never empty (on a
very heavily contended lock) we never spin. So I believe this patch
will mostly be beneficial for moderatedly-contended mutexes.

In the test/misc-mutex.so benchmarks, this patch didn't change the
uncontended measurements and only sporadically improves the heavily
contended (20 threads constantly trying to obtain the mutex).
Performance is signficantly improved in the less heavily contended
benchmarks where context switches aren't necessary, with 2 or 4
lock-unlock loop threads on 4 cpus: From around 800ns and 1800ns for the
2 and 4 thread cases respectively, with this patch we are down to 33ns
and 153ns, respectively, more or less the same results we obtain with
a spinlock.

Please do not commit this patch, but feel free to try it on your favorite
benchmark. I'm continuing to investigate different ways of doing spinning
and their tradoffs.

Signed-off-by: Nadav Har'El <n...@cloudius-systems.com>
---
core/lfmutex.cc | 25 +++++++++++++++++++++++++
include/osv/sched.hh | 4 ++++
2 files changed, 29 insertions(+)

--- .before/include/osv/sched.hh 2014-07-28 22:54:45.958268928 +0300
+++ .after/include/osv/sched.hh 2014-07-28 22:54:45.959268934 +0300
@@ -615,6 +615,10 @@ public:
* lot of stack
*/
static void register_exit_notifier(std::function<void (thread *)> &&n);
+ bool running() {
+ return _detached_state.get()->st.load(std::memory_order_relaxed) ==
+ status::running;
+ }
private:
class reaper;
friend class reaper;
--- .before/core/lfmutex.cc 2014-07-28 22:54:46.042269433 +0300
+++ .after/core/lfmutex.cc 2014-07-28 22:54:46.042269433 +0300
@@ -44,6 +44,31 @@ void mutex::lock()
++depth;
return;
}
+ // If we're still here, the lock is owned by a different thread. Before
+ // going to sleep, let's try to spin for a bit, to avoid the context
+ // switch costs when the critical section is very short. We're spinning
+ // without putting ourselves in the waitqueue first, so when the lock
+ // holder unlock()s, we expect to see a handoff.
+ // FIXME: do not spin on UP.
+ // If someone is on the waitqueue, there's no point in continuing to
+ // spin, as it will be woken first, and no handoff will be done.
+ for (int c = 0; c < 1000 && waitqueue.empty(); c++) {
+ // If the lock holder got preempted, we would better be served
+ // by going to sleep and let it run again, than futile spinning.
+ // Especially if the lock holder wants to run on this CPU.
+ auto holder = owner.load(std::memory_order_relaxed);
+ if (holder && !holder->running()) {
+ break;
+ }
+ auto old_handoff = handoff.load();
+ if (old_handoff && handoff.compare_exchange_strong(old_handoff, 0U)) {
+ owner.store(current, std::memory_order_relaxed);
+ depth = 1;
+ return;
+ }
+ asm volatile ("pause");
+ }
+

// If we're here still here the lock is owned by a different thread.
// Put this thread in a waiting queue, so it will eventually be woken

Dor Laor

unread,

Jul 28, 2014, 5:53:21 PM7/28/14

to Nadav Har'El, Osv Dev

I tried it on my laptop (as opposed to a server previously) and cassandra-stress instead

of ycsb. I didn't get any noticeable change besides that the idle time that got decreased on updates.

More data tomorrow.

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Avi Kivity

unread,

Jul 30, 2014, 3:59:09 AM7/30/14

to Nadav Har'El, osv...@googlegroups.com

On 07/28/2014 10:55 PM, Nadav Har'El wrote:
> v2: Make this patch less likely to be really bad on a non-benchmark
> workload: limit the spinning to 1,000 instead of 10,000, and stop
> spinning if the lock holder is sleeping.

1000 seems still way too high, "pause" can be a hundred cycles (or
more), and 100,000 cycles is excessive.

100 or even less seems better.

This looks unsafe, holder can point to a dead thread.

One option is to store _detached_state instead of owner, the other is to
store the cpu along with owner, and verify that cpu->current == owner.

Nadav Har'El

unread,

Jul 30, 2014, 4:49:44 AM7/30/14

to Avi Kivity, Osv Dev

On Wed, Jul 30, 2014 at 10:59 AM, Avi Kivity <a...@cloudius-systems.com> wrote:

On 07/28/2014 10:55 PM, Nadav Har'El wrote:

v2: Make this patch less likely to be really bad on a non-benchmark
workload: limit the spinning to 1,000 instead of 10,000, and stop
spinning if the lock holder is sleeping.

1000 seems still way too high, "pause" can be a hundred cycles (or more), and 100,000 cycles is excessive.

100 or even less seems better.

You're absolutely right. I'm now investigating why I seem to need this spin-count limit to be as high as 3,000 to get the best results from the misc-mutex benchmarks. There's probably something I'm still missing in understanding the behavior of the mutex under this spinning.

{

+ // If the lock holder got preempted, we would better be served
+ // by going to sleep and let it run again, than futile spinning.
+ // Especially if the lock holder wants to run on this CPU.
+ auto holder = owner.load(std::memory_order_relaxed);
+ if (holder && !holder->running()) {
+ break;
+ }

This looks unsafe, holder can point to a dead thread.

Right. My latest version of this patch contains this FIXME:

// FIXME: what if thread exits while we want to check if it's
// running? as soon as we find holder, it might already be gone :(

One option is to store _detached_state instead of owner

Definitely possible. Not beautiful (will require a bunch of other changes for recursive-mutex checking), but should work. It will also mean I need to hold the RCU read lock.

, the other is to store the cpu along with owner, and verify that cpu->current == owner.

Nice idex, but this will grow the mutex, which is already not very slim, even more :(

--

Nadav Har'El

n...@cloudius-systems.com

Gleb Natapov

unread,

Aug 10, 2014, 5:48:56 AM8/10/14

to Nadav Har'El, osv...@googlegroups.com

In my version of the patch I had two tracepoints: one for when we decide
to not spin any more, it provided a reason for the decision too, another
for when spinning caused mutex to be acquired. They were very useful to
see what is going on.

--
Gleb.

Gleb Natapov

unread,

Aug 17, 2014, 7:06:56 AM8/17/14

to Nadav Har'El, osv...@googlegroups.com

On Mon, Jul 28, 2014 at 10:55:31PM +0300, Nadav Har'El wrote:

> v2: Make this patch less likely to be really bad on a non-benchmark
> workload: limit the spinning to 1,000 instead of 10,000, and stop
> spinning if the lock holder is sleeping.
>

I changes it to 100000 and got this while running cassandra:

Assertion failed: other->thread() != sched::thread::current() (/data/gleb/osv/core/lfmutex.cc: unlock: 271)

[backtrace]
0x0000000000223578 <__assert_fail+24>
0x00000000003a7eb9 <lockfree::mutex::unlock()+553>
0x00000000003e47be <waitqueue::wait(lockfree::mutex&)+126>
0x00000000003a8c4a <rwlock::reader_wait_lockable()+26>
0x00000000003a8c76 <rwlock::rlock()+22>
0x00000000002c2000 <dmu_zfetch+192>
0x00000000002b34e8 <dbuf_read+1480>
0x00000000002c388f <dnode_hold_impl+303>
0x00000000002b94f8 <???+2856184>
0x00000000002ba6c0 <dmu_map_uio+64>
0x0000000000315ec6 <???+3235526>
0x0000000000403cde <vfs_file::get_arcbuf(void*, long)+78>
0x00000000003afd57 <pagecache::get(vfs_file*, long, mmu::hw_ptep<0>, mmu::pt_element<0>, bool, bool)+247>
0x000000000040393f <vfs_file::map_page(unsigned long, mmu::hw_ptep<0>, mmu::pt_element<0>, bool, bool)+31>
0x000000000032dc13 <mmu::map_file_page_mmap::map(unsigned long, mmu::hw_ptep<0>, mmu::pt_element<0>, bool)+51>
0x000000000033471d <mmu::map_level<mmu::populate_small<(mmu::account_opt)1>,
4>::operator()(mmu::hw_ptep<4>, unsigned long)+1213> 0x0000000000335c1d <unsigned long
mmu::populate_vma<(mmu::account_opt)1>(mmu::vma*, void*, unsigned long, bool)+173>
0x000000000032cda4 <mmu::file_vma::fault(unsigned long, exception_frame*)+148>
0x0000000000329106 <mmu::vm_fault(unsigned long, exception_frame*)+230>
0x000000000037ab19 <page_fault+105>
0x0000000000379a06 <???+3643910>

--
Gleb.

Reply all

Reply to author

Forward