[PATCH] RFC: Mutex with spinning

Nadav Har'El

unread,

Jul 28, 2014, 10:51:40 AM7/28/14

to osv...@googlegroups.com, Nadav Har'El

When a thread locks an already-held mutex, it goes to sleep, which incurs
the penalty of a context switch, and on SMP, likely also a slow cross-CPU
wakeup. In many cases, the lock holder will be holding the lock for a very
short time, and it will be beneficial to spin a bit waiting for the lock
to be released, before going to sleep.

This patch implements such spinning, using technique 2 described in
issue #25: we spin after incrementing the mutex's count, but before
adding ourselves to the queue, so unlock() may wake us using the handoff
mechanism.

For uncontended locks, the added spinning does nothing (we only spin when
we find a taken lock). For heavily-contended locks it also usually does
nothing as well, because as soon as the mutex's wait queue becomes
non-empty spinning stops, so if the wait queue is never empty (on a
very heavily contended lock) we never spin. So I believe this patch
will mostly be beneficial for lightly-contended mutexes.

In the test/misc-mutex.so benchmarks, this patch didn't change the
uncontended measurements and only sporadically improves the heavily
contended (20 threads constantly trying to obtain the mutex).
Performance is signficantly improved in the less heavily contended
benchmarks where context switches aren't necessary, with 2 or 4
lock-unlock loop threads on 4 cpus: From around 800ns and 1800ns for the
2 and 4 thread cases respectively, with this patch we are down to 33ns
and 153ns, respectively, more or less the same results we obtain with
a spinlock.

This patch is relatively simple. I tried more changes not included
in this patch which didn't seem to help my tests, but which we might
want in the future. One such heuristic is to stop spinning if the
lock-holding thread is not running. A narrower heuristic is to stop
spinning if the lock-holder is on the same CPU we're on (and
in particular, never spin on single-processor). In the final patch
we'll also need to determine the best amount of spin (or make it a
parameter, or adaptive).

Please do not commit this patch, but feel free to try it on your favorite
benchmark. I'm continuing to investigate different ways of doing spinning
and their tradoffs.

Signed-off-by: Nadav Har'El <n...@cloudius-systems.com>
---
core/lfmutex.cc | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

--- .before/core/lfmutex.cc 2014-07-28 17:46:46.154625393 +0300
+++ .after/core/lfmutex.cc 2014-07-28 17:46:46.154625393 +0300
@@ -44,6 +44,24 @@ void mutex::lock()
++depth;
return;
}
+ // If we're still here, the lock is owned by a different thread. Before
+ // going to sleep, let's try to spin for a bit, to avoid the context
+ // switch costs when the critical section is very short. We're spinning
+ // without putting ourselves in the waitqueue first, so when the lock
+ // holder unlock()s, we expect to see a handoff.
+ // FIXME: do not spin on UP.
+ // If someone is on the waitqueue, there's no point in continuing to
+ // spin, as it will be woken first, and no handoff will be done.
+ for (int c = 0; c < 10000 && waitqueue.empty(); c++) {
+ auto old_handoff = handoff.load();
+ if(old_handoff && handoff.compare_exchange_strong(old_handoff, 0U)) {
+ owner.store(current, std::memory_order_relaxed);
+ depth = 1;
+ return;
+ }
+ asm volatile ("pause");
+ }
+

// If we're here still here the lock is owned by a different thread.
// Put this thread in a waiting queue, so it will eventually be woken

Avi Kivity

unread,

Jul 28, 2014, 11:31:24 AM7/28/14

to Nadav Har'El, osv...@googlegroups.com

10000? What's the average spin count?

Shouldn't we stop when the owner is not running? Otherwise, two threads
contending for the lock and then sleeping will always spin to completion
since the second thread will never encounter a non-empty waitqueue.

> + auto old_handoff = handoff.load();
> + if(old_handoff && handoff.compare_exchange_strong(old_handoff, 0U)) {

I don't understand how this can work without a space after the "if".

Nadav Har'El

unread,

Jul 28, 2014, 2:42:06 PM7/28/14

to Avi Kivity, Osv Dev

On Mon, Jul 28, 2014 at 6:31 PM, Avi Kivity <a...@cloudius-systems.com> wrote:

+ for (int c = 0; c < 10000 && waitqueue.empty(); c++) {

10000? What's the average spin count?

The "average" spin count obviously depends on the workload, but I think the more important (?) question is what is the cost of a context switch and/or remote wakeup. I.e., if the remote wakeup costs like spinning N times, and we spin for N times, then we get an online algorithm with competitive ratio 2 (the worst case scenario is that we spin N times, give up and go to sleep, and a nanosecond later we need to be woken up).

In any case, since I'm only running the mutex benchmark now, which doesn't do any other computation, basically any number will do. You're right that I probably shouldn't have left it 10,000 for anybody else to try - 10,000 is probably much too much. I didn't even bother to test how much time 10,000 iterations take...

Shouldn't we stop when the owner is not running? Otherwise, two threads contending for the lock and then sleeping will always spin to completion since the second thread will never encounter a non-empty waitqueue.

Right. As I commented in the commit message, I already have this heuristic in the code, but didn't include it in the patch because it isn't needed in the benchmark. You're right, I should include this heuristic in the patch.

+ auto old_handoff = handoff.load();
+ if(old_handoff && handoff.compare_exchange_strong(old_handoff, 0U)) {

I don't understand how this can work without a space after the "if".

Actually, having fewer spaces means we can utilize the instruction cache more efficiently ;-)

--

Nadav Har'El

n...@cloudius-systems.com

Dor Laor

unread,

Jul 28, 2014, 5:44:33 PM7/28/14

to Nadav Har'El, Osv Dev

Initial tests w/ Cassandra + YCSB show a nice overall gain. However I need to repeat it to be sure and there are newer

versions of these locks. I think I observe higher latency on one of the tests. Need to redo it:

OSv + spinning mutex

CSB Client 0.1

Command line: -db com.yahoo.ycsb.db.CassandraClient10 -threads 100 -p operationcount=100000 -p recordcount=800000 -p hosts=10.0.0.216 -P workloads/workloadf -s -load

Loading workload...

Starting test.

0 sec: 0 operations;

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

10 sec: 217647 operations; 21693.11 current ops/sec; [INSERT AverageLatency(us)=4556.22]

20 sec: 452146 operations; 23445.21 current ops/sec; [INSERT AverageLatency(us)=4185.44]

30 sec: 667337 operations; 21519.1 current ops/sec; [INSERT AverageLatency(us)=4727.9]

37 sec: 800000 operations; 18502.51 current ops/sec; [INSERT AverageLatency(us)=4702.08]

[OVERALL], RunTime(ms), 37206.0

[OVERALL], Throughput(ops/sec), 21501.908294361125

[INSERT], Operations, 800000

[INSERT], AverageLatency(us), 4517.90316

[INSERT], MinLatency(us), 81

[INSERT], MaxLatency(us), 589856

[INSERT], 95thPercentileLatency(ms), 8

[INSERT], 99thPercentileLatency(ms), 45

[INSERT], Return=0, 800000

[READ], Operations, 100000

[READ], AverageLatency(us), 5974.56879

[READ], MinLatency(us), 131

[READ], MaxLatency(us), 251539

[READ], 95thPercentileLatency(ms), 11

[READ], 99thPercentileLatency(ms), 28

[READ], Return=0, 100000

[READ-MODIFY-WRITE], Operations, 49741

[READ-MODIFY-WRITE], AverageLatency(us), 6.621217908767415

[READ-MODIFY-WRITE], MinLatency(us), 0

[READ-MODIFY-WRITE], MaxLatency(us), 252

[READ-MODIFY-WRITE], 95thPercentileLatency(ms), 0

[READ-MODIFY-WRITE], 99thPercentileLatency(ms), 0

YCSB Client 0.1

Command line: -db com.yahoo.ycsb.db.CassandraClient10 -threads 100 -p operationcount=100000 -p recordcount=800000 -p hosts=10.0.0.216 -P workloads/workloadf -s -t

Loading workload...

Starting test.

0 sec: 0 operations;

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

6 sec: 100000 operations; 15477.48 current ops/sec; [UPDATE AverageLatency(us)=638.87] [READ-MODIFY-WRITE AverageLatency(us)=6.62] [READ AverageLatency(us)=5974.57]

[OVERALL], RunTime(ms), 6462.0

[OVERALL], Throughput(ops/sec), 15475.085112968121

[UPDATE], Operations, 49741

[UPDATE], AverageLatency(us), 638.8723588186808

[UPDATE], MinLatency(us), 55

[UPDATE], MaxLatency(us), 204667

[UPDATE], 95thPercentileLatency(ms), 1

[UPDATE], 99thPercentileLatency(ms), 2

[UPDATE], Return=0, 49741

[UPDATE], 0, 41394

[UPDATE], 1, 6989

[UPDATE], 2, 889

OSv standard - stock master

[READ], Operations, 100000

[READ], AverageLatency(us), 6873.94427

[READ], MinLatency(us), 128

[READ], MaxLatency(us), 53526

[READ], 95thPercentileLatency(ms), 9

[READ], 99thPercentileLatency(ms), 15

[READ], Return=0, 100000

[READ-MODIFY-WRITE], Operations, 50077

[READ-MODIFY-WRITE], AverageLatency(us), 9.920921780458094

[READ-MODIFY-WRITE], MinLatency(us), 0

[READ-MODIFY-WRITE], MaxLatency(us), 56

[READ-MODIFY-WRITE], 95thPercentileLatency(ms), 0

[READ-MODIFY-WRITE], 99thPercentileLatency(ms), 0

[READ-MODIFY-WRITE], 0, 50077

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

0 sec: 0 operations;

8 sec: 100000 operations; 11708.23 current ops/sec; [UPDATE AverageLatency(us)=3051.68] [READ-MODIFY-WRITE AverageLatency(us)=9.92] [READ AverageLatency(us)=6873.94]

[OVERALL], RunTime(ms), 8533.0

[OVERALL], Throughput(ops/sec), 11719.207781553967

[UPDATE], Operations, 50077

[UPDATE], AverageLatency(us), 3051.6789144717136

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward