I have an experimental patchset that implements FUTEX_LOCK and
FUTEX_LOCK_ADAPTIVE in the kernel and use something akin to
mutex_spin_on_owner() for the first waiter to spin. What I'm finding is
that adaptive spinning actually hurts my particular test case, so I was
hoping to poll people for context regarding the existing adaptive
spinning implementations in the kernel as to where we see benefit. Under
which conditions does adaptive spinning help?
I presume locks with a short average hold time stand to gain the most as
the longer the lock is held the more likely the spinner will expire its
timeslice or that the scheduling gain becomes noise in the acquisition
time. My test case simple calls "lock();unlock()" for a fixed number of
iterations and reports the iterations per second at the end of the run.
It can run with an arbitrary number of threads as well. I typically run
with 256 threads for 10M iterations.
futex_lock: Result: 635 Kiter/s
futex_lock_adaptive: Result: 542 Kiter/s
I've limited the number of spinners to 1 but feel that perhaps this
should be configurable as locks with very short hold times could benefit
from up to NR_CPUS-1 spinners.
I'd really appreciate any data, just general insight, you may have
acquired while implementing adaptive spinning for rt-mutexes and
mutexes. Open questions for me regarding conditions where adaptive
spinning helps are:
o What type of lock hold times do we expect to benefit?
o How much contention is a good match for adaptive spinning?
- this is related to the number of threads to run in the test
o How many spinners should be allowed?
I can share the kernel patches if people are interested, but they are
really early, and I'm not sure they are of much value until I better
understand the conditions where this is expected to be useful.
Thanks,
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
I think you may have the wrong Chris. Chris Wright wasn't at the most
recent kernel summit and I do have a vague recollection of Chris Mason
mentioning that Oracle developers found that sched_yield() performed
better than futexes + kernel scheduling.
--
Roland Dreier <rol...@cisco.com> || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
> o What type of lock hold times do we expect to benefit?
0 (that's a zero) :-p
I haven't seen your patches but you are not doing a heuristic approach,
are you? That is, do not "spin" hoping the lock will suddenly become
free. I was against that for -rt and I would be against that for futex
too.
> o How much contention is a good match for adaptive spinning?
> - this is related to the number of threads to run in the test
> o How many spinners should be allowed?
>
> I can share the kernel patches if people are interested, but they are
> really early, and I'm not sure they are of much value until I better
> understand the conditions where this is expected to be useful.
Again, I don't know how you implemented your adaptive spinners, but the
trick to it in -rt was that it would only spin while the owner of the
lock was actually running. If it was not running, it would sleep. No
point waiting for a sleeping task to release its lock.
Is this what you did? Because, IIRC, this only benefited spinlocks
converted to mutexes. It did not help with semaphores, because
semaphores could be held for a long time. Thus, it was good for short
held locks, but hurt performance on long held locks.
If userspace is going to do this, I guess the blocked task would need to
go into kernel, and spin there (with preempt enabled) if the task is
still active and holding the lock.
Then the application would need to determine which to use. An adaptive
spinner for short held locks, and a normal futex for long held locks.
-- Steve
Right. This was *critical* for the adaptive rtmutex. Note in the RT
patch, everybody spins as long as the current owner is on CPU.
FWIW, IIRC, Solaris has a heuristic approach where incoming tasks spin
for a period of time before going to sleep. (Cray UINCOS did the same)
>
> Is this what you did? Because, IIRC, this only benefited spinlocks
> converted to mutexes. It did not help with semaphores, because
> semaphores could be held for a long time. Thus, it was good for short
> held locks, but hurt performance on long held locks.
>
nod. The entire premise was based on the fact that we were converting
spinlocks, which by definition were short held locks. What I found
during early development was that the sleep/wakeup cycle was more
intrusive for RT than spinning.
IIRC, I measured something like 380k context switches/second prior to
the adaptive patches for a dbench test and we cut this down to somewhere
around 50k, with a corresponding increase in throughput. (I can't
remember specific numbers any more, it was a while ago... ;-)
When applied to semaphores, the benefit was in the noise range as I
recall..
(dbench was chosen due to the heavy contention on the dcache spinlock)
Best,
-PWM
Oh bother, Sorry Chris! I even talked to Chris Wright about this already
in person and he told me I probably was looking for Chris Mason, and I
still got the last names crossed up!
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
I'm not sure what you're getting at here. Adaptive spinning is indeed
hoping the lock will become free while you are spinning and checking
it's owner...
>
>> o How much contention is a good match for adaptive spinning?
>> - this is related to the number of threads to run in the test
>> o How many spinners should be allowed?
>>
>> I can share the kernel patches if people are interested, but they are
>> really early, and I'm not sure they are of much value until I better
>> understand the conditions where this is expected to be useful.
>
> Again, I don't know how you implemented your adaptive spinners, but the
> trick to it in -rt was that it would only spin while the owner of the
> lock was actually running. If it was not running, it would sleep. No
> point waiting for a sleeping task to release its lock.
It does exactly this.
> Is this what you did? Because, IIRC, this only benefited spinlocks
> converted to mutexes. It did not help with semaphores, because
> semaphores could be held for a long time. Thus, it was good for short
> held locks, but hurt performance on long held locks.
Trouble is, I'm still seeing performance penalties even on the shortest
critical section possible (lock();unlock();)
> If userspace is going to do this, I guess the blocked task would need to
> go into kernel, and spin there (with preempt enabled) if the task is
> still active and holding the lock.
It is currently under preempt_disable() just like mutexes. I asked Peter
why it was done that way for mutexes, but didn't really get an answer.
He did point out that since we check need_resched() at every iteration
that we won't run longer than our timeslice anyway, so it shouldn't be a
problem.
> Then the application would need to determine which to use. An adaptive
> spinner for short held locks, and a normal futex for long held locks.
Yes, this was intended to be an optional thing (and certainly not the
default).
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
I'm talking about the original idea people had of "lets spin for 50us
and hope it is unlocked before then", which I thought was not a good
idea.
>
> >
> >> o How much contention is a good match for adaptive spinning?
> >> - this is related to the number of threads to run in the test
> >> o How many spinners should be allowed?
> >>
> >> I can share the kernel patches if people are interested, but they are
> >> really early, and I'm not sure they are of much value until I better
> >> understand the conditions where this is expected to be useful.
> >
> > Again, I don't know how you implemented your adaptive spinners, but the
> > trick to it in -rt was that it would only spin while the owner of the
> > lock was actually running. If it was not running, it would sleep. No
> > point waiting for a sleeping task to release its lock.
>
> It does exactly this.
OK, that's good.
>
> > Is this what you did? Because, IIRC, this only benefited spinlocks
> > converted to mutexes. It did not help with semaphores, because
> > semaphores could be held for a long time. Thus, it was good for short
> > held locks, but hurt performance on long held locks.
>
> Trouble is, I'm still seeing performance penalties even on the shortest
> critical section possible (lock();unlock();)
performance penalties compared to what? not having adaptive at all?
>
> > If userspace is going to do this, I guess the blocked task would need to
> > go into kernel, and spin there (with preempt enabled) if the task is
> > still active and holding the lock.
>
> It is currently under preempt_disable() just like mutexes. I asked Peter
> why it was done that way for mutexes, but didn't really get an answer.
> He did point out that since we check need_resched() at every iteration
> that we won't run longer than our timeslice anyway, so it shouldn't be a
> problem.
Sure it's not a problem ;-)
-- Steve
Everybody spins? Really? For RT Tasks I suppose that makes sense as they
will sort out the priority for themselves and if they preempt the owner
they will all immediately schedule out and boost the priority of the
owner.... but then we lose the benefit of spinning since we just put
everyone to sleep. I'll have to take a look at that and see what I'm
missing.
> FWIW, IIRC, Solaris has a heuristic approach where incoming tasks spin
> for a period of time before going to sleep. (Cray UINCOS did the same)
I suppose a heuristic approach could still be used so long as continued
spinning was conditional on the owner continuing to run on a CPU.
>> Is this what you did? Because, IIRC, this only benefited spinlocks
>> converted to mutexes. It did not help with semaphores, because
>> semaphores could be held for a long time. Thus, it was good for short
>> held locks, but hurt performance on long held locks.
>>
>
> nod. The entire premise was based on the fact that we were converting
> spinlocks, which by definition were short held locks. What I found
> during early development was that the sleep/wakeup cycle was more
> intrusive for RT than spinning.
Right, and I'm looking to provide some kernel assistance for userspace
spinlocks here, and am targeting short lived critical sections as well.
>
> IIRC, I measured something like 380k context switches/second prior to
> the adaptive patches for a dbench test and we cut this down to somewhere
> around 50k, with a corresponding increase in throughput. (I can't
> remember specific numbers any more, it was a while ago... ;-)
>
> When applied to semaphores, the benefit was in the noise range as I
> recall..
>
> (dbench was chosen due to the heavy contention on the dcache spinlock)
Interesting, thanks for the input.
--
Darren
>
>
> Best,
> -PWM
>
>
>> If userspace is going to do this, I guess the blocked task would need to
>> go into kernel, and spin there (with preempt enabled) if the task is
>> still active and holding the lock.
>>
>> Then the application would need to determine which to use. An adaptive
>> spinner for short held locks, and a normal futex for long held locks.
>>
>> -- Steve
>>
>>
>
>
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
Right. See the data in the original mail:
futex_lock: Result: 635 Kiter/s
futex_lock_adaptive: Result: 542 Kiter/s
So 15% fewer lock/unlock iterations per second with in kernel adaptive
spinning enabled for a critical section approaching 0 in length. But If
we agree I'm taking the right approach, then it's time for me to polish
things up a bit and send them out for review.
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
Hi Darren,
Part of the magic of adaptive locks is the avoidance of the sleep path, and part of it is using the SMP resources in what is effectively a distributed search (e.g. having N cpus actively looking for when they can acquire verses going to sleep and having the scheduler wake them up later).
I haven't seen your algorithm, but if its not simply trading the sleep path for searching for acquisition+lockowner changes, that may perhaps be a reason for an observed regression. For instance, if you have to do substantial work to get to the adaptive part of the algorithm, there could be an unbalanced amount of overhead in selecting this path.
The whole premise is predicated on the following sequence chart:
non-adaptive:
time | threadA | threadB
----------------------------------------
| lock(X) (granted) |
t0 | | lock(X) (contended)
| | add_wait_list(X, B)
| unlock(X) |
| grant(X, B) |
| wake_up(B) |
| | schedule()
| | |
| | |
| | V
t1 | | schedule() returns
| | (returns from lock())
adaptive:
time | threadA | threadB
----------------------------------------
| lock(X) (granted) |
t0 | | lock(X) (contended)
| unlock(X) |
| grant(X, B) |
| | while(!is_granted(X, B) && is_running(lock_owner(X))
t1 | | (returns from lock()
The idea is that the time interval t0-t1 is shorter in the adaptive case (at least most of the time), and is why the spinning doesn't end up hurting us (the cpu is also busy in the schedule() code otherwise). This is the win-win scenario for adaptive. This is the case generally with short-hold locks (like the spinlocks that were converted to mutexes in -rt tend to be).
For cases where the lock is held longer than the scheduling overhead, we will undoubtedly see more cpu utilization than the non-adaptive case. However, we may _still_ see performance improvements in some scenarios due to the fact that the "grant" operation still has less overhead (thead A can skip the "wakeup" and therefore thread B still comes out of the loop with less latency that it could otherwise. In this scenario you are trading cpu cycles for reduced latency, though the performance gains (if any) are not as profound in the ideal case.
..and then of course there are cpu-bound workloads with long(er) held locks. They don't like adaptive-locks much at all ;)
Long story short, I would try to quantify what your "t1-t0" delta actually is for the workload you are observing to start. That will give you a ballpark of whether you should expect any gains or not.
-Greg
We tried something similar in the original adaptive mutex
implementation. I just went back and reread the threads and the biggest
boost in performance came when we:
1) didn't limit the number of spinners
2) didn't try to be fair to waiters
So, lets say we've spun for a while and given up and tossed a process
onto a wait queue. One of the mutex iterations would see the process on
the wait queue and nicely hop on behind it.
We ended up changing things to spin regardless of what other processes
were doing, and that made a big difference. The spinning loops have
cond_resched() sprinkled in important places to make sure we don't keep
the CPU away from the process that actually owns the mutex.
> >
> >I'd really appreciate any data, just general insight, you may have
> >acquired while implementing adaptive spinning for rt-mutexes and
> >mutexes. Open questions for me regarding conditions where adaptive
> >spinning helps are:
> >
> >o What type of lock hold times do we expect to benefit?
> >o How much contention is a good match for adaptive spinning?
> > - this is related to the number of threads to run in the test
> >o How many spinners should be allowed?
> >
The btrfs benchmarks I was doing on the mutexes had 50 processes on a 4
CPU system, and no limits on the number of spinning processes. The
locks they were hitting were btree locks that were heavily contended for
each operation.
Most of the time, btrfs is able to take the mutex, do a short operation
and release the mutex without scheduling.
-chris
Great idea. I'll be doing a more rigorous investigation on this of
course, but I thought I'd share the results of just dumping this into
the testcase:
# ./futex_lock -i10000000
futex_lock: Measure FUTEX_LOCK operations per second
Arguments: iterations=10000000 threads=256 adaptive=0
Result: 420 Kiter/s
lock calls: 9999872
lock syscalls: 665824 (6.66%)
unlock calls: 9999872
unlock syscalls: 861240 (8.61%)
# ./futex_lock -a -i10000000
futex_lock: Measure FUTEX_LOCK operations per second
Arguments: iterations=10000000 threads=256 adaptive=1
Result: 426 Kiter/s
lock calls: 9999872
lock syscalls: 558787 (5.59%)
unlock calls: 9999872
unlock syscalls: 603412 (6.03%)
This is the first time I've seen adaptive locking have an advantage! The
second set of runs showed a slightly greater advantage. Note that this
was still with spinners being limited to one.
My thanks to everyone for their insight. I'll be preparing some result
matrices and will share the patches and testcases here shortly.
A lock(); unlock(); loop spends most of its time with the lock held or
contended. Can you something like this:
lock();
for (i = 0; i < 1000; ++i)
asm volatile ("" : : : "memory");
unlock();
for (i = 0; i < 10000; ++i)
asm volatile ("" : : : "memory");
This simulates a lock hold ratio of 10% with the lock hold time
exceeding the acquisition time. Will be interesting to lower both loop
bounds as well.
--
error compiling committee.c: too many arguments to function
Question - do all threads finish at the same time, or wildly different
times?
--
error compiling committee.c: too many arguments to function
--
No worries Darren, I happen to be interested in the topic too ;-)
(from KVM point of view, which has a slightly different take)
I'm not sure, I can add some fairness metrics to the test that will help
characterize how that's working. My suspicion is that there will be
several threads that don't make any progress until the very end - since
adaptive spinning is an "unfair" locking technique.
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
Well, if the amount of unfairness differs between the tests (unfair
unfairness?) then you may see results that are not directly related to
spin vs yield. You need to make the test more self-regulating so the
results are more repeatable (yet not make it a round-robin test).
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
> Right. This was *critical* for the adaptive rtmutex. Note in the RT
> patch, everybody spins as long as the current owner is on CPU.
>
> FWIW, IIRC, Solaris has a heuristic approach where incoming tasks spin
> for a period of time before going to sleep. (Cray UINCOS did the same)
IIRC Solaris mutexes are declared either simple spin or adaptive,
an acquisition attempt of the latter only checking the hold
status of the mutex and if held the owner's run status before
making the spin vs. block decision.
I don't believe mixing of the two mutex types within a given path
was permissible as a previously acquired simple spin mutex could
remain held when a subsequent adaptive mutex decided to block.
Although presumably an elevated IPL could have been sufficient to
flag that scenario.
-john
--
john....@third-harmonic.com
What did you have in mind beyond existing mechanisms
which address sibling contention?
One scenario which AFAICT isn't yet addressed is that
of a userspace spin lock holder taking a scheduling
preemption which may result in other threads piling up
on the lock orders of magnitude beyond normal wait times,
until the lock holder is rescheduled.
-john
--
john....@third-harmonic.com
Maybe not a good idea when running on bare metal, but it
could be a big help when running virtualized.
A lock with a short hold time can, occasionally, have a
very long hold time, when the VCPU holding the lock is
preempted by the host/hypervisor.
An adaptive lock would spin-and-acquire if the lock holder
is running, while turning into a sleep lock when the lock
holder has been preempted.
Which is precisely what the RT variant does. Each iteration of the
'spin' loop verifies that the lock owner is on CPU. If the owner is
not, all other tasks stop spinning and sleep.
So how does a timeout help in the VCPU case?
Best,
-PWM
That is an excellent example.
Another is the highly fragile performance characteristics of spinlocks
with sched_yield() implementations lead to. As sched_yield
implementations change, the scheduling behavior of the spinning tasks
also changes. As the number of cores grows, more performance tuning is
required. Sched_yield() essentially allows the spinner to spin for a
time and then get off the cpu for a time - but it doesn't have any idea
about the state of the lock owner and it's pure chance if the spinning
task with schedule back in at an opportune time, or if it will just be
adding to the scheduling overhead and CPU resources the owner is still
trying to acquire.
The idea here is to leverage the additional information we have in the
kernel to make more intelligent decisions about how long to spin (as
well as how many tasks should spin).
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
Right, Steven was referring to a braindead 50us spin, regardless of the
state of the owner. This is the kind of spinning userspace spinlocks
currently implement because they have no information regarding the state
of the owner. By adding adaptive spinning to the kernel, these userspace
locks can make more intelligent spinning decisions.
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team