On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
> Wouldn't a clean solution be to promote a task's scheduler
> class to the spinner class when we PLE (or come from some special
> syscall
> for userspace spinlocks?)?
Userspace spinlocks are typically employed to avoid syscalls..
> That class would be higher priority than the
> fair class and would schedule in FIFO order, but it would only run its
> tasks for short periods before switching.
Since lock hold times aren't limited, esp. for things like userspace
'spin' locks, you've got a very good denial of service / opportunity for
abuse right there.
On Wed, Sep 26, 2012 at 03:26:11PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
> > Wouldn't a clean solution be to promote a task's scheduler
> > class to the spinner class when we PLE (or come from some special
> > syscall
> > for userspace spinlocks?)?
> Userspace spinlocks are typically employed to avoid syscalls..
I'm guessing there could be a slow path - spin N times and then give
up and yield.
> > That class would be higher priority than the
> > fair class and would schedule in FIFO order, but it would only run its
> > tasks for short periods before switching.
> Since lock hold times aren't limited, esp. for things like userspace
> 'spin' locks, you've got a very good denial of service / opportunity for
> abuse right there.
Maybe add some throttling to avoid overuse/maliciousness?
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2012-09-26 at 15:39 +0200, Andrew Jones wrote:
> On Wed, Sep 26, 2012 at 03:26:11PM +0200, Peter Zijlstra wrote:
> > On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote:
> > > Wouldn't a clean solution be to promote a task's scheduler
> > > class to the spinner class when we PLE (or come from some special
> > > syscall
> > > for userspace spinlocks?)?
> > Userspace spinlocks are typically employed to avoid syscalls..
> I'm guessing there could be a slow path - spin N times and then give
> up and yield.
Much better they should do a blocking futex call or so, once you do the
syscall you're in kernel space anyway and have paid the transition cost.
> > > That class would be higher priority than the
> > > fair class and would schedule in FIFO order, but it would only run its
> > > tasks for short periods before switching.
> > Since lock hold times aren't limited, esp. for things like userspace
> > 'spin' locks, you've got a very good denial of service / opportunity for
> > abuse right there.
> Maybe add some throttling to avoid overuse/maliciousness?
At which point you're pretty much back to where you started.
A much better approach is using things like priority inheritance, which
can be extended to cover the fair class just fine..
Also note that user-space spinning is inherently prone to live-locks
when combined with the static priority RT scheduling classes.
On Tue, Sep 25, 2012 at 10:54:21AM +0200, Avi Kivity wrote:
> On 09/25/2012 10:09 AM, Raghavendra K T wrote:
> > On 09/24/2012 09:36 PM, Avi Kivity wrote:
> >> On 09/24/2012 05:41 PM, Avi Kivity wrote:
> >>>> I agree that checking rq1 length is not proper in this case, and as
> >>>> you
> >>>> rightly pointed out, we are in trouble here.
> >>>> nr_running()/num_online_cpus() would give more accurate picture here,
> >>>> but it seemed costly. May be load balancer save us a bit here in not
> >>>> running to such sort of cases. ( I agree load balancer is far too
> >>>> complex).
> >>> In theory preempt notifier can tell us whether a vcpu is preempted or
> >>> not (except for exits to userspace), so we can keep track of whether
> >>> it's we're overcommitted in kvm itself. It also avoids false positives
> >>> from other guests and/or processes being overcommitted while our vm
> >>> is fine.
> >> It also allows us to cheaply skip running vcpus.
> > Hi Avi,
> > Could you please elaborate on how preempt notifiers can be used
> > here to keep track of overcommit or skip running vcpus?
> > Are we planning set some flag in sched_out() handler etc?
> Keep a bitmap kvm->preempted_vcpus.
> In sched_out, test whether we're TASK_RUNNING, and if so, set a vcpu
> flag and our bit in kvm->preempted_vcpus. On sched_in, if the flag is
> set, clear our bit in kvm->preempted_vcpus. We can also keep a counter
> of preempted vcpus.
> We can use the bitmap and the counter to quickly see if spinning is
> worthwhile (if the counter is zero, better to spin). If not, we can use
> the bitmap to select target vcpus quickly.
> The only problem is that in order to keep this accurate we need to keep
> the preempt notifiers active during exits to userspace. But we can
> prototype this without this change, and add it later if it works.
Can user return notifier can be used instead? Set bit in
kvm->preempted_vcpus on return to userspace.
> On 09/24/2012 07:46 PM, Raghavendra K T wrote:
>> On 09/24/2012 07:24 PM, Peter Zijlstra wrote:
>>> On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>>>> However Rik had a genuine concern in the cases where runqueue is not
>>>> equally distributed and lockholder might actually be on a different run
>>>> queue but not running.
>>> Load should eventually get distributed equally -- that's what the
>>> load-balancer is for -- so this is a temporary situation.
>>> We already try and favour the non running vcpu in this case, that's what
>>> yield_to_task_fair() is about. If its still not eligible to run, tough
>>> luck.
>> Yes, I agree.
>>>> Do you think instead of using rq->nr_running, we could get a global
>>>> sense of load using avenrun (something like avenrun/num_onlinecpus)
>>> To what purpose? Also, global stuff is expensive, so you should try and
>>> stay away from it as hard as you possibly can.
>> Yes, that concern only had made me to fall back to rq->nr_running.
>> Will come back with the result soon.
> Got the result with the patches:
> So here is the result,
> Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
> 1x and 2x overcommits
> Base = 3.6.0-rc5 + ple handler optimization patches
> A = Base + checking rq_running in vcpu_on_spin() patch
> B = Base + checking rq->nr_running in sched/core
> C = Base - PLE
> ---+-----------+-----------+-----------+-----------+
> | Ebizzy result (rec/sec higher is better) |
> ---+-----------+-----------+-----------+-----------+
> | Base | A | B | C |
> ---+-----------+-----------+-----------+-----------+
> 1x | 2374.1250 | 7273.7500 | 5690.8750 | 7364.3750|
> 2x | 2536.2500 | 2458.5000 | 2426.3750 | 48.5000|
> ---+-----------+-----------+-----------+-----------+
> % improvements w.r.t BASE
> ---+------------+------------+------------+
> | A | B | C |
> ---+------------+------------+------------+
> 1x | 206.37603 | 139.70410 | 210.19323 |
> 2x | -3.06555 | -4.33218 | -98.08773 |
> ---+------------+------------+------------+
> we are getting the benefit of almost PLE disabled case with this
> approach. With patch B, we have dropped a bit in gain.
> (because we still would iterate vcpus until we decide to do a directed
> yield).
This gives us a good case for tracking preemption on a per-vm basis. As
long as we aren't preempted, we can keep the PLE window high, and also
return immediately from the handler without looking for candidates.
-- error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> On Tue, 25 Sep 2012 10:12:49 +0200
> Avi Kivity <a...@redhat.com> wrote:
>> It will. The tradeoff is between false-positive costs (undercommit) and
>> true positive costs (overcommit). I think undercommit should perform
>> well no matter what.
>> If we utilize preempt notifiers to track overcommit dynamically, then we
>> can vary the spin time dynamically. Keep it long initially, as we get
>> more preempted vcpus make it shorter.
> What will happen if we pin each vcpu thread to some core?
> I don't want to see so many vcpu threads moving around without
> being pinned at all.
If you do that you've removed a lot of flexibility from the scheduler,
so overcommit becomes even less likely to work well (a trivial example
is pinning two vcpus from the same vm to the same core -- it's so
obviously bad no one considers doing it).
> In that case, we don't want to make KVM do any work of searching
> a vcpu thread to yield to.
Why not? If a vcpu thread on another core has been preempted, and is
the lock holder, and we can boost it, then we've fixed our problem.
Even if the spinning thread keeps spinning because it is the only task
eligible to run on its core.
-- error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> I've actually implemented this preempted_bitmap idea.
Interesting, please share the code if you can.
> However, I'm doing this to expose this information to the guest, so the
> guest is able to know if the lock holder is preempted or not before
> spining. Right now, I'm doing experiment to show that this idea works.
> I'm wondering what do you guys think of the relationship between the
> pv_ticketlock approach and PLE handler approach. Are we going to adopt
> PLE instead of the pv ticketlock, and why?
Right now we're searching for the best solution. The tradeoffs are more
or less:
PLE:
- works for unmodified / non-Linux guests
- works for all types of spins (e.g. smp_call_function*())
- utilizes an existing hardware interface (PAUSE instruction) so likely
more robust compared to a software interface
PV:
- has more information, so it can perform better
Given these tradeoffs, if we can get PLE to work for moderate amounts of
overcommit then I'll prefer it (even if it slightly underperforms PV).
If we are unable to make it work well, then we'll have to add PV.
-- error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> On Tue, Sep 25, 2012 at 10:54:21AM +0200, Avi Kivity wrote:
>> On 09/25/2012 10:09 AM, Raghavendra K T wrote:
>> > On 09/24/2012 09:36 PM, Avi Kivity wrote:
>> >> On 09/24/2012 05:41 PM, Avi Kivity wrote:
>> >>>> I agree that checking rq1 length is not proper in this case, and as
>> >>>> you
>> >>>> rightly pointed out, we are in trouble here.
>> >>>> nr_running()/num_online_cpus() would give more accurate picture here,
>> >>>> but it seemed costly. May be load balancer save us a bit here in not
>> >>>> running to such sort of cases. ( I agree load balancer is far too
>> >>>> complex).
>> >>> In theory preempt notifier can tell us whether a vcpu is preempted or
>> >>> not (except for exits to userspace), so we can keep track of whether
>> >>> it's we're overcommitted in kvm itself. It also avoids false positives
>> >>> from other guests and/or processes being overcommitted while our vm
>> >>> is fine.
>> >> It also allows us to cheaply skip running vcpus.
>> > Hi Avi,
>> > Could you please elaborate on how preempt notifiers can be used
>> > here to keep track of overcommit or skip running vcpus?
>> > Are we planning set some flag in sched_out() handler etc?
>> Keep a bitmap kvm->preempted_vcpus.
>> In sched_out, test whether we're TASK_RUNNING, and if so, set a vcpu
>> flag and our bit in kvm->preempted_vcpus. On sched_in, if the flag is
>> set, clear our bit in kvm->preempted_vcpus. We can also keep a counter
>> of preempted vcpus.
>> We can use the bitmap and the counter to quickly see if spinning is
>> worthwhile (if the counter is zero, better to spin). If not, we can use
>> the bitmap to select target vcpus quickly.
>> The only problem is that in order to keep this accurate we need to keep
>> the preempt notifiers active during exits to userspace. But we can
>> prototype this without this change, and add it later if it works.
> Can user return notifier can be used instead? Set bit in
> kvm->preempted_vcpus on return to userspace.
User return notifier is per-cpu, not per-task. There is a new task_work
(<linux/task_work.h>) that does what you want. With these
technicalities out of the way, I think it's the wrong idea. If a vcpu
thread is in userspace, that doesn't mean it's preempted, there's no
point in boosting it if it's already running.
btw, we can have secondary effects. A vcpu can be waiting for a lock in
the host kernel, or for a host page fault. There's no point in boosting
anything for that. Or a vcpu in userspace can be waiting for a lock
that is held by another thread, which has been preempted. This is (like
I think Peter already said) a priority inheritance problem. However
with fine-grained locking in userspace, we can make it go away. The
guest kernel is unlikely to access one device simultaneously from two
threads (and if it does, we just need to improve the threading in the
device model).
-- error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, Sep 27, 2012 at 10:59:21AM +0200, Avi Kivity wrote:
> On 09/27/2012 09:44 AM, Gleb Natapov wrote:
> > On Tue, Sep 25, 2012 at 10:54:21AM +0200, Avi Kivity wrote:
> >> On 09/25/2012 10:09 AM, Raghavendra K T wrote:
> >> > On 09/24/2012 09:36 PM, Avi Kivity wrote:
> >> >> On 09/24/2012 05:41 PM, Avi Kivity wrote:
> >> >>>> I agree that checking rq1 length is not proper in this case, and as
> >> >>>> you
> >> >>>> rightly pointed out, we are in trouble here.
> >> >>>> nr_running()/num_online_cpus() would give more accurate picture here,
> >> >>>> but it seemed costly. May be load balancer save us a bit here in not
> >> >>>> running to such sort of cases. ( I agree load balancer is far too
> >> >>>> complex).
> >> >>> In theory preempt notifier can tell us whether a vcpu is preempted or
> >> >>> not (except for exits to userspace), so we can keep track of whether
> >> >>> it's we're overcommitted in kvm itself. It also avoids false positives
> >> >>> from other guests and/or processes being overcommitted while our vm
> >> >>> is fine.
> >> >> It also allows us to cheaply skip running vcpus.
> >> > Hi Avi,
> >> > Could you please elaborate on how preempt notifiers can be used
> >> > here to keep track of overcommit or skip running vcpus?
> >> > Are we planning set some flag in sched_out() handler etc?
> >> Keep a bitmap kvm->preempted_vcpus.
> >> In sched_out, test whether we're TASK_RUNNING, and if so, set a vcpu
> >> flag and our bit in kvm->preempted_vcpus. On sched_in, if the flag is
> >> set, clear our bit in kvm->preempted_vcpus. We can also keep a counter
> >> of preempted vcpus.
> >> We can use the bitmap and the counter to quickly see if spinning is
> >> worthwhile (if the counter is zero, better to spin). If not, we can use
> >> the bitmap to select target vcpus quickly.
> >> The only problem is that in order to keep this accurate we need to keep
> >> the preempt notifiers active during exits to userspace. But we can
> >> prototype this without this change, and add it later if it works.
> > Can user return notifier can be used instead? Set bit in
> > kvm->preempted_vcpus on return to userspace.
> User return notifier is per-cpu, not per-task. There is a new task_work
> (<linux/task_work.h>) that does what you want. With these
> technicalities out of the way, I think it's the wrong idea. If a vcpu
> thread is in userspace, that doesn't mean it's preempted, there's no
> point in boosting it if it's already running.
Ah, so you want to set bit in kvm->preempted_vcpus if task is _not_
TASK_RUNNING in sched_out (you wrote opposite in your email)? If a task is in userspace it is definitely not preempted.
> btw, we can have secondary effects. A vcpu can be waiting for a lock in
> the host kernel, or for a host page fault. There's no point in boosting
> anything for that. Or a vcpu in userspace can be waiting for a lock
> that is held by another thread, which has been preempted.
Do you mean userspace spinlock? Because otherwise task that's waits on
a kernel lock will sleep in the kernel.
> This is (like
> I think Peter already said) a priority inheritance problem. However
> with fine-grained locking in userspace, we can make it go away. The
> guest kernel is unlikely to access one device simultaneously from two
> threads (and if it does, we just need to improve the threading in the
> device model).
> -- > error compiling committee.c: too many arguments to function
>> User return notifier is per-cpu, not per-task. There is a new task_work
>> (<linux/task_work.h>) that does what you want. With these
>> technicalities out of the way, I think it's the wrong idea. If a vcpu
>> thread is in userspace, that doesn't mean it's preempted, there's no
>> point in boosting it if it's already running.
> Ah, so you want to set bit in kvm->preempted_vcpus if task is _not_
> TASK_RUNNING in sched_out (you wrote opposite in your email)? If a task > is in userspace it is definitely not preempted.
No, as I originally wrote. If it's TASK_RUNNING when it saw sched_out,
then it is preempted (i.e. runnable), not sleeping on some waitqueue,
voluntarily (HLT) or involuntarily (page fault).
>> btw, we can have secondary effects. A vcpu can be waiting for a lock in
>> the host kernel, or for a host page fault. There's no point in boosting
>> anything for that. Or a vcpu in userspace can be waiting for a lock
>> that is held by another thread, which has been preempted. > Do you mean userspace spinlock? Because otherwise task that's waits on
> a kernel lock will sleep in the kernel.
I meant a kernel mutex.
vcpu 0: take guest spinlock
vcpu 0: vmexit
vcpu 0: spin_lock(some_lock)
vcpu 1: take same guest spinlock
vcpu 1: PLE vmexit
vcpu 1: wtf?
Waiting on a host kernel spinlock is not too bad because we expect to be
out shortly. Waiting on a host kernel mutex can be a lot worse.
-- error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> On 09/24/2012 02:02 PM, Raghavendra K T wrote:
>> On 09/24/2012 02:12 PM, Dor Laor wrote:
>>> In order to help PLE and pvticketlock converge I thought that a small
>>> test code should be developed to test this in a predictable,
>>> deterministic way.
>>> The idea is to have a guest kernel module that spawn a new thread each
>>> time you write to a /sys/.... entry.
>>> Each such a thread spins over a spin lock. The specific spin lock is
>>> also chosen by the /sys/ interface. Let's say we have an array of spin
>>> locks *10 times the amount of vcpus.
>>> All the threads are running a
>>> while (1) {
>>> spin_lock(my_lock);
>>> sum += execute_dummy_cpu_computation(time);
>>> spin_unlock(my_lock);
>>> if (sys_tells_thread_to_die()) break;
>>> }
>>> print_result(sum);
>>> Instead of calling the kernel's spin_lock functions, clone them and make
>>> the ticket lock order deterministic and known (like a linear walk of all
>>> the threads trying to catch that lock).
>> By Cloning you mean hierarchy of the locks?
> No, I meant to clone the implementation of the current spin lock code in
> order to set any order you may like for the ticket selection.
> (even for a non pvticket lock version)
> For instance, let's say you have N threads trying to grab the lock, you
> can always make the ticket go linearly from 1->2...->N.
> Not sure it's a good idea, just a recommendation.
>> Also I believe time should be passed via sysfs / hardcoded for each
>> type of lock we are mimicking
> Yap
>>> This way you can easy calculate:
>>> 1. the score of a single vcpu running a single thread
>>> 2. the score of sum of all thread scores when #thread==#vcpu all
>>> taking the same spin lock. The overall sum should be close as
>>> possible to #1.
>>> 3. Like #2 but #threads > #vcpus and other versions of #total vcpus
>>> (belonging to all VMs) > #pcpus.
>>> 4. Create #thread == #vcpus but let each thread have it's own spin
>>> lock
>>> 5. Like 4 + 2
>>> Hopefully this way will allows you to judge and evaluate the exact
>>> overhead of scheduling VMs and threads since you have the ideal result
>>> in hand and you know what the threads are doing.
>>> My 2 cents, Dor
>> Thank you,
>> I think this is an excellent idea. ( Though I am trying to put all the
>> pieces together you mentioned). So overall we should be able to measure
>> the performance of pvspinlock/PLE improvements with a deterministic
>> load in guest.
>> Only thing I am missing is,
>> How to generate different combinations of the lock.
>> Okay, let me see if I can come with a solid model for this.
> Do you mean the various options for PLE/pvticket/other? I haven't
> thought of it and assumed its static but it can also be controlled
> through the temporary /sys interface.
No, I am not there yet.
So In summary, we are suffering with inconsistent benchmark result,
while measuring the benefit of our improvement in PLE/pvlock etc..
So good point from your suggestion is,
- Giving predictability to workload that runs in guest, so that we have
pi-pi comparison of improvement.
- we can easily tune the workload via sysfs, and we can have script to
automate them.
What is complicated is:
- How can we simulate a workload close to what we measure with
benchmarks?
- How can we mimic lock holding time/ lock hierarchy close to the way
it is seen with real workloads (for e.g. highly contended zone lru lock
with similar amount of lockholding times).
- How close it would be to when we forget about other types of spinning
(for e.g, flush_tlb).
On Thu, Sep 27, 2012 at 11:33:56AM +0200, Avi Kivity wrote:
> On 09/27/2012 11:11 AM, Gleb Natapov wrote:
> >> User return notifier is per-cpu, not per-task. There is a new task_work
> >> (<linux/task_work.h>) that does what you want. With these
> >> technicalities out of the way, I think it's the wrong idea. If a vcpu
> >> thread is in userspace, that doesn't mean it's preempted, there's no
> >> point in boosting it if it's already running.
> > Ah, so you want to set bit in kvm->preempted_vcpus if task is _not_
> > TASK_RUNNING in sched_out (you wrote opposite in your email)? If a task > > is in userspace it is definitely not preempted.
> No, as I originally wrote. If it's TASK_RUNNING when it saw sched_out,
> then it is preempted (i.e. runnable), not sleeping on some waitqueue,
> voluntarily (HLT) or involuntarily (page fault).
Of course, I got it all backwards. Need more coffee.
> >> btw, we can have secondary effects. A vcpu can be waiting for a lock in
> >> the host kernel, or for a host page fault. There's no point in boosting
> >> anything for that. Or a vcpu in userspace can be waiting for a lock
> >> that is held by another thread, which has been preempted. > > Do you mean userspace spinlock? Because otherwise task that's waits on
> > a kernel lock will sleep in the kernel.
> I meant a kernel mutex.
> vcpu 0: take guest spinlock
> vcpu 0: vmexit
> vcpu 0: spin_lock(some_lock)
> vcpu 1: take same guest spinlock
> vcpu 1: PLE vmexit
> vcpu 1: wtf?
> Waiting on a host kernel spinlock is not too bad because we expect to be
> out shortly. Waiting on a host kernel mutex can be a lot worse.
We can't do much about it without PV spinlock since there is not
information about what vcpu holds which guest spinlock, no?
>> >> btw, we can have secondary effects. A vcpu can be waiting for a lock in
>> >> the host kernel, or for a host page fault. There's no point in boosting
>> >> anything for that. Or a vcpu in userspace can be waiting for a lock
>> >> that is held by another thread, which has been preempted. >> > Do you mean userspace spinlock? Because otherwise task that's waits on
>> > a kernel lock will sleep in the kernel.
>> I meant a kernel mutex.
>> vcpu 0: take guest spinlock
>> vcpu 0: vmexit
>> vcpu 0: spin_lock(some_lock)
>> vcpu 1: take same guest spinlock
>> vcpu 1: PLE vmexit
>> vcpu 1: wtf?
>> Waiting on a host kernel spinlock is not too bad because we expect to be
>> out shortly. Waiting on a host kernel mutex can be a lot worse.
> We can't do much about it without PV spinlock since there is not
> information about what vcpu holds which guest spinlock, no?
It doesn't help. If the lock holder is waiting for another lock in the
host kernel, boosting it doesn't help even if we know who it is. We
need to boost the real lock holder, but we have no idea who it is (and
even if we did, we often can't do anything about it).
-- error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, Sep 27, 2012 at 12:04:58PM +0200, Avi Kivity wrote:
> On 09/27/2012 11:58 AM, Gleb Natapov wrote:
> >> >> btw, we can have secondary effects. A vcpu can be waiting for a lock in
> >> >> the host kernel, or for a host page fault. There's no point in boosting
> >> >> anything for that. Or a vcpu in userspace can be waiting for a lock
> >> >> that is held by another thread, which has been preempted. > >> > Do you mean userspace spinlock? Because otherwise task that's waits on
> >> > a kernel lock will sleep in the kernel.
> >> Waiting on a host kernel spinlock is not too bad because we expect to be
> >> out shortly. Waiting on a host kernel mutex can be a lot worse.
> > We can't do much about it without PV spinlock since there is not
> > information about what vcpu holds which guest spinlock, no?
> It doesn't help. If the lock holder is waiting for another lock in the
> host kernel, boosting it doesn't help even if we know who it is. We
> need to boost the real lock holder, but we have no idea who it is (and
> even if we did, we often can't do anything about it).
Without PV lock we will boost random preempted vcpu instead of going to
sleep in the situation you described.
> On Tue, Sep 25, 2012 at 05:00:30PM +0200, Dor Laor wrote:
>> On 09/24/2012 02:02 PM, Raghavendra K T wrote:
>>> On 09/24/2012 02:12 PM, Dor Laor wrote:
>>>> In order to help PLE and pvticketlock converge I thought that a small
>>>> test code should be developed to test this in a predictable,
>>>> deterministic way.
>>>> The idea is to have a guest kernel module that spawn a new thread each
>>>> time you write to a /sys/.... entry.
>>>> Each such a thread spins over a spin lock. The specific spin lock is
>>>> also chosen by the /sys/ interface. Let's say we have an array of spin
>>>> locks *10 times the amount of vcpus.
>>>> All the threads are running a
>>>> while (1) {
>>>> spin_lock(my_lock);
>>>> sum += execute_dummy_cpu_computation(time);
>>>> spin_unlock(my_lock);
>>>> if (sys_tells_thread_to_die()) break;
>>>> }
>>>> print_result(sum);
>>>> Instead of calling the kernel's spin_lock functions, clone them and make
>>>> the ticket lock order deterministic and known (like a linear walk of all
>>>> the threads trying to catch that lock).
>>> By Cloning you mean hierarchy of the locks?
>> No, I meant to clone the implementation of the current spin lock
>> code in order to set any order you may like for the ticket
>> selection.
>> (even for a non pvticket lock version)
> Wouldn't that defeat the purpose of trying the test the different
> implementations that try to fix the lock-holder preemption problem?
> You want something that you can shoe-in for all work-loads - also
> for this test system.
Hmm true. I think it is indeed difficult to shoe-in all workloads.
> On Thu, Sep 27, 2012 at 12:04:58PM +0200, Avi Kivity wrote:
>> On 09/27/2012 11:58 AM, Gleb Natapov wrote:
>> >> >> btw, we can have secondary effects. A vcpu can be waiting for a lock in
>> >> >> the host kernel, or for a host page fault. There's no point in boosting
>> >> >> anything for that. Or a vcpu in userspace can be waiting for a lock
>> >> >> that is held by another thread, which has been preempted. >> >> > Do you mean userspace spinlock? Because otherwise task that's waits on
>> >> > a kernel lock will sleep in the kernel.
>> >> Waiting on a host kernel spinlock is not too bad because we expect to be
>> >> out shortly. Waiting on a host kernel mutex can be a lot worse.
>> > We can't do much about it without PV spinlock since there is not
>> > information about what vcpu holds which guest spinlock, no?
>> It doesn't help. If the lock holder is waiting for another lock in the
>> host kernel, boosting it doesn't help even if we know who it is. We
>> need to boost the real lock holder, but we have no idea who it is (and
>> even if we did, we often can't do anything about it).
> Without PV lock we will boost random preempted vcpu instead of going to
> sleep in the situation you described.
True. In theory boosting a random vcpu shouldn't have any negative
effects though. Right now the problem is that the boosting itself is
expensive.
-- error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> On Mon, Sep 24, 2012 at 02:36:05PM +0200, Peter Zijlstra wrote:
>> On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote:
>>> On 09/24/2012 05:04 PM, Peter Zijlstra wrote:
>>>> On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote:
>>>>> In some special scenarios like #vcpu<= #pcpu, PLE handler may
>>>>> prove very costly, because there is no need to iterate over vcpus
>>>>> and do unsuccessful yield_to burning CPU.
>>>> What's the costly thing? The vm-exit, the yield (which should be a nop
>>>> if its the only task there) or something else entirely?
>>> Both vmexit and yield_to() actually,
>>> because unsuccessful yield_to() overall is costly in PLE handler.
>>> This is because when we have large guests, say 32/16 vcpus, and one
>>> vcpu is holding lock, rest of the vcpus waiting for the lock, when they
>>> do PL-exit, each of the vcpu try to iterate over rest of vcpu list in
>>> the VM and try to do directed yield (unsuccessful). (O(n^2) tries).
>>> this results is fairly high amount of cpu burning and double run queue
>>> lock contention.
>>> (if they were spinning probably lock progress would have been faster).
>>> As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which
>>> seems little complex to achieve currently.
>> OK, so the vmexit stays and we need to improve yield_to.
> Can't we do this check sooner as well, as it only requires per-cpu data?
> If we do it way back in kvm_vcpu_on_spin, then we avoid get_pid_task()
> and a bunch of read barriers from kvm_for_each_vcpu. Also, moving the test
> into kvm code would allow us to do other kvm things as a result of the
> check in order to avoid some vmexits. It looks like we should be able to
> avoid some without much complexity by just making a per-vm ple_window
> variable, and then, when we hit the nr_running == 1 condition, also doing
> vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP))
> Reset the window to the default value when we successfully yield (and
> maybe we should limit the number of bumps).
We indeed checked early in original undercommit patch and it has given
result closer to PLE disabled case. But Agree with Peter that it is ugly to export nr_running info to ple handler.
Looking at the result and comparing result of A and C,
> Base = 3.6.0-rc5 + ple handler optimization patches
> A = Base + checking rq_running in vcpu_on_spin() patch
> B = Base + checking rq->nr_running in sched/core
> C = Base - PLE
> % improvements w.r.t BASE
> ---+------------+------------+------------+
> | A | B | C |
> ---+------------+------------+------------+
> 1x | 206.37603 | 139.70410 | 210.19323 |
I have a feeling that vmexit has not caused significant overhead
compared to iterating over vcpus in PLE handler.. Does it not sound so?
On Thu, Sep 27, 2012 at 03:19:45PM +0530, Raghavendra K T wrote:
> On 09/25/2012 08:30 PM, Dor Laor wrote:
> >On 09/24/2012 02:02 PM, Raghavendra K T wrote:
> >>On 09/24/2012 02:12 PM, Dor Laor wrote:
> >>>In order to help PLE and pvticketlock converge I thought that a small
> >>>test code should be developed to test this in a predictable,
> >>>deterministic way.
> >>>The idea is to have a guest kernel module that spawn a new thread each
> >>>time you write to a /sys/.... entry.
> >>>Each such a thread spins over a spin lock. The specific spin lock is
> >>>also chosen by the /sys/ interface. Let's say we have an array of spin
> >>>locks *10 times the amount of vcpus.
> >>>All the threads are running a
> >>>while (1) {
> >>>Instead of calling the kernel's spin_lock functions, clone them and make
> >>>the ticket lock order deterministic and known (like a linear walk of all
> >>>the threads trying to catch that lock).
> >>By Cloning you mean hierarchy of the locks?
> >No, I meant to clone the implementation of the current spin lock code in
> >order to set any order you may like for the ticket selection.
> >(even for a non pvticket lock version)
> >For instance, let's say you have N threads trying to grab the lock, you
> >can always make the ticket go linearly from 1->2...->N.
> >Not sure it's a good idea, just a recommendation.
> >>Also I believe time should be passed via sysfs / hardcoded for each
> >>type of lock we are mimicking
> >Yap
> >>>This way you can easy calculate:
> >>>1. the score of a single vcpu running a single thread
> >>>2. the score of sum of all thread scores when #thread==#vcpu all
> >>>taking the same spin lock. The overall sum should be close as
> >>>possible to #1.
> >>>3. Like #2 but #threads > #vcpus and other versions of #total vcpus
> >>>(belonging to all VMs) > #pcpus.
> >>>4. Create #thread == #vcpus but let each thread have it's own spin
> >>>lock
> >>>5. Like 4 + 2
> >>>Hopefully this way will allows you to judge and evaluate the exact
> >>>overhead of scheduling VMs and threads since you have the ideal result
> >>>in hand and you know what the threads are doing.
> >>>My 2 cents, Dor
> >>Thank you,
> >>I think this is an excellent idea. ( Though I am trying to put all the
> >>pieces together you mentioned). So overall we should be able to measure
> >>the performance of pvspinlock/PLE improvements with a deterministic
> >>load in guest.
> >>Only thing I am missing is,
> >>How to generate different combinations of the lock.
> >>Okay, let me see if I can come with a solid model for this.
> >Do you mean the various options for PLE/pvticket/other? I haven't
> >thought of it and assumed its static but it can also be controlled
> >through the temporary /sys interface.
> No, I am not there yet.
> So In summary, we are suffering with inconsistent benchmark result,
> while measuring the benefit of our improvement in PLE/pvlock etc..
Are you measuring the combined throughput of all running guests, or
just looking at the results of the benchmarks in a single test guest?
I've done some benchmarking as well and my stddevs look pretty good for
kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each
overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that
full sequence of tests (one with the overcommit levels in scrambled
order). The relative stddevs for each of the sets of 5 runs look pretty
good, and the data for the 2 runs match nicely as well.
To try and get consistent results I do the following - interleave the memory of all guests across all numa nodes on the
machine
- echo 0 > /proc/sys/kernel/randomize_va_space on both host and test
guest
- echo 3 > /proc/sys/vm/drop_caches on both host and test guest before
each run
- use a ramdisk for the benchmark output files on all running guests
- no periodically running services installed on the test guest
- HT is turned off as you do, although I'd like to try running again
with it turned back on
Although, I still need to run again measuring the combined throughput
of all running vms (including the ones launched just to generate busy
vcpus). Maybe my results won't be as consistent then...
> So good point from your suggestion is,
> - Giving predictability to workload that runs in guest, so that we have
> pi-pi comparison of improvement.
> - we can easily tune the workload via sysfs, and we can have script to
> automate them.
> What is complicated is:
> - How can we simulate a workload close to what we measure with
> benchmarks?
> - How can we mimic lock holding time/ lock hierarchy close to the way
> it is seen with real workloads (for e.g. highly contended zone lru lock
> with similar amount of lockholding times).
> - How close it would be to when we forget about other types of spinning
> (for e.g, flush_tlb).
> So I feel it is not as trivial as it looks like.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> On 09/25/2012 08:30 PM, Dor Laor wrote:
>> On 09/24/2012 02:02 PM, Raghavendra K T wrote:
>>> On 09/24/2012 02:12 PM, Dor Laor wrote:
>>>> In order to help PLE and pvticketlock converge I thought that a small
>>>> test code should be developed to test this in a predictable,
>>>> deterministic way.
>>>> The idea is to have a guest kernel module that spawn a new thread each
>>>> time you write to a /sys/.... entry.
>>>> Each such a thread spins over a spin lock. The specific spin lock is
>>>> also chosen by the /sys/ interface. Let's say we have an array of spin
>>>> locks *10 times the amount of vcpus.
>>>> All the threads are running a
>>>> while (1) {
>>>> spin_lock(my_lock);
>>>> sum += execute_dummy_cpu_computation(time);
>>>> spin_unlock(my_lock);
>>>> if (sys_tells_thread_to_die()) break;
>>>> }
>>>> print_result(sum);
>>>> Instead of calling the kernel's spin_lock functions, clone them and
>>>> make
>>>> the ticket lock order deterministic and known (like a linear walk of
>>>> all
>>>> the threads trying to catch that lock).
>>> By Cloning you mean hierarchy of the locks?
>> No, I meant to clone the implementation of the current spin lock code in
>> order to set any order you may like for the ticket selection.
>> (even for a non pvticket lock version)
>> For instance, let's say you have N threads trying to grab the lock, you
>> can always make the ticket go linearly from 1->2...->N.
>> Not sure it's a good idea, just a recommendation.
>>> Also I believe time should be passed via sysfs / hardcoded for each
>>> type of lock we are mimicking
>> Yap
>>>> This way you can easy calculate:
>>>> 1. the score of a single vcpu running a single thread
>>>> 2. the score of sum of all thread scores when #thread==#vcpu all
>>>> taking the same spin lock. The overall sum should be close as
>>>> possible to #1.
>>>> 3. Like #2 but #threads > #vcpus and other versions of #total vcpus
>>>> (belonging to all VMs) > #pcpus.
>>>> 4. Create #thread == #vcpus but let each thread have it's own spin
>>>> lock
>>>> 5. Like 4 + 2
>>>> Hopefully this way will allows you to judge and evaluate the exact
>>>> overhead of scheduling VMs and threads since you have the ideal result
>>>> in hand and you know what the threads are doing.
>>>> My 2 cents, Dor
>>> Thank you,
>>> I think this is an excellent idea. ( Though I am trying to put all the
>>> pieces together you mentioned). So overall we should be able to measure
>>> the performance of pvspinlock/PLE improvements with a deterministic
>>> load in guest.
>>> Only thing I am missing is,
>>> How to generate different combinations of the lock.
>>> Okay, let me see if I can come with a solid model for this.
>> Do you mean the various options for PLE/pvticket/other? I haven't
>> thought of it and assumed its static but it can also be controlled
>> through the temporary /sys interface.
> No, I am not there yet.
> So In summary, we are suffering with inconsistent benchmark result,
> while measuring the benefit of our improvement in PLE/pvlock etc..
> So good point from your suggestion is,
> - Giving predictability to workload that runs in guest, so that we have
> pi-pi comparison of improvement.
> - we can easily tune the workload via sysfs, and we can have script to
> automate them.
> What is complicated is:
> - How can we simulate a workload close to what we measure with
> benchmarks?
> - How can we mimic lock holding time/ lock hierarchy close to the way
> it is seen with real workloads (for e.g. highly contended zone lru lock
> with similar amount of lockholding times).
You can spin for a similar instruction count that you're interested
> - How close it would be to when we forget about other types of spinning
> (for e.g, flush_tlb).
> So I feel it is not as trivial as it looks like.
Indeed this is mainly a tool that can serve to optimize few synthetic workloads.
I still believe that it worth to go through this exercise since a 100% predictable and controlled case can help us purely asses the state of PLE and pvticket code. Otherwise we're dealing w/ too many parameters and assumptions at once.
>> So In summary, we are suffering with inconsistent benchmark result,
>> while measuring the benefit of our improvement in PLE/pvlock etc..
> Are you measuring the combined throughput of all running guests, or
> just looking at the results of the benchmarks in a single test guest?
> I've done some benchmarking as well and my stddevs look pretty good for
> kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each
> overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that
> full sequence of tests (one with the overcommit levels in scrambled
> order). The relative stddevs for each of the sets of 5 runs look pretty
> good, and the data for the 2 runs match nicely as well.
> To try and get consistent results I do the following > - interleave the memory of all guests across all numa nodes on the
> machine
> - echo 0 > /proc/sys/kernel/randomize_va_space on both host and test
> guest
> - echo 3 > /proc/sys/vm/drop_caches on both host and test guest before
> each run
> - use a ramdisk for the benchmark output files on all running guests
> - no periodically running services installed on the test guest
> - HT is turned off as you do, although I'd like to try running again
> with it turned back on
> Although, I still need to run again measuring the combined throughput
> of all running vms (including the ones launched just to generate busy
> vcpus). Maybe my results won't be as consistent then...
Another way to test is to execute
perf stat -e 'kvm_exit exit_reason==40' sleep 10
to see how many PAUSEs were intercepted in a given time (except I just
invented the filter syntax). The fewer we get, the more useful work the
system does. This ignores kvm_vcpu_on_spin overhead though, so it's
just a rough measure.
-- error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> On 09/25/2012 03:40 PM, Raghavendra K T wrote:
>> On 09/24/2012 07:46 PM, Raghavendra K T wrote:
>>> On 09/24/2012 07:24 PM, Peter Zijlstra wrote:
>>>> On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>>>>> However Rik had a genuine concern in the cases where runqueue is not
>>>>> equally distributed and lockholder might actually be on a different run
>>>>> queue but not running.
>>>> Load should eventually get distributed equally -- that's what the
>>>> load-balancer is for -- so this is a temporary situation.
>>>> We already try and favour the non running vcpu in this case, that's what
>>>> yield_to_task_fair() is about. If its still not eligible to run, tough
>>>> luck.
>>> Yes, I agree.
>>>>> Do you think instead of using rq->nr_running, we could get a global
>>>>> sense of load using avenrun (something like avenrun/num_onlinecpus)
>>>> To what purpose? Also, global stuff is expensive, so you should try and
>>>> stay away from it as hard as you possibly can.
>>> Yes, that concern only had made me to fall back to rq->nr_running.
>>> Will come back with the result soon.
>> Got the result with the patches:
>> So here is the result,
>> Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
>> 1x and 2x overcommits
>> Base = 3.6.0-rc5 + ple handler optimization patches
>> A = Base + checking rq_running in vcpu_on_spin() patch
>> B = Base + checking rq->nr_running in sched/core
>> C = Base - PLE
>> ---+-----------+-----------+-----------+-----------+
>> | Ebizzy result (rec/sec higher is better) |
>> ---+-----------+-----------+-----------+-----------+
>> | Base | A | B | C |
>> ---+-----------+-----------+-----------+-----------+
>> 1x | 2374.1250 | 7273.7500 | 5690.8750 | 7364.3750|
>> 2x | 2536.2500 | 2458.5000 | 2426.3750 | 48.5000|
>> ---+-----------+-----------+-----------+-----------+
>> % improvements w.r.t BASE
>> ---+------------+------------+------------+
>> | A | B | C |
>> ---+------------+------------+------------+
>> 1x | 206.37603 | 139.70410 | 210.19323 |
>> 2x | -3.06555 | -4.33218 | -98.08773 |
>> ---+------------+------------+------------+
>> we are getting the benefit of almost PLE disabled case with this
>> approach. With patch B, we have dropped a bit in gain.
>> (because we still would iterate vcpus until we decide to do a directed
>> yield).
> This gives us a good case for tracking preemption on a per-vm basis. As
> long as we aren't preempted, we can keep the PLE window high, and also
> return immediately from the handler without looking for candidates.
1) So do you think, deferring preemption patch ( Vatsa was mentioning
long back) is also another thing worth trying, so we reduce the chance
of LHP.
IIRC, with defer preemption :
we will have hook in spinlock/unlock path to measure depth of lock held,
and shared with host scheduler (may be via MSRs now).
Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
give say one chance.
2) looking at the result (comparing A & C) , I do feel we have
significant in iterating over vcpus (when compared to even vmexit)
so We still would need undercommit fix sugested by PeterZ (improving by
140%). ?
So looking back at threads/ discussions so far, I am trying to
summarize, the discussions so far. I feel, at least here are the few
potential candidates to go in:
So here are my action items:
- I plan to repost this series with what PeterZ, Rik suggested with
performance analysis.
- I ll go back and explore on (3) and (6) ..
> On 09/25/2012 04:43 PM, Jiannan Ouyang wrote:
>> I've actually implemented this preempted_bitmap idea.
> Interesting, please share the code if you can.
>> However, I'm doing this to expose this information to the guest, so the
>> guest is able to know if the lock holder is preempted or not before
>> spining. Right now, I'm doing experiment to show that this idea works.
>> I'm wondering what do you guys think of the relationship between the
>> pv_ticketlock approach and PLE handler approach. Are we going to adopt
>> PLE instead of the pv ticketlock, and why?
> Right now we're searching for the best solution. The tradeoffs are more
> or less:
> PLE:
> - works for unmodified / non-Linux guests
> - works for all types of spins (e.g. smp_call_function*())
> - utilizes an existing hardware interface (PAUSE instruction) so likely
> more robust compared to a software interface
> PV:
> - has more information, so it can perform better
Should we also consider that we always have an edge here for non-PLE
machine?
> Given these tradeoffs, if we can get PLE to work for moderate amounts of
> overcommit then I'll prefer it (even if it slightly underperforms PV).
> If we are unable to make it work well, then we'll have to add PV.
Avi,
Thanks for this summary.. It is of great help to proceed in right
direction..
> On Thu, Sep 27, 2012 at 03:19:45PM +0530, Raghavendra K T wrote:
>> On 09/25/2012 08:30 PM, Dor Laor wrote:
>>> On 09/24/2012 02:02 PM, Raghavendra K T wrote:
>>>> On 09/24/2012 02:12 PM, Dor Laor wrote:
>>>>> In order to help PLE and pvticketlock converge I thought that a small
>>>>> test code should be developed to test this in a predictable,
>>>>> deterministic way.
>>>>> The idea is to have a guest kernel module that spawn a new thread each
>>>>> time you write to a /sys/.... entry.
>>>>> Each such a thread spins over a spin lock. The specific spin lock is
>>>>> also chosen by the /sys/ interface. Let's say we have an array of spin
>>>>> locks *10 times the amount of vcpus.
>>>>> All the threads are running a
>>>>> while (1) {
>>>>> spin_lock(my_lock);
>>>>> sum += execute_dummy_cpu_computation(time);
>>>>> spin_unlock(my_lock);
>>>>> if (sys_tells_thread_to_die()) break;
>>>>> }
>>>>> print_result(sum);
>>>>> Instead of calling the kernel's spin_lock functions, clone them and make
>>>>> the ticket lock order deterministic and known (like a linear walk of all
>>>>> the threads trying to catch that lock).
>>>> By Cloning you mean hierarchy of the locks?
>>> No, I meant to clone the implementation of the current spin lock code in
>>> order to set any order you may like for the ticket selection.
>>> (even for a non pvticket lock version)
>>> For instance, let's say you have N threads trying to grab the lock, you
>>> can always make the ticket go linearly from 1->2...->N.
>>> Not sure it's a good idea, just a recommendation.
>>>> Also I believe time should be passed via sysfs / hardcoded for each
>>>> type of lock we are mimicking
>>> Yap
>>>>> This way you can easy calculate:
>>>>> 1. the score of a single vcpu running a single thread
>>>>> 2. the score of sum of all thread scores when #thread==#vcpu all
>>>>> taking the same spin lock. The overall sum should be close as
>>>>> possible to #1.
>>>>> 3. Like #2 but #threads > #vcpus and other versions of #total vcpus
>>>>> (belonging to all VMs) > #pcpus.
>>>>> 4. Create #thread == #vcpus but let each thread have it's own spin
>>>>> lock
>>>>> 5. Like 4 + 2
>>>>> Hopefully this way will allows you to judge and evaluate the exact
>>>>> overhead of scheduling VMs and threads since you have the ideal result
>>>>> in hand and you know what the threads are doing.
>>>>> My 2 cents, Dor
>>>> Thank you,
>>>> I think this is an excellent idea. ( Though I am trying to put all the
>>>> pieces together you mentioned). So overall we should be able to measure
>>>> the performance of pvspinlock/PLE improvements with a deterministic
>>>> load in guest.
>>>> Only thing I am missing is,
>>>> How to generate different combinations of the lock.
>>>> Okay, let me see if I can come with a solid model for this.
>>> Do you mean the various options for PLE/pvticket/other? I haven't
>>> thought of it and assumed its static but it can also be controlled
>>> through the temporary /sys interface.
>> No, I am not there yet.
>> So In summary, we are suffering with inconsistent benchmark result,
>> while measuring the benefit of our improvement in PLE/pvlock etc..
> Are you measuring the combined throughput of all running guests, or
> just looking at the results of the benchmarks in a single test guest?
> I've done some benchmarking as well and my stddevs look pretty good for
> kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each
> overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that
> full sequence of tests (one with the overcommit levels in scrambled
> order). The relative stddevs for each of the sets of 5 runs look pretty
> good, and the data for the 2 runs match nicely as well.
> To try and get consistent results I do the following
> - interleave the memory of all guests across all numa nodes on the
> machine
> - echo 0 > /proc/sys/kernel/randomize_va_space on both host and test
> guest
I was not doing this.
> - echo 3 > /proc/sys/vm/drop_caches on both host and test guest before
> each run
was doing already as you know
> - use a ramdisk for the benchmark output files on all running guests
Yes.. this is also helpful
> - no periodically running services installed on the test guest
> - HT is turned off as you do, although I'd like to try running again
> with it turned back on
> Although, I still need to run again measuring the combined throughput
> of all running vms (including the ones launched just to generate busy
> vcpus). Maybe my results won't be as consistent then...
>> So good point from your suggestion is,
>> - Giving predictability to workload that runs in guest, so that we have
>> pi-pi comparison of improvement.
>> - we can easily tune the workload via sysfs, and we can have script to
>> automate them.
>> What is complicated is:
>> - How can we simulate a workload close to what we measure with
>> benchmarks?
>> - How can we mimic lock holding time/ lock hierarchy close to the way
>> it is seen with real workloads (for e.g. highly contended zone lru lock
>> with similar amount of lockholding times).
>> - How close it would be to when we forget about other types of spinning
>> (for e.g, flush_tlb).
>> So I feel it is not as trivial as it looks like.
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> This gives us a good case for tracking preemption on a per-vm basis. As
>> long as we aren't preempted, we can keep the PLE window high, and also
>> return immediately from the handler without looking for candidates.
> 1) So do you think, deferring preemption patch ( Vatsa was mentioning
> long back) is also another thing worth trying, so we reduce the chance
> of LHP.
Yes, we have to keep it in mind. It will be useful for fine grained
locks, not so much so coarse locks or IPIs.
I would still of course prefer a PLE solution, but if we can't get it to
work we can consider preemption deferral.
> IIRC, with defer preemption :
> we will have hook in spinlock/unlock path to measure depth of lock held,
> and shared with host scheduler (may be via MSRs now).
> Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
> give say one chance.
A downside is that we have to do that even when undercommitted.
Also there may be a lot of false positives (deferred preemptions even
when there is no contention).
> 2) looking at the result (comparing A & C) , I do feel we have
> significant in iterating over vcpus (when compared to even vmexit)
> so We still would need undercommit fix sugested by PeterZ (improving by
> 140%). ?
Looking only at the current runqueue? My worry is that it misses a lot
of cases. Maybe try the current runqueue first and then others.
> So looking back at threads/ discussions so far, I am trying to
> summarize, the discussions so far. I feel, at least here are the few
> potential candidates to go in:
> So here are my action items:
> - I plan to repost this series with what PeterZ, Rik suggested with
> performance analysis.
> - I ll go back and explore on (3) and (6) ..
> Please Let me know..
Undoubtedly we'll think of more stuff. But this looks like a good start.
-- error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> On 09/27/2012 02:20 PM, Avi Kivity wrote:
>> On 09/25/2012 04:43 PM, Jiannan Ouyang wrote:
>>> I've actually implemented this preempted_bitmap idea.
>> Interesting, please share the code if you can.
>>> However, I'm doing this to expose this information to the guest, so the
>>> guest is able to know if the lock holder is preempted or not before
>>> spining. Right now, I'm doing experiment to show that this idea works.
>>> I'm wondering what do you guys think of the relationship between the
>>> pv_ticketlock approach and PLE handler approach. Are we going to adopt
>>> PLE instead of the pv ticketlock, and why?
>> Right now we're searching for the best solution. The tradeoffs are more
>> or less:
>> PLE:
>> - works for unmodified / non-Linux guests
>> - works for all types of spins (e.g. smp_call_function*())
>> - utilizes an existing hardware interface (PAUSE instruction) so likely
>> more robust compared to a software interface
>> PV:
>> - has more information, so it can perform better
> Should we also consider that we always have an edge here for non-PLE
> machine?
True. The deployment share for these is decreasing rapidly though. I
hate optimizing for obsolete hardware.
-- error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/