Re: [kernel.org users] [KORG] Panics on master backend

Peter Zijlstra

unread,

Aug 23, 2011, 4:00:03 PM8/23/11

to

On Tue, 2011-08-23 at 11:09 -0700, J.H. wrote:
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.702389] <<EOE>> <IRQ> [<ffffffff81049469>] try_to_wake_up+0x89/0x1ca
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.709536] [<ffffffff810495d3>] ? wake_up_process+0x15/0x17
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.715422] [<ffffffff810495bc>] default_wake_function+0x12/0x14
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.721659] [<ffffffff8103d165>] __wake_up_common+0x4e/0x84
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.727489] [<ffffffffa0177066>] ? _xfs_buf_ioend+0x2f/0x36 [xfs]
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.733852] [<ffffffff810428b0>] complete+0x3f/0x52
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.738985] [<ffffffffa0177017>] xfs_buf_ioend+0xc2/0xe2 [xfs]
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.745079] [<ffffffffa0177066>] _xfs_buf_ioend+0x2f/0x36 [xfs]
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.751259] [<ffffffffa017727a>] xfs_buf_bio_end_io+0x2a/0x37 [xfs]
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.757758] [<ffffffff8114f684>] bio_endio+0x2d/0x2f
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.762945] [<ffffffff81229cd1>] req_bio_endio.clone.31+0x9e/0xa7
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.772811] [<ffffffff81229e6b>] blk_update_request+0x191/0x32e
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.782563] [<ffffffff8122a029>] blk_update_bidi_request+0x21/0x6d
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.792529] [<ffffffff8122a180>] blk_end_bidi_request+0x1f/0x5d
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.802230] [<ffffffff8122a1f8>] blk_end_request+0x10/0x12
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.811446] [<ffffffff81313300>] scsi_io_completion+0x1e2/0x4d8
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.821002] [<ffffffff814d47e7>] ? _raw_spin_unlock_irqrestore+0x17/0x19
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.831405] [<ffffffff8130bf80>] scsi_finish_command+0xe8/0xf1
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.840948] [<ffffffff8131305f>] scsi_softirq_done+0x109/0x112
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.850485] [<ffffffff8122e3c8>] blk_done_softirq+0x73/0x87
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.859729] [<ffffffff8105757a>] __do_softirq+0xc9/0x1b6
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.868705] [<ffffffff81014e01>] ? paravirt_read_tsc+0x9/0xd
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.878093] [<ffffffff814dd2ac>] call_softirq+0x1c/0x30
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.887025] [<ffffffff81010c5e>] do_softirq+0x46/0x84
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.895767] [<ffffffff81057849>] irq_exit+0x52/0xaf
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.904246] [<ffffffff814ddc21>] smp_apic_timer_interrupt+0x7c/0x8a
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.914171] [<ffffffff814dbb1e>] apic_timer_interrupt+0x6e/0x80
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.923618] <EOI> [<ffffffff81080526>] ? arch_local_irq_restore+0x6/0xd
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.933756] [<ffffffff814d47e7>] _raw_spin_unlock_irqrestore+0x17/0x19
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.943630] [<ffffffffa00446c4>] hpsa_scsi_queue_command+0x2f9/0x319 [hpsa]
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.953851] [<ffffffff8130c9bb>] scsi_dispatch_cmd+0x18c/0x251
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.962917] [<ffffffff81312d70>] scsi_request_fn+0x3cd/0x3f9
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.971781] [<ffffffff81225214>] __blk_run_queue+0x1b/0x1d
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.980439] [<ffffffff8123b637>] cfq_insert_request+0x4cf/0x504
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.989469] [<ffffffff812248d4>] __elv_add_request+0x19c/0x1e7
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145813.998481] [<ffffffff8122a934>] blk_flush_plug_list+0x166/0x19b
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145814.007670] [<ffffffff814d2ce4>] schedule+0x2a3/0x6dc
> 20110823.log:Aug 23 16:49:53 console1 port05 RXDATA: [145814.015903] [<ffffffff814d35e1>] schedule_timeout+0x36/0xe3

Whee, you must be real 'lucky' to hit this so reliable.

Both panics included the above bit, so what I think happens is that we
release rq->lock while calling that blk_flush_plug muck, it however then
goes and re-enabled interrupts *FAIL*

An interrupts happens and it (or in fact a pending softirq) then tries
to wake the current task which is in the process of going to sleep.
Since the task is still on the cpu we spin until its gone (assuming its
a remote task since interrupts aren't supposed to happen here).

Now I can fudge around this, see below, but really the whole
block/scsi/hpsa crap should *NOT* re-enable interrupts here.

It looks like the offender is:

scsi_request_fn(), which does:

/*
* XXX(hch): This is rather suboptimal, scsi_dispatch_cmd will
* take the lock again.
*/
spin_unlock_irq(shost->host_lock);

---
kernel/sched.c | 9 +++------
1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ccacdbd..30d9788 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2630,7 +2630,6 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu)
smp_send_reschedule(cpu);
}

-#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
static int ttwu_activate_remote(struct task_struct *p, int wake_flags)
{
struct rq *rq;
@@ -2647,7 +2646,6 @@ static int ttwu_activate_remote(struct task_struct *p, int wake_flags)
return ret;

}
-#endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
#endif /* CONFIG_SMP */

static void ttwu_queue(struct task_struct *p, int cpu)
@@ -2705,7 +2703,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
* this task as prev, wait until its done referencing the task.
*/
while (p->on_cpu) {
-#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
/*
* In case the architecture enables interrupts in
* context_switch(), we cannot busy wait, since that
@@ -2713,11 +2710,11 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
* tries to wake up @prev. So bail and do a complete
* remote wakeup.
*/
- if (ttwu_activate_remote(p, wake_flags))
+ if (cpu == smp_processor_id() &&
+ ttwu_activate_remote(p, wake_flags))
goto stat;
-#else
+
cpu_relax();
-#endif
}
/*
* Pairs with the smp_wmb() in finish_lock_switch().

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

James Bottomley

unread,

Aug 23, 2011, 5:40:02 PM8/23/11

to

Cc: linux-scsi added

Um, that one looks like a mistake missed in cleanup (probably years
ago). The other host_lock acquisitions aren't _irq, so this one
shouldn't be either. I can patch it up (or you can).

James

Peter Zijlstra

unread,

Aug 24, 2011, 6:10:02 AM8/24/11

to

On Tue, 2011-08-23 at 16:32 -0500, James Bottomley wrote:
> > scsi_request_fn(), which does:
> >
> > /*
> > * XXX(hch): This is rather suboptimal, scsi_dispatch_cmd will
> > * take the lock again.
> > */
> > spin_unlock_irq(shost->host_lock);
>
> Um, that one looks like a mistake missed in cleanup (probably years
> ago). The other host_lock acquisitions aren't _irq, so this one
> shouldn't be either. I can patch it up (or you can).

I would much appreciate if someone who knows that whole stack would
audit it, my preferred solution is removing that blk plug stuff from the
scheduler guts since clearly nobody bothered making sure its sane.

Something like the three patches below, lifted from -rt:

sched-separate-the-scheduler-entry-for-preemption.patch
sched-move-blk_schedule_flush_plug-out-of-__schedule.patch
block-shorten-interrupt-disabled-regions.patch

block-shorten-interrupt-disabled-regions.patch

sched-move-blk_schedule_flush_plug-out-of-__schedule.patch

sched-separate-the-scheduler-entry-for-preemption.patch

Oleg Nesterov

unread,

Aug 24, 2011, 12:20:02 PM8/24/11

to

Looking at the next emails, I guess this is already off-topic, but still...

I think this needs "task_cpu(p) == smp_processor_id()". We can't trust
"cpu", task_cpu() was called before ->on_rq check.

This task_cpu() looks really confusing imho, even if it is fine (afaics).
Perhaps it makes sense to do

--- x/kernel/sched.c
+++ x/kernel/sched.c
@@ -2694,10 +2694,11 @@ try_to_wake_up(struct task_struct *p, un
goto out;

success = 1; /* we're going to change ->state */
- cpu = task_cpu(p);

- if (p->on_rq && ttwu_remote(p, wake_flags))
+ if (p->on_rq && ttwu_remote(p, wake_flags)) {
+ cpu = task_cpu(p); /* for ttwu_stat() */
goto stat;
+ }

#ifdef CONFIG_SMP
/*

to make this more clear. Or even the patch below, I dunno.

Oleg.

--- x/kernel/sched.c
+++ x/kernel/sched.c
@@ -2446,13 +2446,14 @@ static void update_avg(u64 *avg, u64 sam
#endif

static void
-ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
+ttwu_stat(struct task_struct *p, int wake_flags)
{
#ifdef CONFIG_SCHEDSTATS
struct rq *rq = this_rq();

#ifdef CONFIG_SMP
int this_cpu = smp_processor_id();
+ int cpu = task_cpu(p);

if (cpu == this_cpu) {
schedstat_inc(rq, ttwu_local);
@@ -2694,7 +2695,6 @@ try_to_wake_up(struct task_struct *p, un
goto out;

success = 1; /* we're going to change ->state */
- cpu = task_cpu(p);

if (p->on_rq && ttwu_remote(p, wake_flags))
goto stat;
@@ -2739,7 +2739,7 @@ try_to_wake_up(struct task_struct *p, un

ttwu_queue(p, cpu);
stat:
- ttwu_stat(p, cpu, wake_flags);
+ ttwu_stat(p, wake_flags);
out:
raw_spin_unlock_irqrestore(&p->pi_lock, flags);

@@ -2775,7 +2775,7 @@ static void try_to_wake_up_local(struct
ttwu_activate(rq, p, ENQUEUE_WAKEUP);

ttwu_do_wakeup(rq, p, 0);
- ttwu_stat(p, smp_processor_id(), 0);
+ ttwu_stat(p, 0);
out:
raw_spin_unlock(&p->pi_lock);

Peter Zijlstra

unread,

Aug 25, 2011, 6:30:01 AM8/25/11

to

Isn't us holding ->pi_lock sufficient to stabilize task_cpu()? If its a
running task the initial ->state check would have failed, and thus its a
proper wakeup when we get here and thus ->pi_lock is serializing things.

> This task_cpu() looks really confusing imho, even if it is fine (afaics).
> Perhaps it makes sense to do
>
> --- x/kernel/sched.c
> +++ x/kernel/sched.c
> @@ -2694,10 +2694,11 @@ try_to_wake_up(struct task_struct *p, un
> goto out;
>
> success = 1; /* we're going to change ->state */
> - cpu = task_cpu(p);
>
> - if (p->on_rq && ttwu_remote(p, wake_flags))
> + if (p->on_rq && ttwu_remote(p, wake_flags)) {
> + cpu = task_cpu(p); /* for ttwu_stat() */
> goto stat;
> + }
>
> #ifdef CONFIG_SMP
> /*

Would result in the same problem as below...

> to make this more clear. Or even the patch below, I dunno.
>
> Oleg.
>
> --- x/kernel/sched.c
> +++ x/kernel/sched.c
> @@ -2446,13 +2446,14 @@ static void update_avg(u64 *avg, u64 sam
> #endif
>
> static void
> -ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
> +ttwu_stat(struct task_struct *p, int wake_flags)
> {
> #ifdef CONFIG_SCHEDSTATS
> struct rq *rq = this_rq();
>
> #ifdef CONFIG_SMP
> int this_cpu = smp_processor_id();
> + int cpu = task_cpu(p);
>
> if (cpu == this_cpu) {
> schedstat_inc(rq, ttwu_local);
> @@ -2694,7 +2695,6 @@ try_to_wake_up(struct task_struct *p, un
> goto out;
>
> success = 1; /* we're going to change ->state */
> - cpu = task_cpu(p);
>
> if (p->on_rq && ttwu_remote(p, wake_flags))
> goto stat;
> @@ -2739,7 +2739,7 @@ try_to_wake_up(struct task_struct *p, un
>
> ttwu_queue(p, cpu);

Both suggestions result in the above cpu possibly being used
uninitialized for SMP=n.

Oleg Nesterov

unread,

Aug 25, 2011, 10:00:01 AM8/25/11

to

On 08/25, Peter Zijlstra wrote:
>
> On Wed, 2011-08-24 at 18:08 +0200, Oleg Nesterov wrote:
> > >
> > > static void ttwu_queue(struct task_struct *p, int cpu)
> > > @@ -2705,7 +2703,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > > * this task as prev, wait until its done referencing the task.
> > > */
> > > while (p->on_cpu) {
> > > -#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> > > /*
> > > * In case the architecture enables interrupts in
> > > * context_switch(), we cannot busy wait, since that
> > > @@ -2713,11 +2710,11 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > > * tries to wake up @prev. So bail and do a complete
> > > * remote wakeup.
> > > */
> > > - if (ttwu_activate_remote(p, wake_flags))
> > > + if (cpu == smp_processor_id() &&
> >
> > I think this needs "task_cpu(p) == smp_processor_id()". We can't trust
> > "cpu", task_cpu() was called before ->on_rq check.
>
> Isn't us holding ->pi_lock sufficient to stabilize task_cpu()? If its a
> running task the initial ->state check would have failed,

Of course it is not TASK_RUNNING, but it can be running or not.

> and thus its a
> proper wakeup when we get here and thus ->pi_lock is serializing things.

I am not sure. If ->on_rq is true, we need rq->lock. Say, pull_task() can
change its cpu.

> > --- x/kernel/sched.c
> > +++ x/kernel/sched.c
> > @@ -2694,10 +2694,11 @@ try_to_wake_up(struct task_struct *p, un
> > goto out;
> >
> > success = 1; /* we're going to change ->state */
> > - cpu = task_cpu(p);
> >
> > - if (p->on_rq && ttwu_remote(p, wake_flags))
> > + if (p->on_rq && ttwu_remote(p, wake_flags)) {
> > + cpu = task_cpu(p); /* for ttwu_stat() */
> > goto stat;
> > + }
> >
> > #ifdef CONFIG_SMP
> > /*
>
> Would result in the same problem as below...

I see, thanks.

Yes, ttwu_queue(p, cpu) needs this task_cpu() without CONFIG_SMP.

Oleg.

Yong Zhang

unread,

Aug 26, 2011, 2:10:02 AM8/26/11

to

On Thu, Aug 25, 2011 at 03:54:29PM +0200, Oleg Nesterov wrote:
> On 08/25, Peter Zijlstra wrote:
> >
> > On Wed, 2011-08-24 at 18:08 +0200, Oleg Nesterov wrote:
> > > >
> > > > static void ttwu_queue(struct task_struct *p, int cpu)
> > > > @@ -2705,7 +2703,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > > > * this task as prev, wait until its done referencing the task.
> > > > */
> > > > while (p->on_cpu) {
> > > > -#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
> > > > /*
> > > > * In case the architecture enables interrupts in
> > > > * context_switch(), we cannot busy wait, since that
> > > > @@ -2713,11 +2710,11 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > > > * tries to wake up @prev. So bail and do a complete
> > > > * remote wakeup.
> > > > */
> > > > - if (ttwu_activate_remote(p, wake_flags))
> > > > + if (cpu == smp_processor_id() &&
> > >
> > > I think this needs "task_cpu(p) == smp_processor_id()". We can't trust
> > > "cpu", task_cpu() was called before ->on_rq check.
> >
> > Isn't us holding ->pi_lock sufficient to stabilize task_cpu()? If its a
> > running task the initial ->state check would have failed,
>
> Of course it is not TASK_RUNNING, but it can be running or not.

Yup. Before we go beyond ttwu_remote() in ttwu(), 'cpu' is not safe.
For example, wait_event() could be preempted in between.

But after we go beyond ttwu_remote(), ->pi_lock will stabilize it.

So after we take Oleg's suggestion("task_cpu(p) == smp_processor_id()"),
things we left is just how to account stat correctly.

IMHO, we could get cpu in ttwu_remote() to prevent the side effect of
pull_task().

something like below?

Thanks,
Yong

---
diff --git a/kernel/sched.c b/kernel/sched.c
index ccacdbd..4a1d05d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2540,7 +2540,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
* since all we need to do is flip p->state to TASK_RUNNING, since
* the task is still ->on_rq.
*/
-static int ttwu_remote(struct task_struct *p, int wake_flags)
+static int ttwu_remote(struct task_struct *p, int wake_flags, int *cpu)
{
struct rq *rq;
int ret = 0;
@@ -2548,6 +2548,7 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
rq = __task_rq_lock(p);
if (p->on_rq) {
ttwu_do_wakeup(rq, p, wake_flags);
+ *cpu = task_cpu(p);
ret = 1;
}
__task_rq_unlock(rq);
@@ -2696,7 +2697,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

success = 1; /* we're going to change ->state */

cpu = task_cpu(p);

- if (p->on_rq && ttwu_remote(p, wake_flags))

+ /*
+ * read cpu for another time if ttwu_remote() success,
+ * just to prevent task migration in between, otherwise
+ * we maybe account stat incorrectly.
+ */
+ if (p->on_rq && ttwu_remote(p, wake_flags, &cpu))
goto stat;

#ifdef CONFIG_SMP

Oleg Nesterov

unread,

Aug 26, 2011, 10:10:02 AM8/26/11

to

On 08/26, Yong Zhang wrote:
>
> On Thu, Aug 25, 2011 at 03:54:29PM +0200, Oleg Nesterov wrote:
> >
> > Of course it is not TASK_RUNNING, but it can be running or not.
>
> Yup. Before we go beyond ttwu_remote() in ttwu(), 'cpu' is not safe.
> For example, wait_event() could be preempted in between.
>
> But after we go beyond ttwu_remote(), ->pi_lock will stabilize it.

Yes.

> So after we take Oleg's suggestion("task_cpu(p) == smp_processor_id()"),
> things we left is just how to account stat correctly.

Imho, we don't really care. This race is very unlikely, and I think
that the "wrong" cpu argument in ttwu_stat() is harmless.

My only point was, this "cpu = task_cpu(p)" looks confusing, as if we
can trust it below, during the actual wakeup.

> @@ -2696,7 +2697,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> success = 1; /* we're going to change ->state */
> cpu = task_cpu(p);
>
> - if (p->on_rq && ttwu_remote(p, wake_flags))
> + /*
> + * read cpu for another time if ttwu_remote() success,
> + * just to prevent task migration in between, otherwise
> + * we maybe account stat incorrectly.
> + */
> + if (p->on_rq && ttwu_remote(p, wake_flags, &cpu))

I don't think this makes the things better. p->on_rq can be already
false or ttwu_remote() can fail, in this case we still use the result
of initial task_cpu().

Oleg.

Yong Zhang

unread,

Aug 28, 2011, 10:30:01 PM8/28/11

to

On Fri, Aug 26, 2011 at 03:57:39PM +0200, Oleg Nesterov wrote:
> On 08/26, Yong Zhang wrote:
> >
> > On Thu, Aug 25, 2011 at 03:54:29PM +0200, Oleg Nesterov wrote:
> > >
> > > Of course it is not TASK_RUNNING, but it can be running or not.
> >
> > Yup. Before we go beyond ttwu_remote() in ttwu(), 'cpu' is not safe.
> > For example, wait_event() could be preempted in between.
> >
> > But after we go beyond ttwu_remote(), ->pi_lock will stabilize it.
>
> Yes.
>
> > So after we take Oleg's suggestion("task_cpu(p) == smp_processor_id()"),
> > things we left is just how to account stat correctly.
>
> Imho, we don't really care. This race is very unlikely,

Yup.

> and I think
> that the "wrong" cpu argument in ttwu_stat() is harmless.

Hmm, the affected accounting is sched_domain->ttwu_wake_remote.

> > @@ -2696,7 +2697,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > success = 1; /* we're going to change ->state */
> > cpu = task_cpu(p);
> >
> > - if (p->on_rq && ttwu_remote(p, wake_flags))
> > + /*
> > + * read cpu for another time if ttwu_remote() success,
> > + * just to prevent task migration in between, otherwise
> > + * we maybe account stat incorrectly.
> > + */
> > + if (p->on_rq && ttwu_remote(p, wake_flags, &cpu))
>
> I don't think this makes the things better. p->on_rq can be already
> false or ttwu_remote() can fail, in this case we still use the result
> of initial task_cpu().

Ah, My code (ttwu_remote(p, wake_flags, &cpu)) just catch the case:
task is dancing with 'p->on_rq == true'.

But forget the case: task has danced with 'p->on_rq == false' in the end.

So we should reread task_cpu(p) if 'p->on_cpu == true &&
__ARCH_WANT_INTERRUPTS_ON_CTXSW == y', but the code will be a little ugly
and it only aims to account stat more correctly.
We need some balance here :)
BTW, I don't think we should care much on the 'incorrect stat' either
if we don't make intolerable mistake.

Thanks,
Yong

Peter Zijlstra

unread,

Aug 29, 2011, 9:10:02 AM8/29/11

to

On Thu, 2011-08-25 at 15:54 +0200, Oleg Nesterov wrote:
> > Isn't us holding ->pi_lock sufficient to stabilize task_cpu()? If its a
> > running task the initial ->state check would have failed,
>
> Of course it is not TASK_RUNNING, but it can be running or not.
>
> > and thus its a
> > proper wakeup when we get here and thus ->pi_lock is serializing things.
>
> I am not sure. If ->on_rq is true, we need rq->lock. Say, pull_task() can
> change its cpu.

If its !TASK_RUNNING but ->on_rq is true, it must be current, and
pull_task() will never move current around.

Oleg Nesterov

unread,

Aug 29, 2011, 10:50:02 AM8/29/11

to

On 08/29, Peter Zijlstra wrote:
>
> On Thu, 2011-08-25 at 15:54 +0200, Oleg Nesterov wrote:
> > > Isn't us holding ->pi_lock sufficient to stabilize task_cpu()? If its a
> > > running task the initial ->state check would have failed,
> >
> > Of course it is not TASK_RUNNING, but it can be running or not.
> >
> > > and thus its a
> > > proper wakeup when we get here and thus ->pi_lock is serializing things.
> >
> > I am not sure. If ->on_rq is true, we need rq->lock. Say, pull_task() can
> > change its cpu.
>
> If its !TASK_RUNNING but ->on_rq is true, it must be current, and
> pull_task() will never move current around.

You meant task_running() ? But it can be preempted in !TASK_RUNNING
state.

Oleg.