[PATCH RT] rt: Make cpu_chill() use hrtimer instead of msleep()

Steven Rostedt

unread,

Feb 5, 2014, 12:00:02 PM2/5/14

to

Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken
up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is
called from softirq context, it may block the ksoftirqd() from running, in
which case, it may never wake up the msleep() causing the deadlock.

I checked the vmcore, and irq/74-qla2xxx is stuck in the msleep() call,
running on CPU 8. The one ksoftirqd that is stuck, happens to be the one that
runs on CPU 8, and it is blocked on a lock held by irq/74-qla2xxx. As that
ksoftirqd is the one that will wake up irq/74-qla2xxx, and it happens to be
blocked on a lock that irq/74-qla2xxx holds, we have our deadlock.

The solution is not to convert the cpu_chill() back to a cpu_relax() as that
will re-create a possible live lock that the cpu_chill() fixed earlier, and may
also leave this bug open on other softirqs. The fix is to remove the
dependency on ksoftirqd from cpu_chill(). That is, instead of calling
msleep() that requires ksoftirqd to wake it up, use the
hrtimer_nanosleep() code that does the wakeup from hard irq context.

Found-by: Ulrich Obergfell <uobe...@redhat.com>
Signed-off-by: Steven Rostedt <ros...@goodmis.org>

diff --git a/include/linux/delay.h b/include/linux/delay.h
index e23a7c0..37caab3 100644
--- a/include/linux/delay.h
+++ b/include/linux/delay.h
@@ -53,7 +53,7 @@ static inline void ssleep(unsigned int seconds)
}

#ifdef CONFIG_PREEMPT_RT_FULL
-# define cpu_chill() msleep(1)
+extern void cpu_chill(void);
#else
# define cpu_chill() cpu_relax()
#endif
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index aa5eb4f..2f023aa 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1852,6 +1852,21 @@ SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp,
return hrtimer_nanosleep(&tu, rmtp, HRTIMER_MODE_REL, CLOCK_MONOTONIC);
}

+#ifdef CONFIG_PREEMPT_RT_FULL
+/*
+ * Sleep for 1 ms in hope whoever holds what we want will let it go.
+ */
+void cpu_chill(void)
+{
+ struct timespec tu = {
+ .tv_nsec = NSEC_PER_MSEC,
+ };
+
+ hrtimer_nanosleep(&tu, NULL, HRTIMER_MODE_REL, CLOCK_MONOTONIC);
+}
+EXPORT_SYMBOL(cpu_chill);
+#endif
+
/*
* Functions related to boot-time initialization:
*/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Steven Rostedt

unread,

Feb 5, 2014, 12:00:04 PM2/5/14

to

On Wed, 5 Feb 2014 11:51:25 -0500
Steven Rostedt <ros...@goodmis.org> wrote:

>
> Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken
> up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is
> called from softirq context, it may block the ksoftirqd() from running, in
> which case, it may never wake up the msleep() causing the deadlock.
>
> I checked the vmcore, and irq/74-qla2xxx is stuck in the msleep() call,
> running on CPU 8. The one ksoftirqd that is stuck, happens to be the one that
> runs on CPU 8, and it is blocked on a lock held by irq/74-qla2xxx. As that
> ksoftirqd is the one that will wake up irq/74-qla2xxx, and it happens to be
> blocked on a lock that irq/74-qla2xxx holds, we have our deadlock.
>
> The solution is not to convert the cpu_chill() back to a cpu_relax() as that
> will re-create a possible live lock that the cpu_chill() fixed earlier, and may
> also leave this bug open on other softirqs. The fix is to remove the
> dependency on ksoftirqd from cpu_chill(). That is, instead of calling
> msleep() that requires ksoftirqd to wake it up, use the
> hrtimer_nanosleep() code that does the wakeup from hard irq context.
>
> Found-by: Ulrich Obergfell <uobe...@redhat.com>
> Signed-off-by: Steven Rostedt <ros...@goodmis.org>
>

I should have added:

Cc: stab...@vger.kernel.org

-- Steve

Sebastian Andrzej Siewior

unread,

Feb 7, 2014, 6:40:05 AM2/7/14

to

* Steven Rostedt | 2014-02-05 11:51:25 [-0500]:

>Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken
>up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is
>called from softirq context, it may block the ksoftirqd() from running, in
>which case, it may never wake up the msleep() causing the deadlock.
>
>I checked the vmcore, and irq/74-qla2xxx is stuck in the msleep() call,
>running on CPU 8. The one ksoftirqd that is stuck, happens to be the one that
>runs on CPU 8, and it is blocked on a lock held by irq/74-qla2xxx. As that
>ksoftirqd is the one that will wake up irq/74-qla2xxx, and it happens to be
>blocked on a lock that irq/74-qla2xxx holds, we have our deadlock.

could you please tell me which two locks are invovled here?

Sebastian

Steven Rostedt

unread,

Feb 7, 2014, 9:10:03 AM2/7/14

to

On Fri, 7 Feb 2014 12:30:06 +0100
Sebastian Andrzej Siewior <big...@linutronix.de> wrote:

> * Steven Rostedt | 2014-02-05 11:51:25 [-0500]:
>
> >Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken
> >up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is
> >called from softirq context, it may block the ksoftirqd() from running, in
> >which case, it may never wake up the msleep() causing the deadlock.
> >
> >I checked the vmcore, and irq/74-qla2xxx is stuck in the msleep() call,
> >running on CPU 8. The one ksoftirqd that is stuck, happens to be the one that
> >runs on CPU 8, and it is blocked on a lock held by irq/74-qla2xxx. As that
> >ksoftirqd is the one that will wake up irq/74-qla2xxx, and it happens to be
> >blocked on a lock that irq/74-qla2xxx holds, we have our deadlock.
>
> could you please tell me which two locks are invovled here?
>

Looks to be the lock of the block softirq. I don't have the core dump
anymore, but from what I could tell the ksoftirqd was blocked on the
block softirq lock, where the block softirq handler did a msleep
(called by the qla2xxx interrupt handler).

Looking at trigger_softirq() in block/blk-softirq.c, it can do a
smp_callfunction() to another cpu to run the block softirq. If that
happens to be the cpu where the qla2xx irq handler is doing the block
softirq and is in a middle of a msleep(), I believe the ksoftirqd will
try to run the softirq. If it does that, then BOOM, it's deadlocked
because the ksoftirqd will never run the timer softirq either.

-- Steve

Steven Rostedt

unread,

Feb 7, 2014, 9:20:02 AM2/7/14

to

On Fri, 7 Feb 2014 09:08:34 -0500
Steven Rostedt <ros...@goodmis.org> wrote:

> > could you please tell me which two locks are invovled here?

I should have also stated that it was only one lock that was involved.
But the lock owner was doing a msleep() that requires a wakeup by
ksoftirqd to continue. If ksoftirqd happens to be blocked on a lock
held by the msleep() caller, then you have your deadlock.

It's best not to have any softirqs going to sleep requiring another
softirq to wake it up. Note, if we ever require a timer softirq to do a
cpu_chill() it will most definitely hit this deadlock.

Sebastian Andrzej Siewior

unread,

Feb 7, 2014, 9:30:02 AM2/7/14

to

On 02/07/2014 03:13 PM, Steven Rostedt wrote:
> On Fri, 7 Feb 2014 09:08:34 -0500
> Steven Rostedt <ros...@goodmis.org> wrote:
>
>
>>> could you please tell me which two locks are invovled here?
>
> I should have also stated that it was only one lock that was involved.
> But the lock owner was doing a msleep() that requires a wakeup by
> ksoftirqd to continue. If ksoftirqd happens to be blocked on a lock
> held by the msleep() caller, then you have your deadlock.

That makes sense.

> It's best not to have any softirqs going to sleep requiring another
> softirq to wake it up. Note, if we ever require a timer softirq to do a
> cpu_chill() it will most definitely hit this deadlock.

Yes. And that sleep in softirq is also not really nice but this isn't
new. Thanks for the patch & explanation.
>
> -- Steve

Sebastian

Sebastian Andrzej Siewior

unread,

Feb 19, 2014, 7:00:02 AM2/19/14

to

Since we replaced msleep() by hrtimer I see now and then (rarely) this:

| [....] Waiting for /dev to be fully populated...
| =====================================
| [ BUG: udevd/229 still has locks held! ]
| 3.12.11-rt17 #23 Not tainted
| -------------------------------------
| 1 lock held by udevd/229:
| #0: (&type->i_mutex_dir_key#2){+.+.+.}, at: lookup_slow+0x28/0x98
|
| stack backtrace:
| CPU: 0 PID: 229 Comm: udevd Not tainted 3.12.11-rt17 #23
| (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14)
| (show_stack+0x10/0x14) from (dump_stack+0x74/0xbc)
| (dump_stack+0x74/0xbc) from (do_nanosleep+0x120/0x160)
| (do_nanosleep+0x120/0x160) from (hrtimer_nanosleep+0x90/0x110)
| (hrtimer_nanosleep+0x90/0x110) from (cpu_chill+0x30/0x38)
| (cpu_chill+0x30/0x38) from (dentry_kill+0x158/0x1ec)
| (dentry_kill+0x158/0x1ec) from (dput+0x74/0x15c)
| (dput+0x74/0x15c) from (lookup_real+0x4c/0x50)
| (lookup_real+0x4c/0x50) from (__lookup_hash+0x34/0x44)
| (__lookup_hash+0x34/0x44) from (lookup_slow+0x38/0x98)
| (lookup_slow+0x38/0x98) from (path_lookupat+0x208/0x7fc)
| (path_lookupat+0x208/0x7fc) from (filename_lookup+0x20/0x60)
| (filename_lookup+0x20/0x60) from (user_path_at_empty+0x50/0x7c)
| (user_path_at_empty+0x50/0x7c) from (user_path_at+0x14/0x1c)
| (user_path_at+0x14/0x1c) from (vfs_fstatat+0x48/0x94)
| (vfs_fstatat+0x48/0x94) from (SyS_stat64+0x14/0x30)
| (SyS_stat64+0x14/0x30) from (ret_fast_syscall+0x0/0x48)

For now I see no better way but to disable the freezer for the sleep period.

Signed-off-by: Sebastian Andrzej Siewior <big...@linutronix.de>
---
kernel/hrtimer.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 5c26d2c..083815d 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1899,8 +1899,12 @@ void cpu_chill(void)
struct timespec tu = {
.tv_nsec = NSEC_PER_MSEC,
};
+ unsigned int freeze_flag = current->flags & PF_NOFREEZE;

+ current->flags |= PF_NOFREEZE;
hrtimer_nanosleep(&tu, NULL, HRTIMER_MODE_REL, CLOCK_MONOTONIC);
+ if (!freeze_flag)
+ current->flags &= ~PF_NOFREEZE;
}
EXPORT_SYMBOL(cpu_chill);
#endif
--
1.9.0.rc3

Steven Rostedt

unread,

Feb 19, 2014, 10:30:01 AM2/19/14

to

On Wed, 19 Feb 2014 12:53:53 +0100

Sebastian Andrzej Siewior <big...@linutronix.de> wrote:

Hmm, I wonder if this is because msleep() sets the tasks state to
TASK_UNINTERRUPTIBLE, and hrtimer_nanosleep sets it to
TASK_INTERRUPTIBLE.

>
> Signed-off-by: Sebastian Andrzej Siewior <big...@linutronix.de>
> ---
> kernel/hrtimer.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> index 5c26d2c..083815d 100644
> --- a/kernel/hrtimer.c
> +++ b/kernel/hrtimer.c
> @@ -1899,8 +1899,12 @@ void cpu_chill(void)
> struct timespec tu = {
> .tv_nsec = NSEC_PER_MSEC,
> };
> + unsigned int freeze_flag = current->flags & PF_NOFREEZE;
>
> + current->flags |= PF_NOFREEZE;
> hrtimer_nanosleep(&tu, NULL, HRTIMER_MODE_REL, CLOCK_MONOTONIC);
> + if (!freeze_flag)
> + current->flags &= ~PF_NOFREEZE;

We could remove the branch with:

current->flags &= ~PF_NOFREEZE | freeze_flag;

But as we just slept for a millisecond, that optimization may not be
worth anything ;-)

Acked-by: Steven Rostedt <ros...@goodmis.org>

-- Steve

> }
> EXPORT_SYMBOL(cpu_chill);
> #endif

Sebastian Andrzej Siewior

unread,

Feb 19, 2014, 10:50:01 AM2/19/14

to

On 02/19/2014 04:27 PM, Steven Rostedt wrote:

> Hmm, I wonder if this is because msleep() sets the tasks state to
> TASK_UNINTERRUPTIBLE, and hrtimer_nanosleep sets it to
> TASK_INTERRUPTIBLE.

No, it is because freezable_schedule() is called via do_nanosleep() in
hrtimer call path and msleep() calls schedule() without it.

Sebastian