I'm debugging a problem where pulseaudio is getting killed with a SIGKILL
out of the blue. It appears to be a problem where pulseaudio sets
RLIMIT_RTTIME and the bound gets exceeded. Analysis with 'top' shows
a short spike of 96% system time, and the tail end of strace shows this:
[pid 25065] 01:50:20.371484 ioctl(28, USBDEVFS_CONTROL, 0x7fd3d76f630c) = 0 <0.000015>
[pid 25065] 01:50:20.371548 ioctl(28, 0x40045532, 0x7fd3d76f636c) = 0 <0.000016>
[pid 25065] 01:50:20.371611 open("/dev/snd/pcmC0D0p", O_RDWR|O_NONBLOCK|O_CLOEXEC <unfinished ...>
[pid 25064] 01:50:20.371678 <... write resumed> ) = 8 <0.002104>
[pid 25064] 01:50:20.371718 futex(0xc2ec00, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 25066] 01:50:21.408392 +++ killed by SIGKILL +++
PANIC: handle_group_exit: 25066 leader 25064
[pid 25065] 01:50:21.408442 +++ killed by SIGKILL +++
PANIC: handle_group_exit: 25065 leader 25064
01:50:21.420354 +++ killed by SIGKILL +++
thread 25064 apparently gets gunned down due to RTTIME because it spent a whole
second in a futex() call - is it reasonable for futex() to not return for that
long?
In other words - kernel bug because futex() should return, or pulseaudio bug
for not understanding futex() can snooze a while?
If a kernel bug, anybody got a better idea than nuking the RLIMIT_RTTIME call,
waiting for it to repeat (takes between 1 minute and 1 hour or so), and
whomping it a few times with sysrq-T?
is that second spend in processing sysrq-t?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
No, currently that second is spent in a futex() syscall - I'm wondering:
1) should it get killed for RLIMIT_RTTIME because it's been in a futex()
for multiple seconds? It seems suspicious - docs say a blocking syscall
resets RTTIME - so if futex() blocks it shouldn't kill, and if it's in the
kernel for a second without blocking it's a bug too.
2) Is sysrq-T my best bet here, or should I be trying something else first?
> 1) should it get killed for RLIMIT_RTTIME because it's been in a futex()
> for multiple seconds? It seems suspicious - docs say a blocking syscall
> resets RTTIME - so if futex() blocks it shouldn't kill, and if it's in the
> kernel for a second without blocking it's a bug too.
sys_futex() shouldn't be taking that long indeed, and if its not caused
by the sysrq interrupt then something funny is going on indeed.
You could poke around in kernel/futex.c:futex_wait() to see where it
thinks its spending it time? There's a single retry loop in there, maybe
instrument that to see if its retrying too many times..
Well, it's in futex_wait(). If nothing unlocks the futex, then it
stays there forever.
> > > In other words - kernel bug because futex() should return, or pulseaudio bug
> > > for not understanding futex() can snooze a while?
> > >
> > > If a kernel bug, anybody got a better idea than nuking the RLIMIT_RTTIME call,
> > > waiting for it to repeat (takes between 1 minute and 1 hour or so), and
> > > whomping it a few times with sysrq-T?
> >
> > is that second spend in processing sysrq-t?
>
> No, currently that second is spent in a futex() syscall - I'm wondering:
>
> 1) should it get killed for RLIMIT_RTTIME because it's been in a futex()
> for multiple seconds? It seems suspicious - docs say a blocking syscall
> resets RTTIME - so if futex() blocks it shouldn't kill, and if it's in the
> kernel for a second without blocking it's a bug too.
If it schedules out, then the RLIMIT_RTTIME should not be hit.
There are several possibilities why this happens:
- the futex code has a bug which causes it to busy loop
- the rlimit code is wreckaged
> 2) Is sysrq-T my best bet here, or should I be trying something else first?
Can you enable the function tracer and check whether it reproduces
with the function tracer. If yes, then we can put a tracing_off() into
the code which handles the rlimit, so we can see in the trace what
happened before it triggered.
Thanks,
tglx
So it received the signal and the task waiting in
sys_futex(FUTEX_WAIT_PRIVATE...) wakes up to handle the signal and
return right? What do you see here that suggests the task is spinning
inside the futex syscall?
This is a non-rt kernel right? So spinning could occur on the hash
bucket lock if something else is really horked. That seems unlikely.
> [pid 25066] 01:50:21.408392 +++ killed by SIGKILL +++
> PANIC: handle_group_exit: 25066 leader 25064
> [pid 25065] 01:50:21.408442 +++ killed by SIGKILL +++
> PANIC: handle_group_exit: 25065 leader 25064
> 01:50:21.420354 +++ killed by SIGKILL +++
>
> thread 25064 apparently gets gunned down due to RTTIME because it spent a whole
> second in a futex() call - is it reasonable for futex() to not return for that
> long?
Certainly if a corresponding FUTEX_WAKE op was not sent.
--
Darrem
> In other words - kernel bug because futex() should return, or pulseaudio bug
> for not understanding futex() can snooze a while?
> If a kernel bug, anybody got a better idea than nuking the RLIMIT_RTTIME call,
> waiting for it to repeat (takes between 1 minute and 1 hour or so), and
> whomping it a few times with sysrq-T?
>
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team