Processes hang in an unkillable state

Robert Święcki

unread,

Apr 12, 2011, 6:32:44 AM4/12/11

to linux-...@vger.kernel.org

Hi, while fuzzing Linux system calls (32bit fuzzer, 64bi linux
kernel), it happens after some time (10-20mins) that some processes
enter a state which makes them un-killable. They are either in R or D
state.

# strace ps wwuax
..
..
open("/proc/450/cmdline", O_RDONLY) = 6
read(6, - hangs....

# kill -9 450
# kill -9 450 (no ESRCH)

More data in the attachment - I'll keep it in the kdb session for
further examination.

--
Robert Święcki

450.txt

Américo Wang

unread,

Apr 12, 2011, 8:44:42 AM4/12/11

to Robert Święcki, linux-...@vger.kernel.org, ol...@redhat.com

2011/4/12 Robert Święcki <rob...@swiecki.net>:

> Hi, while fuzzing Linux system calls (32bit fuzzer, 64bi linux
> kernel), it happens after some time (10-20mins) that some processes
> enter a state which makes them un-killable. They are either in R or D
> state.
>
> # strace ps wwuax
> ...

> ...

> open("/proc/450/cmdline", O_RDONLY) = 6
> read(6, - hangs....
>
> # kill -9 450
> # kill -9 450 (no ESRCH)
>
> More data in the attachment - I'll keep it in the kdb session for
> further examination.

Hmm, it must be stuck at

lib/rwsem.c

/* wait to be given the lock */
for (;;) {
if (!waiter.task)
break;
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}

don't know why it still can't acquire the ->mmap_sem...

Cc'ing Oleg...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Robert Święcki

unread,

Apr 12, 2011, 9:03:27 AM4/12/11

to Américo Wang, linux-...@vger.kernel.org, ol...@redhat.com

On Tue, Apr 12, 2011 at 2:44 PM, Américo Wang <xiyou.w...@gmail.com> wrote:
> 2011/4/12 Robert Święcki <rob...@swiecki.net>:
>> Hi, while fuzzing Linux system calls (32bit fuzzer, 64bi linux
>> kernel), it happens after some time (10-20mins) that some processes
>> enter a state which makes them un-killable. They are either in R or D
>> state.
>>
>> # strace ps wwuax
>> ...
>> ...
>> open("/proc/450/cmdline", O_RDONLY) = 6
>> read(6, - hangs....
>>
>> # kill -9 450
>> # kill -9 450 (no ESRCH)
>>
>> More data in the attachment - I'll keep it in the kdb session for
>> further examination.
>
> Hmm, it must be stuck at
>
> lib/rwsem.c
>
> /* wait to be given the lock */
> for (;;) {
> if (!waiter.task)
> break;
> schedule();
> set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> }
>
> don't know why it still can't acquire the ->mmap_sem...

btw, the ps process trying to read /proc/450/cmdline is stuck in

[0]kdb> bt
Stack traceback for pid 6959
0xffff880113334590 6959 18384 0 1 D 0xffff880113334a10 ps
<c> ffff88011f8f9d00<c> 0000000000000082<c> 00000040ffffffff<c>
0000000000000000<c>
<c> ffff88012bffcc08<c> ffff88011f8f8000<c> ffff88011f8f8000<c>
ffff880113334590<c>
<c> ffff88011f8f8010<c> ffff880113334948<c> ffff88011f8f9fd8<c>
ffff88011f8f9fd8<c>
Call Trace:
[<ffffffff8224f665>] rwsem_down_failed_common+0xc5/0x160
[<ffffffff8224f735>] rwsem_down_read_failed+0x15/0x17
[<ffffffff81595694>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff810b31d0>] ? get_task_mm+0x40/0x80
[<ffffffff8224e957>] ? down_read+0x17/0x20
[<ffffffff811788eb>] access_process_vm+0x4b/0x1f0
[<ffffffff8224ffba>] ? _raw_spin_unlock+0x1a/0x40
[<ffffffff8120b15d>] proc_pid_cmdline+0x6d/0x120
[<ffffffff811925c1>] ? alloc_pages_current+0xa1/0x100
[<ffffffff8120bc9d>] proc_info_read+0xad/0xf0
[<ffffffff811abc55>] vfs_read+0xc5/0x190
[<ffffffff811abe21>] sys_read+0x51/0x90
[<ffffffff8104f082>] system_call_fastpath+0x16/0x1b

--
Robert Święcki

Oleg Nesterov

unread,

Apr 12, 2011, 2:28:46 PM4/12/11

to Américo Wang, Robert Święcki, linux-...@vger.kernel.org

On 04/12, Américo Wang wrote:
>
> 2011/4/12 Robert Święcki <rob...@swiecki.net>:
> > Hi, while fuzzing Linux system calls (32bit fuzzer, 64bi linux
> > kernel), it happens after some time (10-20mins) that some processes
> > enter a state which makes them un-killable. They are either in R or D
> > state.
> >
> > # strace ps wwuax
> > ...
> > ...
> > open("/proc/450/cmdline", O_RDONLY) = 6
> > read(6, - hangs....
> >
> > # kill -9 450
> > # kill -9 450 (no ESRCH)
> >
> > More data in the attachment - I'll keep it in the kdb session for
> > further examination.
>

> http://marc.info/?t 0260440100004

>
> Hmm, it must be stuck at
>
> lib/rwsem.c
>
> /* wait to be given the lock */
> for (;;) {
> if (!waiter.task)
> break;
> schedule();
> set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> }
>
> don't know why it still can't acquire the ->mmap_sem...
>
> Cc'ing Oleg...

I seem to understand...

Please wait a bit, I need to recheck.

Oleg.

Robert Święcki

unread,

Apr 12, 2011, 2:34:15 PM4/12/11

to Oleg Nesterov, Américo Wang, linux-...@vger.kernel.org

Btw, Linus Torvalds is looking into a similar case in another thread -
http://marc.info/?l=linux-kernel&m=130262886420218&w=2

--
Robert Święcki

Oleg Nesterov

unread,

Apr 12, 2011, 3:08:55 PM4/12/11

to Américo Wang, Linus Torvalds, Hugh Dickins, Robert Święcki, linux-...@vger.kernel.org

(add cc's)

On 04/12, Oleg Nesterov wrote:
>
> On 04/12, Américo Wang wrote:
> >
> > Hmm, it must be stuck at
> >
> > lib/rwsem.c
> >
> > /* wait to be given the lock */
> > for (;;) {
> > if (!waiter.task)
> > break;
> > schedule();
> > set_task_state(tsk, TASK_UNINTERRUPTIBLE);
> > }
> >
> > don't know why it still can't acquire the ->mmap_sem...
> >
> > Cc'ing Oleg...
>
> I seem to understand...
>
> Please wait a bit, I need to recheck.

Yes, mlock looks buggy. I'll report more info a bit later, but
I think we need something like this patch.

Oleg Nesterov

unread,

Apr 12, 2011, 3:09:31 PM4/12/11

to Américo Wang, Linus Torvalds, Hugh Dickins, Robert Święcki, linux-...@vger.kernel.org

__mlock_vma_pages_range() simply changes addr/nr_pages when
stack_guard_page(vma, start). But this means that __get_user_pages()
returns a number which doesn't match the [start, end) interval and
the caller can be confused.

If we skip the first page, we should return 1 if gup fails, or add
1 to the number it returns.

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---

mm/mlock.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

--- sigprocmask/mm/mlock.c~do_mlock_pages_stack_guard_page 2011-04-06 21:33:50.000000000 +0200
+++ sigprocmask/mm/mlock.c 2011-04-12 20:50:30.000000000 +0200
@@ -159,9 +159,8 @@ static long __mlock_vma_pages_range(stru
int *nonblocking)
{
struct mm_struct *mm = vma->vm_mm;
- unsigned long addr = start;
int nr_pages = (end - start) / PAGE_SIZE;
- int gup_flags;
+ int gup_flags, skip_page, ret;

VM_BUG_ON(start & ~PAGE_MASK);
VM_BUG_ON(end & ~PAGE_MASK);
@@ -189,13 +188,22 @@ static long __mlock_vma_pages_range(stru
gup_flags |= FOLL_MLOCK;

/* We don't try to access the guard page of a stack vma */
+ skip_page = 0;
if (stack_guard_page(vma, start)) {
- addr += PAGE_SIZE;
+ skip_page = 1;
+ start += PAGE_SIZE;
nr_pages--;
}

- return __get_user_pages(current, mm, addr, nr_pages, gup_flags,
+ ret = __get_user_pages(current, mm, start, nr_pages, gup_flags,
NULL, NULL, nonblocking);
+
+ if (ret >= 0)
+ ret += skip_page;
+ else if (skip_page)
+ ret = 1;
+
+ return ret;
}

/*

Robert Święcki

unread,

Apr 12, 2011, 3:18:53 PM4/12/11

to Oleg Nesterov, Américo Wang, Linus Torvalds, Hugh Dickins, linux-...@vger.kernel.org

Compiling with Linus' new patch now, lemme know if you agree on which
one might be the more correct one :). Otherwise I'll stick to the
first choice, and let you know tomorrow if it worked some more
extensive testing.

--
Robert Święcki

Oleg Nesterov

unread,

Apr 12, 2011, 3:22:13 PM4/12/11

to Robert Święcki, Américo Wang, linux-...@vger.kernel.org, Linus Torvalds, Hugh Dickins

On 04/12, Robert Święcki wrote:
>
> On Tue, Apr 12, 2011 at 8:28 PM, Oleg Nesterov <ol...@redhat.com> wrote:
> >
> > I seem to understand...
> >
> > Please wait a bit, I need to recheck.
>
> Btw, Linus Torvalds is looking into a similar case in another thread -
> http://marc.info/?l=linux-kernel&m=130262886420218&w=2

Argh. thanks ;)

I'd wish I knew this before I started to investigate...

So, Linus's patch does the same, we can ignore the patch I sent.

Oleg.

Linus Torvalds

unread,

Apr 12, 2011, 4:23:38 PM4/12/11

to Robert Święcki, Oleg Nesterov, Américo Wang, linux-...@vger.kernel.org, Hugh Dickins, Miklos Szeredi

On Tue, Apr 12, 2011 at 1:03 PM, Robert Święcki <rob...@swiecki.net> wrote:
>>
>> Ok, applied Linus' patch and got the following (kdb dump in the attachment):
>>
>> It contains references to sys_mlock, but in another process/user that
>> oopsed (there are iknowthis and iknowthis2 processes running under
>> test and test2 users). I think I'll simply disable sys_madvise in the
>> fuzzer; and treat this oops as a separate issue.

This does seem to be something else.

It looks like some kind of live-lock situation between two processes
both doing madvise() and causing vmtruncate_range() calls.

Miklos wrote this patch for something bad in this area to serialize
concurrent unmap_mapping_range() calls in order to not restart forever
on vm_truncate_count. That got merged into 2.6.38, so it's there, but
I wonder if there is some case it misses.

Linus

Linus Torvalds

unread,

Apr 12, 2011, 5:47:16 PM4/12/11

to Robert Święcki, Oleg Nesterov, Américo Wang, linux-...@vger.kernel.org, Hugh Dickins, Miklos Szeredi

On Tue, Apr 12, 2011 at 1:56 PM, Robert Święcki <rob...@swiecki.net> wrote:
>
> Ok, just to update you with what I'm currently doing:
>
> I'm testing now with 2.6.39-rc3 - according to
> http://www.kernel.org/pub/linux/kernel/v2.6/testing/ChangeLog-2.6.39-rc3
> it has vma_to_resize patch included
> (982134ba62618c2d69fbbbd166d0a11ee3b7e3d8) - I applied the latest
> Linus' patch for sys_mlock (the one patching memory.c and mlock.c),
> disabled the sys_madvise in the fuzzer, and now I got the following
> (full kdb dump attached)

Ok, that's different from the apparent livelock.

Except it once again is one of the BUG_ON's in vma_prio_tree_add() -
and again, your kgdb thing has corrupted the bug information.

Can you make a bug-report to the kgdb people? It's annoying as hell
that all the *critical* bug information that the kernel prints out
apparently gets totally lost when you attach with the debugger. It's
not an Oops, it should have that nice BUG: together with filename and
line number.

> <d>Pid: 18598, comm: iknowthis Not tainted 2.6.39-rc3 #1<c> Dell Inc.
> Precision WorkStation 390 <c>/0GH911<c>
> <d>RIP: 0010:[<ffffffff8116c842>] [<ffffffff8116c842>] vma_prio_tree_add+0xc2/0xd0

Code disassembly shows:

0: 58 pop %rax
1: 48 89 7e 68 mov %rdi,0x68(%rsi)
5: c9 leaveq
6: c3 retq
7: 66 90 xchg %ax,%ax
9: 48 8b 56 50 mov 0x50(%rsi),%rdx
d: 48 8d 47 50 lea 0x50(%rdi),%rax
11: 48 89 42 08 mov %rax,0x8(%rdx)
15: 48 89 57 50 mov %rdx,0x50(%rdi)
19: 48 8d 56 50 lea 0x50(%rsi),%rdx
1d: 48 89 57 58 mov %rdx,0x58(%rdi)
21: 48 89 46 50 mov %rax,0x50(%rsi)
25: c9 leaveq
26: c3 retq
27:* 0f 0b ud2 <-- trapping instruction
29: eb fe jmp 0x29
2b:* 0f 0b ud2 <-- trapping instruction
2d: eb fe jmp 0x2d
2f: eb 08 jmp 0x39

and scripts/decodecode is wrong, it's the _second_ of the two ud2's
that traps, as shown by the Code: line.

But whether that is the first or the second in the source code, who
knows? Gcc may have re-ordered things completely, and kdb has thrown
away the information that the kernel should have printed out.

Anyway, it looks _very_ much exactly like the old mremap() issue. But
if you are running -rc3, then you already have commit 42933bac11e8 in
your tree, so maybe there is some other way to trigger a vm_pgoff
overflow.

You've lost Hugh's patch that did the vma dump instead of having the
BUG_ON(). Can you try that one? And once more, I think that if you had
CONFIG_OPTIMIZE_SIZE on, then I think gcc wouldn't re-order the basic
blocks, and the BUG_ON() info would be easier to track.

> Call Trace:
> [<ffffffff8116c9a1>] vma_prio_tree_insert+0x41/0x60
> [<ffffffff8117cb8c>] __vma_link_file+0x4c/0x90
> [<ffffffff8117d568>] vma_adjust+0xe8/0x570
> [<ffffffff8117db31>] __split_vma+0x141/0x280
> [<ffffffff8117dc95>] split_vma+0x25/0x30
> [<ffffffff8117c1a1>] mlock_fixup+0x171/0x1c0
> [<ffffffff8117c529>] do_mlock+0xc9/0x100
> [<ffffffff8117c6d7>] sys_mlock+0xe7/0x130
> [<ffffffff82284e03>] ia32_do_call+0x13/0x13

Hmm. mlock() itself should not be causing any pgoff expansion.

I wonder if this is related to that whole stack expansion thing (you
clearly are hitting the stack vma judging by the other bug you found),
and we have a pgoff underflow when expanding the stack?

Attached patch for your enjoyment. COMPLETELY UNTESTED, as usual.

Guys, can you think of any other thing that might expand a mapping?
Rather than find them one-by-one as Robert plays with his fuzzer?

Linus

patch.diff

Robert Święcki

unread,

Apr 12, 2011, 5:59:44 PM4/12/11

to Linus Torvalds, Oleg Nesterov, Américo Wang, linux-...@vger.kernel.org, Hugh Dickins, Miklos Szeredi

>> Ok, just to update you with what I'm currently doing:
>>
>> I'm testing now with 2.6.39-rc3 - according to
>> http://www.kernel.org/pub/linux/kernel/v2.6/testing/ChangeLog-2.6.39-rc3
>> it has vma_to_resize patch included
>> (982134ba62618c2d69fbbbd166d0a11ee3b7e3d8) - I applied the latest
>> Linus' patch for sys_mlock (the one patching memory.c and mlock.c),
>> disabled the sys_madvise in the fuzzer, and now I got the following
>> (full kdb dump attached)
>
> Ok, that's different from the apparent livelock.
>
> Except it once again is one of the BUG_ON's in vma_prio_tree_add() -
> and again, your kgdb thing has corrupted the bug information.
>
> Can you make a bug-report to the kgdb people?

Ok,

> You've lost Hugh's patch that did the vma dump instead of having the
> BUG_ON(). Can you try that one? And once more, I think that if you had
> CONFIG_OPTIMIZE_SIZE on, then I think gcc wouldn't re-order the basic
> blocks, and the BUG_ON() info would be easier to track.

Compiling now with CONFIG_OPTIMIZE_SIZE and vma dump code. Will
probably post some results tomorrow.

--
Robert Święcki

Linus Torvalds

unread,

Apr 12, 2011, 6:13:56 PM4/12/11

to Robert Święcki, Oleg Nesterov, Américo Wang, linux-...@vger.kernel.org, Hugh Dickins, Miklos Szeredi

On Tue, Apr 12, 2011 at 2:59 PM, Robert Święcki <rob...@swiecki.net> wrote:
>
> Compiling now with CONFIG_OPTIMIZE_SIZE and vma dump code. Will
> probably post some results tomorrow.

. and if you've added my patch to the stack growth case, hopefully
there won't _be_ any results ;)

Linus

Robert Święcki

unread,

Apr 12, 2011, 6:16:22 PM4/12/11

to Linus Torvalds, Oleg Nesterov, Américo Wang, linux-...@vger.kernel.org, Hugh Dickins, Miklos Szeredi

On Wed, Apr 13, 2011 at 12:12 AM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Tue, Apr 12, 2011 at 2:59 PM, Robert Święcki <rob...@swiecki.net> wrote:
>>
>> Compiling now with CONFIG_OPTIMIZE_SIZE and vma dump code. Will
>> probably post some results tomorrow.
>

> .. and if you've added my patch to the stack growth case, hopefully

> there won't _be_ any results ;)

I can, depending in whether you'd like to see vma dump results for
this case or not. Let me know, it's still cooooompiling :).

--
Robert Święcki

Linus Torvalds

unread,

Apr 12, 2011, 6:19:38 PM4/12/11

to Robert Święcki, Oleg Nesterov, Américo Wang, linux-...@vger.kernel.org, Hugh Dickins, Miklos Szeredi

On Tue, Apr 12, 2011 at 3:16 PM, Robert Święcki <rob...@swiecki.net> wrote:
> On Wed, Apr 13, 2011 at 12:12 AM, Linus Torvalds
> <torv...@linux-foundation.org> wrote:
>> On Tue, Apr 12, 2011 at 2:59 PM, Robert Święcki <rob...@swiecki.net> wrote:
>>>
>>> Compiling now with CONFIG_OPTIMIZE_SIZE and vma dump code. Will
>>> probably post some results tomorrow.
>>
>> .. and if you've added my patch to the stack growth case, hopefully
>> there won't _be_ any results ;)
>
> I can, depending in whether you'd like to see vma dump results for
> this case or not. Let me know, it's still cooooompiling :).

Please do add it. Since it can take a long time to trigger, it's best
to just try to fix this issue asap. If it never triggers, and we don't
see any vma dumps, I won't cry.

Linus

Robert Święcki

unread,

Apr 12, 2011, 6:31:04 PM4/12/11

to Linus Torvalds, Oleg Nesterov, Américo Wang, linux-...@vger.kernel.org, Hugh Dickins, Miklos Szeredi

On Wed, Apr 13, 2011 at 12:18 AM, Linus Torvalds

<torv...@linux-foundation.org> wrote:
> On Tue, Apr 12, 2011 at 3:16 PM, Robert Święcki <rob...@swiecki.net> wrote:
>> On Wed, Apr 13, 2011 at 12:12 AM, Linus Torvalds
>> <torv...@linux-foundation.org> wrote:
>>> On Tue, Apr 12, 2011 at 2:59 PM, Robert Święcki <rob...@swiecki.net> wrote:
>>>>
>>>> Compiling now with CONFIG_OPTIMIZE_SIZE and vma dump code. Will
>>>> probably post some results tomorrow.
>>>
>>> .. and if you've added my patch to the stack growth case, hopefully
>>> there won't _be_ any results ;)
>>
>> I can, depending in whether you'd like to see vma dump results for
>> this case or not. Let me know, it's still cooooompiling :).
>
> Please do add it. Since it can take a long time to trigger, it's best
> to just try to fix this issue asap. If it never triggers, and we don't
> see any vma dumps, I won't cry.

Ok,

btw, here might be another path which hits this (at least I think so).

http://alt.swiecki.net/linux_kernel/sys_mprotect-2.6.38.txt

And, generally: here are a few deadlocks/bug_on's/ooops gathered
earlier (some fixed already) - http://alt.swiecki.net/linux_kernel/ -
I'll try to ask for fixes for them one by one, as soon as they repeat
and I have proper kdb/perf dumps.

--
Robert Święcki

Linus Torvalds

unread,

Apr 12, 2011, 6:44:44 PM4/12/11

to Robert Święcki, Oleg Nesterov, Américo Wang, linux-...@vger.kernel.org, Hugh Dickins, Miklos Szeredi

On Tue, Apr 12, 2011 at 3:30 PM, Robert Święcki <rob...@swiecki.net> wrote:
>
> btw, here might be another path which hits this (at least I think so).

So both mprotect and mlock will do the same "split/merge vma's as
necessary", but neither of them should actually ever _expand_ a
mapping or change the vm_pgoff of a vma (except to fix up the pgoff as
a vma is split).

So what I think is happening is that a previous vma operation (like
the mremap or the stack expansion) did the expand and created a vma
with a wrapping vm_pgoff. But nothing bad happened, because nobody
really _cares_ about the wrapping until later, when we split the vma.

So I think (and hope) that your mprotect issue is exactly the same as
your mlock issue, and that the deeper problem was the earlier stack
expansion.

That said, I'm not at all going to guarantee that it's about stack
expansion. There might be something else going on, and the stack
expansion was just the first thing that I could think of as doing
something similar to mremap(), causing a wrapping vm_pgoff.

Linus

Robert Święcki

unread,

Apr 13, 2011, 8:19:18 AM4/13/11

to Linus Torvalds, Oleg Nesterov, Américo Wang, linux-...@vger.kernel.org, Hugh Dickins, Miklos Szeredi

On Wed, Apr 13, 2011 at 12:43 AM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>> btw, here might be another path which hits this (at least I think so).
>
> So both mprotect and mlock will do the same "split/merge vma's as
> necessary", but neither of them should actually ever _expand_ a
> mapping or change the vm_pgoff of a vma (except to fix up the pgoff as
> a vma is split).
>
> So what I think is happening is that a previous vma operation (like
> the mremap or the stack expansion) did the expand and created a vma
> with a wrapping vm_pgoff. But nothing bad happened, because nobody
> really _cares_ about the wrapping until later, when we split the vma.
>
> So I think (and hope) that your mprotect issue is exactly the same as
> your mlock issue, and that the deeper problem was the earlier stack
> expansion.

So, after ~12h of testing I don't see any crashes. Currently, I'm
testing with 2.6.39-rc3 with 2 of your patches applied (1st patching
mlock.c/memory.c, 2nd: mmap.c).

It's still crashing with sys_madvise (as reported earlier), and I'm
going to re-enable all syscalls now (madvise, getdents(64), readdir),
which were disabled before. If something unrelated to the problems
discussed in this thread happens, I'll report it in another thread.

--
Robert Święcki