Re: INFO: rcu detected stall in sys_kill

4 views
Skip to first unread message

Dmitry Vyukov

unread,
Jan 9, 2020, 3:50:25 AM1/9/20
to Casey Schaufler, Daniel Axtens, Alexander Potapenko, clang-built-linux, Tetsuo Handa, syzbot, kasan-dev, Andrew Morton, LKML, syzkaller-bugs
On Thu, Jan 9, 2020 at 9:19 AM Dmitry Vyukov <dvy...@google.com> wrote:
>
> On Wed, Jan 8, 2020 at 6:19 PM Casey Schaufler <ca...@schaufler-ca.com> wrote:
> >
> > On 1/8/2020 2:25 AM, Tetsuo Handa wrote:
> > > On 2020/01/08 15:20, Dmitry Vyukov wrote:
> > >> I temporarily re-enabled smack instance and it produced another 50
> > >> stalls all over the kernel, and now keeps spewing a dozen every hour.
> >
> > Do I have to be using clang to test this? I'm setting up to work on this,
> > and don't want to waste time using my current tool chain if the problem
> > is clang specific.
>
> Humm, interesting. Initially I was going to say that most likely it's
> not clang-related. Bug smack instance is actually the only one that
> uses clang as well (except for KMSAN of course). So maybe it's indeed
> clang-related rather than smack-related. Let me try to build a kernel
> with clang.

+clang-built-linux, glider

[clang-built linux is severe broken since early Dec]

Building kernel with clang I can immediately reproduce this locally:

$ syz-manager
2020/01/09 09:27:15 loading corpus...
2020/01/09 09:27:17 serving http on http://0.0.0.0:50001
2020/01/09 09:27:17 serving rpc on tcp://[::]:45851
2020/01/09 09:27:17 booting test machines...
2020/01/09 09:27:17 wait for the connection from test machine...
2020/01/09 09:29:23 machine check:
2020/01/09 09:29:23 syscalls : 2961/3195
2020/01/09 09:29:23 code coverage : enabled
2020/01/09 09:29:23 comparison tracing : enabled
2020/01/09 09:29:23 extra coverage : enabled
2020/01/09 09:29:23 setuid sandbox : enabled
2020/01/09 09:29:23 namespace sandbox : enabled
2020/01/09 09:29:23 Android sandbox : /sys/fs/selinux/policy
does not exist
2020/01/09 09:29:23 fault injection : enabled
2020/01/09 09:29:23 leak checking : CONFIG_DEBUG_KMEMLEAK is
not enabled
2020/01/09 09:29:23 net packet injection : enabled
2020/01/09 09:29:23 net device setup : enabled
2020/01/09 09:29:23 concurrency sanitizer : /sys/kernel/debug/kcsan
does not exist
2020/01/09 09:29:23 devlink PCI setup : PCI device 0000:00:10.0
is not available
2020/01/09 09:29:27 corpus : 50226 (0 deleted)
2020/01/09 09:29:27 VMs 20, executed 0, cover 0, crashes 0, repro 0
2020/01/09 09:29:37 VMs 20, executed 45, cover 0, crashes 0, repro 0
2020/01/09 09:29:47 VMs 20, executed 74, cover 0, crashes 0, repro 0
2020/01/09 09:29:57 VMs 20, executed 80, cover 0, crashes 0, repro 0
2020/01/09 09:30:07 VMs 20, executed 80, cover 0, crashes 0, repro 0
2020/01/09 09:30:17 VMs 20, executed 80, cover 0, crashes 0, repro 0
2020/01/09 09:30:27 VMs 20, executed 80, cover 0, crashes 0, repro 0
2020/01/09 09:30:37 VMs 20, executed 80, cover 0, crashes 0, repro 0
2020/01/09 09:30:47 VMs 20, executed 80, cover 0, crashes 0, repro 0
2020/01/09 09:30:57 VMs 20, executed 80, cover 0, crashes 0, repro 0
2020/01/09 09:31:07 VMs 20, executed 80, cover 0, crashes 0, repro 0
2020/01/09 09:31:17 VMs 20, executed 80, cover 0, crashes 0, repro 0
2020/01/09 09:31:26 vm-10: crash: INFO: rcu detected stall in do_idle
2020/01/09 09:31:27 VMs 13, executed 80, cover 0, crashes 0, repro 0
2020/01/09 09:31:28 vm-1: crash: INFO: rcu detected stall in sys_futex
2020/01/09 09:31:29 vm-4: crash: INFO: rcu detected stall in sys_futex
2020/01/09 09:31:31 vm-0: crash: INFO: rcu detected stall in sys_getsockopt
2020/01/09 09:31:33 vm-18: crash: INFO: rcu detected stall in sys_clone3
2020/01/09 09:31:35 vm-3: crash: INFO: rcu detected stall in sys_futex
2020/01/09 09:31:36 vm-8: crash: INFO: rcu detected stall in do_idle
2020/01/09 09:31:37 VMs 7, executed 80, cover 0, crashes 6, repro 0
2020/01/09 09:31:38 vm-19: crash: INFO: rcu detected stall in schedule_tail
2020/01/09 09:31:40 vm-6: crash: INFO: rcu detected stall in schedule_tail
2020/01/09 09:31:42 vm-2: crash: INFO: rcu detected stall in schedule_tail
2020/01/09 09:31:44 vm-12: crash: INFO: rcu detected stall in sys_futex
2020/01/09 09:31:46 vm-15: crash: INFO: rcu detected stall in sys_nanosleep
2020/01/09 09:31:47 VMs 1, executed 80, cover 0, crashes 11, repro 0
2020/01/09 09:31:48 vm-16: crash: INFO: rcu detected stall in sys_futex
2020/01/09 09:31:50 vm-9: crash: INFO: rcu detected stall in schedule
2020/01/09 09:31:52 vm-13: crash: INFO: rcu detected stall in schedule_tail
2020/01/09 09:31:54 vm-11: crash: INFO: rcu detected stall in schedule_tail
2020/01/09 09:31:56 vm-17: crash: INFO: rcu detected stall in sys_futex
2020/01/09 09:31:57 VMs 0, executed 80, cover 0, crashes 16, repro 0
2020/01/09 09:31:58 vm-7: crash: INFO: rcu detected stall in sys_futex
2020/01/09 09:32:00 vm-5: crash: INFO: rcu detected stall in dput
2020/01/09 09:32:02 vm-14: crash: INFO: rcu detected stall in sys_nanosleep


Then I switched LSM to selinux and I _still_ can reproduce this. So,
Casey, you may relax, this is not smack-specific :)

Then I disabled CONFIG_KASAN_VMALLOC and CONFIG_VMAP_STACK and it
started working normally.

So this is somehow related to both clang and KASAN/VMAP_STACK.

The clang I used is:
https://storage.googleapis.com/syzkaller/clang-kmsan-362913.tar.gz
(the one we use on syzbot).

Dmitry Vyukov

unread,
Jan 9, 2020, 4:29:36 AM1/9/20
to Casey Schaufler, Daniel Axtens, Alexander Potapenko, clang-built-linux, Tetsuo Handa, syzbot, kasan-dev, Andrew Morton, LKML, syzkaller-bugs
Clustering hangs, they all happen within very limited section of the code:

1 free_thread_stack+0x124/0x590 kernel/fork.c:284
5 free_thread_stack+0x12e/0x590 kernel/fork.c:280
39 free_thread_stack+0x12e/0x590 kernel/fork.c:284
6 free_thread_stack+0x133/0x590 kernel/fork.c:280
5 free_thread_stack+0x13d/0x590 kernel/fork.c:280
2 free_thread_stack+0x141/0x590 kernel/fork.c:280
6 free_thread_stack+0x14c/0x590 kernel/fork.c:280
9 free_thread_stack+0x151/0x590 kernel/fork.c:280
3 free_thread_stack+0x15b/0x590 kernel/fork.c:280
67 free_thread_stack+0x168/0x590 kernel/fork.c:280
6 free_thread_stack+0x16d/0x590 kernel/fork.c:284
2 free_thread_stack+0x177/0x590 kernel/fork.c:284
1 free_thread_stack+0x182/0x590 kernel/fork.c:284
1 free_thread_stack+0x186/0x590 kernel/fork.c:284
16 free_thread_stack+0x18b/0x590 kernel/fork.c:284
4 free_thread_stack+0x195/0x590 kernel/fork.c:284

Here is disass of the function:
https://gist.githubusercontent.com/dvyukov/a283d1aaf2ef7874001d56525279ccbd/raw/ac2478bff6472bc473f57f91a75f827cd72bb6bf/gistfile1.txt

But if I am not mistaken, the function only ever jumps down. So how
can it loop?...

Dmitry Vyukov

unread,
Jan 9, 2020, 5:05:26 AM1/9/20
to Casey Schaufler, Daniel Axtens, Alexander Potapenko, clang-built-linux, Tetsuo Handa, syzbot, kasan-dev, Andrew Morton, LKML, syzkaller-bugs
This is a miscompilation related to static branches.

objdump shows:

ffffffff814878f8: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
./arch/x86/include/asm/jump_label.h:25
asm_volatile_goto("1:"

However, the actual instruction in memory at the time is:

0xffffffff814878f8 <+408>: jmpq 0xffffffff8148787f <free_thread_stack+287>

Which jumps to a wrong location in free_thread_stack and makes it loop.

The static branch is this:

static inline bool memcg_kmem_enabled(void)
{
return static_branch_unlikely(&memcg_kmem_enabled_key);
}

static inline void memcg_kmem_uncharge(struct page *page, int order)
{
if (memcg_kmem_enabled())
__memcg_kmem_uncharge(page, order);
}

I suspect it may have something to do with loop unrolling. It may jump
to the right location, but in the wrong unrolled iteration.

Dmitry Vyukov

unread,
Jan 9, 2020, 5:39:19 AM1/9/20
to Casey Schaufler, Daniel Axtens, Alexander Potapenko, clang-built-linux, Tetsuo Handa, syzbot, kasan-dev, Andrew Morton, LKML, syzkaller-bugs
Kernel built with clang version 10.0.0
(https://github.com/llvm/llvm-project.git
c2443155a0fb245c8f17f2c1c72b6ea391e86e81) works fine.

Alex, please update clang on syzbot machines.

Alexander Potapenko

unread,
Jan 9, 2020, 11:23:33 AM1/9/20
to Dmitry Vyukov, Casey Schaufler, Daniel Axtens, clang-built-linux, Tetsuo Handa, syzbot, kasan-dev, Andrew Morton, LKML, syzkaller-bugs
Done ~3 hours ago, guess we'll see the results within a day.

--
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Nick Desaulniers

unread,
Jan 9, 2020, 12:17:08 PM1/9/20
to Alexander Potapenko, Dmitry Vyukov, Casey Schaufler, Daniel Axtens, clang-built-linux, Tetsuo Handa, syzbot, kasan-dev, Andrew Morton, LKML, syzkaller-bugs
On Thu, Jan 9, 2020 at 8:23 AM 'Alexander Potapenko' via Clang Built
Linux <clang-bu...@googlegroups.com> wrote:
>
> On Thu, Jan 9, 2020 at 11:39 AM Dmitry Vyukov <dvy...@google.com> wrote:
> >
> > On Thu, Jan 9, 2020 at 11:05 AM Dmitry Vyukov <dvy...@google.com> wrote:
> > > > > > > On 1/8/2020 2:25 AM, Tetsuo Handa wrote:
> > > > > > > > On 2020/01/08 15:20, Dmitry Vyukov wrote:
> > > > > > > >> I temporarily re-enabled smack instance and it produced another 50
> > > > > > > >> stalls all over the kernel, and now keeps spewing a dozen every hour.
> > > > > > >
> > > > > > > Do I have to be using clang to test this? I'm setting up to work on this,
> > > > > > > and don't want to waste time using my current tool chain if the problem
> > > > > > > is clang specific.
> > > > > >
> > > > > > Humm, interesting. Initially I was going to say that most likely it's
> > > > > > not clang-related. Bug smack instance is actually the only one that
> > > > > > uses clang as well (except for KMSAN of course). So maybe it's indeed
> > > > > > clang-related rather than smack-related. Let me try to build a kernel
> > > > > > with clang.
> > > > >
> > > > > +clang-built-linux, glider
> > > > >
> > > > > [clang-built linux is severe broken since early Dec]

Is there automated reporting? Consider adding our mailing list for
Clang specific failures.
clang-built-linux <clang-bu...@googlegroups.com>
Our CI looks green, but there's a very long tail of combinations of
configs that we don't have coverage of, so bug reports are
appreciated:
https://github.com/ClangBuiltLinux/linux/issues
I disabled loop unrolling and loop unswitching in LLVM when the loop
contained asm goto in:
https://github.com/llvm/llvm-project/commit/c4f245b40aad7e8627b37a8bf1bdcdbcd541e665
I have a fix for loop unrolling in:
https://reviews.llvm.org/D64101
that I should dust off. I haven't looked into loop unswitching yet.

> >
> >
> > Kernel built with clang version 10.0.0
> > (https://github.com/llvm/llvm-project.git
> > c2443155a0fb245c8f17f2c1c72b6ea391e86e81) works fine.
> >
> > Alex, please update clang on syzbot machines.
>
> Done ~3 hours ago, guess we'll see the results within a day.

Please let me know if you otherwise encounter any miscompiles with
Clang, particularly `asm goto` I treat as P0.
--
Thanks,
~Nick Desaulniers

Dmitry Vyukov

unread,
Jan 9, 2020, 12:23:34 PM1/9/20
to Nick Desaulniers, Alexander Potapenko, Casey Schaufler, Daniel Axtens, clang-built-linux, Tetsuo Handa, syzbot, kasan-dev, Andrew Morton, LKML, syzkaller-bugs
On Thu, Jan 9, 2020 at 6:17 PM Nick Desaulniers <ndesau...@google.com> wrote:
> > > > > > > > On 1/8/2020 2:25 AM, Tetsuo Handa wrote:
> > > > > > > > > On 2020/01/08 15:20, Dmitry Vyukov wrote:
> > > > > > > > >> I temporarily re-enabled smack instance and it produced another 50
> > > > > > > > >> stalls all over the kernel, and now keeps spewing a dozen every hour.
> > > > > > > >
> > > > > > > > Do I have to be using clang to test this? I'm setting up to work on this,
> > > > > > > > and don't want to waste time using my current tool chain if the problem
> > > > > > > > is clang specific.
> > > > > > >
> > > > > > > Humm, interesting. Initially I was going to say that most likely it's
> > > > > > > not clang-related. Bug smack instance is actually the only one that
> > > > > > > uses clang as well (except for KMSAN of course). So maybe it's indeed
> > > > > > > clang-related rather than smack-related. Let me try to build a kernel
> > > > > > > with clang.
> > > > > >
> > > > > > +clang-built-linux, glider
> > > > > >
> > > > > > [clang-built linux is severe broken since early Dec]
>
> Is there automated reporting? Consider adding our mailing list for
> Clang specific failures.
> clang-built-linux <clang-bu...@googlegroups.com>
> Our CI looks green, but there's a very long tail of combinations of
> configs that we don't have coverage of, so bug reports are
> appreciated:
> https://github.com/ClangBuiltLinux/linux/issues

syzbot does automatic reporting, but it does not automatically
classify bugs as clang-specific.
FTR, this combination is clang+KASAN+VMAP_STACK (relatively recent
changes, and that's what triggered the infinite loop). But note that
the kernel booted, you can ssh and do some basic things.
c4f245b40aad7e8627b37a8bf1bdcdbcd541e665 is in the range between the
broken compiler and the newer compiler that seems to work, so I would
assume that that commit fixes this.
We will get the final stamp from syzbot hopefully by tomorrow.

Nick Desaulniers

unread,
Jan 9, 2020, 12:39:04 PM1/9/20
to Dmitry Vyukov, Alexander Potapenko, Casey Schaufler, Daniel Axtens, clang-built-linux, Tetsuo Handa, syzbot, kasan-dev, Andrew Morton, LKML, syzkaller-bugs
How often do you refresh the build of Clang in syzbot? Is it manual? I
understand the tradeoffs of living on the tip of the spear, but
c4f245b40aad7e8627b37a8bf1bdcdbcd541e665 is 6 months old. So upstream
LLVM could be regressing more often, and you wouldn't notice for 1/2 a
year or more. :-/

--
Thanks,
~Nick Desaulniers

Daniel Axtens

unread,
Jan 9, 2020, 6:25:33 PM1/9/20
to Dmitry Vyukov, Casey Schaufler, Alexander Potapenko, clang-built-linux, Tetsuo Handa, syzbot, kasan-dev, Andrew Morton, LKML, syzkaller-bugs
Wow, what a bug. Very happy to be off the hook for causing it, and
feeling a lot better about my inability to reproduce it with a GCC-built
kernel!

Regards,
Daniel

Casey Schaufler

unread,
Jan 10, 2020, 12:54:57 AM1/10/20
to Dmitry Vyukov, Daniel Axtens, Alexander Potapenko, clang-built-linux, Tetsuo Handa, syzbot, kasan-dev, Andrew Morton, LKML, syzkaller-bugs, Casey Schaufler
Wow. I wasn't expecting clang to be the problem, just a possible
required condition. I am, of course, quite relieved.

Alexander Potapenko

unread,
Jan 10, 2020, 3:37:19 AM1/10/20
to Nick Desaulniers, Dmitry Vyukov, Casey Schaufler, Daniel Axtens, clang-built-linux, Tetsuo Handa, syzbot, kasan-dev, Andrew Morton, LKML, syzkaller-bugs
KMSAN used to be the only user of Clang on syzbot, so I didn't bother too often.
Now that there are other users, we'll need a better strategy.
Clang revisions I've been picking previously came from Chromium's
Clang distributions. This is nice, because Chromium folks usually pick
a revision that has been extensively tested at Google already, plus
they make sure Chromium tests also pass.
They don't roll the compiler often, however (typically once a month or
two, but this time there were holidays, plus some nasty breakages).
> --
> Thanks,
> ~Nick Desaulniers
>
> --
> You received this message because you are subscribed to the Google Groups "kasan-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kasan-dev+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kasan-dev/CAKwvOdkh8CV0pgqqHXknv8%2BgE2ovoKEV_m%2BqiEmWutmLnra3%3Dg%40mail.gmail.com.

Dmitry Vyukov

unread,
Jan 14, 2020, 5:15:39 AM1/14/20
to Alexander Potapenko, Nick Desaulniers, Casey Schaufler, Daniel Axtens, clang-built-linux, Tetsuo Handa, syzbot, kasan-dev, Andrew Morton, LKML, syzkaller-bugs
The clang instances are back to life (incl smack).

#syz invalid

On Fri, Jan 10, 2020 at 9:37 AM 'Alexander Potapenko' via kasan-dev
> To view this discussion on the web visit https://groups.google.com/d/msgid/kasan-dev/CAG_fn%3DUU0fuws59L8Bp8DEVhH%2BX6xRaanwuRrzy-HNdrVpqJmg%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages