INFO: rcu detected stall in ext4_write

syzbot

unread,

Jun 26, 2019, 1:27:09 PM6/26/19

to adilger...@dilger.ca, da...@davemloft.net, el...@mellanox.com, ido...@mellanox.com, ji...@mellanox.com, john....@linaro.org, linux...@vger.kernel.org, linux-...@vger.kernel.org, net...@vger.kernel.org, syzkall...@googlegroups.com, tg...@linutronix.de, ty...@mit.edu

Hello,

syzbot found the following crash on:

HEAD commit: abf02e29 Merge tag 'pm-5.2-rc6' of git://git.kernel.org/pu..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1435aaf6a00000
kernel config: https://syzkaller.appspot.com/x/.config?x=e5c77f8090a3b96b
dashboard link: https://syzkaller.appspot.com/bug?extid=4bfbbf28a2e50ab07368
compiler: gcc (GCC) 9.0.0 20181231 (experimental)
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11234c41a00000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15d7f026a00000

The bug was bisected to:

commit 0c81ea5db25986fb2a704105db454a790c59709c
Author: Elad Raz <el...@mellanox.com>
Date: Fri Oct 28 19:35:58 2016 +0000

mlxsw: core: Add port type (Eth/IB) set API

bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=10393a89a00000
final crash: https://syzkaller.appspot.com/x/report.txt?x=12393a89a00000
console output: https://syzkaller.appspot.com/x/log.txt?x=14393a89a00000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+4bfbbf...@syzkaller.appspotmail.com
Fixes: 0c81ea5db259 ("mlxsw: core: Add port type (Eth/IB) set API")

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
(detected by 0, t=10502 jiffies, g=8969, q=26)
rcu: All QSes seen, last rcu_preempt kthread activity 10503
(4295056736-4295046233), jiffies_till_next_fqs=1, root ->qsmask 0x0
syz-executor778 R running task 26464 9577 9576 0x00004000
Call Trace:
<IRQ>
sched_show_task kernel/sched/core.c:5286 [inline]
sched_show_task.cold+0x291/0x2fc kernel/sched/core.c:5261
print_other_cpu_stall kernel/rcu/tree_stall.h:410 [inline]
check_cpu_stall kernel/rcu/tree_stall.h:536 [inline]
rcu_pending kernel/rcu/tree.c:2625 [inline]
rcu_sched_clock_irq.cold+0xaaf/0xbfd kernel/rcu/tree.c:2161
update_process_times+0x32/0x80 kernel/time/timer.c:1639
tick_sched_handle+0xa2/0x190 kernel/time/tick-sched.c:167
tick_sched_timer+0x47/0x130 kernel/time/tick-sched.c:1298
__run_hrtimer kernel/time/hrtimer.c:1389 [inline]
__hrtimer_run_queues+0x33b/0xdd0 kernel/time/hrtimer.c:1451
hrtimer_interrupt+0x314/0x770 kernel/time/hrtimer.c:1509
local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1041 [inline]
smp_apic_timer_interrupt+0x111/0x550 arch/x86/kernel/apic/apic.c:1066
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
</IRQ>
RIP: 0010:ext4_write_checks+0x1/0x260 fs/ext4/file.c:161
Code: 61 fa ff ff e8 e0 3c 53 ff 55 48 89 e5 41 54 49 89 fc e8 f2 0a 81 ff
4c 89 e7 31 f6 e8 98 f9 ff ff 41 5c 5d c3 0f 1f 40 00 55 <48> 89 e5 41 56
41 55 49 89 f5 41 54 53 48 89 fb e8 ca 0a 81 ff 48
RSP: 0018:ffff888093a97640 EFLAGS: 00000293 ORIG_RAX: ffffffffffffff13
RAX: ffff88809901c100 RBX: ffff888093a977d8 RCX: ffffffff81efcb42
RDX: 0000000000000000 RSI: ffff888093a97a08 RDI: ffff888093a977d8
RBP: ffff888093a97768 R08: ffff88809901c100 R09: ffff88809901c9c8
R10: ffff88809901c9a8 R11: ffff88809901c100 R12: 0000000000000001
R13: ffff8880995df4f0 R14: ffff888093a97740 R15: 0000000000000000
call_write_iter include/linux/fs.h:1872 [inline]
do_iter_readv_writev+0x5f8/0x8f0 fs/read_write.c:693
do_iter_write fs/read_write.c:970 [inline]
do_iter_write+0x184/0x610 fs/read_write.c:951
vfs_iter_write+0x77/0xb0 fs/read_write.c:983
iter_file_splice_write+0x65c/0xbd0 fs/splice.c:746
do_splice_from fs/splice.c:848 [inline]
direct_splice_actor+0x123/0x190 fs/splice.c:1020
splice_direct_to_actor+0x366/0x970 fs/splice.c:975
do_splice_direct+0x1da/0x2a0 fs/splice.c:1063
do_sendfile+0x597/0xd00 fs/read_write.c:1464
__do_sys_sendfile64 fs/read_write.c:1519 [inline]
__se_sys_sendfile64 fs/read_write.c:1511 [inline]
__x64_sys_sendfile64+0x15a/0x220 fs/read_write.c:1511
do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4417c9
Code: e8 7c e7 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 bb 07 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffce5c38198 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 00007ffce5c38340 RCX: 00000000004417c9
RDX: 0000000020000000 RSI: 0000000000000003 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 00008080fffffffe R11: 0000000000000246 R12: 0000000000000000
R13: 00000000004024a0 R14: 0000000000000000 R15: 0000000000000000
rcu: rcu_preempt kthread starved for 10549 jiffies! g8969 f0x2
RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
rcu: RCU grace-period kthread stack dump:
rcu_preempt R running task 29056 10 2 0x80004000
Call Trace:
context_switch kernel/sched/core.c:2818 [inline]
__schedule+0x7cb/0x1560 kernel/sched/core.c:3445
schedule+0xa8/0x260 kernel/sched/core.c:3509
schedule_timeout+0x486/0xc50 kernel/time/timer.c:1807
rcu_gp_fqs_loop kernel/rcu/tree.c:1589 [inline]
rcu_gp_kthread+0x9b2/0x18b0 kernel/rcu/tree.c:1746
kthread+0x354/0x420 kernel/kthread.c:255
ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

Theodore Ts'o

unread,

Jun 26, 2019, 2:42:59 PM6/26/19

to syzbot, adilger...@dilger.ca, da...@davemloft.net, el...@mellanox.com, ido...@mellanox.com, ji...@mellanox.com, john....@linaro.org, linux...@vger.kernel.org, linux-...@vger.kernel.org, net...@vger.kernel.org, syzkall...@googlegroups.com, tg...@linutronix.de

On Wed, Jun 26, 2019 at 10:27:08AM -0700, syzbot wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit: abf02e29 Merge tag 'pm-5.2-rc6' of git://git.kernel.org/pu..
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1435aaf6a00000
> kernel config: https://syzkaller.appspot.com/x/.config?x=e5c77f8090a3b96b
> dashboard link: https://syzkaller.appspot.com/bug?extid=4bfbbf28a2e50ab07368
> compiler: gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11234c41a00000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=15d7f026a00000
>
> The bug was bisected to:
>
> commit 0c81ea5db25986fb2a704105db454a790c59709c
> Author: Elad Raz <el...@mellanox.com>
> Date: Fri Oct 28 19:35:58 2016 +0000
>
> mlxsw: core: Add port type (Eth/IB) set API

Um, so this doesn't pass the laugh test.

> bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=10393a89a00000

It looks like the automated bisection machinery got confused by two
failures getting triggered by the same repro; the symptoms changed
over time. Initially, the failure was:

crashed: INFO: rcu detected stall in {sys_sendfile64,ext4_file_write_iter}

Later, the failure changed to something completely different, and much
earlier (before the test was even started):

run #5: basic kernel testing failed: failed to copy test binary to VM: failed to run ["scp" "-P" "22" "-F" "/dev/null" "-o" "UserKnownHostsFile=/dev/null" "-o" "BatchMode=yes" "-o" "IdentitiesOnly=yes" "-o" "StrictHostKeyChecking=no" "-o" "ConnectTimeout=10" "-i" "/syzkaller/jobs/linux/workdir/image/key" "/tmp/syz-executor216456474" "ro...@10.128.15.205:./syz-executor216456474"]: exit status 1
Connection timed out during banner exchange
lost connection

Looks like an opportunity to improve the bisection engine?

- Ted

Theodore Ts'o

unread,

Jun 26, 2019, 5:03:57 PM6/26/19

to syzbot, adilger...@dilger.ca, da...@davemloft.net, el...@mellanox.com, ido...@mellanox.com, ji...@mellanox.com, john....@linaro.org, linux...@vger.kernel.org, linux-...@vger.kernel.org, net...@vger.kernel.org, syzkall...@googlegroups.com, tg...@linutronix.de

The reproducer causes similar rcu stalls when using xfs:

RSP: 0018:ffffaae8c0953c58 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000288 RBX: 000000000000b05a RCX: ffffaae8c0953d50
RDX: 000000000000001c RSI: 000000000000001c RDI: ffffdcec41772800
RBP: ffffdcec41772800 R08: 00000015143ceae3 R09: 0000000000000001
R10: 0000000000000000 R11: ffffffff96863980 R12: ffff88d179ac7d80
R13: ffff88d174837ca0 R14: 0000000000000288 R15: 000000000000001c
generic_file_buffered_read+0x2c1/0x8b0
xfs_file_buffered_aio_read+0x5f/0x140
xfs_file_read_iter+0x6e/0xd0
generic_file_splice_read+0x110/0x1d0
splice_direct_to_actor+0xd5/0x230
? pipe_to_sendpage+0x90/0x90
do_splice_direct+0x9f/0xd0
do_sendfile+0x1d4/0x3a0
__se_sys_sendfile64+0x58/0xc0
do_syscall_64+0x50/0x1b0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f8038387469

and btrfs

[ 42.671321] RSP: 0018:ffff960dc0963b90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[ 42.673592] RAX: 0000000000000001 RBX: ffff89eb3628ab30 RCX: 00000000ffffffff
[ 42.675775] RDX: ffff89eb3628a380 RSI: 00000000ffffffff RDI: ffffffff9ec63980
[ 42.677851] RBP: ffff89eb3628aaf8 R08: 0000000000000001 R09: 0000000000000001
[ 42.680028] R10: 0000000000000000 R11: ffffffff9ec63980 R12: ffff89eb3628a380
[ 42.682213] R13: 0000000000000246 R14: 0000000000000001 R15: ffffffff9ec63980
[ 42.684509] xas_descend+0xed/0x120
[ 42.685682] xas_load+0x39/0x50
[ 42.686691] find_get_entry+0xa0/0x330
[ 42.687885] pagecache_get_page+0x30/0x2d0
[ 42.689190] generic_file_buffered_read+0xee/0x8b0
[ 42.690708] generic_file_splice_read+0x110/0x1d0
[ 42.692374] splice_direct_to_actor+0xd5/0x230
[ 42.693868] ? pipe_to_sendpage+0x90/0x90
[ 42.695180] do_splice_direct+0x9f/0xd0
[ 42.696407] do_sendfile+0x1d4/0x3a0
[ 42.697551] __se_sys_sendfile64+0x58/0xc0
[ 42.698854] do_syscall_64+0x50/0x1b0
[ 42.700021] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 42.701619] RIP: 0033:0x7f21e8d5f469

So this is probably a generic vfs issue (probably in sendfile). Next
steps is probably for someone to do a bisection search covering
changes to the fs/ and mm/ directories. This should exclude bogus
changes in the networking layer, although it does seem that there is
something very badly wrong which is breaking bisectability, if you
can't even scp to the system under test for certain commit ranges. :-( :-( :-(

- Ted

Theodore Ts'o

unread,

Jun 26, 2019, 6:47:16 PM6/26/19

to syzbot, adilger...@dilger.ca, da...@davemloft.net, el...@mellanox.com, ido...@mellanox.com, ji...@mellanox.com, john....@linaro.org, linux...@vger.kernel.org, linux-...@vger.kernel.org, net...@vger.kernel.org, syzkall...@googlegroups.com, tg...@linutronix.de

More details about what is going on. First, it requires root, because
one of that is required is using sched_setattr (which is enough to
shoot yourself in the foot):

sched_setattr(0, {size=0, sched_policy=0x6 /* SCHED_??? */, sched_flags=0, sched_nice=0, sched_priority=0, sched_runtime=2251799813724439, sched_deadline=4611686018427453437, sched_period=0}, 0) = 0

This is setting the scheduler policy to be SCHED_DEADLINE, with a
runtime parameter of 2251799.813724439 seconds (or 26 days) and a
deadline of 4611686018.427453437 seconds (or 146 *years*). This means
a particular kernel thread can run for up to 26 **days** before it is
scheduled away, and if a kernel reads gets woken up or sent a signal,
no worries, it will wake up roughly seven times the interval that Rip
Van Winkle spent snoozing in a cave in the Catskill Mountains (in
Washington Irving's short story).

We then kick off a half-dozen threads all running:

sendfile(fd, fd, &pos, 0x8080fffffffe);

(and since count is a ridiculously large number, this gets cut down to):

sendfile(fd, fd, &pos, 2147479552);

Is it any wonder that we are seeing RCU stalls? :-)

- Ted

Dmitry Vyukov

unread,

Jul 5, 2019, 9:18:18 AM7/5/19

to Theodore Ts'o, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, syzkaller

Hi Ted,

Yes, these infrastructure errors plague bisections episodically.
That's https://github.com/google/syzkaller/issues/1250

It did not confuse bisection explicitly as it understands that these
are infrastructure failures rather then a kernel crash, e.g. here you
may that it correctly identified that this run was OK and started
bisection in v4.10 v4.9 range besides 2 scp failures:

testing release v4.9
testing commit 69973b830859bc6529a7a0468ba0d80ee5117826 with gcc (GCC) 5.5.0
run #0: basic kernel testing failed: failed to copy test binary to VM:
failed to run ["scp" ...]: exit status 1

Connection timed out during banner exchange

run #1: basic kernel testing failed: failed to copy test binary to VM:
failed to run ["scp" ....]: exit status 1

Connection timed out during banner exchange

run #2: OK
run #3: OK
run #4: OK
run #5: OK
run #6: OK
run #7: OK
run #8: OK
run #9: OK
# git bisect start v4.10 v4.9

Though, of course, it may confuse bisection indirectly by reducing
number of tests per commit.

So far I wasn't able to gather any significant info about these
failures. We gather console logs, but on these runs they are empty.
It's easy to blame everything onto GCE but I don't have any bit of
information that would point either way. These failures just appear
randomly in production and usually in batches...

Dmitry Vyukov

unread,

Jul 5, 2019, 9:24:38 AM7/5/19

to Theodore Ts'o, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Paul E. McKenney

+Peter, Ingo for sched_setattr and +Paul for rcu

First of all: is it a semi-intended result of a root (CAP_SYS_NICE)
doing local DoS abusing sched_setattr? It would perfectly reasonable
to starve other processes, but I am not sure about rcu. In the end the
high prio process can use rcu itself, and then it will simply blow
system memory by stalling rcu. So it seems that rcu stalls should not
happen as a result of weird sched_setattr values. If that is the case,
what needs to be fixed? sched_setattr? rcu? sendfile?

If this is semi-intended, the only option I see is to disable
something in syzkaller: sched_setattr entirely, or drop CAP_SYS_NICE,
or ...? Any preference either way?

Paul E. McKenney

unread,

Jul 5, 2019, 11:17:05 AM7/5/19

to Dmitry Vyukov, Theodore Ts'o, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar

Does the (untested, probably does not even build) patch shown below help?
This patch assumes that the kernel was built with CONFIG_PREEMPT=n.
And that I found all the tight loops on the do_sendfile() code path.

> If this is semi-intended, the only option I see is to disable
> something in syzkaller: sched_setattr entirely, or drop CAP_SYS_NICE,
> or ...? Any preference either way?

Long-running tight loops in the kernel really should contain
cond_resched() or better.

Thanx, Paul

------------------------------------------------------------------------

diff --git a/fs/splice.c b/fs/splice.c
index 25212dcca2df..50aa3286764a 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -985,6 +985,7 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
sd->pos = prev_pos + ret;
goto out_release;
}
+ cond_resched();
}

done:

Amir Goldstein

unread,

Jul 5, 2019, 11:47:57 AM7/5/19

to pau...@linux.ibm.com, Dmitry Vyukov, Theodore Ts'o, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, Ext4, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar

> Does the (untested, probably does not even build) patch shown below help?
> This patch assumes that the kernel was built with CONFIG_PREEMPT=n.
> And that I found all the tight loops on the do_sendfile() code path.
>

I *think* you have.

FWIW, it would have been nicer for sendfile(2) and copy_file_range(2)
if the do_splice_direct() loop was also killable/interruptible.
Users may want to back off from asking the kernel to copy/send a huge file.

Thanks,
Amir.

Dmitry Vyukov

unread,

Jul 5, 2019, 11:48:45 AM7/5/19

to Paul E. McKenney, Theodore Ts'o, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar

The config used when this happened is referenced from here:
https://syzkaller.appspot.com/bug?extid=4bfbbf28a2e50ab07368
and it contains:
CONFIG_PREEMPT=y

So... what does this mean? The loop should have been preempted without
the cond_resched() then, right?

Paul E. McKenney

unread,

Jul 5, 2019, 3:11:02 PM7/5/19

to Dmitry Vyukov, Theodore Ts'o, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar

Exactly, so although my patch might help for CONFIG_PREEMPT=n, it won't
help in your scenario. But looking at the dmesg from your URL above,
I see the following:

rcu: rcu_preempt kthread starved for 10549 jiffies! g8969 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0

And, prior to that:

rcu: All QSes seen, last rcu_preempt kthread activity 10503 (4295056736-4295046233), jiffies_till_next_fqs=1, root ->qsmask 0x0

In other words, the grace period has finished, but RCU's grace-period
kthread hasn't gotten a chance to run, and thus hasn't marked it as
completed. The standard workaround is to set the rcutree.kthread_prio
kernel boot parameter to a comfortably high real-time priority.

At least assuming that syzkaller isn't setting the scheduling priority
of random CPU-bound tasks to RT priority 99 or some such. ;-)

Does that work for you?

Thanx, Paul

Theodore Ts'o

unread,

Jul 6, 2019, 12:28:13 AM7/6/19

to Paul E. McKenney, Dmitry Vyukov, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar

On Fri, Jul 05, 2019 at 12:10:55PM -0700, Paul E. McKenney wrote:
>
> Exactly, so although my patch might help for CONFIG_PREEMPT=n, it won't
> help in your scenario. But looking at the dmesg from your URL above,
> I see the following:

I just tested with CONFIG_PREEMPT=n

% grep CONFIG_PREEMPT /build/ext4-64/.config
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTIRQ_TRACEPOINTS=y
# CONFIG_PREEMPTIRQ_EVENTS is not set

And with your patch, it's still not helping.

I think that's because SCHED_DEADLINE is a real-time style scheduler:

In order to fulfill the guarantees that are made when a thread is ad‐
mitted to the SCHED_DEADLINE policy, SCHED_DEADLINE threads are the
highest priority (user controllable) threads in the system; if any
SCHED_DEADLINE thread is runnable, it will preempt any thread scheduled
under one of the other policies.

So a SCHED_DEADLINE process is not going yield control of the CPU,
even if it calls cond_resched() until the thread has run for more than
the sched_runtime parameter --- which for the syzkaller repro, was set
at 26 days.

There are some safety checks when using SCHED_DEADLINE:

The kernel requires that:

sched_runtime <= sched_deadline <= sched_period

In addition, under the current implementation, all of the parameter
values must be at least 1024 (i.e., just over one microsecond, which is
the resolution of the implementation), and less than 2^63. If any of
these checks fails, sched_setattr(2) fails with the error EINVAL.

The CBS guarantees non-interference between tasks, by throttling
threads that attempt to over-run their specified Runtime.

To ensure deadline scheduling guarantees, the kernel must prevent situ‐
ations where the set of SCHED_DEADLINE threads is not feasible (schedu‐
lable) within the given constraints. The kernel thus performs an ad‐
mittance test when setting or changing SCHED_DEADLINE policy and at‐
tributes. This admission test calculates whether the change is feasi‐
ble; if it is not, sched_setattr(2) fails with the error EBUSY.

The problem is that SCHED_DEADLINE is designed for sporadic tasks:

A sporadic task is one that has a sequence of jobs, where each job is
activated at most once per period. Each job also has a relative dead‐
line, before which it should finish execution, and a computation time,
which is the CPU time necessary for executing the job. The moment when
a task wakes up because a new job has to be executed is called the ar‐
rival time (also referred to as the request time or release time). The
start time is the time at which a task starts its execution. The abso‐
lute deadline is thus obtained by adding the relative deadline to the
arrival time.

It appears that kernel's admission control before allowing
SCHED_DEADLINE to be set on a thread was designed for sane
applications, and not abusive ones. Given that process started doing
abusive things *after* SCHED_DEADLINE policy was set, in order kernel
to figure out that in fact SCHED_DEADLINE should be denied for any
arbitrary kernel thread would require either (a) solving the halting
problem, or (b) being able to anticipate the future (in which case,
we should be using that kernel algorithm to play the stock market :-)

- Ted

Paul E. McKenney

unread,

Jul 6, 2019, 2:16:36 AM7/6/19

to Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar

26 days will definitely get you a large collection of RCU CPU stall
warnings! Thank you for digging into this, Ted.

I suppose RCU could take the dueling-banjos approach and use increasingly
aggressive scheduler policies itself, up to and including SCHED_DEADLINE,
until it started getting decent forward progress. However, that
sounds like the something that just might have unintended consequences,
particularly if other kernel subsystems were to also play similar
games of dueling banjos.

Alternatively, is it possible to provide stricter admission control?
For example, what sorts of policies do SCHED_DEADLINE users actually use?

Thanx, Paul

Theodore Ts'o

unread,

Jul 6, 2019, 11:02:37 AM7/6/19

to Paul E. McKenney, Dmitry Vyukov, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar

On Fri, Jul 05, 2019 at 11:16:31PM -0700, Paul E. McKenney wrote:
> I suppose RCU could take the dueling-banjos approach and use increasingly
> aggressive scheduler policies itself, up to and including SCHED_DEADLINE,
> until it started getting decent forward progress. However, that
> sounds like the something that just might have unintended consequences,
> particularly if other kernel subsystems were to also play similar
> games of dueling banjos.

So long as the RCU threads are well-behaved, using SCHED_DEADLINE
shouldn't have much of an impact on the system --- and the scheduling
parameters that you can specify on SCHED_DEADLINE allows you to
specify the worst-case impact on the system while also guaranteeing
that the SCHED_DEADLINE tasks will urn in the first place. After all,
that's the whole point of SCHED_DEADLINE.

So I wonder if the right approach is during the the first userspace
system call to shced_setattr to enable a (any) real-time priority
scheduler (SCHED_DEADLINE, SCHED_FIFO or SCHED_RR) on a userspace
thread, before that's allowed to proceed, the RCU kernel threads are
promoted to be SCHED_DEADLINE with appropriately set deadline
parameters. That way, a root user won't be able to shoot the system
in the foot, and since the vast majority of the time, there shouldn't
be any processes running with real-time priorities, we won't be
changing the behavior of a normal server system.

(I suspect there might be some audio applications that might try to
set real-time priorities, but for desktop systems, it's probably more
important that the system not tie its self into knots since the
average desktop user isn't going to be well equipped to debug the
problem.)

> Alternatively, is it possible to provide stricter admission control?

I think that's an orthogonal issue; better admission control would be
nice, but it looks to me that it's going to be fundamentally an issue
of tweaking hueristics, and a fool-proof solution that will protect
against all malicious userspace applications (including syzkaller) is
going to require solving the halting problem. So while it would be
nice to improve the admission control, I don't think that's a going to
be a general solution.

- Ted

Paul E. McKenney

unread,

Jul 6, 2019, 2:03:20 PM7/6/19

to Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar

On Sat, Jul 06, 2019 at 11:02:26AM -0400, Theodore Ts'o wrote:
> On Fri, Jul 05, 2019 at 11:16:31PM -0700, Paul E. McKenney wrote:
> > I suppose RCU could take the dueling-banjos approach and use increasingly
> > aggressive scheduler policies itself, up to and including SCHED_DEADLINE,
> > until it started getting decent forward progress. However, that
> > sounds like the something that just might have unintended consequences,
> > particularly if other kernel subsystems were to also play similar
> > games of dueling banjos.
>
> So long as the RCU threads are well-behaved, using SCHED_DEADLINE
> shouldn't have much of an impact on the system --- and the scheduling
> parameters that you can specify on SCHED_DEADLINE allows you to
> specify the worst-case impact on the system while also guaranteeing
> that the SCHED_DEADLINE tasks will urn in the first place. After all,
> that's the whole point of SCHED_DEADLINE.
>
> So I wonder if the right approach is during the the first userspace
> system call to shced_setattr to enable a (any) real-time priority
> scheduler (SCHED_DEADLINE, SCHED_FIFO or SCHED_RR) on a userspace
> thread, before that's allowed to proceed, the RCU kernel threads are
> promoted to be SCHED_DEADLINE with appropriately set deadline
> parameters. That way, a root user won't be able to shoot the system
> in the foot, and since the vast majority of the time, there shouldn't
> be any processes running with real-time priorities, we won't be
> changing the behavior of a normal server system.

It might well be. However, running the RCU kthreads at real-time
priority does not come for free. For example, it tends to crank up the
context-switch rate.

Plus I have taken several runs at computing SCHED_DEADLINE parameters,
but things like the rcuo callback-offload threads have computational
requirements that are controlled not by RCU, and not just by the rest of
the kernel, but also by userspace (keeping in mind the example of opening
and closing a file in a tight loop, each pass of which queues a callback).
I suspect that RCU is not the only kernel subsystem whose computational
requirements are set not by the subsystem, but rather by external code.

OK, OK, I suppose I could just set insanely large SCHED_DEADLINE
parameters, following syzkaller's example, and then trust my ability to
keep the RCU code from abusing the resulting awesome power. But wouldn't
a much nicer approach be to put SCHED_DEADLINE between SCHED_RR/SCHED_FIFO
priorities 98 and 99 or some such? Then the same (admittedly somewhat
scary) result could be obtained much more simply via SCHED_FIFO or
SCHED_RR priority 99.

Some might argue that this is one of those situations where simplicity
is not necessarily an advantage, but then again, you can find someone
who will complain about almost anything. ;-)

> (I suspect there might be some audio applications that might try to
> set real-time priorities, but for desktop systems, it's probably more
> important that the system not tie its self into knots since the
> average desktop user isn't going to be well equipped to debug the
> problem.)

Not only that, but if core counts continue to increase, and if reliance
on cloud computing continues to grow, there are going to be an increasing
variety of mixed workloads in increasingly less-controlled environments.

So, yes, it would be good to solve this problem in some reasonable way.

I don't see this as urgent just yet, but I am sure you all will let
me know if I am mistaken on that point.

> > Alternatively, is it possible to provide stricter admission control?
>
> I think that's an orthogonal issue; better admission control would be
> nice, but it looks to me that it's going to be fundamentally an issue
> of tweaking hueristics, and a fool-proof solution that will protect
> against all malicious userspace applications (including syzkaller) is
> going to require solving the halting problem. So while it would be
> nice to improve the admission control, I don't think that's a going to
> be a general solution.

Agreed, and my earlier point about the need to trust the coding abilities
of those writing ultimate-priority code is all too consistent with your
point about needing to solve the halting problem. Nevertheless, I believe
that we could make something that worked reasonably well in practice.

Here are a few components of a possible solution, in practice, but
of course not in theory:

1. We set limits to SCHED_DEADLINE parameters, perhaps novel ones.
For one example, insist on (say) 10 milliseconds of idle time
every second on each CPU. Yes, you can configure beyond that
given sufficient permissions, but if you do so, you just voided
your warranty.

2. Only allow SCHED_DEADLINE on nohz_full CPUs. (Partial solution,
given that such a CPU might be running in the kernel or have
more than one runnable task. Just for fun, I will suggest the
option of disabling SCHED_DEADLINE during such times.)

3. RCU detects slowdowns, and does something TBD to increase its
priority, but only while the slowdown persists. This likely
relies on scheduling-clock interrupts to detect the slowdowns,
so there might be additional challenges on a fully nohz_full
system.

4. Your idea here.

Thanx, Paul

Paul E. McKenney

unread,

Jul 6, 2019, 9:17:02 PM7/6/19

to Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar

4. SCHED_DEADLINE treats the other three scheduling classes as each
having a period, deadline, and a modest CPU consumption budget
for the members of the class in aggregate. But this has to have
been discussed before. How did that go?

> 5. Your idea here.

Thanx, Paul

Dmitry Vyukov

unread,

Jul 14, 2019, 10:48:13 AM7/14/19

to Paul E. McKenney, Theodore Ts'o, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar

Trying to digest this thread.

Do I understand correctly that setting rcutree.kthread_prio=99 won't
help because the deadline priority is higher?
And there are no other existing mechanisms to either fix the stalls
nor make kernel reject the non well-behaving parameters? Kernel tries
to filter out non well-behaving parameters, but the check detects only
obvious misconfigurations, right?
This reminds of priority inversion/inheritance problem. I wonder if
there are other kernel subsystems that suffer from the same problem.
E.g. the background kernel thread that destroys net namespaces and any
other type of async work. A high prio user process can overload the
queue and make kernel eat all memory. May be relatively easy to do
even unintentionally. I suspect the problem is not specific to rcu and
plumbing just rcu may just make the next problem pop up.
Should user be able to starve basic kernel services? User should be
able to prioritize across user processes (potentially in radical
ways), but perhaps it should not be able to badly starve kernel
functions that just happened to be asynchronous? I guess it's not as
simple as setting the highest prio for all kernel threads because in
normal case we want to reduce latency of user work by making the work
async. But user must not be able to starve kernel threads
infinitely... sounds like something similar to the deadline scheduling
-- kernel threads need to get at least some time slice per unit of
time.

But short term I don't see any other solution than stop testing
sched_setattr because it does not check arguments enough to prevent
system misbehavior. Which is a pity because syzkaller has found some
bad misconfigurations that were oversight on checking side.
Any other suggestions?

Paul E. McKenney

unread,

Jul 14, 2019, 2:49:24 PM7/14/19

to Dmitry Vyukov, Theodore Ts'o, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar

As I understand it, there is provision for giving other threads slack
time even in SCHED_DEADLINE, but the timing of that slack time depends
on the other tasks' SCHED_DEADLINE settings. And RCU's kthreads do
need some response time: Optimally a few milliseconds, preferably about
a hundred milliseconds, but definitely a second. With the huge cycle
time specified, RCU might not get that.

And yes, I suspect that RCU is not the only thing in the system needing
a little CPU time fairly frequently, for some ill-defined notion of
"fairly frequently".

> But short term I don't see any other solution than stop testing
> sched_setattr because it does not check arguments enough to prevent
> system misbehavior. Which is a pity because syzkaller has found some
> bad misconfigurations that were oversight on checking side.
> Any other suggestions?

Keep the times down to a few seconds? Of course, that might also
fail to find interesting bugs.

Thanx, Paul

Theodore Ts'o

unread,

Jul 14, 2019, 3:05:31 PM7/14/19

to Dmitry Vyukov, Paul E. McKenney, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar

On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:

> But short term I don't see any other solution than stop testing
> sched_setattr because it does not check arguments enough to prevent
> system misbehavior. Which is a pity because syzkaller has found some
> bad misconfigurations that were oversight on checking side.
> Any other suggestions?

Or maybe syzkaller can put its own limitations on what parameters are
sent to sched_setattr? In practice, there are any number of ways a
root user can shoot themselves in the foot when using sched_setattr or
sched_setaffinity, for that matter. I imagine there must be some such
constraints already --- or else syzkaller might have set a kernel
thread to run with priority SCHED_BATCH, with similar catastrophic
effects --- or do similar configurations to make system threads
completely unschedulable.

Real time administrators who know what they are doing --- and who know
that their real-time threads are well behaved --- will always want to
be able to do things that will be catastrophic if the real-time thread
is *not* well behaved. I don't it is possible to add safety checks
which would allow the kernel to automatically detect and reject unsafe
configurations.

An apt analogy might be civilian versus military aircraft. Most
airplanes are designed to be "inherently stable"; that way, modulo
buggy/insane control systems like on the 737 Max, the airplane will
automatically return to straight and level flight. On the other hand,
some military planes (for example, the F-16, F-22, F-36, the
Eurofighter, etc.) are sometimes designed to be unstable, since that
way they can be more maneuverable.

There are use cases for real-time Linux where this flexibility/power
vs. stability tradeoff is going to argue for giving root the
flexibility to crash the system. Some of these systems might
literally involve using real-time Linux in military applications,
something for which Paul and I have had some experience. :-)

Speaking of sched_setaffinity, one thing which we can do is have
syzkaller move all of the system threads to they run on the "system
CPU's", and then move the syzkaller processes which are testing the
kernel to be on the "system under test CPU's". Then regardless of
what priority the syzkaller test programs try to run themselves at,
they can't crash the system.

Some real-time systems do actually run this way, and it's a
recommended configuration which is much safer than letting the
real-time threads take over the whole system:

http://linuxrealtime.org/index.php/Improving_the_Real-Time_Properties#Isolating_the_Application

- Ted

Paul E. McKenney

unread,

Jul 14, 2019, 3:30:00 PM7/14/19

to Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar

Good point! We might still have issues with some per-CPU kthreads,
but perhaps use of nohz_full would help at least reduce these sorts
of problems. (There could still be issues on CPUs with more than
one runnable threads.)

Thanx, Paul

Paul E. McKenney

unread,

Jul 14, 2019, 11:10:33 PM7/14/19

to Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar

I looked at testing limitations in a bit more detail from an RCU
viewpoint, and came up with the following rough rule of thumb (which of
course might or might not survive actual testing experience, but should at
least be a good place to start). I believe that the sched_setaffinity()
testing rule should be that the SCHED_DEADLINE cycle be no more than
two-thirds of the RCU CPU stall warning timeout, which defaults to 21
seconds in mainline and 60 seconds in many distro kernels.

That is, the SCHED_DEADLINE cycle should never exceed 14 seconds when
testing mainline on the one hand or 40 seconds when testing enterprise
distros on the other.

This assumes quite a bit, though:

o The system has ample memory to spare, and isn't running a
callback-hungry workload. For example, if you "only" have 100MB
of spare memory and you are also repeatedly and concurrently
expanding (say) large source trees from tarballs and then deleting
those source trees, the system might OOM. The reason OOM might
happen is that each close() of a file generates an RCU callback,
and 40 seconds worth of waiting-for-a-grace-period structures
takes up a surprisingly large amount of memory.

So please be careful when combining tests. ;-)

o There are no aggressive real-time workloads on the system.
The reason for this is that RCU is going to start sending IPIs
halfway to the RCU CPU stall timeout, and, in certain situations
on CONFIG_NO_HZ_FULL kernels, much earlier. (These situations
constitute abuse of CONFIG_NO_HZ_FULL, but then again carefully
calibrated abuse is what stress testing is all about.)

o The various RCU kthreads will get a chance to run at least once
during the SCHED_DEADLINE cycle. If in real life, they only
get a chance to run once per two SCHED_DEADLINE cycles, then of
course the 14 seconds becomes 7 and the 40 seconds becomes 20.

o The current RCU CPU stall warning defaults remain in
place. These are set by the CONFIG_RCU_CPU_STALL_TIMEOUT
Kconfig parameter, which may in turn be overridden by the
rcupdate.rcu_cpu_stall_timeout kernel boot parameter.

o The current SCHED_DEADLINE default for providing spare cycles
for other uses remains in place.

o Other kthreads might have other constraints, but given that you
were seeing RCU CPU stall warnings instead of other failures,
the needs of RCU's kthreads seem to be a good place to start.

Again, the candidate rough rule of thumb is that the the SCHED_DEADLINE
cycle be no more than 14 seconds when testing mainline kernels on the one
hand and 40 seconds when testing enterprise distro kernels on the other.

Dmitry, does that help?

Thanx, Paul

Paul E. McKenney

unread,

Jul 15, 2019, 9:01:19 AM7/15/19

to Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar

And there are configurations and workloads that might require division
by three, so that (assuming one chance to run per cycle), the 14 seconds
becomes about 5 and the 40 seconds becomes about 15.

> o The current RCU CPU stall warning defaults remain in
> place. These are set by the CONFIG_RCU_CPU_STALL_TIMEOUT
> Kconfig parameter, which may in turn be overridden by the
> rcupdate.rcu_cpu_stall_timeout kernel boot parameter.
>
> o The current SCHED_DEADLINE default for providing spare cycles
> for other uses remains in place.
>
> o Other kthreads might have other constraints, but given that you
> were seeing RCU CPU stall warnings instead of other failures,
> the needs of RCU's kthreads seem to be a good place to start.
>
> Again, the candidate rough rule of thumb is that the the SCHED_DEADLINE
> cycle be no more than 14 seconds when testing mainline kernels on the one
> hand and 40 seconds when testing enterprise distro kernels on the other.
>
> Dmitry, does that help?

I checked with the people running the Linux Plumbers Conference Scheduler
Microconference, and they said that they would welcome a proposal on
this topic, which I have submitted (please see below). Would anyone
like to join as co-conspirator?

Thanx, Paul

------------------------------------------------------------------------

Title: Making SCHED_DEADLINE safe for kernel kthreads

Abstract:

Dmitry Vyukov's testing work identified some (ab)uses of sched_setattr()
that can result in SCHED_DEADLINE tasks starving RCU's kthreads for
extended time periods, not millisecond, not seconds, not minutes, not even
hours, but days. Given that RCU CPU stall warnings are issued whenever
an RCU grace period fails to complete within a few tens of seconds,
the system did not suffer silently. Although one could argue that people
should avoid abusing sched_setattr(), people are human and humans make
mistakes. Responding to simple mistakes with RCU CPU stall warnings is
all well and good, but a more severe case could OOM the system, which
is a particularly unhelpful error message.

It would be better if the system were capable of operating reasonably
despite such abuse. Several approaches have been suggested.

First, sched_setattr() could recognize parameter settings that put
kthreads at risk and refuse to honor those settings. This approach
of course requires that we identify precisely what combinations of
sched_setattr() parameters settings are risky, especially given that there
are likely to be parameter settings that are both risky and highly useful.

Second, in theory, RCU could detect this situation and take the "dueling
banjos" approach of increasing its priority as needed to get the CPU time
that its kthreads need to operate correctly. However, the required amount
of CPU time can vary greatly depending on the workload. Furthermore,
non-RCU kthreads also need some amount of CPU time, and replicating
"dueling banjos" across all such Linux-kernel subsystems seems both
wasteful and error-prone. Finally, experience has shown that setting
RCU's kthreads to real-time priorities significantly harms performance
by increasing context-switch rates.

Third, stress testing could be limited to non-risky regimes, such that
kthreads get CPU time every 5-40 seconds, depending on configuration
and experience. People needing risky parameter settings could then test
the settings that they actually need, and also take responsibility for
ensuring that kthreads get the CPU time that they need. (This of course
includes per-CPU kthreads!)

Fourth, bandwidth throttling could treat tasks in other scheduling classes
as an aggregate group having a reasonable aggregate deadline and CPU
budget. This has the advantage of allowing "abusive" testing to proceed,
which allows people requiring risky parameter settings to rely on this
testing. Additionally, it avoids complex progress checking and priority
setting on the part of many kthreads throughout the system. However,
if this was an easy choice, the SCHED_DEADLINE developers would likely
have selected it. For example, it is necessary to determine what might
be a "reasonable" aggregate deadline and CPU budget. Reserving 5%
seems quite generous, and RCU's grace-period kthread would optimally
like a deadline in the milliseconds, but would do reasonably well with
many tens of milliseconds, and absolutely needs a few seconds. However,
for CONFIG_RCU_NOCB_CPU=y, the RCU's callback-offload kthreads might
well need a full CPU each! (This happens when the CPU being offloaded
generates a high rate of callbacks.)

The goal of this proposal is therefore to generate face-to-face
discussion, hopefully resulting in a good and sufficient solution to
this problem.

Peter Zijlstra

unread,

Jul 15, 2019, 9:22:57 AM7/15/19

to Paul E. McKenney, Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Ingo Molnar

On Sat, Jul 06, 2019 at 06:16:55PM -0700, Paul E. McKenney wrote:
> 4. SCHED_DEADLINE treats the other three scheduling classes as each
> having a period, deadline, and a modest CPU consumption budget
> for the members of the class in aggregate. But this has to have
> been discussed before. How did that go?

Yeah; this has been proposed a number of times; and I think everybody
agrees that it is a good idea, but nobody has so far sat down and wrote
the patches.

Or rather; we would've gotten this for 'free' with the rt-cgroup
rewrite, but that's been stuck forever due to affinity being difficult.

Peter Zijlstra

unread,

Jul 15, 2019, 9:29:16 AM7/15/19

to Paul E. McKenney, Dmitry Vyukov, Theodore Ts'o, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Ingo Molnar

Right, if syzcaller can put a limit on the period/deadline parameters
(and make sure to not write "-1" to
/proc/sys/kernel/sched_rt_runtime_us) then per the in-kernel
access-control should not allow these things to happen.

Dmitry Vyukov

unread,

Jul 15, 2019, 9:29:45 AM7/15/19

to Paul E. McKenney, Theodore Ts'o, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar

I would be happy to attend if this won't conflict with important
things on the testing and fuzzing MC.

If we restrict arguments for sched_attr, what would be the criteria
for 100% safe arguments? Moving the check from kernel to user-space
does not relief us from explicitly stating the condition in
black-and-white way. All of sched_runtime/sched_deadline/sched_period
be not larger than 1 second?

The problem is that syzkaller does not allow 100% reliable enforcement
for indirect arguments in memory. E.g. inputs arguments can overlap,
input/output can overlap, weird races affect what's actually being
passed to kernel, the memory being mapped from a weird device, etc.
And that's also useful as it can discover TOCTOU bugs, deadlocks, etc.
We could try to wrap sched_setattr and do some additional restrictions
by giving up on TOCTOU, device-mapped memory, etc.

I am also thinking about dropping CAP_SYS_NICE, it should still allow
some configurations, but no inherently unsafe ones.

Dmitry Vyukov

unread,

Jul 15, 2019, 9:33:24 AM7/15/19

to Peter Zijlstra, Paul E. McKenney, Theodore Ts'o, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Ingo Molnar

Since we are racing with emails, could you suggest a 100% safe
parameters? Because I only hear people saying "safe", "sane",
"well-behaving" :)
If we move the check to user-space, it does not mean that we can get
away without actually defining what that means.

Now thinking of this, if we come up with some simple criteria, could
we have something like a sysctl that would allow only really "safe"
parameters?

Peter Zijlstra

unread,

Jul 15, 2019, 9:39:58 AM7/15/19

to Paul E. McKenney, Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Ingo Molnar

On Mon, Jul 15, 2019 at 06:01:01AM -0700, Paul E. McKenney wrote:
> Title: Making SCHED_DEADLINE safe for kernel kthreads
>
> Abstract:
>
> Dmitry Vyukov's testing work identified some (ab)uses of sched_setattr()
> that can result in SCHED_DEADLINE tasks starving RCU's kthreads for
> extended time periods, not millisecond, not seconds, not minutes, not even
> hours, but days. Given that RCU CPU stall warnings are issued whenever
> an RCU grace period fails to complete within a few tens of seconds,
> the system did not suffer silently. Although one could argue that people
> should avoid abusing sched_setattr(), people are human and humans make
> mistakes. Responding to simple mistakes with RCU CPU stall warnings is
> all well and good, but a more severe case could OOM the system, which
> is a particularly unhelpful error message.
>
> It would be better if the system were capable of operating reasonably
> despite such abuse. Several approaches have been suggested.
>
> First, sched_setattr() could recognize parameter settings that put
> kthreads at risk and refuse to honor those settings. This approach
> of course requires that we identify precisely what combinations of
> sched_setattr() parameters settings are risky, especially given that there
> are likely to be parameter settings that are both risky and highly useful.

So we (the people poking at the DEADLINE code) are all aware of this,
and on the TODO list for making DEADLINE available for !priv users is
the item:

- put limits on deadline/period

And note that that is both an upper and lower limit. The upper limit
you've just found why we need it, the lower limit is required because
you can DoS the hardware by causing deadlines/periods that are equal (or
shorter) than the time it takes to program the hardware.

There might have even been some patches that do some of this, but I've
held off because we have bigger problems and they would've established
an ABI while it wasn't clear it was sufficient or the right form.

Peter Zijlstra

unread,

Jul 15, 2019, 9:47:05 AM7/15/19

to Dmitry Vyukov, Paul E. McKenney, Theodore Ts'o, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Ingo Molnar

On Mon, Jul 15, 2019 at 03:33:11PM +0200, Dmitry Vyukov wrote:
> On Mon, Jul 15, 2019 at 3:29 PM Peter Zijlstra <pet...@infradead.org> wrote:
> >
> > On Sun, Jul 14, 2019 at 11:49:15AM -0700, Paul E. McKenney wrote:
> > > On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> > > > But short term I don't see any other solution than stop testing
> > > > sched_setattr because it does not check arguments enough to prevent
> > > > system misbehavior. Which is a pity because syzkaller has found some
> > > > bad misconfigurations that were oversight on checking side.
> > > > Any other suggestions?
> > >
> > > Keep the times down to a few seconds? Of course, that might also
> > > fail to find interesting bugs.
> >
> > Right, if syzcaller can put a limit on the period/deadline parameters
> > (and make sure to not write "-1" to
> > /proc/sys/kernel/sched_rt_runtime_us) then per the in-kernel
> > access-control should not allow these things to happen.
>
> Since we are racing with emails, could you suggest a 100% safe
> parameters? Because I only hear people saying "safe", "sane",
> "well-behaving" :)
> If we move the check to user-space, it does not mean that we can get
> away without actually defining what that means.

Right, well, that's part of the problem. I think Paul just did the
reverse math and figured that 95% of X must not be larger than my
watchdog timeout and landed on 14 seconds.

I'm thinking 4 seconds (or rather 4.294967296) would be a very nice
number.

> Now thinking of this, if we come up with some simple criteria, could
> we have something like a sysctl that would allow only really "safe"
> parameters?

I suppose we could do that, something like:
sysctl_deadline_period_{min,max}. I'll have to dig back a bit on where
we last talked about that and what the problems where.

For one, setting the min is a lot harder, but I suppose we can start at
TICK_NSEC or something.

Paul E. McKenney

unread,

Jul 15, 2019, 10:02:20 AM7/15/19

to Peter Zijlstra, Dmitry Vyukov, Theodore Ts'o, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Ingo Molnar

On Mon, Jul 15, 2019 at 03:46:51PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 15, 2019 at 03:33:11PM +0200, Dmitry Vyukov wrote:
> > On Mon, Jul 15, 2019 at 3:29 PM Peter Zijlstra <pet...@infradead.org> wrote:
> > >
> > > On Sun, Jul 14, 2019 at 11:49:15AM -0700, Paul E. McKenney wrote:
> > > > On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> > > > > But short term I don't see any other solution than stop testing
> > > > > sched_setattr because it does not check arguments enough to prevent
> > > > > system misbehavior. Which is a pity because syzkaller has found some
> > > > > bad misconfigurations that were oversight on checking side.
> > > > > Any other suggestions?
> > > >
> > > > Keep the times down to a few seconds? Of course, that might also
> > > > fail to find interesting bugs.
> > >
> > > Right, if syzcaller can put a limit on the period/deadline parameters
> > > (and make sure to not write "-1" to
> > > /proc/sys/kernel/sched_rt_runtime_us) then per the in-kernel
> > > access-control should not allow these things to happen.
> >
> > Since we are racing with emails, could you suggest a 100% safe
> > parameters? Because I only hear people saying "safe", "sane",
> > "well-behaving" :)
> > If we move the check to user-space, it does not mean that we can get
> > away without actually defining what that means.
>
> Right, well, that's part of the problem. I think Paul just did the
> reverse math and figured that 95% of X must not be larger than my
> watchdog timeout and landed on 14 seconds.

I was actually working backwards from thw 21-second RCU CPU stall
timeout, but there are likely many other limits to consider.

> I'm thinking 4 seconds (or rather 4.294967296) would be a very nice
> number.

Works for me! That should give the various RCU kthreads ample
opportunities to execute within the RCU CPU stall timeout.

The rcuo callback-offload kthreads will need special handling, but if
someone has 100 CPUs wildly generating callbacks and allocates but one
CPU to invoke them, there is not much either the RCU or the scheduler
can do to make that work. ;-)

Thanx, Paul

Paul E. McKenney

unread,

Jul 15, 2019, 10:03:26 AM7/15/19

to Peter Zijlstra, Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Ingo Molnar

So I should withdraw the proposal?

Thanx, Paul

Dmitry Vyukov

unread,

Jul 22, 2019, 6:03:15 AM7/22/19

to Peter Zijlstra, Paul E. McKenney, Theodore Ts'o, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Ingo Molnar

Now syzkaller will drop CAP_SYS_NICE for the test process:
https://github.com/google/syzkaller/commit/f3ad68446455acbe562e0057931e6256b8b991e8
I will close this bug report as invalid once the change reaches all
syzbot instances, if nobody plans any other on this bug.

Dmitry Vyukov

unread,

Jul 23, 2019, 4:51:48 AM7/23/19

to Peter Zijlstra, Paul E. McKenney, Theodore Ts'o, syzbot, Andreas Dilger, David Miller, el...@mellanox.com, Ido Schimmel, Jiri Pirko, John Stultz, linux...@vger.kernel.org, LKML, netdev, syzkaller-bugs, Thomas Gleixner, Ingo Molnar

#syz invalid

Reply all

Reply to author

Forward

INFO: rcu detected stall in ext4_write_checks

syzbot

Theodore Ts'o

Theodore Ts'o

Theodore Ts'o

Dmitry Vyukov

Dmitry Vyukov

Paul E. McKenney

Amir Goldstein

Dmitry Vyukov

Paul E. McKenney

Theodore Ts'o

Paul E. McKenney

Theodore Ts'o

Paul E. McKenney

Paul E. McKenney

Dmitry Vyukov

Paul E. McKenney

Theodore Ts'o

Paul E. McKenney

Paul E. McKenney

Paul E. McKenney

Peter Zijlstra

Peter Zijlstra

Dmitry Vyukov

Dmitry Vyukov

Peter Zijlstra

Peter Zijlstra

Paul E. McKenney

Paul E. McKenney

Dmitry Vyukov

Dmitry Vyukov