INFO: rcu detected stall in do

syzbot

unread,

Oct 13, 2018, 3:31:03 AM10/13/18

to b...@alien8.de, h...@zytor.com, linux-...@vger.kernel.org, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, tg...@linutronix.de, x...@kernel.org

Hello,

syzbot found the following crash on:

HEAD commit: 6b3944e42e2e afs: Fix cell proc list
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1545a479400000
kernel config: https://syzkaller.appspot.com/x/.config?x=88e9a8a39dc0be2d
dashboard link: https://syzkaller.appspot.com/bug?extid=385468161961cee80c31
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=14825d5e400000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=122cc531400000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+385468...@syzkaller.appspotmail.com

sshd (5193) used greatest stack depth: 15496 bytes left
sched: DL replenish lagged too much
hrtimer: interrupt took 417330 ns
rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
rcu: (detected by 0, t=10520 jiffies, g=-1179, q=0)
rcu: All QSes seen, last rcu_sched kthread activity 10521
(4295005425-4294994904), jiffies_till_next_fqs=1, root ->qsmask 0x0
swapper/0 R running task 21816 0 0 0x80000000
Call Trace:
<IRQ>
sched_show_task.cold.83+0x2b6/0x30a kernel/sched/core.c:5296
print_other_cpu_stall.cold.79+0xa83/0xba5 kernel/rcu/tree.c:1430
check_cpu_stall kernel/rcu/tree.c:1557 [inline]
__rcu_pending kernel/rcu/tree.c:3276 [inline]
rcu_pending kernel/rcu/tree.c:3319 [inline]
rcu_check_callbacks+0xafc/0x1990 kernel/rcu/tree.c:2665
update_process_times+0x2d/0x70 kernel/time/timer.c:1636
tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:164
tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1274
__run_hrtimer kernel/time/hrtimer.c:1398 [inline]
__hrtimer_run_queues+0x41c/0x10d0 kernel/time/hrtimer.c:1460
hrtimer_interrupt+0x313/0x780 kernel/time/hrtimer.c:1518
local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1029 [inline]
smp_apic_timer_interrupt+0x1a1/0x760 arch/x86/kernel/apic/apic.c:1054
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:864
RIP: 0010:arch_local_irq_enable arch/x86/include/asm/paravirt.h:798 [inline]
RIP: 0010:__do_softirq+0x2c3/0xad8 kernel/softirq.c:276
Code: c7 c0 98 f2 31 89 48 c1 e8 03 42 80 3c 30 00 0f 85 17 07 00 00 48 83
3d e2 ef 51 01 00 0f 84 6e 06 00 00 fb 66 0f 1f 44 00 00 <48> c7 85 68 fe
ff ff 00 91 20 89 b8 ff ff ff ff 0f bc 85 74 fe ff
RSP: 0018:ffff8801dae07ae8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
RAX: 1ffffffff1263e53 RBX: ffffffff89276e40 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff892776bc
RBP: ffff8801dae07cc8 R08: ffffffff89276e40 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1003b5c0fa1
R13: ffffffff89276e40 R14: dffffc0000000000 R15: 0000000000000000
invoke_softirq kernel/softirq.c:372 [inline]
irq_exit+0x17f/0x1c0 kernel/softirq.c:412
scheduler_ipi+0x55a/0xad0 kernel/sched/core.c:1780
smp_reschedule_interrupt+0x109/0x650 arch/x86/kernel/smp.c:278
reschedule_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:888
</IRQ>
RIP: 0010:native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:57
Code: e9 2c ff ff ff 48 89 c7 48 89 45 d8 e8 e3 e0 11 fa 48 8b 45 d8 e9 ca
fe ff ff 48 89 df e8 d2 e0 11 fa eb 82 55 48 89 e5 fb f4 <5d> c3 0f 1f 84
00 00 00 00 00 55 48 89 e5 f4 5d c3 90 90 90 90 90
RSP: 0018:ffffffff89207bb8 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff02
RAX: dffffc0000000000 RBX: 1ffffffff1240f7b RCX: 0000000000000000
RDX: 1ffffffff1263e54 RSI: 0000000000000001 RDI: ffffffff8931f2a0
RBP: ffffffff89207bb8 R08: ffffffff89276e40 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff89207c78
R13: ffffffff89f3be20 R14: 0000000000000000 R15: 0000000000000000
arch_safe_halt arch/x86/include/asm/paravirt.h:94 [inline]
default_idle+0xbf/0x490 arch/x86/kernel/process.c:498
arch_cpu_idle+0x10/0x20 arch/x86/kernel/process.c:489
default_idle_call+0x6d/0x90 kernel/sched/idle.c:93
cpuidle_idle_call kernel/sched/idle.c:153 [inline]
do_idle+0x3db/0x5b0 kernel/sched/idle.c:262
cpu_startup_entry+0x10c/0x120 kernel/sched/idle.c:368
rest_init+0xe2/0xe5 init/main.c:442
start_kernel+0x8f4/0x92f init/main.c:739
x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:470
x86_64_start_kernel+0x76/0x79 arch/x86/kernel/head64.c:451
secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243
rcu: rcu_sched kthread starved for 10601 jiffies! g-1179 f0x2
RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
rcu: RCU grace-period kthread stack dump:
rcu_sched R running task 25144 11 2 0x80000000
Call Trace:
context_switch kernel/sched/core.c:2825 [inline]
__schedule+0x86c/0x1ed0 kernel/sched/core.c:3473
schedule+0xfe/0x460 kernel/sched/core.c:3517
schedule_timeout+0x140/0x260 kernel/time/timer.c:1804
rcu_gp_kthread+0x9d9/0x2310 kernel/rcu/tree.c:2194
kthread+0x35a/0x420 kernel/kthread.c:246
ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413

---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

Thomas Gleixner

unread,

Oct 16, 2018, 9:24:38 AM10/16/18

to syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, x...@kernel.org, Peter Zijlstra

On Sat, 13 Oct 2018, syzbot wrote:
> syzbot found the following crash on:
>
> HEAD commit: 6b3944e42e2e afs: Fix cell proc list
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1545a479400000
> kernel config: https://syzkaller.appspot.com/x/.config?x=88e9a8a39dc0be2d
> dashboard link: https://syzkaller.appspot.com/bug?extid=385468161961cee80c31
> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=14825d5e400000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=122cc531400000
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+385468...@syzkaller.appspotmail.com
>
> sshd (5193) used greatest stack depth: 15496 bytes left
> sched: DL replenish lagged too much
> hrtimer: interrupt took 417330 ns
> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:

It does reproduce here but with a kworker stall. Looking at the reproducer:

*(uint32_t*)0x20000000 = 0;
*(uint32_t*)0x20000004 = 6;
*(uint64_t*)0x20000008 = 0;
*(uint32_t*)0x20000010 = 0;
*(uint32_t*)0x20000014 = 0;
*(uint64_t*)0x20000018 = 0x9917;
*(uint64_t*)0x20000020 = 0xffff;
*(uint64_t*)0x20000028 = 0;
syscall(__NR_sched_setattr, 0, 0x20000000, 0);

which means:

struct sched_attr {
.size = 0,
.policy = 6,
.flags = 0,
.nice = 0,
.priority = 0,
.deadline = 0x9917,
.runtime = 0xffff,
.period = 0,
}

policy 6 is SCHED_DEADLINE

That makes the thread hog the CPU and prevents all kind of stuff to run.

Peter, is that expected behaviour?

Thanks,

tglx

Peter Zijlstra

unread,

Oct 16, 2018, 10:03:26 AM10/16/18

to Thomas Gleixner, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, Juri Lelli

On Tue, Oct 16, 2018 at 03:24:06PM +0200, Thomas Gleixner wrote:
> It does reproduce here but with a kworker stall. Looking at the reproducer:
>
> *(uint32_t*)0x20000000 = 0;
> *(uint32_t*)0x20000004 = 6;
> *(uint64_t*)0x20000008 = 0;
> *(uint32_t*)0x20000010 = 0;
> *(uint32_t*)0x20000014 = 0;
> *(uint64_t*)0x20000018 = 0x9917;
> *(uint64_t*)0x20000020 = 0xffff;
> *(uint64_t*)0x20000028 = 0;
> syscall(__NR_sched_setattr, 0, 0x20000000, 0);
>
> which means:
>
> struct sched_attr {
> .size = 0,
> .policy = 6,
> .flags = 0,
> .nice = 0,
> .priority = 0,
> .deadline = 0x9917,
> .runtime = 0xffff,
> .period = 0,
> }
>
> policy 6 is SCHED_DEADLINE
>
> That makes the thread hog the CPU and prevents all kind of stuff to run.
>
> Peter, is that expected behaviour?

Sorta, just like FIFO-99 while(1);. Except we should be rejecting the
above configuration, because of the rule:

runtime <= deadline <= period

Juri, where were we supposed to check that?

Juri Lelli

unread,

Oct 16, 2018, 10:41:33 AM10/16/18

to Peter Zijlstra, Thomas Gleixner, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, juri....@redhat.com

Not if period == 0.

https://elixir.bootlin.com/linux/latest/source/kernel/sched/deadline.c#L2632
https://elixir.bootlin.com/linux/latest/source/kernel/sched/deadline.c#L2515

Now, maybe we should be checking also against the default 95% cap?

Best,

- Juri

Thomas Gleixner

unread,

Oct 16, 2018, 10:45:32 AM10/16/18

to Juri Lelli, Peter Zijlstra, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, juri....@redhat.com

If the cap is active, then yes. But you want to use the actual
configuration not the default.

Thanks,

tglx

Juri Lelli

unread,

Oct 16, 2018, 11:36:12 AM10/16/18

to Thomas Gleixner, Juri Lelli, Peter Zijlstra, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com

Sure.

Although DEADLINE bandwidth is "replicated" across the CPUs of a domain,
so we can still admit a while(1) on multi-CPUs domains. Mmm, guess we
should be able to fix this however if we limit also the per-task maximum
bandwidth considering rt_runtime/rt_period.

Juri Lelli

unread,

Oct 18, 2018, 4:28:45 AM10/18/18

to Thomas Gleixner, Juri Lelli, Peter Zijlstra, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, Luca Abeni, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

OK, looks like the "which means" part above had me fooled, as
we actually have ([1], where the comment is wrong)

struct sched_attr {
.size = 0,
.policy = 6,
.flags = 0,
.nice = 0,
.priority = 0,

.runtime = 0x9917,
.deadline = 0xffff,
.period = 0,
}

So, we seem to be correctly (in theory, see below) accepting the task.

What seems to generate the problem here is that CONFIG_HZ=100 and
reproducer task has "tiny" runtime (~40us) and deadline (~66us)
parameters, combination that "bypasses" the enforcing mechanism
(performed at each tick).

Another side problem seems also to be that with such tiny parameters we
spend lot of time in the while (dl_se->runtime <= 0) loop of replenish_dl_
entity() (actually uselessly, as deadline is most probably going to
still be in the past when eventually runtime becomes positive again), as
delta_exec is huge w.r.t. runtime and runtime has to keep up with tiny
increments of dl_runtime. I guess we could ameliorate things here by
limiting the number of time we execute the loop before bailing out.

Enabling HRTICK makes a difference [2]. I played a bit with several
combinations and could verify that parameters in the ~50us range seem
usable. However, still to mention that when runtime gets close to
deadline (very high bandwidth) enforcing could be tricked again, as
hrtick overheads might make the task effectively executing for more than
the runtime, over passing the replenish instant (old deadline), so
replenish timer is not set, and letting the task continuing executing
after a replenishment.

This is all however very much platform and config dependent, of course.

So, I tend to think that we might want to play safe and put some higher
minimum value for dl_runtime (it's currently at 1ULL << DL_SCALE).
Guess the problem is to pick a reasonable value, though. Maybe link it
someway to HZ? Then we might add a sysctl (or similar) thing with which
knowledgeable users can do whatever they think their platform/config can
support?

Thoughts?

I'm adding more people on Cc as I'm not sure they are following this.
Thread starts here [3].

1 - https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/sched/types.h#L70
2 - noticed that we don't actually start hrtick on setup_new_dl_entity()
and think we should
3 - https://lore.kernel.org/lkml/000000000000a4...@google.com/

Peter Zijlstra

unread,

Oct 18, 2018, 5:48:57 AM10/18/18

to Juri Lelli, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, Luca Abeni, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

On Thu, Oct 18, 2018 at 10:28:38AM +0200, Juri Lelli wrote:

> Another side problem seems also to be that with such tiny parameters we
> spend lot of time in the while (dl_se->runtime <= 0) loop of replenish_dl_
> entity() (actually uselessly, as deadline is most probably going to
> still be in the past when eventually runtime becomes positive again), as
> delta_exec is huge w.r.t. runtime and runtime has to keep up with tiny
> increments of dl_runtime. I guess we could ameliorate things here by
> limiting the number of time we execute the loop before bailing out.

That's the "DL replenish lagged too much" case, right? Yeah, there is
only so much we can recover from.

Funny that GCC actually emits that loop; sometimes we've had to fight
GCC not to turn that into a division.

But yes, I suppose we can put a limit on how many periods we can lag
before just giving up.

> So, I tend to think that we might want to play safe and put some higher
> minimum value for dl_runtime (it's currently at 1ULL << DL_SCALE).
> Guess the problem is to pick a reasonable value, though. Maybe link it
> someway to HZ? Then we might add a sysctl (or similar) thing with which
> knowledgeable users can do whatever they think their platform/config can
> support?

Yes, a HZ related limit sounds like something we'd want. But if we're
going to do a minimum sysctl, we should also consider adding a maximum,
if you set a massive period/deadline, you can, even with a relatively
low u, incur significant delays.

And do we want to put the limit on runtime or on period ?

That is, something like:

TICK_NSEC/2 < period < 10*TICK_NSEC

and/or

TICK_NSEC/2 < runtime < 10*TICK_NSEC

Hmm, for HZ=1000 that ends up with a max period of 10ms, that's far too
low, 24Hz needs ~41ms. We can of course also limit the runtime by
capping u for users (as we should anyway).

Juri Lelli

unread,

Oct 18, 2018, 6:10:12 AM10/18/18

to Peter Zijlstra, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, Luca Abeni, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

On 18/10/18 11:48, Peter Zijlstra wrote:
> On Thu, Oct 18, 2018 at 10:28:38AM +0200, Juri Lelli wrote:
>
> > Another side problem seems also to be that with such tiny parameters we
> > spend lot of time in the while (dl_se->runtime <= 0) loop of replenish_dl_
> > entity() (actually uselessly, as deadline is most probably going to
> > still be in the past when eventually runtime becomes positive again), as
> > delta_exec is huge w.r.t. runtime and runtime has to keep up with tiny
> > increments of dl_runtime. I guess we could ameliorate things here by
> > limiting the number of time we execute the loop before bailing out.
>
> That's the "DL replenish lagged too much" case, right? Yeah, there is
> only so much we can recover from.

Right.

> Funny that GCC actually emits that loop; sometimes we've had to fight
> GCC not to turn that into a division.
>
> But yes, I suppose we can put a limit on how many periods we can lag
> before just giving up.

OK.

> > So, I tend to think that we might want to play safe and put some higher
> > minimum value for dl_runtime (it's currently at 1ULL << DL_SCALE).
> > Guess the problem is to pick a reasonable value, though. Maybe link it
> > someway to HZ? Then we might add a sysctl (or similar) thing with which
> > knowledgeable users can do whatever they think their platform/config can
> > support?
>
> Yes, a HZ related limit sounds like something we'd want. But if we're
> going to do a minimum sysctl, we should also consider adding a maximum,
> if you set a massive period/deadline, you can, even with a relatively
> low u, incur significant delays.
>
> And do we want to put the limit on runtime or on period ?
>
> That is, something like:
>
> TICK_NSEC/2 < period < 10*TICK_NSEC
>
> and/or
>
> TICK_NSEC/2 < runtime < 10*TICK_NSEC
>
> Hmm, for HZ=1000 that ends up with a max period of 10ms, that's far too
> low, 24Hz needs ~41ms. We can of course also limit the runtime by
> capping u for users (as we should anyway).

I also thought of TICK_NSEC/2 as a reasonably safe lower limit, that
will implicitly limit period as well since

runtime <= deadline <= period

Not sure about the upper limit, though. Lower limit is something related
to the inherent granularity of the platform/config, upper limit is more
to do with highest prio stuff with huge period delaying everything else;
doesn't seem to be related to HZ?

Maybe we could just pick something that seems reasonably big to handle
SCHED_DEADLINE users needs and not too big to jeopardize everyone else,
say 0.5s?

luca abeni

unread,

Oct 18, 2018, 6:23:36 AM10/18/18

to Juri Lelli, Thomas Gleixner, Juri Lelli, Peter Zijlstra, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

Hi Juri,

On Thu, 18 Oct 2018 10:28:38 +0200
Juri Lelli <juri....@redhat.com> wrote:
[...]

> struct sched_attr {
> .size = 0,
> .policy = 6,
> .flags = 0,
> .nice = 0,
> .priority = 0,
> .runtime = 0x9917,
> .deadline = 0xffff,
> .period = 0,
> }
>
> So, we seem to be correctly (in theory, see below) accepting the task.
>
> What seems to generate the problem here is that CONFIG_HZ=100 and
> reproducer task has "tiny" runtime (~40us) and deadline (~66us)
> parameters, combination that "bypasses" the enforcing mechanism
> (performed at each tick).

Ok, so the task can execute for at most 1 tick before being throttled...
Which does not look too bad.

I missed the original emails, but maybe the issue is that the task
blocks before the tick, and when it wakes up again something goes wrong
with the deadline and runtime assignment? (maybe because the deadline
is in the past?)

> Another side problem seems also to be that with such tiny parameters
> we spend lot of time in the while (dl_se->runtime <= 0) loop of
> replenish_dl_ entity() (actually uselessly, as deadline is most
> probably going to still be in the past when eventually runtime
> becomes positive again), as delta_exec is huge w.r.t. runtime and
> runtime has to keep up with tiny increments of dl_runtime. I guess we
> could ameliorate things here by limiting the number of time we
> execute the loop before bailing out.

Actually, I think the loop will iterate at most 10ms / 39us times, which
is about 256 times, right? If this is too much (I do not know how much
time it is spent executing the loop), then the solution is (as you
suggest) to increase the minimum allowed runtime.

[...]

> So, I tend to think that we might want to play safe and put some
> higher minimum value for dl_runtime (it's currently at 1ULL <<
> DL_SCALE). Guess the problem is to pick a reasonable value, though.
> Maybe link it someway to HZ?

Yes, a value dependent on HZ looks like a good idea. I would propose
HZ / N, where N is the maximum number of times you want the loop above
to be executed.

> Then we might add a sysctl (or similar)
> thing with which knowledgeable users can do whatever they think their
> platform/config can support?

I guess this can be related to the utilization limits we were
discussing some time ago... I would propose a cgroup-based interface to
set all of these limits.

Luca

luca abeni

unread,

Oct 18, 2018, 6:33:39 AM10/18/18

to Peter Zijlstra, Juri Lelli, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

Hi Peter,

On Thu, 18 Oct 2018 11:48:50 +0200
Peter Zijlstra <pet...@infradead.org> wrote:
[...]

> > So, I tend to think that we might want to play safe and put some
> > higher minimum value for dl_runtime (it's currently at 1ULL <<
> > DL_SCALE). Guess the problem is to pick a reasonable value, though.
> > Maybe link it someway to HZ? Then we might add a sysctl (or
> > similar) thing with which knowledgeable users can do whatever they
> > think their platform/config can support?
>
> Yes, a HZ related limit sounds like something we'd want. But if we're
> going to do a minimum sysctl, we should also consider adding a
> maximum, if you set a massive period/deadline, you can, even with a
> relatively low u, incur significant delays.

I agree with this.

> And do we want to put the limit on runtime or on period ?

I think we should have a minimum allowed runtime, a maximum allowed
runtime, a minimum allowed period and a (per-user? per-control
group?) maximum allowed utilization.

I suspect having a maximum period is useless, if we already enforce a
maximum runtime.

> That is, something like:
>
> TICK_NSEC/2 < period < 10*TICK_NSEC

As written above I would not enforce a maximum period.

>
> and/or
>
> TICK_NSEC/2 < runtime < 10*TICK_NSEC

I think (but I might be wrong) that "TICK_NSEC/2" is too large... I
would divide the tick for a larger number (how many time do we want to
allow the loop to run?)

And I think the maximum runtime should not be TICK-dependent... It is
the maximum amount of time for which we allow the dealdine task to
starve non-deadline tasks, so it should be an absolute time, not
something HZ-dependent... No?

> Hmm, for HZ=1000 that ends up with a max period of 10ms, that's far
> too low, 24Hz needs ~41ms. We can of course also limit the runtime by
> capping u for users (as we should anyway).

Regarding capping u for users: some time ago, with Juri we discussed
the idea of having per-cgroup limits on the deadline utilization... I
think this is a good idea (and if the userspace creates a cgroup per
user, this results in per-user capping - but it is more flexible in
general)

Luca

luca abeni

unread,

Oct 18, 2018, 6:38:11 AM10/18/18

to Juri Lelli, Peter Zijlstra, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

Hi Juri,

On Thu, 18 Oct 2018 12:10:08 +0200

Juri Lelli <juri....@redhat.com> wrote:
[...]

> > Yes, a HZ related limit sounds like something we'd want. But if
> > we're going to do a minimum sysctl, we should also consider adding
> > a maximum, if you set a massive period/deadline, you can, even with
> > a relatively low u, incur significant delays.
> >
> > And do we want to put the limit on runtime or on period ?
> >
> > That is, something like:
> >
> > TICK_NSEC/2 < period < 10*TICK_NSEC
> >
> > and/or
> >
> > TICK_NSEC/2 < runtime < 10*TICK_NSEC
> >
> > Hmm, for HZ=1000 that ends up with a max period of 10ms, that's far
> > too low, 24Hz needs ~41ms. We can of course also limit the runtime
> > by capping u for users (as we should anyway).
>
> I also thought of TICK_NSEC/2 as a reasonably safe lower limit

I tend to think that something larger than "2" should be used (maybe
10? I mean: even if HZ = 100, it might make sense to allow a runtime
equal to 1ms...)

> that will implicitly limit period as well since
>
> runtime <= deadline <= period

I agree that if we end up with TICK_NSEC/2 for the runtime limit then
explicitly enforcing a minimum period is not needed.

> Not sure about the upper limit, though. Lower limit is something
> related to the inherent granularity of the platform/config, upper
> limit is more to do with highest prio stuff with huge period delaying
> everything else; doesn't seem to be related to HZ?

I agree

Luca

Juri Lelli

unread,

Oct 18, 2018, 6:47:18 AM10/18/18

to luca abeni, Thomas Gleixner, Juri Lelli, Peter Zijlstra, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

Hi,

No, the problem is that the task won't be throttled at all, because its
replenishing instant is always way in the past when tick occurs. :-/

> > Another side problem seems also to be that with such tiny parameters
> > we spend lot of time in the while (dl_se->runtime <= 0) loop of
> > replenish_dl_ entity() (actually uselessly, as deadline is most
> > probably going to still be in the past when eventually runtime
> > becomes positive again), as delta_exec is huge w.r.t. runtime and
> > runtime has to keep up with tiny increments of dl_runtime. I guess we
> > could ameliorate things here by limiting the number of time we
> > execute the loop before bailing out.
>
> Actually, I think the loop will iterate at most 10ms / 39us times, which
> is about 256 times, right? If this is too much (I do not know how much
> time it is spent executing the loop), then the solution is (as you
> suggest) to increase the minimum allowed runtime.

Yeah, it's maybe not a big issue (and fixing it won't change anything
regarding the real problem at hand). Just thought I'd mention what I was
seeing; and having the loop limit won't harm anyway I guess.

> [...]
> > So, I tend to think that we might want to play safe and put some
> > higher minimum value for dl_runtime (it's currently at 1ULL <<
> > DL_SCALE). Guess the problem is to pick a reasonable value, though.
> > Maybe link it someway to HZ?
>
> Yes, a value dependent on HZ looks like a good idea. I would propose
> HZ / N, where N is the maximum number of times you want the loop above
> to be executed.

Mmm, it's not really about the loop, but about the granularity at which
we do enforcement.

> > Then we might add a sysctl (or similar)
> > thing with which knowledgeable users can do whatever they think their
> > platform/config can support?
>
> I guess this can be related to the utilization limits we were
> discussing some time ago... I would propose a cgroup-based interface to
> set all of these limits.

Guess we can go that path as well. But I'd leave it for a later stage.

Thanks,

- Juri

luca abeni

unread,

Oct 18, 2018, 7:08:16 AM10/18/18

to Juri Lelli, Thomas Gleixner, Juri Lelli, Peter Zijlstra, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

On Thu, 18 Oct 2018 12:47:13 +0200

Ok, I see the issue now: the problem is that the "while (dl_se->runtime
<= 0)" loop is executed at replenishment time, but the deadline should
be postponed at enforcement time.

I mean: in update_curr_dl() we do:
dl_se->runtime -= scaled_delta_exec;
if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
...
enqueue replenishment timer at dl_next_period(dl_se)
But dl_next_period() is based on a "wrong" deadline!

I think that inserting a
while (dl_se->runtime <= -pi_se->dl_runtime) {
dl_se->deadline += pi_se->dl_period;
dl_se->runtime += pi_se->dl_runtime;
}
immediately after "dl_se->runtime -= scaled_delta_exec;" would fix the
problem, no?
If we go this way, then we can remove the while loop from
replenish_dl_entity(), and change it in
WARN_ON(dl_se->runtime <= -pi_se->dl_runtime);
WARN_ON(dl_se->runtime > 0);
dl_se->deadline += pi_se->dl_period;
dl_se->runtime += pi_se->dl_runtime;
or something similar.

Luca

Juri Lelli

unread,

Oct 18, 2018, 8:21:47 AM10/18/18

to luca abeni, Thomas Gleixner, Juri Lelli, Peter Zijlstra, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

Mmm, I also thought of letting the task "pay back" its overrunning. But,
doesn't this get us quite far from what one would expect. I mean,
enforcement granularity will be way different from task period, no?

luca abeni

unread,

Oct 18, 2018, 8:36:11 AM10/18/18

to Juri Lelli, Thomas Gleixner, Juri Lelli, Peter Zijlstra, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

Hi Juri,

On Thu, 18 Oct 2018 14:21:42 +0200

Juri Lelli <juri....@redhat.com> wrote:
[...]

Yes, the granularity will be what the kernel can provide (due to the HZ
value and to the hrtick on/off state). But at least the task will not
starve non-deadline tasks (which is bug that originated this
discussion, I think).

If I understand well, there are two different (and orthogonal) issues
here:
1) Due to a bug in the accounting / enforcement mechanisms
(the wrong placement of the while() loop), the tasks consumes 100% of
the CPU time, starving non-deadline tasks
2) Due to the large HZ value, the small runtime (and period) and the
fact that hrtick is disabled, the kernel cannot provide the
requested scheduling granularity

The second issue can be fixed by imposing limits on minimum and maximum
runtime and the first issue can be fixed by changing the code as I
suggested in my previous email.

I would suggest to address both the two issues, with separate changes
(the current replenishment code looks strange anyway).

Luca

Peter Zijlstra

unread,

Oct 19, 2018, 7:40:00 AM10/19/18

to luca abeni, Juri Lelli, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

On Thu, Oct 18, 2018 at 01:08:11PM +0200, luca abeni wrote:
> Ok, I see the issue now: the problem is that the "while (dl_se->runtime
> <= 0)" loop is executed at replenishment time, but the deadline should
> be postponed at enforcement time.
>
> I mean: in update_curr_dl() we do:
> dl_se->runtime -= scaled_delta_exec;
> if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> ...
> enqueue replenishment timer at dl_next_period(dl_se)
> But dl_next_period() is based on a "wrong" deadline!
>
>
> I think that inserting a
> while (dl_se->runtime <= -pi_se->dl_runtime) {
> dl_se->deadline += pi_se->dl_period;
> dl_se->runtime += pi_se->dl_runtime;
> }
> immediately after "dl_se->runtime -= scaled_delta_exec;" would fix the
> problem, no?

That certainly makes sense to me. The only remaining issue would then be
placing a limit on the amount of times we can take that loop; which, as
you propose in a later email; can be done separately as a limit on
runtime.

Peter Zijlstra

unread,

Oct 19, 2018, 9:14:26 AM10/19/18

to luca abeni, Juri Lelli, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

On Thu, Oct 18, 2018 at 12:33:32PM +0200, luca abeni wrote:
> Hi Peter,
>
> On Thu, 18 Oct 2018 11:48:50 +0200
> Peter Zijlstra <pet...@infradead.org> wrote:
> [...]
> > > So, I tend to think that we might want to play safe and put some
> > > higher minimum value for dl_runtime (it's currently at 1ULL <<
> > > DL_SCALE). Guess the problem is to pick a reasonable value, though.
> > > Maybe link it someway to HZ? Then we might add a sysctl (or
> > > similar) thing with which knowledgeable users can do whatever they
> > > think their platform/config can support?
> >
> > Yes, a HZ related limit sounds like something we'd want. But if we're
> > going to do a minimum sysctl, we should also consider adding a
> > maximum, if you set a massive period/deadline, you can, even with a
> > relatively low u, incur significant delays.
>
> I agree with this.
>
>
> > And do we want to put the limit on runtime or on period ?
>
> I think we should have a minimum allowed runtime, a maximum allowed
> runtime, a minimum allowed period and a (per-user? per-control
> group?) maximum allowed utilization.

I was talking about a global !root max-u, but yes the cgroup max-u makes
definite sense as well.

> I suspect having a maximum period is useless, if we already enforce a
> maximum runtime.

Probably; yes. The asymmetry is unfortunate of course.

> > That is, something like:
> >
> > TICK_NSEC/2 < period < 10*TICK_NSEC
>
> As written above I would not enforce a maximum period.

I'm confused: 'period < 10*TICK_NSEC' reads like a max to me.

(irrespective of the argument on wether the max should be HZ related;
and I think you and Juri made good argument for it not to be)

> > and/or
> >
> > TICK_NSEC/2 < runtime < 10*TICK_NSEC
>
> I think (but I might be wrong) that "TICK_NSEC/2" is too large... I
> would divide the tick for a larger number (how many time do we want to
> allow the loop to run?)

It depends on how strict we want to enforce the no-interference rule.
The smaller we make this, the less accurate we enforce, the worse the
interference between tasks.

Note that we're only talking about a default; and HZ=100 is daft in any
case.

> And I think the maximum runtime should not be TICK-dependent... It is
> the maximum amount of time for which we allow the dealdine task to
> starve non-deadline tasks, so it should be an absolute time, not
> something HZ-dependent... No?

Agreed.

> > Hmm, for HZ=1000 that ends up with a max period of 10ms, that's far
> > too low, 24Hz needs ~41ms. We can of course also limit the runtime by
> > capping u for users (as we should anyway).
>
> Regarding capping u for users: some time ago, with Juri we discussed
> the idea of having per-cgroup limits on the deadline utilization... I
> think this is a good idea (and if the userspace creates a cgroup per
> user, this results in per-user capping - but it is more flexible in
> general)

Agreed, that makes sense.

luca abeni

unread,

Oct 19, 2018, 4:50:12 PM10/19/18

to Peter Zijlstra, Juri Lelli, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

On Fri, 19 Oct 2018 13:39:42 +0200
Peter Zijlstra <pet...@infradead.org> wrote:

> On Thu, Oct 18, 2018 at 01:08:11PM +0200, luca abeni wrote:
> > Ok, I see the issue now: the problem is that the "while
> > (dl_se->runtime <= 0)" loop is executed at replenishment time, but
> > the deadline should be postponed at enforcement time.
> >
> > I mean: in update_curr_dl() we do:
> > dl_se->runtime -= scaled_delta_exec;
> > if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> > ...
> > enqueue replenishment timer at dl_next_period(dl_se)
> > But dl_next_period() is based on a "wrong" deadline!
> >
> >
> > I think that inserting a
> > while (dl_se->runtime <= -pi_se->dl_runtime) {
> > dl_se->deadline += pi_se->dl_period;
> > dl_se->runtime += pi_se->dl_runtime;
> > }
> > immediately after "dl_se->runtime -= scaled_delta_exec;" would fix
> > the problem, no?
>
> That certainly makes sense to me.

Good; I'll try to work on this idea in the weekend.

Thanks,
Luca

Juri Lelli

unread,

Oct 24, 2018, 8:03:42 AM10/24/18

to Peter Zijlstra, luca abeni, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

On 19/10/18 22:50, luca abeni wrote:
> On Fri, 19 Oct 2018 13:39:42 +0200
> Peter Zijlstra <pet...@infradead.org> wrote:
>
> > On Thu, Oct 18, 2018 at 01:08:11PM +0200, luca abeni wrote:
> > > Ok, I see the issue now: the problem is that the "while
> > > (dl_se->runtime <= 0)" loop is executed at replenishment time, but
> > > the deadline should be postponed at enforcement time.
> > >
> > > I mean: in update_curr_dl() we do:
> > > dl_se->runtime -= scaled_delta_exec;
> > > if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> > > ...
> > > enqueue replenishment timer at dl_next_period(dl_se)
> > > But dl_next_period() is based on a "wrong" deadline!
> > >
> > >
> > > I think that inserting a
> > > while (dl_se->runtime <= -pi_se->dl_runtime) {
> > > dl_se->deadline += pi_se->dl_period;
> > > dl_se->runtime += pi_se->dl_runtime;
> > > }
> > > immediately after "dl_se->runtime -= scaled_delta_exec;" would fix
> > > the problem, no?
> >
> > That certainly makes sense to me.
>
> Good; I'll try to work on this idea in the weekend.

So, we (me and Luca) managed to spend some more time on this and found a
few more things worth sharing. I'll try to summarize what we have got so
far (including what already discussed) because impression is that each
point might deserve a fix or at least consideration (just amazing how a
simple random fuzzer thing can highlight all that :). Apologies for the
long email.

Reproducer runs on a CONFIG_HZ=100, CONFIG_IRQ_TIME_ACCOUNTING kernel
and does something like this (only the bits that seems to matter here)

int main(void)
{
[...]
[setup stuff at 0x2001d000]
syscall(__NR_perf_event_open, 0x2001d000, 0, -1, -1, 0);

*(uint32_t*)0x20000000 = 0;
*(uint32_t*)0x20000004 = 6;
*(uint64_t*)0x20000008 = 0;
*(uint32_t*)0x20000010 = 0;
*(uint32_t*)0x20000014 = 0;

*(uint64_t*)0x20000018 = 0x9917; <-- ~40us
*(uint64_t*)0x20000020 = 0xffff; <-- ~65us (~60% bandwidth)

*(uint64_t*)0x20000028 = 0;
syscall(__NR_sched_setattr, 0, 0x20000000, 0);

[busy loop]
return 0;
}

And this causes problems because the task is actually never throttled.

Pain points:

1. Granularity of enforcement (at each tick) is huge compared with
the task runtime. This makes starting the replenishment timer,
when runtime is depleted, always to fail (because old deadline
is way in the past). So, the task is fully replenished and put
back to run.

- Luca's proposal should help here, since the deadline is postponed
at throttling time, and replenishment timer set to that (and it
should be in the future)

1.1 Even if we fix 1. in a configuration like this, the task would
still be able to run for ~10ms (worst case) and potentially starve
other tasks. It doesn't seem a too big interval maybe, but there
might be other very short activities that might miss an occasion
to run "quickly".

- Might be fixed by imposing (via sysctl) reasonable defaults for
minimum runtime (w.r.t. HZ, like HZ/2) and maximum for period
(as also a very small bandwidth task can have a big runtime if
period is big as well)

(1.2) When runtime becomes very negative (because delta_exec was big)
we seem to spend lot of time inside the replenishment loop.

- Not sure it's such a big problem, might need more profiling.
Feeling is that once the other points will be addressed this
won't matter anymore

2. This is related to perf_event_open syscall reproducer does before
becoming DEADLINE and entering the busy loop. Enabling of perf
swevents generates lot of hrtimers load that happens in the
reproducer task context. Now, DEADLINE uses rq_clock() for setting
deadlines, but rq_clock_task() for doing runtime enforcement.
In a situation like this it seems that the amount of irq pressure
becomes pretty big (I'm seeing this on kvm, real hw should maybe do
better, pain point remains I guess), so rq_clock() and
rq_clock_task() might become more a more skewed w.r.t. each other.
Since rq_clock() is only used when setting absolute deadlines for
the first time (or when resetting them in certain cases), after a
bit the replenishment code will start to see postponed deadlines
always in the past w.r.t. rq_clock(). And this brings us back to the
fact that the task is never stopped, since it can't keep up with
rq_clock().

- Not sure yet how we want to address this [1]. We could use
rq_clock() everywhere, but tasks might be penalized by irq
pressure (theoretically this would mandate that irqs are
explicitly accounted for I guess). I tried to use the skew between
the two clocks to "fix" deadlines, but that puts us at risks of
de-synchronizing userspace and kernel views of deadlines.

3. HRTICK is not started for new entities.

- Already got a patch for it.

This should be it, I hope. Luca (thanks a lot for your help) and please
add or correct me if I was wrong.

Thoughts?

Best,

- Juri

1 - https://elixir.bootlin.com/linux/latest/source/kernel/sched/deadline.c#L1162

Dmitry Vyukov

unread,

Oct 27, 2018, 7:17:14 AM10/27/18

to Juri Lelli, Peter Zijlstra, luca abeni, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, Ingo Molnar, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

1. Fuzzing finds bugs in any code. Always.
If a code wasn't fuzzed, there are bugs.

2. This fuzzer is not so simple ;)

> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bug...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/20181024120335.GE29272%40localhost.localdomain.
> For more options, visit https://groups.google.com/d/optout.

Juri Lelli

unread,

Oct 28, 2018, 4:34:03 AM10/28/18

to Dmitry Vyukov, Peter Zijlstra, luca abeni, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, Ingo Molnar, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

Indeed! I meant that it's amazing how the fuzzer was able to forge a
relatively simple reproducer that highlighted the problem.

Peter Zijlstra

unread,

Oct 30, 2018, 6:46:08 AM10/30/18

to Juri Lelli, luca abeni, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

On Wed, Oct 24, 2018 at 02:03:35PM +0200, Juri Lelli wrote:
> Pain points:
>
> 1. Granularity of enforcement (at each tick) is huge compared with
> the task runtime. This makes starting the replenishment timer,
> when runtime is depleted, always to fail (because old deadline
> is way in the past). So, the task is fully replenished and put
> back to run.
>
> - Luca's proposal should help here, since the deadline is postponed
> at throttling time, and replenishment timer set to that (and it
> should be in the future)

ACK

> 1.1 Even if we fix 1. in a configuration like this, the task would
> still be able to run for ~10ms (worst case) and potentially starve
> other tasks. It doesn't seem a too big interval maybe, but there
> might be other very short activities that might miss an occasion
> to run "quickly".
>
> - Might be fixed by imposing (via sysctl) reasonable defaults for
> minimum runtime (w.r.t. HZ, like HZ/2) and maximum for period
> (as also a very small bandwidth task can have a big runtime if
> period is big as well)

ACK

> (1.2) When runtime becomes very negative (because delta_exec was big)
> we seem to spend lot of time inside the replenishment loop.
>
> - Not sure it's such a big problem, might need more profiling.
> Feeling is that once the other points will be addressed this
> won't matter anymore

Right, once the sysctl limits are in place, we should not have such
excessive cases anymore.

> 2. This is related to perf_event_open syscall reproducer does before
> becoming DEADLINE and entering the busy loop. Enabling of perf
> swevents generates lot of hrtimers load that happens in the
> reproducer task context. Now, DEADLINE uses rq_clock() for setting
> deadlines, but rq_clock_task() for doing runtime enforcement.
> In a situation like this it seems that the amount of irq pressure
> becomes pretty big (I'm seeing this on kvm, real hw should maybe do
> better, pain point remains I guess), so rq_clock() and
> rq_clock_task() might become more a more skewed w.r.t. each other.
> Since rq_clock() is only used when setting absolute deadlines for
> the first time (or when resetting them in certain cases), after a
> bit the replenishment code will start to see postponed deadlines
> always in the past w.r.t. rq_clock(). And this brings us back to the
> fact that the task is never stopped, since it can't keep up with
> rq_clock().
>
> - Not sure yet how we want to address this [1]. We could use
> rq_clock() everywhere, but tasks might be penalized by irq
> pressure (theoretically this would mandate that irqs are
> explicitly accounted for I guess). I tried to use the skew between
> the two clocks to "fix" deadlines, but that puts us at risks of
> de-synchronizing userspace and kernel views of deadlines.

Hurm.. right. We knew of this issue back when we did it.
I suppose now it hurts and we need to figure something out.

By virtue of being a real-time class, we do indeed need to have deadline
on the wall-clock. But if we then don't account runtime on that same
clock, but on a potentially slower clock, we get the problem that we can
run longer than our period/deadline, which is what we're running into
here I suppose.

And yes, at some point RT workloads need to be aware of the jitter
injected by things like IRQs and such. But I believe the rationale was
that for soft real-time workloads this current semantic was 'easier'
because we get to ignore IRQ overhead for workload estimation etc.

What we could maybe do is track runtime in both rq_clock_task() and
rq_clock() and detect where the rq_clock based one exceeds the period
and then push out the deadline (and add runtime).

Maybe something along such lines; does that make sense?

---
include/linux/sched.h | 3 +++
kernel/sched/deadline.c | 53 ++++++++++++++++++++++++++++++++-----------------
2 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8f8a5418b627..6aec81cb3d2e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -522,6 +522,9 @@ struct sched_dl_entity {
u64 deadline; /* Absolute deadline for this instance */
unsigned int flags; /* Specifying the scheduler behaviour */

+ u64 wallstamp;
+ s64 walltime;
+
/*
* Some bool flags:
*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 91e4202b0634..633c8f36c700 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -683,16 +683,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
if (dl_se->dl_yielded && dl_se->runtime > 0)
dl_se->runtime = 0;

- /*
- * We keep moving the deadline away until we get some
- * available runtime for the entity. This ensures correct
- * handling of situations where the runtime overrun is
- * arbitrary large.
- */
- while (dl_se->runtime <= 0) {
- dl_se->deadline += pi_se->dl_period;
- dl_se->runtime += pi_se->dl_runtime;
- }
+ /* XXX what do we do with pi_se */

/*
* At this point, the deadline really should be "in
@@ -1148,9 +1139,9 @@ static void update_curr_dl(struct rq *rq)
{
struct task_struct *curr = rq->curr;
struct sched_dl_entity *dl_se = &curr->dl;
- u64 delta_exec, scaled_delta_exec;
+ u64 delta_exec, scaled_delta_exec, delta_wall;
int cpu = cpu_of(rq);
- u64 now;
+ u64 now, wall;

if (!dl_task(curr) || !on_dl_rq(dl_se))
return;
@@ -1171,6 +1162,17 @@ static void update_curr_dl(struct rq *rq)
return;
}

+ wall = rq_clock();
+ delta_wall = wall - dl_se->wallstamp;
+ if (delta_wall > 0) {
+ dl_se->walltime += delta_wall;
+ dl_se->wallstamp = wall;
+ }
+
+ /* check if rq_clock_task() has been too slow */
+ if (unlikely(dl_se->walltime > dl_se->period))
+ goto throttle;
+
schedstat_set(curr->se.statistics.exec_max,
max(curr->se.statistics.exec_max, delta_exec));

@@ -1204,14 +1206,27 @@ static void update_curr_dl(struct rq *rq)

dl_se->runtime -= scaled_delta_exec;

-throttle:
if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
+throttle:
dl_se->dl_throttled = 1;

- /* If requested, inform the user about runtime overruns. */
- if (dl_runtime_exceeded(dl_se) &&
- (dl_se->flags & SCHED_FLAG_DL_OVERRUN))
- dl_se->dl_overrun = 1;
+ if (dl_runtime_exceeded(dl_se)) {
+ /* If requested, inform the user about runtime overruns. */
+ if (dl_se->flags & SCHED_FLAG_DL_OVERRUN)
+ dl_se->dl_overrun = 1;
+
+ }
+
+ /*
+ * We keep moving the deadline away until we get some available
+ * runtime for the entity. This ensures correct handling of
+ * situations where the runtime overrun is arbitrary large.
+ */
+ while (dl_se->runtime <= 0 || dl_se->walltime > dl_se->period) {
+ dl_se->deadline += dl_se->dl_period;
+ dl_se->runtime += dl_se->dl_runtime;
+ dl_se->walltime -= dl_se->dl_period;
+ }

__dequeue_task_dl(rq, curr, 0);
if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
@@ -1751,9 +1766,10 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

p = dl_task_of(dl_se);
p->se.exec_start = rq_clock_task(rq);
+ dl_se->wallstamp = rq_clock(rq);

/* Running task will never be pushed. */
- dequeue_pushable_dl_task(rq, p);
+ dequeue_pushable_dl_task(rq, p);

if (hrtick_enabled(rq))
start_hrtick_dl(rq, p);
@@ -1811,6 +1827,7 @@ static void set_curr_task_dl(struct rq *rq)
struct task_struct *p = rq->curr;

p->se.exec_start = rq_clock_task(rq);
+ p->dl_se.wallstamp = rq_clock(rq);

/* You can't push away the running task */
dequeue_pushable_dl_task(rq, p);

Juri Lelli

unread,

Oct 30, 2018, 7:12:27 AM10/30/18

to Peter Zijlstra, luca abeni, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

On 30/10/18 11:45, Peter Zijlstra wrote:

[...]

> Hurm.. right. We knew of this issue back when we did it.
> I suppose now it hurts and we need to figure something out.
>
> By virtue of being a real-time class, we do indeed need to have deadline
> on the wall-clock. But if we then don't account runtime on that same
> clock, but on a potentially slower clock, we get the problem that we can
> run longer than our period/deadline, which is what we're running into
> here I suppose.
>
> And yes, at some point RT workloads need to be aware of the jitter
> injected by things like IRQs and such. But I believe the rationale was
> that for soft real-time workloads this current semantic was 'easier'
> because we get to ignore IRQ overhead for workload estimation etc.

Right. In this case the task is self injecting IRQ load, but it maybe
doesn't make a big difference on how we need to treat it (supposing we
can actually distinguish).

> What we could maybe do is track runtime in both rq_clock_task() and
> rq_clock() and detect where the rq_clock based one exceeds the period
> and then push out the deadline (and add runtime).
>
> Maybe something along such lines; does that make sense?

Yeah, I think I've got the gist of the idea. I'll play with it.

Thanks,

- Juri

Daniel Bristot de Oliveira

unread,

Oct 31, 2018, 12:18:03 PM10/31/18

to luca abeni, Peter Zijlstra, Juri Lelli, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino

On 10/30/18 12:08 PM, luca abeni wrote:
> Hi Peter,
>

> On Tue, 30 Oct 2018 11:45:54 +0100

> Peter Zijlstra <pet...@infradead.org> wrote:
> [...]

> I might be hugely misunderstanding something here, but in my impression
> the issue is just that if the IRQ time is not accounted to the
> -deadline task, then the non-deadline tasks might be starved.
>
> I do not see this as a skew between two clocks, but as an accounting
> thing:
> - if we decide that the IRQ time is accounted to the -deadline
> task (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING disabled),
> then the non-deadline tasks are not starved (but of course the
> -deadline tasks executes for less than its reserved time in the
> period);
> - if we decide that the IRQ time is not accounted to the -deadline task
> (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING enabled), then
> the -deadline task executes for the expected amount of time (about
> 60% of the CPU time), but an IRQ load of 40% will starve non-deadline
> tasks (this is what happens in the bug that triggered this discussion)
>
> I think this might be seen as an adimission control issue: when
> CONFIG_IRQ_TIME_ACCOUNTING is disabled, the IRQ time is accounted for
> in the admission control (because it ends up in the task's runtime),
> but when CONFIG_IRQ_TIME_ACCOUNTING is enabled the IRQ time is not
> accounted for in the admission test (the IRQ handler becomes some sort
> of entity with a higher priority than -deadline tasks, on which no
> accounting or enforcement is performed).
>

I am sorry for taking to long to join in the discussion.

I agree with Luca. I've seem this behavior two time before. Firstly when we were
trying to make the rt throttling to have a very short runtime for non-rt
threads, and then in the proof of concept of the semi-partitioned scheduler.

Firstly, I started thinking on this as a skew between both clocks and disabled
IRQ_TIME_ACCOUNTING. But by ignoring IRQ accounting, we are assuming that the
IRQ runtime will be accounted as the thread's runtime. In other words, we are
just sweeping the trash under the rug, where the rug is the worst case execution
time estimation/definition (which is an even more complex problem). In the
Brazilian part of the Ph.D we are dealing with probabilistic worst case
execution time, and to be able to use probabilistic methods, we need to remove
the noise of the IRQs in the execution time [1]. So, IMHO, using
CONFIG_IRQ_TIME_ACCOUNTING is a good thing.

The fact that we have barely no control of the execution of IRQs, at first
glance, let us think that the idea of considering an IRQ as a task seems to be
absurd. But, it is not. The IRQs run a piece of code that is, in the vast
majority of the case, not related to the current thread, so it runs another
"task". In the occurrence of more than one IRQ concurrently, the processor
serves the IRQ in a predictable order [2], so the processor schedules the IRQs
as a "task". Finally, there are precedence constraints among threads and IRQs.
For instance, the latency can be seen as the response time of the timer IRQ
handler, plus the delta of the return of the handler and the starting of the
execution of cyclictest [3]. In the theory, the idea of precedence constraints
is also about "task".

So IMHO, IRQs can be considered as a task (I am considering in my model), and
the place to account this would be in the admission test.

The problem is that, for the best of my knowledge, there is no admissions test
for such task model/system:

Two level of schedulers. A high priority scheduler that schedules a non
preemptive task set (IRQ) under a fixed priority (the processor scheduler do it,
and on intel it is a fixed priority). A lower priority task set (threads)
scheduled by the OS.

But assuming that our current admission control is more about a safe guard than
an exact admission control - that is, for multiprocessor it is necessary, but
not sufficient. (Theoretically, it works for uniprocessor, but... there is a
paper of Rob Davis somewhere that shows that if we have "context switch" (and so
scheduler for our case)) with different costs, the many things does not hold
true, for instance, Deadline Monotonic is not optimal... but I will have to read
more to enter in this point, anyway, multiprocessor is only necessary).

With this in mind: we do *not* use/have an exact admission test for all cases.
By not having an exact admission test, we assume the user knows what he/she is
doing. In this case, if they have a high load of IRQs... they need to know that:

1) Their periods should be consistent with the "interference" they might receive.
2) Their tasks can miss the deadline because of IRQs (and there is no way to
avoid this without "throttling" IRQs...)

So, is it worth to put a duct tape for this case?

My fear is that, by putting a duct tape here, we would turn things prone to more
complex errors/undeterminism... so...

I think we have another point to add to the discussion at plumbers, Juri.

[1] http://bristot.me/wp-content/uploads/2018/09/conference_071817.pdf
[2] Intel® 64 and IA-32 Architectures Software Developer’s Manual: Volume 3,
section: 6.9 PRIORITY AMONG SIMULTANEOUS EXCEPTIONS AND INTERRUPTS.
[3] I will detail this a little bit more in the plumbers presentation.

>
>> And yes, at some point RT workloads need to be aware of the jitter
>> injected by things like IRQs and such. But I believe the rationale was
>> that for soft real-time workloads this current semantic was 'easier'
>> because we get to ignore IRQ overhead for workload estimation etc.
>>
>> What we could maybe do is track runtime in both rq_clock_task() and
>> rq_clock() and detect where the rq_clock based one exceeds the period
>> and then push out the deadline (and add runtime).
>>
>> Maybe something along such lines; does that make sense?
>

> Uhm... I have to study and test your patch... I'll comment on this
> later.
>
>
>
> Thanks,
> Luca

If I got it correctly, it can be the case that we would throttle a thread that,
because of IRQs, received less CPU time than expected, right?

Thanks!
-- Daniel

Juri Lelli

unread,

Oct 31, 2018, 12:40:17 PM10/31/18

to Daniel Bristot de Oliveira, luca abeni, Peter Zijlstra, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino

Yeah, sure. My fear in a case like this though is that the task that
ends up starving other is "creating" IRQ overhead on itself. Kind of
DoS, no?

I'm seeing something along the lines of what Peter suggested as a last
resort measure we probably still need to put in place.

Thanks,

- Juri

Peter Zijlstra

unread,

Oct 31, 2018, 1:39:04 PM10/31/18

to Daniel Bristot de Oliveira, luca abeni, Juri Lelli, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino

On Wed, Oct 31, 2018 at 05:18:00PM +0100, Daniel Bristot de Oliveira wrote:
> Brazilian part of the Ph.D we are dealing with probabilistic worst case
> execution time, and to be able to use probabilistic methods, we need to remove
> the noise of the IRQs in the execution time [1]. So, IMHO, using
> CONFIG_IRQ_TIME_ACCOUNTING is a good thing.

> With this in mind: we do *not* use/have an exact admission test for all cases.
> By not having an exact admission test, we assume the user knows what he/she is
> doing. In this case, if they have a high load of IRQs... they need to know that:

So I mostly agree with the things you said; IRQs are a different
'context' or 'task' from the normal scheduled task. For AC we can
consider an average IRQ load etc..

But even if we get AC sufficient with an average IRQ load, there are
still the cases where the IRQs cluster. So, esp. for very short
runtimes, you can get this scenario.

> 1) Their periods should be consistent with the "interference" they might receive.
> 2) Their tasks can miss the deadline because of IRQs (and there is no way to
> avoid this without "throttling" IRQs...)

True.

> So, is it worth to put a duct tape for this case?

My point was mostly to to not misbehave. Maybe I got it wrong, that
happens ;-)

> >> @@ -1171,6 +1162,17 @@ static void update_curr_dl(struct rq *rq)
> >> return;
> >> }
> >>
> >> + wall = rq_clock();
> >> + delta_wall = wall - dl_se->wallstamp;
> >> + if (delta_wall > 0) {
> >> + dl_se->walltime += delta_wall;
> >> + dl_se->wallstamp = wall;
> >> + }
> >> +
> >> + /* check if rq_clock_task() has been too slow */
> >> + if (unlikely(dl_se->walltime > dl_se->period))
> >> + goto throttle;
> >> +
>
> If I got it correctly, it can be the case that we would throttle a thread that,
> because of IRQs, received less CPU time than expected, right?

So the thinking was that when the actual walltime runtime (which
includes IRQs) exceeds the period, our whole CBS model breaks and we
need to replenish and push forward the deadline.

Maybe it should do that and not throttle.

Also, the walltime -= period thing should be fixed to never go negative
I think.

Peter Zijlstra

unread,

Oct 31, 2018, 1:39:56 PM10/31/18

to Juri Lelli, Daniel Bristot de Oliveira, luca abeni, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino

On Wed, Oct 31, 2018 at 05:40:10PM +0100, Juri Lelli wrote:
> I'm seeing something along the lines of what Peter suggested as a last
> resort measure we probably still need to put in place.

Yes, you certainly hope to never hit that, and should not in a proper
setup. But we need something to keep things 'working'.

Daniel Bristot de Oliveira

unread,

Oct 31, 2018, 1:58:11 PM10/31/18

to Juri Lelli, luca abeni, Peter Zijlstra, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino

I see your point.

But how about a non-rt thread that creates a lot of timers, then a DL task
arrives and preempts it, receiving the interfere from interrupts that were
caused by the previous thread?

Actually, enabling/disabling sched stats in a loop generates a IPI storm on all
(other) CPUs because of updates in jump labels (we will reduce/bound that with
the batch of jump label update, but still, the problem will exist). But not
only, iirc we can cause this with a madvise to cause the flush of pages.

> I'm seeing something along the lines of what Peter suggested as a last
> resort measure we probably still need to put in place.

I meant, I am not against the/a fix, i just think that... it is more complicated
that it seems.

For example: Let's assume that we have a non-rt bad thread A in CPU 0 generating
IPIs because of static key update, and a good dl thread B in the CPU 1.

In this case, the thread B could run less than what was reserved for it, but it
was not causing the interrupts. It is not fair to put a penalty in the thread B.

The same is valid for a dl thread running in the same CPU that is receiving a
lot of network packets to another application, and other legit cases.

In the end, if we want to avoid non-rt threads starving, we need to prioritize
them some time, but in this case, we return to the DL server for non-rt threads.

Thoughts?

Thanks,
-- Daniel

> Thanks,
>
> - Juri
>

Juri Lelli

unread,

Nov 1, 2018, 1:55:23 AM11/1/18

to Daniel Bristot de Oliveira, luca abeni, Peter Zijlstra, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino

And I see your point. :-)

I'd also add (maybe you mentioned this as well) that it seems the same
could happen with RT throttling safety measure, as we are using
clock_task there as well to account runtime and throttle stuff.

OTOH, when something like you describe happens, guarantees are probably
already out of the window and we should just do our best to at least
keep the system "working"? (maybe only to warn the user that something
bad has happened)

luca abeni

unread,

Nov 2, 2018, 4:25:40 AM11/2/18

to Peter Zijlstra, Juri Lelli, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

Hi Peter,

On Tue, 30 Oct 2018 11:45:54 +0100
Peter Zijlstra <pet...@infradead.org> wrote:
[...]

I might be hugely misunderstanding something here, but in my impression
the issue is just that if the IRQ time is not accounted to the
-deadline task, then the non-deadline tasks might be starved.

I do not see this as a skew between two clocks, but as an accounting
thing:
- if we decide that the IRQ time is accounted to the -deadline
task (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING disabled),
then the non-deadline tasks are not starved (but of course the
-deadline tasks executes for less than its reserved time in the
period);
- if we decide that the IRQ time is not accounted to the -deadline task
(this is what happens with CONFIG_IRQ_TIME_ACCOUNTING enabled), then
the -deadline task executes for the expected amount of time (about
60% of the CPU time), but an IRQ load of 40% will starve non-deadline
tasks (this is what happens in the bug that triggered this discussion)

I think this might be seen as an adimission control issue: when
CONFIG_IRQ_TIME_ACCOUNTING is disabled, the IRQ time is accounted for
in the admission control (because it ends up in the task's runtime),
but when CONFIG_IRQ_TIME_ACCOUNTING is enabled the IRQ time is not
accounted for in the admission test (the IRQ handler becomes some sort
of entity with a higher priority than -deadline tasks, on which no
accounting or enforcement is performed).

> And yes, at some point RT workloads need to be aware of the jitter
> injected by things like IRQs and such. But I believe the rationale was
> that for soft real-time workloads this current semantic was 'easier'
> because we get to ignore IRQ overhead for workload estimation etc.
>
> What we could maybe do is track runtime in both rq_clock_task() and
> rq_clock() and detect where the rq_clock based one exceeds the period
> and then push out the deadline (and add runtime).
>
> Maybe something along such lines; does that make sense?

Uhm... I have to study and test your patch... I'll comment on this
later.

Thanks,
Luca

>

Daniel Bristot de Oliveira

unread,

Nov 2, 2018, 6:00:39 AM11/2/18

to Juri Lelli, luca abeni, Peter Zijlstra, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino

On 11/1/18 6:55 AM, Juri Lelli wrote:
>> I meant, I am not against the/a fix, i just think that... it is more complicated
>> that it seems.
>>
>> For example: Let's assume that we have a non-rt bad thread A in CPU 0 generating
>> IPIs because of static key update, and a good dl thread B in the CPU 1.
>>
>> In this case, the thread B could run less than what was reserved for it, but it
>> was not causing the interrupts. It is not fair to put a penalty in the thread B.
>>
>> The same is valid for a dl thread running in the same CPU that is receiving a
>> lot of network packets to another application, and other legit cases.
>>
>> In the end, if we want to avoid non-rt threads starving, we need to prioritize
>> them some time, but in this case, we return to the DL server for non-rt threads.
>>
>> Thoughts?
> And I see your point. :-)
>
> I'd also add (maybe you mentioned this as well) that it seems the same
> could happen with RT throttling safety measure, as we are using
> clock_task there as well to account runtime and throttle stuff.

Yes! The same problem can happen with rt scheduler as well! I saw this problem
first with the rt throttling mechanism when I was trying to make it work in the
microseconds granularity (it is only enforced in the schedule tick, so it is in
an ms granularity in practice). After using hr timers to do the enforcement in
the microseconds granularity, I was trying to let just fewer us for the non-rt.
But as the IRQ runtime was higher than these fewer us, the rt_rq was never
throttled. It is the same/similar behavior we see here.

As we think in the rt throttling as "avoiding rt workload to consume more than
rt_runtime/rt_period", and considering that IRQs are a level of task with a
fixed priority higher than all the real-time related schedulers, i.e., deadline
and rt, we can safely argue that we can consider the IRQ time into the pool of
rt workload and account it in the rt_runtime. The easiest way to do it is to use
the rq_clock() in the measurement. I agree.

The point is that the CBS has a dual goal: it avoids a task running for more
than its runtime (a throttling behavior), but it also is used as a guarantee of
runtime for the case in which the task behaves, and the system is not
overloaded. Considering we can have more load than we can schedule in a
multiprocessor - but that is another story.

The the obvious reasoning here is: Ok boy, but the system IS overloaded in this
case, we have a RCU stall! And that is true if you look at the processor
starving RCU. But if the system has mode than one CPU, it could have CPU time
available in another CPU. So, we could just move the dl task from one CPU to
another.

Btw, that is another point. We have the AC with the sum of the utilization of
all CPUs. But we do no enforcement for per-cpu utilization. If one set a single
thread with runtime=deadline=period (in a system with more than one CPU), and
run in a busy-loop, we will eventually have an RCU stall as well (I just did on
my box, I got a soft lockup). I know this is a different problem. But, maybe,
there is a general solution for both issues:

For instance, if the sum of the execution time of all "task" with priority
higher than the OTHER class (rt, dl, stop_machine, IRQs, NMIs, Hypervisor?) in a
CPU is higher than rt_runtime in the rt_period, we need to avoid what is
"avoidable" by trying to move rt and dl threads away from that CPU. Another
possibility is to bump the priority of the OTHER class (and we are back to the
DL server).

- Dude, would not be easy just changing the CBS?

Yeah, but by changing the CBS, we may end up breaking the algorithms/properties
that rely on CBS... like GRUB, user-space/kernel-space synchronization...

> OTOH, when something like you describe happens, guarantees are probably
> already out of the window and we should just do our best to at least
> keep the system "working"? (maybe only to warn the user that something
> bad has happened)

Btw, don't get me wrong, I am not against changing CBS: I am just trying to
raise other viewpoints to avoid touching in the base of the DL scheduler, and
avoid punishing a thread that behaves well.

Anyway, notifying that dl+rt+IRQ time is higher than the rt_runtime is another
good thing to do as well. We will be notified anyway, either by RCU or
softlockup... but they are side effects warning. By notifying that we have an
overload of rt or higher workload we will be pointing to the cause.

Thoughts?

-- Daniel

Juri Lelli

unread,

Nov 5, 2018, 5:55:45 AM11/5/18

to Daniel Bristot de Oliveira, luca abeni, Peter Zijlstra, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino

Mmm, only that in this particular case I believe IRQ load will move
together with the migrating task and problem won't really be solved. :-/

> Btw, that is another point. We have the AC with the sum of the utilization of
> all CPUs. But we do no enforcement for per-cpu utilization. If one set a single
> thread with runtime=deadline=period (in a system with more than one CPU), and
> run in a busy-loop, we will eventually have an RCU stall as well (I just did on
> my box, I got a soft lockup). I know this is a different problem. But, maybe,
> there is a general solution for both issues:

This is true. However, the single 100% bandwidth task problem can be
solved by limiting the maximum bandwidth a single entity can ask for. Of
course we can get again to a similar sort of problem if multiple
entities are then co-scheduled on the same CPU, for which we would need
(residual) capacity awareness. This should happen less likely though, as
there is a general tendency to spread tasks.

> For instance, if the sum of the execution time of all "task" with priority
> higher than the OTHER class (rt, dl, stop_machine, IRQs, NMIs, Hypervisor?) in a
> CPU is higher than rt_runtime in the rt_period, we need to avoid what is
> "avoidable" by trying to move rt and dl threads away from that CPU. Another
> possibility is to bump the priority of the OTHER class (and we are back to the
> DL server).

Kind of weird though having to migrate RT (everything higher than OTHER)
only to make some room for non-RT stuff. Also because one can introduce
unwanted side effects on high prio workloads (cache related overheads,
etc.). OTHER has also already have some knowledge about higher prio
activities (rt,dl,irq PELT). So this seems to really leave us with
affined tasks, of all priorities and kinds (real vs. irq).

>
> - Dude, would not be easy just changing the CBS?
>
> Yeah, but by changing the CBS, we may end up breaking the algorithms/properties
> that rely on CBS... like GRUB, user-space/kernel-space synchronization...
>
> > OTOH, when something like you describe happens, guarantees are probably
> > already out of the window and we should just do our best to at least
> > keep the system "working"? (maybe only to warn the user that something
> > bad has happened)
>
> Btw, don't get me wrong, I am not against changing CBS: I am just trying to
> raise other viewpoints to avoid touching in the base of the DL scheduler, and
> avoid punishing a thread that behaves well.
>
> Anyway, notifying that dl+rt+IRQ time is higher than the rt_runtime is another
> good thing to do as well. We will be notified anyway, either by RCU or
> softlockup... but they are side effects warning. By notifying that we have an
> overload of rt or higher workload we will be pointing to the cause.

Right. It doesn't solve the problem, but I guess it could help debugging.

Juri Lelli

unread,

Nov 6, 2018, 6:44:58 AM11/6/18

to Peter Zijlstra, luca abeni, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino, Daniel Bristot de Oliveira

So, even though I consider Luca and Daniel's points/concerns more then
valid, I spent I bit of time playing with this idea and ended up with
what follows.

I'm adding deadline adjustment when we realize that there is a
difference between delta_exec and delta_wall (IRQ load) and walltime is
bigger then configured dl_period. The adjustment to rq_clock (even
though it of course de-synchs user/kernel deadlines) seems needed, as w/o
this kind of thing task doesn't get throttled for too long, a dl_period
at max I guess, after having been running for quite a while.

Another change is that I believe we want to keep runtime always below
dl_runtime while pushing deadline away.

This doesn't seem to modify behavior for non misbehaving and non
affected by irq load tasks and should fix stalls and starvation of lower
prio activities problem.

Does it make any sense? :-)

--->8---
include/linux/sched.h | 3 ++
kernel/sched/deadline.c | 78 ++++++++++++++++++++++++++++++-----------
2 files changed, 60 insertions(+), 21 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 977cb57d7bc9..e5aaf6deab7e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -521,6 +521,9 @@ struct sched_dl_entity {

u64 deadline; /* Absolute deadline for this instance */
unsigned int flags; /* Specifying the scheduler behaviour */

+ u64 wallstamp;
+ s64 walltime;
+
/*
* Some bool flags:
*
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c

index 93da990ba458..061562589282 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -696,16 +696,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,

if (dl_se->dl_yielded && dl_se->runtime > 0)
dl_se->runtime = 0;

- /*
- * We keep moving the deadline away until we get some
- * available runtime for the entity. This ensures correct
- * handling of situations where the runtime overrun is
- * arbitrary large.
- */
- while (dl_se->runtime <= 0) {
- dl_se->deadline += pi_se->dl_period;
- dl_se->runtime += pi_se->dl_runtime;
- }

+ /* XXX have to deal with PI */
+ dl_se->walltime = 0;

/*
@@ -917,7 +909,7 @@ static int start_dl_timer(struct task_struct *p)
* that it is actually coming from rq->clock and not from
* hrtimer's time base reading.
*/
- act = ns_to_ktime(dl_next_period(dl_se));
+ act = ns_to_ktime(dl_se->deadline - dl_se->dl_period);
now = hrtimer_cb_get_time(timer);
delta = ktime_to_ns(now) - rq_clock(rq);
trace_printk("%s: cpu:%d pid:%d dl_runtime:%llu dl_deadline:%llu dl_period:%llu runtime:%lld deadline:%llu rq_clock:%llu rq_clock_task:%llu act:%lld now:%lld delta:%lld",
@@ -1174,9 +1166,10 @@ static void update_curr_dl(struct rq *rq)

{
struct task_struct *curr = rq->curr;
struct sched_dl_entity *dl_se = &curr->dl;
- u64 delta_exec, scaled_delta_exec;
+ u64 delta_exec, scaled_delta_exec, delta_wall;
int cpu = cpu_of(rq);
- u64 now;
+ u64 now, wall;

+ bool adjusted = false;

if (!dl_task(curr) || !on_dl_rq(dl_se))
return;
@@ -1191,12 +1184,28 @@ static void update_curr_dl(struct rq *rq)
*/
now = rq_clock_task(rq);
delta_exec = now - curr->se.exec_start;
+
+ wall = rq_clock(rq);

+ delta_wall = wall - dl_se->wallstamp;
+

if (unlikely((s64)delta_exec <= 0)) {
if (unlikely(dl_se->dl_yielded))
goto throttle;
return;

}

+ if (delta_wall > 0) {
+ dl_se->walltime += delta_wall;
+ dl_se->wallstamp = wall;
+ }
+

schedstat_set(curr->se.statistics.exec_max,
max(curr->se.statistics.exec_max, delta_exec));

@@ -1230,22 +1239,47 @@ static void update_curr_dl(struct rq *rq)

dl_se->runtime -= scaled_delta_exec;

+ if (dl_runtime_exceeded(dl_se) ||
+ dl_se->dl_yielded ||
+ unlikely(dl_se->walltime > dl_se->dl_period)) {
throttle:
- if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
dl_se->dl_throttled = 1;

/* If requested, inform the user about runtime overruns. */

if (dl_runtime_exceeded(dl_se) &&
(dl_se->flags & SCHED_FLAG_DL_OVERRUN))
dl_se->dl_overrun = 1;
-
__dequeue_task_dl(rq, curr, 0);

+ /*
+ * We keep moving the deadline away until we get some available
+ * runtime for the entity. This ensures correct handling of
+ * situations where the runtime overrun is arbitrary large.
+ */

+ while (dl_se->runtime <= 0 || dl_se->walltime > (s64)dl_se->dl_period) {
+
+ if (delta_exec != delta_wall &&
+ dl_se->walltime > (s64)dl_se->dl_period && !adjusted) {
+ dl_se->deadline = wall;
+ adjusted = true;
+ }
+

+ dl_se->deadline += dl_se->dl_period;
+

+ if (dl_se->runtime <= 0)

+ dl_se->runtime += dl_se->dl_runtime;
+
+ dl_se->walltime -= dl_se->dl_period;
+ }

+
+ WARN_ON_ONCE(dl_se->runtime > dl_se->dl_runtime);
+
if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);

@@ -1783,9 +1817,10 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

p = dl_task_of(dl_se);
p->se.exec_start = rq_clock_task(rq);
+ dl_se->wallstamp = rq_clock(rq);

/* Running task will never be pushed. */
- dequeue_pushable_dl_task(rq, p);
+ dequeue_pushable_dl_task(rq, p);

if (hrtick_enabled(rq))
start_hrtick_dl(rq, p);

@@ -1843,6 +1878,7 @@ static void set_curr_task_dl(struct rq *rq)

struct task_struct *p = rq->curr;

p->se.exec_start = rq_clock_task(rq);

+ p->dl.wallstamp = rq_clock(rq);

Daniel Bristot de Oliveira

unread,

Nov 7, 2018, 5:12:20 AM11/7/18

to Juri Lelli, luca abeni, Peter Zijlstra, Thomas Gleixner, Juri Lelli, syzbot, Borislav Petkov, H. Peter Anvin, LKML, mi...@redhat.com, nst...@suse.de, syzkall...@googlegroups.com, hen...@austad.us, Tommaso Cucinotta, Claudio Scordino

The thread would move to another CPU. Allowing the (pinned) non-rt tasks to have
time to run. Later, the bad dl task would be able to return, to avoid the
problem of the deadline task doing the wrong thing in the other CPU. In this
way, non-rt threads would be able to run, avoiding RCU stall/softlockup.

That is the idea of the rt throttling.

>> Btw, that is another point. We have the AC with the sum of the utilization of
>> all CPUs. But we do no enforcement for per-cpu utilization. If one set a single
>> thread with runtime=deadline=period (in a system with more than one CPU), and
>> run in a busy-loop, we will eventually have an RCU stall as well (I just did on
>> my box, I got a soft lockup). I know this is a different problem. But, maybe,
>> there is a general solution for both issues:
>
> This is true. However, the single 100% bandwidth task problem can be
> solved by limiting the maximum bandwidth a single entity can ask for. Of
> course we can get again to a similar sort of problem if multiple
> entities are then co-scheduled on the same CPU, for which we would need
> (residual) capacity awareness. This should happen less likely though, as
> there is a general tendency to spread tasks.

Limiting the U of a task does not solve the problem. Moreover, a U = 1 task is
not exactly a problem, if the proper way to avoid the starvation of non-rt
thread exists. A U = 1 task can exist without causing damage by moving the
thread between two CPUs, for instance. I know this is very controversial, but
there are many use cases for it. For instance, NFV polling to NIC,
high-frequency trading -rt-users use polling mechanism as well (the discussion
of whether it is right or wrong is another chapter). In practice, these cases
are a significant part of -rt-users.

Still, the problem can happen even if you limit the U per task. You just need
two U = 0.5 tasks to fulfill the CPU. The global scheduler tends to spread the
load (because it migrates the threads very often), I agree. But the problem can
happen, and it will, sooner or later it always happens.

>> For instance, if the sum of the execution time of all "task" with priority
>> higher than the OTHER class (rt, dl, stop_machine, IRQs, NMIs, Hypervisor?) in a
>> CPU is higher than rt_runtime in the rt_period, we need to avoid what is
>> "avoidable" by trying to move rt and dl threads away from that CPU. Another
>> possibility is to bump the priority of the OTHER class (and we are back to the
>> DL server).
>
> Kind of weird though having to migrate RT (everything higher than OTHER)
> only to make some room for non-RT stuff.

It is not. That is the idea of the RT throttling. The rq is throttle, to avoid
starving (per-cpu) non-rt threads that need to run. One can prevent migrating rt
threads, but this is not correct for a global scheduler, as it would break the
working conserving properties.

> Also because one can introduce
> unwanted side effects on high prio workloads (cache related overheads,
> etc.).

Considering one thread will have to migrate once per rt_period (1ms by default),
only if an rq becomes overloaded, we can say that this is barely insignificant,
given the amount of migrations we have in the global scheduler. Well, global has
a tendency to spread tasks by migrating them anyway.

> OTHER has also already have some knowledge about higher prio
> activities (rt,dl,irq PELT). So this seems to really leave us with
> affined tasks, of all priorities and kinds (real vs. irq).

I am not an expert in PELT, how does PELT deal with RT throttling?

>>
>> - Dude, would not be easy just changing the CBS?
>>
>> Yeah, but by changing the CBS, we may end up breaking the algorithms/properties
>> that rely on CBS... like GRUB, user-space/kernel-space synchronization...
>>
>>> OTOH, when something like you describe happens, guarantees are probably
>>> already out of the window and we should just do our best to at least
>>> keep the system "working"? (maybe only to warn the user that something
>>> bad has happened)
>>
>> Btw, don't get me wrong, I am not against changing CBS: I am just trying to
>> raise other viewpoints to avoid touching in the base of the DL scheduler, and
>> avoid punishing a thread that behaves well.
>>
>> Anyway, notifying that dl+rt+IRQ time is higher than the rt_runtime is another
>> good thing to do as well. We will be notified anyway, either by RCU or
>> softlockup... but they are side effects warning. By notifying that we have an
>> overload of rt or higher workload we will be pointing to the cause.
>
> Right. It doesn't solve the problem, but I guess it could help debugging.

I did not say it was a solution.

I was having a look at the reproducer, and... well, a good part of the problem
can be bounded in the other part of the equation. The reproducer enables perf
sampling, and it is known that perf sampling can cause problems, and that is why
we have limits for it.

The limits point to a 25 percent of CPU time for perf sampling... considering
throttling imprecision because of HZ... we can clearly see that the system is
with > 100% of CPU usage for dl + IRQ.

Again: don't get me wrong, I am aware and agree that there is another problem,
about the "readjustment of the period/runtime considering the drift in the
execution of the task caused by IRQs." What I am pointing here is that there are
more general problems w.r.t. the possibility of causing starvation of per-cpu
housekeeping threads needed by the system (for instance, RCU).

There are many open issues w.r.t the throttling mechanism, for instance:

1) We need to take the imprecision in the account of runtime in the AC.
2) The throttling needs to be designed in such a way that we try not to starve
non-rt threads in a CPU/rq - rather than the system (accounting per CPU).
3) We need to consider the IRQ workload as well to avoid RT+DL+IRQ to use all
the CPU time.

... among other things.

-- Daniel

Reply all

Reply to author

Forward

INFO: rcu detected stall in do_idle

syzbot

Thomas Gleixner

Peter Zijlstra

Juri Lelli

Thomas Gleixner

Juri Lelli

Juri Lelli

Peter Zijlstra

Juri Lelli

luca abeni

luca abeni

luca abeni

Juri Lelli

luca abeni

Juri Lelli

luca abeni

Peter Zijlstra

Peter Zijlstra

luca abeni

Juri Lelli

Dmitry Vyukov

Juri Lelli

Peter Zijlstra

Juri Lelli

Daniel Bristot de Oliveira

Juri Lelli

Peter Zijlstra

Peter Zijlstra

Daniel Bristot de Oliveira

Juri Lelli

luca abeni

Daniel Bristot de Oliveira

Juri Lelli

Juri Lelli

Daniel Bristot de Oliveira