Re: [PATCH rcu 04/11] kernel/smp: Provide boot-time timeout for CSD lock diagnostics

73 views
Skip to first unread message

Dmitry Vyukov

unread,
Apr 19, 2022, 10:11:54 AM4/19/22
to pau...@kernel.org, Hillf Danton, linux-...@vger.kernel.org, syzkaller
On Tue, 19 Apr 2022 at 15:46, Paul E. McKenney <pau...@kernel.org> wrote:
>
> On Tue, Apr 19, 2022 at 04:56:07PM +0800, Hillf Danton wrote:
> > On Mon, 18 Apr 2022 15:53:52 -0700 Paul E. McKenney wrote:
> > > Debugging of problems involving insanely long-running SMI handlers
> > > proceeds better if the CSD-lock timeout can be adjusted. This commit
> > > therefore provides a new smp.csd_lock_timeout kernel boot parameter
> > > that specifies the timeout in milliseconds. The default remains at the
> > > previously hard-coded value of five seconds.
> > >
> > > Cc: Rik van Riel <ri...@surriel.com>
> > > Cc: Peter Zijlstra <pet...@infradead.org>
> > > Cc: Ingo Molnar <mi...@kernel.org>
> > > Cc: Thomas Gleixner <tg...@linutronix.de>
> > > Cc: Sebastian Andrzej Siewior <big...@linutronix.de>
> > > Cc: Juergen Gross <jgr...@suse.com>
> > > Signed-off-by: Paul E. McKenney <pau...@kernel.org>
> > > ---
> > > Documentation/admin-guide/kernel-parameters.txt | 11 +++++++++++
> > > kernel/smp.c | 7 +++++--
> > > 2 files changed, 16 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > > index 3f1cc5e317ed..645c4c001b16 100644
> > > --- a/Documentation/admin-guide/kernel-parameters.txt
> > > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > > @@ -5377,6 +5377,17 @@
> > > smart2= [HW]
> > > Format: <io1>[,<io2>[,...,<io8>]]
> > >
> > > + smp.csd_lock_timeout= [KNL]
> > > + Specify the period of time in milliseconds
> > > + that smp_call_function() and friends will wait
> > > + for a CPU to release the CSD lock. This is
> > > + useful when diagnosing bugs involving CPUs
> > > + disabling interrupts for extended periods
> > > + of time. Defaults to 5,000 milliseconds, and
> > > + setting a value of zero disables this feature.
> > > + This feature may be more efficiently disabled
> > > + using the csdlock_debug- kernel parameter.
> > > +
> >
> > Can non-responsive CSD lock detected trigger syzbot (warning) report?
>
> If they enable it by building with CONFIG_CSD_LOCK_WAIT_DEBUG=y, yes.

+syzkaller mailing list

Currently we don't enable CONFIG_CSD_LOCK_WAIT_DEBUG in syzbot configs.
Is it a generally useful debugging feature recommended to be enabled
in kernel testing systems?
If we enabled it, we also need to figure out where it fits into the
timeout hierarchy and adjust smp.csd_lock_timeout:
https://github.com/google/syzkaller/blob/master/dashboard/config/linux/bits/x86_64.yml#L15-L40


> Thanx, Paul
>
> > Hillf
> > > smsc-ircc2.nopnp [HW] Don't use PNP to discover SMC devices
> > > smsc-ircc2.ircc_cfg= [HW] Device configuration I/O port
> > > smsc-ircc2.ircc_sir= [HW] SIR base I/O port
> > > diff --git a/kernel/smp.c b/kernel/smp.c
> > > index 01a7c1706a58..d82439bac401 100644
> > > --- a/kernel/smp.c
> > > +++ b/kernel/smp.c
> > > @@ -183,7 +183,9 @@ static DEFINE_PER_CPU(smp_call_func_t, cur_csd_func);
> > > static DEFINE_PER_CPU(void *, cur_csd_info);
> > > static DEFINE_PER_CPU(struct cfd_seq_local, cfd_seq_local);
> > >
> > > -#define CSD_LOCK_TIMEOUT (5ULL * NSEC_PER_SEC)
> > > +static ulong csd_lock_timeout = 5000; /* CSD lock timeout in milliseconds. */
> > > +module_param(csd_lock_timeout, ulong, 0444);
> > > +
> > > static atomic_t csd_bug_count = ATOMIC_INIT(0);
> > > static u64 cfd_seq;
> > >
> > > @@ -329,6 +331,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
> > > u64 ts2, ts_delta;
> > > call_single_data_t *cpu_cur_csd;
> > > unsigned int flags = READ_ONCE(csd->node.u_flags);
> > > + unsigned long long csd_lock_timeout_ns = csd_lock_timeout * NSEC_PER_MSEC;
> > >
> > > if (!(flags & CSD_FLAG_LOCK)) {
> > > if (!unlikely(*bug_id))
> > > @@ -341,7 +344,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
> > >
> > > ts2 = sched_clock();
> > > ts_delta = ts2 - *ts1;
> > > - if (likely(ts_delta <= CSD_LOCK_TIMEOUT))
> > > + if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns <= 0))
> > > return false;
> > >
> > > firsttime = !*bug_id;
> > > --
> > > 2.31.1.189.g2e36527f23

Hillf Danton

unread,
Apr 19, 2022, 7:18:35 PM4/19/22
to Dmitry Vyukov, linux-...@vger.kernel.org, Paul E. McKenney, syzkaller
On Tue, 19 Apr 2022 09:49:08 -0700 Paul E. McKenney wrote:
> With the default value for smp.csd_lock_timeout, it detects CPUs that have
> had interrupts disabled for more than five seconds, which can be useful
> for detecting what would otherwise be silent response-time failures.

The five seconds take precedence over the 143s in reports [1] IMO.
The shorter timeout helps select reproducers which in turn help find answer
to questions there.

Hillf

[1] https://lore.kernel.org/lkml/20220321210140.GK4285@paulmck-ThinkPad-P17-Gen-1/

>
> > If we enabled it, we also need to figure out where it fits into the
> > timeout hierarchy and adjust smp.csd_lock_timeout:
> > https://github.com/google/syzkaller/blob/master/dashboard/config/linux/bits/x86_64.yml#L15-L40
>
> On this, I must defer to you guys. I can say that the larger the value
> that you choose for smp.csd_lock_timeout, the nearer the beginnning of
> that group it should go, but you guys knew that already. ;-)
>
> Thanx, Paul

Dmitry Vyukov

unread,
Apr 20, 2022, 8:17:33 AM4/20/22
to Hillf Danton, linux-...@vger.kernel.org, Paul E. McKenney, syzkaller
I've sent https://github.com/google/syzkaller/pull/3090 to enable the config.
5 seconds won't be reliable, I've set it to 100s to match CPU stall timeout.



> Hillf
>
> [1] https://lore.kernel.org/lkml/20220321210140.GK4285@paulmck-ThinkPad-P17-Gen-1/
>
> >
> > > If we enabled it, we also need to figure out where it fits into the
> > > timeout hierarchy and adjust smp.csd_lock_timeout:
> > > https://github.com/google/syzkaller/blob/master/dashboard/config/linux/bits/x86_64.yml#L15-L40
> >
> > On this, I must defer to you guys. I can say that the larger the value
> > that you choose for smp.csd_lock_timeout, the nearer the beginnning of
> > that group it should go, but you guys knew that already. ;-)
> >
> > Thanx, Paul
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller/20220419231820.2089-1-hdanton%40sina.com.

Dmitry Vyukov

unread,
Apr 20, 2022, 10:00:22 AM4/20/22
to pau...@kernel.org, Hillf Danton, linux-...@vger.kernel.org, syzkaller
> Thank you, Dmitry!
>
> Is there something I should be doing to enable the RCU CPU stall timeout
> to be reduced?

No.
We just bumped it to a high enough value to not cause false positives.
If it's stalled, it's stalled.


> Thanx, Paul

Paul E. McKenney

unread,
Apr 21, 2022, 4:27:46 AM4/21/22
to Dmitry Vyukov, Hillf Danton, linux-...@vger.kernel.org, syzkaller
On Tue, Apr 19, 2022 at 04:11:36PM +0200, Dmitry Vyukov wrote:
With the default value for smp.csd_lock_timeout, it detects CPUs that have
had interrupts disabled for more than five seconds, which can be useful
for detecting what would otherwise be silent response-time failures.

> If we enabled it, we also need to figure out where it fits into the
> timeout hierarchy and adjust smp.csd_lock_timeout:
> https://github.com/google/syzkaller/blob/master/dashboard/config/linux/bits/x86_64.yml#L15-L40

On this, I must defer to you guys. I can say that the larger the value
that you choose for smp.csd_lock_timeout, the nearer the beginnning of
that group it should go, but you guys knew that already. ;-)

Paul E. McKenney

unread,
Apr 21, 2022, 4:28:15 AM4/21/22
to Dmitry Vyukov, Hillf Danton, linux-...@vger.kernel.org, syzkaller
Thank you, Dmitry!

Is there something I should be doing to enable the RCU CPU stall timeout
to be reduced?

Thanx, Paul

Paul E. McKenney

unread,
Apr 21, 2022, 4:28:19 AM4/21/22
to Dmitry Vyukov, Hillf Danton, linux-...@vger.kernel.org, syzkaller
Ah, your goal is to find perma-stalls. Got it, thank you!
Reply all
Reply to author
Forward
0 new messages