[PATCH v2 0/5] workqueue: Detect stalled in-flight workers

Breno Leitao

unread,

Mar 5, 2026, 11:16:05 AMMar 5

to Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com, Breno Leitao

There is a blind spot exists in the work queue stall detecetor (aka
show_cpu_pool_hog()). It only prints workers whose task_is_running() is
true, so a busy worker that is sleeping (e.g. wait_event_idle())
produces an empty backtrace section even though it is the cause of the
stall.

Additionally, when the watchdog does report stalled pools, the output
doesn't show how long each in-flight work item has been running, making
it harder to identify which specific worker is stuck.

Example of the sample code:

BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 132s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x100
pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=4 refcnt=5
in-flight: 178:stall_work1_fn [wq_stall]
pending: stall_work2_fn [wq_stall], free_obj_work, psi_avgs_work
...
Showing backtraces of running workers in stalled
CPU-bound worker pools:
<nothing here>

I see it happening on real machines, causing some stalls that doesn't
have any backtrace. This is one of the code path:

1) kfence executes toggle_allocation_gate() as a delayed workqueue
item (kfence_timer) on the system WQ.

2) toggle_allocation_gate() enables a static key, which IPIs every
CPU to patch code:
static_branch_enable(&kfence_allocation_key);

3) toggle_allocation_gate() then sleeps in TASK_IDLE waiting for a
kfence allocation to occur:
wait_event_idle(allocation_wait,
atomic_read(&kfence_allocation_gate) > 0 || ...);

This can last indefinitely if no allocation goes through the
kfence path (or IPIing all the CPUs take longer, which is common on
platforms that do not have NMI).

The worker remains in the pool's busy_hash
(in-flight) but is no longer task_is_running().

4) The workqueue watchdog detects the stall and calls
show_cpu_pool_hog(), which only prints backtraces for workers
that are actively running on CPU:

static void show_cpu_pool_hog(struct worker_pool *pool) {
...
if (task_is_running(worker->task))
sched_show_task(worker->task);
}

5) Nothing is printed because the offending worker is in TASK_IDLE
state. The output shows "Showing backtraces of running workers in
stalled CPU-bound worker pools:" followed by nothing, effectively
hiding the actual culprit.

Given I am using this detector a lot, I am also proposing additional
improvements here as well.

This series addresses these issues:

Patch 1 fixes a minor semantic inconsistency where pool flags were
checked against a workqueue-level constant (WQ_BH instead of POOL_BH).
No behavioral change since both constants have the same value.

Patch 2 renames pool->watchdog_ts to pool->last_progress_ts to better
describe what the timestamp actually tracks.

Patch 3 adds a current_start timestamp to struct worker, recording when
a work item began executing. This is printed in show_pwq() as elapsed
wall-clock time (e.g., "in-flight: 165:stall_work_fn [wq_stall] for
100s"), giving immediate visibility into how long each worker has been
busy.

Patch 4 removes the task_is_running() filter from show_cpu_pool_hog()
so that every in-flight worker in the pool's busy_hash is dumped. This
catches workers that are busy but sleeping or blocked, which were
previously invisible in the watchdog output.

With this series applied, stall output shows the backtrace for all
tasks, and for how long the work is stall. Example:

BUG: workqueue lockup - pool cpus=14 node=0 flags=0x0 nice=0 stuck for 42!
Showing busy workqueues and worker pools:
workqueue events: flags=0x100
pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
pending: vmstat_shepherd
pwq 58: cpus=14 node=0 flags=0x0 nice=0 active=4 refcnt=5
in-flight: 184:stall_work1_fn [wq_stall] for 39s
...
Showing backtraces of busy workers in stalled CPU-bound worker pools:
pool 58:
task:kworker/14:1 state:I stack:0 pid:184 tgid:184 ppid:2 task_flags:0x4208040 flags:0x00080000
Call Trace:
<TASK>
__schedule+0x1521/0x5360
schedule+0x165/0x350
stall_work1_fn+0x17f/0x250 [wq_stall]
...

---
Changes in v2:
- Drop the task_running() filter in show_cpu_pool_hog() instead of assuming a
work item cannot stay running forever.
- Add a sample code to exercise the stall detector
- Link to v1: https://patch.msgid.link/20260211-wqstall_star...@debian.org

---
Breno Leitao (5):
workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
workqueue: Rename pool->watchdog_ts to pool->last_progress_ts
workqueue: Show in-flight work item duration in stall diagnostics
workqueue: Show all busy workers in stall diagnostics
workqueue: Add stall detector sample module

kernel/workqueue.c | 47 +++++++-------
kernel/workqueue_internal.h | 1 +
samples/workqueue/stall_detector/Makefile | 1 +
samples/workqueue/stall_detector/wq_stall.c | 98 +++++++++++++++++++++++++++++
4 files changed, 124 insertions(+), 23 deletions(-)
---
base-commit: c107785c7e8dbabd1c18301a1c362544b5786282
change-id: 20260210-wqstall_start-at-e7319a005ab4

Best regards,
--
Breno Leitao <lei...@debian.org>

Breno Leitao

unread,

Mar 5, 2026, 11:16:05 AMMar 5

to Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com, Breno Leitao

pr_cont_worker_id() checks pool->flags against WQ_BH, which is a
workqueue-level flag (defined in workqueue.h). Pool flags use a
separate namespace with POOL_* constants (defined in workqueue.c).
The correct constant is POOL_BH. Both WQ_BH and POOL_BH are defined
as (1 << 0) so this has no behavioral impact, but it is semantically
wrong and inconsistent with every other pool-level BH check in the
file.

Fixes: 4cb1ef64609f ("workqueue: Implement BH workqueues to eventually replace tasklets")
Signed-off-by: Breno Leitao <lei...@debian.org>
---
kernel/workqueue.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index aeaec79bc09c4..1e5b6cb0fbda6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -6274,7 +6274,7 @@ static void pr_cont_worker_id(struct worker *worker)
{
struct worker_pool *pool = worker->pool;

- if (pool->flags & WQ_BH)
+ if (pool->flags & POOL_BH)
pr_cont("bh%s",
pool->attrs->nice == HIGHPRI_NICE_LEVEL ? "-hi" : "");
else

--
2.47.3

Breno Leitao

unread,

Mar 5, 2026, 11:16:10 AMMar 5

to Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com, Breno Leitao

The watchdog_ts name doesn't convey what the timestamp actually tracks.
This field tracks the last time a workqueue got progress.

Rename it to last_progress_ts to make it clear that it records when the
pool last made forward progress (started processing new work items).

No functional change.

Signed-off-by: Breno Leitao <lei...@debian.org>
---

kernel/workqueue.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1e5b6cb0fbda6..687d5c55c6174 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -190,7 +190,7 @@ struct worker_pool {
int id; /* I: pool ID */
unsigned int flags; /* L: flags */

- unsigned long watchdog_ts; /* L: watchdog timestamp */
+ unsigned long last_progress_ts; /* L: last forward progress timestamp */
bool cpu_stall; /* WD: stalled cpu bound pool */

/*
@@ -1697,7 +1697,7 @@ static void __pwq_activate_work(struct pool_workqueue *pwq,
WARN_ON_ONCE(!(*wdb & WORK_STRUCT_INACTIVE));
trace_workqueue_activate_work(work);
if (list_empty(&pwq->pool->worklist))
- pwq->pool->watchdog_ts = jiffies;
+ pwq->pool->last_progress_ts = jiffies;
move_linked_works(work, &pwq->pool->worklist, NULL);
__clear_bit(WORK_STRUCT_INACTIVE_BIT, wdb);
}
@@ -2348,7 +2348,7 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
*/
if (list_empty(&pwq->inactive_works) && pwq_tryinc_nr_active(pwq, false)) {
if (list_empty(&pool->worklist))
- pool->watchdog_ts = jiffies;
+ pool->last_progress_ts = jiffies;

trace_workqueue_activate_work(work);
insert_work(pwq, work, &pool->worklist, work_flags);
@@ -3352,7 +3352,7 @@ static void process_scheduled_works(struct worker *worker)
while ((work = list_first_entry_or_null(&worker->scheduled,
struct work_struct, entry))) {
if (first) {
- worker->pool->watchdog_ts = jiffies;
+ worker->pool->last_progress_ts = jiffies;
first = false;
}
process_one_work(worker, work);
@@ -4850,7 +4850,7 @@ static int init_worker_pool(struct worker_pool *pool)
pool->cpu = -1;
pool->node = NUMA_NO_NODE;
pool->flags |= POOL_DISASSOCIATED;
- pool->watchdog_ts = jiffies;
+ pool->last_progress_ts = jiffies;
INIT_LIST_HEAD(&pool->worklist);
INIT_LIST_HEAD(&pool->idle_list);
hash_init(pool->busy_hash);
@@ -6462,7 +6462,7 @@ static void show_one_worker_pool(struct worker_pool *pool)

/* How long the first pending work is waiting for a worker. */
if (!list_empty(&pool->worklist))
- hung = jiffies_to_msecs(jiffies - pool->watchdog_ts) / 1000;
+ hung = jiffies_to_msecs(jiffies - pool->last_progress_ts) / 1000;

/*
* Defer printing to avoid deadlocks in console drivers that
@@ -7691,7 +7691,7 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
touched = READ_ONCE(per_cpu(wq_watchdog_touched_cpu, pool->cpu));
else
touched = READ_ONCE(wq_watchdog_touched);
- pool_ts = READ_ONCE(pool->watchdog_ts);
+ pool_ts = READ_ONCE(pool->last_progress_ts);

if (time_after(pool_ts, touched))
ts = pool_ts;

--
2.47.3

Breno Leitao

unread,

Mar 5, 2026, 11:16:12 AMMar 5

to Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com, Breno Leitao

When diagnosing workqueue stalls, knowing how long each in-flight work
item has been executing is valuable. Add a current_start timestamp
(jiffies) to struct worker, set it when a work item begins execution in
process_one_work(), and print the elapsed wall-clock time in show_pwq().

Unlike current_at (which tracks CPU runtime and resets on wakeup for
CPU-intensive detection), current_start is never reset because the
diagnostic cares about total wall-clock time including sleeps.

Before: in-flight: 165:stall_work_fn [wq_stall]
After: in-flight: 165:stall_work_fn [wq_stall] for 100s

Signed-off-by: Breno Leitao <lei...@debian.org>
---

kernel/workqueue.c | 3 +++
kernel/workqueue_internal.h | 1 +
2 files changed, 4 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 687d5c55c6174..56d8af13843f8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3204,6 +3204,7 @@ __acquires(&pool->lock)
worker->current_pwq = pwq;
if (worker->task)
worker->current_at = worker->task->se.sum_exec_runtime;
+ worker->current_start = jiffies;
work_data = *work_data_bits(work);
worker->current_color = get_work_color(work_data);

@@ -6359,6 +6360,8 @@ static void show_pwq(struct pool_workqueue *pwq)
pr_cont(" %s", comma ? "," : "");
pr_cont_worker_id(worker);
pr_cont(":%ps", worker->current_func);
+ pr_cont(" for %us",
+ jiffies_to_msecs(jiffies - worker->current_start) / 1000);
list_for_each_entry(work, &worker->scheduled, entry)
pr_cont_work(false, work, &pcws);
pr_cont_work_flush(comma, (work_func_t)-1L, &pcws);
diff --git a/kernel/workqueue_internal.h b/kernel/workqueue_internal.h
index f6275944ada77..8def1ddc5a1bf 100644
--- a/kernel/workqueue_internal.h
+++ b/kernel/workqueue_internal.h
@@ -32,6 +32,7 @@ struct worker {
work_func_t current_func; /* K: function */
struct pool_workqueue *current_pwq; /* K: pwq */
u64 current_at; /* K: runtime at start or last wakeup */
+ unsigned long current_start; /* K: start time of current work item */
unsigned int current_color; /* K: color */

int sleeping; /* S: is worker sleeping? */

--
2.47.3

Breno Leitao

unread,

Mar 5, 2026, 11:16:17 AMMar 5

to Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com, Breno Leitao

show_cpu_pool_hog() only prints workers whose task is currently running
on the CPU (task_is_running()). This misses workers that are busy
processing a work item but are sleeping or blocked — for example, a
worker that clears PF_WQ_WORKER and enters wait_event_idle(). Such a
worker still occupies a pool slot and prevents progress, yet produces
an empty backtrace section in the watchdog output.

This is happening on real arm64 systems, where
toggle_allocation_gate() IPIs every single CPU in the machine (which
lacks NMI), causing workqueue stalls that show empty backtraces because
toggle_allocation_gate() is sleeping in wait_event_idle().

Remove the task_is_running() filter so every in-flight worker in the
pool's busy_hash is dumped. The busy_hash is protected by pool->lock,
which is already held.

Signed-off-by: Breno Leitao <lei...@debian.org>
---

kernel/workqueue.c | 28 +++++++++++++---------------
1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 56d8af13843f8..09b9ad78d566c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -7583,9 +7583,9 @@ MODULE_PARM_DESC(panic_on_stall_time, "Panic if stall exceeds this many seconds

/*
* Show workers that might prevent the processing of pending work items.
- * The only candidates are CPU-bound workers in the running state.
- * Pending work items should be handled by another idle worker
- * in all other situations.
+ * A busy worker that is not running on the CPU (e.g. sleeping in
+ * wait_event_idle() with PF_WQ_WORKER cleared) can stall the pool just as
+ * effectively as a CPU-bound one, so dump every in-flight worker.
*/

static void show_cpu_pool_hog(struct worker_pool *pool)
{

@@ -7596,19 +7596,17 @@ static void show_cpu_pool_hog(struct worker_pool *pool)
raw_spin_lock_irqsave(&pool->lock, irq_flags);

hash_for_each(pool->busy_hash, bkt, worker, hentry) {
- if (task_is_running(worker->task)) {
- /*
- * Defer printing to avoid deadlocks in console
- * drivers that queue work while holding locks
- * also taken in their write paths.
- */
- printk_deferred_enter();
+ /*
+ * Defer printing to avoid deadlocks in console
+ * drivers that queue work while holding locks
+ * also taken in their write paths.
+ */
+ printk_deferred_enter();

- pr_info("pool %d:\n", pool->id);
- sched_show_task(worker->task);
+ pr_info("pool %d:\n", pool->id);
+ sched_show_task(worker->task);

- printk_deferred_exit();
- }
+ printk_deferred_exit();
}

raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
@@ -7619,7 +7617,7 @@ static void show_cpu_pools_hogs(void)
struct worker_pool *pool;
int pi;

- pr_info("Showing backtraces of running workers in stalled CPU-bound worker pools:\n");
+ pr_info("Showing backtraces of busy workers in stalled CPU-bound worker pools:\n");

rcu_read_lock();

--
2.47.3

Breno Leitao

unread,

Mar 5, 2026, 11:16:21 AMMar 5

to Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com, Breno Leitao

Add a sample module under samples/workqueue/stall_detector/ that
reproduces a workqueue stall caused by PF_WQ_WORKER misuse. The
module queues two work items on the same per-CPU pool, then clears
PF_WQ_WORKER and sleeps in wait_event_idle(), hiding from the
concurrency manager and stalling the second work item indefinitely.

This is useful for testing the workqueue watchdog stall diagnostics.

Signed-off-by: Breno Leitao <lei...@debian.org>
---

samples/workqueue/stall_detector/Makefile | 1 +
samples/workqueue/stall_detector/wq_stall.c | 98 +++++++++++++++++++++++++++++

2 files changed, 99 insertions(+)

diff --git a/samples/workqueue/stall_detector/Makefile b/samples/workqueue/stall_detector/Makefile
new file mode 100644
index 0000000000000..8849e85e95bb9
--- /dev/null
+++ b/samples/workqueue/stall_detector/Makefile
@@ -0,0 +1 @@
+obj-m += wq_stall.o
diff --git a/samples/workqueue/stall_detector/wq_stall.c b/samples/workqueue/stall_detector/wq_stall.c
new file mode 100644
index 0000000000000..6f4a497b18814
--- /dev/null
+++ b/samples/workqueue/stall_detector/wq_stall.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * wq_stall - Test module for the workqueue stall detector.
+ *
+ * Deliberately creates a workqueue stall so the watchdog fires and
+ * prints diagnostic output. Useful for verifying that the stall
+ * detector correctly identifies stuck workers and produces useful
+ * backtraces.
+ *
+ * The stall is triggered by clearing PF_WQ_WORKER before sleeping,
+ * which hides the worker from the concurrency manager. A second
+ * work item queued on the same pool then sits in the worklist with
+ * no worker available to process it.
+ *
+ * After ~30s the workqueue watchdog fires:
+ * BUG: workqueue lockup - pool cpus=N ...
+ *
+ * Build:
+ * make -C <kernel tree> M=samples/workqueue/stall_detector modules
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Breno Leitao <lei...@debian.org>
+ */
+
+#include <linux/module.h>
+#include <linux/workqueue.h>
+#include <linux/wait.h>
+#include <linux/atomic.h>
+#include <linux/sched.h>
+
+static DECLARE_WAIT_QUEUE_HEAD(stall_wq_head);
+static atomic_t wake_condition = ATOMIC_INIT(0);
+static struct work_struct stall_work1;
+static struct work_struct stall_work2;
+
+static void stall_work2_fn(struct work_struct *work)
+{
+ pr_info("wq_stall: second work item finally ran\n");
+}
+
+static void stall_work1_fn(struct work_struct *work)
+{
+ pr_info("wq_stall: first work item running on cpu %d\n",
+ raw_smp_processor_id());
+
+ /*
+ * Queue second item while we're still counted as running
+ * (pool->nr_running > 0). Since schedule_work() on a per-CPU
+ * workqueue targets raw_smp_processor_id(), item 2 lands on the
+ * same pool. __queue_work -> kick_pool -> need_more_worker()
+ * sees nr_running > 0 and does NOT wake a new worker.
+ */
+ schedule_work(&stall_work2);
+
+ /*
+ * Hide from the workqueue concurrency manager. Without
+ * PF_WQ_WORKER, schedule() won't call wq_worker_sleeping(),
+ * so nr_running is never decremented and no replacement
+ * worker is created. Item 2 stays stuck in pool->worklist.
+ */
+ current->flags &= ~PF_WQ_WORKER;
+
+ pr_info("wq_stall: entering wait_event_idle (PF_WQ_WORKER cleared)\n");
+ pr_info("wq_stall: expect 'BUG: workqueue lockup' in ~30-60s\n");
+ wait_event_idle(stall_wq_head, atomic_read(&wake_condition) != 0);
+
+ /* Restore so process_one_work() cleanup works correctly */
+ current->flags |= PF_WQ_WORKER;
+ pr_info("wq_stall: woke up, PF_WQ_WORKER restored\n");
+}
+
+static int __init wq_stall_init(void)
+{
+ pr_info("wq_stall: loading\n");
+
+ INIT_WORK(&stall_work1, stall_work1_fn);
+ INIT_WORK(&stall_work2, stall_work2_fn);
+ schedule_work(&stall_work1);
+
+ return 0;
+}
+
+static void __exit wq_stall_exit(void)
+{
+ pr_info("wq_stall: unloading\n");
+ atomic_set(&wake_condition, 1);
+ wake_up(&stall_wq_head);
+ flush_work(&stall_work1);
+ flush_work(&stall_work2);
+ pr_info("wq_stall: all work flushed, module unloaded\n");
+}
+
+module_init(wq_stall_init);
+module_exit(wq_stall_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Reproduce workqueue stall caused by PF_WQ_WORKER misuse");
+MODULE_AUTHOR("Breno Leitao <lei...@debian.org>");

--
2.47.3

Song Liu

unread,

Mar 5, 2026, 12:13:50 PMMar 5

to Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com

On Thu, Mar 5, 2026 at 8:16 AM Breno Leitao <lei...@debian.org> wrote:
>
> pr_cont_worker_id() checks pool->flags against WQ_BH, which is a
> workqueue-level flag (defined in workqueue.h). Pool flags use a
> separate namespace with POOL_* constants (defined in workqueue.c).
> The correct constant is POOL_BH. Both WQ_BH and POOL_BH are defined
> as (1 << 0) so this has no behavioral impact, but it is semantically
> wrong and inconsistent with every other pool-level BH check in the
> file.
>
> Fixes: 4cb1ef64609f ("workqueue: Implement BH workqueues to eventually replace tasklets")
> Signed-off-by: Breno Leitao <lei...@debian.org>

Acked-by: Song Liu <so...@kernel.org>

Song Liu

unread,

Mar 5, 2026, 12:16:43 PMMar 5

to Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com

On Thu, Mar 5, 2026 at 8:16 AM Breno Leitao <lei...@debian.org> wrote:
>

> The watchdog_ts name doesn't convey what the timestamp actually tracks.
> This field tracks the last time a workqueue got progress.
>
> Rename it to last_progress_ts to make it clear that it records when the
> pool last made forward progress (started processing new work items).
>
> No functional change.
>
> Signed-off-by: Breno Leitao <lei...@debian.org>

Acked-by: Song Liu <so...@kernel.org>

Song Liu

unread,

Mar 5, 2026, 12:17:31 PMMar 5

to Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com

On Thu, Mar 5, 2026 at 8:16 AM Breno Leitao <lei...@debian.org> wrote:
>

> When diagnosing workqueue stalls, knowing how long each in-flight work
> item has been executing is valuable. Add a current_start timestamp
> (jiffies) to struct worker, set it when a work item begins execution in
> process_one_work(), and print the elapsed wall-clock time in show_pwq().
>
> Unlike current_at (which tracks CPU runtime and resets on wakeup for
> CPU-intensive detection), current_start is never reset because the
> diagnostic cares about total wall-clock time including sleeps.
>
> Before: in-flight: 165:stall_work_fn [wq_stall]
> After: in-flight: 165:stall_work_fn [wq_stall] for 100s
>
> Signed-off-by: Breno Leitao <lei...@debian.org>

Acked-by: Song Liu <so...@kernel.org>

This shows really useful information. Thanks!

Song Liu

unread,

Mar 5, 2026, 12:18:09 PMMar 5

to Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com

On Thu, Mar 5, 2026 at 8:16 AM Breno Leitao <lei...@debian.org> wrote:
>

> show_cpu_pool_hog() only prints workers whose task is currently running
> on the CPU (task_is_running()). This misses workers that are busy
> processing a work item but are sleeping or blocked — for example, a
> worker that clears PF_WQ_WORKER and enters wait_event_idle(). Such a
> worker still occupies a pool slot and prevents progress, yet produces
> an empty backtrace section in the watchdog output.
>
> This is happening on real arm64 systems, where
> toggle_allocation_gate() IPIs every single CPU in the machine (which
> lacks NMI), causing workqueue stalls that show empty backtraces because
> toggle_allocation_gate() is sleeping in wait_event_idle().
>
> Remove the task_is_running() filter so every in-flight worker in the
> pool's busy_hash is dumped. The busy_hash is protected by pool->lock,
> which is already held.
>
> Signed-off-by: Breno Leitao <lei...@debian.org>

Acked-by: Song Liu <so...@kernel.org>

Song Liu

unread,

Mar 5, 2026, 12:25:28 PMMar 5

to Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com

On Thu, Mar 5, 2026 at 8:16 AM Breno Leitao <lei...@debian.org> wrote:
>

> Add a sample module under samples/workqueue/stall_detector/ that
> reproduces a workqueue stall caused by PF_WQ_WORKER misuse. The
> module queues two work items on the same per-CPU pool, then clears
> PF_WQ_WORKER and sleeps in wait_event_idle(), hiding from the
> concurrency manager and stalling the second work item indefinitely.

Clearing PF_WQ_WORKER is an interesting way to trigger the stall.

>
> This is useful for testing the workqueue watchdog stall diagnostics.
>
> Signed-off-by: Breno Leitao <lei...@debian.org>

Acked-by: Song Liu <so...@kernel.org>

Tejun Heo

unread,

Mar 5, 2026, 12:39:30 PMMar 5

to Breno Leitao, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com

Hello,

> Breno Leitao (5):
> workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
> workqueue: Rename pool->watchdog_ts to pool->last_progress_ts
> workqueue: Show in-flight work item duration in stall diagnostics
> workqueue: Show all busy workers in stall diagnostics
> workqueue: Add stall detector sample module

Applied 1-5 to wq/for-7.0-fixes.

One minor note for a future follow-up: show_cpu_pool_hog() and
show_cpu_pools_hogs() function names no longer reflect the broadened
scope after patch 4 - they now dump all busy workers, not just CPU
hogs.

Thanks.

--
tejun

Petr Mladek

unread,

Mar 12, 2026, 12:38:32 PMMar 12

to Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

I am trying to better understand the situation. There was a reason
why only the worker in the running state was shown.

Normally, a sleeping worker should not cause a stall. The scheduler calls
wq_worker_sleeping() which should wake up another idle worker. There is
always at least one idle worker in the poll. It should start processing
the next pending work. Or it should fork another worker when it was
the last idle one.

I wonder what blocked the idle worker from waking or forking
a new worker. Was it caused by the IPIs?

Did printing the sleeping workers helped to analyze the problem?

I wonder if we could do better in this case. For example, warn
that the scheduler failed to wake up another idle worker when
no worker is in the running state. And maybe, print backtrace
of the currently running process on the given CPU because it
likely blocks waking/scheduling the idle worker.

Otherwise, I like the other improvements.

Best Regards,
Petr

Petr Mladek

unread,

Mar 12, 2026, 1:03:10 PMMar 12

to Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

On Thu 2026-03-05 08:15:40, Breno Leitao wrote:
> show_cpu_pool_hog() only prints workers whose task is currently running
> on the CPU (task_is_running()). This misses workers that are busy
> processing a work item but are sleeping or blocked — for example, a
> worker that clears PF_WQ_WORKER and enters wait_event_idle().

IMHO, it is misleading. AFAIK, workers clear PF_WQ_WORKER flag only
when they are going to die. They never do so when going to sleep.

> Such a
> worker still occupies a pool slot and prevents progress, yet produces
> an empty backtrace section in the watchdog output.
>
> This is happening on real arm64 systems, where
> toggle_allocation_gate() IPIs every single CPU in the machine (which
> lacks NMI), causing workqueue stalls that show empty backtraces because
> toggle_allocation_gate() is sleeping in wait_event_idle().

The wait_event_idle() called in toggle_allocation_gate() should not
cause a stall. The scheduler should call wq_worker_sleeping(tsk)
and wake up another idle worker. It should guarantee the progress.

> Remove the task_is_running() filter so every in-flight worker in the
> pool's busy_hash is dumped. The busy_hash is protected by pool->lock,
> which is already held.

As I explained in reply to the cover letter, sleeping workers should
not block forward progress. It seems that in this case, the system was
not able to wake up the other idle worker or it was the last idle
worker and was not able to fork a new one.

IMHO, we should warn about this when there is no running worker.
It might be more useful than printing backtraces of the sleeping
workers because they likely did not cause the problem.

I believe that the problem, in this particular situation, is that
the system can't schedule or fork new processes. It might help
to warn about it and maybe show backtrace of the currently
running process on the stalled CPU.

Anyway, I think we could do better here. And blindly printing backtraces
from all workers would do more harm then good in most situations.

Best Regards,
Petr

Breno Leitao

unread,

Mar 13, 2026, 8:25:17 AMMar 13

to Petr Mladek, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

Hello Petr,

Right, but let's look at this case:

BUG: workqueue lockup - pool 55 cpu 13 curr 0 (swapper/13) stack ffff800085640000 cpus=13 node=0 flags=0x0 nice=-20 stuck for 679s!
work func=blk_mq_timeout_work data=0xffff0000ad7e3a05

Showing busy workqueues and worker pools:

workqueue events_unbound: flags=0x2
pwq 288: cpus=0-71 flags=0x4 nice=0 active=1 refcnt=2
in-flight: 4083734:btrfs_extent_map_shrinker_worker
workqueue mm_percpu_wq: flags=0x8
pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=1 refcnt=2
pending: vmstat_update
pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 1715676 4086805 3860852 3587585 4065550 4014041 3944711 3744484

Showing backtraces of running workers in stalled CPU-bound worker pools:

# Nothing in here

It seems CPU 13 is idle (curr = 0) and blk_mq_timeout_work has been pending for
679s ?

> I wonder what blocked the idle worker from waking or forking
> a new worker. Was it caused by the IPIs?

Not sure, keep in mind that these hosts (arm64) do not have NMI, so,
IPIs are just regular interrupts that could take a long time to be handled. The
toggle_allocation_gate() was good example, given it was sending IPIs very
frequently and I took it as an example for the cover letter, but, this problem
also show up with diferent places. (more examples later)

> Did printing the sleeping workers helped to analyze the problem?

That is my hope. I don't have a reproducer other than the one in this
patchset.

I am currently rolling this patchset to production, and I can report once
I get more information.

> I wonder if we could do better in this case. For example, warn
> that the scheduler failed to wake up another idle worker when
> no worker is in the running state. And maybe, print backtrace
> of the currently running process on the given CPU because it
> likely blocks waking/scheduling the idle worker.

I am happy to improve this, given this has been a hard issue. let me give more
instances of the "empty" stalls I am seeing. All with empty backtraces:

# Instance 1
BUG: workqueue lockup - pool cpus=33 node=0 flags=0x0 nice=0 stuck for 33s!

Showing busy workqueues and worker pools:

workqueue events: flags=0x0
pwq 134: cpus=33 node=0 flags=0x0 nice=0 active=3 refcnt=4
pending: 3*psi_avgs_work
pwq 218: cpus=54 node=0 flags=0x0 nice=0 active=1 refcnt=2
in-flight: 842:key_garbage_collector
workqueue mm_percpu_wq: flags=0x8
pwq 134: cpus=33 node=0 flags=0x0 nice=0 active=1 refcnt=2
pending: vmstat_update
pool 218: cpus=54 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 11200 524627

Showing backtraces of running workers in stalled CPU-bound worker pools:

# Instance 2
BUG: workqueue lockup - pool cpus=53 node=0 flags=0x0 nice=0 stuck for 459s!

Showing busy workqueues and worker pools:

workqueue events: flags=0x0

pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2

pending: psi_avgs_work
pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=4 refcnt=5
pending: 2*psi_avgs_work, drain_local_memcg_stock, iova_depot_work_func
workqueue events_freezable: flags=0x4

pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2

pending: pci_pme_list_scan
workqueue slub_flushwq: flags=0x8
pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=1 refcnt=3
pending: flush_cpu_slab BAR(7520)
workqueue mm_percpu_wq: flags=0x8
pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=1 refcnt=2
pending: vmstat_update
workqueue mlx5_cmd_0002:03:00.1: flags=0x6000a
pwq 576: cpus=0-143 flags=0x4 nice=0 active=1 refcnt=146
pending: cmd_work_handler

Showing backtraces of running workers in stalled CPU-bound worker pools:

# Instance 3
BUG: workqueue lockup - pool cpus=74 node=1 flags=0x0 nice=0 stuck for 31s!

Showing busy workqueues and worker pools:

workqueue mm_percpu_wq: flags=0x8
pwq 298: cpus=74 node=1 flags=0x0 nice=0 active=1 refcnt=2
pending: vmstat_update

Showing backtraces of running workers in stalled CPU-bound worker pools:

# Instance 4
BUG: workqueue lockup - pool cpus=71 node=0 flags=0x0 nice=0 stuck for 32s!

Showing busy workqueues and worker pools:

workqueue events: flags=0x0
pwq 286: cpus=71 node=0 flags=0x0 nice=0 active=2 refcnt=3
pending: psi_avgs_work, fuse_check_timeout
workqueue events_freezable: flags=0x4

pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2

pending: pci_pme_list_scan
workqueue mm_percpu_wq: flags=0x8
pwq 286: cpus=71 node=0 flags=0x0 nice=0 active=1 refcnt=2
pending: vmstat_update

Showing backtraces of running workers in stalled CPU-bound worker pools:

Thanks for your help,
--breno

Breno Leitao

unread,

Mar 13, 2026, 8:58:14 AMMar 13

to Petr Mladek, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

Do you mean checking if pool->busy_hash is empty, and then warning?

Commit fc36ad49ce7160907bcbe4f05c226595611ac293
Author: Breno Leitao <lei...@debian.org>
Date: Fri Mar 13 05:35:02 2026 -0700

workqueue: warn when stalled pool has no running workers

When the workqueue watchdog detects a pool stall and the pool's
busy_hash is empty (no workers executing any work item), print a
diagnostic warning with the pool state and trigger a backtrace of
the currently running task on the stalled CPU.

Signed-off-by: Breno Leitao <lei...@debian.org>
Suggested-by: Petr Mladek <pml...@suse.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6ee52ba9b14f7..d538067754123 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -7655,6 +7655,17 @@ static void show_cpu_pool_busy_workers(struct worker_pool *pool)

raw_spin_lock_irqsave(&pool->lock, irq_flags);

+ if (hash_empty(pool->busy_hash)) {
+ raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
+
+ pr_info("pool %d: no running workers, cpu=%d is %s (nr_workers=%d nr_idle=%d)\n",
+ pool->id, pool->cpu,
+ idle_cpu(pool->cpu) ? "idle" : "busy",
+ pool->nr_workers, pool->nr_idle);
+ trigger_single_cpu_backtrace(pool->cpu);
+ return;
+ }
+

hash_for_each(pool->busy_hash, bkt, worker, hentry) {

if (task_is_running(worker->task)) {
/*

Petr Mladek

unread,

Mar 13, 2026, 10:39:03 AMMar 13

to Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

It looks like that progress is not blocked by an overloaded CPU.

One interesting thing is there is no "pwq XXX: cpus=13" in the list
of busy workqueues and worker pools. IMHO, the watchdog should report
a stall only when there is a pending work. It does not make much sense
to me.

BTW: I look at pr_cont_pool_info() in the mainline and it does not
not print the name of the current process and its stack address.
I guess that it is printed by another debugging patch ?

> pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 17

> > I wonder what blocked the idle worker from waking or forking
> > a new worker. Was it caused by the IPIs?
>
> Not sure, keep in mind that these hosts (arm64) do not have NMI, so,
> IPIs are just regular interrupts that could take a long time to be handled. The
> toggle_allocation_gate() was good example, given it was sending IPIs very
> frequently and I took it as an example for the cover letter, but, this problem
> also show up with diferent places. (more examples later)
>
> > Did printing the sleeping workers helped to analyze the problem?
>
> That is my hope. I don't have a reproducer other than the one in this
> patchset.

Good to know. Note that the reproducer is not "realistic".
PF_WQ_WORKER is an internal flag and must not be manipulated
by the queued work callbacks. It is like shooting into an own leg.

> I am currently rolling this patchset to production, and I can report once
> I get more information.

That would be great. I am really curious what is the root problem here.

In all these cases, there is listed some pending work on the stuck
"cpus=XXX". So, it looks more sane than the 1st report.

I agree that it looks ugly that it did not print any backtraces.
But I am not sure if the backtraces would help.

If there is no running worker then wq_worker_sleeping() should wake up
another idle worker. And if this is the last idle worker in the
per-CPU pool than it should create another worker.

Honestly, I think that there is only small chance that the backtraces
of the sleeping workers will help to solve the problem.

IMHO, the problem is that wq_worker_sleeping() was not able to
guarantee forward progress. Note that there should always be
at least one idle work on CPU-bound worker pools.

Now, the might be more reasons why it failed:

1. It did not wake up any idle worker because it though
it has already been done, for example because a messed
worker->sleeping flag, worker->flags & WORKER_NOT_RUNNING flag,
pool->nr_running count.

IMHO, the chance of this bug is small.

2. The scheduler does not schedule the woken idle worker because:

+ too big load
+ soft/hardlockup on the given CPU
+ the scheduler does not schedule anything, e.g. because of
stop_machine()

It seems that this not the case on the 1st example where
the CPU is idle. But I am not sure how exactly are the IPIs
handled on arm64.

3. There always must be at least one idle worker in each pool.
But the last idle worker newer processes pending work.
It has to create another worker instead.

create_worker() might fail from more reasons:

+ worker pool limit (is there any?)
+ PID limit
+ memory limit

I have personally seen these problems caused by PID limit.
Note that containers might have relatively small limits by
default !!!

4. ???

I think that it might be interesting to print backtrace and
state of the worker which is supposed to guarantee progress.
Is it "pool->manager" ?

Also create_worker() prints an error when it can't create worker.
But the error is printed only once. And it might get lost on
huge systems with extensive load and logging.

Maybe, we could add some global variable allow to print
these errors once again when workqueue stall is detected.

Or store some timestamps when the function tried to create a new worker
and when it succeeded last time. And print it in the stall report.

Best Regards,
Petr

Petr Mladek

unread,

Mar 13, 2026, 12:27:45 PMMar 13

to Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

This would print it only when there is no in-flight work.

But I think that the problem is when there in no worker in
the running state. There should always be one to guarantee
the forward progress.

I took inspiration from your patch. This is what comes to my mind
on top of the current master (printing only running workers):

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index aeaec79bc09c..a044c7e42139 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -7588,12 +7588,15 @@ static void show_cpu_pool_hog(struct worker_pool *pool)
{
struct worker *worker;
unsigned long irq_flags;
+ bool found_running;
int bkt;

raw_spin_lock_irqsave(&pool->lock, irq_flags);

+ found_running = false;

hash_for_each(pool->busy_hash, bkt, worker, hentry) {
if (task_is_running(worker->task)) {

+ found_running = true;
/*

* Defer printing to avoid deadlocks in console

* drivers that queue work while holding locks

@@ -7609,6 +7612,19 @@ static void show_cpu_pool_hog(struct worker_pool *pool)
}

raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
+
+ if (!found_running) {
+ pr_info("pool %d: no worker in running state, cpu=%d is %s (nr_workers=%d nr_idle=%d)\n",

+ pool->id, pool->cpu,
+ idle_cpu(pool->cpu) ? "idle" : "busy",
+ pool->nr_workers, pool->nr_idle);

+ pr_info("The pool might have troubles to wake up another idle worker.\n");
+ if (pool->manager) {
+ pr_info("Backtrace of the pool manager:\n");
+ sched_show_task(pool->manager->task);
+ }
+ trigger_single_cpu_backtrace(pool->cpu);
+ }
}

static void show_cpu_pools_hogs(void)

Warning: The code is not safe. We would need add some synchronization
of the pool->manager pointer.

Even better might be to print state and backtrace of the process
which was woken by kick_pool() when the last running worker
went asleep.

Motivation: AFAIK, if there is a pending work in CPU bound workqueue
than at least one worker in the related worker pool should be
in "task_is_running()" state to guarantee forward progress.

If we find the running worker then it will likely be the
culprit. It either runs for too long. Or it is the last
idle worker and it fails to create a new one.

If there is no worker in running state then there is likely
a problem in the core workqueue code. Or some work shoot
the workqueue into its leg. Anyway, we might need to print
much more details to nail it down.

Best Regards,
Petr

Breno Leitao

unread,

Mar 13, 2026, 1:36:24 PMMar 13

to Petr Mladek, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

On Fri, Mar 13, 2026 at 03:38:57PM +0100, Petr Mladek wrote:
> On Fri 2026-03-13 05:24:54, Breno Leitao wrote:

> > Right, but let's look at this case:
> >
> > BUG: workqueue lockup - pool 55 cpu 13 curr 0 (swapper/13) stack ffff800085640000 cpus=13 node=0 flags=0x0 nice=-20 stuck for 679s!
> > work func=blk_mq_timeout_work data=0xffff0000ad7e3a05
> > Showing busy workqueues and worker pools:
> > workqueue events_unbound: flags=0x2
> > pwq 288: cpus=0-71 flags=0x4 nice=0 active=1 refcnt=2
> > in-flight: 4083734:btrfs_extent_map_shrinker_worker
> > workqueue mm_percpu_wq: flags=0x8
> > pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=1 refcnt=2
> > pending: vmstat_update
> > pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 1715676 4086805 3860852 3587585 4065550 4014041 3944711 3744484
> > Showing backtraces of running workers in stalled CPU-bound worker pools:
> > # Nothing in here
> >
> > It seems CPU 13 is idle (curr = 0) and blk_mq_timeout_work has been pending for
> > 679s ?
>
> It looks like that progress is not blocked by an overloaded CPU.

Looking at data address, it seems it always have the last 0x5 bits set,
meaning that WORK_STRUCT_PENDING and WORK_STRUCT_PWQ set, right?

So, the work is peding for a huge amount of time (see more examples below)

> One interesting thing is there is no "pwq XXX: cpus=13" in the list
> of busy workqueues and worker pools. IMHO, the watchdog should report
> a stall only when there is a pending work. It does not make much sense
> to me.
>
> BTW: I look at pr_cont_pool_info() in the mainline and it does not
> not print the name of the current process and its stack address.
> I guess that it is printed by another debugging patch ?

Sorry, this was an simple change we got in initially, that is basically doing:

void *curr_stack;
curr_stack = try_get_task_stack(curr)
pr_emerg("BUG: workqueue lockup - pool %d cpu %d curr %d (%s) stack %px",
pool->id, pool->cpu, curr->pid,
curr->comm, curr_stack);

>
>
> > pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 17
>
> > > I wonder what blocked the idle worker from waking or forking
> > > a new worker. Was it caused by the IPIs?
> >
> > Not sure, keep in mind that these hosts (arm64) do not have NMI, so,
> > IPIs are just regular interrupts that could take a long time to be handled. The
> > toggle_allocation_gate() was good example, given it was sending IPIs very
> > frequently and I took it as an example for the cover letter, but, this problem
> > also show up with diferent places. (more examples later)
> >
> > > Did printing the sleeping workers helped to analyze the problem?
> >
> > That is my hope. I don't have a reproducer other than the one in this
> > patchset.
>
> Good to know. Note that the reproducer is not "realistic".
> PF_WQ_WORKER is an internal flag and must not be manipulated
> by the queued work callbacks. It is like shooting into an own leg.

Ack!

> > I am currently rolling this patchset to production, and I can report once
> > I get more information.
>
> That would be great. I am really curious what is the root problem here.

In fact, I got some instances of this issue with this new patchset, and,
still, the backtrace is empty. These are the only 3 issues I got with the new
patches applied. All of them wiht the "blk_mq_timeout_work" function.

BUG: workqueue lockup - pool 11 cpu 2 curr 686384 (thrmon_agg) stack ffff8002bd200000 cpus=2 node=0 flags=0x0 nice=-20 stuck for 276s!
work func=blk_mq_timeout_work data=0xffff0000b88e3405

Showing busy workqueues and worker pools:

workqueue kblockd: flags=0x18
pwq 11: cpus=2 node=0 flags=0x0 nice=-20 active=1 refcnt=2
pending: blk_mq_timeout_work
Showing backtraces of busy workers in stalled CPU-bound worker pools:

BUG: workqueue lockup - pool 7 cpu 1 curr 0 (swapper/1) stack ffff800084f80000 cpus=1 node=0 flags=0x0 nice=-20 stuck for 114s!
work func=blk_mq_timeout_work data=0xffff0000b88e3205

Showing busy workqueues and worker pools:
workqueue events: flags=0x0

pwq 510: cpus=127 node=1 flags=0x0 nice=0 active=1 refcnt=2
pending: psi_avgs_work
Showing backtraces of busy workers in stalled CPU-bound worker pools:

BUG: workqueue lockup - pool 11 cpu 2 curr 24596 (mcrcfg-fci) stack ffff8002b5a40000 cpus=2 node=0 flags=0x0 nice=-20 stuck for 282s!
work func=blk_mq_timeout_work data=0xffff0000b8706805

Showing busy workqueues and worker pools:

Showing backtraces of busy workers in stalled CPU-bound worker pools:

I don't have information about the load of those machines when the problem
happens, but, in some case the problem happen when there is no workload
(production job) running on those machine, thus, it is hard to assume that the
load is high.

> 3. There always must be at least one idle worker in each pool.
> But the last idle worker newer processes pending work.
> It has to create another worker instead.
>
> create_worker() might fail from more reasons:
>
> + worker pool limit (is there any?)
> + PID limit
> + memory limit
>
> I have personally seen these problems caused by PID limit.
> Note that containers might have relatively small limits by
> default !!!

Might this justify the fact that WORK_STRUCT_PENDING bit is set for ~200
seconds?

> I think that it might be interesting to print backtrace and
> state of the worker which is supposed to guarantee progress.
> Is it "pool->manager" ?
>
> Also create_worker() prints an error when it can't create worker.
> But the error is printed only once. And it might get lost on
> huge systems with extensive load and logging.

That is definitely not the case. I've scan Meta's whole fleet for create_worker
error, and there is a single instance on a unrelated host.

Breno Leitao

unread,

Mar 18, 2026, 7:31:23 AMMar 18

to Petr Mladek, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

Hello Petr,

I agree. We should probably store the last woken worker in the worker_pool
structure and print it later.

I've spent some time verifying that the locking and lifecycle management are
correct. While I'm not completely certain, I believe it's getting closer. An
extra pair of eyes would be helpful.

This is the new version of this patch:

commit feccca7e696ead3272669ee4d4dc02b6946d0faf
Author: Breno Leitao <lei...@debian.org>
Date: Mon Mar 16 09:47:09 2026 -0700

workqueue: print diagnostic info when no worker is in running state

show_cpu_pool_busy_workers() iterates over busy workers but gives no
feedback when none are found in running state, which is a key indicator
that a pool may be stuck — unable to wake an idle worker to process
pending work.

Add a diagnostic message when no running workers are found, reporting
pool id, CPU, idle state, and worker counts. Also trigger a single-CPU
backtrace for the stalled CPU.

To identify the task most likely responsible for the stall, add
last_woken_worker (L: pool->lock) to worker_pool and record it in
kick_pool() just before wake_up_process(). This captures the idle
worker that was kicked to take over when the last running worker went to
sleep; if the pool is now stuck with no running worker, that task is the
prime suspect and its backtrace is dumped.

Using struct worker * rather than struct task_struct * avoids any
lifetime concern: workers are only destroyed via set_worker_dying()
which requires pool->lock, and set_worker_dying() clears
last_woken_worker when the dying worker matches. show_cpu_pool_busy_workers()
holds pool->lock while calling sched_show_task(), so last_woken_worker
is either NULL or points to a live worker with a valid task. More
precisely, set_worker_dying() clears last_woken_worker before setting
WORKER_DIE, so a non-NULL last_woken_worker means the kthread has not
yet exited and worker->task is still alive.

The pool info message is printed inside pool->lock using
printk_deferred_enter/exit, the same pattern used by the existing
busy-worker loop, to avoid deadlocks with console drivers that queue
work while holding locks also taken in their write paths.
trigger_single_cpu_backtrace() is called after releasing the lock.

Suggested-by: Petr Mladek <pml...@suse.com>
Signed-off-by: Breno Leitao <lei...@debian.org>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b77119d71641a..38aebf4514c03 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -217,6 +217,7 @@ struct worker_pool {
/* L: hash of busy workers */

struct worker *manager; /* L: purely informational */
+ struct worker *last_woken_worker; /* L: last worker woken by kick_pool() */
struct list_head workers; /* A: attached workers */

struct ida worker_ida; /* worker IDs for task name */
@@ -1295,6 +1296,9 @@ static bool kick_pool(struct worker_pool *pool)
}
}
#endif
+ /* Track the last idle worker woken, used for stall diagnostics. */
+ pool->last_woken_worker = worker;
+
wake_up_process(p);
return true;
}
@@ -2902,6 +2906,13 @@ static void set_worker_dying(struct worker *worker, struct list_head *list)
pool->nr_workers--;
pool->nr_idle--;

+ /*
+ * Clear last_woken_worker if it points to this worker, so that
+ * show_cpu_pool_busy_workers() cannot dereference a freed worker.
+ */
+ if (pool->last_woken_worker == worker)
+ pool->last_woken_worker = NULL;
+
worker->flags |= WORKER_DIE;

list_move(&worker->entry, list);
@@ -7582,20 +7593,58 @@ module_param_named(panic_on_stall_time, wq_panic_on_stall_time, uint, 0644);
MODULE_PARM_DESC(panic_on_stall_time, "Panic if stall exceeds this many seconds (0=disabled)");

/*
- * Show workers that might prevent the processing of pending work items.
- * A busy worker that is not running on the CPU (e.g. sleeping in
- * wait_event_idle() with PF_WQ_WORKER cleared) can stall the pool just as
- * effectively as a CPU-bound one, so dump every in-flight worker.
+ * Report that a pool has no worker in running state, which is a sign that the
+ * pool may be stuck. Print pool info. Must be called with pool->lock held and
+ * inside a printk_deferred_enter/exit region.
+ */
+static void show_pool_no_running_worker(struct worker_pool *pool)
+{
+ lockdep_assert_held(&pool->lock);
+
+ printk_deferred_enter();

+ pr_info("pool %d: no worker in running state, cpu=%d is %s (nr_workers=%d nr_idle=%d)\n",
+ pool->id, pool->cpu,
+ idle_cpu(pool->cpu) ? "idle" : "busy",
+ pool->nr_workers, pool->nr_idle);

+ pr_info("The pool might have trouble waking an idle worker.\n");
+ /*
+ * last_woken_worker and its task are valid here: set_worker_dying()
+ * clears it under pool->lock before setting WORKER_DIE, so if
+ * last_woken_worker is non-NULL the kthread has not yet exited and
+ * worker->task is still alive.
+ */
+ if (pool->last_woken_worker) {
+ pr_info("Backtrace of last woken worker:\n");
+ sched_show_task(pool->last_woken_worker->task);
+ } else {
+ pr_info("Last woken worker empty\n");
+ }
+ printk_deferred_exit();
+}
+
+/*
+ * Show running workers that might prevent the processing of pending work items.
+ * If no running worker is found, the pool may be stuck waiting for an idle
+ * worker to be woken, so report the pool state and the last woken worker.
*/

static void show_cpu_pool_busy_workers(struct worker_pool *pool)

{
struct worker *worker;
unsigned long irq_flags;

- int bkt;
+ bool found_running = false;
+ int cpu, bkt;

raw_spin_lock_irqsave(&pool->lock, irq_flags);

+ /* Snapshot cpu inside the lock to safely use it after unlock. */
+ cpu = pool->cpu;
+

hash_for_each(pool->busy_hash, bkt, worker, hentry) {

+ /* Skip workers that are not actively running on the CPU. */
+ if (!task_is_running(worker->task))
+ continue;
+

+ found_running = true;
/*
* Defer printing to avoid deadlocks in console
* drivers that queue work while holding locks

@@ -7609,7 +7658,23 @@ static void show_cpu_pool_busy_workers(struct worker_pool *pool)
printk_deferred_exit();
}

+ /*
+ * If no running worker was found, the pool is likely stuck. Print pool
+ * state and the backtrace of the last woken worker, which is the prime
+ * suspect for the stall.
+ */
+ if (!found_running)
+ show_pool_no_running_worker(pool);
+
raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
+
+ /*
+ * Trigger a backtrace on the stalled CPU to capture what it is
+ * currently executing. Called after releasing the lock to avoid
+ * any potential issues with NMI delivery.
+ */
+ if (!found_running)
+ trigger_single_cpu_backtrace(cpu);
}

static void show_cpu_pools_busy_workers(void)

Petr Mladek

unread,

Mar 18, 2026, 11:12:00 AMMar 18

to Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

This is a bit ambiguous. It sounds like that the worker is idle.
I would write something like:

pr_info("There is no info about the last woken worker\n");
pr_info("Missing info about the last woken worker.\n");

> + }
> + printk_deferred_exit();
> +}
> +

Otherwise, I like this patch.

I still think what might be the reason that there is no worker
in the running state. Let's see if this patch brings some useful info.

One more idea. It might be useful to store a timestamp when the last
worker was woken. And then print either the timestamp or delta.
It would help to make sure that kick_pool() was really called
during the reported stall.

Best Regards,
Petr

Petr Mladek

unread,

Mar 18, 2026, 12:46:26 PMMar 18

to Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

On Fri 2026-03-13 10:36:09, Breno Leitao wrote:
> On Fri, Mar 13, 2026 at 03:38:57PM +0100, Petr Mladek wrote:
> > On Fri 2026-03-13 05:24:54, Breno Leitao wrote:
> > > I am currently rolling this patchset to production, and I can report once
> > > I get more information.
> >
> > That would be great. I am really curious what is the root problem here.
>
> In fact, I got some instances of this issue with this new patchset, and,
> still, the backtrace is empty. These are the only 3 issues I got with the new
> patches applied. All of them wiht the "blk_mq_timeout_work" function.
>
> BUG: workqueue lockup - pool 11 cpu 2 curr 686384 (thrmon_agg) stack ffff8002bd200000 cpus=2 node=0 flags=0x0 nice=-20 stuck for 276s!
> work func=blk_mq_timeout_work data=0xffff0000b88e3405
> Showing busy workqueues and worker pools:
> workqueue kblockd: flags=0x18
> pwq 11: cpus=2 node=0 flags=0x0 nice=-20 active=1 refcnt=2
> pending: blk_mq_timeout_work

This is report is showing the stalled "pool 11" in the list of busy
worker pools.

> Showing backtraces of busy workers in stalled CPU-bound worker pools:
>
> BUG: workqueue lockup - pool 7 cpu 1 curr 0 (swapper/1) stack ffff800084f80000 cpus=1 node=0 flags=0x0 nice=-20 stuck for 114s!
> work func=blk_mq_timeout_work data=0xffff0000b88e3205
> Showing busy workqueues and worker pools:
> workqueue events: flags=0x0
> pwq 510: cpus=127 node=1 flags=0x0 nice=0 active=1 refcnt=2
> pending: psi_avgs_work

It is strange that "pwq 7" is not listed here.

> Showing backtraces of busy workers in stalled CPU-bound worker pools:
>
> BUG: workqueue lockup - pool 11 cpu 2 curr 24596 (mcrcfg-fci) stack ffff8002b5a40000 cpus=2 node=0 flags=0x0 nice=-20 stuck for 282s!
> work func=blk_mq_timeout_work data=0xffff0000b8706805
> Showing busy workqueues and worker pools:

And the list of busy worker pools is even empty here.

> Showing backtraces of busy workers in stalled CPU-bound worker pools:

I would expect that the stalled pool was shown by show_one_workqueue().

show_one_workqueue() checks pwq->nr_active instead of
list_empty(&pool->worklist). But my understanding is that work items
added to pool->worklist should be counted by the related
pwq->nr_active. In fact, pwq->nr_active seems to be decremented
only when the work is proceed or removed from the queue. So that
it should be counted as nr_active even when it is already in progress.
As a result, show_one_workqueue() should print even pools which have
the last assigned work in-flight.

Maybe, I miss something. For example, the barriers are not counted
as nr_active, ...

Anyway, the backtrace of the last woken worker might give us
some pointers. It might show that the pool is stuck on some
wq_barrier or so.

Good to know. I am more and more curious what would be the culprit
here.

Best Regards,
Petr

Breno Leitao

unread,

Mar 20, 2026, 6:41:29 AMMar 20

to Petr Mladek, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

Ack, this is the following patch I will deploy in production, let's see
how useful it is.

commit c78b175971888da3c2ae6d84971e9beb01269a92

Author: Breno Leitao <lei...@debian.org>
Date: Mon Mar 16 09:47:09 2026 -0700

workqueue: print diagnostic info when no worker is in running state

show_cpu_pool_busy_workers() iterates over busy workers but gives no
feedback when none are found in running state, which is a key indicator
that a pool may be stuck — unable to wake an idle worker to process
pending work.

Add a diagnostic message when no running workers are found, reporting
pool id, CPU, idle state, and worker counts. Also trigger a single-CPU
backtrace for the stalled CPU.

To identify the task most likely responsible for the stall, add

last_woken_worker and last_woken_tstamp (both L: pool->lock) to
worker_pool and record them in kick_pool() just before

wake_up_process(). This captures the idle worker that was kicked to
take over when the last running worker went to sleep; if the pool is
now stuck with no running worker, that task is the prime suspect and

its backtrace is dumped along with how long ago it was woken.

Using struct worker * rather than struct task_struct * avoids any
lifetime concern: workers are only destroyed via set_worker_dying()
which requires pool->lock, and set_worker_dying() clears
last_woken_worker when the dying worker matches. show_cpu_pool_busy_workers()
holds pool->lock while calling sched_show_task(), so last_woken_worker
is either NULL or points to a live worker with a valid task. More
precisely, set_worker_dying() clears last_woken_worker before setting
WORKER_DIE, so a non-NULL last_woken_worker means the kthread has not
yet exited and worker->task is still alive.

The pool info message is printed inside pool->lock using
printk_deferred_enter/exit, the same pattern used by the existing
busy-worker loop, to avoid deadlocks with console drivers that queue
work while holding locks also taken in their write paths.
trigger_single_cpu_backtrace() is called after releasing the lock.

Sample output from a stall triggered by the wq_stall test now.

pool 174: no worker in running state, cpu=43 is idle (nr_workers=2 nr_idle=1)

The pool might have trouble waking an idle worker.

Last worker woken 48977 ms ago:
task:kworker/43:1 state:I stack:0 pid:631 tgid:631 ppid:2
Call Trace:
<stack trace>

Suggested-by: Petr Mladek <pml...@suse.com>
Signed-off-by: Breno Leitao <lei...@debian.org>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c

index b77119d71641a..f8b1741824117 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -217,6 +217,8 @@ struct worker_pool {

/* L: hash of busy workers */

struct worker *manager; /* L: purely informational */
+ struct worker *last_woken_worker; /* L: last worker woken by kick_pool() */

+ unsigned long last_woken_tstamp; /* L: timestamp of last kick_pool() wake */

struct list_head workers; /* A: attached workers */

struct ida worker_ida; /* worker IDs for task name */

@@ -1295,6 +1297,10 @@ static bool kick_pool(struct worker_pool *pool)

}
}
#endif
+ /* Track the last idle worker woken, used for stall diagnostics. */
+ pool->last_woken_worker = worker;

+ pool->last_woken_tstamp = jiffies;
+
wake_up_process(p);
return true;
}
@@ -2902,6 +2908,13 @@ static void set_worker_dying(struct worker *worker, struct list_head *list)

pool->nr_workers--;
pool->nr_idle--;

+ /*
+ * Clear last_woken_worker if it points to this worker, so that
+ * show_cpu_pool_busy_workers() cannot dereference a freed worker.

+ */

+ if (pool->last_woken_worker == worker)
+ pool->last_woken_worker = NULL;
+
worker->flags |= WORKER_DIE;

list_move(&worker->entry, list);

@@ -7582,20 +7595,59 @@ module_param_named(panic_on_stall_time, wq_panic_on_stall_time, uint, 0644);

+ pr_info("Last worker woken %lu ms ago:\n",
+ jiffies_to_msecs(jiffies - pool->last_woken_tstamp));

+ sched_show_task(pool->last_woken_worker->task);
+ } else {

+ pr_info("Missing info about the last woken worker.\n");

+ }
+ printk_deferred_exit();
+}
+

+/*
+ * Show running workers that might prevent the processing of pending work items.
+ * If no running worker is found, the pool may be stuck waiting for an idle
+ * worker to be woken, so report the pool state and the last woken worker.
*/
static void show_cpu_pool_busy_workers(struct worker_pool *pool)
{
struct worker *worker;
unsigned long irq_flags;
- int bkt;
+ bool found_running = false;
+ int cpu, bkt;

raw_spin_lock_irqsave(&pool->lock, irq_flags);

+ /* Snapshot cpu inside the lock to safely use it after unlock. */
+ cpu = pool->cpu;
+
hash_for_each(pool->busy_hash, bkt, worker, hentry) {
+ /* Skip workers that are not actively running on the CPU. */
+ if (!task_is_running(worker->task))
+ continue;
+
+ found_running = true;
/*
* Defer printing to avoid deadlocks in console

* drivers that queue work while holding locks
@@ -7609,7 +7661,23 @@ static void show_cpu_pool_busy_workers(struct worker_pool *pool)

Breno Leitao

unread,

Mar 20, 2026, 6:44:17 AMMar 20

to Petr Mladek, so...@kernel.org, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

> > Showing backtraces of busy workers in stalled CPU-bound worker pools:
>
> I would expect that the stalled pool was shown by show_one_workqueue().
>
> show_one_workqueue() checks pwq->nr_active instead of
> list_empty(&pool->worklist). But my understanding is that work items
> added to pool->worklist should be counted by the related
> pwq->nr_active. In fact, pwq->nr_active seems to be decremented
> only when the work is proceed or removed from the queue. So that
> it should be counted as nr_active even when it is already in progress.
> As a result, show_one_workqueue() should print even pools which have
> the last assigned work in-flight.
>
> Maybe, I miss something. For example, the barriers are not counted
> as nr_active, ...

Chatting quickly to Song, he believed that we need a barrier in-between
adding the worklist and updating last_progress_ts, specifically, the
watchdog can see a non-empty worklist (from a list_add) while reading
a stale last_progress_ts value, causing a false positive stall report.
as well

Jiri Slaby

unread,

May 7, 2026, 6:20:40 AMMay 7

to Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com

We see dumps from non-existent cpus on 7.0 like:
BUG: workqueue lockup - pool cpus=144 node=0 flags=0x4 nice=0 stuck
for 168224s!
...

Showing busy workqueues and worker pools:

workqueue rcu_gp: flags=0x108
pwq 578: cpus=144 node=0 flags=0x4 nice=0 active=3 refcnt=4
in:
https://bugzilla.suse.com/show_bug.cgi?id=1263947
?

Can this (or other patch from the series) cause this? Should there be
something like cpu_online() instead of task_is_running() somewhere?

> - /*
> - * Defer printing to avoid deadlocks in console
> - * drivers that queue work while holding locks
> - * also taken in their write paths.
> - */
> - printk_deferred_enter();
> + /*
> + * Defer printing to avoid deadlocks in console
> + * drivers that queue work while holding locks
> + * also taken in their write paths.
> + */
> + printk_deferred_enter();
>
> - pr_info("pool %d:\n", pool->id);
> - sched_show_task(worker->task);
> + pr_info("pool %d:\n", pool->id);
> + sched_show_task(worker->task);
>
> - printk_deferred_exit();
> - }
> + printk_deferred_exit();
> }
>
> raw_spin_unlock_irqrestore(&pool->lock, irq_flags);

thanks,
--
js
suse labs

Breno Leitao

unread,

May 7, 2026, 9:12:12 AMMay 7

to Jiri Slaby, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com

Hi Jiri,

On Thu, May 07, 2026 at 12:20:33PM +0200, Jiri Slaby wrote:
> On 05. 03. 26, 17:15, Breno Leitao wrote:
>
> BUG: workqueue lockup - pool cpus=144 node=0 flags=0x4 nice=0 stuck for
> 168224s!

That's an extremely long stall (~1.95 days).

> ...
> Showing busy workqueues and worker pools:
> workqueue rcu_gp: flags=0x108
> pwq 578: cpus=144 node=0 flags=0x4 nice=0 active=3 refcnt=4
> in:
> https://bugzilla.suse.com/show_bug.cgi?id=1263947
> ?
>
> Can this (or other patch from the series) cause this? Should there be
> something like cpu_online() instead of task_is_running() somewhere?

This series only affects stall reporting, not detection. The changes run
after the watchdog has identified a stall, so the detection logic itself
remains unchanged.

To help diagnose this issue, could you provide some additional information:

1) Was CPU 144 online at any point? If so, when was it taken offline?
2) Does this message appear repeatedly? If you bring CPU 144 online, does
the issue resolve?
3) Have you run similar tests on earlier kernel versions without seeing
this behavior, or is this a clear regression?

Thanks for the report,
--breno

Jiri Slaby

unread,

May 11, 2026, 1:21:34 AMMay 11

to Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com

Hi,

we currently have several reports of this. On s390, ppc64, and x86_64.

On 07. 05. 26, 15:11, Breno Leitao wrote:
> Hi Jiri,
>
> On Thu, May 07, 2026 at 12:20:33PM +0200, Jiri Slaby wrote:
>> On 05. 03. 26, 17:15, Breno Leitao wrote:
>>
>> BUG: workqueue lockup - pool cpus=144 node=0 flags=0x4 nice=0 stuck for
>> 168224s!
>
> That's an extremely long stall (~1.95 days).
>
>> ...
>> Showing busy workqueues and worker pools:
>> workqueue rcu_gp: flags=0x108
>> pwq 578: cpus=144 node=0 flags=0x4 nice=0 active=3 refcnt=4
>> in:
>> https://bugzilla.suse.com/show_bug.cgi?id=1263947
>> ?
>>
>> Can this (or other patch from the series) cause this? Should there be
>> something like cpu_online() instead of task_is_running() somewhere?
>
> This series only affects stall reporting, not detection. The changes run
> after the watchdog has identified a stall, so the detection logic itself
> remains unchanged.
>
> To help diagnose this issue, could you provide some additional information:
>
> 1) Was CPU 144 online at any point? If so, when was it taken offline?

It was not, it's non-present.

> 2) Does this message appear repeatedly? If you bring CPU 144 online, does
> the issue resolve?

Yes, look at this new x86_64 report's dmesg (I believe it is related to
the above report):
BUG: workqueue lockup - pool cpus=2 node=0 flags=0x4 nice=0 stuck for
50s!
in:
https://bugzilla.suse.com/attachment.cgi?id=890229

$ grep -c BUG sl.txt
504
$ grep -c pwq sl.txt
509

It comes from:
https://bugzilla.suse.com/show_bug.cgi?id=1264554

> 3) Have you run similar tests on earlier kernel versions without seeing
> this behavior, or is this a clear regression?

It's new in 7.0. Going back to 6.19.12 makes it disappear.

Thorsten Leemhuis

unread,

May 13, 2026, 3:29:29 AMMay 13

to Jiri Slaby, Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com, Linux kernel regressions list

On 5/11/26 07:21, Jiri Slaby wrote:
> we currently have several reports of this. On s390, ppc64, and x86_64.

I stumbled on this by accident and this is not my area of expertise, so
the following might be bogus:

Is this maybe the same as "Observed Workqueue lockups on offline CPUs.":
https://lore.kernel.org/lkml/97a7d011-d573-4754...@linux.ibm.com/

Fix is here:
https://lore.kernel.org/lkml/20260508174353....@kernel.org/

Ciao, Thorsten

Jiri Slaby

unread,

May 13, 2026, 4:04:03 AMMay 13

to Thorsten Leemhuis, Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, Petr Mladek, kerne...@meta.com, Linux kernel regressions list, Paul E. McKenney

On 13. 05. 26, 9:29, Thorsten Leemhuis wrote:
> On 5/11/26 07:21, Jiri Slaby wrote:
>> we currently have several reports of this. On s390, ppc64, and x86_64.
>
> I stumbled on this by accident and this is not my area of expertise, so
> the following might be bogus:
>
> Is this maybe the same as "Observed Workqueue lockups on offline CPUs.":
> https://lore.kernel.org/lkml/97a7d011-d573-4754...@linux.ibm.com/

Thanks, looks like pretty much it. All three reports have:
rcu: srcu_init: Setting srcu_struct sizes to big.

> Fix is here:
> https://lore.kernel.org/lkml/20260508174353....@kernel.org/

Building a kernel with this and serving to the reporters to test.

--
js
suse labs

Markus Elfring

unread,

May 13, 2026, 4:53:14 AMMay 13

to Breno Leitao, kasa...@googlegroups.com, kerne...@meta.com, Andrew Morton, Lai Jiangshan, Tejun Heo, LKML, Danielle Costantino, Omar Sandoval, Petr Mladek, Song Liu

> There is a blind spot exists in the work queue stall detecetor …

detector?

Regards,
Markus

Hillf Danton

unread,

May 13, 2026, 4:57:31 AMMay 13

to Breno Leitao, Petr Mladek, Tejun Heo, linux-...@vger.kernel.org, Omar Sandoval, Danielle Costantino, kasa...@googlegroups.com

An idle CPU failed to process pending work, so the root cause lies outside
workqueue, and it is difficult to understand why giving more X-ray scan
to Peter helps if Paul has a bone in throat.

Breno Leitao

unread,

Jun 11, 2026, 10:50:26 AMJun 11

to Petr Mladek, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

On Fri, Mar 20, 2026 at 03:41:13AM -0700, Breno Leitao wrote:
> On Wed, Mar 18, 2026 at 04:11:54PM +0100, Petr Mladek wrote:
> > On Wed 2026-03-18 04:31:08, Breno Leitao wrote:
> > Otherwise, I like this patch.
> >
> > I still think what might be the reason that there is no worker
> > in the running state. Let's see if this patch brings some useful info.
> >
> > One more idea. It might be useful to store a timestamp when the last
> > worker was woken. And then print either the timestamp or delta.
> > It would help to make sure that kick_pool() was really called
> > during the reported stall.
>
> Ack, this is the following patch I will deploy in production, let's see
> how useful it is.

I got this running in production (backported to 6.16), and we finally got the culprit.

05:42:00 BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 115s!
NMI backtrace for cpu 2
CPU: 2 UID: 0 PID: 411 Comm: kworker/u288:2 Tainted: G O 6.16.1-0_fbk4_0_gb849430a436c #1 NONE
Tainted: [O]=OOT_MODULE
Hardware name: <foo>
Workqueue: efi_rts_wq efi_call_rts
pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
pc : 0x4052f10900
lr : 0x4052f10e94
sp : ffff800088cefc90
x29: ffff800088cefc90 x28: 0000000048524641 x27: 0000004052b60000
x26: 0000000000010058 x25: 0000004043ba0000 x24: 0000000001280000
x23: 000000405a02807f x22: 0000000000010080 x21: 0000004053ac0097
x20: 000000405a028080 x19: 0000004053ac0098 x18: 0000000000000000
x17: 0000000000000030 x16: 0000004052eb6de0 x15: 0000004042ba0030
x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000001
x11: 0000000001d00d09 x10: 0000004042ba0028 x9 : ffff800088cefc90
x8 : 0000000001d00cd9 x7 : 0000000000000000 x6 : 0000004043ba0000
x5 : 0000004043bb0000 x4 : 0000004053ac0098 x3 : 000000405a028080
x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffffffffffe1e8
Call trace:
0x4052f10900 (P)
0x4052f10e94
0x4052b00ed0
0x4052b02e38
0x4052b0175c
0x4052b517b4
0x4052a70b84
0x4052cb11d4
__efi_rt_asm_wrapper+0x50/0x78
efi_call_rts+0x178/0x240
process_scheduled_works+0x17c/0x420
worker_thread+0x184/0x4d8
kthread+0xcc/0x1f8
ret_from_fork+0x10/0x20
05:42:30 BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 145s!
NMI backtrace for cpu 2
CPU: 2 UID: 0 PID: 411 Comm: kworker/u288:2 Tainted: G O 6.16.1-0_fbk4_0_gb849430a436c #1 NONE
Tainted: [O]=OOT_MODULE
Hardware name: <foo>
Workqueue: efi_rts_wq efi_call_rts
pstate: 63401009 (nZCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
pc : 0x4052f11ecc
lr : 0x4052f10b8c
sp : ffff800088cefc30
x29: ffff800088cefc40 x28: 0000000048524641 x27: 0000004052b60000
x26: 0000000000010058 x25: 0000004043fb0000 x24: 0000000001690000
x23: 0000004053ab0040 x22: 0000000000010080 x21: ffff800088cefd00

rinse and repeat..

Unfortunately I didn't get the other pr_info(), because of console settings,
but, I can say the following from this issue and previous code:

1) in show_cpu_pool_hog, found_running variable is set to false.
2) hash_for_each() never found any running task
3) The following code was trigger and was very helpful:

if (!found_running)
trigger_single_cpu_backtrace(cpu);

Petr Mladek

unread,

Jun 16, 2026, 8:38:17 AM (14 days ago) Jun 16

to Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

Great. So, the extra complexity was worth it. Should I clean it and
send a proper patch? Or would you like to do so?

Also I wonder whether it would make sense to revert the commit
8823eaef45da7f ("workqueue: Show all busy workers in stall
diagnostics"). If I get it correctly then printing all busy workers
was not that helpful. Namely, the sleeping workers should not prevent
progress.

Best Regards,
Petr

Breno Leitao

unread,

Jun 16, 2026, 8:44:27 AM (14 days ago) Jun 16

to Petr Mladek, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

Hello Petr,

I am happy to do so. I had a chat with tejun yesterday, and I got the
impression he is OK with this patch as well.

> Also I wonder whether it would make sense to revert the commit
> 8823eaef45da7f ("workqueue: Show all busy workers in stall
> diagnostics").

This patch undo that change already. Would you like to see a full revert
or, are you OK with this patch reverting that behaviour? Maybe with
a fixes: tag?

> If I get it correctly then printing all busy workers
> was not that helpful. Namely, the sleeping workers should not prevent
> progress.

Ack!

Petr Mladek

unread,

Jun 16, 2026, 8:57:11 AM (14 days ago) Jun 16

to Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

On Tue 2026-06-16 05:44:12, Breno Leitao wrote:
> Hello Petr,
>
> On Tue, Jun 16, 2026 at 02:38:11PM +0200, Petr Mladek wrote:
> > On Thu 2026-06-11 07:50:04, Breno Leitao wrote:
> > >
> > > Unfortunately I didn't get the other pr_info(), because of console settings,
> > > but, I can say the following from this issue and previous code:
> > >
> > > 1) in show_cpu_pool_hog, found_running variable is set to false.
> > > 2) hash_for_each() never found any running task
> > > 3) The following code was trigger and was very helpful:
> > >
> > > if (!found_running)
> > > trigger_single_cpu_backtrace(cpu);
> >
> > Great. So, the extra complexity was worth it. Should I clean it and
> > send a proper patch? Or would you like to do so?
>
> I am happy to do so. I had a chat with tejun yesterday, and I got the
> impression he is OK with this patch as well.

Great!

> > Also I wonder whether it would make sense to revert the commit
> > 8823eaef45da7f ("workqueue: Show all busy workers in stall
> > diagnostics").
>
> This patch undo that change already. Would you like to see a full revert
> or, are you OK with this patch reverting that behaviour? Maybe with
> a fixes: tag?

I do not mind how exactly it is done. Reworking the behavior is
perfectly fine. And the fixes tag would make sense when it was
not a full revert.

Best Regards,
Petr

Breno Leitao

unread,

Jun 16, 2026, 12:48:18 PM (14 days ago) Jun 16

to Petr Mladek, Tejun Heo, Lai Jiangshan, Andrew Morton, linux-...@vger.kernel.org, Omar Sandoval, Song Liu, Danielle Costantino, kasa...@googlegroups.com, kerne...@meta.com

Thanks. I've sent an RFC patchset for us to discuss it there.

https://lore.kernel.org/all/20260616-wq_dump_pe...@debian.org/

Thanks for your help so far,
--breno

Reply all

Reply to author

Forward