[PATCH 1/1] percpu-refcount: do not forget to rcu

Roman Pen

unread,

Aug 5, 2016, 1:50:06 PM8/5/16

to

percpu issues some RCU callbacks to synchronize its state, so before
freeing we have to wait all those callbacks to finish.

E.g. the following simple sequence on stack causes nasty crash:

struct percpu_ref ref;

percpu_ref_init(&ref, release, 0, GFP_KERNEL);
percpu_ref_kill(&ref);
percpu_ref_exit(&ref);

Also this patch includes inition to NULL of confirm_switch callback.
Without this inition you have to zero out a chunk of memory or kernel
frightfully complains with WARN_ON_ONCE(ref->confirm_switch) at
__percpu_ref_switch_to_atomic.

Signed-off-by: Roman Pen <roman....@profitbricks.com>
Cc: Tejun Heo <t...@kernel.org>
Cc: linux-...@vger.kernel.org
---
lib/percpu-refcount.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 6111bcb..ddf934b 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -82,6 +82,7 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release,
atomic_long_set(&ref->count, start_count);

ref->release = release;
+ ref->confirm_switch = NULL;
return 0;
}
EXPORT_SYMBOL_GPL(percpu_ref_init);
@@ -101,6 +102,7 @@ void percpu_ref_exit(struct percpu_ref *ref)
unsigned long __percpu *percpu_count = percpu_count_ptr(ref);

if (percpu_count) {
+ rcu_barrier_sched();
free_percpu(percpu_count);
ref->percpu_count_ptr = __PERCPU_REF_ATOMIC_DEAD;
}
--
2.9.0

Roman Pen

unread,

Aug 5, 2016, 1:50:06 PM8/5/16

to

Long time ago there was a similar fix proposed by Akinobu Mita[1],
but it seems that time everyone decided to fix this subtle race in
percpu-refcount and Tejun Heo[2] did an attempt (as I can see that
patchset was not applied).

The following is a description of a queue hang - same fix but a bug
from another angle.

The hang happens on queue freeze because of a simultaneous calls of
blk_mq_freeze_queue() and blk_mq_unfreeze_queue() from different threads,
and because of a reference race percpu_ref_reinit() and percpu_ref_kill()
swap.

CPU#0 CPU#1
---------------- -----------------
percpu_ref_kill()

percpu_ref_kill() << atomic reference does not
percpu_ref_reinit() << guarantee the order

blk_mq_freeze_queue_wait() << HANG HERE

percpu_ref_reinit()

Firstly this wrong sequence raises two kernel warnings:

1st. WARNING at lib/percpu-recount.c:309
percpu_ref_kill_and_confirm called more than once

2nd. WARNING at lib/percpu-refcount.c:331

But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(),
which waits for a zero of a q_usage_counter, which never happens
because percpu-ref was not reinited and stays in PERCPU state forever.

The simplified sequence above is reproduced on shared tags, when one
queue is going to die meanwhile another one is initing:

CPU#0 CPU#1
------------------------------- ------------------------------------
q1 = blk_mq_init_queue(shared_tags)

q2 = blk_mq_init_queue(shared_tags):
blk_mq_add_queue_tag_set(shared_tags):
blk_mq_update_tag_set_depth(shared_tags):
blk_mq_freeze_queue(q1)
blk_cleanup_queue(q1) ...
blk_mq_freeze_queue(q1) <<<->>> blk_mq_unfreeze_queue(q1)

[1] Message id: 1443287365-4244-7-git-...@gmail.com
[2] Message id: 1443563240-29306-6...@kernel.org

Signed-off-by: Roman Pen <roman....@profitbricks.com>
Cc: Akinobu Mita <akinob...@gmail.com>
Cc: Tejun Heo <t...@kernel.org>
Cc: Jens Axboe <ax...@kernel.dk>
Cc: Christoph Hellwig <h...@lst.de>
Cc: linux...@vger.kernel.org
Cc: linux-...@vger.kernel.org
---
block/blk-core.c | 1 +
block/blk-mq.c | 22 +++++++++++-----------
include/linux/blkdev.h | 7 ++++++-
3 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ef78848..01dcb02 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -740,6 +740,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);

init_waitqueue_head(&q->mq_freeze_wq);
+ mutex_init(&q->mq_freeze_lock);

/*
* Init percpu_ref in atomic mode so that it's faster to shutdown.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6d6f8fe..1f3e81b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -80,13 +80,13 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,

void blk_mq_freeze_queue_start(struct request_queue *q)
{
- int freeze_depth;
-
- freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
- if (freeze_depth == 1) {
+ mutex_lock(&q->mq_freeze_lock);
+ if (++q->mq_freeze_depth == 1) {
percpu_ref_kill(&q->q_usage_counter);
+ mutex_unlock(&q->mq_freeze_lock);
blk_mq_run_hw_queues(q, false);
- }
+ } else
+ mutex_unlock(&q->mq_freeze_lock);
}
EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);

@@ -124,14 +124,14 @@ EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);

void blk_mq_unfreeze_queue(struct request_queue *q)
{
- int freeze_depth;
-
- freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
- WARN_ON_ONCE(freeze_depth < 0);
- if (!freeze_depth) {
+ mutex_lock(&q->mq_freeze_lock);
+ q->mq_freeze_depth--;
+ WARN_ON_ONCE(q->mq_freeze_depth < 0);
+ if (!q->mq_freeze_depth) {
percpu_ref_reinit(&q->q_usage_counter);
wake_up_all(&q->mq_freeze_wq);
}
+ mutex_unlock(&q->mq_freeze_lock);
}
EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);

@@ -2105,7 +2105,7 @@ void blk_mq_free_queue(struct request_queue *q)
static void blk_mq_queue_reinit(struct request_queue *q,
const struct cpumask *online_mask)
{
- WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));
+ WARN_ON_ONCE(!q->mq_freeze_depth);

blk_mq_sysfs_unregister(q);

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f6ff9d1..d692c16 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -445,7 +445,7 @@ struct request_queue {
struct mutex sysfs_lock;

int bypass_depth;
- atomic_t mq_freeze_depth;
+ int mq_freeze_depth;

#if defined(CONFIG_BLK_DEV_BSG)
bsg_job_fn *bsg_job_fn;
@@ -459,6 +459,11 @@ struct request_queue {
#endif
struct rcu_head rcu_head;
wait_queue_head_t mq_freeze_wq;
+ /*
+ * Protect concurrent access to q_usage_counter by
+ * percpu_ref_kill() and percpu_ref_reinit().
+ */
+ struct mutex mq_freeze_lock;
struct percpu_ref q_usage_counter;
struct list_head all_q_node;

--
2.9.0

Roman Pen

unread,

Aug 8, 2016, 7:40:05 AM8/8/16

to

Long time ago there was a similar fix proposed by Akinobu Mita[1],
but it seems that time everyone decided to fix this subtle race in
percpu-refcount and Tejun Heo[2] did an attempt (as I can see that
patchset was not applied).

The following is a description of a hang in blk_mq_freeze_queue_wait() -

same fix but a bug from another angle.

The hang happens on attempt to freeze a queue while another task does
queue unfreeze.

The root cause is an incorrect sequence of percpu_ref_reinit() and
percpu_ref_kill() and as a result those two can be swapped:

CPU#0 CPU#1
---------------- -----------------
percpu_ref_kill()

percpu_ref_kill() << atomic reference does

percpu_ref_reinit() << not guarantee the order

blk_mq_freeze_queue_wait() << HANG HERE

percpu_ref_reinit()

Firstly this wrong sequence raises two kernel warnings:

1st. WARNING at lib/percpu-recount.c:309
percpu_ref_kill_and_confirm called more than once

2nd. WARNING at lib/percpu-refcount.c:331

But the most unpleasant effect is a hang of a blk_mq_freeze_queue_wait(),
which waits for a zero of a q_usage_counter, which never happens

because percpu-ref was reinited (instead of being killed) and stays in
PERCPU state forever.

The simplified sequence above can be reproduced on shared tags, when
queue A is going to die meanwhile another queue B is in init state and
is trying to freeze the queue A, which shares the same tags set:

CPU#0 CPU#1
------------------------------- ------------------------------------
q1 = blk_mq_init_queue(shared_tags)

q2 = blk_mq_init_queue(shared_tags):
blk_mq_add_queue_tag_set(shared_tags):
blk_mq_update_tag_set_depth(shared_tags):
blk_mq_freeze_queue(q1)
blk_cleanup_queue(q1) ...
blk_mq_freeze_queue(q1) <<<->>> blk_mq_unfreeze_queue(q1)

[1] Message id: 1443287365-4244-7-git-...@gmail.com
[2] Message id: 1443563240-29306-6...@kernel.org

Signed-off-by: Roman Pen <roman....@profitbricks.com>
Cc: Akinobu Mita <akinob...@gmail.com>
Cc: Tejun Heo <t...@kernel.org>
Cc: Jens Axboe <ax...@kernel.dk>
Cc: Christoph Hellwig <h...@lst.de>
Cc: linux...@vger.kernel.org
Cc: linux-...@vger.kernel.org
---

v2:
- forgotten hunk from local repo
- minor tweaks in the commit message

block/blk-core.c | 3 ++-

block/blk-mq.c | 22 +++++++++++-----------
include/linux/blkdev.h | 7 ++++++-

3 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ef78848..4fd27e9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -658,7 +658,7 @@ int blk_queue_enter(struct request_queue *q, gfp_t gfp)
return -EBUSY;

ret = wait_event_interruptible(q->mq_freeze_wq,
- !atomic_read(&q->mq_freeze_depth) ||
+ !q->mq_freeze_depth ||
blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;

Tejun Heo

unread,

Aug 10, 2016, 12:00:06 AM8/10/16

to

Hello,

On Mon, Aug 08, 2016 at 01:39:08PM +0200, Roman Pen wrote:
> Long time ago there was a similar fix proposed by Akinobu Mita[1],
> but it seems that time everyone decided to fix this subtle race in
> percpu-refcount and Tejun Heo[2] did an attempt (as I can see that
> patchset was not applied).

So, I probably forgot about it while waiting for confirmation of fix.
Can you please verify that the patchset fixes the issue? I can apply
the patchset right away.

Thanks.

--
tejun

Tejun Heo

unread,

Aug 10, 2016, 3:00:06 PM8/10/16

to

On Wed, Aug 10, 2016 at 10:42:09AM +0200, Roman Penyaev wrote:
> Hi,

> I have not checked your patchset but according to my understanding
> it should not fix *this* issue. What happens here is a wrong order
> of invocation of percpu_ref_reinit() and percpu_ref_kill(). So what
> was observed is the following:

Ah, understood.

Acked-by: Tejun Heo <t...@kernel.org>

I'll commit the percpu_refcnt patches too. While they don't fix the
problem on their own, the changes are generally useful for all mode
switching use cases.

Thanks.

--
tejun

Roman Penyaev

unread,

Aug 10, 2016, 3:40:11 PM8/10/16

to

On Wed, Aug 10, 2016 at 10:42 AM, Roman Penyaev
<roman....@profitbricks.com> wrote:
> Hi,
>
> On Wed, Aug 10, 2016 at 5:55 AM, Tejun Heo <t...@kernel.org> wrote:

> I have not checked your patchset but according to my understanding
> it should not fix *this* issue.

So, your patchset does not help (but for sure it helps for keeping
internal percpu-refcount members consistent, but that is not related
to this issue). That's the backtrace which I observe:

Call Trace:
[<ffffffff810ba8df>] ? vprintk_default+0x1f/0x30
[<ffffffff816a47f5>] schedule+0x35/0x80
[<ffffffff81336154>] blk_mq_freeze_queue_wait+0x124/0x1a0
[<ffffffff810a3f70>] ? wake_atomic_t_function+0x60/0x60
[<ffffffff8133821a>] blk_mq_freeze_queue+0x1a/0x20
[<ffffffff8133822e>] blk_freeze_queue+0xe/0x10
[<ffffffff81329cc2>] blk_cleanup_queue+0xe2/0x280

To ease reproduction I do the following:

-------------------------------------------
static int thread_fn(void *data)
{
struct blk_mq_tag_set *tags = data;
struct request_queue *q;

while (!kthread_should_stop()) {
q = blk_mq_init_queue(tags);
BUG_ON(q == NULL);
/*
* That is done by blk_register_queue(), but here
* we are reproducing blk-mq bug and do not require
* gendisk and friends. Just silently switch to percpu.
*/
percpu_ref_switch_to_percpu(&q->q_usage_counter);

msleep(prandom_u32_max(10));
blk_cleanup_queue(q);
}

return 0;
}
-------------------------------------------

o Start 2 threads (exactly 2, we need 2 queues for 1 shared tags)
o Pass same shared tags pointer for each thread
o Wait
o PROFIT

To make immediate reproduction this hunk can be applied:

@@ -129,6 +142,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
WARN_ON_ONCE(freeze_depth < 0);
if (!freeze_depth) {
+ msleep(1000);
percpu_ref_reinit(&q->q_usage_counter);
wake_up_all(&q->mq_freeze_wq);
}

--
Roman

Roman Pen

unread,

Aug 10, 2016, 4:00:08 PM8/10/16

to

percpu issues some RCU callbacks to synchronize its state, so before
freeing we have to wait all those callbacks to finish.

E.g. the following simple sequence on stack causes nasty crash:

struct percpu_ref ref;

percpu_ref_init(&ref, release, 0, GFP_KERNEL);
percpu_ref_kill(&ref);
percpu_ref_exit(&ref);

Also this patch includes inition to NULL of confirm_switch callback.
Without this inition you have to zero out a chunk of memory or kernel
frightfully complains with WARN_ON_ONCE(ref->confirm_switch) at
__percpu_ref_switch_to_atomic.

Signed-off-by: Roman Pen <roman....@profitbricks.com>
Cc: Tejun Heo <t...@kernel.org>
Cc: linux-...@vger.kernel.org
---
v2:
- just sending one more time because v1 was accidently
sent as a reply to not related issue.

Roman Penyaev

unread,

Aug 10, 2016, 4:10:08 PM8/10/16

to

Hi,

On Wed, Aug 10, 2016 at 5:55 AM, Tejun Heo <t...@kernel.org> wrote:

I have not checked your patchset but according to my understanding

it should not fix *this* issue. What happens here is a wrong order
of invocation of percpu_ref_reinit() and percpu_ref_kill(). So what
was observed is the following:

CPU#0 CPU#1
---------------- -----------------
percpu_ref_kill()

percpu_ref_kill() << atomic reference does
percpu_ref_reinit() << not guarantee the order

blk_mq_freeze_queue_wait() !! HANG HERE

percpu_ref_reinit()

blk_mq_freeze_queue_wait() on CPU#1 expects percpu-refcount to be
switched to ATOMIC mode (killed), but that does not happen, because
CPU#2 was faster and has been switched percpu-refcount to PERCPU
mode.

This race happens inside blk-mq, because invocation of kill/reinit
is controlled by the reference counter, which does not guarantee the
order of the following functions calls (kill/reinit).

So the fix is the same as originally proposed by Akinobu Mita, but
the issue is different.

But of course I can run tests on top of your series, just to verify
that everything goes smoothly and internally percpu-refcount members
are consistent.

--
Roman

Tejun Heo

unread,

Aug 10, 2016, 6:10:05 PM8/10/16

to

On Wed, Aug 10, 2016 at 09:55:39PM +0200, Roman Pen wrote:
> percpu issues some RCU callbacks to synchronize its state, so before
> freeing we have to wait all those callbacks to finish.
>
> E.g. the following simple sequence on stack causes nasty crash:
>
> struct percpu_ref ref;
>
> percpu_ref_init(&ref, release, 0, GFP_KERNEL);
> percpu_ref_kill(&ref);
> percpu_ref_exit(&ref);

Hmmm... that's just an illegal sequence of operations. You can't exit
a ref which hasn't completed killing yet (the kill callback hasn't
been called).

Thanks.

--
tejun

Roman Penyaev

unread,

Aug 11, 2016, 5:10:05 AM8/11/16

to

Yes, exactly, this is an illegal operation. But it is not more illegal
than calling kill() twice or reinit() when counter is not yet zero.
And those illegals are covered with warnings, which can be observed
for example with this freeze/unfreeze blk-mq bug.

So what I want to say is that bugs exist above percpu-ref and can
easily trigger illegal sequence and nasty crash is not a good way
to say that someone did a mistake.

But of course, that is very minor and was discovered by my stupidity
and tests which use percpu-ref in not a kosher way :)

--
Roman

Tejun Heo

unread,

Aug 11, 2016, 12:30:30 PM8/11/16

to

Hello, Roman.

On Thu, Aug 11, 2016 at 11:07:14AM +0200, Roman Penyaev wrote:
> Yes, exactly, this is an illegal operation. But it is not more illegal
> than calling kill() twice or reinit() when counter is not yet zero.
> And those illegals are covered with warnings, which can be observed
> for example with this freeze/unfreeze blk-mq bug.

I have no objections about adding warnings for these conditions;
however, adding rcu barrier to mask illegal usages is a very different
thing. That adds a lot of unnecessary latency to the exit function
and makes it unusable from non-sleepable contexts.

Thanks.

--
tejun

Roman Penyaev

unread,

Aug 11, 2016, 1:40:05 PM8/11/16

to

Hi,

Yes, for sure that makes sense. I changed the patch a little.
Have sent.

--
Roman

王金浦

unread,

Aug 30, 2016, 5:40:05 AM8/30/16

to

Hi Jens,

I didn't see this patch in you tree, what's the blocker?

Thanks,
Jinpu

[PATCH 1/1] percpu-refcount: do not forget to rcu_barrier() just before freeing

Roman Pen

Roman Pen

Roman Pen

Tejun Heo

Tejun Heo

Roman Penyaev

Roman Pen

Roman Penyaev

Tejun Heo

Roman Penyaev

Tejun Heo

Roman Penyaev

王金浦