[PATCH 0/2] blk-mq: dying queue fix & improvement

Ming Lei

unread,

Mar 9, 2017, 8:10:07 AM3/9/17

to

Hi,

The 1st patch fixes one race between timeout and dying queue.

The 2nd patch improves handling for dying queue.

thanks,
Ming

Ming Lei (2):
blk-mq: don't complete un-started request in timeout handler
blk-mq: start to freeze queue just after setting dying

block/blk-core.c | 7 +++++--
block/blk-mq.c | 11 +----------
2 files changed, 6 insertions(+), 12 deletions(-)

--
2.7.4

Ming Lei

unread,

Mar 9, 2017, 8:20:04 AM3/9/17

to

Before commit 780db2071a(blk-mq: decouble blk-mq freezing
from generic bypassing), the dying flag is checked before
entering queue, and Tejun converts the checking into .mq_freeze_depth,
and assumes the counter is increased just after dying flag
is set. Unfortunately we doesn't do that in blk_set_queue_dying().

This patch calls blk_mq_freeze_queue_start() for blk-mq in
blk_set_queue_dying(), so that we can block new I/O coming
once the queue is set as dying.

Given blk_set_queue_dying() is always called in remove path
of block device, and queue will be cleaned up later, we don't
need to worry about undo of the counter.

Cc: Tejun Heo <t...@kernel.org>
Signed-off-by: Ming Lei <tom.l...@gmail.com>
---
block/blk-core.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0eeb99ef654f..559487e58296 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -500,9 +500,12 @@ void blk_set_queue_dying(struct request_queue *q)
queue_flag_set(QUEUE_FLAG_DYING, q);
spin_unlock_irq(q->queue_lock);

- if (q->mq_ops)
+ if (q->mq_ops) {
blk_mq_wake_waiters(q);
- else {
+
+ /* block new I/O coming */
+ blk_mq_freeze_queue_start(q);
+ } else {
struct request_list *rl;

spin_lock_irq(q->queue_lock);
--
2.7.4

Ming Lei

unread,

Mar 9, 2017, 8:20:05 AM3/9/17

to

When iterating busy request in timeout handler,
if the STARTED flag of one request isn't set, that means
the request is being processed in block layer, and not
dispatched to low level driver yet.

In current implementation of blk_mq_check_expired(),
in case that the request queue becomes dying, un-started
requests are completed immediately. This way can cause
use-after-free issue[1][2] when doing I/O and removing&resetting
NVMe device.

This patch fixes several issues reported by Yi Zhang.

[1]. oops log 1
[ 581.789754] ------------[ cut here ]------------
[ 581.789758] kernel BUG at block/blk-mq.c:374!
[ 581.789760] invalid opcode: 0000 [#1] SMP
[ 581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
dm_region_hash dm_log dm_mod
[ 581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
[ 581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 581.789804] Workqueue: kblockd blk_mq_timeout_work
[ 581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
[ 581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
[ 581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
[ 581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
[ 581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
[ 581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
[ 581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
[ 581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
[ 581.789817] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[ 581.789818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
[ 581.789820] Call Trace:
[ 581.789825] blk_mq_check_expired+0x76/0x80
[ 581.789828] bt_iter+0x45/0x50
[ 581.789830] blk_mq_queue_tag_busy_iter+0xdd/0x1f0
[ 581.789832] ? blk_mq_rq_timed_out+0x70/0x70
[ 581.789833] ? blk_mq_rq_timed_out+0x70/0x70
[ 581.789840] ? __switch_to+0x140/0x450
[ 581.789841] blk_mq_timeout_work+0x88/0x170
[ 581.789845] process_one_work+0x165/0x410
[ 581.789847] worker_thread+0x137/0x4c0
[ 581.789851] kthread+0x101/0x140
[ 581.789853] ? rescuer_thread+0x3b0/0x3b0
[ 581.789855] ? kthread_park+0x90/0x90
[ 581.789860] ret_from_fork+0x2c/0x40
[ 581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
[ 581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
[ 581.789889] ---[ end trace bcaf03d9a14a0a70 ]---

[2]. oops log2
[ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857373] PGD 0
[ 6984.857374]
[ 6984.857376] Oops: 0000 [#1] SMP
[ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
dm_region_hash dm_log dm_mod
[ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
4.10.0-2.el7.bz1420297.x86_64 #1
[ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
[ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
[ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
[ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
[ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
[ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
[ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
[ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
[ 6984.857439] FS: 0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
[ 6984.857440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
[ 6984.857443] Call Trace:
[ 6984.857451] ? mempool_free+0x2b/0x80
[ 6984.857455] ? bio_free+0x4e/0x60
[ 6984.857459] blk_mq_dispatch_rq_list+0xf5/0x230
[ 6984.857462] blk_mq_process_rq_list+0x133/0x170
[ 6984.857465] __blk_mq_run_hw_queue+0x8c/0xa0
[ 6984.857467] blk_mq_run_work_fn+0x12/0x20
[ 6984.857473] process_one_work+0x165/0x410
[ 6984.857475] worker_thread+0x137/0x4c0
[ 6984.857478] kthread+0x101/0x140
[ 6984.857480] ? rescuer_thread+0x3b0/0x3b0
[ 6984.857481] ? kthread_park+0x90/0x90
[ 6984.857489] ret_from_fork+0x2c/0x40
[ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
[ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
[ 6984.857512] CR2: 0000000000000010
[ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---

Cc: sta...@vger.kernel.org
Reported-by: Yi Zhang <yiz...@redhat.com>

Signed-off-by: Ming Lei <tom.l...@gmail.com>
---

block/blk-mq.c | 11 +----------
1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a28d66..0aff380099d5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -697,17 +697,8 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
{
struct blk_mq_timeout_data *data = priv;

- if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
- /*
- * If a request wasn't started before the queue was
- * marked dying, kill it here or it'll go unnoticed.
- */
- if (unlikely(blk_queue_dying(rq->q))) {
- rq->errors = -EIO;
- blk_mq_end_request(rq, rq->errors);
- }
+ if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
return;
- }

if (time_after_eq(jiffies, rq->deadline)) {
if (!blk_mark_rq_complete(rq))
--
2.7.4

Ming Lei

unread,

Mar 9, 2017, 9:00:07 AM3/9/17

to

Ming Lei

unread,

Mar 9, 2017, 9:20:04 PM3/9/17

to

On Fri, Mar 10, 2017 at 12:58 AM, Bart Van Assche
<Bart.Va...@sandisk.com> wrote:

> The comment above blk_mq_freeze_queue_start() should explain more clearly
> why that call is needed. Additionally, I think this patch makes the

The comment of "block new I/O coming" has been added, and let me know what
others are needed, :-)

> blk_freeze_queue() call in blk_cleanup_queue() superfluous. How about the
> (entirely untested) patch below?

I don't think we need to wait in blk_set_queue_dying(), and the purpose
of this patch is to block new I/O coming once dying iset as pointed in the
comment, and the change in blk_cleanup_queue() isn't necessary too, since
that is exactly where we should drain the queue.

Thanks,
Ming Lei

Ming Lei

unread,

Mar 15, 2017, 8:30:05 AM3/15/17

to

On Wed, Mar 15, 2017 at 12:07:37AM +0000, Bart Van Assche wrote:
> On Thu, 2017-03-09 at 21:02 +0800, Ming Lei wrote:

> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 159187a28d66..0aff380099d5 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -697,17 +697,8 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
> > {
> > struct blk_mq_timeout_data *data = priv;
> >
> > - if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
> > - /*
> > - * If a request wasn't started before the queue was
> > - * marked dying, kill it here or it'll go unnoticed.
> > - */
> > - if (unlikely(blk_queue_dying(rq->q))) {
> > - rq->errors = -EIO;
> > - blk_mq_end_request(rq, rq->errors);
> > - }
> > + if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> > return;
> > - }
> >
> > if (time_after_eq(jiffies, rq->deadline)) {
> > if (!blk_mark_rq_complete(rq))
>

> Both the old and the new check look racy to me. The REQ_ATOM_STARTED bit can
> be changed concurrently by blk_mq_start_request(), __blk_mq_finish_request()

blk_mq_start_request() and __blk_mq_finish_request() won't be run concurrently.

From view of __blk_mq_finish_request():

- if it is run from merge queue io path(blk_mq_merge_queue_io()),
blk_mq_start_request() can't be run at all, and the COMPLETE flag
is kept as previous value(zero)

- if it is run from normal complete path, COMPLETE flag is cleared
before the req/tag is released to tag set.

so there isn't race in blk_mq_start_request() vs. __blk_mq_finish_request()
wrt. timeout.

> or __blk_mq_requeue_request(). Another issue with this function is that the

__blk_mq_requeue_request() can be run from two pathes:

- dispatch failure, in which case the req/tag isn't released to tag set

- IO completion path, in which COMPLETE flag is cleared before requeue.

so I can't see races with timeout in case of start rq vs. requeue rq.

> request passed to this function can be reinitialized concurrently. Sorry but

Yes, that is possible, but rq->atomic_flags is kept, and in that case
when we handle timeout, COMPLETE is cleared before releasing the rq/tag to
tag set via blk_mark_rq_complete(), so we won't complete that req twice.

Thanks,
Ming

Ming Lei

unread,

Mar 15, 2017, 8:50:05 AM3/15/17

to

On Wed, Mar 15, 2017 at 08:18:53PM +0800, Ming Lei wrote:
> On Wed, Mar 15, 2017 at 12:07:37AM +0000, Bart Van Assche wrote:
>
> > or __blk_mq_requeue_request(). Another issue with this function is that the
>
> __blk_mq_requeue_request() can be run from two pathes:
>
> - dispatch failure, in which case the req/tag isn't released to tag set
>
> - IO completion path, in which COMPLETE flag is cleared before requeue.
>
> so I can't see races with timeout in case of start rq vs. requeue rq.

Actually rq/tag won't be released to tag set if it will be requeued, so
the timeout race is nothing to do with requeue.

Thanks,
Ming

Yi Zhang

unread,

Mar 15, 2017, 10:20:06 AM3/15/17

to

Thanks Ming.

Tested-by: Yi Zhang <yiz...@redhat.com>

Best Regards,
Yi Zhang

Ming Lei

unread,

Mar 15, 2017, 12:30:05 PM3/15/17

to

On Wed, Mar 15, 2017 at 03:36:31PM +0000, Bart Van Assche wrote:

> Hello Ming,
>
> Please have another look at __blk_mq_requeue_request(). In that function
> the following code occurs: if (test_and_clear_bit(REQ_ATOM_STARTED,
> &rq->atomic_flags)) { ... }
>
> I think the REQ_ATOM_STARTED check in blk_mq_check_expired() races with the
> test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags) call in
> __blk_mq_requeue_request().

OK, this race should only exist in case that the requeue happens after dispatch
busy, because COMPLETE flag isn't set. And if the requeue is from io completion,
no such race because COMPLETE flag is set.

One solution I thought of is to call blk_mark_rq_complete() before requeuing
when dispatch busy happened, but that looks a bit silly. Another way is to
set STARTED flag just after .queue_rq returns BLK_MQ_RQ_QUEUE_OK, which looks
reasonable too. Any comments on the 2nd solution?

Thanks,
Ming

Ming Lei

unread,

Mar 15, 2017, 12:50:06 PM3/15/17

to

Actually it isn't possible to happen because rq->deadline is just set
in blk_mq_start_request() called from .queue_rq, and it won't trigger
timeout handling even STARTED is observed as true in blk_mq_check_expired()
because timeout period is often set as big enough. So it is still safe, isn't it?

But this situation should have been commented.

Thanks,
Ming

Ming Lei

unread,

Mar 15, 2017, 7:50:05 PM3/15/17

to

On Wed, Mar 15, 2017 at 09:34:50PM +0000, Bart Van Assche wrote:

> On Wed, 2017-03-15 at 20:18 +0800, Ming Lei wrote:
> > On Wed, Mar 15, 2017 at 12:07:37AM +0000, Bart Van Assche wrote:
> > > Both the old and the new check look racy to me. The REQ_ATOM_STARTED bit can
> > > be changed concurrently by blk_mq_start_request(), __blk_mq_finish_request()
> >
> > blk_mq_start_request() and __blk_mq_finish_request() won't be run concurrently.
> >
> > From view of __blk_mq_finish_request():
> >
> > - if it is run from merge queue io path(blk_mq_merge_queue_io()),
> > blk_mq_start_request() can't be run at all, and the COMPLETE flag
> > is kept as previous value(zero)
> >
> > - if it is run from normal complete path, COMPLETE flag is cleared
> > before the req/tag is released to tag set.
> >
> > so there isn't race in blk_mq_start_request() vs. __blk_mq_finish_request()
> > wrt. timeout.
> >
> > > or __blk_mq_requeue_request(). Another issue with this function is that the
> >
> > __blk_mq_requeue_request() can be run from two pathes:
> >
> > - dispatch failure, in which case the req/tag isn't released to tag set
> >
> > - IO completion path, in which COMPLETE flag is cleared before requeue.
> >
> > so I can't see races with timeout in case of start rq vs. requeue rq.
> >
> > > request passed to this function can be reinitialized concurrently.
>

> Hello Ming,
>
> You misinterpret what I wrote. I was referring to manipulation of
> REQ_ATOM_STARTED from different contexts and not to what you explained.

This patch addresses one race between timeout and pre-queuing I/O in block layer
(before .queue_rq), please focus on this patch. And that is definitely correct.

Also I am happy to discuss this kind of races, but maybe we can do that in
another thread if you can describe the issue clearly.

--
Ming

Ming Lei

unread,

Mar 15, 2017, 8:10:06 PM3/15/17

to

On Wed, Mar 15, 2017 at 09:35:03PM +0000, Bart Van Assche wrote:

> Sorry but I don't think that would be sufficient. There are several other
> scenarios that have not been mentioned above, e.g. that a tag gets freed and
> reused after the blk_mq_check_expired() call started and before that function
> has had the chance to examine any members of struct request. What is needed in
> blk_mq_check_expired() is the following as a single atomic operation:

We have dealt with this by checking COMPLETE & rq->deadline together in
blk_mq_check_expired() already:

- if new rq->deadline(set in reuse path) has been observed in the later
checking rq of blk_mq_check_expired(), it won't be timeouted because of the timing.

- if new rq->deadline(set in reuse path) hasn't been observed in the
later checking rq of blk_mq_check_expired(), that means COMPLETE flag isn't set
yet in reuse path because we have a barrier to enhance the order in
blk_mq_start_request(), so it won't be timeouted too.

So let me know what is the real race between free/reusing vs. timeout.

> * Check whether REQ_ATOM_STARTED has been set.
> * Check whether REQ_ATOM_COMPLETE has not yet been set.
> * If both conditions have been met, set REQ_ATOM_COMPLETE.
>
> I don't think there is another solution than using a single state variable to
> represent the REQ_ATOM_STARTED and REQ_ATOM_COMPLETE states instead of two
> independent bits. How about the patch below?

I would review it if you can confirm me that it is a real issue, :-)

Thanks,
Ming

Ming Lei

unread,

Mar 16, 2017, 8:10:04 PM3/16/17

to

On Fri, Mar 17, 2017 at 5:35 AM, Bart Van Assche
<Bart.Va...@sandisk.com> wrote:

> On Thu, 2017-03-16 at 08:07 +0800, Ming Lei wrote:
>> > * Check whether REQ_ATOM_STARTED has been set.
>> > * Check whether REQ_ATOM_COMPLETE has not yet been set.
>> > * If both conditions have been met, set REQ_ATOM_COMPLETE.
>> >
>> > I don't think there is another solution than using a single state variable to
>> > represent the REQ_ATOM_STARTED and REQ_ATOM_COMPLETE states instead of two
>> > independent bits. How about the patch below?
>>
>> I would review it if you can confirm me that it is a real issue, :-)
>

> Hello Ming,
>
> I was chasing a v4.11 regression in the SCSI stack. Since my tests of today
> revealed that it's probably not a block layer issue, let's proceed with your
> patch.

Yeah, it shouldn't have been related with blk-mq timeout handler which
isn't changed much
recently, and thanks for your review!

Thanks,
Ming Lei

Ming Lei

unread,

Mar 17, 2017, 6:10:05 AM3/17/17

to

When iterating busy requests in timeout handler,

if the STARTED flag of one request isn't set, that means

the request is being processed in block layer or driver, and
isn't dispatch to hardware yet.

In current implementation of blk_mq_check_expired(),

if the request queue becomes dying, un-started requests are
handled as being completed/freed immediately. This way is
wrong, and can cause use-after-free issue easily[1][2], when
doing I/O and removing&resetting NVMe device at the sametime.

Tested-by: Yi Zhang <yiz...@redhat.com>
Reviewed-by: Bart Van Assche <bart.va...@sandisk.com>

Signed-off-by: Ming Lei <tom.l...@gmail.com>
---
block/blk-mq.c | 11 +----------
1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c

index a4546f060e80..08a49c69738b 100644

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -697,17 +697,8 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
{
struct blk_mq_timeout_data *data = priv;

- if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
- /*
- * If a request wasn't started before the queue was
- * marked dying, kill it here or it'll go unnoticed.
- */
- if (unlikely(blk_queue_dying(rq->q))) {
- rq->errors = -EIO;
- blk_mq_end_request(rq, rq->errors);
- }
+ if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
return;
- }

if (time_after_eq(jiffies, rq->deadline)) {
if (!blk_mark_rq_complete(rq))
--

2.9.3

Ming Lei

unread,

Mar 17, 2017, 6:10:05 AM3/17/17

to

Before commit 780db2071a(blk-mq: decouble blk-mq freezing
from generic bypassing), the dying flag is checked before
entering queue, and Tejun converts the checking into .mq_freeze_depth,
and assumes the counter is increased just after dying flag
is set. Unfortunately we doesn't do that in blk_set_queue_dying().

This patch calls blk_mq_freeze_queue_start() for blk-mq in
blk_set_queue_dying(), so that we can block new I/O coming
once the queue is set as dying.

Given blk_set_queue_dying() is always called in remove path
of block device, and queue will be cleaned up later, we don't

need to worry about undoing the counter.

Cc: Bart Van Assche <bart.va...@sandisk.com>
Cc: Tejun Heo <t...@kernel.org>

Signed-off-by: Ming Lei <tom.l...@gmail.com>
---

block/blk-core.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c

index d772c221cc17..62d4967c369f 100644

--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -500,9 +500,12 @@ void blk_set_queue_dying(struct request_queue *q)
queue_flag_set(QUEUE_FLAG_DYING, q);
spin_unlock_irq(q->queue_lock);

- if (q->mq_ops)
+ if (q->mq_ops) {
blk_mq_wake_waiters(q);
- else {
+
+ /* block new I/O coming */
+ blk_mq_freeze_queue_start(q);
+ } else {
struct request_list *rl;

spin_lock_irq(q->queue_lock);

--
2.9.3

Ming Lei

unread,

Mar 17, 2017, 6:10:06 AM3/17/17

to

From: Ming Lei <min...@redhat.com>

Hi,

The 1st patch fixes one race between timeout and dying queue.

The 2nd patch add comments on races with timeout handler.

The 3rd patch improves handling for dying queue.

V1:
- add comments on races related with timeout handler
- add Tested-by & Reviewed-by tag

thanks,
Ming

Ming Lei (3):

blk-mq: don't complete un-started request in timeout handler

blk-mq: comment on races related timeout handler

blk-mq: start to freeze queue just after setting dying

block/blk-core.c | 7 +++++--

block/blk-mq.c | 33 +++++++++++++++++++++++----------
2 files changed, 28 insertions(+), 12 deletions(-)

--
2.9.3

Ming Lei

unread,

Mar 17, 2017, 2:50:06 PM3/17/17

to

On Sat, Mar 18, 2017 at 1:32 AM, Bart Van Assche
<Bart.Va...@sandisk.com> wrote:

> Hello Ming,
>
> I think we need the queue freezing not only for blk-mq but also for blk-sq.

Yes, we can, but it may not be a big deal for blk-sq, since get_request() does
check the dying flag.

> Since the queue flags and the mq_freeze_depth are stored in different
> variables we need to prevent that the CPU reorders the stores to these

Not needed, see below.

> variables. The comment about blk_mq_freeze_queue_start() should be more
> clear. How about something like the patch below?
>
>
> [PATCH] blk-mq: Force block layer users to check the "dying" flag after it has been set
>
> Commit 780db2071ac4 removed the blk_queue_dying() check from the
> hot path of blk_mq_queue_enter() although that check is necessary
> when cleaning up a queue. Hence make sure that blk_queue_enter()
> and blk_mq_queue_enter() check the dying flag after it has been set.
>
> Because blk_set_queue_dying() is only called from the remove path
> of a block device we don't need to worry about unfreezing the queue.
>
> Fixes: commit 780db2071ac4 ("blk-mq: decouble blk-mq freezing from generic bypassing")
> ---
> block/blk-core.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d772c221cc17..730f715b72ff 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -500,6 +500,19 @@ void blk_set_queue_dying(struct request_queue *q)
> queue_flag_set(QUEUE_FLAG_DYING, q);
> spin_unlock_irq(q->queue_lock);
>
> + /*
> + * Avoid that the updates of the queue flags and q_usage_counter
> + * are reordered.
> + */
> + smp_wmb();

atomic_inc_return() in blk_mq_freeze_queue_start() does imply a
barrier(smp_mb()).

> +
> + /*
> + * Force blk_queue_enter() and blk_mq_queue_enter() to check the
> + * "dying" flag. Despite its name, blk_mq_freeze_queue_start()
> + * affects blk-sq and blk-mq queues.
> + */
> + blk_mq_freeze_queue_start(q);

We need to change the name into blk_freeze_queue_start(), since it is quite
confusing to call a _mq function for both mq and legacy.

I will make it for both in v2, if no one objects.

Thanks,
Ming Lei

Ming Lei

unread,

Mar 17, 2017, 9:10:05 PM3/17/17

to

On Fri, Mar 17, 2017 at 11:26:26PM +0000, Bart Van Assche wrote:
> On Fri, 2017-03-17 at 17:57 +0800, Ming Lei wrote:

> > Given blk_set_queue_dying() is always called in remove path
> > of block device, and queue will be cleaned up later, we don't
> > need to worry about undoing the counter.
> >

> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index d772c221cc17..62d4967c369f 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -500,9 +500,12 @@ void blk_set_queue_dying(struct request_queue *q)
> > queue_flag_set(QUEUE_FLAG_DYING, q);
> > spin_unlock_irq(q->queue_lock);
> >
> > - if (q->mq_ops)
> > + if (q->mq_ops) {
> > blk_mq_wake_waiters(q);
> > - else {
> > +
> > + /* block new I/O coming */
> > + blk_mq_freeze_queue_start(q);
> > + } else {
> > struct request_list *rl;
> >
> > spin_lock_irq(q->queue_lock);
>

> Hello Ming,
>
> The blk_freeze_queue() call in blk_cleanup_queue() waits until q_usage_counter
> drops to zero. Since the above blk_mq_freeze_queue_start() call increases that
> counter by one, how is blk_freeze_queue() expected to finish ever?

It is q->mq_freeze_depth which is increased by blk_mq_freeze_queue_start(), not
q->q_usage_counter, otherwise blk_freeze_queue() would never return, :-)

Thanks,
Ming

Hannes Reinecke

unread,

Mar 18, 2017, 7:30:05 AM3/18/17

to

Reviewed-by: Hannes Reinecke <ha...@suse.com>

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
ha...@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

Hannes Reinecke

unread,

Mar 18, 2017, 7:50:06 AM3/18/17

to

On 03/17/2017 10:57 AM, Ming Lei wrote:

> Before commit 780db2071a(blk-mq: decouble blk-mq freezing
> from generic bypassing), the dying flag is checked before
> entering queue, and Tejun converts the checking into .mq_freeze_depth,
> and assumes the counter is increased just after dying flag
> is set. Unfortunately we doesn't do that in blk_set_queue_dying().
>
> This patch calls blk_mq_freeze_queue_start() for blk-mq in
> blk_set_queue_dying(), so that we can block new I/O coming
> once the queue is set as dying.
>
> Given blk_set_queue_dying() is always called in remove path
> of block device, and queue will be cleaned up later, we don't
> need to worry about undoing the counter.
>
> Cc: Bart Van Assche <bart.va...@sandisk.com>
> Cc: Tejun Heo <t...@kernel.org>
> Signed-off-by: Ming Lei <tom.l...@gmail.com>
> ---
> block/blk-core.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>

Ming Lei

unread,

Mar 21, 2017, 10:20:05 PM3/21/17

to

When iterating busy requests in timeout handler,
if the STARTED flag of one request isn't set, that means
the request is being processed in block layer or driver, and

isn't submitted to hardware yet.

In current implementation of blk_mq_check_expired(),
if the request queue becomes dying, un-started requests are
handled as being completed/freed immediately. This way is

wrong, and can cause rq corruption or double allocation[1][2],

Reviewed-by: Hannes Reinecke <ha...@suse.com>

Signed-off-by: Ming Lei <tom.l...@gmail.com>
---

block/blk-mq.c | 11 +----------
1 file changed, 1 insertion(+), 10 deletions(-)

Keith Busch

unread,

Mar 22, 2017, 12:00:10 PM3/22/17

to

On Tue, Mar 21, 2017 at 11:03:59PM -0400, Jens Axboe wrote:

> On 03/21/2017 10:14 PM, Ming Lei wrote:
> > When iterating busy requests in timeout handler,
> > if the STARTED flag of one request isn't set, that means
> > the request is being processed in block layer or driver, and
> > isn't submitted to hardware yet.
> >
> > In current implementation of blk_mq_check_expired(),
> > if the request queue becomes dying, un-started requests are
> > handled as being completed/freed immediately. This way is
> > wrong, and can cause rq corruption or double allocation[1][2],
> > when doing I/O and removing&resetting NVMe device at the sametime.
>

> I agree, completing it looks bogus. If the request is in a scheduler or
> on a software queue, this won't end well at all. Looks like it was
> introduced by this patch:
>
> commit eb130dbfc40eabcd4e10797310bda6b9f6dd7e76
> Author: Keith Busch <keith...@intel.com>
> Date: Thu Jan 8 08:59:53 2015 -0700
>
> blk-mq: End unstarted requests on a dying queue
>
> Before that, we just ignored it. Keith?

The above was intended for a stopped hctx on a dying queue such that
there's nothing in flight to the driver. Nvme had been relying on this
to end unstarted requests so we may progress when a controller dies.

We've since obviated the need: we restart the hw queues to flush entered
requests to failure, so we don't need that brokenness.

Ming Lei

unread,

Mar 23, 2017, 8:20:06 AM3/23/17

to

So the brokenness started just from the begining.

>
> We've since obviated the need: we restart the hw queues to flush entered
> requests to failure, so we don't need that brokenness.

Looks the following commit need to be backported too if we port this patch.

commit 69d9a99c258eb1d6478fd9608a2070890797eed7
Author: Keith Busch <keith...@intel.com>
Date: Wed Feb 24 09:15:56 2016 -0700

NVMe: Move error handling to failed reset handler

Thanks,
Ming