[PATCH] recovery: remove onlining of devices via sysfs

62 views
Skip to first unread message

Uday Shankar

unread,
Aug 11, 2022, 7:43:17 PM8/11/22
to open-...@googlegroups.com, Uday Shankar
In setup_full_feature_phase, iscsid calls into the kernel via
start_conn, then sets all the relevant device states to "running" via
session_online_devs. This second step is redundant since start_conn will
set the device states to running. Moreover, it can cause tasks to hang
forever: between start_conn and session_online_devs, the kernel could
detect another conn error and block the session again, which quiesces
the device queues. Setting the device state to "running" via sysfs kicks
off a rescan, and if the device queue is quiesced, the rescan will hang.
The iscsid kernel stacktrace looks like the following:

[<0>] blk_execute_rq+0x11c/0x170
[<0>] __scsi_execute+0x108/0x270
[<0>] scsi_vpd_inquiry+0x6d/0xc0
[<0>] scsi_get_vpd_size+0x33/0x70
[<0>] scsi_get_vpd_buf+0x25/0xb0
[<0>] scsi_attach_vpd+0x33/0x1a0
[<0>] scsi_rescan_device+0x2a/0x90
[<0>] store_state_field+0x1b0/0x250
[<0>] kernfs_fop_write_iter+0x130/0x1c0
[<0>] new_sync_write+0x10c/0x190
[<0>] vfs_write+0x218/0x2a0
[<0>] ksys_write+0x59/0xd0
[<0>] do_syscall_64+0x3a/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

Since iscsid is responsible for recovery from the second conn error but
it is stuck, the relevant device queues will remain quiesced forever.
Tasks attempting I/O on these queues will thus also get stuck.

For these two reasons, remove the call to session_online_devs in
setup_full_feature_phase.

Signed-off-by: Uday Shankar <usha...@purestorage.com>
---
usr/initiator.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/usr/initiator.c b/usr/initiator.c
index 56bf38b..6cbdcba 100644
--- a/usr/initiator.c
+++ b/usr/initiator.c
@@ -1068,7 +1068,6 @@ setup_full_feature_phase(iscsi_conn_t *conn)
} else {
session->notify_qtask = NULL;

- session_online_devs(session->hostno, session->id);
mgmt_ipc_write_rsp(c->qtask, ISCSI_SUCCESS);
log_warning("connection%d:%d is operational after recovery "
"(%d attempts)", session->id, conn->id,
--
2.25.1

Uday Shankar

unread,
Aug 23, 2022, 5:32:33 PM8/23/22
to open-...@googlegroups.com, Lee Duncan, Chris Leech, Mike Christie
Bump and CC maintainers.

Michael Christie

unread,
Aug 24, 2022, 11:29:04 AM8/24/22
to Uday Shankar, open-...@googlegroups.com, Lee Duncan, Chris Leech

The kernel's call to unblock devices is only for setting devices to online if they are in the transport-offline state. It doesn't do anything if the scsi-eh set them to offline. The user space online code in your patch handles the scsi-eh case.

I hit the hang below and it should be fixed in this set:





From: Uday Shankar <usha...@purestorage.com>
Sent: Tuesday, August 23, 2022 2:32 PM
To: open-...@googlegroups.com <open-...@googlegroups.com>
Cc: Lee Duncan <ldu...@suse.com>; Chris Leech <cle...@redhat.com>; Michael Christie <michael....@oracle.com>
Subject: Re: [PATCH] recovery: remove onlining of devices via sysfs
 

Uday Shankar

unread,
Aug 29, 2022, 4:00:14 PM8/29/22
to Michael Christie, open-...@googlegroups.com, Lee Duncan, Chris Leech
> I hit the hang below and it should be fixed in this set:
>
> https://lore.kernel.org/all/20211105221048.6541...@oracle.com/

I hit the hang I described while running a kernel with your fix in. My
hang does not involve scsi-eh; it happens when a transport error is
detected by the kernel while iscsid is in the process of handling a
previous transport error. If the second transport error happens at an
unlucky timing (after iscsid does start_conn when handling the first
error, but before it sets the device state to running via sysfs), the
device queues can be quiesced (in the block layer sense) when iscsid
writes to sysfs and ends up in scsi_rescan_device.

I wanted to avoid this by just removing the write to sysfs from iscsid,
but it seems that interferes with scsi-eh flow:

> The user space online code in your patch handles the scsi-eh case.

So I will need to try another approach. Simply adding a call to
scsi_start_queue like so

if (rescan_dev) {
/* ...
*/
+ scsi_start_queue(sdev);
blk_mq_run_hw_queues(sdev->request_queue, true);
scsi_rescan_device(dev);
}

does not fix the issue either, since your linked patch moves the
run_hw_queues and rescan calls out of the area protected by state_mutex.
A race window still exists where the kernel could detect the second
transport error and quiesce sdev's request queue after we unquiesce it
but before we run the queues and rescan the device above.

It seems that the most straightforward way to fix my issue without
undoing your patch is to add another sync primitive that guards the sdev
queue quiesced/unquiesced state so that we can guarantee the queues are
unquiesced when we do the rescan above. Unless you have other
suggestions, I'll try that approach.

Thanks,
Uday

Mike Christie

unread,
Aug 29, 2022, 4:42:55 PM8/29/22
to Uday Shankar, open-...@googlegroups.com, Lee Duncan, Chris Leech
On 8/29/22 3:00 PM, Uday Shankar wrote:
>> I hit the hang below and it should be fixed in this set:
>>
>> https://urldefense.com/v3/__https://lore.kernel.org/all/20211105221048.6541...@oracle.com/__;!!ACWV5N9M2RV99hQ!L10OSfsVJ6Bssm8eT4oRSTBOLfCKp7-4cM5kGDF4aXnYuBfB0xlTEzbFTvMMw0nqmbfLqSpaXVl-E_tVvIc6ZQqBrA$
>
> I hit the hang I described while running a kernel with your fix in. My

Ah yeah sorry. I didn't read your first mail well. I tried to do a
quick reply from my phone on vacation :)

> hang does not involve scsi-eh; it happens when a transport error is
> detected by the kernel while iscsid is in the process of handling a
> previous transport error. If the second transport error happens at an
> unlucky timing (after iscsid does start_conn when handling the first> error, but before it sets the device state to running via sysfs), the
> device queues can be quiesced (in the block layer sense) when iscsid
> writes to sysfs and ends up in scsi_rescan_device.

Ok got it.

>
> I wanted to avoid this by just removing the write to sysfs from iscsid,
> but it seems that interferes with scsi-eh flow:
>
>> The user space online code in your patch handles the scsi-eh case.
>
> So I will need to try another approach. Simply adding a call to
> scsi_start_queue like so
>
> if (rescan_dev) {
> /* ...
> */
> + scsi_start_queue(sdev);
> blk_mq_run_hw_queues(sdev->request_queue, true);
> scsi_rescan_device(dev);
> }
>
> does not fix the issue either, since your linked patch moves the
> run_hw_queues and rescan calls out of the area protected by state_mutex.
> A race window still exists where the kernel could detect the second
> transport error and quiesce sdev's request queue after we unquiesce it
> but before we run the queues and rescan the device above.
>
> It seems that the most straightforward way to fix my issue without
> undoing your patch is to add another sync primitive that guards the sdev
> queue quiesced/unquiesced state so that we can guarantee the queues are
> unquiesced when we do the rescan above. Unless you have other
> suggestions, I'll try that approach.
>

Here is an alternative I was thinking about when I did that other
patchset.

I think we could remove the session_online_devs call for newer kernels.

The session_online_devs call was handling the case where a scsi cmd
timesout because the network is lost, we run the scsi eh, that ends
up getting escalated to us dropping the session, then iscsi_eh_session_reset
fails due to the replacement_timeout expiring.

With the patch:

commit a0c2f8b6709a9a4af175497ca65f93804f57b248
Author: Mike Christie <michael....@oracle.com>
Date: Fri Nov 5 17:10:47 2021 -0500

scsi: iscsi: Unblock session then wake up error handler


we should always end up in the transport-offline state for that
type of case. When we relogin, then the conn start's call to
__iscsi_unblock_session -> scsi_target_unblock will handle that
case and set the device to running. So we should not need the
userspace online/running call anymore.

So we could just add a CAP_SCSI_EH_TRANSPORT OFFLINE flag to
the iscsi_transport->caps struct. When userspace sees that then
it knows the kernel will now do the right thing.

The drawback is that we have to patch userspace and then also
get the the new CAP_SCSI_EH_TRANSPORT_OFFLINE patch in the stable
kernels.

Uday Shankar

unread,
Aug 29, 2022, 5:51:28 PM8/29/22
to Mike Christie, open-...@googlegroups.com, Lee Duncan, Chris Leech
> So we could just add a CAP_SCSI_EH_TRANSPORT OFFLINE flag to
> the iscsi_transport->caps struct. When userspace sees that then
> it knows the kernel will now do the right thing.
>
> The drawback is that we have to patch userspace and then also
> get the the new CAP_SCSI_EH_TRANSPORT_OFFLINE patch in the stable
> kernels.

Another drawback of this approach is that we are covering up a bug in
the kernel, which is still possible to hit if someone/something besides
iscsid decides to write to the sysfs state field at the unlucky timing.
Not sure how much we care about this. If it's considered a non-issue,
then I'll put together the change you described and send it to
linux-scsi as well.

Thanks,
Uday

michael....@oracle.com

unread,
Aug 29, 2022, 10:21:48 PM8/29/22
to Uday Shankar, open-...@googlegroups.com, Lee Duncan, Chris Leech
On 8/29/22 4:51 PM, Uday Shankar wrote:
>> So we could just add a CAP_SCSI_EH_TRANSPORT OFFLINE flag to
>> the iscsi_transport->caps struct. When userspace sees that then
>> it knows the kernel will now do the right thing.
>>
>> The drawback is that we have to patch userspace and then also
>> get the the new CAP_SCSI_EH_TRANSPORT_OFFLINE patch in the stable
>> kernels.
>
> Another drawback of this approach is that we are covering up a bug in
> the kernel, which is still possible to hit if someone/something besides

I'm still not sure if it's a bug in the kernel or not. I've been thinking
it's not a kernel bug.

Maybe it depends on if you feel there was a guarantee that writing to
that file was non-blocking. If so, then yeah I think it's a kernel bug
that we now do all this new work from that call. I don't think there is
anything that documents the behavior for that sysfs file though.

If there was no guarantee, then I think it's an issue with our design
where we assumed it was non-blocking from reading the kernel code and
thinking that was always going to be the case. If writing to that file
is allowed to block, then our design is wrong since we can't be sleeping
in that main thread.

michael....@oracle.com

unread,
Aug 29, 2022, 10:38:42 PM8/29/22
to Uday Shankar, open-...@googlegroups.com, Lee Duncan, Chris Leech
On 8/29/22 9:21 PM, michael....@oracle.com wrote:
> On 8/29/22 4:51 PM, Uday Shankar wrote:
>>> So we could just add a CAP_SCSI_EH_TRANSPORT OFFLINE flag to
>>> the iscsi_transport->caps struct. When userspace sees that then
>>> it knows the kernel will now do the right thing.
>>>
>>> The drawback is that we have to patch userspace and then also
>>> get the the new CAP_SCSI_EH_TRANSPORT_OFFLINE patch in the stable
>>> kernels.
>>
>> Another drawback of this approach is that we are covering up a bug in
>> the kernel, which is still possible to hit if someone/something besides

Oh yeah, I don't think the list of programs that can hit this problem
is very long. It's a iscsi only problem right? So there's open-iscsi,
some boot tool in go, and google's tools that we know of. I think only
open-iscsi will have the problem too.

You won't hit it in other transports because only iscsid tries to online
the device from userspace from the same thread that is needed to relogin
and unblock a device. FC/SAS/SRP don't have the same problem since they
do the block and unblock from kernel workqueues which will run independently
from some userspace thread writing to sysfs.

Or, is there other ways to hit this that you are seeing? If there are
then forget what I wrote in the last mail :)

Uday Shankar

unread,
Sep 8, 2022, 2:22:58 PM9/8/22
to michael....@oracle.com, open-...@googlegroups.com, Lee Duncan, Chris Leech
> Or, is there other ways to hit this that you are seeing? If there are
> then forget what I wrote in the last mail :)

No, I don't know of another way to hit this organically. With the change
you suggested, if someone sets a scsi device to the "running" state
concurrently with the kernel onlining devices (kicked off by iscsid),
it's possible that the kernel will observe the current state of the
device as SDEV_RUNNING in scsi_internal_device_unblock_nowait. That
function will then return -EINVAL and the device queue will remain
quiesced. AFAIK there's no way to recover from this situation without a
reboot. (This is actually the bug as I encountered it first, in Redhat
kernel 4.18.0-348.20.1). This is not a great situation, but maybe it can
be argued that if you're fiddling with things in sysfs you should really
know what you're doing, and avoid doing stuff like this.

I did write up the change you suggested though, and a couple of
questions before I mail it here and to linux-scsi:
- Can I credit you as "Suggested-by" ?
- Currently I only have the new capability in TCP's iscsi_transport
struct, because that's the one I care about. But I think it should go
everywhere?
Reply all
Reply to author
Forward
0 new messages