[PATCH] scsi: libiscsi: fix NOP race condition

39 views
Skip to first unread message

Lee Duncan

unread,
Sep 18, 2020, 5:10:30 PM9/18/20
to linux...@vger.kernel.org, open-...@googlegroups.com, Lee Duncan
From: Lee Duncan <ldu...@suse.com>

iSCSI NOPs are sometimes "lost", mistakenly sent to the
user-land iscsid daemon instead of handled in the kernel,
as they should be, resulting in a message from the daemon like:

> iscsid: Got nop in, but kernel supports nop handling.

This can occur because of the new forward- and back-locks,
and the fact that an iSCSI NOP response can occur before
processing of the NOP send is complete. This can result
in "conn->ping_task" being NULL in iscsi_nop_out_rsp(),
when the pointer is actually in the process of being set.

To work around this, we add a new state to the "ping_task"
pointer. In addition to NULL (not assigned) and a pointer
(assigned), we add the state "being set", which is signaled
with an INVALID pointer (using "-1").

Signed-off-by: Lee Duncan <ldu...@suse.com>
---
drivers/scsi/libiscsi.c | 11 ++++++++++-
include/scsi/libiscsi.h | 3 +++
2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index 1e9c3171fa9f..5eb064787ee2 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -738,6 +738,9 @@ __iscsi_conn_send_pdu(struct iscsi_conn *conn, struct iscsi_hdr *hdr,
task->conn->session->age);
}

+ if (unlikely(READ_ONCE(conn->ping_task) == INVALID_SCSI_TASK))
+ WRITE_ONCE(conn->ping_task, task);
+
if (!ihost->workq) {
if (iscsi_prep_mgmt_task(conn, task))
goto free_task;
@@ -941,6 +944,11 @@ static int iscsi_send_nopout(struct iscsi_conn *conn, struct iscsi_nopin *rhdr)
struct iscsi_nopout hdr;
struct iscsi_task *task;

+ if (!rhdr) {
+ if (READ_ONCE(conn->ping_task))
+ return -EINVAL;
+ WRITE_ONCE(conn->ping_task, INVALID_SCSI_TASK);
+ }
if (!rhdr && conn->ping_task)
return -EINVAL;

@@ -957,11 +965,12 @@ static int iscsi_send_nopout(struct iscsi_conn *conn, struct iscsi_nopin *rhdr)

task = __iscsi_conn_send_pdu(conn, (struct iscsi_hdr *)&hdr, NULL, 0);
if (!task) {
+ if (!rhdr)
+ WRITE_ONCE(conn->ping_task, NULL);
iscsi_conn_printk(KERN_ERR, conn, "Could not send nopout\n");
return -EIO;
} else if (!rhdr) {
/* only track our nops */
- conn->ping_task = task;
conn->last_ping = jiffies;
}

diff --git a/include/scsi/libiscsi.h b/include/scsi/libiscsi.h
index c25fb86ffae9..b3bbd10eb3f0 100644
--- a/include/scsi/libiscsi.h
+++ b/include/scsi/libiscsi.h
@@ -132,6 +132,9 @@ struct iscsi_task {
void *dd_data; /* driver/transport data */
};

+/* invalid scsi_task pointer */
+#define INVALID_SCSI_TASK (struct iscsi_task *)-1l
+
static inline int iscsi_task_has_unsol_data(struct iscsi_task *task)
{
return task->unsol_r2t.data_length > task->unsol_r2t.sent;
--
2.26.2

ldu...@suse.com

unread,
Sep 19, 2020, 11:05:23 AM9/19/20
to linux...@vger.kernel.org, open-...@googlegroups.com, Lee Duncan

ldu...@suse.com

unread,
Sep 25, 2020, 2:42:32 PM9/25/20
to linux...@vger.kernel.org, linux-...@vger.kernel.org, open-...@googlegroups.com, martin....@oracle.com, mchr...@redhat.com, ha...@suse.com, Lee Duncan
From: Lee Duncan <ldu...@suse.com>

A customer that uses iSCSI NOPs extensively found a race
condition caused in part by the two-lock system used in
iscsi (a forward and a back lock), since sending an iSCSI
NOP uses the forward lock, and receiving one uses the
back lock. Because of this, processing of the "send"
can still be in progress when the "receive" occurs, on
a sufficiently fast multicore system.

To handle this case, we add a new state to the "ping_task"
pointer besides unassigned and assigned, called "invalid",
which means the "not yet completed sending". Tests show
this closes this race condition hole.

Changes since V1:
- Removed two redundant lines in iscsi_send_nopout()
- Updated commit text to be more clear
- Added this cover letter with even more info

Lee Duncan (1):
scsi: libiscsi: fix NOP race condition

drivers/scsi/libiscsi.c | 13 ++++++++++---
include/scsi/libiscsi.h | 3 +++
2 files changed, 13 insertions(+), 3 deletions(-)

--
2.26.2

ldu...@suse.com

unread,
Sep 25, 2020, 2:42:33 PM9/25/20
to linux...@vger.kernel.org, linux-...@vger.kernel.org, open-...@googlegroups.com, martin....@oracle.com, mchr...@redhat.com, ha...@suse.com, Lee Duncan
From: Lee Duncan <ldu...@suse.com>

iSCSI NOPs are sometimes "lost", mistakenly sent to the
user-land iscsid daemon instead of handled in the kernel,
as they should be, resulting in a message from the daemon like:

> iscsid: Got nop in, but kernel supports nop handling.

This can occur because of the forward- and back-locks
in the kernel iSCSI code, and the fact that an iSCSI NOP
response can be processed before processing of the NOP send
is complete. This can result in "conn->ping_task" being NULL
in iscsi_nop_out_rsp(), when the pointer is actually in
the process of being set.

To work around this, we add a new state to the "ping_task"
pointer. In addition to NULL (not assigned) and a pointer
(assigned), we add the state "being set", which is signaled
with an INVALID pointer (using "-1").

Signed-off-by: Lee Duncan <ldu...@suse.com>
---
drivers/scsi/libiscsi.c | 13 ++++++++++---
include/scsi/libiscsi.h | 3 +++
2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index 1e9c3171fa9f..cade108c33b6 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -738,6 +738,9 @@ __iscsi_conn_send_pdu(struct iscsi_conn *conn, struct iscsi_hdr *hdr,
task->conn->session->age);
}

+ if (unlikely(READ_ONCE(conn->ping_task) == INVALID_SCSI_TASK))
+ WRITE_ONCE(conn->ping_task, task);
+
if (!ihost->workq) {
if (iscsi_prep_mgmt_task(conn, task))
goto free_task;
@@ -941,8 +944,11 @@ static int iscsi_send_nopout(struct iscsi_conn *conn, struct iscsi_nopin *rhdr)
struct iscsi_nopout hdr;
struct iscsi_task *task;

- if (!rhdr && conn->ping_task)
- return -EINVAL;
+ if (!rhdr) {
+ if (READ_ONCE(conn->ping_task))
+ return -EINVAL;
+ WRITE_ONCE(conn->ping_task, INVALID_SCSI_TASK);
+ }

memset(&hdr, 0, sizeof(struct iscsi_nopout));
hdr.opcode = ISCSI_OP_NOOP_OUT | ISCSI_OP_IMMEDIATE;
@@ -957,11 +963,12 @@ static int iscsi_send_nopout(struct iscsi_conn *conn, struct iscsi_nopin *rhdr)

Lee Duncan

unread,
Oct 2, 2020, 12:13:45 PM10/2/20
to linux...@vger.kernel.org, linux-...@vger.kernel.org, open-...@googlegroups.com, martin....@oracle.com, mchr...@redhat.com, ha...@suse.com
Ping?

--
Lee Duncan

Mike Christie

unread,
Oct 8, 2020, 1:11:29 PM10/8/20
to ldu...@suse.com, linux...@vger.kernel.org, linux-...@vger.kernel.org, open-...@googlegroups.com, martin....@oracle.com, mchr...@redhat.com, ha...@suse.com
I think the API gets a little weird now where in some cases
__iscsi_conn_send_pdu checks the opcode to see what type of request
it is but above we the caller sets the ping_task.

For login, tmfs and passthrough, we assume the __iscsi_conn_send_pdu
has sent or cleaned up everything. I think it might be nicer to just
have __iscsi_conn_send_pdu set the ping_task field before doing the
xmit/queue call. It would then work similar to the conn->login_task
case where that function knows about that special task too.

So in __iscsi_conn_send_pdu add a "if (opcode == ISCSI_OP_NOOP_OUT)",
and check if it's a nop we need to track. If so set conn->ping_task.

Mike Christie

unread,
Oct 8, 2020, 4:54:43 PM10/8/20
to ldu...@suse.com, linux...@vger.kernel.org, linux-...@vger.kernel.org, open-...@googlegroups.com, martin....@oracle.com, mchr...@redhat.com, ha...@suse.com
Ignore this. It won't work nicely either. To figure out if the nop is
our internal transport test ping vs a userspace ping that also needs
a reply, we would need to do something like you did above so there is
no point.

Ulrich Windl

unread,
Oct 9, 2020, 3:25:14 AM10/9/20
to open-iscsi
>>> Mike Christie <michael....@oracle.com> schrieb am 08.10.2020 um 22:54 in
Nachricht <47eca384-b54e-63cc...@oracle.com>:
I don't know the implementation details, but if ICMP package data would contain some "unlikely" value for iSCSI pings, you could use the packet data to decide. I guess 32 random bits could be "unlikely enough".

>
> --
> You received this message because you are subscribed to the Google Groups
> "open-iscsi" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to open-iscsi+...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/open-iscsi/47eca384-b54e-63cc-0f84-7ed6501f
> 427e%40oracle.com.




Lee Duncan

unread,
Oct 20, 2020, 12:55:48 PM10/20/20
to Mike Christie, linux...@vger.kernel.org, linux-...@vger.kernel.org, open-...@googlegroups.com, martin....@oracle.com, mchr...@redhat.com, ha...@suse.com
Hi Mike:

I've read this a few times, and I'm still no sure I'm parsing it correctly.

Are you saying that my original patch submission is ok, or are you
saying there's nothing we can do and we're up the proverbial creek?
--
Lee Duncan

Mike Christie

unread,
Nov 4, 2020, 4:33:59 PM11/4/20
to Lee Duncan, linux...@vger.kernel.org, open-...@googlegroups.com, Lee Duncan
I don't think you need this. If __iscsi_conn_send_pdu returns NULL, it
will have done __iscsi_put_task and done this already.

> iscsi_conn_printk(KERN_ERR, conn, "Could not send nopout\n");
> return -EIO;
> } else if (!rhdr) {
> /* only track our nops */
> - conn->ping_task = task;
> conn->last_ping = jiffies;
> }

Why in the send path do we always use the READ_ONCE/WRITE_ONCE, but in
the completion path like in iscsi_complete_task we don't.

Mike Christie

unread,
Nov 4, 2020, 4:37:41 PM11/4/20
to Lee Duncan, linux...@vger.kernel.org, open-...@googlegroups.com, Lee Duncan
Ignore that. That is iscsi_complete_task that would do it.

Lee Duncan

unread,
Nov 5, 2020, 1:30:38 PM11/5/20
to Mike Christie, linux...@vger.kernel.org, open-...@googlegroups.com
Not an issue, as you already replied.

>
>>           iscsi_conn_printk(KERN_ERR, conn, "Could not send nopout\n");
>>           return -EIO;
>>       } else if (!rhdr) {
>>           /* only track our nops */
>> -        conn->ping_task = task;
>>           conn->last_ping = jiffies;
>>       }
>
> Why in the send path do we always use the READ_ONCE/WRITE_ONCE, but in
> the completion path like in iscsi_complete_task we don't.
>

The answer is that I was only modifying the code that needed changing
for this bug. My first pass did not use READ_ONCE() or WRITE_ONCE(), but
Hannes suggested the change.

Now that I think about it more, the memory barrier stuff would make
sense only if all the access to that field are protected.

I will resubmit V2 of the patch.
--
Lee Duncan

Reply all
Reply to author
Forward
0 new messages