[BUG]: Null pointer exception from parallel calls to iscsi_stop_conn

1 view

Skip to first unread message

ajhu...@gmail.com

unread,

Jul 16, 2024, 3:25:40 PM (23 hours ago) Jul 16

to open-iscsi

Hi. I reviewed a kdump generated by a NULL pointer exception during termination of an iSCSI session. In this instance, the termination of the session was due to a 'Target-Not-Found' error from target during login.

The system is running SLES15 SP4 ( v5.14.21 )

crash> bt
PID: 61755 TASK: ffff88ae57e4c380 CPU: 6 COMMAND: "kworker/u40:3"
#0 [ffffc90006b6fae8] machine_kexec at ffffffff8106af4e
#1 [ffffc90006b6fb38] __crash_kexec at ffffffff81168dce
#2 [ffffc90006b6fc00] panic at ffffffff8191aa0f
#3 [ffffc90006b6fc88] oops_end at ffffffff8102e3dd
#4 [ffffc90006b6fca8] page_fault_oops at ffffffff8107b6fb
#5 [ffffc90006b6fd28] exc_page_fault at ffffffff81923610
#6 [ffffc90006b6fd50] asm_exc_page_fault at ffffffff81a00f39
[exception RIP: iscsi_sw_tcp_release_conn+111]
RIP: ffffffffc0c8243f RSP: ffffc90006b6fe08 RFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff8881cb225388 RCX: 0000000000000001
RDX: ffff88adbf660900 RSI: ffffffff81f7cb84 RDI: ffff88adbf660980
RBP: ffff888ad68cd140 R8: 0000000000000001 R9: 0000000000000001
R10: 0000000000000000 R11: 00000000000001d2 R12: ffff8881cb225388
R13: ffff8881cb2256a8 R14: ffff8881cb2256a8 R15: ffff888105d8ca05
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffffc90006b6fe38] iscsi_sw_tcp_conn_stop at ffffffffc0c825fd [iscsi_tcp]
#8 [ffffc90006b6fe58] iscsi_stop_conn at ffffffffc0f276f3 [scsi_transport_iscsi]
#9 [ffffc90006b6fe78] iscsi_cleanup_conn_work_fn at ffffffffc0f277f8 [scsi_transport_iscsi]
#10 [ffffc90006b6fea0] process_one_work at ffffffff810b5766
#11 [ffffc90006b6fed8] worker_thread at ffffffff810b595d
#12 [ffffc90006b6ff10] kthread at ffffffff810bdb63
#13 [ffffc90006b6ff50] ret_from_fork at ffffffff8100204f

Based on code review and journal logs, iscsid detects the login error and initiates a TERM stop from user space. In parallel, the kernel driver detects a socket error and initiates a RECOVERY stop on the connection.

Initiated by iscsid

iscsi_recv_login_rsp ->
iscsi_login_eh ->
session_conn_shutdown ->
kstop_conn ->
iscsi_if_transport_conn ->
iscsi_if_stop_conn ->
iscsi_stop_conn(conn, STOP_CONN_TERM)

Initiated by error on TCP socket

iscsi_sw_sk_state_check ->
iscsi_conn_failure ->
iscsi_conn_error_event ->
iscsi_conn_error_event ->
queue_work(iscsi_conn_cleanup_workq, &conn->cleanup_work);
.
.
iscsi_cleanup_conn_work_fn ->
iscsi_stop_conn(conn, STOP_CONN_RECOVER);

The null pointer exception occurred in the iscsi_stop_conn call initiated from the worker thread for cleanup. Both iscsi_sw_tcp_conn_stop and iscsi_sw_tcp_release_conn check for a NULL sock pointer in the connection but the call to iscsi_sw_tcp_conn_restore_callbacks within iscsi_sw_tcp_release_conn does not leaving a small window where the connection's socket pointer can be set to NULL by the other iscsi_stop_conn call running in parallel resulting in this exception.

It would be simple enough to add a check for a NULL socket pointer in iscsi_sw_tcp_conn_restore_callbacks but I'm not convinced that is the correct solution. It looks to me that the resulting state of the session and connections would be different depending on which of the two calls executes first. If the cleanup thread successfully stop the connection with RECOVERY, it will set the socket pointer in the connection to NULL and this will short circuit the iscsid TERMINATE and keep it from modifying the connection/session states.

Also, I noticed that the cleanup thread's call to iscsi_stop_conn is made while holding the ep_mutex while the call made from the iscsid is not. Should the call from iscsid to iscsi_stop_conn be made while holding the ep_mutex?

Thanks in advance,

Adam

Reply all

Reply to author

Forward

0 new messages