[BUG]: Null pointer exception from parallel calls to iscsi_stop_conn

1 view
Skip to first unread message

ajhu...@gmail.com

unread,
Jul 16, 2024, 3:25:40 PM (23 hours ago) Jul 16
to open-iscsi
Hi. I reviewed a kdump generated by a NULL pointer exception during termination of an iSCSI session. In this instance, the termination of the session was due to a 'Target-Not-Found' error from target during login. 

The system is running SLES15 SP4 ( v5.14.21 )
 
crash> bt
PID: 61755  TASK: ffff88ae57e4c380  CPU: 6   COMMAND: "kworker/u40:3"
 #0 [ffffc90006b6fae8] machine_kexec at ffffffff8106af4e
 #1 [ffffc90006b6fb38] __crash_kexec at ffffffff81168dce
 #2 [ffffc90006b6fc00] panic at ffffffff8191aa0f
 #3 [ffffc90006b6fc88] oops_end at ffffffff8102e3dd
 #4 [ffffc90006b6fca8] page_fault_oops at ffffffff8107b6fb
 #5 [ffffc90006b6fd28] exc_page_fault at ffffffff81923610
 #6 [ffffc90006b6fd50] asm_exc_page_fault at ffffffff81a00f39
    [exception RIP: iscsi_sw_tcp_release_conn+111]
    RIP: ffffffffc0c8243f  RSP: ffffc90006b6fe08  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: ffff8881cb225388  RCX: 0000000000000001
    RDX: ffff88adbf660900  RSI: ffffffff81f7cb84  RDI: ffff88adbf660980
    RBP: ffff888ad68cd140   R8: 0000000000000001   R9: 0000000000000001
    R10: 0000000000000000  R11: 00000000000001d2  R12: ffff8881cb225388
    R13: ffff8881cb2256a8  R14: ffff8881cb2256a8  R15: ffff888105d8ca05
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffffc90006b6fe38] iscsi_sw_tcp_conn_stop at ffffffffc0c825fd [iscsi_tcp]
 #8 [ffffc90006b6fe58] iscsi_stop_conn at ffffffffc0f276f3 [scsi_transport_iscsi]
 #9 [ffffc90006b6fe78] iscsi_cleanup_conn_work_fn at ffffffffc0f277f8 [scsi_transport_iscsi]
#10 [ffffc90006b6fea0] process_one_work at ffffffff810b5766
#11 [ffffc90006b6fed8] worker_thread at ffffffff810b595d
#12 [ffffc90006b6ff10] kthread at ffffffff810bdb63
#13 [ffffc90006b6ff50] ret_from_fork at ffffffff8100204f


Based on code review and journal logs, iscsid detects the login error and initiates a TERM stop from user space. In parallel, the kernel driver detects a socket error and initiates a RECOVERY stop on the connection.  

Initiated by iscsid

iscsi_recv_login_rsp ->
  iscsi_login_eh ->
    session_conn_shutdown ->
      kstop_conn ->
       iscsi_if_transport_conn ->
         iscsi_if_stop_conn ->
           iscsi_stop_conn(conn, STOP_CONN_TERM)

Initiated by error on TCP socket


iscsi_sw_sk_state_check ->
  iscsi_conn_failure ->
    iscsi_conn_error_event ->
      iscsi_conn_error_event ->
        queue_work(iscsi_conn_cleanup_workq, &conn->cleanup_work);
        .
        .
        iscsi_cleanup_conn_work_fn ->
          iscsi_stop_conn(conn, STOP_CONN_RECOVER);

The null pointer exception occurred in the iscsi_stop_conn call initiated from the worker thread for cleanup. Both iscsi_sw_tcp_conn_stop and iscsi_sw_tcp_release_conn check for a NULL sock pointer in the connection but the call to iscsi_sw_tcp_conn_restore_callbacks within iscsi_sw_tcp_release_conn does not leaving a small window where the connection's socket pointer can be set to NULL by the other iscsi_stop_conn call running in parallel resulting in this exception.

It would be simple enough to add a check for a NULL socket pointer in iscsi_sw_tcp_conn_restore_callbacks but I'm not convinced that is the correct solution. It looks to me that the resulting state of the session and connections would be different depending on which of the two calls executes first. If the cleanup thread successfully stop the connection with RECOVERY,  it will set the socket pointer in the connection to NULL and this will short circuit the iscsid TERMINATE and keep it from modifying the connection/session states. 

Also, I noticed that the cleanup thread's call to iscsi_stop_conn is made while holding the ep_mutex while the call made from the iscsid is not. Should the call from iscsid to iscsi_stop_conn be made while holding the ep_mutex? 

Thanks in advance, 
Adam 
Reply all
Reply to author
Forward
0 new messages