Segfaults from iscsiuio (iscsiuio/src/unix/nic_nl.c)

31 views
Skip to first unread message

Frank Fegert

unread,
Jul 28, 2016, 5:16:12 AM7/28/16
to open-...@googlegroups.com
Hello all,

disclaimer: i'm not a programmer, so the following might be utterly
and completely wrong ;-)

TL;DR: i'm getting segfaults from iscsiuio upon any target login.
Specifically this happens in iscsiuio/src/unix/nic_nl.c. Debugging
this lead me to believe this is a case of trying to unlock a not
locked mutex. A quick and dirty hack which works around this is
available here:
https://github.com/frank-fegert/open-iscsi/commit/9f770f9eb0f302d146d455f1d68648e2d0172eb6

There is probably room for a proper fix (e.g. counter on the number of
locks?), which considers the semantics of the whole code.


The longer explanation:

My setup involves 6 Dell M630 hosts (host1, host{5,6,7,8,9}), all with
BCM57810 iSOEs. BCM57810 firmware, software (Debian 8) and targets are
exactly the same on all hosts. I'm using the Debian open-iscsi package,
which is rebuild in order to include iscsiuio and has the changes up to
Git commit 0fa43f29 - but excluding c6d1117b and 76832662 (externalization
of the open-isns code) - backported. The only difference is, host1 has
Intel E5 v3 CPUs, host{5,6,7,8,9} have Intel E5 v4 CPUs.

On host1 everything works fine, iscsiuio runs as expected, access to
targets is working flawlessly.
On host{5,6,7,8,9} i'm getting segfaults like the one in the example
shown below, while trying to log in to any target.

Searching for "__lll_unlock_elision" in conjunction with "pthread_*"
led me to the following resources:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800574
https://lwn.net/Articles/534758/

which are pointing towards a CPU (Broadwell and Skylake) specific pro-
blem when not carefully using mutexes. The general opinion from there
and other related bug reports seems to be, that the application be-
haviour should be changed, in order to fix such an issue.

Tracking this issue further down i ended up in the function "nl_process_
handle_thread" in iscsiuio/src/unix/nic_nl.c and specifically here:

486 rc = pthread_cond_wait(&nic->nl_process_cond,
487 &nic->nl_process_mutex);

From the pthread_cond_wait manpage:
The pthread_cond_timedwait() and pthread_cond_wait() functions shall
block on a condition variable. They shall be called with mutex locked
by the calling thread or undefined behavior results.

On the first pass of the loop, this constraint seems to be true. At
the end of the loop at:

499 pthread_mutex_unlock(&nic->nl_process_mutex);

the mutex is then unlocked. Thus - if i understand the code right -
the above constraint is no longer met on the subsequent passes of the
loop. On Intel E5 v3 this seemed to be tolerated and without any impact.
But on Intel E5 v4 (and other CPUs implementing HLE and RTM) this IMHO
causes the observed segfault.

Could someone more familiar with mutex handling in phtreads and/or the
semantics of the iscsiuio code please take a look at this? I'd be inter-
ested if my analysis is correct and whether my quick'n'dirty fix has
any major side-effects. And - of course - what a proper fix for the ob-
served segfault would look like ;-)

Thanks & best regards,

Frank


host5:~# gdb /sbin/iscsiuio
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /sbin/iscsiuio...(no debugging symbols found)...done.

(gdb) run -d 4 -f
Starting program: /sbin/iscsiuio -d 4 -f
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
INFO [Wed Jul 27 10:01:45 2016]Initialize logger using log file: /var/log/iscsiuio.log
INFO [Wed Jul 27 10:01:45 2016]Started iSCSI uio stack: Ver 0.7.8.2
INFO [Wed Jul 27 10:01:45 2016]Build date: Fri Jul 22 15:40:04 CEST 2016
INFO [Wed Jul 27 10:01:45 2016]Debug mode enabled
INFO [Wed Jul 27 10:01:45 2016]Running on sysname: 'Linux', release: '3.16.0-4-amd64', version '#1 SMP Debian 3.16.7-ckt25-2+deb8u3 (2016-07-02)' machine: 'x86_64'
DBG [Wed Jul 27 10:01:45 2016]Loaded nic library 'bnx2' Version: '0.7.8.2' build on Fri Jul 22 15:40:04 CEST 2016'
DBG [Wed Jul 27 10:01:45 2016]Added 'bnx2' nic library
DBG [Wed Jul 27 10:01:45 2016]Loaded nic library 'bnx2x' Version: '0.7.8.2' build on Fri Jul 22 15:40:04 CEST 2016'
DBG [Wed Jul 27 10:01:45 2016]Added 'bnx2x' nic library
[New Thread 0x7ffff760f700 (LWP 4942)]
INFO [Wed Jul 27 10:01:45 2016]signal handling thread ready
INFO [Wed Jul 27 10:01:45 2016]nic_utils Found host[11]: host11
INFO [Wed Jul 27 10:01:45 2016]Done capturing /sys/class/iscsi_host/host11/netdev
INFO [Wed Jul 27 10:01:45 2016]Done capturing /sys/class/iscsi_host/host11/netdev
INFO [Wed Jul 27 10:01:45 2016]nic_utils looking for uio device for eth3
WARN [Wed Jul 27 10:01:45 2016]Could not find assoicate uio device with eth3
ERR [Wed Jul 27 10:01:45 2016]nic_utils Could not determine UIO name for eth3
INFO [Wed Jul 27 10:01:45 2016]nic_utils Found host[12]: host12
INFO [Wed Jul 27 10:01:45 2016]Done capturing /sys/class/iscsi_host/host12/netdev
INFO [Wed Jul 27 10:01:45 2016]Done capturing /sys/class/iscsi_host/host12/netdev
INFO [Wed Jul 27 10:01:45 2016]nic_utils looking for uio device for eth2
INFO [Wed Jul 27 10:01:45 2016]nic_utils eth2 associated with uio0
INFO [Wed Jul 27 10:01:45 2016]nic_utils NIC not found creating an instance for host_no: 12 eth2
DBG [Wed Jul 27 10:01:45 2016]Could not increase process priority: Success
[New Thread 0x7ffff6e0e700 (LWP 4943)]
DBG [Wed Jul 27 10:01:45 2016]iscsi_ipc Started iscsid listening thread
DBG [Wed Jul 27 10:01:45 2016]iscsi_ipc Waiting for iscsid command
INFO [Wed Jul 27 10:01:45 2016]NIC_NL Netlink to CNIC on pid 4938 is ready
DBG [Wed Jul 27 10:01:57 2016]iscsi_ipc recv iscsid request: cmd: 1, payload_len: 11720
INFO [Wed Jul 27 10:01:57 2016]iscsi_ipc Received request for 'eth2' to set IP address: '10.0.1.62' VLAN: '0'
INFO [Wed Jul 27 10:01:57 2016]iscsi_ipc Using netmask: 0.0.0.0
INFO [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2, using existing NIC
INFO [Wed Jul 27 10:01:57 2016]nic_utils looking for uio device for eth2
INFO [Wed Jul 27 10:01:57 2016]nic_utils eth2 associated with uio0
INFO [Wed Jul 27 10:01:57 2016]Done capturing /sys/class/uio/uio0/name
INFO [Wed Jul 27 10:01:57 2016]nic_utils eth2: Verified uio name bnx2x_cnic with library bnx2x
INFO [Wed Jul 27 10:01:57 2016]eth2: found NIC with library 'bnx2x'
INFO [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2 library set using transport_name bnx2i
INFO [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2: requesting configuration using static IP address
DBG [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2 couldn't find interface with ip_type: 0x2 creating it
INFO [Wed Jul 27 10:01:57 2016]nic eth2: Added nic interface for VLAN: 0, protocol: 2
INFO [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2: created network interface
[New Thread 0x7ffff660d700 (LWP 4947)]
WARN [Wed Jul 27 10:01:57 2016]nic_utils eth2: device already disabled: flag: 0x1088 state: 0x1
INFO [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2: configuring using static IP IPv4 address :10.0.1.62
INFO [Wed Jul 27 10:01:57 2016]iscsi_ipc netmask: 255.255.255.0
[New Thread 0x7ffff5e0c700 (LWP 4948)]
INFO [Wed Jul 27 10:01:57 2016]iscsi_ipc ISCSID_UIP_IPC_GET_IFACE: command: 1 name: bnx2i.d0:43:1e:51:98:53, netdev: eth2 ipaddr: 10.0.1.62 vlan: 0 transport_name:bnx2i
INFO [Wed Jul 27 10:01:57 2016]nic_utils eth2: spinning up thread for nic
DBG [Wed Jul 27 10:01:57 2016]iscsi_ipc Waiting for iscsid command
[New Thread 0x7ffff560b700 (LWP 4949)]
DBG [Wed Jul 27 10:01:57 2016]nic eth2: Waiting to be enabled
INFO [Wed Jul 27 10:01:57 2016]Created nic thread: eth2
INFO [Wed Jul 27 10:01:57 2016]iscsi_ipc eth2: started NIC enable thread state: 0x1
DBG [Wed Jul 27 10:01:57 2016]nic eth2: is now enabled
INFO [Wed Jul 27 10:01:57 2016]bnx2x eth2: bnx2x driver using version 1.78.19
ERR [Wed Jul 27 10:01:58 2016]bnx2x /dev/uio0: uio device has been brought up via pid: 4938 on fd: 7
INFO [Wed Jul 27 10:01:58 2016]Done capturing /sys/class/uio/uio0/name
INFO [Wed Jul 27 10:01:58 2016]bnx2x eth2: Verified is a cnic_uio device
DBG [Wed Jul 27 10:01:58 2016]bnx2x eth2: using rx ring size: 15, rx buffer size: 1024
INFO [Wed Jul 27 10:01:58 2016]Done capturing /sys/class/uio/uio0/event
DBG [Wed Jul 27 10:01:58 2016]bnx2x Chip ID: 168e1000
INFO [Wed Jul 27 10:01:58 2016]nic_id eth2: is found at 03:00.00
INFO [Wed Jul 27 10:01:58 2016]bnx2x eth2: func 0x0, pfid 0x0, client_id 0x88, cid 0x1
DBG [Wed Jul 27 10:01:58 2016]bnx2x eth2: mode = 0x100
INFO [Wed Jul 27 10:01:58 2016]bnx2x eth2: Using mac address: d0:43:1e:51:98:53
INFO [Wed Jul 27 10:01:58 2016]eth2: bnx2x initialized
INFO [Wed Jul 27 10:01:58 2016]nic eth2: Initialized ip stack: VLAN: 0
INFO [Wed Jul 27 10:01:58 2016]nic eth2: mac: d0:43:1e:51:98:53
INFO [Wed Jul 27 10:01:58 2016]nic eth2: Using IP address: 10.0.1.62
INFO [Wed Jul 27 10:01:58 2016]nic eth2: Using netmask: 255.255.255.0
INFO [Wed Jul 27 10:01:58 2016]nic eth2: enabled vlan 0 protocol: 2
INFO [Wed Jul 27 10:01:58 2016]nic eth2: entering main nic loop
DBG [Wed Jul 27 10:01:58 2016]nic_utils eth2: device enabled
[Thread 0x7ffff5e0c700 (LWP 4948) exited]
DBG [Wed Jul 27 10:01:59 2016]iscsi_ipc recv iscsid request: cmd: 1, payload_len: 11720
INFO [Wed Jul 27 10:01:59 2016]iscsi_ipc Received request for 'eth2' to set IP address: '10.0.1.62' VLAN: '0'
INFO [Wed Jul 27 10:01:59 2016]iscsi_ipc Using netmask: 0.0.0.0
INFO [Wed Jul 27 10:01:59 2016]iscsi_ipc eth2, using existing NIC
INFO [Wed Jul 27 10:01:59 2016]nic_utils looking for uio device for eth2
INFO [Wed Jul 27 10:01:59 2016]nic_utils eth2 associated with uio0
INFO [Wed Jul 27 10:01:59 2016]eth2: Have NIC library 'bnx2x'
INFO [Wed Jul 27 10:01:59 2016]Done capturing /sys/class/uio/uio0/name
INFO [Wed Jul 27 10:01:59 2016]nic_utils eth2: Verified uio name bnx2x_cnic with library bnx2x
INFO [Wed Jul 27 10:01:59 2016]eth2: found NIC with library 'bnx2x'
INFO [Wed Jul 27 10:01:59 2016]iscsi_ipc eth2 library set using transport_name bnx2i
INFO [Wed Jul 27 10:01:59 2016]iscsi_ipc eth2: requesting configuration using static IP address
INFO [Wed Jul 27 10:01:59 2016]iscsi_ipc eth2: using existing network interface
INFO [Wed Jul 27 10:01:59 2016]iscsi_ipc eth2: IP configuration didn't change using 0x2
INFO [Wed Jul 27 10:01:59 2016]iscsi_ipc eth2: NIC already enabled flags: 0x1084 state: 0x4

INFO [Wed Jul 27 10:01:59 2016]iscsi_ipc ISCSID_UIP_IPC_GET_IFACE: command: 1 name: bnx2i.d0:43:1e:51:98:53, netdev: eth2 ipaddr: 10.0.1.62 vlan: 0 transport_name:bnx2i
DBG [Wed Jul 27 10:01:59 2016]iscsi_ipc Waiting for iscsid command
INFO [Wed Jul 27 10:02:00 2016]NIC_NL Received path_req for host 12
INFO [Wed Jul 27 10:02:00 2016]Done capturing /sys/class/iscsi_host/host12/netdev
DBG [Wed Jul 27 10:02:00 2016]NIC_NL Pulled nl event
INFO [Wed Jul 27 10:02:00 2016]NIC_NL eth2: Processing 'path_req'
DBG [Wed Jul 27 10:02:00 2016]NIC_NL eth2: PATH_REQ with iface_num -1 VLAN 32768
DBG [Wed Jul 27 10:02:00 2016]CNIC eth2: Netlink message with VLAN ID: 0, path MTU: 9000 minor: 0 ip_addr_len: 4
DBG [Wed Jul 27 10:02:00 2016]CNIC eth2: src=10.0.1.62
DBG [Wed Jul 27 10:02:00 2016]CNIC eth2: dst=10.0.1.2
DBG [Wed Jul 27 10:02:00 2016]CNIC eth2: nm=255.255.255.0
INFO [Wed Jul 27 10:02:00 2016]CNIC eth2: Didn't find IPv4: '10.0.1.2' in ARP table
DBG [Wed Jul 27 10:02:00 2016]CNIC eth2: Sent cnic arp request for IP: 10.0.1.2
INFO [Wed Jul 27 10:02:00 2016]Found 10.0.1.2 at b0:83:fe:cc:57:bb
DBG [Wed Jul 27 10:02:00 2016]CNIC neighbor reply sent back to kernel 10.0.1.62 at b0:83:fe:cc:57:bb with vlan 0
INFO [Wed Jul 27 10:02:00 2016]NIC_NL eth2: 'path_req' operation finished

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff660d700 (LWP 4947)]
__lll_unlock_elision (lock=0x55555577fd40, private=0) at ../nptl/sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
29 ../nptl/sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or directory.


(gdb) info threads
Id Target Id Frame
6 Thread 0x7ffff560b700 (LWP 4949) "iscsiuio" 0x00007ffff76f1ae3 in select () at ../sysdeps/unix/syscall-template.S:81
* 4 Thread 0x7ffff660d700 (LWP 4947) "iscsiuio" __lll_unlock_elision (lock=0x55555577fd40, private=0) at ../nptl/sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
3 Thread 0x7ffff6e0e700 (LWP 4943) "iscsiuio" 0x00007ffff79c9ccd in accept () at ../sysdeps/unix/syscall-template.S:81
2 Thread 0x7ffff760f700 (LWP 4942) "iscsiuio" do_sigwait (set=<optimized out>, sig=0x7ffff760eeac) at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:63
1 Thread 0x7ffff7fea700 (LWP 4938) "iscsiuio" 0x00007ffff79c9e9d in recvmsg () at ../sysdeps/unix/syscall-template.S:81


(gdb) thread apply all bt

Thread 6 (Thread 0x7ffff560b700 (LWP 4949)):
#0 0x00007ffff76f1ae3 in select () at ../sysdeps/unix/syscall-template.S:81
#1 0x000055555555ac06 in ?? ()
#2 0x000055555555d39e in nic_loop ()
#3 0x00007ffff79c30a4 in start_thread (arg=0x7ffff560b700) at pthread_create.c:309
#4 0x00007ffff76f887d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 4 (Thread 0x7ffff660d700 (LWP 4947)):
#0 __lll_unlock_elision (lock=0x55555577fd40, private=0) at ../nptl/sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
#1 0x00007ffff79c7007 in pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:94
#2 0x000055555555e803 in nl_process_handle_thread ()
#3 0x00007ffff79c30a4 in start_thread (arg=0x7ffff660d700) at pthread_create.c:309
#4 0x00007ffff76f887d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 3 (Thread 0x7ffff6e0e700 (LWP 4943)):
#0 0x00007ffff79c9ccd in accept () at ../sysdeps/unix/syscall-template.S:81
#1 0x00005555555641b0 in ?? ()
#2 0x00007ffff79c30a4 in start_thread (arg=0x7ffff6e0e700) at pthread_create.c:309
#3 0x00007ffff76f887d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 2 (Thread 0x7ffff760f700 (LWP 4942)):
#0 do_sigwait (set=<optimized out>, sig=0x7ffff760eeac) at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:63
#1 0x00007ffff79ca693 in __sigwait (set=0x7ffff760eeb0, sig=0x0) at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:97
#2 0x000055555555a49c in _start ()

Thread 1 (Thread 0x7ffff7fea700 (LWP 4938)):
#0 0x00007ffff79c9e9d in recvmsg () at ../sysdeps/unix/syscall-template.S:81
#1 0x000055555555e5e9 in ?? ()
#2 0x000055555555eea8 in nic_nl_open ()
#3 0x000055555555a1b8 in main ()

The Lee-Man

unread,
Jul 30, 2016, 8:10:20 PM7/30/16
to open-iscsi, fra.no...@gmx.de
On Thursday, July 28, 2016 at 2:16:12 AM UTC-7, Frank Fegert wrote:
Hello all,

Hi Frank.

disclaimer: i'm not a programmer, so the following might be utterly
and completely wrong ;-)

TL;DR: i'm getting segfaults from iscsiuio upon any target login.
Specifically this happens in iscsiuio/src/unix/nic_nl.c. Debugging
this lead me to believe this is a case of trying to unlock a not
locked mutex. A quick and dirty hack which works around this is
available here:
  https://github.com/frank-fegert/open-iscsi/commit/9f770f9eb0f302d146d455f1d68648e2d0172eb6

There is probably room for a proper fix (e.g. counter on the number of
locks?), which considers the semantics of the whole code.

I looked at your suggested fix, and I think it's almost correct.

The caller should indeed have the mutex locked when calling
pthread_cond_wait. It's probably just luck this hasn't blown up before.

But I think it just needs one tweak. You need to call
pthread_mutex_unlock() in the error case, i.e. in the
"rc != 0" case, before calling LOG_ERR().

Can you submit a patch or pull request updated with this
change?

Frank Fegert

unread,
Aug 1, 2016, 11:27:31 AM8/1/16
to The Lee-Man, open-iscsi
Hello Lee,

On Sat, Jul 30, 2016 at 05:10:20PM -0700, The Lee-Man wrote:
> I looked at your suggested fix, and I think it's almost correct.
>
> The caller should indeed have the mutex locked when calling
> pthread_cond_wait. It's probably just luck this hasn't blown up before.
>
> But I think it just needs one tweak. You need to call
> pthread_mutex_unlock() in the error case, i.e. in the
> "rc != 0" case, before calling LOG_ERR().
>
> Can you submit a patch or pull request updated with this
> change?

thank you very much for taking the time to look at this issue!
You're of course right about the missing "pthread_mutex_unlock()"
call in the error handling section. I've added this in the latest
Git commit and created a pull request regarding the attached patch.
open-iscsi__iscsiuio__un-lock_of_not_locked_mutex.patch
Reply all
Reply to author
Forward
0 new messages