Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[Bug 211990] iscsi fails to reconnect and does not release devices

134 views
Skip to first unread message

bugzilla...@freebsd.org

unread,
Aug 19, 2016, 5:41:24 AM8/19/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

Bug ID: 211990
Summary: iscsi fails to reconnect and does not release devices
Product: Base System
Version: 10.3-RELEASE
Hardware: Any
OS: Any
Status: New
Severity: Affects Only Me
Priority: ---
Component: kern
Assignee: freebs...@FreeBSD.org
Reporter: ben.r...@gmail.com

Hello,

I'm facing an issue where iscsictl does not want to remove devices.
Here is how I can reproduce this.

### Initiator :
# iscsictl -Aa
iscsictl then reports the 17 targets as connected, perfect.

### Target :
Let's switch down the network interface
# ifconfig mlxen1 down

### Initiator :
iscsictl reports the 17 targets as disconnected, perfect.

### Target :
Let's switch up the network interface
# ifconfig mlxen1 up

### Initiator :
iscsictl reports the 17 targets as connected, however, for 4 devices, I get the
following :
09:59:43 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): timed out
waiting for iscsid(8) for 11 seconds; reconnecting
09:59:54 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): timed out
waiting for iscsid(8) for 11 seconds; reconnecting
09:59:57 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): handoff on
already connected session
07:59:57 srv1 iscsid[1372]: 192.168.2.2 (iqn.2012-06.srv2:lg2): ISCSIDHANDOFF:
Device busy
09:59:57 srv1 iscsid[581]: child process 1372 terminated with exit status 1
09:59:57 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): connection
error; reconnecting
09:59:57 srv1 kernel: (da21:iscsi8:0:0:0): got CAM status 0x8
09:59:57 srv1 kernel: (da21:iscsi8:0:0:0): fatal error, failed to attach to
device
10:00:07 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): no ping
reply (NOP-In) after 10 seconds; reconnecting
10:00:08 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): no ping
reply (NOP-In) after 10 seconds; reconnecting

### Target :
09:58:50 srv2 kernel: mlxen1: link state changed to DOWN
09:58:50 srv2 kernel: mlx4_en: mlxen1: Link Down
09:58:53 srv2 kernel: WARNING: 192.168.2.1 (iqn.1994-09.org.freebsd:srv1): no
ping reply (NOP-Out) after 5 seconds; dropping connection
09:58:53 srv2 last message repeated 16 times
09:59:49 srv2 kernel: mlx4_en: mlxen1: Link Up
09:59:49 srv2 kernel: mlxen1: link state changed to UP
09:59:49 srv2 devd: Executing '/etc/rc.d/dhclient quietstart mlxen1'
09:59:59 srv2 kernel: WARNING: 192.168.2.1 (iqn.1994-09.org.freebsd:srv1):
connection error; dropping connection
09:59:59 srv2 last message repeated 3 times

### Initiator :
# iscsictl -Ra
# iscsictl -L
Target name Target portal State
iqn.2012-06.srv2:sW1 192.168.2.2 Connected: da18
iqn.2012-06.srv2:sW2 192.168.2.2 Connected: da23
iqn.2012-06.srv2:rT3 192.168.2.2 Connected: da17
iqn.2012-06.srv2:lg2 192.168.2.2 Connected: da21

As you can see, the 4 problematic devices remain "connected", nodes exist in
/dev/, but they are unusable.

Each time I "iscsictl -Ra", I get the following on initiator side :
10:09:35 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:lg2): connection
error; reconnecting
10:09:35 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:sW1): connection
error; reconnecting
10:09:35 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:rT3): connection
error; reconnecting
10:09:35 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:sW2): connection
error; reconnecting

No logs however on target side, even if I start ctld with -d.

The only workaround I found is to reboot, or to change the target name to
properly reconnect...

# uname -v
FreeBSD 10.3-RELEASE-p7 #0: Thu Aug 11 18:38:15 UTC 2016
ro...@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC

Thank you for your support,

Best regards,

Ben

--
You are receiving this mail because:
You are the assignee for the bug.
_______________________________________________
freebs...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs...@freebsd.org"

bugzilla...@freebsd.org

unread,
Aug 19, 2016, 6:05:06 AM8/19/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #1 from Ben RUBSON <ben.r...@gmail.com> ---
On initiator, netstat also returns the following :
tcp4 0 0 192.168.2.1.49715 192.168.2.2.3260 CLOSED
tcp4 0 0 192.168.2.1.49708 192.168.2.2.3260 CLOSED
tcp4 0 0 192.168.2.1.49707 192.168.2.2.3260 CLOSED
tcp4 0 0 192.168.2.1.49706 192.168.2.2.3260 CLOSED

bugzilla...@freebsd.org

unread,
Aug 19, 2016, 8:16:15 AM8/19/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #2 from Ben RUBSON <ben.r...@gmail.com> ---
Quite difficult to reproduce, but I managed to do it again.
This time the problematic target (only one /17) is seen as disconnected.

### Initiator :
# iscsictl -Ra
# iscsictl -L
Target name Target portal State
iqn.2012-06.srv2:ch3 192.168.2.2 Disconnected
# netstat -an | grep 3260
// nothing returned

### Initiator logs :
12:38:02 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:ch3): login timed
out after 11 seconds; reconnecting
12:38:06 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:ch3): handoff on
already connected session
12:38:06 srv1 iscsid[7395]: 192.168.2.2 (iqn.2012-06.srv2:ch3): ISCSIDHANDOFF:
Device busy
12:38:06 srv1 iscsid[581]: child process 7395 terminated with exit status 1
12:38:06 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:ch3): connection
error; reconnecting
12:38:06 srv1 kernel: da8 at iscsi63 bus 0 scbus1 target 0 lun 0
12:38:06 srv1 kernel: da8: <FREEBSD CTLDISK 0001> s/n G22020PY1EZHAch detached
12:38:16 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:ch3): no ping
reply (NOP-In) after 10 seconds; reconnecting

### Target logs :
12:37:41 srv2 kernel: mlxen1: link state changed to DOWN
12:37:41 srv2 kernel: mlx4_en: mlxen1: Link Down
12:37:46 srv2 kernel: WARNING: 192.168.2.1 (iqn.1994-09.org.freebsd:srv1): no
ping reply (NOP-Out) after 5 seconds; dropping connection
12:37:46 srv2 last message repeated 16 times
12:38:05 srv2 kernel: mlx4_en: mlxen1: Link Up
12:38:05 srv2 kernel: mlxen1: link state changed to UP
12:38:05 srv2 devd: Executing '/etc/rc.d/dhclient quietstart mlxen1'
12:38:09 srv2 kernel: WARNING: 192.168.2.1 (iqn.1994-09.org.freebsd:srv1):
connection error; dropping connection

On initiator side, I also get the same log messages as above for the other
devices, but they correctly reconnect.

bugzilla...@freebsd.org

unread,
Aug 19, 2016, 10:36:30 AM8/19/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #3 from Ben RUBSON <ben.r...@gmail.com> ---
(In reply to Ben RUBSON from comment #0)
>The only workaround I found is to reboot

I even have to hard-reboot, because of this never ending :
iscsi_terminate_sessions: waiting for sessions to terminate

bugzilla...@freebsd.org

unread,
Aug 19, 2016, 6:19:44 PM8/19/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #4 from Ben RUBSON <ben.r...@gmail.com> ---
One strange thing I noticed.
(I put all things that could be interesting from my troubleshooting)

As soon as I put the network interface down, I get the following message on
target side, one per target :
17:01:00 srv2 kernel: WARNING: 192.168.2.1 (iqn.1994-09.org.freebsd:srv1): no
ping reply (NOP-Out) after 5 seconds; dropping connection

Then, on initiator side, I get these messages for each target :
Aug 19 17:01:07 srv1 kernel: iscsi_maintenance_thread_reconnect: 192.168.2.2
(iqn.2012-06.srv2:hm4): connection failed, destroying devices
Aug 19 17:01:07 srv1 kernel: iscsi_session_cleanup: 192.168.2.2
(iqn.2012-06.srv2:hm4): freezing
Aug 19 17:01:07 srv1 kernel: iscsi_session_cleanup: 192.168.2.2
(iqn.2012-06.srv2:hm4): deregistering SIM

At this moment, on initiator side, one iscsid process per target appears.

10 seconds later, on initiator side, I get these messages for each target :
Aug 19 17:01:18 srv1 kernel: WARNING: 192.168.2.2 (iqn.2012-06.srv2:hm4): login
timed out after 11 seconds; reconnecting
Aug 19 17:01:18 srv1 kernel: iscsi_maintenance_thread_reconnect: 192.168.2.2
(iqn.2012-06.srv2:hm4): connection failed, destroying devices

And at the same time, a second iscsid process per target appears, so that I get
2 iscsid processes per target :
# ps auxxw | grep iscsid:
root 866 0.0 0.0 16632 2144 - I 4:58pm 0:00.00 iscsid:
192.168.2.2 (iqn.2012-06.srv2:hm4) (iscsid)
root 881 0.0 0.0 16632 2144 - I 4:58pm 0:00.00 iscsid:
192.168.2.2 (iqn.2012-06.srv2:hm4) (iscsid)
(...)
However sounds like there is a limit to 30 processes, as for 17 targets I would
have expected 34 processes, but I only get 30.

If I put the NIC up before the second process is created, I only get one
reconnection message per target in target logs.
If I put the NIC up after the second process is created, I get a lot more
reconnection messages in target logs, between 40 and 50 for 17 targets.

Do we expect these additional processes ?
I think we would only expect one process / one reconnection message per target
?
Seems strange to have all these "duplicated" connection retries.

Another related question to the "30" processes found :
Is there any limit to 30 targets ?
I found a maxproc option in ctl.conf (default to 30) but I don't exactly know
what it means (I tested values of 1 to 50 without seeing any change).
No option found however on initiator side.

I noticed that we can reproduce this bug easier when we "stress" the devices :
disconnect network as soon as targets are reconnected, and reconnect it as soon
as they are disconnected.



Additionally to this, I had 8 kernel crashes, initator or target, each time
with the same address / pointer :
kernel: Fatal trap 12: page fault while in kernel mode
kernel: fault virtual address = 0x1e8
kernel: instruction pointer = 0x20:0xffffffff80936933

I also got a stacktrace, but did not get it's pointer address.
http://img4.hostingpics.net/pics/707217211990.png

I'm also trying to get a full dump.
However I'm not sure this kernel crash issue is related to the reconnection
issue, perhaps there are 2 issues.

# uname -v
FreeBSD 10.3-RELEASE-p7 #0: Thu Aug 11 18:38:15 UTC 2016
ro...@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC



A lot of info !
I hope we will be able to correct these issues.

Many thanks,

bugzilla...@freebsd.org

unread,
Aug 20, 2016, 1:30:08 PM8/20/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #5 from Ben RUBSON <ben.r...@gmail.com> ---
I made further troubleshooting.

Regarding the number of iscsid processes which increases, I've found timeout
settings which permit to always have only one process per target :
sysctl kern.iscsi.login_timeout=85
sysctl kern.iscsi.iscsid_timeout=5
sysctl kern.iscsi.ping_timeout=5
These are based on net.inet.tcp.keepinit which is 75000 by default (75
seconds).

Regarding the "30" processes limit, it can be tuned with the "-d" option to
iscsid (through /etc/rc.d/iscsid for example).

And the most important, regarding the bug :
when devices do not want to correctly reconnect, I found that it is because
iscsi is stuck in the following :
cam_sim_free(is->is_sim, TRUE /*free_devq*/);
However I don't know why.

bugzilla...@freebsd.org

unread,
Aug 21, 2016, 6:27:16 AM8/21/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #6 from Ben RUBSON <ben.r...@gmail.com> ---
(In reply to Ben RUBSON from comment #5)
>And the most important, regarding the bug :
>when devices do not want to correctly reconnect, I found that it is because
>iscsi is stuck in the following :
>cam_sim_free(is->is_sim, TRUE /*free_devq*/);
>However I don't know why.
I put some printf in cam_sim.c and found that iscsi is stuck by :
msleep(sim, sim->mtx, PRIBIO, "simfree", 0);
cam_sim_release() is never fired, so msleep is blocking iscsi.

bugzilla...@freebsd.org

unread,
Aug 21, 2016, 4:15:13 PM8/21/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #7 from Ben RUBSON <ben.r...@gmail.com> ---
I tried 11-RC1 and was not able to reproduce the issue.

With 10.3, after about 10 NIC toggles, some devices fail.
With 11-RC1, more than one thousand toggles later, all my devices are still
reconnecting correctly.

bugzilla...@freebsd.org

unread,
Sep 6, 2016, 4:48:21 AM9/6/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

Ben RUBSON <ben.r...@gmail.com> changed:

What |Removed |Added
----------------------------------------------------------------------------
Version|10.3-RELEASE |11.0-STABLE
Severity|Affects Only Me |Affects Some People
CC| |freebs...@FreeBSD.org,
| |tr...@FreeBSD.org

--- Comment #8 from Ben RUBSON <ben.r...@gmail.com> ---
I'm facing this issue in 11-RC2 where I have 2 iscsi devices which are
disconnected, don't want to reconnect, and I can't manage to remove them :

# iscsictl
Target name Target portal State
iqn.2012-06.g2:rT3 192.168.2.2 Disconnected
iqn.2012-06.g2:ch1 192.168.2.2 Disconnected

# iscsictl -Ra

# iscsictl -v
Session ID: 6
Initiator name: iqn.2012-06.b2g1
Initiator portal:
Initiator alias:
Target name: iqn.2012-06.g2:rT3
Target portal: 192.168.2.2
Target alias: G22020PY1EZHArT
User:
Secret:
Mutual user:
Mutual secret:
Session type: Normal
Enable: Yes
Session state: Disconnected
Failure reason:
Header digest: None
Data digest: None
DataSegmentLen: 131072
MaxBurstLen: 262144
FirstBurstLen: 131072
ImmediateData: Yes
iSER (RDMA): No
Offload driver: None
Device nodes: da11

Session ID: 10
Initiator name: iqn.2012-06.b2g1
Initiator portal:
Initiator alias:
Target name: iqn.2012-06.g2:ch1
Target portal: 192.168.2.2
Target alias: G22000PY1EVVAch
User:
Secret:
Mutual user:
Mutual secret:
Session type: Normal
Enable: Yes
Session state: Disconnected
Failure reason:
Header digest: None
Data digest: None
DataSegmentLen: 131072
MaxBurstLen: 262144
FirstBurstLen: 131072
ImmediateData: Yes
iSER (RDMA): No
Offload driver: None
Device nodes: da12

# ls -l /dev/da11 /dev/da12
ls: /dev/da11: No such file or directory
ls: /dev/da12: No such file or directory

Any help please ?

Thank you very much,

bugzilla...@freebsd.org

unread,
Sep 6, 2016, 4:57:11 AM9/6/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

Ben RUBSON <ben.r...@gmail.com> changed:

What |Removed |Added
----------------------------------------------------------------------------
Version|11.0-STABLE |10.3-RELEASE

--- Comment #9 from Ben RUBSON <ben.r...@gmail.com> ---
Mmmmhh these 2 devices were spares in a zpool, but strangely seen as available.
I removed these 2 spares from the zpool and then the 2 iscsi devices
disappeared properly from the system.
OK, not related then to the 10.3 bug above.

bugzilla...@freebsd.org

unread,
Sep 7, 2016, 5:55:53 AM9/7/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

Ben RUBSON <ben.r...@gmail.com> changed:

What |Removed |Added
----------------------------------------------------------------------------
Status|New |Closed
Resolution|--- |Feedback Timeout

--- Comment #10 from Ben RUBSON <ben.r...@gmail.com> ---
If I'm the only one interested in this bug report, it can then be closed, as I
now run FreeBSD 11.
Feel free then to re-open it if needed.

bugzilla...@freebsd.org

unread,
Oct 17, 2016, 5:18:24 AM10/17/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

Julien Cigar <jul...@perdition.city> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |jul...@perdition.city

--- Comment #11 from Julien Cigar <jul...@perdition.city> ---
I got a similar issue on:

FreeBSD filer1.prod.lan 10.3-RELEASE-p7 FreeBSD 10.3-RELEASE-p7 #0: Thu Aug 11
18:38:15 UTC 2016
ro...@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64

with:

root@filer1:/etc/rc.d # sysctl -a|grep -i 'iscsi'
kern.iscsi.fail_on_shutdown: 1
kern.iscsi.fail_on_disconnection: 1
kern.iscsi.maxtags: 255
kern.iscsi.login_timeout: 5
kern.iscsi.iscsid_timeout: 5
kern.iscsi.ping_timeout: 5
kern.iscsi.debug: 1

On the initiator side:

WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): no ping reply
(NOP-Out) after 5 seconds; dropping connection
WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): connection
error; dropping connection
WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): connection
error; dropping connection

On the target side:

WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In)
after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In)
after 5 seconds; reconnecting
(da3:iscsi2:0:0:0): READ(10). CDB: 28 00 08 03 02 7a 00 00 01 00
(da3:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:(da2:iscsi1:0:0:0): READ(10). CDB: 28 00 02 d7 49 e7 00 00 20 00
iscsi2:0:(da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host
0:(da2:0): iscsi1:0:Retrying command
0:0): Retrying command
(da3:iscsi2:0:0:0): READ(10). CDB: 28 00 02 d7 4a e7 00 00 20 00
(da2:iscsi1:0:0:0): READ(10). CDB: 28 00 02 d7 4a 07 00 00 20 00
(da3:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da3:(da2:iscsi2:0:iscsi1:0:0:0:0): 0): Retrying command
Retrying command
da3 at iscsi2 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached
da2 at iscsi1 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached
(da3:iscsi2:0:0:0): Periph destroyed
(da2:iscsi1:0:0:0): Periph destroyed
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL 1
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL 0
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In)
after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In)
after 5 seconds; reconnecting
(da3:iscsi1:0:0:0): READ(10). CDB: 28 00 02 d7 4c a7 00 00 20 00
(da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 02 e0 f3 d4 00 00 04 00
(da3:(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
iscsi1:0:(da2:0:iscsi2:0:0): 0:Retrying command
0): Retrying command
(da3:iscsi1:0:0:0): READ(10). CDB: 28 00 02 d7 4d 07 00 00 20 00
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 02 d7 4b c7 00 00 20 00
(da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:(da2:iscsi1:0:iscsi2:0:0:0:0): 0): Retrying command
Retrying command
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached
(da2:iscsi2:0:0:0): Periph destroyed
(da3:iscsi1:0:0:0): Periph destroyed
da2 at iscsi1 bus 0 scbus4 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL 0
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
da3 at iscsi2 bus 0 scbus3 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL 1
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In)
after 5 seconds; reconnecting
(da2:iscsi1:0:0:0): READ(6). CDB: 08 00 00 00 20 00
(da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi1:0:0:0): Retrying command
(da2:iscsi1:0:0:0): READ(6). CDB: 08 00 00 10 02 00
(da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi1:0:0:0): Retrying command
da2 at iscsi1 bus 0 scbus4 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached
(da2:iscsi1:0:0:0): Periph destroyed
da2 at iscsi1 bus 0 scbus4 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL 0
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In)
after 5 seconds; reconnecting
(da3:iscsi2:0:0:0): READ(6). CDB: 08 00 00 20 20 00
(da3:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:iscsi2:0:0:0): Retrying command
da3 at iscsi2 bus 0 scbus3 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached
(da3:iscsi2:0:0:0): Periph destroyed
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In)
after 5 seconds; reconnecting
(da2:iscsi1:0:0:0): READ(6). CDB: 08 00 00 00 20 00
(da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi1:0:0:0): Retrying command
da2 at iscsi1 bus 0 scbus4 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached
(da2:iscsi1:0:0:0): Periph destroyed
da2 at iscsi1 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL 0
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): login timed out after 6
seconds; reconnecting
da3 at iscsi2 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL 1
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In)
after 5 seconds; reconnecting
da2 at iscsi1 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached
(da2:iscsi1:0:0:0): READ(10). CDB: 28 00 1c 14 10 0f 00 00 01 00
(da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi1:0:0:0): Error 5, Periph was invalidated
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In)
after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In)
after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In)
after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In)
after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In)
after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In)
after 5 seconds; reconnecting
(da3:iscsi2:0:0:0): READ(10). CDB: 28 00 1c 14 10 0f 00 00 01 00
(da3:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:iscsi2:0:0:0): Retrying command
da3 at iscsi2 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached
(da2:iscsi1:0:0:0): Periph destroyed
(da3:iscsi2:0:0:0): Periph destroyed
da2 at iscsi1 bus 0 scbus4 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL 0
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
da3 at iscsi2 bus 0 scbus3 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL 1
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): handoff on already
connected session
Oct 16 01:02:04 filer1 iscsid[88600]: 10.20.30.32
(iqn.2016-08.lan.prod:target1): ISCSIDHANDOFF: Device busy
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): connection error;
reconnecting
da3 at iscsi2 bus 0 scbus3 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached
(da3:iscsi2:0:0:0): Periph destroyed
da3 at iscsi2 bus 0 scbus3 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL 1
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In)
after 5 seconds; reconnecting
(da2:iscsi1:0:0:0): READ(6). CDB: 08 00 00 20 20 00
(da2:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi1:0:0:0): Retrying command
da2 at iscsi1 bus 0 scbus4 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached
(da2:iscsi1:0:0:0): Periph destroyed
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In)
after 5 seconds; reconnecting
(da3:iscsi2:0:0:0): READ(6). CDB: 08 00 00 20 20 00
(da3:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:iscsi2:0:0:0): Retrying command
da3 at iscsi2 bus 0 scbus3 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached
(da3:iscsi2:0:0:0): Periph destroyed
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL 1
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL 0
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): connection error;
reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): connection error;
reconnecting
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached
(da3:iscsi1:0:0:0): Periph destroyed
(da2:iscsi2:0:0:0): Periph destroyed
da2 at iscsi2 bus 0 scbus4 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL 1
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
da3 at iscsi1 bus 0 scbus3 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL 0
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)

bugzilla...@freebsd.org

unread,
Oct 17, 2016, 7:54:28 AM10/17/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #12 from Julien Cigar <jul...@perdition.city> ---
Some more info:

on filer2:

root@filer2:/home/jcigar # cat /etc/ctl.conf

auth-group ag0 {
chap xxx xxx
}

portal-group pg0 {
discovery-auth-group no-authentication
listen 10.20.30.32
}

target iqn.2016-08.lan.prod:target0 {
auth-group ag0
portal-group pg0

lun 0 {
blocksize 4096
path /dev/gpt/f2data0
}
}

target iqn.2016-08.lan.prod:target1 {
auth-group ag0
portal-group pg0

lun 0 {
blocksize 4096
path /dev/gpt/f2data1
}
}

root@filer2:/home/jcigar # ifconfig bge1
bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000

options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
ether d0:bf:9c:3a:b7:25
inet 10.20.30.32 netmask 0xffffff00 broadcast 10.20.30.255
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
media: Ethernet autoselect (1000baseT <full-duplex>)
status: active

root@filer2:/home/jcigar # pciconf -lv bge1
bge1@pci0:2:0:1: class=0x020000 card=0x22e8103c chip=0x165f14e4 rev=0x00
hdr=0x00
vendor = 'Broadcom Corporation'
device = 'NetXtreme BCM5720 Gigabit Ethernet PCIe'
class = network
subclass = ethernet

root@filer2:/home/jcigar # grep 'bge1' /var/run/dmesg.boot
bge1: <Broadcom BCM5720 A0, ASIC rev. 0x5720000> mem
0x92b00000-0x92b0ffff,0x92b10000-0x92b1ffff,0x92b20000-0x92b2ffff at device 0.1
on pci2
bge1: APE FW version: NCSI v1.3.12.0
bge1: CHIP ID 0x05720000; ASIC REV 0x5720; CHIP REV 0x57200; PCI-E
miibus1: <MII bus> on bge1
bge1: Using defaults for TSO: 65518/35/2048
bge1: Ethernet address: d0:bf:9c:3a:b7:25
bge1: link state changed to DOWN
bge1: link state changed to UP

on filer1.prod.lan:

root@filer1:/home/jcigar # cat /etc/iscsi.conf

t0 {
TargetAddress = 10.20.30.32
TargetName = iqn.2016-08.lan.prod:target0
AuthMethod = CHAP
chapIName = xxx
chapSecret = xxx
}

t1 {
TargetAddress = 10.20.30.32
TargetName = iqn.2016-08.lan.prod:target1
AuthMethod = CHAP
chapIName = xxx
chapSecret = xxx
}

root@filer1:/home/jcigar # ifconfig bge1
bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000

options=c019b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
ether d0:bf:9c:3a:c6:e1
inet 10.20.30.31 netmask 0xffffff00 broadcast 10.20.30.255
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
media: Ethernet autoselect (1000baseT <full-duplex,master>)
status: active

root@filer1:/home/jcigar # pciconf -lv bge1
bge1@pci0:2:0:1: class=0x020000 card=0x22e8103c chip=0x165f14e4 rev=0x00
hdr=0x00
vendor = 'Broadcom Corporation'
device = 'NetXtreme BCM5720 Gigabit Ethernet PCIe'
class = network
subclass = ethernet

root@filer1:/home/jcigar # grep 'bge1' /var/run/dmesg.boot
bge1: <Broadcom BCM5720 A0, ASIC rev. 0x5720000> mem
0x92b00000-0x92b0ffff,0x92b10000-0x92b1ffff,0x92b20000-0x92b2ffff at device 0.1
on pci2
bge1: APE FW version: NCSI v1.3.12.0
bge1: CHIP ID 0x05720000; ASIC REV 0x5720; CHIP REV 0x57200; PCI-E
miibus1: <MII bus> on bge1
bge1: Using defaults for TSO: 65518/35/2048
bge1: Ethernet address: d0:bf:9c:3a:c6:e1
bge1: link state changed to DOWN
bge1: link state changed to UP

bugzilla...@freebsd.org

unread,
Oct 18, 2016, 9:57:34 AM10/18/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #13 from Edward Tomasz Napierala <tr...@FreeBSD.org> ---
The "WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): no ping
reply (NOP-Out) after 5 seconds; dropping connection" and "WARNING: 10.20.30.32
(iqn.2016-08.lan.prod:target0): connection error; reconnecting" messages
suggest a network problem.

bugzilla...@freebsd.org

unread,
Oct 18, 2016, 10:26:13 AM10/18/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #14 from Julien Cigar <jul...@perdition.city> ---
This is also what I thought at first, but there is no error messages related to
the network interfaces in the logs, and all Ierrs, Idrop, and Coll counters are
zeros.

There is a dedicated interface for the iSCSI traffic which I turned on/off
quite a lot to test a failover script, and the symptoms are exactly the ones
described by Ben Rubson in the post above (altough I only have 4 disks in
total).

One more thing, I'm managing everything with SaltStack and in some cases ctld
on the target and iscsid on the initiator are restarted at the same time, could
it be because of that?

I forgot to paste my rc.conf filers, but I don't think there is something wrong
..:

root@filer1:/home/jcigar # cat /etc/rc.conf
dumpdev="AUTO"
ifconfig_bge0="DHCP"
ip6addrctl_policy="ipv4_prefer"
ipv6_network_interfaces="none"
keymap="be.iso"
mountd_enable="YES"
mountd_flags="-r -S -h 192.168.10.15"
nfs_server_enable="YES"
nfs_server_flags="-t -h 192.168.10.15"
nfsuserd_enable="YES"
nfsuserd_flags="-domain prod.lan"
nfsv4_server_enable="YES"
ntpdate_flags="-b -4"
ntpdate_hosts="ntp.belnet.be"
salt_minion_enable="YES"
sendmail_enable="NO"
sendmail_msp_queue_enable="NO"
sendmail_outbound_enable="NO"
sendmail_submit_enable="NO"
zfs_enable="YES"

hostname="filer1.prod.lan"

ifconfig_bge1="inet 10.20.30.31 netmask 255.255.255.0 mtu 9000"

ifconfig_bge0_alias0="inet vhid 54 advskew 50 pass xxx alias 192.168.10.15/32"

sshd_enable="YES"

ntpdate_enable="YES"

iscsid_enable="YES"
iscsictl_enable="YES"
iscsictl_flags="-Aa"

root@filer2:/home/jcigar # cat /etc/rc.conf
dumpdev="AUTO"
ifconfig_bge0="DHCP"
ip6addrctl_policy="ipv4_prefer"
ipv6_network_interfaces="none"
keymap="be.iso"
ntpdate_flags="-b -4"
ntpdate_hosts="ntp.belnet.be"
salt_minion_enable="YES"
sendmail_enable="NO"
sendmail_msp_queue_enable="NO"
sendmail_outbound_enable="NO"
sendmail_submit_enable="NO"
zfs_enable="YES"

hostname="filer2.prod.lan"

ifconfig_bge1="inet 10.20.30.32 netmask 255.255.255.0 mtu 9000"

ifconfig_bge0_alias0="inet vhid 54 advskew 100 pass xxx alias 192.168.10.15/32"

sshd_enable="YES"

ntpdate_enable="YES"
ctld_enable="YES"

bugzilla...@freebsd.org

unread,
Oct 18, 2016, 10:34:50 AM10/18/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

Ben RUBSON <ben.r...@gmail.com> changed:

What |Removed |Added
----------------------------------------------------------------------------
Status|Closed |Open
Resolution|Feedback Timeout |---

--- Comment #15 from Ben RUBSON <ben.r...@gmail.com> ---
I re-open this bug for Julien.

bugzilla...@freebsd.org

unread,
Oct 18, 2016, 10:54:41 AM10/18/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #16 from Julien Cigar <jul...@perdition.city> ---
Thank you Ben. Can you confirm (once again) that you haven't had this problem
anymore in 11.0-RELEASE? If so I could just upgrade then .. :)

bugzilla...@freebsd.org

unread,
Oct 18, 2016, 11:02:14 AM10/18/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #17 from Ben RUBSON <ben.r...@gmail.com> ---
I switched to 11 at the time of 11-RC1 as I did not manage to reproduce the bug
with it.
I did not try to explicitly reproduce it with RC2, RC3, 11.0-RELEASE and
11.0-RELEASE-p1, but it did not occur by itself in these versions :)
Sounds like this bug only affects 10.3.

bugzilla...@freebsd.org

unread,
Nov 10, 2016, 4:58:06 AM11/10/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #18 from Julien Cigar <jul...@perdition.city> ---
Problem appeared again today, after ~15 days of uptime, always on FreeBSD
filer1.prod.lan 10.3-RELEASE-p11 FreeBSD 10.3-RELEASE-p11 #0: Mon Oct 24
18:49:24 UTC 2016
ro...@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64

WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In)
after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In)
after 5 seconds; reconnecting
(da3:iscsi1:0:0:0): READ(10). CDB: 28 00 01 ef ec 90 00 00 01 00
(da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 01 ef ec 8e 00 00 01 00
(da3:(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
iscsi1:0:(da2:0:iscsi2:0:0): 0:Retrying command
0): Retrying command
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached
(da2:iscsi2:0:0:0): Periph destroyed
(da3:iscsi1:0:0:0): Periph destroyed
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL 1
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In)
after 5 seconds; reconnecting
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 1c 14 10 0f 00 00 01 00
(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): Retrying command
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached
(da2:iscsi2:0:0:0): Periph destroyed
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL 1
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): login timed out after 6
seconds; reconnecting
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL 0
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In)
after 5 seconds; reconnecting
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 1c 14 10 0f 00 00 01 00
(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): Retrying command
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In)
after 5 seconds; reconnecting
(da2:iscsi2:0:0:0): Periph destroyed
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL 1
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): handoff on already
connected session
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): connection error;
reconnecting
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached
(da3:iscsi1:0:0:0): Periph destroyed
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL 0
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)

After a zpool online, and with vfs.zfs.scrub_delay = 0 and
vfs.zfs.resilver_delay = 0 I issued a zpool scrub and again I had a timeout:

WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target0): no ping reply (NOP-In)
after 5 seconds; reconnecting
WARNING: 10.20.30.32 (iqn.2016-08.lan.prod:target1): no ping reply (NOP-In)
after 5 seconds; reconnecting
(da3:iscsi1:0:0:0): READ(10). CDB: 28 00 00 b9 9a 67 00 00 01 00
(da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 00 b9 98 3c 00 00 01 00
(da3:(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
iscsi1:0:(da2:0:iscsi2:0:0): 0:Retrying command
0): Retrying command
(da3:iscsi1:0:0:0): READ(10). CDB: 28 00 00 b9 9f e4 00 00 01 00
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 00 b9 a8 cc 00 00 01 00
(da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:(da2:iscsi1:0:iscsi2:0:0:0:0): 0): Retrying command
Retrying command
(da3:iscsi1:0:0:0): READ(10). CDB: 28 00 00 b9 a2 42 00 00 01 00
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 00 b9 95 6e 00 00 20 00
(da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:(da2:iscsi1:0:iscsi2:0:0:0:0): 0): Retrying command
Retrying command
(da3:iscsi1:0:0:0): READ(10). CDB: 28 00 00 b9 96 4e 00 00 20 00
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 00 b9 95 8e 00 00 20 00
(da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:(da2:iscsi1:0:iscsi2:0:0:0:0): 0): Retrying command
Retrying command
(da3:iscsi1:0:0:0): READ(10). CDB: 28 00 00 b9 96 6e 00 00 20 00
(da2:iscsi2:0:0:0): READ(10). CDB: 28 00 00 b9 a9 b2 00 00 01 00
(da3:iscsi1:0:0:0): CAM status: CCB request aborted by the host
(da2:iscsi2:0:0:0): CAM status: CCB request aborted by the host
(da3:(da2:iscsi1:0:iscsi2:0:0:0:0): 0): Retrying command
Retrying command
da3 at iscsi1 bus 0 scbus4 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> s/n MYSERIAL 0 detached
da2 at iscsi2 bus 0 scbus3 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> s/n MYSERIAL 1 detached
(da3:iscsi1:0:0:0): Periph destroyed
(da2:iscsi2:0:0:0): Periph destroyed
da2 at iscsi1 bus 0 scbus4 target 0 lun 0
da2: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number MYSERIAL 0
da2: 150.000MB/s transfers
da2: Command Queueing enabled
da2: 1840144MB (471076881 4096 byte sectors)
da3 at iscsi2 bus 0 scbus3 target 0 lun 0
da3: <FREEBSD CTLDISK 0001> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number MYSERIAL 1
da3: 150.000MB/s transfers
da3: Command Queueing enabled
da3: 1840144MB (471076881 4096 byte sectors)

I've raised those timeouts a little bit:
kern.iscsi.login_timeout: 30
kern.iscsi.iscsid_timeout: 30
kern.iscsi.ping_timeout: 30

and see if it makes any difference

bugzilla...@freebsd.org

unread,
Nov 10, 2016, 5:06:05 AM11/10/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #19 from Julien Cigar <jul...@perdition.city> ---
this is what I have on the target:

root@filer2:/home/jcigar # sysctl -a|grep -i 'iscsi'
kern.iscsi.fail_on_shutdown: 1
kern.iscsi.fail_on_disconnection: 1
kern.iscsi.maxtags: 255
kern.iscsi.login_timeout: 30
kern.iscsi.iscsid_timeout: 30
kern.iscsi.ping_timeout: 30
kern.iscsi.debug: 1
kern.cam.ctl.iscsi.maxcmdsn_delta: 256
kern.cam.ctl.iscsi.login_timeout: 60
kern.cam.ctl.iscsi.ping_timeout: 5
kern.cam.ctl.iscsi.debug: 1

and the following in the error logs:
WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): connection
error; dropping connection
WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): connection
error; dropping connection
WARNING: 10.20.30.31 (iqn.1994-09.org.freebsd:filer1.prod.lan): connection
error; dropping connection

bugzilla...@freebsd.org

unread,
Nov 10, 2016, 6:18:33 AM11/10/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #20 from Julien Cigar <jul...@perdition.city> ---
Also I noticed some Idrop packets on the iscsi interface of the initiator side:

root@filer1:/home/jcigar # netstat -I bge1
Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs
Coll
bge1 9000 <Link#2> d0:bf:9c:3a:c6:e1 77396222 0 107 45800806 0
0
bge1 - 10.20.30.0 10.20.30.31 77394879 - - 73726273 -
-

and:

root@filer1:/home/jcigar # vmstat -z | grep -i 'mbuf_jumbo_9k:'
mbuf_jumbo_9k: 9216, 74075, 256, 161,51922039, 214, 0

could it be related to jumbo frames?

(on the target side counters are 0)

bugzilla...@freebsd.org

unread,
Nov 10, 2016, 6:28:00 AM11/10/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #21 from Ben RUBSON <ben.r...@gmail.com> ---
I was also using jumbo frames with 10.3, but unfortunately I did not look at
dropped packets when I had the issues.
Note that with 11, I still use jumbo frames and never encounter this issue
again.

bugzilla...@freebsd.org

unread,
Nov 10, 2016, 6:47:32 AM11/10/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #22 from Julien Cigar <jul...@perdition.city> ---
Yep it's strange, I will definitively upgrade to 11 when possible ..

Other infos:

root@filer1:/home/jcigar # netstat -m
1285/6575/7860 mbufs in use (current/cache/total)
1024/3526/4550/500010 mbuf clusters in use (current/cache/total/max)
1024/3178 mbuf+clusters out of packet secondary zone in use (current/cache)
0/267/267/250005 4k (page size) jumbo clusters in use (current/cache/total/max)
256/181/437/74075 9k jumbo clusters in use (current/cache/total/max)
0/0/0/41667 16k jumbo clusters in use (current/cache/total/max)
4673K/11392K/16066K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/214/0 requests for jumbo clusters denied (4k/9k/16k)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile

root@filer1:/home/jcigar # sysctl kern.ipc | fgrep nmb
kern.ipc.nmbufs: 3200070
kern.ipc.nmbjumbo16: 166668
kern.ipc.nmbjumbo9: 222225
kern.ipc.nmbjumbop: 250005
kern.ipc.nmbclusters: 500010

bugzilla...@freebsd.org

unread,
Nov 10, 2016, 5:25:46 PM11/10/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #23 from Julien Cigar <jul...@perdition.city> ---
Ok.. I definitively think it's an issue with jumbo frames, I got another
disconnection and again the "requests for jumbo clusters denied" (9k) counter
increased:

0/304/0 requests for jumbo clusters denied (4k/9k/16k)

any idea?

bugzilla...@freebsd.org

unread,
Nov 10, 2016, 5:47:46 PM11/10/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

Steven Hartland <s...@FreeBSD.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |s...@FreeBSD.org

--- Comment #24 from Steven Hartland <s...@FreeBSD.org> ---
Any mention of nmbjumbo9 in /var/log/messages?

Specifically do you see:
kern.ipc.nmbjumbo9 limit reached

If so have you tried bumping kern.ipc.nmbjumbo9?

bugzilla...@freebsd.org

unread,
Nov 10, 2016, 6:12:22 PM11/10/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #25 from Julien Cigar <jul...@perdition.city> ---
(In reply to Steven Hartland from comment #24)

I don't have any mention of nmbjumbo9 in /var/log/messages, I haven't raised
kern.ipc.nmbjumbo9 yet, currently it's:

root@filer1:/var/log # sysctl kern.ipc.nmbjumbo9
kern.ipc.nmbjumbo9: 222225

but it does not seem necessary to raise it if I can trust netstat -m:

root@filer1:/var/log # netstat -m
1282/6533/7815 mbufs in use (current/cache/total)
1024/3564/4588/500010 mbuf clusters in use (current/cache/total/max)
1024/3431 mbuf+clusters out of packet secondary zone in use (current/cache)
0/161/161/250005 4k (page size) jumbo clusters in use (current/cache/total/max)
256/120/376/74075 9k jumbo clusters in use (current/cache/total/max)
0/0/0/41667 16k jumbo clusters in use (current/cache/total/max)
4672K/10485K/15157K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/304/0 requests for jumbo clusters denied (4k/9k/16k)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile

(the box isn't heavily loaded)

what I don't understand if why there are requests for 4k jumbo clusters denied
as there are plenty free .. ?

could it be related to this change:
https://svnweb.freebsd.org/base/head/sys/dev/bge/if_bge.c?r1=276299&r2=276750 ?

bugzilla...@freebsd.org

unread,
Nov 10, 2016, 6:15:52 PM11/10/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #26 from Julien Cigar <jul...@perdition.city> ---
I don't know if it's related but I also noticed some incorrect cksum:

00:14:03.380734 IP (tos 0x0, ttl 64, id 19369, offset 0, flags [DF], proto TCP
(6), length 100, bad cksum 0 (->9e84)!)
10.20.30.31.24995 > 10.20.30.32.iscsi-target: Flags [P.], cksum 0x50bd
(incorrect -> 0xfff9), seq 1205926059:1205926107, ack 1518444144, win 16384,
options [nop,nop,TS val 1388742470 ecr 2474468733], length 48

00:14:03.380742 IP (tos 0x0, ttl 64, id 19370, offset 0, flags [DF], proto TCP
(6), length 100, bad cksum 0 (->9e83)!)
10.20.30.31.55645 > 10.20.30.32.iscsi-target: Flags [P.], cksum 0x50bd
(incorrect -> 0x3912), seq 3104128225:3104128273, ack 1948017382, win 16384,
options [nop,nop,TS val 1388742470 ecr 4001288677], length 48

00:14:03.381060 IP (tos 0x0, ttl 64, id 54596, offset 0, flags [DF], proto TCP
(6), length 100)
10.20.30.32.iscsi-target > 10.20.30.31.55645: Flags [P.], cksum 0x4c6a
(correct), seq 1:49, ack 48, win 16384, options [nop,nop,TS val 4001289988 ecr
1388742470], length 48

00:14:03.381070 IP (tos 0x0, ttl 64, id 54597, offset 0, flags [DF], proto TCP
(6), length 100)
10.20.30.32.iscsi-target > 10.20.30.31.24995: Flags [P.], cksum 0x0d62
(correct), seq 1:49, ack 48, win 16384, options [nop,nop,TS val 2474470019 ecr
1388742470], length 48

... but I guess it's because TSO is enabled on the interfaces ?

bugzilla...@freebsd.org

unread,
Nov 10, 2016, 6:18:01 PM11/10/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #27 from Steven Hartland <s...@FreeBSD.org> ---
checksums a likely just due to hw checksum offloading.

bugzilla...@freebsd.org

unread,
Nov 10, 2016, 6:19:53 PM11/10/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #28 from Steven Hartland <s...@FreeBSD.org> ---

bugzilla...@freebsd.org

unread,
Nov 10, 2016, 6:31:00 PM11/10/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #29 from Steven Hartland <s...@FreeBSD.org> ---
I'm assuming vmstat agrees with netstat -m e.g.
vmstat -z | head -n 1; vmstat -z | grep mbuf

It could be that the disconnect is the cause of mbuf issue and not the other
way round. You'd have to catch it when the issue is occurring to confirm.

If you have plenty of ram you could just trying doubling kern.ipc.nmbjumbo9 to
see if the problem goes away?

bugzilla...@freebsd.org

unread,
Nov 10, 2016, 6:42:31 PM11/10/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #30 from Steven Hartland <s...@FreeBSD.org> ---
If its dropping in the driver due to failure go get a jumbo mbuf you should see
this in the output netstat -i.

bugzilla...@freebsd.org

unread,
Nov 10, 2016, 6:47:26 PM11/10/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #31 from Julien Cigar <jul...@perdition.city> ---
Y(In reply to Steven Hartland from comment #29)

Yes it seems so:

root@filer1:/var/log # vmstat -z | head -n 1; vmstat -z | grep mbuf
ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP
mbuf_packet: 256, 3200070, 1024, 3431,94881716, 0, 0
mbuf: 256, 3200070, 257, 3103,566359041, 0, 0
mbuf_cluster: 2048, 500010, 4455, 133,15367798, 0, 0
mbuf_jumbo_page: 4096, 250005, 0, 161,102798544, 0, 0
mbuf_jumbo_9k: 9216, 74075, 256, 120,52036055, 304, 0
mbuf_jumbo_16k: 16384, 41667, 0, 0, 0, 0, 0
mbuf_ext_refcnt: 4, 0, 0, 0, 0, 0, 0

> You'd have to catch it when the issue is occurring to confirm

not so easy :) but every time I have a WARNING: 10.20.30.32
(iqn.2016-08.lan.prod:target0): no ping reply (NOP-In) after 30 seconds;
reconnecting

I have a small increase in the "requests for 9k jumbo clusters denied" counter

> If you have plenty of ram you could just trying doubling kern.ipc.nmbjumbo9 to see if the problem goes away?

The machine has only 8GB of RAM (and ZFS), I think I'll first test with a 1500
MTU and -tso

bugzilla...@freebsd.org

unread,
Nov 10, 2016, 6:50:08 PM11/10/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #32 from Julien Cigar <jul...@perdition.city> ---
(In reply to Steven Hartland from comment #30)

Yes I already checked the error counters, but it's 0:

root@filer1:/var/log # netstat -I bge1
Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs
Coll
bge1 9000 <Link#2> d0:bf:9c:3a:c6:e1 77755195 0 152 46072898 0
0
bge1 - 10.20.30.0 10.20.30.31 77753810 - - 74150354 -
-

... so I guess it's not a driver issue

bugzilla...@freebsd.org

unread,
Dec 15, 2016, 4:41:07 AM12/15/16
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #34 from Julien Cigar <jul...@perdition.city> ---
No problems for more than a month now.. I tend to conclude that the MTU 9000
was the root cause of this problem (?)

bugzilla...@freebsd.org

unread,
Jan 4, 2017, 9:02:56 AM1/4/17
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #35 from Ben RUBSON <ben.r...@gmail.com> ---
Did you try to increase some sysctls with jumbo frames on, as proposed by
Steven ?

bugzilla...@freebsd.org

unread,
Jan 5, 2017, 4:18:34 AM1/5/17
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #36 from Julien Cigar <jul...@perdition.city> ---
(In reply to Ben RUBSON from comment #35)

yes, .. but only slightly as the machine has only 8GB of RAM

bugzilla...@freebsd.org

unread,
Jan 5, 2017, 4:24:16 AM1/5/17
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #37 from Ben RUBSON <ben.r...@gmail.com> ---
And no improvements ?

bugzilla...@freebsd.org

unread,
Jan 5, 2017, 4:36:06 AM1/5/17
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #38 from Julien Cigar <jul...@perdition.city> ---
(In reply to Ben RUBSON from comment #37)

it only delayed the problem, .. but it would be interesting if you could test
on your side..!

bugzilla...@freebsd.org

unread,
Jan 5, 2017, 9:04:54 AM1/5/17
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #39 from Ben RUBSON <ben.r...@gmail.com> ---
Strange, as if mbufs were "consumed" but not released.
And when devices are disconnected, can you reconnect to them properly ?
Or do you need to change their target name or to reboot, as explained in my
first message of this bug report ?

bugzilla...@freebsd.org

unread,
Jan 6, 2017, 4:57:30 AM1/6/17
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #40 from Julien Cigar <jul...@perdition.city> ---
I had to reboot..

bugzilla...@freebsd.org

unread,
Jan 6, 2017, 5:00:12 AM1/6/17
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #41 from Ben RUBSON <ben.r...@gmail.com> ---
OK, so same issue as mine (for which I opened this bug report), with 10.3.
Which I was not able to reproduce with 11.
Any plan to upgrade to 11 to be safe ?

bugzilla...@freebsd.org

unread,
Jan 6, 2017, 5:01:51 AM1/6/17
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

--- Comment #42 from Julien Cigar <jul...@perdition.city> ---
also, looks similar to bug #215727

bugzilla...@freebsd.org

unread,
Apr 11, 2017, 3:55:48 PM4/11/17
to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211990

Edward Tomasz Napierala <tr...@FreeBSD.org> changed:

What |Removed |Added
----------------------------------------------------------------------------
Status|Open |Closed
Resolution|--- |Works As Intended

--- Comment #44 from Edward Tomasz Napierala <tr...@FreeBSD.org> ---
From what I see, it's a networking problem, not a bug in iSCSI.
0 new messages