connection error kills iscsi session

470 views
Skip to first unread message

djmorris

unread,
Sep 24, 2009, 3:45:24 PM9/24/09
to open-iscsi, drew....@sickkids.ca
Problem: While running a backup to a mounted iscsi device, iscsid gets
a "conn error (1011)", and my drive starts getting input/output
errors.

Details:
Kernel Version: 2.6.23.17-88.fc7PAE
Distribution: Fedora core 7
open-iscsi version: 2.0-871

ISCSI Target Device: Promise Vtrak M610i
backup program: BackupPC (uses rsync)
my linux box and target are both plugged into a Netgear 516T Switch
(1000 baseT, 16 port, unmanaged)

When I log into the target after this happens, i notice that there are
no active iscsi sessions
However, when i run
iscsiadm -m session, it lists the active session

tcp: [2] TAR.GET.IP.ADR:3260,1 reserve.vtrak

I can log out with
iscsiadm -m node -U manual
but i cannot log in after that or even discover the target device
unless i reboot that device

sample from my /var/log/messages:
Aug 22 23:56:02 bold kernel: connection12:0: detected conn error
(1011)
Aug 22 23:56:03 bold iscsid: Kernel reported iSCSI connection 12:0
error (1011)
state (3)
Aug 22 23:56:06 bold iscsid: connect to 142.20.199.117:3260 failed
(Connection r
efused)
Aug 22 23:56:39 bold last message repeated 9 times
Aug 22 23:57:43 bold last message repeated 17 times
Aug 22 23:58:43 bold last message repeated 16 times
Aug 22 23:58:47 bold iscsid:
Aug 22 23:58:47 bold iscsid: connect to 142.20.199.117:3260 failed
(Connection r
efused)
Aug 22 23:59:21 bold last message repeated 9 times
Aug 23 00:00:24 bold last message repeated 17 times
Aug 23 00:01:02 bold last message repeated 10 times
Aug 23 00:01:02 bold kernel: session12: session recovery timed out
after 300 se
cs
Aug 23 00:01:02 bold kernel: sd 18:0:0:0: [sdd] Result:
hostbyte=DID_NO_CONNECT
driverbyte=DRIVER_OK,SUGGEST_OK
Aug 23 00:01:02 bold kernel: end_request: I/O error, dev sdd, sector
7141111779
Aug 23 00:01:02 bold kernel: sd 18:0:0:0: [sdd] Result:
hostbyte=DID_NO_CONNECT
driverbyte=DRIVER_OK,SUGGEST_OK
Aug 23 00:01:02 bold kernel: end_request: I/O error, dev sdd, sector
6591795491
....
Aug 23 00:01:02 bold kernel: I/O error in filesystem ("sdd1") meta-
data dev sdd1
block 0x4091c90 ("xfs_trans_read_buf") error 5 buf count 4096
....
Aug 23 00:01:02 bold kernel: sd 18:0:0:0: [sdd] Result:
hostbyte=DID_NO_CONNECT
driverbyte=DRIVER_OK,SUGGEST_OK
Aug 23 00:01:02 bold kernel: end_request: I/O error, dev sdd, sector
6042486114
Aug 23 00:01:02 bold kernel: Device sdd1, XFS metadata write error
block 0x1a9a4
b7c1 in sdd1
Aug 23 00:01:02 bold kernel: I/O error in filesystem ("sdd1") meta-
data dev sdd1
block 0x20be08de9 ("xlog_iodone") error 5 buf count 31744
Aug 23 00:01:02 bold kernel: xfs_force_shutdown(sdd1,0x2) called from
line 958 o
f file fs/xfs/xfs_log.c. Return address = 0xf92bd608
Aug 23 00:01:02 bold kernel: Filesystem "sdd1": Log I/O Error
Detected. Shuttin
g down filesystem: sdd1
Aug 23 00:01:02 bold kernel: Please umount the filesystem, and rectify
the probl
em(s)


I'm trying to figure out the source of this problem... anyone have any
ideas?
different open-iscsi version? different linux kernel? different
switch?
Or if someone can at least tell me what ""kernel: connection12:0:
detected conn error (1011)" means, so that the device vendor can try
to debug on their side.

Thanks so much for any advice.

Mike Christie

unread,
Sep 25, 2009, 8:45:54 PM9/25/09
to open-...@googlegroups.com, drew....@sickkids.ca
djmorris wrote:
> Problem: While running a backup to a mounted iscsi device, iscsid gets
> a "conn error (1011)", and my drive starts getting input/output
> errors.
>
> Details:
> Kernel Version: 2.6.23.17-88.fc7PAE
> Distribution: Fedora core 7
> open-iscsi version: 2.0-871
>
> ISCSI Target Device: Promise Vtrak M610i
> backup program: BackupPC (uses rsync)
> my linux box and target are both plugged into a Netgear 516T Switch
> (1000 baseT, 16 port, unmanaged)
>
> When I log into the target after this happens, i notice that there are
> no active iscsi sessions
> However, when i run
> iscsiadm -m session, it lists the active session
>
> tcp: [2] TAR.GET.IP.ADR:3260,1 reserve.vtrak
>
> I can log out with
> iscsiadm -m node -U manual
> but i cannot log in after that or even discover the target device
> unless i reboot that device
>
> sample from my /var/log/messages:
> Aug 22 23:56:02 bold kernel: connection12:0: detected conn error
> (1011)

Is there anything before this? About a nop or ping maybe?


> Aug 22 23:56:03 bold iscsid: Kernel reported iSCSI connection 12:0
> error (1011)
> state (3)
> Aug 22 23:56:06 bold iscsid: connect to 142.20.199.117:3260 failed
> (Connection r
> efused)
> Aug 22 23:56:39 bold last message repeated 9 times
> Aug 22 23:57:43 bold last message repeated 17 times
> Aug 22 23:58:43 bold last message repeated 16 times
> Aug 22 23:58:47 bold iscsid:
> Aug 22 23:58:47 bold iscsid: connect to 142.20.199.117:3260 failed
> (Connection r
> efused)
> Aug 22 23:59:21 bold last message repeated 9 times
> Aug 23 00:00:24 bold last message repeated 17 times
> Aug 23 00:01:02 bold last message repeated 10 times
> Aug 23 00:01:02 bold kernel: session12: session recovery timed out
> after 300 se


We can get a conn error (1011) for any old reason. It is a generic error
value (1011 == connection error). The target could have died, or someone
could have pulled a cable or the target could have dropped the
session/connection for some reasons, or we the target could have sent us
some invalid data.


When you see that message, it means the initiator is going to drop the
tcp/ip and iscsi connection if it is not already gone then try to create
a new tcp connection and relogin. The connect to 142.20.199.117:3260
failed (Connection refused) means that when we tried to reconnect, the
target did not allow the connection. We do not have any more info than
that. You should check the target logs. Maybe there was some resource
issue or maybe it got rebooted. I have no idea.

The recovery timed out after 300 secs message means we tried to
reconnect for 300 secs but we could not. At this time, the iscsi
initiator will just fail IO and you get the FS errors. However, it will
continue to try and reconnect to the target (this is more useful for
multipath).

iscsiadm -m session will show the iscsi sessions that we are logged in
to and sessions that were connected and we are trying to log back into.
This is why it will show sessions where the target may only show targets
that are logged in (I have not idea what vtrak does).

If you run

iscsiadm -m session -P 3

It will tell if the state of the session.

Drew Morris

unread,
Sep 28, 2009, 10:39:37 AM9/28/09
to Mike Christie, open-...@googlegroups.com
Mike,

Thanks for your prompt/detailed reply.

Vtrak is an iscsi disk array, which unfortunately doesn't seem to be reporting any errors.
Is there any sort of debugging I could do to discern what the cause of the problem is? I'm pretty sure it's not a pulled cable, but either the target or the initiator dropping the connection for some reason...
What sort of "invalid data" would cause the initiator to drop the connection?

There are no errors about pings or nops. The fragment I sent was the "beginning" of the problem. Can the verbosity of the iscsi messages be increased?

Below is the output of iscsiadm -m session -P 3. The connection failed again over the weekend, this time I wasn't even doing any reads/writes.


Some different error messages in there too, though I've seen these before.
Sep 27 02:51:04 spin kernel: connection4:0: iscsi: detected conn error (1011)
Sep 27 02:51:05 spin iscsid: Kernel reported iSCSI connection 4:0 error (1011) s
tate (3)
Sep 27 02:51:08 spin iscsid: connect to 142.20.199.117:3260 failed (Connection r
efused)
Sep 27 02:51:41 spin last message repeated 9 times
Sep 27 02:52:45 spin last message repeated 17 times
Sep 27 02:53:49 spin last message repeated 17 times
Sep 27 02:54:52 spin last message repeated 17 times
Sep 27 02:55:56 spin last message repeated 17 times
Sep 27 02:56:00 spin iscsid: connect to 142.20.199.117:3260 failed (Connection r
efused)
Sep 27 02:56:04 spin kernel: session4: iscsi: session recovery timed out after
300 secs
Sep 27 02:56:04 spin iscsid: connect to 142.20.199.117:3260 failed (Connection r
efused)
...
Sep 27 04:02:33 spin last message repeated 7 times
Sep 27 04:02:35 spin kernel: iscsi: cmd 0x28 is not queued (8)
Sep 27 04:02:35 spin kernel: sd 8:0:0:0: SCSI error: return code = 0x00010000
Sep 27 04:02:35 spin kernel: end_request: I/O error, dev sde, sector 549316386
Sep 27 04:02:35 spin kernel: I/O error in filesystem ("sde1") meta-data dev sde1
block 0x20bde700 ("xfs_trans_read_buf") error 5 buf count 8192
Sep 27 04:02:37 spin iscsid: connect to 142.20.199.117:3260 failed (Connection r
efused)
Sep 27 04:02:39 spin kernel: iscsi: cmd 0x28 is not queued (8)
Sep 27 04:02:39 spin kernel: sd 8:0:0:0: SCSI error: return code = 0x00010000
Sep 27 04:02:39 spin kernel: end_request: I/O error, dev sde, sector 446314
Sep 27 04:02:39 spin kernel: I/O error in filesystem ("sde1") meta-data dev sde1
block 0x6cf48 ("xfs_trans_read_buf") error 5 buf count 4096
...

iSCSI Transport Class version 2.0-724
version 2.0-871
Target: reserve.vtrak
Current Portal: 142.20.199.117:3260,1
Persistent Portal: 142.20.199.117:3260,1
**********
Interface:
**********
Iface Name: default
Iface Transport: tcp
Iface Initiatorname: unknown
Iface IPaddress: 142.20.199.72
Iface HWaddress: default
Iface Netdev: default
SID: 4
iSCSI Connection State: TRANSPORT WAIT
iSCSI Session State: Unknown
Internal iscsid Session State: REOPEN
************************
Negotiated iSCSI params:
************************
HeaderDigest: None
DataDigest: None
MaxRecvDataSegmentLength: 131072
MaxXmitDataSegmentLength: 524288
FirstBurstLength: 131072
MaxBurstLength: 262144
ImmediateData: Yes
InitialR2T: Yes
MaxOutstandingR2T: 1
************************
Attached SCSI devices:
************************
Host Number: 8 State: running
scsi8 Channel 00 Id 0 Lun: 0
Attached scsi disk sde State: running

If you run

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 8.5.409 / Virus Database: 270.13.112/2394 - Release Date: 09/25/09 17:52:00

This e-mail may contain confidential, personal and/or health information(information which may be subject to legal restrictions on use, retention and/or disclosure) for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this e-mail in error, please contact the sender and delete all copies.

Mike Christie

unread,
Sep 29, 2009, 2:58:46 PM9/29/09
to Drew Morris, open-...@googlegroups.com
On 09/28/2009 09:39 AM, Drew Morris wrote:
> Mike,
>
> Thanks for your prompt/detailed reply.
>
> Vtrak is an iscsi disk array, which unfortunately doesn't seem to be reporting any errors.
> Is there any sort of debugging I could do to discern what the cause of the problem is? I'm pretty sure it's not a pulled cable, but either the target or the initiator dropping the connection for some reason...
> What sort of "invalid data" would cause the initiator to drop the connection?
>
> There are no errors about pings or nops. The fragment I sent was the "beginning" of the problem. Can the verbosity of the iscsi messages be increased?
>

Are you using the kernel modules and tools from open-iscsi.org? If so,
each iscsi module has a mod param that can used to turn on different
debugging. You can also set it through sysfs like so:

echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_conn
echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_session
echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_eh
echo 1 > /sys/module/iscsi_tcp/parameters/debug_iscsi_tcp
echo 1 > /sys/module/libiscsi_tcp/parameters/debug_libiscsi_tcp

This turns on all the logging, so if you run IO it will be a lot of
info. You probably just want to turn all this on when not doing any
reads/writes. That way we will just get what causes the initial conn error.

Reply all
Reply to author
Forward
0 new messages