Very strange problem with an Infortrend A16E iSCSI storage array

75 views
Skip to first unread message

Santi Saez

unread,
Feb 3, 2009, 11:55:22 AM2/3/09
to open-...@googlegroups.com

Hi,

We have a very strange problem with an Infortrend A16E iSCSI storage
array [1]. I think it's not a Open-iSCSI related problem, but someone
here may shed some light :-)

This array has 4 iSCSI interfaces to distribute/balance ethernet
traffic. There are 16 hosts connected to this array via iSCSI, with 4
hosts per channel/interface.

*Randomly*, one of these channels resets, making the 4 servers connected
to the channel timeout. The other 3 channels are not affected at all.

Open-iSCSI logs this:

ping timeout of 5 secs expired, last rx 502453156, last ping 502446907,
now 502463156
connection4:0: iscsi: detected conn error (1011)
session4: iscsi: session recovery timed out after 120 secs
iscsi: cmd 0x28 is not queued (8)
iscsi: cmd 0x28 is not queued (8)
iscsi: cmd 0x28 is not queued (8)
sd 4:0:0:0: SCSI error: return code = 0x00010000
end_request: I/O error, dev sdc, sector 338694423
(..)


The switch port where it is connected shows:

%LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/5,
changed state to down
%LINK-3-UPDOWN: Interface GigabitEthernet0/5, changed state to down
%LINK-3-UPDOWN: Interface GigabitEthernet0/5, changed state to up
%LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/5,
changed state to up


It appears like iSCSI channel *resets* and starts a down+up port
process.. we have changed the wire, the switch.. and still get the same
error.

The Infortrend array is logging nothing and the official support people
have no idea about this issue :-/

We believe that the source of the problem is a single server. When we
move this server to a different iSCSI channel we get the same error
there, and the channel where it previously was starts working as
expected, with no interface resets.

Anyone could say that something in that faulty server is making the
interface reset; but we've checked it several times and we really
believe that the server is configured as the other 16 we have attached
to the array.

The switch connecting the servers and the array is a Cisco Catalyst 2960G.

Anyone ever experienced anything similar?

Regards,

[1] http://www.infortrend.com/main/2_product/es_a16e-g2130-4.asp

--
Santi Saez
http://woop.es

Mike Christie

unread,
Feb 3, 2009, 2:19:40 PM2/3/09
to open-...@googlegroups.com
Santi Saez wrote:
>
> Hi,
>
> We have a very strange problem with an Infortrend A16E iSCSI storage
> array [1]. I think it's not a Open-iSCSI related problem, but someone
> here may shed some light :-)
>
> This array has 4 iSCSI interfaces to distribute/balance ethernet
> traffic. There are 16 hosts connected to this array via iSCSI, with 4
> hosts per channel/interface.
>
> *Randomly*, one of these channels resets, making the 4 servers connected
> to the channel timeout. The other 3 channels are not affected at all.
>
> Open-iSCSI logs this:
>
> ping timeout of 5 secs expired, last rx 502453156, last ping 502446907,
> now 502463156

The initiatior sends a iscsi ping every X seconds. If we do not get a
response in Y seconds we drop the session (drop connection and relogin).

There was a bug in the initiator where we would spit out this timeout
error by accident. What kernel are you using? Are you using the iscsi
modules in the kernel or modules from a open-iscsi.org release and what
release of open-iscsi.org?

> connection4:0: iscsi: detected conn error (1011)
> session4: iscsi: session recovery timed out after 120 secs

I do not think it is the bug, because you would normally log right back in.

The recovery timed out error means that the initiator tried to log back
in for 120 seconds and during that time we could not reconnect/relogin.

I think this makes sense when looking at the switch messages below. If
something causes the link to go down, the iscsi ping would fail/timeout.

I am not sure if the iscsi layer dropping the session would cause the
link to go down/up.

Santi Saez

unread,
Feb 4, 2009, 7:16:26 AM2/4/09
to open-...@googlegroups.com

Hi Mike,

El 3/2/09 20:19, Mike Christie escribió:

>> *Randomly*, one of these channels resets, making the 4 servers connected
>> to the channel timeout. The other 3 channels are not affected at all.

(..)

> The initiatior sends a iscsi ping every X seconds. If we do not get a
> response in Y seconds we drop the session (drop connection and relogin).

Yes, we were aware of this bug. In fact, you helped us with it not too
long ago:

http://tinyurl.com/cywy3j


> There was a bug in the initiator where we would spit out this timeout
> error by accident. What kernel are you using? Are you using the iscsi
> modules in the kernel or modules from a open-iscsi.org release and what
> release of open-iscsi.org?

# iscsiadm -m session -P 3
iSCSI Transport Class version 2.0-724
iscsiadm version 2.0-868
Target: iqn.2002-10.com.infortrend:raid.sn7457155.30
Current Portal: 10.15.17.133:3260,1
Persistent Portal: 10.15.17.133:3260,1
**********
Interface:
**********
Iface Name: default
Iface Transport: tcp
Iface Initiatorname: iqn.2001-05.net.example:vz11
Iface IPaddress: 10.15.17.137
Iface HWaddress: default
Iface Netdev: default
SID: 2
iSCSI Connection State: LOGGED IN
iSCSI Session State: Unknown
Internal iscsid Session State: NO CHANGE
************************
Negotiated iSCSI params:
************************
HeaderDigest: None
DataDigest: None
MaxRecvDataSegmentLength: 131072
MaxXmitDataSegmentLength: 65536
FirstBurstLength: 65536
MaxBurstLength: 262144
ImmediateData: Yes
InitialR2T: No
MaxOutstandingR2T: 1
************************
Attached SCSI devices:
************************
Host Number: 2 State: running
scsi2 Channel 00 Id 0 Lun: 0
Attached scsi disk sdb State: running


We're using CentOS 5.2 with default "iscsi-initiator-utils" package:

# rpm -qa iscsi-initiator-utils
iscsi-initiator-utils-6.2.0.868-0.7.el5

Also, using default iSCSI modules.


>> connection4:0: iscsi: detected conn error (1011)
>> session4: iscsi: session recovery timed out after 120 secs
>
> I do not think it is the bug, because you would normally log right back in.
>
> The recovery timed out error means that the initiator tried to log back
> in for 120 seconds and during that time we could not reconnect/relogin.
>
> I think this makes sense when looking at the switch messages below. If
> something causes the link to go down, the iscsi ping would fail/timeout.
>
> I am not sure if the iscsi layer dropping the session would cause the
> link to go down/up.

The link that goes down/up isn't the link between switch and the host,
the link affected is between the *switch and the array*, very strange.
It appears that some iSCSI client is causing "something" that makes
iSCSI interface in the array to reset..

I think it's not a problem with Open-iSCSI and it's a Infortrend array
bug, but perhaps someone may shed some light with this problem.

As I said, when this ocurrs it affects to all servers connected to this
iSCSI interface/channel, including Windows hosts, etc..

Regards,

Tristan Ball

unread,
Feb 4, 2009, 5:10:52 PM2/4/09
to open-...@googlegroups.com
You should be able to turn up the logging verbosity on the switch quite
a bit. If the switch is "making the choice" to disconnect the port, the
higher log levels should show why. If it's the Infortrend making the
choice - well, the switch probably won't show much more than what you've
got, as it will just see the link go down.

If you're not already, make sure the switch is sending it's log messages
to a syslog server so you don't miss them!

Regards,
T
______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________
Reply all
Reply to author
Forward
0 new messages