Help with some iSCSI connect random errors

Santi Saez

unread,

Jun 25, 2009, 7:33:11 AM6/25/09

to open-...@googlegroups.com

Hi,

Randomly I get those iSCSI errors on a Linux box with CentOS 5.3,
running default kernel (2.6.18) and using Open-iSCSI
(6.2.0.868-0.18.el5_3.1):

ping timeout of 5 secs expired, last rx (..)
connection1:0: iscsi: detected conn error (1011)
Kernel reported iSCSI connection 1:0 error (1011) state (3)
session1: iscsi: session recovery timed out after 120 secs
iscsi: cmd 0x28 is not queued (8)
sd 1:0:0:0: SCSI error: return code = 0x00010000
end_request: I/O error, dev sdb, sector 226732039
sd 1:0:0:0: SCSI error: return code = 0x00010000
end_request: I/O error, dev sdb, sector 187040175

Full log is available at: http://pastebin.com/f40472f99

After that, we need to reboot the server to recover read-write into ext3 fs.

Where use default Open-iSCSI config:

http://pastebin.com/f9f15d82

More info about this device:

# cat /sys/block/sdb/device/timeout
60

# cat /sys/class/iscsi_session/session1/recovery_tmo
120

There are more initiators conected to the same target and switch, and
are not afectted by this situation, so we think that maybe changing some
Open-iSCSI configuration parameter we can solve this.. any ideas? thanks!!

Regards,

--
Santi Saez
http://woop.es

Mike Christie

unread,

Jun 25, 2009, 11:17:05 AM6/25/09

to open-...@googlegroups.com

On 06/25/2009 06:33 AM, Santi Saez wrote:
>
> Hi,
>
> Randomly I get those iSCSI errors on a Linux box with CentOS 5.3,
> running default kernel (2.6.18) and using Open-iSCSI
> (6.2.0.868-0.18.el5_3.1):
>
> ping timeout of 5 secs expired, last rx (..)

This indicates that the initiator sent a iscsi ping but we did not get a
reply. When this happens the initiator will then drop the session and
try to relogin and retry IO.

> connection1:0: iscsi: detected conn error (1011)
> Kernel reported iSCSI connection 1:0 error (1011) state (3)
> session1: iscsi: session recovery timed out after 120 secs
> iscsi: cmd 0x28 is not queued (8)

This indicates that we tried to relogin for 2 minutes, but we could not
log back in. At that time, we fail IO.

> sd 1:0:0:0: SCSI error: return code = 0x00010000
> end_request: I/O error, dev sdb, sector 226732039
> sd 1:0:0:0: SCSI error: return code = 0x00010000
> end_request: I/O error, dev sdb, sector 187040175
>
> Full log is available at: http://pastebin.com/f40472f99
>
> After that, we need to reboot the server to recover read-write into ext3 fs.
>

You might be able to avoid this problem by increasing the
node.session.timeo.replacement_timeout in iscsid.conf (dont forget to
rediscovery the storage so the new value gets picked up). However, if we
are not able to reconnect for a couple minutes then something is wrong here.

Maybe running iscsid by hand with debugging on will give us more info:

iscsid -d 8

Or if you could run it by hand and make a test disk, then login and just
pull the cable, so we can check that relogin is working it might be
helpful. You should see:

ping timeout of 5 secs expired,

connection1:0: iscsi: detected conn error (1011)
Kernel reported iSCSI connection 1:0 error (1011) state (3)

When you see that you should plug the cable back in. Then instead of

session1: iscsi: session recovery timed out after 120 secs

you should see
connection1:0 is operational after recovery

Mike Christie

unread,

Jun 25, 2009, 8:02:44 PM6/25/09

to open-...@googlegroups.com

Oh yeah, for config settings when not using dm-multipath you can just
turn nops off by setting

node.conn[0].timeo.noop_out_interval = 0
node.conn[0].timeo.noop_out_timeout = 0

Reply all

Reply to author

Forward