Devs,
We seem to having an issue with the time to failover over iSCSI.
The end goal here being to force a failover within 10 seconds to an
alternate path as defined by dm-multipath.
Distro: CentOS
Kernel version: 2.6.29.5
dm-multipath version: device-mapper-multipath-0.4.7-17.el5
iscsid version: iscsi-initiator-utils-6.2.0.868-0.7.el5
We have dm-multipath installed and configured with the following
configurations:
udev_dir /dev
polling_interval 3
selector "round-robin 0"
path_grouping_policy failover
getuid_callout "/sbin/scsi_id -g -u -s /
block/%n"
prio_callout /bin/true
path_checker tur
rr_min_io 10
max_fds 8192
rr_weight uniform
failback manual
no_path_retry fail
user_friendly_names yes
We have also modified scsi PDU timeout:
ACTION=="add", SUBSYSTEM=="scsi" , SYSFS{type}=="0|7|
14", \
RUN+="/bin/sh -c 'echo 60 > /sys$
$DEVPATH/timeout'"
in the /etc/udev/rules.d/50-udev.rules.
We have also modified some parameters in /etc/iscsi/iscsi.conf:
node.session.timeo.replacement_timeout = 5
node.conn[0].timeo.login_timeout = 5
node.conn[0].timeo.logout_timeout = 5
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 10
Given the above configuration the failover takes place in 2 minutes.
After reading a few posts on this group, I did try to change the scsi
PDU timeout & the node.session.timeo.replacement_timeout, but still
that didn't change the failover time.
However, if I modify the scsi PDU timout to 3, the node.conn
[0].timeo.noop_out_interval = 1 & node.conn[0].timeo.noop_out_timeout
= 2, we seem to get a failover in about 65 seconds (thats still too
long for our purposes)
/etc/iscsi/iscsi.conf:
node.session.timeo.replacement_timeout = 120
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 1
node.conn[0].timeo.noop_out_timeout = 2
/etc/udev/rules.d/50-udev.rules:
ACTION=="add", SUBSYSTEM=="scsi" , SYSFS{type}=="0|7|
14", \
RUN+="/bin/sh -c 'echo 60 > /sys$
$DEVPATH/timeout'"
Not sure why these values actually cause a difference in the failover
time, but apparently changing any other parameter doesn't really help.
/var/log/messages:
When the cable (power) is pulled from the primary:
Jul 8 15:38:23
cschi-mbxdsg-0226.cleversafelabs.com kernel:
connection2:0: ping timeout of 2 secs expired, last rx 4295169719,
last ping 4295170719, now 4295172719
Jul 8 15:38:23
cschi-mbxdsg-0226.cleversafelabs.com kernel:
connection2:0: detected conn error (1011)
Jul 8 15:38:24
cschi-mbxdsg-0226.cleversafelabs.com iscsid:
Kernel reported iSCSI connection 2:0 error (1011) state (3)
.
.
No messages for a bit and then: (the failover occurs at this point)
Jul 8 15:39:35
cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
2:0:0:0: timing out command, waited 18s
Jul 8 15:39:35
cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
2:0:0:0: [sdb] Unhandled error code
Jul 8 15:39:35
cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
2:0:0:0: [sdb] Result: hostbyte=DID_TRANSPORT_DISRUPTED
driverbyte=DRIVER_OK,SUGGEST_OK
Jul 8 15:39:35
cschi-mbxdsg-0226.cleversafelabs.com kernel:
end_request: I/O error, dev sdb, sector 1599400
Jul 8 15:39:35
cschi-mbxdsg-0226.cleversafelabs.com kernel:
device-mapper: multipath: Failing path 8:16.
Jul 8 15:39:35
cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
2:0:0:0: timing out command, waited 18s
Jul 8 15:39:35
cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
2:0:0:0: [sdb] Unhandled error code
Jul 8 15:39:35
cschi-mbxdsg-0226.cleversafelabs.com kernel: sd
2:0:0:0: [sdb] Result: hostbyte=DID_TRANSPORT_DISRUPTED
driverbyte=DRIVER_OK,SUGGEST_OK
Jul 8 15:39:35
cschi-mbxdsg-0226.cleversafelabs.com kernel:
end_request: I/O error, dev sdb, sector 1600424
Any clues on how we can reduce this failover time would be
appreciated.
--
Akshay Lal