I've simulated a remote network failure by adding a blackhole route for one of the
remote iSCSI ports, watching the kernel messages of SLES 10 SP1 (open-iscsi-
2.0.707-0.32). I tried to follow the configuration guidelines in README for using
multipath. The host sees three target LUNs, each reachable via 16 paths like this:
two NICs * two iSCSI boxes * two FC adapters * two storage controllers
I feel that some error messages would benefit from being throttled or more
specific:
It all starts with this:
iscsid: cannot make a connection to 172.20.76.1:3260 (101)
kernel: connection224:0: iscsi: detected conn error (1011)
kernel: connection225:0: iscsi: detected conn error (1011)
kernel: connection226:0: iscsi: detected conn error (1011)
kernel: connection227:0: iscsi: detected conn error (1011)
[...more and repeating about every second...]
iscsid: detected iSCSI connection 224:0 error (1011) state (3)
iscsid: cannot make a connection to 172.20.76.1:3260 (101)
[...]
multipathd: sdaa: tur checker reports path is down
multipathd: checker failed path 65:160 in map EVA-2_iSCSI-1
multipathd: EVA-2_iSCSI-1: remaining active paths: 15
iscsid: detected iSCSI connection 222:0 error (1011) state (3)
iscsid: cannot make a connection to 172.20.76.1:3260 (101)
iscsid: cannot make a connection to 172.20.76.1:3260 (101)
iscsid: cannot make a connection to 172.20.76.1:3260 (101)
iscsid: cannot make a connection to 172.20.76.1:3260 (101)
iscsid: cannot make a connection to 172.20.76.1:3260 (101)
iscsid: cannot make a connection to 172.20.76.1:3260 (101)
iscsid: cannot make a connection to 172.20.76.1:3260 (101)
iscsid: cannot make a connection to 172.20.76.1:3260 (101)
[ here the attempts for a connection appeared within two seconds, poosibly for
different targets/LUNs, but the message doesn't tell which. There are at least one
hundred of these messages within a very short time ]
kernel: session223: iscsi: session recovery timed out after 120 secs
multipathd: sdab: tur checker reports path is down
kernel: iscsi: cmd 0x0 is not queued (8)
kernel: device-mapper: dm-multipath: Failing path 65:176.
multipathd: checker failed path 65:176 in map EVA-2_iSCSI-1
[ In the message from iscsi above a few details about the "owner" of the command
being queued would make much sense IMHO ]
[...]
iscsid: cannot make a connection to 172.20.76.1:3260 (101)
iscsid: cannot make a connection to 172.20.76.1:3260 (101)
kernel: iscsi: cmd 0x0 is not queued (8)
kernel: iscsi: cmd 0x0 is not queued (8)
kernel: iscsi: cmd 0x0 is not queued (8)
kernel: iscsi: cmd 0x0 is not queued (8)
kernel: iscsi: cmd 0x0 is not queued (8)
kernel: iscsi: cmd 0x0 is not queued (8)
kernel: iscsi: cmd 0x0 is not queued (8)
kernel: iscsi: cmd 0x0 is not queued (8)
kernel: iscsi: cmd 0x0 is not queued (8)
kernel: iscsi: cmd 0x0 is not queued (8)
kernel: iscsi: cmd 0x0 is not queued (8)
kernel: iscsi: cmd 0x0 is not queued (8)
multipathd: sdaa: tur checker reports path is down
multipathd: sdab: tur checker reports path is down
multipathd: sdac: tur checker reports path is down
multipathd: sdad: tur checker reports path is down
multipathd: sdae: tur checker reports path is down
[...]
[ Here the problem mentioned before becomes obvious. I'm wondering why the syslog
daemon does not say "last message repeated # times"; the iscsi messages all
arrived in the same second ]
[ the other thing is that multipathd REPEATEDLY reports the paths being down, i.e.
it reports the STATE, not the TRANSITION of the state.: ]
kernel: iscsi: cmd 0x0 is not queued (8)
kernel: iscsi: cmd 0x0 is not queued (8)
kernel: iscsi: cmd 0x0 is not queued (8)
kernel: iscsi: cmd 0x0 is not queued (8)
multipathd: sdaa: tur checker reports path is down
multipathd: sdab: tur checker reports path is down
multipathd: sdac: tur checker reports path is down
multipathd: sdad: tur checker reports path is down
multipathd: sdae: tur checker reports path is down
[...]
[ The other amazing thing is that the path checker still reports a "down" three
seconds after iSCSI detected a successful connection: ]
iscsid: connection222:0 is operational after recovery (83 attempts)
iscsid: connection223:0 is operational after recovery (84 attempts)
iscsid: connection227:0 is operational after recovery (84 attempts)
iscsid: connection229:0 is operational after recovery (84 attempts)
iscsid: connection224:0 is operational after recovery (84 attempts)
iscsid: connection225:0 is operational after recovery (84 attempts)
iscsid: connection226:0 is operational after recovery (84 attempts)
iscsid: connection228:0 is operational after recovery (84 attempts)
multipathd: sdav: tur checker reports path is down
multipathd: sdaw: tur checker reports path is down
[...more like these...for another 8 seconds; finally an UP transition]
multipathd: sdaw: tur checker reports path is down
multipathd: sdax: tur checker reports path is down
multipathd: sdaa: tur checker reports path is up
multipathd: 65:160: reinstated
multipathd: EVA-2_iSCSI-1: remaining active paths: 13
multipathd: sdab: tur checker reports path is up
multipathd: 65:176: reinstated
multipathd: EVA-2_iSCSI-1: remaining active paths: 14
multipathd: sdac: tur checker reports path is up
multipathd: 65:192: reinstated
[...]
It seems syslog could not log all of those messges:
syslog-ng[10113]: STATS: dropped 927
I know that this is not all iSCSI, but can something be done about the problem,
like an exponential backuff strategy for emitting error messages?
If I configure something like 10 LUNs with 16 paths each, a failure of a network
connection for a few seconds will flood the syslog with thousands of entries.
Regards,
Ulrich Windl
The detected conn error messages are not repeating are they? It should
only fire once when we detect a problem.
> iscsid: detected iSCSI connection 224:0 error (1011) state (3)
> iscsid: cannot make a connection to 172.20.76.1:3260 (101)
These ones do spit out over and over. I agree we should have some back
off code for that. Will fix it in a second.
> [...]
> multipathd: sdaa: tur checker reports path is down
> multipathd: checker failed path 65:160 in map EVA-2_iSCSI-1
> multipathd: EVA-2_iSCSI-1: remaining active paths: 15
> iscsid: detected iSCSI connection 222:0 error (1011) state (3)
> iscsid: cannot make a connection to 172.20.76.1:3260 (101)
> iscsid: cannot make a connection to 172.20.76.1:3260 (101)
> iscsid: cannot make a connection to 172.20.76.1:3260 (101)
> iscsid: cannot make a connection to 172.20.76.1:3260 (101)
> iscsid: cannot make a connection to 172.20.76.1:3260 (101)
> iscsid: cannot make a connection to 172.20.76.1:3260 (101)
> iscsid: cannot make a connection to 172.20.76.1:3260 (101)
> iscsid: cannot make a connection to 172.20.76.1:3260 (101)
> [ here the attempts for a connection appeared within two seconds, poosibly for
> different targets/LUNs, but the message doesn't tell which. There are at least one
> hundred of these messages within a very short time ]
This should be fixed soon. I just fixed up the kernel messages to add
some prefix so you can match this up. I am going to get to the userspace
messsages next.
> kernel: session223: iscsi: session recovery timed out after 120 secs
> multipathd: sdab: tur checker reports path is down
> kernel: iscsi: cmd 0x0 is not queued (8)
> kernel: device-mapper: dm-multipath: Failing path 65:176.
> multipathd: checker failed path 65:176 in map EVA-2_iSCSI-1
> [ In the message from iscsi above a few details about the "owner" of the command
> being queued would make much sense IMHO ]
In the upstream code it should be fixed, so that if it is prtinted you
know where it is coming from.
However, upstream does not normally turn this on. It is SUSE specific,
so if you do not like it you have to make a suse bugzilla and ask to
turn it off.
Do you mean it repeatedly reports this for each path or that you get a
message for each path that is on the same session/connection?
>
> Ulrich Windl wrote:
> > Hello,
> >
> > I've simulated a remote network failure by adding a blackhole route for one of the
> > remote iSCSI ports, watching the kernel messages of SLES 10 SP1 (open-iscsi-
> > 2.0.707-0.32). I tried to follow the configuration guidelines in README for using
> > multipath. The host sees three target LUNs, each reachable via 16 paths like this:
[...]
> > It all starts with this:
> > iscsid: cannot make a connection to 172.20.76.1:3260 (101)
> > kernel: connection224:0: iscsi: detected conn error (1011)
> > kernel: connection225:0: iscsi: detected conn error (1011)
> > kernel: connection226:0: iscsi: detected conn error (1011)
> > kernel: connection227:0: iscsi: detected conn error (1011)
> > [...more and repeating about every second...]
>
> The detected conn error messages are not repeating are they? It should
> only fire once when we detect a problem.
Yes, there is only one "conn error" per connection.
>
> > iscsid: detected iSCSI connection 224:0 error (1011) state (3)
> > iscsid: cannot make a connection to 172.20.76.1:3260 (101)
>
> These ones do spit out over and over. I agree we should have some back
> off code for that. Will fix it in a second.
Very good!
[...]
> > [ the other thing is that multipathd REPEATEDLY reports the paths being down, i.e.
> > it reports the STATE, not the TRANSITION of the state.: ]
>
> Do you mean it repeatedly reports this for each path or that you get a
> message for each path that is on the same session/connection?
Does that extract from the log answer your question? ;-)
Feb 26 13:24:30 rkdvmso1 multipathd: sdaa: tur checker reports path is down
Feb 26 13:26:28 rkdvmso1 multipathd: sdab: tur checker reports path is down
Feb 26 13:26:29 rkdvmso1 multipathd: sdac: tur checker reports path is down
Feb 26 13:26:29 rkdvmso1 multipathd: sdad: tur checker reports path is down
Feb 26 13:26:29 rkdvmso1 multipathd: sdae: tur checker reports path is down
Feb 26 13:26:29 rkdvmso1 multipathd: sdaf: tur checker reports path is down
Feb 26 13:26:29 rkdvmso1 multipathd: sdag: tur checker reports path is down
Feb 26 13:26:29 rkdvmso1 multipathd: sdah: tur checker reports path is down
Feb 26 13:26:29 rkdvmso1 multipathd: sdau: tur checker reports path is down
Feb 26 13:26:29 rkdvmso1 multipathd: sdav: tur checker reports path is down
Feb 26 13:26:29 rkdvmso1 multipathd: sdaw: tur checker reports path is down
Feb 26 13:26:29 rkdvmso1 multipathd: sdax: tur checker reports path is down
Feb 26 13:26:34 rkdvmso1 multipathd: sdaa: tur checker reports path is down
Feb 26 13:26:34 rkdvmso1 multipathd: sdab: tur checker reports path is down
Feb 26 13:26:34 rkdvmso1 multipathd: sdac: tur checker reports path is down
Feb 26 13:26:34 rkdvmso1 multipathd: sdad: tur checker reports path is down
Feb 26 13:26:34 rkdvmso1 multipathd: sdae: tur checker reports path is down
Feb 26 13:26:34 rkdvmso1 multipathd: sdaf: tur checker reports path is down
Feb 26 13:26:34 rkdvmso1 multipathd: sdag: tur checker reports path is down
Feb 26 13:26:34 rkdvmso1 multipathd: sdah: tur checker reports path is down
Feb 26 13:26:34 rkdvmso1 multipathd: sdau: tur checker reports path is down
Feb 26 13:26:34 rkdvmso1 multipathd: sdav: tur checker reports path is down
Feb 26 13:26:34 rkdvmso1 multipathd: sdaw: tur checker reports path is down
Feb 26 13:26:34 rkdvmso1 multipathd: sdax: tur checker reports path is down
Feb 26 13:26:39 rkdvmso1 multipathd: sdaa: tur checker reports path is down
Feb 26 13:26:39 rkdvmso1 multipathd: sdab: tur checker reports path is down
Feb 26 13:26:39 rkdvmso1 multipathd: sdac: tur checker reports path is down
Feb 26 13:26:39 rkdvmso1 multipathd: sdad: tur checker reports path is down
Feb 26 13:26:39 rkdvmso1 multipathd: sdae: tur checker reports path is down
Feb 26 13:26:39 rkdvmso1 multipathd: sdaf: tur checker reports path is down
Feb 26 13:26:39 rkdvmso1 multipathd: sdag: tur checker reports path is down
Feb 26 13:26:39 rkdvmso1 multipathd: sdah: tur checker reports path is down
Feb 26 13:26:39 rkdvmso1 multipathd: sdau: tur checker reports path is down
Feb 26 13:26:39 rkdvmso1 multipathd: sdav: tur checker reports path is down
Feb 26 13:26:39 rkdvmso1 multipathd: sdaw: tur checker reports path is down
Feb 26 13:26:39 rkdvmso1 multipathd: sdax: tur checker reports path is down
Feb 26 13:26:44 rkdvmso1 multipathd: sdaa: tur checker reports path is down
Feb 26 13:26:44 rkdvmso1 multipathd: sdab: tur checker reports path is down
Feb 26 13:26:44 rkdvmso1 multipathd: sdac: tur checker reports path is down
Feb 26 13:26:44 rkdvmso1 multipathd: sdad: tur checker reports path is down
Feb 26 13:26:44 rkdvmso1 multipathd: sdae: tur checker reports path is down
Feb 26 13:26:44 rkdvmso1 multipathd: sdaf: tur checker reports path is down
Feb 26 13:26:44 rkdvmso1 multipathd: sdag: tur checker reports path is down
Feb 26 13:26:44 rkdvmso1 multipathd: sdah: tur checker reports path is down
Feb 26 13:26:44 rkdvmso1 multipathd: sdau: tur checker reports path is down
Feb 26 13:26:44 rkdvmso1 multipathd: sdav: tur checker reports path is down
Feb 26 13:26:44 rkdvmso1 multipathd: sdaw: tur checker reports path is down
Feb 26 13:26:44 rkdvmso1 multipathd: sdax: tur checker reports path is down
Feb 26 13:26:49 rkdvmso1 multipathd: sdaa: tur checker reports path is down
Feb 26 13:26:49 rkdvmso1 multipathd: sdab: tur checker reports path is down
Feb 26 13:26:49 rkdvmso1 multipathd: sdac: tur checker reports path is down
Feb 26 13:26:49 rkdvmso1 multipathd: sdad: tur checker reports path is down
Feb 26 13:26:49 rkdvmso1 multipathd: sdae: tur checker reports path is down
Feb 26 13:26:49 rkdvmso1 multipathd: sdaf: tur checker reports path is down
Feb 26 13:26:49 rkdvmso1 multipathd: sdag: tur checker reports path is down
Feb 26 13:26:49 rkdvmso1 multipathd: sdah: tur checker reports path is down
Feb 26 13:26:49 rkdvmso1 multipathd: sdau: tur checker reports path is down
Feb 26 13:26:49 rkdvmso1 multipathd: sdav: tur checker reports path is down
Feb 26 13:26:49 rkdvmso1 multipathd: sdaw: tur checker reports path is down
Feb 26 13:26:49 rkdvmso1 multipathd: sdax: tur checker reports path is down
Feb 26 13:26:54 rkdvmso1 multipathd: sdaa: tur checker reports path is down
Feb 26 13:26:54 rkdvmso1 multipathd: sdab: tur checker reports path is down
Feb 26 13:26:54 rkdvmso1 multipathd: sdac: tur checker reports path is down
Feb 26 13:26:54 rkdvmso1 multipathd: sdad: tur checker reports path is down
Feb 26 13:26:54 rkdvmso1 multipathd: sdae: tur checker reports path is down
Feb 26 13:26:54 rkdvmso1 multipathd: sdaf: tur checker reports path is down
[...many more...]
Feb 26 13:28:15 rkdvmso1 multipathd: sdag: tur checker reports path is down
Feb 26 13:28:15 rkdvmso1 multipathd: sdah: tur checker reports path is down
Feb 26 13:28:15 rkdvmso1 multipathd: sdau: tur checker reports path is down
Feb 26 13:28:15 rkdvmso1 multipathd: sdav: tur checker reports path is down
Feb 26 13:28:15 rkdvmso1 multipathd: sdaw: tur checker reports path is down
Feb 26 13:28:15 rkdvmso1 multipathd: sdax: tur checker reports path is down
Regards,
Ulrich
> Feb 26 13:26:44 rkdvmso1 multipathd: sdaa: tur checker reports path is down
:) Ok, I get it. You should report that to the dm-devel list,
dm-d...@redhat.com, so they fix that up. Also report the path down
messages 3 seconds after the connection up ones. I thought it might be a
race with the logging or something but 3 seconds is a long time so maybe
it is bug (multipathd getting backed up or something).