Failover time of iSCSI multipath devices.

bet

unread,

Mar 1, 2010, 1:06:41 PM3/1/10

to open-iscsi

Hi all. I am going through some testing of my multipathed iSCSI
devices and I am seeing some longer than expected delays. I am
running the latest RHEL 5.4 packages as of this morning. I am seeing
the failure of the iSCSI sessions take about 67 seconds. After the
iSCSI failure the multipath layer picks up almost immediately. Here
is a breakdown of /var/log/messages, I am testing dd while pulling a
network cable:

I started the dd at 07:13:51

Cable pulled at:
Mar 1 07:14:27 bentCluster-1 kernel: connection4:0: ping timeout of
5 secs expired, recv timeout 5, last rx 4884304, last ping 4889304,
now 4894304

ISCIS errors at:
Mar 1 07:14:28 bentCluster-1 iscsid: Kernel reported iSCSI connection
4:0 error (1011) state (3)

SCSI error and multipath failures at:
Mar 1 07:15:35 bentCluster-1 kernel: session2: session recovery
timed out after 15 secs
Mar 1 07:15:35 bentCluster-1 kernel: sd 3:0:0:1: SCSI error: return
code = 0x000f0000
Mar 1 07:15:35 bentCluster-1 kernel: end_request: I/O error, dev sdf,
sector 3164079
Mar 1 07:15:35 bentCluster-1 kernel: device-mapper: multipath:
Failing path 8:80.

And then I/O starts again on the device I am sending I/O down.

Finally the other devices fail:

Mar 1 07:15:48 bentCluster-1 kernel: device-mapper: multipath:
Failing path 8:112.

The entire dd took 138 seconds. It looks like the delay is in the
iSCSI layer. It took from 07:14:28 to 07:15:35 for the iSCSI session
to fail.

I am using the timeouts:

● node.session.timeo.replacement_timeout = 15
● node.conn[0].timeo.noop_out_timeout = 5
● node.conn[0].timeo.noop_out_interval = 5

http://kbase.redhat.com/faq/docs/DOC-2877

So I guess I have two questions:

1. Based on my timeouts I would think that my session would time out
after 15 seconds. Anyone have an idea why is it taking 67 seconds?
Am I missing any other timeout values?

2. In a perfect world what is the best case scenario for the failure
of my iSCSI session?

Thanks in advance.

-Ben

Mike Christie

unread,

Mar 1, 2010, 9:53:49 PM3/1/10

to open-...@googlegroups.com, bet

On 03/01/2010 12:06 PM, bet wrote:
> 1. Based on my timeouts I would think that my session would time out

Yes. It should timeout about 15 secs after you see

> Mar 1 07:14:27 bentCluster-1 kernel: connection4:0: ping timeout of
> 5 secs expired, recv timeout 5, last rx 4884304, last ping 4889304,
> now 4894304

You might be hitting a bug where the network layer gets stuck trying to
send data. I attached a patch that should fix the problem.

If you do not know how to build a RHEL kernel let me know the arch you
are using and I can build a kernel here (it takes about a day).

> after 15 seconds. Anyone have an idea why is it taking 67 seconds?
> Am I missing any other timeout values?

No. The ones you have set are it.

>
> 2. In a perfect world what is the best case scenario for the failure
> of my iSCSI session?
>

It should work like in that doc.

speed-up-conn-fail.patch

guy keren

unread,

Mar 2, 2010, 12:33:00 AM3/2/10

to open-...@googlegroups.com, bet

wouldn't the abort timeout also have an effect here? or will iSCSI fail
the coming abort (that the mid-layer sends when it gets an error sending
a SCSI command) immediately?

--guy

Or Gerlitz

unread,

Mar 2, 2010, 12:43:16 AM3/2/10

to open-...@googlegroups.com, bet

Mike Christie wrote:
> You might be hitting a bug where the network layer gets stuck trying to
> send data. I attached a patch that should fix the problem

Doing some multipath testing with iscsi/tcp I didn't hit this bug, any hint
what does it take to have this come into play? I did failover on silent line
(other then nops) and during I/O.

Mike, is this patch production ready? if yes, are you pushing it upstream?

Or.

Mike Christie

unread,

Mar 2, 2010, 2:09:38 AM3/2/10

to open-...@googlegroups.com, guy keren, bet

In the case in this thread abort timeout does not come into play,
because when the iscsi layer's nop times out it will fail the session
right away and that will prevent the scsi layer's eh (aborts, device
reset, target reset, etc) from starting up.

Mike Christie

unread,

Mar 2, 2010, 2:35:47 AM3/2/10

to open-...@googlegroups.com, Or Gerlitz, bet

On 03/01/2010 11:43 PM, Or Gerlitz wrote:
> Mike Christie wrote:
>> You might be hitting a bug where the network layer gets stuck trying to
>> send data. I attached a patch that should fix the problem
>
>
> Doing some multipath testing with iscsi/tcp I didn't hit this bug, any hint
> what does it take to have this come into play? I did failover on silent line

The problem I am referring to is the one we are discussing in the
"[PATCH] decrease sndtmo" thread. The problem is that if there is a
problem at the same time the write space window is closed we get caught
in a wait for at least one sndtmo period (sometimes more if we were in
the middle of a send).

> (other then nops) and during I/O.
>
> Mike, is this patch production ready? if yes, are you pushing it upstream?
>

It is in scsi-misc for the next feature window. I think James has sent
it already.

bennyturns

unread,

Mar 2, 2010, 2:39:30 PM3/2/10

to open-iscsi

I was able to get my failover time down to about 25-30 seconds:

Mar 1 12:32:37 bentCluster-1 kernel: tg3: eth0: Link is down.

Mar 1 12:33:03 bentCluster-1 multipathd: checker failed path 8:224 in
map mpath0
Mar 1 12:33:03 bentCluster-1 kernel: end_request: I/O error, dev sdo,
sector 1249431
Mar 1 12:33:03 bentCluster-1 multipathd: mpath0: remaining active
paths: 1

I ended up setting:

[root@bentCluster-1 ~]# echo noop > /sys/block/sdn/queue/scheduler
[root@bentCluster-1 ~]# echo noop > /sys/block/sdo/queue/scheduler

[root@bentCluster-1 ~]# echo 64 > /sys/block/sdn/queue/max_sectors_kb
[root@bentCluster-1 ~]# echo 64 > /sys/block/sdo/queue/max_sectors_kb

[root@bentCluster-1 ~]# echo "5" > /sys/block/sdn/device/timeout
[root@bentCluster-1 ~]# echo "5" > /sys/block/sdo/device/timeout

I couldn't get it under 90 seconds without "/sys/block/sdn/device/
timeout" being set and in my best test I hit 26 seconds. I have a
couple questions:

1. Do I need the scsi timeout to be turned down or could I be hitting
the bug Mike mentioned?

2. The patch that Mike attached to this tread, is there a Red Hat BZ
associated with it so I can track its progress? If not should I open
a BZ?

3. In a best case scenario what kind of failover time can I expect
with multipath and iSCSI? I see about 25-30 seconds, is this
accurate? I saw 3 second failover time using bonded NICs instead of
dm-multipath, is there any specific reason to use multipathd instead
of channel bonding?

Thanks for all the help everyone!

-Ben

Mike Christie

unread,

Mar 5, 2010, 6:07:53 AM3/5/10

to open-...@googlegroups.com, bet

On 03/01/2010 08:53 PM, Mike Christie wrote:
> On 03/01/2010 12:06 PM, bet wrote:
>> 1. Based on my timeouts I would think that my session would time out
>
> Yes. It should timeout about 15 secs after you see
> > Mar 1 07:14:27 bentCluster-1 kernel: connection4:0: ping timeout of
> > 5 secs expired, recv timeout 5, last rx 4884304, last ping 4889304,
> > now 4894304
>
> You might be hitting a bug where the network layer gets stuck trying to
> send data. I attached a patch that should fix the problem.
>

It looks like we have two bugs.

1. We can get stuck in the network code.
2. There is a race where the session->state can get reset due to the
xmit thread throwing an error after we have set the session->state but
before we have set the stop_stage.

The attached patch for RHEL 5.5 should fix them all.

speed-up-conn-fail-take3.patch

Pasi Kärkkäinen

unread,

Mar 7, 2010, 8:46:36 AM3/7/10

to open-...@googlegroups.com, bet

Hello,

Will this patch be in the next RHEL 5.5 beta kernel? Easier to test if there's
no need to build custom kernel :)

-- Pasi

Mike Christie

unread,

Mar 8, 2010, 3:07:14 PM3/8/10

to open-...@googlegroups.com, Pasi Kärkkäinen, bet

I am not sure if it will be in the next 5.5 beta. It should be in 5.5
though. Do you have a bugzilla account? I made this bugzilla
https://bugzilla.redhat.com/show_bug.cgi?id=570681
You can add yourself to it and when the patch is merged you will get a
notification and a link to a test kernel.

If you do not have a bugzilla account, just let me know and I will ping
you when it is available in a test kernel.

Pasi Kärkkäinen

unread,

Mar 8, 2010, 3:10:02 PM3/8/10

to Mike Christie, open-...@googlegroups.com, bet

I just added myself to the bug. Thanks!

-- Pasi

Alex Zeffertt

unread,

Mar 15, 2010, 6:56:41 AM3/15/10

to open-...@googlegroups.com

Hi Mike,

The bugzilla ticket requests a merge of two git commits, but neither of those
contain the libiscsi.c change that addresses bug #2. Was this a mistake, or did
you deliberately omit that part of your speed-up-conn-fail-take3.patch when you
raised the ticket?

TIA,

Alex

Mike Christie

unread,

Mar 15, 2010, 5:53:38 PM3/15/10

to open-...@googlegroups.com, Alex Zeffertt

On 03/15/2010 05:56 AM, Alex Zeffertt wrote:
>
> The bugzilla ticket requests a merge of two git commits, but neither of
> those contain the libiscsi.c change that addresses bug #2. Was this a
> mistake, or did you deliberately omit that part of your
> speed-up-conn-fail-take3.patch when you raised the ticket?
>

Hey,

It was laziness. I did not update the bugzilla. When I made it, I
thought we were only hitting #1 (this was the first patch I sent in this
thread). But when I was testing those 2 patches with RHEL 5, I finally
hit the problem that bet was hitting. When I figured out that we were
hitting #2, I made the second patch in this thread. I then just did not
update the bugzilla with the new patch. For RHEL I ended up sending the
second patch though.

Alex Zeffertt

unread,

Mar 16, 2010, 5:50:36 AM3/16/10

to Mike Christie, open-...@googlegroups.com

Thanks for the clarification. Is the fix for #2 being upstreamed? If so, is
there a git commit I can reference? (This will make it easier for us to drop
the patch when we pull a kernel which has the fix in it.)

Thanks in advance,

Alex

Mike Christie

unread,

Mar 16, 2010, 1:47:09 PM3/16/10

to Alex Zeffertt, open-...@googlegroups.com

On 03/16/2010 04:50 AM, Alex Zeffertt wrote:
> Mike Christie wrote:
>> On 03/15/2010 05:56 AM, Alex Zeffertt wrote:
>>> The bugzilla ticket requests a merge of two git commits, but neither of
>>> those contain the libiscsi.c change that addresses bug #2. Was this a
>>> mistake, or did you deliberately omit that part of your
>>> speed-up-conn-fail-take3.patch when you raised the ticket?
>>>
>>
>> Hey,
>>
>> It was laziness. I did not update the bugzilla. When I made it, I
>> thought we were only hitting #1 (this was the first patch I sent in
>> this thread). But when I was testing those 2 patches with RHEL 5, I
>> finally hit the problem that bet was hitting. When I figured out that
>> we were hitting #2, I made the second patch in this thread. I then
>> just did not update the bugzilla with the new patch. For RHEL I ended
>> up sending the second patch though.
>>
>
> Thanks for the clarification. Is the fix for #2 being upstreamed? If so,

I sent it to linux-scsi/James a couple days after I sent the patch in
this thread. It is not merged yet.

> is there a git commit I can reference? (This will make it easier for us
> to drop the patch when we pull a kernel which has the fix in it.)

Do you want me to cc you on all future iscsi patches that go upstream?
When James merges it and sends it to linus, then I get a automated
message from him. If I cc you, you can get one too.

>
> Thanks in advance,
>
> Alex

bennyturns

unread,

Mar 16, 2010, 5:02:35 PM3/16/10

to open-iscsi

I am trying work out a formula for total failover time of my
multipathed iSCSI device so far I have:

failover time = nop timout + nop interval + replacement_timeout
seconds + scsi block device timeout(/sys/block/sdX/device/timeout)

Is there anything else that I am missing?

-b

Mike Christie

unread,

Mar 16, 2010, 5:27:25 PM3/16/10

to open-...@googlegroups.com, bennyturns

On 03/16/2010 04:02 PM, bennyturns wrote:
> I am trying work out a formula for total failover time of my
> multipathed iSCSI device so far I have:
>
> failover time = nop timout + nop interval + replacement_timeout
> seconds + scsi block device timeout(/sys/block/sdX/device/timeout)
>

/sys/block/sdX/device/timeout is the scsi cmd timeout. It only comes
into play if you have nops off or have their timers set higher than the
scsi cmd timeout (you do not want to do this). When using nops if they
timeout then if the scsi cmd timer fires, the iscsi code would basically
tell the scsi layer they it is handling the problem so do not run the
scsi error handler.

So it is:

failover time = nop timout + nop interval + replacement_timeout

or
/sys/block/sdX/device/timeout + replacement_timeout + min(abort, lun
reset timeoutt, target reset timeout).

bennyturns

unread,

Mar 16, 2010, 5:53:55 PM3/16/10

to open-iscsi

Thks Mike, that explains it :)

Reply all

Reply to author

Forward