iscsi timeouts

Dennis Jacobfeuerborn

unread,

Jan 17, 2012, 4:27:09 PM1/17/12

to open-...@googlegroups.com

Hi,
I'm currently looking into setting up my first iscsi storage system(s) for virtualization.
The issue I'm currently trying to find more details on is the timeouts involved in the iscsi traffic.

From what I found on the web the following timeouts are relevant:

os: /sys/block/sda/device/timeout
initiator: node.session.timeo.replacement_timeout
target/initiator: DefaultTime2Retain

Apparently the "Time2Retain" is negotiated between initiator and target.

Can somebody explain how these values are used in running iscsi session?
Which values are relevant if I want to prevent I/O errors on the clients side due to a short disruption of the network connection?

Regards,
Dennis

Mike Christie

unread,

Jan 17, 2012, 9:01:20 PM1/17/12

to open-...@googlegroups.com, Dennis Jacobfeuerborn

On 01/17/2012 03:27 PM, Dennis Jacobfeuerborn wrote:
> Hi,
> I'm currently looking into setting up my first iscsi storage system(s)
> for virtualization.
> The issue I'm currently trying to find more details on is the timeouts
> involved in the iscsi traffic.
>
> From what I found on the web the following timeouts are relevant:
>
> os: /sys/block/sda/device/timeout
> initiator: node.session.timeo.replacement_timeout

What version of open-iscsi and kernel are you using? Check out the
readme (attached the current one) section 8. Does that help? If not let
me know what other info you need.

> target/initiator: DefaultTime2Retain

With ERL0 and single connection sessions this does not really do
anything. When there is a problem detected that we have to drop the
session and relogin the IO is going to be failed and retried. So if your
network disruption caused the kernel network layer to return a error or
tcp/ip state change notifcation then we are going to relogin and IO is
going to be failed and retried if possible (for block/FS IO you get 5
retries for most errors (if replacement_timeout fires though then IO is
failed right away (readme should have more info))). Or, if your
disruption last longer than the ping/nop (see the readme) or device
timeout then we are going to have to fail the IO and retry if possible.

>
> Apparently the "Time2Retain" is negotiated between initiator and target.
>
> Can somebody explain how these values are used in running iscsi session?
> Which values are relevant if I want to prevent I/O errors on the clients
> side due to a short disruption of the network connection?
>
> Regards,
> Dennis
>

> --
> You received this message because you are subscribed to the Google
> Groups "open-iscsi" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/open-iscsi/-/sBEKHG7tz0QJ.
> To post to this group, send email to open-...@googlegroups.com.
> To unsubscribe from this group, send email to
> open-iscsi+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/open-iscsi?hl=en.

README

Dennis Jacobfeuerborn

unread,

Jan 18, 2012, 1:58:41 PM1/18/12

to open-...@googlegroups.com, Dennis Jacobfeuerborn

Thanks for the response. I'm currently toying with this on my Fedora 15 system but eventually this will be implemented on a centos 6.2 system with:

root@dus1san1:~# iscsiadm -V
iscsiadm version 2.0-872.33.el6
root@dus1san1:~# uname -a
Linux dus1san1.cvsn.local 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64 x86_64 x86_64 GNU/Linux

My problem is that I'm not sure how the various timeouts relate to each other. What I basically want to be able to do is to guarantee that if e.g. a network outage lasts X seconds I want the virtual machines to recover and not get an I/O error resulting in a corrupt filesystem.

From the readme it sound like the first thing that happens are the 5 "ping" retries and this would last 5*noop_out_timeout seconds. What happens after that?
It sounds like a re-establishment of the connection is then attempted. Will this then generate new noop retry cycle and last until the replacement_timeout has passed? At which point does the os device timeout come into play (/sys/block/sdX/...)?

I guess what I'm looking for is a sort of timeline. The network gets unplugged and an I/O request is issued (e.g. a simple "ls" on the filesystem on an iscsi device) to the device. What happens with this I/O request until it hits the wall and the failure manifest itself and show up as an I/O error on the console?
(Currently I'm not using multipath in the setup I'm experimenting with)

Regards,
Dennis

Mike Christie

unread,

Jan 18, 2012, 5:36:45 PM1/18/12

to open-...@googlegroups.com, Dennis Jacobfeuerborn

On 01/18/2012 12:58 PM, Dennis Jacobfeuerborn wrote:
> Thanks for the response. I'm currently toying with this on my Fedora 15
> system but eventually this will be implemented on a centos 6.2 system with:
>
> root@dus1san1:~# iscsiadm -V
> iscsiadm version 2.0-872.33.el6
> root@dus1san1:~# uname -a
> Linux dus1san1.cvsn.local 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22
> GMT 2011 x86_64 x86_64 x86_64 GNU/Linux

>
> My problem is that I'm not sure how the various timeouts relate to each
> other. What I basically want to be able to do is to guarantee that if e.g.
> a network outage lasts X seconds I want the virtual machines to recover and
> not get an I/O error resulting in a corrupt filesystem.
>
> From the readme it sound like the first thing that happens are the 5 "ping"
> retries and this would last 5*noop_out_timeout seconds. What happens after

There are not ping retries. Just one chance. There are 5 retries for
disk IO.

> that?
> It sounds like a re-establishment of the connection is then attempted. Will
> this then generate new noop retry cycle and last until the
> replacement_timeout has passed? At which point does the os device timeout
> come into play (/sys/block/sdX/...)?

No.

>
> I guess what I'm looking for is a sort of timeline. The network gets
> unplugged and an I/O request is issued (e.g. a simple "ls" on the
> filesystem on an iscsi device) to the device. What happens with this I/O
> request until it hits the wall and the failure manifest itself and show up
> as an I/O error on the console?

1 Initiator sends ping if there is not activity (READ/WRITE request
being sent) on the connection for timeo.noop_out_interval seconds.
2 If we do not get a responce for the ping in noop_out_timeout seconds
we fail the connection.
3. iscsi layer will try to relogin to the target.
4.

A. If the command was running (it has not timed out and the scsi eh is
not running) then the IO will be failed to the scsi layer and if it has
retries left (so if it has been retried less than 5 times for disk IO)
it will be queue in the block/scsi layer.

B. If the command had already timedout then it is sort of stuck in the
scsi eh until we relogin or replacement_timeout fires. It will sit in
there waiting for the outcome of #5.

5.

A. If we relogin within replacement_timeout seconds then IO will be
restarted if the command had enough retries left.

B. If cannot relogin withing replacement_timeout seconds then the IO
will be failed upwards (if you are using dm-multipath then it will
handle the problem).

Reply all

Reply to author

Forward