What version of open-iscsi and kernel are you using? Check out the
readme (attached the current one) section 8. Does that help? If not let
me know what other info you need.
> target/initiator: DefaultTime2Retain
With ERL0 and single connection sessions this does not really do
anything. When there is a problem detected that we have to drop the
session and relogin the IO is going to be failed and retried. So if your
network disruption caused the kernel network layer to return a error or
tcp/ip state change notifcation then we are going to relogin and IO is
going to be failed and retried if possible (for block/FS IO you get 5
retries for most errors (if replacement_timeout fires though then IO is
failed right away (readme should have more info))). Or, if your
disruption last longer than the ping/nop (see the readme) or device
timeout then we are going to have to fail the IO and retry if possible.
>
> Apparently the "Time2Retain" is negotiated between initiator and target.
>
> Can somebody explain how these values are used in running iscsi session?
> Which values are relevant if I want to prevent I/O errors on the clients
> side due to a short disruption of the network connection?
>
> Regards,
> Dennis
>
> --
> You received this message because you are subscribed to the Google
> Groups "open-iscsi" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/open-iscsi/-/sBEKHG7tz0QJ.
> To post to this group, send email to open-...@googlegroups.com.
> To unsubscribe from this group, send email to
> open-iscsi+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/open-iscsi?hl=en.
>
> My problem is that I'm not sure how the various timeouts relate to each
> other. What I basically want to be able to do is to guarantee that if e.g.
> a network outage lasts X seconds I want the virtual machines to recover and
> not get an I/O error resulting in a corrupt filesystem.
>
> From the readme it sound like the first thing that happens are the 5 "ping"
> retries and this would last 5*noop_out_timeout seconds. What happens after
There are not ping retries. Just one chance. There are 5 retries for
disk IO.
> that?
> It sounds like a re-establishment of the connection is then attempted. Will
> this then generate new noop retry cycle and last until the
> replacement_timeout has passed? At which point does the os device timeout
> come into play (/sys/block/sdX/...)?
No.
>
> I guess what I'm looking for is a sort of timeline. The network gets
> unplugged and an I/O request is issued (e.g. a simple "ls" on the
> filesystem on an iscsi device) to the device. What happens with this I/O
> request until it hits the wall and the failure manifest itself and show up
> as an I/O error on the console?
1 Initiator sends ping if there is not activity (READ/WRITE request
being sent) on the connection for timeo.noop_out_interval seconds.
2 If we do not get a responce for the ping in noop_out_timeout seconds
we fail the connection.
3. iscsi layer will try to relogin to the target.
4.
A. If the command was running (it has not timed out and the scsi eh is
not running) then the IO will be failed to the scsi layer and if it has
retries left (so if it has been retried less than 5 times for disk IO)
it will be queue in the block/scsi layer.
B. If the command had already timedout then it is sort of stuck in the
scsi eh until we relogin or replacement_timeout fires. It will sit in
there waiting for the outcome of #5.
5.
A. If we relogin within replacement_timeout seconds then IO will be
restarted if the command had enough retries left.
B. If cannot relogin withing replacement_timeout seconds then the IO
will be failed upwards (if you are using dm-multipath then it will
handle the problem).