Immediate fs errors on connection problem

Niels de Carpentier

unread,

Apr 21, 2008, 7:58:40 AM4/21/08

to open-...@googlegroups.com

I have a number of systems with an iscsi root filesystem. These systems
connect to an redundant pair of iscsi servers, using tgtd. I use heartbeat
to fail over the iscsi target. I'm using open-iscsi 869. It tried both the
iscsi transport 869, and the default centos 724. The iscsid used was
always 869.

I've set the replacement timeout high, so the iscsi root system should be
able to recover from the short outage if the iscsi target fails over to
another server:

node.session.timeo.replacement_timeout = 86400

Unfortunately, this doesn't always work. Sometimes the OS will report
filesystem errors and mount the fs read-only. A short time later the iscsi
targets will be reconnected, but the filesystem is already read-only by
then.

The logs show (default iscsi transport 724 was used for this test):

Apr 21 11:35:25 front003 kernel: end_request: I/O error, dev sda, sector
1336006
Apr 21 11:35:25 front003 kernel: end_request: I/O error, dev sda, sector
1336006
Apr 21 11:35:25 front003 kernel: Buffer I/O error on device sda1, logical
block 166993
Apr 21 11:35:25 front003 kernel: Buffer I/O error on device sda1, logical
block 166993
<more disk errors>
Apr 21 11:35:26 front003 kernel: ext3_abort called.
Apr 21 11:35:26 front003 kernel: ext3_abort called.
Apr 21 11:35:26 front003 kernel: EXT3-fs error (device sda1):
ext3_journal_start_sb: Detected aborted journal
Apr 21 11:35:26 front003 kernel: EXT3-fs error (device sda1):
ext3_journal_start_sb: Detected aborted journal
Apr 21 11:35:26 front003 kernel: Remounting filesystem read-only
Apr 21 11:35:26 front003 kernel: Remounting filesystem read-only
Apr 21 11:35:36 front003 kernel: connection1:0: iscsi: detected conn error
(1011)
Apr 21 11:35:36 front003 kernel: connection1:0: iscsi: detected conn error
(1011)
Apr 21 11:35:36 front003 iscsid: Kernel reported iSCSI connection 1:0
error (1011) state (3)
Apr 21 11:35:40 front003 kernel: connection5:0: iscsi: detected conn error
(1011)
Apr 21 11:35:40 front003 kernel: connection5:0: iscsi: detected conn error
(1011)
Apr 21 11:35:40 front003 kernel: connection8:0: iscsi: detected conn error
(1011)
Apr 21 11:35:40 front003 kernel: connection8:0: iscsi: detected conn error
(1011)
Apr 21 11:35:40 front003 iscsid: received iferror -38
Apr 21 11:35:40 front003 iscsid: received iferror -38
Apr 21 11:35:40 front003 iscsid: received iferror -38
Apr 21 11:35:40 front003 iscsid: received iferror -38
Apr 21 11:35:40 front003 iscsid: received iferror -38
Apr 21 11:35:40 front003 iscsid: connection1:0 is operational after
recovery (1 attempts)

Is there any way to prevent this, so a iscsi root system can recover
gracefully from a short outage?

Niels

Mike Christie

unread,

Apr 22, 2008, 11:15:59 AM4/22/08

to open-...@googlegroups.com

Niels de Carpentier wrote:
> I have a number of systems with an iscsi root filesystem. These systems
> connect to an redundant pair of iscsi servers, using tgtd. I use heartbeat
> to fail over the iscsi target. I'm using open-iscsi 869. It tried both the
> iscsi transport 869, and the default centos 724. The iscsid used was
> always 869.
>
> I've set the replacement timeout high, so the iscsi root system should be
> able to recover from the short outage if the iscsi target fails over to
> another server:
>
> node.session.timeo.replacement_timeout = 86400
>
> Unfortunately, this doesn't always work. Sometimes the OS will report
> filesystem errors and mount the fs read-only. A short time later the iscsi
> targets will be reconnected, but the filesystem is already read-only by
> then.
>
> The logs show (default iscsi transport 724 was used for this test):

Could you send the parts of the log before the fs errors? We want to see
how many conn errors there were and when they occured and when the
replacement/recovery timeout fired.

Could you also run
iscsiadm -m node -T target -p ip:port
for the root target and send the output?

And could you run

cat /sys/class/iscsi_session/session1/recovery_tmo

(session1 is the session for the root disk right. If not replace the one
with whatever number it is).

and send that outpout?

> Apr 21 11:35:40 front003 iscsid: connection1:0 is operational after
> recovery (1 attempts)
>

This is weird, because it only took one recovery attempt, so it looks
like it was a really short outage and should have been back within the
replacement you set.

Tomasz Chmielewski

unread,

Apr 22, 2008, 11:20:42 AM4/22/08

to open-...@googlegroups.com

Niels de Carpentier schrieb:

> I have a number of systems with an iscsi root filesystem. These systems
> connect to an redundant pair of iscsi servers, using tgtd. I use heartbeat

^^^^

> Is there any way to prevent this, so a iscsi root system can recover
> gracefully from a short outage?

You are using tgtd.

What are the causes of these disconnections? I guess manual caused by
you, but is it:

1) cabling, switches
2) firewall, routing etc.
3) tgtd restart?

--
Tomasz Chmielewski
http://wpkg.org

Tomasz Chmielewski

unread,

Apr 22, 2008, 11:28:03 AM4/22/08

to open-...@googlegroups.com

Tomasz Chmielewski schrieb:

Anyway, if it's 3) tgtd restart - it's "by design", and you should
complain on stgt-devel mailing list. There were some changes in the git
lately, but AFAIK, it hasn't improved in all areas.

If it's either 1) or 2), there is something fishy here.

Supposing it's 3) - there is a slight race before you start tgtd and
before you can configure targets with tgtadm. Therefore, a "sleep 2s"
would be recommended. The rest is like below - tgtd already listens on
3260, but has no targets configured. Any initiator that connects will be
rejected, and hence your immediate fs errors.

One workaround is to block iSCSI traffic with iptables before starting
tgtd and removing the block after all targets are configured.

target initiator
-----------------------------------
not started trying to reconnect
start tgtd trying to reconnect
sleep 2s trying to reconnect
nothing configured login I/O error - non fatal
configure target1 conn to target1 OK
no such target conn to target2 FAIL
I/O error to target2
configure target2 too late, fatal, we lost it

Niels de Carpentier

unread,

Apr 22, 2008, 11:33:45 AM4/22/08

to open-...@googlegroups.com

In this case the failover is indeed manual for failover testing. The
failover process basically is:

server1: remove virtual IP
server1: remove luns from tgtd
server1: Make local DRBD device secondary
server2: Make DRBD device primary
server2: add luns to tgtd
server2: Add virtual IP

Since the iscsi initiators connect to the virtual IP, there can be no
network connectivity while the switch is in progress. This should prevent
any race conditions in tgtd.

The network trace looks normal, some retries to the VIP while the switch
is in progress, and a RST of the connection once the switch is done. The
initiator reconnects normally after the reset, but the damage has already
been done by then. (And it looks like the problems even start before the
RST of the connection)

Niels

Tomasz Chmielewski

unread,

Apr 22, 2008, 11:40:24 AM4/22/08

to open-...@googlegroups.com

Niels de Carpentier schrieb:

>> Niels de Carpentier schrieb:
>>> I have a number of systems with an iscsi root filesystem. These systems
>>> connect to an redundant pair of iscsi servers, using tgtd. I use
>>> heartbeat
>> ^^^^
>>
>>
>>> Is there any way to prevent this, so a iscsi root system can recover
>>> gracefully from a short outage?
>> You are using tgtd.
>>
>> What are the causes of these disconnections? I guess manual caused by
>> you, but is it:
>>
>> 1) cabling, switches
>> 2) firewall, routing etc.
>> 3) tgtd restart?
>
> In this case the failover is indeed manual for failover testing. The
> failover process basically is:
>
> server1: remove virtual IP
> server1: remove luns from tgtd

Have you tried doing everything on server1:

server1: remove virtual IP
server1: remove luns from tgtd

server1: add luns to tgtd
server1: Add virtual IP

If it works, try running DRBD in multi-master mode, and do:

server1: remove virtual IP
server1: remove luns from tgtd

server2: add luns to tgtd
server2: Add virtual IP

Actually, I don't think removing luns in tgtd is supported when
initiators are still connected. And tgtd only reacts to a brutal "pkill
-9 tgtd".

> Since the iscsi initiators connect to the virtual IP, there can be no
> network connectivity while the switch is in progress. This should prevent
> any race conditions in tgtd.

Yep. Unless you omitted something important ;)

Niels de Carpentier

unread,

Apr 22, 2008, 12:04:59 PM4/22/08

to open-...@googlegroups.com

>
> Niels de Carpentier wrote:
>>
>> Unfortunately, this doesn't always work. Sometimes the OS will report
>> filesystem errors and mount the fs read-only. A short time later the
>> iscsi
>> targets will be reconnected, but the filesystem is already read-only by
>> then.
>>
>> The logs show (default iscsi transport 724 was used for this test):
>
> Could you send the parts of the log before the fs errors? We want to see
> how many conn errors there were and when they occured and when the
> replacement/recovery timeout fired.

These were actually the first error messages for that test. The I/O errors
happen almost instantly, and it takes almost 10 seconds after that to
detect the actual connection error. (After which it almost immediately
reconnects). Possibly this has something to do with the way I do the
switchover? (Remove virtual IP address from one server and add it to the
other.)

>
> Could you also run
> iscsiadm -m node -T target -p ip:port
> for the root target and send the output?

#iscsiadm -m node -T iqn.2007-11.com.smys:storage.front003 -p 10.40.99.1:3260
iscsiadm: Config file line 47 too long.
node.name = iqn.2007-11.com.smys:storage.front003
node.tpgt = 1
node.startup = manual
iface.hwaddress = default
iface.iscsi_ifacename = default
iface.net_ifacename = default
iface.transport_name = tcp
node.discovery_address = 10.40.99.1
node.discovery_port = 3260
node.discovery_type = send_targets
node.session.initial_cmdsn = 0
node.session.initial_login_retry_max = 4
node.session.cmds_max = 32
node.session.queue_depth = 32
node.session.auth.authmethod = CHAP
node.session.auth.username = frontend
node.session.auth.password = ********
node.session.auth.username_in = <empty>
node.session.auth.password_in = <empty>
node.session.timeo.replacement_timeout = 86400
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.host_reset_timeout = 60
node.session.iscsi.FastAbort = Yes
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.session.iscsi.DefaultTime2Retain = 0
node.session.iscsi.DefaultTime2Wait = 2
node.session.iscsi.MaxConnections = 1
node.session.iscsi.MaxOutstandingR2T = 1
node.session.iscsi.ERL = 0
node.conn[0].address = 10.40.99.1
node.conn[0].port = 3260
node.conn[0].startup = manual
node.conn[0].tcp.window_size = 524288
node.conn[0].tcp.type_of_service = 0
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.login_timeout = 30
node.conn[0].timeo.auth_timeout = 45
node.conn[0].timeo.noop_out_interval = 0
node.conn[0].timeo.noop_out_timeout = 0
node.conn[0].iscsi.MaxRecvDataSegmentLength = 131072
node.conn[0].iscsi.HeaderDigest = None,CRC32C
node.conn[0].iscsi.DataDigest = None
node.conn[0].iscsi.IFMarker = No
node.conn[0].iscsi.OFMarker = No

Some values have changed since the log I send you:

node.session.cmds_max 128 -> 32
node.session.iscsi.InitialR2T Yes -> No
DefaultTime2Retain 20 -> 86400
DefaultTime2Wait 2 -> 10

>
> And could you run
>
> cat /sys/class/iscsi_session/session1/recovery_tmo
>
> (session1 is the session for the root disk right. If not replace the one
> with whatever number it is).
>
> and send that outpout?

#cat /sys/class/iscsi_session/session1/recovery_tmo
86400

>
>> Apr 21 11:35:40 front003 iscsid: connection1:0 is operational after
>> recovery (1 attempts)
>>
>
> This is weird, because it only took one recovery attempt, so it looks
> like it was a really short outage and should have been back within the
> replacement you set.

The outage is probably 15-20 seconds, while the target is switched over to
another server. The actual reconnect is to another identically configured
server/target. The weird thing is the timeline:

11:35:25 I/O errors
11:35:36 connection error detected
11:35:40 reconnected

Niels

Niels de Carpentier

unread,

Apr 22, 2008, 12:21:17 PM4/22/08

to open-...@googlegroups.com

>
> Anyway, if it's 3) tgtd restart - it's "by design", and you should
> complain on stgt-devel mailing list. There were some changes in the git
> lately, but AFAIK, it hasn't improved in all areas.
>
> If it's either 1) or 2), there is something fishy here.

I'm using the "offline" patch, which fixes this nicely. I can restart tgtd
without any issues. (The patch is not in git yet though)

>
> Supposing it's 3) - there is a slight race before you start tgtd and
> before you can configure targets with tgtadm. Therefore, a "sleep 2s"
> would be recommended. The rest is like below - tgtd already listens on
> 3260, but has no targets configured. Any initiator that connects will be
> rejected, and hence your immediate fs errors.

The "offline" patch fixed this. In this case, tgtd will only respond to
commands once you set it to running.

>
> One workaround is to block iSCSI traffic with iptables before starting
> tgtd and removing the block after all targets are configured.

Removing the IP address should have the same effect. I'll also do some
test with iptables to see if that helps.

Niels

Mike Christie

unread,

Apr 22, 2008, 12:36:30 PM4/22/08

to open-...@googlegroups.com

Niels de Carpentier wrote:
>> Niels de Carpentier wrote:
>>> Unfortunately, this doesn't always work. Sometimes the OS will report
>>> filesystem errors and mount the fs read-only. A short time later the
>>> iscsi
>>> targets will be reconnected, but the filesystem is already read-only by
>>> then.
>>>
>>> The logs show (default iscsi transport 724 was used for this test):
>> Could you send the parts of the log before the fs errors? We want to see
>> how many conn errors there were and when they occured and when the
>> replacement/recovery timeout fired.
>
> These were actually the first error messages for that test. The I/O errors

Huh, that does not make sense. It is also weird that there are no scsi
errors before the block layer ones like here:

Apr 21 11:35:25 front003 kernel: end_request: I/O error, dev sda, sector
> 1336006

The iscsi layer can only fail commands after getting the connection
error, so you should see

> Apr 21 11:35:36 front003 kernel: connection1:0: iscsi: detected conn
error
> (1011)

Then something about the recovery/replacement timeout expiring, then
some scsi error messages, and finally the block errors above.

> happen almost instantly, and it takes almost 10 seconds after that to
> detect the actual connection error. (After which it almost immediately
> reconnects). Possibly this has something to do with the way I do the
> switchover? (Remove virtual IP address from one server and add it to the
> other.)

Maybe.

Niels de Carpentier

unread,

Apr 22, 2008, 1:07:44 PM4/22/08

to open-...@googlegroups.com

>>
>> These were actually the first error messages for that test. The I/O
>> errors
>
> Huh, that does not make sense. It is also weird that there are no scsi
> errors before the block layer ones like here:
>
> Apr 21 11:35:25 front003 kernel: end_request: I/O error, dev sda, sector
> > 1336006
>
> The iscsi layer can only fail commands after getting the connection
> error, so you should see
> > Apr 21 11:35:36 front003 kernel: connection1:0: iscsi: detected conn
> error
> > (1011)
>
> Then something about the recovery/replacement timeout expiring, then
> some scsi error messages, and finally the block errors above.

Yes, that's what you would expect, but somehow this isn't happening.
The kernel is a centos dom0 kernel: 2.6.18-53.1.14.el5xen

The error messages relating to connection1 before the error are from an
earlier test:

Apr 21 11:33:34 front003 kernel: connection1:0: iscsi: detected conn error
(1011)
Apr 21 11:33:34 front003 kernel: connection1:0: iscsi: detected conn error
(1011)
Apr 21 11:33:34 front003 iscsid: Kernel reported iSCSI connection 1:0
error (1011) state (3)
Apr 21 11:33:38 front003 iscsid: received iferror -38
Apr 21 11:33:38 front003 iscsid: received iferror -38
Apr 21 11:33:38 front003 iscsid: received iferror -38
Apr 21 11:33:38 front003 iscsid: received iferror -38
Apr 21 11:33:38 front003 iscsid: received iferror -38
Apr 21 11:33:38 front003 iscsid: connection1:0 is operational after
recovery (1 attempts)

In this earlier case the failover worked correctly. (Maybe no commands
queued?)

The full logs for the error I send earlier:

Apr 21 11:35:25 front003 kernel: end_request: I/O error, dev sda, sector
1336006
Apr 21 11:35:25 front003 kernel: end_request: I/O error, dev sda, sector
1336006

Apr 21 11:35:25 front003 kernel: Buffer I/O error on device sda1, logical
block 166993
Apr 21 11:35:25 front003 kernel: Buffer I/O error on device sda1, logical
block 166993

Apr 21 11:35:25 front003 kernel: lost page write due to I/O error on sda1
Apr 21 11:35:25 front003 kernel: lost page write due to I/O error on sda1