we have some servers running open-iscsi 2.0.870~rc3-0.4 (Debian) on
a couple of servers.
One of these servers has been frequent unexplained reboots, and another
I saw had a hard freeze due to the sdb/sdc device becoming
unwritable/unreadable.
Could open-iscsi be the culprit here? We've had a number of other
servers from the same vendor (not connected to a SAN, using open-iscsi)
and have not seen unexplained reboots like this before.
If so, is there hope that this could be fixed in a short amount of time, or
we need to consider dumping (selling) the SAN and going for regular
harddrives instead?
The SAN is an MD3000i BTW.
Thanks,
Morten
--
Morten W. Petersen
Manager
Nidelven IT Ltd
Phone: +47 45 44 00 69
Email: mor...@nidelven-it.no
Is this open-iscsi that comes with debian or is that a open-iscsi.org
release of 870-rc3?
> a couple of servers.
>
> One of these servers has been frequent unexplained reboots, and another
> I saw had a hard freeze due to the sdb/sdc device becoming
> unwritable/unreadable.
>
> Could open-iscsi be the culprit here?
It could be. Is there anything in /var/log/messages when the reboot
occurs? Do you see something about a conn error or ping/nop timing out
or somehting about a host reset failing?
How do you know the disk is not read/writable?
Are you doing failover with the MD3000i target?
What values are you using for the noops and replacement_timeout?
Is 14:35:16 when the system hangs? Is there anything else suspicious in
the log? Some IO/request errors, maybe?
It looks like we hit some problem with the transport connection at
14:34:45. We try to reconnect to the target but we get the 113 error
from the network layer when we try to connect. 113 is "No route to host."
However, it looks like we reconnect about 30 secs later and then there
is no IO error, and so if there was IO running at the time it should
have been completed ok (at least no IO errors to indicate there was a
problem).
So it looks like we handled some transient problem ok.
Mike Christie wrote, On 04/11/09 12:43 AM:
> Morten W. Petersen wrote:
>> Hi,
>>
>> we have some servers running open-iscsi 2.0.870~rc3-0.4 (Debian) on
>
> Is this open-iscsi that comes with debian or is that a open-iscsi.org
> release of 870-rc3?
This is the debian package.
>
>> a couple of servers.
>>
>> One of these servers has been frequent unexplained reboots, and another
>> I saw had a hard freeze due to the sdb/sdc device becoming
>> unwritable/unreadable.
>>
>> Could open-iscsi be the culprit here?
>
> It could be. Is there anything in /var/log/messages when the reboot
> occurs? Do you see something about a conn error or ping/nop timing out
> or somehting about a host reset failing?
>
I do see some in the logs, but the time says it happened at boot.
daemon.log:Nov 5 14:34:45 death-magnetic iscsid: Kernel reported iSCSI
connection 1:0 error (1011) state (3)
daemon.log:Nov 5 14:34:49 death-magnetic iscsid: connect failed (113)
daemon.log:Nov 5 14:34:55 death-magnetic iscsid: connect failed (113)
daemon.log:Nov 5 14:34:59 death-magnetic iscsid: connect failed (113)
daemon.log:Nov 5 14:35:05 death-magnetic iscsid: connect failed (113)
daemon.log:Nov 5 14:35:12 death-magnetic iscsid: connect failed (113)
daemon.log:Nov 5 14:35:16 death-magnetic iscsid: connection1:0 is
operational after recovery (6 attempts)
# date ; uptime
Fri Nov 6 20:29:11 CET 2009
20:29:11 up 1 day, 5:55, 1 user, load average: 0.00, 0.00, 0.00
So that means the system was up at 14:34 and the above was logged at boot.
> How do you know the disk is not read/writable?
>
> Are you doing failover with the MD3000i target?
No failover.
>
> What values are you using for the noops and replacement_timeout?
>
node.session.timeo.replacement_timeout = 120
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
- --
Vinay S Shastry
Consultant
Nidelven IT Ltd
Phone: +91 98866 57877
Email: sha...@nidelven-it.no
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAkr0iUcACgkQ0RwZ6UaUQbRt1gCgoJKoZ62y9kPHjVgjDlbXoX0a
a+wAoJB7WHTvZQx7mIw7XYSDe5ToHxid
=UGLe
-----END PGP SIGNATURE-----
Mike Christie wrote, On 07/11/09 2:36 AM:
> Vinay S Shastry wrote:
>>> It could be. Is there anything in /var/log/messages when the reboot
>>> occurs? Do you see something about a conn error or ping/nop timing out
>>> or somehting about a host reset failing?
>>>
>>
>> I do see some in the logs, but the time says it happened at boot.
>>
>> daemon.log:Nov 5 14:34:45 death-magnetic iscsid: Kernel reported iSCSI
>> connection 1:0 error (1011) state (3)
>> daemon.log:Nov 5 14:34:49 death-magnetic iscsid: connect failed (113)
>> daemon.log:Nov 5 14:34:55 death-magnetic iscsid: connect failed (113)
>> daemon.log:Nov 5 14:34:59 death-magnetic iscsid: connect failed (113)
>> daemon.log:Nov 5 14:35:05 death-magnetic iscsid: connect failed (113)
>> daemon.log:Nov 5 14:35:12 death-magnetic iscsid: connect failed (113)
>> daemon.log:Nov 5 14:35:16 death-magnetic iscsid: connection1:0 is
>> operational after recovery (6 attempts)
>>
>>
>>
>> # date ; uptime
>> Fri Nov 6 20:29:11 CET 2009
>> 20:29:11 up 1 day, 5:55, 1 user, load average: 0.00, 0.00, 0.00
>>
>> So that means the system was up at 14:34 and the above was logged at
>> boot.
>>
>
> Is 14:35:16 when the system hangs? Is there anything else suspicious in
> the log? Some IO/request errors, maybe?
>
No Mike, the system booted at "14:34:11"
Those messages are at "boot", probably while initiating connection.
I was unable to find any log with any errors.
> It looks like we hit some problem with the transport connection at
> 14:34:45. We try to reconnect to the target but we get the 113 error
> from the network layer when we try to connect. 113 is "No route to host."
>
> However, it looks like we reconnect about 30 secs later and then there
> is no IO error, and so if there was IO running at the time it should
> have been completed ok (at least no IO errors to indicate there was a
> problem).
>
> So it looks like we handled some transient problem ok.
- --
Vinay S Shastry
Consultant
Nidelven IT Ltd
Phone: +91 98866 57877
Email: sha...@nidelven-it.no
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAkr20yAACgkQ0RwZ6UaUQbSk7QCfZ9ZzV5LrzLqbUnRliYN+OEYP
eZAAnRRsrRjdoSXKBzo+F7wz3evd+Zvz
=ACZD
-----END PGP SIGNATURE-----
What target are you using? Do you see anything in the target log?