we have some servers running open-iscsi 2.0.870~rc3-0.4 (Debian) on a couple of servers.
One of these servers has been frequent unexplained reboots, and another I saw had a hard freeze due to the sdb/sdc device becoming unwritable/unreadable.
Could open-iscsi be the culprit here? We've had a number of other servers from the same vendor (not connected to a SAN, using open-iscsi) and have not seen unexplained reboots like this before.
If so, is there hope that this could be fixed in a short amount of time, or we need to consider dumping (selling) the SAN and going for regular harddrives instead?
> we have some servers running open-iscsi 2.0.870~rc3-0.4 (Debian) on
Is this open-iscsi that comes with debian or is that a open-iscsi.org release of 870-rc3?
> a couple of servers.
> One of these servers has been frequent unexplained reboots, and another > I saw had a hard freeze due to the sdb/sdc device becoming > unwritable/unreadable.
> Could open-iscsi be the culprit here?
It could be. Is there anything in /var/log/messages when the reboot occurs? Do you see something about a conn error or ping/nop timing out or somehting about a host reset failing?
How do you know the disk is not read/writable?
Are you doing failover with the MD3000i target?
What values are you using for the noops and replacement_timeout?
Vinay S Shastry wrote: >> It could be. Is there anything in /var/log/messages when the reboot >> occurs? Do you see something about a conn error or ping/nop timing out >> or somehting about a host reset failing?
> I do see some in the logs, but the time says it happened at boot.
> # date ; uptime > Fri Nov 6 20:29:11 CET 2009 > 20:29:11 up 1 day, 5:55, 1 user, load average: 0.00, 0.00, 0.00
> So that means the system was up at 14:34 and the above was logged at boot.
Is 14:35:16 when the system hangs? Is there anything else suspicious in the log? Some IO/request errors, maybe?
It looks like we hit some problem with the transport connection at 14:34:45. We try to reconnect to the target but we get the 113 error from the network layer when we try to connect. 113 is "No route to host."
However, it looks like we reconnect about 30 secs later and then there is no IO error, and so if there was IO running at the time it should have been completed ok (at least no IO errors to indicate there was a problem).
So it looks like we handled some transient problem ok.
>> we have some servers running open-iscsi 2.0.870~rc3-0.4 (Debian) on
> Is this open-iscsi that comes with debian or is that a open-iscsi.org > release of 870-rc3?
This is the debian package.
>> a couple of servers.
>> One of these servers has been frequent unexplained reboots, and another >> I saw had a hard freeze due to the sdb/sdc device becoming >> unwritable/unreadable.
>> Could open-iscsi be the culprit here?
> It could be. Is there anything in /var/log/messages when the reboot > occurs? Do you see something about a conn error or ping/nop timing out > or somehting about a host reset failing?
I do see some in the logs, but the time says it happened at boot.
> Vinay S Shastry wrote: >>> It could be. Is there anything in /var/log/messages when the reboot >>> occurs? Do you see something about a conn error or ping/nop timing out >>> or somehting about a host reset failing?
>> I do see some in the logs, but the time says it happened at boot.
>> # date ; uptime >> Fri Nov 6 20:29:11 CET 2009 >> 20:29:11 up 1 day, 5:55, 1 user, load average: 0.00, 0.00, 0.00
>> So that means the system was up at 14:34 and the above was logged at >> boot.
> Is 14:35:16 when the system hangs? Is there anything else suspicious in > the log? Some IO/request errors, maybe?
No Mike, the system booted at "14:34:11" Those messages are at "boot", probably while initiating connection.
I was unable to find any log with any errors.
> It looks like we hit some problem with the transport connection at > 14:34:45. We try to reconnect to the target but we get the 113 error > from the network layer when we try to connect. 113 is "No route to host."
> However, it looks like we reconnect about 30 secs later and then there > is no IO error, and so if there was IO running at the time it should > have been completed ok (at least no IO errors to indicate there was a > problem).
> So it looks like we handled some transient problem ok.
- -- Vinay S Shastry Consultant Nidelven IT Ltd
Phone: +91 98866 57877 Email: shas...@nidelven-it.no -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
Vinay S Shastry wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1
> Mike Christie wrote, On 07/11/09 2:36 AM: >> Vinay S Shastry wrote: >>>> It could be. Is there anything in /var/log/messages when the reboot >>>> occurs? Do you see something about a conn error or ping/nop timing out >>>> or somehting about a host reset failing?
>>> I do see some in the logs, but the time says it happened at boot.