how to automatically reset linux server on hard disk failure?

Frank Langner

unread,

Jun 29, 2009, 4:47:13 AM6/29/09

to

Hi,

system is 2-node-cluster based on OES2 Linux (SP0). During testing the
cluster failover functions I found out that a cluster node will continue
living when the local file system is unavailable (e.g. all local hard
disks have failed). This means that cluster services are still running
on the machine, but the ressources in fact become unusable (clusterd nss
volume etc).

The node does *not* die and therefor the cluster ressources do *not*
failover to a surviving node!

How can I solve this problem? My first idea is to have some kind of
service that checks the local file system and resets the server if it is
unavailable. Resetting the server would cause the cluster resources to
failover.

Is there such a service available or which other method exist to solve
the problem?

Thanks in advance!

Nico Kadel-Garcia

unread,

Jun 29, 2009, 5:16:53 AM6/29/09

to

Nagios? I personally detest clustering software: my peers and I wind
up having to write userland monitors and switchover tools anyway, due
to non-functional or missing components of clustering tools, so why
waste all the money and support difficulty and computation power on
clustering when we can use normal hnigh-availability and failover
techniques such as the old "whackamole" program, and MySQL master/
slave setups?

> Thanks in advance!

Frank Langner

unread,

Jun 30, 2009, 8:20:18 AM6/30/09

to

> Nagios? I personally detest clustering software: my peers and I wind

I don't think that nagios would work. It relies on script files and when
the local file system is gone it will not fire these scripts. What I
have in mind, is a tool that is loaded into memory on startup and is not
dependend on any additional files (therefor a cron job will not work
also). It just polls the local file system and causes a reset or halt
(in this case the server's management card will reset the server) if the
file system is not available anymore.

The Natural Philosopher

unread,

Jun 30, 2009, 8:45:53 AM6/30/09

to

Frank Langner wrote:
>> Nagios? I personally detest clustering software: my peers and I wind
>
> I don't think that nagios would work. It relies on script files and when
> the local file system is gone it will not fire these scripts. What I

> have in mind, is a tool that is loaded into memory on startup#

BUT you cant actually guarantee it will stay there and not get paged out..

> and is not
> depended on any additional files (therefore a cron job will not work

> also). It just polls the local file system and causes a reset or halt
> (in this case the server's management card will reset the server) if the
> file system is not available anymore.

What you actually need is a custom modified kernel, with a disk driver
that executes a full authority JMP REBOOT if it detects failure.

Kernels are always resident in RAM IIRC. And device drivers that take
interrupts HAVE to be.

That is, actually quite a simple way out of the problem. If that is what
you want.

However, a machine that loses its disk drive is not in my opinion a
machine that is configured correctly. I have never had such otrher than
with gross hardware faults - which booting didnt fix - or gross software
bugs in the driver (RAID controller) that eventually caused us to send
the whole unit back and get a refund.

John Hasler

unread,

Jun 30, 2009, 9:17:00 AM6/30/09

to

The Natural Philosopher writes:
> BUT you cant actually guarantee it will stay there and not get paged
> out..

man mlockall

> However, a machine that loses its disk drive is not in my opinion a
> machine that is configured correctly.

Right.
--
John Hasler
jo...@dhh.gt.org
Dancing Horse Hill
Elmwood, WI USA

The Natural Philosopher

unread,

Jun 30, 2009, 9:28:02 PM6/30/09

to

John Hasler wrote:
> The Natural Philosopher writes:
>> BUT you cant actually guarantee it will stay there and not get paged
>> out..
>
> man mlockall
>

|# man mlockall
No manual entry for mlockall

Mmm. Thaat sounds interesting..I'll have to see what it does.

John Hasler

unread,

Jun 30, 2009, 9:55:46 PM6/30/09

to

The Natural Philosopher writes:
> No manual entry for mlockall

You must have the appropriate development package installed, of course. In
Debian that's manpages-dev.

Frank Langner

unread,

Jul 1, 2009, 6:29:49 AM7/1/09

to

> What you actually need is a custom modified kernel, with a disk driver
> that executes a full authority JMP REBOOT if it detects failure.
>
> Kernels are always resident in RAM IIRC. And device drivers that take
> interrupts HAVE to be.
>
> That is, actually quite a simple way out of the problem. If that is what
> you want.

Sounds good. But how to get such a driver, because I can not write it on
my own. :-) I'm hoping that I might not be the first one asking for such
a functionality.

>
> However, a machine that loses its disk drive is not in my opinion a
> machine that is configured correctly. I have never had such otrher than
> with gross hardware faults - which booting didnt fix - or gross software
> bugs in the driver (RAID controller) that eventually caused us to send
> the whole unit back and get a refund.

That's not the background of my problem. The server is fine and runs
well. But I want it to be prepared for the worst case, which means the
customer is running the server until both hard disks of its RAID1 are
defective. In this case the other cluster node has to migrate all
resources to itself, but right now the failed node is still alive enough
to make the cluster think that everything is ok.

Please don't let us argue about the fact, that no reasonable IT
department or service personal would let it come so far. Agreed. But
reality showed us more than once, that some people in fact wait until
the system is broken completely, before taking actions. Therefor I want
to harden the cluster for such circumstances.

Keith Keller

unread,

Jul 1, 2009, 2:23:57 PM7/1/09

to

On 2009-07-01, Frank Langner <f.nospa...@isam-ag.de> wrote:
>
> That's not the background of my problem. The server is fine and runs
> well.

Okay, I think this was the root of people's confusion; they thought you
were already having problems.

> But I want it to be prepared for the worst case, which means the
> customer is running the server until both hard disks of its RAID1 are
> defective. In this case the other cluster node has to migrate all
> resources to itself, but right now the failed node is still alive enough
> to make the cluster think that everything is ok.

What you might want to do is something nagios-like (if not nagios
itself) with monitoring the RAID itself. If you have a hardware RAID
controller, it probably has a way of returning the status of the array.
If that fails for some reason, you can have the still-functional node
take over. Without knowing how the node determines that it needs to
take over, it's hard to recommend an exact route; if it's something
where if a ping fails it takes over, then one very crude way to go would
be to write a script which monitors the other node's RAID; if it detects
a failure, use iptables or some such to drop all packets (including
ICMP!) from the dead host.

Another option would be to connect the nodes to some sort of switched
PDU. When a RAID failure is detected, the still-working node asks the
PDU to power off the outlet where the dead node is plugged in. (That
may not really be what you want, if you're afraid that an abrupt
power-off will have an adverse effect on already-broken hardware.)

Also, you might consider putting more disks into the RAID1. That's more
money into hardware, but if this thing really has to be ultra-reliable
you get what you pay for. No, that won't cover the scenario when all N
disks fail before the client tells you. :)

If you can provide more details about how the failover works, I or
someone else may be able to provide more suggestions on failure
detection.

> Please don't let us argue about the fact, that no reasonable IT
> department or service personal would let it come so far. Agreed. But
> reality showed us more than once, that some people in fact wait until
> the system is broken completely, before taking actions. Therefor I want
> to harden the cluster for such circumstances.

Too true. ;-/

If this is an ongoing support contract, you might consider using nagios
and having it send you messages about failures, instead of your client.
This way you know about them before the client does, and can be
proactive about repairing them. (And if there isn't a support contract,
well, there's not much you can do to solve a client's incompetence.)

--keith

--
kkeller...@wombat.san-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt
see X- headers for PGP signature information

The Natural Philosopher

unread,

Jul 1, 2009, 3:00:33 PM7/1/09

to

John Hasler wrote:
> The Natural Philosopher writes:
>> No manual entry for mlockall
>
> You must have the appropriate development package installed, of course. In
> Debian that's manpages-dev.

John, many thanks for that.

However I had expected a user level command, rather than a c level
function. Still it might come in handy one day..

The Natural Philosopher

unread,

Jul 1, 2009, 3:24:29 PM7/1/09

to

Frank Langner wrote:
>> What you actually need is a custom modified kernel, with a disk driver
>> that executes a full authority JMP REBOOT if it detects failure.
>>
>> Kernels are always resident in RAM IIRC. And device drivers that take
>> interrupts HAVE to be.
>>
>> That is, actually quite a simple way out of the problem. If that is what
>> you want.
>
> Sounds good. But how to get such a driver, because I can not write it on
> my own. :-) I'm hoping that I might not be the first one asking for such
> a functionality.

Oh I thi8jk you are.. the rest of us found it easier to diagnose and fix
disk drive problems, than to write custom code to kludge round them..;-)

I ahve done that sort of thing occasioanlly, very badly, to help in
identifying problems when working very near hardware, but I certainly
wouldn't recommend it as an approach in any user environment.

>> However, a machine that loses its disk drive is not in my opinion a
>> machine that is configured correctly. I have never had such otrher than
>> with gross hardware faults - which booting didnt fix - or gross software
>> bugs in the driver (RAID controller) that eventually caused us to send
>> the whole unit back and get a refund.
>
> That's not the background of my problem. The server is fine and runs
> well. But I want it to be prepared for the worst case, which means the
> customer is running the server until both hard disks of its RAID1 are
> defective. In this case the other cluster node has to migrate all
> resources to itself, but right now the failed node is still alive enough
> to make the cluster think that everything is ok.
>

Ah. In which case rebooting is not what you want, you want to shut it
down completely.

> Please don't let us argue about the fact, that no reasonable IT
> department or service personal would let it come so far. Agreed. But
> reality showed us more than once, that some people in fact wait until
> the system is broken completely, before taking actions. Therefor I want
> to harden the cluster for such circumstances.
>

Ok. Now the context is clear it seems more reasonable.

I am no cluster expert. But I would have thought that perhaps you should
be doing something like :-

First of all, use a tertiary non RAID disk for swap. That solves the
problem of not being able to page stuff back in if the main 'disk' array
goes bad.

Then write a watchdog timer. And incorporate the relevant bits from reboot.c

http://www.gelato.unsw.edu.au/lxr/source/arch/i386/kernel/reboot.c

Its a long time since I wrote a daemon, but it goes something like this..

fork a process and exit the main program. That puts the daemon in
background. I cant remember what happens to stout and errout..

Set up an infinite loop using long period sleeps that occasionally wakes up.

Try and read or write the raw disk (so as to evade ram caches).
On failure, execute shutdown code. Or shutdown whatever is needed to
sort the cluster dynamics (ethernet?)

This ain't simple script, but its not an impossible program to write.

Provided you have a swap disk separate, that will run even if the main
disk goes down, or use mlockall maybe as John said, to keep the little
watchdog in memory.

John Hasler

unread,

Jul 1, 2009, 4:39:56 PM7/1/09

to

The Natural Philosopher writes:
> However I had expected a user level command [for mlockall], rather than a

> c level function. Still it might come in handy one day..

There doesn't seem to be one, though it shouldn't hard to write one along
the lines of schedtool (or perhaps add memory locking functionality to
schedtool).