I have a 2 node cluster (Lenny amd64, ganeti 2.0.1), node1 (master) &
node2 (nominated secondary for the master).
I am currently testing a masterfailover situation.
Say node1 has just crashed or lost its network and is effectively dead.
I have a single DRBD/HA instance, running on this node that is now down.
(This instance can be failed over to the secondary, when the master is up)
Now on node2 I want to promote it, to be the master of the cluster, so
the ganeti-watcher script can run and start up my instance again.
# node2:~# gnt-cluster masterfailover
# Failure: prerequisites not met for this operation:
# Cluster is inconsistent, most nodes did not respond.
This appears to be at odds with the documentation In
http://ganeti-doc.googlecode.com/svn/ganeti-2.0/admin.html#failing-over-the-master-node
--snip--
Failing over the master node
This is all good as long as the Ganeti Master Node is up. Should it go
down, or should you wish to decommission it, just run on any other
node the command:
gnt-cluster masterfailover
and the node you ran it on is now the new master.
--/snip--
Is our interpretation of the docs correct or have we got this wrong.
And the cluster master role can only be failed over when the master
node is up?
Or do we have some other issues here, with our ganeti install?
Instance fail overs work. As do the gnt-instance commands and all
other gnt-cluster commands required to configure the cluster.
Regards,
Paul De Audney
I guess you may encounter side issues when you only have 2 nodes since
in this case "most nodes" is 1 :)
What does "gnt-cluster verify" say?
> And the cluster master role can only be failed over when the master
> node is up?
I am pretty sure I already failed over a master when it was down and
unreachable; it just takes time for the RPC calls to timeout. The
source code (see lib/bootstrap.py::MasterFailover() definition)
implies that if the old master is unreachable, a warning will be
displayed and the command will continue.
308 if not rpc.call_node_stop_master(old_master, True):
309 logging.error("could disable the master role on the old master"
310 " %s, please disable manually", old_master)
(Side note: the error message should say "could *not* disable")
Your error message implies you are not even reaching this part of the
code. Is node2 marked as a master candidate? (gnt-node list -o
+master_candidate) If not, mark it as such with gnt-node modify -C yes
node2. Of course, this should be done before your master crashes, but
I guess you are just testing now :)
--
olive
Yes, this is perfectly normal, but if you look at masterfailover --help
there should be a force option to allow you to failover the role
without a vote being called!
>> And the cluster master role can only be failed over when the master
>> node is up?
No, of course, but on very small clusters you have to force the operation! :)
Thanks,
Guido
When both nodes are up, from the master.
Thu Jul 2 18:45:38 2009 * Verifying global settings
Thu Jul 2 18:45:38 2009 * Gathering data (2 nodes)
Thu Jul 2 18:45:39 2009 * Verifying node node2 (master candidate)
Thu Jul 2 18:45:39 2009 * Verifying node node1 (master)
Thu Jul 2 18:45:39 2009 * Verifying instance garboard.anchor.net.au
Thu Jul 2 18:45:39 2009 * Verifying orphan volumes
Thu Jul 2 18:45:39 2009 - ERROR: volume root on node node2 should not exist
Thu Jul 2 18:45:39 2009 - ERROR: volume swap on node node2 should not exist
Thu Jul 2 18:45:39 2009 - ERROR: volume root on node node1 should not exist
Thu Jul 2 18:45:39 2009 - ERROR: volume swap on node node1 should not exist
Thu Jul 2 18:45:39 2009 * Verifying remaining instances
Thu Jul 2 18:45:39 2009 * Verifying N+1 Memory redundancy
Thu Jul 2 18:45:39 2009 * Other Notes
Thu Jul 2 18:45:39 2009 * Hooks Results
> I am pretty sure I already failed over a master when it was down and
> unreachable; it just takes time for the RPC calls to timeout. The
> source code (see lib/bootstrap.py::MasterFailover() definition)
> implies that if the old master is unreachable, a warning will be
> displayed and the command will continue.
hmm ok.
As seen above, node2 is indeed confirmed to be a nominated secondary
for the master role.
Here is the command with -d, from node2, when node1 is down. When
node2 is confirmed to be a nominated secondary master.
node2:~# gnt-cluster masterfailover -d
2009-07-02 17:35:01,577: gnt-cluster masterfailover pid=9632 cli:739
INFO run with arguments '-d'
2009-07-02 17:35:01,578: gnt-cluster masterfailover pid=9632
workerpool:263 DEBUG Resizing to 10 workers
2009-07-02 17:35:01,616: gnt-cluster masterfailover pid=9632
workerpool:91 DEBUG Worker 1: waiting for tasks
-- snip lots of worker stuff/spam --
2009-07-02 17:35:01,620: gnt-cluster masterfailover pid=9632
workerpool:96 DEBUG Worker 1: notified while waiting
2009-07-02 17:35:01,621: gnt-cluster masterfailover pid=9632
workerpool:117 DEBUG Worker 1: starting task
(<ganeti.http.client._HttpClientPendingRequest object at 0x139bfd0>,)
2009-07-02 17:35:04,624: gnt-cluster masterfailover pid=9632
workerpool:120 DEBUG Worker 1: done with task
(<ganeti.http.client._HttpClientPendingRequest object at 0x139bfd0>,)
2009-07-02 17:35:04,624: gnt-cluster masterfailover pid=9632 rpc:242
ERROR RPC error in master_info from node node1: Connection failed
(113: No route to host)
2009-07-02 17:35:04,625: gnt-cluster masterfailover pid=9632
workerpool:91 DEBUG Worker 1: waiting for tasks
2009-07-02 17:35:04,625: gnt-cluster masterfailover pid=9632
workerpool:331 DEBUG Terminating all workers
2009-07-02 17:35:04,626: gnt-cluster masterfailover pid=9632
workerpool:263 DEBUG Resizing to 0 workers
2009-07-02 17:35:04,626: gnt-cluster masterfailover pid=9632
workerpool:96 DEBUG Worker 4: notified while waiting
--snip lots of worker stuff/spam --
2009-07-02 17:35:04,634: gnt-cluster masterfailover pid=9632
workerpool:290 DEBUG Waiting for thread Thread-9
2009-07-02 17:35:04,634: gnt-cluster masterfailover pid=9632
workerpool:290 DEBUG Waiting for thread Thread-10
2009-07-02 17:35:04,635: gnt-cluster masterfailover pid=9632
workerpool:342 DEBUG All workers terminated
2009-07-02 17:35:04,635: gnt-cluster masterfailover pid=9632 cli:748
ERROR Error durring command processing
Traceback (most recent call last):
File "/var/lib/python-support/python2.5/ganeti/cli.py", line 744, in
GenericMain
result = func(options, args)
File "/var/lib/python-support/python2.5/ganeti/cli.py", line 416, in wrapper
return fn(*args, **kwargs)
File "/usr/sbin/gnt-cluster", line 427, in MasterFailover
return bootstrap.MasterFailover()
File "/var/lib/python-support/python2.5/ganeti/bootstrap.py", line
409, in MasterFailover
raise errors.OpPrereqError("Cluster is inconsistent, most nodes did not"
OpPrereqError: Cluster is inconsistent, most nodes did not respond.
Failure: prerequisites not met for this operation:
Cluster is inconsistent, most nodes did not respond.
Paul
> Yes, this is perfectly normal, but if you look at masterfailover --help
> there should be a force option to allow you to failover the role
> without a vote being called!
Okay, I am not seeing that option at all. I am using 2.0.1 and checked git here
http://git.ganeti.org/?p=ganeti.git;a=blob_plain;f=scripts/gnt-cluster;hb=HEAD
--help did not print any force option either, and attempts to use
--force fail with
gnt-cluster: error: no such option: --force
Can you confirm what version I should be seeing a --force option?
Paul
Uhm, I think my memory failed, and we just have the force option to
start the master but not to fail it over.
I believe this is a bug, and we can try to target a fix at 2.0.2 or
2.0.3... In the meantime you could poke a bit with
bootstrap.MasterFailover to actually override the check, since you
only have a second node, and of course he cannot provide an answer
(basically temporarily commenting out lines 417 to 426)... Also please
file an issue in the bug tracker for us to look at.
Thanks,
Guido
Yep, indeed, we only have “ganeti-masterd --no-voting”, but not force on
failover. However, my recollection was that we already handle failover
correctly (without the flag) on 2-node cluster, which seems not to be
the case…
iustin
Possibly we handle it correctly "if the second node is up"?
Guido
You're probably right, but in that case, it's not a special cluster. 2-node
clusters are special only when one node is down…
iustin