Two nodes cluster - voting issue

482 views
Skip to first unread message

John N.

unread,
Jul 29, 2015, 12:08:38 PM7/29/15
to ganeti
Hello,

On my ganeti 2.12 (Debian 8) test cluster I have two nodes and for some reason if only one of the node is available, I can't use any gnt-* commands. I see the following error message:

Jul 27 11:41:00 node1a ganeti[1526]: Error in the RPC HTTP reply from 'Node {nodeName = "node1b.domain.tld", nodePrimaryIp = "192.168.10.133", nodeSecondaryIp = "10.1.0.2", nodeMasterCandidate = True, nodeOffline = False, nodeDrained = False, nodeGroup = "e6d1810f-e9f2-4599-a8f5-3e395129d3e1", nodeMasterCapable = True,
Jul 27 11:41:00 node1a ganeti[1526]: No voting RPC result from ["node1b.domain.tld"]

I think ganeti refuses to start if one of the nodes is missing. This wasn't the cast with my previous older ganeti 2.9 cluster if I remember correctly. Any ideas what changed?

Regards
J.

Klaus Aehlig

unread,
Jul 29, 2015, 12:33:04 PM7/29/15
to gan...@googlegroups.com

Hello,

> I think ganeti refuses to start if one of the nodes is missing. This wasn't
> the cast with my previous older ganeti 2.9 cluster if I remember correctly.
> Any ideas what changed?

Ganeti master-node daemons always refused to start unless more than half
of all nodes are available and confirm the master status. For a two-node
cluster, this means both nodes must be present. The only exception is, if
the --no-voting option is given (requires --yes-do-it).

It is a different situation, however, if the daemon is already running; then
it will continue to run, as the master status is only verified on startup.

This is a problem with two-node setups, as you cannot rely on a healthy
majority to give enough guidance to avoid a split-brain situation once
something breaks.

Regards,
Klaus

--
Klaus Aehlig
Google Germany GmbH, Dienerstr. 12, 80331 Muenchen
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschaeftsfuehrer: Graham Law, Christine Elizabeth Flores

John N.

unread,
Jul 29, 2015, 1:30:37 PM7/29/15
to ganeti, aeh...@google.com
Hi Klaus,

I am aware that a two-node cluster is not optimal but it is only for my test cluster and I would like to force ganeti to start on both nodes no matter if the other second node is not available. As far as I understand for that purpose I will need to modify the /etc/default/ganeti file to pass the right parameter but right now I am confused to the following a) which daemon requires a special parameter (--no-voting?) and b) which parameter should I pass it?

Here is the currentl content of my /etc/default/ganeti file:

# Default arguments for Ganeti daemons
NODED_ARGS=""
RAPI_ARGS=""
CONFD_ARGS=""
WCONFD_ARGS=""
LUXID_ARGS=""

Regards
John

Klaus Aehlig

unread,
Jul 30, 2015, 4:33:41 AM7/30/15
to John N., ganeti

Hi John,

> I am aware that a two-node cluster is not optimal but it is only for my
> test cluster and I would like to force ganeti to start on both nodes no
> matter if the other second node is not available.

the whole point is: I'm pretty sure, you want luxid and wconfd to run only
on one node, the master node. Having two entities believing they are authoritative
for the the job queue (luxid) or the configuration (wconfd) is a good way
to corrupt your data.

> As far as I understand
> for that purpose I will need to modify the /etc/default/ganeti file to pass
> the right parameter but right now I am confused to the following a) which
> daemon requires a special parameter (--no-voting?) and b) which parameter
> should I pass it?

It's not that easy, as you don't know in advance which node will be the
surviving one of the next failure. The --no-voting option bascially tells
the daemon that you verified manually that this node is the only one
that is part of the cluster and that the other one is gone (e.g., because
you powered it off).

The way you usually operate a two-node cluster is that you leave the default
options and as long as both nodes are healthy, everything is fine.

If the non-master node dies, you go to the master node and offline
the other node. If the master dies, you go to the other node and
do a 'gnt-cluster master-failover --no-voting' and then you offline
the other node. Instead of offlining you could also remove the node
form the cluster, which has the advantage that you have a one-node
cluster then (and on a one-node cluster, your the node's own vote
is a majority).

Should the surviving node be rebooted in the time where your cluster
is degraded to a two-node cluster with only one node live, you start
the central daemons (wconfd, luxid) manually with --no-voting --yes-do-it.

candlerb

unread,
Jul 30, 2015, 5:05:03 AM7/30/15
to ganeti, hosting...@gmail.com, aeh...@google.com
There is a documented approach, which is to run a third ganeti node somewhere purely for voting purposes (--master-capable=yes --vm-capable=no), which can itself be a VM:

I don't think this VM can be a ganeti instance on the same two-node cluster. Consider that after a restart with one node, VMs are restarted by the ganeti master, but the master won't be running until the voting is complete . But you could run the third virtual node as a libvirt/kvm instance, with libvirt set to autostart. Maybe even a docker or lxc container would do the job.

The best solution would be if you could run the virtual node somewhere else in your infrastructure - then either of the two nodes can come up by itself. The third node can have additional value as a remote backup of your config, and somewhere to export instances to.

(As a feature idea, it would be neat if you could statically add a +0.5 vote to a node of your choice. Then in a two-node cluster, at least if that node comes up it can vote itself master)

Regards, Brian.

John N.

unread,
Aug 3, 2015, 2:16:16 AM8/3/15
to ganeti, hosting...@gmail.com
Hi Klaus,

Thank you for your exhaustive explanation. It makes sense to have this voting and protection mechanisms. For now I have no choice but to run a two node test ganeti cluster so I will just keep in mind to restart wconfd and luxid with the with --no-voting --yes-do-it if one node should fail. I guess that's all I really need to know on the practical side.

Regards
J.
Reply all
Reply to author
Forward
0 new messages