Can't start instances after reboot

418 views
Skip to first unread message

David Sedeño

unread,
Oct 8, 2014, 6:18:07 AM10/8/14
to gan...@googlegroups.com
Hi,

I have a two node testing cluster and I come to a situation that the cluster can't start instances.

Background: Ganeti 2.11.5 in Ubuntu Server 14.04

* node1 is master.
* shutdown node1
* master-failover to node2.
* mark node1 as offline
* node2 works normally, I can failover instance and start it.
* reboot node2
* when it's comes up again, I can start any instance, they status are ERROR_down and any command in node2 get stuck and gets queued (as seen in gnt-job list).

* When I start node1 and add it to the cluster all works as expected, and I can start instances

List of commands to reproduce:

node1# gnt-cluster getmaster
node1.mydomain.com

node2# gnt-cluster getmaster
node1.mydomain.com

node1# shutdown -h now

node2# gnt-cluster getmaster
node1.mydomain.com

node2# gnt-instance list
Failure: prerequisites not met for this operation:
error type: wrong_input, error details:
This is not the master node, please connect to node 'node1.mydomain.com' and rerun the command


Switch master:

node2# gnt-cluster master-failover --no-voting

node2# gnt-cluster getmaster
node2.mydomain.com

Set node1 offline:
node2# gnt-node modify -O yes node1.mydomain.com
[..]
Modified node node1.mydomain.com
 - master_candidate -> False
 - offline -> True


node2# reboot

node2:~# gnt-cluster getmaster
node2.mydomain.com

node2:~# gnt-instance list
Instance    Hypervisor OS                  Primary_node     Status            Memory
couchbase1  kvm        snf-image+default   node2.mydomain.com ERROR_down             -
couchbase2  kvm        snf-image+default   node2.mydomain.com ERROR_down             -
debian      kvm        debootstrap+default node2.mydomain.com ERROR_down             -
debian2     kvm        debootstrap+default node1.mydomain.com ERROR_nodeoffline      *

node2# gnt-instance start couchbase1
Waiting for job 2571 for couchbase1 ...
^CAborted. Note that if the operation created any jobs, they might have been submitted and will continue to run in the background.

Start node1 and add it to the cluster:

node2# gnt-node add --readd node1.mydomain.com
Wed Oct  8 12:15:21 2014  - INFO: Readding a node, the offline/drained flags were reset
Wed Oct  8 12:15:21 2014  - INFO: Node will be a master candidate

node2:~# gnt-instance list
Instance    Hypervisor OS                  Primary_node     Status     Memory
couchbase1  kvm        snf-image+default   node2.zoconet.es running      4.0G

node2:~# gnt-cluster getmaster
node2.zoconet.es

node1:~# gnt-cluster getmaster
node2.zoconet.es


Regards,
--
David Sedeño

Phil Regnauld

unread,
Oct 8, 2014, 6:22:46 AM10/8/14
to gan...@googlegroups.com
David Sedeño (tcoldsf) writes:
> Hi,
>
> I have a two node testing cluster and I come to a situation that the
> cluster can't start instances.
>
> Background: Ganeti 2.11.5 in Ubuntu Server 14.04

Hi David,

What do you see the following log files under /var/log/ganeti/:

master-daemon.log
node-daemon.log
watcher.log

Cheers,
Phil

David Sedeño

unread,
Oct 8, 2014, 6:47:00 AM10/8/14
to gan...@googlegroups.com
El miércoles, 8 de octubre de 2014 12:22:46 UTC+2, Phil Regnauld escribió:
> Background: Ganeti 2.11.5 in Ubuntu Server 14.04

        Hi David,

        What do you see the following log files under /var/log/ganeti/:

master-daemon.log
node-daemon.log
watcher.log

In master-daemon.log I don't see anything after node2 reboot so I try:

node2# service ganeti status
 * ganeti-noded is running
 * ganeti-masterd is not running
 * ganeti-rapi is running
 * ganeti-luxid is running
 * ganeti-kvmd is not running
 * ganeti-confd is running
 * ganeti-mond is running

masterd isn't running! Trying to start it give the errors:

node2# service ganeti start
 * Starting Ganeti cluster                                                                                                     * ganeti-noded...                                                                                                     [ OK ]
 * ganeti-masterd...                                                                                                          ERROR:root:RPC error in master_node_name on node node1.mydomain.com: Error 7: Failed to connect to 192.168.111.201 port 1811: No route to host
WARNING:root:Error contacting node node1.mydomain.com: Error 7: Failed to connect to 192.168.111.201 port 1811: No route to host
ERROR:root:RPC error in master_node_name on node node1.mydomain.com: Error 7: Failed to connect to 192.168.111.201 port 1811: No route to host
WARNING:root:Error contacting node node1.mydomain.com: Error 7: Failed to connect to 192.168.111.201 port 1811: No route to host
ERROR:root:RPC error in master_node_name on node node1.mydomain.com: Error 7: Failed to connect to 192.168.111.201 port 1811: No route to host
WARNING:root:Error contacting node node1.mydomain.com: Error 7: Failed to connect to 192.168.111.201 port 1811: No route to host
ERROR:root:RPC error in master_node_name on node node1.mydomain.com: Error 7: Failed to connect to 192.168.111.201 port 1811: No route to host
WARNING:root:Error contacting node node1.mydomain.com: Error 7: Failed to connect to 192.168.111.201 port 1811: No route to host
ERROR:root:RPC error in master_node_name on node node1.mydomain.com: Error 7: Failed to connect to 192.168.111.201 port 1811: No route to host
WARNING:root:Error contacting node node1.mydomain.com: Error 7: Failed to connect to 192.168.111.201 port 1811: No route to host
ERROR:root:RPC error in master_node_name on node node1.mydomain.com: Error 7: Failed to connect to 192.168.111.201 port 1811: No route to host
WARNING:root:Error contacting node node1.mydomain.com: Error 7: Failed to connect to 192.168.111.201 port 1811: No route to host
CRITICAL:root:Cluster inconsistent, most of the nodes didn't answer after multiple retries. Aborting startup
CRITICAL:root:Use the --no-voting option if you understand what effects it has on the cluster state

So the problem is that ganeti-masterd can't start because can't comunicate to node1 even though I put node1 in offline mode before.

I try to put the --no-voting option in /etc/default/ganeti and when start the service it says that this option is dangerous and ask confirmation. I confirm the option and now masterd starts.

So, is this the correct recovery process for a two node cluster in this situation ?

---
David Sedeño
 

Helga Velroyen

unread,
Oct 8, 2014, 7:14:37 AM10/8/14
to gan...@googlegroups.com
Hi David,

Yes, the --no-voting option is actually the way to go with two-node clusters. The problem with two nodes are that none of them is "more important" than the other and thus they cannot decide by themselves who is supposed to be the master. 

The man page of 'gnt-cluster' explains this a bit:

Cheers,
Helga
--
Helga Velroyen | Software Engineer | hel...@google.com | 

Google Germany GmbH
Dienerstr. 12
80331 München

Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Graham Law, Christine Elizabeth Flores
Reply all
Reply to author
Forward
0 new messages