gnt-cluster master-failover does not succeed on 2.12.4

John McNally

unread,

Jun 26, 2015, 2:40:36 PM6/26/15

to gan...@googlegroups.com

Hi,

I am running a two-node cluster on Centos 7.1. I recently updated ganeti from 2.12.1 to to 2.12.4. Now, when I do "gnt-cluster master-failover -d" on the master candidate node, I get this:

-------

2015-06-26 14:20:40,788: gnt-cluster master-failover pid=4790 cli:2706 DEBUG Command line: gnt-cluster master-failover -d

2015-06-26 14:20:40,789: gnt-cluster master-failover pid=4790 node:93 INFO Using PycURL libcurl/7.29.0 NSS/3.15.4 zlib/1.2.7 libidn/1.28 libssh2/1.4.3

2015-06-26 14:20:40,792: gnt-cluster master-failover pid=4790 client:142 DEBUG Starting request <ganeti.http.client.HttpClientRequest 192.168.2.202:1811 POST /master_node_name at 0x13313d0>

2015-06-26 14:20:40,792: gnt-cluster master-failover pid=4790 client:142 DEBUG Starting request <ganeti.http.client.HttpClientRequest 192.168.2.201:1811 POST /master_node_name at 0x1331490>

2015-06-26 14:20:40,903: gnt-cluster master-failover pid=4790 client:228 DEBUG Request <ganeti.http.client.HttpClientRequest 192.168.2.201:1811 POST /master_node_name at 0x1331490> finished, errmsg=None

2015-06-26 14:20:40,904: gnt-cluster master-failover pid=4790 client:228 DEBUG Request <ganeti.http.client.HttpClientRequest 192.168.2.202:1811 POST /master_node_name at 0x13313d0> finished, errmsg=None

2015-06-26 14:20:40,905: gnt-cluster master-failover pid=4790 bootstrap:1055 INFO Setting master to navy-a.posnet.psfc.coop, old master: navy-b.posnet.psfc.coop

2015-06-26 14:20:40,906: gnt-cluster master-failover pid=4790 process:217 INFO RunCmd /usr/lib64/ganeti/daemon-util start ganeti-wconfd --force-node --no-voting --yes-do-it

2015-06-26 14:20:40,960: gnt-cluster master-failover pid=4790 process:217 INFO RunCmd /usr/lib64/ganeti/daemon-util stop ganeti-wconfd

2015-06-26 14:20:41,008: gnt-cluster master-failover pid=4790 cli:2713 ERROR Error during command processing

Traceback (most recent call last):

File "/usr/share/ganeti/2.12/ganeti/cli.py", line 2709, in GenericMain

result = func(options, args)

File "/usr/share/ganeti/2.12/ganeti/rpc/node.py", line 141, in wrapper

return fn(*args, **kwargs)

File "/usr/share/ganeti/2.12/ganeti/client/gnt_cluster.py", line 861, in MasterFailover

rvlaue, msgs = bootstrap.MasterFailover(no_voting=opts.no_voting)

File "/usr/share/ganeti/2.12/ganeti/bootstrap.py", line 1071, in MasterFailover

cfg = config.GetConfig(None, livelock, accept_foreign=True)

File "/usr/share/ganeti/2.12/ganeti/config.py", line 105, in GetConfig

kwargs['wconfd'] = wc.Client()

File "/usr/share/ganeti/2.12/ganeti/wconfd.py", line 64, in __init__

self._InitTransport()

File "/usr/share/ganeti/2.12/ganeti/rpc/client.py", line 199, in _InitTransport

timeouts=self.timeouts)

File "/usr/share/ganeti/2.12/ganeti/rpc/transport.py", line 101, in __init__

args=(self.socket, address, self._ctimeout))

File "/usr/share/ganeti/2.12/ganeti/utils/retry.py", line 173, in Retry

return fn(*args)

File "/usr/share/ganeti/2.12/ganeti/rpc/transport.py", line 130, in _Connect

raise errors.NoMasterError(address)

NoMasterError: /var/run/ganeti/socket/ganeti-wconfd

Cannot communicate with socket '/var/run/ganeti/socket/ganeti-wconfd'.

Is the process running and listening for connections?

-------

The cluster is healthy otherwise: Instances are running and performing normally, and I can migrate instances between nodes.

Any advice on how to proceed?

Thanks.

Petr Pudlák

unread,

Jun 29, 2015, 1:25:15 PM6/29/15

to gan...@googlegroups.com

Hi John,

so far I don't have any recommendation, but I'll try to reproduce the issue and get back to you. Are you running Ganeti in the split-user mode, that is, the daemons running under different users?

Cheers,

Petr

--

Petr Pudlák

Software Engineer

pud...@google.com

Google Germany GmbH

Dienerstraße 12

80331 München

Geschäftsführer: Graham Law, Christine Elizabeth Flores

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Diese E-Mail ist vertraulich. Wenn Sie nicht der richtige Adressat sind, leiten Sie diese bitte nicht weiter, informieren Sie den Absender und löschen Sie die E-Mail und alle Anhänge. Vielen Dank.

This e-mail is confidential. If you are not the right addressee please do not forward it, please inform the sender, and please erase this e-mail including any attachments. Thanks.

John McNally

unread,

Jun 29, 2015, 2:42:54 PM6/29/15

to gan...@googlegroups.com

Peter,

Thanks for your response. No -- the daemons are all running as the root user. I installed ganeti from the RPM's available for CentOS/RHEL on the integ-ganeti repo (http://jfut.integ.jp/linux/ganeti/). The lastest available release on this repo is 2.12.4.

Even if you can't reproduce, can you suggest any workarounds to make the failover happen? I need to do some maintenance and have an operational plan for when the master goes down hard.

Thanks again.

_________________
John McNally
(718) 834-0549
jmcn...@acm.org

Petr Pudlák

unread,

Jun 30, 2015, 8:51:30 AM6/30/15

to gan...@googlegroups.com

Hi John,

could you share, or examine the log of WConf daemon, usually located at /var/log/ganeti/wconf-daemon.log on the node where the master failover failed? The daemon should have been started, so the log should shed some light onto why the communication failed.

Thanks,

Petr

John McNally

unread,

Jun 30, 2015, 3:06:20 PM6/30/15

to gan...@googlegroups.com

Petr,

I have attached the last two wconfd-daemon log files. Here are some other details about the cluster, FYI:

OS: CentOS 7.1.1503

Kernel: 3.10.0-229.7.2.el7.x86_64

Ganeti version: 2.12.4

DRBD version: 8.4.6

KVM (QEMU) version: 1.5.3

LVM version: 2.02.115(2)-RHEL7 (2015-01-28)

Let me know if you need further info.

Thanks.

_________________
John McNally
(718) 834-0549
jmcn...@acm.org

wconf-daemon-logs.tgz

Petr Pudlák

unread,

Jul 1, 2015, 6:55:26 AM7/1/15

to gan...@googlegroups.com

Hi John,

thank you. I couldn't replicate the issue and I didn't find anything wrong in the logs, no errors in the WConf daemon log, and the logs also show that the daemon was successfully running before.

The only idea I have right now is that there might be a race condition during the failover operation - the main job is trying to communicate with the daemon before it's fully started.

Could you try to apply the attached patch to bootstrap.py (please back it up first) on the node that you want to failover to? It just adds a small delay after starting the WConf daemon during the operation. Please let me know how it goes.

Thanks,

Petr

wconfd-workaround.patch

John McNally

unread,

Jul 1, 2015, 12:40:58 PM7/1/15

to gan...@googlegroups.com

Petr,

Good news! Your patch is effective and allows gnt-cluster master-failover to complete normally. I installed it on both nodes and was able to failover and back twice. I have attached log and console output for your information.

Thanks for addressing this so quickly.

Best,

_________________
John McNally
(718) 834-0549
jmcn...@acm.org

gnt-cluster-master-failover-d.txt

wconf-daemon.log

Petr Pudlák

unread,

Jul 1, 2015, 12:43:30 PM7/1/15

to gan...@googlegroups.com

Hi John,

thanks for trying it out. This confirms the hypothesis, so I'll make a proper fix, which will then get into 2.12.5.

All the best

Petr

Petr Pudlák

unread,

Jul 1, 2015, 12:47:45 PM7/1/15

to gan...@googlegroups.com

FYI: https://code.google.com/p/ganeti/issues/detail?id=1115

Reply all

Reply to author

Forward