gnt-cluster master-failover does not succeed on 2.12.4

47 views
Skip to first unread message

John McNally

unread,
Jun 26, 2015, 2:40:36 PM6/26/15
to gan...@googlegroups.com
Hi,

I am running a two-node cluster on Centos 7.1. I recently updated ganeti from 2.12.1 to to 2.12.4. Now, when I do "gnt-cluster master-failover -d" on the master candidate node, I get this:

-------

2015-06-26 14:20:40,788: gnt-cluster master-failover pid=4790 cli:2706 DEBUG Command line: gnt-cluster master-failover -d
2015-06-26 14:20:40,789: gnt-cluster master-failover pid=4790 node:93 INFO Using PycURL libcurl/7.29.0 NSS/3.15.4 zlib/1.2.7 libidn/1.28 libssh2/1.4.3
2015-06-26 14:20:40,792: gnt-cluster master-failover pid=4790 client:142 DEBUG Starting request <ganeti.http.client.HttpClientRequest 192.168.2.202:1811 POST /master_node_name at 0x13313d0>
2015-06-26 14:20:40,792: gnt-cluster master-failover pid=4790 client:142 DEBUG Starting request <ganeti.http.client.HttpClientRequest 192.168.2.201:1811 POST /master_node_name at 0x1331490>
2015-06-26 14:20:40,903: gnt-cluster master-failover pid=4790 client:228 DEBUG Request <ganeti.http.client.HttpClientRequest 192.168.2.201:1811 POST /master_node_name at 0x1331490> finished, errmsg=None
2015-06-26 14:20:40,904: gnt-cluster master-failover pid=4790 client:228 DEBUG Request <ganeti.http.client.HttpClientRequest 192.168.2.202:1811 POST /master_node_name at 0x13313d0> finished, errmsg=None
2015-06-26 14:20:40,905: gnt-cluster master-failover pid=4790 bootstrap:1055 INFO Setting master to navy-a.posnet.psfc.coop, old master: navy-b.posnet.psfc.coop
2015-06-26 14:20:40,906: gnt-cluster master-failover pid=4790 process:217 INFO RunCmd /usr/lib64/ganeti/daemon-util start ganeti-wconfd --force-node --no-voting --yes-do-it
2015-06-26 14:20:40,960: gnt-cluster master-failover pid=4790 process:217 INFO RunCmd /usr/lib64/ganeti/daemon-util stop ganeti-wconfd
2015-06-26 14:20:41,008: gnt-cluster master-failover pid=4790 cli:2713 ERROR Error during command processing
Traceback (most recent call last):
  File "/usr/share/ganeti/2.12/ganeti/cli.py", line 2709, in GenericMain
    result = func(options, args)
  File "/usr/share/ganeti/2.12/ganeti/rpc/node.py", line 141, in wrapper
    return fn(*args, **kwargs)
  File "/usr/share/ganeti/2.12/ganeti/client/gnt_cluster.py", line 861, in MasterFailover
    rvlaue, msgs = bootstrap.MasterFailover(no_voting=opts.no_voting)
  File "/usr/share/ganeti/2.12/ganeti/bootstrap.py", line 1071, in MasterFailover
    cfg = config.GetConfig(None, livelock, accept_foreign=True)
  File "/usr/share/ganeti/2.12/ganeti/config.py", line 105, in GetConfig
    kwargs['wconfd'] = wc.Client()
  File "/usr/share/ganeti/2.12/ganeti/wconfd.py", line 64, in __init__
    self._InitTransport()
  File "/usr/share/ganeti/2.12/ganeti/rpc/client.py", line 199, in _InitTransport
    timeouts=self.timeouts)
  File "/usr/share/ganeti/2.12/ganeti/rpc/transport.py", line 101, in __init__
    args=(self.socket, address, self._ctimeout))
  File "/usr/share/ganeti/2.12/ganeti/utils/retry.py", line 173, in Retry
    return fn(*args)
  File "/usr/share/ganeti/2.12/ganeti/rpc/transport.py", line 130, in _Connect
    raise errors.NoMasterError(address)
NoMasterError: /var/run/ganeti/socket/ganeti-wconfd
Cannot communicate with socket '/var/run/ganeti/socket/ganeti-wconfd'.
Is the process running and listening for connections?

-------

The cluster is healthy otherwise: Instances are running and performing normally, and I can migrate instances between nodes.

Any advice on how to proceed?

Thanks.

Petr Pudlák

unread,
Jun 29, 2015, 1:25:15 PM6/29/15
to gan...@googlegroups.com
Hi John,

so far I don't have any recommendation, but I'll try to reproduce the issue and get back to you. Are you running Ganeti in the split-user mode, that is, the daemons running under different users?

  Cheers,
  Petr
--
Petr Pudlák
Software Engineer
Google Germany GmbH
Dienerstraße 12
80331 München

Geschäftsführer: Graham Law, Christine Elizabeth Flores
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Diese E-Mail ist vertraulich. Wenn Sie nicht der richtige Adressat sind, leiten Sie diese bitte nicht weiter, informieren Sie den Absender und löschen Sie die E-Mail und alle Anhänge. Vielen Dank.
       
This e-mail is confidential. If you are not the right addressee please do not forward it, please inform the sender, and please erase this e-mail including any attachments. Thanks.

John McNally

unread,
Jun 29, 2015, 2:42:54 PM6/29/15
to gan...@googlegroups.com
Peter,

Thanks for your response. No -- the daemons are all running as the root user. I installed ganeti from the RPM's available for CentOS/RHEL on the integ-ganeti repo (http://jfut.integ.jp/linux/ganeti/). The lastest available release on this repo is 2.12.4.

Even if you can't reproduce, can you suggest any workarounds to make the failover happen? I need to do some maintenance and have an operational plan for when the master goes down hard.

Thanks again.

 
_________________
John McNally
(718) 834-0549
jmcn...@acm.org

Petr Pudlák

unread,
Jun 30, 2015, 8:51:30 AM6/30/15
to gan...@googlegroups.com
Hi John,

could you share, or examine the log of WConf daemon, usually located at /var/log/ganeti/wconf-daemon.log on the node where the master failover failed? The daemon should have been started, so the log should shed some light onto why the communication failed.

  Thanks,
  Petr

John McNally

unread,
Jun 30, 2015, 3:06:20 PM6/30/15
to gan...@googlegroups.com
Petr,

I have attached the last two wconfd-daemon log files. Here are some other details about the cluster, FYI:

OS: CentOS 7.1.1503
Kernel: 3.10.0-229.7.2.el7.x86_64
Ganeti version: 2.12.4
DRBD version: 8.4.6
KVM (QEMU) version: 1.5.3 
LVM version: 2.02.115(2)-RHEL7 (2015-01-28)

Let me know if you need further info.

Thanks.

 
_________________
John McNally
(718) 834-0549
jmcn...@acm.org

wconf-daemon-logs.tgz

Petr Pudlák

unread,
Jul 1, 2015, 6:55:26 AM7/1/15
to gan...@googlegroups.com
Hi John,

thank you. I couldn't replicate the issue and I didn't find anything wrong in the logs, no errors in the WConf daemon log, and the logs also show that the daemon was successfully running before.

The only idea I have right now is that there might be a race condition during the failover operation - the main job is trying to communicate with the daemon before it's fully started.

Could you try to apply the attached patch to bootstrap.py (please back it up first) on the node that you want to failover to? It just adds a small delay after starting the WConf daemon during the operation. Please let me know how it goes.

  Thanks,
  Petr
wconfd-workaround.patch

John McNally

unread,
Jul 1, 2015, 12:40:58 PM7/1/15
to gan...@googlegroups.com
Petr,

Good news! Your patch is effective and allows gnt-cluster master-failover to complete normally. I installed it on both nodes and was able to failover and back twice. I have attached log and console output for your information.

Thanks for addressing this so quickly.

Best,

 
_________________
John McNally
(718) 834-0549
jmcn...@acm.org

gnt-cluster-master-failover-d.txt
wconf-daemon.log

Petr Pudlák

unread,
Jul 1, 2015, 12:43:30 PM7/1/15
to gan...@googlegroups.com
Hi John,

thanks for trying it out. This confirms the hypothesis, so I'll make a proper fix, which will then get into 2.12.5.

  All the best
  Petr

Petr Pudlák

unread,
Jul 1, 2015, 12:47:45 PM7/1/15
to gan...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages