Ganeti - how to remove shutted down node

582 views
Skip to first unread message

Pavel Műller

unread,
Mar 21, 2016, 6:41:42 AM3/21/16
to ganeti
Hello guys,

I am trying to remove node (gnt-node remove ganeti2) from two nodes ganeti cluster, but I got stuck in this step. I got the following output:

# gnt-node remove ganeti2
Timeout while talking to the master daemon. Jobs might have been submitted and will continue to run even if the call timed out. Useful commands in this situation are "gnt-job list", "gnt-job cancel" and "gnt-job watch". Error:
Connect timed out

When I tried other commands, so I got the same output: Connect timed out.

Could you please help?

Pavel

P.S. here is output when I starting Ganeti:


# systemctl status ganeti.service
● ganeti.service - LSB: Ganeti Cluster Manager
   Loaded: loaded (/etc/init.d/ganeti)
   Active: active (running) since Mon 2016-03-21 10:40:18 CET; 20min ago
  Process: 15001 ExecStop=/etc/init.d/ganeti stop (code=exited, status=0/SUCCESS)
  Process: 15047 ExecStart=/etc/init.d/ganeti start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/ganeti.service
           ├─15065 /usr/bin/python /usr/sbin/ganeti-noded -b 192.168.4.47
           └─15106 /usr/bin/python /usr/sbin/ganeti-rapi -b 192.168.4.47

Mar 21 10:40:03 sunfirex2200 ganeti[15047]: Error in the RPC HTTP reply from 'Node {nodeName = "ganeti2", nodePrimaryIp = "192.168.4.27", nodeSecondaryIp =...apable = T
Mar 21 10:40:03 sunfirex2200 ganeti[15047]: No voting RPC result from ["sunfirex2200","ganeti2"]
Mar 21 10:40:13 sunfirex2200 ganeti[15047]: Error in the RPC HTTP reply from 'Node {nodeName = "sunfirex2200", nodePrimaryIp = "192.168.4.47", nodeSecondar...deVmCapabl
Mar 21 10:40:13 sunfirex2200 ganeti[15047]: Error in the RPC HTTP reply from 'Node {nodeName = "ganeti2", nodePrimaryIp = "192.168.4.27", nodeSecondaryIp =...apable = T
Mar 21 10:40:13 sunfirex2200 ganeti[15047]: No voting RPC result from ["sunfirex2200","ganeti2"]
Mar 21 10:40:13 sunfirex2200 ganeti[15047]: Failed to verify master status: Couldn't gather voting results of enough nodes
Mar 21 10:40:13 sunfirex2200 ganeti[15047]: failed (exit code 1).
Mar 21 10:40:15 sunfirex2200 ganeti[15047]: ganeti-kvmd...done.
Mar 21 10:40:17 sunfirex2200 ganeti[15047]: ganeti-confd...done.
Mar 21 10:40:18 sunfirex2200 ganeti[15047]: ganeti-mond...done.
Hint: Some lines were ellipsized, use -l to show in full.



Benjamin Redling

unread,
Mar 21, 2016, 7:21:58 AM3/21/16
to gan...@googlegroups.com
Hello Pavel,

On 03/21/2016 11:41, Pavel Műller wrote:
> I am trying to remove node (gnt-node remove ganeti2) from two nodes ganeti
> cluster, [...]

Have you found and read:
https://groups.google.com/forum/#!topic/ganeti/pl1B_-C28qM

What's your current output of gnt-cluster verify?

Regards,
Benjamin
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321

Pavel Muller

unread,
Mar 21, 2016, 7:54:45 AM3/21/16
to gan...@googlegroups.com
I can not start any gnt-* commands. Here is the output of gnt-cluster
verify:
# gnt-cluster verify
Timeout while talking to the master daemon. Jobs might have been
submitted and will continue to run even if the call timed out. Useful
commands in this situation are "gnt-job list", "gnt-job cancel" and
"gnt-job watch". Error:
Connect timed out

...and yes I have found the link, which you sent. But I still dont know
how exactly remove the node, which is offline now (it is test
environment). How to restart wconfd with parameter "--no-voting
--yes-do-it".

"# ganeti-wconfd --no-user-checks --no-voting --yes-do-it"?

Thank you in advance.

On 21.3.2016 12:21, Benjamin Redling wrote:
> Hello Pavel,
>
> On 03/21/2016 11:41, Pavel Műller wrote:
>> I am trying to remove node (gnt-node remove ganeti2) from two nodes ganeti
>> cluster, [...]
> Have you found and read:
> https://groups.google.com/forum/#!topic/ganeti/pl1B_-C28qM
>
> What's your current output of gnt-cluster verify?
>
> Regards,
> Benjamin

--
S Pozdravem

Pavel Muller
Systémový inženýr
Mobil: +420 603 436 153


Benjamin Redling

unread,
Mar 21, 2016, 9:38:03 AM3/21/16
to gan...@googlegroups.com
On 03/21/2016 12:54, Pavel Muller wrote:
> I can not start any gnt-* commands.

Just to make sure: on which node are you submitting your commands?
And which version of ganeti are you using? OS?

Just asking because of similarites to
https://code.google.com/p/ganeti/issues/detail?id=1051

And depending on your gnt version and systemd you seem to have to pass
--no-voting & Co explicitly:
https://code.google.com/p/ganeti/issues/detail?id=1007
https://code.google.com/p/ganeti/issues/detail?id=1084


> Here is the output of gnt-cluster
> verify:
> # gnt-cluster verify
> Timeout while talking to the master daemon. Jobs might have been
> submitted and will continue to run even if the call timed out. Useful
> commands in this situation are "gnt-job list", "gnt-job cancel" and
> "gnt-job watch". Error:
> Connect timed out
>
> ...and yes I have found the link, which you sent. But I still dont know
> how exactly remove the node, which is offline now (it is test
> environment). How to restart wconfd with parameter "--no-voting
> --yes-do-it".
>
> "# ganeti-wconfd --no-user-checks --no-voting --yes-do-it"?

AFAIK yes -- and I would look twice that there are no old processes.

Pavel Muller

unread,
Mar 21, 2016, 10:07:59 AM3/21/16
to gan...@googlegroups.com
Hello Benjamin,

I am on the master node. I originally created node called "sunfirex2200"
(master) and installed Ganeti and It worked properly. After that, I
added second node (which is now shutted down). Since that moment (shut
down second node) I have had described problems.

OS: Debian Jessie, kernel 3.16.0-4-amd64
Ganeti version: 2.12.4-1+deb8u3 (installed as a Debian package)

Thank you for the links - I am going to go to check them.

Benjamin Redling

unread,
Mar 21, 2016, 10:52:07 AM3/21/16
to gan...@googlegroups.com
On 03/21/2016 15:07, Pavel Muller wrote:
> I am on the master node. I originally created node called "sunfirex2200"
> (master) and installed Ganeti and It worked properly. After that, I
> added second node (which is now shutted down). Since that moment (shut
> down second node) I have had described problems.

Did you have to restart the master node or the ganeti services on the
master?

Then read the link from my first answer again.
Klaus Aehlig explains it there early on:
"
Ganeti master-node daemons always refused to start unless more than half
of all nodes are available and confirm the master status. For a two-node
cluster, this means both nodes must be present. The only exception is, if
the --no-voting option is given (requires --yes-do-it).

It is a different situation, however, if the daemon is already running; then
it will continue to run, as the master status is only verified on startup.
"

I guess as soon as you added the second node you run into the scenario
Klaus Aehlig described. Everything run fine but your master won't
restart with the second node shut down.

Set "--no-voting --yes-do-it" in /etc/default/ganeti at WCONFD_ARGS and
LUXID_ARGS.

Regards,

Pavel Muller

unread,
Mar 21, 2016, 11:31:15 AM3/21/16
to gan...@googlegroups.com
I restarted ganeti service on master node several times.

# systemctl restart ganeti.service

I set "--no-voting --yes-do-it" in /etc/default/ganeti at WCONFD_ARGS and
LUXID_ARGS... As you said. Now it seems that, it partially started to work. 

For example:

###
 # gnt-cluster verify

Failure: command execution error:
detected death of job 5870
###



But the following command works:

###
# gnt-cluster info
Cluster name: ganeti
Cluster UUID: d65b386b-44c2-433a-93a3-58394c1978e4
Creation time: 2016-02-02 10:58:16
Modification time: 2016-02-02 10:58:16
Master node: sunfirex2200
Architecture (this node): 64bits (x86_64)
Tags: (none)
Default hypervisor: kvm
Enabled hypervisors: kvm
...
###
-- 
S Pozdravem

Pavel Muller
Systémový inženýr
Mobil: +420 603 436 153
Reply all
Reply to author
Forward
0 new messages