gnt-instance reboot slow

135 views
Skip to first unread message

Daniel Howard

unread,
Aug 12, 2015, 1:56:02 PM8/12/15
to ganeti
Hello,

One thing that is really slowing down my efforts is that "gnt-instance reboot" on my test cluster takes 2-3 minutes to execute.  I see this in the logs:

==> /var/log/ganeti/wconf-daemon.log <==
2015-08-12 10:42:24,587454000000 PDT: ganeti-wconfd pid=3192/ThreadId 28318 WARNING Error during message receiving: user error (Timeout in reading a response)

I found a forum post blaming this on a regression bug that was fixed in 2.12.4.  Alas, I am running 2.12.4-1~trusty+1 from the PPA http://ppa.launchpad.net/pkg-ganeti-devel/lts/ubuntu

Thanks,
-danny

Daniel Howard

unread,
Aug 12, 2015, 2:36:34 PM8/12/15
to ganeti
Workaround: I evacuated the instance's node. The instance was relocated to master node.  reboot happens in an instant.  Evacuated master node, instance is back on original node, and reboot still happens in an instant.

*scratches head*

-danny

Daniel Howard

unread,
Aug 12, 2015, 6:52:03 PM8/12/15
to ganeti
Various man pages allude to a two minute grace period to allow a VM to shut down cleanly.   There is an argument --timeout which can not be applied to reboot, but can be applied to shutdown.  In my case, shutdown --timeout 0 brings cycle time down to ... one minute.

1-15:46 djh@ganeti06-23 ~$ time sudo gnt-instance reboot --timeout 0 ganeti-test1
Usage
=====
  gnt-instance reboot <instance>

gnt-instance: error: no such option: --timeout

real    0m0.175s
user    0m0.129s
sys     0m0.047s
2-15:46 djh@ganeti06-23 ~$ time sudo gnt-instance shutdown --timeout 0 ganeti-test1
Waiting for job 819 for ganeti-test1.mtv.qxxxxxxxxd.com ...

real    0m58.496s
user    0m0.152s
sys     0m0.036s
0-15:47 djh@ganeti06-23 ~$ time sudo gnt-instance start ganeti-test1
Waiting for job 820 for ganeti-test1.mtv.
qxxxxxxxxd.com ...

real    0m4.075s
user    0m0.150s
sys     0m0.040s


In my earlier workaround, I suspect that the important difference was having successfully installed an OS.  Now that I have Ubuntu bootstrapping from virtual disk:

0-15:49 djh@ganeti06-23 ~$ time sudo gnt-instance reboot ganeti-test1
Waiting for job 822 for ganeti-test1.mtv.
qxxxxxxxxd.com ...

real    0m7.875s
user    0m0.148s
sys     0m0.031s



On Wednesday, August 12, 2015 at 10:56:02 AM UTC-7, Daniel Howard wrote:

candlerb

unread,
Aug 13, 2015, 8:37:55 AM8/13/15
to ganeti
> There is an argument --timeout which can not be applied to reboot

The flag there is annoyingly called `--shutdown-timeout` instead.

It will wait *up to* 2 minutes. The guest gets sent an ACPI shutdown signal, and then ganeti waits for it to halt. If your instance does actually halt, then as soon as it has done so ganeti will return.

However if your instance is locked up (e.g. because it failed to boot in the first place), you may as well pass timeout 0, otherwise you will sit there for 2 minutes twiddling your thumbs.

I don't know what your 1 minute extra delay is though.

Vegard Hansen

unread,
Sep 3, 2015, 6:20:07 AM9/3/15
to gan...@googlegroups.com
Hi,

So I've hit the same problem as you guys and did some further digging.
Or at least I think.

When Ganeti shuts down an instance it uses socat to connect to the KVM
socket, where it runs "system_powerdown". This doesn't run on the
instance for whatever reason, and we have to wait until timeout and
Ganeti just pulls the plug. However, you can run "system_reboot" and
the instance will immediately reboot itself, which leads me to believe
that acpi is actually working somewhat. I've yet to figure out why it
doesn't respond to "system_powerdown", so more digging required.

You can reproduce this by doing the following, on the node that the
instance is running: /usr/bin/socat STDIO
UNIX-CONNECT:/var/run/ganeti/kvm-hypervisor/ctrl/test.monitor

Where test is the instance name.
--
Med vennlig hilsen
Vegard Hansen

Vegard Hansen

unread,
Sep 4, 2015, 4:28:13 AM9/4/15
to gan...@googlegroups.com
A short update,

With a little help from a few guys at the IRC-channel we've found the
problem, and this might not be related to what you guys are having
issues with. I didn't install the kernel image on the instances so
while the kernel was built with support for it, the kernel modules
them selves were not present, and therefor not loaded. After
installing linux-image-amd64 on the instances and rebooting them
"lsmod" reported "button" as a module, and "system_shutdown" triggered
a ACPI shutdown.

So to sum up, if you're using the node/cluster kernel with
"kernel_path" and "initrd_path" you also need to install the relevant
kernel on the instance to load all the modules.
--
Med vennlig hilsen
Vegard Hansen


Reply all
Reply to author
Forward
0 new messages