Can't upgrade to 3.0 due to running jobs I can't kill

63 views
Skip to first unread message

Imre Jonk

unread,
May 12, 2022, 5:15:52 AM5/12/22
to ganeti
Hi there,

I'm attempting to upgrade my Ganeti 2.16 clusters to 3.0. However, in order to upgrade, the queue must be drained as can be seen in the upgrade output:
"""
root@catwalk:~# gnt-cluster upgrade --to=3.0 --debug
2022-05-12 10:58:14,057: gnt-cluster upgrade pid=27075 cli:1250 DEBUG Command line: gnt-cluster upgrade --to=3.0 -d
An upgrade is already in progress. Target version matches, resuming.
Verifying 3.0 present on all nodes
Draining queue
"""

This draining queue operation never ends because there are three stuck jobs:
"""
root@catwalk:~# gnt-job list --running
    ID Status  Summary
163446 running GROUP_VERIFY_DISKS(a7564447-e5c7-4e23-ac17-e3ac12385839)
163489 running GROUP_VERIFY_DISKS(a7564447-e5c7-4e23-ac17-e3ac12385839)
163526 running GROUP_VERIFY_DISKS(a7564447-e5c7-4e23-ac17-e3ac12385839)
"""

Trying to kill these jobs only yields an error:
"""
root@catwalk:~# gnt-job cancel --kill --yes-do-it 163446
Job 163446 not found in queue
"""

Any help is much appreciated.

Imre Jonk

unread,
May 12, 2022, 7:51:06 AM5/12/22
to ganeti
Nevermind! A reboot of the master node made the stuck jobs disappear. Somehow a Ganeti service restart was not enough.

I did experience some issues with live migrating instances on Debian 11, but quickly found out that the 3.0.2 update (available in bullseye-backports) fixes this.

Ganeti is a great piece of software. Thanks everyone who helps developing it.

Sascha Lucas

unread,
May 12, 2022, 4:16:38 PM5/12/22
to 'Imre Jonk' via ganeti
Hi Imre,

On Thu, 12 May 2022, 'Imre Jonk' via ganeti wrote:

> Nevermind! A reboot of the master node made the stuck jobs disappear.
> Somehow a Ganeti service restart was not enough.

I'm glade to hear, that you could recover an proceed with the upgrade.
Restarting Ganeti service helped in the distant past. Today probably
deleting job files from /var/lib/ganeti/queue/ may also be necessary???

However, I assume you also did a master-failover before rebooting? Maybe
this alone has done the trick?

> I did experience some issues with live migrating instances on Debian 11,
> but quickly found out that the 3.0.2 update (available in
> bullseye-backports) fixes this.

Yes. 3.0.2 has a few bug fixes regarding newer Qemu.

> Ganeti is a great piece of software. Thanks everyone who helps developing
> it.

Indeed, it is! But in the age of cloud it feels like becoming a dinosaur :-).

Thanks, Sascha.

Imre Jonk

unread,
May 13, 2022, 3:20:55 AM5/13/22
to gan...@googlegroups.com
Hi Sascha,

On Thu, 2022-05-12 at 22:16 +0200, Sascha Lucas wrote:
> I'm glade to hear, that you could recover an proceed with the
> upgrade.
> Restarting Ganeti service helped in the distant past. Today probably
> deleting job files from /var/lib/ganeti/queue/ may also be
> necessary???

Maybe, I haven't tried that. I'll give it a shot if it ever happens
again with my 3.0 cluster.

> However, I assume you also did a master-failover before rebooting?
> Maybe this alone has done the trick?

I'm quite sure it didn't. I did some master-failovers before and that
didn't help.

>
> Indeed, it is! But in the age of cloud it feels like becoming a
> dinosaur :-).

Ah well, my company is still mostly on-prem and happy with it :)
signature.asc

c sights

unread,
May 13, 2022, 7:30:48 AM5/13/22
to gan...@googlegroups.com
Thanks for keeping us updated! I have yet to upgrade on Debian so this info will be useful.

--
You received this message because you are subscribed to the Google Groups "ganeti" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ganeti+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ganeti/817e2db4-a88a-4db1-8a77-24d9a8023084n%40googlegroups.com.

Imre Jonk

unread,
May 17, 2022, 8:35:25 AM5/17/22
to ganeti
I have now upgraded a second (physically remote) cluster from Debian 10 to 11 as well, without instance downtime. It's a two-node cluster with DRBD. The upgrade procedure looked roughly like this:

1. Evacuate all instances from node2: `# gnt-node evacuate --primary-only node2.example.com`
2. Add bullseye-backports APT sources to node2. Configure APT to prefer the backported Ganeti packages by putting this in /etc/apt/preferences.d/pin_ganeti.pref:
Explanation: Live migration bugfix in 3.0.2
Package: ganeti ganeti-*
Pin: release a=bullseye-backports
Pin-Priority: 500
3. Temporarily hold the 'ganeti' package on node2, it must be upgraded *after* running 'gnt-cluster upgrade': `# echo ganeti hold | dpkg --set-selections`
4. Upgrade node2 to Debian 11 [1] and reboot. The Ganeti cluster version is now still 2.16, but the 3.0.2 packages are installed.


5. Add the Bullseye APT sources, including bullseye-backports, to node1. Add pin_ganeti.pref from step 2 to /etc/apt/preferences.d/. Install Ganeti 3.0.2 packages from bullseye-backports: `# apt install ganeti-3.0 ganeti-haskell-3.0 ganeti-htools-3.0`. This will upgrade some packages, namely glibc and openssh, and will also install Python 3.9.
6. Upgrade the Ganeti cluster: `# gnt-cluster upgrade --to=3.0`

7. Remove the 'ganeti' package hold from node2: `# echo ganeti install | dpkg --set-selections`

8. Upgrade the 'ganeti' package on both node1 and node2: `# apt install ganeti=3.0.2-1~bpo11+1`
9. Restart the 'ganeti' service on both node1 and node2.
10. Try live-migrating an unimportant VM.

11. Evacuate all remaining instances from node1 and upgrade node1 to Debian 11 as well.

Note: I performed a master-failover before each Debian version upgrade. This would have allowed me to still manage my instances if the upgrade went south.
Hope this is useful to someone. Feel free to ask me any questions about my upgrade.
Reply all
Reply to author
Forward
0 new messages