ganeti migration stuck in loop

KenJi

unread,

Apr 26, 2026, 11:42:39 PMApr 26

to ganeti

Hello,
I'm migrating instance using the gnt-migrate and my migration got in loop. what should in do??
Mon Apr 27 09:34:35 2026 * memory transfer progress: 42372.01 %
Mon Apr 27 09:34:46 2026 * memory transfer progress: 42374.17 %
Mon Apr 27 09:34:57 2026 * memory transfer progress: 42376.36 %
Mon Apr 27 09:35:08 2026 * memory transfer progress: 42378.53 %
Mon Apr 27 09:35:19 2026 * memory transfer progress: 42380.71 %
Mon Apr 27 09:35:31 2026 * memory transfer progress: 42382.88 %
Mon Apr 27 09:35:42 2026 * memory transfer progress: 42385.07 %
Mon Apr 27 09:35:53 2026 * memory transfer progress: 42387.26 %
Mon Apr 27 09:36:03 2026 * memory transfer progress: 42389.24 %
Mon Apr 27 09:36:14 2026 * memory transfer progress: 42391.41 %
Mon Apr 27 09:36:26 2026 * memory transfer progress: 42393.60 %
Mon Apr 27 09:36:37 2026 * memory transfer progress: 42395.76 %
Mon Apr 27 09:36:48 2026 * memory transfer progress: 42397.93 %
Mon Apr 27 09:36:59 2026 * memory transfer progress: 42400.10 %

Sascha Lucas

unread,

Apr 27, 2026, 2:40:34 AMApr 27

to ganeti

Hi,

This is a non converging migration, which Ganeti can not break by itself.

The effected instance changes memory faster than your migration settings /
network can handle.

You have the following options:

* If you can control the effected instance from inside, stop the process
that changes memory (mostly java). Then migration should converge.

* If not, break the current migration manually: on the primary node of
the effected instance run the following command:

echo "migrate_cancel" | socat stdio unix-connect:/var/run/ganeti/kvm-hypervisor/ctrl/<instance-name>.monitor

For the next migration, tune your migration settings:

* set migration_bandwidth to ~2/3 of the NIC speed. This is
- ~83 on 1G
- ~833 on 10G
- etc.

* set migration_downtime to 1000 (meaning one second)

If all this does not help, the brave admin changes to post-copy migration.
Set either at cluster level for all instances or just on effected
instance: migration_caps=postcopy-ram. With that the migration is
guarantied to finish.

HTH, Sascha.

Rudolph Bott

unread,

Apr 27, 2026, 8:39:54 AMApr 27

to gan...@googlegroups.com

Hi,

We recently implemented a backoff-mechanism for our automated instance migrations. If transferred memory amount passes 200% it starts to slowly increase the migration_downtime, starting from a very low default of 10ms until it crosses an upper threshold and only then cancels the migration process. What do you think Sascha, would it make sense to implement something like that directly within Ganeti?

Regards,

Rudi

--
You received this message because you are subscribed to the Google Groups "ganeti" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ganeti+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/ganeti/3cb4c2f1-c492-3467-54c1-31eae2fb12d2%40web.de.

--

Rudolph Bott - bo...@sipgate.de

Telefon: +49 (0)211-63 55 55-55

Telefax: +49 (0)211-63 55 55-22

sipgate GmbH - Gladbacher Str. 74 - 40219 Düsseldorf

HRB Düsseldorf 39841 - Geschäftsführer: Thilo Salmon, Tim Mois

Steuernummer: 106/5724/7147, Umsatzsteuer-ID: DE219349391

www.sipgate.de - www.sipgate.co.uk

Sascha Lucas

unread,

Apr 27, 2026, 10:50:43 AMApr 27

to 'Rudolph Bott' via ganeti

Hi Rudi,

On Mon, 27 Apr 2026, 'Rudolph Bott' via ganeti wrote:

> We recently implemented a backoff-mechanism for our automated instance
> migrations. If transferred memory amount passes 200% it starts to slowly
> increase the migration_downtime, starting from a very low default of 10ms
> until it crosses an upper threshold and only then cancels the migration
> process. What do you think Sascha, would it make sense to implement
> something like that directly within Ganeti?

Yes, makes sense to have a Ganeti control here. That reminds me of the
following:

* having an upper threshold means already tolerating possible maximum
downtime: why not set it to max from the beginning?

* my example of 1000 ms downtime is derived from VMware max. stun time.

* a new HV parameter at how many % transferred RAM Ganeti should cancel
sounds good

* postcopy-ram has a similar threshold: dirty_sync_count, but it's not
necessary equal to % transferred RAM. Mostly postcopy kicks in at 80-100%.

* possible unification for either cancel or postcopy-ram?

* qemu supports converged migration[1,2].

* if I initiate a migration, I need it to finish and not to cancel. That's
why I'm using postcopy-ram.

Thanks, Sascha.

[1] https://wiki.qemu.org/Features/AutoconvergeLiveMigration
[2] https://www.qemu.org/docs/master/devel/migration/dirty-limit.html

Rudolph Bott

unread,

Apr 27, 2026, 3:19:37 PMApr 27

to gan...@googlegroups.com

Hi Sascha,

* having an upper threshold means already tolerating possible maximum
downtime: why not set it to max from the beginning?

Some (not all) of our services / VMs are somewhat time-sensitive so we try to keep the downtime as short as possible. We'd rather let Qemu try for a while to use the migration window of e.g. 30ms and only then start to increase it (with a waiting period in between the steps). If we set it to e.g. 500ms right away, qemu might more often than not use that maximum and cause service interruptions/trouble. But that's more of an educated guess than validated behaviour.

* a new HV parameter at how many % transferred RAM Ganeti should cancel
sounds good

I could image the following parameters + workflow:

- migration_downtime: same as now; this is the starting value for Qemu

- migration_downtime_max: if set and larger than migration_downtime, Ganeti will increase the downtime window configured into the running qemu instance by e.g. 10% and then wait for a defined period of time

- migration_cancel_threshold: if set, the amount of % transferred memory will be observed and the migration will be canceled once the threshold has been exceeded

Of course the above logic must not be applied if postcopy-ram is used. Maybe it is also the perfect time to start with a new set of migration-related HV params to configure the general migration-mode, thresholds, capabilities and so on and have cfgupgrade/downgrade code so that we can start clean.

* qemu supports converged migration[1,2].

That sounds also interesting and it might make sense to optionally have that available (not sure if that would help or break more things in our scenario, but it seems useful in general).

* if I initiate a migration, I need it to finish and not to cancel. That's
why I'm using postcopy-ram.

We are running Debian unattended-upgrades (with a pre/post hook system added to it) on (almost) all of our systems, including Ganeti nodes. A full upgrade cycle through all of our systems takes roughly five workdays and if a single Ganeti instance fails to migrate (effectively blocking the UU run on that node on that day), all instances will be migrated back and it will try again at most 7 days later. With the migration-window-increasing-logic we reduced failing migrations to an absolute minimum (without broadly raising migration_downtime) so we did not (yet) have to switch to postcopy-ram :-)

Thanks for your input!

Rudolph Bott

unread,

Apr 27, 2026, 6:02:11 PMApr 27

to gan...@googlegroups.com

Hi,

I have created an experimental branch (aided by Claude Code, as mentioned in the commit message) with an implementation of migration_downtime_max and migration_cancel_threshold. The QA suite is currently running but it does neither use the new HV parameters nor is there any significant load inside the QA guest instances that would delay live migrations. So the only take-away of the QA run will be that the changes do not break anything existing. I won't open a PR until I have done at least some basic testing in a multi-node ganeti cluster but I am not sure when I will find the time for that.

If anyone is interested in that, please check out the follwoing branch:

https://github.com/rbott/ganeti/tree/migration_improvements

Cheers,

Rudi

Reply all

Reply to author

Forward