ganeti migration stuck in loop

6 views
Skip to first unread message

KenJi

unread,
Apr 26, 2026, 11:42:39 PM (3 days ago) Apr 26
to ganeti
Hello, 
I'm migrating instance using the gnt-migrate and my migration got in loop. what should in do??
Mon Apr 27 09:34:35 2026 * memory transfer progress: 42372.01 %
Mon Apr 27 09:34:46 2026 * memory transfer progress: 42374.17 %
Mon Apr 27 09:34:57 2026 * memory transfer progress: 42376.36 %
Mon Apr 27 09:35:08 2026 * memory transfer progress: 42378.53 %
Mon Apr 27 09:35:19 2026 * memory transfer progress: 42380.71 %
Mon Apr 27 09:35:31 2026 * memory transfer progress: 42382.88 %
Mon Apr 27 09:35:42 2026 * memory transfer progress: 42385.07 %
Mon Apr 27 09:35:53 2026 * memory transfer progress: 42387.26 %
Mon Apr 27 09:36:03 2026 * memory transfer progress: 42389.24 %
Mon Apr 27 09:36:14 2026 * memory transfer progress: 42391.41 %
Mon Apr 27 09:36:26 2026 * memory transfer progress: 42393.60 %
Mon Apr 27 09:36:37 2026 * memory transfer progress: 42395.76 %
Mon Apr 27 09:36:48 2026 * memory transfer progress: 42397.93 %
Mon Apr 27 09:36:59 2026 * memory transfer progress: 42400.10 %

Sascha Lucas

unread,
Apr 27, 2026, 2:40:34 AM (3 days ago) Apr 27
to ganeti
Hi,
This is a non converging migration, which Ganeti can not break by itself.

The effected instance changes memory faster than your migration settings /
network can handle.

You have the following options:

* If you can control the effected instance from inside, stop the process
that changes memory (mostly java). Then migration should converge.

* If not, break the current migration manually: on the primary node of
the effected instance run the following command:

echo "migrate_cancel" | socat stdio unix-connect:/var/run/ganeti/kvm-hypervisor/ctrl/<instance-name>.monitor

For the next migration, tune your migration settings:

* set migration_bandwidth to ~2/3 of the NIC speed. This is
- ~83 on 1G
- ~833 on 10G
- etc.

* set migration_downtime to 1000 (meaning one second)

If all this does not help, the brave admin changes to post-copy migration.
Set either at cluster level for all instances or just on effected
instance: migration_caps=postcopy-ram. With that the migration is
guarantied to finish.

HTH, Sascha.

Rudolph Bott

unread,
Apr 27, 2026, 8:39:54 AM (2 days ago) Apr 27
to gan...@googlegroups.com
Hi,

We recently implemented a backoff-mechanism for our automated instance migrations. If transferred memory amount passes 200% it starts to slowly increase the migration_downtime, starting from a very low default of 10ms until it crosses an upper threshold and only then cancels the migration process. What do you think Sascha, would it make sense to implement something like that directly within Ganeti?

Regards,
Rudi

--
You received this message because you are subscribed to the Google Groups "ganeti" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ganeti+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/ganeti/3cb4c2f1-c492-3467-54c1-31eae2fb12d2%40web.de.


--
 Rudolph Bott - bo...@sipgate.de

 sipgate GmbH - Gladbacher Str. 74 - 40219 Düsseldorf
 HRB Düsseldorf 39841 - Geschäftsführer: Thilo Salmon, Tim Mois
 Steuernummer: 106/5724/7147, Umsatzsteuer-ID: DE219349391

Sascha Lucas

unread,
Apr 27, 2026, 10:50:43 AM (2 days ago) Apr 27
to 'Rudolph Bott' via ganeti
Hi Rudi,

On Mon, 27 Apr 2026, 'Rudolph Bott' via ganeti wrote:

> We recently implemented a backoff-mechanism for our automated instance
> migrations. If transferred memory amount passes 200% it starts to slowly
> increase the migration_downtime, starting from a very low default of 10ms
> until it crosses an upper threshold and only then cancels the migration
> process. What do you think Sascha, would it make sense to implement
> something like that directly within Ganeti?

Yes, makes sense to have a Ganeti control here. That reminds me of the
following:

* having an upper threshold means already tolerating possible maximum
downtime: why not set it to max from the beginning?

* my example of 1000 ms downtime is derived from VMware max. stun time.

* a new HV parameter at how many % transferred RAM Ganeti should cancel
sounds good

* postcopy-ram has a similar threshold: dirty_sync_count, but it's not
necessary equal to % transferred RAM. Mostly postcopy kicks in at 80-100%.

* possible unification for either cancel or postcopy-ram?

* qemu supports converged migration[1,2].

* if I initiate a migration, I need it to finish and not to cancel. That's
why I'm using postcopy-ram.

Thanks, Sascha.

[1] https://wiki.qemu.org/Features/AutoconvergeLiveMigration
[2] https://www.qemu.org/docs/master/devel/migration/dirty-limit.html

Rudolph Bott

unread,
Apr 27, 2026, 3:19:37 PM (2 days ago) Apr 27
to gan...@googlegroups.com
Hi Sascha,


* having an upper threshold means already tolerating possible maximum
downtime: why not set it to max from the beginning?

Some (not all) of our services / VMs are somewhat time-sensitive so we try to keep the downtime as short as possible. We'd rather let Qemu try for a while to use the migration window of e.g. 30ms  and only then start to increase it (with a waiting period in between the steps). If we set it to e.g. 500ms right away, qemu might more often than not use that maximum and cause service interruptions/trouble. But that's more of an educated guess than validated behaviour.
 
* a new HV parameter at how many % transferred RAM Ganeti should cancel
sounds good

I could image the following parameters + workflow:
- migration_downtime: same as now; this is the starting value for Qemu
- migration_downtime_max: if set and larger than migration_downtime, Ganeti will increase the downtime window configured into the running qemu instance by e.g. 10% and then wait for a defined period of time
migration_cancel_threshold: if set, the amount of % transferred memory will be observed and the migration will be canceled once the threshold has been exceeded

Of course the above logic must not be applied if postcopy-ram is used. Maybe it is also the perfect time to start with a new set of migration-related HV params to configure the general migration-mode, thresholds, capabilities and so on and have cfgupgrade/downgrade code so that we can start clean.
 
* qemu supports converged migration[1,2].

That sounds also interesting and it might make sense to optionally have that available (not sure if that would help or break more things in our scenario, but it seems useful in general).
 
* if I initiate a migration, I need it to finish and not to cancel. That's
why I'm using postcopy-ram.

We are running Debian unattended-upgrades (with a pre/post hook system added to it) on (almost) all of our systems, including Ganeti nodes. A full upgrade cycle through all of our systems takes roughly five workdays and if a single Ganeti instance fails to migrate (effectively blocking the UU run on that node on that day), all instances will be migrated back and it will try again at most 7 days later. With the migration-window-increasing-logic we reduced failing migrations to an absolute minimum (without broadly raising migration_downtime) so we did not (yet) have to switch to postcopy-ram :-)

Thanks for your input!
 

Rudolph Bott

unread,
Apr 27, 2026, 6:02:11 PM (2 days ago) Apr 27
to gan...@googlegroups.com
Hi,

I have created an experimental branch (aided by Claude Code, as mentioned in the commit message) with an implementation of migration_downtime_max and migration_cancel_threshold. The QA suite is currently running but it does neither use the new HV parameters nor is there any significant load inside the QA guest instances that would delay live migrations. So the only take-away of the QA run will be that the changes do not break anything existing. I won't open a PR until I have done at least some basic testing in a multi-node ganeti cluster but I am not sure when I will find the time for that.

If anyone is interested in that, please check out the follwoing branch:


Cheers,
Rudi
Reply all
Reply to author
Forward
0 new messages