The job log states the following:
job-640 pid=50932 INFO Op 1/1: opcode INSTANCE_MIGRATE(kvm-test-instance01) waiting for locks
job-640 pid=50932 INFO Selected nodes for instance kvm-test-instance01 via iallocator hail: gnt-test02
job-640 pid=50932 ERROR Instance migration failed, trying to revert disk status: Failed to get migration status: Failed to send command 'info migrate' to instance 'kvm-test-instance01', reason 'exited with exit code 1', output: 2020/04/03 23:49:55 socat[32327] E connect(5, AF=1 "/var/run/ganeti/kvm-hypervisor/ctrl/kvm-test-instance01.monitor", 65): Connection refused
job-640 pid=50932 ERROR Op 1/1: Caught exception in INSTANCE_MIGRATE(kvm-test-instance01)
Traceback (most recent call last):
File "/usr/share/ganeti/3.0/ganeti/jqueue/__init__.py", line 933, in _ExecOpCodeUnlocked
result = self.opexec_fn(op.input,
File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 705, in ExecOpCode
result = self._LockAndExecLU(lu, locking.LEVEL_CLUSTER + 1,
File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout,
File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 639, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout, pending=pending)
File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout,
File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout,
File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 639, in _LockAndExecLU
result = self._LockAndExecLU(lu, level + 1, calc_timeout, pending=pending)
File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 547, in _LockAndExecLU
result = self._ExecLU(lu)
File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 505, in _ExecLU
result = _ProcessResult(submit_mj_fn, lu.op, lu.Exec(self.Log))
File "/usr/share/ganeti/3.0/ganeti/cmdlib/base.py", line 351, in Exec
tl.Exec(feedback_fn)
File "/usr/share/ganeti/3.0/ganeti/cmdlib/instance_migration.py", line 1163, in Exec
return self._ExecMigration()
File "/usr/share/ganeti/3.0/ganeti/cmdlib/instance_migration.py", line 953, in _ExecMigration
raise errors.OpExecError("Could not migrate instance %s: %s" %
ganeti.errors.OpExecError: Could not migrate instance kvm-test-instance01: Failed to get migration status: Failed to send command 'info migrate' to instance 'kvm-test-instance01', reason 'exited with exit code 1', output: 2020/04/03 23:49:55 socat[32327] E connect(5, AF=1 "/var/run/ganeti/kvm-hypervisor/ctrl/kvm-test-instance01.monitor", 65): Connection refused
From monitoring the sockets and the debug log auf ganeti-noded, the following seems to happen:
* the qemu process on the secondary node starts
* migration parameters are set through the socket on the primary node
* the migration is started through the socket
* ganeti-noded issues three "info migrate" commands to the socket (two seem to work, the third always fails as the qemu process is gone)
Attached you can find the noded logs from both nodes.
* initial scenario: the instance runs on gnt-test03
* after a gnt-instance failover it runs on gnt-test02
* the live migration back to gnt-test03 fails
Simply adding a 'sleep 2' between the two ganeti commands fixes the issue. Both Focal and Bullseye currently ship with QEMU 4.2. Does anyone have an idea why this happens? I am not through yet with testing/debugging, but maybe someone already does have an idea what this might be about.
Cheers,
Rudi
--
sipgate GmbH - Gladbacher Str. 74 - 40219 Düsseldorf
HRB Düsseldorf 39841 - Geschäftsführer: Thilo Salmon, Tim Mois
Steuernummer: 106/5724/7147, Umsatzsteuer-ID: DE219349391