QA suite KVM failover/migrate timing issue on Ubuntu 19.10 and Debian Bullseye

33 views
Skip to first unread message

Rudolph Bott

unread,
Apr 4, 2020, 7:46:41 AM4/4/20
to ganeti...@googlegroups.com
Hi list,

I have come across a "bug" with QEMU which is probably only related to the QA suite. My test environment consists of the following:
- three-node ganeti cluster
- KVM Hypervisor
- DRBD and sharedfile disks
- small VMs with a ramdisk-only setup which run a busybox/acpid to react to shutdown signals

The QA suite passes all tests for DRBD backed instances just fine. However, when I run a gnt-instance failover immediately followed by gnt-instance migrate (as the QA suite does), the migration fails for "sharedfile":

 - INFO: Selected nodes for instance kvm-test-instance01 via iallocator hail:   gnt-test02
Migrating instance kvm-test-instance01
* checking disk consistency between source and target
* closing instance disks on node gnt-test02
* opening instance disks on node gnt-test03 in shared mode
* opening instance disks on node gnt-test02 in shared mode
* preparing gnt-test02 to accept the instance
* migrating instance to gnt-test02
* starting memory transfer
Migration failed, aborting
* closing instance disks on node gnt-test02
* opening instance disks on node gnt-test03 in exclusive mode
Failure: command execution error:
Could not migrate instance kvm-test-instance01: Failed to get migration status: Failed to send command 'info migrate' to instance 'kvm-test-instance01', reason 'exited with exit code 1', output: 2020/04/03 23:49:55 socat[32327] E           connect(5, AF=1 "/var/run/ganeti/kvm-hypervisor/ctrl/kvm-test-instance01.       monitor", 65): Connection refused


The job log states the following:
job-640 pid=50932 INFO Op 1/1: opcode INSTANCE_MIGRATE(kvm-test-instance01) waiting for locks
job-640 pid=50932 INFO Selected nodes for instance kvm-test-instance01 via iallocator hail: gnt-test02
job-640 pid=50932 ERROR Instance migration failed, trying to revert disk status: Failed to get migration status: Failed to send command 'info migrate' to instance 'kvm-test-instance01', reason 'exited with exit code 1', output: 2020/04/03 23:49:55 socat[32327] E connect(5, AF=1 "/var/run/ganeti/kvm-hypervisor/ctrl/kvm-test-instance01.monitor", 65): Connection refused

job-640 pid=50932 ERROR Op 1/1: Caught exception in INSTANCE_MIGRATE(kvm-test-instance01)
Traceback (most recent call last):
  File "/usr/share/ganeti/3.0/ganeti/jqueue/__init__.py", line 933, in _ExecOpCodeUnlocked
    result = self.opexec_fn(op.input,
  File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 705, in ExecOpCode
    result = self._LockAndExecLU(lu, locking.LEVEL_CLUSTER + 1,
  File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout,
  File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 639, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout, pending=pending)
  File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout,
  File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 631, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout,
  File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 639, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout, pending=pending)
  File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 547, in _LockAndExecLU
    result = self._ExecLU(lu)
  File "/usr/share/ganeti/3.0/ganeti/mcpu.py", line 505, in _ExecLU
    result = _ProcessResult(submit_mj_fn, lu.op, lu.Exec(self.Log))
  File "/usr/share/ganeti/3.0/ganeti/cmdlib/base.py", line 351, in Exec
    tl.Exec(feedback_fn)
  File "/usr/share/ganeti/3.0/ganeti/cmdlib/instance_migration.py", line 1163, in Exec
    return self._ExecMigration()
  File "/usr/share/ganeti/3.0/ganeti/cmdlib/instance_migration.py", line 953, in _ExecMigration
    raise errors.OpExecError("Could not migrate instance %s: %s" %
ganeti.errors.OpExecError: Could not migrate instance kvm-test-instance01: Failed to get migration status: Failed to send command 'info migrate' to instance 'kvm-test-instance01', reason 'exited with exit code 1', output: 2020/04/03 23:49:55 socat[32327] E connect(5, AF=1 "/var/run/ganeti/kvm-hypervisor/ctrl/kvm-test-instance01.monitor", 65): Connection refused


From monitoring the sockets and the debug log auf ganeti-noded, the following seems to happen:

* the qemu process on the secondary node starts
* migration parameters are set through the socket on the primary node
* the migration is started through the socket
* ganeti-noded issues three "info migrate" commands to the socket (two seem to work, the third always fails as the qemu process is gone)

Attached you can find the noded logs from both nodes. 

* initial scenario: the instance runs on gnt-test03
* after a gnt-instance failover it runs on gnt-test02
* the live migration back to gnt-test03 fails

Simply adding a 'sleep 2' between the two ganeti commands fixes the issue. Both Focal and Bullseye currently ship with QEMU 4.2. Does anyone have an idea why this happens? I am not through yet with testing/debugging, but maybe someone already does have an idea what this might be about. 

Cheers,
Rudi

--
 Rudolph Bott - bo...@sipgate.de

 sipgate GmbH - Gladbacher Str. 74 - 40219 Düsseldorf
 HRB Düsseldorf 39841 - Geschäftsführer: Thilo Salmon, Tim Mois
 Steuernummer: 106/5724/7147, Umsatzsteuer-ID: DE219349391

gnt-test02.log
gnt-test03.log

Sascha Lucas

unread,
Apr 4, 2020, 11:52:56 AM4/4/20
to ganeti...@googlegroups.com
Hi Rudi,

On Sat, 4 Apr 2020, Rudolph Bott wrote:

> * the qemu process on the secondary node starts
> * migration parameters are set through the socket on the primary node
> * the migration is started through the socket
> * ganeti-noded issues three "info migrate" commands to the socket (two seem
> to work, the third always fails as the qemu process is gone)

Apparently the qemu process crashed. I wonder if there is something in
the logs[1] (/var/log/ganeti/kvm/<instance name>.log)?

> Simply adding a 'sleep 2' between the two ganeti commands fixes the issue.

Sounds why DRBD does not trigger the bug. The disk transition from
disconnect, primary/primary reconnect includes this extra seconds.

Thanks, Sascha.

[1] https://github.com/ganeti/ganeti/commit/c138dbe4c

Rudolph Bott

unread,
Apr 5, 2020, 9:41:47 AM4/5/20
to Sascha Lucas, ganeti...@googlegroups.com
Hi Sascha,

On Sat, Apr 4, 2020 at 5:52 PM Sascha Lucas <sascha...@web.de> wrote:
Apparently the qemu process crashed. I wonder if there is something in
the logs[1] (/var/log/ganeti/kvm/<instance name>.log)?

I entirely forgot about the QEMU log files. Yes, we can see an error message from the crashing QEMU process:

qemu-system-x86_64: /build/qemu-oknQD6/qemu-4.2/accel/kvm/kvm-all.c:653: kvm_log_clear_one_slot: Assertion `mem->dirty_bmap' failed.

Looking into that, I stumbled over these bugs:


It seems that it's related to starting a migration while the QEMU process (or rather: the VM inside) is still in the boot phase. The fix for this is already upstream:


However, it seems it is only in for the next QEMU 5 release. I think we should open a Debian Bug for this (however, after quickly reading through the guide I am not a 100% sure I understood how to open a bug for QEMU in bullseye). Maybe the fix can be backported.

Any ideas how to work around this issue in the QA suite for now? Swap failover / migration tests? Ganeti users issuing failover/reboot/start + migrate in rapid order is probably not very likely for production systems. But from what I understand, this can also be an issue when someone triggers a live migration through ganeti while the VM is in a rebooting state internally. That again is something which might hit people in production.
 

> Simply adding a 'sleep 2' between the two ganeti commands fixes the issue.

Sounds why DRBD does not trigger the bug. The disk transition from
disconnect, primary/primary reconnect includes this extra seconds.

From what we know now, this slight difference in timing should be enough to get the QEMU process/VM out of its (re)booting state.
 
Cheers, Rudi
 

Sascha Lucas

unread,
Apr 6, 2020, 6:39:03 AM4/6/20
to ganeti...@googlegroups.com
Hi Rudi,

On Sun, 5 Apr 2020, Rudolph Bott wrote:

> I entirely forgot about the QEMU log files. Yes, we can see an error
> message from the crashing QEMU process:
>
> qemu-system-x86_64: /build/qemu-oknQD6/qemu-4.2/accel/kvm/kvm-all.c:653:
> kvm_log_clear_one_slot: Assertion `mem->dirty_bmap' failed.

Nice finding. Well done!

> It seems that it's related to starting a migration while the QEMU process
> (or rather: the VM inside) is still in the boot phase. The fix for this is
> already upstream:
>
> https://github.com/qemu/qemu/commit/9b3a31c745b61758aaa5466a3a9fc0526d409188
>
> However, it seems it is only in for the next QEMU 5 release. I think we
> should open a Debian Bug for this (however, after quickly reading through
> the guide I am not a 100% sure I understood how to open a bug for QEMU in
> bullseye). Maybe the fix can be backported.

I'm also unsure about bullseye. Once it has package freeze, QEMU 5 might
be included. Don't know if someone is willing to backport patches here.
However Ubuntu/focal is shipping with QEMU 4.2. Maybe it's worth to open
an issue there?

> Any ideas how to work around this issue in the QA suite for now? Swap
> failover / migration tests?

Does swapping the tests help? ATM I seen no other options besides
inserting sleeps/delays.

> But from what I understand, this can also be an issue when
> someone triggers a live migration through ganeti while the VM is in a
> rebooting state internally. That again is something which might hit
> people in production.

I understand it in the same way, so this bug should be fixed.

Thanks, Sascha.

Rudolph Bott

unread,
Apr 11, 2020, 8:47:52 AM4/11/20
to ganeti...@googlegroups.com
Hey everyone,

I have opened a bug at launchpad for the qemu-kvm package in Ubuntu - let's see what happens there:


However, we still need an intermediate solution to continue testing on Debian Bullseye and/or Ubuntu Focal Fossa. I took a look at all the "info *" queries which the QEMU monitor socket provides but none of them seem to be able to indicate that the instance is currently in an early booting state. That means for now in regards to the QA suite the only options left are either swapping the order of tests or adding a short sleep time between them. I will open a PR for that soon.

However, even with a solution for the QA suite this still means that currently any live migration on Bullseye or Focal would crash if there is a reboot going on within the VM (which depending on the amount of people using/administrating your cluster and the instances running on it is a more or less likely thing to happen).

Looking forward to any suggestions here.

Cheers,
Rudi

--
You received this message because you are subscribed to the Google Groups "ganeti-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ganeti-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ganeti-devel/alpine.DEB.2.20.2004061223490.5424%40ivy.loc.

Rudolph Bott

unread,
Apr 17, 2020, 6:17:42 AM4/17/20
to ganeti...@googlegroups.com
Hey everyone,

quick update - the bug has presumably been fixed on Ubuntu. I'll check this later. I will try to figure out if Debian needs to be informed separately or if there is some information flow between Ubuntu / Debian package maintainers.


Cheers,
Rudi

Rudolph Bott

unread,
Apr 28, 2020, 3:26:45 PM4/28/20
to ganeti...@googlegroups.com
Hey everyone,

I finally got around to open the same bug in the Debian BTS:


Also, the QA suite now passes on Ubuntu Focal with the latest qemu packages. I have some issues with the nightly builds repository but as soon as that is fixed, packages for focal will be provided as well (and I will also configure a nightly build for focal).

Cheers,
Rudi

Sedat Dilek

unread,
Apr 29, 2020, 2:51:52 AM4/29/20
to Rudolph Bott, ganeti...@googlegroups.com
On Tue, Apr 28, 2020 at 9:26 PM Rudolph Bott <bo...@sipgate.de> wrote:

> I finally got around to open the same bug in the Debian BTS:
>
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=958926
>

Moin Rudolph,

thanks for opening a bug-report in the Debian-BTS.

References to the Debian-BTS in shorter form: https://bugs.debian.org/958926

Regards,
- Sedat -
Reply all
Reply to author
Forward
0 new messages