kvm migration lockup issues

Lance Albertson

unread,

Feb 3, 2010, 11:48:23 AM2/3/10

to gan...@googlegroups.com

I've recently started testing the experimental kvm migrations and
encountered an odd issue. I didn't see anyone else on the list with a
similar issue or found any on your website.

I'm tried doing a migration from the primary node to the secondary node
and it succeeded without any issue. However, when I tried to migrate the
instance back to the original primary node, the VM ends up locking up
hard. I don't see any error messages on the command or in the logs. I
can see that the VM has moved back to the original primary node but the
console is completely locked.

I have also tried using the "--non-live" option to see if it made any
difference but I had the same problem. I tried doing this with different
guest OS and had the same problem.

Have you encountered this issue before? I thought it was odd that it
would work one way but not the other. If you'd prefer, I can open up an
issue instead of debugging this over the list.

A list of versions of relevant software:

ganeti-2.1.0-rc4
node kernel: 2.6.29 (Gentoo Hardened)
drbd-8.0.16
qemu-kvm-0.12.2

Thanks, and sorry for spamming your list so much lately ;-)

--
Lance Albertson

signature.asc

Thomas Treutner

unread,

Feb 8, 2010, 4:18:58 AM2/8/10

to gan...@googlegroups.com, Lance Albertson

On Wednesday 03 February 2010 17:48:23 Lance Albertson wrote:
> I've recently started testing the experimental kvm migrations and
> encountered an odd issue. I didn't see anyone else on the list with a
> similar issue or found any on your website.
>
> I'm tried doing a migration from the primary node to the secondary node
> and it succeeded without any issue. However, when I tried to migrate the
> instance back to the original primary node, the VM ends up locking up
> hard. I don't see any error messages on the command or in the logs. I
> can see that the VM has moved back to the original primary node but the
> console is completely locked.

Yes, I'm facing such problems too, although I don't use ganeti. I use 2.6.31.6
and tried various qemu-kvm-XYZ, and with all 0.12s I have that problem.

Which hardware are you using? I'm using a PhenomII and AthlonII, whose
cpuflags are identical. Deactivating C1E in the BIOS and switching back to
0.11 helped somewhat, but it's not entirely rockstable as it was some months
ago. Maybe it is necessary to deactivate all powersaving features, as I also
had some weird timing issues (half an hour drift).

-t

Iustin Pop

unread,

Feb 8, 2010, 5:39:16 AM2/8/10

to gan...@googlegroups.com, Lance Albertson

Interesting information, thanks.

Just curious, is this always the second migration (i.e. A→B→A and
A→B→C both broken) or is it special about coming back to the original
node?

iustin

Lance Albertson

unread,

Feb 8, 2010, 1:44:06 PM2/8/10

to gan...@googlegroups.com

On 02/08/2010 02:39 AM, Iustin Pop wrote:
> On Mon, Feb 08, 2010 at 10:18:58AM +0100, Thomas Treutner wrote:
>> On Wednesday 03 February 2010 17:48:23 Lance Albertson wrote:
>>> I've recently started testing the experimental kvm migrations
>>> and encountered an odd issue. I didn't see anyone else on the
>>> list with a similar issue or found any on your website.
>>>
>>> I'm tried doing a migration from the primary node to the
>>> secondary node and it succeeded without any issue. However, when
>>> I tried to migrate the instance back to the original primary
>>> node, the VM ends up locking up hard. I don't see any error
>>> messages on the command or in the logs. I can see that the VM has
>>> moved back to the original primary node but the console is
>>> completely locked.
>>
>> Yes, I'm facing such problems too, although I don't use ganeti. I
>> use 2.6.31.6 and tried various qemu-kvm-XYZ, and with all 0.12s I
>> have that problem.
>>
>> Which hardware are you using? I'm using a PhenomII and AthlonII,
>> whose cpuflags are identical. Deactivating C1E in the BIOS and
>> switching back to 0.11 helped somewhat, but it's not entirely
>> rockstable as it was some months ago. Maybe it is necessary to
>> deactivate all powersaving features, as I also had some weird
>> timing issues (half an hour drift).

I can confirm using 0.11.1 "fixes" migrations so there must be some kind
of regression going on with the 0.12.x series. We're using HP DL360 with
2x Xeon Dual-core 3Ghz processors. I'm confused why deactivating
powersaving on the physical server matters in this case.

After doing some digging around I noticed there's a similar thread on
the libvirt mailing list [1] talking about the exact same problem. It
appears that this is indeed a problem with qemu-kvm.

[1] http://thread.gmane.org/gmane.comp.emulators.libvirt/20771

> Interesting information, thanks.
>
> Just curious, is this always the second migration (i.e. A→B→A and
> A→B→C both broken) or is it special about coming back to the
> original node?

What exactly would "C" be in a primary/secondary setup? With 0.12.x it
always failed going back to "A" for me.

Thanks for the feedback.

--
Lance Albertson

signature.asc

Lance Albertson

unread,

Feb 8, 2010, 1:53:59 PM2/8/10

to gan...@googlegroups.com

On 02/08/2010 10:44 AM, Lance Albertson wrote:
> On 02/08/2010 02:39 AM, Iustin Pop wrote:
>> On Mon, Feb 08, 2010 at 10:18:58AM +0100, Thomas Treutner wrote:

>>> Yes, I'm facing such problems too, although I don't use ganeti.
>>> I use 2.6.31.6 and tried various qemu-kvm-XYZ, and with all 0.12s
>>> I have that problem.
>>>
>>> Which hardware are you using? I'm using a PhenomII and AthlonII,
>>> whose cpuflags are identical. Deactivating C1E in the BIOS and
>>> switching back to 0.11 helped somewhat, but it's not entirely
>>> rockstable as it was some months ago. Maybe it is necessary to
>>> deactivate all powersaving features, as I also had some weird
>>> timing issues (half an hour drift).
>
> I can confirm using 0.11.1 "fixes" migrations so there must be some
> kind of regression going on with the 0.12.x series. We're using HP
> DL360 with 2x Xeon Dual-core 3Ghz processors. I'm confused why
> deactivating powersaving on the physical server matters in this
> case.
>
> After doing some digging around I noticed there's a similar thread
> on the libvirt mailing list [1] talking about the exact same problem.
> It appears that this is indeed a problem with qemu-kvm.
>
> [1] http://thread.gmane.org/gmane.comp.emulators.libvirt/20771

I can also confirm the clock drift on the VM when migrating from A->B.
It was off by approximately 600 seconds into the future. I fixed the
clock with ntp and tried migrating it back B->A and the VM remained
online with the time correct. A rather annoying bug and I intend on
opening one up with the qemu-kvm project soon. I'll keep you updated.

So for now, either use 0.11.x or make sure you resync the clock on the
VM before migrating it over.

Cheers-

--
Lance Albertson

signature.asc

Iustin Pop

unread,

Feb 8, 2010, 2:36:15 PM2/8/10

to gan...@googlegroups.com

Well, a third, unrelated node. I.e. migrating from nodeA → nodeB → nodeC also
break stuff, or only A→B→A?

But since it's not ganeti itself the problem, it's more of an academic
question; before knowing that it's KVM I was wondering if we do
something wrong with the DRBD minors or such.

iustin

Lance Albertson

unread,

Feb 8, 2010, 7:31:26 PM2/8/10

to gan...@googlegroups.com

On 02/08/2010 10:53 AM, Lance Albertson wrote:

> I can also confirm the clock drift on the VM when migrating from A->B.
> It was off by approximately 600 seconds into the future. I fixed the
> clock with ntp and tried migrating it back B->A and the VM remained
> online with the time correct. A rather annoying bug and I intend on
> opening one up with the qemu-kvm project soon. I'll keep you updated.

If you wish to follow progress on this bug, you can check it out here [1].

[1]
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2948108&group_id=180599

> So for now, either use 0.11.x or make sure you resync the clock on the
> VM before migrating it over.

I noticed this is only the case on CentOS since it uses such an old
kernel that kvmclock doesn't even exist. So far another workaround is to
either disable kvmclock or switch the system clock source on the guest
system.

Cheers-

--
Lance Albertson

signature.asc

Reply all

Reply to author

Forward