Reboot failures on Hyper-V

16 views
Skip to first unread message

Neil Mayhew

unread,
Apr 27, 2019, 2:05:13 PM4/27/19
to CoreOS Dev
I have CoreOS running in a VM on Hyper-V. It's using the Stable channel. Sometimes when it reboots automatically to apply an update, it never comes back up and when I check it later the VM is switched off. When the VM is then started manually, it's back in the old version. This is happening with the latest update, from 2023.5.0 to 2079.3.0. It had similar problems updating to 1967.4.0 and 1911.3.0, but handled all the others OK.

Unfortunately, I don't have access to the host that's running Hyper-V. The host is owned by my client, so I have to go through its IT staff, and they can't seem to find out what's wrong or give me much information. I've asked for a screen capture of the reboot process, but so far they haven't been able to give me that.

My suspicion is that the new CoreOS version fails to boot and the VM reboots into the old version almost immediately, and that Hyper-V sees this as an instability and shuts the machine off. However, it doesn't always happen exactly like this. Sometimes I see multiple reboots during the locksmith reboot window, all of them into the old version, which then goes through the update and reboot process within a few minutes. Perhaps it depends on how long it takes for the new version to crash, and if it's longer than some threshold Hyper-V doesn't shut off the machine. I usually don't see any logs, even partial, for the new version, so I assume it's dying very early in the startup process, probably before the root is switched from the initrd.

I realize I really need to get the client's staff to give me better information, but I don't have much control over that. Their view is that it's something to do with their Windows Server 2012 being set up for Cluster Aware Updating. They think that when the VM reboots the host is migrating it to a different host in the cluster, and if that we could just migrate the VM once and for all to a non-clustered infrastructure the problem would go away. Migrating the VM is going to be a bit of a hassle for me, because the networking will change and last time it took a while for the client staff to get their NAT working properly. So I'd much prefer it if there was a way to fix the problem without doing that.
Reply all
Reply to author
Forward
0 new messages