Hyper-V manual merge - problems booting

209 views
Skip to first unread message

Poppy Lochridge

unread,
Jul 29, 2024, 9:59:03 AM7/29/24
to NTSysAdmin

Hi all,

 

Over the weekend, we needed to manually merge (in Powershell) some hyper-v differencing (AVHDX) files that aren’t showing up as checkpoints in the Hyper-V console. We’ve been testing after each merge, and selecting the newly merged AVHDX file for the VM gives us a message that it’s unable to select that VHD because a merge is pending. There’s about 17 more differencing files to go before we get back to the original VHD.

 

We’ve tried rebooting the host several times, and it does not seem to change the situation. It’s Monday morning now, and our only plan is to continue the manual merge until we are down to a single VHD (as it should be) – but this is the primary file server, and our ETA on the merge process is a couple more days.

 

Any suggestions on getting past the “merge pending” message??

 

--P

 

-----

If you never know what your infrastructure is, you can never know if it’s been breached.

 

Poppy Lochridge (she/her)

NetCorps

1385-B Oak Street

Eugene, OR 97401

541-465-1127 x4

 

po...@netcorps.org

http://www.netcorps.org

 

 

Michael B. Smith

unread,
Jul 29, 2024, 10:01:50 AM7/29/24
to ntsys...@googlegroups.com

I think you are resolving it properly, but what are you trying to do that you are being stopped from?

 

The key thing is, I think, to determine what was causing the differencing files to be created, so that this isn’t an issue moving forward. Is it due to a backup failure?

--
You received this message because you are subscribed to the Google Groups "ntsysadmin" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ntsysadmin+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ntsysadmin/CO1PR08MB706173617003BEED84236F88CAB72%40CO1PR08MB7061.namprd08.prod.outlook.com.

Brian Illner

unread,
Jul 29, 2024, 11:33:45 AM7/29/24
to ntsys...@googlegroups.com

Did you just look in the Hyper-V Manager GUI for those checkpoints? Or did you also verify with Get-VMSnapshot via PoSH?

 

(Yes, MS still hasn’t gotten their naming convention standardized for this.)

 

I’ve had some that showed in PowerShell and not the GUI.

 

Is this part of a cluster where you could live migrate that VM to another host and see if that allows the merges to start automatically?

 

BRIAN ILLNER

 

Senior Systems Administrator

864.250.9227 Office

864.679.2537 Fax

Canal Insurance Company

101 N. Main Street, Suite 400

Greenville, SC 29601

WARNING:  As the information in this transmittal (including attachments, if any) may contain confidential, proprietary, or business trade secret information, it should only be reviewed by those who are the intended recipients.  Unless you are an intended recipient, any review, use, disclosure, distribution or copying of this transmittal (or any attachments) is strictly prohibited.   If you have received this transmittal in error, please notify me immediately by reply email and destroy all copies of the transmittal.  While Canal believes this transmittal to be free of virus or other defect, it is the responsibility of the recipient to ensure that it is virus free and no responsibility is accepted by Canal (or its subsidiaries and affiliates) for any loss or damage arising therefrom.

 

From: ntsys...@googlegroups.com <ntsys...@googlegroups.com> On Behalf Of Michael B. Smith
Sent: Monday, July 29, 2024 10:02 AM
To: ntsys...@googlegroups.com
Subject: [ntsysadmin] RE: Hyper-V manual merge - problems booting

 

CAUTION: This message was sent from outside of Canal Insurance. Please do not click links or open attachments unless you recognize the source of this email and know the content is safe. Please report all suspicious emails to "inf...@canal-ins.com" as an attachment.


Poppy Lochridge

unread,
Jul 29, 2024, 1:21:17 PM7/29/24
to ntsys...@googlegroups.com

It was a backup failure that caused the problem, and we identified that in 2022 and resolved the backup failure then – but when we started merging differencing files, the time investment was so large that we did a few and decided to defer. We deferred a little too long, and ended up in a drive space crunch this month. Yeah, a mistake, and I won’t go into the reasons why the mistake happened.

 

What we’d like to do is to be able to boot from the current ADHVX file that’s been merged, boot the guest VM, and be able to work during the week, then pick up the manual merge process in scheduled downtime. Right now, we’re unable to start the guest VM because the linked VHD no longer exists and we’re unable to link to a different VHD because it says a merge is pending.

 

--P

 

From: ntsys...@googlegroups.com <ntsys...@googlegroups.com> On Behalf Of Michael B. Smith
Sent: Monday, July 29, 2024 7:02 AM
To: ntsys...@googlegroups.com

Michael B. Smith

unread,
Jul 29, 2024, 1:25:26 PM7/29/24
to ntsys...@googlegroups.com

The root vhd/vhdx no longer exists? That’s bad news. I don’t think you have any option but to continue the merge process. And you may find, at the end, that a restore is required anyway.

Poppy Lochridge

unread,
Jul 29, 2024, 1:26:04 PM7/29/24
to ntsys...@googlegroups.com

We looked in the Hyper-V Manager GUI for the checkpoints and they do not/did not exist there.

We ran this process, which describes the original problem exactly:  https://learn.microsoft.com/en-us/troubleshoot/windows-server/virtualization/merge-checkpoints-with-many-differencing-disks

 

Using Powershell and Get-VM, which DID produce a series of differencing files and commands for the manual merge.

 

We don’t have a cluster, sadly, but we do have a secondary server – I can check into the possibility of setting that up as a host and trying a migration to it.

 

--P

Poppy Lochridge

unread,
Jul 29, 2024, 1:44:24 PM7/29/24
to ntsys...@googlegroups.com

Thankfully, the root VHDX does exist. We’re about 16 merges away from it, but it does exist.

 

“the linked VHD no longer exists” – the AVHDX file that Hyper-V says is linked to the VM has already been merged, as have the next ~6 files in the chain.

Benoit Segonnes

unread,
Jul 29, 2024, 2:15:08 PM7/29/24
to ntsys...@googlegroups.com

I have the same problem with a guest cluster and vhds files….

 

Have you tried to live migrate the VM ? it should release the ghost lock on the file and proceed the merge

 

Cordialement,

 

Benoit SEGONNES

Gestionnaire Travaux

 

Tél. 01.55.23.24.24

SUPii Mécavenir

12 bis, rue des pavillons

92800 PUTEAUX

 

www.mecavenir.com

 

 

 

De : 'Poppy Lochridge' via ntsysadmin <ntsys...@googlegroups.com>
Envoyé : lundi 29 juillet 2024 19:30
À : ntsys...@googlegroups.com
Objet : [ntsysadmin] RE: Hyper-V manual merge - problems booting

Philip Elder

unread,
Aug 1, 2024, 4:59:21 PM8/1/24
to ntsys...@googlegroups.com

Did you get this resolved?

 

One can trigger a merge for orphaned .AVHDX files by creating a snapshot/checkpoint in Hyper-V Management for the impacted VM.

 

That doesn’t always work though. PowerShell is the way to go if it does not clear them all up.

 

Process Explorer will give you the ability to find out where the locks are coming from. My guess is the backup software.

 

This situation is common. In my experience it indicates a weak storage subsystem as we see it where IOPS are lower.

 

What happens:

  1. Backup software calls a VSS snapshot
  2. OS creates checkpoint/snapshot
  3. OS calls VSS
  4. VSS creates snapshot of current VM state
  5. Backup software writes VSS image to backup destination
  6. Backup software completes write
  7. Backup software closes VSS Call
  8. OS releases VSS
  9. VM snapshot/checkpoint deleted
  10. Snapshot/Checkpoint is merged
  11. Done

 

Step 8 is where things get messed up. The clean-up process remains when the steps further down are called due to lagging I/O.

 

As a result, we get what’s called an “ghost process” that was orphaned as a result of the OS going, “Okay, VSS is released delete that checkpoint/snapshot and yup, we’re all good now!” Fortunately, that .AVHDX file remains otherwise we’d be hooped.

 

As above, trigger a Checkpoint/Snapshot in Hyper-V Management to see if they get merged. In most cases they will.

 

Eric has a good article here:

https://www.altaro.com/hyper-v/clean-up-hyper-v-checkpoint/

 

This Veeam Forum’s post has good info but it must be gathered up:

https://forums.veeam.com/microsoft-hyper-v-f25/9-5-hyper-v-known-issues-t38927-270.html

 

 

 

Philip Elder MCTS

Senior Technical Architect

Microsoft High Availability MVP

E-mail: Phili...@mpecsinc.ca

Phone: +1 (780) 458-2028

Web: www.mpecsinc.com

Blog: blog.mpecsinc.com

Twitter: Twitter.com/MPECSInc

Skype: MPECSInc.

 

Please note: Although we may sometimes respond to email, text and phone calls instantly at all hours of the day, our regular business hours are 8:00 AM - 5:00 PM, Monday thru Friday.

 

From: 'Poppy Lochridge' via ntsysadmin <ntsys...@googlegroups.com>
Sent: Monday, July 29, 2024 07:59
To: NTSysAdmin <ntsys...@googlegroups.com>
Subject: [ntsysadmin] Hyper-V manual merge - problems booting

 

Hi all,

--

Poppy Lochridge

unread,
Aug 2, 2024, 12:09:23 PM8/2/24
to ntsys...@googlegroups.com

We did! Ended up detaching the hard drive from the affected VM entirely, then attaching a new SCSI virtual drive with the currently merged AVHDX. It booted, and folks were able to access their data.

 

We still have about 9 AVHDX files to merge, but this gives us breathing room to schedule the downtime to do those.

 

To be determined: who and when the remaining merges will be done AND how to replace the backup system that caused the initial problem.

 

--P

 

From: ntsys...@googlegroups.com <ntsys...@googlegroups.com> On Behalf Of Philip Elder
Sent: Thursday, August 1, 2024 1:59 PM
To: ntsys...@googlegroups.com

Philip Elder

unread,
Aug 2, 2024, 3:30:46 PM8/2/24
to ntsys...@googlegroups.com

Is there enough free space on the host somewhere to set up a new .VHDX file, attach it to the VM, use BeyondCompare (Run As Admin, set NTFS Permission Copy, Date Creation Copy) to create a copy, and finally just drop the bad one?

 

Note that this will have an impact on backups depending on the backup in use. So, keep that in mind.

Philip Elder

unread,
Aug 2, 2024, 3:32:19 PM8/2/24
to ntsys...@googlegroups.com

Oops … hit SEND too quick.

 

If that does work, then you can schedule some down time, drop the bad OS .VHDX, create and attach a new one, and restore that OS partition. It should be relatively quick if it is just the OS and some app/server files and folders.

 

Philip Elder MCTS

Senior Technical Architect

Microsoft High Availability MVP

E-mail: Phili...@mpecsinc.ca

Phone: +1 (780) 458-2028

Web: www.mpecsinc.com

Blog: blog.mpecsinc.com

Twitter: Twitter.com/MPECSInc

Skype: MPECSInc.

 

Please note: Although we may sometimes respond to email, text and phone calls instantly at all hours of the day, our regular business hours are 8:00 AM - 5:00 PM, Monday thru Friday.

 

Poppy Lochridge

unread,
Aug 2, 2024, 6:34:37 PM8/2/24
to ntsys...@googlegroups.com

Yes, probably. I think part of our correction of error process might end up being a discussion of whether virtualizing the file server was the right plan in the first place, and if we should migrate the bulk of the stored files to a physical server instead. So it might be a moot point by the time we finish mapping out what we want to do for prevention.

Kurt Buff

unread,
Aug 2, 2024, 6:54:35 PM8/2/24
to ntsys...@googlegroups.com
WRT putting a file server in a VM, I've not used Hyper-V in a lot of years, and never in a real server environment, but.....

In our VMware environment we put the VM on the ESXi/vSphere host, then attach disks over iSCSI on the SAN. Well, actually all of the disks are on the SAN, that's where the OS disk also lives, but the disks holding user files are kept on separate disks on the SAN. The VMware host loads all of that from the SAN. We have 6 ESXi hosts in two clusters, all running from a single Nimble SAN.

I would be fairly surprised if that weren't available under Hyper-V as well. Of course if you don't have a SAN, that's not gonna work, but I would turn your discussion to getting a SAN and keeping your virtualization.

Kurt



Philip Elder

unread,
Aug 2, 2024, 8:02:14 PM8/2/24
to ntsys...@googlegroups.com

There are simple scripts out there to send an e-mail if there’s orphaned .AVHDX files on the host(s)/nodes.

 

Putting a file server on iron is not a good use of resources IMO.

 

One other point: Are the NTFS partitions _within_ the guest at least 25% free?

 

Also, are there any collisions between VSS for Volume Shadow Copy (Previous Versions) and the backup product’s schedules? Make sure they are at least 5-10 minutes apart so as to avoid simultaneous VSS calls from in-guest and the backup software. This can corrupt the data within the guest.

 

And, one more: Is there enough VSC space allocated? If the file server has a largish repository of 2TB or more and there’s a fair amount of churn then the VSS VSC cache being full can cause issues with the snapshot process.

Philip Elder

unread,
Aug 2, 2024, 8:04:43 PM8/2/24
to ntsys...@googlegroups.com

Question about that setup or two:

1: Are the LUNs evenly divided between the two controllers?

2: Are snapshots enabled and if they are do the destination snapshots get tested?

 

Just curious more than anything as our primary cluster type is HCI (S2D/AzSHCI) and some standalone Hyper-V + Storage Spaces setups.

 

Philip Elder MCTS

Senior Technical Architect

Microsoft High Availability MVP

E-mail: Phili...@mpecsinc.ca

Phone: +1 (780) 458-2028

Web: www.mpecsinc.com

Blog: blog.mpecsinc.com

Twitter: Twitter.com/MPECSInc

Skype: MPECSInc.

 

Please note: Although we may sometimes respond to email, text and phone calls instantly at all hours of the day, our regular business hours are 8:00 AM - 5:00 PM, Monday thru Friday.

 

Kurt Buff

unread,
Aug 3, 2024, 12:37:08 AM8/3/24
to ntsys...@googlegroups.com
1. Yes.
2. Yes, and yes. We use Veeam, and test with the solution they provide. More on this below.

Sorry to turn this into a product intro, but I think it's relevant/interesting.

We're just now (sigh, it's been a very long road) starting to implement Stonefly backup units, one local, one to be remote after initial replication. Since I've been here (about 4.5 years), the backups have been stored on the same Nimble NAS that is used to support the clusters. That's pretty horrible, but getting management to understand the risks and assign funds has been a struggle.

I have no opinion on the Stonefly units yet. The interesting thing is that they tout the ability to spin up VMs on themselves for DR in case our cluster goes down, using ESXi in a virtual config. Interesting, but again I have no opinion, as they are untested.

The initial test was FUBARed because the sysadmins didn't realize that the subnet had a DHCP scope assigned that didn't have reservations, and some of the hosts on the iSCSI subnet didn't respond to pings, so the Stonefly unit grabbed addresses that were used by some of the other hosts on the subnet, and it took down both clusters. I guess you could say it was a clusterf*...

I'm not directly on the sysadmin team (I'm the IT security guy), but I was quite embarrassed for them, as they did it in the middle of the day, and we incurred about 1.5 hours of downtime.

It's the 5P principle (proper planning prevents p**s poor performance) or the "failing to plan is planning to fail" principle.

We did an AAR, and then another risk evaluation, and while they did identify a few risks and mitigations, I identified a fair number of risks and mitigations for which they didn't account, and their and my risks/mitigations were incorporated into the updated implementation plan for the next implementation attempt.

One of the downsides I identified on the Stonefly units is that Stonefly was very reluctant to give us the implementation and management docs - and they really insisted on being the ones to do the setup/configuration. I understand that, but it tends to leave the customer very dependent on the vendor, which is not the position I (we) wish to be in. We did get the docs eventually, which makes me happier.

Kurt

Philip Elder

unread,
Aug 6, 2024, 2:13:03 AM8/6/24
to ntsys...@googlegroups.com

Something sure sounds off there.

 

DHCP?

 

Reservations should be set up for all static IP addresses or at the very least exclusions set for blocks assigned to devices/servers with static IP addresses. Ouch, that was a “should not have happened” situation plus a “we failed to document” one as well I think.

 

Planning is one thing, having good documentation and an understanding of how things are set up is yet another. :0(

Kurt Buff

unread,
Aug 6, 2024, 9:28:46 AM8/6/24
to ntsys...@googlegroups.com
Yes, it was poorly managed on our end.

If you're going to serve DHCP on a critical subnet, it must have reservations set for existing hosts, and set existing hosts with static addresses.

I have no problem either way - Have DHCP with reservations/exclusions, or don't use DHCP - but the critical hosts must have statically assigned addresses, and they must be documents.

The odd thing is that the DHCP server was set to test addresses with two pings before issuing the lease, but apparently either the firewall was configured to not allow ICMP, or the hosts simply didn't respond.

Either way it turned into a several hour nightmare in the middle of the day while the network guys worked on resolving it.

Kurt

Philip Elder

unread,
Aug 6, 2024, 2:10:26 PM8/6/24
to ntsys...@googlegroups.com

We tried the DHCP high availability thing for a couple of infrastructure domains. It should just work right?

 

Nah, it didn’t. So, we stick with static IPs. It’s easier to manage with PowerShell too.

Reply all
Reply to author
Forward
0 new messages