Vmware A Fault Has Occurred Causing A Virtual Cpu To Enter The Shutdown State

0 views

Skip to first unread message

Maricel Fergason

unread,

Aug 5, 2024, 2:25:05 AM8/5/24

to taisucsueta

Afault has occurred causing a virtual CPU to enter the shutdown state. If this fault had occurred outside of a virtual machine, it would have caused the physical machine to restart. The shutdown state can be reached by incorrectly configuring the virtual machine, a bug in the guest operating system, or a problem in VMWare Workstation.

I'm not sure why the move mov gs,ax would cause Windows to immediately triple fault, but it would soon enough cause it to crash. In the 64-bit Windows kernel the GS segment is used as a pointer to access the current CPU's Processor Control Region (PCR). Each CPU has a different GS base value pointing to a different PCR. Your mov ax,gs mov gs,ax sequence actually breaks this because it loads an incorrect value for the GS base into the descriptor cache.

The GDT doesn't actually contain the correct base for the GS register. Since the GDT can only hold 32-bit addresses, it's not actually used to load the GS base. Instead the IA32_GS_BASE and IA32_KERNEL_GS_BASE MSRs, the later in combination with the SWAPGS instruction, are used to set a 64-bit base address for the GS segment. The selector value stored in the GS register is just a dummy value.

So your mov gs,ax instruction loads the dummy 32-bit base value stored in the GDT, not the 64-bit value stored in IA32_GS_BASE. This means that the base address of the GS segment is set to 0 and not to the address of the PCR for the current CPU. After loading this incorrect GS base it's only a matter of time before the Windows kernel tries to use the GS register to access the PCR (with an instruction like mov rax, gs:[10]) and ends up reading what is probably unmapped memory causing an unexpected kernel page fault and crashing.

I recently bought a new PC with an AMD Ryzen 2600 CPU - before that I was running an Intel Core i7 860. Now I wonder if it is possible to reuse my existing OSX Yosemite 10.10 VMWare Image with the new PC.

When starting it up without any changes I get a message telling me that "a fault has occurred causing a virtual CPU to enter the shutdown state". I did a quick Google search and found some parameters to add to .vmx file:

So probably I have to reinstall OSX with proper AMD kernel extension but I just wonder if anybody has faced the same switch from Intel to AMD CPU and could tell me if I can do it without reinstall? Is it somehow possible to install the kernel extensions in an existing VMWare image that cannot boot?

Can you attach a copy of your VMX file to a post for the VM please? No guarantees but I havd found some old CPUID patching details that may work. Make sure you have a complete backup of this VM before we start.

Sorry I missed following up on this. I will take a look and see if there is anything I can find but first guess is that an Intel instruction is being used and AMD does not implement it. Need to check the CPUID dumps for both processors.

So this is the my next thoughts on this. I had to do a bit off checking because some things have changed in Workstation 14 & 15. I'm not sure if there is anythiong else we can do if this doesn't work apart from try and find a way to add an AMD kernel to the image.

The only other thing I can think of is to get a Linux VM and mount the images virtual disk and see if an AMD kernel could be copied over. Whatever else we try and do is going to be convulated and no guarantee of success. I will have one more think about it. It's a shame the masking didn't work. Really only wanted to get it booted well enough to get AMD specific kernal installed.

The snapshot functionality provides for example a massive advantage when performing any type of upgrade in the VM Operating System (OS). When the upgrade ends up rendering the VM useless, there is an easy way to restore the VM back to the state of before the upgrade. Many backup solutions like Veeam or Commvault use the same functionality of creating a snapshot and then download and store that snapshot as a backup.

However, there is also a downside to having snapshots, when snapshots are kept on the VM there is a possible loss of performance for every snapshot. This is because each snapshot has its own delta file and the ESXi node needs to calculate the differences between those files, which gets trickier with every snapshot. The best practice is to not keep snapshots for longer than absolutely needed and remove them via the snapshot manager.

As there were about 100 VMs needed to be migrated, this procedure would need to be done around 100 times. After a couple of VMs, I decided to be lazier and I created a script to do the job for me. I found out that there is a PowerCLI option to connect to multiple vCenters at the same time and changed my script accordingly. I programmed the script and everything went as expected, except that there was still a snapshot left on the VM. When trying to remove it manually through vCenter it throws a very useful error; A general system error occurred: Fault cause: vim.fault.GenericVmConfigFault

So, the first thing that I needed to do was to make sure that I get a maintenance window approved by the customer because the VM will need to be powered off in the process. These are the steps that I ended up taking:

In the end, it was a lot of work to remove all the snapshots that were stuck. It showed me that automation can often help to make dull tasks quicker but when it goes wrong then it can cause some more work. This situation clearly is an example of that. All in all, I was happy to be able to migrate the VMs from one platform to the new one, and that I managed to remove the snapshots in the end.

Recently had an outage and powered off all VMs manually, including vCLS, instead of migrating the vCLS to the last host to be powered off. After host systems power on, manual reboot of vCLS did not re-enable DRS services as shown by error:

Read documentation somewhere to attempt manual deletion of vCLS VMs to allow them to reboot on their own. I tried one to see the affect but this resulted in a vCLS VM that deletes but now displays "orphaned" but is present in list of inventory. Thought better not to go through with other two vCLS VMs. Getting the following error with "Failed to relocate vCLS" event message of the orphaned vCLS VM that was deleted manually from host.

Unable to perform any meaningful actions on orphaned VM. Have tried disabling DRS via Retreat Mode to reboot all vCLS VMs to no avail. It is essentially a ghost artifact displaying on my vCenter web client as I am unable to locate it in the data stores on the hosts themselves. The remaining two vCLS are live but DRS still unhealthy. I would really appreciate any insight on remediation the community here can provide. Thanks.

vSphere Cluster Sevices (vCLS) rely on dedicated virtual machines (vCLS VMs) to function. vCenter server automatically manages the power state & resources allocated to vCLS VMs. Manually manipulating vCLS VMs is not a recommended practice.

Unfortunately manual deletion seems to have caused this issue "Failed to relocate vCLS". You need to remove the "ghost" vCLS VM from the inventory.

Focus on cleaning up the orphaned VM, don't touch the remaining healthy vCLS VMs and ensure the healthy vCLS operation. Don't attempt to relocate or power on the orphaned vCLS VM as it is in a broken state & wont function properly. And as you mentioned 2 remaining vCLS VMs are live, it is important to check their health as well. By doing this, the error message about the power state and resource management may disappear and DRS should function normally.

There is no way to disable vCLS on a vSphere cluster and still have vSphere DRS being functional on that cluster.

Disabling vCLS on a cluster can be done by the Retreat Mode, but this will impact some of the cluster services, such DRS, for that cluster. The VMs running inside your cluster are not load-balanced and will not be migrated to different hosts if your host running a particular VM is running out of resources.

Unfortunately this whole ordeal started because I, which I in retrospect realized was erroneous on my part, manually shutdown the vCLS VMs during the outage. I had to manually power them on but that did not do the trick to get rid of vSphere Cluster Service VM message, and thus why I tried to remove one of them manually with the hopes that would trigger a respawn, to no avail. I suspect the other manually powered on vCLS I haven't touched are not healthy either as they show powered on but in Monitor, under Task of Tasks and Events menu, powerer off task displays but not power on indicating I powered them on. However, I do see "bandwidth usages is normal" information update under Events. Could the below be the source of the issue? Have you ever seen situation and/or any ideas on best way to resolve?

I have reviewed the document that you provided. The workaround/cause looks similar to your scenario. Workaround as mentioned earlier would be to clean the orphaned vCLS VMs from the inventory.

According to me, the other manually powered on vCLS that were up, but you assume weren't healthy, can be checked if there are enough free resources in the cluster which may cause the issue. But I think we cannot judge the scenario of their health and availability just by the details provided above.

I definitely understand what you are saying that I should clean up the orphaned VM. But unfortunately, I am at a loss as to how to do that since it was deleted from the host but it still shows up on vCenter web client and displays (orphaned) in parentheses. That's what makes this odd. And I have tried doing retreat mode but that did not restart the vCLS VMs. How can I clear the orphaned vCLS VM from portal? Could a VCSA restart do the trick? Is there something else I can pursue?

Vmware will spawn a new vCLS VM when one gets deleted, this is done automatically. Also, when vCLS VMS is shutdown, vCenter will auto power on those as well. Shutting down the vCLS is ok if you have to set the host to Maintenance Mode for example.