Vmware Virtualcenter Server Service Stuck On Starting

133 views

Skip to first unread message

Mertie Oldow

unread,

Aug 4, 2024, 7:34:49 PM8/4/24

to stifmasmussres

Wecould see something was wrong due to the vCenter Server showing down in the VMware Horizon Admin Console. When checking on the vCenter service itself, it looked like the VMware VirtualCenter Server service was not started. When trying to manually start the service, it was getting stuck on starting the service for several minutes and failed. You may also see event errors with ID: 1000.

It turns out, a Windows Update enabled the built in Windows BranchCache feature. By going to the Non-Plug and Play Drivers in Device Manager and going into HTTP as seen above, we could see what was trying to use the HTTP protocol. (I was a little surprised as our environment uses HTTPS but it looks like vCenter still relies on port 80 when starting). Checking in here showed that BranchCache was butting in and taking over port 80 which was causing vCenter to fail to start. (Thank you VMware support for showing me!)

The first thing I checked was that the time was set correctly. If the time is incorrect, it can cause several issues. Since we were able to log into vCenter Server Appliance Management Interface (VAMI), I was able to check that the NTP source was set up and that there were no issues. This can also be done in vCenter BASH Shell via SSH by using the commands ntp.get to check the configured NTP, and the command date.I also checked if the STS certificate had expired

I then checked if there was enough disk space with df -h.After confirming that the services were still getting stuck on 85% after a reboot of the vCenter, I checked the certificate manager log, which you can find here: /var/log/vmware/vmcad/certificate-manager.log.In the log, I could see that there were inconsistencies in how the Fully Qualified Domain Name (FQDN) of the vCenter was written, and there were errors pointing to problems with the Subject Alternate Name (SAN).

After all of this, I was confident that I had found the error, but it was still stuck on 85% after resetting the certificates. I then tried resetting the Security Token Service (STS) certificate and replacing all the certificates through certificate-manager, but no luck.

In conclusion, there was a lot to learn from this issue. Firstly, what might seem like red herrings, may very well be underlying problems that also needs to be solved. In the end, the final issue was fixed by VMware Support, but without all the troubleshooting steps performed before they were brought on board, the fix might not have been so straight forward.

In this article I will cover steps and actions we took in order to ensure a seamless migration of our current SQL Database server to a new one. This will include a lot of recommendations and what issues we faced both in testing and real environment.

Recommendation: have an DNS alias name created of the current DB Server and start this procedure. When the new SQL DB is ready, just modify the alias name in DNS to point to the new SQL Server. This will give some leverage, in case something goes wrong with the new one, to revert back in short amount of time. (just a matter or changing the alias IP back to old server and restarting services)

-Master password for the SSO (admin@System-Domain). This is the initial password and even if you change it in the mean time you need the primary password to continue. If you do not have it or forgot it, than the only option is to re-install the vCenter Server completely.

EDIT 2: The other issue we ran in to was that all of our vCenter permissions were stripped from the server when we updated the certificates. We did not see this when testing. Fortunately we were able to roll back the changes we had made with snapshots and backup the permissions before proceeding a second time. I would highly recommend backing up all vCenter permissions before starting this process if you decide to follow the following steps.

The vCenter servers that have the older certificates have been upgraded several times, and have the older certificate bit length in the self-signed certificates. Because the servers have been upgraded over many past versions, the servers in question also have some extra certificates still in use on the systems which caused other issues during the certificate replacement process.

It is here that I ran in to the first gotcha surrounding the multiple certificates. The issue surrounds KB 2045422, but our circumstances did not exactly mirror the symptoms from the KB. I checked, re-checked, and checked my syntax again and again, I would receive different results (9 9, 8 8 and 3 3), but nothing I did would cause the register-is.bat to complete successfully. The script would keep giving errors similar to:

Upon further investigation, I noticed that the vpxd.cfg file (C:\ProgramData\VMware\VMware VirtualCenter\vpxd.cfg) would change the following two entries every time I registered the new certificates with the mob. The original entries would read:

We also ran into an issue later where the vpxd.cfg entries needed to be changed back to the sso.crt and sso.key entries before the service would restart. If you determine that you need to make this change as well, back up the vpxd.cfg prior to modifying. Back to the certificate changes:

We have an issue on a few of our Windows 2016 servers that started sometime in the last few months and seems to be getting progressively worse. Resetting the system seems to right it for a few days, but inevitably it will slide into the same useless state and require another hard reset. So far the system has bounced back, with sometimes nothing more than a chkdsk, but on our SQL servers this can sometimes take a few minutes of recovery.

These systems run well for a few days, but then we notice that we can no longer connect via RDP. If we try to log in on the VM console, it will usually hang on the "Waiting for user profile service" but that never resolves and the console is stuck on that login until reset. The SQL or web service on the VM continue to run as if there is no problem for several hours, but eventually we will notice the IP address that vCenter shows for the server disappears and the box is now completely isolated. We have to hard reset to restore service.

I have ran SFC on all of these servers and there is no corruption reported. I ran the DISM tools and it does report the component store can be repaired, but looking in the DISM and CBS logs, there are no errors reported, only Info and Warning. We dont seem to have any problem installing Windows updates, we are patched up to the March roll-up. These servers cant reach MS Update servers, so not sure how to clear these DISM issues. I have injected from a KB CAB before, but if the logs dont identify a KB, then what?

This behavior where it works ok for a few days, then services start to die off sounds to me like a memory leak in some component, but Im sure there could be other things. We recently installed Elastic Metricbeat to see if we can spot the process that might be running amok.

So I am looking for some tips on things to watch that might cause RDP/User profile service to die, or a NIC to suddenly stop working. I assume that the VMware tools installed on this server are getting killed or choked out by this supposed runaway process.

We got nowhere with this. It just stopped happening. So $500 wasted on MS Technical Services. I am going to assume this was some kind of conflict between our antivirus suite and Microsoft Trusted Installer. That seems to be a common thing we saw in the log files when the crash occurred. I guess a Windows update or a McAfee update resolved the issue at some unknown time. I just hope it doesnt come back.

I have the same problem this started i think in late february begining of march

First the i would reset vm's and it woulkd last a few weeks , lately it's a few days with luck.

They all stop responding and if try to logon with console it hangs on profile

The only "error" i can see in the logs is this

svchost (1068) SoftwareUsageMetrics-Svc: Um pedido para escrever no ficheiro "C:\Windows\system32\LogFiles\Sum\Svc.log"

dont fully know if this is the actual problem or byproduct of the hang....

I moved the vm to another server and the problem is the same

i have malware bytes anti ransomware in the servers, i think im going to disable to see if it solves it

It seems the same for us. We found out the IP is not disappearing, its the just vmtools service being taken down that makes the IP disappear in vCenter. The VM still pings, its just all the services have stopped.

When deploying VMWare VCSA in a lab environment the installer often gets stuck at stage 2 starting services. In my case, starting authentication network 2%. This seems to be an issue with VCSA wanting to verify the SSO domain through DNS resolution.

The solution for me was to create an entry in the hosts file of the VCSA appliance through SSH to the CLI. Before doing so, it is best to start from scratch. Delete the partially deployed VCSA appliance. Deploy VCSA through Stage 1 but do not enter Stage 2.

Once Stage 1 is complete and the VM is booted, SSH can be enabled by accessing the console of the VCSA machine, hitting F2 and enabling SSH. Alternatively, one can use the Alt+F1 method to access the CLI directly through the console.

Ubuntu 16.04.3 LTS server running on a VPS. Got another VPS on the same host machine, CentOS6. CentOS vps is still chugging along. Ubuntu vps won't boot. Last change was adding a Virtualmin "virtual server" (really just a separate user with privileges to some daemons). And some fiddling with postfix. Everything I've read online says to rip out my graphics drivers and reinstall them. Well I don't have any graphics drivers cause I don't have graphics. No X. No window managers. And certainly not intel or nvidia graphics drivers for X.

Closest I've come to a sane-sounding solution so far is an semi-ancient forum post about Arch Linux. Same problem caused by a missing symlink from /var/run to /run. Well, I have that symlink. So that's probably not it. And other than the tremendously unhelpful message above there is nothing to indicate what might be wrong.