We have a client with a Windows Server 2016 standard installation that gives BSOD at random times. We're talking about only three times a month, but since it is a server with important roles (Hyper-V, AD, DNS, DHCP) we're investigating the issue. I am aware that you'd normally want to split some of the roles, but that is not within our power to change.
We bluescreenview which reports back that the culprit is hal.dll with ntoskrnl.exe for each of the blue screens that we still have a .dmp file for. Looking into the MEMORY.dmp file, I can see:
DPC_WATCHDOG_VIOLATION (133)
DPC_QUEUE_EXECUTION_TIMEOUT_EXCEEDED
DEFAULT_BUCKET_ID: WIN8_DRIVER_FAULT
I'm unfortunately unable to really deep digger into this, as I have never before had to. I ran some commands within the Windows debugger and all the information is uploaded in the attached text file. Is anyone experienced with troubleshooting such a problem?
We couldn't run sfc/scannow at first, so we repaired with dism with a local source, and then ran sfc/scannow successfully. The repaired files pointed towards Windows defender and I do not think its related. Good to know: Symantec is running on the server, defender real time protection is off.
@Dave Patrick Host BSOD unfortunately. And nope, all the roles are installed on the host itself. It is not something done by me or my colleagues, and its not within our power to change. The host is powerful enough to virtualise the roles, but the client is reluctant to do so.
Which makes it all the harder to really guarantee a stable system, even if we end up finding what causes this. The host has not crashed yet since we last ran some basic repair commands, but we'll keep an eye on it ..
I'd work to move the roles (other than Hyper-V) off the host by standing up the required guests. You can do this rather easily. I'd use dcdiag / repadmin tools to verify health correcting all errors found before starting. Then stand up the new guest, patch it fully, license it, join existing domain, add active directory domain services, promote it also making it a GC (recommended), transfer FSMO roles over (optional), transfer pdc emulator role (optional), use dcdiag / repadmin tools to again verify health, when all is good you can decommission / demote old one.
He already said he can't do that... and whilst I agree it is far from ideal having them all on the same box - it shouldn't be causing a BSOD should it. So just moving the DC role to another machine is almost certainly not going to fix this anyway.
It seems much more likely this is from a hardware driver. @DLans would it be possible to ZIP the MEMORY.dmp file and upload it somewhere for us to take a look at? You've done a great job with that initial text file showing various outputs from the debugger, but there's a few more things I want to take a look at and it will be a lot quicker to explore the file rather than relaying commands and results back and forth between us on here.
This is a Windows XP error. Other Windows operating systems, like Windows 11 or Windows 10, might also experience this problem. See how to fix hal.dll errors in newer versions of Windows if you're not using Windows XP.
Since hal.dll errors appear before Windows is fully loaded, it's not possible to properly restart your computer. Instead, you'll need to force a restart. You can do that by pressing or holding down the physical power button until the computer shuts down; press it once to start it back up.
HDD's use magnetic platters which is divided into sectors where the actual data is stored. The HDD firmware is clever enough to determine that these sectors are working correctly. When some issue is found, it can be either a hard fault or a soft fault, the sector is marked as a bad sector and is no longer used. This bad sector is mapped to another good loaction on the HDD platter by the firmware. which only has a limited memory for handling these bad sectors mappings. And at some point the firmware can't handle any more bad sectors...
In any case, when a sector that is used to store some file becomes bad, it means that the portion of data stored on that sector can no longer be trusted. It may be OK, it may be corrupted, but it can't be trusted. And if the file using this now bad sector was an important windows executable, it could be indeed corrupted and the OS may crash or behave oddly...
Sometimes, sometimes, "chkdisk" will find and repair the damage file (or identify it) so it could be worth a go. Also the command "sfc /scannow" could be used to scan and fix protected system files. Doing these two things may give you a bit of a window to better handle the suspect HDD and its data.
If that doesn't work, as the files seem corrupt, then a more intensive data recovery process may be needed but cost and complexity goes up and your options become less and less the more the suspect HDD is used.
d3342ee215