Both ipmctl and ndctl utilities can display the health information for Optane PMem modules, regions, and namespaces. You'll also find health info for the PMem modules in the platform management interface (eg: BMC, iLO, iDRAC, etc).
To show heath information of the modules, regions, and namespaces, use some or all of the following:
- ipmctl show -memoryresources
- ipmctl show -region
- ipmctl show -a -region
- ipmctl show -dimm
- ipmctl show -a -dimm
- ndctl list -DH // DIMM + Health
- ndctl list -RH // Region +Heath
- ndctl list -NH // Namespace + Health
- ndctl list -DRNH // All combined in one output
You'll see from the above that the following conditions or scenarios could occur (not an exhaustive list):
- Fatal Media Error (Cannot Read or Write to the PMem module). PMem needs to be replaced. All data is lost on the DIMM and Region/Namespace(s) it belongs to.
- High/Low Media or Controller Temperature condition
- Used all consumable spare capacity
- Package Sparing has occurred - This indicates one of the Optane chips on the PMem module has failed but the spare one has taken over
- dirty_shutdown - This indicates the platform lost power and did not successfully complete the ADR sync of data from the memory controller to the PMem module(s). This is a potential data loss/corruption scenario.
- Boot Status - Did the PMem module initialize during POST correctly. If not, the Region will be marked as Faulty and the Namespaces will be unavailable
- ARS Status - If 'Address Range Scrub' started or failed. There are BIOS options to determine if ARS is enabled (default) or disabled.
- Poisoned Data - The app is responsible for writing the data (recovery). See the book or PMDK documentation.
- PMem is Locked - If a User or Master Passphrase has been configured, the Locked status indicates the passphrase has not been entered yet to unlock the PMem module(s). Data is not accessible until the PMem is Unlocked
/Steve