Hardware Error Detection Capability in PMEM

47 views
Skip to first unread message

Madhava Krishnan

unread,
Dec 19, 2021, 4:42:56 PM12/19/21
to pm...@googlegroups.com
Hello All,
I have a question about the hardware error detection capabilities of PMEM. Chapter 17 in the PMDK book clearly explains how an uncorrectable error is detected/reported by the hardware and how the OS handles it. Overall, My understanding is that the hardware ECC will detect all the hardware errors and also try to fix some, but in case if it can not fix then such errors will be reported to the OS as uncorrectable errors". But I wanted to know if there is any case or situation where the hardware ECC may fail to detect an error? Can somebody shed some light on this? Also, are there any differences in the error detection capabilities between the different generations of the PMEM hardware?  



Best regards,
Madhav

Steve Scargall

unread,
Dec 20, 2021, 11:26:53 AM12/20/21
to pmem
Hi

Firstly, thank you for reading the book :)

>> But I wanted to know if there is any case or situation where the hardware ECC may fail to detect an error? Can somebody shed some light on this?

Each PMem product will have different RAS (Reliability, Accessibility, Serviceability) features. The Intel(R) Optane(TM) Persistent Memory Product implements ECC which works as you expect (using DRAM as the comparison). You should expect Correctable and Uncorrectable errors when the ECC engine can or cannot resolve the checksum. Unlike DRAM, Optane PMem has additional features discussed on https://www.intel.com/content/www/us/en/developer/articles/technical/pmem-RAS.html. Not all errors reported by Optane PMem result in CE or UE errors. 

>> Also, are there any differences in the error detection capabilities between the different generations of the PMEM hardware? 

All current generations of Optane PMem implement the same RAS features. There are no new or deprecated RAS features between Optane 100 and 200 Series. As discussed in the book, some hardware RAS features require software to understand the errors and handle it within the app/software layer. The Persistent Memory Developer Kit helps developers by abstracting the hardware intricacies to make a stable API interface for developers to use. 

/Steve


Madhava Krishnan

unread,
Dec 20, 2021, 12:12:20 PM12/20/21
to Steve Scargall, pmem
Hello Steve,
Thank you for your answer! Also thank you very much for the efforts of all the Intel engineers who put the PMDK book together, it is well written and very resourceful. 
I have a follow-up question to your answer particularly on  "Not all errors reported by Optane PMem result in CE or UE errors ", 
can you elaborate a bit on what do you mean by this? Can you also tell me what are the other possible errors that optane reports? 
As I understand from the PMDK book and from your answer above RAS features and hardware ECC reports hardware errors that are uncorrectable and then the PMEM application should handle/fix these errors with their own techniques or by using PMDK APIs. 


Best regards,
Madhav

--
You received this message because you are subscribed to the Google Groups "pmem" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pmem+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pmem/84a12f42-2b90-44b2-9ed6-4b8006cd23a3n%40googlegroups.com.

Steve Scargall

unread,
Dec 21, 2021, 11:14:51 AM12/21/21
to pmem
I have a follow-up question to your answer particularly on  "Not all errors reported by Optane PMem result in CE or UE errors ", 
can you elaborate a bit on what do you mean by this? Can you also tell me what are the other possible errors that optane reports? 

Both ipmctl and ndctl utilities can display the health information for Optane PMem modules, regions, and namespaces. You'll also find health info for the PMem modules in the platform management interface (eg: BMC, iLO, iDRAC, etc).

To show heath information of the modules, regions, and namespaces, use some or all of the following:
  • ipmctl show -memoryresources
  • ipmctl show -region
  • ipmctl show -a -region
  • ipmctl show -dimm
  • ipmctl show -a -dimm
  • ndctl list -DH // DIMM + Health
  • ndctl list -RH // Region +Heath
  • ndctl list -NH // Namespace + Health
  • ndctl list -DRNH // All combined in one output
You'll see from the above that the following conditions or scenarios could occur (not an exhaustive list):
  • Fatal Media Error (Cannot Read or Write to the PMem module). PMem needs to be replaced. All data is lost on the DIMM and Region/Namespace(s) it belongs to.
  • High/Low Media or Controller Temperature condition
  • Used all consumable spare capacity
  • Package Sparing has occurred - This indicates one of the Optane chips on the PMem module has failed but the spare one has taken over
  • dirty_shutdown - This indicates the platform lost power and did not successfully complete the ADR sync of data from the memory controller to the PMem module(s). This is a potential data loss/corruption scenario.
  • Boot Status - Did the PMem module initialize during POST correctly. If not, the Region will be marked as Faulty and the Namespaces will be unavailable
  • ARS Status - If 'Address Range Scrub' started or failed. There are BIOS options to determine if ARS is enabled (default) or disabled.
  • Poisoned Data - The app is responsible for writing the data (recovery). See the book or PMDK documentation.
  • PMem is Locked - If a User or Master Passphrase has been configured, the Locked status indicates the passphrase has not been entered yet to unlock the PMem module(s). Data is not accessible until the PMem is Unlocked
/Steve

Reply all
Reply to author
Forward
0 new messages