Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#1050940: linux: Enable Correctable Errors Collector RAS_CEC feature

175 views
Skip to first unread message

Miguel Bernal Marin

unread,
Aug 31, 2023, 12:20:05 PM8/31/23
to
Source: linux
Version: 6.5~rc7-1~exp1
Severity: wishlist
Tags: patch
X-Debbugs-Cc: miguel.be...@linux.intel.com, jair.g...@linux.intel.com

Dear Maintainer,

Please enable the Reliability, Availability and Serviceability (RAS)
Correctable Errors Collector (RAS_CEC) feature on arch amd64/x86_64,
on Debian Trixie.

RAS_CEC introduce a simple data structure for collecting correctable
errors along with accessors.

This is a small cache which collects correctable memory errors per 4K
page PFN and counts their repeated occurrence. Once the counter for a
PFN overflows, we try to soft-offline that page as we take it to mean
that it has reached a relatively high error count and would probably
be best if we don't use it anymore.

The error decoding is done with the decoding chain now and
mce_first_notifier() gets to see the error first and the CEC decides
whether to log it and then the rest of the chain doesn't hear about it -
basically the main reason for the CE collector - or to continue running
the notifiers.

When the CEC hits the action threshold, it will try to soft-offine the
page containing the ECC and then the whole decoding chain gets to see
the error.

To disable the Correctable Errors Collector, a kernel parameter is used:
> ras=cec_disable

A MR was created with this proposal at:

https://salsa.debian.org/kernel-team/linux/-/merge_requests/827

Thanks,
Miguel Bernal Marin

Debian Bug Tracking System

unread,
Sep 3, 2023, 7:50:04 AM9/3/23
to
Your message dated Sun, 03 Sep 2023 11:40:55 +0000
with message-id <E1qclTL-...@fasolo.debian.org>
and subject line Bug#1050940: fixed in linux 6.5.1-1~exp1
has caused the Debian Bug report #1050940,
regarding linux: Enable Correctable Errors Collector RAS_CEC feature
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact ow...@bugs.debian.org
immediately.)


--
1050940: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1050940
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems
0 new messages