Crashes caused by bad hardware are now being flagged

41 views
Skip to first unread message

Gabriele Svelto

unread,
Sep 11, 2022, 4:50:35 AMSep 11
to stability, dev-platform
[cross-posting to dev-platform]

Hello all,
:willkg just deployed a change to crash-stats that explicitly flags
crashes known to be caused by faulty hardware and thus not actionable.
The crash signature for these crashes will start with "bad hardware"
making it abundantly clear that there's nothing we can do to address
them. You can see an example of these crashes with this query:

https://crash-stats.mozilla.org/search/?signature=~bad%20hardware

Currently this is limited to crashes on Windows that are caused by
accessing faulty disks. We'll flag more crash types as we introduce new
ways of detecting them.

Currently these make up 0.28% of all Windows crashes in the past 24H
which is a stunningly large number IMHO.

This information is also available at crash time so we might evaluate
ways of informing the user that their machine is at fault. I'm open to
suggestions on this topic as they'd involve dedicated UX and front-end
work which is outside of my area of expertise.

Gabriele
OpenPGP_signature

Jeff Muizelaar

unread,
Sep 11, 2022, 9:26:26 PMSep 11
to Gabriele Svelto, stability, dev-platform
On Sun, Sep 11, 2022 at 4:50 AM Gabriele Svelto <gsv...@mozilla.com> wrote:
>
> [cross-posting to dev-platform]
>
> Hello all,
> :willkg just deployed a change to crash-stats that explicitly flags
> crashes known to be caused by faulty hardware and thus not actionable.
> The crash signature for these crashes will start with "bad hardware"
> making it abundantly clear that there's nothing we can do to address
> them. You can see an example of these crashes with this query:
>
> https://crash-stats.mozilla.org/search/?signature=~bad%20hardware
>
> Currently this is limited to crashes on Windows that are caused by
> accessing faulty disks. We'll flag more crash types as we introduce new
> ways of detecting them.

Is the current detection using reason: "EXCEPTION_IN_PAGE_ERROR_READ /
STATUS_DEVICE_DATA_ERROR" to detect bad hardware? Is it possible to
get this exception reason when using a network drive along with a
flaky network?

-Jeff

Gabriele Svelto

unread,
Sep 12, 2022, 4:44:36 AMSep 12
to Jeff Muizelaar, stability, dev-platform
On 12/09/22 03:26, Jeff Muizelaar wrote:
> Is the current detection using reason: "EXCEPTION_IN_PAGE_ERROR_READ /
> STATUS_DEVICE_DATA_ERROR" to detect bad hardware? Is it possible to
> get this exception reason when using a network drive along with a
> flaky network?

I don't think so, all the references I found about it indicate that it's
what the NT kernel throws when hitting bad blocks on a drive (see
below). Looking at the type of errors we get on Socorro I think that
other codes are used for network issues:

https://crash-stats.mozilla.org/search/?reason=~EXCEPTION_IN_PAGE_ERROR_READ&_facets=signature&_facets=reason#facet-reason

In particular I'd say that STATUS_UNEXPECTED_NETWORK_ERROR,
STATUS_CONNECTION_DISCONNECTED, STATUS_BAD_NETWORK_PATH,
STATUS_CONNECTION_ABORTED, STATUS_INVALID_CONNECTION and
STATUS_CONNECTION_RESET should be the ones indicating an issue with a
network drive.

Note that there's a lot of other potential candidates in there to be
ignored as non-actionable. STATUS_IN_PAGE_ERROR being one of them (and
high-volume too). I found this doc:
https://winprotocoldoc.blob.core.windows.net/productionwindowsarchives/MS-ERREF/%5BMS-ERREF%5D-210407-diff.pdf
describing it as "The required data was not placed into memory because
of an I/O error status [...]". The same document describes
STATUS_DEVICE_DATA_ERROR as "There are bad blocks (sectors) on the hard
disk."

Gabriele
OpenPGP_signature

Valentin Gosu

unread,
Sep 27, 2022, 4:51:21 PM (4 days ago) Sep 27
to Gabriele Svelto, Jeff Muizelaar, stability, dev-platform
On Mon, 12 Sept 2022 at 10:44, Gabriele Svelto <gsv...@mozilla.com> wrote:
On 12/09/22 03:26, Jeff Muizelaar wrote:
> Is the current detection using reason: "EXCEPTION_IN_PAGE_ERROR_READ /
> STATUS_DEVICE_DATA_ERROR" to detect bad hardware? Is it possible to
> get this exception reason when using a network drive along with a
> flaky network?

I don't think so, all the references I found about it indicate that it's
what the NT kernel throws when hitting bad blocks on a drive (see
below). Looking at the type of errors we get on Socorro I think that
other codes are used for network issues:

We did actually encounter these exceptions a while back in JAR code.
We were loading an extension JAR by mapping it, and when the network shared drive got disconnected, we got EXCEPTION_IN_PAGE_ERROR_READ / STATUS_DEVICE_DATA_ERROR or similar. See https://bugzilla.mozilla.org/show_bug.cgi?id=1551562#c26
We fixed this on windows by handling that exception specifically - I think this should not happen unless we are using mmaped memory (which we usually aren't).
 

https://crash-stats.mozilla.org/search/?reason=~EXCEPTION_IN_PAGE_ERROR_READ&_facets=signature&_facets=reason#facet-reason

In particular I'd say that STATUS_UNEXPECTED_NETWORK_ERROR,
STATUS_CONNECTION_DISCONNECTED, STATUS_BAD_NETWORK_PATH,
STATUS_CONNECTION_ABORTED, STATUS_INVALID_CONNECTION and
STATUS_CONNECTION_RESET should be the ones indicating an issue with a
network drive.

Note that there's a lot of other potential candidates in there to be
ignored as non-actionable. STATUS_IN_PAGE_ERROR being one of them (and
high-volume too). I found this doc:
https://winprotocoldoc.blob.core.windows.net/productionwindowsarchives/MS-ERREF/%5BMS-ERREF%5D-210407-diff.pdf
describing it as "The required data was not placed into memory because
of an I/O error status [...]". The same document describes
STATUS_DEVICE_DATA_ERROR as "There are bad blocks (sectors) on the hard
disk."

  Gabriele

--
You received this message because you are subscribed to the Google Groups "dev-pl...@mozilla.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev-platform...@mozilla.org.
To view this discussion on the web visit https://groups.google.com/a/mozilla.org/d/msgid/dev-platform/569ba5c6-76c9-5286-c7db-143cfbaa5e81%40mozilla.com.
Reply all
Reply to author
Forward
0 new messages