Windows Error Reporting crash interception has landed

64 views
Skip to first unread message

Gabriele Svelto

unread,
Apr 7, 2021, 3:23:54 PM4/7/21
to dev-platform, stability
[cross-posting to stability]

TL;DR

We now leverage the Windows Error Reporting service to capture crashes
that elude our regular crash reporting machinery. nightly's crash rate
will go up in the coming days but just because we now know about those
crashes; so don't worry it's a good thing!

Longer version:

I discovered a way to leverage the API used to interact with the Windows
Error Reporting (WER) [1] service to intercept crashes that were eluding
our exception handler. This includes many forms of OOM crashes, issues
that badly corrupted the stack or heap, weird DLL injections and even
__fastfail() [2] crashes.

The implementation involves deploying a DLL with WER-specific hooks that
are called when Firefox crashes. We leverage the hooks to capture a
minidump of the crashed process, generate a minimal set of crash
annotations and launch the crash reporter client to notify the user.

This currently only supports parent process crashes. I'll add child
process interception in the coming weeks, progress can be tracked at bug
1682507 [3].

If you're curious you can find the various bits of the implementation
(in Rust) in bug 1682509 [4], bug 1682514 [5], bug 1682511 [6] and bug
1682516 [7]. Beware that it's doing hacky-wacky Windows things to
achieve its goal; don't expect elegant stuff :-P

Interestingly it seems that we're the first to be able to intercept
__fastfail() crashes. Looking for info on the topic I found posts,
tweets and even bugs on Chromium's tracker that unanimously mentioned it
couldn't be done. My guess is that Microsoft changed how the system
works since the last time someone tried.

Gabriele

[1] https://docs.microsoft.com/en-us/windows/win32/api/_wer/
[2] https://docs.microsoft.com/en-us/cpp/intrinsics/fastfail
[3] Use the Windows Error Reporting API to generate minidumps for
crashes which cannot be caught using Breakpad
https://bugzilla.mozilla.org/show_bug.cgi?id=1682507
[4] Write a WER runtime exception module capable of writing out minidumps
https://bugzilla.mozilla.org/show_bug.cgi?id=1682509
[5] Register the WER runtime exception module at runtime
https://bugzilla.mozilla.org/show_bug.cgi?id=1682514
[6] Record the WER runtime exception module in the Windows registry
https://bugzilla.mozilla.org/show_bug.cgi?id=1682511
[7] Make the WER runtime exception module launch the crash reporter
client when the browser crashes
https://bugzilla.mozilla.org/show_bug.cgi?id=1682516

OpenPGP_signature

Chris Peterson

unread,
Apr 7, 2021, 5:18:01 PM4/7/21
to Gabriele Svelto, dev-platform, stability
Impressive work!

Will these crash reports will be submitted to Socorro like regular crash
reports, even though we will be using the Windows Error Reporting
service to capture the crashes? Will crash reports have an annotation to
indicate whether they were caught by Breakpad or WER? Or do consumers of
crash reports not need to care?

Do you expect this feature to ride the trains with Firefox 89? As
someone monitoring a Beta channel experiment now (Fission), I'm eager to
get more insight into stability. :)

Gabriele Svelto

unread,
Apr 7, 2021, 5:28:08 PM4/7/21
to Chris Peterson, dev-platform, stability
On 07/04/21 23:17, Chris Peterson wrote:
> Will these crash reports will be submitted to Socorro like regular crash
> reports, even though we will be using the Windows Error Reporting
> service to capture the crashes?

Yes, these crashes will follow the regular crash reporting flow and end
up in Socorro.

> Will crash reports have an annotation to
> indicate whether they were caught by Breakpad or WER? Or do consumers of
> crash reports not need to care?

These crashes don't have an annotation to tell them apart yet. I'll file
a bug to add that though so that we can easily tell them apart from the
others. You will still be able to spot them in the meantime because they
will have a lot less annotations than a regular crash. I sent this one
as a test:

https://crash-stats.mozilla.org/report/index/f3e9a779-34e4-48a9-a3d5-9980c0210407#tab-details

If you look in the "Crash Annotations" tab you'll see that only a
handful of fields are present.

> Do you expect this feature to ride the trains with Firefox 89? As
> someone monitoring a Beta channel experiment now (Fission), I'm eager to
> get more insight into stability. :)

I was planning for it to ride the trains but if we feel it works well we
could always uplift. The only limitation is that we can't catch child
process crashes with it just yet.

Gabriele

OpenPGP_signature

Daniel Veditz

unread,
Apr 7, 2021, 7:22:13 PM4/7/21
to Gabriele Svelto, dev-platform, stability
On Wed, Apr 7, 2021 at 2:28 PM Gabriele Svelto <gsv...@mozilla.com> wrote:
https://crash-stats.mozilla.org/report/index/f3e9a779-34e4-48a9-a3d5-9980c0210407#tab-details

If you look in the "Crash Annotations" tab you'll see that only a
handful of fields are present.

Awesome! It will be super to be able to account for these "missing" stability issues.

Would it be possible to capture the fast-fail code and add that to the annotations? Somewhere down the road, I mean, definitely after getting these captures working in all our process types. It would be nice to be able to distinguish, for example, FAST_FAIL_FATAL_APP_EXIT (we bailed) from FAST_FAIL_STACK_COOKIE_CHECK_FAILURE (OS slapped us down for stack corruption). Just the actual integer code is good enough if that makes it easier.


Gabriele Svelto

unread,
Apr 8, 2021, 4:18:53 AM4/8/21
to Daniel Veditz, dev-platform, stability
On 08/04/21 01:21, Daniel Veditz wrote:
> Awesome! It will be super to be able to account for these "missing"
> stability issues.
>
> Would it be possible to capture the fast-fail code and add that to the
> annotations? Somewhere down the road, I mean, definitely after getting
> these captures working in all our process types. It would be nice to be
> able to distinguish, for example, FAST_FAIL_FATAL_APP_EXIT (we bailed)
> from FAST_FAIL_STACK_COOKIE_CHECK_FAILURE (OS slapped us down for stack
> corruption). Just the actual integer code is good enough if that makes
> it easier.

Absolutely! I already filed bug 1703248 [1] for that and I will
implement it shortly.

Gabriele

[1] Print the exception subcode of EXCEPTION_STACK_BUFFER_OVERRUN crashes
https://bugzilla.mozilla.org/show_bug.cgi?id=1703248

OpenPGP_signature

Tom Ritter

unread,
Apr 8, 2021, 12:59:52 PM4/8/21
to Gabriele Svelto, Daniel Veditz, dev-platform, stability
This is awesome. With content process coverage, this will cover a big
gap in our ability and confidence in deploying security mitigations,
as many of the crashes that result from e.g. unexpected behavior
clashing with mitigations we're trying to deploy are surfaced through
this mechanism.

-tom
> --
> You received this message because you are subscribed to the Google Groups "dev-pl...@mozilla.org" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dev-platform...@mozilla.org.
> To view this discussion on the web visit https://groups.google.com/a/mozilla.org/d/msgid/dev-platform/01bee7ce-297c-cc88-f73c-96f7062cf211%40mozilla.com.

Gabriele Svelto

unread,
Apr 11, 2021, 4:03:13 PM4/11/21
to William Lachance, dev-platform, stability
Il 08/04/2021 16.30, William Lachance ha scritto:
> This sounds really useful! Are we also sending crash pings for these?

I had to double-check the code the be sure about my answer: yes, they
will be sent but only after Firefox restart. Generally parent process
crashes send the ping via the crash reporter client, but this is because
we have the right telemetry bits available in the crash annotations.
With crashes captured by WER we don't have the required data (like the
telemetry URL) and so we have to rely on Firefox picking up the crash
after restart and sending the crash ping.

Gabriele
Reply all
Reply to author
Forward
0 new messages