Feedback wanted: surfacing more information in crash reports

65 views
Skip to first unread message

Gabriele Svelto

unread,
Feb 2, 2023, 8:14:41 AM2/2/23
to stability, dev-platform
[cross-posting to dev-platform]

Hello everybody,
last year we replaced the tool used to extract stacks and additional
information from crashes with a new Rust-based implementation. One of
the goals behind the change was extending it over the legacy tool and
surfacing richer information about crashes.

As we've implemented a few different analyses we were left wondering
what would be the best way to surface this information in crash reports.
Here's a few things we can detect now:

* We accessed a dead object (i.e. the crash is an UAF)
* We jumped into a dead object (same as above, but we were probably
going through the vtable or a function pointer)
* The crash was a NULL pointer access even though it might not look like
one (e.g. it was NULL plus a fixed offset, like when reading a field of
a structure)
* The crash was cause by bad hardware (corrupted data on disk, bad
memory with stuck bits, etc...)

And here's a few more we'll be able to detect soon:

* The crash was caused by the data being misaligned when accessed (this
happens sometimes with SIMD/vector instructions)
* The crash is impossible - likely caused by a CPU bug (e.g. the crash
is a segmentation fault but the crashing instruction doesn't access memory)
* The crash is a stack overflow (these have a specific crash reason on
Windows, but not on macOS and Linux)

So the question is, where would you like to see this data? In some case
we already surface some of it in the crash signature, see this crash as
an example:

https://crash-stats.mozilla.org/report/index/2ccae112-2e4e-4904-ba18-aaba60230202

This isn't the best way for all type of crashes though, for example
accesses to dead objects can't be detected all the time so putting that
information in the signature would bucket crashes in unwanted ways
depending on what was in the object.

Another possibility is to add a new field to the crashes that would then
be shown as a column when looking at a list of crash reports (something
like the crash reason, but more rich since we know more about the crash).

What do you think? What would help you more when looking at crashes?

Gabriele
OpenPGP_signature

Chris H-C

unread,
Feb 2, 2023, 8:53:10 AM2/2/23
to Gabriele Svelto, stability, dev-platform
For me, I interact with crashes mostly through Bugzilla. This means that the signature, to me, is the best place for any piece of important character about the crash. For the types of crashes that permit it, I'd like to see the trend of "put that in the sig" continue.

The next place I look at for context is the Crash Data section in BMO where Crash Stop[1] obligingly fills in a lot of helpful info. If we can't put the data in the sig, maybe we can put it there?

Otherwise, field/annotation away. I'll find it eventually (or at the very least someone will point it out to me once I've missed it). I'm just happy we'll have this information as it'll help me out, wherever it is displayed.

:chutten



--
You received this message because you are subscribed to the Google Groups "dev-pl...@mozilla.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev-platform...@mozilla.org.
To view this discussion on the web visit https://groups.google.com/a/mozilla.org/d/msgid/dev-platform/93d0d34d-3d15-bcf1-22b2-0963878b73bf%40mozilla.com.

Tom Ritter

unread,
Feb 2, 2023, 11:01:26 AM2/2/23
to Chris H-C, Gabriele Svelto, stability, dev-platform
If it's in the signature, is it possible to give Bugzilla a signature
like `[@ * | TrueTypeFontMetricsBuilder::GetBlackBox]` to get all the
reasons? And then optionally/additionally to populate this data into
crash stop like "30/30 are uaf, 2/32 are bad hardware"?
> To view this discussion on the web visit https://groups.google.com/a/mozilla.org/d/msgid/dev-platform/CAMPhgK9g-O3h88XCj%3DsG5i2%2B4noquvAVjGCJYWTzFTj_2uRTrw%40mail.gmail.com.

Gabriele Svelto

unread,
Feb 14, 2023, 4:43:19 AM2/14/23
to Tom Ritter, Chris H-C, stability, dev-platform
On 02/02/23 17:01, Tom Ritter wrote:
> If it's in the signature, is it possible to give Bugzilla a signature
> like `[@ * | TrueTypeFontMetricsBuilder::GetBlackBox]` to get all the
> reasons? And then optionally/additionally to populate this data into
> crash stop like "30/30 are uaf, 2/32 are bad hardware"?

You can do that on Socorro already by searching for a signature, for
example this query:

https://crash-stats.mozilla.org/search/?signature=~TrueTypeFontMetricsBuilder%3A%3AGetBlackBox

Yields these results:

1 shutdownhang | (anonymous
namespace)::TrueTypeFontMetricsBuilder::GetBlackBox 270 54.77%
2 bad hardware | TrueTypeFontMetricsBuilder::GetBlackBox 161 32.66%
3 TrueTypeFontMetricsBuilder::GetBlackBox 50 10.14%
4 shutdownhang | TrueTypeFontMetricsBuilder::GetBlackBox 12 2.43%

Note how we already flag some of those crashes as bad hardware.
Interestingly if you click on the third signature, go to the
"Aggregations" page and aggregate on the reasons field you'll notice
that these are also not actionable (or bad hardware):

EXCEPTION_IN_PAGE_ERROR_READ / STATUS_BAD_COMPRESSION_BUFFER 30 60.00%
EXCEPTION_IN_PAGE_ERROR_READ / STATUS_IO_DEVICE_ERROR 10 20.00%
EXCEPTION_IN_PAGE_ERROR_READ / STATUS_DEVICE_HARDWARE_ERROR 6 12.00%
EXCEPTION_IN_PAGE_ERROR_READ / STATUS_NO_SUCH_DEVICE 2 4.00%
EXCEPTION_IN_PAGE_ERROR_READ / STATUS_OBJECT_NAME_NOT_FOUND 1 2.00%
EXCEPTION_IN_PAGE_ERROR_READ / STATUS_UNEXPECTED_NETWORK_ERROR 1 2.00%

We probably need to flag those too.

Gabriele
OpenPGP_signature

Gabriele Svelto

unread,
Feb 14, 2023, 5:40:23 AM2/14/23
to stability, dev-platform
Hello all,
based on the feedback I received it seems like the place where most
people would like to see extra information is the crash signature. We
can certainly improve that by flagging more crashes, here's my proposal
for things that should go in the signature:

* Extend the "bad hardware" signatures to flag more data. For example we
only flag Windows crashes where the exception contains the
`STATUS_DEVICE_DATA_ERROR`. Looking through our crashes it seems like
all crashes where the reason is a variant of `STATUS_IN_PAGE_ERROR` are
non-actionable: they cover network disconnections, corrupt data, disk
being full, etc... These are not necessarily instances of "bad hardware"
so we might want to split them up, but still flag them as clearly
non-actionable.

* Surface bit-flip detection. We're not yet 100% confident about our
bit-flip detection heuristic (though the false positives appear to be
few). We could put the bit-flip detection into a dedicated field which
can be searched and thus made visible in the aggregations tab. This
would make it easy to spot crash signatures with a high number of
potential bit-flips. It's worth noting that early testing indicates that
crashes caused by bit-flips represent double-digit percentages of
reported crashes.

* Always replace the crash address with the adjusted address, including
for NULL pointers. This will make understanding crashes easier, we can
put the raw crash address in a separate field.

* Last but not least I'd like to add a field providing a high-level
description of the crash, possibly obtained by cross-referencing the
address, the crash reason, the platform and other information. The idea
is to have a platform-independent place to store what kind of crash
we're dealing with (stack overflow, assertion, NULL-dereference, UAF,
misaligned access, etc...).

WDYT?

Gabriele
OpenPGP_signature

ISHIKAWA,chiaki

unread,
Feb 15, 2023, 10:48:54 PM2/15/23
to dev-pl...@mozilla.org, ishikawa, chiaki
On 2023/02/14 18:43, Gabriele Svelto wrote:
> EXCEPTION_IN_PAGE_ERROR_READ / STATUS_UNEXPECTED_NETWORK_ERROR 1  2.00%

I know this is tangetial to the original topic discussed.

I can
certainly understand that crash in the face of malfunctionining I/O device may be unavoidable.

But network error, when firefox (presumably) tries to load a page should
not cause FF to crash IMHO.
I mean we need to be resistant to malicious data (or the lack of it).
Correct?


But I digress. This is not directly relate to the original discussion.
(I have been horrified over the years at the lack of manylow-level  I/O
error handling in C-C TB code and networking error is one of them.)

Chiaki


Gabriele Svelto

unread,
Feb 16, 2023, 2:58:34 AM2/16/23
to ISHIKAWA,chiaki, dev-pl...@mozilla.org
On 16/02/23 04:48, ISHIKAWA,chiaki wrote:
> On 2023/02/14 18:43, Gabriele Svelto wrote:
>> EXCEPTION_IN_PAGE_ERROR_READ / STATUS_UNEXPECTED_NETWORK_ERROR 1  2.00%
>
> I know this is tangetial to the original topic discussed.
>
> I can certainly understand that crash in the face of 
> malfunctionining I/O device may be unavoidable.
>
> But network error, when firefox (presumably) tries to load a page should
> not cause FF to crash IMHO.
> I mean we need to be resistant to malicious data (or the lack of it).
> Correct?

This is not a network error per-se but rather an I/O error to a
network-mounted filesystem. In that case Windows will deliver us a fatal
exception because it cannot fill a page with data and there's really not
much we can do. This usually means the connection dropped in the middle
of a transfer or the filesystem was unmounted from under us.

Gabriele

OpenPGP_signature

ISHIKAWA,chiaki

unread,
Feb 16, 2023, 6:14:46 PM2/16/23
to Gabriele Svelto, dev-pl...@mozilla.org
Thank you for the comment.

I agree that there is not much we can do.
But still, for Thunderbird, I would like to see a graceful shutdown with
an easy to understand error message about "network file system did not
respond.", etc.
Otherwise, the user is left with a bitter taste in the mouth thinking
"Is my last e-mail sent successfully?", "Is the last downloaded e-mail
stored securely?", etc.

For idempotent operation, that is, operations that can be tried many
times and return the same result always.), a crash is OK.
E.g. FF's fetching a page that would return the same page not matter how
many times we try, or TB's
letting users looking at the headers of already downloaded messages.
They are idempotent operations.
(I am ignoring the cache update or already-read-flag setting, etc.)

For non-idempotent operations, and TB's mail handling such as
receiving/writing/sending e-mails are not idempotent operations,
a crash is too harsh on the user. That is why I try to make it a bit
more acceptable in the face of serious trouble with network file system
and other I/O operations
by handling such errors sensibly (and gracefully exit if not much can be
done.)
However, it is an uphill battle since low-level I/O error handling was
not considered/tested well in TB.

But such attention should be given to FF users as well.
I suspect that FF user in the middle of important transaction (such as
banking/payment), which is definitely NOT an idempotent operation,
would have a similar sentiment if FF crashes just because underlying
network file system does not respond, etc.

BTE, yhere id be a subtle difference between Windows and Linux regarding
network file system operation (and its errors).
Windows I/O system primitive tries various network error recovery
schemes such as re-trying including  automatically handle short-read and
try to read as many octets as possible
when the remote server returns less than requested number of octets at
initial call and there are still remaining octets on the remote server.
So in that sense, if Windows I/O system call fails for network
operation, that is when we know hard unrecoverable error occured.
Windows has already tried a few error recovery method.
OTOH, under linux, the system call obviously does not do such extra
error recovery and all is passed to user code, which needs to take care of
the short read and other recovery measure if any
Currently T-B, and puresumably FF, too, does not handle such recovery
very well.
At least I have produced a patch for short-read issues for TB under
linux and have tested it locally for several years.
I have learnt the difference between Windows and Linux network I/O error
handling at OS level because C-C TB under linux could not talk to
congested remote server which occasionally returned short response
whereas C-C TB under windows did not show such behavior.
I investigated and realized that the C-C TB under linux needed a fix.

While testing the code by mimicking the remote file server by unplugging
network cable many times for a few weeks, I learned there are still
other I/O issues such as failure of ftell and lseek which are not still
handled perfectly in my patch yet. (It *IS* rare, and I suspect ftell
wrapper has a bug somewhere. The error is thrown back as signal which
C-C TB did not catch, thus crashing.)
The issue of coping with misbehaving remote network file system
gracefully is so hard to test without an instance of malfunctioning
remote file server which can be controlled to "err" on demand.

Anyway, I know this topic is tangetial to the original discussion.
At least, being able to know the causes of crash including possible
hardware issues including network failure is great.
So thank you for showing how to do it.

Chiaki


Reply all
Reply to author
Forward
0 new messages