Intent to Ship: New Socorro Backend (crash-stats changes)

43 views

Skip to first unread message

Alexis Beingessner

unread,

Dec 9, 2021, 12:07:17 PM12/9/21

to dev-platform, crash-reporting-wg

Intent to Ship: New Socorro Backend (crash-stats changes)

We are planning to enable the new rust-minidump backend for socorro (crash-stats.mozilla.org) next week (Monday?). This implementation was deployed to mozcrash.py (the `try` crash analyzer) a couple weeks ago (Bug 1741205). It has also been running on crash-stats staging for the past few weeks, and seems to be working well.

It's very likely that some crash signatures will change. Hopefully not many, but it's inevitable. If you think a signature (or anything else) is notably worse now, please needinfo :Gankra on bugzilla or ping @gankra:mozilla.org on element (she wrote and maintains the bulk of the new implementation).

Detailed user documentation for the new implementation can be found here (including a section on using it with Firefox crashes). It's much easier to run locally!

Although we've put a fair amount of effort into making this a "behind the scenes'' change you don't need to care about, there are definitely some differences. At this point we're fairly confident that all of these changes (except one, see exploitability) are either benign differences of opinion or net improvements.

You can identify which implementation processed a crash report by checking the new "Debug" tab in crash-stats (needs protected data access). The new implementation will have a "Stackwalker version" like:

minidump-stackwalk 0.9.5 (2021-12-08 c4617dc2)

while the old implementation will just say

stackwalker unknown

The rest of this message is a complete listing and discussion of these changes (to the best of our knowledge). This will be primarily focused on the raw JSON output, but those values often get verbatim reflected in the crash-stats UI. If you are unfamiliar/uncertain about the meaning of a JSON value discussed here, the new implementation also comes with a complete schema that describes the semantics of each field!

All of these differences are "WONTFIX" -- we're happy with the new behaviours! That said, they can be relitigated if anyone has a compelling argument for why an old behaviour is better.

To help you read this document, we have marked each difference with an indicator:

❌ is a regression

🚧 is a benign difference

✅ is an improvement

❌ exploitability

Let's get the one regression out of the way first. The "exploitability" field is no longer populated. Although a minimal implementation was written, we decided against landing it because of the feedback we got when we asked how people use the current exploitability implementation. Pretty unanimously our peers came back with: "it sucks and we ignore it" (If you disagree, please let us know!).

Exploitability is a feature where breakpad would combine all of the information in the minidump to try to guess how likely it is that this crash is something exploitable. For instance, if the crash was an OOM, that's probably fine; if the crash was a null pointer deref, that's a little worrying; and if you segfaulted the instruction pointer, that's a huge concern.

Unfortunately the signal-to-noise ratio was too bad to make it actually useful. Our peers frequently found that benign crashes were flagged as concerning, so checking or monitoring "exploitability" was rarely useful.

If we ever decide to reimplement this functionality, a lot more work will need to go into eliminating false alarms (e.g. properly noticing that a crash was due to an assertion and not really an otherwise-very-sketchy illegal instruction).

🚧 various stackwalking differences

Stackwalking is complicated. The two implementations largely use the same algorithms and approaches, and will mostly come back with the same answer. But there's so many little subtleties and interactions that it's impossible (and undesirable) to keep the two the same.

If you do encounter something strange, the new implementation includes debugging tools for stackwalking (not run in production, but can be run locally on "suspicious" crash reports, which socc-pair can help with).

To the best of my reckoning the differences are generally either benign or in the new implementation's favour, especially for the crashing thread, which is the most important one. Other threads are more likely to be in a zombie state that is harder (or impossible?) to stackwalk. I've definitely seen a few places where the new implementation is doing an obviously better job, but I don't have a good sense for how "common" or "significant" that is.

Most differences occur when we don't have any good information informing the stackwalk, and are basically making educated guesses (e.g. when `trust=scan` or when the CFI is seemingly wrong). In these situations it's very easy for either implementation to "hallucinate" extra frames or skip over frames. When this happens it's hard to tell if one is doing a better or worse job than the other. Garbage in, garbage out.

In these cases the new implementation is slightly more biased towards "keep it simple". So for instance, we do not implement the cfi_scan hack, because it produces dubious results.

The new implementation also tries to be more consistent between platforms when there isn't a clear reason for them to have divergent behaviour. So a heuristic we apply on x86 is more likely to also be applied on x64 and ARM. Some arbitrary inconsistency was however preserved in places where the difference between the old implementation and new implementation seemed too large. e.g. we generally tried to preserve any quirks that were observable in breakpad's testsuite.

(Note: just like the old implementation, every platform's stackwalker is a completely independent file, so keeping things consistent is an explicit decision and not just an accidental artifact of excessive code reuse.)

🚧 hard errors are reported differently

The JSON schema contains a "status" field that the old implementation would try to use to report an error (e.g. for empty/corrupt minidumps). This is a bit unreliable, because things may be in a corrupted state, resulting in corrupt JSON.

The new implementation prefers to produce no input if it encounters an error, and prints an error message to stderr. That said, in the new implementation this mostly only ever happens for empty/corrupt minidumps, where there isn't anything interesting to be done.

The Socorro processor rule that runs minidump-stackwalk now captures stderr and truncates it to the last 10 lines. It includes this data in the "mdsw_stderr" field and it's visible in the Debug tab for people who have protected data access.

🚧 null vs nothing

There are a bunch of fields that the old stackwalker doesn't emit at all if there's no value, whereas the new implementation will always emit the field with the value `null`. This is partially a quirk of the implementation, but it also just makes it easier to tell if some data could be there but isn't.

mac_crash_info
lsb_release
crash_info.assertion
system_info.cpu_microcode_version
crashing_thread.last_error_value
crashing_thread.thread_name
crashing_thread.frames.N.line
crashing_thread.frames.N.file
crashing_thread.frames.N.module
crashing_thread.frames.N.module_offset
crashing_thread.frames.N.function_offset
crashing_thread.frames.N.function
threads.N.last_error_value
threads.N.thread_name
threads.N.frames.N.file
threads.N.frames.N.function
threads.N.frames.N.function_offset
threads.N.frames.N.line
threads.N.frames.N.module
threads.N.frames.N.module_offset
modules.N.cert_subject
modules.N.symbol_url
unloaded_modules.N.cert_subject

🚧 false vs nothing

Similar to null vs. nothing, there are a couple of fields where the old stackwalker emits nothing, but rust minidump-stackwalk emits false.

modules.N.loaded_symbols
modules.N.corrupt_symbol

🚧 module version is null instead of empty string

Breakpad stackwalker emitted modules.N.version as empty string ("") and Rust minidump-stackwalk emits a null.

Crash Stats doesn't do anything with this, but scripts that look at processed crash data and depend on this will need to be updated.

🚧 modules have a different order

The old implementation liked to sort modules (executable, libraries, mapped files like fonts) by address. The new implementation simply preserves the order of the modules in the minidump.

(Due to this difference, main_module -- an index into the modules array -- may also change if a minidump is reprocessed, but it will still point to the same module as before.)

🚧 bad module code_ids are preserved

The old implementation would replace a seemingly corrupt (too short) modules.N.code_id value with the string "id" (no clue why). The new implementation just reports whatever value it finds.

✅ module names have their case preserved

The old implementation would lower-case threads.N.module strings, the new implementation preserves the casing of the module's name.

(e.g. instead of accessiblehandler.dll, you will see AccessibleHandler.dll)

✅ unloaded_modules are now correctly messier

The old implementation similarly sorted and deduplicated the contents of "unloaded_modules". However this isn't appropriate, because unloaded_modules is a log of unload events. Modules can (and do) show up multiple times in unloaded_modules when they are loaded and unloaded multiple times. These "repeat" entries may or may not have the same address. The ordering of the log is also mostly chronological, so the ordering has significance.

However (at least for Microsoft's native implementation of minidumps), unloaded_modules is not a complete log, because they use a fixed-size circular buffer to track unload events. If the buffer fills up, the oldest entries will get overwritten by the newest ones. This circular buffer also results in the "start" of the log not actually being the first element, and there's no way to tell which element is first. Hence mostly chronological.

The new implementation faithfully returns the log as it appears in the minidump.

(Unfortunately I don't actually know if the log is ascending or descending in time at the moment)

🚧 legacy memory statistics are removed

Several fields have been deprecated and aren't emitted in the new implementation because they have been obsoleted by our more robust memory reporting infrastructure in the "Crash Annotations" tab (example).

tiny_block_size
largest_free_vm_block
write_combine_size

🚧 legacy stack truncation fields are removed

The old implementation used to truncate long backtraces (presumably to simplify the output). This ended up confusing people, so it was disabled about a year ago. Since the new implementation never supported this (mis)feature, it does not output the fields that were used to report on when this happened:

crashing_thread.frames_truncated (was always false)
crashing_thread.total_frames (this stored the untruncated count)

Note that crashing_thread.frame_count is still present (and always has been), and contains the number of frames. For the last year this has always contained the exact same value as total_frames.

🚧 symbol debugging fields removed/changed

These were just things that maintainers of the old implementation added to help debug their own work. We did not feel the need to track them in the new implementation.

modules.N.symbol_disk_cache_hit
modules.N.symbols_fetch_time

The other module debug statistics (like symbols_corrupt) will not necessarily have the same values, as they reflect internal implementation details.

✅ symbols_url is now correct

modules.N.symbols_url specifies what url the symbols for this module were downloaded from. In the old implementation this value would be wrong if the symbols were found in the cache (it made a sloppy guess). The new implementation stores this value in the cache so it will always be correct.

Consumers of this value hacked around the value being incorrect, but these hacks can now be removed.

🚧 last_error_value may change

threads.N.last_error_value contains the GetLastError() value for each thread (which for unix people is similar to errno). Interpreting error codes on windows is an imperfect art because the major error types have overlapping values. For instance STATUS_WAIT_3 and ERROR_PATH_NOT_FOUND are the exact same value, and we just have to pick one arbitrarily.

The new implementation has a different order it tries to resolve errors in (we think it's better), so some of these values may change if you reprocess a minidump.