Heads Up: New Crash Analyzer (minidump-stackwalk) for "Try" and Local Builds

40 views
Skip to first unread message

Alexis Beingessner

unread,
Nov 23, 2021, 2:45:08 PM11/23/21
to dev-platform
High Level Details

Bug 1741205 has just landed.

minidump-stackwalk, the tool that is used to produce a backtrace from a minidump when a test unexpectedly crashes (via mozcrash.py) has been completely rewritten in Rust. The core behaviour of the tool should be largely unchanged. It should produce basically the exact same backtraces, although I expect it will be faster and more reliable. If you notice any problems, please let us know by filing a bug against "Toolkit > Crash Reporting" or by leaving a comment on Bug 174120.

Although this is ideally a non-functional change for almost everyone, this is a major step in the ongoing oxidation of our crash reporting infrastructure: this is the first deployment of rust-minidump as a replacement for breakpad's minidump processor. This same implementation will (if testing goes smoothly) become the new backend for crash-stats.mozilla.org.

Moving to rust-minidump will make it easier to maintain our crash reporting infrastructure and make improvements to it. We've already found several long-standing issues that it accidentally fixes and found that it does runs faster than the old version. And of course, moving from C++ to Rust trivially improves the security of our infrastructure. Hooray!

Perhaps more immediately interesting to some of you: rust-minidump also makes it a lot easier to be a crash-analysis power user, and is easier to build and run locally (see the last section).

Sections:
  • What is affected (and how to tell which version is being used)
  • Nitty Gritty details (how builds/tooling configs have changed)
  • New crash analysis power user features (digging deeper into minidumps)


What is affected

This tool is used to generate this kind of line in the try frontend:

PROCESS-CRASH | Last test finished | application crashed [@ static mozilla::image::SurfaceCache::ReleaseImageOnMainThread(already_AddRefed<mozilla::image::Image>, bool)]

And this information in the logs:

INFO - mozcrash Copy/paste: Z:/task_163718277621850/fetches\minidump_stackwalk\minidump_stackwalk.exe --human C:\Users\task_163718277621850\AppData\Local\Temp\tmpra9trz2u.mozrunner\minidumps\b2b1c4f0-3e0a-46b8-b469-988591c3c015.dmp Z:\task_163718277621850\build\symbols --symbols-url=https://symbols.mozilla.org/
INFO - mozcrash Saved minidump as Z:\task_163718277621850\build\blobber_upload_dir\b2b1c4f0-3e0a-46b8-b469-988591c3c015.dmp
INFO - PROCESS-CRASH | Last test finished | application crashed [@ static mozilla::image::SurfaceCache::ReleaseImageOnMainThread(already_AddRefed<mozilla::image::Image>, bool)]
INFO - Crash dump filename: C:\Users\task_163718277621850\AppData\Local\Temp\tmpra9trz2u.mozrunner\minidumps\b2b1c4f0-3e0a-46b8-b469-988591c3c015.dmp
INFO - Operating system: Windows NT
INFO -                   10.0.19041
INFO - CPU: amd64
INFO -      family 6 model 85 stepping 7
INFO -      8 CPUs
INFO -
INFO - Crash reason:  EXCEPTION_BREAKPOINT
INFO - Crash address: 0x7ff8ae0df019
INFO - Process uptime: 2 seconds
INFO -
INFO - Thread 3 TaskController #2 (crashed)
INFO -  0  xul.dll!static mozilla::image::SurfaceCache::ReleaseImageOnMainThread(already_AddRefed<mozilla::image::Image>, bool) [SurfaceCache.cpp:694ce55b85c51b3381eaf432020924e4f0ca4717 : 1831 + 0x40]
INFO -     rax = 0x00007ff8b541d9f9   rdx = 0x0000000000000000
INFO -     rcx = 0x00007ff8e277c978   rbx = 0x0000000000000001
INFO -     rsi = 0x00000041a977f350   rdi = 0x00000177b0f6f740
INFO -     rbp = 0x00000177aa172130   rsp = 0x00000041a977f2b0
INFO -      r8 = 0x00000041a977f820    r9 = 0x00007ff8ea530000
INFO -     r10 = 0x00007ff8ea582651   r11 = 0x00000041a977ec70
INFO -     r12 = 0x00007ff8e26e9630   r13 = 0x00000177b3a15120
INFO -     r14 = 0x00000177b0f970d8   r15 = 0x00000177b28a2a00
INFO -     rip = 0x00007ff8ae0df019
INFO -     Found by: given as instruction pointer in context
INFO -  1  xul.dll!mozilla::image::DecodedSurfaceProvider::FinishDecoding() [DecodedSurfaceProvider.cpp:694ce55b85c51b3381eaf432020924e4f0ca4717 : 200 + 0x37]
INFO -     rbx = 0x0000000000000001   rbp = 0x00000177aa172130
INFO -     rsp = 0x00000041a977f310   r12 = 0x00007ff8e26e9630
INFO -     r13 = 0x00000177b3a15120   r14 = 0x00000177b0f970d8
INFO -     r15 = 0x00000177b28a2a00   rip = 0x00007ff8ae0b0dbf
INFO -     Found by: call frame info
INFO -  2  xul.dll!mozilla::image::DecodedSurfaceProvider::Run() [DecodedSurfaceProvider.cpp:694ce55b85c51b3381eaf432020924e4f0ca4717 : 129 + 0x7]
INFO -     rbx = 0x0000000000000001   rbp = 0x00000177aa172130
INFO -     rsp = 0x00000041a977f390   r12 = 0x00007ff8e26e9630
INFO -     r13 = 0x00000177b3a15120   r14 = 0x00000177b0f970d8
INFO -     r15 = 0x00000177b28a2a00   rip = 0x00007ff8ae0b0929
INFO -     Found by: call frame info


Only the new implementation will include the `--human` flag in the "Copy/paste: Z:/task_163718277621850/fetches\minidump_stackwalk\minidump_stackwalk.exe" line, if you aren't sure which implementation is being used. However I believe this line is omitted if fix-stacks.py is involved. In that case, you can use the fact that the backtraces will end with "unimplemented streams", a self-debugging feature unique to the new tool:

INFO -  Unimplemented streams encountered:
INFO -  Stream 0x00000000 UnusedStream (Official) @ 0x00000000
INFO -  Stream 0x00000015 SystemMemoryInfoStream (Official) @ 0x00002d98
INFO -  Stream 0x00000016 ProcessVmCountersStream (Official) @ 0x00002f84


These differences similarly apply to local crashes, as far as I know (unverified).



Nitty Gritty Details

On paper, this should be a significant upgrade to try's minidump-stackwalk, in that the one that is currently being used is a weird unmaintained fork of what's used on crash-stats. The new one should be faster and more reliable, and will be easy to update by just changing the commit in its toolchain fetch:

rust-minidump:
    description: rust-minidump source code (for minidump-stackwalk)
    fetch:
        type: git
        repo: https://github.com/luser/rust-minidump/
        revision: 0c90e02544797317503d1c4cff8daab0cabdea86


However this will introduce some changes to how minidump-stackwalk is built:

  • minidump-stackwalk no longer builds from checked-in source, so it will only need to be built if the toolchain fetch is updated (less builds and churn, yay!)
  • This introduces a currently orphaned win64-minidump-stackwalk build, for future use in solving Bug 1410840. That can be removed if having an orphan tool sets off some annoying warnings/errors for the sheriffs.
  • There was technically some wiring in `mach` to support building minidump-stackwalk locally. this no longer works, as there is no local source to use. mozboot was already downloading `minidump_stackwalk` for you, so it's unlikely this will affect anyone's workflow.
  • If for whatever reason you want to build your own copy of minidump-stackwalk, you can:
    •  `cargo install minidump-stackwalk`
    • or checkout the rust-minidump tree and build it with `cargo build --release` (that's all our build servers do!)
    • NB: the rust binary is officially "minidump-stackwalk" but we rename it to "minidump_stackwalk" in CI to avoid needless churn, if you do either of these things, you will get a binary called "minidump-stackwalk"


Crash Analysis Power User Features

The new minidump-stackwalk should build and run locally on all mainstream platforms without any issue via "cargo install minidump-stackwalk". I have done my best to write user documentation and the process for analyzing a firefox minidump is streamlined:

> minidump-stackwalk --symbols-url=https://symbols.mozilla.org/ /path/to/minidump.dmp

Because it is designed to be the backend for crash-stats (aka socorro), it can do ~all the analysis you expect there (and more details are surfaced in the default JSON output than the --human one used by "try" and local builds).

To help me test this, I have also created a tool for downloading minidumps/annotations from crash-stats and comparing local minidump-stackwalk output to the values on crash-stats: socc-pair. While the comparison machinery is probably not necessarily useful to you, as a side-effect of its purpose it automates fetching all the details of a crash and running local analysis, and saves the results to files you can search through for details:

      socc-pair --api-token=f0c129d4467bf58eeca0ad8e8e5d --crash-id=b4f58e9f-49be-4ba5-a203-8ef160211027

<lots of interesting analysis>
...

Output Files:
  * Minidump: C:\Users\gankra\AppData\Local\Temp\socc-pair\dumps\b4f58e9f-49be-4ba5-a203-8ef160211027.dmp
  * Socorro Processed Crash: C:\Users\gankra\AppData\Local\Temp\socc-pair\dumps\b4f58e9f-49be-4ba5-a203-8ef160211027.json
  * Raw JSON: C:\Users\gankra\AppData\Local\Temp\socc-pair\dumps\b4f58e9f-49be-4ba5-a203-8ef160211027.raw.json
  * Local minidump-stackwalk Output: C:\Users\gankra\AppData\Local\Temp\socc-pair\dumps\b4f58e9f-49be-4ba5-a203-8ef160211027.local.json
  * Local minidump-stackwalk Logs: C:\Users\gankra\AppData\Local\Temp\socc-pair\dumps\b4f58e9f-49be-4ba5-a203-8ef160211027.log.txt


NOTE: these files can contain protected user information. Although they are written to your system's default "temp", I recommend deleting the temp `socc-pair` directory regularly to ensure compliance with our protected data policies.

Notably this includes all of the logging rust-minidump did (in the `.log.txt`), including tracing for the backtracer's analysis, which can help you debug strange backtraces:


[TRACE] unwind: starting stack unwind
[TRACE] unwind: unwinding NtGetContextThread
[TRACE] unwind: trying cfi
[TRACE] unwind: found symbols for address, searching for cfi entries
[TRACE] unwind: trying STACK CFI exprs
[TRACE] unwind:   .cfa: $rsp 8 + .ra: .cfa 8 - ^
[TRACE] unwind:   .cfa: $rsp 8 +
[TRACE] unwind: STACK CFI parse successful
[TRACE] unwind: STACK CFI seems reasonable, evaluating
[TRACE] unwind: successfully evaluated .cfa (frame address)
[TRACE] unwind: successfully evaluated .ra (return address)
[TRACE] unwind: cfi evaluation was successful -- caller_ip: 0x000000ec00000000, caller_sp: 0x000000ec7fbfd790
[TRACE] unwind: cfi result seems valid
[TRACE] unwind: unwinding 1013612281855
[TRACE] unwind: trying cfi
[TRACE] unwind: trying frame pointer
[TRACE] unwind: trying scan
[TRACE] unwind: scan seems valid -- caller_ip: 0x7ffd172c2a24, caller_sp: 0xec7fbfd7f8
[TRACE] unwind: unwinding <unknown in ntdll.dll>
[TRACE] unwind: trying cfi
[TRACE] unwind: found symbols for address, searching for cfi entries
[TRACE] unwind: trying frame pointer
[TRACE] unwind: trying scan
[TRACE] unwind: scan seems valid -- caller_ip: 0x7ffd162b7034, caller_sp: 0xec7fbfd828
[TRACE] unwind: unwinding BaseThreadInitThunk
[TRACE] unwind: trying cfi
[TRACE] unwind: found symbols for address, searching for cfi entries
[TRACE] unwind: trying STACK CFI exprs
[TRACE] unwind:   .cfa: $rsp 8 + .ra: .cfa 8 - ^
[TRACE] unwind:   .cfa: $rsp 48 +
[TRACE] unwind: STACK CFI parse successful
[TRACE] unwind: STACK CFI seems reasonable, evaluating
[TRACE] unwind: successfully evaluated .cfa (frame address)
[TRACE] unwind: successfully evaluated .ra (return address)
[TRACE] unwind: cfi evaluation was successful -- caller_ip: 0x0000000000000000, caller_sp: 0x000000ec7fbfd858
[TRACE] unwind: cfi result seems valid
[TRACE] unwind: instruction pointer was nullish, assuming unwind complete
[TRACE] unwind: finished stack unwind



If you are using minidump-stackwalk directly, you can get these same logs with `--verbose=trace`. `--output-file=x/y/z` and `--log-file=x/y/z` arguments are included to make it easier to pipe these streams to files.
Reply all
Reply to author
Forward
0 new messages