A SIGBUS Handler for Chrome

200 views
Skip to first unread message

Brian White

unread,
Apr 16, 2018, 2:54:30 PM4/16/18
to chromium-dev
"Bus Error" memory-fault exceptions have become problematic on Posix systems due to the expanding use of memory-mapped files used for persistent metrics, breadcrumbs, inter-process parameter passing, and more.

I have a Q2 OKR to build an exception handler for these types of memory faults.  I've written up a basic design and would like comments on it:

https://docs.google.com/document/d/1FeWeSVC7CC370soiYwSA__m-eu4BaxjvTVxebnQGOrw/edit?usp=sharing

  Brian
  bcw...@google.com
-----------------------------------------------------------------------------------------
Treat someone as they are and they will remain that way.
Treat someone as they can be and they will become that way.

Torne (Richard Coles)

unread,
Apr 16, 2018, 5:08:00 PM4/16/18
to bcw...@google.com, chromium-dev
The doc is only shared with @google.com - could you make a public version on chromium.org maybe?

This seems very interesting, and I'd be interested in reviewing design/implementation from an Android and WebView perspective; we have in the past had issues with the exact details of how breakpad's signal handler works on android due to various platform bugs and incorrect assumptions in our code that resulted in incorrect handling, so I'd like to look into those areas when you're ready ;)

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CAB8qB%2Btak-%2BjvUPAzBvo6hHXCws5K3G43Smjcf4gz8tPp78vXA%40mail.gmail.com.

Lei Zhang

unread,
Apr 16, 2018, 5:29:49 PM4/16/18
to Brian White, chromium-dev, Torne (Richard Coles)
BTW, I recently added the ability to write out the si_code for SIGBUS
[1] and read it from the crash dump [2]. It may be of interest to you
to know why SIGBUS crashes are happening. Also, have you looked at if
it is the case that many crashes are just coming from a small number
of users with chronic problems?

[1] https://chromium-review.googlesource.com/990799
[2] https://chromium-review.googlesource.com/1012589
> https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CAEV-rjfE5bNhDGEz%2BUUsQJjiKtnf-QYKg6vc0Echc8mMPF96JQ%40mail.gmail.com.

Brian White

unread,
Apr 16, 2018, 5:55:58 PM4/16/18
to Lei Zhang, chromium-dev, Torne (Richard Coles)
BTW, I recently added the ability to write out the si_code for SIGBUS
[1] and read it from the crash dump [2]. It may be of interest to you
to know why SIGBUS crashes are happening. Also, have you looked at if
it is the case that many crashes are just coming from a small number
of users with chronic problems?

It used to be a fairly wide problem but that may have changed with the force pre-allocation of the files.  I haven't looked at si_code but I know it's not a misaligned access.

-- Brian


--

Primiano Tucci

unread,
Apr 17, 2018, 12:25:47 AM4/17/18
to bcw...@google.com, Lei Zhang, chromium-dev, Torne (Richard Coles)
I am quite worried about introducing a handler for SIGBUS errors outside of breakpad/crashpad, especially if, as the doc suggests, that gives the ability of attaching *arbitrary* callbacks to the handler. Handling exceptions in a signal handler is full of footguns, mostly because signal handlers are full of platform-level bugs (e.g., crbug.com/448968, crbug.com/483399, crbug.com/481420, crbug.com/477444, crbug.com/473973). We had quite a number of bugs in breakpad itself before we got chaining of signal handlers right in all versions of Android. How are these arbitrary callbacks not going to hit the same chain of bugs?

P.S: It seems we had this discussion ~1 y ago on this doc, where concerns were expressed about the impact on crash reporting. What is different in this proposal? 


Brian White

unread,
Apr 17, 2018, 12:48:35 PM4/17/18
to Primiano Tucci, Lei Zhang, chromium-dev, Torne (Richard Coles)
I am quite worried about introducing a handler for SIGBUS errors outside of breakpad/crashpad, especially if, as the doc suggests, that gives the ability of attaching *arbitrary* callbacks to the handler. Handling exceptions in a signal handler is full of footguns, mostly because signal handlers are full of platform-level bugs (e.g., crbug.com/448968, crbug.com/483399, crbug.com/481420, crbug.com/477444, crbug.com/473973). We had quite a number of bugs in breakpad itself before we got chaining of signal handlers right in all versions of Android. How are these arbitrary callbacks not going to hit the same chain of bugs?

P.S: It seems we had this discussion ~1 y ago on this doc, where concerns were expressed about the impact on crash reporting. What is different in this proposal? 

The other proposals in "this doc" have been done.  They've been insufficient.  Sadly.
-- Brian

Albert J. Wong (王重傑)

unread,
Apr 17, 2018, 1:06:44 PM4/17/18
to bcw...@google.com, Primiano Tucci, Lei Zhang, Chromium-dev, Torne (Richard Coles)
I'm similarly concerned about the generality of the solution.

The driving motivation, per the doc, seems to be "Due to [memory mapped files] use in persistent metrics and the breadcrumbs project, " I wonder if we should examine alternates to that usage?

Trying to handle sigbus (or segv in the future) arising from an OS resource exhaustion scenario via adding of a dummy page seems fraught. My imagination immediately goes to logic reading/writing a value in that (now possibly shared) memory and basic program logic on the results without any awareness of concurrent modifications.

If the goal is to report better sigbus, can we go for a more narrow solution...which probably still allows a crash?

If we are attempting to fix the crash, then I wonder if we should examine the architecture of the code causing the sigbus.

-Albert



Brian White

unread,
Apr 17, 2018, 1:30:24 PM4/17/18
to ajw...@chromium.org, Primiano Tucci, Lei Zhang, chromium-dev, Torne (Richard Coles)
The driving motivation, per the doc, seems to be "Due to [memory mapped files] use in persistent metrics and the breadcrumbs project, " I wonder if we should examine alternates to that usage?

Both of these involve the persistence of information during abnormal terminations (aka "crashes").  There are very few options to achieve this.

During the original planning, it was expected that Crashpad would be able to grab the "persistent" memory and write it out to a file but the Crashpad team didn't like that idea and there was serious concern that it could even manage it with sufficient accuracy to be useful.
 

Trying to handle sigbus (or segv in the future) arising from an OS resource exhaustion scenario via adding of a dummy page seems fraught. My imagination immediately goes to logic reading/writing a value in that (now possibly shared) memory and basic program logic on the results without any awareness of concurrent modifications.

There's no question that it has to be done carefully but it's not new.  The X server uses something similar in case a shared memory segment provided by a client suddenly goes away.

There is certainly a danger it accessing a page of unknown data that has suddenly taken the place of what was expected.  Great care is necessary to be confident in reasonable results.

But this isn't something that would suddenly be covering arbitrary memory.  Handlers would have to be written to cover a specific memory area of interest for a defined amount of time.

 

If the goal is to report better sigbus, can we go for a more narrow solution...which probably still allows a crash? 

If we are attempting to fix the crash, then I wonder if we should examine the architecture of the code causing the sigbus.

We have reporting.  The goal here is to not crash for disk errors outside of our control.  I'm certainly open to other improvements to the PersistentMemoryAllocator but when it's used in file-backed-memory, it's at the mercy of the OS properly connecting RAM addresses to disk blocks.  The SIGBUS handler just adds a method for the OS to inform Chrome of a problem.

Metric persistence is a very useful thing but we shouldn't sacrifice stability for it.  If persistence isn't available, the code can simply stop using it.  It already dose this, in fact, should it become full or corruption be detected.

-- Brian

Albert J. Wong (王重傑)

unread,
Apr 17, 2018, 2:38:22 PM4/17/18
to bcw...@google.com, Primiano Tucci, Lei Zhang, Chromium-dev, Torne (Richard Coles)
I think I'm understanding the problem more.  To restate:

  (1) We want to disable instead of crash for metrics persistence since metrics is optional
  (2) The current design uses in-process memory-mapped sparse files
  (3) Metrics can also be triggered in abnormal shutdown
  (4) Something has happened such that more crashes due to disk-full are occurring which
      is causing urgency.

Because of (2), we're stuck on POSIX with SIGBUS being the only error-condition signal. There in that setup, to avoid crashing, we have the following chain of reasoning:
  (a) There MUST be code to swap in a dummy page to make the address valid otherwise the 
        process has no choice but to hard terminate
  (b) Since this process global state, a global manager is preferred for visibility to other 
        modules and coordination.
  (c) Hiding the tools doesn't actually prevent someone else from writing a bad version, so 
        might as well expose a good (if dangerous) version

And the first (maybe only) consumer of this is PersistentMemoryAllocator, which would swap in the dummy page, likely continue and set a flag that no-ops itself, and then at some indeterminate time in the future undo the dummy page.

Results and Assumptions:
  (i) Metrics collection no longer crashes Chrome
  (ii) There's reasonable belief that Chrome can and will continue to run for some useful
       period of time inside this resource exhaustion/error environment
  (iii) OR we can persist enough data to give us good telemetry
  (iv) Those results are worth the cost of introducing the likely subtle, hard to understand,
        kernel-bug-prone, but probably rarely touched logic of handling SIGBUS.

And that's all preferable to
  (α) accepting the current set of crashes
  (β) rearchitecting the persistent metrics to do something else (eg, offload logging recording off
        process like OOPHP or ETW) such that crashes in the logging system are isolated.

Is that an accurate redux?

-Albert

Jochen Eisinger

unread,
Apr 17, 2018, 2:41:14 PM4/17/18
to ajw...@chromium.org, bradn...@chromium.org, bcw...@google.com, Primiano Tucci, Lei Zhang, Chromium-dev, Torne (Richard Coles)

Brian White

unread,
Apr 17, 2018, 4:20:20 PM4/17/18
to joc...@chromium.org, ajw...@chromium.org, bradn...@chromium.org, Primiano Tucci, Lei Zhang, chromium-dev, Torne (Richard Coles)
I think I'm understanding the problem more.  To restate:

  (1) We want to disable instead of crash for metrics persistence since metrics is optional

Metrics won't actually get disabled.  If an error causes memory to be unavailable, the system will fall back to non-persistent storage.
 
  (2) The current design uses in-process memory-mapped sparse files

It uses memory-mapped files.  How the OS treats it is not directly under our control though we do make an effort to fully realize the file by calling fallocate() or explicitly writing a byte to every page.
 
  (3) Metrics can also be triggered in abnormal shutdown
  (4) Something has happened such that more crashes due to disk-full are occurring which
      is causing urgency.

It shouldn't be disk-full due to the fallocate() that is done.  All we know for sure is that the OS is unable to map a page into the virtual address space.
 

Because of (2), we're stuck on POSIX with SIGBUS being the only error-condition signal. There in that setup, to avoid crashing, we have the following chain of reasoning:
  (a) There MUST be code to swap in a dummy page to make the address valid otherwise the 
        process has no choice but to hard terminate
  (b) Since this process global state, a global manager is preferred for visibility to other 
        modules and coordination.
  (c) Hiding the tools doesn't actually prevent someone else from writing a bad version, so 
        might as well expose a good (if dangerous) version

And the first (maybe only) consumer of this is PersistentMemoryAllocator, which would swap in the dummy page, likely continue and set a flag that no-ops itself, and then at some indeterminate time in the future undo the dummy page.

Results and Assumptions:
  (i) Metrics collection no longer crashes Chrome
  (ii) There's reasonable belief that Chrome can and will continue to run for some useful
       period of time inside this resource exhaustion/error environment

This is the biggest question I have.  If there are some sort of device fault that is causing this, will handling it here just lead to other problems a short while later.  I tend to think, though, that other device errors are most likely to be seen through I/O system calls with proper handling for those errors.  And even if Chrome is unable to continue meaningfully in that state, we'd likely get better insight into the true cause of the problem.
 

  (iii) OR we can persist enough data to give us good telemetry
  (iv) Those results are worth the cost of introducing the likely subtle, hard to understand,
        kernel-bug-prone, but probably rarely touched logic of handling SIGBUS.

And that's all preferable to
  (α) accepting the current set of crashes
  (β) rearchitecting the persistent metrics to do something else (eg, offload logging recording off
        process like OOPHP or ETW) such that crashes in the logging system are isolated.

Low latency is very important to metrics collection and breadcrumbs operation.  Management by an outside process doesn't meet the low-latency requirements of both metrics and breadcrumbs, and is likely to fail in some cases such as system shutdown.
 

Is that an accurate redux?

Yes.

-- Brian
 
On Tue, Apr 17, 2018 at 10:28 AM Brian White <bcw...@google.com> wrote:
The driving motivation, per the doc, seems to be "Due to [memory mapped files] use in persistent metrics and the breadcrumbs project, " I wonder if we should examine alternates to that usage?

Both of these involve the persistence of information during abnormal terminations (aka "crashes").  There are very few options to achieve this.

During the original planning, it was expected that Crashpad would be able to grab the "persistent" memory and write it out to a file but the Crashpad team didn't like that idea and there was serious concern that it could even manage it with sufficient accuracy to be useful.
 

Trying to handle sigbus (or segv in the future) arising from an OS resource exhaustion scenario via adding of a dummy page seems fraught. My imagination immediately goes to logic reading/writing a value in that (now possibly shared) memory and basic program logic on the results without any awareness of concurrent modifications.

There's no question that it has to be done carefully but it's not new.  The X server uses something similar in case a shared memory segment provided by a client suddenly goes away.

There is certainly a danger it accessing a page of unknown data that has suddenly taken the place of what was expected.  Great care is necessary to be confident in reasonable results.

But this isn't something that would suddenly be covering arbitrary memory.  Handlers would have to be written to cover a specific memory area of interest for a defined amount of time.

 

If the goal is to report better sigbus, can we go for a more narrow solution...which probably still allows a crash? 

If we are attempting to fix the crash, then I wonder if we should examine the architecture of the code causing the sigbus.

We have reporting.  The goal here is to not crash for disk errors outside of our control.  I'm certainly open to other improvements to the PersistentMemoryAllocator but when it's used in file-backed-memory, it's at the mercy of the OS properly connecting RAM addresses to disk blocks.  The SIGBUS handler just adds a method for the OS to inform Chrome of a problem.

Metric persistence is a very useful thing but we shouldn't sacrifice stability for it.  If persistence isn't available, the code can simply stop using it.  It already dose this, in fact, should it become full or corruption be detected.

-- Brian

On Tue, Apr 17, 2018 at 9:48 AM 'Brian White' via Chromium-dev <chromi...@chromium.org> wrote:
I am quite worried about introducing a handler for SIGBUS errors outside of breakpad/crashpad, especially if, as the doc suggests, that gives the ability of attaching *arbitrary* callbacks to the handler. Handling exceptions in a signal handler is full of footguns, mostly because signal handlers are full of platform-level bugs (e.g., crbug.com/448968, crbug.com/483399, crbug.com/481420, crbug.com/477444, crbug.com/473973). We had quite a number of bugs in breakpad itself before we got chaining of signal handlers right in all versions of Android. How are these arbitrary callbacks not going to hit the same chain of bugs?

P.S: It seems we had this discussion ~1 y ago where concerns were expressed about the impact on crash reporting. What is different in this proposal? 

Torne (Richard Coles)

unread,
Apr 17, 2018, 4:27:28 PM4/17/18
to Brian White, joc...@chromium.org, ajw...@chromium.org, bradn...@chromium.org, Primiano Tucci, Lei Zhang, chromium-dev
If we are already fallocate()ing the pages we map in advance then it seems like the SIGBUS errors we're getting are just general "the world is on fire, oh no" results that come from having a diverse population of often-crappy computers that run the code, and trying to handle them doesn't seem likely to be that helpful..

erik...@chromium.org

unread,
Apr 17, 2018, 5:57:40 PM4/17/18
to Chromium-dev, bcw...@google.com, joc...@chromium.org, ajw...@chromium.org, bradn...@chromium.org, prim...@chromium.org, the...@chromium.org
The crash that the design doc is designed to fix occurs overwhelmingly often on Linux 4.4.0.*. Less than 1% of crashes occur on Android. Less than 0.1% of crashes occur on macOS/Windows. 

This makes me suspect:
1) A kernel bug in 4.4.0.* 
and/or
2) Something pathological occurring in the implementation of PersistentMemoryAllocator on Linux, perhaps because Linux users are more likely to use non-standard [e.g. distributed] file systems.

Before we implement this proposal, let's disable PersistentMemoryAllocator on 4.4.0.*, and add some more checks to try to reduce crash rates on Linux [e.g. don't use PersistentMemoryAllocator on atypical filesystems]. 

Torne (Richard Coles)

unread,
Apr 17, 2018, 6:04:35 PM4/17/18
to erik...@chromium.org, Chromium-dev, bcw...@google.com, joc...@chromium.org, ajw...@chromium.org, bradn...@chromium.org, prim...@chromium.org, the...@chromium.org
It'd be interesting to see what filesystem the affected devices have. I'm not sure that unusual things like btrfs/zfs really do anything meaningful with fallocate() so may be out of disk anyway, and btrfs on older kernels is also.. fun :)

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.

Primiano Tucci

unread,
Apr 18, 2018, 12:27:29 AM4/18/18
to Albert J. Wong (王重傑), bcw...@google.com, Lei Zhang, Chromium-dev, Torne (Richard Coles)
There is a chapter that is missing here, that is my primary concern:

Risk (at the end of the day there is no right or wrong, this is all risk management):
1) If the signal handling is good but the signal chaining is wrong (see bugs linked above), Chrome stops receiving crash reports for (a portion of) SIGBUS crashes. If that happens only on a class of devices, it's going to be hard to realize that this happened.

2) If something goes wrong in the signal handling (e.g. we fail to mmap a replacement page), we can easily end up in a situation where the cause of the SIGBUS is not resolved and the signal handler is invoked repeatedly, in a sort of while(true) loop between us and the kernel. If that happens that will deliver an awful experience to the user (chrome burning their battery) in a way that is going to be absolutely undetectable for us: none of our watchdogs will be able to do anything useful if we get stuck in that state.

Both options are, IMHO, not entirely unrealistic (they can happen in the same way it happened in those 5 bugs linked). So my question here is: is the risk worth the benefit? 

Egor Pasko

unread,
Apr 18, 2018, 9:39:48 AM4/18/18
to Primiano Tucci, Brian White, Lei Zhang, chromium-dev, Torne (Richard Coles)
On Tue, Apr 17, 2018 at 6:25 AM Primiano Tucci <prim...@chromium.org> wrote:
I am quite worried about introducing a handler for SIGBUS errors outside of breakpad/crashpad, especially if, as the doc suggests, that gives the ability of attaching *arbitrary* callbacks to the handler. Handling exceptions in a signal handler is full of footguns, mostly because signal handlers are full of platform-level bugs (e.g., crbug.com/448968, crbug.com/483399, crbug.com/481420, crbug.com/477444, crbug.com/473973). We had quite a number of bugs in breakpad itself before we got chaining of signal handlers right in all versions of Android. How are these arbitrary callbacks not going to hit the same chain of bugs?

Just to add to the collection of issues with signal handlers: http://crbug.com/249880

Brian White

unread,
Apr 21, 2018, 12:12:12 PM4/21/18
to Egor Pasko, Primiano Tucci, Lei Zhang, chromium-dev, Torne (Richard Coles)
After all the discussion and a mitigation against one especially susceptible Linux kernel version, I'm going to put this on hold for now.

-- Brian

Reply all
Reply to author
Forward
0 new messages