--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CAB8qB%2Btak-%2BjvUPAzBvo6hHXCws5K3G43Smjcf4gz8tPp78vXA%40mail.gmail.com.
BTW, I recently added the ability to write out the si_code for SIGBUS
[1] and read it from the crash dump [2]. It may be of interest to you
to know why SIGBUS crashes are happening. Also, have you looked at if
it is the case that many crashes are just coming from a small number
of users with chronic problems?
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CAB8qB%2BvLR6EnZPTOoArZEfN3KxUVqD4b4xkx8bugC7fBbhZQ%2Bg%40mail.gmail.com.
I am quite worried about introducing a handler for SIGBUS errors outside of breakpad/crashpad, especially if, as the doc suggests, that gives the ability of attaching *arbitrary* callbacks to the handler. Handling exceptions in a signal handler is full of footguns, mostly because signal handlers are full of platform-level bugs (e.g., crbug.com/448968, crbug.com/483399, crbug.com/481420, crbug.com/477444, crbug.com/473973). We had quite a number of bugs in breakpad itself before we got chaining of signal handlers right in all versions of Android. How are these arbitrary callbacks not going to hit the same chain of bugs?P.S: It seems we had this discussion ~1 y ago on this doc, where concerns were expressed about the impact on crash reporting. What is different in this proposal?
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CAB8qB%2BuSbEo1F-6sHeMynsgfmS-r6eXC-dVD%3DdY%3D37dKUPSyDg%40mail.gmail.com.
The driving motivation, per the doc, seems to be "Due to [memory mapped files] use in persistent metrics and the breadcrumbs project, " I wonder if we should examine alternates to that usage?
Trying to handle sigbus (or segv in the future) arising from an OS resource exhaustion scenario via adding of a dummy page seems fraught. My imagination immediately goes to logic reading/writing a value in that (now possibly shared) memory and basic program logic on the results without any awareness of concurrent modifications.
If the goal is to report better sigbus, can we go for a more narrow solution...which probably still allows a crash?
If we are attempting to fix the crash, then I wonder if we should examine the architecture of the code causing the sigbus.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CALcbsXBsbiRTn9qSxehhjzO7ParM63Ff7%3Dzsf%2Bacd58m%3DqM2tQ%40mail.gmail.com.
I think I'm understanding the problem more. To restate:(1) We want to disable instead of crash for metrics persistence since metrics is optional
(2) The current design uses in-process memory-mapped sparse files
(3) Metrics can also be triggered in abnormal shutdown(4) Something has happened such that more crashes due to disk-full are occurring whichis causing urgency.
Because of (2), we're stuck on POSIX with SIGBUS being the only error-condition signal. There in that setup, to avoid crashing, we have the following chain of reasoning:(a) There MUST be code to swap in a dummy page to make the address valid otherwise theprocess has no choice but to hard terminate(b) Since this process global state, a global manager is preferred for visibility to othermodules and coordination.(c) Hiding the tools doesn't actually prevent someone else from writing a bad version, somight as well expose a good (if dangerous) versionAnd the first (maybe only) consumer of this is PersistentMemoryAllocator, which would swap in the dummy page, likely continue and set a flag that no-ops itself, and then at some indeterminate time in the future undo the dummy page.Results and Assumptions:(i) Metrics collection no longer crashes Chrome(ii) There's reasonable belief that Chrome can and will continue to run for some usefulperiod of time inside this resource exhaustion/error environment
(iii) OR we can persist enough data to give us good telemetry(iv) Those results are worth the cost of introducing the likely subtle, hard to understand,kernel-bug-prone, but probably rarely touched logic of handling SIGBUS.And that's all preferable to(α) accepting the current set of crashes(β) rearchitecting the persistent metrics to do something else (eg, offload logging recording offprocess like OOPHP or ETW) such that crashes in the logging system are isolated.
Is that an accurate redux?
On Tue, Apr 17, 2018 at 10:28 AM Brian White <bcw...@google.com> wrote:
The driving motivation, per the doc, seems to be "Due to [memory mapped files] use in persistent metrics and the breadcrumbs project, " I wonder if we should examine alternates to that usage?Both of these involve the persistence of information during abnormal terminations (aka "crashes"). There are very few options to achieve this.During the original planning, it was expected that Crashpad would be able to grab the "persistent" memory and write it out to a file but the Crashpad team didn't like that idea and there was serious concern that it could even manage it with sufficient accuracy to be useful.Trying to handle sigbus (or segv in the future) arising from an OS resource exhaustion scenario via adding of a dummy page seems fraught. My imagination immediately goes to logic reading/writing a value in that (now possibly shared) memory and basic program logic on the results without any awareness of concurrent modifications.There's no question that it has to be done carefully but it's not new. The X server uses something similar in case a shared memory segment provided by a client suddenly goes away.There is certainly a danger it accessing a page of unknown data that has suddenly taken the place of what was expected. Great care is necessary to be confident in reasonable results.But this isn't something that would suddenly be covering arbitrary memory. Handlers would have to be written to cover a specific memory area of interest for a defined amount of time.If the goal is to report better sigbus, can we go for a more narrow solution...which probably still allows a crash?If we are attempting to fix the crash, then I wonder if we should examine the architecture of the code causing the sigbus.We have reporting. The goal here is to not crash for disk errors outside of our control. I'm certainly open to other improvements to the PersistentMemoryAllocator but when it's used in file-backed-memory, it's at the mercy of the OS properly connecting RAM addresses to disk blocks. The SIGBUS handler just adds a method for the OS to inform Chrome of a problem.Metric persistence is a very useful thing but we shouldn't sacrifice stability for it. If persistence isn't available, the code can simply stop using it. It already dose this, in fact, should it become full or corruption be detected.-- Brian
On Tue, Apr 17, 2018 at 9:48 AM 'Brian White' via Chromium-dev <chromi...@chromium.org> wrote:
I am quite worried about introducing a handler for SIGBUS errors outside of breakpad/crashpad, especially if, as the doc suggests, that gives the ability of attaching *arbitrary* callbacks to the handler. Handling exceptions in a signal handler is full of footguns, mostly because signal handlers are full of platform-level bugs (e.g., crbug.com/448968, crbug.com/483399, crbug.com/481420, crbug.com/477444, crbug.com/473973). We had quite a number of bugs in breakpad itself before we got chaining of signal handlers right in all versions of Android. How are these arbitrary callbacks not going to hit the same chain of bugs?
P.S: It seems we had this discussion ~1 y ago where concerns were expressed about the impact on crash reporting. What is different in this proposal?
--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/90e4a98d-3206-46d6-bd5d-cd0c4c2db200%40chromium.org.
I am quite worried about introducing a handler for SIGBUS errors outside of breakpad/crashpad, especially if, as the doc suggests, that gives the ability of attaching *arbitrary* callbacks to the handler. Handling exceptions in a signal handler is full of footguns, mostly because signal handlers are full of platform-level bugs (e.g., crbug.com/448968, crbug.com/483399, crbug.com/481420, crbug.com/477444, crbug.com/473973). We had quite a number of bugs in breakpad itself before we got chaining of signal handlers right in all versions of Android. How are these arbitrary callbacks not going to hit the same chain of bugs?