Out of memory (OOM) crashes for Chromium based browsers

880 views
Skip to first unread message

Joe Laughlin

unread,
Aug 22, 2019, 5:49:02 PM8/22/19
to memor...@chromium.org, Mike Decker, Tim Scudder, Mike Rorke, Michael Lynch, Sebastian Poulose

Hello Everyone,

 

The Anaheim Reliability team is working on creating a plan to better analyze/mitigate OOM scenarios in Anaheim. Since OOM is often a symptom of other underlying performance issues (e.g. increase in working set) we wanted to discuss our plans and any potential areas of collaboration in this space.

 

Please respond if this is an area of interest so we start the dialog among interested parties.

 

Bruce Dawson

unread,
Aug 22, 2019, 6:15:53 PM8/22/19
to Joe Laughlin, memor...@chromium.org, Mike Decker, Tim Scudder, Mike Rorke, Michael Lynch, Sebastian Poulose
As you know I am quite interested in reducing Chrome's memory footprint and have been investigating the browser process in crbug.com/982452

--
You received this message because you are subscribed to the Google Groups "memory-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to memory-dev+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/memory-dev/SN6PR00MB0431D8F16079A82C0D5D8D5ECDA50%40SN6PR00MB0431.namprd00.prod.outlook.com.

Erik Chen

unread,
Aug 22, 2019, 6:16:17 PM8/22/19
to Joe Laughlin, memor...@chromium.org, Mike Decker, Tim Scudder, Mike Rorke, Michael Lynch, Sebastian Poulose, Kentaro Hara, Benoit Lize, Bruce Dawson, siddhartha sivakumar, Alexei Filippov, Etienne Bergeron, Takashi Sakamoto
Thanks for reaching out! I think that there's a lot of space for potential collaborations to reduce OOMs for Chromium-based browsers. I imagine that there are many people on the Chromium side who might be interested. Maybe you could provide some more context on the scope of potential projects? [e.g. I'm guessing this will be targeted at Windows (10?), browser and/or renderer processes? ]

+ a bunch of potentially interested people from the Chromium side.

On Thu, Aug 22, 2019 at 2:49 PM 'Joe Laughlin' via memory-dev <memor...@chromium.org> wrote:
--

Mike Rorke

unread,
Aug 23, 2019, 1:06:26 PM8/23/19
to Erik Chen, Joe Laughlin, memor...@chromium.org, Mike Decker, Tim Scudder, Michael Lynch, Sebastian Poulose, Kentaro Hara, Benoit Lize, Bruce Dawson, siddhartha sivakumar, Alexei Filippov, Etienne Bergeron, Takashi Sakamoto

Hi! I am a developer on the reliability team for Edge and we are starting to ramp up our investigations into OOM analysis and mitigation. To start with, we are trying to get an idea of the scale of the problem.

 

We started by looking at crash dumps where the exception indicates the failure was due to a lack of necessary resources (usually memory). We found this only really represents part of the problem though as the job limits are also involved here, terminating the processes with high memory usage, but not sending dumps for these.

 

Are there any other recommendations for how to measure OOM rate? Are there any measurements/metrics around the job limit (e.g. how often its hit, are there specific URLs that cause it to hit more often than others, etc.) that might be useful here? OOM rate is often directly tied to the physical hardware, so any usable measure needs to take this into account (e.g. pivoting the failures on RAM size, HDD vs. SSD, etc.) – do you have any common pivots that might be useful here?

 

These are the very initial stages of our investigation and we fully expect to get more involved with proposing actual code changes, etc. as we gain more experience in the area. Initially, our experience will probably skew more towards Windows, but we are also looking to engage and improve other platforms too.

Bruce Dawson

unread,
Aug 23, 2019, 1:10:49 PM8/23/19
to Mike Rorke, Erik Chen, Joe Laughlin, memor...@chromium.org, Mike Decker, Tim Scudder, Michael Lynch, Sebastian Poulose, Kentaro Hara, Benoit Lize, siddhartha sivakumar, Alexei Filippov, Etienne Bergeron, Takashi Sakamoto
Are you interested in browser-process OOM or renderer process OOM?

In my own experience on my home laptop (32 GB machine with growable page file) the render-process OOM failures that I occasionally hit are entirely due to badly behaving web pages. I've seen them on msn.com more than on most pages, perhaps due to the choice of ads that run there. In these cases it seems like an OOM failure is actually the best possible outcome since it stops a runaway page from making a machine useless.

The silent failures when processes hit the job limits (no crash dump) seem like a problem. Is there any thoughts on how to get data on those?

Mike Rorke

unread,
Aug 23, 2019, 1:35:42 PM8/23/19
to Bruce Dawson, Erik Chen, Joe Laughlin, memor...@chromium.org, Mike Decker, Tim Scudder, Michael Lynch, Sebastian Poulose, Kentaro Hara, Benoit Lize, siddhartha sivakumar, Alexei Filippov, Etienne Bergeron, Takashi Sakamoto

For measurement purposes, I think we would like to get data on OOMs from all process types. I think the pivots we would use to aggregate that data would be different per process type (e.g. render process may pivot on host/URL while browser would pivot on available RAM). We are in early planning for this work right now, so any input you have into useful pivots, etc. would be most welcome.

 

From a user impact point of view, OOM on the browser process is much more disruptive than other processes,  so we would definitely focus more on investigating and mitigating any issues we find there.

 

For the job limits, I assume there are already some histograms available to track the prevalence of these (I haven’t yet investigated this myself) which we would want to leverage. Is there any other existing telemetry around this area that you think we might find useful in categorizing these events? As for getting more data, we could try collecting a dump though that might have a high failure rate in an already memory stressed system. I was wondering if anyone on this thread knew if that had been tried before and/or why it isn’t done currently? If we are able to get dumps from these processes before they are terminated by the job limit code, we have discussed using a crash-key to try and journal the any large memory allocations – we could then mine this data from the dumps, to try and detect memory allocation patterns that often result in hitting the job limit.

 

We also have the ability on Windows Insider users, to setup a circular buffer of system performance data (background ETL collection basically) that we can snap and report in response to specified trigger events. We could add a trigger event to the job limit code that could then be used to collect performance traces from users that hit job limits. This type of tracing is very verbose and needs to be quite targeted in order to be useful and not quickly exceed the allowed circular buffer size – so we would probably need to use existing telemetry/crash dumps to nail down specific scenarios for which we wanted to collect more detailed traces.

Erik Chen

unread,
Aug 23, 2019, 2:59:31 PM8/23/19
to Mike Rorke, Sébastien Marchand, Bruce Dawson, Joe Laughlin, memor...@chromium.org, Mike Decker, Tim Scudder, Michael Lynch, Sebastian Poulose, Kentaro Hara, Benoit Lize, siddhartha sivakumar, Alexei Filippov, Etienne Bergeron, Takashi Sakamoto, Will Harris
We have some basic metrics:

We measure Memory.{Browser,Renderer,...}.PrivateMemoryFootprint. On Windows this reports commit charge via PROCESS_MEMORY_COUNTERS_EX::PrivateUsage. See doc 1 and doc 2 for more background. This measurement is tracked reasonably closely [e.g. by default, A/B experiments will show changes to this metric, we have in-lab testing that reports changes to this metric.]

We've previously invested some time into finding pivots for this data [e.g. process uptime, whether a renderer is foreground or background, URL for renderer, etc.] via a tool named UKM. While this yielded some insights, it didn't end up providing a lot of actionable data. Maybe we didn't find the right pivots to use. We have not put very much time into this recently.

Our crash-reporting toolchain will automatically tag certain types of crashes with OOM. For crashes we also track breadcrumbs: some useful debugging info like process commit charge, system commit limit, system commit charge, etc. I'm not sure that these are carefully tracked, although major shifts to crash rates would show up. I also don't know if OOMing due to hitting a job limit on Windows will emit a crash [or any other metric, for that matter]. Maybe +wfh knows.

We have one mostly automated tool called memlog for tracking malloc-based memory issues [some public documentation here]. For a subset of live users, we collect anonymized heap profiles [a poisson sampling based snapshot of malloc allocations, with callsite info]. We have automation that filters for large/frequent allocations [that were never freed]. This tends to find and make it easy to root cause a variety of memory issues [dead leaks, live leaks, etc.]. See examples of bugs here [Note: these are Restrict-View-EditIssue]. 

There are some metrics that we don't have right now that we could use your help on, as experts for the platform. :)

# of OOMs, segmented by process type, segmented by cause: system commit charge exhaustion, job limit, page fault failure (I assume this is possible on Windows?)
Note: All 3 of these are somewhat outside of the control of Chrome. e.g. user could be running a commit-charge heavy non-Chrome application, a popular website could change to use a lot more memory. If there is a regression to a metric, how do we know whether this is caused by Chrome? What is an appropriate follow up?

Relationship between's Chrome's memory usage [be it commit charge, working set, etc.] and system performance. [e.g. swap/compression thrashing?]. +Sébastien Marchand has been doing some research in this area. Is there a proxy metric we should be using to evaluate overall system performance?


> From a user impact point of view, OOM on the browser process is much more disruptive than other processes,  so we would definitely focus more on investigating and mitigating any issues we find there.
Agreed. I browsed through a couple of OOM crashes on our end, and system commit charge limit exhaustion seemed to be a common cause. 

 

> As for getting more data, we could try collecting a dump though that might have a high failure rate in an already memory stressed system. I was wondering if anyone on this thread knew if that had been tried before and/or why it isn’t done currently? 
We are doing this, and it's been quite successful. :)
Unfortunately, the tools/pipeline are not publicly available. All the hooks are there for your own integration though.

> If we are able to get dumps from these processes before they are terminated by the job limit code, we have discussed using a crash-key to try and journal the any large memory allocations – we could then mine this data from the dumps, to try and detect memory allocation patterns that often result in hitting the job limit.

 

We also have the ability on Windows Insider users, to setup a circular buffer of system performance data (background ETL collection basically) that we can snap and report in response to specified trigger events. We could add a trigger event to the job limit code that could then be used to collect performance traces from users that hit job limits. This type of tracing is very verbose and needs to be quite targeted in order to be useful and not quickly exceed the allowed circular buffer size – so we would probably need to use existing telemetry/crash dumps to nail down specific scenarios for which we wanted to collect more detailed traces.



Erik Chen

unread,
Aug 23, 2019, 3:06:02 PM8/23/19
to Mike Rorke, Sébastien Marchand, Bruce Dawson, Joe Laughlin, memor...@chromium.org, Mike Decker, Tim Scudder, Michael Lynch, Sebastian Poulose, Kentaro Hara, Benoit Lize, siddhartha sivakumar, Alexei Filippov, Etienne Bergeron, Takashi Sakamoto, Will Harris
Sorry, prematurely sent email before I finished responding.

> If we are able to get dumps from these processes before they are terminated by the job limit code, we have discussed using a crash-key to try and journal the any large memory allocations – we could then mine this data from the dumps, to try and detect memory allocation patterns that often result in hitting the job limit.
We haven't tried to correlate our heap profiles with actual OOMs. While that would be nice, it's not clear how that makes the heap profiles more actionable than they already are.

> We also have the ability on Windows Insider users, to setup a circular buffer of system performance data (background ETL collection basically) that we can snap and report in response to specified trigger events.
We have some a similar tools in Chrome [slow reports, Chrometto is a WIP, UMA sampling profiler, etc.] that are able to provide actionable, performance data. Integration with ETL data seems like it could potentially be useful, but also would add additional complexity to the code base. If we could demonstrate that this is useful for finding and fixing bugs, then I think it would be worth integrating.

We're trying to be very targeted with our performance work in the memory space. e.g. if we add a metric that we care about regressions to, then we want a corresponding follow-up mechanism that provides actionable data to regressions in the metric. 

Very curious to hear more about your thoughts/prior-art in the space. :)

Erik


Siddhartha S

unread,
Aug 23, 2019, 4:45:48 PM8/23/19
to Erik Chen, Oystein Eftevaag, Mike Rorke, Sébastien Marchand, Bruce Dawson, Joe Laughlin, memor...@chromium.org, Mike Decker, Tim Scudder, Michael Lynch, Sebastian Poulose, Kentaro Hara, Benoit Lize, siddhartha sivakumar, Alexei Filippov, Etienne Bergeron, Takashi Sakamoto, Will Harris
On Fri, Aug 23, 2019 at 12:06 PM Erik Chen <erik...@chromium.org> wrote:
Sorry, prematurely sent email before I finished responding.

> If we are able to get dumps from these processes before they are terminated by the job limit code, we have discussed using a crash-key to try and journal the any large memory allocations – we could then mine this data from the dumps, to try and detect memory allocation patterns that often result in hitting the job limit.
We haven't tried to correlate our heap profiles with actual OOMs. While that would be nice, it's not clear how that makes the heap profiles more actionable than they already are.

I see the point where you want to investigate OOMs using memory dumps of the process just before the cash or when the crash occurred. This method has not yielded any useful insights on previous trials on Android. OOMs and memory usage of the process itself were not well correlated. Of course the cases where OOM happened have a higher median of total memory used, compared to normal median cases. But getting memory dumps from OOMs will just give us normal processes memory usage which varies from very low to very high value. It is more actionable if we concentrated on solving problems around high memory usage, even if they did not cause OOMs. This is exactly what memlog is doing (described by Erik). I believe there is also work done around managing the memory resources effectively across multiple renderers and purging and suspending unused processes. On the other hand, OOMs typically are caused by a lot of processes using a lot of memory, and these need not be chrome processes at all. So, the information we need to understand the OOM causes involve having a system wide tracking that is able to tell everything that is happening in the OS that lead to the OOM of browser. For example the OS incorrectly set priority and killed browser instead of a renderers, etc. In Android we are building a tracing system called Perfetto which integrates system and chrome tracing in a single trace to provide such information.
 
> We also have the ability on Windows Insider users, to setup a circular buffer of system performance data (background ETL collection basically) that we can snap and report in response to specified trigger events.
We have some a similar tools in Chrome [slow reports, Chrometto is a WIP, UMA sampling profiler, etc.] that are able to provide actionable, performance data. Integration with ETL data seems like it could potentially be useful, but also would add additional complexity to the code base. If we could demonstrate that this is useful for finding and fixing bugs, then I think it would be worth integrating.

Like I mentioned above, Chrome has this tracing service backed by Perfetto (Chrometto) which is used to collect tracing data to debug performance issues. Currently we are able to get traces limited to only data that chrome knows about. This data is collected in the same system as UMA. I am unsure about the possibility of integration of this service with ETL. It does seem useful.
 
We're trying to be very targeted with our performance work in the memory space. e.g. if we add a metric that we care about regressions to, then we want a corresponding follow-up mechanism that provides actionable data to regressions in the metric. 

Very curious to hear more about your thoughts/prior-art in the space. :)

Erik



On Fri, Aug 23, 2019 at 11:59 AM Erik Chen <erik...@chromium.org> wrote:
We have some basic metrics:

We measure Memory.{Browser,Renderer,...}.PrivateMemoryFootprint. On Windows this reports commit charge via PROCESS_MEMORY_COUNTERS_EX::PrivateUsage. See doc 1 and doc 2 for more background. This measurement is tracked reasonably closely [e.g. by default, A/B experiments will show changes to this metric, we have in-lab testing that reports changes to this metric.]

We've previously invested some time into finding pivots for this data [e.g. process uptime, whether a renderer is foreground or background, URL for renderer, etc.] via a tool named UKM. While this yielded some insights, it didn't end up providing a lot of actionable data. Maybe we didn't find the right pivots to use. We have not put very much time into this recently.

Our crash-reporting toolchain will automatically tag certain types of crashes with OOM. For crashes we also track breadcrumbs: some useful debugging info like process commit charge, system commit limit, system commit charge, etc. I'm not sure that these are carefully tracked, although major shifts to crash rates would show up. I also don't know if OOMing due to hitting a job limit on Windows will emit a crash [or any other metric, for that matter]. Maybe +wfh knows.

We also track all kinds of crashes in browser by checking if the last session was a shutdown or a crash. If it's not shutdown then a crash is recorded. There must be a way to find the difference between the total number of crashes and the crash reports, to tell unexpected failures. Maybe we can assume these to be OOMs? Not sure about other causes where we won't get a crash report.
 

Bruce Dawson

unread,
Aug 23, 2019, 4:52:34 PM8/23/19
to Siddhartha S, Erik Chen, Oystein Eftevaag, Mike Rorke, Sébastien Marchand, Joe Laughlin, memor...@chromium.org, Mike Decker, Tim Scudder, Michael Lynch, Sebastian Poulose, Kentaro Hara, Benoit Lize, siddhartha sivakumar, Alexei Filippov, Etienne Bergeron, Takashi Sakamoto, Will Harris
We definitely can't assume that unexpected failures are OOM crashes. There are (at least on Windows) many types of unexpected failures other than job-limit OOMs. Heap corruption, stack-canary overwrites and some other security related failures don't record a crash dump, on Windows. It would be great to have a way to get crash dumps from these failures, and crash dumps from job-limit OOMs, just so that we don't have blind spots around these failure types.

Mike Rorke

unread,
Aug 23, 2019, 5:11:34 PM8/23/19
to Bruce Dawson, Siddhartha S, Erik Chen, Oystein Eftevaag, Sébastien Marchand, Joe Laughlin, memor...@chromium.org, Mike Decker, Tim Scudder, Michael Lynch, Sebastian Poulose, Kentaro Hara, Benoit Lize, siddhartha sivakumar, Alexei Filippov, Etienne Bergeron, Takashi Sakamoto, Will Harris

Thanks all for the info! I will take this back to my team for discussion and we will let you know what our plans are.

 

For the issue of getting crash dumps when we hit a job limit – when we run chrome://memory-exhaust, I do not see a crash dump generated. I assumed this means I won’t get any crash dumps for cases where we hit the job limit, but that might be an incorrect assumption.

 

For issues like heap corruption, I do see crash dumps generated, though they often go through the Windows OS crash handler, rather than crashpad.  

 

To find the crashes we believe are due to OOM, we are filtering all crash dumps for the following:

  • Look for specific exception codes that indicate a memory allocation failure (e.g. 0xe0000008 - ERROR_NOT_ENOUGH_MEMORY).
  • Look for issues where system profile shows peak commit at or near commit limit (i.e. the system is out of memory overall). As mentioned below, this is often due to other processes running on the machine and not just the browser, so making sense of these often requires more detailed analysis. One though is to apply a further filter that pulls out cases where our binary is the top user of memory.
  • Look for crash stacks that include well known APIs dealing with OOM type scenarios (e.g. ReportOOMErrorInMainThread).

Bruce Dawson

unread,
Aug 23, 2019, 6:14:47 PM8/23/19
to Mike Rorke, Siddhartha S, Erik Chen, Oystein Eftevaag, Sébastien Marchand, Joe Laughlin, memor...@chromium.org, Mike Decker, Tim Scudder, Michael Lynch, Sebastian Poulose, Kentaro Hara, Benoit Lize, siddhartha sivakumar, Alexei Filippov, Etienne Bergeron, Takashi Sakamoto, Will Harris
Getting Chrome access to WER crash dumps in a useful way is an ongoing discussion. I'm not sure of the details but I think a big part of the problem has always been that WER can't ingest Chrome's Canary symbols quickly enough so it can't do useful bucketing. Until that is resolved we don't get any heap corruption crash dumps. Ditto for other fast-fail crashes.
Reply all
Reply to author
Forward
0 new messages