Hello Everyone,
The Anaheim Reliability team is working on creating a plan to better analyze/mitigate OOM scenarios in Anaheim. Since OOM is often a symptom of other underlying performance issues (e.g. increase in working set) we wanted to discuss our plans and any potential areas of collaboration in this space.
Please respond if this is an area of interest so we start the dialog among interested parties.
--
You received this message because you are subscribed to the Google Groups "memory-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to memory-dev+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/memory-dev/SN6PR00MB0431D8F16079A82C0D5D8D5ECDA50%40SN6PR00MB0431.namprd00.prod.outlook.com.
--
Hi! I am a developer on the reliability team for Edge and we are starting to ramp up our investigations into OOM analysis and mitigation. To start with, we are trying to get an idea of the scale of the problem.
We started by looking at crash dumps where the exception indicates the failure was due to a lack of necessary resources (usually memory). We found this only really represents part of the problem though as the job limits are also involved here, terminating the processes with high memory usage, but not sending dumps for these.
Are there any other recommendations for how to measure OOM rate? Are there any measurements/metrics around the job limit (e.g. how often its hit, are there specific URLs that cause it to hit more often than others, etc.) that might be useful here? OOM rate is often directly tied to the physical hardware, so any usable measure needs to take this into account (e.g. pivoting the failures on RAM size, HDD vs. SSD, etc.) – do you have any common pivots that might be useful here?
These are the very initial stages of our investigation and we fully expect to get more involved with proposing actual code changes, etc. as we gain more experience in the area. Initially, our experience will probably skew more towards Windows, but we are also looking to engage and improve other platforms too.
For measurement purposes, I think we would like to get data on OOMs from all process types. I think the pivots we would use to aggregate that data would be different per process type (e.g. render process may pivot on host/URL while browser would pivot on available RAM). We are in early planning for this work right now, so any input you have into useful pivots, etc. would be most welcome.
From a user impact point of view, OOM on the browser process is much more disruptive than other processes, so we would definitely focus more on investigating and mitigating any issues we find there.
For the job limits, I assume there are already some histograms available to track the prevalence of these (I haven’t yet investigated this myself) which we would want to leverage. Is there any other existing telemetry around this area that you think we might find useful in categorizing these events? As for getting more data, we could try collecting a dump though that might have a high failure rate in an already memory stressed system. I was wondering if anyone on this thread knew if that had been tried before and/or why it isn’t done currently? If we are able to get dumps from these processes before they are terminated by the job limit code, we have discussed using a crash-key to try and journal the any large memory allocations – we could then mine this data from the dumps, to try and detect memory allocation patterns that often result in hitting the job limit.
We also have the ability on Windows Insider users, to setup a circular buffer of system performance data (background ETL collection basically) that we can snap and report in response to specified trigger events. We could add a trigger event to the job limit code that could then be used to collect performance traces from users that hit job limits. This type of tracing is very verbose and needs to be quite targeted in order to be useful and not quickly exceed the allowed circular buffer size – so we would probably need to use existing telemetry/crash dumps to nail down specific scenarios for which we wanted to collect more detailed traces.
Sorry, prematurely sent email before I finished responding.> If we are able to get dumps from these processes before they are terminated by the job limit code, we have discussed using a crash-key to try and journal the any large memory allocations – we could then mine this data from the dumps, to try and detect memory allocation patterns that often result in hitting the job limit.We haven't tried to correlate our heap profiles with actual OOMs. While that would be nice, it's not clear how that makes the heap profiles more actionable than they already are.
> We also have the ability on Windows Insider users, to setup a circular buffer of system performance data (background ETL collection basically) that we can snap and report in response to specified trigger events.We have some a similar tools in Chrome [slow reports, Chrometto is a WIP, UMA sampling profiler, etc.] that are able to provide actionable, performance data. Integration with ETL data seems like it could potentially be useful, but also would add additional complexity to the code base. If we could demonstrate that this is useful for finding and fixing bugs, then I think it would be worth integrating.
We're trying to be very targeted with our performance work in the memory space. e.g. if we add a metric that we care about regressions to, then we want a corresponding follow-up mechanism that provides actionable data to regressions in the metric.Very curious to hear more about your thoughts/prior-art in the space. :)ErikOn Fri, Aug 23, 2019 at 11:59 AM Erik Chen <erik...@chromium.org> wrote:We have some basic metrics:We measure Memory.{Browser,Renderer,...}.PrivateMemoryFootprint. On Windows this reports commit charge via PROCESS_MEMORY_COUNTERS_EX::PrivateUsage. See doc 1 and doc 2 for more background. This measurement is tracked reasonably closely [e.g. by default, A/B experiments will show changes to this metric, we have in-lab testing that reports changes to this metric.]We've previously invested some time into finding pivots for this data [e.g. process uptime, whether a renderer is foreground or background, URL for renderer, etc.] via a tool named UKM. While this yielded some insights, it didn't end up providing a lot of actionable data. Maybe we didn't find the right pivots to use. We have not put very much time into this recently.Our crash-reporting toolchain will automatically tag certain types of crashes with OOM. For crashes we also track breadcrumbs: some useful debugging info like process commit charge, system commit limit, system commit charge, etc. I'm not sure that these are carefully tracked, although major shifts to crash rates would show up. I also don't know if OOMing due to hitting a job limit on Windows will emit a crash [or any other metric, for that matter]. Maybe +wfh knows.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/memory-dev/CAEYHnr2UM_W2u7Asc3PNZG_jyx_H_%3D_E-Cohqv921Z_HDD1oJw%40mail.gmail.com.
Thanks all for the info! I will take this back to my team for discussion and we will let you know what our plans are.
For the issue of getting crash dumps when we hit a job limit – when we run chrome://memory-exhaust, I do not see a crash dump generated. I assumed this means I won’t get any crash dumps for cases where we hit the job limit, but that might be an incorrect assumption.
For issues like heap corruption, I do see crash dumps generated, though they often go through the Windows OS crash handler, rather than crashpad.
To find the crashes we believe are due to OOM, we are filtering all crash dumps for the following: