Proposed OOM improvements.

27 views
Skip to first unread message

Greg Spencer

unread,
Aug 4, 2010, 6:24:40 PM8/4/10
to chromium-dev
[I also already sent this to chromium-os-dev under a separate e-mail for various annoying reasons]

Hi Folks,

Here's my proposal for improving OOM situations on ChromeOS. In a nutshell, the idea is that we'll tune the OOM killer's algorithm to match what we want, and make the UI more explicit about what happened when a tab is killed by the OOM killer.
Please let me know if you have any suggestions/comments.
-Greg.

Document link: https://docs.google.com/a/google.com/document/edit?id=1ddPY1-v7ZFr0jmhuxw04ehNLMQrPzxHhL6vUoBzELBo&hl=en

(but that probably won't work outside google.com: see below for full text).

-------------------

Out of Memory Management for ChromeOS

Greg Spencer (gspencer), ChromeOS UI team.

Intro

Like all computers, ChromeOS devices have limited memory, and bad things happen when we run out of physical memory.  We’d like to make ChromeOS be more elegant than most OSs when it runs into this situation.  To that end, we’re looking to improve the user experience around out of memory (OOM) conditions.

Current State

Currently, when a ChromeOS device runs out of memory processes are killed by the OOM killer (a part of the kernel) until enough memory is available.  Because we have no swap configured, but do allow overcommit (i.e. malloc pretends it has nearly unlimited memory when handing out addresses), eventually a process tries to use memory assigned to it in the virtual address space that isn’t actually available, and the kernel asks the OOM killer to kill processes on the system until enough memory is available.  The processes killed don’t necessarily include the one that started to use the unavailable memory, but rather are based on the OOM killer’s “badness” algorithm.

We don’t have a swap partition configured because we’re afraid that it will start killing blocks on the SSD after an unreasonably short time.  [I haven’t verified that this is indeed a problem, but I’m assuming that the original decision wasn’t made in a vacuum.  It does seem to me that write levelling in the SSD hardware should mitigate this somewhat, however].

Currently, the renderers, plugins, and browser processes are (not surprisingly) the largest users of memory on Chrome OS.  The renderers and plugins can be killed without crashing the system, but killing the browser process (which can grow quite large) causes the entire session to restart, so we want to kill that as a last resort (or at least after all the renderers and plugins have been killed).

Linux Chrome already uses the /proc/<PID>/oom_adj method for proiritizing renderers and plugins over the browser process (or system processes) for being killed by the OOM so they’ll be killed in the right order.  This works fairly well, but the default OOM killer algorithm prefers to kill recent processes instead of older processes, so this is not quite optimal for us, as we would prefer to kill older tabs over newer ones, and non-pinned tabs before pinned ones.  [The file /proc/<PID>/oom_adj contains a bit-shift value from -17 to 15 that adjust the badness value of the process.]

Also, when things are killed, the “Sad Tab” page is displayed, which doesn’t communicate the nature of the failure.

Possible Methods of Controlling Memory Usage on ChromeOS

[This is the brainstorming part of the doc.  Not all of these will be implemented.]

Kernel Level
  • Change overcommit behavior (change to "overcommit_ratio”), to encourage more NULLs being returned from malloc instead of the OOM getting happy and killing stuff randomly.   This might not actually help things -- it'll mean that the process that is trying to allocate always gets killed via segfault instead of another less important process.
  • Use mem_notify kernel module to send notification when thresholds are reached if we aren't already.   This is useful for clearning caches, garbage collecting, etc, but isn’t a solution to the overall problem. This may be useful for marking which tabs are killed from the OOM and which are killed for other reasons.
  • Severely re-nice or stop processes that abuse memory in order to have resources to let user pick what to do. (but it is not be possible for it to happen fast enough in all cases).
  • Setup some small swap space (e.g. 50M) so that any very static data in memory gets swapped out.  We currently have at least 25M of data that never gets accessed again once the app is loaded.


Chrome Level Changes
  • Respond to mem_notify events (in order of how draconian they are) with actions that don’t require user notification.  This is by its nature a bandaid, as any memory sponge will quickly eat up the freed memory:
    • Flushing memory HTML caches
    • Garbage collecting all V8, Crore, Flash instances.
    • Sharing renderers among more tabs, killing some renderers.  [Darin says this probably won’t gain us much -- only means we can share a few more font tables, etc, and will slow things down considerably due to swapping out DOMs, etc.]
    • Empty Flash and HTML5 audio/video buffers (and maybe notify user because they’ll all start rebuffering if they are playing).
  • Other measures that may require user notification:
    • Closing no-content tabs (new tab pages, about: pages).
    • Closing windows that only have non-content tabs in them (e.g. an empty window running with just the new tab page).
  • Reduce memory usage in the first place by:
    • mmap’ing large images (which would get swapped out on low memory by the kernel).  [We may already be doing this]

Implementation Plan

Given that some of the suggestions above require more work than others, I’m planning to pick the low hanging items first, and then see how much bang that gives us, and then move on to more time consuming mitigation if that’s not sufficient.

Phase 1 -- Tune OOM killer algorithm

I'm going to collect the following information:
  • Whether or not a tab is pinned
  • When was the last time the user clicked on or entered something into the tab
  • When was the last time the user clicked on the tab to make it current
  • How much memory the tab is using

And then I'm going to come up with an algorithm (TBD) that ranks tabs based on these criterion.  The algorithm will probably prefer to kill tabs that aren’t pinned, have been idle for the longest, and use the most memory.  It’ll probably kill plugins before killing renderers.

I'm going to write a manager into the browser process that every so often (every five seconds or so) adjusts the oom_adj value of all the renderers and plugins to sort them based on the algorithm above.  I will probably only need to adjust them a little -- the current renderer and plugin processes get an adjustment of five (which shifts the badness up by five bits).  I'll probably just have three to five different levels of badness to assign, starting at five (where larger is more likely to be killed).

I'm going to change the UI so that the when a tab is killed by the OOM, it displays a page different from the “Sad Tab” page that tells the user what happened and why, and gives them the option to reload the page.  This may be a little tricky to determine, as there really isn’t a lot of warning when the OOM kills your process.

It has been suggested that we just let the OOM killer kill a tab, mark it, and just reload it the next time the user visits it.  We can test this, but my feeling is that the user will occasionally be very surprised to find that this happens, that some web apps will handle this poorly, and that losing user data on reload is something we need to explicitly notify the user about.  It seems to me that if we can’t guarantee full reload (save DOM state, javascript variable state, plugin state, etc.), that this is a shabby thing to do: it’s cleaner to tell them why we killed it and let them decide if they want to chance reloading it.

As I’m implementing this, I’ll write a test that will exercise the OOM killer algorithm.  Hopefully that’s not too tricky to get into our testing framework without being flaky.

Phase 1.1 -- Add in Networking info to OOM killer tuning.

Collecting the last time a tab accessed the network is complicated to implement (e.g. sandboxed network access happens in another process and so has to be tracked back to a renderer), so I’ll implement that only if we think it’ll help with tuning.  The main idea here is that music streaming apps might be likely to be killed based on the other criterion, so this helps recognize tabs that are streaming in the background.  The fallback is to have the user pin streaming tabs.

Phase 2 -- Notify user when memory is getting low

In this phase we post some kind of notification when we get a mem_notify event that we’re low on memory.  At that point, we can ask the user to kill off memory intensive applications.  This will require a UI similar to the task manager (it might even be the task manager) so that the user can make informed choices about what to kill.  In order to be able to display this UI when the memory is low, we’ll have to pre-allocate it and keep it around until needed.

This feels like a pretty heavy UI, and I’m not sure all users will feel qualified to decide what to kill.  Maybe just give them a choice of the top five candidates for killing?

Phase 3 -- Flush all caches on mem_notify events

In this phase we try and flush all available caches in the OS -- plugins, browsers, etc. when we get our first mem_notify event that we’re out of memory.  This step seems like a bandaid -- it’s only going to help the first time it happens, and thereafter there will bealmost nothing freed until the caches have time to refill.  This would, however, be good in combination with user notification that there is too much memory being used, since it may buy the user some time to manage their tabs.


Evan Martin

unread,
Aug 4, 2010, 6:31:26 PM8/4/10
to gspe...@google.com, chromium-dev
On Wed, Aug 4, 2010 at 3:24 PM, Greg Spencer <gspe...@chromium.org> wrote:
> Setup some small swap space (e.g. 50M) so that any very static data in
> memory gets swapped out.  We currently have at least 25M of data that never
> gets accessed again once the app is loaded.

What is this? Can we fix it?

Greg Spencer

unread,
Aug 4, 2010, 7:31:40 PM8/4/10
to Evan Martin, chromium-dev
From what I understand, this is static data and code contained in (mostly) third party libraries.
We either execute it once and never again, or allocate some statics and never use them.
Zel and Dave Moore have more information.

The simplest fix seems to me to be what I proposed.  Sifting through third party libraries to remove statics seems futile.

-Greg.

Adam Langley

unread,
Aug 4, 2010, 7:38:54 PM8/4/10
to gspe...@google.com, Evan Martin, chromium-dev
On Wed, Aug 4, 2010 at 7:31 PM, Greg Spencer <gspe...@chromium.org> wrote:
> From what I understand, this is static data and code contained in (mostly)
> third party libraries.
> We either execute it once and never again, or allocate some statics and
> never use them.
> Zel and Dave Moore have more information.
> The simplest fix seems to me to be what I proposed.  Sifting through third
> party libraries to remove statics seems futile.

Code pages are backed by files on disk and will be 'swapped out' by
the kernel if need be. (The memory pages are simply dropped). So it
should only be unbacked pages which can waste memory.

Changing the oom_adj value, when using the SUID sandbox, means forking
off a SUID process to change each one. That's a massive hack and, if
it concerns you, you can probably tweak the kernel so that the oom_adj
inode isn't locked down by making a process non-dumpable.

AGL

Greg Spencer

unread,
Aug 4, 2010, 7:43:46 PM8/4/10
to Adam Langley, Evan Martin, chromium-dev
On Wed, Aug 4, 2010 at 4:38 PM, Adam Langley <a...@chromium.org> wrote:
On Wed, Aug 4, 2010 at 7:31 PM, Greg Spencer <gspe...@chromium.org> wrote:
> From what I understand, this is static data and code contained in (mostly)
> third party libraries.
> We either execute it once and never again, or allocate some statics and
> never use them.
> Zel and Dave Moore have more information.
> The simplest fix seems to me to be what I proposed.  Sifting through third
> party libraries to remove statics seems futile.

Code pages are backed by files on disk and will be 'swapped out' by
the kernel if need be. (The memory pages are simply dropped). So it
should only be unbacked pages which can waste memory.

Oh, right.  So it must just be the statics.
 
Changing the oom_adj value, when using the SUID sandbox, means forking
off a SUID process to change each one. That's a massive hack and, if
it concerns you, you can probably tweak the kernel so that the oom_adj
inode isn't locked down by making a process non-dumpable.

Yeah, I've considered modifying the kernel in this way for exactly that reason, but I was going to go through the suid helper initially just to see if that gave us the behavior we wanted.  It is a massive hack.  :)

-Greg. 

Reply all
Reply to author
Forward
0 new messages