[patch 0/7 -mm] oom killer rewrite

David Rientjes

unread,

Feb 10, 2010, 11:40:02 AM2/10/10

to

This patchset is a rewrite of the out of memory killer to address several
issues that have been raised recently. The most notable change is a
complete rewrite of the badness heuristic that determines which task is
killed; the goal was to make it as simple and predictable as possible
while still addressing issues that plague the VM.

This patchset is based on mmotm-2010-02-05-15-06 because of the following
dependencies:

[patch 4/7] oom: badness heuristic rewrite:
mm-count-swap-usage.patch

[patch 5/7] oom: replace sysctls with quick mode:
sysctl-clean-up-vm-related-variable-delcarations.patch

To apply to mainline, download 2.6.33-rc7 and apply

mm-clean-up-mm_counter.patch
mm-avoid-false-sharing-of-mm_counter.patch
mm-avoid-false_sharing-of-mm_counter-checkpatch-fixes.patch
mm-count-swap-usage.patch
mm-count-swap-usage-checkpatch-fixes.patch
mm-introduce-dump_page-and-print-symbolic-flag-names.patch
sysctl-clean-up-vm-related-variable-declarations.patch
sysctl-clean-up-vm-related-variable-declarations-fix.patch

from http://userweb.kernel.org/~akpm/mmotm/broken-out.tar.gz first.
---
Documentation/filesystems/proc.txt | 78 ++++---
Documentation/sysctl/vm.txt | 51 ++---
fs/proc/base.c | 13 +-
include/linux/mempolicy.h | 13 +-
include/linux/oom.h | 18 +-
kernel/sysctl.c | 15 +-
mm/mempolicy.c | 39 +++
mm/oom_kill.c | 455 +++++++++++++++++++-----------------
mm/page_alloc.c | 3 +
9 files changed, 383 insertions(+), 302 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

David Rientjes

unread,

Feb 10, 2010, 11:40:02 AM2/10/10

to

This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions. The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.

The baseline for the heuristic is a proportion of memory that each task
is currently using in memory plus swap compared to the amount of
"allowable" memory. "Allowable," in this sense, means the system-wide
resources for unconstrained oom conditions, the set of mempolicy nodes,
the mems attached to current's cpuset, or a memory controller's limit.
The proportion is given on a scale of 0 (never kill) to 1000 (always
kill), roughly meaning that if a task has a badness() score of 500 that
the task consumes approximately 50% of allowable memory resident in RAM
or in swap space.

The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets
may operate in isolation; they shall not need to know the true size of
the machine on which they are running if they are bound to a specific set
of nodes or mems, respectively.

Forkbomb detection is done in a completely different way: a threshold is
configurable from userspace to determine how many first-generation execve
children (those with their own address spaces) a task may have before it
is considered a forkbomb. This can be tuned by altering the value in
/proc/sys/vm/oom_forkbomb_thres, which defaults to 1000.

When a task has more than 1000 first-generation children with different
address spaces than itself, a penalty of

(average rss of children) * (# of 1st generation execve children)
-----------------------------------------------------------------
oom_forkbomb_thres

is assessed. So, for example, using the default oom_forkbomb_thres of
1000, the penalty is twice the average rss of all its execve children if
there are 2000 such tasks. A task is considered to count toward the
threshold if its total runtime is less than one second; for 1000 of such
tasks to exist, the parent process must be forking at an extremely high
rate either erroneously or maliciously.

Even though a particular task may be designated a forkbomb and selected
as the victim, the oom killer will still kill the 1st generation execve
child with the highest badness() score in its place. The avoids killing
important servers or system daemons.

Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.

Because of the change in the badness() heuristic's baseline, a change
must also be made to the user interface used to tune it. Instead of a
scale from -16 to +15 to indicate a bitshift on the point value given to
a task, which was very difficult to tune accurately or with any degree of
precision, /proc/pid/oom_adj now ranges from -1000 to +1000. That is, it
can be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others are always considered. The value is
added directly into the badness() score so a value of -500, for example,
means to discount 50% of its memory consumption in comparison to other
tasks either on the system, bound to the mempolicy, or in the cpuset.

OOM_ADJUST_MIN and OOM_ADJUST_MAX have been exported to userspace since
2006 via include/linux/oom.h. This alters their values from -16 to -1000
and from +15 to +1000, respectively. OOM_DISABLE is now the equivalent
of the lowest possible value, OOM_ADJUST_MIN. Adding its value, -1000,
to any badness score will always return 0.

Although redefining these values may be controversial, it is much easier
to understand when the units are fully understood as described above.
In the short-term, there may be userspace breakage for tasks that
hardcode -17 meaning OOM_DISABLE, for example, but the long-term will
make the semantics much easier to understand and oom killing much more
effective.

Signed-off-by: David Rientjes <rien...@google.com>
---
Documentation/filesystems/proc.txt | 78 ++++++----
Documentation/sysctl/vm.txt | 21 +++
fs/proc/base.c | 13 +-
include/linux/oom.h | 15 ++-
kernel/sysctl.c | 8 +
mm/oom_kill.c | 298 +++++++++++++++++++----------------
6 files changed, 256 insertions(+), 177 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1195,39 +1195,51 @@ CHAPTER 3: PER-PROCESS PARAMETERS
3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
------------------------------------------------------

-This file can be used to adjust the score used to select which processes
-should be killed in an out-of-memory situation. Giving it a high score will
-increase the likelihood of this process being killed by the oom-killer. Valid
-values are in the range -16 to +15, plus the special value -17, which disables
-oom-killing altogether for this process.
-
-The process to be killed in an out-of-memory situation is selected among all others
-based on its badness score. This value equals the original memory size of the process
-and is then updated according to its CPU time (utime + stime) and the
-run time (uptime - start time). The longer it runs the smaller is the score.
-Badness score is divided by the square root of the CPU time and then by
-the double square root of the run time.
-
-Swapped out tasks are killed first. Half of each child's memory size is added to
-the parent's score if they do not share the same memory. Thus forking servers
-are the prime candidates to be killed. Having only one 'hungry' child will make
-parent less preferable than the child.
-
-/proc/<pid>/oom_score shows process' current badness score.
-
-The following heuristics are then applied:
- * if the task was reniced, its score doubles
- * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
- or CAP_SYS_RAWIO) have their score divided by 4
- * if oom condition happened in one cpuset and checked process does not belong
- to it, its score is divided by 8
- * the resulting score is multiplied by two to the power of oom_adj, i.e.
- points <<= oom_adj when it is positive and
- points >>= -(oom_adj) otherwise
-
-The task with the highest badness score is then selected and its children
-are killed, process itself will be killed in an OOM situation when it does
-not have children or some of them disabled oom like described above.
+This file can be used to adjust the badness heuristic used to select which
+process gets killed in out of memory conditions.
+
+The badness heuristic assigns a value to each candidate task ranging from 0
+(never kill) to 1000 (always kill) to determine which process is targeted. The
+units are roughly a proportion along that range of allowed memory the process
+may allocate from based on an estimation of its current memory and swap use.
+For example, if a task is using all allowed memory, its badness score will be
+1000. If it is using half of its allowed memory, its score will be 500.
+
+There are a couple of additional factors included in the badness score: root
+processes are given 3% extra memory over other tasks, and tasks which forkbomb
+an excessive number of child processes are penalized by their average size.
+The number of child processes considered to be a forkbomb is configurable
+via /proc/sys/vm/oom_forkbomb_thres (see Documentation/sysctl/vm.txt).
+
+The amount of "allowed" memory depends on the context in which the oom killer
+was called. If it is due to the memory assigned to the allocating task's cpuset
+being exhausted, the allowed memory represents the set of mems assigned to that
+cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed
+memory represents the set of mempolicy nodes. If it is due to a memory
+limit (or swap limit) being reached, the allowed memory is that configured
+limit. Finally, if it is due to the entire system being out of memory, the
+allowed memory represents all allocatable resources.
+
+The value of /proc/<pid>/oom_adj is added to the badness score before it is used
+to determine which task to kill. Acceptable values range from -1000
+(OOM_MIN_ADJUST) to +1000 (OOM_MAX_ADJUST). This allows userspace to polarize
+the preference for oom killing either by always preferring a certain task or
+completely disabling it. OOM_DISABLE is equivalent to the lowest possible
+value, -1000, since such tasks will always report a badness score of 0.
+
+Consequently, it is very simple for userspace to define the amount of memory to
+consider for each task. Setting a /proc/<pid>/oom_adj value of +500, for
+example, is roughly equivalent to allowing the remainder of tasks sharing the
+same system, cpuset, mempolicy, or memory controller resources to use at least
+50% more memory. A value of -500, on the other hand, would be roughly
+equivalent to discounting 50% of the task's allowed memory from being considered
+as scoring against the task.
+
+Caveat: when a parent task is selected, the oom killer will sacrifice any first
+generation children with seperate address spaces instead, if possible. This
+avoids servers and important system daemons from being killed and loses the
+minimal amount of work.
+

3.2 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -44,6 +44,7 @@ Currently, these files are in /proc/sys/vm:
- nr_trim_pages (only if CONFIG_MMU=n)
- numa_zonelist_order
- oom_dump_tasks
+- oom_forkbomb_thres
- oom_kill_allocating_task
- overcommit_memory
- overcommit_ratio
@@ -490,6 +491,26 @@ The default value is 0.

==============================================================

+oom_forkbomb_thres
+
+This value defines how many children with a seperate address space a specific
+task may have before being considered as a possible forkbomb. Tasks with more
+children not sharing the same address space as the parent will be penalized by a
+quantity of memory equaling
+
+ (average rss of execve children) * (# of 1st generation execve children)
+ ------------------------------------------------------------------------
+ oom_forkbomb_thres
+
+in the oom killer's badness heuristic. Such tasks may be protected with a lower
+oom_adj value (see Documentation/filesystems/proc.txt) if necessary.
+
+A value of 0 will disable forkbomb detection.
+
+The default value is 1000.
+
+==============================================================
+
oom_kill_allocating_task

This enables or disables killing the OOM-triggering task in
diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -81,6 +81,7 @@
#include <linux/elf.h>
#include <linux/pid_namespace.h>
#include <linux/fs_struct.h>
+#include <linux/swap.h>
#include "internal.h"

/* NOTE:
@@ -458,7 +459,6 @@ static const struct file_operations proc_lstats_operations = {
#endif

/* The badness from the OOM killer */
-unsigned long badness(struct task_struct *p, unsigned long uptime);
static int proc_oom_score(struct task_struct *task, char *buffer)
{
unsigned long points;
@@ -466,7 +466,13 @@ static int proc_oom_score(struct task_struct *task, char *buffer)

do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
- points = badness(task->group_leader, uptime.tv_sec);
+ points = oom_badness(task->group_leader,
+ global_page_state(NR_INACTIVE_ANON) +
+ global_page_state(NR_ACTIVE_ANON) +
+ global_page_state(NR_INACTIVE_FILE) +
+ global_page_state(NR_ACTIVE_FILE) +
+ total_swap_pages,
+ uptime.tv_sec);
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
@@ -1137,8 +1143,7 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
err = strict_strtol(strstrip(buffer), 0, &oom_adjust);
if (err)
return -EINVAL;
- if ((oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX) &&
- oom_adjust != OOM_DISABLE)
+ if (oom_adjust < OOM_ADJUST_MIN || oom_adjust > OOM_ADJUST_MAX)
return -EINVAL;

task = get_proc_task(file->f_path.dentry->d_inode);
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -1,14 +1,18 @@
#ifndef __INCLUDE_LINUX_OOM_H
#define __INCLUDE_LINUX_OOM_H

-/* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
-#define OOM_DISABLE (-17)
+/* /proc/<pid>/oom_adj set to -1000 disables OOM killing for pid */
+#define OOM_DISABLE (-1000)
/* inclusive */
-#define OOM_ADJUST_MIN (-16)
-#define OOM_ADJUST_MAX 15
+#define OOM_ADJUST_MIN OOM_DISABLE
+#define OOM_ADJUST_MAX 1000
+
+/* See Documentation/sysctl/vm.txt */
+#define DEFAULT_OOM_FORKBOMB_THRES 1000

#ifdef __KERNEL__

+#include <linux/sched.h>
#include <linux/types.h>
#include <linux/nodemask.h>

@@ -24,6 +28,8 @@ enum oom_constraint {
CONSTRAINT_MEMORY_POLICY,
};

+extern unsigned int oom_badness(struct task_struct *p,
+ unsigned long totalpages, unsigned long uptime);
extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

@@ -47,6 +53,7 @@ static inline void oom_killer_enable(void)
extern int sysctl_panic_on_oom;
extern int sysctl_oom_kill_allocating_task;
extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_forkbomb_thres;

#endif /* __KERNEL__*/
#endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -963,6 +963,14 @@ static struct ctl_table vm_table[] = {
.proc_handler = proc_dointvec,
},
{
+ .procname = "oom_forkbomb_thres",
+ .data = &sysctl_oom_forkbomb_thres,
+ .maxlen = sizeof(sysctl_oom_forkbomb_thres),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ },
+ {
.procname = "overcommit_ratio",
.data = &sysctl_overcommit_ratio,
.maxlen = sizeof(sysctl_overcommit_ratio),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -4,6 +4,8 @@
* Copyright (C) 1998,2000 Rik van Riel
* Thanks go out to Claus Fischer for some serious inspiration and
* for goading me into coding this file...
+ * Copyright (C) 2010 Google, Inc
+ * Rewritten by David Rientjes
*
* The routines in this file are used to kill a process when
* we're seriously out of memory. This gets called from __alloc_pages()
@@ -32,8 +34,8 @@
int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks;
+int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
static DEFINE_SPINLOCK(zone_scan_lock);
-/* #define DEBUG */

/*
* Do all threads of the target process overlap our allowed nodes?
@@ -68,138 +70,125 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
return false;
}

-/**
- * badness - calculate a numeric value for how bad this task has been
- * @p: task struct of which task we should calculate
- * @uptime: current uptime in seconds
+/*
+ * Tasks that fork a very large number of children with seperate address spaces
+ * may be the result of a bug, user error, or a malicious application. The oom
+ * killer assesses a penalty equaling
*
- * The formula used is relatively simple and documented inline in the
- * function. The main rationale is that we want to select a good task
- * to kill when we run out of memory.
+ * (average rss of children) * (# of 1st generation execve children)
+ * -----------------------------------------------------------------
+ * sysctl_oom_forkbomb_thres
*
- * Good in this context means that:
- * 1) we lose the minimum amount of work done
- * 2) we recover a large amount of memory
- * 3) we don't kill anything innocent of eating tons of memory
- * 4) we want to kill the minimum amount of processes (one)
- * 5) we try to kill the process the user expects us to kill, this
- * algorithm has been meticulously tuned to meet the principle
- * of least surprise ... (be careful when you change it)
+ * for such tasks to target the parent. oom_kill_process() will attempt to
+ * first kill a child, so there's no risk of killing an important system daemon
+ * via this method. The goal is to give the user a chance to recover from the
+ * error rather than deplete all memory.
*/
-
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+static unsigned long oom_forkbomb_penalty(struct task_struct *tsk)
{
- unsigned long points, cpu_time, run_time;
- struct mm_struct *mm;
struct task_struct *child;
- int oom_adj = p->signal->oom_adj;
- struct task_cputime task_time;
- unsigned long utime;
- unsigned long stime;
+ unsigned long child_rss = 0;
+ int forkcount = 0;

- if (oom_adj == OOM_DISABLE)
+ if (!sysctl_oom_forkbomb_thres)
return 0;
+ list_for_each_entry(child, &tsk->children, sibling) {
+ struct task_cputime task_time;
+ unsigned long runtime;

- task_lock(p);
- mm = p->mm;
- if (!mm) {
- task_unlock(p);
- return 0;
+ task_lock(child);
+ if (!child->mm || child->mm == tsk->mm) {
+ task_unlock(child);
+ continue;
+ }
+ thread_group_cputime(child, &task_time);
+ runtime = cputime_to_jiffies(task_time.utime) +
+ cputime_to_jiffies(task_time.stime);
+ /*
+ * Only threads that have run for less than a second are
+ * considered toward the forkbomb penalty, these threads rarely
+ * get to execute at all in such cases anyway.
+ */
+ if (runtime < HZ) {
+ child_rss += get_mm_rss(child->mm);
+ forkcount++;
+ }
+ task_unlock(child);
}

/*
- * The memory size of the process is the basis for the badness.
+ * Forkbombs get penalized by:
+ *
+ * (average rss of children) * (# of first-generation execve children) /
+ * sysctl_oom_forkbomb_thres
*/
- points = mm->total_vm;
+ return forkcount > sysctl_oom_forkbomb_thres ?
+ (child_rss / sysctl_oom_forkbomb_thres) : 0;
+}
+
+/**
+ * oom_badness - heuristic function to determine which candidate task to kill
+ * @p: task struct of which task we should calculate
+ * @totalpages: total present RAM allowed for page allocation
+ * @uptime: current uptime in seconds
+ *
+ * The heuristic for determining which task to kill is made to be as simple and
+ * predictable as possible. The goal is to return the highest value for the
+ * task consuming the most memory to avoid subsequent oom conditions.
+ */
+unsigned int oom_badness(struct task_struct *p, unsigned long totalpages,
+ unsigned long uptime)
+{
+ struct mm_struct *mm;
+ int points;

/*
- * After this unlock we can no longer dereference local variable `mm'
+ * Shortcut check for OOM_DISABLE so the entire heuristic doesn't need
+ * to be executed for something that can't be killed.
*/
- task_unlock(p);
+ if (p->signal->oom_adj == OOM_DISABLE)
+ return 0;

/*
- * swapoff can easily use up all memory, so kill those first.
+ * When the PF_OOM_ORIGIN bit is set, it indicates the task should have
+ * priority for oom killing.
*/
if (p->flags & PF_OOM_ORIGIN)
- return ULONG_MAX;
+ return 1000;

- /*
- * Processes which fork a lot of child processes are likely
- * a good choice. We add half the vmsize of the children if they
- * have an own mm. This prevents forking servers to flood the
- * machine with an endless amount of children. In case a single
- * child is eating the vast majority of memory, adding only half
- * to the parents will make the child our kill candidate of choice.
- */
- list_for_each_entry(child, &p->children, sibling) {
- task_lock(child);
- if (child->mm != mm && child->mm)
- points += child->mm->total_vm/2 + 1;
- task_unlock(child);
+ task_lock(p);
+ mm = p->mm;
+ if (!mm) {
+ task_unlock(p);
+ return 0;
}

/*
- * CPU time is in tens of seconds and run time is in thousands
- * of seconds. There is no particular reason for this other than
- * that it turned out to work very well in practice.
- */
- thread_group_cputime(p, &task_time);
- utime = cputime_to_jiffies(task_time.utime);
- stime = cputime_to_jiffies(task_time.stime);
- cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
-
-
- if (uptime >= p->start_time.tv_sec)
- run_time = (uptime - p->start_time.tv_sec) >> 10;
- else
- run_time = 0;
-
- if (cpu_time)
- points /= int_sqrt(cpu_time);
- if (run_time)
- points /= int_sqrt(int_sqrt(run_time));
-
- /*
- * Niced processes are most likely less important, so double
- * their badness points.
+ * The baseline for the badness score is the proportion of RAM that each
+ * task's rss and swap space use.
*/
- if (task_nice(p) > 0)
- points *= 2;
-
- /*
- * Superuser processes are usually more important, so we make it
- * less likely that we kill those.
- */
- if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
- has_capability_noaudit(p, CAP_SYS_RESOURCE))
- points /= 4;
+ points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
+ totalpages;
+ task_unlock(p);
+ points += oom_forkbomb_penalty(p);

/*
- * We don't want to kill a process with direct hardware access.
- * Not only could that mess up the hardware, but usually users
- * tend to only have this flag set on applications they think
- * of as important.
+ * Root processes get 3% bonus, just like the __vm_enough_memory() used
+ * by LSMs.
*/
- if (has_capability_noaudit(p, CAP_SYS_RAWIO))
- points /= 4;
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN))
+ points -= 30;

/*
- * Adjust the score by oom_adj.
+ * /proc/pid/oom_adj ranges from -1000 to +1000 such that the range
+ * may either completely disable oom killing or always prefer a certain
+ * task.
*/
- if (oom_adj) {
- if (oom_adj > 0) {
- if (!points)
- points = 1;
- points <<= oom_adj;
- } else
- points >>= -(oom_adj);
- }
+ points += p->signal->oom_adj;

-#ifdef DEBUG
- printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
- p->pid, p->comm, points);
-#endif
- return points;
+ if (points < 0)
+ return 0;
+ return (points <= 1000) ? points : 1000;
}

/*
@@ -207,11 +196,21 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
*/
#ifdef CONFIG_NUMA
static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask, nodemask_t *nodemask)
+ gfp_t gfp_mask, nodemask_t *nodemask,
+ unsigned long *totalpages)
{
struct zone *zone;
struct zoneref *z;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+ bool cpuset_limited = false;
+ int nid;
+
+ /* Default to all anonymous memory, page cache, and swap */
+ *totalpages = global_page_state(NR_INACTIVE_ANON) +
+ global_page_state(NR_ACTIVE_ANON) +
+ global_page_state(NR_INACTIVE_FILE) +
+ global_page_state(NR_ACTIVE_FILE) +
+ total_swap_pages;

/*
* Reach here only when __GFP_NOFAIL is used. So, we should avoid
@@ -222,25 +221,41 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
return CONSTRAINT_NONE;

/*
- * The nodemask here is a nodemask passed to alloc_pages(). Now,
- * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
- * feature. mempolicy is an only user of nodemask here.
- * check mempolicy's nodemask contains all N_HIGH_MEMORY
+ * This is not a __GFP_THISNODE allocation, so a truncated nodemask in
+ * the page allocator means a mempolicy is in effect. Cpuset policy
+ * is enforced in get_page_from_freelist().
*/
- if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
+ if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
+ *totalpages = total_swap_pages;
+ for_each_node_mask(nid, *nodemask)
+ *totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
+ node_page_state(nid, NR_ACTIVE_ANON) +
+ node_page_state(nid, NR_INACTIVE_FILE) +
+ node_page_state(nid, NR_ACTIVE_FILE);
return CONSTRAINT_MEMORY_POLICY;
+ }

/* Check this allocation failure is caused by cpuset's wall function */
for_each_zone_zonelist_nodemask(zone, z, zonelist,
high_zoneidx, nodemask)
if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
- return CONSTRAINT_CPUSET;
-
+ cpuset_limited = true;
+
+ if (cpuset_limited) {
+ *totalpages = total_swap_pages;
+ for_each_node_mask(nid, cpuset_current_mems_allowed)
+ *totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
+ node_page_state(nid, NR_ACTIVE_ANON) +
+ node_page_state(nid, NR_INACTIVE_FILE) +
+ node_page_state(nid, NR_ACTIVE_FILE);
+ return CONSTRAINT_CPUSET;
+ }
return CONSTRAINT_NONE;
}
#else
static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask, nodemask_t *nodemask)
+ gfp_t gfp_mask, nodemask_t *nodemask,
+ unsigned long *totalpages)
{
return CONSTRAINT_NONE;
}
@@ -252,9 +267,9 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
*
* (not docbooked, we don't want this one cluttering up the manual)
*/
-static struct task_struct *select_bad_process(unsigned long *ppoints,
- struct mem_cgroup *mem, enum oom_constraint constraint,
- const nodemask_t *mask)
+static struct task_struct *select_bad_process(unsigned int *ppoints,
+ unsigned long totalpages, struct mem_cgroup *mem,
+ enum oom_constraint constraint, const nodemask_t *mask)
{
struct task_struct *p;
struct task_struct *chosen = NULL;
@@ -263,7 +278,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,

do_posix_clock_monotonic_gettime(&uptime);
for_each_process(p) {
- unsigned long points;
+ unsigned int points;

/*
* skip kernel threads and tasks which have already released
@@ -308,13 +323,13 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
return ERR_PTR(-1UL);

chosen = p;
- *ppoints = ULONG_MAX;
+ *ppoints = 1000;
}

if (p->signal->oom_adj == OOM_DISABLE)
continue;

- points = badness(p, uptime.tv_sec);
+ points = oom_badness(p, totalpages, uptime.tv_sec);
if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
@@ -449,12 +464,12 @@ static int oom_kill_task(struct task_struct *p)
}

static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
- unsigned long points, struct mem_cgroup *mem,
- const char *message)
+ unsigned int points, unsigned long totalpages,
+ struct mem_cgroup *mem, const char *message)
{
struct task_struct *victim = p;
struct task_struct *c;
- unsigned long victim_points = 0;
+ unsigned int victim_points = 0;
struct timespec uptime;

if (printk_ratelimit())
@@ -469,19 +484,19 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
return 0;
}

- pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
+ pr_err("%s: Kill process %d (%s) with score %d or sacrifice child\n",
message, task_pid_nr(p), p->comm, points);

/* Try to sacrifice the worst child first */
do_posix_clock_monotonic_gettime(&uptime);
list_for_each_entry(c, &p->children, sibling) {
- unsigned long cpoints;
+ unsigned int cpoints;

if (c->mm == p->mm)
continue;

- /* badness() returns 0 if the thread is unkillable */
- cpoints = badness(c, uptime.tv_sec);
+ /* oom_badness() returns 0 if the thread is unkillable */
+ cpoints = oom_badness(c, totalpages, uptime.tv_sec);
if (cpoints > victim_points) {
victim = c;
victim_points = cpoints;
@@ -493,19 +508,22 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
{
- unsigned long points = 0;
+ unsigned int points = 0;
struct task_struct *p;
+ unsigned long limit;

+ limit = (res_counter_read_u64(&mem->res, RES_LIMIT) >> PAGE_SHIFT) +
+ (res_counter_read_u64(&mem->memsw, RES_LIMIT) >> PAGE_SHIFT);
read_lock(&tasklist_lock);
retry:
- p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
+ p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
if (PTR_ERR(p) == -1UL)
goto out;

if (!p)
p = current;

- if (oom_kill_process(p, gfp_mask, 0, points, mem,
+ if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
"Memory cgroup out of memory"))
goto retry;
out:
@@ -580,22 +598,22 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
/*
* Must be called with tasklist_lock held for read.
*/
-static void __out_of_memory(gfp_t gfp_mask, int order,
+static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
enum oom_constraint constraint, const nodemask_t *mask)
{
struct task_struct *p;
- unsigned long points;
+ unsigned int points;

if (sysctl_oom_kill_allocating_task)
- if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
- "Out of memory (oom_kill_allocating_task)"))
+ if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
+ NULL, "Out of memory (oom_kill_allocating_task)"))
return;
retry:
/*
* Rambo mode: Shoot down a process and hope it solves whatever
* issues we may have.
*/
- p = select_bad_process(&points, NULL, constraint, mask);
+ p = select_bad_process(&points, totalpages, NULL, constraint, mask);

if (PTR_ERR(p) == -1UL)
return;
@@ -607,7 +625,7 @@ retry:
panic("Out of memory and no killable processes...\n");
}

- if (oom_kill_process(p, gfp_mask, order, points, NULL,
+ if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
"Out of memory"))
goto retry;
}
@@ -618,6 +636,7 @@ retry:
*/
void pagefault_out_of_memory(void)
{
+ unsigned long totalpages;
unsigned long freed = 0;

blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
@@ -635,9 +654,14 @@ void pagefault_out_of_memory(void)
if (sysctl_panic_on_oom)
panic("out of memory from page fault. panic_on_oom is selected.\n");

+ totalpages = global_page_state(NR_INACTIVE_ANON) +
+ global_page_state(NR_ACTIVE_ANON) +
+ global_page_state(NR_INACTIVE_FILE) +
+ global_page_state(NR_ACTIVE_FILE) +
+ total_swap_pages;
read_lock(&tasklist_lock);
/* unknown gfp_mask and order */
- __out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
+ __out_of_memory(0, 0, totalpages, CONSTRAINT_NONE, NULL);
read_unlock(&tasklist_lock);

/*
@@ -665,6 +689,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
int order, nodemask_t *nodemask)
{
unsigned long freed = 0;
+ unsigned long totalpages = 0;
enum oom_constraint constraint;

blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
@@ -681,7 +706,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
- constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+ constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
+ &totalpages);
read_lock(&tasklist_lock);
if (unlikely(sysctl_panic_on_oom)) {
/*
@@ -694,7 +720,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
panic("Out of memory: panic_on_oom is enabled\n");
}
}
- __out_of_memory(gfp_mask, order, constraint, nodemask);
+ __out_of_memory(gfp_mask, order, totalpages, constraint, nodemask);
read_unlock(&tasklist_lock);

/*

David Rientjes

unread,

Feb 10, 2010, 11:40:02 AM2/10/10

to

Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
implemented for very large systems to avoid excessively long tasklist
scans. The former suppresses helpful diagnostic messages that are
emitted for each thread group leader that are candidates for oom kill
including their pid, uid, vm size, rss, oom_adj value, and name; this
information is very helpful to users in understanding why a particular
task was chosen for kill over others. The latter simply kills current,
the task triggering the oom condition, instead of iterating through the
tasklist looking for the worst offender.

Both of these sysctls are combined into one for use on the aforementioned
large systems: oom_kill_quick. This disables the now-default
oom_dump_tasks and kills current whenever the oom killer is called.

The oom killer rewrite is the perfect opportunity to combine both sysctls
into one instead of carrying around the others for years to come for
nothing else than legacy purposes.

Signed-off-by: David Rientjes <rien...@google.com>
---

Documentation/sysctl/vm.txt | 44 +++++-------------------------------------
include/linux/oom.h | 3 +-
kernel/sysctl.c | 13 ++---------
mm/oom_kill.c | 9 +++----
4 files changed, 14 insertions(+), 55 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt

@@ -43,9 +43,8 @@ Currently, these files are in /proc/sys/vm:
- nr_pdflush_threads

- nr_trim_pages (only if CONFIG_MMU=n)
- numa_zonelist_order

-- oom_dump_tasks
- oom_forkbomb_thres
-- oom_kill_allocating_task
+- oom_kill_quick
- overcommit_memory
- overcommit_ratio
- page-cluster
@@ -470,27 +469,6 @@ this is causing problems for your system/application.

==============================================================

-oom_dump_tasks
-
-Enables a system-wide task dump (excluding kernel threads) to be
-produced when the kernel performs an OOM-killing and includes such
-information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and
-name. This is helpful to determine why the OOM killer was invoked
-and to identify the rogue task that caused it.
-
-If this is set to zero, this information is suppressed. On very
-large systems with thousands of tasks it may not be feasible to dump
-the memory state information for each one. Such systems should not
-be forced to incur a performance penalty in OOM conditions when the
-information may not be desired.
-
-If this is set to non-zero, this information is shown whenever the
-OOM killer actually kills a memory-hogging task.
-
-The default value is 0.
-
-==============================================================
-
oom_forkbomb_thres

This value defines how many children with a seperate address space a specific

@@ -511,22 +489,12 @@ The default value is 1000.

==============================================================

-oom_kill_allocating_task
-
-This enables or disables killing the OOM-triggering task in
-out-of-memory situations.
-
-If this is set to zero, the OOM killer will scan through the entire
-tasklist and select a task based on heuristics to kill. This normally
-selects a rogue memory-hogging task that frees up a large amount of
-memory when killed.
-
-If this is set to non-zero, the OOM killer simply kills the task that
-triggered the out-of-memory condition. This avoids the expensive
-tasklist scan.
+oom_kill_quick

-If panic_on_oom is selected, it takes precedence over whatever value
-is used in oom_kill_allocating_task.
+When enabled, this will always kill the task that triggered the oom killer, i.e.
+the task that attempted to allocate memory that could not be found. It also
+suppresses the tasklist dump to the kernel log whenever the oom killer is
+called. Typically set on systems with an extremely large number of tasks.

The default value is 0.

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h

@@ -51,8 +51,7 @@ static inline void oom_killer_enable(void)
}
/* for sysctl */
extern int sysctl_panic_on_oom;
-extern int sysctl_oom_kill_allocating_task;
-extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_kill_quick;

extern int sysctl_oom_forkbomb_thres;

#endif /* __KERNEL__*/

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c

@@ -949,16 +949,9 @@ static struct ctl_table vm_table[] = {
.proc_handler = proc_dointvec,
},
{
- .procname = "oom_kill_allocating_task",
- .data = &sysctl_oom_kill_allocating_task,
- .maxlen = sizeof(sysctl_oom_kill_allocating_task),
- .mode = 0644,
- .proc_handler = proc_dointvec,
- },
- {
- .procname = "oom_dump_tasks",
- .data = &sysctl_oom_dump_tasks,
- .maxlen = sizeof(sysctl_oom_dump_tasks),
+ .procname = "oom_kill_quick",
+ .data = &sysctl_oom_kill_quick,
+ .maxlen = sizeof(sysctl_oom_kill_quick),
.mode = 0644,
.proc_handler = proc_dointvec,

},
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c

@@ -32,9 +32,8 @@
#include <linux/security.h>

int sysctl_panic_on_oom;
-int sysctl_oom_kill_allocating_task;
-int sysctl_oom_dump_tasks;
int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
+int sysctl_oom_kill_quick;
static DEFINE_SPINLOCK(zone_scan_lock);

/*
@@ -397,7 +396,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
dump_stack();
mem_cgroup_print_oom_info(mem, p);
show_mem();
- if (sysctl_oom_dump_tasks)
+ if (!sysctl_oom_kill_quick)
dump_tasks(mem);
}

@@ -604,9 +603,9 @@ static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
struct task_struct *p;
unsigned int points;

- if (sysctl_oom_kill_allocating_task)
+ if (sysctl_oom_kill_quick)

if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,

- NULL, "Out of memory (oom_kill_allocating_task)"))
+ NULL, "Out of memory (quick mode)"))
return;
retry:
/*

David Rientjes

unread,

Feb 10, 2010, 11:40:02 AM2/10/10

to

When a task is chosen for oom kill, the oom killer first attempts to
sacrifice a child not sharing its parent's memory instead.
Unfortunately, this often kills in a seemingly random fashion based on
the ordering of the selected task's child list. Additionally, it is not
guaranteed at all to free a large amount of memory that we need to
prevent additional oom killing in the very near future.

Instead, we now only attempt to sacrifice the worst child not sharing its
parent's memory, if one exists. The worst child is indicated with the
highest badness() score. This serves two advantages: we kill a
memory-hogging task more often, and we allow the configurable
/proc/pid/oom_adj value to be considered as a factor in which child to
kill.

Reviewers may observe that the previous implementation would iterate
through the children and attempt to kill each until one was successful
and then the parent if none were found while the new code simply kills
the most memory-hogging task or the parent. Note that the only time
oom_kill_task() fails, however, is when a child does not have an mm or
has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both
cases, so the final oom_kill_task() will always succeed.

Signed-off-by: David Rientjes <rien...@google.com>
---

mm/oom_kill.c | 23 +++++++++++++++++------
1 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c

@@ -432,7 +432,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,

unsigned long points, struct mem_cgroup *mem,

const char *message)
{
+ struct task_struct *victim = p;
struct task_struct *c;
+ unsigned long victim_points = 0;
+ struct timespec uptime;

if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem);
@@ -446,17 +449,25 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
return 0;
}

- printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
- message, task_pid_nr(p), p->comm, points);
+ pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
+ message, task_pid_nr(p), p->comm, points);

- /* Try to kill a child first */
+ /* Try to sacrifice the worst child first */
+ do_posix_clock_monotonic_gettime(&uptime);
list_for_each_entry(c, &p->children, sibling) {
+ unsigned long cpoints;
+

if (c->mm == p->mm)
continue;

- if (!oom_kill_task(c))
- return 0;
+
+ /* badness() returns 0 if the thread is unkillable */
+ cpoints = badness(c, uptime.tv_sec);
+ if (cpoints > victim_points) {
+ victim = c;
+ victim_points = cpoints;
+ }
}
- return oom_kill_task(p);
+ return oom_kill_task(victim);
}

#ifdef CONFIG_CGROUP_MEM_RES_CTLR

David Rientjes

unread,

Feb 10, 2010, 11:40:02 AM2/10/10

to

Tasks that do not share the same set of allowed nodes with the task that
triggered the oom should not be considered as candidates for oom kill.

Tasks in other cpusets with a disjoint set of mems would be unfairly
penalized otherwise because of oom conditions elsewhere; an extreme
example could unfairly kill all other applications on the system if a
single task in a user's cpuset sets itself to OOM_DISABLE and then uses
more memory than allowed.

Killing tasks outside of current's cpuset rarely would free memory for
current anyway.

Signed-off-by: David Rientjes <rien...@google.com>
---

mm/oom_kill.c | 12 +++---------
1 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c

@@ -35,7 +35,7 @@ static DEFINE_SPINLOCK(zone_scan_lock);
/* #define DEBUG */

/*
- * Is all threads of the target process nodes overlap ours?
+ * Do all threads of the target process overlap our allowed nodes?
*/
static int has_intersects_mems_allowed(struct task_struct *tsk)
{
@@ -167,14 +167,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
points /= 4;

/*
- * If p's nodes don't overlap ours, it may still help to kill p
- * because p may have allocated or otherwise mapped memory on
- * this node before. However it will be less likely.
- */
- if (!has_intersects_mems_allowed(p))
- points /= 8;
-
- /*

* Adjust the score by oom_adj.

*/
if (oom_adj) {
@@ -266,6 +258,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
continue;
if (mem && !task_in_mem_cgroup(p, mem))
continue;
+ if (!has_intersects_mems_allowed(p))
+ continue;

/*
* This task already has access to memory reserves and is

David Rientjes

unread,

Feb 10, 2010, 11:40:02 AM2/10/10

to

Remove the redundancy in __oom_kill_task() since:

- init can never be passed to this function: it will never be PF_EXITING
or selectable from select_bad_process(), and

- it will never be passed a task from oom_kill_task() without an ->mm
and we're unconcerned about detachment from exiting tasks, there's no
reason to protect them against SIGKILL or access to memory reserves.

Also moves the kernel log message to a higher level since the verbosity
is not always emitted here; we need not print an error message if an
exiting task is given a longer timeslice.

Signed-off-by: David Rientjes <rien...@google.com>
---

mm/oom_kill.c | 64 ++++++++++++++------------------------------------------
1 files changed, 16 insertions(+), 48 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c

@@ -400,67 +400,35 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
dump_tasks(mem);
}

-#define K(x) ((x) << (PAGE_SHIFT-10))
-
/*
- * Send SIGKILL to the selected process irrespective of CAP_SYS_RAW_IO
- * flag though it's unlikely that we select a process with CAP_SYS_RAW_IO
- * set.
+ * Give the oom killed task high priority and access to memory reserves so that
+ * it may quickly exit and free its memory.
*/
-static void __oom_kill_task(struct task_struct *p, int verbose)
+static void __oom_kill_task(struct task_struct *p)
{
- if (is_global_init(p)) {
- WARN_ON(1);
- printk(KERN_WARNING "tried to kill init!\n");
- return;
- }
-
- task_lock(p);
- if (!p->mm) {
- WARN_ON(1);
- printk(KERN_WARNING "tried to kill an mm-less task %d (%s)!\n",
- task_pid_nr(p), p->comm);
- task_unlock(p);
- return;
- }
-
- if (verbose)
- printk(KERN_ERR "Killed process %d (%s) "
- "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
- task_pid_nr(p), p->comm,
- K(p->mm->total_vm),
- K(get_mm_counter(p->mm, MM_ANONPAGES)),
- K(get_mm_counter(p->mm, MM_FILEPAGES)));
- task_unlock(p);
-
- /*
- * We give our sacrificial lamb high priority and access to
- * all the memory it needs. That way it should be able to
- * exit() and clear out its resources quickly...
- */
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
-
force_sig(SIGKILL, p);
}

+#define K(x) ((x) << (PAGE_SHIFT-10))

static int oom_kill_task(struct task_struct *p)

{
- /* WARNING: mm may not be dereferenced since we did not obtain its
- * value from get_task_mm(p). This is OK since all we need to do is
- * compare mm to q->mm below.
- *
- * Furthermore, even if mm contains a non-NULL value, p->mm may
- * change to NULL at any time since we do not hold task_lock(p).
- * However, this is of no concern to us.
- */
- if (!p->mm || p->signal->oom_adj == OOM_DISABLE)
+ task_lock(p);
+ if (!p->mm || p->signal->oom_adj == OOM_DISABLE) {
+ task_unlock(p);
return 1;
+ }
+ pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+ task_pid_nr(p), p->comm, K(p->mm->total_vm),
+ K(get_mm_counter(p->mm, MM_ANONPAGES)),
+ K(get_mm_counter(p->mm, MM_FILEPAGES)));
+ task_unlock(p);

- __oom_kill_task(p, 1);
-
+ __oom_kill_task(p);
return 0;
}
+#undef K

static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,

unsigned int points, unsigned long totalpages,

@@ -479,7 +447,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* its children or threads, just set TIF_MEMDIE so it can die quickly
*/
if (p->flags & PF_EXITING) {
- __oom_kill_task(p, 0);
+ __oom_kill_task(p);
return 0;

Rik van Riel

unread,

Feb 10, 2010, 12:10:02 PM2/10/10

to

On 02/10/2010 11:32 AM, David Rientjes wrote:
> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
>
> Tasks in other cpusets with a disjoint set of mems would be unfairly
> penalized otherwise because of oom conditions elsewhere; an extreme
> example could unfairly kill all other applications on the system if a
> single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> more memory than allowed.
>
> Killing tasks outside of current's cpuset rarely would free memory for
> current anyway.
>
> Signed-off-by: David Rientjes<rien...@google.com>

Acked-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed.

Rik van Riel

unread,

Feb 10, 2010, 4:00:01 PM2/10/10

to

On 02/10/2010 11:32 AM, David Rientjes wrote:

> When a task is chosen for oom kill, the oom killer first attempts to
> sacrifice a child not sharing its parent's memory instead.
> Unfortunately, this often kills in a seemingly random fashion based on
> the ordering of the selected task's child list. Additionally, it is not
> guaranteed at all to free a large amount of memory that we need to
> prevent additional oom killing in the very near future.
>
> Instead, we now only attempt to sacrifice the worst child not sharing its
> parent's memory, if one exists. The worst child is indicated with the
> highest badness() score. This serves two advantages: we kill a
> memory-hogging task more often, and we allow the configurable
> /proc/pid/oom_adj value to be considered as a factor in which child to
> kill.
>
> Reviewers may observe that the previous implementation would iterate
> through the children and attempt to kill each until one was successful
> and then the parent if none were found while the new code simply kills
> the most memory-hogging task or the parent. Note that the only time
> oom_kill_task() fails, however, is when a child does not have an mm or
> has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both
> cases, so the final oom_kill_task() will always succeed.
>
> Signed-off-by: David Rientjes<rien...@google.com>

Acked-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed.

Rik van Riel

unread,

Feb 10, 2010, 11:20:01 PM2/10/10

to

On 02/10/2010 11:32 AM, David Rientjes wrote:

> OOM_ADJUST_MIN and OOM_ADJUST_MAX have been exported to userspace since
> 2006 via include/linux/oom.h. This alters their values from -16 to -1000
> and from +15 to +1000, respectively.

That seems like a bad idea. Google may have the luxury of
being able to recompile all its in-house applications, but
this will not be true for many other users of /proc/<pid>/oom_adj

> +/*
> + * Tasks that fork a very large number of children with seperate address spaces
> + * may be the result of a bug, user error, or a malicious application. The oom
> + * killer assesses a penalty equaling

It could also be the result of the system getting many client
connections - think of overloaded mail, web or database servers.

--
All rights reversed.

David Rientjes

unread,

Feb 11, 2010, 4:20:02 AM2/11/10

to

On Wed, 10 Feb 2010, Rik van Riel wrote:

> > OOM_ADJUST_MIN and OOM_ADJUST_MAX have been exported to userspace since
> > 2006 via include/linux/oom.h. This alters their values from -16 to -1000
> > and from +15 to +1000, respectively.
>
> That seems like a bad idea. Google may have the luxury of
> being able to recompile all its in-house applications, but
> this will not be true for many other users of /proc/<pid>/oom_adj
>

Changing any value that may have a tendency to be hardcoded elsewhere is
always controversial, but I think the nature of /proc/pid/oom_adj allows
us to do so for two specific reasons:

- hardcoded values tend not the fall within a range, they tend to either
always prefer a certain task for oom kill first or disable oom killing
entirely. The current implementation uses this as a bitshift on a
seemingly unpredictable and unscientific heuristic that is very
difficult to predict at runtime. This means that fewer and fewer
applications would hardcode a value of '8', for example, because its
semantics depends entirely on RAM capacity of the system to begin with
since badness() scores are only useful when used in comparison with
other tasks.

- the badness() heuristic is radically changed from what it is currently
so this gives applications that hardcoded /proc/pid/oom_adj values into
their software a reason to notice the change and adjust to the new
semantics of the badness score. Using /proc/pid/oom_adj as a bitshift
has no real application to any sane heuristic that represents scores in
units of meaning, so users should end up with a net benefit of the
change by being able to better tune the oom killing behavior with a
much more powerful and easier to understand heuristic that requires
them to recalculate exactly what oom_adj should be for any given
application in terms of real units and business goals.

As mentioned in the changelog, we've exported these minimum and maximum
values via a kernel header file since at least 2006. At what point do we
assume they are going to be used and not hardcoded into applications?
That was certainly the intention when making them user visible.

> > +/*
> > + * Tasks that fork a very large number of children with seperate address
> > spaces
> > + * may be the result of a bug, user error, or a malicious application. The
> > oom
> > + * killer assesses a penalty equaling
>
> It could also be the result of the system getting many client
> connections - think of overloaded mail, web or database servers.
>

True, that's a great example of why child tasks should be sacrificed for
the parent: if the oom killer is being called then we are truly overloaded
and there's no shame in killing excessive client connections to recover,
otherwise we might find the entire server becoming unresponsive. The user
can easily tune to /proc/sys/vm/oom_forkbomb_thres to define what
"excessive" is to assess the penalty, if any. I'll add that to the
comment if we require a second revision.

Thanks for your speedy review of this patchset so far, Rik!

Nick Bowler

unread,

Feb 11, 2010, 10:10:01 AM2/11/10

to

On 01:14 Thu 11 Feb , David Rientjes wrote:
> On Wed, 10 Feb 2010, Rik van Riel wrote:
>
> > > OOM_ADJUST_MIN and OOM_ADJUST_MAX have been exported to userspace since
> > > 2006 via include/linux/oom.h. This alters their values from -16 to -1000
> > > and from +15 to +1000, respectively.

<snip>

> As mentioned in the changelog, we've exported these minimum and maximum
> values via a kernel header file since at least 2006. At what point do we
> assume they are going to be used and not hardcoded into applications?
> That was certainly the intention when making them user visible.

The thing is, even when the macros are used, their values are hardcoded
into programs once the code is run through a compiler. That's why it's
called an ABI.

--
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

David Rientjes

unread,

Feb 11, 2010, 4:10:02 PM2/11/10

to

On Thu, 11 Feb 2010, Nick Bowler wrote:

> > As mentioned in the changelog, we've exported these minimum and maximum
> > values via a kernel header file since at least 2006. At what point do we
> > assume they are going to be used and not hardcoded into applications?
> > That was certainly the intention when making them user visible.
>
> The thing is, even when the macros are used, their values are hardcoded
> into programs once the code is run through a compiler. That's why it's
> called an ABI.
>

Right, that's the second point that I listed: since the semantics of the
tunable have radically changed from the bitshift to an actual unit
(proportion of available memory), those applications need to change how
they use oom_adj anyway. The bitshift simply isn't extendable with any
sane heuristic that is predictable or works with any reasonable amount of
granularity, so this change seems inevitable in the long term.

We may be forced to abandon /proc/pid/oom_adj itself and introduce the
tunable with a different name: oom_score_adj, for example, to make it
clear that it's a different entity.

Andrew Morton

unread,

Feb 11, 2010, 4:50:04 PM2/11/10

to

On Thu, 11 Feb 2010 01:14:43 -0800 (PST)
David Rientjes <rien...@google.com> wrote:

> On Wed, 10 Feb 2010, Rik van Riel wrote:
>
> > > OOM_ADJUST_MIN and OOM_ADJUST_MAX have been exported to userspace since
> > > 2006 via include/linux/oom.h. This alters their values from -16 to -1000
> > > and from +15 to +1000, respectively.
> >
> > That seems like a bad idea. Google may have the luxury of
> > being able to recompile all its in-house applications, but
> > this will not be true for many other users of /proc/<pid>/oom_adj
> >
>
> Changing any value that may have a tendency to be hardcoded elsewhere is
> always controversial, but I think the nature of /proc/pid/oom_adj allows
> us to do so for two specific reasons:
>
> - hardcoded values tend not the fall within a range, they tend to either
> always prefer a certain task for oom kill first or disable oom killing
> entirely. The current implementation uses this as a bitshift on a
> seemingly unpredictable and unscientific heuristic that is very
> difficult to predict at runtime. This means that fewer and fewer
> applications would hardcode a value of '8', for example, because its
> semantics depends entirely on RAM capacity of the system to begin with
> since badness() scores are only useful when used in comparison with
> other tasks.

You'd be amazed what dumb things applications do. Get thee to
http://google.com/codesearch?hl=en&lr=&q=[^a-z]oom_adj[^a-z]&sbtn=Search
and start reading. All 641 matches ;)

Here's one which which writes -16:
http://google.com/codesearch/p?hl=en#eN5TNOm7KtI/trunk/wlan/vendor/asus/eeepc/init.rc&q=[^a-z]oom_adj[^a-z]&sa=N&cd=70&ct=rc

Let's not change the ABI please.

David Rientjes

unread,

Feb 11, 2010, 5:00:02 PM2/11/10

to

On Thu, 11 Feb 2010, Andrew Morton wrote:

> > Changing any value that may have a tendency to be hardcoded elsewhere is
> > always controversial, but I think the nature of /proc/pid/oom_adj allows
> > us to do so for two specific reasons:
> >
> > - hardcoded values tend not the fall within a range, they tend to either
> > always prefer a certain task for oom kill first or disable oom killing
> > entirely. The current implementation uses this as a bitshift on a
> > seemingly unpredictable and unscientific heuristic that is very
> > difficult to predict at runtime. This means that fewer and fewer
> > applications would hardcode a value of '8', for example, because its
> > semantics depends entirely on RAM capacity of the system to begin with
> > since badness() scores are only useful when used in comparison with
> > other tasks.
>
> You'd be amazed what dumb things applications do. Get thee to
> http://google.com/codesearch?hl=en&lr=&q=[^a-z]oom_adj[^a-z]&sbtn=Search
> and start reading. All 641 matches ;)
>
> Here's one which which writes -16:
> http://google.com/codesearch/p?hl=en#eN5TNOm7KtI/trunk/wlan/vendor/asus/eeepc/init.rc&q=[^a-z]oom_adj[^a-z]&sa=N&cd=70&ct=rc
>
> Let's not change the ABI please.
>

Sigh, this is going to require the amount of system memory to be
partitioned into OOM_ADJUST_MAX, 15, chunks and that's going to be the
granularity at which we'll be able to either bias or discount memory usage
of individual tasks by: instead of being able to do this with 0.1%
granularity we'll now be limited to 100 / 15, or ~7%. That's ~9GB on my
128GB system just because this was originally a bitshift. The upside is
that it's now linear and not exponential.

Andrew Morton

unread,

Feb 11, 2010, 5:40:01 PM2/11/10

to

Can you add newly-named knobs (rather than modifying the existing
ones), deprecate the old ones and then massage writes to the old ones
so that they talk into the new framework?

David Rientjes

unread,

Feb 11, 2010, 5:50:02 PM2/11/10

to

On Thu, 11 Feb 2010, Andrew Morton wrote:

> > Sigh, this is going to require the amount of system memory to be
> > partitioned into OOM_ADJUST_MAX, 15, chunks and that's going to be the
> > granularity at which we'll be able to either bias or discount memory usage
> > of individual tasks by: instead of being able to do this with 0.1%
> > granularity we'll now be limited to 100 / 15, or ~7%. That's ~9GB on my
> > 128GB system just because this was originally a bitshift. The upside is
> > that it's now linear and not exponential.
>
> Can you add newly-named knobs (rather than modifying the existing
> ones), deprecate the old ones and then massage writes to the old ones
> so that they talk into the new framework?
>

That's what I was thinking, add /proc/pid/oom_score_adj that is just added
into the badness score (and is then exported with /proc/pid/oom_score)
like this patch did with oom_adj and then scale it into oom_adj units for
that tunable. A write to either oom_adj or oom_score_adj would change the
other, the same thing I did for /proc/sys/vm/dirty_{bytes,ratio} and
/proc/sys/vm/dirty_background_{bytes,ratio} which I guess we have to
support forever since the predecessors are part of the ABI and there's no
way to deprecate them since they'll never be removed for that reason.

Andrew Morton

unread,

Feb 11, 2010, 6:20:03 PM2/11/10

to

On Thu, 11 Feb 2010 14:42:39 -0800 (PST)
David Rientjes <rien...@google.com> wrote:

> On Thu, 11 Feb 2010, Andrew Morton wrote:
>
> > > Sigh, this is going to require the amount of system memory to be
> > > partitioned into OOM_ADJUST_MAX, 15, chunks and that's going to be the
> > > granularity at which we'll be able to either bias or discount memory usage
> > > of individual tasks by: instead of being able to do this with 0.1%
> > > granularity we'll now be limited to 100 / 15, or ~7%. That's ~9GB on my
> > > 128GB system just because this was originally a bitshift. The upside is
> > > that it's now linear and not exponential.
> >
> > Can you add newly-named knobs (rather than modifying the existing
> > ones), deprecate the old ones and then massage writes to the old ones
> > so that they talk into the new framework?
> >
>
> That's what I was thinking, add /proc/pid/oom_score_adj that is just added
> into the badness score (and is then exported with /proc/pid/oom_score)
> like this patch did with oom_adj and then scale it into oom_adj units for
> that tunable. A write to either oom_adj or oom_score_adj would change the
> other,

How ugly is all this?

> the same thing I did for /proc/sys/vm/dirty_{bytes,ratio} and
> /proc/sys/vm/dirty_background_{bytes,ratio} which I guess we have to
> support forever since the predecessors are part of the ABI and there's no
> way to deprecate them since they'll never be removed for that reason.

Ah, OK, I was trying to remember where we did that ;)

There _are_ things we can do though. Detect a write to the old file and
emit a WARN_ON_ONCE("you suck"). Wait a year, turn it into
WARN_ON("you really suck"). Wait a year, then remove it.

Andrew Morton

unread,

Feb 11, 2010, 6:40:03 PM2/11/10

to

On Thu, 11 Feb 2010 15:31:14 -0800 (PST)
David Rientjes <rien...@google.com> wrote:

> > There _are_ things we can do though. Detect a write to the old file and
> > emit a WARN_ON_ONCE("you suck"). Wait a year, turn it into
> > WARN_ON("you really suck"). Wait a year, then remove it.
> >
>

> Ok, I'll use WARN_ON_ONCE() to let the user know of the deprecation and
> then add an entry to Documentation/feature-removal-schedule.txt.

A printk_once() would be better - WARN() generates a big stack
spew which often is wholly irrelevant.

David Rientjes

unread,

Feb 11, 2010, 6:40:02 PM2/11/10

to

On Thu, 11 Feb 2010, Andrew Morton wrote:

> > > > Sigh, this is going to require the amount of system memory to be
> > > > partitioned into OOM_ADJUST_MAX, 15, chunks and that's going to be the
> > > > granularity at which we'll be able to either bias or discount memory usage
> > > > of individual tasks by: instead of being able to do this with 0.1%
> > > > granularity we'll now be limited to 100 / 15, or ~7%. That's ~9GB on my
> > > > 128GB system just because this was originally a bitshift. The upside is
> > > > that it's now linear and not exponential.
> > >
> > > Can you add newly-named knobs (rather than modifying the existing
> > > ones), deprecate the old ones and then massage writes to the old ones
> > > so that they talk into the new framework?
> > >
> >
> > That's what I was thinking, add /proc/pid/oom_score_adj that is just added
> > into the badness score (and is then exported with /proc/pid/oom_score)
> > like this patch did with oom_adj and then scale it into oom_adj units for
> > that tunable. A write to either oom_adj or oom_score_adj would change the
> > other,
>
> How ugly is all this?
>

The advantages outweigh the disadvantages, users need to be able to
specify how much memory vital tasks should be able to use compared to
others without getting penalized and that needs to be done as a fraction
of available memory. I wanted to avoid it originally by not having to
introduce another tunable, but I understand the need for a stable ABI and
backwards compatability. The way /proc/pid/oom_adj currently works as a
bitshift on the badness score is nearly impossible to tune correctly so
change in scoring is inevitable. Luckily, users who tune either can
ignore the other until such time as oom_adj can be removed.

> There _are_ things we can do though. Detect a write to the old file and
> emit a WARN_ON_ONCE("you suck"). Wait a year, turn it into
> WARN_ON("you really suck"). Wait a year, then remove it.
>

Ok, I'll use WARN_ON_ONCE() to let the user know of the deprecation and

then add an entry to Documentation/feature-removal-schedule.txt.

KAMEZAWA Hiroyuki

unread,

Feb 11, 2010, 7:00:01 PM2/11/10

to

On Wed, 10 Feb 2010 08:32:08 -0800 (PST)
David Rientjes <rien...@google.com> wrote:

> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
>
> Tasks in other cpusets with a disjoint set of mems would be unfairly
> penalized otherwise because of oom conditions elsewhere; an extreme
> example could unfairly kill all other applications on the system if a
> single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> more memory than allowed.
>
> Killing tasks outside of current's cpuset rarely would free memory for
> current anyway.
>
> Signed-off-by: David Rientjes <rien...@google.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

KAMEZAWA Hiroyuki

unread,

Feb 11, 2010, 7:10:01 PM2/11/10

to

On Wed, 10 Feb 2010 08:32:10 -0800 (PST)
David Rientjes <rien...@google.com> wrote:

> When a task is chosen for oom kill, the oom killer first attempts to
> sacrifice a child not sharing its parent's memory instead.
> Unfortunately, this often kills in a seemingly random fashion based on
> the ordering of the selected task's child list. Additionally, it is not
> guaranteed at all to free a large amount of memory that we need to
> prevent additional oom killing in the very near future.
>
> Instead, we now only attempt to sacrifice the worst child not sharing its
> parent's memory, if one exists. The worst child is indicated with the
> highest badness() score. This serves two advantages: we kill a
> memory-hogging task more often, and we allow the configurable
> /proc/pid/oom_adj value to be considered as a factor in which child to
> kill.
>
> Reviewers may observe that the previous implementation would iterate
> through the children and attempt to kill each until one was successful
> and then the parent if none were found while the new code simply kills
> the most memory-hogging task or the parent. Note that the only time
> oom_kill_task() fails, however, is when a child does not have an mm or
> has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both
> cases, so the final oom_kill_task() will always succeed.
>
> Signed-off-by: David Rientjes <rien...@google.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

Maybe better than current logic..but I'm not sure why we have to check children ;)

BTW,
==
list_for_each_entry(child, &p->children, sibling) {
task_lock(child);

if (child->mm != mm && child->mm)

points += child->mm->total_vm/2 + 1;

task_unlock(child);
}
==
I wonder this part should be
points += (child->total_vm/2) >> child->signal->oom_adj + 1

If not, in following situation,
==
parent (oom_adj = 0)
-> child (oom_adj=-15, very big memory user)
==
the child may be killd at first, anyway. Today, I have to explain customers
"When you set oom_adj to a process, please set the same value to all ancestors.
Otherwise, your oom_adj value will be ignored."

No ?

Thanks,
-Kame

David Rientjes

unread,

Feb 11, 2010, 7:20:03 PM2/11/10

to

On Fri, 12 Feb 2010, KAMEZAWA Hiroyuki wrote:

> Maybe better than current logic..but I'm not sure why we have to check children ;)
>
> BTW,
> ==
> list_for_each_entry(child, &p->children, sibling) {
> task_lock(child);
> if (child->mm != mm && child->mm)
> points += child->mm->total_vm/2 + 1;
> task_unlock(child);
> }
> ==
> I wonder this part should be
> points += (child->total_vm/2) >> child->signal->oom_adj + 1
>
> If not, in following situation,
> ==
> parent (oom_adj = 0)
> -> child (oom_adj=-15, very big memory user)
> ==
> the child may be killd at first, anyway. Today, I have to explain customers
> "When you set oom_adj to a process, please set the same value to all ancestors.
> Otherwise, your oom_adj value will be ignored."
>

This is a different change than the forkbomb detection which is rewritten
in the fourth patch in the series. We must rely on badness() being able
to tell us how beneficial it will be to kill a task, so iterating through
the child list and picking the most beneficial is the goal of this patch.
It reduces the chances of needlessly killing a child using very little
memory for no benefit just because it was forked first.

KAMEZAWA Hiroyuki

unread,

Feb 11, 2010, 7:20:03 PM2/11/10

to

On Wed, 10 Feb 2010 08:32:24 -0800 (PST)
David Rientjes <rien...@google.com> wrote:

> Remove the redundancy in __oom_kill_task() since:
>
> - init can never be passed to this function: it will never be PF_EXITING
> or selectable from select_bad_process(), and
>
> - it will never be passed a task from oom_kill_task() without an ->mm
> and we're unconcerned about detachment from exiting tasks, there's no
> reason to protect them against SIGKILL or access to memory reserves.
>
> Also moves the kernel log message to a higher level since the verbosity
> is not always emitted here; we need not print an error message if an
> exiting task is given a longer timeslice.
>
> Signed-off-by: David Rientjes <rien...@google.com>

If you say "never", it's better to add BUG_ON() rather than
if (!p->mm)...

But yes, this patch seesm to remove unnecessary codes.
Reviewed-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

David Rientjes

unread,

Feb 11, 2010, 7:30:02 PM2/11/10

to

On Fri, 12 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > Remove the redundancy in __oom_kill_task() since:
> >
> > - init can never be passed to this function: it will never be PF_EXITING
> > or selectable from select_bad_process(), and
> >
> > - it will never be passed a task from oom_kill_task() without an ->mm
> > and we're unconcerned about detachment from exiting tasks, there's no
> > reason to protect them against SIGKILL or access to memory reserves.
> >
> > Also moves the kernel log message to a higher level since the verbosity
> > is not always emitted here; we need not print an error message if an
> > exiting task is given a longer timeslice.
> >
> > Signed-off-by: David Rientjes <rien...@google.com>
>
> If you say "never", it's better to add BUG_ON() rather than
> if (!p->mm)...
>

As the description says, oom_kill_task() never passes __oom_kill_task() a
task, p, where !p->mm, but it doesn't imply that p cannot detach its ->mm
before __oom_kill_task() gets a chance to run. The point is that we don't
really care about giving it access to memory reserves anymore since it's
exiting and won't be allocating anything. Warning about that scenario is
unnecessary and would simply spam the kernel log, a recall to the oom
killer would no longer select this task in case the oom condition persists
anyway.

> But yes, this patch seesm to remove unnecessary codes.
> Reviewed-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
>

Thanks!

KAMEZAWA Hiroyuki

unread,

Feb 11, 2010, 7:40:01 PM2/11/10

to

On Wed, 10 Feb 2010 08:32:17 -0800 (PST)
David Rientjes <rien...@google.com> wrote:

> Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> implemented for very large systems to avoid excessively long tasklist
> scans. The former suppresses helpful diagnostic messages that are
> emitted for each thread group leader that are candidates for oom kill
> including their pid, uid, vm size, rss, oom_adj value, and name; this
> information is very helpful to users in understanding why a particular
> task was chosen for kill over others. The latter simply kills current,
> the task triggering the oom condition, instead of iterating through the
> tasklist looking for the worst offender.
>
> Both of these sysctls are combined into one for use on the aforementioned
> large systems: oom_kill_quick. This disables the now-default
> oom_dump_tasks and kills current whenever the oom killer is called.
>
> The oom killer rewrite is the perfect opportunity to combine both sysctls
> into one instead of carrying around the others for years to come for
> nothing else than legacy purposes.
>
> Signed-off-by: David Rientjes <rien...@google.com>

seems reasonable..but how old these APIs are ? Replacement is ok ?

Reviewed-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

David Rientjes

unread,

Feb 12, 2010, 5:00:02 AM2/12/10

to

On Fri, 12 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> > implemented for very large systems to avoid excessively long tasklist
> > scans. The former suppresses helpful diagnostic messages that are
> > emitted for each thread group leader that are candidates for oom kill
> > including their pid, uid, vm size, rss, oom_adj value, and name; this
> > information is very helpful to users in understanding why a particular
> > task was chosen for kill over others. The latter simply kills current,
> > the task triggering the oom condition, instead of iterating through the
> > tasklist looking for the worst offender.
> >
> > Both of these sysctls are combined into one for use on the aforementioned
> > large systems: oom_kill_quick. This disables the now-default
> > oom_dump_tasks and kills current whenever the oom killer is called.
> >
> > The oom killer rewrite is the perfect opportunity to combine both sysctls
> > into one instead of carrying around the others for years to come for
> > nothing else than legacy purposes.
> >
> > Signed-off-by: David Rientjes <rien...@google.com>
>
> seems reasonable..but how old these APIs are ? Replacement is ok ?
>

I'm not concerned about /proc/sys/vm/oom_dump_tasks because it was
disabled by default and is now enabled by default (unless the user sets
this new /proc/sys/vm/oom_kill_quick). So existing users of
oom_dump_tasks will just have their write fail but identical behavior as
before.

/proc/sys/vm/oom_kill_allocating_task is different since it now requires
enabling /proc/sys/vm/oom_kill_quick, but I think there are such few users
(SGI originally requested it a couple years ago when we started scanning
the tasklist for CONSTRAINT_CPUSET in 2.6.24) and the side-effect of not
enabling it is minimal, it's just a long delay at oom kill time because
they must scan the tasklist. Therefore, I don't see it as a major problem
that will cause large disruptions, instead I see it as a great opportunity
to get rid of one more sysctl without taking away functionality.

> Reviewed-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
>

Thanks!

Minchan Kim

unread,

Feb 12, 2010, 9:00:02 AM2/12/10

to

Hi, David.

On Thu, 2010-02-11 at 01:14 -0800, David Rientjes wrote:
> > > +/*
> > > + * Tasks that fork a very large number of children with seperate address
> > > spaces
> > > + * may be the result of a bug, user error, or a malicious application. The
> > > oom
> > > + * killer assesses a penalty equaling
> >
> > It could also be the result of the system getting many client
> > connections - think of overloaded mail, web or database servers.
> >
>
> True, that's a great example of why child tasks should be sacrificed for
> the parent: if the oom killer is being called then we are truly overloaded
> and there's no shame in killing excessive client connections to recover,
> otherwise we might find the entire server becoming unresponsive. The user
> can easily tune to /proc/sys/vm/oom_forkbomb_thres to define what
> "excessive" is to assess the penalty, if any. I'll add that to the
> comment if we require a second revision.
>

I am worried about opposite case.

If forkbomb parent makes so many children in a short time(ex, 2000 per
second) continuously and we kill a child continuously not parent, system
is almost unresponsible, I think.
I suffered from that case in LTP and no swap system.
It might be a corner case but might happen in real.

I think we could have two types of forkbomb.

Normal forkbomb : apache, DB server and so on.
Buggy forkbomb: It's mistake of user.

We can control normal forkbomb by oom_forkbomb_thres.
But how about handling buggy forkbomb?

If we make sure this task is buggy forkbomb, it would be better to kill
it. But it's hard to make sure it's a buggy forkbomb.

Could we solve this problem by following as?
If OOM selects victim and then the one was selected victim right before
and it's repeatable 5 times for example, then we kill the victim(buggy
forkbom) itself not child of one. It is assumed normal forkbomb is
controlled by admin who uses oom_forkbomb_thres well. So it doesn't
happen selecting victim continuously above five time.

--
Kind regards,
Minchan Kim

David Rientjes

unread,

Feb 12, 2010, 4:10:02 PM2/12/10

to

On Fri, 12 Feb 2010, Minchan Kim wrote:

> > True, that's a great example of why child tasks should be sacrificed for
> > the parent: if the oom killer is being called then we are truly overloaded
> > and there's no shame in killing excessive client connections to recover,
> > otherwise we might find the entire server becoming unresponsive. The user
> > can easily tune to /proc/sys/vm/oom_forkbomb_thres to define what
> > "excessive" is to assess the penalty, if any. I'll add that to the
> > comment if we require a second revision.
> >
>
> I am worried about opposite case.
>
> If forkbomb parent makes so many children in a short time(ex, 2000 per
> second) continuously and we kill a child continuously not parent, system
> is almost unresponsible, I think.

The oom killer is not the appropriate place for a kernel forkbomb policy
to be implemented, you'd need to address that concern in the scheduler.
When I've brought that up in the past, the response is that if we aren't
out of memory, then it isn't a problem. It is a problem for buggy
applications because their timeslice is now spread across an egregious
amount of tasks that they are perhaps leaking and is detrimental to their
server's performance. I'm not saying that we need to enforce a hard limit
on how many tasks a server forks, for instance, but the scheduler can
detect forkbombs much easier than the oom killer's tasklist scan by at
least indicating to us with a process flag that it is a likely forkbomb.

> I suffered from that case in LTP and no swap system.
> It might be a corner case but might happen in real.
>

If you look at the patchset overall and not just this one patch, you'll
notice that we now kill the child with the highest badness() score first,
i.e. generally the one consuming the most memory. That is radically
different than the previous behavior and should prevent the system from
becoming unresponsive. The goal is to allow the user to react to the
forkbomb rather than implement a strict detection and handling heuristic
that kills innocent servers and system daemons.

> If we make sure this task is buggy forkbomb, it would be better to kill
> it. But it's hard to make sure it's a buggy forkbomb.
>
> Could we solve this problem by following as?
> If OOM selects victim and then the one was selected victim right before
> and it's repeatable 5 times for example, then we kill the victim(buggy
> forkbom) itself not child of one. It is assumed normal forkbomb is
> controlled by admin who uses oom_forkbomb_thres well. So it doesn't
> happen selecting victim continuously above five time.
>

That doesn't work with Rik's example of a webserver that forks a large
number of threads to handle client connections. It is _always_ better to
kill a child instead of making the entire webserver unresponsive.

In other words, doing anything in the oom killer other than slightly
penalizing these tasks and killing a child is really a non-starter because
there are too many critical use cases (we have many) that would be
unfairly biased against.

Minchan Kim

unread,

Feb 12, 2010, 9:50:01 PM2/12/10

to

On Sat, Feb 13, 2010 at 6:00 AM, David Rientjes <rien...@google.com> wrote:
> On Fri, 12 Feb 2010, Minchan Kim wrote:
>
>> > True, that's a great example of why child tasks should be sacrificed for
>> > the parent: if the oom killer is being called then we are truly overloaded
>> > and there's no shame in killing excessive client connections to recover,
>> > otherwise we might find the entire server becoming unresponsive. The user
>> > can easily tune to /proc/sys/vm/oom_forkbomb_thres to define what
>> > "excessive" is to assess the penalty, if any. I'll add that to the
>> > comment if we require a second revision.
>> >
>>
>> I am worried about opposite case.
>>
>> If forkbomb parent makes so many children in a short time(ex, 2000 per
>> second) continuously and we kill a child continuously not parent, system
>> is almost unresponsible, I think.
>
> The oom killer is not the appropriate place for a kernel forkbomb policy
> to be implemented, you'd need to address that concern in the scheduler.

I agree. but your's patch try to implement policy(avg rss of children < HZ)
in oom killer as well as detection.
so I pointed out that.
I think if we want to implement it, we also consider above scenario.
As you said, it would be better to detect forkbom in scheduler.
Then, let's remove forkbomb detection in OOM killer.
Afterward, we can implement it in scheduler and can use it in OOM killer.

It makes OOM killer more simple and predictable.

> When I've brought that up in the past, the response is that if we aren't
> out of memory, then it isn't a problem. It is a problem for buggy

I said the situation out of memory.

> applications because their timeslice is now spread across an egregious
> amount of tasks that they are perhaps leaking and is detrimental to their
> server's performance. I'm not saying that we need to enforce a hard limit
> on how many tasks a server forks, for instance, but the scheduler can
> detect forkbombs much easier than the oom killer's tasklist scan by at
> least indicating to us with a process flag that it is a likely forkbomb.
>
>> I suffered from that case in LTP and no swap system.
>> It might be a corner case but might happen in real.
>>
>
> If you look at the patchset overall and not just this one patch, you'll
> notice that we now kill the child with the highest badness() score first,
> i.e. generally the one consuming the most memory. That is radically

It would work well just in case children have big difference badness scores.

> different than the previous behavior and should prevent the system from
> becoming unresponsive. The goal is to allow the user to react to the
> forkbomb rather than implement a strict detection and handling heuristic
> that kills innocent servers and system daemons.
>
>> If we make sure this task is buggy forkbomb, it would be better to kill
>> it. But it's hard to make sure it's a buggy forkbomb.
>>
>> Could we solve this problem by following as?
>> If OOM selects victim and then the one was selected victim right before
>> and it's repeatable 5 times for example, then we kill the victim(buggy
>> forkbom) itself not child of one. It is assumed normal forkbomb is
>> controlled by admin who uses oom_forkbomb_thres well. So it doesn't
>> happen selecting victim continuously above five time.
>>
>
> That doesn't work with Rik's example of a webserver that forks a large
> number of threads to handle client connections. It is _always_ better to
> kill a child instead of making the entire webserver unresponsive.

In such case, admin have to handle it by oom_forkbom_thres.
Isn't it your goal?

My suggestion is how handle buggy forkbomb processes which make
system almost hang by user's mistake. :)

>
> In other words, doing anything in the oom killer other than slightly
> penalizing these tasks and killing a child is really a non-starter because
> there are too many critical use cases (we have many) that would be
> unfairly biased against.
>

Tend to agree. So I hope we shouldn't have consider forkbomb in OOM killer.
I guess forkbomb logic in OOM killer could make many issues in future, still.
As you said, it would be better to implement in scheduler and OOM
killer just uses it.

Thanks for quick reply, David.

--
Kind regards,
Minchan Kim

Minchan Kim

unread,

Feb 12, 2010, 10:30:02 PM2/12/10

to

On Thu, Feb 11, 2010 at 1:32 AM, David Rientjes <rien...@google.com> wrote:
> When a task is chosen for oom kill, the oom killer first attempts to
> sacrifice a child not sharing its parent's memory instead.
> Unfortunately, this often kills in a seemingly random fashion based on
> the ordering of the selected task's child list. Additionally, it is not
> guaranteed at all to free a large amount of memory that we need to
> prevent additional oom killing in the very near future.
>
> Instead, we now only attempt to sacrifice the worst child not sharing its
> parent's memory, if one exists. The worst child is indicated with the
> highest badness() score. This serves two advantages: we kill a
> memory-hogging task more often, and we allow the configurable
> /proc/pid/oom_adj value to be considered as a factor in which child to
> kill.
>
> Reviewers may observe that the previous implementation would iterate
> through the children and attempt to kill each until one was successful
> and then the parent if none were found while the new code simply kills
> the most memory-hogging task or the parent. Note that the only time
> oom_kill_task() fails, however, is when a child does not have an mm or
> has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both
> cases, so the final oom_kill_task() will always succeed.
>
> Signed-off-by: David Rientjes <rien...@google.com>

Reviewed-by: Minchan Kim <minch...@gmail.com>

Regardless of forkbom detection, It does makes sense to me.

KOSAKI Motohiro

unread,

Feb 14, 2010, 10:00:02 PM2/14/10

to

> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
>
> Tasks in other cpusets with a disjoint set of mems would be unfairly
> penalized otherwise because of oom conditions elsewhere; an extreme
> example could unfairly kill all other applications on the system if a
> single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> more memory than allowed.
>
> Killing tasks outside of current's cpuset rarely would free memory for
> current anyway.
>
> Signed-off-by: David Rientjes <rien...@google.com>

This patch does right thing and looks promissing. but unfortunately
I have to NAK this patch temporary.

This patch is nearly just revert of the commit 7887a3da75. We have to
dig archaeology mail log and find why this reverting don't cause
the old pain again.

However, I personally think we end up to merge this. It is the reason
I've used "temporary".

> ---
> mm/oom_kill.c | 12 +++---------
> 1 files changed, 3 insertions(+), 9 deletions(-)

>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c

> @@ -35,7 +35,7 @@ static DEFINE_SPINLOCK(zone_scan_lock);
> /* #define DEBUG */
>
> /*
> - * Is all threads of the target process nodes overlap ours?
> + * Do all threads of the target process overlap our allowed nodes?
> */
> static int has_intersects_mems_allowed(struct task_struct *tsk)
> {
> @@ -167,14 +167,6 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> points /= 4;
>
> /*
> - * If p's nodes don't overlap ours, it may still help to kill p
> - * because p may have allocated or otherwise mapped memory on
> - * this node before. However it will be less likely.
> - */
> - if (!has_intersects_mems_allowed(p))
> - points /= 8;
> -
> - /*
> * Adjust the score by oom_adj.
> */
> if (oom_adj) {
> @@ -266,6 +258,8 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> continue;
> if (mem && !task_in_mem_cgroup(p, mem))
> continue;
> + if (!has_intersects_mems_allowed(p))
> + continue;
>
> /*
> * This task already has access to memory reserves and is

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--

KOSAKI Motohiro

unread,

Feb 14, 2010, 10:00:02 PM2/14/10

to

> This patchset is a rewrite of the out of memory killer to address several
> issues that have been raised recently. The most notable change is a
> complete rewrite of the badness heuristic that determines which task is
> killed; the goal was to make it as simple and predictable as possible
> while still addressing issues that plague the VM.
>
> This patchset is based on mmotm-2010-02-05-15-06 because of the following
> dependencies:

At first, I'm glad that you tackle this issue. unfortunatelly I'm very busy now.
but I'll make a time for reviewing this patches asap.

>
> [patch 4/7] oom: badness heuristic rewrite:
> mm-count-swap-usage.patch
>
> [patch 5/7] oom: replace sysctls with quick mode:
> sysctl-clean-up-vm-related-variable-delcarations.patch
>
> To apply to mainline, download 2.6.33-rc7 and apply
>
> mm-clean-up-mm_counter.patch
> mm-avoid-false-sharing-of-mm_counter.patch
> mm-avoid-false_sharing-of-mm_counter-checkpatch-fixes.patch
> mm-count-swap-usage.patch
> mm-count-swap-usage-checkpatch-fixes.patch
> mm-introduce-dump_page-and-print-symbolic-flag-names.patch
> sysctl-clean-up-vm-related-variable-declarations.patch
> sysctl-clean-up-vm-related-variable-declarations-fix.patch
>
> from http://userweb.kernel.org/~akpm/mmotm/broken-out.tar.gz first.
> ---
> Documentation/filesystems/proc.txt | 78 ++++---
> Documentation/sysctl/vm.txt | 51 ++---
> fs/proc/base.c | 13 +-
> include/linux/mempolicy.h | 13 +-
> include/linux/oom.h | 18 +-
> kernel/sysctl.c | 15 +-
> mm/mempolicy.c | 39 +++
> mm/oom_kill.c | 455 +++++++++++++++++++-----------------
> mm/page_alloc.c | 3 +
> 9 files changed, 383 insertions(+), 302 deletions(-)

KOSAKI Motohiro

unread,

Feb 14, 2010, 10:10:01 PM2/14/10

to

> When a task is chosen for oom kill, the oom killer first attempts to
> sacrifice a child not sharing its parent's memory instead.
> Unfortunately, this often kills in a seemingly random fashion based on
> the ordering of the selected task's child list. Additionally, it is not
> guaranteed at all to free a large amount of memory that we need to
> prevent additional oom killing in the very near future.
>
> Instead, we now only attempt to sacrifice the worst child not sharing its
> parent's memory, if one exists. The worst child is indicated with the
> highest badness() score. This serves two advantages: we kill a
> memory-hogging task more often, and we allow the configurable
> /proc/pid/oom_adj value to be considered as a factor in which child to
> kill.
>
> Reviewers may observe that the previous implementation would iterate
> through the children and attempt to kill each until one was successful
> and then the parent if none were found while the new code simply kills
> the most memory-hogging task or the parent. Note that the only time
> oom_kill_task() fails, however, is when a child does not have an mm or
> has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both
> cases, so the final oom_kill_task() will always succeed.
>
> Signed-off-by: David Rientjes <rien...@google.com>

Probably, kamezawa-san talked about right thing. but this patch is
enough small and it have no regression risk. So, we can choice step-by-step
development.
Reviewed-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>

> ---
> mm/oom_kill.c | 23 +++++++++++++++++------
> 1 files changed, 17 insertions(+), 6 deletions(-)

>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c

KOSAKI Motohiro

unread,

Feb 15, 2010, 3:10:01 AM2/15/10

to

> This a complete rewrite of the oom killer's badness() heuristic which is
> used to determine which task to kill in oom conditions. The goal is to
> make it as simple and predictable as possible so the results are better
> understood and we end up killing the task which will lead to the most
> memory freeing while still respecting the fine-tuning from userspace.
>
> The baseline for the heuristic is a proportion of memory that each task
> is currently using in memory plus swap compared to the amount of
> "allowable" memory. "Allowable," in this sense, means the system-wide
> resources for unconstrained oom conditions, the set of mempolicy nodes,
> the mems attached to current's cpuset, or a memory controller's limit.
> The proportion is given on a scale of 0 (never kill) to 1000 (always
> kill), roughly meaning that if a task has a badness() score of 500 that
> the task consumes approximately 50% of allowable memory resident in RAM
> or in swap space.
>
> The proportion is always relative to the amount of "allowable" memory and
> not the total amount of RAM systemwide so that mempolicies and cpusets
> may operate in isolation; they shall not need to know the true size of
> the machine on which they are running if they are bound to a specific set
> of nodes or mems, respectively.
>
> Forkbomb detection is done in a completely different way: a threshold is
> configurable from userspace to determine how many first-generation execve
> children (those with their own address spaces) a task may have before it
> is considered a forkbomb. This can be tuned by altering the value in
> /proc/sys/vm/oom_forkbomb_thres, which defaults to 1000.
>
> When a task has more than 1000 first-generation children with different
> address spaces than itself, a penalty of
>
> (average rss of children) * (# of 1st generation execve children)
> -----------------------------------------------------------------
> oom_forkbomb_thres
>
> is assessed. So, for example, using the default oom_forkbomb_thres of
> 1000, the penalty is twice the average rss of all its execve children if
> there are 2000 such tasks. A task is considered to count toward the
> threshold if its total runtime is less than one second; for 1000 of such
> tasks to exist, the parent process must be forking at an extremely high
> rate either erroneously or maliciously.
>
> Even though a particular task may be designated a forkbomb and selected
> as the victim, the oom killer will still kill the 1st generation execve
> child with the highest badness() score in its place. The avoids killing
> important servers or system daemons.
>
> Root tasks are given 3% extra memory just like __vm_enough_memory()
> provides in LSMs. In the event of two tasks consuming similar amounts of
> memory, it is generally better to save root's task.
>
> Because of the change in the badness() heuristic's baseline, a change
> must also be made to the user interface used to tune it. Instead of a
> scale from -16 to +15 to indicate a bitshift on the point value given to
> a task, which was very difficult to tune accurately or with any degree of
> precision, /proc/pid/oom_adj now ranges from -1000 to +1000. That is, it
> can be used to polarize the heuristic such that certain tasks are never
> considered for oom kill while others are always considered. The value is
> added directly into the badness() score so a value of -500, for example,
> means to discount 50% of its memory consumption in comparison to other
> tasks either on the system, bound to the mempolicy, or in the cpuset.

>
> OOM_ADJUST_MIN and OOM_ADJUST_MAX have been exported to userspace since
> 2006 via include/linux/oom.h. This alters their values from -16 to -1000

> and from +15 to +1000, respectively. OOM_DISABLE is now the equivalent
> of the lowest possible value, OOM_ADJUST_MIN. Adding its value, -1000,
> to any badness score will always return 0.
>
> Although redefining these values may be controversial, it is much easier
> to understand when the units are fully understood as described above.
> In the short-term, there may be userspace breakage for tasks that
> hardcode -17 meaning OOM_DISABLE, for example, but the long-term will
> make the semantics much easier to understand and oom killing much more
> effective.

I NAK this patch as same as many other people. This patch bring to a lot of
compatibility issue. and it isn't necessary.

KOSAKI Motohiro

unread,

Feb 15, 2010, 3:20:02 AM2/15/10

to

> Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> implemented for very large systems to avoid excessively long tasklist
> scans. The former suppresses helpful diagnostic messages that are
> emitted for each thread group leader that are candidates for oom kill
> including their pid, uid, vm size, rss, oom_adj value, and name; this
> information is very helpful to users in understanding why a particular
> task was chosen for kill over others. The latter simply kills current,
> the task triggering the oom condition, instead of iterating through the
> tasklist looking for the worst offender.
>
> Both of these sysctls are combined into one for use on the aforementioned
> large systems: oom_kill_quick. This disables the now-default
> oom_dump_tasks and kills current whenever the oom killer is called.
>
> The oom killer rewrite is the perfect opportunity to combine both sysctls
> into one instead of carrying around the others for years to come for
> nothing else than legacy purposes.

"_quick" is always bad sysctl name. instead, turnning oom_dump_tasks on
by default is better.

plus, this patch makes unnecessary compatibility issue.

>
> Signed-off-by: David Rientjes <rien...@google.com>

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--

KOSAKI Motohiro

unread,

Feb 15, 2010, 3:40:02 AM2/15/10

to

> On Fri, 12 Feb 2010, KAMEZAWA Hiroyuki wrote:
>
> > > Remove the redundancy in __oom_kill_task() since:
> > >
> > > - init can never be passed to this function: it will never be PF_EXITING
> > > or selectable from select_bad_process(), and
> > >
> > > - it will never be passed a task from oom_kill_task() without an ->mm
> > > and we're unconcerned about detachment from exiting tasks, there's no
> > > reason to protect them against SIGKILL or access to memory reserves.
> > >
> > > Also moves the kernel log message to a higher level since the verbosity
> > > is not always emitted here; we need not print an error message if an
> > > exiting task is given a longer timeslice.
> > >
> > > Signed-off-by: David Rientjes <rien...@google.com>
> >
> > If you say "never", it's better to add BUG_ON() rather than
> > if (!p->mm)...
> >
>
> As the description says, oom_kill_task() never passes __oom_kill_task() a
> task, p, where !p->mm, but it doesn't imply that p cannot detach its ->mm
> before __oom_kill_task() gets a chance to run. The point is that we don't
> really care about giving it access to memory reserves anymore since it's
> exiting and won't be allocating anything. Warning about that scenario is
> unnecessary and would simply spam the kernel log, a recall to the oom
> killer would no longer select this task in case the oom condition persists
> anyway.

I agree this description is correct and this code is unnecessary.
Reviewed-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>

>
> > But yes, this patch seesm to remove unnecessary codes.
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
> >
>
> Thanks!
>
> --

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--

David Rientjes

unread,

Feb 15, 2010, 5:00:04 PM2/15/10

to

On Sat, 13 Feb 2010, Minchan Kim wrote:

> > The oom killer is not the appropriate place for a kernel forkbomb policy
> > to be implemented, you'd need to address that concern in the scheduler.
>
> I agree. but your's patch try to implement policy(avg rss of children < HZ)
> in oom killer as well as detection.
> so I pointed out that.

That's not what's used, we detect whether a child should be included in
the forkbomb count by checking for two traits: (i) it doesn't share an
->mm with the parent, otherwise it wouldn't free any memory unless the
parent was killed as well, and (ii) its total runtime is less than a
second since threads in forkbomb scenarios don't typically get any
runtime. The _penalization_ is then the average rss of those children
times how many times the count exceeds oom_forkbomb_thres.

> I think if we want to implement it, we also consider above scenario.
> As you said, it would be better to detect forkbom in scheduler.
> Then, let's remove forkbomb detection in OOM killer.
> Afterward, we can implement it in scheduler and can use it in OOM killer.
>

We're not enforcing a global, system-wide forkbomb policy in the oom
killer, but we do need to identify tasks that fork a very large number of
tasks to break ties with other tasks: in other words, it would not be
helpful to kill an application that has been running for weeks because
another application with the same or less memory usage has forked 1000
children and has caused an oom condition. That unfairly penalizes the
former application that is actually doing work.

Again, I'd encourage you to look at this as only a slight penalization
rather than a policy that strictly needs to be enforced. If it were
strictly enforced, it would be a prerequisite for selection if such a task
were to exist; in my implementation, it is part of the heuristic.

> > That doesn't work with Rik's example of a webserver that forks a large
> > number of threads to handle client connections. It is _always_ better to
> > kill a child instead of making the entire webserver unresponsive.
>
> In such case, admin have to handle it by oom_forkbom_thres.
> Isn't it your goal?
>

oom_forkbomb_thres has a default value, which is 1000, so it should be
enabled by default.

> My suggestion is how handle buggy forkbomb processes which make
> system almost hang by user's mistake. :)
>

I don't think you've given a clear description (or, even better, a patch)
of your suggestion.

David Rientjes

unread,

Feb 15, 2010, 5:10:03 PM2/15/10

to

On Mon, 15 Feb 2010, KOSAKI Motohiro wrote:

> > Tasks that do not share the same set of allowed nodes with the task that
> > triggered the oom should not be considered as candidates for oom kill.
> >
> > Tasks in other cpusets with a disjoint set of mems would be unfairly
> > penalized otherwise because of oom conditions elsewhere; an extreme
> > example could unfairly kill all other applications on the system if a
> > single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> > more memory than allowed.
> >
> > Killing tasks outside of current's cpuset rarely would free memory for
> > current anyway.
> >
> > Signed-off-by: David Rientjes <rien...@google.com>
>
> This patch does right thing and looks promissing. but unfortunately
> I have to NAK this patch temporary.
>
> This patch is nearly just revert of the commit 7887a3da75. We have to
> dig archaeology mail log and find why this reverting don't cause
> the old pain again.
>

Nick is probably wondering why I cc'd him on this patchset, and this is it
:)

We now determine whether an allocation is constrained by a cpuset by
iterating through the zonelist and checking
cpuset_zone_allowed_softwall(). This checks for the necessary cpuset
restrictions that we need to validate (the GFP_ATOMIC exception is
irrelevant, we don't call into the oom killer for those). We don't need
to kill outside of its cpuset because we're not guaranteed to find any
memory on those nodes, in fact it allows for needless oom killing if a
task sets all of its threads to have OOM_DISABLE in its own cpuset and
then runs out of memory. The oom killer would have killed every other
user task on the system even though the offending application can't
allocate there. That's certainly an undesired result and needs to be
fixed in this manner.

David Rientjes

unread,

Feb 15, 2010, 5:20:02 PM2/15/10

to

On Mon, 15 Feb 2010, KOSAKI Motohiro wrote:

> > Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> > implemented for very large systems to avoid excessively long tasklist
> > scans. The former suppresses helpful diagnostic messages that are
> > emitted for each thread group leader that are candidates for oom kill
> > including their pid, uid, vm size, rss, oom_adj value, and name; this
> > information is very helpful to users in understanding why a particular
> > task was chosen for kill over others. The latter simply kills current,
> > the task triggering the oom condition, instead of iterating through the
> > tasklist looking for the worst offender.
> >
> > Both of these sysctls are combined into one for use on the aforementioned
> > large systems: oom_kill_quick. This disables the now-default
> > oom_dump_tasks and kills current whenever the oom killer is called.
> >
> > The oom killer rewrite is the perfect opportunity to combine both sysctls
> > into one instead of carrying around the others for years to come for
> > nothing else than legacy purposes.
>
> "_quick" is always bad sysctl name.

Why? It does exactly what it says: it kills current without doing an
expensive tasklist scan and suppresses the possibly long tasklist dump.
That's the oom killer's "quick mode."

> instead, turnning oom_dump_tasks on
> by default is better.
>

It's now on by default and can be disabled by enabling oom_kill_quick.

> plus, this patch makes unnecessary compatibility issue.
>

It's the perfect opportunity when totally rewriting the oom killer to
combine two sysctls with the exact same users into one. Users will notice
that the tasklist is always dumped now (we're defaulting oom_dump_tasks
to be enabled), so there is no reason why we can't remove oom_dump_tasks,
we're just giving them a new way to disable it. oom_kill_allocating_task
no longer always means what it once did: with the mempolicy-constrained
oom rewrite, we now iterate the tasklist for such cases to kill a task.
So users need to reassess whether this should be set if all tasks on the
system are constrained by mempolicies, a typical configuration for
extremely large systems.

David Rientjes

unread,

Feb 15, 2010, 5:30:02 PM2/15/10

to

Tasks that do not share the same set of allowed nodes with the task that
triggered the oom should not be considered as candidates for oom kill.

Tasks in other cpusets with a disjoint set of mems would be unfairly
penalized otherwise because of oom conditions elsewhere; an extreme
example could unfairly kill all other applications on the system if a
single task in a user's cpuset sets itself to OOM_DISABLE and then uses
more memory than allowed.

Killing tasks outside of current's cpuset rarely would free memory for
current anyway.

Acked-by: Rik van Riel <ri...@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

Signed-off-by: David Rientjes <rien...@google.com>
---

mm/oom_kill.c | 12 +++---------
1 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c

David Rientjes

unread,

Feb 15, 2010, 5:30:02 PM2/15/10

to

If memory has been depleted in lowmem zones even with the protection
afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
killing current users will help. The memory is either reclaimable (or
migratable) already, in which case we should not invoke the oom killer at
all, or it is pinned by an application for I/O. Killing such an
application may leave the hardware in an unspecified state and there is
no guarantee that it will be able to make a timely exit.

Lowmem allocations are now failed in oom conditions so that the task can
perhaps recover or try again later. Killing current is an unnecessary
result for simply making a GFP_DMA or GFP_DMA32 page allocation and no
lowmem allocations use the now-deprecated __GFP_NOFAIL bit so retrying is
unnecessary.

Previously, the heuristic provided some protection for those tasks with
CAP_SYS_RAWIO, but this is no longer necessary since we will not be
killing tasks for the purposes of ISA allocations.

high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
default for all allocations that are not __GFP_DMA, __GFP_DMA32,
__GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
flags. Testing for high_zoneidx being less than ZONE_NORMAL will only
return true for allocations that have either __GFP_DMA or __GFP_DMA32.

Acked-by: Rik van Riel <ri...@redhat.com>

Reviewed-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>

Signed-off-by: David Rientjes <rien...@google.com>
---

mm/page_alloc.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1914,6 +1914,9 @@ rebalance:
* running out of options and have to consider going OOM
*/
if (!did_some_progress) {
+ /* The oom killer won't necessarily free lowmem */
+ if (high_zoneidx < ZONE_NORMAL)
+ goto nopage;
if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
if (oom_killer_disabled)
goto nopage;

David Rientjes

unread,

Feb 15, 2010, 5:30:03 PM2/15/10

to

This patchset is a rewrite of the out of memory killer to address several
issues that have been raised recently. The most notable change is a
complete rewrite of the badness heuristic that determines which task is
killed; the goal was to make it as simple and predictable as possible
while still addressing issues that plague the VM.

Changes from version 1:

- updated to mmotm-2010-02-11-21-55

- when iterating the tasklist for mempolicy-constrained oom conditions,
the node of the cpu that a MPOL_F_LOCAL task is running on is now
intersected with the page allocator's nodemask to determine whether it
should be a candidate for oom kill.

- added: [patch 4/9] oom: remove compulsory panic_on_oom mode

- /proc/pid/oom_score_adj was added to prevent ABI breakage for
applications using /proc/pid/oom_adj. /proc/pid/oom_adj may still be
used with the old range but it is then scaled to oom_score_adj units
for a rough linear approximation. There is no loss in functionality
from the old interface.

- added: [patch 6/9] oom: deprecate oom_adj tunable

This patchset is based on mmotm-2010-02-11-21-55 because of the following
dependencies:

[patch 5/9] oom: badness heuristic rewrite:
mm-count-swap-usage.patch

[patch 7/9] oom: replace sysctls with quick mode:
sysctl-clean-up-vm-related-variable-delcarations.patch

To apply to mainline, download 2.6.33-rc8 and apply

mm-clean-up-mm_counter.patch
mm-avoid-false-sharing-of-mm_counter.patch
mm-avoid-false_sharing-of-mm_counter-checkpatch-fixes.patch
mm-count-swap-usage.patch
mm-count-swap-usage-checkpatch-fixes.patch
mm-introduce-dump_page-and-print-symbolic-flag-names.patch
sysctl-clean-up-vm-related-variable-declarations.patch
sysctl-clean-up-vm-related-variable-declarations-fix.patch

from http://userweb.kernel.org/~akpm/mmotm/broken-out.tar.gz first.
---

David Rientjes

unread,

Feb 15, 2010, 5:30:03 PM2/15/10

to

When a task is chosen for oom kill, the oom killer first attempts to
sacrifice a child not sharing its parent's memory instead.
Unfortunately, this often kills in a seemingly random fashion based on
the ordering of the selected task's child list. Additionally, it is not
guaranteed at all to free a large amount of memory that we need to
prevent additional oom killing in the very near future.

Instead, we now only attempt to sacrifice the worst child not sharing its
parent's memory, if one exists. The worst child is indicated with the
highest badness() score. This serves two advantages: we kill a
memory-hogging task more often, and we allow the configurable
/proc/pid/oom_adj value to be considered as a factor in which child to
kill.

Reviewers may observe that the previous implementation would iterate
through the children and attempt to kill each until one was successful
and then the parent if none were found while the new code simply kills
the most memory-hogging task or the parent. Note that the only time
oom_kill_task() fails, however, is when a child does not have an mm or
has a /proc/pid/oom_adj of OOM_DISABLE. badness() returns 0 for both
cases, so the final oom_kill_task() will always succeed.

Acked-by: Rik van Riel <ri...@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minch...@gmail.com>

Reviewed-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
Signed-off-by: David Rientjes <rien...@google.com>
---

mm/oom_kill.c | 23 +++++++++++++++++------
1 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c

David Rientjes

unread,

Feb 15, 2010, 5:30:03 PM2/15/10

to

Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
implemented for very large systems to avoid excessively long tasklist
scans. The former suppresses helpful diagnostic messages that are
emitted for each thread group leader that are candidates for oom kill
including their pid, uid, vm size, rss, oom_adj value, and name; this
information is very helpful to users in understanding why a particular
task was chosen for kill over others. The latter simply kills current,
the task triggering the oom condition, instead of iterating through the
tasklist looking for the worst offender.

Both of these sysctls are combined into one for use on the aforementioned
large systems: oom_kill_quick. This disables the now-default
oom_dump_tasks and kills current whenever the oom killer is called.

The oom killer rewrite is the perfect opportunity to combine both sysctls
into one instead of carrying around the others for years to come for
nothing else than legacy purposes.

Reviewed-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

Signed-off-by: David Rientjes <rien...@google.com>
---

@@ -63,8 +63,7 @@ static inline void oom_killer_enable(void)

}
/* for sysctl */
extern int sysctl_panic_on_oom;
-extern int sysctl_oom_kill_allocating_task;
-extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_kill_quick;
extern int sysctl_oom_forkbomb_thres;

#endif /* __KERNEL__*/
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c

@@ -941,16 +941,9 @@ static struct ctl_table vm_table[] = {

.proc_handler = proc_dointvec,
},
{
- .procname = "oom_kill_allocating_task",
- .data = &sysctl_oom_kill_allocating_task,
- .maxlen = sizeof(sysctl_oom_kill_allocating_task),
- .mode = 0644,
- .proc_handler = proc_dointvec,
- },
- {
- .procname = "oom_dump_tasks",
- .data = &sysctl_oom_dump_tasks,
- .maxlen = sizeof(sysctl_oom_dump_tasks),
+ .procname = "oom_kill_quick",
+ .data = &sysctl_oom_kill_quick,
+ .maxlen = sizeof(sysctl_oom_kill_quick),
.mode = 0644,
.proc_handler = proc_dointvec,
},

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c

@@ -32,9 +32,8 @@
#include <linux/security.h>

int sysctl_panic_on_oom;
-int sysctl_oom_kill_allocating_task;
-int sysctl_oom_dump_tasks;
int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
+int sysctl_oom_kill_quick;
static DEFINE_SPINLOCK(zone_scan_lock);

/*

@@ -402,7 +401,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,

dump_stack();
mem_cgroup_print_oom_info(mem, p);
show_mem();
- if (sysctl_oom_dump_tasks)
+ if (!sysctl_oom_kill_quick)
dump_tasks(mem);
}

@@ -609,9 +608,9 @@ static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,

struct task_struct *p;
unsigned int points;

- if (sysctl_oom_kill_allocating_task)
+ if (sysctl_oom_kill_quick)
if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
- NULL, "Out of memory (oom_kill_allocating_task)"))
+ NULL, "Out of memory (quick mode)"))
return;
retry:
/*

David Rientjes

unread,

Feb 15, 2010, 5:30:02 PM2/15/10

to

important servers or system daemons. When a web server forks a very
large number of threads for client connections, for example, it is much
better to kill one of those threads than to kill the server and make it
unresponsive.

Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs. In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.

Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it. It's not
possible to redefine the meaning of /proc/pid/oom_adj with a new scale
since the ABI cannot be changed for backward compatability. Instead, a
new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to
+1000. It may be used to polarize the heuristic such that certain tasks
are never considered for oom kill while others may always be considered.
The value is added directory into the badness() score so a value of -500,

for example, means to discount 50% of its memory consumption in
comparison to other tasks either on the system, bound to the mempolicy,

in the cpuset, or sharing the same memory controller.

/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning. Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity. This is required
so the ABI is not broken with userspace applications and allows oom_adj
to be deprecated for future removal.

Signed-off-by: David Rientjes <rien...@google.com>
---

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -33,7 +33,8 @@ Table of Contents
2 Modifying System Parameters

3 Per-Process Parameters
- 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
+ 3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer
+ score
3.2 /proc/<pid>/oom_score - Display current oom-killer score
3.3 /proc/<pid>/io - Display the IO accounting fields
3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
@@ -1193,42 +1194,64 @@ of the kernel.
CHAPTER 3: PER-PROCESS PARAMETERS
------------------------------------------------------------------------------

-3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
-------------------------------------------------------
-
-This file can be used to adjust the score used to select which processes
-should be killed in an out-of-memory situation. Giving it a high score will
-increase the likelihood of this process being killed by the oom-killer. Valid
-values are in the range -16 to +15, plus the special value -17, which disables
-oom-killing altogether for this process.
-
-The process to be killed in an out-of-memory situation is selected among all others
-based on its badness score. This value equals the original memory size of the process
-and is then updated according to its CPU time (utime + stime) and the
-run time (uptime - start time). The longer it runs the smaller is the score.
-Badness score is divided by the square root of the CPU time and then by
-the double square root of the run time.
-
-Swapped out tasks are killed first. Half of each child's memory size is added to
-the parent's score if they do not share the same memory. Thus forking servers
-are the prime candidates to be killed. Having only one 'hungry' child will make
-parent less preferable than the child.
-
-/proc/<pid>/oom_score shows process' current badness score.
-
-The following heuristics are then applied:
- * if the task was reniced, its score doubles
- * superuser or direct hardware access tasks (CAP_SYS_ADMIN, CAP_SYS_RESOURCE
- or CAP_SYS_RAWIO) have their score divided by 4
- * if oom condition happened in one cpuset and checked process does not belong
- to it, its score is divided by 8
- * the resulting score is multiplied by two to the power of oom_adj, i.e.
- points <<= oom_adj when it is positive and
- points >>= -(oom_adj) otherwise
-
-The task with the highest badness score is then selected and its children
-are killed, process itself will be killed in an OOM situation when it does
-not have children or some of them disabled oom like described above.
+3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
+--------------------------------------------------------------------------------
+
+These file can be used to adjust the badness heuristic used to select which
+process gets killed in out of memory conditions.
+
+The badness heuristic assigns a value to each candidate task ranging from 0
+(never kill) to 1000 (always kill) to determine which process is targeted. The
+units are roughly a proportion along that range of allowed memory the process
+may allocate from based on an estimation of its current memory and swap use.
+For example, if a task is using all allowed memory, its badness score will be
+1000. If it is using half of its allowed memory, its score will be 500.
+
+There are a couple of additional factors included in the badness score: root
+processes are given 3% extra memory over other tasks, and tasks which forkbomb
+an excessive number of child processes are penalized by their average size.
+The number of child processes considered to be a forkbomb is configurable
+via /proc/sys/vm/oom_forkbomb_thres (see Documentation/sysctl/vm.txt).
+
+The amount of "allowed" memory depends on the context in which the oom killer
+was called. If it is due to the memory assigned to the allocating task's cpuset
+being exhausted, the allowed memory represents the set of mems assigned to that
+cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed
+memory represents the set of mempolicy nodes. If it is due to a memory
+limit (or swap limit) being reached, the allowed memory is that configured
+limit. Finally, if it is due to the entire system being out of memory, the
+allowed memory represents all allocatable resources.
+
+The value of /proc/<pid>/oom_score_adj is added to the badness score before it
+is used to determine which task to kill. Acceptable values range from -1000
+(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to
+polarize the preference for oom killing either by always preferring a certain
+task or completely disabling it. The lowest possible value, -1000, is
+equivalent to disabling oom killing entirely for that task since it will always
+report a badness score of 0.
+
+Consequently, it is very simple for userspace to define the amount of memory to
+consider for each task. Setting a /proc/<pid>/oom_score_adj value of +500, for
+example, is roughly equivalent to allowing the remainder of tasks sharing the
+same system, cpuset, mempolicy, or memory controller resources to use at least
+50% more memory. A value of -500, on the other hand, would be roughly
+equivalent to discounting 50% of the task's allowed memory from being considered
+as scoring against the task.
+
+For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also
+be used to tune the badness score. Its acceptable values range from -16
+(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
+(OOM_DISABLE) to disable oom killing entirely for that task. Its value is
+scaled linearly with /proc/<pid>/oom_score_adj.
+
+Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the
+other with its scaled value.
+
+Caveat: when a parent task is selected, the oom killer will sacrifice any first
+generation children with seperate address spaces instead, if possible. This
+avoids servers and important system daemons from being killed and loses the
+minimal amount of work.
+

3.2 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt

@@ -44,6 +44,7 @@ Currently, these files are in /proc/sys/vm:

- nr_trim_pages (only if CONFIG_MMU=n)
- numa_zonelist_order

- oom_dump_tasks
+- oom_forkbomb_thres
- oom_kill_allocating_task
- overcommit_memory
- overcommit_ratio
@@ -490,6 +491,26 @@ The default value is 0.

==============================================================

+oom_forkbomb_thres
+
+This value defines how many children with a seperate address space a specific
+task may have before being considered as a possible forkbomb. Tasks with more
+children not sharing the same address space as the parent will be penalized by a
+quantity of memory equaling
+
+ (average rss of execve children) * (# of 1st generation execve children)
+ ------------------------------------------------------------------------
+ oom_forkbomb_thres
+
+in the oom killer's badness heuristic. Such tasks may be protected with a lower
+oom_adj value (see Documentation/filesystems/proc.txt) if necessary.
+
+A value of 0 will disable forkbomb detection.
+
+The default value is 1000.
+
+==============================================================
+
oom_kill_allocating_task

This enables or disables killing the OOM-triggering task in

diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -81,6 +81,7 @@
#include <linux/elf.h>
#include <linux/pid_namespace.h>
#include <linux/fs_struct.h>
+#include <linux/swap.h>
#include "internal.h"

/* NOTE:
@@ -458,7 +459,6 @@ static const struct file_operations proc_lstats_operations = {
#endif

/* The badness from the OOM killer */
-unsigned long badness(struct task_struct *p, unsigned long uptime);
static int proc_oom_score(struct task_struct *task, char *buffer)
{
unsigned long points;
@@ -466,7 +466,13 @@ static int proc_oom_score(struct task_struct *task, char *buffer)

do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
- points = badness(task->group_leader, uptime.tv_sec);
+ points = oom_badness(task->group_leader,
+ global_page_state(NR_INACTIVE_ANON) +
+ global_page_state(NR_ACTIVE_ANON) +
+ global_page_state(NR_INACTIVE_FILE) +
+ global_page_state(NR_ACTIVE_FILE) +
+ total_swap_pages,
+ uptime.tv_sec);
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
@@ -1152,7 +1158,15 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
}

task->signal->oom_adj = oom_adjust;
-
+ /*
+ * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
+ * value is always attainable.
+ */
+ if (task->signal->oom_adj == OOM_ADJUST_MAX)
+ task->signal->oom_score_adj = OOM_SCORE_ADJ_MAX;
+ else
+ task->signal->oom_score_adj = (oom_adjust * OOM_SCORE_ADJ_MAX) /
+ -OOM_DISABLE;
unlock_task_sighand(task, &flags);
put_task_struct(task);

@@ -1164,6 +1178,82 @@ static const struct file_operations proc_oom_adjust_operations = {
.write = oom_adjust_write,
};

+static ssize_t oom_score_adj_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
+ char buffer[PROC_NUMBUF];
+ int oom_score_adj = OOM_SCORE_ADJ_MIN;
+ unsigned long flags;
+ size_t len;
+
+ if (!task)
+ return -ESRCH;
+ if (lock_task_sighand(task, &flags)) {
+ oom_score_adj = task->signal->oom_score_adj;
+ unlock_task_sighand(task, &flags);
+ }
+ put_task_struct(task);
+ len = snprintf(buffer, sizeof(buffer), "%d\n", oom_score_adj);
+ return simple_read_from_buffer(buf, count, ppos, buffer, len);
+}
+
+static ssize_t oom_score_adj_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct task_struct *task;
+ char buffer[PROC_NUMBUF];
+ unsigned long flags;
+ long oom_score_adj;
+ int err;
+
+ memset(buffer, 0, sizeof(buffer));
+ if (count > sizeof(buffer) - 1)
+ count = sizeof(buffer) - 1;
+ if (copy_from_user(buffer, buf, count))
+ return -EFAULT;
+
+ err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
+ if (err)
+ return -EINVAL;
+ if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
+ oom_score_adj > OOM_SCORE_ADJ_MAX)
+ return -EINVAL;
+
+ task = get_proc_task(file->f_path.dentry->d_inode);
+ if (!task)
+ return -ESRCH;
+ if (!lock_task_sighand(task, &flags)) {
+ put_task_struct(task);
+ return -ESRCH;
+ }
+ if (oom_score_adj < task->signal->oom_score_adj &&
+ !capable(CAP_SYS_RESOURCE)) {
+ unlock_task_sighand(task, &flags);
+ put_task_struct(task);
+ return -EACCES;
+ }
+
+ task->signal->oom_score_adj = oom_score_adj;
+ /*
+ * Scale /proc/pid/oom_adj appropriately ensuring that OOM_DISABLE is
+ * always attainable.
+ */
+ if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+ task->signal->oom_adj = OOM_DISABLE;
+ else
+ task->signal->oom_adj = (oom_score_adj * OOM_ADJUST_MAX) /
+ OOM_SCORE_ADJ_MAX;
+ unlock_task_sighand(task, &flags);
+ put_task_struct(task);
+ return count;
+}
+
+static const struct file_operations proc_oom_score_adj_operations = {
+ .read = oom_score_adj_read,
+ .write = oom_score_adj_write,
+};
+
#ifdef CONFIG_AUDITSYSCALL
#define TMPBUFLEN 21
static ssize_t proc_loginuid_read(struct file * file, char __user * buf,
@@ -2727,6 +2817,7 @@ static const struct pid_entry tgid_base_stuff[] = {
#endif
INF("oom_score", S_IRUGO, proc_oom_score),
REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
+ REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
#ifdef CONFIG_AUDITSYSCALL
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUGO, proc_sessionid_operations),
@@ -3061,6 +3152,7 @@ static const struct pid_entry tid_base_stuff[] = {
#endif
INF("oom_score", S_IRUGO, proc_oom_score),
REG("oom_adj", S_IRUGO|S_IWUSR, proc_oom_adjust_operations),
+ REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),
#ifdef CONFIG_AUDITSYSCALL
REG("loginuid", S_IWUSR|S_IRUGO, proc_loginuid_operations),
REG("sessionid", S_IRUSR, proc_sessionid_operations),

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h

@@ -1,14 +1,27 @@
#ifndef __INCLUDE_LINUX_OOM_H
#define __INCLUDE_LINUX_OOM_H

-/* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
+/*
+ * /proc/<pid>/oom_adj set to -17 protects from the oom-killer
+ */
#define OOM_DISABLE (-17)
/* inclusive */
#define OOM_ADJUST_MIN (-16)
#define OOM_ADJUST_MAX 15

+/*
+ * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
+ * pid.
+ */
+#define OOM_SCORE_ADJ_MIN (-1000)
+#define OOM_SCORE_ADJ_MAX 1000
+
+/* See Documentation/sysctl/vm.txt */
+#define DEFAULT_OOM_FORKBOMB_THRES 1000
+
#ifdef __KERNEL__

+#include <linux/sched.h>
#include <linux/types.h>
#include <linux/nodemask.h>

@@ -24,6 +37,8 @@ enum oom_constraint {
CONSTRAINT_MEMORY_POLICY,
};

+extern unsigned int oom_badness(struct task_struct *p,
+ unsigned long totalpages, unsigned long uptime);
extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

@@ -47,6 +62,7 @@ static inline void oom_killer_enable(void)
extern int sysctl_panic_on_oom;
extern int sysctl_oom_kill_allocating_task;
extern int sysctl_oom_dump_tasks;
+extern int sysctl_oom_forkbomb_thres;

#endif /* __KERNEL__*/
#endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -624,7 +624,8 @@ struct signal_struct {
struct tty_audit_buf *tty_audit_buf;
#endif

- int oom_adj; /* OOM kill score adjustment (bit shift) */
+ int oom_adj; /* OOM kill score adjustment (bit shift) */
+ int oom_score_adj; /* OOM kill score adjustment */
};

/* Context switch must be unlocked if interrupts are to be enabled */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -878,6 +878,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
tty_audit_fork(sig);

sig->oom_adj = current->signal->oom_adj;
+ sig->oom_score_adj = current->signal->oom_score_adj;

return 0;

}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c

@@ -955,6 +955,14 @@ static struct ctl_table vm_table[] = {
.proc_handler = proc_dointvec,
},
{
+ .procname = "oom_forkbomb_thres",
+ .data = &sysctl_oom_forkbomb_thres,
+ .maxlen = sizeof(sysctl_oom_forkbomb_thres),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ },
+ {
.procname = "overcommit_ratio",
.data = &sysctl_overcommit_ratio,
.maxlen = sizeof(sysctl_overcommit_ratio),

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c

@@ -4,6 +4,8 @@
* Copyright (C) 1998,2000 Rik van Riel
* Thanks go out to Claus Fischer for some serious inspiration and
* for goading me into coding this file...
+ * Copyright (C) 2010 Google, Inc
+ * Rewritten by David Rientjes
*
* The routines in this file are used to kill a process when
* we're seriously out of memory. This gets called from __alloc_pages()
@@ -32,8 +34,8 @@
int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks;
+int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
static DEFINE_SPINLOCK(zone_scan_lock);
-/* #define DEBUG */

/*

* Do all threads of the target process overlap our allowed nodes?

@@ -68,138 +70,129 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
return false;
}

-/**
- * badness - calculate a numeric value for how bad this task has been
- * @p: task struct of which task we should calculate
- * @uptime: current uptime in seconds

+/*
+ * Tasks that fork a very large number of children with seperate address spaces

+ * may be the result of a bug, user error, malicious applications, or even those
+ * with a very legitimate purpose. The oom killer assesses a penalty equaling
*
- * The formula used is relatively simple and documented inline in the
- * function. The main rationale is that we want to select a good task
- * to kill when we run out of memory.
+ * (average rss of children) * (# of 1st generation execve children)
+ * -----------------------------------------------------------------
+ * sysctl_oom_forkbomb_thres
*
- * Good in this context means that:
- * 1) we lose the minimum amount of work done
- * 2) we recover a large amount of memory
- * 3) we don't kill anything innocent of eating tons of memory
- * 4) we want to kill the minimum amount of processes (one)
- * 5) we try to kill the process the user expects us to kill, this
- * algorithm has been meticulously tuned to meet the principle
- * of least surprise ... (be careful when you change it)
+ * for such tasks to target the parent. oom_kill_process() will attempt to
+ * first kill a child, so there's no risk of killing an important system daemon
+ * via this method. A web server, for example, may fork a very large number of
+ * threads to response to client connections; it's much better to kill a child
+ * than to kill the parent, making the server unresponsive. The goal here is
+ * to give the user a chance to recover from the error rather than deplete all
+ * memory such that the system is unusable, it's not meant to effect a forkbomb
+ * policy.
*/
-
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+static unsigned long oom_forkbomb_penalty(struct task_struct *tsk)
{
- unsigned long points, cpu_time, run_time;
- struct mm_struct *mm;
struct task_struct *child;
- int oom_adj = p->signal->oom_adj;
- struct task_cputime task_time;
- unsigned long utime;
- unsigned long stime;
+ unsigned long child_rss = 0;
+ int forkcount = 0;

- if (oom_adj == OOM_DISABLE)
+ if (!sysctl_oom_forkbomb_thres)
return 0;
+ list_for_each_entry(child, &tsk->children, sibling) {
+ struct task_cputime task_time;
+ unsigned long runtime;

- task_lock(p);
- mm = p->mm;
- if (!mm) {
- task_unlock(p);
- return 0;
+ task_lock(child);
+ if (!child->mm || child->mm == tsk->mm) {
+ task_unlock(child);
+ continue;
+ }
+ thread_group_cputime(child, &task_time);
+ runtime = cputime_to_jiffies(task_time.utime) +
+ cputime_to_jiffies(task_time.stime);
+ /*
+ * Only threads that have run for less than a second are
+ * considered toward the forkbomb penalty, these threads rarely
+ * get to execute at all in such cases anyway.
+ */
+ if (runtime < HZ) {
+ child_rss += get_mm_rss(child->mm);
+ forkcount++;
+ }
+ task_unlock(child);
}

/*
- * The memory size of the process is the basis for the badness.
+ * Forkbombs get penalized by:
+ *
+ * (average rss of children) * (# of first-generation execve children) /
+ * sysctl_oom_forkbomb_thres
*/
- points = mm->total_vm;
+ return forkcount > sysctl_oom_forkbomb_thres ?
+ (child_rss / sysctl_oom_forkbomb_thres) : 0;
+}
+
+/**
+ * oom_badness - heuristic function to determine which candidate task to kill
+ * @p: task struct of which task we should calculate
+ * @totalpages: total present RAM allowed for page allocation
+ * @uptime: current uptime in seconds
+ *
+ * The heuristic for determining which task to kill is made to be as simple and
+ * predictable as possible. The goal is to return the highest value for the
+ * task consuming the most memory to avoid subsequent oom conditions.
+ */
+unsigned int oom_badness(struct task_struct *p, unsigned long totalpages,
+ unsigned long uptime)
+{
+ struct mm_struct *mm;
+ int points;

/*
- * After this unlock we can no longer dereference local variable `mm'
+ * Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't
+ * need to be executed for something that can't be killed.
*/
- task_unlock(p);
+ if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+ return 0;

/*
- * swapoff can easily use up all memory, so kill those first.
+ * When the PF_OOM_ORIGIN bit is set, it indicates the task should have
+ * priority for oom killing.
*/
if (p->flags & PF_OOM_ORIGIN)
- return ULONG_MAX;
+ return 1000;

- /*
- * Processes which fork a lot of child processes are likely
- * a good choice. We add half the vmsize of the children if they
- * have an own mm. This prevents forking servers to flood the
- * machine with an endless amount of children. In case a single
- * child is eating the vast majority of memory, adding only half
- * to the parents will make the child our kill candidate of choice.
- */
- list_for_each_entry(child, &p->children, sibling) {
- task_lock(child);
- if (child->mm != mm && child->mm)
- points += child->mm->total_vm/2 + 1;
- task_unlock(child);
+ task_lock(p);
+ mm = p->mm;
+ if (!mm) {
+ task_unlock(p);
+ return 0;
}

/*
- * CPU time is in tens of seconds and run time is in thousands
- * of seconds. There is no particular reason for this other than
- * that it turned out to work very well in practice.
- */
- thread_group_cputime(p, &task_time);
- utime = cputime_to_jiffies(task_time.utime);
- stime = cputime_to_jiffies(task_time.stime);
- cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
-
-
- if (uptime >= p->start_time.tv_sec)
- run_time = (uptime - p->start_time.tv_sec) >> 10;
- else
- run_time = 0;
-
- if (cpu_time)
- points /= int_sqrt(cpu_time);
- if (run_time)
- points /= int_sqrt(int_sqrt(run_time));
-
- /*
- * Niced processes are most likely less important, so double
- * their badness points.
+ * The baseline for the badness score is the proportion of RAM that each
+ * task's rss and swap space use.
*/
- if (task_nice(p) > 0)
- points *= 2;
-
- /*
- * Superuser processes are usually more important, so we make it
- * less likely that we kill those.
- */
- if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
- has_capability_noaudit(p, CAP_SYS_RESOURCE))
- points /= 4;
+ points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
+ totalpages;
+ task_unlock(p);
+ points += oom_forkbomb_penalty(p);

/*
- * We don't want to kill a process with direct hardware access.
- * Not only could that mess up the hardware, but usually users
- * tend to only have this flag set on applications they think
- * of as important.
+ * Root processes get 3% bonus, just like the __vm_enough_memory() used
+ * by LSMs.
*/
- if (has_capability_noaudit(p, CAP_SYS_RAWIO))
- points /= 4;
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN))
+ points -= 30;

/*
- * Adjust the score by oom_adj.
+ * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that the
+ * range may either completely disable oom killing or always prefer a
+ * certain task.
*/
- if (oom_adj) {
- if (oom_adj > 0) {
- if (!points)
- points = 1;
- points <<= oom_adj;
- } else
- points >>= -(oom_adj);
- }
+ points += p->signal->oom_score_adj;

-#ifdef DEBUG
- printk(KERN_DEBUG "OOMkill: task %d (%s) got %lu points\n",
- p->pid, p->comm, points);
-#endif
- return points;
+ if (points < 0)
+ return 0;
+ return (points <= 1000) ? points : 1000;
}

/*
@@ -207,11 +200,21 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
*/
#ifdef CONFIG_NUMA
static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask, nodemask_t *nodemask)
+ gfp_t gfp_mask, nodemask_t *nodemask,
+ unsigned long *totalpages)
{
struct zone *zone;
struct zoneref *z;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+ bool cpuset_limited = false;
+ int nid;
+
+ /* Default to all anonymous memory, page cache, and swap */
+ *totalpages = global_page_state(NR_INACTIVE_ANON) +
+ global_page_state(NR_ACTIVE_ANON) +
+ global_page_state(NR_INACTIVE_FILE) +
+ global_page_state(NR_ACTIVE_FILE) +
+ total_swap_pages;

/*
* Reach here only when __GFP_NOFAIL is used. So, we should avoid
@@ -222,25 +225,41 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
return CONSTRAINT_NONE;

/*
- * The nodemask here is a nodemask passed to alloc_pages(). Now,
- * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
- * feature. mempolicy is an only user of nodemask here.
- * check mempolicy's nodemask contains all N_HIGH_MEMORY
+ * This is not a __GFP_THISNODE allocation, so a truncated nodemask in
+ * the page allocator means a mempolicy is in effect. Cpuset policy
+ * is enforced in get_page_from_freelist().
*/
- if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
+ if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask)) {
+ *totalpages = total_swap_pages;
+ for_each_node_mask(nid, *nodemask)
+ *totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
+ node_page_state(nid, NR_ACTIVE_ANON) +
+ node_page_state(nid, NR_INACTIVE_FILE) +
+ node_page_state(nid, NR_ACTIVE_FILE);
return CONSTRAINT_MEMORY_POLICY;
+ }

/* Check this allocation failure is caused by cpuset's wall function */
for_each_zone_zonelist_nodemask(zone, z, zonelist,
high_zoneidx, nodemask)
if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
- return CONSTRAINT_CPUSET;
-
+ cpuset_limited = true;
+
+ if (cpuset_limited) {
+ *totalpages = total_swap_pages;
+ for_each_node_mask(nid, cpuset_current_mems_allowed)
+ *totalpages += node_page_state(nid, NR_INACTIVE_ANON) +
+ node_page_state(nid, NR_ACTIVE_ANON) +
+ node_page_state(nid, NR_INACTIVE_FILE) +
+ node_page_state(nid, NR_ACTIVE_FILE);
+ return CONSTRAINT_CPUSET;
+ }
return CONSTRAINT_NONE;
}
#else
static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask, nodemask_t *nodemask)
+ gfp_t gfp_mask, nodemask_t *nodemask,
+ unsigned long *totalpages)
{
return CONSTRAINT_NONE;
}
@@ -252,9 +271,9 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
*
* (not docbooked, we don't want this one cluttering up the manual)
*/
-static struct task_struct *select_bad_process(unsigned long *ppoints,
- struct mem_cgroup *mem, enum oom_constraint constraint,
- const nodemask_t *mask)
+static struct task_struct *select_bad_process(unsigned int *ppoints,
+ unsigned long totalpages, struct mem_cgroup *mem,
+ enum oom_constraint constraint, const nodemask_t *mask)
{
struct task_struct *p;
struct task_struct *chosen = NULL;
@@ -263,7 +282,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,

do_posix_clock_monotonic_gettime(&uptime);
for_each_process(p) {
- unsigned long points;
+ unsigned int points;

/*
* skip kernel threads and tasks which have already released
@@ -308,13 +327,13 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
return ERR_PTR(-1UL);

chosen = p;
- *ppoints = ULONG_MAX;
+ *ppoints = 1000;
}

- if (p->signal->oom_adj == OOM_DISABLE)
+ if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
continue;

- points = badness(p, uptime.tv_sec);
+ points = oom_badness(p, totalpages, uptime.tv_sec);
if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
@@ -330,7 +349,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
*
* Dumps the current memory state of all system tasks, excluding kernel threads.
* State information includes task's pid, uid, tgid, vm size, rss, cpu, oom_adj
- * score, and name.
+ * value, oom_score_adj value, and name.
*
* If the actual is non-NULL, only tasks that are a member of the mem_cgroup are
* shown.
@@ -342,7 +361,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
struct task_struct *g, *p;

printk(KERN_INFO "[ pid ] uid tgid total_vm rss cpu oom_adj "
- "name\n");
+ "oom_score_adj name\n");
do_each_thread(g, p) {
struct mm_struct *mm;

@@ -362,10 +381,10 @@ static void dump_tasks(const struct mem_cgroup *mem)
task_unlock(p);
continue;
}
- printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d %3d %s\n",
+ pr_info("[%5d] %5d %5d %8lu %8lu %3d %3d %4d %s\n",
p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
get_mm_rss(mm), (int)task_cpu(p), p->signal->oom_adj,
- p->comm);
+ p->signal->oom_score_adj, p->comm);
task_unlock(p);
} while_each_thread(g, p);
}
@@ -374,8 +393,9 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
struct mem_cgroup *mem)
{
pr_warning("%s invoked oom-killer: gfp_mask=0x%x, order=%d, "
- "oom_adj=%d\n",
- current->comm, gfp_mask, order, current->signal->oom_adj);
+ "oom_adj=%d, oom_score_adj=%d\n",
+ current->comm, gfp_mask, order, current->signal->oom_adj,
+ current->signal->oom_score_adj);
task_lock(current);
cpuset_print_task_mems_allowed(current);
task_unlock(current);
@@ -440,7 +460,7 @@ static int oom_kill_task(struct task_struct *p)

* change to NULL at any time since we do not hold task_lock(p).

* However, this is of no concern to us.

*/
- if (!p->mm || p->signal->oom_adj == OOM_DISABLE)

+ if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
return 1;

__oom_kill_task(p, 1);
@@ -449,12 +469,12 @@ static int oom_kill_task(struct task_struct *p)
}

static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
- unsigned long points, struct mem_cgroup *mem,
- const char *message)
+ unsigned int points, unsigned long totalpages,
+ struct mem_cgroup *mem, const char *message)
{

struct task_struct *victim = p;
struct task_struct *c;

- unsigned long victim_points = 0;
+ unsigned int victim_points = 0;

struct timespec uptime;

if (printk_ratelimit())

@@ -469,19 +489,19 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
return 0;
}

- pr_err("%s: Kill process %d (%s) with score %lu or sacrifice child\n",
+ pr_err("%s: Kill process %d (%s) with score %d or sacrifice child\n",

message, task_pid_nr(p), p->comm, points);

/* Try to sacrifice the worst child first */

do_posix_clock_monotonic_gettime(&uptime);
list_for_each_entry(c, &p->children, sibling) {

- unsigned long cpoints;
+ unsigned int cpoints;

if (c->mm == p->mm)
continue;

- /* badness() returns 0 if the thread is unkillable */
- cpoints = badness(c, uptime.tv_sec);
+ /* oom_badness() returns 0 if the thread is unkillable */
+ cpoints = oom_badness(c, totalpages, uptime.tv_sec);
if (cpoints > victim_points) {
victim = c;
victim_points = cpoints;
@@ -493,19 +513,22 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
{
- unsigned long points = 0;
+ unsigned int points = 0;
struct task_struct *p;
+ unsigned long limit;

+ limit = (res_counter_read_u64(&mem->res, RES_LIMIT) >> PAGE_SHIFT) +
+ (res_counter_read_u64(&mem->memsw, RES_LIMIT) >> PAGE_SHIFT);
read_lock(&tasklist_lock);
retry:
- p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);
+ p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
if (PTR_ERR(p) == -1UL)
goto out;

if (!p)
p = current;

- if (oom_kill_process(p, gfp_mask, 0, points, mem,
+ if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
"Memory cgroup out of memory"))
goto retry;
out:
@@ -580,22 +603,22 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
/*
* Must be called with tasklist_lock held for read.
*/
-static void __out_of_memory(gfp_t gfp_mask, int order,
+static void __out_of_memory(gfp_t gfp_mask, int order, unsigned long totalpages,
enum oom_constraint constraint, const nodemask_t *mask)
{
struct task_struct *p;
- unsigned long points;
+ unsigned int points;

if (sysctl_oom_kill_allocating_task)
- if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
- "Out of memory (oom_kill_allocating_task)"))
+ if (!oom_kill_process(current, gfp_mask, order, 0, totalpages,
+ NULL, "Out of memory (oom_kill_allocating_task)"))
return;
retry:
/*
* Rambo mode: Shoot down a process and hope it solves whatever
* issues we may have.
*/
- p = select_bad_process(&points, NULL, constraint, mask);
+ p = select_bad_process(&points, totalpages, NULL, constraint, mask);

if (PTR_ERR(p) == -1UL)
return;
@@ -607,7 +630,7 @@ retry:
panic("Out of memory and no killable processes...\n");
}

- if (oom_kill_process(p, gfp_mask, order, points, NULL,
+ if (oom_kill_process(p, gfp_mask, order, points, totalpages, NULL,
"Out of memory"))
goto retry;
}
@@ -618,6 +641,7 @@ retry:
*/
void pagefault_out_of_memory(void)
{
+ unsigned long totalpages;
unsigned long freed = 0;

blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
@@ -635,9 +659,14 @@ void pagefault_out_of_memory(void)
if (sysctl_panic_on_oom)
panic("out of memory from page fault. panic_on_oom is selected.\n");

+ totalpages = global_page_state(NR_INACTIVE_ANON) +
+ global_page_state(NR_ACTIVE_ANON) +
+ global_page_state(NR_INACTIVE_FILE) +
+ global_page_state(NR_ACTIVE_FILE) +
+ total_swap_pages;
read_lock(&tasklist_lock);
/* unknown gfp_mask and order */
- __out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
+ __out_of_memory(0, 0, totalpages, CONSTRAINT_NONE, NULL);
read_unlock(&tasklist_lock);

/*
@@ -665,6 +694,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
int order, nodemask_t *nodemask)
{
unsigned long freed = 0;
+ unsigned long totalpages = 0;
enum oom_constraint constraint;

blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
@@ -676,7 +706,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
- constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
+ constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
+ &totalpages);
read_lock(&tasklist_lock);
if (unlikely(sysctl_panic_on_oom)) {
/*
@@ -689,7 +720,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
panic("Out of memory: panic_on_oom is enabled\n");
}
}
- __out_of_memory(gfp_mask, order, constraint, nodemask);
+ __out_of_memory(gfp_mask, order, totalpages, constraint, nodemask);
read_unlock(&tasklist_lock);

/*

David Rientjes

unread,

Feb 15, 2010, 5:30:02 PM2/15/10

to

/proc/pid/oom_adj is now deprecated so that that it may eventually be
removed. The target date for removal is December 2011.

A warning will be printed to the kernel log if a task attempts to use
this interface. Future warning will be suppressed until the kernel is
rebooted to prevent spamming the kernel log.

Signed-off-by: David Rientjes <rien...@google.com>
---

Documentation/feature-removal-schedule.txt | 30 ++++++++++++++++++++++++++++
Documentation/filesystems/proc.txt | 3 ++
fs/proc/base.c | 8 +++++++
include/linux/oom.h | 3 ++
4 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -168,6 +168,36 @@ Who: Eric Biederman <ebie...@xmission.com>

---------------------------

+What: /proc/<pid>/oom_adj
+When: December 2011
+Why: /proc/<pid>/oom_adj allows userspace to influence the oom killer's
+ badness heuristic used to determine which task to kill when the kernel
+ is out of memory.
+
+ The badness heuristic has since been rewritten since the introduction of
+ this tunable such that its meaning is deprecated. The value was
+ implemented as a bitshift on a score generated by the badness()
+ function that did not have any precise units of measure. With the
+ rewrite, the score is given as a proportion of available memory to the
+ task allocating pages, so using a bitshift which grows the score
+ exponentially is, thus, impossible to tune with fine granularity.
+
+ A much more powerful interface, /proc/<pid>/oom_score_adj, was
+ introduced with the oom killer rewrite that allows users to increase or
+ decrease the badness() score linearly. This interface will replace
+ /proc/<pid>/oom_adj.
+
+ See Documentation/filesystems/proc.txt for information on how to use the
+ new tunable.
+
+ A warning will be emitted to the kernel log if an application uses this
+ deprecated interface. After it is printed once, future warning will be
+ suppressed until the kernel is rebooted.
+
+Who: David Rientjes <rien...@google.com>
+
+---------------------------
+
What: remove EXPORT_SYMBOL(kernel_thread)
When: August 2006
Files: arch/*/kernel/*_ksyms.c

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt

@@ -1247,6 +1247,9 @@ scaled linearly with /proc/<pid>/oom_score_adj.

Writing to /proc/<pid>/oom_score_adj or /proc/<pid>/oom_adj will change the

other with its scaled value.

+NOTICE: /proc/<pid>/oom_adj is deprecated and will be removed, please see
+Documentation/feature-removal-schedule.txt.
+

Caveat: when a parent task is selected, the oom killer will sacrifice any first

generation children with seperate address spaces instead, if possible. This

avoids servers and important system daemons from being killed and loses the

diff --git a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c
+++ b/fs/proc/base.c

@@ -1157,6 +1157,14 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
return -EACCES;
}

+ /*
+ * Warn that /proc/pid/oom_adj is deprecated, see
+ * Documentation/feature-removal-schedule.txt.
+ */
+ printk_once(KERN_WARNING "%s (%d): /proc/%d/oom_adj is deprecated, "
+ "please use /proc/%d/oom_score_adj instead.\n",
+ current->comm, task_pid_nr(current),
+ task_pid_nr(task), task_pid_nr(task));

task->signal->oom_adj = oom_adjust;

/*

* Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum

diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h

@@ -2,6 +2,9 @@
#define __INCLUDE_LINUX_OOM_H

/*
+ * /proc/<pid>/oom_adj is deprecated, see
+ * Documentation/feature-removal-schedule.txt.
+ *

* /proc/<pid>/oom_adj set to -17 protects from the oom-killer

*/
#define OOM_DISABLE (-17)

David Rientjes

unread,

Feb 15, 2010, 5:30:02 PM2/15/10

to

The oom killer presently kills current whenever there is no more memory
free or reclaimable on its mempolicy's nodes. There is no guarantee that
current is a memory-hogging task or that killing it will free any
substantial amount of memory, however.

In such situations, it is better to scan the tasklist for nodes that are
allowed to allocate on current's set of nodes and kill the task with the
highest badness() score. This ensures that the most memory-hogging task,
or the one configured by the user with /proc/pid/oom_adj, is always
selected in such scenarios.

Signed-off-by: David Rientjes <rien...@google.com>
---

include/linux/mempolicy.h | 13 +++++++-
mm/mempolicy.c | 39 +++++++++++++++++++++++
mm/oom_kill.c | 77 +++++++++++++++++++++++++++-----------------
3 files changed, 98 insertions(+), 31 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -202,6 +202,8 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
unsigned long addr, gfp_t gfp_flags,
struct mempolicy **mpol, nodemask_t **nodemask);
extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
+extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+ const nodemask_t *mask);
extern unsigned slab_node(struct mempolicy *policy);

extern enum zone_type policy_zone;
@@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
return node_zonelist(0, gfp_flags);
}

-static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
+static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
+{
+ return false;
+}
+
+static inline bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+ const nodemask_t *mask)
+{
+ return false;
+}

static inline int do_migrate_pages(struct mm_struct *mm,
const nodemask_t *from_nodes,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1638,6 +1638,45 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
}
#endif

+/*
+ * mempolicy_nodemask_intersects
+ *
+ * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
+ * policy. Otherwise, check for intersection between mask and the policy
+ * nodemask for 'bind' or 'interleave' policy, or mask to contain the single
+ * node for 'preferred' or 'local' policy.
+ */
+bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+ const nodemask_t *mask)
+{
+ struct mempolicy *mempolicy;
+ bool ret = true;
+
+ mempolicy = tsk->mempolicy;
+ mpol_get(mempolicy);
+ if (!mask || !mempolicy)
+ goto out;
+
+ switch (mempolicy->mode) {
+ case MPOL_PREFERRED:
+ if (mempolicy->flags & MPOL_F_LOCAL)
+ ret = node_isset(cpu_to_node(task_cpu(tsk)), *mask);
+ else
+ ret = node_isset(mempolicy->v.preferred_node,
+ *mask);
+ break;
+ case MPOL_BIND:
+ case MPOL_INTERLEAVE:
+ ret = nodes_intersects(mempolicy->v.nodes, *mask);
+ break;
+ default:
+ BUG();
+ }
+out:
+ mpol_put(mempolicy);
+ return ret;
+}
+
/* Allocate a page in interleaved policy.
Own path because it needs to do special accounting. */
static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c

@@ -26,6 +26,7 @@
#include <linux/module.h>
#include <linux/notifier.h>
#include <linux/memcontrol.h>
+#include <linux/mempolicy.h>
#include <linux/security.h>

int sysctl_panic_on_oom;
@@ -36,19 +37,35 @@ static DEFINE_SPINLOCK(zone_scan_lock);

/*
* Do all threads of the target process overlap our allowed nodes?

+ * @tsk: task struct of which task to consider
+ * @mask: nodemask passed to page allocator for mempolicy ooms
*/
-static int has_intersects_mems_allowed(struct task_struct *tsk)
+static bool has_intersects_mems_allowed(struct task_struct *tsk,
+ const nodemask_t *mask)
{
- struct task_struct *t;
+ struct task_struct *start = tsk;

- t = tsk;
do {
- if (cpuset_mems_allowed_intersects(current, t))
- return 1;
- t = next_thread(t);
- } while (t != tsk);
-
- return 0;
+ if (mask) {
+ /*
+ * If this is a mempolicy constrained oom, tsk's
+ * cpuset is irrelevant. Only return true if its
+ * mempolicy intersects current, otherwise it may be
+ * needlessly killed.
+ */
+ if (mempolicy_nodemask_intersects(tsk, mask))
+ return true;
+ } else {
+ /*
+ * This is not a mempolicy constrained oom, so only
+ * check the mems of tsk's cpuset.
+ */
+ if (cpuset_mems_allowed_intersects(current, tsk))
+ return true;
+ }
+ tsk = next_thread(tsk);
+ } while (tsk != start);
+ return false;
}

/**
@@ -236,7 +253,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,

* (not docbooked, we don't want this one cluttering up the manual)
*/

static struct task_struct *select_bad_process(unsigned long *ppoints,

- struct mem_cgroup *mem)
+ struct mem_cgroup *mem, enum oom_constraint constraint,
+ const nodemask_t *mask)

{
struct task_struct *p;
struct task_struct *chosen = NULL;

@@ -258,7 +276,9 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,

continue;
if (mem && !task_in_mem_cgroup(p, mem))
continue;

- if (!has_intersects_mems_allowed(p))
+ if (!has_intersects_mems_allowed(p,
+ constraint == CONSTRAINT_MEMORY_POLICY ? mask :
+ NULL))
continue;

/*
@@ -478,7 +498,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)

read_lock(&tasklist_lock);
retry:
- p = select_bad_process(&points, mem);
+ p = select_bad_process(&points, mem, CONSTRAINT_NONE, NULL);

if (PTR_ERR(p) == -1UL)
goto out;

@@ -560,7 +580,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)

/*
* Must be called with tasklist_lock held for read.
*/

-static void __out_of_memory(gfp_t gfp_mask, int order)

+static void __out_of_memory(gfp_t gfp_mask, int order,

+ enum oom_constraint constraint, const nodemask_t *mask)
{
struct task_struct *p;

unsigned long points;
@@ -574,7 +595,7 @@ retry:

* Rambo mode: Shoot down a process and hope it solves whatever
* issues we may have.
*/

- p = select_bad_process(&points, NULL);
+ p = select_bad_process(&points, NULL, constraint, mask);

if (PTR_ERR(p) == -1UL)
return;

@@ -615,7 +636,8 @@ void pagefault_out_of_memory(void)

panic("out of memory from page fault. panic_on_oom is selected.\n");

read_lock(&tasklist_lock);
- __out_of_memory(0, 0); /* unknown gfp_mask and order */
+ /* unknown gfp_mask and order */
+ __out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
read_unlock(&tasklist_lock);

/*
@@ -632,6 +654,7 @@ rest_and_return:
* @zonelist: zonelist pointer
* @gfp_mask: memory allocation flags
* @order: amount of memory being requested as a power of 2
+ * @nodemask: nodemask passed to page allocator
*
* If we run out of memory, we have the choice between either
* killing a random task (bad), letting the system crash (worse)
@@ -660,24 +683,18 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
*/

constraint = constrained_alloc(zonelist, gfp_mask, nodemask);

read_lock(&tasklist_lock);
-
- switch (constraint) {
- case CONSTRAINT_MEMORY_POLICY:
- oom_kill_process(current, gfp_mask, order, 0, NULL,
- "No available memory (MPOL_BIND)");
- break;
-
- case CONSTRAINT_NONE:
- if (sysctl_panic_on_oom) {
+ if (unlikely(sysctl_panic_on_oom)) {
+ /*
+ * panic_on_oom only affects CONSTRAINT_NONE, the kernel
+ * should not panic for cpuset or mempolicy induced memory
+ * failures.
+ */
+ if (constraint == CONSTRAINT_NONE) {
dump_header(NULL, gfp_mask, order, NULL);
- panic("out of memory. panic_on_oom is selected\n");
+ panic("Out of memory: panic_on_oom is enabled\n");
}
- /* Fall-through */
- case CONSTRAINT_CPUSET:
- __out_of_memory(gfp_mask, order);
- break;
}
-
+ __out_of_memory(gfp_mask, order, constraint, nodemask);
read_unlock(&tasklist_lock);

/*

David Rientjes

unread,

Feb 15, 2010, 5:30:02 PM2/15/10

to

Remove the redundancy in __oom_kill_task() since:

- init can never be passed to this function: it will never be PF_EXITING
or selectable from select_bad_process(), and

- it will never be passed a task from oom_kill_task() without an ->mm
and we're unconcerned about detachment from exiting tasks, there's no
reason to protect them against SIGKILL or access to memory reserves.

Also moves the kernel log message to a higher level since the verbosity
is not always emitted here; we need not print an error message if an
exiting task is given a longer timeslice.

Reviewed-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
Reviewed-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>

Signed-off-by: David Rientjes <rien...@google.com>
---

mm/oom_kill.c | 64 ++++++++++++++------------------------------------------
1 files changed, 16 insertions(+), 48 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -405,67 +405,35 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,

-
- /*

- * We give our sacrificial lamb high priority and access to
- * all the memory it needs. That way it should be able to
- * exit() and clear out its resources quickly...
- */
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
-
force_sig(SIGKILL, p);
}

+#define K(x) ((x) << (PAGE_SHIFT-10))

static int oom_kill_task(struct task_struct *p)

{
- /* WARNING: mm may not be dereferenced since we did not obtain its
- * value from get_task_mm(p). This is OK since all we need to do is
- * compare mm to q->mm below.
- *
- * Furthermore, even if mm contains a non-NULL value, p->mm may

- * change to NULL at any time since we do not hold task_lock(p).
- * However, this is of no concern to us.
- */
- if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+ task_lock(p);
+ if (!p->mm || p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {

+ task_unlock(p);
return 1;
+ }
+ pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+ task_pid_nr(p), p->comm, K(p->mm->total_vm),
+ K(get_mm_counter(p->mm, MM_ANONPAGES)),
+ K(get_mm_counter(p->mm, MM_FILEPAGES)));
+ task_unlock(p);

- __oom_kill_task(p, 1);
-
+ __oom_kill_task(p);
return 0;
}
+#undef K

static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,

unsigned int points, unsigned long totalpages,

@@ -484,7 +452,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,

* its children or threads, just set TIF_MEMDIE so it can die quickly

*/

if (p->flags & PF_EXITING) {
- __oom_kill_task(p, 0);
+ __oom_kill_task(p);
return 0;
}

David Rientjes

unread,

Feb 15, 2010, 5:30:03 PM2/15/10

to

If /proc/sys/vm/panic_on_oom is set to 2, the kernel will panic
regardless of whether the memory allocation is constrained by either a
mempolicy or cpuset.

Since mempolicy-constrained out of memory conditions now iterate through
the tasklist and select a task to kill, it is possible to panic the
machine if all tasks sharing the same mempolicy nodes (including those
with default policy, they may allocate anywhere) or cpuset mems have
/proc/pid/oom_adj values of OOM_DISABLE. This is functionally equivalent
to the compulsory panic_on_oom setting of 2, so the mode is removed.

Signed-off-by: David Rientjes <rien...@google.com>
---

Documentation/sysctl/vm.txt | 20 ++++----------------
mm/oom_kill.c | 5 -----
2 files changed, 4 insertions(+), 21 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt

@@ -559,25 +559,13 @@ swap-intensive.

panic_on_oom

-This enables or disables panic on out-of-memory feature.
+If this is set to zero, the oom killer will be invoked when the kernel is out of
+memory and direct reclaim cannot free any pages. It will select a memory-
+hogging task that frees up a large amount of memory to kill.

-If this is set to 0, the kernel will kill some rogue process,
-called oom_killer. Usually, oom_killer can kill rogue processes and
-system will survive.
-
-If this is set to 1, the kernel panics when out-of-memory happens.
-However, if a process limits using nodes by mempolicy/cpusets,
-and those nodes become memory exhaustion status, one process
-may be killed by oom-killer. No panic occurs in this case.
-Because other nodes' memory may be free. This means system total status
-may be not fatal yet.
-
-If this is set to 2, the kernel panics compulsorily even on the
-above-mentioned.
+If this is set to non-zero, the machine will panic when out of memory.

The default value is 0.

-1 and 2 are for failover of clustering. Please select either
-according to your policy of failover.

=============================================================

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c

@@ -672,11 +672,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
/* Got some memory back in the last second. */
return;

- if (sysctl_panic_on_oom == 2) {
- dump_header(NULL, gfp_mask, order, NULL);
- panic("out of memory. Compulsory panic_on_oom is selected.\n");
- }
-
/*

* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.

Alan Cox

unread,

Feb 15, 2010, 5:40:02 PM2/15/10

to

On Mon, 15 Feb 2010 14:20:16 -0800 (PST)
David Rientjes <rien...@google.com> wrote:

> /proc/pid/oom_adj is now deprecated so that that it may eventually be
> removed. The target date for removal is December 2011.

There are systems that rely on this feature. It's ABI, its sacred. We are
committed to it and it has users. That doesn't really detract from the
good/bad of the rest of the proposal, it's just one step we can't quite
make.

Alan

David Rientjes

unread,

Feb 15, 2010, 5:40:02 PM2/15/10

to

On Mon, 15 Feb 2010, Alan Cox wrote:

> > /proc/pid/oom_adj is now deprecated so that that it may eventually be
> > removed. The target date for removal is December 2011.
>
> There are systems that rely on this feature. It's ABI, its sacred. We are
> committed to it and it has users. That doesn't really detract from the
> good/bad of the rest of the proposal, it's just one step we can't quite
> make.
>

Andrew suggested that it be deprecated in this way, so that's what was
done. I don't have any strong opinions about leaving it around forever
now that it's otherwise unused beyond simply converting itself into units
for /proc/pid/oom_score_adj at a much higher granularity.

KAMEZAWA Hiroyuki

unread,

Feb 15, 2010, 7:10:01 PM2/15/10

to

On Mon, 15 Feb 2010 14:20:09 -0800 (PST)
David Rientjes <rien...@google.com> wrote:

> If /proc/sys/vm/panic_on_oom is set to 2, the kernel will panic
> regardless of whether the memory allocation is constrained by either a
> mempolicy or cpuset.
>
> Since mempolicy-constrained out of memory conditions now iterate through
> the tasklist and select a task to kill, it is possible to panic the
> machine if all tasks sharing the same mempolicy nodes (including those
> with default policy, they may allocate anywhere) or cpuset mems have
> /proc/pid/oom_adj values of OOM_DISABLE. This is functionally equivalent
> to the compulsory panic_on_oom setting of 2, so the mode is removed.
>
> Signed-off-by: David Rientjes <rien...@google.com>

NACK. In an enviroment which depends on cluster-fail-over, this is useful
even if in such situation.

Thanks,
-Kame

KAMEZAWA Hiroyuki

unread,

Feb 15, 2010, 7:10:02 PM2/15/10

to

WARN_ON((high_zoneidx < ZONE_NORMAL) && (gfp_mask & __GFP_NOFAIL))
plz.

Thanks,
-Kame

David Rientjes

unread,

Feb 15, 2010, 7:20:02 PM2/15/10

to

As I already explained when you first brought this up, the possibility of
not invoking the oom killer is not unique to GFP_DMA, it is also possible
for GFP_NOFS. Since __GFP_NOFAIL is deprecated and there are no current
users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.
We're not adding any additional __GFP_NOFAIL allocations.

David Rientjes

unread,

Feb 15, 2010, 7:20:02 PM2/15/10

to

On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > If /proc/sys/vm/panic_on_oom is set to 2, the kernel will panic
> > regardless of whether the memory allocation is constrained by either a
> > mempolicy or cpuset.
> >
> > Since mempolicy-constrained out of memory conditions now iterate through
> > the tasklist and select a task to kill, it is possible to panic the
> > machine if all tasks sharing the same mempolicy nodes (including those
> > with default policy, they may allocate anywhere) or cpuset mems have
> > /proc/pid/oom_adj values of OOM_DISABLE. This is functionally equivalent
> > to the compulsory panic_on_oom setting of 2, so the mode is removed.
> >
> > Signed-off-by: David Rientjes <rien...@google.com>
>
> NACK. In an enviroment which depends on cluster-fail-over, this is useful
> even if in such situation.
>

You don't understand that the behavior has changed ever since
mempolicy-constrained oom conditions are now affected by a compulsory
panic_on_oom mode, please see the patch description. It's absolutely
insane for a single sysctl mode to panic the machine anytime a cpuset or
mempolicy runs out of memory and is more prone to user error from setting
it without fully understanding the ramifications than any use it will ever
do. The kernel already provides a mechanism for doing this, OOM_DISABLE.
if you want your cpuset or mempolicy to risk panicking the machine, set
all tasks that share its mems or nodes, respectively, to OOM_DISABLE.
This is no different from the memory controller being immune to such
panic_on_oom conditions, stop believing that it is the only mechanism used
in the kernel to do memory isolation.

KAMEZAWA Hiroyuki

unread,

Feb 15, 2010, 7:30:02 PM2/15/10

to

You don't explain why "we _have to_ remove API which is used"

Thanks,
-Kame

KAMEZAWA Hiroyuki

unread,

Feb 15, 2010, 7:30:02 PM2/15/10

to

Please add documentation about that to gfp.h before doing this.
Doing this without writing any documenation is laziness.
(WARNING is a style of documentation.)

Thanks,
-Kame

David Rientjes

unread,

Feb 15, 2010, 8:20:01 PM2/15/10

to

On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > As I already explained when you first brought this up, the possibility of
> > not invoking the oom killer is not unique to GFP_DMA, it is also possible
> > for GFP_NOFS. Since __GFP_NOFAIL is deprecated and there are no current
> > users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.
> > We're not adding any additional __GFP_NOFAIL allocations.
> >
>
> Please add documentation about that to gfp.h before doing this.
> Doing this without writing any documenation is laziness.
> (WARNING is a style of documentation.)
>

This is already documented in the page allocator, but I guess doing it in
include/linux/gfp.h as well doesn't hurt.

mm: add comment about deprecation of __GFP_NOFAIL

__GFP_NOFAIL was deprecated in dab48dab, so add a comment that no new
users should be added.

Cc: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

Signed-off-by: David Rientjes <rien...@google.com>
---

include/linux/gfp.h | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -30,7 +30,8 @@ struct vm_area_struct;
* _might_ fail. This depends upon the particular VM implementation.
*
* __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
- * cannot handle allocation failures.
+ * cannot handle allocation failures. This modifier is deprecated and no new
+ * users should be added.
*
* __GFP_NORETRY: The VM implementation must not retry indefinitely.
*

KAMEZAWA Hiroyuki

unread,

Feb 15, 2010, 8:40:02 PM2/15/10

to

On Mon, 15 Feb 2010 17:13:57 -0800 (PST)
David Rientjes <rien...@google.com> wrote:

> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
>
> > > As I already explained when you first brought this up, the possibility of
> > > not invoking the oom killer is not unique to GFP_DMA, it is also possible
> > > for GFP_NOFS. Since __GFP_NOFAIL is deprecated and there are no current
> > > users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.
> > > We're not adding any additional __GFP_NOFAIL allocations.
> > >
> >
> > Please add documentation about that to gfp.h before doing this.
> > Doing this without writing any documenation is laziness.
> > (WARNING is a style of documentation.)
> >
>
> This is already documented in the page allocator, but I guess doing it in
> include/linux/gfp.h as well doesn't hurt.
>

I want warning when someone uses OBSOLETE interface but...

Reviewed-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>

I hope no 3rd vendor (proprietary) driver uses __GFP_NOFAIL, they tend to
believe API is trustable and unchanged.

>
>
> mm: add comment about deprecation of __GFP_NOFAIL
>
> __GFP_NOFAIL was deprecated in dab48dab, so add a comment that no new
> users should be added.
>
> Cc: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rien...@google.com>
> ---
> include/linux/gfp.h | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -30,7 +30,8 @@ struct vm_area_struct;
> * _might_ fail. This depends upon the particular VM implementation.
> *
> * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> - * cannot handle allocation failures.
> + * cannot handle allocation failures. This modifier is deprecated and no new
> + * users should be added.
> *
> * __GFP_NORETRY: The VM implementation must not retry indefinitely.
> *
>
> --

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>
>

--

KOSAKI Motohiro

unread,

Feb 16, 2010, 12:00:02 AM2/16/10

to

> On Mon, 15 Feb 2010, KOSAKI Motohiro wrote:
>
> > > Tasks that do not share the same set of allowed nodes with the task that
> > > triggered the oom should not be considered as candidates for oom kill.
> > >
> > > Tasks in other cpusets with a disjoint set of mems would be unfairly
> > > penalized otherwise because of oom conditions elsewhere; an extreme
> > > example could unfairly kill all other applications on the system if a
> > > single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> > > more memory than allowed.
> > >
> > > Killing tasks outside of current's cpuset rarely would free memory for
> > > current anyway.
> > >
> > > Signed-off-by: David Rientjes <rien...@google.com>
> >
> > This patch does right thing and looks promissing. but unfortunately
> > I have to NAK this patch temporary.
> >
> > This patch is nearly just revert of the commit 7887a3da75. We have to
> > dig archaeology mail log and find why this reverting don't cause
> > the old pain again.
> >
>
> Nick is probably wondering why I cc'd him on this patchset, and this is it
> :)

Good decision :)

>
> We now determine whether an allocation is constrained by a cpuset by
> iterating through the zonelist and checking
> cpuset_zone_allowed_softwall(). This checks for the necessary cpuset
> restrictions that we need to validate (the GFP_ATOMIC exception is
> irrelevant, we don't call into the oom killer for those). We don't need
> to kill outside of its cpuset because we're not guaranteed to find any
> memory on those nodes, in fact it allows for needless oom killing if a
> task sets all of its threads to have OOM_DISABLE in its own cpuset and
> then runs out of memory. The oom killer would have killed every other
> user task on the system even though the offending application can't
> allocate there. That's certainly an undesired result and needs to be
> fixed in this manner.

But this explanation is irrelevant and meaningless. CPUSET can change
restricted node dynamically. So, the tsk->mempolicy at oom time doesn't
represent the place of task's usage memory. plus, OOM_DISABLE can
always makes undesirable result. it's not special in this case.

The fact is, both current and your heuristics have a corner case. it's
obvious. (I haven't seen corner caseless heuristics). then talking your
patch's merit doesn't help to merge the patch. The most important thing
is, we keep no regression. personally, I incline your one. but It doesn't
mean we can ignore its demerit.

KOSAKI Motohiro

unread,

Feb 16, 2010, 12:30:02 AM2/16/10

to

> On Mon, 15 Feb 2010, KOSAKI Motohiro wrote:
>
> > > Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> > > implemented for very large systems to avoid excessively long tasklist
> > > scans. The former suppresses helpful diagnostic messages that are
> > > emitted for each thread group leader that are candidates for oom kill
> > > including their pid, uid, vm size, rss, oom_adj value, and name; this
> > > information is very helpful to users in understanding why a particular
> > > task was chosen for kill over others. The latter simply kills current,
> > > the task triggering the oom condition, instead of iterating through the
> > > tasklist looking for the worst offender.
> > >
> > > Both of these sysctls are combined into one for use on the aforementioned
> > > large systems: oom_kill_quick. This disables the now-default
> > > oom_dump_tasks and kills current whenever the oom killer is called.
> > >
> > > The oom killer rewrite is the perfect opportunity to combine both sysctls
> > > into one instead of carrying around the others for years to come for
> > > nothing else than legacy purposes.
> >
> > "_quick" is always bad sysctl name.
>
> Why? It does exactly what it says: it kills current without doing an
> expensive tasklist scan and suppresses the possibly long tasklist dump.
> That's the oom killer's "quick mode."

Because, an administrator think "_quick" implies "please use it always".
plus, "quick" doesn't describe clealy meanings. oom_dump_tasks does.

> > instead, turnning oom_dump_tasks on
> > by default is better.
> >
>
> It's now on by default and can be disabled by enabling oom_kill_quick.
>
> > plus, this patch makes unnecessary compatibility issue.
> >
>
> It's the perfect opportunity when totally rewriting the oom killer to
> combine two sysctls with the exact same users into one. Users will notice
> that the tasklist is always dumped now (we're defaulting oom_dump_tasks
> to be enabled), so there is no reason why we can't remove oom_dump_tasks,
> we're just giving them a new way to disable it. oom_kill_allocating_task
> no longer always means what it once did: with the mempolicy-constrained
> oom rewrite, we now iterate the tasklist for such cases to kill a task.
> So users need to reassess whether this should be set if all tasks on the
> system are constrained by mempolicies, a typical configuration for
> extremely large systems.

No.
Your explanation doesn't answer why this change don't cause any comatibility
issue to _all_ user. Merely "opportunity" doesn't allow we ignore real world user.
I had made some incompatibility patch too, but all one have unavoidable reason.

KOSAKI Motohiro

unread,

Feb 16, 2010, 12:40:02 AM2/16/10

to

No current user? I don't think so.

int bio_integrity_prep(struct bio *bio)
{
(snip)
buf = kmalloc(len, GFP_NOIO | __GFP_NOFAIL | q->bounce_gfp);

and

void blk_queue_bounce_limit(struct request_queue *q, u64 dma_mask)
{
(snip)
if (dma) {
init_emergency_isa_pool();
q->bounce_gfp = GFP_NOIO | GFP_DMA;
q->limits.bounce_pfn = b_pfn;
}

I don't like rumor based discussion, I like fact based one.

Thanks.

KOSAKI Motohiro

unread,

Feb 16, 2010, 1:10:02 AM2/16/10

to

typo.

> But this explanation is irrelevant and meaningless. CPUSET can change

s/CPUSET/mempolicy/

Nick Piggin

unread,

Feb 16, 2010, 1:20:01 AM2/16/10

to

On Mon, Feb 15, 2010 at 02:20:01PM -0800, David Rientjes wrote:
> Tasks that do not share the same set of allowed nodes with the task that
> triggered the oom should not be considered as candidates for oom kill.
>
> Tasks in other cpusets with a disjoint set of mems would be unfairly
> penalized otherwise because of oom conditions elsewhere; an extreme
> example could unfairly kill all other applications on the system if a
> single task in a user's cpuset sets itself to OOM_DISABLE and then uses
> more memory than allowed.
>
> Killing tasks outside of current's cpuset rarely would free memory for
> current anyway.
>
> Acked-by: Rik van Riel <ri...@redhat.com>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
> Signed-off-by: David Rientjes <rien...@google.com>

Acked-by: Nick Piggin <npi...@suse.de>

Nick Piggin

unread,

Feb 16, 2010, 1:20:02 AM2/16/10

to

Acked-by: Nick Piggin <npi...@suse.de>

Nick Piggin

unread,

Feb 16, 2010, 1:30:01 AM2/16/10

to

On Mon, Feb 15, 2010 at 02:20:09PM -0800, David Rientjes wrote:
> If /proc/sys/vm/panic_on_oom is set to 2, the kernel will panic
> regardless of whether the memory allocation is constrained by either a
> mempolicy or cpuset.
>
> Since mempolicy-constrained out of memory conditions now iterate through
> the tasklist and select a task to kill, it is possible to panic the
> machine if all tasks sharing the same mempolicy nodes (including those
> with default policy, they may allocate anywhere) or cpuset mems have
> /proc/pid/oom_adj values of OOM_DISABLE. This is functionally equivalent
> to the compulsory panic_on_oom setting of 2, so the mode is removed.
>
> Signed-off-by: David Rientjes <rien...@google.com>

What is the point of removing it, though? If it doesn't significantly
help some future patch, just leave it in. It's not worth breaking the
user/kernel interface just to remove 3 trivial lines of code.

Nick Piggin

unread,

Feb 16, 2010, 1:30:01 AM2/16/10

to

On Mon, Feb 15, 2010 at 02:20:18PM -0800, David Rientjes wrote:
> Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> implemented for very large systems to avoid excessively long tasklist
> scans. The former suppresses helpful diagnostic messages that are
> emitted for each thread group leader that are candidates for oom kill
> including their pid, uid, vm size, rss, oom_adj value, and name; this
> information is very helpful to users in understanding why a particular
> task was chosen for kill over others. The latter simply kills current,
> the task triggering the oom condition, instead of iterating through the
> tasklist looking for the worst offender.
>
> Both of these sysctls are combined into one for use on the aforementioned
> large systems: oom_kill_quick. This disables the now-default
> oom_dump_tasks and kills current whenever the oom killer is called.
>
> The oom killer rewrite is the perfect opportunity to combine both sysctls
> into one instead of carrying around the others for years to come for
> nothing else than legacy purposes.

I just don't understand this either. There appears to be simply no
performance or maintainability reason to change this.

Nick Piggin

unread,

Feb 16, 2010, 1:50:02 AM2/16/10

to

Completely agree with this request. Actually, I think even better you
should just add && !(gfp_mask & __GFP_NOFAIL). Deprecated doesn't mean
it is OK to break the API (callers *will* oops or corrupt memory if
__GFP_NOFAIL returns NULL).

David Rientjes

unread,

Feb 16, 2010, 2:00:01 AM2/16/10

to

On Tue, 16 Feb 2010, Nick Piggin wrote:

> What is the point of removing it, though? If it doesn't significantly
> help some future patch, just leave it in. It's not worth breaking the
> user/kernel interface just to remove 3 trivial lines of code.
>

Because it is inconsistent at the user's expense, it has never panicked
the machine for memory controller ooms, so why is a cpuset or mempolicy
constrained oom conditions any different? It also panics the machine even
on VM_FAULT_OOM which is ridiculous, the tunable is certainly not being
used how it was documented and so given the fact that mempolicy
constrained ooms are now much smarter with my rewrite and we never simply
kill current unless oom_kill_quick is enabled anymore, the compulsory
panic_on_oom == 2 mode is no longer required. Simply set all tasks
attached to a cpuset or bound to a specific mempolicy to be OOM_DISABLE,
the kernel need not provide confusing alternative modes to sysctls for
this behavior. Before panic_on_oom == 2 was introduced, it would have
only panicked the machine if panic_on_oom was set to a non-zero integer,
defining it be something different for '2' after it has held the same
semantics for years is inappropriate. There is just no concrete example
that anyone can give where they want a cpuset-constrained oom to panic the
machine when other tasks on a disjoint set of mems can continue to do
work and the cpuset of interest cannot have its tasks set to OOM_DISABLE.

Nick Piggin

unread,

Feb 16, 2010, 2:10:02 AM2/16/10

to

On Tue, Feb 16, 2010 at 01:52:02PM +0900, KOSAKI Motohiro wrote:
> But this explanation is irrelevant and meaningless. CPUSET can change
> restricted node dynamically. So, the tsk->mempolicy at oom time doesn't
> represent the place of task's usage memory. plus, OOM_DISABLE can
> always makes undesirable result. it's not special in this case.
>
> The fact is, both current and your heuristics have a corner case. it's
> obvious. (I haven't seen corner caseless heuristics). then talking your
> patch's merit doesn't help to merge the patch. The most important thing
> is, we keep no regression. personally, I incline your one. but It doesn't
> mean we can ignore its demerit.

Yes we do need to explain the downside of the patch. It is a
heuristic and we can't call either approach perfect.

The fact is that even if 2 tasks are on completely disjoint
memory policies and never _allocate_ from one another's nodes,
you can still have one task pinning memory of the other task's
node.

Most shared and userspace-pinnable resources (pagecache, vfs
caches and fds files sockes etc) are allocated by first-touch
basically.

I don't see much usage of cpusets and oom killer first hand in
my experience, so I am happy to defer to others when it comes
to heuristics. Just so long as we are all aware of the full
story :)

David Rientjes

unread,

Feb 16, 2010, 2:10:02 AM2/16/10

to

On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:

> I hope no 3rd vendor (proprietary) driver uses __GFP_NOFAIL, they tend to
> believe API is trustable and unchanged.
>

I hope they don't use it with GFP_ATOMIC, either, because it's never been
respected in that context. We can easily audit the handful of cases in
the kernel that use __GFP_NOFAIL (it takes five minutes at the max) and
prove that none use it with GFP_ATOMIC or GFP_NOFS. We don't need to add
multitudes of warnings about using a deprecated flag with ludicrous
combinations (does anyone really expect GFP_ATOMIC | __GFP_NOFAIL to work
gracefully)?

Nick Piggin

unread,

Feb 16, 2010, 2:30:02 AM2/16/10

to

On Mon, Feb 15, 2010 at 10:59:26PM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, Nick Piggin wrote:
>
> > What is the point of removing it, though? If it doesn't significantly
> > help some future patch, just leave it in. It's not worth breaking the
> > user/kernel interface just to remove 3 trivial lines of code.
> >
>
> Because it is inconsistent at the user's expense, it has never panicked
> the machine for memory controller ooms, so why is a cpuset or mempolicy
> constrained oom conditions any different?

Well memory controller was added later, wasn't it? So if you think
that's a bug then a fix to panic on memory controller ooms might
be in order.

> It also panics the machine even
> on VM_FAULT_OOM which is ridiculous,

Why?

> the tunable is certainly not being
> used how it was documented

Why not? The documentation seems to match the implementation.

> and so given the fact that mempolicy
> constrained ooms are now much smarter with my rewrite and we never simply
> kill current unless oom_kill_quick is enabled anymore, the compulsory
> panic_on_oom == 2 mode is no longer required. Simply set all tasks
> attached to a cpuset or bound to a specific mempolicy to be OOM_DISABLE,
> the kernel need not provide confusing alternative modes to sysctls for
> this behavior. Before panic_on_oom == 2 was introduced, it would have
> only panicked the machine if panic_on_oom was set to a non-zero integer,
> defining it be something different for '2' after it has held the same
> semantics for years is inappropriate.

Well it was always defined in the documentation that it should be 0
or 1. Just that the limit wasn't enforced. I agree that's not ideal,
but anyway the existing and documented 0/1/2 has been there for 3 years
and so now removing the 2 is even worse.

> There is just no concrete example
> that anyone can give where they want a cpuset-constrained oom to panic the
> machine when other tasks on a disjoint set of mems can continue to do
> work and the cpuset of interest cannot have its tasks set to OOM_DISABLE.

But this is changing the way the environment is required to set up. So
a kernel upgrade can break previously working setups. We don't do this
without really good reason.

David Rientjes

unread,

Feb 16, 2010, 2:30:01 AM2/16/10

to

On Tue, 16 Feb 2010, KOSAKI Motohiro wrote:

> No current user? I don't think so.
>
> int bio_integrity_prep(struct bio *bio)
> {
> (snip)
> buf = kmalloc(len, GFP_NOIO | __GFP_NOFAIL | q->bounce_gfp);
>
> and
>
> void blk_queue_bounce_limit(struct request_queue *q, u64 dma_mask)
> {
> (snip)
> if (dma) {
> init_emergency_isa_pool();
> q->bounce_gfp = GFP_NOIO | GFP_DMA;
> q->limits.bounce_pfn = b_pfn;
> }
>
>
>
> I don't like rumor based discussion, I like fact based one.
>

The GFP_NOIO will prevent the oom killer from being called, it requires
__GFP_FS.

I can change this to invoke the should_alloc_retry() logic by testing for
!(gfp_mask & __GFP_NOFAIL), but there's nothing else the page allocator
can currently do to increase its probability of allocating pages; the
memory compaction patchset might be particularly helpful for these types
of scenarios.

Nick Piggin

unread,

Feb 16, 2010, 2:30:02 AM2/16/10

to

On Mon, Feb 15, 2010 at 11:03:50PM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:
>
> > I hope no 3rd vendor (proprietary) driver uses __GFP_NOFAIL, they tend to
> > believe API is trustable and unchanged.
> >
>
> I hope they don't use it with GFP_ATOMIC, either, because it's never been
> respected in that context. We can easily audit the handful of cases in
> the kernel that use __GFP_NOFAIL (it takes five minutes at the max) and
> prove that none use it with GFP_ATOMIC or GFP_NOFS. We don't need to add
> multitudes of warnings about using a deprecated flag with ludicrous
> combinations (does anyone really expect GFP_ATOMIC | __GFP_NOFAIL to work
> gracefully)?

You don't need to add warnings, just don't break existing working
combinations and nobody has anything to complain about.

David Rientjes

unread,

Feb 16, 2010, 2:50:02 AM2/16/10

to

On Tue, 16 Feb 2010, Nick Piggin wrote:

> > As I already explained when you first brought this up, the possibility of
> > not invoking the oom killer is not unique to GFP_DMA, it is also possible
> > for GFP_NOFS. Since __GFP_NOFAIL is deprecated and there are no current
> > users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.
> > We're not adding any additional __GFP_NOFAIL allocations.
>
> Completely agree with this request. Actually, I think even better you
> should just add && !(gfp_mask & __GFP_NOFAIL). Deprecated doesn't mean
> it is OK to break the API (callers *will* oops or corrupt memory if
> __GFP_NOFAIL returns NULL).
>

... unless it's used with GFP_ATOMIC, which we've always returned NULL
for when even ALLOC_HARDER can't find pages, right?

I'm wondering where this strong argument in favor of continuing to support
__GFP_NOFAIL was when I insisted we call the oom killer for them even for
allocations over PAGE_ALLOC_COSTLY_ORDER when __alloc_pages_nodemask() was
refactored back in 2.6.31. The argument was that nobody is allocating
that high of orders of __GFP_NOFAIL pages so we don't need to free memory
for them and that's where the deprecation of the modifier happened in the
first place. Ultimately, we did invoke the oom killer for those
allocations because there's no chance of forward progress otherwise and,
unlike __GFP_DMA, GFP_KERNEL | __GFP_NOFAIL actually is popular.

I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask &
__GFP_NOFAIL) path since we're all content with endlessly looping.

David Rientjes

unread,

Feb 16, 2010, 3:00:02 AM2/16/10

to

On Tue, 16 Feb 2010, Nick Piggin wrote:

> > Because it is inconsistent at the user's expense, it has never panicked
> > the machine for memory controller ooms, so why is a cpuset or mempolicy
> > constrained oom conditions any different?
>
> Well memory controller was added later, wasn't it? So if you think
> that's a bug then a fix to panic on memory controller ooms might
> be in order.
>

But what about the existing memcg users who set panic_on_oom == 2 and
don't expect the memory controller to be influenced by that?

> > It also panics the machine even
> > on VM_FAULT_OOM which is ridiculous,
>
> Why?
>

Because the oom killer was never called for VM_FAULT_OOM before, we simply
sent a SIGKILL to current, i.e. the original panic_on_oom semantics were
not even enforced.

> > the tunable is certainly not being
> > used how it was documented
>
> Why not? The documentation seems to match the implementation.
>

It was meant to panic the machine anytime it was out of memory, regardless
of the constraint, but that obviously doesn't match the memory controller
case. Just because cpusets and mempolicies decide to use the oom killer
as a mechanism for enforcing a user-defined policy does not mean that we
want to panic for them: mempolicies, for example, are user created and do
not require any special capability. Does it seem reasonable that an oom
condition on those mempolicy nodes should panic the machine when killing
the offender is possible (and perhaps even encouraged if the user sets a
high /proc/pid/oom_score_adj?) In other words, is an admin setting
panic_on_oom == 2 really expecting that no application will use
set_mempolicy() or do an mbind()? This is a very error-prone interface
that needs to be dealt with on a case-by-case basis and the perfect way to
do that is by setting the affected tasks to be OOM_DISABLE; that
interface, unlike panic_on_oom == 2, is very well understood by those with
CAP_SYS_RESOURCE.

Nick Piggin

unread,

Feb 16, 2010, 3:00:01 AM2/16/10

to

On Mon, Feb 15, 2010 at 11:41:49PM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, Nick Piggin wrote:
>
> > > As I already explained when you first brought this up, the possibility of
> > > not invoking the oom killer is not unique to GFP_DMA, it is also possible
> > > for GFP_NOFS. Since __GFP_NOFAIL is deprecated and there are no current
> > > users of GFP_DMA | __GFP_NOFAIL, that warning is completely unnecessary.
> > > We're not adding any additional __GFP_NOFAIL allocations.
> >
> > Completely agree with this request. Actually, I think even better you
> > should just add && !(gfp_mask & __GFP_NOFAIL). Deprecated doesn't mean
> > it is OK to break the API (callers *will* oops or corrupt memory if
> > __GFP_NOFAIL returns NULL).
> >
>
> ... unless it's used with GFP_ATOMIC, which we've always returned NULL
> for when even ALLOC_HARDER can't find pages, right?

Ye, it's never worked with GFP_ATOMIC.

> I'm wondering where this strong argument in favor of continuing to support
> __GFP_NOFAIL was when I insisted we call the oom killer for them even for
> allocations over PAGE_ALLOC_COSTLY_ORDER when __alloc_pages_nodemask() was
> refactored back in 2.6.31. The argument was that nobody is allocating
> that high of orders of __GFP_NOFAIL pages so we don't need to free memory
> for them and that's where the deprecation of the modifier happened in the
> first place. Ultimately, we did invoke the oom killer for those
> allocations because there's no chance of forward progress otherwise and,
> unlike __GFP_DMA, GFP_KERNEL | __GFP_NOFAIL actually is popular.

I don't know. IMO we should never just randomly weaken or break such
flag as the page allocator API.

>
> I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask &
> __GFP_NOFAIL) path since we're all content with endlessly looping.

Thanks. Yes endlessly looping is far preferable to randomly oopsing
or corrupting memory.

Nick Piggin

unread,

Feb 16, 2010, 3:10:02 AM2/16/10

to

On Mon, Feb 15, 2010 at 11:53:33PM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, Nick Piggin wrote:
>
> > > Because it is inconsistent at the user's expense, it has never panicked
> > > the machine for memory controller ooms, so why is a cpuset or mempolicy
> > > constrained oom conditions any different?
> >
> > Well memory controller was added later, wasn't it? So if you think
> > that's a bug then a fix to panic on memory controller ooms might
> > be in order.
> >
>
> But what about the existing memcg users who set panic_on_oom == 2 and
> don't expect the memory controller to be influenced by that?

But that was a bug in the addition of the memory controller. Either the
documentation should be fixed, or the implementation should be fixed.

> > > It also panics the machine even
> > > on VM_FAULT_OOM which is ridiculous,
> >
> > Why?
> >
>
> Because the oom killer was never called for VM_FAULT_OOM before, we simply
> sent a SIGKILL to current, i.e. the original panic_on_oom semantics were
> not even enforced.

No but now they are. I don't know what your point is here because there
is no way the users of this interface can be expected to know about
VM_FAULT_OOM versus pagefault_out_of_memory let alone do anything useful
with that.

>
> > > the tunable is certainly not being
> > > used how it was documented
> >
> > Why not? The documentation seems to match the implementation.
> >
>
> It was meant to panic the machine anytime it was out of memory, regardless
> of the constraint, but that obviously doesn't match the memory controller
> case.

Right, and it's been like that for 3 years and people who don't use
the memory controller will be using that tunable.

Let's fix the memory controller case.

> Just because cpusets and mempolicies decide to use the oom killer
> as a mechanism for enforcing a user-defined policy does not mean that we
> want to panic for them: mempolicies, for example, are user created and do
> not require any special capability. Does it seem reasonable that an oom
> condition on those mempolicy nodes should panic the machine when killing
> the offender is possible (and perhaps even encouraged if the user sets a
> high /proc/pid/oom_score_adj?) In other words, is an admin setting
> panic_on_oom == 2 really expecting that no application will use
> set_mempolicy() or do an mbind()? This is a very error-prone interface
> that needs to be dealt with on a case-by-case basis and the perfect way to
> do that is by setting the affected tasks to be OOM_DISABLE; that
> interface, unlike panic_on_oom == 2, is very well understood by those with
> CAP_SYS_RESOURCE.

I assume it is reasonable to want to panic on any OOM if you're after
fail-stop kind of behaviour. I guess that is why it was added. I see
more use for that case than panic_on_oom==1 case myself.

KAMEZAWA Hiroyuki

unread,

Feb 16, 2010, 3:20:01 AM2/16/10

to

On Tue, 16 Feb 2010 19:08:17 +1100
Nick Piggin <npi...@suse.de> wrote:

> On Mon, Feb 15, 2010 at 11:53:33PM -0800, David Rientjes wrote:
> > On Tue, 16 Feb 2010, Nick Piggin wrote:
> >
> > > > Because it is inconsistent at the user's expense, it has never panicked
> > > > the machine for memory controller ooms, so why is a cpuset or mempolicy
> > > > constrained oom conditions any different?
> > >
> > > Well memory controller was added later, wasn't it? So if you think
> > > that's a bug then a fix to panic on memory controller ooms might
> > > be in order.
> > >
> >
> > But what about the existing memcg users who set panic_on_oom == 2 and
> > don't expect the memory controller to be influenced by that?
>
> But that was a bug in the addition of the memory controller. Either the
> documentation should be fixed, or the implementation should be fixed.
>

I'll add a documentation to memcg. As

"When you exhaust memory resource under memcg, oom-killer may be invoked.
But in this case, the system never panics even when panic_on_oom is set."

Maybe I should add "memcg_oom_notify (netlink message or file-decriptor or some".
Because memcg's oom is virtual oom, automatic management software can show
report to users and can do fail-over. I'll consider something useful for
memcg oom-fail-over instead of panic. In the simplest case, cgroup's notiifer
file descriptor can be used.

Thanks,
-Kame

David Rientjes

unread,

Feb 16, 2010, 3:30:02 AM2/16/10

to

On Tue, 16 Feb 2010, Nick Piggin wrote:

> > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask &
> > __GFP_NOFAIL) path since we're all content with endlessly looping.
>
> Thanks. Yes endlessly looping is far preferable to randomly oopsing
> or corrupting memory.
>

Here's the new patch for your consideration.

oom: avoid oom killer for lowmem allocations

If memory has been depleted in lowmem zones even with the protection
afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
killing current users will help. The memory is either reclaimable (or
migratable) already, in which case we should not invoke the oom killer at
all, or it is pinned by an application for I/O. Killing such an
application may leave the hardware in an unspecified state and there is
no guarantee that it will be able to make a timely exit.

Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
not used so that the task can perhaps recover or try again later.

Previously, the heuristic provided some protection for those tasks with
CAP_SYS_RAWIO, but this is no longer necessary since we will not be
killing tasks for the purposes of ISA allocations.

high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
default for all allocations that are not __GFP_DMA, __GFP_DMA32,
__GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
flags. Testing for high_zoneidx being less than ZONE_NORMAL will only
return true for allocations that have either __GFP_DMA or __GFP_DMA32.

Signed-off-by: David Rientjes <rien...@google.com>

---
mm/page_alloc.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c

@@ -1705,6 +1705,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
*/
if (gfp_mask & __GFP_THISNODE)
goto out;

+ /* The oom killer won't necessarily free lowmem */
+ if (high_zoneidx < ZONE_NORMAL)

+ goto out;
}
/* Exhausted what can be done so it's blamo time */
out_of_memory(zonelist, gfp_mask, order, nodemask);

David Rientjes

unread,

Feb 16, 2010, 3:50:02 AM2/16/10

to

On Tue, 16 Feb 2010, Nick Piggin wrote:

> Yes we do need to explain the downside of the patch. It is a
> heuristic and we can't call either approach perfect.
>
> The fact is that even if 2 tasks are on completely disjoint
> memory policies and never _allocate_ from one another's nodes,
> you can still have one task pinning memory of the other task's
> node.
>
> Most shared and userspace-pinnable resources (pagecache, vfs
> caches and fds files sockes etc) are allocated by first-touch
> basically.
>
> I don't see much usage of cpusets and oom killer first hand in
> my experience, so I am happy to defer to others when it comes
> to heuristics. Just so long as we are all aware of the full
> story :)
>

Unless you can present a heuristic that will determine how much memory
usage a given task has allocated on nodes in current's zonelist, we must
exclude tasks from cpusets with a disjoint set of nodes, otherwise we
cannot determine the optimal task to kill. There's a strong possibility
that killing a task on a disjoint set of mems will never free memory for
current, making it a needless kill. That's a much more serious
consequence than not having the patch, in my opinion, than rather simply
killing current.

David Rientjes

unread,

Feb 16, 2010, 3:50:02 AM2/16/10

to

On Tue, 16 Feb 2010, Nick Piggin wrote:

> > > > Because it is inconsistent at the user's expense, it has never panicked
> > > > the machine for memory controller ooms, so why is a cpuset or mempolicy
> > > > constrained oom conditions any different?
> > >
> > > Well memory controller was added later, wasn't it? So if you think
> > > that's a bug then a fix to panic on memory controller ooms might
> > > be in order.
> > >
> >
> > But what about the existing memcg users who set panic_on_oom == 2 and
> > don't expect the memory controller to be influenced by that?
>
> But that was a bug in the addition of the memory controller. Either the
> documentation should be fixed, or the implementation should be fixed.
>

The memory controller behavior seems intentional because it prevents
panicking in two places: mem_cgroup_out_of_memory() never considers it and
sysctl_panic_on_oom is preempted in pagefault_out_of_memory() if current's
memcg is oom.

The documentation is currently right because it only mentions an
application to cpusets and mempolicies.

That's the reason why I think we should eliminate it: it is completely
bogus as it stands because it allows tasks to be killed in memory
controller environments if their hard limit is reached unless they are set
to OOM_DISABLE. That doesn't have fail-stop behavior and trying to make
exceptions to the rule is not true "fail-stop" that we need to preserve
with this interface.

> > Because the oom killer was never called for VM_FAULT_OOM before, we simply
> > sent a SIGKILL to current, i.e. the original panic_on_oom semantics were
> > not even enforced.
>
> No but now they are. I don't know what your point is here because there
> is no way the users of this interface can be expected to know about
> VM_FAULT_OOM versus pagefault_out_of_memory let alone do anything useful
> with that.
>

I think VM_FAULT_OOM should panic the machine for panic_on_oom == 1 as it
presently does, it needs no special handling otherwise. But this is an
example of where semantics of panic_on_oom have changed in the past where
OOM_DISABLE would remove any ambiguity. Instead of redefining the
sysctl's semantics everytime we add another usecase for the oom killer,
why can't we just use a single interface that has been around for years
when a certain task shouldn't be killed?

> Let's fix the memory controller case.
>

I doubt you'll find much support from the memory controller folks on that
since they probably won't agree this is fail-stop behavior and killing a
task when constrained by a memcg is appropriate because the user asked for
a hard limit.

Again, OOM_DISABLE would remove all ambiguity and we wouldn't need to
concern ourselves of what the semantics of a poorly chosen interface such
as panic_on_oom == 2 is whenever we change the oom killer.

> I assume it is reasonable to want to panic on any OOM if you're after
> fail-stop kind of behaviour. I guess that is why it was added. I see
> more use for that case than panic_on_oom==1 case myself.
>

panic_on_oom == 1 is reasonable since no system task can make forward
progress in allocating memory, that isn't necessarily true of cpuset or
mempolicy (or memcg) constrained applications. Other cpusets, for
instance, can continue to do work uninterrupted and without threat of
having one of their tasks being oom killed.

David Rientjes

unread,

Feb 16, 2010, 3:50:03 AM2/16/10

to

On Tue, 16 Feb 2010, KOSAKI Motohiro wrote:

> > We now determine whether an allocation is constrained by a cpuset by
> > iterating through the zonelist and checking
> > cpuset_zone_allowed_softwall(). This checks for the necessary cpuset
> > restrictions that we need to validate (the GFP_ATOMIC exception is
> > irrelevant, we don't call into the oom killer for those). We don't need
> > to kill outside of its cpuset because we're not guaranteed to find any
> > memory on those nodes, in fact it allows for needless oom killing if a
> > task sets all of its threads to have OOM_DISABLE in its own cpuset and
> > then runs out of memory. The oom killer would have killed every other
> > user task on the system even though the offending application can't
> > allocate there. That's certainly an undesired result and needs to be
> > fixed in this manner.
>
> But this explanation is irrelevant and meaningless. CPUSET can change
> restricted node dynamically. So, the tsk->mempolicy at oom time doesn't
> represent the place of task's usage memory. plus, OOM_DISABLE can
> always makes undesirable result. it's not special in this case.
>

It depends whether memory_migrate is set or not when changing a cpuset's
set of mems. The point is that we cannot penalize tasks in cpusets with a
disjoint set of mems because another cpuset is out of memory. Unless a
candidate task will definitely free memory on a node that the zonelist
allows, we should not consider it because it may needlessly kill that
task, it would be better to kill current. Otherwise, our badness()
heuristic cannot possibly determine the optimal task to kill, anyway.

David Rientjes

unread,

Feb 16, 2010, 4:00:02 AM2/16/10

to

On Tue, 16 Feb 2010, Nick Piggin wrote:

> > Two VM sysctls, oom dump_tasks and oom_kill_allocating_task, were
> > implemented for very large systems to avoid excessively long tasklist
> > scans. The former suppresses helpful diagnostic messages that are
> > emitted for each thread group leader that are candidates for oom kill
> > including their pid, uid, vm size, rss, oom_adj value, and name; this
> > information is very helpful to users in understanding why a particular
> > task was chosen for kill over others. The latter simply kills current,
> > the task triggering the oom condition, instead of iterating through the
> > tasklist looking for the worst offender.
> >
> > Both of these sysctls are combined into one for use on the aforementioned
> > large systems: oom_kill_quick. This disables the now-default
> > oom_dump_tasks and kills current whenever the oom killer is called.
> >
> > The oom killer rewrite is the perfect opportunity to combine both sysctls
> > into one instead of carrying around the others for years to come for
> > nothing else than legacy purposes.
>
> I just don't understand this either. There appears to be simply no
> performance or maintainability reason to change this.
>

When oom_dump_tasks() is always emitted for out of memory conditions as my
patch does, then these two tunables have the exact same audience: users
with large systems that have extremely long tasklists. They want to avoid
tasklist scanning (either to select a bad process to kill or dump their
information) in oom conditions and simply kill the allocating task. I
chose to combine the two: we're not concerned about breaking the
oom_dump_tasks ABI since it's now the default behavior and since we scan
the tasklist for mempolicy-constrained ooms, users may now choose to
enable oom_kill_allocating_task when they previously wouldn't have. To do
that, they can either use the old sysctl or convert to this new sysctl
with the benefit that we've removed one unnecessary sysctl from
/proc/sys/vm.

As far as I know, oom_kill_allocating_task is only used by SGI, anyway,
since they are the ones who asked for it when I implemented cpuset
tasklist scanning. It's certainly not widely used and since the semantics
for mempolicies have changed, oom_kill_quick may find more users.

David Rientjes

unread,

Feb 16, 2010, 4:10:02 AM2/16/10

to

On Tue, 16 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > You don't understand that the behavior has changed ever since
> > mempolicy-constrained oom conditions are now affected by a compulsory
> > panic_on_oom mode, please see the patch description. It's absolutely
> > insane for a single sysctl mode to panic the machine anytime a cpuset or
> > mempolicy runs out of memory and is more prone to user error from setting
> > it without fully understanding the ramifications than any use it will ever
> > do. The kernel already provides a mechanism for doing this, OOM_DISABLE.
> > if you want your cpuset or mempolicy to risk panicking the machine, set
> > all tasks that share its mems or nodes, respectively, to OOM_DISABLE.
> > This is no different from the memory controller being immune to such
> > panic_on_oom conditions, stop believing that it is the only mechanism used
> > in the kernel to do memory isolation.
> >
> You don't explain why "we _have to_ remove API which is used"
>

First, I'm not stating that we _have_ to remove anything, this is a patch
proposal that is open for review.

Second, I believe we _should_ remove panic_on_oom == 2 because it's no
longer being used as it was documented: as we've increased the exposure of
the oom killer (memory controller, pagefault ooms, now mempolicy tasklist
scanning), we constantly have to re-evaluate the semantics of this option
while a well-understood tunable with a long history, OOM_DISABLE, already
does the equivalent. The downside of getting this wrong is that the
machine panics when it shouldn't have because of an unintended consequence
of the mode being enabled (a mempolicy ooms, for example, that was created
by the user). When reconsidering its semantics, I'd personally opt on the
safe side and make sure the machine doesn't panic unnecessarily and
instead require users to use OOM_DISABLE for tasks they do not want to be
oom killed.

Nick Piggin

unread,

Feb 16, 2010, 4:10:02 AM2/16/10

to

On Tue, Feb 16, 2010 at 12:49:14AM -0800, David Rientjes wrote:
> On Tue, 16 Feb 2010, Nick Piggin wrote:
>
> > Yes we do need to explain the downside of the patch. It is a
> > heuristic and we can't call either approach perfect.
> >
> > The fact is that even if 2 tasks are on completely disjoint
> > memory policies and never _allocate_ from one another's nodes,
> > you can still have one task pinning memory of the other task's
> > node.
> >
> > Most shared and userspace-pinnable resources (pagecache, vfs
> > caches and fds files sockes etc) are allocated by first-touch
> > basically.
> >
> > I don't see much usage of cpusets and oom killer first hand in
> > my experience, so I am happy to defer to others when it comes
> > to heuristics. Just so long as we are all aware of the full
> > story :)
> >
>
> Unless you can present a heuristic that will determine how much memory
> usage a given task has allocated on nodes in current's zonelist, we must
> exclude tasks from cpusets with a disjoint set of nodes, otherwise we
> cannot determine the optimal task to kill. There's a strong possibility
> that killing a task on a disjoint set of mems will never free memory for
> current, making it a needless kill. That's a much more serious
> consequence than not having the patch, in my opinion, than rather simply
> killing current.

I don't really agree with your black and white view. We equally
can't tell a lot of cases about who is pinning memory where. The
fact is that any task can be pinning memory and the heuristic
was specifically catering for that.

It's not an issue of yes/no, but of more/less probability. Anyway
I wasn't really arguing against your patch.

David Rientjes

unread,

Feb 16, 2010, 4:10:02 AM2/16/10

to

On Tue, 16 Feb 2010, KOSAKI Motohiro wrote:

> > > "_quick" is always bad sysctl name.
> >
> > Why? It does exactly what it says: it kills current without doing an
> > expensive tasklist scan and suppresses the possibly long tasklist dump.
> > That's the oom killer's "quick mode."
>
> Because, an administrator think "_quick" implies "please use it always".
> plus, "quick" doesn't describe clealy meanings. oom_dump_tasks does.
>

The audience for both of these tunables (now that oom_dump_tasks is
default to enabled) is users with extremely long tasklists that want to
avoid those scans, so oom_kill_quick implies that it won't waste any time
and will act how it's documented: simply kill current and move on.

David Rientjes

unread,

Feb 16, 2010, 4:20:02 AM2/16/10

to

On Tue, 16 Feb 2010, Nick Piggin wrote:

> I don't really agree with your black and white view. We equally
> can't tell a lot of cases about who is pinning memory where. The
> fact is that any task can be pinning memory and the heuristic
> was specifically catering for that.
>

That's a main source of criticism of the current heuristic: it needlessly
kills tasks. There is only one thing we know for certain: current is
trying to allocate memory on its nodes. We can either kill a task that
is allowed that same set or current itself; there's no evidence that
killing anything else will lead to memory freeing that will allow the
allocation to succeed. The heuristic will never perfectly select the task
that it should kill 100% of the time when playing around with mempolicy
nodes or cpuset mems, but relying on their current placement is a good
indicator of what is more likely than not to free memory of interest.

Minchan Kim

unread,

Feb 16, 2010, 8:20:02 AM2/16/10

to

On Mon, 2010-02-15 at 13:54 -0800, David Rientjes wrote:
> We're not enforcing a global, system-wide forkbomb policy in the oom
> killer, but we do need to identify tasks that fork a very large number of
> tasks to break ties with other tasks: in other words, it would not be
> helpful to kill an application that has been running for weeks because
> another application with the same or less memory usage has forked 1000
> children and has caused an oom condition. That unfairly penalizes the
> former application that is actually doing work.
>
> Again, I'd encourage you to look at this as only a slight penalization
> rather than a policy that strictly needs to be enforced. If it were
> strictly enforced, it would be a prerequisite for selection if such a task
> were to exist; in my implementation, it is part of the heuristic.

Okay. I can think it of slight penalization in this patch.
But in current OOM logic, we try to kill child instead of forkbomb
itself. My concern was that.
Of course, It's not a part of your patch[2/7] which is good.
It has been in there during long time. I hope we could solve that in
this chance. Pz, look at below my example.

>
> > > That doesn't work with Rik's example of a webserver that forks a large
> > > number of threads to handle client connections. It is _always_ better to
> > > kill a child instead of making the entire webserver unresponsive.
> >
> > In such case, admin have to handle it by oom_forkbom_thres.
> > Isn't it your goal?
> >
>
> oom_forkbomb_thres has a default value, which is 1000, so it should be
> enabled by default.
>
> > My suggestion is how handle buggy forkbomb processes which make
> > system almost hang by user's mistake. :)
> >
>
> I don't think you've given a clear description (or, even better, a patch)
> of your suggestion.

I write down my suggestion, again.
My concern is following as.

1. Forkbomb A task makes 2000 children in a second.
2. 2000 children has almost same memory usage. I know another factors
affect oom_score. but in here, I assume all of children have almost same
badness score.
3. Your heuristic penalizes A task so it would be detected as forkbomb.
4. So OOM killer select A task as bad task.
5. oom_kill_process kills high badness one of children, _NOT_ task A
itself. Unfortunately high badness child doesn't has big memory usage
compared to sibling. It means sooner or later we would need OOM again.

My point was 5.

1. oom_kill_process have to take a long time to scan tasklist for
selecting just one high badness task. Okay. It's right since OOM system
hang is much bad and it would be better to kill just first task(ie,
random one) in tasklist.

2. But in above scenario, sibling have almost same memory. So we would
need OOM again sooner or later and OOM logic could do above scenario
repeatably.

Yes. Our system is already unresponsible since time slice is spread out
many child tasks. Then in here, it would be better to kill dumb child
instead of BUGGY forkbomb task A? How long time do we have to wait
system responsible?

I said _BUGGY_ forkbomb task. That's because Rik's example isn't buggy
task. Administrator already knows apache can make many task in a second.
So he can handle it by your oom_forkbomb_thres knob. It's goal of your
knob.

So my suggestion is following as.

I assume normal forkbomb tasks are handled well by admin who use your
oom_forkbom_thres. The remained problem is just BUGGY forkbomb process.
So if your logic selects same victim task as forkbomb by your heuristic
and it's 5th time continuously in 10 second, let's kill forkbomb instead
of child.

tsk = select_victim_task(&cause);
if (tsk == last_victim_tsk && cause == BUGGY_FORKBOMB)
if (++count == 5 && time_since_first_detect_forkbomb <= 10*HZ)
kill(tsk);
else {
last_victim_tsk = NULL; count = 0; time_since... = 0;
kill(tsk's child);
}

It's just example of my concern. It might never good solution.
What I mean is just whether we have to care this.

--
Kind regards,
Minchan Kim

David Rientjes

unread,

Feb 16, 2010, 4:50:02 PM2/16/10

to

On Tue, 16 Feb 2010, Minchan Kim wrote:

> > Again, I'd encourage you to look at this as only a slight penalization
> > rather than a policy that strictly needs to be enforced. If it were
> > strictly enforced, it would be a prerequisite for selection if such a task
> > were to exist; in my implementation, it is part of the heuristic.
>
> Okay. I can think it of slight penalization in this patch.
> But in current OOM logic, we try to kill child instead of forkbomb
> itself. My concern was that.

We still do with my rewrite, that is handled in oom_kill_process(). The
forkbomb penalization takes place in badness().

> 1. Forkbomb A task makes 2000 children in a second.
> 2. 2000 children has almost same memory usage. I know another factors
> affect oom_score. but in here, I assume all of children have almost same
> badness score.
> 3. Your heuristic penalizes A task so it would be detected as forkbomb.
> 4. So OOM killer select A task as bad task.
> 5. oom_kill_process kills high badness one of children, _NOT_ task A
> itself. Unfortunately high badness child doesn't has big memory usage
> compared to sibling. It means sooner or later we would need OOM again.
>

Couple points: killing a task with a comparatively small rss and swap
usage to the parent does not imply that we need the call the oom killer
again later, killing the child will allow for future memory freeing that
may be all that is necessary. If the parent continues to fork, that will
continue to be an issue, but the constant killing of its children should
allow the user to intervene without bring the system to a grinding halt.
I'd strongly prefer to kill a child from a forkbombing task, however, than
an innocent application that has been running for days or weeks only to
find that the forkbombing parent will consume its memory as well and then
need have its children killed. Secondly, the forkbomb detection does not
simply require 2000 children to be forked in a second, it requires
oom_forkbomb_thres children that have called execve(), i.e. they have
seperate address spaces, to have a runtime of less than one second.

> My point was 5.
>
> 1. oom_kill_process have to take a long time to scan tasklist for
> selecting just one high badness task. Okay. It's right since OOM system
> hang is much bad and it would be better to kill just first task(ie,
> random one) in tasklist.
>
> 2. But in above scenario, sibling have almost same memory. So we would
> need OOM again sooner or later and OOM logic could do above scenario
> repeatably.
>

In Rik's web server example, this is the preferred outcome: kill a thread
handling a single client connection rather than kill a "legitimate"
forkbombing server to make the entire service unresponsive.

> I said _BUGGY_ forkbomb task. That's because Rik's example isn't buggy
> task. Administrator already knows apache can make many task in a second.
> So he can handle it by your oom_forkbomb_thres knob. It's goal of your
> knob.
>

We can't force all web servers to tune oom_forkbomb_thres.

> So my suggestion is following as.
>
> I assume normal forkbomb tasks are handled well by admin who use your
> oom_forkbom_thres. The remained problem is just BUGGY forkbomb process.
> So if your logic selects same victim task as forkbomb by your heuristic
> and it's 5th time continuously in 10 second, let's kill forkbomb instead
> of child.
>
> tsk = select_victim_task(&cause);
> if (tsk == last_victim_tsk && cause == BUGGY_FORKBOMB)
> if (++count == 5 && time_since_first_detect_forkbomb <= 10*HZ)
> kill(tsk);
> else {
> last_victim_tsk = NULL; count = 0; time_since... = 0;
> kill(tsk's child);
> }
>
> It's just example of my concern. It might never good solution.
> What I mean is just whether we have to care this.
>

This unfairly penalizes tasks that have a large number of execve()
children, we can't possibly know how to define BUGGY_FORKBOMB. In other
words, a system-wide forkbombing policy in the oom killer will always have
a chance of killing a legitimate task, such as a web server, that will be
an undesired result. Setting the parent to OOM_DISABLE isn't really an
option in this case since that value is inherited by children and would
need to explicitly be cleared by each thread prior to execve(); this is
one of the reasons why I proposed /proc/pid/oom_adj_child a few months
ago, but it wasn't well received.

KAMEZAWA Hiroyuki

unread,

Feb 16, 2010, 6:50:01 PM2/16/10

to

Please don't. I had a chance to talk with customer support team and talked
about panic_on_oom briefly. I understood that panic_on_oom_alyways+kdump
is the strongest tool for investigating customer's OOM situtation and do
the best advice to them. panic_on_oom_always+kdump is the 100% information
as snapshot when oom-killer happens. Then, it's easy to investigate and
explain what is wront. They sometimes discover memory leak (by some prorietary
driver) or miss-configuration of the system (as using unnecessary bounce buffer.)

Then, please leave panic_on_oom=always.
Even with mempolicy or cpuset 's OOM, we need panic_on_oom=always option.
And yes, I'll add something similar to memcg. freeze_at_oom or something.

Thanks,
-Kame

KAMEZAWA Hiroyuki

unread,

Feb 16, 2010, 7:00:03 PM2/16/10

to

On Tue, 16 Feb 2010 00:25:22 -0800 (PST)
David Rientjes <rien...@google.com> wrote:

> On Tue, 16 Feb 2010, Nick Piggin wrote:
>
> > > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask &
> > > __GFP_NOFAIL) path since we're all content with endlessly looping.
> >
> > Thanks. Yes endlessly looping is far preferable to randomly oopsing
> > or corrupting memory.
> >
>
> Here's the new patch for your consideration.
>

Then, can we take kdump in this endlessly looping situaton ?

panic_on_oom=always + kdump can do that.

Thanks,
-Kame

David Rientjes

unread,

Feb 16, 2010, 7:00:02 PM2/16/10

to

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> Please don't. I had a chance to talk with customer support team and talked
> about panic_on_oom briefly. I understood that panic_on_oom_alyways+kdump
> is the strongest tool for investigating customer's OOM situtation and do
> the best advice to them. panic_on_oom_always+kdump is the 100% information
> as snapshot when oom-killer happens. Then, it's easy to investigate and
> explain what is wront. They sometimes discover memory leak (by some prorietary
> driver) or miss-configuration of the system (as using unnecessary bounce buffer.)
>

Ok, I'm not looking to cause your customers unnecessary grief by removing
an option that they use, even though the same effect is possible by
setting all tasks to OOM_DISABLE. I'll remove this patch in the next
revision.

> Then, please leave panic_on_oom=always.
> Even with mempolicy or cpuset 's OOM, we need panic_on_oom=always option.
> And yes, I'll add something similar to memcg. freeze_at_oom or something.
>

Memcg isn't a special case here, it should also panic the machine if
panic_on_oom == 2, so if we aren't going to remove this option then I
agree with Nick that we need to panic from mem_cgroup_out_of_memory() as
well. Some users use cpusets, for example, for the same effect of memory
isolation as you use memcg, so panicking in one scenario and not the other
is inconsistent.

KAMEZAWA Hiroyuki

unread,

Feb 16, 2010, 7:10:02 PM2/16/10

to

On Tue, 16 Feb 2010 15:54:50 -0800 (PST)
David Rientjes <rien...@google.com> wrote:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
> > Then, please leave panic_on_oom=always.
> > Even with mempolicy or cpuset 's OOM, we need panic_on_oom=always option.
> > And yes, I'll add something similar to memcg. freeze_at_oom or something.
> >
>
> Memcg isn't a special case here, it should also panic the machine if
> panic_on_oom == 2, so if we aren't going to remove this option then I
> agree with Nick that we need to panic from mem_cgroup_out_of_memory() as
> well. Some users use cpusets, for example, for the same effect of memory
> isolation as you use memcg, so panicking in one scenario and not the other
> is inconsistent.
>

Hmm, I have a few reason to add special behavior to memcg rather than panic.

- freeze_at_oom is enough.
If OOM can be notified, the management daemon can do useful jobs. Shutdown
all other cgroups or migrate them to other host and do kdump.

- memcg's oom is not very complicated.
Because we just counts RSS+FileCache

But, Hmm...I'd like to go this way.

1. At first, support panic_on_oom=2 in memcg.

2. Second, I'll add OOM-notifier and freeze_at_oom to memcg.
and don't call memcg_out_of_memory in oom_kill.c in this case. Because
we don't kill anything. Taking coredumps of all procs in memcg is not
very difficult.

I need to discuss with memcg guys. But this will be a way to go, I think

Thanks,
-Kame

David Rientjes

unread,

Feb 16, 2010, 7:10:02 PM2/16/10

to

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > > > I'll add this check to __alloc_pages_may_oom() for the !(gfp_mask &
> > > > __GFP_NOFAIL) path since we're all content with endlessly looping.
> > >
> > > Thanks. Yes endlessly looping is far preferable to randomly oopsing
> > > or corrupting memory.
> > >
> >
> > Here's the new patch for your consideration.
> >
>
> Then, can we take kdump in this endlessly looping situaton ?
>
> panic_on_oom=always + kdump can do that.
>

The endless loop is only helpful if something is going to free memory
external to the current page allocation: either another task with
__GFP_WAIT | __GFP_FS that invokes the oom killer, a task that frees
memory, or a task that exits.

The most notable endless loop in the page allocator is the one when a task
has been oom killed, gets access to memory reserves, and then cannot find
a page for a __GFP_NOFAIL allocation:

do {
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
preferred_zone, migratetype);

if (!page && gfp_mask & __GFP_NOFAIL)
congestion_wait(BLK_RW_ASYNC, HZ/50);
} while (!page && (gfp_mask & __GFP_NOFAIL));

We don't expect any such allocations to happen during the exit path, but
we could probably find some in the fs layer.

I don't want to check sysctl_panic_on_oom in the page allocator because it
would start panicking the machine unnecessarily for the integrity
metadata GFP_NOIO | __GFP_NOFAIL allocation, for any
order > PAGE_ALLOC_COSTLY_ORDER, or for users who can't lock the zonelist
for oom kill that wouldn't have panicked before.

KAMEZAWA Hiroyuki

unread,

Feb 16, 2010, 7:10:01 PM2/16/10

to

Then, why don't you check higzone_idx in oom_kill.c

Thanks,
-Kame

David Rientjes

unread,

Feb 16, 2010, 7:30:01 PM2/16/10

to

out_of_memory() doesn't return a value to specify whether the page
allocator should retry the allocation or just return NULL, all that policy
is kept in mm/page_alloc.c. For highzone_idx < ZONE_NORMAL, we want to
fail the allocation when !(gfp_mask & __GFP_NOFAIL) and call the oom
killer when it's __GFP_NOFAIL.
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c

@@ -1696,6 +1696,9 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
/* The OOM killer will not help higher order allocs */
if (order > PAGE_ALLOC_COSTLY_ORDER)
goto out;
+ /* The OOM killer does not needlessly kill tasks for lowmem */

+ if (high_zoneidx < ZONE_NORMAL)
+ goto out;
/*

* GFP_THISNODE contains __GFP_NORETRY and we never hit this.
* Sanity check for bare calls of __GFP_THISNODE, not real OOM.
@@ -1924,15 +1927,23 @@ rebalance:
if (page)
goto got_pg;

- /*
- * The OOM killer does not trigger for high-order
- * ~__GFP_NOFAIL allocations so if no progress is being
- * made, there are no other options and retrying is
- * unlikely to help.
- */
- if (order > PAGE_ALLOC_COSTLY_ORDER &&
- !(gfp_mask & __GFP_NOFAIL))
- goto nopage;
+ if (!(gfp_mask & __GFP_NOFAIL)) {
+ /*
+ * The oom killer is not called for high-order
+ * allocations that may fail, so if no progress
+ * is being made, there are no other options and
+ * retrying is unlikely to help.
+ */
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ goto nopage;
+ /*
+ * The oom killer is not called for lowmem
+ * allocations to prevent needlessly killing
+ * innocent tasks.
+ */
+ if (high_zoneidx < ZONE_NORMAL)
+ goto nopage;
+ }

goto restart;

David Rientjes

unread,

Feb 16, 2010, 7:40:02 PM2/16/10

to

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> Hmm, I have a few reason to add special behavior to memcg rather than panic.
>
> - freeze_at_oom is enough.
> If OOM can be notified, the management daemon can do useful jobs. Shutdown
> all other cgroups or migrate them to other host and do kdump.
>

The same could be said for cpusets if users use that for memory isolation.

> But, Hmm...I'd like to go this way.
>
> 1. At first, support panic_on_oom=2 in memcg.
>

This should panic in mem_cgroup_out_of_memory() and the documentation
should be added to Documentation/sysctl/vm.txt.

The memory controller also has some protection in the pagefault oom
handler that seems like it could be made more general: instead of checking
for mem_cgroup_oom_called(), I'd rather do a tasklist scan to check for
already oom killed task (checking for the TIF_MEMDIE bit) and check all
zones for ZONE_OOM_LOCKED. If no oom killed tasks are found and no zones
are locked, we can check sysctl_panic_on_oom and invoke the system-wide
oom.

> 2. Second, I'll add OOM-notifier and freeze_at_oom to memcg.
> and don't call memcg_out_of_memory in oom_kill.c in this case. Because
> we don't kill anything. Taking coredumps of all procs in memcg is not
> very difficult.
>

The oom notifier would be at a higher level than the oom killer, the oom
killer's job is simply to kill a task when it is called. So for these
particular cases, you would never even call into out_of_memory() to panic
the machine in the first place. Hopefully, the oom notifier can be made
to be more generic as its own cgroup rather than only being used by memcg,
but if such a userspace notifier would defer to the kernel oom killer, it
should panic when panic_on_oom == 2 is selected regardless of whether it
is constrained or not. Thus, we can keep the sysctl_panic_on_oom logic in
the oom killer (both in out_of_memory() and mem_cgroup_out_of_memory())
without risk of unnecessarily panic whenever an oom notifier or
freeze_at_oom setting intercepts the condition.

KAMEZAWA Hiroyuki

unread,

Feb 16, 2010, 7:50:02 PM2/16/10

to

On Tue, 16 Feb 2010 16:31:39 -0800 (PST)
David Rientjes <rien...@google.com> wrote:

> On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:
>
> > Hmm, I have a few reason to add special behavior to memcg rather than panic.
> >
> > - freeze_at_oom is enough.
> > If OOM can be notified, the management daemon can do useful jobs. Shutdown
> > all other cgroups or migrate them to other host and do kdump.
> >
>
> The same could be said for cpusets if users use that for memory isolation.
>

cpuset's difficulty is that there are some methods which share the limitation.

It's not simple that we have
- cpuset
- mempolicy per task
- mempolicy per vma

Sigh..but they are for their own purpose.

> > But, Hmm...I'd like to go this way.
> >
> > 1. At first, support panic_on_oom=2 in memcg.
> >
>
> This should panic in mem_cgroup_out_of_memory() and the documentation
> should be added to Documentation/sysctl/vm.txt.
>
> The memory controller also has some protection in the pagefault oom
> handler that seems like it could be made more general: instead of checking
> for mem_cgroup_oom_called(), I'd rather do a tasklist scan to check for
> already oom killed task (checking for the TIF_MEMDIE bit) and check all
> zones for ZONE_OOM_LOCKED. If no oom killed tasks are found and no zones
> are locked, we can check sysctl_panic_on_oom and invoke the system-wide
> oom.
>

plz remove memcg's hook after doing that. Current implemantation is desgined
not to affect too much to other cgroups by doing unnecessary jobs.

> > 2. Second, I'll add OOM-notifier and freeze_at_oom to memcg.
> > and don't call memcg_out_of_memory in oom_kill.c in this case. Because
> > we don't kill anything. Taking coredumps of all procs in memcg is not
> > very difficult.
> >
>
> The oom notifier would be at a higher level than the oom killer, the oom
> killer's job is simply to kill a task when it is called.
> So for these particular cases, you would never even call into out_of_memory() to panic
> the machine in the first place.

That's my point.

Thanks,
-Kame

David Rientjes

unread,

Feb 16, 2010, 8:00:01 PM2/16/10

to

On Wed, 17 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > This should panic in mem_cgroup_out_of_memory() and the documentation
> > should be added to Documentation/sysctl/vm.txt.
> >
> > The memory controller also has some protection in the pagefault oom
> > handler that seems like it could be made more general: instead of checking
> > for mem_cgroup_oom_called(), I'd rather do a tasklist scan to check for
> > already oom killed task (checking for the TIF_MEMDIE bit) and check all
> > zones for ZONE_OOM_LOCKED. If no oom killed tasks are found and no zones
> > are locked, we can check sysctl_panic_on_oom and invoke the system-wide
> > oom.
> >
> plz remove memcg's hook after doing that. Current implemantation is desgined
> not to affect too much to other cgroups by doing unnecessary jobs.
>

Ok, I'll eliminate pagefault_out_of_memory() and get it to use
out_of_memory() by only checking for constrained_alloc() when
gfp_mask != 0.

> > > 2. Second, I'll add OOM-notifier and freeze_at_oom to memcg.
> > > and don't call memcg_out_of_memory in oom_kill.c in this case. Because
> > > we don't kill anything. Taking coredumps of all procs in memcg is not
> > > very difficult.
> > >
> >
> > The oom notifier would be at a higher level than the oom killer, the oom
> > killer's job is simply to kill a task when it is called.
> > So for these particular cases, you would never even call into out_of_memory() to panic
> > the machine in the first place.
>
> That's my point.
>

Great, are you planning on implementing a cgroup that is based on roughly
on the /dev/mem_notify patchset so userspace can poll() a file and be
notified of oom events? It would help beyond just memcg, it has an
application to cpusets (adding more mems on large systems) as well. It
can also be used purely to preempt the kernel oom killer and move all the
policy to userspace even though it would be sacrificing TIF_MEMDIE.