[PATCH 0/4] coredump: core dump masking support v3

Kawai, Hidehiro

unread,

Feb 16, 2007, 8:40:17 AM2/16/07

to

Hi,

This patch series is version 3 of the core dump masking feature,
which provides a per-process flag not to dump anonymous shared
memory segments.

In this version, /proc/<pid>/coredump_omit_anonymous_shared file
is provided as an interface instead of the previous
/proc/<pid>/core_flags. If you have written a non-zero value to the
file, anonymous shared memory segments of the process not to be
dumped.

Another change of this version is removal of kernel.core_flags_enables
sysctl which enables/disables core dump flags globally. The purpose
of this sysctl is for the system administrator to force all anonymous
memory to be dumped. But ulimit -c 0 breaks it. So this sysctl is
not helpful for the current linux.

This patch series can be applied against 2.6.20-mm1.
The supported core file formats are ELF and ELF-FDPIC. ELF has been
tested, but ELF-FDPIC has not been build and tested because I don't
have the test environment.

Background:
Some software programs share huge memory among hundreds of
processes. If a failure occurs on one of these processes, they can
be signaled by a monitoring process to generate core files and
restart the service. However, it can develop into a system-wide
failure such as system slow down for a long time and disk space
shortage because the total size of the core files is very huge!

To avoid the above situation we can limit the core file size by
setrlimit(2) or ulimit(1). But this method can lose important data
such as stack because core dumping is terminated halfway.
So I suggest keeping shared memory segments from being dumped for
particular processes. Because the shared memory attached to processes
is common in them, we don't need to dump the shared memory every time.

Usage:
Get all shared memory segments of pid 1234 not to dump:

$ echo 1 > /proc/1234/coredump_omit_anonymous_shared

When a new process is created, the process inherits the flag status
from its parent. It is useful to set the core dump flags before the
program runs. For example:

$ echo 1 > /proc/self/coredump_omit_anonymous_shared
$ ./some_program

ChangeLog:
v3:
- remove `/proc/<pid>/core_flags' proc entry
- add `/proc/<pid>/coredump_anonymous_shared' as a named flag
- remove kernel.core_flags_enable sysctl parameter

v2:
http://groups.google.com/group/linux.kernel/browse_frm/thread/cb254465971d4a42/
http://groups.google.com/group/linux.kernel/browse_frm/thread/da78f2702e06fa11/
- rename `coremask' to `core_flags'
- change `core_flags' member in mm_struct to a bit field
next to `dumpable'
- introduce a global spin lock to protect adjacent two bit fields
(core_flags and dumpable) from race condition
- fix a bug that the generated core file can be corrupted when
core dumping and updating core_flags occur concurrently
- add kernel.core_flags_enable sysctl parameter to enable/disable
flags in /proc/<pid>/core_flags
- support ELF-FDPIC binary format, but not tested

v1:
http://groups.google.com/group/linux.kernel/browse_frm/thread/1381fc54d716e3e6/

--
Hidehiro Kawai
Hitachi, Ltd., Systems Development Laboratory
E-mail: hidehiro...@hitachi.com

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Kawai, Hidehiro

unread,

Feb 16, 2007, 8:50:21 AM2/16/07

to

This patch adds an interface to set/reset a flag which determines
anonymous shared memory segments should be dumped or not when a core
file is generated.

/proc/<pid>/coredump_omit_anonymous_shared file is provided to access
the flag. You can change the flag status for a particular process by
writing to or reading from the file.

The flag status is inherited to the child process when it is created.

The flag is stored into coredump_omit_anon_shared member of mm_struct,
which shares bytes with dumpable member because these two are adjacent
bit fields. In order to avoid write-write race between the two, we use
a global spin lock.

smp_wmb() at updating dumpable is removed because set_dumpable()
includes a pair of spin lock and unlock which has the effect of
memory barrier.

Index: linux-2.6.20-mm1/fs/proc/base.c
===================================================================
--- linux-2.6.20-mm1.orig/fs/proc/base.c
+++ linux-2.6.20-mm1/fs/proc/base.c
@@ -74,6 +74,7 @@
#include <linux/poll.h>
#include <linux/nsproxy.h>
#include <linux/oom.h>
+#include <linux/elf.h>
#include "internal.h"

/* NOTE:
@@ -1753,6 +1754,100 @@ static const struct inode_operations pro

#endif

+#if defined(USE_ELF_CORE_DUMP) && defined(CONFIG_ELF_CORE)
+static ssize_t proc_coredump_omit_anon_shared_read(struct file *file,
+ char __user *buf,
+ size_t count,
+ loff_t *ppos)
+{
+ struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
+ struct mm_struct *mm;
+ char buffer[PROC_NUMBUF];
+ size_t len;
+ loff_t __ppos = *ppos;
+ int ret;
+
+ ret = -ESRCH;
+ if (!task)
+ goto out_no_task;
+
+ ret = 0;
+ mm = get_task_mm(task);
+ if (!mm)
+ goto out_no_mm;
+
+ len = snprintf(buffer, sizeof(buffer), "%u\n",
+ mm->coredump_omit_anon_shared);
+ if (__ppos >= len)
+ goto out;
+ if (count > len - __ppos)
+ count = len - __ppos;
+
+ ret = -EFAULT;
+ if (copy_to_user(buf, buffer + __ppos, count))
+ goto out;
+
+ ret = count;
+ *ppos = __ppos + count;
+
+ out:
+ mmput(mm);
+ out_no_mm:
+ put_task_struct(task);
+ out_no_task:
+ return ret;
+}
+
+static ssize_t proc_coredump_omit_anon_shared_write(struct file *file,
+ const char __user *buf,
+ size_t count,
+ loff_t *ppos)
+{
+ struct task_struct *task;
+ struct mm_struct *mm;
+ char buffer[PROC_NUMBUF], *end;
+ unsigned int val;
+ int ret;
+
+ ret = -EFAULT;
+ memset(buffer, 0, sizeof(buffer));
+ if (count > sizeof(buffer) - 1)
+ count = sizeof(buffer) - 1;
+ if (copy_from_user(buffer, buf, count))
+ goto out_no_task;
+
+ ret = -EINVAL;
+ val = (unsigned int)simple_strtoul(buffer, &end, 0);
+ if (*end == '\n')
+ end++;
+ if (end - buffer == 0)
+ goto out_no_task;
+
+ ret = -ESRCH;
+ task = get_proc_task(file->f_dentry->d_inode);
+ if (!task)
+ goto out_no_task;
+
+ ret = end - buffer;
+ mm = get_task_mm(task);
+ if (!mm)
+ goto out_no_mm;
+
+ set_coredump_omit_anon_shared(mm, (val != 0));
+
+ mmput(mm);
+ out_no_mm:
+ put_task_struct(task);
+ out_no_task:
+ return ret;
+}
+
+static struct file_operations proc_coredump_omit_anon_shared_operations = {
+ .read = proc_coredump_omit_anon_shared_read,
+ .write = proc_coredump_omit_anon_shared_write,
+};
+#endif
+
/*
* /proc/self:
*/
@@ -1972,6 +2067,10 @@ static struct pid_entry tgid_base_stuff[
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, fault_inject),
#endif
+#if defined(USE_ELF_CORE_DUMP) && defined(CONFIG_ELF_CORE)
+ REG("coredump_omit_anonymous_shared", S_IRUGO|S_IWUSR,
+ coredump_omit_anon_shared),
+#endif
#ifdef CONFIG_TASK_IO_ACCOUNTING
INF("io", S_IRUGO, pid_io_accounting),
#endif
Index: linux-2.6.20-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.20-mm1.orig/include/linux/sched.h
+++ linux-2.6.20-mm1/include/linux/sched.h
@@ -366,7 +366,13 @@ struct mm_struct {
unsigned int token_priority;
unsigned int last_interval;

+ /*
+ * Writing to these bit fields can cause race condition. To avoid
+ * the race, use dump_bits_lock. You may also use set_dumpable() or
+ * set_coredump_*() macros to set a value.
+ */
unsigned char dumpable:2;
+ unsigned char coredump_omit_anon_shared:1; /* don't dump anon shm */

/* coredumping support */
int core_waiters;
@@ -1721,6 +1727,33 @@ static inline void inc_syscw(struct task
}
#endif

+#include <linux/elf.h>
+/*
+ * These macros are used to protect dumpable and coredump_omit_anon_shared bit
+ * fields in mm_struct from write race between the two.
+ */
+extern spinlock_t dump_bits_lock;
+#define __set_dump_bits(dest, val) \
+ do { \
+ spin_lock(&dump_bits_lock); \
+ (dest) = (val); \
+ spin_unlock(&dump_bits_lock); \
+ } while (0)
+
+#if defined(USE_ELF_CORE_DUMP) && defined(CONFIG_ELF_CORE)
+# define set_dumpable(mm, val) \
+ __set_dump_bits((mm)->dumpable, val)
+# define set_coredump_omit_anon_shared(mm, val) \
+ __set_dump_bits((mm)->coredump_omit_anon_shared, val)
+#else
+# define set_dumpable(mm, val) \
+ do { \
+ (mm)->dumpable = (val); \
+ smp_wmb(); \
+ } while (0)
+# define set_coredump_omit_anon_shared(mm, val)
+#endif
+
#endif /* __KERNEL__ */

#endif
Index: linux-2.6.20-mm1/fs/exec.c
===================================================================
--- linux-2.6.20-mm1.orig/fs/exec.c
+++ linux-2.6.20-mm1/fs/exec.c
@@ -62,6 +62,8 @@ int core_uses_pid;
char core_pattern[128] = "core";
int suid_dumpable = 0;

+DEFINE_SPINLOCK(dump_bits_lock);
+
EXPORT_SYMBOL(suid_dumpable);
/* The maximal length of core_pattern is also specified in sysctl.c */

@@ -853,9 +855,9 @@ int flush_old_exec(struct linux_binprm *
current->sas_ss_sp = current->sas_ss_size = 0;

if (current->euid == current->uid && current->egid == current->gid)
- current->mm->dumpable = 1;
+ set_dumpable(current->mm, 1);
else
- current->mm->dumpable = suid_dumpable;
+ set_dumpable(current->mm, suid_dumpable);

name = bprm->filename;

@@ -883,7 +885,7 @@ int flush_old_exec(struct linux_binprm *
file_permission(bprm->file, MAY_READ) ||
(bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP)) {
suid_keys(current);
- current->mm->dumpable = suid_dumpable;
+ set_dumpable(current->mm, suid_dumpable);
}

/* An exec changes our domain. We are no longer part of the thread
@@ -1477,7 +1479,7 @@ int do_coredump(long signr, int exit_cod
flag = O_EXCL; /* Stop rewrite attacks */
current->fsuid = 0; /* Dump root private */
}
- mm->dumpable = 0;
+ set_dumpable(mm, 0);

retval = coredump_wait(exit_code);
if (retval < 0)
Index: linux-2.6.20-mm1/kernel/fork.c
===================================================================
--- linux-2.6.20-mm1.orig/kernel/fork.c
+++ linux-2.6.20-mm1/kernel/fork.c
@@ -333,6 +333,9 @@ static struct mm_struct * mm_init(struct
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
INIT_LIST_HEAD(&mm->mmlist);
+ /* don't need to use set_coredump_omit_anon_shared() */
+ mm->coredump_omit_anon_shared =
+ (current->mm) ? current->mm->coredump_omit_anon_shared : 0;
mm->core_waiters = 0;
mm->nr_ptes = 0;
set_mm_counter(mm, file_rss, 0);
Index: linux-2.6.20-mm1/kernel/sys.c
===================================================================
--- linux-2.6.20-mm1.orig/kernel/sys.c
+++ linux-2.6.20-mm1/kernel/sys.c
@@ -1022,10 +1022,8 @@ asmlinkage long sys_setregid(gid_t rgid,
else
return -EPERM;
}
- if (new_egid != old_egid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (new_egid != old_egid)
+ set_dumpable(current->mm, suid_dumpable);
if (rgid != (gid_t) -1 ||
(egid != (gid_t) -1 && egid != old_rgid))
current->sgid = new_egid;
@@ -1052,16 +1050,12 @@ asmlinkage long sys_setgid(gid_t gid)
return retval;

if (capable(CAP_SETGID)) {
- if (old_egid != gid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (old_egid != gid)
+ set_dumpable(current->mm, suid_dumpable);
current->gid = current->egid = current->sgid = current->fsgid = gid;
} else if ((gid == current->gid) || (gid == current->sgid)) {
- if (old_egid != gid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (old_egid != gid)
+ set_dumpable(current->mm, suid_dumpable);
current->egid = current->fsgid = gid;
}
else
@@ -1089,10 +1083,8 @@ static int set_user(uid_t new_ruid, int

switch_uid(new_user);

- if (dumpclear) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (dumpclear)
+ set_dumpable(current->mm, suid_dumpable);
current->uid = new_ruid;
return 0;
}
@@ -1145,10 +1137,8 @@ asmlinkage long sys_setreuid(uid_t ruid,
if (new_ruid != old_ruid && set_user(new_ruid, new_euid != old_euid) < 0)
return -EAGAIN;

- if (new_euid != old_euid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (new_euid != old_euid)
+ set_dumpable(current->mm, suid_dumpable);
current->fsuid = current->euid = new_euid;
if (ruid != (uid_t) -1 ||
(euid != (uid_t) -1 && euid != old_ruid))
@@ -1195,10 +1185,8 @@ asmlinkage long sys_setuid(uid_t uid)
} else if ((uid != current->uid) && (uid != new_suid))
return -EPERM;

- if (old_euid != uid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (old_euid != uid)
+ set_dumpable(current->mm, suid_dumpable);
current->fsuid = current->euid = uid;
current->suid = new_suid;

@@ -1240,10 +1228,8 @@ asmlinkage long sys_setresuid(uid_t ruid
return -EAGAIN;
}
if (euid != (uid_t) -1) {
- if (euid != current->euid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (euid != current->euid)
+ set_dumpable(current->mm, suid_dumpable);
current->euid = euid;
}
current->fsuid = current->euid;
@@ -1290,10 +1276,8 @@ asmlinkage long sys_setresgid(gid_t rgid
return -EPERM;
}
if (egid != (gid_t) -1) {
- if (egid != current->egid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (egid != current->egid)
+ set_dumpable(current->mm, suid_dumpable);
current->egid = egid;
}
current->fsgid = current->egid;
@@ -1336,10 +1320,8 @@ asmlinkage long sys_setfsuid(uid_t uid)
if (uid == current->uid || uid == current->euid ||
uid == current->suid || uid == current->fsuid ||
capable(CAP_SETUID)) {
- if (uid != old_fsuid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (uid != old_fsuid)
+ set_dumpable(current->mm, suid_dumpable);
current->fsuid = uid;
}

@@ -1365,10 +1347,8 @@ asmlinkage long sys_setfsgid(gid_t gid)
if (gid == current->gid || gid == current->egid ||
gid == current->sgid || gid == current->fsgid ||
capable(CAP_SETGID)) {
- if (gid != old_fsgid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (gid != old_fsgid)
+ set_dumpable(current->mm, suid_dumpable);
current->fsgid = gid;
key_fsgid_changed(current);
proc_id_connector(current, PROC_EVENT_GID);
@@ -2163,7 +2143,7 @@ asmlinkage long sys_prctl(int option, un
error = -EINVAL;
break;
}
- current->mm->dumpable = arg2;
+ set_dumpable(current->mm, arg2);
break;

case PR_SET_UNALIGN:
Index: linux-2.6.20-mm1/security/commoncap.c
===================================================================
--- linux-2.6.20-mm1.orig/security/commoncap.c
+++ linux-2.6.20-mm1/security/commoncap.c
@@ -244,7 +244,7 @@ void cap_bprm_apply_creds (struct linux_

if (bprm->e_uid != current->uid || bprm->e_gid != current->gid ||
!cap_issubset (new_permitted, current->cap_permitted)) {
- current->mm->dumpable = suid_dumpable;
+ set_dumpable(current->mm, suid_dumpable);

if (unsafe & ~LSM_UNSAFE_PTRACE_CAP) {
if (!capable(CAP_SETUID)) {
Index: linux-2.6.20-mm1/security/dummy.c
===================================================================
--- linux-2.6.20-mm1.orig/security/dummy.c
+++ linux-2.6.20-mm1/security/dummy.c
@@ -130,7 +130,7 @@ static void dummy_bprm_free_security (st
static void dummy_bprm_apply_creds (struct linux_binprm *bprm, int unsafe)
{
if (bprm->e_uid != current->uid || bprm->e_gid != current->gid) {
- current->mm->dumpable = suid_dumpable;
+ set_dumpable(current->mm, suid_dumpable);

if ((unsafe & ~LSM_UNSAFE_PTRACE_CAP) && !capable(CAP_SETUID)) {
bprm->e_uid = current->uid;

Kawai, Hidehiro

unread,

Feb 16, 2007, 8:50:31 AM2/16/07

to

This patch enables to omit anonymous shared memory from an ELF
formatted core file when it is generated.

Signed-off-by: Hidehiro Kawai <hidehiro...@hitachi.com>
---

fs/binfmt_elf.c | 20 ++++++++++++++------
1 files changed, 14 insertions(+), 6 deletions(-)

Index: linux-2.6.20-mm1/fs/binfmt_elf.c
===================================================================
--- linux-2.6.20-mm1.orig/fs/binfmt_elf.c
+++ linux-2.6.20-mm1/fs/binfmt_elf.c
@@ -1181,7 +1181,7 @@ static int dump_seek(struct file *file,
*
* I think we should skip something. But I am not sure how. H.J.
*/
-static int maydump(struct vm_area_struct *vma)
+static int maydump(struct vm_area_struct *vma, unsigned int omit_anon_shared)
{
/* The vma can be set up to tell us the answer directly. */
if (vma->vm_flags & VM_ALWAYSDUMP)
@@ -1191,9 +1191,15 @@ static int maydump(struct vm_area_struct
if (vma->vm_flags & (VM_IO | VM_RESERVED))
return 0;

- /* Dump shared memory only if mapped from an anonymous file. */
- if (vma->vm_flags & VM_SHARED)
- return vma->vm_file->f_path.dentry->d_inode->i_nlink == 0;
+ /*
+ * Dump shared memory only if mapped from an anonymous file and
+ * /proc/<pid>/coredump_omit_anonymous_shared flag is not set.
+ */
+ if (vma->vm_flags & VM_SHARED) {
+ if (vma->vm_file->f_path.dentry->d_inode->i_nlink)
+ return 0;
+ return omit_anon_shared == 0;
+ }

/* If it hasn't been written to, don't write it out */
if (!vma->anon_vma)
@@ -1491,6 +1497,7 @@ static int elf_core_dump(long signr, str
#endif
int thread_status_size = 0;
elf_addr_t *auxv;
+ unsigned int omit_anon_shared = 0;

/*
* We no longer stop all VM operations.
@@ -1629,6 +1636,7 @@ static int elf_core_dump(long signr, str
}

dataoff = offset = roundup(offset, ELF_EXEC_PAGESIZE);
+ omit_anon_shared = current->mm->coredump_omit_anon_shared;

/* Write program headers for segments dump */
for (vma = first_vma(current, gate_vma); vma != NULL;
@@ -1642,7 +1650,7 @@ static int elf_core_dump(long signr, str
phdr.p_offset = offset;
phdr.p_vaddr = vma->vm_start;
phdr.p_paddr = 0;
- phdr.p_filesz = maydump(vma) ? sz : 0;
+ phdr.p_filesz = maydump(vma, omit_anon_shared) ? sz : 0;
phdr.p_memsz = sz;
offset += phdr.p_filesz;
phdr.p_flags = vma->vm_flags & VM_READ ? PF_R : 0;
@@ -1685,7 +1693,7 @@ static int elf_core_dump(long signr, str
vma = next_vma(vma, gate_vma)) {
unsigned long addr;

- if (!maydump(vma))
+ if (!maydump(vma, omit_anon_shared))
continue;

for (addr = vma->vm_start;

Kawai, Hidehiro

unread,

Feb 16, 2007, 8:50:34 AM2/16/07

to

This patch enables to omit anonymous shared memory from an ELF-FDPIC

formatted core file when it is generated.

The debug messages from maydump() in fs/binfmt_elf_fdpic.c are changed
appropriately so that we can know what kind of memory segments are
dumped or not.

Signed-off-by: Hidehiro Kawai <hidehiro...@hitachi.com>
---

fs/binfmt_elf_fdpic.c | 37 +++++++++++++++++++++++++------------
1 files changed, 25 insertions(+), 12 deletions(-)

Index: linux-2.6.20-mm1/fs/binfmt_elf_fdpic.c
===================================================================
--- linux-2.6.20-mm1.orig/fs/binfmt_elf_fdpic.c
+++ linux-2.6.20-mm1/fs/binfmt_elf_fdpic.c
@@ -1168,7 +1168,7 @@ static int dump_seek(struct file *file,

*
* I think we should skip something. But I am not sure how. H.J.
*/
-static int maydump(struct vm_area_struct *vma)
+static int maydump(struct vm_area_struct *vma, unsigned int omit_anon_shared)
{

/* Do not dump I/O mapped devices or special mappings */
if (vma->vm_flags & (VM_IO | VM_RESERVED)) {
@@ -1184,15 +1184,22 @@ static int maydump(struct vm_area_struct

return 0;
}

- /* Dump shared memory only if mapped from an anonymous file. */

+ /*
+ * Dump shared memory only if mapped from an anonymous file and
+ * /proc/<pid>/coredump_omit_anonymous_shared flag is not set.
+ */

if (vma->vm_flags & VM_SHARED) {

- if (vma->vm_file->f_path.dentry->d_inode->i_nlink == 0) {
+ if (vma->vm_file->f_path.dentry->d_inode->i_nlink) {
kdcore("%08lx: %08lx: no (share)", vma->vm_start, vma->vm_flags);
+ return 0;
+ }
+ if (omit_anon_shared) {
+ kdcore("%08lx: %08lx: no (anon-share)", vma->vm_start, vma->vm_flags);
+ return 0;
+ } else {
+ kdcore("%08lx: %08lx: yes (anon-share)", vma->vm_start, vma->vm_flags);
return 1;
}
-
- kdcore("%08lx: %08lx: no (share)", vma->vm_start, vma->vm_flags);
- return 0;
}

#ifdef CONFIG_MMU
@@ -1444,14 +1451,15 @@ static int elf_dump_thread_status(long s
*/
#ifdef CONFIG_MMU
static int elf_fdpic_dump_segments(struct file *file, struct mm_struct *mm,
- size_t *size, unsigned long *limit)
+ size_t *size, unsigned long *limit,
+ unsigned int omit_anon_shared)
{
struct vm_area_struct *vma;

for (vma = current->mm->mmap; vma; vma = vma->vm_next) {

unsigned long addr;

- if (!maydump(vma))
+ if (!maydump(vma, omit_anon_shared))
continue;

for (addr = vma->vm_start;

@@ -1499,14 +1507,15 @@ end_coredump:
*/
#ifndef CONFIG_MMU
static int elf_fdpic_dump_segments(struct file *file, struct mm_struct *mm,
- size_t *size, unsigned long *limit)
+ size_t *size, unsigned long *limit,
+ unsigned int omit_anon_shared)
{
struct vm_list_struct *vml;

for (vml = current->mm->context.vmlist; vml; vml = vml->next) {
struct vm_area_struct *vma = vml->vma;

- if (!maydump(vma))
+ if (!maydump(vma, omit_anon_shared))
continue;

if ((*size += PAGE_SIZE) > *limit)
@@ -1557,6 +1566,7 @@ static int elf_fdpic_core_dump(long sign
struct vm_list_struct *vml;
#endif

elf_addr_t *auxv;
+ unsigned int omit_anon_shared = 0;

/*
* We no longer stop all VM operations.

@@ -1694,6 +1704,8 @@ static int elf_fdpic_core_dump(long sign
/* Page-align dumped data */

dataoff = offset = roundup(offset, ELF_EXEC_PAGESIZE);

+ omit_anon_shared = current->mm->coredump_omit_anon_shared;

+
/* write program headers for segments dump */
for (
#ifdef CONFIG_MMU
@@ -1715,7 +1727,7 @@ static int elf_fdpic_core_dump(long sign

phdr.p_offset = offset;
phdr.p_vaddr = vma->vm_start;
phdr.p_paddr = 0;
- phdr.p_filesz = maydump(vma) ? sz : 0;
+ phdr.p_filesz = maydump(vma, omit_anon_shared) ? sz : 0;
phdr.p_memsz = sz;
offset += phdr.p_filesz;
phdr.p_flags = vma->vm_flags & VM_READ ? PF_R : 0;

@@ -1749,7 +1761,8 @@ static int elf_fdpic_core_dump(long sign

DUMP_SEEK(dataoff);

- if (elf_fdpic_dump_segments(file, current->mm, &size, &limit) < 0)
+ if (elf_fdpic_dump_segments(file, current->mm, &size, &limit,
+ omit_anon_shared) < 0)
goto end_coredump;

#ifdef ELF_CORE_WRITE_EXTRA_DATA

Kawai, Hidehiro

unread,

Feb 16, 2007, 8:50:35 AM2/16/07

to

This patch adds the documentation for
/proc/<pid>/coredump_omit_anonymous_shared.

Signed-off-by: Hidehiro Kawai <hidehiro...@hitachi.com>
---

Documentation/filesystems/proc.txt | 38 +++++++++++++++++++++++++++
1 files changed, 38 insertions(+)

Index: linux-2.6.20-mm1/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.20-mm1.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.20-mm1/Documentation/filesystems/proc.txt
@@ -41,6 +41,7 @@ Table of Contents
2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem
2.12 /proc/<pid>/oom_adj - Adjust the oom-killer score
2.13 /proc/<pid>/oom_score - Display current oom-killer score
+ 2.14 /proc/<pid>/coredump_omit_anonymous_shared - Core dump coordinator

------------------------------------------------------------------------------
Preface
@@ -1982,6 +1983,43 @@ This file can be used to check the curre
any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
process should be killed in an out-of-memory situation.

+2.14 /proc/<pid>/coredump_omit_anonymous_shared - Core dump coordinator
+---------------------------------------------------------------------
+When a process is dumped, all anonymous memory is written to a core file as
+long as the size of the core file isn't limited. But sometimes we don't want
+to dump some memory segments, for example, huge shared memory.
+
+The /proc/<pid>/coredump_omit_anonymous_shared is a flag which enables you to
+omit anonymous shared memory segments from a core file when it is generated.
+When the <pid> process is dumped, the core dump routine decides whether a
+given memory segment should be dumped into a core file or not based on the
+type of the memory segment and the flag.
+
+If you have written a non-zero value to this proc file, anonymous shared
+memory segments are not dumped. There are three types of anonymous shared
+memory:
+
+ - IPC shared memory
+ - the memory segments created by mmap(2) with MAP_ANONYMOUS and MAP_SHARED
+ flags
+ - the memory segments created by mmap(2) with MAP_SHARED flag, and the
+ mapped file has already been unlinked
+
+Because current core dump routine doesn't distinguish these segments, you can
+only choose either dumping all anonymous shared memory segments or not.
+
+If you don't want to dump all shared memory segments attached to pid 1234,
+write 0 to the process's proc file.
+
+ $ echo 1 > /proc/1234/coredump_omit_anonymous_shared
+
+When a new process is created, the process inherits the flag status from its
+parent. It is useful to set the flag before the program runs.
+For example:
+
+ $ echo 1 > /proc/self/coredump_omit_anonymous_shared
+ $ ./some_program
+
------------------------------------------------------------------------------
Summary
------------------------------------------------------------------------------

David Howells

unread,

Feb 16, 2007, 10:10:21 AM2/16/07

to

Kawai, Hidehiro <hidehiro...@hitachi.com> wrote:

> To avoid the above situation we can limit the core file size by
> setrlimit(2) or ulimit(1). But this method can lose important data
> such as stack because core dumping is terminated halfway.
> So I suggest keeping shared memory segments from being dumped for
> particular processes.

A better way might be to place the shared memory segments last if that's
possible (I'm not sure ELF supports out-of-order segments).

> Because the shared memory attached to processes is common in them, we don't
> need to dump the shared memory every time.

So there's no guarantee that they'll be dumped at all... I'm not sure there's
any way around that, however.

David

David Howells

unread,

Feb 16, 2007, 10:10:26 AM2/16/07

to

Kawai, Hidehiro <hidehiro...@hitachi.com> wrote:

> static int elf_fdpic_dump_segments(struct file *file, struct mm_struct *mm,
> - size_t *size, unsigned long *limit)
> + size_t *size, unsigned long *limit,
> + unsigned int omit_anon_shared)

Why are you passing it as an extra argument when you could just use
mm->coredump_omit_anon_shared here:

> + if (!maydump(vma, omit_anon_shared))

And here:

> + phdr.p_filesz = maydump(vma, omit_anon_shared) ? sz : 0;

Actually, I think I would just pass the mm pointer you have into maydump() and
let that dereference it here:

> + if (omit_anon_shared) {

which would then be:

if (mm->coredump_omit_anon_shared) {

Then the calls to maydump() would be:

if (!maydump(vma, mm))

and:

phdr.p_filesz = maydump(vma, current->mm) ? sz : 0;

I wouldn't worry, were I you, about the setting changing whilst it's being
used. If you are worried about that, you can probably do some locking against
that.

David

Robin Holt

unread,

Feb 16, 2007, 12:20:51 PM2/16/07

to

On Fri, Feb 16, 2007 at 03:05:35PM +0000, David Howells wrote:
> Actually, I think I would just pass the mm pointer you have into maydump() and
> let that dereference it here:
>
> > + if (omit_anon_shared) {
>
> which would then be:
>
> if (mm->coredump_omit_anon_shared) {

How about:
if (vma->vm_mm->coredump_omit_anon_shared) {

Then the calls to maydump() would be unchanged:

David Howells

unread,

Feb 16, 2007, 3:20:07 PM2/16/07

to

Robin Holt <ho...@sgi.com> wrote:

> How about:
> if (vma->vm_mm->coredump_omit_anon_shared) {
>
> Then the calls to maydump() would be unchanged:

VMAs are a shared resource under NOMMU conditions.

David

Kawai, Hidehiro

unread,

Feb 20, 2007, 4:50:16 AM2/20/07

to

Hi,

David Howells wrote:

> Kawai, Hidehiro <hidehiro...@hitachi.com> wrote:
>
>>To avoid the above situation we can limit the core file size by
>>setrlimit(2) or ulimit(1). But this method can lose important data
>>such as stack because core dumping is terminated halfway.
>>So I suggest keeping shared memory segments from being dumped for
>>particular processes.
>
> A better way might be to place the shared memory segments last if that's
> possible (I'm not sure ELF supports out-of-order segments).

Placing the shared memory segments last and limiting by ulimit -c
is one of the solutions. But there is no guarantee that the memory
segments other than anonymous shared memory are always dumped.
So your idea cannot alternate my suggesting feature.
But if your idea has a merit which my idea doesn't have, I try to
consider coexistence of both idea.

>>Because the shared memory attached to processes is common in them, we don't
>>need to dump the shared memory every time.
>
> So there's no guarantee that they'll be dumped at all... I'm not sure there's
> any way around that, however.

Indeed. However, some people don't want to dump anonymous shared memory
at all. Taking into account this requirement, I don't guarantee that.
But this feature allows per-process setting. So if you want to dump
the shared memory at least once, you can manage to do that in userland.

Thanks,

--
Hidehiro Kawai
Hitachi, Ltd., Systems Development Laboratory

-

Kawai, Hidehiro

unread,

Feb 20, 2007, 4:50:30 AM2/20/07

to

Hi,

Thank you for your comments.

David Howells wrote:

>> static int elf_fdpic_dump_segments(struct file *file, struct mm_struct *mm,
>>- size_t *size, unsigned long *limit)
>>+ size_t *size, unsigned long *limit,
>>+ unsigned int omit_anon_shared)
>
> Why are you passing it as an extra argument when you could just use
> mm->coredump_omit_anon_shared here:
>
>>+ if (!maydump(vma, omit_anon_shared))

> I wouldn't worry, were I you, about the setting changing whilst it's being

> used. If you are worried about that, you can probably do some locking against
> that.

Core dumping is separated two phases, one is the phase of writing
headers, the other is the phase of writing memory segments. If the
coredump_omit_anon_shared setting is changed between these two phases,
a corrupted core file will be generated because the offsets written
in headers don't match their bodies. So we need to use the same
setting in both phases.

I think that locking makes codes complex and generates overhead.
So I wouldn't like to use lock as far as possible. I think passing
the flag as an extra argument is the simplest implementation to
avoid the core file corruption.

Thanks,

--
Hidehiro Kawai
Hitachi, Ltd., Systems Development Laboratory

-

David Howells

unread,

Feb 20, 2007, 6:00:56 AM2/20/07

to

Kawai, Hidehiro <hidehiro...@hitachi.com> wrote:

> Core dumping is separated two phases, one is the phase of writing
> headers, the other is the phase of writing memory segments. If the
> coredump_omit_anon_shared setting is changed between these two phases,
> a corrupted core file will be generated because the offsets written
> in headers don't match their bodies. So we need to use the same
> setting in both phases.

Hmmm... Okay.

> I think that locking makes codes complex and generates overhead.
> So I wouldn't like to use lock as far as possible. I think passing
> the flag as an extra argument is the simplest implementation to
> avoid the core file corruption.

Actually, I don't think the locking is that hard or that complex.

int do_coredump(long signr, int exit_code, struct pt_regs * regs)
{
<setup vars>

down_read(&coredump_settings_sem);

...

fail:
up_read(&coredump_settings_sem);
return retval;
}

And:

static ssize_t proc_coredump_omit_anon_shared_write(struct file *file,
const char __user *buf,
size_t count,
loff_t *ppos)
{
<setup vars>

down_write(&coredump_settings_sem);

...

out_no_task:
up_write(&coredump_settings_sem);
return ret;
}

The same could be applied to all controls that change the coredumping
variables, in particular the sysctl for core_pattern could be wrapped so as to
remove one of the reliances on lock_kernel() and the lock_kernel pair could be
removed from do_coredump().

David

Robin Holt

unread,

Feb 20, 2007, 8:00:16 AM2/20/07

to

On Tue, Feb 20, 2007 at 10:58:17AM +0000, David Howells wrote:
> Kawai, Hidehiro <hidehiro...@hitachi.com> wrote:

> Actually, I don't think the locking is that hard or that complex.
>
> int do_coredump(long signr, int exit_code, struct pt_regs * regs)
> {
> <setup vars>
>
> down_read(&coredump_settings_sem);
>
> ...
>
> fail:
> up_read(&coredump_settings_sem);
> return retval;
> }
>
> And:
>
> static ssize_t proc_coredump_omit_anon_shared_write(struct file *file,
> const char __user *buf,
> size_t count,
> loff_t *ppos)
> {
> <setup vars>
>
> down_write(&coredump_settings_sem);

If the dump has started, do we want to change this to a trylock and skip
changing the value if it is already locked?

Thanks,
Robin

Kawai, Hidehiro

unread,

Feb 21, 2007, 5:10:14 AM2/21/07

to

Hi,

Thank you for your reply.

David Howells wrote:

>>I think that locking makes codes complex and generates overhead.
>>So I wouldn't like to use lock as far as possible. I think passing
>>the flag as an extra argument is the simplest implementation to
>>avoid the core file corruption.
>
> Actually, I don't think the locking is that hard or that complex.
>
> int do_coredump(long signr, int exit_code, struct pt_regs * regs)
> {
> <setup vars>
>
> down_read(&coredump_settings_sem);
>
> ...
>
> fail:
> up_read(&coredump_settings_sem);
> return retval;
> }
>
> And:
>
> static ssize_t proc_coredump_omit_anon_shared_write(struct file *file,
> const char __user *buf,
> size_t count,
> loff_t *ppos)
> {
> <setup vars>
>
> down_write(&coredump_settings_sem);
>
> ...
>
> out_no_task:
> up_write(&coredump_settings_sem);
> return ret;
> }

Is coredump_setting_sem a global semaphore? If so, it prevents concurrent
core dumping. Additionally, while some process is dumped, writing to
coredump_omit_anon_shared of unrelated process will be blocked.

So we should use per process or per mm locking. But where should we
place the variable for locking? Because we don't want to increase the
struct size, we might want to add another bit field in mm_struct:

struct mm_struct {
...
unsigned char dumpable:2;
unsigned char coredump_in_progress:1; /* this */
unsigned char coredump_omit_anon_shared:1;
...

And we use it to determine whether core dumping is in progress or not:

int do_coredump(long signr, int exit_code, struct pt_regs * regs)
{
<setup vars>

spin_lock(&dump_bits_lock);
current->mm->coredump_in_progress = 1;
spin_unlock(&dump_bits_lock);
...

Additionally:

static ssize_t proc_coredump_omit_anon_shared_write(struct file *file,
const char __user *buf,
size_t count,
loff_t *ppos)
{
<setup vars>

ret = -EBUSY;
spin_lock(&dump_bits_lock);
if (mm->coredump_in_progress) {
spin_unlock(&dump_bits_lock);
goto out;
}
mm->coredump_omit_anon_shared = (val != 0);
spin_unlock(&dump_bits_lock);
...

Do you think which is better this method or passing argument method
used by my patchset?
Or, are there even better way else?

--
Hidehiro Kawai
Hitachi, Ltd., Systems Development Laboratory

-

David Howells

unread,

Feb 21, 2007, 6:40:26 AM2/21/07

to

Kawai, Hidehiro <hidehiro...@hitachi.com> wrote:

> Is coredump_setting_sem a global semaphore? If so, it prevents concurrent
> core dumping.

No, it doesn't. Look again:

int do_coredump(long signr, int exit_code, struct pt_regs * regs)
{
<setup vars>

>>>> down_read(&coredump_settings_sem);

> Additionally, while some process is dumped, writing to

> coredump_omit_anon_shared of unrelated process will be blocked.

Yes, but that's probably reasonable. How likely (a) is a process to coredump,
and (b) is a coredump to occur simultaneously with someone altering their
settings?

David

Robin Holt

unread,

Feb 21, 2007, 7:00:30 AM2/21/07

to

On Wed, Feb 21, 2007 at 11:33:31AM +0000, David Howells wrote:
> Kawai, Hidehiro <hidehiro...@hitachi.com> wrote:
>
> > Is coredump_setting_sem a global semaphore? If so, it prevents concurrent
> > core dumping.
>
> No, it doesn't. Look again:
>
> int do_coredump(long signr, int exit_code, struct pt_regs * regs)
> {
> <setup vars>
>
> >>>> down_read(&coredump_settings_sem);
>
> > Additionally, while some process is dumped, writing to
> > coredump_omit_anon_shared of unrelated process will be blocked.
>
> Yes, but that's probably reasonable. How likely (a) is a process to coredump,
> and (b) is a coredump to occur simultaneously with someone altering their
> settings?

And (c) altering the setting during a core dump going to produce an
unusable core dump. I don't think the locking is that difficult to add
and it just makes sense. I would venture a guess that it will take less
time to actually do the locking than to continue arguing it is not needed
when it clearly appears it is needed for even a small number of cases.

Thanks,
Robin

Kawai, Hidehiro

unread,

Feb 22, 2007, 12:40:07 AM2/22/07

to

Hi David and Robin,

Thank you for your reply.

Robin Holt wrote:
> On Wed, Feb 21, 2007 at 11:33:31AM +0000, David Howells wrote:
>
>>Kawai, Hidehiro <hidehiro...@hitachi.com> wrote:
>>
>>>Is coredump_setting_sem a global semaphore? If so, it prevents concurrent
>>>core dumping.
>>
>>No, it doesn't. Look again:
>>
>> int do_coredump(long signr, int exit_code, struct pt_regs * regs)
>> {
>> <setup vars>
>>
>> >>>> down_read(&coredump_settings_sem);

Oh, I'm sorry. I have overlooked it. There is no problem.

>>>Additionally, while some process is dumped, writing to
>>>coredump_omit_anon_shared of unrelated process will be blocked.
>>
>>Yes, but that's probably reasonable. How likely (a) is a process to coredump,
>>and (b) is a coredump to occur simultaneously with someone altering their
>>settings?
>
> And (c) altering the setting during a core dump going to produce an
> unusable core dump. I don't think the locking is that difficult to add
> and it just makes sense. I would venture a guess that it will take less
> time to actually do the locking than to continue arguing it is not needed
> when it clearly appears it is needed for even a small number of cases.

Okay, the probability that the process is blocked in the proc handler seems
to be small. But I'm not sure if problems never occur in enterprise use.
So I'd like to use down_write_trylock() as Robin said before. And if it
fails to acquire the lock, it returns EBUSY immediately.
Do you have any comments?

Thanks,

--
Hidehiro Kawai
Hitachi, Ltd., Systems Development Laboratory

-

David Howells

unread,

Feb 22, 2007, 6:50:12 AM2/22/07

to

Kawai, Hidehiro <hidehiro...@hitachi.com> wrote:

> Okay, the probability that the process is blocked in the proc handler seems
> to be small. But I'm not sure if problems never occur in enterprise use.
> So I'd like to use down_write_trylock() as Robin said before. And if it
> fails to acquire the lock, it returns EBUSY immediately.
> Do you have any comments?

That's probably reasonable, but that means userspace then has to handle EBUSY.

David

Markus Gutschke

unread,

Feb 23, 2007, 10:40:09 PM2/23/07

to

Kawai, Hidehiro wrote:
> This patch series is version 3 of the core dump masking feature,
> which provides a per-process flag not to dump anonymous shared
> memory segments.

I just wanted to remind you that you need to be careful about dumping
the [vdso] segment no matter whether you omit other segments. I didn't
actually try running your patches, and if the kernel doesn't actually
consider this segment anonymous and shared, things might already work
fine as is.

In any case, you can check with "readelf -a", if the [vdso] segment is
there. And you will find that if you forget to dump it, "gdb" can no
longer give you stack traces on call chains that involve signal handlers.

As an alternative to your kernel patch, you could achieve the same goal
in user space, by linking my coredumper
http://code.google.com/p/google-coredumper/ into your binaries and
setting up appropriate signal handlers. An equivalent patch for
selectively omitting memory regions would be trivial to add. While this
does give you more flexibility, it of course has the drawback of
requiring you to change your applications, so there still is some
benefit in a kernelspace solution.

Markus

David Howells

unread,

Feb 24, 2007, 5:10:11 AM2/24/07

to

Markus Gutschke <mar...@google.com> wrote:

> As an alternative to your kernel patch, you could achieve the same goal in
> user space, by linking my coredumper

How does it work when you can't actually get back to userspace to have
userspace do the coredump? You still have to handle the userspace equivalents
of double/triple faults.

David

Pavel Machek

unread,

Feb 24, 2007, 6:50:10 AM2/24/07

to

> Kawai, Hidehiro wrote:
> >This patch series is version 3 of the core dump masking feature,
> >which provides a per-process flag not to dump anonymous shared
> >memory segments.
>
> I just wanted to remind you that you need to be careful about dumping
> the [vdso] segment no matter whether you omit other segments. I didn't
> actually try running your patches, and if the kernel doesn't actually
> consider this segment anonymous and shared, things might already work
> fine as is.
>
> In any case, you can check with "readelf -a", if the [vdso] segment is
> there. And you will find that if you forget to dump it, "gdb" can no
> longer give you stack traces on call chains that involve signal handlers.
>
> As an alternative to your kernel patch, you could achieve the same goal
> in user space, by linking my coredumper
> http://code.google.com/p/google-coredumper/ into your binaries and
> setting up appropriate signal handlers. An equivalent patch for
> selectively omitting memory regions would be trivial to add. While this
> does give you more flexibility, it of course has the drawback of
> requiring you to change your applications, so there still is some
> benefit in a kernelspace solution.

"We are too lazy to change 0.01% of apps that actually need it" is not
good enough reason to push the feature into kernel, I'd say.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Markus Gutschke

unread,

Feb 24, 2007, 3:10:07 PM2/24/07

to

David Howells wrote:
> How does it work when you can't actually get back to userspace to have
> userspace do the coredump? You still have to handle the userspace equivalents
> of double/triple faults.

My experience shows that there are only very rare occurrences of
situations where you cannot get back into userspace to do the coredump,
and the coredumper tries very hard never to cause additional faults.

While I am sure you could construct scenarios where this would happen,
realistically the only one I have run into were stack overflows, and
they can be handled by carefully setting up an alternate stack for
signal handlers -- just make sure the entire stack is already dirtied
before you run out of memory (or, turn of overcommitting).

Markus

David Howells

unread,

Feb 26, 2007, 7:00:15 AM2/26/07

to

Markus Gutschke <mar...@google.com> wrote:

> > How does it work when you can't actually get back to userspace to have
> > userspace do the coredump? You still have to handle the userspace
> > equivalents of double/triple faults.
>
> My experience shows that there are only very rare occurrences of situations
> where you cannot get back into userspace to do the coredump, and the
> coredumper tries very hard never to cause additional faults.

So what? If they can occur, you have to handle them.

> While I am sure you could construct scenarios where this would happen,
> realistically the only one I have run into were stack overflows, and they can
> be handled by carefully setting up an alternate stack for signal handlers --
> just make sure the entire stack is already dirtied before you run out of
> memory (or, turn of overcommitting).

Duff SIGSEGV or SIGBUS signal handlers are just as realistic. All that takes
is for someone to make a programming error. Remember: error paths are the
least frequently tested.

And any time you say "by carefully setting up" you can guarantee someone's
going to do it wrong.

David

Pavel Machek

unread,

Feb 26, 2007, 7:10:08 AM2/26/07

to

> > While I am sure you could construct scenarios where this would happen,
> > realistically the only one I have run into were stack overflows, and they can
> > be handled by carefully setting up an alternate stack for signal handlers --
> > just make sure the entire stack is already dirtied before you run out of
> > memory (or, turn of overcommitting).
>
> Duff SIGSEGV or SIGBUS signal handlers are just as realistic. All that takes
> is for someone to make a programming error. Remember: error paths are the
> least frequently tested.
>
> And any time you say "by carefully setting up" you can guarantee someone's
> going to do it wrong.

By same argument, we should just give up the coredumping in kernel. It
is rarely tested, so someone will just get it wrong.

Remember: we are having people with huge apps, and therefore huge
coredumps. They want to hack a kernel in ugly way to make their dumps
smaller.

...but there's better solution for them, create (& debug) userland
coredumping library. No need to hack a kernel.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

David Howells

unread,

Feb 26, 2007, 7:50:13 AM2/26/07

to

Pavel Machek <pa...@ucw.cz> wrote:

> By same argument, we should just give up the coredumping in kernel. It
> is rarely tested, so someone will just get it wrong.

Not so rare... Lots of people still use Evolution after all:-)

However, to counter your point, I'll point out that there's just one main code
path to do coredumping in the kernel (for ELF; there are other binfmts that do
coredumping too), as opposed to lots of applications all trying to set up
coredumping themselves.

> Remember: we are having people with huge apps, and therefore huge
> coredumps. They want to hack a kernel in ugly way to make their dumps
> smaller.

MMAP_NODUMP or MADV_NODUMP might be better. Let the apps mark out for
themselves what they want.

> ...but there's better solution for them, create (& debug) userland
> coredumping library. No need to hack a kernel.

Actually, a better way to do this may be similar the way I assume Windows to
work. On fatal fault: start up a debugger and PT_ATTACH corpse to it. The
debugger could then be something that'll just save the core. No need to make
sure you link in the core dumping library which might not catch the event if
it's bad enough as it's working from *inside* the program, and so is subject
to being corrupted by beserk code.

David

Kawai, Hidehiro

unread,

Mar 1, 2007, 7:40:10 AM3/1/07

to

Hi,

Markus Gutschke wrote:

> Kawai, Hidehiro wrote:
>
>> This patch series is version 3 of the core dump masking feature,
>> which provides a per-process flag not to dump anonymous shared
>> memory segments.
>
> I just wanted to remind you that you need to be careful about dumping
> the [vdso] segment no matter whether you omit other segments. I didn't
> actually try running your patches, and if the kernel doesn't actually
> consider this segment anonymous and shared, things might already work
> fine as is.

Thank you for your advice and sorry for not replying soon.

Fortunately, the latest kernel uses VM_ALWAYSDUMP flag to always dump
the vdso segment. My patchset doesn't change this behavior. So we
don't need to worry about the vdso segment.

> As an alternative to your kernel patch, you could achieve the same goal
> in user space, by linking my coredumper
> http://code.google.com/p/google-coredumper/ into your binaries and
> setting up appropriate signal handlers. An equivalent patch for
> selectively omitting memory regions would be trivial to add.

As far as I can see, google-coredumper has more flexibility.
Can google-coredumper satisfy the following requirements easily?

Requirements are:
(1) a user can change the core dump settings _anytime_
- sometimes want to dump anonymous shared memory segments and
sometimes don't want to dump them
(2) a user can change the core dump settings of _any processes_
(although permission checks are performed)
- in a huge application which forks many processes, a user
hopes that some processes dump anonymous shared memory
segments and some processes don't dump them

And reliability of the core dump feature is also important.

> While this
> does give you more flexibility, it of course has the drawback of
> requiring you to change your applications, so there still is some
> benefit in a kernelspace solution.

And all the software vendors don't necessarily apply
google-coredumper. If the vendor doesn't apply it, the user will
be bothered by huge core dumps or the buggy application which
remains unfixed. So I believe that in kernel solution is still
needed.

Thanks,

--
Hidehiro Kawai
Hitachi, Ltd., Systems Development Laboratory

-

Markus Gutschke

unread,

Mar 1, 2007, 1:20:08 PM3/1/07

to

Kawai, Hidehiro wrote:
> Requirements are:
> (1) a user can change the core dump settings _anytime_
> - sometimes want to dump anonymous shared memory segments and
> sometimes don't want to dump them

I might not have been sufficiently clear about this in my previous
e-mail. Currently, the Google coredumper does not have the feature that
you asked about, but adding it would be trivial -- it just hadn't been
needed, yet, as on-the-fly compression was good enough for most users.

Answering your question, I don't see any reason why the new API would
not be able to make changes at any time.

> (2) a user can change the core dump settings of _any processes_
> (although permission checks are performed)
> - in a huge application which forks many processes, a user
> hopes that some processes dump anonymous shared memory
> segments and some processes don't dump them

The Google coredumper is a library that needs to be linked into the
application and needs to be called from appropriate signal handlers. As
such, it is the application's responsibility what management API it
wants to expose externally, and what tools it wants to provide for
managing a group of processes.

> And reliability of the core dump feature is also important.

We have generally had very good reliability with the Google coredumper.
In some cases, it even works a little more reliably than the default
in-kernel dumper (e.g. because we can control where to write the file,
and whether it should be compressed on-the-fly; or because we can get
multi-threaded coredumps even in situations where the particular
combination of libc and kernel doesn't support this), and in other cases
the in-kernel dumper works a little better (e.g. if an application got
too corrupted to even run any signal handlers).

Realistically, it just works. But we did have to make sure that we set
up alternate stacks for signal processing, and that we made sure that
these stacks have been dirtied in order to avoid problems with memory
overcomitting.

> And all the software vendors don't necessarily apply
> google-coredumper. If the vendor doesn't apply it, the user will
> be bothered by huge core dumps or the buggy application which
> remains unfixed. So I believe that in kernel solution is still
> needed.

I agree that the Google coredumper is only one possible solution to your
problem. Depending on how your production environment looks like, it
might help a lot, or it might be completely useless.

If it is cheap for you to modify your applications, but expensive to
upgrade your kernels, the Google coredumper is the way to go. Also, if
you need the extra features, such as the ability to compress core files
on-the-fly, or the ability to send corefiles to somewhere other than an
on-disk file, you definitely should look at a user-space solution. On
the other hand, if you can easily upgrade all your kernels, but you
don't even have access to the source of your applications, then an
in-kernel solution is much better.

Markus

Hugh Dickins

unread,

Mar 2, 2007, 12:00:14 PM3/2/07

to

On Fri, 16 Feb 2007, David Howells wrote:
> Robin Holt <ho...@sgi.com> wrote:
>
> > How about:
> > if (vma->vm_mm->coredump_omit_anon_shared) {
> >
> > Then the calls to maydump() would be unchanged:
>
> VMAs are a shared resource under NOMMU conditions.

That's a disturbing remark. Under precisely what NOMMU conditions?

I had thought Robin's suggestion very sensible; and throughout mm/
it has seemed pretty random whether we pass an "mm" argument down
in addition to "vma", or just take vma->vm_mm at whatever level needs.

You seem to be suggesting vma->vm_mm is dangerous when CONFIG_NOMMU,
but we MMU people are scarily unaware of that. Perhaps you need to
put #ifndef CONFIG_NOMMU around vm_mm in struct vm_area_struct?

Or am I totally misunderstanding?

Hugh

David Howells

unread,

Mar 3, 2007, 9:20:07 AM3/3/07

to

Hugh Dickins <hu...@veritas.com> wrote:

> > VMAs are a shared resource under NOMMU conditions.
>
> That's a disturbing remark.

Why? No-one complained when I first put up my rewrite patches three years ago.

> Under precisely what NOMMU conditions?

CONFIG_MMU=n.

> I had thought Robin's suggestion very sensible; and throughout mm/
> it has seemed pretty random whether we pass an "mm" argument down
> in addition to "vma", or just take vma->vm_mm at whatever level needs.
>
> You seem to be suggesting vma->vm_mm is dangerous when CONFIG_NOMMU,

vm_mm is never set to anything other than NULL if CONFIG_MMU=n and it doesn't
seem to be a problem. I don't think anything in the mm/ directory is left that
looks at vm_mm once MMU support is disabled (in fact I've just checked, and I
can compile with vm_mm #ifdef'd out)

> but we MMU people are scarily unaware of that.

If you're worryied that you can't compile anything for NOMMU, an FRV compiler
is available, and a suitable NOMMU default config can be provided.
Alternatively, you can pick ARM, M68K, ...

> Perhaps you need to put #ifndef CONFIG_NOMMU around vm_mm in struct
> vm_area_struct?

I can if it makes you happier. It's not strictly necessary, but it does make
the struct smaller which is good.

David

David Howells

unread,

Mar 3, 2007, 9:30:16 AM3/3/07

to

From: David Howells <dhow...@redhat.com>

Hide struct vm_area_struct::vm_mm when in NOMMU mode as this isn't used there.

Signed-Off-By: David Howells <dhow...@redhat.com>
---

include/linux/mm.h | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 60e0e4a..ba394e7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -58,7 +58,9 @@ extern int sysctl_legacy_va_layout;
* library, the executable area etc).
*/
struct vm_area_struct {
+#ifdef CONFIG_MMU
struct mm_struct * vm_mm; /* The address space we belong to. */
+#endif
unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */

Hugh Dickins

unread,

Mar 5, 2007, 2:10:13 PM3/5/07

to

On Sat, 3 Mar 2007, David Howells wrote:
> Hugh Dickins <hu...@veritas.com> wrote:
>
> > > VMAs are a shared resource under NOMMU conditions.
> >
> > That's a disturbing remark.
>
> Why? No-one complained when I first put up my rewrite patches three years ago.

Never mind whether anyone read your patches three years ago,
it's disturbing when one body of active developers perceives an object
as shared and another (largely disjoint) body of active developers
perceives it as private: great scope for disaster.

>
> > Under precisely what NOMMU conditions?
>
> CONFIG_MMU=n.

Believe it or not, I had just about grasped that subtlety. I was
rather expecting some helpful response along the lines of "we share
all the vmas across a fork" or "whenever we go to allocate a vma, we
look to see if there's already an isomorphic vma which we can share"
or something like that. But don't worry about it, the source is
there for me to look at whenever.

>
> > I had thought Robin's suggestion very sensible; and throughout mm/
> > it has seemed pretty random whether we pass an "mm" argument down
> > in addition to "vma", or just take vma->vm_mm at whatever level needs.
> >
> > You seem to be suggesting vma->vm_mm is dangerous when CONFIG_NOMMU,
>
> vm_mm is never set to anything other than NULL if CONFIG_MMU=n and it doesn't
> seem to be a problem. I don't think anything in the mm/ directory is left that
> looks at vm_mm once MMU support is disabled (in fact I've just checked, and I
> can compile with vm_mm #ifdef'd out)

That's reassuring, thanks.

>
> > but we MMU people are scarily unaware of that.
>
> If you're worryied that you can't compile anything for NOMMU, an FRV compiler
> is available, and a suitable NOMMU default config can be provided.
> Alternatively, you can pick ARM, M68K, ...
>
> > Perhaps you need to put #ifndef CONFIG_NOMMU around vm_mm in struct
> > vm_area_struct?
>
> I can if it makes you happier. It's not strictly necessary, but it does make
> the struct smaller which is good.

No, it doesn't really make me any happier: I expect that if I look any
deeper, I'll just find plenty more to worry about there. I have the
impression, now reinforced by your defensive posture, that NOMMU is
a hack that squeamish outsiders had better not look too deeply into:
so long as it mostly works for those who need it to work,
we'd probably just be wasting your time anyway.

Hugh

David Howells

unread,

Mar 6, 2007, 1:20:22 PM3/6/07

to

Hugh Dickins <hu...@veritas.com> wrote:

> > > Under precisely what NOMMU conditions?
> >
> > CONFIG_MMU=n.
>
> Believe it or not, I had just about grasped that subtlety. I was
> rather expecting some helpful response along the lines of "we share
> all the vmas across a fork" or "whenever we go to allocate a vma, we
> look to see if there's already an isomorphic vma which we can share"
> or something like that. But don't worry about it, the source is
> there for me to look at whenever.

Any VMAs that can be shared are, as a general rule. It works this way because
shared MAP_PRIVATE read-only file data (such as a segment of an ELF
executable) is stored in a kmalloc()'d buffer attached to the VMA. The
lifetime of this buffer is handled by the VMA, and the buffer is not part of
the pagecache.

Maybe it could be; that'd make the ramfs handling more consistent with normal
file handling.

Such things as mappable chardevs do get separate VMA's, but those don't own
the backing store. That's configured through the BDI information in the
address_space.

> > I can if it makes you happier. It's not strictly necessary, but it does
> > make the struct smaller which is good.
>
> No, it doesn't really make me any happier: I expect that if I look any
> deeper, I'll just find plenty more to worry about there.

Good. I was hoping someone else would review it.

> I have the impression, now reinforced by your defensive posture,

*shrug* I was trying to keep the VMA as little changed as possible and add as
little extra to it as possible.

> that NOMMU is a hack that squeamish outsiders had better not look too deeply
> into:

Anyone that squeamish should avoid the full MMU VM like the plague then - it's
way way worse. NOMMU is considerably simpler. Yes, some liberties have to be
taken, but there's no avoiding that as there's no MMU available. I've tried to
take as few as possible, and I've managed to make sure almost all the
functionality afforded by the MMU kernels is available.

> so long as it mostly works for those who need it to work, we'd probably just
> be wasting your time anyway.

*shrug* as you wish.

David

David Howells

unread,

Mar 9, 2007, 9:20:06 AM3/9/07

to

I've been considering how to deal with the SYSV SHM problem, and I think we
may have to move to unshared VMAs in NOMMU mode to deal with this. Currently,
what we have is each mm_struct has in its arch-specific context argument a
list of VMLs. Take the FRV context for example:

[include/asm-frv/mmu.h]
typedef struct {
#ifdef CONFIG_MMU
...
struct vm_list_struct *vmlist;
unsigned long end_brk;

#endif
...
} mm_context_t;

Each VML struct containes a pointer to a systemwide VMA and the next VML in
the list:

struct vm_list_struct {
struct vm_list_struct *next;
struct vm_area_struct *vma;
};

The VMAs themselves are kept in an rb-tree in mm/nommu.c:

/* list of shareable VMAs */
struct rb_root nommu_vma_tree = RB_ROOT;

which can then be displayed through /proc/maps.

There are some restrictions of this system, mainly due to the NOMMU constraints:

(*) mmap() may not be used to overlay one mapping upon another

(*) mmap() may not be used with MAP_FIXED.

(*) mmap()'s of the same part of the same file will result in multiple
mappings returning the same base address, assuming the maps are shareable.
If they aren't shareable, they'll be at different base addresses.

(*) for normal shareable file mappings, two mappings will only be shared if
they precisely match offset, size and protection, otherwise a new mapping
will be created (this is because VMAs will be shared). Splitting VMAs
would reduce the this restriction, though subsequent mappings would have
to be bounded by the first mapping, but wouldn't have to be the same size.

(*) munmap() may only unmap a precise match amongst the mappings made; it may
not be used to cut down or punch a hole in an existing mapping.

The VMAs for private file mappings, private blockdev mappings and anonymous
mappings, be they shared[*] or unshared, hold a pointer to the kmalloc()'d
region of memory in which the mapping contents reside. This region is
discarded when the VMA is deleted. When a region can be shared the VMA is also
shared, and so no reference counting need take place on the mapping contents as
that is implied by the VMA.

[*] MAP_PRIVATE+!PROT_WRITE+!PT_PTRACED regions may be shared

Note that for mappable chardevs with special BDI capability flags, extra VMAs
may be allocated because (a) they may need to overlap non-exactly, and (b) the
chardev itself pins the backing storage, if the backing storage is potentially
transient.

If VMAs are not shared for shared memory regions then some other means of
retaining the actual allocated memory region must be found. The obvious way to
do this is to have the VMA point to a shared, refcounted record that keeps
track of the region:

struct vm_region {
/* the first parameters define the region as for the VMA */
pgprot_t vm_page_prot;
unsigned long vm_start;
unsigned long vm_end
unsigned long vm_pgoff;
struct file *vm_file;

atomic_t vm_usage; /* region usage count */
struct rb_node vm_rb; /* region tree */
};

The VMA itself would then have to be modified to include a pointer to this, but
wouldn't then need its own refcount. VMAs would belong, once again, to the
mm_struct, the VML struct would vanish, and the VML list rooted in mm_context_t
would vanish.

For R/O shareable file mappings, it might be possible to actually use the
target file's pagecache for the mapping. I do something of that sort for
shared-writable mappings on ramfs files (to support POSIX SHM and SYSV SHM).

The downside of allocating all these extra VMAs is that, of course, it takes up
more memory, though that may not be too bad, especially if it's at the gain of
additional consistency with the MM code.

However, consistency isn't for the most part a real issue. As I see it,
drivers and filesystems should not concern themselves with anything other than
the VMA they're given, and so it doesn't matter if these are shared or not.

That brings us on to the problem with SYSV SHM which keeps an attachment count
that the VMA mmap(), open() and release() ops manipulate. This means that the
nattch count comes out wrong on NOMMU systems. Note that on MMU systems, doing
a munmap() in the middle of an attached region will *also* break the nattch
count, though this is self-correcting.

Another way of dealing with the nattch count on NOMMU systems is to do it
through the VML list, but that then needs more special casing in the SHM driver
and perhaps others.

Thoughts?

Robin Getz

unread,

Mar 12, 2007, 6:00:09 PM3/12/07

to

On Fri 9 Mar 2007 09:12, David Howells pondered:

> I've been considering how to deal with the SYSV SHM problem, and I think we
> may have to move to unshared VMAs in NOMMU mode to deal with this.

Thanks for putting some good thoughts down.

I guess I don't look at it as consistency with the MM code as being the
primary request, but consistency in operation with the MM code from a user
space perspective - hopefully the two goals are not divergent.

> However, consistency isn't for the most part a real issue. As I see it,
> drivers and filesystems should not concern themselves with anything other
> than the VMA they're given, and so it doesn't matter if these are shared or
> not.
>
> That brings us on to the problem with SYSV SHM which keeps an attachment
> count that the VMA mmap(), open() and release() ops manipulate. This means
> that the nattch count comes out wrong on NOMMU systems. Note that on MMU
> systems, doing a munmap() in the middle of an attached region will *also*
> break the nattch count, though this is self-correcting.
>
> Another way of dealing with the nattch count on NOMMU systems is to do it
> through the VML list, but that then needs more special casing in the SHM
> driver and perhaps others.

We (noMMU) folks need to have special code anyway - so why not put it there,
and try not to increase memory footprint?

-Robin

David Howells

unread,

Mar 13, 2007, 6:20:15 AM3/13/07

to

Robin Getz <rg...@blackfin.uclinux.org> wrote:

> We (noMMU) folks need to have special code anyway - so why not put it there,
> and try not to increase memory footprint?

I'd like to have the drivers and filesystems need to know as little as possible
about whether they're working in MMU-mode or NOMMU-mode - for the most part
such knowledge should be unnecessary. Additionally, I'd rather not put special
case code in the generic parts of the NOMMU code.

David

Hugh Dickins

unread,

Mar 15, 2007, 5:30:09 PM3/15/07

to

On Fri, 9 Mar 2007, David Howells wrote:
>
> I've been considering how to deal with the SYSV SHM problem, and I think we
> may have to move to unshared VMAs in NOMMU mode to deal with this. Currently,
> what we have is each mm_struct has in its arch-specific context argument a
> list of VMLs. Take the FRV context for example:

>...

> Another way of dealing with the nattch count on NOMMU systems is to do it
> through the VML list, but that then needs more special casing in the SHM
> driver and perhaps others.
>
> Thoughts?

Thoughts are in regrettably short supply at my end.

I do appreciate all the trouble you've taken to explain it,
in terms I'd understand. And the way you've taken on board
my anxiety about vma assumptions diverging between NO/MMU.

But if "the SYSV SHM problem" you mention at the beginning
is just the "nattch" problem you mention at the end, I doubt
that's worth such a redesign as you're considering here.

Actually, I'm rather surprised SHM needs any such nattch count,
I'd expect it to deducible from file->f_count and mode&SHM_DEST
(but haven't investigated whether that really works out at all).

If you just need a little CONFIG_MMU in ipc/shm.c to solve your
problem, I don't think more is justified.

Your struct vm_region idea does look more to my taste than what
you presently have; yet if you pursue it, I think it would just
make divergence worse wouldn't it? NOMMU wanting vma to contain
a pointer to vm_region, MMU wanting vm_region embedded in vma.

I don't really understand why NOMMU chooses to share vmas, or
vm_regions, rather than just sharing the data which they indicate.
Just because you can use less memory that way? But no need to
justify it now, I'm unlikely to study it in detail.

Hugh

David Howells

unread,

Mar 15, 2007, 6:50:09 PM3/15/07

to

Hugh Dickins <hu...@veritas.com> wrote:

> But if "the SYSV SHM problem" you mention at the beginning
> is just the "nattch" problem you mention at the end, I doubt
> that's worth such a redesign as you're considering here.

Yes, as far as I know that's the problem. nattch is available to userspace and
seems to misbehave as far as userspace programs are concerned (I think the
program sees that it is 1 and assumes itself to be the last user).

> Actually, I'm rather surprised SHM needs any such nattch count,
> I'd expect it to deducible from file->f_count and mode&SHM_DEST
> (but haven't investigated whether that really works out at all).

Ummm... Currently file->f_count doesn't count the number of shmats because the
VMAs are shared. If they are no longer shared then the problem goes away.

There may be several VMLs for a particular process pointing to a VMA.

sys_shmdt() doesn't malfunction because it's not possible to split a VMA in
NOMMU mode, and so the whole VMA must match.

Actually, looking carefully at it, it might go wrong it someone does shmat(),
munmap(), shmdt(). do_munmap(), however, protects against too many munmaps (in
whatever form they're issued).

> If you just need a little CONFIG_MMU in ipc/shm.c to solve your
> problem, I don't think more is justified.

Hmmm... I'm not sure it's quite that simple. SYSV SHM is provided by a chain
of shm -> tiny-shmem -> ramfs. The mapping is actually managed by ramfs.

> Your struct vm_region idea does look more to my taste than what
> you presently have; yet if you pursue it, I think it would just
> make divergence worse wouldn't it? NOMMU wanting vma to contain
> a pointer to vm_region, MMU wanting vm_region embedded in vma.

That bit of divergence is, in effect, already there. In NOMMU-mode the VMA
owns the backing store; in MMU-mode it does not. This would, at least, rectify
that: fixing it would mean that the backing store is no longer owned by the
VMA, and would permit more flexibility in overlapping mappings.

> I don't really understand why NOMMU chooses to share vmas, or
> vm_regions, rather than just sharing the data which they indicate.

Where would that data be? How do you keep track of it? How do you know when
to deallocate it?

I have considered co-opting the pagecache attached to the mapped inode (which
is exactly how I do shared-writable mappings on ramfs), but that only works for
shared mappings. I still have to have a way to handle unshareable mappings.
At the moment, they're both the same way (unless overridden by the driver/fs),
and I just share the VMA.

> Just because you can use less memory that way?

That's one consideration. The other is that it makes management of these
chunks of data simpler. If the memory isn't attached to the VMA then it must
be managed in some other manner.

David

Eric W. Biederman

unread,

Mar 19, 2007, 3:30:13 PM3/19/07

to

I'm just trying to digest this a little.

As I understand your description for non-shared mappings the VMAs are
per process.

For shared mappings you share in some sense the page cache.

My gut feel says just keep a vma per process of the regions the
process has and do the appropriate book keeping and all will be fine.

For shm_nattach it looks like you simply are not calling the
open/close methods on fork (because you have a shared pool of vmas).

Eric

David Howells

unread,

Mar 20, 2007, 7:10:11 AM3/20/07

to

Eric W. Biederman <ebie...@xmission.com> wrote:

> As I understand your description for non-shared mappings the VMAs are
> per process.

Are you talking about the current state of play? If so, not precisely. In
the current scheme of things, *all* VMAs are kept in a global tree and are
globally available; it's just that any VMA that's not shareable will not be
shared, and so, in effect, is per-process.

In my suggested revamp, VMAs revert to being per-process objects only, and
sharing is effected by indirection.

> For shared mappings you share in some sense the page cache.

Currently, no - not unless the driver does something clever as ramfs does.
Sharing through the page cache is a nice idea, but it has some limitations,
mainly that non-sharing then operates differently.

> My gut feel says just keep a vma per process of the regions the
> process has and do the appropriate book keeping and all will be fine.

I'm sure it will be, but at the cost of consuming extra memory. I'm not sure
that the amount of extra memory is, however, all that significant. Now that I
think about it, I don't imagine that a lot of processes are going to be
running at once on a NOMMU system, and so the scope for sharing isn't all that
wide.

> For shm_nattach it looks like you simply are not calling the
> open/close methods on fork (because you have a shared pool of vmas).

There is no fork.

No, the problem is that sys_shmat() relies on do_mmap_pgoff() to call the VMA
open() method. However, this assumes that a new VMA will be made per process.

David

Eric W. Biederman

unread,

Mar 20, 2007, 12:50:07 PM3/20/07

to

David Howells <dhow...@redhat.com> writes:

> Eric W. Biederman <ebie...@xmission.com> wrote:
>
>> As I understand your description for non-shared mappings the VMAs are
>> per process.
>
> Are you talking about the current state of play? If so, not precisely. In
> the current scheme of things, *all* VMAs are kept in a global tree and are
> globally available; it's just that any VMA that's not shareable will not be
> shared, and so, in effect, is per-process.
>
> In my suggested revamp, VMAs revert to being per-process objects only, and
> sharing is effected by indirection.

Ok, and I do think per process VMAs are the way to go if you are
going to stay anywhere close to the rest of linux.

>> For shared mappings you share in some sense the page cache.
>
> Currently, no - not unless the driver does something clever as ramfs does.
> Sharing through the page cache is a nice idea, but it has some limitations,
> mainly that non-sharing then operates differently.

I don't quite what your limitations are but it is a fundamental
assumption that the sharing happen in the page cache. I.e. The pages
that are in the page cache are the only ones that can be effectively
shared.

As a corollary if a page is not in the page cache it should not be
shared. I guess this limits sharing to just those things in ramfs?

You obviously don't have the hardware to enforce this but at least
you can define the sharing of anything else as undefined.

Now I don't know what it takes from your data structures to achieve
sharing of page cache pages. Or what your underlying complications
are. Not having the page cache as the fundamental underlying sharing
pool is strongly non-intuitive from the perspective of the rest
of linux.

>> My gut feel says just keep a vma per process of the regions the
>> process has and do the appropriate book keeping and all will be fine.
>
> I'm sure it will be, but at the cost of consuming extra memory. I'm not sure
> that the amount of extra memory is, however, all that significant. Now that I
> think about it, I don't imagine that a lot of processes are going to be
> running at once on a NOMMU system, and so the scope for sharing isn't all that
> wide.

Sure. But this is sharing with the kernel as well. And you always
have at least one kernel and one application. So if they can share
the file buffers that is a win. And what mmap of any file backed
pages is expected to provide.

>> For shm_nattach it looks like you simply are not calling the
>> open/close methods on fork (because you have a shared pool of vmas).
>
> There is no fork.

Scratch the fork part. You still aren't calling the open/close
methods when they would normally be called, and that is where your
problem lies.

> No, the problem is that sys_shmat() relies on do_mmap_pgoff() to call the VMA
> open() method. However, this assumes that a new VMA will be made
> per process.

Exactly.

Eric

David Howells

unread,

Mar 20, 2007, 3:20:07 PM3/20/07

to

Eric W. Biederman <ebie...@xmission.com> wrote:

> >> For shared mappings you share in some sense the page cache.
> >
> > Currently, no - not unless the driver does something clever as ramfs does.
> > Sharing through the page cache is a nice idea, but it has some limitations,
> > mainly that non-sharing then operates differently.
>
> I don't quite what your limitations are

(1) There's no MMU to provide write protection. This can be worked around in
various ways - such as returning ETXTBSY if one attempts to write() to a
shared mapped file.

(2) An NFS root changing. ETXTBSY cannot be applied to the server just
because a client has a file open. Admittedly, this is a problem for
MMU-mode too.

(3) Keeping track of what memory a VMA is currently pinning. This isn't
normally necessary in MMU-mode, but it's very important in NOMMU-mode.

(4) Mappings must be made on contiguous regions of CPU physical memory space.
This leads to fun generating contiguous regions. If someone reads, say,
the first page of a file, then attempts to map the first meg, say, of the
file (think ld.so loading a shared library), you need to make the first
meg completely contiguous. There can be problems with this:

(a) Any pages that are already in the page cache (notably the first page)
may need moving to a region that's big enough to hold the whole
segment.

(b) If a smaller mapping is already made, but can't be extended, then you
can't make the bigger mapping unless you can decide not to share them.

With ramfs on NOMMU, using truncate() to expand a zero length file
attempts to get a contiguous region of the size specified which is then
broken up and attached to the page cache for that file. This is what
makes POSIX and SYSV shared memory work on NOMMU.

(5) PROT_WRITE/MAP_SHARED mappings are not practical on files that are on
devices or filesystems that aren't directly mappable (disks and NFS vs
memory and flash).

(6) PROT_WRITE/MAP_PRIVATE mappings cannot practically be made to follow
changes to the backing file.

Point (4) is the most difficult one, I think.

> but it is a fundamental assumption that the sharing happen in the page cache.

A fundamental assumption where? There's no requirement for an O/S to work by
having a page cache.

> I.e. The pages that are in the page cache are the only ones that can be
> effectively shared.

That's not actually so. Character devices don't generally exist in the page
cache, for example. In fact, IIRC, SYSV SHM used to work without touching the
page cache (I may be wrong on that).

Remember also, you're on a NOMMU system: we might have to bend some of the
rules.

> As a corollary if a page is not in the page cache it should not be
> shared. I guess this limits sharing to just those things in ramfs?

That's nasty, and also unnecessary. It also prohibits XIP, btw.

> You obviously don't have the hardware to enforce this

The main point is not that you don't have h/w to enforce protection, it's that
you may not have h/w to do virtual address mappings.

> but at least you can define the sharing of anything else as undefined.

Why would I want to do that?

> Now I don't know what it takes from your data structures to achieve
> sharing of page cache pages. Or what your underlying complications
> are. Not having the page cache as the fundamental underlying sharing
> pool is strongly non-intuitive from the perspective of the rest
> of linux.

Again, you're on a NOMMU system. Actually, as it stands, what I have here
seems to work pretty well. There isn't much that doesn't work. fork() doesn't
work (it's just too impractical) and it turns out that SYSV SHM only mostly
works (grrr).

> >> My gut feel says just keep a vma per process of the regions the process
> >> has and do the appropriate book keeping and all will be fine.
> >
> > I'm sure it will be, but at the cost of consuming extra memory. I'm not
> > sure that the amount of extra memory is, however, all that significant.
> > Now that I think about it, I don't imagine that a lot of processes are
> > going to be running at once on a NOMMU system, and so the scope for sharing
> > isn't all that wide.
>
> Sure. But this is sharing with the kernel as well.

What do you mean by that?

> And you always have at least one kernel and one application.

Actually, that's not so. I seem to recall some router thing where userspace
used to just exit completely, leaving the kernel to run alone.

> So if they can share the file buffers that is a win.

Yes, it's a win, but there are also problems with doing so, most particularly
contiguity.

> And what mmap of any file backed pages is expected to provide.

I think it's reasonable to say that a read-only MAP_PRIVATE mapping need not
follow the backing file in NOMMU mode. Note that it's close to impossible to
do this for a writable MAP_PRIVATE mapping, even though that's the expected
behaviour if it hasn't been written through.

Out of interest, do you know of any application that relies on the behaviour in
which unwritten private mappings follow the backing file?

> Scratch the fork part. You still aren't calling the open/close
> methods when they would normally be called, and that is where your
> problem lies.

As previously stated, yes, I do realise that.

I think the behaviour I have for sharing R/O private mappings is good enough,
especially as consolidating bits of the pagecache will be a pain, and mostly
unnecessary.

This is NOMMU mode. Some things are going to have to be different, but almost
all of the MMU functionality is available, just as long as you don't look too
closely.

David

David Howells

unread,

Mar 20, 2007, 4:00:18 PM3/20/07

to

David Howells <dhow...@redhat.com> wrote:

> I don't quite what your limitations are

(7) Two shared mappings on the same offset in the same file must, of
necessity, appear at the same address. This means you get two VMAs in the
rbtree at the same place.

David Howells

unread,

Mar 21, 2007, 12:20:11 PM3/21/07

to

David Howells <dhow...@redhat.com> wrote:

> I don't quite what your limitations are

(8) Handling mappings that extend beyond the end of file.