This patch series is version 3 of the core dump masking feature,
which provides a per-process flag not to dump anonymous shared
memory segments.
In this version, /proc/<pid>/coredump_omit_anonymous_shared file
is provided as an interface instead of the previous
/proc/<pid>/core_flags. If you have written a non-zero value to the
file, anonymous shared memory segments of the process not to be
dumped.
Another change of this version is removal of kernel.core_flags_enables
sysctl which enables/disables core dump flags globally. The purpose
of this sysctl is for the system administrator to force all anonymous
memory to be dumped. But ulimit -c 0 breaks it. So this sysctl is
not helpful for the current linux.
This patch series can be applied against 2.6.20-mm1.
The supported core file formats are ELF and ELF-FDPIC. ELF has been
tested, but ELF-FDPIC has not been build and tested because I don't
have the test environment.
Background:
Some software programs share huge memory among hundreds of
processes. If a failure occurs on one of these processes, they can
be signaled by a monitoring process to generate core files and
restart the service. However, it can develop into a system-wide
failure such as system slow down for a long time and disk space
shortage because the total size of the core files is very huge!
To avoid the above situation we can limit the core file size by
setrlimit(2) or ulimit(1). But this method can lose important data
such as stack because core dumping is terminated halfway.
So I suggest keeping shared memory segments from being dumped for
particular processes. Because the shared memory attached to processes
is common in them, we don't need to dump the shared memory every time.
Usage:
Get all shared memory segments of pid 1234 not to dump:
$ echo 1 > /proc/1234/coredump_omit_anonymous_shared
When a new process is created, the process inherits the flag status
from its parent. It is useful to set the core dump flags before the
program runs. For example:
$ echo 1 > /proc/self/coredump_omit_anonymous_shared
$ ./some_program
ChangeLog:
v3:
- remove `/proc/<pid>/core_flags' proc entry
- add `/proc/<pid>/coredump_anonymous_shared' as a named flag
- remove kernel.core_flags_enable sysctl parameter
v2:
http://groups.google.com/group/linux.kernel/browse_frm/thread/cb254465971d4a42/
http://groups.google.com/group/linux.kernel/browse_frm/thread/da78f2702e06fa11/
- rename `coremask' to `core_flags'
- change `core_flags' member in mm_struct to a bit field
next to `dumpable'
- introduce a global spin lock to protect adjacent two bit fields
(core_flags and dumpable) from race condition
- fix a bug that the generated core file can be corrupted when
core dumping and updating core_flags occur concurrently
- add kernel.core_flags_enable sysctl parameter to enable/disable
flags in /proc/<pid>/core_flags
- support ELF-FDPIC binary format, but not tested
v1:
http://groups.google.com/group/linux.kernel/browse_frm/thread/1381fc54d716e3e6/
--
Hidehiro Kawai
Hitachi, Ltd., Systems Development Laboratory
E-mail: hidehiro...@hitachi.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
/proc/<pid>/coredump_omit_anonymous_shared file is provided to access
the flag. You can change the flag status for a particular process by
writing to or reading from the file.
The flag status is inherited to the child process when it is created.
The flag is stored into coredump_omit_anon_shared member of mm_struct,
which shares bytes with dumpable member because these two are adjacent
bit fields. In order to avoid write-write race between the two, we use
a global spin lock.
smp_wmb() at updating dumpable is removed because set_dumpable()
includes a pair of spin lock and unlock which has the effect of
memory barrier.
Signed-off-by: Hidehiro Kawai <hidehiro...@hitachi.com>
---
fs/exec.c | 10 ++--
fs/proc/base.c | 99 ++++++++++++++++++++++++++++++++++++++++
include/linux/sched.h | 33 +++++++++++++
kernel/fork.c | 3 +
kernel/sys.c | 62 ++++++++-----------------
security/commoncap.c | 2
security/dummy.c | 2
7 files changed, 164 insertions(+), 47 deletions(-)
Index: linux-2.6.20-mm1/fs/proc/base.c
===================================================================
--- linux-2.6.20-mm1.orig/fs/proc/base.c
+++ linux-2.6.20-mm1/fs/proc/base.c
@@ -74,6 +74,7 @@
#include <linux/poll.h>
#include <linux/nsproxy.h>
#include <linux/oom.h>
+#include <linux/elf.h>
#include "internal.h"
/* NOTE:
@@ -1753,6 +1754,100 @@ static const struct inode_operations pro
#endif
+#if defined(USE_ELF_CORE_DUMP) && defined(CONFIG_ELF_CORE)
+static ssize_t proc_coredump_omit_anon_shared_read(struct file *file,
+ char __user *buf,
+ size_t count,
+ loff_t *ppos)
+{
+ struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
+ struct mm_struct *mm;
+ char buffer[PROC_NUMBUF];
+ size_t len;
+ loff_t __ppos = *ppos;
+ int ret;
+
+ ret = -ESRCH;
+ if (!task)
+ goto out_no_task;
+
+ ret = 0;
+ mm = get_task_mm(task);
+ if (!mm)
+ goto out_no_mm;
+
+ len = snprintf(buffer, sizeof(buffer), "%u\n",
+ mm->coredump_omit_anon_shared);
+ if (__ppos >= len)
+ goto out;
+ if (count > len - __ppos)
+ count = len - __ppos;
+
+ ret = -EFAULT;
+ if (copy_to_user(buf, buffer + __ppos, count))
+ goto out;
+
+ ret = count;
+ *ppos = __ppos + count;
+
+ out:
+ mmput(mm);
+ out_no_mm:
+ put_task_struct(task);
+ out_no_task:
+ return ret;
+}
+
+static ssize_t proc_coredump_omit_anon_shared_write(struct file *file,
+ const char __user *buf,
+ size_t count,
+ loff_t *ppos)
+{
+ struct task_struct *task;
+ struct mm_struct *mm;
+ char buffer[PROC_NUMBUF], *end;
+ unsigned int val;
+ int ret;
+
+ ret = -EFAULT;
+ memset(buffer, 0, sizeof(buffer));
+ if (count > sizeof(buffer) - 1)
+ count = sizeof(buffer) - 1;
+ if (copy_from_user(buffer, buf, count))
+ goto out_no_task;
+
+ ret = -EINVAL;
+ val = (unsigned int)simple_strtoul(buffer, &end, 0);
+ if (*end == '\n')
+ end++;
+ if (end - buffer == 0)
+ goto out_no_task;
+
+ ret = -ESRCH;
+ task = get_proc_task(file->f_dentry->d_inode);
+ if (!task)
+ goto out_no_task;
+
+ ret = end - buffer;
+ mm = get_task_mm(task);
+ if (!mm)
+ goto out_no_mm;
+
+ set_coredump_omit_anon_shared(mm, (val != 0));
+
+ mmput(mm);
+ out_no_mm:
+ put_task_struct(task);
+ out_no_task:
+ return ret;
+}
+
+static struct file_operations proc_coredump_omit_anon_shared_operations = {
+ .read = proc_coredump_omit_anon_shared_read,
+ .write = proc_coredump_omit_anon_shared_write,
+};
+#endif
+
/*
* /proc/self:
*/
@@ -1972,6 +2067,10 @@ static struct pid_entry tgid_base_stuff[
#ifdef CONFIG_FAULT_INJECTION
REG("make-it-fail", S_IRUGO|S_IWUSR, fault_inject),
#endif
+#if defined(USE_ELF_CORE_DUMP) && defined(CONFIG_ELF_CORE)
+ REG("coredump_omit_anonymous_shared", S_IRUGO|S_IWUSR,
+ coredump_omit_anon_shared),
+#endif
#ifdef CONFIG_TASK_IO_ACCOUNTING
INF("io", S_IRUGO, pid_io_accounting),
#endif
Index: linux-2.6.20-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.20-mm1.orig/include/linux/sched.h
+++ linux-2.6.20-mm1/include/linux/sched.h
@@ -366,7 +366,13 @@ struct mm_struct {
unsigned int token_priority;
unsigned int last_interval;
+ /*
+ * Writing to these bit fields can cause race condition. To avoid
+ * the race, use dump_bits_lock. You may also use set_dumpable() or
+ * set_coredump_*() macros to set a value.
+ */
unsigned char dumpable:2;
+ unsigned char coredump_omit_anon_shared:1; /* don't dump anon shm */
/* coredumping support */
int core_waiters;
@@ -1721,6 +1727,33 @@ static inline void inc_syscw(struct task
}
#endif
+#include <linux/elf.h>
+/*
+ * These macros are used to protect dumpable and coredump_omit_anon_shared bit
+ * fields in mm_struct from write race between the two.
+ */
+extern spinlock_t dump_bits_lock;
+#define __set_dump_bits(dest, val) \
+ do { \
+ spin_lock(&dump_bits_lock); \
+ (dest) = (val); \
+ spin_unlock(&dump_bits_lock); \
+ } while (0)
+
+#if defined(USE_ELF_CORE_DUMP) && defined(CONFIG_ELF_CORE)
+# define set_dumpable(mm, val) \
+ __set_dump_bits((mm)->dumpable, val)
+# define set_coredump_omit_anon_shared(mm, val) \
+ __set_dump_bits((mm)->coredump_omit_anon_shared, val)
+#else
+# define set_dumpable(mm, val) \
+ do { \
+ (mm)->dumpable = (val); \
+ smp_wmb(); \
+ } while (0)
+# define set_coredump_omit_anon_shared(mm, val)
+#endif
+
#endif /* __KERNEL__ */
#endif
Index: linux-2.6.20-mm1/fs/exec.c
===================================================================
--- linux-2.6.20-mm1.orig/fs/exec.c
+++ linux-2.6.20-mm1/fs/exec.c
@@ -62,6 +62,8 @@ int core_uses_pid;
char core_pattern[128] = "core";
int suid_dumpable = 0;
+DEFINE_SPINLOCK(dump_bits_lock);
+
EXPORT_SYMBOL(suid_dumpable);
/* The maximal length of core_pattern is also specified in sysctl.c */
@@ -853,9 +855,9 @@ int flush_old_exec(struct linux_binprm *
current->sas_ss_sp = current->sas_ss_size = 0;
if (current->euid == current->uid && current->egid == current->gid)
- current->mm->dumpable = 1;
+ set_dumpable(current->mm, 1);
else
- current->mm->dumpable = suid_dumpable;
+ set_dumpable(current->mm, suid_dumpable);
name = bprm->filename;
@@ -883,7 +885,7 @@ int flush_old_exec(struct linux_binprm *
file_permission(bprm->file, MAY_READ) ||
(bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP)) {
suid_keys(current);
- current->mm->dumpable = suid_dumpable;
+ set_dumpable(current->mm, suid_dumpable);
}
/* An exec changes our domain. We are no longer part of the thread
@@ -1477,7 +1479,7 @@ int do_coredump(long signr, int exit_cod
flag = O_EXCL; /* Stop rewrite attacks */
current->fsuid = 0; /* Dump root private */
}
- mm->dumpable = 0;
+ set_dumpable(mm, 0);
retval = coredump_wait(exit_code);
if (retval < 0)
Index: linux-2.6.20-mm1/kernel/fork.c
===================================================================
--- linux-2.6.20-mm1.orig/kernel/fork.c
+++ linux-2.6.20-mm1/kernel/fork.c
@@ -333,6 +333,9 @@ static struct mm_struct * mm_init(struct
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
INIT_LIST_HEAD(&mm->mmlist);
+ /* don't need to use set_coredump_omit_anon_shared() */
+ mm->coredump_omit_anon_shared =
+ (current->mm) ? current->mm->coredump_omit_anon_shared : 0;
mm->core_waiters = 0;
mm->nr_ptes = 0;
set_mm_counter(mm, file_rss, 0);
Index: linux-2.6.20-mm1/kernel/sys.c
===================================================================
--- linux-2.6.20-mm1.orig/kernel/sys.c
+++ linux-2.6.20-mm1/kernel/sys.c
@@ -1022,10 +1022,8 @@ asmlinkage long sys_setregid(gid_t rgid,
else
return -EPERM;
}
- if (new_egid != old_egid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (new_egid != old_egid)
+ set_dumpable(current->mm, suid_dumpable);
if (rgid != (gid_t) -1 ||
(egid != (gid_t) -1 && egid != old_rgid))
current->sgid = new_egid;
@@ -1052,16 +1050,12 @@ asmlinkage long sys_setgid(gid_t gid)
return retval;
if (capable(CAP_SETGID)) {
- if (old_egid != gid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (old_egid != gid)
+ set_dumpable(current->mm, suid_dumpable);
current->gid = current->egid = current->sgid = current->fsgid = gid;
} else if ((gid == current->gid) || (gid == current->sgid)) {
- if (old_egid != gid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (old_egid != gid)
+ set_dumpable(current->mm, suid_dumpable);
current->egid = current->fsgid = gid;
}
else
@@ -1089,10 +1083,8 @@ static int set_user(uid_t new_ruid, int
switch_uid(new_user);
- if (dumpclear) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (dumpclear)
+ set_dumpable(current->mm, suid_dumpable);
current->uid = new_ruid;
return 0;
}
@@ -1145,10 +1137,8 @@ asmlinkage long sys_setreuid(uid_t ruid,
if (new_ruid != old_ruid && set_user(new_ruid, new_euid != old_euid) < 0)
return -EAGAIN;
- if (new_euid != old_euid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (new_euid != old_euid)
+ set_dumpable(current->mm, suid_dumpable);
current->fsuid = current->euid = new_euid;
if (ruid != (uid_t) -1 ||
(euid != (uid_t) -1 && euid != old_ruid))
@@ -1195,10 +1185,8 @@ asmlinkage long sys_setuid(uid_t uid)
} else if ((uid != current->uid) && (uid != new_suid))
return -EPERM;
- if (old_euid != uid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (old_euid != uid)
+ set_dumpable(current->mm, suid_dumpable);
current->fsuid = current->euid = uid;
current->suid = new_suid;
@@ -1240,10 +1228,8 @@ asmlinkage long sys_setresuid(uid_t ruid
return -EAGAIN;
}
if (euid != (uid_t) -1) {
- if (euid != current->euid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (euid != current->euid)
+ set_dumpable(current->mm, suid_dumpable);
current->euid = euid;
}
current->fsuid = current->euid;
@@ -1290,10 +1276,8 @@ asmlinkage long sys_setresgid(gid_t rgid
return -EPERM;
}
if (egid != (gid_t) -1) {
- if (egid != current->egid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (egid != current->egid)
+ set_dumpable(current->mm, suid_dumpable);
current->egid = egid;
}
current->fsgid = current->egid;
@@ -1336,10 +1320,8 @@ asmlinkage long sys_setfsuid(uid_t uid)
if (uid == current->uid || uid == current->euid ||
uid == current->suid || uid == current->fsuid ||
capable(CAP_SETUID)) {
- if (uid != old_fsuid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (uid != old_fsuid)
+ set_dumpable(current->mm, suid_dumpable);
current->fsuid = uid;
}
@@ -1365,10 +1347,8 @@ asmlinkage long sys_setfsgid(gid_t gid)
if (gid == current->gid || gid == current->egid ||
gid == current->sgid || gid == current->fsgid ||
capable(CAP_SETGID)) {
- if (gid != old_fsgid) {
- current->mm->dumpable = suid_dumpable;
- smp_wmb();
- }
+ if (gid != old_fsgid)
+ set_dumpable(current->mm, suid_dumpable);
current->fsgid = gid;
key_fsgid_changed(current);
proc_id_connector(current, PROC_EVENT_GID);
@@ -2163,7 +2143,7 @@ asmlinkage long sys_prctl(int option, un
error = -EINVAL;
break;
}
- current->mm->dumpable = arg2;
+ set_dumpable(current->mm, arg2);
break;
case PR_SET_UNALIGN:
Index: linux-2.6.20-mm1/security/commoncap.c
===================================================================
--- linux-2.6.20-mm1.orig/security/commoncap.c
+++ linux-2.6.20-mm1/security/commoncap.c
@@ -244,7 +244,7 @@ void cap_bprm_apply_creds (struct linux_
if (bprm->e_uid != current->uid || bprm->e_gid != current->gid ||
!cap_issubset (new_permitted, current->cap_permitted)) {
- current->mm->dumpable = suid_dumpable;
+ set_dumpable(current->mm, suid_dumpable);
if (unsafe & ~LSM_UNSAFE_PTRACE_CAP) {
if (!capable(CAP_SETUID)) {
Index: linux-2.6.20-mm1/security/dummy.c
===================================================================
--- linux-2.6.20-mm1.orig/security/dummy.c
+++ linux-2.6.20-mm1/security/dummy.c
@@ -130,7 +130,7 @@ static void dummy_bprm_free_security (st
static void dummy_bprm_apply_creds (struct linux_binprm *bprm, int unsafe)
{
if (bprm->e_uid != current->uid || bprm->e_gid != current->gid) {
- current->mm->dumpable = suid_dumpable;
+ set_dumpable(current->mm, suid_dumpable);
if ((unsafe & ~LSM_UNSAFE_PTRACE_CAP) && !capable(CAP_SETUID)) {
bprm->e_uid = current->uid;
Signed-off-by: Hidehiro Kawai <hidehiro...@hitachi.com>
---
fs/binfmt_elf.c | 20 ++++++++++++++------
1 files changed, 14 insertions(+), 6 deletions(-)
Index: linux-2.6.20-mm1/fs/binfmt_elf.c
===================================================================
--- linux-2.6.20-mm1.orig/fs/binfmt_elf.c
+++ linux-2.6.20-mm1/fs/binfmt_elf.c
@@ -1181,7 +1181,7 @@ static int dump_seek(struct file *file,
*
* I think we should skip something. But I am not sure how. H.J.
*/
-static int maydump(struct vm_area_struct *vma)
+static int maydump(struct vm_area_struct *vma, unsigned int omit_anon_shared)
{
/* The vma can be set up to tell us the answer directly. */
if (vma->vm_flags & VM_ALWAYSDUMP)
@@ -1191,9 +1191,15 @@ static int maydump(struct vm_area_struct
if (vma->vm_flags & (VM_IO | VM_RESERVED))
return 0;
- /* Dump shared memory only if mapped from an anonymous file. */
- if (vma->vm_flags & VM_SHARED)
- return vma->vm_file->f_path.dentry->d_inode->i_nlink == 0;
+ /*
+ * Dump shared memory only if mapped from an anonymous file and
+ * /proc/<pid>/coredump_omit_anonymous_shared flag is not set.
+ */
+ if (vma->vm_flags & VM_SHARED) {
+ if (vma->vm_file->f_path.dentry->d_inode->i_nlink)
+ return 0;
+ return omit_anon_shared == 0;
+ }
/* If it hasn't been written to, don't write it out */
if (!vma->anon_vma)
@@ -1491,6 +1497,7 @@ static int elf_core_dump(long signr, str
#endif
int thread_status_size = 0;
elf_addr_t *auxv;
+ unsigned int omit_anon_shared = 0;
/*
* We no longer stop all VM operations.
@@ -1629,6 +1636,7 @@ static int elf_core_dump(long signr, str
}
dataoff = offset = roundup(offset, ELF_EXEC_PAGESIZE);
+ omit_anon_shared = current->mm->coredump_omit_anon_shared;
/* Write program headers for segments dump */
for (vma = first_vma(current, gate_vma); vma != NULL;
@@ -1642,7 +1650,7 @@ static int elf_core_dump(long signr, str
phdr.p_offset = offset;
phdr.p_vaddr = vma->vm_start;
phdr.p_paddr = 0;
- phdr.p_filesz = maydump(vma) ? sz : 0;
+ phdr.p_filesz = maydump(vma, omit_anon_shared) ? sz : 0;
phdr.p_memsz = sz;
offset += phdr.p_filesz;
phdr.p_flags = vma->vm_flags & VM_READ ? PF_R : 0;
@@ -1685,7 +1693,7 @@ static int elf_core_dump(long signr, str
vma = next_vma(vma, gate_vma)) {
unsigned long addr;
- if (!maydump(vma))
+ if (!maydump(vma, omit_anon_shared))
continue;
for (addr = vma->vm_start;
The debug messages from maydump() in fs/binfmt_elf_fdpic.c are changed
appropriately so that we can know what kind of memory segments are
dumped or not.
Signed-off-by: Hidehiro Kawai <hidehiro...@hitachi.com>
---
fs/binfmt_elf_fdpic.c | 37 +++++++++++++++++++++++++------------
1 files changed, 25 insertions(+), 12 deletions(-)
Index: linux-2.6.20-mm1/fs/binfmt_elf_fdpic.c
===================================================================
--- linux-2.6.20-mm1.orig/fs/binfmt_elf_fdpic.c
+++ linux-2.6.20-mm1/fs/binfmt_elf_fdpic.c
@@ -1168,7 +1168,7 @@ static int dump_seek(struct file *file,
*
* I think we should skip something. But I am not sure how. H.J.
*/
-static int maydump(struct vm_area_struct *vma)
+static int maydump(struct vm_area_struct *vma, unsigned int omit_anon_shared)
{
/* Do not dump I/O mapped devices or special mappings */
if (vma->vm_flags & (VM_IO | VM_RESERVED)) {
@@ -1184,15 +1184,22 @@ static int maydump(struct vm_area_struct
return 0;
}
- /* Dump shared memory only if mapped from an anonymous file. */
+ /*
+ * Dump shared memory only if mapped from an anonymous file and
+ * /proc/<pid>/coredump_omit_anonymous_shared flag is not set.
+ */
if (vma->vm_flags & VM_SHARED) {
- if (vma->vm_file->f_path.dentry->d_inode->i_nlink == 0) {
+ if (vma->vm_file->f_path.dentry->d_inode->i_nlink) {
kdcore("%08lx: %08lx: no (share)", vma->vm_start, vma->vm_flags);
+ return 0;
+ }
+ if (omit_anon_shared) {
+ kdcore("%08lx: %08lx: no (anon-share)", vma->vm_start, vma->vm_flags);
+ return 0;
+ } else {
+ kdcore("%08lx: %08lx: yes (anon-share)", vma->vm_start, vma->vm_flags);
return 1;
}
-
- kdcore("%08lx: %08lx: no (share)", vma->vm_start, vma->vm_flags);
- return 0;
}
#ifdef CONFIG_MMU
@@ -1444,14 +1451,15 @@ static int elf_dump_thread_status(long s
*/
#ifdef CONFIG_MMU
static int elf_fdpic_dump_segments(struct file *file, struct mm_struct *mm,
- size_t *size, unsigned long *limit)
+ size_t *size, unsigned long *limit,
+ unsigned int omit_anon_shared)
{
struct vm_area_struct *vma;
for (vma = current->mm->mmap; vma; vma = vma->vm_next) {
unsigned long addr;
- if (!maydump(vma))
+ if (!maydump(vma, omit_anon_shared))
continue;
for (addr = vma->vm_start;
@@ -1499,14 +1507,15 @@ end_coredump:
*/
#ifndef CONFIG_MMU
static int elf_fdpic_dump_segments(struct file *file, struct mm_struct *mm,
- size_t *size, unsigned long *limit)
+ size_t *size, unsigned long *limit,
+ unsigned int omit_anon_shared)
{
struct vm_list_struct *vml;
for (vml = current->mm->context.vmlist; vml; vml = vml->next) {
struct vm_area_struct *vma = vml->vma;
- if (!maydump(vma))
+ if (!maydump(vma, omit_anon_shared))
continue;
if ((*size += PAGE_SIZE) > *limit)
@@ -1557,6 +1566,7 @@ static int elf_fdpic_core_dump(long sign
struct vm_list_struct *vml;
#endif
elf_addr_t *auxv;
+ unsigned int omit_anon_shared = 0;
/*
* We no longer stop all VM operations.
@@ -1694,6 +1704,8 @@ static int elf_fdpic_core_dump(long sign
/* Page-align dumped data */
dataoff = offset = roundup(offset, ELF_EXEC_PAGESIZE);
+ omit_anon_shared = current->mm->coredump_omit_anon_shared;
+
/* write program headers for segments dump */
for (
#ifdef CONFIG_MMU
@@ -1715,7 +1727,7 @@ static int elf_fdpic_core_dump(long sign
phdr.p_offset = offset;
phdr.p_vaddr = vma->vm_start;
phdr.p_paddr = 0;
- phdr.p_filesz = maydump(vma) ? sz : 0;
+ phdr.p_filesz = maydump(vma, omit_anon_shared) ? sz : 0;
phdr.p_memsz = sz;
offset += phdr.p_filesz;
phdr.p_flags = vma->vm_flags & VM_READ ? PF_R : 0;
@@ -1749,7 +1761,8 @@ static int elf_fdpic_core_dump(long sign
DUMP_SEEK(dataoff);
- if (elf_fdpic_dump_segments(file, current->mm, &size, &limit) < 0)
+ if (elf_fdpic_dump_segments(file, current->mm, &size, &limit,
+ omit_anon_shared) < 0)
goto end_coredump;
#ifdef ELF_CORE_WRITE_EXTRA_DATA
Signed-off-by: Hidehiro Kawai <hidehiro...@hitachi.com>
---
Documentation/filesystems/proc.txt | 38 +++++++++++++++++++++++++++
1 files changed, 38 insertions(+)
Index: linux-2.6.20-mm1/Documentation/filesystems/proc.txt
===================================================================
--- linux-2.6.20-mm1.orig/Documentation/filesystems/proc.txt
+++ linux-2.6.20-mm1/Documentation/filesystems/proc.txt
@@ -41,6 +41,7 @@ Table of Contents
2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem
2.12 /proc/<pid>/oom_adj - Adjust the oom-killer score
2.13 /proc/<pid>/oom_score - Display current oom-killer score
+ 2.14 /proc/<pid>/coredump_omit_anonymous_shared - Core dump coordinator
------------------------------------------------------------------------------
Preface
@@ -1982,6 +1983,43 @@ This file can be used to check the curre
any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which
process should be killed in an out-of-memory situation.
+2.14 /proc/<pid>/coredump_omit_anonymous_shared - Core dump coordinator
+---------------------------------------------------------------------
+When a process is dumped, all anonymous memory is written to a core file as
+long as the size of the core file isn't limited. But sometimes we don't want
+to dump some memory segments, for example, huge shared memory.
+
+The /proc/<pid>/coredump_omit_anonymous_shared is a flag which enables you to
+omit anonymous shared memory segments from a core file when it is generated.
+When the <pid> process is dumped, the core dump routine decides whether a
+given memory segment should be dumped into a core file or not based on the
+type of the memory segment and the flag.
+
+If you have written a non-zero value to this proc file, anonymous shared
+memory segments are not dumped. There are three types of anonymous shared
+memory:
+
+ - IPC shared memory
+ - the memory segments created by mmap(2) with MAP_ANONYMOUS and MAP_SHARED
+ flags
+ - the memory segments created by mmap(2) with MAP_SHARED flag, and the
+ mapped file has already been unlinked
+
+Because current core dump routine doesn't distinguish these segments, you can
+only choose either dumping all anonymous shared memory segments or not.
+
+If you don't want to dump all shared memory segments attached to pid 1234,
+write 0 to the process's proc file.
+
+ $ echo 1 > /proc/1234/coredump_omit_anonymous_shared
+
+When a new process is created, the process inherits the flag status from its
+parent. It is useful to set the flag before the program runs.
+For example:
+
+ $ echo 1 > /proc/self/coredump_omit_anonymous_shared
+ $ ./some_program
+
------------------------------------------------------------------------------
Summary
------------------------------------------------------------------------------
> To avoid the above situation we can limit the core file size by
> setrlimit(2) or ulimit(1). But this method can lose important data
> such as stack because core dumping is terminated halfway.
> So I suggest keeping shared memory segments from being dumped for
> particular processes.
A better way might be to place the shared memory segments last if that's
possible (I'm not sure ELF supports out-of-order segments).
> Because the shared memory attached to processes is common in them, we don't
> need to dump the shared memory every time.
So there's no guarantee that they'll be dumped at all... I'm not sure there's
any way around that, however.
David
> static int elf_fdpic_dump_segments(struct file *file, struct mm_struct *mm,
> - size_t *size, unsigned long *limit)
> + size_t *size, unsigned long *limit,
> + unsigned int omit_anon_shared)
Why are you passing it as an extra argument when you could just use
mm->coredump_omit_anon_shared here:
> + if (!maydump(vma, omit_anon_shared))
And here:
> + phdr.p_filesz = maydump(vma, omit_anon_shared) ? sz : 0;
Actually, I think I would just pass the mm pointer you have into maydump() and
let that dereference it here:
> + if (omit_anon_shared) {
which would then be:
if (mm->coredump_omit_anon_shared) {
Then the calls to maydump() would be:
if (!maydump(vma, mm))
and:
phdr.p_filesz = maydump(vma, current->mm) ? sz : 0;
I wouldn't worry, were I you, about the setting changing whilst it's being
used. If you are worried about that, you can probably do some locking against
that.
David
How about:
if (vma->vm_mm->coredump_omit_anon_shared) {
Then the calls to maydump() would be unchanged:
> How about:
> if (vma->vm_mm->coredump_omit_anon_shared) {
>
> Then the calls to maydump() would be unchanged:
VMAs are a shared resource under NOMMU conditions.
David
David Howells wrote:
> Kawai, Hidehiro <hidehiro...@hitachi.com> wrote:
>
>>To avoid the above situation we can limit the core file size by
>>setrlimit(2) or ulimit(1). But this method can lose important data
>>such as stack because core dumping is terminated halfway.
>>So I suggest keeping shared memory segments from being dumped for
>>particular processes.
>
> A better way might be to place the shared memory segments last if that's
> possible (I'm not sure ELF supports out-of-order segments).
Placing the shared memory segments last and limiting by ulimit -c
is one of the solutions. But there is no guarantee that the memory
segments other than anonymous shared memory are always dumped.
So your idea cannot alternate my suggesting feature.
But if your idea has a merit which my idea doesn't have, I try to
consider coexistence of both idea.
>>Because the shared memory attached to processes is common in them, we don't
>>need to dump the shared memory every time.
>
> So there's no guarantee that they'll be dumped at all... I'm not sure there's
> any way around that, however.
Indeed. However, some people don't want to dump anonymous shared memory
at all. Taking into account this requirement, I don't guarantee that.
But this feature allows per-process setting. So if you want to dump
the shared memory at least once, you can manage to do that in userland.
Thanks,
--
Hidehiro Kawai
Hitachi, Ltd., Systems Development Laboratory
-
Thank you for your comments.
David Howells wrote:
>> static int elf_fdpic_dump_segments(struct file *file, struct mm_struct *mm,
>>- size_t *size, unsigned long *limit)
>>+ size_t *size, unsigned long *limit,
>>+ unsigned int omit_anon_shared)
>
> Why are you passing it as an extra argument when you could just use
> mm->coredump_omit_anon_shared here:
>
>>+ if (!maydump(vma, omit_anon_shared))
> I wouldn't worry, were I you, about the setting changing whilst it's being
> used. If you are worried about that, you can probably do some locking against
> that.
Core dumping is separated two phases, one is the phase of writing
headers, the other is the phase of writing memory segments. If the
coredump_omit_anon_shared setting is changed between these two phases,
a corrupted core file will be generated because the offsets written
in headers don't match their bodies. So we need to use the same
setting in both phases.
I think that locking makes codes complex and generates overhead.
So I wouldn't like to use lock as far as possible. I think passing
the flag as an extra argument is the simplest implementation to
avoid the core file corruption.
Thanks,
--
Hidehiro Kawai
Hitachi, Ltd., Systems Development Laboratory
-
> Core dumping is separated two phases, one is the phase of writing
> headers, the other is the phase of writing memory segments. If the
> coredump_omit_anon_shared setting is changed between these two phases,
> a corrupted core file will be generated because the offsets written
> in headers don't match their bodies. So we need to use the same
> setting in both phases.
Hmmm... Okay.
> I think that locking makes codes complex and generates overhead.
> So I wouldn't like to use lock as far as possible. I think passing
> the flag as an extra argument is the simplest implementation to
> avoid the core file corruption.
Actually, I don't think the locking is that hard or that complex.
int do_coredump(long signr, int exit_code, struct pt_regs * regs)
{
<setup vars>
down_read(&coredump_settings_sem);
...
fail:
up_read(&coredump_settings_sem);
return retval;
}
And:
static ssize_t proc_coredump_omit_anon_shared_write(struct file *file,
const char __user *buf,
size_t count,
loff_t *ppos)
{
<setup vars>
down_write(&coredump_settings_sem);
...
out_no_task:
up_write(&coredump_settings_sem);
return ret;
}
The same could be applied to all controls that change the coredumping
variables, in particular the sysctl for core_pattern could be wrapped so as to
remove one of the reliances on lock_kernel() and the lock_kernel pair could be
removed from do_coredump().
David
If the dump has started, do we want to change this to a trylock and skip
changing the value if it is already locked?
Thanks,
Robin
Thank you for your reply.
David Howells wrote:
>>I think that locking makes codes complex and generates overhead.
>>So I wouldn't like to use lock as far as possible. I think passing
>>the flag as an extra argument is the simplest implementation to
>>avoid the core file corruption.
>
> Actually, I don't think the locking is that hard or that complex.
>
> int do_coredump(long signr, int exit_code, struct pt_regs * regs)
> {
> <setup vars>
>
> down_read(&coredump_settings_sem);
>
> ...
>
> fail:
> up_read(&coredump_settings_sem);
> return retval;
> }
>
> And:
>
> static ssize_t proc_coredump_omit_anon_shared_write(struct file *file,
> const char __user *buf,
> size_t count,
> loff_t *ppos)
> {
> <setup vars>
>
> down_write(&coredump_settings_sem);
>
> ...
>
> out_no_task:
> up_write(&coredump_settings_sem);
> return ret;
> }
Is coredump_setting_sem a global semaphore? If so, it prevents concurrent
core dumping. Additionally, while some process is dumped, writing to
coredump_omit_anon_shared of unrelated process will be blocked.
So we should use per process or per mm locking. But where should we
place the variable for locking? Because we don't want to increase the
struct size, we might want to add another bit field in mm_struct:
struct mm_struct {
...
unsigned char dumpable:2;
unsigned char coredump_in_progress:1; /* this */
unsigned char coredump_omit_anon_shared:1;
...
And we use it to determine whether core dumping is in progress or not:
int do_coredump(long signr, int exit_code, struct pt_regs * regs)
{
<setup vars>
spin_lock(&dump_bits_lock);
current->mm->coredump_in_progress = 1;
spin_unlock(&dump_bits_lock);
...
Additionally:
static ssize_t proc_coredump_omit_anon_shared_write(struct file *file,
const char __user *buf,
size_t count,
loff_t *ppos)
{
<setup vars>
ret = -EBUSY;
spin_lock(&dump_bits_lock);
if (mm->coredump_in_progress) {
spin_unlock(&dump_bits_lock);
goto out;
}
mm->coredump_omit_anon_shared = (val != 0);
spin_unlock(&dump_bits_lock);
...
Do you think which is better this method or passing argument method
used by my patchset?
Or, are there even better way else?
--
Hidehiro Kawai
Hitachi, Ltd., Systems Development Laboratory
-
> Is coredump_setting_sem a global semaphore? If so, it prevents concurrent
> core dumping.
No, it doesn't. Look again:
int do_coredump(long signr, int exit_code, struct pt_regs * regs)
{
<setup vars>
>>>> down_read(&coredump_settings_sem);
> Additionally, while some process is dumped, writing to
> coredump_omit_anon_shared of unrelated process will be blocked.
Yes, but that's probably reasonable. How likely (a) is a process to coredump,
and (b) is a coredump to occur simultaneously with someone altering their
settings?
David
And (c) altering the setting during a core dump going to produce an
unusable core dump. I don't think the locking is that difficult to add
and it just makes sense. I would venture a guess that it will take less
time to actually do the locking than to continue arguing it is not needed
when it clearly appears it is needed for even a small number of cases.
Thanks,
Robin
Thank you for your reply.
Robin Holt wrote:
> On Wed, Feb 21, 2007 at 11:33:31AM +0000, David Howells wrote:
>
>>Kawai, Hidehiro <hidehiro...@hitachi.com> wrote:
>>
>>>Is coredump_setting_sem a global semaphore? If so, it prevents concurrent
>>>core dumping.
>>
>>No, it doesn't. Look again:
>>
>> int do_coredump(long signr, int exit_code, struct pt_regs * regs)
>> {
>> <setup vars>
>>
>> >>>> down_read(&coredump_settings_sem);
Oh, I'm sorry. I have overlooked it. There is no problem.
>>>Additionally, while some process is dumped, writing to
>>>coredump_omit_anon_shared of unrelated process will be blocked.
>>
>>Yes, but that's probably reasonable. How likely (a) is a process to coredump,
>>and (b) is a coredump to occur simultaneously with someone altering their
>>settings?
>
> And (c) altering the setting during a core dump going to produce an
> unusable core dump. I don't think the locking is that difficult to add
> and it just makes sense. I would venture a guess that it will take less
> time to actually do the locking than to continue arguing it is not needed
> when it clearly appears it is needed for even a small number of cases.
Okay, the probability that the process is blocked in the proc handler seems
to be small. But I'm not sure if problems never occur in enterprise use.
So I'd like to use down_write_trylock() as Robin said before. And if it
fails to acquire the lock, it returns EBUSY immediately.
Do you have any comments?
Thanks,
--
Hidehiro Kawai
Hitachi, Ltd., Systems Development Laboratory
-
> Okay, the probability that the process is blocked in the proc handler seems
> to be small. But I'm not sure if problems never occur in enterprise use.
> So I'd like to use down_write_trylock() as Robin said before. And if it
> fails to acquire the lock, it returns EBUSY immediately.
> Do you have any comments?
That's probably reasonable, but that means userspace then has to handle EBUSY.
David
I just wanted to remind you that you need to be careful about dumping
the [vdso] segment no matter whether you omit other segments. I didn't
actually try running your patches, and if the kernel doesn't actually
consider this segment anonymous and shared, things might already work
fine as is.
In any case, you can check with "readelf -a", if the [vdso] segment is
there. And you will find that if you forget to dump it, "gdb" can no
longer give you stack traces on call chains that involve signal handlers.
As an alternative to your kernel patch, you could achieve the same goal
in user space, by linking my coredumper
http://code.google.com/p/google-coredumper/ into your binaries and
setting up appropriate signal handlers. An equivalent patch for
selectively omitting memory regions would be trivial to add. While this
does give you more flexibility, it of course has the drawback of
requiring you to change your applications, so there still is some
benefit in a kernelspace solution.
Markus
> As an alternative to your kernel patch, you could achieve the same goal in
> user space, by linking my coredumper
How does it work when you can't actually get back to userspace to have
userspace do the coredump? You still have to handle the userspace equivalents
of double/triple faults.
David
"We are too lazy to change 0.01% of apps that actually need it" is not
good enough reason to push the feature into kernel, I'd say.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
My experience shows that there are only very rare occurrences of
situations where you cannot get back into userspace to do the coredump,
and the coredumper tries very hard never to cause additional faults.
While I am sure you could construct scenarios where this would happen,
realistically the only one I have run into were stack overflows, and
they can be handled by carefully setting up an alternate stack for
signal handlers -- just make sure the entire stack is already dirtied
before you run out of memory (or, turn of overcommitting).
Markus
> > How does it work when you can't actually get back to userspace to have
> > userspace do the coredump? You still have to handle the userspace
> > equivalents of double/triple faults.
>
> My experience shows that there are only very rare occurrences of situations
> where you cannot get back into userspace to do the coredump, and the
> coredumper tries very hard never to cause additional faults.
So what? If they can occur, you have to handle them.
> While I am sure you could construct scenarios where this would happen,
> realistically the only one I have run into were stack overflows, and they can
> be handled by carefully setting up an alternate stack for signal handlers --
> just make sure the entire stack is already dirtied before you run out of
> memory (or, turn of overcommitting).
Duff SIGSEGV or SIGBUS signal handlers are just as realistic. All that takes
is for someone to make a programming error. Remember: error paths are the
least frequently tested.
And any time you say "by carefully setting up" you can guarantee someone's
going to do it wrong.
David
By same argument, we should just give up the coredumping in kernel. It
is rarely tested, so someone will just get it wrong.
Remember: we are having people with huge apps, and therefore huge
coredumps. They want to hack a kernel in ugly way to make their dumps
smaller.
...but there's better solution for them, create (& debug) userland
coredumping library. No need to hack a kernel.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
> By same argument, we should just give up the coredumping in kernel. It
> is rarely tested, so someone will just get it wrong.
Not so rare... Lots of people still use Evolution after all:-)
However, to counter your point, I'll point out that there's just one main code
path to do coredumping in the kernel (for ELF; there are other binfmts that do
coredumping too), as opposed to lots of applications all trying to set up
coredumping themselves.
> Remember: we are having people with huge apps, and therefore huge
> coredumps. They want to hack a kernel in ugly way to make their dumps
> smaller.
MMAP_NODUMP or MADV_NODUMP might be better. Let the apps mark out for
themselves what they want.
> ...but there's better solution for them, create (& debug) userland
> coredumping library. No need to hack a kernel.
Actually, a better way to do this may be similar the way I assume Windows to
work. On fatal fault: start up a debugger and PT_ATTACH corpse to it. The
debugger could then be something that'll just save the core. No need to make
sure you link in the core dumping library which might not catch the event if
it's bad enough as it's working from *inside* the program, and so is subject
to being corrupted by beserk code.
David
Markus Gutschke wrote:
> Kawai, Hidehiro wrote:
>
>> This patch series is version 3 of the core dump masking feature,
>> which provides a per-process flag not to dump anonymous shared
>> memory segments.
>
> I just wanted to remind you that you need to be careful about dumping
> the [vdso] segment no matter whether you omit other segments. I didn't
> actually try running your patches, and if the kernel doesn't actually
> consider this segment anonymous and shared, things might already work
> fine as is.
Thank you for your advice and sorry for not replying soon.
Fortunately, the latest kernel uses VM_ALWAYSDUMP flag to always dump
the vdso segment. My patchset doesn't change this behavior. So we
don't need to worry about the vdso segment.
> As an alternative to your kernel patch, you could achieve the same goal
> in user space, by linking my coredumper
> http://code.google.com/p/google-coredumper/ into your binaries and
> setting up appropriate signal handlers. An equivalent patch for
> selectively omitting memory regions would be trivial to add.
As far as I can see, google-coredumper has more flexibility.
Can google-coredumper satisfy the following requirements easily?
Requirements are:
(1) a user can change the core dump settings _anytime_
- sometimes want to dump anonymous shared memory segments and
sometimes don't want to dump them
(2) a user can change the core dump settings of _any processes_
(although permission checks are performed)
- in a huge application which forks many processes, a user
hopes that some processes dump anonymous shared memory
segments and some processes don't dump them
And reliability of the core dump feature is also important.
> While this
> does give you more flexibility, it of course has the drawback of
> requiring you to change your applications, so there still is some
> benefit in a kernelspace solution.
And all the software vendors don't necessarily apply
google-coredumper. If the vendor doesn't apply it, the user will
be bothered by huge core dumps or the buggy application which
remains unfixed. So I believe that in kernel solution is still
needed.
Thanks,
--
Hidehiro Kawai
Hitachi, Ltd., Systems Development Laboratory
-
I might not have been sufficiently clear about this in my previous
e-mail. Currently, the Google coredumper does not have the feature that
you asked about, but adding it would be trivial -- it just hadn't been
needed, yet, as on-the-fly compression was good enough for most users.
Answering your question, I don't see any reason why the new API would
not be able to make changes at any time.
> (2) a user can change the core dump settings of _any processes_
> (although permission checks are performed)
> - in a huge application which forks many processes, a user
> hopes that some processes dump anonymous shared memory
> segments and some processes don't dump them
The Google coredumper is a library that needs to be linked into the
application and needs to be called from appropriate signal handlers. As
such, it is the application's responsibility what management API it
wants to expose externally, and what tools it wants to provide for
managing a group of processes.
> And reliability of the core dump feature is also important.
We have generally had very good reliability with the Google coredumper.
In some cases, it even works a little more reliably than the default
in-kernel dumper (e.g. because we can control where to write the file,
and whether it should be compressed on-the-fly; or because we can get
multi-threaded coredumps even in situations where the particular
combination of libc and kernel doesn't support this), and in other cases
the in-kernel dumper works a little better (e.g. if an application got
too corrupted to even run any signal handlers).
Realistically, it just works. But we did have to make sure that we set
up alternate stacks for signal processing, and that we made sure that
these stacks have been dirtied in order to avoid problems with memory
overcomitting.
> And all the software vendors don't necessarily apply
> google-coredumper. If the vendor doesn't apply it, the user will
> be bothered by huge core dumps or the buggy application which
> remains unfixed. So I believe that in kernel solution is still
> needed.
I agree that the Google coredumper is only one possible solution to your
problem. Depending on how your production environment looks like, it
might help a lot, or it might be completely useless.
If it is cheap for you to modify your applications, but expensive to
upgrade your kernels, the Google coredumper is the way to go. Also, if
you need the extra features, such as the ability to compress core files
on-the-fly, or the ability to send corefiles to somewhere other than an
on-disk file, you definitely should look at a user-space solution. On
the other hand, if you can easily upgrade all your kernels, but you
don't even have access to the source of your applications, then an
in-kernel solution is much better.
Markus
That's a disturbing remark. Under precisely what NOMMU conditions?
I had thought Robin's suggestion very sensible; and throughout mm/
it has seemed pretty random whether we pass an "mm" argument down
in addition to "vma", or just take vma->vm_mm at whatever level needs.
You seem to be suggesting vma->vm_mm is dangerous when CONFIG_NOMMU,
but we MMU people are scarily unaware of that. Perhaps you need to
put #ifndef CONFIG_NOMMU around vm_mm in struct vm_area_struct?
Or am I totally misunderstanding?
Hugh