VFS: file-max limit 50044 reached

Serge Belyshev

unread,

Oct 15, 2005, 9:20:55 AM10/15/05

to linux-...@vger.kernel.org

This program:

#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>

int main (void)
{
int f, j;

j = 0;
while (1) {
f = open ("/dev/null", O_RDONLY);
if (f == -1) {
fprintf (stderr,"open (%i): %s\n", j, strerror (errno));
abort ();
}
close (f);
j ++;
}
return 0;
}

fails on 2.6.14-rc4 kernel with this message:

$ ./a.out
VFS: file-max limit 50044 reached
open (55499): Too many open files in system
Aborted
$

This problem was reproduced on i386 and amd64 with
kernels 2.6.14-rc1 .. 2.6.14-rc4-git4
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Serge Belyshev

unread,

Oct 15, 2005, 1:56:24 PM10/15/05

to linux-...@vger.kernel.org

>This problem was reproduced on i386 and amd64 with
>kernels 2.6.14-rc1 .. 2.6.14-rc4-git4

Caused by this change:

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ab2af1f5005069321c5d130f09cce577b03f43ef
or
http://tinyurl.com/cyrou

aka "[PATCH] files: files struct with RCU"

Dipankar Sarma

unread,

Oct 16, 2005, 12:30:17 PM10/16/05

to Serge Belyshev, linux-...@vger.kernel.org, Linus Torvalds, kh...@linux-fr.org, Andrew Morton, Manfred Spraul

On Sat, Oct 15, 2005 at 09:53:14PM +0400, Serge Belyshev wrote:
>
> >This problem was reproduced on i386 and amd64 with
> >kernels 2.6.14-rc1 .. 2.6.14-rc4-git4
>
> Caused by this change:
>
> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ab2af1f5005069321c5d130f09cce577b03f43ef
> or
> http://tinyurl.com/cyrou
>
> aka "[PATCH] files: files struct with RCU"

Linus, I don't think this has anything to do with RCU grace periods
like we discussed previously. I measured on my 3.6GHz x86_64 and
found that open()/close() pair on /dev/null takes about 45500
cycles or 12 microseconds. [Does that sound resonable?]. So
assuing 100 HZ, we can't really queue more than 1660 file
structs to free per 2 timer ticks. In fact, I looked at the
filp slabinfo and we were indeed returning file structures
to slab.

I think this is a known issue I was looking at earlier - the
way we do file struct accounting is not very suitable for batched
freeing. For scalability reasons, file accounting was constructor/destructor
based. This meant that nr_files was decremented only when
the object was removed from the slab cache. This is
susceptible to slab fragmentation. With RCU based file structure,
consequent batched freeing and a test program like Serge's,
we just speed this up and end up with a very fragmented slab -

llm22:~ # cat /proc/sys/fs/file-nr
587730 0 758844

At the same time, I see only a 2000+ objects in filp cache.
To verify this theory, I tried the following experimental patch
I had from before and it fixes this problem. However I run
into my old "bad page state" problem that I have been seeing
since 2.6.9-rc2 in that machine. That needs a separate investigation.

Serge, could you please try the following experimental patch
just to see if file counting is indeed the problem. The patch
is definitely *not* meant for inclusion. Yet. Manfred told me
a while ago that global filp counting caused scalability problems
in some benchmarks - something I haven't been able to verify.

Thanks
Dipankar

This patch changes the file counting by removing the filp_count_lock.
Instead we use a separate atomic_t, nr_files, for now and all
accesses to it are through get_nr_files() api. In the sysctl
handler for nr_files, we populate files_stat.nr_files before returning
to user.

Counting files as an when they are created and destroyed (as opposed
to inside slab) allows us to correctly count open files with RCU.

Signed-off-by: Dipankar Sarma <dipa...@in.ibm.com>
---

diff -puN fs/dcache.c~files-scale-file-counting fs/dcache.c
--- linux-2.6.14-rc1-test/fs/dcache.c~files-scale-file-counting 2005-10-16 14:03:25.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/fs/dcache.c 2005-10-16 14:03:25.000000000 -0700
@@ -1730,7 +1730,7 @@ void __init vfs_caches_init(unsigned lon
SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);

filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
- SLAB_HWCACHE_ALIGN|SLAB_PANIC, filp_ctor, filp_dtor);
+ SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);

dcache_init(mempages);
inode_init(mempages);
diff -puN fs/file_table.c~files-scale-file-counting fs/file_table.c
--- linux-2.6.14-rc1-test/fs/file_table.c~files-scale-file-counting 2005-10-16 14:03:25.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/fs/file_table.c 2005-10-16 14:07:20.000000000 -0700
@@ -5,6 +5,7 @@
* Copyright (C) 1997 David S. Miller (da...@caip.rutgers.edu)
*/

+#include <linux/config.h>
#include <linux/string.h>
#include <linux/slab.h>
#include <linux/file.h>
@@ -18,52 +19,67 @@
#include <linux/mount.h>
#include <linux/cdev.h>
#include <linux/fsnotify.h>
+#include <linux/sysctl.h>
+#include <asm/atomic.h>

/* sysctl tunables... */
struct files_stat_struct files_stat = {
.max_files = NR_FILE
};

-EXPORT_SYMBOL(files_stat); /* Needed by unix.o */
-
/* public. Not pretty! */
__cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);

-static DEFINE_SPINLOCK(filp_count_lock);
+static atomic_t nr_files __cacheline_aligned_in_smp;

-/* slab constructors and destructors are called from arbitrary
- * context and must be fully threaded - use a local spinlock
- * to protect files_stat.nr_files
- */
-void filp_ctor(void * objp, struct kmem_cache_s *cachep, unsigned long cflags)
+static inline void file_free_rcu(struct rcu_head *head)
{
- if ((cflags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
- SLAB_CTOR_CONSTRUCTOR) {
- unsigned long flags;
- spin_lock_irqsave(&filp_count_lock, flags);
- files_stat.nr_files++;
- spin_unlock_irqrestore(&filp_count_lock, flags);
- }
+ struct file *f = container_of(head, struct file, f_rcuhead);
+ kmem_cache_free(filp_cachep, f);
}

-void filp_dtor(void * objp, struct kmem_cache_s *cachep, unsigned long dflags)
+static inline void file_free(struct file *f)
{
- unsigned long flags;
- spin_lock_irqsave(&filp_count_lock, flags);
- files_stat.nr_files--;
- spin_unlock_irqrestore(&filp_count_lock, flags);
+ atomic_dec(&nr_files);
+ call_rcu(&f->f_rcuhead, file_free_rcu);
}

-static inline void file_free_rcu(struct rcu_head *head)
+/*
+ * Return the total number of open files in the system
+ */
+int get_nr_files(void)
{
- struct file *f = container_of(head, struct file, f_rcuhead);
- kmem_cache_free(filp_cachep, f);
+ return atomic_read(&nr_files);
}

-static inline void file_free(struct file *f)
+/*
+ * Return the maximum number of open files in the system
+ */
+int get_max_files(void)
{
- call_rcu(&f->f_rcuhead, file_free_rcu);
+ return files_stat.max_files;
+}
+
+EXPORT_SYMBOL(get_nr_files);
+EXPORT_SYMBOL(get_max_files);
+
+/*
+ * Handle nr_files sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_files(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ files_stat.nr_files = get_nr_files();
+ proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_files(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ return -ENOSYS;
}
+#endif

/* Find an unused file structure and return a pointer to it.
* Returns NULL, if there are no more free file structures or
@@ -77,7 +93,7 @@ struct file *get_empty_filp(void)
/*
* Privileged users can go above max_files
*/
- if (files_stat.nr_files >= files_stat.max_files &&
+ if (get_nr_files() >= files_stat.max_files &&
!capable(CAP_SYS_ADMIN))
goto over;

@@ -96,11 +112,12 @@ struct file *get_empty_filp(void)
rwlock_init(&f->f_owner.lock);
/* f->f_version: 0 */
INIT_LIST_HEAD(&f->f_list);
+ atomic_inc(&nr_files);
return f;

over:
/* Ran out of filps - report that */
- if (files_stat.nr_files > old_max) {
+ if (get_nr_files() > old_max) {
printk(KERN_INFO "VFS: file-max limit %d reached\n",
files_stat.max_files);
old_max = files_stat.nr_files;
diff -puN fs/xfs/linux-2.6/xfs_linux.h~files-scale-file-counting fs/xfs/linux-2.6/xfs_linux.h
--- linux-2.6.14-rc1-test/fs/xfs/linux-2.6/xfs_linux.h~files-scale-file-counting 2005-10-16 14:03:25.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/fs/xfs/linux-2.6/xfs_linux.h 2005-10-16 14:03:25.000000000 -0700
@@ -88,6 +88,7 @@
#include <linux/proc_fs.h>
#include <linux/version.h>
#include <linux/sort.h>
+#include <linux/fs.h>

#include <asm/page.h>
#include <asm/div64.h>
@@ -242,7 +243,7 @@ static inline void set_buffer_unwritten_

/* IRIX uses the current size of the name cache to guess a good value */
/* - this isn't the same but is a good enough starting point for now. */
-#define DQUOT_HASH_HEURISTIC files_stat.nr_files
+#define DQUOT_HASH_HEURISTIC get_nr_files()

/* IRIX inodes maintain the project ID also, zero this field on Linux */
#define DEFAULT_PROJID 0
diff -puN include/linux/file.h~files-scale-file-counting include/linux/file.h
--- linux-2.6.14-rc1-test/include/linux/file.h~files-scale-file-counting 2005-10-16 14:03:25.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/include/linux/file.h 2005-10-16 14:03:25.000000000 -0700
@@ -60,8 +60,6 @@ extern void put_filp(struct file *);
extern int get_unused_fd(void);
extern void FASTCALL(put_unused_fd(unsigned int fd));
struct kmem_cache_s;
-extern void filp_ctor(void * objp, struct kmem_cache_s *cachep, unsigned long cflags);
-extern void filp_dtor(void * objp, struct kmem_cache_s *cachep, unsigned long dflags);

extern struct file ** alloc_fd_array(int);
extern void free_fd_array(struct file **, int);
diff -puN include/linux/fs.h~files-scale-file-counting include/linux/fs.h
--- linux-2.6.14-rc1-test/include/linux/fs.h~files-scale-file-counting 2005-10-16 14:03:25.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/include/linux/fs.h 2005-10-16 14:03:25.000000000 -0700
@@ -36,6 +36,8 @@ struct files_stat_struct {
int max_files; /* tunable */
};
extern struct files_stat_struct files_stat;
+extern int get_nr_files(void);
+extern int get_max_files(void);

struct inodes_stat_t {
int nr_inodes;
diff -puN kernel/sysctl.c~files-scale-file-counting kernel/sysctl.c
--- linux-2.6.14-rc1-test/kernel/sysctl.c~files-scale-file-counting 2005-10-16 14:03:25.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/kernel/sysctl.c 2005-10-16 14:03:25.000000000 -0700
@@ -50,6 +50,9 @@
#include <linux/nfs_fs.h>
#endif

+extern int proc_nr_files(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos);
+
#if defined(CONFIG_SYSCTL)

/* External variables not in a header file. */
@@ -879,7 +882,7 @@ static ctl_table fs_table[] = {
.data = &files_stat,
.maxlen = 3*sizeof(int),
.mode = 0444,
- .proc_handler = &proc_dointvec,
+ .proc_handler = &proc_nr_files,
},
{
.ctl_name = FS_MAXFILE,
diff -puN net/unix/af_unix.c~files-scale-file-counting net/unix/af_unix.c
--- linux-2.6.14-rc1-test/net/unix/af_unix.c~files-scale-file-counting 2005-10-16 14:03:25.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/net/unix/af_unix.c 2005-10-16 14:03:25.000000000 -0700
@@ -547,7 +547,7 @@ static struct sock * unix_create1(struct
struct sock *sk = NULL;
struct unix_sock *u;

- if (atomic_read(&unix_nr_socks) >= 2*files_stat.max_files)
+ if (atomic_read(&unix_nr_socks) >= 2*get_max_files())
goto out;

sk = sk_alloc(PF_UNIX, GFP_KERNEL, &unix_proto, 1);

_

Serge Belyshev

unread,

Oct 16, 2005, 2:54:30 PM10/16/05

to Dipankar Sarma, linux-...@vger.kernel.org, Linus Torvalds, kh...@linux-fr.org, Andrew Morton, Manfred Spraul

Dipankar Sarma <dipa...@in.ibm.com> writes:

> Serge, could you please try the following experimental patch
> just to see if file counting is indeed the problem. The patch

I ran my test program with this patch applied on top of 2.6.14-rc4-git4
and it worked.

Dipankar Sarma

unread,

Oct 16, 2005, 3:03:09 PM10/16/05

to Serge Belyshev, linux-...@vger.kernel.org, Linus Torvalds, kh...@linux-fr.org, Andrew Morton, Manfred Spraul

On Sun, Oct 16, 2005 at 10:51:12PM +0400, Serge Belyshev wrote:
> Dipankar Sarma <dipa...@in.ibm.com> writes:
>
> > Serge, could you please try the following experimental patch
> > just to see if file counting is indeed the problem. The patch
>
> I ran my test program with this patch applied on top of 2.6.14-rc4-git4
> and it worked.

Serge, thanks for the test.

The issue is however far from resolved. We need to find about
potential scalability problems with this approach.

Secondly, on subsequent repeated tests, I saw a very large number
of allocated objects (600000+) in filp cache. That does point to either RCU
grace period not happening or my sycall measurements completely
wrong. I did run with the following patch that adds syscall
exit as a queiescent state, but it didn't help. I am going
to have to instrument RCU to see what is really happening.

Thanks
Dipankar

It turns out that under some really heavy RCU updates under simulated
conditions, a syscall bound task that doesn't block may prevent
RCU from happening during its entire timeslice and that window
may be big enough to generate out-of-memory situations for RCU
protected objects. This patch starts counting completion of
syscalls as quiescent state in order to prevent the above situation
from happening.

It introduces a new field in thread_info called rcu_qs which
stores the RCU quiescent state counter pointer for the cpu
on which the thread runs. We increment the counter on every
syscall completion to move rcu forward. This patch adds that
support to i386 and x86_64 archs, but it doesn't break other arches. As and
when support for rcu_qs is added to thread_info structs of
other arches, we need to define ARCH_HAS_RCU_QS for that
arch.

Not-Yet-Signed-Off-By: Dipankar Sarma <dipa...@in.ibm.com>

diff -puN arch/i386/kernel/entry.S~rcu-syscall-quiescent arch/i386/kernel/entry.S
--- linux-2.6.14-rc1-test/arch/i386/kernel/entry.S~rcu-syscall-quiescent 2005-10-16 11:01:35.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/arch/i386/kernel/entry.S 2005-10-16 11:25:10.000000000 -0700
@@ -239,6 +239,8 @@ syscall_exit:
cli # make sure we don't miss an interrupt
# setting need_resched or sigpending
# between sampling and the iret
+ movl TI_rcu_qs(%ebp), %ecx # Update RCU quiescent state flag
+ movl $1,(%ecx)
movl TI_flags(%ebp), %ecx
testw $_TIF_ALLWORK_MASK, %cx # current->work
jne syscall_exit_work
diff -puN include/asm-i386/thread_info.h~rcu-syscall-quiescent include/asm-i386/thread_info.h
--- linux-2.6.14-rc1-test/include/asm-i386/thread_info.h~rcu-syscall-quiescent 2005-10-16 11:01:35.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/include/asm-i386/thread_info.h 2005-10-16 11:20:37.000000000 -0700
@@ -17,6 +17,8 @@
#include <asm/processor.h>
#endif

+#define ARCH_HAS_RCU_QS
+
/*
* low level task data that entry.S needs immediate access to
* - this struct should fit entirely inside of one cache line
@@ -39,6 +41,7 @@ struct thread_info {
0-0xFFFFFFFF for kernel-thread
*/
struct restart_block restart_block;
+ int *rcu_qs; /* RCU quiescent state flag */

unsigned long previous_esp; /* ESP of the previous stack in case
of nested (IRQ) stacks
diff -puN include/linux/rcupdate.h~rcu-syscall-quiescent include/linux/rcupdate.h
--- linux-2.6.14-rc1-test/include/linux/rcupdate.h~rcu-syscall-quiescent 2005-10-16 11:01:35.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/include/linux/rcupdate.h 2005-10-16 12:38:56.000000000 -0700
@@ -41,6 +41,7 @@
#include <linux/percpu.h>
#include <linux/cpumask.h>
#include <linux/seqlock.h>
+#include <linux/thread_info.h>

/**
* struct rcu_head - callback structure for use with RCU
@@ -271,6 +272,16 @@ static inline int rcu_pending(int cpu)
*/
#define synchronize_sched() synchronize_rcu()

+#ifdef ARCH_HAS_RCU_QS
+static inline void rcu_set_qs(struct thread_info *ti, int cpu)
+{
+ struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
+ ti->rcu_qs = &rdp->passed_quiesc;
+}
+#else
+static inline void rcu_set_qs(struct thread_info *ti, int cpu) { }
+#endif
+
extern void rcu_init(void);
extern void rcu_check_callbacks(int cpu, int user);
extern void rcu_restart_cpu(int cpu);
diff -puN init/main.c~rcu-syscall-quiescent init/main.c
--- linux-2.6.14-rc1-test/init/main.c~rcu-syscall-quiescent 2005-10-16 11:01:35.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/init/main.c 2005-10-16 12:43:19.000000000 -0700
@@ -671,6 +671,9 @@ static int init(void * unused)
*/
child_reaper = current;

+ /* Set up rcu quiscent state counter before making any syscall */
+ rcu_set_qs(current_thread_info(), smp_processor_id());
+
/* Sets up cpus_possible() */
smp_prepare_cpus(max_cpus);

diff -puN kernel/sched.c~rcu-syscall-quiescent kernel/sched.c
--- linux-2.6.14-rc1-test/kernel/sched.c~rcu-syscall-quiescent 2005-10-16 11:01:35.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/kernel/sched.c 2005-10-16 12:43:53.000000000 -0700
@@ -3006,6 +3006,7 @@ switch_tasks:
rq->nr_switches++;
rq->curr = next;
++*switch_count;
+ rcu_set_qs(next->thread_info, task_cpu(prev));

prepare_task_switch(rq, next);
prev = context_switch(rq, prev, next);
diff -puN arch/i386/kernel/asm-offsets.c~rcu-syscall-quiescent arch/i386/kernel/asm-offsets.c
--- linux-2.6.14-rc1-test/arch/i386/kernel/asm-offsets.c~rcu-syscall-quiescent 2005-10-16 11:35:28.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/arch/i386/kernel/asm-offsets.c 2005-10-16 11:36:15.000000000 -0700
@@ -53,6 +53,7 @@ void foo(void)
OFFSET(TI_preempt_count, thread_info, preempt_count);
OFFSET(TI_addr_limit, thread_info, addr_limit);
OFFSET(TI_restart_block, thread_info, restart_block);
+ OFFSET(TI_rcu_qs, thread_info, rcu_qs);
BLANK();

OFFSET(EXEC_DOMAIN_handler, exec_domain, handler);
diff -puN arch/x86_64/kernel/entry.S~rcu-syscall-quiescent arch/x86_64/kernel/entry.S
--- linux-2.6.14-rc1-test/arch/x86_64/kernel/entry.S~rcu-syscall-quiescent 2005-10-16 11:48:27.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/arch/x86_64/kernel/entry.S 2005-10-16 12:03:01.000000000 -0700
@@ -214,6 +214,8 @@ ret_from_sys_call:
sysret_check:
GET_THREAD_INFO(%rcx)
cli
+ movq threadinfo_rcu_qs(%rcx),%rdx
+ movq $1,(%rdx)
movl threadinfo_flags(%rcx),%edx
andl %edi,%edx
CFI_REMEMBER_STATE
@@ -310,6 +312,8 @@ ENTRY(int_ret_from_sys_call)
/* edi: mask to check */
int_with_check:
GET_THREAD_INFO(%rcx)
+ movq threadinfo_rcu_qs(%rcx),%rdx
+ movl $1,(%rdx)
movl threadinfo_flags(%rcx),%edx
andl %edi,%edx
jnz int_careful
diff -puN include/asm-x86_64/thread_info.h~rcu-syscall-quiescent include/asm-x86_64/thread_info.h
--- linux-2.6.14-rc1-test/include/asm-x86_64/thread_info.h~rcu-syscall-quiescent 2005-10-16 11:50:25.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/include/asm-x86_64/thread_info.h 2005-10-16 11:54:47.000000000 -0700
@@ -23,6 +23,8 @@ struct task_struct;
struct exec_domain;
#include <asm/mmsegment.h>

+#define ARCH_HAS_RCU_QS
+
struct thread_info {
struct task_struct *task; /* main task structure */
struct exec_domain *exec_domain; /* execution domain */
@@ -33,6 +35,7 @@ struct thread_info {

mm_segment_t addr_limit;
struct restart_block restart_block;
+ int *rcu_qs;
};
#endif

diff -puN arch/x86_64/kernel/asm-offsets.c~rcu-syscall-quiescent arch/x86_64/kernel/asm-offsets.c
--- linux-2.6.14-rc1-test/arch/x86_64/kernel/asm-offsets.c~rcu-syscall-quiescent 2005-10-16 11:52:13.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/arch/x86_64/kernel/asm-offsets.c 2005-10-16 11:53:14.000000000 -0700
@@ -33,6 +33,7 @@ int main(void)
ENTRY(flags);
ENTRY(addr_limit);
ENTRY(preempt_count);
+ ENTRY(rcu_qs);
BLANK();
#undef ENTRY
#define ENTRY(entry) DEFINE(pda_ ## entry, offsetof(struct x8664_pda, entry))

_

Linus Torvalds

unread,

Oct 16, 2005, 10:20:43 PM10/16/05

to Dipankar Sarma, Serge Belyshev, linux-...@vger.kernel.org, kh...@linux-fr.org, Andrew Morton, Manfred Spraul

On Mon, 17 Oct 2005, Dipankar Sarma wrote:
>
> Secondly, on subsequent repeated tests, I saw a very large number
> of allocated objects (600000+) in filp cache. That does point to either RCU
> grace period not happening or my sycall measurements completely
> wrong. I did run with the following patch that adds syscall
> exit as a queiescent state, but it didn't help. I am going
> to have to instrument RCU to see what is really happening.

I would _really_ prefer to not do this in the system call hot-path by
default. That is unquestionably the hottest path in the kernel by far.

It would be _much_ better to set one of the TIF_WORK flags when there's a
lot of RCU stuff, and do this all in the not-quit-so-hot path of
do_notify_resume() (on x86, I think others call it other things) instead.

If you use the same kind of "set the TIF flag every 1000 rcu events"
approach that my failed patch had, you'd be much better off.

In fact, in that path you could even do a full "rcu_process_callbacks()".
After all, this is not that different from signal handling.

Gaah. I had really hoped to release 2.6.14 tomorrow. It's been a week
since -rc4.

Maybe this isn't that serious in practice right now? Serge, how did you
notice it?

Linus

Linus Torvalds

unread,

Oct 16, 2005, 10:35:23 PM10/16/05

to Dipankar Sarma, Serge Belyshev, linux-...@vger.kernel.org, kh...@linux-fr.org, Andrew Morton, Manfred Spraul

On Sun, 16 Oct 2005, Dipankar Sarma wrote:
>
> Linus, I don't think this has anything to do with RCU grace periods
> like we discussed previously. I measured on my 3.6GHz x86_64 and
> found that open()/close() pair on /dev/null takes about 45500
> cycles or 12 microseconds. [Does that sound resonable?].

That sounds very slow. I can do a million open/close pairs in 4 seconds on
a 2.5GHz G5. Maybe you tested a cold-cache case?

Of course, a P4 is just about the worst architecture to test system call
performance on, so ...

Still, that's 4us. I'm pretty sure some machines will do it in 3 or less
(in fact, lmbench says 3.17us on another machine of mine for open/close).
Still, that's only four times faster, so 2 timer ticks should be less than
5000 file structs to free.

I suspect this patch is worth it for the 2.6.14 timeframe, but I'll wait
for confirmation.

In fact, for 2.6.14, I'd almost do an even more minimal one. I agree with
your changing the file counter to an atomic, but I'd rather keep that
change for later.

Serge, does this alternate patch work for you?

[ cache constructors and destructors are _stupid_. They act exactly the
wrong way from a cache standpoint. ]

Linus

---
diff --git a/fs/dcache.c b/fs/dcache.c
index fb10386..40aaa90 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1731,7 +1731,7 @@ void __init vfs_caches_init(unsigned lon

SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);

filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
- SLAB_HWCACHE_ALIGN|SLAB_PANIC, filp_ctor, filp_dtor);
+ SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);

dcache_init(mempages);
inode_init(mempages);

diff --git a/fs/file_table.c b/fs/file_table.c
index 86ec8ae..fbda480 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -39,21 +39,9 @@ void filp_ctor(void * objp, struct kmem_
{

if ((cflags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==

SLAB_CTOR_CONSTRUCTOR) {
- unsigned long flags;
- spin_lock_irqsave(&filp_count_lock, flags);
- files_stat.nr_files++;
- spin_unlock_irqrestore(&filp_count_lock, flags);
}
}

-void filp_dtor(void * objp, struct kmem_cache_s *cachep, unsigned long dflags)

-{

- unsigned long flags;
- spin_lock_irqsave(&filp_count_lock, flags);
- files_stat.nr_files--;
- spin_unlock_irqrestore(&filp_count_lock, flags);

-}
-

static inline void file_free_rcu(struct rcu_head *head)
{

struct file *f = container_of(head, struct file, f_rcuhead);

@@ -62,6 +50,13 @@ static inline void file_free_rcu(struct

static inline void file_free(struct file *f)
{

+ unsigned long flags;
+
+ /* Stupid. Use atomics */
+ spin_lock_irqsave(&filp_count_lock, flags);
+ files_stat.nr_files--;
+ spin_unlock_irqrestore(&filp_count_lock, flags);
+
call_rcu(&f->f_rcuhead, file_free_rcu);
}

@@ -73,6 +68,7 @@ struct file *get_empty_filp(void)
{
static int old_max;
struct file * f;
+ unsigned long flags;

/*
* Privileged users can go above max_files

@@ -85,6 +81,11 @@ struct file *get_empty_filp(void)
if (f == NULL)
goto fail;

+ /* Stupid. Use atomics */
+ spin_lock_irqsave(&filp_count_lock, flags);
+ files_stat.nr_files++;
+ spin_unlock_irqrestore(&filp_count_lock, flags);
+
memset(f, 0, sizeof(*f));
if (security_file_alloc(f))
goto fail_sec;
diff --git a/include/linux/file.h b/include/linux/file.h
index f5bbd4c..55f0572 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h

@@ -60,8 +60,6 @@ extern void put_filp(struct file *);
extern int get_unused_fd(void);
extern void FASTCALL(put_unused_fd(unsigned int fd));
struct kmem_cache_s;
-extern void filp_ctor(void * objp, struct kmem_cache_s *cachep, unsigned long cflags);
-extern void filp_dtor(void * objp, struct kmem_cache_s *cachep, unsigned long dflags);

extern struct file ** alloc_fd_array(int);
extern void free_fd_array(struct file **, int);

Roland Dreier

unread,

Oct 16, 2005, 11:57:21 PM10/16/05

to Linus Torvalds, Dipankar Sarma, Serge Belyshev, linux-...@vger.kernel.org, kh...@linux-fr.org, Andrew Morton, Manfred Spraul

> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -39,21 +39,9 @@ void filp_ctor(void * objp, struct kmem_
> {
> if ((cflags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
> SLAB_CTOR_CONSTRUCTOR) {
> - unsigned long flags;
> - spin_lock_irqsave(&filp_count_lock, flags);
> - files_stat.nr_files++;
> - spin_unlock_irqrestore(&filp_count_lock, flags);
> }
> }

Am I missing something? Why not delete the whole filp_ctor() function
rather than just the then clause of the if()?

- R.

Serge Belyshev

unread,

Oct 17, 2005, 12:46:43 AM10/17/05

to Linus Torvalds, linux-...@vger.kernel.org, kh...@linux-fr.org, Andrew Morton, Manfred Spraul

[resend, sorry for sending mail off-list]

Linus Torvalds <torv...@osdl.org> writes:

> Serge, does this alternate patch work for you?
>

Yes, this patch works too.

> Gaah. I had really hoped to release 2.6.14 tomorrow. It's been a week
> since -rc4.
>
> Maybe this isn't that serious in practice right now? Serge, how did you
> notice it?
>

This bug causes random failures when building kernel with make -j4, all with
"Too many open files in system" message from gcc.

Jean Delvare

unread,

Oct 17, 2005, 4:46:14 AM10/17/05

to torv...@osdl.org, dipa...@in.ibm.com, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

Hi Linus, Dipankar, all,

On 2005-10-17, Linus Torvalds wrote:
> I would _really_ prefer to not do this in the system call hot-path by
> default. That is unquestionably the hottest path in the kernel by far.
>
> It would be _much_ better to set one of the TIF_WORK flags when there's a
> lot of RCU stuff, and do this all in the not-quit-so-hot path of
> do_notify_resume() (on x86, I think others call it other things) instead.
>
> If you use the same kind of "set the TIF flag every 1000 rcu events"
> approach that my failed patch had, you'd be much better off.
>
> In fact, in that path you could even do a full "rcu_process_callbacks()".
> After all, this is not that different from signal handling.
>
> Gaah. I had really hoped to release 2.6.14 tomorrow. It's been a week
> since -rc4.

Isn't reverting the original change an option? 2.6.13 was working OK if
I'm not mistaken.

Thanks,
--
Jean Delvare

Dipankar Sarma

unread,

Oct 17, 2005, 4:52:57 AM10/17/05

to Jean Delvare, torv...@osdl.org, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, Oct 17, 2005 at 10:32:47AM +0200, Jean Delvare wrote:
>
> > In fact, in that path you could even do a full "rcu_process_callbacks()".
> > After all, this is not that different from signal handling.
> >
> > Gaah. I had really hoped to release 2.6.14 tomorrow. It's been a week
> > since -rc4.
>
> Isn't reverting the original change an option? 2.6.13 was working OK if
> I'm not mistaken.

IMO, putting the file accounting in slab ctor/dtors is not very
reliable because it depends on slab not getting fragmented.
Batched freeing in RCU is just an extreme case of it. We needed
to fix file counting anyway.

Thanks
Dipankar

Eric Dumazet

unread,

Oct 17, 2005, 5:11:26 AM10/17/05

to dipa...@in.ibm.com, Jean Delvare, torv...@osdl.org, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

Dipankar Sarma a écrit :

> On Mon, Oct 17, 2005 at 10:32:47AM +0200, Jean Delvare wrote:
>
>>>In fact, in that path you could even do a full "rcu_process_callback
s()".
>>>After all, this is not that different from signal handling.
>>>
>>>Gaah. I had really hoped to release 2.6.14 tomorrow. It's been a wee
k
>>>since -rc4.
>>
>>Isn't reverting the original change an option? 2.6.13 was working OK
if
>>I'm not mistaken.
>
>
> IMO, putting the file accounting in slab ctor/dtors is not very
> reliable because it depends on slab not getting fragmented.
> Batched freeing in RCU is just an extreme case of it. We needed
> to fix file counting anyway.
>
> Thanks
> Dipankar

But isnt this file counting a small problem ?

This small program can eat all available memory.

Fixing the 'file count' wont fix the real problem : Batch freeing is
good but
should be limited so that not more than *billions* of file struct are q
ueued
for deletion.

Dont take me wrong : I really *need* the file RCU stuff added in 2.6.14
.

I believe we can find a solution, even if it might delay 2.6.14 because
Linus
would have to release a rc5

Eric

Christoph Hellwig

unread,

Oct 17, 2005, 5:15:16 AM10/17/05

to Eric Dumazet, dipa...@in.ibm.com, Jean Delvare, torv...@osdl.org, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote:
> Dont take me wrong : I really *need* the file RCU stuff added in 2.6.14.

how so? and why should we care? I'd rather see a 2.6.14 soon with
the changes backed out so we can have a proper release that more or
less sticks to the release schedule we agreed on at kernel summit.
You'll have four weeks time to sort out the issue afterwards.

Eric Dumazet

unread,

Oct 17, 2005, 5:31:34 AM10/17/05

to Christoph Hellwig, dipa...@in.ibm.com, Jean Delvare, torv...@osdl.org, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

Christoph Hellwig a écrit :

> On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote:
>
>>Dont take me wrong : I really *need* the file RCU stuff added in 2.6.
14.
>
>
> how so? and why should we care? I'd rather see a 2.6.14 soon with
> the changes backed out so we can have a proper release that more or
> less sticks to the release schedule we agreed on at kernel summit.
> You'll have four weeks time to sort out the issue afterwards.
> -

Christoph,

You can try to hide the forest by killing some trees.

Are you sure that RCU 'file structs' is the only problem lying around ?

For instance, I think other RCU freeing problem are dormant (see maxb
atch=10
and think about the number of routes a busy router (or DOS attack) can
handle...

Of course, a 'test program' is more difficult to write than a

while (1) close(open("/dev/null", 3));

Eric

Dipankar Sarma

unread,

Oct 17, 2005, 6:39:59 AM10/17/05

to Eric Dumazet, Jean Delvare, torv...@osdl.org, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote:

> Dipankar Sarma a écrit :

> >
> >IMO, putting the file accounting in slab ctor/dtors is not very
> >reliable because it depends on slab not getting fragmented.
> >Batched freeing in RCU is just an extreme case of it. We needed
> >to fix file counting anyway.
> >
> >Thanks
> >Dipankar
>
> But isnt this file counting a small problem ?
>
> This small program can eat all available memory.
>
> Fixing the 'file count' wont fix the real problem : Batch freeing is
good
> but should be limited so that not more than *billions* of file struct
are
> queued for deletion.

Agreed. It is not designed to work that way, so there must be
a bug somewhere and I am trying to track it down. It could very well
be that at maxbatch=10 we are just queueing at a rate far too high
compared to processing.

> I believe we can find a solution, even if it might delay 2.6.14 becau
se
> Linus would have to release a rc5

This I am not sure, it is Linus' call. I am just trying to do the
right thing - fix the real problem.

Thanks
Dipankar

Dipankar Sarma

unread,

Oct 17, 2005, 8:01:38 AM10/17/05

to Linus Torvalds, Serge Belyshev, linux-...@vger.kernel.org, kh...@linux-fr.org, Andrew Morton, Manfred Spraul

On Sun, Oct 16, 2005 at 07:34:24PM -0700, Linus Torvalds wrote:
> On Sun, 16 Oct 2005, Dipankar Sarma wrote:
> >
> > Linus, I don't think this has anything to do with RCU grace periods
> > like we discussed previously. I measured on my 3.6GHz x86_64 and
> > found that open()/close() pair on /dev/null takes about 45500
> > cycles or 12 microseconds. [Does that sound resonable?].
>
> That sounds very slow. I can do a million open/close pairs in 4 seconds on
> a 2.5GHz G5. Maybe you tested a cold-cache case?

I measured after warming up for about a 100 times or so. It is
not a cold-cache case. I think we have a bigger problem in hand
here. I measured this with 2.6.13 and saw that I could do the
same in ~3 microseconds per iteration. It balloons to 12 microseconds
in 2.6.14-rc1. I am looking at this right now apart from the other
problems.

> I suspect this patch is worth it for the 2.6.14 timeframe, but I'll wait
> for confirmation.
>
> In fact, for 2.6.14, I'd almost do an even more minimal one. I agree with
> your changing the file counter to an atomic, but I'd rather keep that
> change for later.

Even beyond the file counter issue, we do need to address the DoS
and the open/close slowdown issue.

Thanks
Dipankar

Eric Dumazet

unread,

Oct 17, 2005, 8:17:38 AM10/17/05

to dipa...@in.ibm.com, Jean Delvare, torv...@osdl.org, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

Dipankar Sarma a écrit :

> On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote:
>>
>>Fixing the 'file count' wont fix the real problem : Batch freeing is good
>>but should be limited so that not more than *billions* of file struct are
>>queued for deletion.
>
>
> Agreed. It is not designed to work that way, so there must be
> a bug somewhere and I am trying to track it down. It could very well
> be that at maxbatch=10 we are just queueing at a rate far too high
> compared to processing.
>

I can freeze my test machine with a program that 'only' use dentries, no files.

No message, no panic, but machine becomes totally unresponsive after few seconds.

Just greping for call_rcu in kernel sources gave me another call_rcu() use
from syscalls. And yes 2.6.13 has the same problem.

Here is the killer on by HT Xeon machine (2GB ram)

Eric

stress2.c

linux-os (Dick Johnson)

unread,

Oct 17, 2005, 8:33:27 AM10/17/05

to Eric Dumazet, dipa...@in.ibm.com, Jean Delvare, torv...@osdl.org, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

No problem with linux-2.6.13.4 and ext3 file-system:
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME CO
MMAND
4 0 1 0 16 0 1544 408 - S ? 0:00 init
[5]
[SNIPPED....]
1 0 16017 6 15 0 0 0 pdflus SW ? 0:00 [pdf
lush]
4 666 16406 5273 16 0 4464 1004 wait S tty2 0:00 -bas
h
0 666 16501 16406 18 0 1324 240 - R tty2 9:46 ./xx
x
4 0 16502 5223 15 0 4204 1248 wait S tty1 0:00 -bas
h
0 0 16563 16502 16 0 2276 584 - R tty1 0:00 ps l
axw

I just put 9:46 CPU time on your program and every thing is fine.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.46 BogoMips).
Warning : 98.36% of all statistics are fiction.

****************************************************************
The information transmitted in this message is confidential and may be
privileged. Any review, retransmission, dissemination, or other use of
this information by persons or entities other than the intended recipi
ent is prohibited. If you are not the intended recipient, please notif
y Analogic Corporation immediately - by replying to this message or by
sending an email to Deliver...@analogic.com - and destroy all copie
s of this information, including any attachments, without reading or di
sclosing them.

Thank you.

Dipankar Sarma

unread,

Oct 17, 2005, 8:43:25 AM10/17/05

to Eric Dumazet, Jean Delvare, torv...@osdl.org, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

Can you try it with rcupdate.maxbatch set to 10000 in boot
command line ?

FWIW, the open/close test problem goes away if I set maxbatch to
10000. I had introduced this limit some time ago to curtail
the effect long running softirq handlers have on scheduling
latencies, which now conflicts with OOM avoidance requirements.

Thanks
Dipankar

Eric Dumazet

unread,

Oct 17, 2005, 9:29:45 AM10/17/05

to dipa...@in.ibm.com, Jean Delvare, torv...@osdl.org, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

Dipankar Sarma a écrit :
> On Mon, Oct 17, 2005 at 02:10:09PM +0200, Eric Dumazet wrote:
>
>>Dipankar Sarma a écrit :
>>
>>>On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote:
>>>
>>>Agreed. It is not designed to work that way, so there must be
>>>a bug somewhere and I am trying to track it down. It could very well
>>>be that at maxbatch=10 we are just queueing at a rate far too high
>>>compared to processing.
>>>
>>
>>I can freeze my test machine with a program that 'only' use dentries,
no
>>files.
>>
>>No message, no panic, but machine becomes totally unresponsive after
few
>>seconds.
>>
>>Just greping for call_rcu in kernel sources gave me another call_rcu(
) use
>>from syscalls. And yes 2.6.13 has the same problem.
>
>
> Can you try it with rcupdate.maxbatch set to 10000 in boot
> command line ?
>

Changing maxbatch from 10 to 10000 cures the problem.
Maybe we could initialize maxbatch to (10000000/HZ), considering no cur
rent
cpu is able to queue more than 10.000.000 items per second in a list.

> FWIW, the open/close test problem goes away if I set maxbatch to
> 10000. I had introduced this limit some time ago to curtail
> the effect long running softirq handlers have on scheduling
> latencies, which now conflicts with OOM avoidance requirements.

Yes, and probably OOM avoidance has a higher priority than latencies in
DOS
situations...

Eric

Dipankar Sarma

unread,

Oct 17, 2005, 9:39:48 AM10/17/05

to Eric Dumazet, Jean Delvare, torv...@osdl.org, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, Oct 17, 2005 at 03:28:22PM +0200, Eric Dumazet wrote:
> Dipankar Sarma a écrit :
> >On Mon, Oct 17, 2005 at 02:10:09PM +0200, Eric Dumazet wrote:
> >
> >
> >Can you try it with rcupdate.maxbatch set to 10000 in boot
> >command line ?
> >
>
> Changing maxbatch from 10 to 10000 cures the problem.
> Maybe we could initialize maxbatch to (10000000/HZ), considering no c
urrent
> cpu is able to queue more than 10.000.000 items per second in a list.

I don't know, maybe I can look at a more adaptive heuristics.

>
>
> >FWIW, the open/close test problem goes away if I set maxbatch to
> >10000. I had introduced this limit some time ago to curtail
> >the effect long running softirq handlers have on scheduling
> >latencies, which now conflicts with OOM avoidance requirements.
>
> Yes, and probably OOM avoidance has a higher priority than latencies
in DOS
> situations...

Yes, one would think. But the audio guys would chew my head for this :)

Thanks
Dipankar

Eric Dumazet

unread,

Oct 17, 2005, 10:55:48 AM10/17/05

to Eric Dumazet, dipa...@in.ibm.com, Jean Delvare, torv...@osdl.org, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

Eric Dumazet a écrit :

> Dipankar Sarma a écrit :
>
>> On Mon, Oct 17, 2005 at 02:10:09PM +0200, Eric Dumazet wrote:
>>
>>> Dipankar Sarma a écrit :
>>>
>>>> On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote:
>>>>
>>>> Agreed. It is not designed to work that way, so there must be
>>>> a bug somewhere and I am trying to track it down. It could very we
ll
>>>> be that at maxbatch=10 we are just queueing at a rate far too hi
gh
>>>> compared to processing.
>>>>
>>>
>>> I can freeze my test machine with a program that 'only' use dentrie
s,
>>> no files.
>>>
>>> No message, no panic, but machine becomes totally unresponsive afte
r
>>> few seconds.
>>>
>>> Just greping for call_rcu in kernel sources gave me another

>>> call_rcu() use from syscalls. And yes 2.6.13 has the same problem.

>>
>>
>>
>> Can you try it with rcupdate.maxbatch set to 10000 in boot
>> command line ?
>>
>
> Changing maxbatch from 10 to 10000 cures the problem.
> Maybe we could initialize maxbatch to (10000000/HZ), considering no

> current cpu is able to queue more than 10.000.000 items per second in
a
> list.
>

Well... after one 90 minutes of stress, I got an OOM even with maxbatch
=10000

Out of Memory : Killed process 1759 (mysqld)

Maybe because on this HT machine, all (timer and network) interrupts ar
e taken
by CPU0.

So if the user program is bound on CPU1, may be this cpu only performs
syscalls and no rcu state change at all.

Oct 17 18:24:25 localhost kernel: oom-killer: gfp_mask=0xd0, order=
0
Oct 17 18:24:25 localhost kernel: Mem-info:
Oct 17 18:24:25 localhost kernel: DMA per-cpu:
Oct 17 18:24:25 localhost kernel: cpu 0 hot: low 2, high 6, batch 1 use
d:5
Oct 17 18:24:25 localhost kernel: cpu 0 cold: low 0, high 2, batch 1 us
ed:1
Oct 17 18:24:25 localhost kernel: cpu 1 hot: low 2, high 6, batch 1 use
d:2
Oct 17 18:24:25 localhost kernel: cpu 1 cold: low 0, high 2, batch 1 us
ed:0
Oct 17 18:24:25 localhost kernel: Normal per-cpu:
Oct 17 18:24:25 localhost kernel: cpu 0 hot: low 62, high 186, batch 31
used:168
Oct 17 18:24:25 localhost kernel: cpu 0 cold: low 0, high 62, batch 31
used:55
Oct 17 18:24:25 localhost kernel: cpu 1 hot: low 62, high 186, batch 31
used:95
Oct 17 18:24:25 localhost kernel: cpu 1 cold: low 0, high 62, batch 31
used:33
Oct 17 18:24:25 localhost kernel: HighMem per-cpu:
Oct 17 18:26:17 localhost kernel: cpu 0 hot: low 62, high 186, batch 31
used:166
Oct 17 18:26:17 localhost kernel: cpu 0 cold: low 0, high 62, batch 31
used:29
Oct 17 18:26:17 localhost kernel: cpu 1 hot: low 62, high 186, batch 31
used:176
Oct 17 18:26:17 localhost kernel: cpu 1 cold: low 0, high 62, batch 31
used:13
Oct 17 18:26:17 localhost kernel: Free pages: 1136620kB (1129392kB
HighMem)
Oct 17 18:26:17 localhost kernel: Active:8040 inactive:3876 dirty:1
writeback:0 unstable:0 free:284155 slab:218548 mapped:8064 pagetables:1
30
Oct 17 18:26:17 localhost kernel: DMA free:3588kB min:68kB low:84kB hig
h:100kB
active:0kB inactive:0kB present:16384kB pages_scanned:246 all_unreclaim
able? no
Oct 17 18:26:17 localhost kernel: lowmem_reserve[]: 0 880 2031
Oct 17 18:26:17 localhost kernel: Normal free:3640kB min:3756kB low:469
2kB
high:5632kB active:76kB inactive:24kB present:901120kB pages_scanned:85
81
all_unreclaimable? no
Oct 17 18:26:17 localhost kernel: lowmem_reserve[]: 0 0 9215
Oct 17 18:26:17 localhost kernel: HighMem free:1129392kB min:512kB low:
640kB
high:768kB active:32084kB inactive:15480kB present:1179520kB pages_scan
ned:0
all_unreclaimable? no
Oct 17 18:26:17 localhost kernel: lowmem_reserve[]: 0 0 0
Oct 17 18:26:17 localhost kernel: DMA: 1*4kB 0*8kB 0*16kB 0*32kB 0*64kB

0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3588kB
Oct 17 18:26:17 localhost kernel: Normal: 0*4kB 1*8kB 1*16kB 1*32kB 0*6
4kB
0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3640kB
Oct 17 18:26:17 localhost kernel: HighMem: 518*4kB 301*8kB 119*16kB 54*
32kB
22*64kB 13*128kB 6*256kB 1*512kB 0*1024kB 1*2048kB 272*4096kB = 11293
92kB
Oct 17 18:26:17 localhost kernel: Swap cache: add 0, delete 0, find 0/0
, race 0+0
Oct 17 18:26:17 localhost kernel: Free swap = 1012016kB
Oct 17 18:26:17 localhost kernel: Total swap = 1012016kB
Oct 17 18:26:17 localhost kernel: Free swap: 1012016kB
Oct 17 18:26:17 localhost kernel: 524256 pages of RAM
Oct 17 18:26:17 localhost kernel: 294880 pages of HIGHMEM
Oct 17 18:26:17 localhost kernel: 5472 reserved pages
Oct 17 18:26:17 localhost kernel: 11361 pages shared
Oct 17 18:26:18 localhost kernel: 0 pages swap cached
Oct 17 18:26:18 localhost kernel: 1 pages dirty
Oct 17 18:26:18 localhost kernel: 0 pages writeback
Oct 17 18:26:18 localhost kernel: 8064 pages mapped
Oct 17 18:26:18 localhost kernel: 218548 pages slab
Oct 17 18:26:18 localhost kernel: 130 pages pagetables
Oct 17 18:26:18 localhost kernel: Out of Memory: Killed process 1759 (m
ysqld).

Linus Torvalds

unread,

Oct 17, 2005, 11:43:15 AM10/17/05

to Dipankar Sarma, Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, 17 Oct 2005, Dipankar Sarma wrote:
>
> Agreed. It is not designed to work that way, so there must be
> a bug somewhere and I am trying to track it down. It could very well
> be that at maxbatch=10 we are just queueing at a rate far too high
> compared to processing.

That sounds sane.

I suspect that the real fix for 2.6.14 might be to update maxbatch to be
much higher by default.

The thing is, that batching really is fundamentally wrong. If we have a
thousand thing to free, we can't just free ten of them, and leave the 990
others to wait for next time. I realize people want real-time, but
if it's INCORRECT, then real-time isn't real-time.

I just checked: increasing "maxbatch" from 10 to 10000 does fix the
problem.

> This I am not sure, it is Linus' call. I am just trying to do the
> right thing - fix the real problem.

It sure looks like the batch limiter is the fundamental problem.

Instead of limiting the batching, we should likely try to avoid the RCU
lists getting huge in the first place - ie do the RCU callback processing
more often if the list is getting longer.

So I suspect that the _real_ fix is:

- for 2.6.14: remove the batching limig (or just make it much higher for
now)

- post-14: work on making sure rcu callbacks are done in a more timely
manner when the rcu queue gets long. This would involve TIF_RCUPENDING
and whatever else to make sure that we have timely quiescent periods,
and we do the RCU callback tasklet more often if the queue is long.

Hmm?

Linus

Eric Dumazet

unread,

Oct 17, 2005, 12:02:47 PM10/17/05

to Linus Torvalds, Dipankar Sarma, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

Linus Torvalds a écrit :

> So I suspect that the _real_ fix is:
>
> - for 2.6.14: remove the batching limig (or just make it much higher
for
> now)

I would just remove it. If the limit is wrong, we crash again. And the
realtime guys already are pissed off by batch=10000 anyway.

>
> - post-14: work on making sure rcu callbacks are done in a more time
ly
> manner when the rcu queue gets long. This would involve TIF_RCUPEN
DING
> and whatever else to make sure that we have timely quiescent perio
ds,
> and we do the RCU callback tasklet more often if the queue is long
.
>

Absolutely. Keeping a count of (percpu) queued items is basically free
if kept
in the cache line used by list head, so the 'queue length on this cpu'
is a
cheap metric.

A 'realtime refinement' would be to use a different maxbatch limit depe
nding
on the caller's priority : Let a softirq thread have a lower batch coun
t than
a regular user thread.

Eric

Linus Torvalds

unread,

Oct 17, 2005, 12:17:34 PM10/17/05

to Eric Dumazet, Dipankar Sarma, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, 17 Oct 2005, Eric Dumazet wrote:
>
> I would just remove it. If the limit is wrong, we crash again. And the
> realtime guys already are pissed off by batch=10000 anyway.

Normally I would too, but I'm still hoping I could do a 2.6.14 tonight. I
guess that's unreasonable (swtlb issues etc), but for now I just committed
the one-liner.

> Absolutely. Keeping a count of (percpu) queued items is basically free if kept
> in the cache line used by list head, so the 'queue length on this cpu' is a
> cheap metric.

Yes. I did something broken like that before Dipankar pointed me at
batching.

The only downside to TIF_RCUUPDATE is that those damn TIF-flags are
per-architecture (probably largely unnecessary, but while most
architectures don't care at all, others seem to have optimized their
layout so that they can test the work bits more efficiently). So it's a
matter of each architecture being updated with its TIF_xyz flag and their
work function.

Anybody willing to try? Dipankar apparently has a lot on his plate, this
_should_ be fairly straightforward. Eric?

Linus

Dipankar Sarma

unread,

Oct 17, 2005, 12:27:27 PM10/17/05

to Linus Torvalds, Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, Oct 17, 2005 at 08:42:05AM -0700, Linus Torvalds wrote:
>
> On Mon, 17 Oct 2005, Dipankar Sarma wrote:
>
> > This I am not sure, it is Linus' call. I am just trying to do the
> > right thing - fix the real problem.
>
> It sure looks like the batch limiter is the fundamental problem.
>
> Instead of limiting the batching, we should likely try to avoid the RCU
> lists getting huge in the first place - ie do the RCU callback processing
> more often if the list is getting longer.
>
> So I suspect that the _real_ fix is:
>
> - for 2.6.14: remove the batching limig (or just make it much higher for
> now)

You can remove the batching limit by making maxbatch = 0 by default.
Just a one line patch.

> - post-14: work on making sure rcu callbacks are done in a more timely
> manner when the rcu queue gets long. This would involve TIF_RCUPENDING
> and whatever else to make sure that we have timely quiescent periods,
> and we do the RCU callback tasklet more often if the queue is long.

Yes, I am already looking at this. There are a number approaches
to this include adaptive algorithm to cater to naughty corner
cases and/or adding different ways to handle RCU as in
tree. I hope to experiment with these incrementally after 2.6.14 over
a period of time and see what works best for most people.

Thanks
Dipankar

Dipankar Sarma

unread,

Oct 17, 2005, 12:29:59 PM10/17/05

to Eric Dumazet, Linus Torvalds, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, Oct 17, 2005 at 06:01:31PM +0200, Eric Dumazet wrote:
> Linus Torvalds a écrit :

>
> >
> > - post-14: work on making sure rcu callbacks are done in a more tim
ely
> > manner when the rcu queue gets long. This would involve TIF_RCUPE
NDING
> > and whatever else to make sure that we have timely quiescent peri
ods,
> > and we do the RCU callback tasklet more often if the queue is lon
g.
> >
>
> Absolutely. Keeping a count of (percpu) queued items is basically fre
e if
> kept in the cache line used by list head, so the 'queue length on thi
s cpu'
> is a cheap metric.

Or 'sudden increase in queue length on this cpu' :)

> A 'realtime refinement' would be to use a different maxbatch limit

> depending on the caller's priority : Let a softirq thread have a lowe

r
> batch count than a regular user thread.

Yes, would be interesting.

Thanks
Dipankar

Lee Revell

unread,

Oct 17, 2005, 12:32:55 PM10/17/05

to Eric Dumazet, Linus Torvalds, Dipankar Sarma, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, 2005-10-17 at 18:01 +0200, Eric Dumazet wrote:
> A 'realtime refinement' would be to use a different maxbatch limit depending

> on the caller's priority : Let a softirq thread have a lower batch count than
> a regular user thread.

Or just make the whole thing preemptible like in the -rt tree and forget
about it.

Lee

Dipankar Sarma

unread,

Oct 17, 2005, 12:36:14 PM10/17/05

to Linus Torvalds, Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote:
>
> > Absolutely. Keeping a count of (percpu) queued items is basically free if kept
> > in the cache line used by list head, so the 'queue length on this cpu' is a
> > cheap metric.
>

> The only downside to TIF_RCUUPDATE is that those damn TIF-flags are
> per-architecture (probably largely unnecessary, but while most
> architectures don't care at all, others seem to have optimized their
> layout so that they can test the work bits more efficiently). So it's a
> matter of each architecture being updated with its TIF_xyz flag and their
> work function.
>
> Anybody willing to try? Dipankar apparently has a lot on his plate, this
> _should_ be fairly straightforward. Eric?

I *had*, when this hit me :) It was one those spurt things. I am going to
look at this, but I think we will need to do this with some careful
benchmarking.

At the moment however I do have another concern - open/close taking too
much time as I mentioned in an earlier email. It is nearly 4 times
slower than 2.6.13. So, that is first up in my list of things to
do at the moment.

Thanks
Dipankar

Eric Dumazet

unread,

Oct 17, 2005, 2:02:30 PM10/17/05

to dipa...@in.ibm.com, Linus Torvalds, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

Dipankar Sarma a écrit :

> On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote:
>
>>>Absolutely. Keeping a count of (percpu) queued items is basically free if kept
>>>in the cache line used by list head, so the 'queue length on this cpu' is a
>>>cheap metric.
>>
>>The only downside to TIF_RCUUPDATE is that those damn TIF-flags are
>>per-architecture (probably largely unnecessary, but while most
>>architectures don't care at all, others seem to have optimized their
>>layout so that they can test the work bits more efficiently). So it's a
>>matter of each architecture being updated with its TIF_xyz flag and their
>>work function.
>>
>>Anybody willing to try? Dipankar apparently has a lot on his plate, this
>>_should_ be fairly straightforward. Eric?
>
>
> I *had*, when this hit me :) It was one those spurt things. I am going to
> look at this, but I think we will need to do this with some careful
> benchmarking.
>
> At the moment however I do have another concern - open/close taking too
> much time as I mentioned in an earlier email. It is nearly 4 times
> slower than 2.6.13. So, that is first up in my list of things to
> do at the moment.
>

<lazy_mode=ON>
Do we really need a TIF_RCUUPDATE flag, or could we just ask for a resched ?
</lazy_mode>

This patch only take care of call_rcu(), I'm unsure of what can be done inside
call_rcu_bh()

The two stress program dont hit OOM anymore with this patch applied (even with
maxbatch=10)

Eric

rcu_patch.1

Dipankar Sarma

unread,

Oct 17, 2005, 2:22:40 PM10/17/05

to Linus Torvalds, Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, Oct 17, 2005 at 09:59:30PM +0530, Dipankar Sarma wrote:
> On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote:
>
> At the moment however I do have another concern - open/close taking too
> much time as I mentioned in an earlier email. It is nearly 4 times
> slower than 2.6.13. So, that is first up in my list of things to
> do at the moment.

Please ignore this. This is a big Doh! slab debugging snuck into
my config file because I was trying to track down the
"bad page state" problem again. Without it, open/close in 2.6.14-rc1
is just as fast as 2.6.13 - ~3 microseconds per pair.

Dipankar Sarma

unread,

Oct 17, 2005, 2:37:55 PM10/17/05

to Eric Dumazet, Linus Torvalds, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, Oct 17, 2005 at 08:01:21PM +0200, Eric Dumazet wrote:
> Dipankar Sarma a écrit :
> >On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote:
> >
>
> <lazy_mode=ON>
> Do we really need a TIF_RCUUPDATE flag, or could we just ask for a re
sched ?
> </lazy_mode>

I think the theory was that we have to process the callbacks,
not just force the grace period by setting need_resched.
That is what TIF_RCUUPDATE indicates - rcus to process.

> This patch only take care of call_rcu(), I'm unsure of what can be do
ne
> inside call_rcu_bh()
>
> The two stress program dont hit OOM anymore with this patch applied (
even
> with maxbatch=10)

Hmm.. I am supprised that maxbatch=10 still allowed you keep up
with a continuously queueing cpu. OK, I will look at this.

Linus Torvalds

unread,

Oct 17, 2005, 2:38:47 PM10/17/05

to Eric Dumazet, dipa...@in.ibm.com, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, 17 Oct 2005, Eric Dumazet wrote:
>
> <lazy_mode=ON>
> Do we really need a TIF_RCUUPDATE flag, or could we just ask for a resched ?
> </lazy_mode>

Hmm.. Your patch looks very much like one I tried already, but the big
difference being that I just cleared the count when doing the rcu
callback. That was because I hadn't realized the importance of the
maxbatch thing (so it didn't work for me, like it did for you).

Still - the actual RCU callback will only be called at the next timer tick
or whatever as far as I can tell, so the first time you'll still have a
_long_ RCU queue (and thus bad latency).

I guess that's inevitable - and TIF_RCUUPDATE wouldn't even help, because
we still need to wait for the _other_ CPU's to get to their RCU quiescent
event.

However, that leaves us with the nasty situation that we'll ve very
inefficient: we'll do "maxbatch" RCU entries, and then return, and then
force a whole re-schedule. That just can't be good.

How about instead of depending on "maxbatch", we'd depend on
"need_resched()"? Mabe the "maxbatch" be a _minbatch_ thing, and then once
we've done the minimum amount we _need_ to do (or emptied the RCU queue)
we start honoring need_resched(), and return early if we do?

That, together with your patch, should work, without causing ludicrous
"reschedule every ten system calls" behaviour..

Hmm?

Linus

Linus Torvalds

unread,

Oct 17, 2005, 2:41:34 PM10/17/05

to Dipankar Sarma, Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, 17 Oct 2005, Dipankar Sarma wrote:
>
> At the moment however I do have another concern - open/close taking too
> much time as I mentioned in an earlier email. It is nearly 4 times
> slower than 2.6.13. So, that is first up in my list of things to
> do at the moment.

It's not slower for me. For me, lmbench shows open/close as being pretty
stable at least since 2.6.12.

Are you sure that your dentry cache tests haven't just filled up the
dentry lists so much that when you compare open/close performance after
the dentry tests, they seem much slower than your numbers from before?

If you run something that fills up the dentry cache, open/close will be
slower just because the open part will have to traverse longer hash
chains.

Linus

Linus Torvalds

unread,

Oct 17, 2005, 3:01:27 PM10/17/05

to Dipankar Sarma, Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Tue, 18 Oct 2005, Dipankar Sarma wrote:

> On Mon, Oct 17, 2005 at 08:01:21PM +0200, Eric Dumazet wrote:
> > Dipankar Sarma a écrit :
> > >On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote:
> > >
> >
> > <lazy_mode=ON>

> > Do we really need a TIF_RCUUPDATE flag, or could we just ask for a resched ?

> > </lazy_mode>
>
> I think the theory was that we have to process the callbacks,
> not just force the grace period by setting need_resched.
> That is what TIF_RCUUPDATE indicates - rcus to process.

I'm having second thoughts about that, since the problem (in SMP) is that
even if the currently active process tries to more proactively handle RCU
events rather than just setting the grace period, in order to do that
you'd still need to wait for the other CPU's to have their quiescent
phase.

So the RCU queues can grow long, if only because the other CPU's won't
necessarily do the same.

So we probably cannot throttle RCU queues down, and they will inevitably
have to be able to grow pretty long.

> Hmm.. I am supprised that maxbatch=10 still allowed you keep up
> with a continuously queueing cpu. OK, I will look at this.

I think it's just because it ends up rescheduling a lot, and thus waking
up softirqd.

The RCU thing is done as a tasklet, which means that
- it starts out as a "synchronous" softirq event, at which point it gets
called at most X times (MAX_SOFTIRQ_RESTART, defaults to 10)
- after that, we end up saying "uhhuh, this is using too much softirq
time" and instead just run the softirq as a kernel thread.
- setting TIF_NEEDRESCHED whenever the rcu lists are long will keep on
rescheduling to the softirq thread much more aggressively.

See __do_softirq() for some of this softirq (and thus tasklet) handling.

I suspect it's _very_ inefficient, but maybe the bad case triggers so
seldom that we don't really need to care.

Linus

Eric Dumazet

unread,

Oct 17, 2005, 3:13:30 PM10/17/05

to Linus Torvalds, dipa...@in.ibm.com, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

Linus Torvalds a écrit :

>
> On Mon, 17 Oct 2005, Eric Dumazet wrote:
>
>><lazy_mode=ON>
>>Do we really need a TIF_RCUUPDATE flag, or could we just ask for a re
sched ?
>></lazy_mode>
>
>
> Hmm.. Your patch looks very much like one I tried already, but the bi
g
> difference being that I just cleared the count when doing the rcu
> callback. That was because I hadn't realized the importance of the
> maxbatch thing (so it didn't work for me, like it did for you).
>
> Still - the actual RCU callback will only be called at the next timer
tick
> or whatever as far as I can tell, so the first time you'll still have
a
> _long_ RCU queue (and thus bad latency).
>
> I guess that's inevitable - and TIF_RCUUPDATE wouldn't even help, bec
ause
> we still need to wait for the _other_ CPU's to get to their RCU quies
cent
> event.
>
> However, that leaves us with the nasty situation that we'll ve very
> inefficient: we'll do "maxbatch" RCU entries, and then return, and th
en
> force a whole re-schedule. That just can't be good.
>

Thats strange, because on my tests it seems that I dont have one resche
dule
for 'maxbatch' items. Doing 'grep filp /proc/slabinfo' it seems I have
one
'schedule' then filp count goes back to 1000.

vmstat shows about 150 context switches per second.

(This machines does 1.000.000 pair of open/close in 4.88 seconds)

oprofile data shows verly little schedule overhead :

CPU: P4 / Xeon with 2 hyper-threads, speed 1993.83 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples % symbol name
132578 11.3301 path_lookup
104788 8.9551 __d_lookup
85220 7.2829 link_path_walk
63013 5.3851 sysenter_past_esp
53287 4.5539 _atomic_dec_and_lock
45825 3.9162 chrdev_open
43105 3.6837 get_unused_fd
39948 3.4139 kmem_cache_alloc
38308 3.2738 strncpy_from_user
35738 3.0542 rcu_do_batch
31850 2.7219 __link_path_walk
31355 2.6796 get_empty_filp
25941 2.2169 kmem_cache_free
24455 2.0899 __fput
24422 2.0871 sys_close
19814 1.6933 filp_dtor
19616 1.6764 free_block
19000 1.6237 open_namei
18214 1.5566 fput
15991 1.3666 fd_install
14394 1.2301 file_kill
14365 1.2276 call_rcu
14338 1.2253 kref_put
13679 1.1690 file_move
13646 1.1662 schedule
13456 1.1499 getname
13019 1.1126 kref_get

Linus Torvalds

unread,

Oct 17, 2005, 3:31:31 PM10/17/05

to Eric Dumazet, dipa...@in.ibm.com, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, 17 Oct 2005, Eric Dumazet wrote:
>
> Thats strange, because on my tests it seems that I dont have one reschedule
> for 'maxbatch' items. Doing 'grep filp /proc/slabinfo' it seems I have one
> 'schedule' then filp count goes back to 1000.

Hmm.

I think you're right, but for all the wrong reasons.

"maxbatch" ends up not actually having any real effect in the end: after
the tasklet ends up running in softirqd, softirqd will actually keep on
calling the tasklet code until it doesn't get rescheduled any more ;)

So it will do "maxbatch" RCU entries, reschedule itself, return, and
immediately get called again.

Heh.

The _good_ news is that since it ends up running in softirqd (after the
first ten times - the softirq code in kernel/softirq.c will start off
calling it ten times _first_), it can be scheduled away, so it actually
ends up helping latency.

Which means that we actually end up doing exactly the right thing,
although for what appears to be the wrong reasons (or very lucky ones).

The _bad_ news is that softirqd is running at nice +19, so I suspect that
with some unlucky patterns it's probably pretty easy to make sure that
ksoftirqd doesn't actually run very often at all!

Gaah. So close, yet so far. I'm _almost_ willing to just undo my "make
maxbatch huge" patch, and apply your patch, because now that I see how it
all happens to work together I'm convinced that it _almost_ works. Even if
it seems to be mostly by luck(*) rather than anything else.

Linus

(*) Not strictly true. It may not be by design of the RCU code itself, but
it's definitely by design of the softirq's being designed to be robust and
have good latency behaviour. So it does work by design, but it works by
softirq design rather than RCU design ;)

Eric Dumazet

unread,

Oct 17, 2005, 3:40:10 PM10/17/05

to Linus Torvalds, dipa...@in.ibm.com, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

Linus Torvalds a écrit :

:)

What about call_rcu_bh() which I left unchanged ? At least one of my
production machine cannot live very long unless I have maxbatch = 300
, because
of an insane large tcp route cache (and one of its CPU almost filled by

softirq NIC processing)

Linus Torvalds

unread,

Oct 17, 2005, 4:15:25 PM10/17/05

to Eric Dumazet, dipa...@in.ibm.com, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, 17 Oct 2005, Eric Dumazet wrote:
>
> What about call_rcu_bh() which I left unchanged ? At least one of my

> production machine cannot live very long unless I have maxbatch = 300, because

> of an insane large tcp route cache (and one of its CPU almost filled by
> softirq NIC processing)

I think we'll have to release 2.6.14 with maxbatch at the high value
(10000).

Yes, it may screw up some latency stuff, but quite frankly, even with your
patch and even ignoring the call_rcu_bh case, I'm convinced you can easily
get into the situation where softirqd just doesn't run soon enough.

But at least I think I understand _why_ rcu processing was delayed.

I think a real fix might have to involve more explicit knowledge of
tasklet behaviour and softirq interaction.

Linus

Christopher Friesen

unread,

Oct 17, 2005, 4:26:28 PM10/17/05

to Linus Torvalds, Eric Dumazet, dipa...@in.ibm.com, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

Linus Torvalds wrote:

> Yes, it may screw up some latency stuff, but quite frankly, even with your
> patch and even ignoring the call_rcu_bh case, I'm convinced you can easily
> get into the situation where softirqd just doesn't run soon enough.
>
> But at least I think I understand _why_ rcu processing was delayed.

Could this be related to the "rename14 LTP test with /tmp as tmpfs and
HIGHMEM causes OOM-killer invocation due to zone normal exhaustion" issue?

Chris

Dipankar Sarma

unread,

Oct 17, 2005, 4:30:44 PM10/17/05

to Christopher Friesen, Linus Torvalds, Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, Oct 17, 2005 at 02:25:17PM -0600, Christopher Friesen wrote:
> Linus Torvalds wrote:
>
> >Yes, it may screw up some latency stuff, but quite frankly, even with your
> >patch and even ignoring the call_rcu_bh case, I'm convinced you can easily
> >get into the situation where softirqd just doesn't run soon enough.
> >
> >But at least I think I understand _why_ rcu processing was delayed.
>
> Could this be related to the "rename14 LTP test with /tmp as tmpfs and
> HIGHMEM causes OOM-killer invocation due to zone normal exhaustion" issue?

Could very well be. Chris, could you please try booting
with rcupdate.maxbatch=10000 and see if the problem goes away ?

Thanks
Dipankar

Dipankar Sarma

unread,

Oct 17, 2005, 4:40:52 PM10/17/05

to Linus Torvalds, Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, Oct 17, 2005 at 01:14:20PM -0700, Linus Torvalds wrote:
>
>
> On Mon, 17 Oct 2005, Eric Dumazet wrote:
> >
> > What about call_rcu_bh() which I left unchanged ? At least one of my
> > production machine cannot live very long unless I have maxbatch = 300, because
> > of an insane large tcp route cache (and one of its CPU almost filled by
> > softirq NIC processing)
>
> I think we'll have to release 2.6.14 with maxbatch at the high value
> (10000).

Is 10000 enough ? Eric seemed to find a problem even with this
after 90 minutes ?

> Yes, it may screw up some latency stuff, but quite frankly, even with your
> patch and even ignoring the call_rcu_bh case, I'm convinced you can easily
> get into the situation where softirqd just doesn't run soon enough.
>
> But at least I think I understand _why_ rcu processing was delayed.
>
> I think a real fix might have to involve more explicit knowledge of
> tasklet behaviour and softirq interaction.

Agreed. I am now looking at characterizing the corner cases that
can get us into trouble and checking what pattern of processing
is appropriate to cover them all. It will take some time to
sort this out making sure that it satisfies most requirements
reasonably.

Thanks
Dipankar

Linus Torvalds

unread,

Oct 17, 2005, 4:40:52 PM10/17/05

to Christopher Friesen, Eric Dumazet, dipa...@in.ibm.com, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, 17 Oct 2005, Christopher Friesen wrote:
>
> Could this be related to the "rename14 LTP test with /tmp as tmpfs and HIGHMEM
> causes OOM-killer invocation due to zone normal exhaustion" issue?

Yes.

You can try the current git tree, or just change "maxbatch" from 10 to
10000 in your own tree, and see if it makes a difference.

I would not be surprised at all if this turns out to be the exact same
issue, for the exact same reason.

Eric's patch is also likely to fix it (if the "maxbatch" change fixes it),
since I suspect that under _practical_ load Eric's patch works fine.

The advantage of Eric's patch is that it shouldn't have any latency
downsides, so Eric's is in many ways preferable to just increasing
maxbatch. I just can't convince myself that it's really always going to
fix the problem.

If somebody else can, holler.

Linus

Linus Torvalds

unread,

Oct 17, 2005, 6:42:53 PM10/17/05

to Eric Dumazet, dipa...@in.ibm.com, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, 17 Oct 2005, Linus Torvalds wrote:
> On Mon, 17 Oct 2005, Eric Dumazet wrote:
> >
> > What about call_rcu_bh() which I left unchanged ? At least one of my
> > production machine cannot live very long unless I have maxbatch = 300, because
> > of an insane large tcp route cache (and one of its CPU almost filled by
> > softirq NIC processing)
>
> I think we'll have to release 2.6.14 with maxbatch at the high value
> (10000).

Btw, I'm going to apply your patch in _addition_ to the bigger maxbatch
value.

It might help latency a bit, but more importantly, on one of my machines
(but only one - it probably depends on how much memory you have etc), I
can re-create the out-of-file-descriptors thing even with a maxbatch of a
million.

Probably what happens is that the rcu callbacks just grow fast enough
without any quiescent period that the maxbatch thing just never matters:
we simply run out of file descriptors because we haven't even gotten
around to trying to free them yet.

I'm compiling with your patch on that machine to verify that it does
actually help keep the queues down. Just doing a

while : ; do cat /proc/slabinfo | grep filp; sleep 1; done

while running the test programs gives some alarming numbers as-is.

Your patch keeps the numbers _much_ more stable.

Regardless, keeping track of the number of rcu callback events we have
will almost inevitably be part of whatever future strategy we take, so
your patch is definitely a step in the right direction, even if we have to
tweak it later.

Paul E. McKenney

unread,

Oct 17, 2005, 7:00:00 PM10/17/05

to Eric Dumazet, dipa...@in.ibm.com, Linus Torvalds, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, Oct 17, 2005 at 08:01:21PM +0200, Eric Dumazet wrote:

Keeping the per-CPU count of queued callbacks seems eminently reasonabl
e
to me, as does the set_need_resched(). But the proposed (but fortunate
ly
commented out) call of rcu_do_batch() from call_rcu() does have deadloc
k
issues.

> Eric
>

> --- linux-2.6.14-rc4/kernel/rcupdate.c 2005-10-11 03:19:19.000000000
+0200
> +++ linux-2.6.14-rc4-ed/kernel/rcupdate.c 2005-10-17 21:52:18.0000000
00 +0200
> @@ -109,6 +109,10 @@
> rdp = &__get_cpu_var(rcu_data);
> *rdp->nxttail = head;
> rdp->nxttail = &head->next;
> +
> + if (unlikely(++rdp->count > 10000))
> + set_need_resched();
> +
> local_irq_restore(flags);
> }
>
> @@ -140,6 +144,12 @@
> rdp = &__get_cpu_var(rcu_bh_data);
> *rdp->nxttail = head;
> rdp->nxttail = &head->next;
> + rdp->count++;

Really need an "rdp->count++" in call_rcu_bh() as well, otherwise
the _bh struct rcu_data will have a steadily decreasing count field.
Strictly speaking, this is harmless, since call_rcu_bh() cheerfully
ignores this field, but this situation is bound to cause severe confusi
on
at some point.

> +/*
> + * Should we directly call rcu_do_batch() here ?
> + * if (unlikely(rdp->count > 10000))
> + * rcu_do_batch(rdp);
> + */

Good thing that the above is commented out! ;-)

Doing this can result in self-deadlock, for example with the following:

spin_lock(&mylock);
/* do some stuff. */
call_rcu(&p->rcu_head, my_rcu_callback);
/* do some more stuff. */
spin_unlock(&mylock);

void my_rcu_callback(struct rcu_head *p)
{
spin_lock(&mylock);
/* self-deadlock via call_rcu() via rcu_do_batch()!!! */
spin_unlock(&mylock);
}

Thanx, Paul

> }
>
> @@ -157,6 +167,7 @@
> next = rdp->donelist = list->next;
> list->func(list);
> list = next;
> + rdp->count--;
> if (++count >= maxbatch)
> break;
> }
> --- linux-2.6.14-rc4/include/linux/rcupdate.h 2005-10-11 03:19:19.000
000000 +0200
> +++ linux-2.6.14-rc4-ed/include/linux/rcupdate.h 2005-10-17 21:02:25.
000000000 +0200
> @@ -94,6 +94,7 @@
> long batch; /* Batch # for current RCU batch */
> struct rcu_head *nxtlist;
> struct rcu_head **nxttail;
> + long count; /* # of queued items */
> struct rcu_head *curlist;
> struct rcu_head **curtail;
> struct rcu_head *donelist;

Eric Dumazet

unread,

Oct 18, 2005, 5:53:07 AM10/18/05

to pau...@us.ibm.com, dipa...@in.ibm.com, Linus Torvalds, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

Paul E. McKenney a écrit :

>
>
>>+/*
>>+ * Should we directly call rcu_do_batch() here ?
>>+ * if (unlikely(rdp->count > 10000))
>>+ * rcu_do_batch(rdp);
>>+ */
>
>
> Good thing that the above is commented out! ;-)
>
> Doing this can result in self-deadlock, for example with the followin
g:
>
> spin_lock(&mylock);
> /* do some stuff. */
> call_rcu(&p->rcu_head, my_rcu_callback);
> /* do some more stuff. */
> spin_unlock(&mylock);
>
> void my_rcu_callback(struct rcu_head *p)
> {
> spin_lock(&mylock);
> /* self-deadlock via call_rcu() via rcu_do_batch()!!! */
> spin_unlock(&mylock);
> }
>
>
> Thanx, Paul

Thanks Paul for reminding us that call_rcu() should not ever call the c
allback
function, as very well documented in Documentation/RCU/UP.txt
(Example 3: Death by Deadlock)

But is the same true for call_rcu_bh() ?

I intentionally wrote the comment to remind readers that a low maxbatch
can
trigger OOM in case a CPU is filled by some kind of DOS (network IRQ fl
ood for
example, targeting the IP dst cache)

To solve this problem, may be we could add a requirement to
call_rcu_bh/callback functions : If they have to lock a spinlock, only
use a
spin_trylock() and make them returns a status (0 : sucessfull callback,
1:
please requeue me)

As most callback functions just kfree() some memory, most of OOM would
be cleared.

int my_rcu_callback(struct rcu_head *p)
{
if (!spin_trylock(&mylock))
return 1; /* please call me later */
/* do something here */
...
spin_unlock(&mylock);
return 0;
}

(Changes to rcu_do_batch() are left as an exercice :) )

Eric

Christopher Friesen

unread,

Oct 18, 2005, 11:57:38 AM10/18/05

to dipa...@in.ibm.com, Linus Torvalds, Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

Dipankar Sarma wrote:
> On Mon, Oct 17, 2005 at 02:25:17PM -0600, Christopher Friesen wrote:

>>Could this be related to the "rename14 LTP test with /tmp as tmpfs and
>>HIGHMEM causes OOM-killer invocation due to zone normal exhaustion" issue?

> Could very well be. Chris, could you please try booting
> with rcupdate.maxbatch=10000 and see if the problem goes away ?

And sure enough, that fixes it. The dcache slab usage maxes out at
around 11MB rather than consuming all of zone normal.

Is there any downside to this option?

Chris

Paul E. McKenney

unread,

Oct 18, 2005, 12:22:27 PM10/18/05

to Eric Dumazet, dipa...@in.ibm.com, Linus Torvalds, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Tue, Oct 18, 2005 at 11:46:30AM +0200, Eric Dumazet wrote:
> Paul E. McKenney a écrit :
> >
> >
> >>+/*
> >>+ * Should we directly call rcu_do_batch() here ?
> >>+ * if (unlikely(rdp->count > 10000))
> >>+ * rcu_do_batch(rdp);
> >>+ */
> >
> >
> >Good thing that the above is commented out! ;-)
> >
> >Doing this can result in self-deadlock, for example with the followi
ng:
> >
> > spin_lock(&mylock);
> > /* do some stuff. */
> > call_rcu(&p->rcu_head, my_rcu_callback);
> > /* do some more stuff. */
> > spin_unlock(&mylock);
> >
> >void my_rcu_callback(struct rcu_head *p)
> >{
> > spin_lock(&mylock);
> > /* self-deadlock via call_rcu() via rcu_do_batch()!!! */
> > spin_unlock(&mylock);
> >}
> >
> >
> > Thanx, Paul
>
> Thanks Paul for reminding us that call_rcu() should not ever call the

> callback function, as very well documented in Documentation/RCU/UP.tx

t
> (Example 3: Death by Deadlock)
>
> But is the same true for call_rcu_bh() ?

Yes, same rules for this aspect of call_rcu_bh() and call_rcu().

> I intentionally wrote the comment to remind readers that a low maxbat

ch can

> trigger OOM in case a CPU is filled by some kind of DOS (network IRQ
flood
> for example, targeting the IP dst cache)
>
> To solve this problem, may be we could add a requirement to
> call_rcu_bh/callback functions : If they have to lock a spinlock, on

ly use

> a spin_trylock() and make them returns a status (0 : sucessfull callb
ack,
> 1: please requeue me)
>
> As most callback functions just kfree() some memory, most of OOM woul
d be
> cleared.
>
> int my_rcu_callback(struct rcu_head *p)
> {
> if (!spin_trylock(&mylock))
> return 1; /* please call me later */
> /* do something here */
> ...
> spin_unlock(&mylock);
> return 0;
> }
>
> (Changes to rcu_do_batch() are left as an exercice :) )

Another approach that would keep the current easier-to-use semantics
would be to schedule a tasklet or workqueue to process the callbacks
in a safe context.

Thanx, Paul