AIM7 40% regression with 2.6.26-rc1

Zhang, Yanmin

unread,

May 6, 2008, 1:49:58 AM5/6/08

to Matthew Wilcox, LKML

Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more than 40% with 2.6.26-rc1
on my 8-core stoakley, 16-core tigerton, and Itanium Montecito. Bisect located
below patch.

64ac24e738823161693bf791f87adc802cf529ff is first bad commit
commit 64ac24e738823161693bf791f87adc802cf529ff
Author: Matthew Wilcox <mat...@wil.cx>
Date: Fri Mar 7 21:55:58 2008 -0500

Generic semaphore implementation

After I manually reverted the patch against 2.6.26-rc1 while fixing lots of
conflictions/errors, aim7 regression became less than 2%.

-yanmin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Matthew Wilcox

unread,

May 6, 2008, 7:19:05 AM5/6/08

to Zhang, Yanmin, LKML

On Tue, May 06, 2008 at 01:48:24PM +0800, Zhang, Yanmin wrote:
> Comparing with kernel 2.6.25, ???AIM7 (use tmpfs) has ???more than 40% with 2.6.26-rc1

> on my 8-core stoakley, 16-core tigerton, and Itanium Montecito. Bisect located
> below patch.
>
> 64ac24e738823161693bf791f87adc802cf529ff is first bad commit
> commit 64ac24e738823161693bf791f87adc802cf529ff
> Author: Matthew Wilcox <mat...@wil.cx>
> Date: Fri Mar 7 21:55:58 2008 -0500
>
> Generic semaphore implementation
>
>
> After I manually reverted the patch against 2.6.26-rc1 while fixing lots of
> conflictions/errors, aim7 regression became less than 2%.

40%?! That's shocking. Can you tell which semaphore was heavily
contended? I have a horrible feeling that it's the BKL.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Ingo Molnar

unread,

May 6, 2008, 7:45:47 AM5/6/08

to Zhang, Yanmin, Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton

* Zhang, Yanmin <yanmin...@linux.intel.com> wrote:

> Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more than 40% with
> 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton, and Itanium
> Montecito. Bisect located below patch.
>
> 64ac24e738823161693bf791f87adc802cf529ff is first bad commit
> commit 64ac24e738823161693bf791f87adc802cf529ff
> Author: Matthew Wilcox <mat...@wil.cx>
> Date: Fri Mar 7 21:55:58 2008 -0500
>
> Generic semaphore implementation
>
> After I manually reverted the patch against 2.6.26-rc1 while fixing
> lots of conflictions/errors, aim7 regression became less than 2%.

hm, which exact semaphore would that be due to?

My first blind guess would be the BKL - there's not much other semaphore
use left in the core kernel otherwise that would affect AIM7 normally.
The VFS still makes frequent use of the BKL and AIM7 is very VFS
intense. Getting rid of that BKL use from the VFS might be useful to
performance anyway.

Could you try to check that it's indeed the BKL?

Easiest way to check it would be to run AIM7 it on
sched-devel.git/latest and do scheduler tracing via:

http://people.redhat.com/mingo/sched-devel.git/readme-tracer.txt

by doing:

echo stacktrace > /debug/tracing/iter_ctl

you could get exact backtraces of all scheduling points in the trace. If
the BKL's down() shows up in those traces then it's definitely the BKL
that causes this. The backtraces will also tell us exactly which BKL use
is the most frequent one.

To keep tracing overhead low on SMP i'd also suggest to only trace a
single CPU, via:

echo 1 > /debug/tracing/tracing_cpumask

Ingo

Matthew Wilcox

unread,

May 6, 2008, 8:10:17 AM5/6/08

to Ingo Molnar, Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds, Andrew Morton, linux-...@vger.kernel.org

On Tue, May 06, 2008 at 01:44:49PM +0200, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin...@linux.intel.com> wrote:
> > After I manually reverted the patch against 2.6.26-rc1 while fixing
> > lots of conflictions/errors, aim7 regression became less than 2%.
>
> hm, which exact semaphore would that be due to?
>
> My first blind guess would be the BKL - there's not much other semaphore
> use left in the core kernel otherwise that would affect AIM7 normally.
> The VFS still makes frequent use of the BKL and AIM7 is very VFS
> intense. Getting rid of that BKL use from the VFS might be useful to
> performance anyway.

That's slightly slanderous to the VFS ;-) The BKL really isn't used
that much any more. So little that I've gone through and produced a
list of places it's used:

fs/block_dev.c opening and closing a block device. Unlikely to be
provoked by AIM7.
fs/char_dev.c chrdev_open. Unlikely to be provoked by AIM7.
fs/compat.c mount. Unlikely to be provoked by AIM7.
fs/compat_ioctl.c held around calls to ioctl translator.
fs/exec.c coredump. If this is a contention problem ...
fs/fcntl.c held around call to ->fasync.
fs/ioctl.c held around f_op ->ioctl call (tmpfs doesn't have
ioctl). ditto bmap. there's fasync, as previously
mentioned.
fs/locks.c hellhole. I hope AIM7 doesn't use locks.
fs/namespace.c mount, umount. Unlikely to be provoked by AIM7.
fs/read_write.c llseek. tmpfs uses the unlocked version.
fs/super.c shtdown, remount. Unlikely to be provoked by AIM7.

So the only likely things I can see are:

- file locks
- fasync

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Matthew Wilcox

unread,

May 6, 2008, 12:24:04 PM5/6/08

to Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds, Andrew Morton, linux-...@vger.kernel.org

On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> So the only likely things I can see are:
>
> - file locks
> - fasync

I've wanted to fix file locks for a while. Here's a first attempt.
It was done quickly, so I concede that it may well have bugs in it.
I found (and fixed) one with LTP.

It takes *no account* of nfsd, nor remote filesystems. We need to have
a serious discussion about their requirements.

diff --git a/fs/locks.c b/fs/locks.c
index 663c069..cb09765 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -140,6 +140,8 @@ int lease_break_time = 45;
#define for_each_lock(inode, lockp) \
for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)

+static DEFINE_SPINLOCK(file_lock_lock);
+
static LIST_HEAD(file_lock_list);
static LIST_HEAD(blocked_list);

@@ -510,9 +512,9 @@ static void __locks_delete_block(struct file_lock *waiter)
*/
static void locks_delete_block(struct file_lock *waiter)
{
- lock_kernel();
+ spin_lock(&file_lock_lock);
__locks_delete_block(waiter);
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
}

/* Insert waiter into blocker's block list.
@@ -649,7 +651,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
{
struct file_lock *cfl;

- lock_kernel();
+ spin_lock(&file_lock_lock);
for (cfl = filp->f_path.dentry->d_inode->i_flock; cfl; cfl = cfl->fl_next) {
if (!IS_POSIX(cfl))
continue;
@@ -662,7 +664,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
fl->fl_pid = pid_vnr(cfl->fl_nspid);
} else
fl->fl_type = F_UNLCK;
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
return;
}
EXPORT_SYMBOL(posix_test_lock);
@@ -735,18 +737,21 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
int error = 0;
int found = 0;

- lock_kernel();
- if (request->fl_flags & FL_ACCESS)
+ if (request->fl_flags & FL_ACCESS) {
+ spin_lock(&file_lock_lock);
goto find_conflict;
+ }

if (request->fl_type != F_UNLCK) {
error = -ENOMEM;
+
new_fl = locks_alloc_lock();
if (new_fl == NULL)
- goto out;
+ goto out_unlocked;
error = 0;
}

+ spin_lock(&file_lock_lock);
for_each_lock(inode, before) {
struct file_lock *fl = *before;
if (IS_POSIX(fl))
@@ -772,10 +777,13 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
* If a higher-priority process was blocked on the old file lock,
* give it the opportunity to lock the file.
*/
- if (found)
+ if (found) {
+ spin_unlock(&file_lock_lock);
cond_resched();
+ spin_lock(&file_lock_lock);
+ }

-find_conflict:
+ find_conflict:
for_each_lock(inode, before) {
struct file_lock *fl = *before;
if (IS_POSIX(fl))
@@ -796,8 +804,9 @@ find_conflict:
new_fl = NULL;
error = 0;

-out:
- unlock_kernel();
+ out:
+ spin_unlock(&file_lock_lock);
+ out_unlocked:
if (new_fl)
locks_free_lock(new_fl);
return error;
@@ -826,7 +835,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
new_fl2 = locks_alloc_lock();
}

- lock_kernel();
+ spin_lock(&file_lock_lock);
if (request->fl_type != F_UNLCK) {
for_each_lock(inode, before) {
fl = *before;
@@ -994,7 +1003,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
locks_wake_up_blocks(left);
}
out:
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
/*
* Free any unused locks.
*/
@@ -1069,14 +1078,14 @@ int locks_mandatory_locked(struct inode *inode)
/*
* Search the lock list for this inode for any POSIX locks.
*/
- lock_kernel();
+ spin_lock(&file_lock_lock);
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (!IS_POSIX(fl))
continue;
if (fl->fl_owner != owner)
break;
}
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
return fl ? -EAGAIN : 0;
}

@@ -1190,7 +1199,7 @@ int __break_lease(struct inode *inode, unsigned int mode)

new_fl = lease_alloc(NULL, mode & FMODE_WRITE ? F_WRLCK : F_RDLCK);

- lock_kernel();
+ spin_lock(&file_lock_lock);

time_out_leases(inode);

@@ -1251,8 +1260,10 @@ restart:
break_time++;
}
locks_insert_block(flock, new_fl);
+ spin_unlock(&file_lock_lock);
error = wait_event_interruptible_timeout(new_fl->fl_wait,
!new_fl->fl_next, break_time);
+ spin_lock(&file_lock_lock);
__locks_delete_block(new_fl);
if (error >= 0) {
if (error == 0)
@@ -1266,8 +1277,8 @@ restart:
error = 0;
}

-out:
- unlock_kernel();
+ out:
+ spin_unlock(&file_lock_lock);
if (!IS_ERR(new_fl))
locks_free_lock(new_fl);
return error;
@@ -1323,7 +1334,7 @@ int fcntl_getlease(struct file *filp)
struct file_lock *fl;
int type = F_UNLCK;

- lock_kernel();
+ spin_lock(&file_lock_lock);
time_out_leases(filp->f_path.dentry->d_inode);
for (fl = filp->f_path.dentry->d_inode->i_flock; fl && IS_LEASE(fl);
fl = fl->fl_next) {
@@ -1332,7 +1343,7 @@ int fcntl_getlease(struct file *filp)
break;
}
}
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
return type;
}

@@ -1363,6 +1374,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
if (error)
return error;

+ spin_lock(&file_lock_lock);
time_out_leases(inode);

BUG_ON(!(*flp)->fl_lmops->fl_break);
@@ -1370,10 +1382,11 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
lease = *flp;

if (arg != F_UNLCK) {
+ spin_unlock(&file_lock_lock);
error = -ENOMEM;
new_fl = locks_alloc_lock();
if (new_fl == NULL)
- goto out;
+ goto out_unlocked;

error = -EAGAIN;
if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
@@ -1382,6 +1395,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
&& ((atomic_read(&dentry->d_count) > 1)
|| (atomic_read(&inode->i_count) > 1)))
goto out;
+ spin_lock(&file_lock_lock);
}

/*
@@ -1429,11 +1443,14 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)

locks_copy_lock(new_fl, lease);
locks_insert_lock(before, new_fl);
+ spin_unlock(&file_lock_lock);

*flp = new_fl;
return 0;

-out:
+ out:
+ spin_unlock(&file_lock_lock);
+ out_unlocked:
if (new_fl != NULL)
locks_free_lock(new_fl);
return error;
@@ -1471,12 +1488,10 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
{
int error;

- lock_kernel();
if (filp->f_op && filp->f_op->setlease)
error = filp->f_op->setlease(filp, arg, lease);
else
error = generic_setlease(filp, arg, lease);
- unlock_kernel();

return error;
}
@@ -1503,12 +1518,11 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
if (error)
return error;

- lock_kernel();
-
error = vfs_setlease(filp, arg, &flp);
if (error || arg == F_UNLCK)
- goto out_unlock;
+ return error;

+ lock_kernel();
error = fasync_helper(fd, filp, 1, &flp->fl_fasync);
if (error < 0) {
/* remove lease just inserted by setlease */
@@ -1519,7 +1533,7 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
}

error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
-out_unlock:
+ out_unlock:
unlock_kernel();
return error;
}
@@ -2024,7 +2038,7 @@ void locks_remove_flock(struct file *filp)
fl.fl_ops->fl_release_private(&fl);
}

- lock_kernel();
+ spin_lock(&file_lock_lock);
before = &inode->i_flock;

while ((fl = *before) != NULL) {
@@ -2042,7 +2056,7 @@ void locks_remove_flock(struct file *filp)
}
before = &fl->fl_next;
}
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
}

/**
@@ -2057,12 +2071,12 @@ posix_unblock_lock(struct file *filp, struct file_lock *waiter)
{
int status = 0;

- lock_kernel();
+ spin_lock(&file_lock_lock);
if (waiter->fl_next)
__locks_delete_block(waiter);
else
status = -ENOENT;
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
return status;
}

@@ -2175,7 +2189,7 @@ static int locks_show(struct seq_file *f, void *v)

static void *locks_start(struct seq_file *f, loff_t *pos)
{
- lock_kernel();
+ spin_lock(&file_lock_lock);
f->private = (void *)1;
return seq_list_start(&file_lock_list, *pos);
}
@@ -2187,7 +2201,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos)

static void locks_stop(struct seq_file *f, void *v)
{
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
}

struct seq_operations locks_seq_operations = {
@@ -2215,7 +2229,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
{
struct file_lock *fl;
int result = 1;
- lock_kernel();
+ spin_lock(&file_lock_lock);
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (IS_POSIX(fl)) {
if (fl->fl_type == F_RDLCK)
@@ -2232,7 +2246,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
result = 0;
break;
}
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
return result;
}

@@ -2255,7 +2269,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
{
struct file_lock *fl;
int result = 1;
- lock_kernel();
+ spin_lock(&file_lock_lock);
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (IS_POSIX(fl)) {
if ((fl->fl_end < start) || (fl->fl_start > (start + len)))
@@ -2270,7 +2284,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
result = 0;
break;
}
- unlock_kernel();
+ spin_unlock(&file_lock_lock);
return result;

Linus Torvalds

unread,

May 6, 2008, 12:37:22 PM5/6/08

to Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton, linux-...@vger.kernel.org

On Tue, 6 May 2008, Matthew Wilcox wrote:
>
> I've wanted to fix file locks for a while. Here's a first attempt.
> It was done quickly, so I concede that it may well have bugs in it.
> I found (and fixed) one with LTP.

Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is
there some user that doesn't have the inode info, or does anything that
might cross inode boundaries?

This does seem to drop all locking around the "setlease()" calls down to
the filesystem, which worries me. That said, we clearly do need to do
this. Probably should have done it a long time ago.

Also, why do people do this:

> -find_conflict:
> + find_conflict:

Hmm?

Linus

Matthew Wilcox

unread,

May 6, 2008, 12:42:57 PM5/6/08

to Linus Torvalds, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton, linux-...@vger.kernel.org

On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote:
> On Tue, 6 May 2008, Matthew Wilcox wrote:
> > I've wanted to fix file locks for a while. Here's a first attempt.
> > It was done quickly, so I concede that it may well have bugs in it.
> > I found (and fixed) one with LTP.
>
> Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is
> there some user that doesn't have the inode info, or does anything that
> might cross inode boundaries?

/proc/locks and deadlock detection both cross inode boundaries (and even
filesystem boundaries). The BKL-removal brigade tried this back in 2.4
and the locking ended up scaling worse than just plonking a single
spinlock around the whole thing.

> This does seem to drop all locking around the "setlease()" calls down to
> the filesystem, which worries me. That said, we clearly do need to do
> this. Probably should have done it a long time ago.

The only filesystems that are going to have their own setlease methods
will be remote ones (nfs, smbfs, etc). They're going to need to sleep
while the server responds to them. So holding a spinlock while we call
them is impolite at best.

> Also, why do people do this:
>
> > -find_conflict:
> > + find_conflict:
>
> Hmm?

So that find_conflict doesn't end up in the first column, which causes
diff to treat it as a function name for the purposes of the @@ lines.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

J. Bruce Fields

unread,

May 6, 2008, 12:46:02 PM5/6/08

to Linus Torvalds, Matthew Wilcox, Ingo Molnar, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton, linux-...@vger.kernel.org, rich...@citi.umich.edu

On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote:
>
>

> On Tue, 6 May 2008, Matthew Wilcox wrote:
> >
> > I've wanted to fix file locks for a while. Here's a first attempt.
> > It was done quickly, so I concede that it may well have bugs in it.
> > I found (and fixed) one with LTP.
>
> Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is
> there some user that doesn't have the inode info, or does anything that
> might cross inode boundaries?

The deadlock detection crosses inode boundaries.

--b.

Alan Cox

unread,

May 6, 2008, 12:50:46 PM5/6/08

to Matthew Wilcox, Linus Torvalds, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton, linux-...@vger.kernel.org

> > Hmm?
>
> So that find_conflict doesn't end up in the first column, which causes
> diff to treat it as a function name for the purposes of the @@ lines.

Please can we just fix the tools not mangle the kernel to work around
silly bugs ?

Matthew Wilcox

unread,

May 6, 2008, 12:51:38 PM5/6/08

to Alan Cox, Linus Torvalds, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton, linux-...@vger.kernel.org

On Tue, May 06, 2008 at 05:39:48PM +0100, Alan Cox wrote:
> > > Hmm?
> >
> > So that find_conflict doesn't end up in the first column, which causes
> > diff to treat it as a function name for the purposes of the @@ lines.
>
> Please can we just fix the tools not mangle the kernel to work around
> silly bugs ?

The people who control the tools refuse to fix them.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Alan Cox

unread,

May 6, 2008, 12:55:32 PM5/6/08

to Matthew Wilcox, Linus Torvalds, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton, linux-...@vger.kernel.org

On Tue, 6 May 2008 10:51:12 -0600
Matthew Wilcox <mat...@wil.cx> wrote:

> On Tue, May 06, 2008 at 05:39:48PM +0100, Alan Cox wrote:
> > > > Hmm?
> > >
> > > So that find_conflict doesn't end up in the first column, which causes
> > > diff to treat it as a function name for the purposes of the @@ lines.
> >
> > Please can we just fix the tools not mangle the kernel to work around
> > silly bugs ?
>
> The people who control the tools refuse to fix them.

That would be their problem. We should refuse to mash the kernel up
because they aren't doing their job. If need be someone can fork a
private git-patch.

Andrew Morton

unread,

May 6, 2008, 1:23:03 PM5/6/08

to Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds, linux-...@vger.kernel.org

On Tue, 6 May 2008 10:23:32 -0600 Matthew Wilcox <mat...@wil.cx> wrote:

> On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> > So the only likely things I can see are:
> >
> > - file locks
> > - fasync
>
> I've wanted to fix file locks for a while. Here's a first attempt.

Do we actually know that the locks code is implicated in this regression?

I'd initially thought "lseek" but afaict tmpfs doesn't hit default_llseek()
or remote_llseek().

tmpfs tends to do weird stuff - it would be interesting to know if the
regression is also present on ramfs or ext2/ext3/xfs/etc.

It would be interesting to see if the context switch rate has increased.

Finally: how come we regressed by swapping the semaphore implementation
anyway? We went from one sleeping lock implementation to another - I'd
have expected performance to be pretty much the same.

down(), down_interruptible() and down_try() should use spin_lock_irq(), not
irqsave.

up() seems to be doing wake-one, FIFO which is nice. Did the
implementation which we just removed also do that? Was it perhaps
accidentally doing LIFO or something like that?

Matthew Wilcox

unread,

May 6, 2008, 1:33:33 PM5/6/08

to Andrew Morton, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds, linux-...@vger.kernel.org

On Tue, May 06, 2008 at 10:21:53AM -0700, Andrew Morton wrote:
> Do we actually know that the locks code is implicated in this regression?

Not yet. We don't even know it's the BKL. It's just my best guess.
We're waiting for the original reporter to run some tests Ingo pointed
him at.

> I'd initially thought "lseek" but afaict tmpfs doesn't hit default_llseek()
> or remote_llseek().

Correct.

> Finally: how come we regressed by swapping the semaphore implementation
> anyway? We went from one sleeping lock implementation to another - I'd
> have expected performance to be pretty much the same.
>
> <looks at the implementation>
>
> down(), down_interruptible() and down_try() should use spin_lock_irq(), not
> irqsave.

We talked about this ... the BKL actually requires that you be able to
acquire it with interrupts disabled. Maybe we should make lock_kernel
do this:

if (likely(!depth)) {
unsigned long flags;
local_save_flags(flags);
down();
local_irq_restore(flags);
}

But tweaking down() is not worth it -- we should be eliminating users of
both the BKL and semaphores instead.

> up() seems to be doing wake-one, FIFO which is nice. Did the
> implementation which we just removed also do that? Was it perhaps
> accidentally doing LIFO or something like that?

That's a question for someone who knows x86 assembler, I think.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Ingo Molnar

unread,

May 6, 2008, 1:44:04 PM5/6/08

to Andrew Morton, Matthew Wilcox, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds, linux-...@vger.kernel.org

* Andrew Morton <ak...@linux-foundation.org> wrote:

> Finally: how come we regressed by swapping the semaphore
> implementation anyway? We went from one sleeping lock implementation
> to another - I'd have expected performance to be pretty much the same.
>
> <looks at the implementation>
>
> down(), down_interruptible() and down_try() should use
> spin_lock_irq(), not irqsave.
>
> up() seems to be doing wake-one, FIFO which is nice. Did the
> implementation which we just removed also do that? Was it perhaps
> accidentally doing LIFO or something like that?

i just checked the old implementation on x86. It used
lib/semaphore-sleepers.c which does one weird thing:

- __down() when it returns wakes up yet another task via
wake_up_locked().

i.e. we'll always keep yet another task in flight. This can mask wakeup
latencies especially when it takes time.

The patch (hack) below tries to emulate this weirdness - it 'kicks'
another task as well and keeps it busy. Most of the time this just
causes extra scheduling, but if AIM7 is _just_ saturating the number of
CPUs, it might make a difference. Yanmin, does the patch below make any
difference to the AIM7 results?

( it would be useful data to get a meaningful context switch trace from
the whole regressed workload, and compare it to a context switch trace
with the revert added. )

Ingo

---
kernel/semaphore.c | 10 ++++++++++
1 file changed, 10 insertions(+)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -261,4 +261,14 @@ static noinline void __sched __up(struct
list_del(&waiter->list);
waiter->up = 1;
wake_up_process(waiter->task);
+
+ if (likely(list_empty(&sem->wait_list)))
+ return;
+ /*
+ * Opportunistically wake up another task as well but do not
+ * remove it from the list:
+ */
+ waiter = list_first_entry(&sem->wait_list,
+ struct semaphore_waiter, list);
+ wake_up_process(waiter->task);

Linus Torvalds

unread,

May 6, 2008, 1:46:18 PM5/6/08

to Matthew Wilcox, Alan Cox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton, linux-...@vger.kernel.org

On Tue, 6 May 2008, Matthew Wilcox wrote:

> On Tue, May 06, 2008 at 05:39:48PM +0100, Alan Cox wrote:
> > > > Hmm?
> > >
> > > So that find_conflict doesn't end up in the first column, which causes
> > > diff to treat it as a function name for the purposes of the @@ lines.
> >
> > Please can we just fix the tools not mangle the kernel to work around
> > silly bugs ?
>
> The people who control the tools refuse to fix them.

That's just plain bullocks.

First off, that "@@" line thing in diffs is not important enough to screw
up the source code for. Ever. It's just a small hint to make it somewhat
easier to see more context for humans.

Second, it's not even that bad to show the last label there, rather than
the function name.

Third, you seem to be a git user, so if you actually really care that much
about the @@ line, then git actually lets you set your very own pattern
for those things.

In fact, you can even do it on a per-file basis based on things like
filename rules (ie you can have different patterns for what to trigger on
for a *.c file and for a *.S file, since in a *.S file the 'name:' thing
_is_ the right pattern).

So not only are you making idiotic changes just for irrelevant tool usage,
you're also apparently lying about people "refusing to fix" things as an
excuse.

You can play with it. It's documented in gitattributes (see "Defining a
custom hunk header"), and the default one is just the same one that GNU
diff uses for "-p". I think.

You can add something like this to your ~/.gitconfig:

[diff "default"]
funcname=^[a-zA-Z_$].*(.*$

to only trigger the funcname pattern on a line that starts with a valid C
identifier hing, and contains a '('.

And you can just override the default like the above (that way you don't
have to specify attributes), but if you want to do things differently for
*.c files than from *.S files, you can edit your .git/info/attributes file
and make it contain something like

*.S diff=assembler
*.c diff=C

and now you can make your ~/.gitconfig actually show them differently, ie
something like

[diff "C"]
funcname=^[a-zA-Z_$].*(.*$

[diff "assembler"]
funcname=^[a-zA-Z_$].*:

etc.

Of course, there is a real cost to this, but it's cheap enough in practice
that you'll never notice.

Linus

Linus Torvalds

unread,

May 6, 2008, 1:46:53 PM5/6/08

to Andrew Morton, Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

On Tue, 6 May 2008, Andrew Morton wrote:
>
> down(), down_interruptible() and down_try() should use spin_lock_irq(), not
> irqsave.

down_trylock() is used in atomic code. See for example kernel/printk.c. So
no, that one needs to be irqsafe.

Linus

Ingo Molnar

unread,

May 6, 2008, 1:50:54 PM5/6/08

to Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds, linux-...@vger.kernel.org

* Matthew Wilcox <mat...@wil.cx> wrote:

> > down(), down_interruptible() and down_try() should use
> > spin_lock_irq(), not irqsave.
>
> We talked about this ... the BKL actually requires that you be able to

> acquire it with interrupts disabled. [...]

hm, where does it require it, besides the early bootup code? (which
should just be fixed)

down_trylock() is OK as irqsave/irqrestore for legacy reasons, but that
is fundamentally atomic anyway.

> > up() seems to be doing wake-one, FIFO which is nice. Did the
> > implementation which we just removed also do that? Was it perhaps
> > accidentally doing LIFO or something like that?
>
> That's a question for someone who knows x86 assembler, I think.

the assembly is mostly just for the fastpath - and a 40% regression
cannot be about fastpath differences. In the old code the scheduling
happens in lib/semaphore-sleeper.c, and from the looks of it it appears
to be a proper FIFO as well. (plus this small wakeup weirdness it has)

i reviewed the new code in kernel/semaphore.c as well and can see
nothing bad in it - it does proper wake-up, FIFO queueing, like the
mutex code.

Ingo

Andrew Morton

unread,

May 6, 2008, 2:09:28 PM5/6/08

to Ingo Molnar, mat...@wil.cx, bfi...@citi.umich.edu, yanmin...@linux.intel.com, linux-...@vger.kernel.org, vi...@ftp.linux.org.uk, torv...@linux-foundation.org, linux-...@vger.kernel.org

On Tue, 6 May 2008 19:49:54 +0200
Ingo Molnar <mi...@elte.hu> wrote:

>
> * Matthew Wilcox <mat...@wil.cx> wrote:
>
> > > down(), down_interruptible() and down_try() should use
> > > spin_lock_irq(), not irqsave.
> >
> > We talked about this ... the BKL actually requires that you be able to
> > acquire it with interrupts disabled. [...]
>
> hm, where does it require it, besides the early bootup code? (which
> should just be fixed)

Yeah, the early bootup code. The kernel does accidental lock_kernel()s in
various places and if that renables interrupts then powerpc goeth crunch.

Matthew, that seemingly-unneeded irqsave in lib/semaphore.c is a prime site
for /* one of these things */, no?

> down_trylock() is OK as irqsave/irqrestore for legacy reasons, but that
> is fundamentally atomic anyway.

yes, trylock should be made irq-safe.

> > > up() seems to be doing wake-one, FIFO which is nice. Did the
> > > implementation which we just removed also do that? Was it perhaps
> > > accidentally doing LIFO or something like that?
> >
> > That's a question for someone who knows x86 assembler, I think.
>
> the assembly is mostly just for the fastpath - and a 40% regression
> cannot be about fastpath differences. In the old code the scheduling
> happens in lib/semaphore-sleeper.c, and from the looks of it it appears
> to be a proper FIFO as well. (plus this small wakeup weirdness it has)
>
> i reviewed the new code in kernel/semaphore.c as well and can see
> nothing bad in it - it does proper wake-up, FIFO queueing, like the
> mutex code.
>

There's the weird wakeup in down() which I understood for about five
minutes five years ago. Perhaps that accidentally sped something up.
Oh well, more investigation needed..

Linus Torvalds

unread,

May 6, 2008, 4:29:52 PM5/6/08

to Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton, linux-...@vger.kernel.org

On Tue, 6 May 2008, Matthew Wilcox wrote:
> On Tue, May 06, 2008 at 09:36:06AM -0700, Linus Torvalds wrote:
> >
> > Hmm. Wouldn't it be nicer to make the lock be a per-inode thing? Or is
> > there some user that doesn't have the inode info, or does anything that
> > might cross inode boundaries?
>
> /proc/locks and deadlock detection both cross inode boundaries (and even
> filesystem boundaries). The BKL-removal brigade tried this back in 2.4
> and the locking ended up scaling worse than just plonking a single
> spinlock around the whole thing.

Ok, no worries. Just as long as I know why it's a single lock. Looks ok to
me, apart from the need for testing (and talking to NFS etc people).

Linus

Zhang, Yanmin

unread,

May 6, 2008, 10:12:40 PM5/6/08

to Ingo Molnar, Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton

On Tue, 2008-05-06 at 13:44 +0200, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin...@linux.intel.com> wrote:
>
> > Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more than 40% with
> > 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton, and Itanium
> > Montecito. Bisect located below patch.
> >
> > 64ac24e738823161693bf791f87adc802cf529ff is first bad commit
> > commit 64ac24e738823161693bf791f87adc802cf529ff
> > Author: Matthew Wilcox <mat...@wil.cx>
> > Date: Fri Mar 7 21:55:58 2008 -0500
> >
> > Generic semaphore implementation
> >
> > After I manually reverted the patch against 2.6.26-rc1 while fixing
> > lots of conflictions/errors, aim7 regression became less than 2%.
>
> hm, which exact semaphore would that be due to?
>
> My first blind guess would be the BKL - there's not much other semaphore
> use left in the core kernel otherwise that would affect AIM7 normally.
> The VFS still makes frequent use of the BKL and AIM7 is very VFS
> intense. Getting rid of that BKL use from the VFS might be useful to
> performance anyway.
>
> Could you try to check that it's indeed the BKL?
>
> Easiest way to check it would be to run AIM7 it on
> sched-devel.git/latest and do scheduler tracing via:
>
> http://people.redhat.com/mingo/sched-devel.git/readme-tracer.txt

Thank you guys for the quick response. I ran into many regressions with 2.6.26-rc1, but
just reported 2 of them because I located the patches. My machine is locating the root cause
of 30% regression of sysbench+mysql(oltp readonly) now. Bisect is not so qucik because either
kernel hang with testing or compilation fails.

Another specjbb2005 on Montvale is also under investigation.

Let me figure out how to clone your tree quickly as the network speed is very slow.

One clear weird behavior of aim7 is cpu idle is 0% with 2.6.25, but is more than 50% with
2.6.26-rc1. I have a patch to collect schedule info.

Zhang, Yanmin

unread,

May 6, 2008, 11:43:29 PM5/6/08

to Ingo Molnar, Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton

On Wed, 2008-05-07 at 10:11 +0800, Zhang, Yanmin wrote:
> On Tue, 2008-05-06 at 13:44 +0200, Ingo Molnar wrote:
> > * Zhang, Yanmin <yanmin...@linux.intel.com> wrote:
> >
> > > Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more than 40% with
> > > 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton, and Itanium
> > > Montecito. Bisect located below patch.
> > >
> > > 64ac24e738823161693bf791f87adc802cf529ff is first bad commit
> > > commit 64ac24e738823161693bf791f87adc802cf529ff
> > > Author: Matthew Wilcox <mat...@wil.cx>
> > > Date: Fri Mar 7 21:55:58 2008 -0500
> > >
> > > Generic semaphore implementation
> > >
> > > After I manually reverted the patch against 2.6.26-rc1 while fixing
> > > lots of conflictions/errors, aim7 regression became less than 2%.
> >
> > hm, which exact semaphore would that be due to?
> >
> > My first blind guess would be the BKL - there's not much other semaphore
> > use left in the core kernel otherwise that would affect AIM7 normally.
> > The VFS still makes frequent use of the BKL and AIM7 is very VFS
> > intense. Getting rid of that BKL use from the VFS might be useful to
> > performance anyway.
> >
> > Could you try to check that it's indeed the BKL?
> >
> > Easiest way to check it would be to run AIM7 it on
> > sched-devel.git/latest and do scheduler tracing via:
> >
> > http://people.redhat.com/mingo/sched-devel.git/readme-tracer.txt

> One clear weird behavior of aim7 is cpu idle is 0% with 2.6.25, but is more than 50% with
> 2.6.26-rc1. I have a patch to collect schedule info.

With my patch+gprof, I collected some data. Below was outputed by gprof.

index % time self children called name
0.00 0.00 2/223305376 __down_write_nested [22749]
0.00 0.00 3/223305376 journal_commit_transaction [10526]
0.00 0.00 6/223305376 __down_read [22745]
0.00 0.00 8/223305376 start_this_handle [19167]
0.00 0.00 15/223305376 sys_pause [19808]
0.00 0.00 17/223305376 log_wait_commit [11047]
0.00 0.00 20/223305376 futex_wait [8122]
0.00 0.00 64/223305376 pdflush [14335]
0.00 0.00 71/223305376 do_get_write_access [5367]
0.00 0.00 84/223305376 pipe_wait [14460]
0.00 0.00 111/223305376 kjournald [10726]
0.00 0.00 116/223305376 int_careful [9634]
0.00 0.00 224/223305376 do_nanosleep [5418]
0.00 0.00 1152/223305376 watchdog [22065]
0.00 0.00 4087/223305376 worker_thread [22076]
0.00 0.00 5003/223305376 __mutex_lock_killable_slowpath [23305]
0.00 0.00 7810/223305376 ksoftirqd [10831]
0.00 0.00 9389/223305376 __mutex_lock_slowpath [23306]
0.00 0.00 10642/223305376 io_schedule [9813]
0.00 0.00 23544/223305376 migration_thread [11495]
0.00 0.00 35319/223305376 __cond_resched [22673]
0.00 0.00 49065/223305376 retint_careful [16146]
0.00 0.00 119757/223305376 sysret_careful [20074]
0.00 0.00 151717/223305376 do_wait [5545]
0.00 0.00 250221/223305376 do_exit [5356]
0.00 0.00 303836/223305376 cpu_idle [4350]
0.00 0.00 222333093/223305376 schedule_timeout [2]
[1] 0.0 0.00 0.00 223305376 schedule [1]
-----------------------------------------------
0.00 0.00 2/222333093 io_schedule_timeout [9814]
0.00 0.00 4/222333093 journal_stop [10588]
0.00 0.00 8/222333093 cifs_oplock_thread [3760]
0.00 0.00 14/222333093 do_sys_poll [5513]
0.00 0.00 20/222333093 cifs_dnotify_thread [3733]
0.00 0.00 32/222333093 read_chan [15648]
0.00 0.00 47/222333093 wait_for_common [22017]
0.00 0.00 658/222333093 do_select [5479]
0.00 0.00 2000/222333093 inet_stream_connect [9324]
0.00 0.00 222330308/222333093 __down [22577]
[2] 0.0 0.00 0.00 222333093 schedule_timeout [2]
0.00 0.00 222333093/223305376 schedule [1]
-----------------------------------------------
0.00 0.00 1/165565 flock_lock_file_wait [7735]
0.00 0.00 7/165565 __posix_lock_file [23371]
0.00 0.00 203/165565 de_put [4665]
0.00 0.00 243/165565 opost [13633]
0.00 0.00 333/165565 proc_root_readdir [14982]
0.00 0.00 358/165565 write_chan [22090]
0.00 0.00 6222/165565 proc_lookup_de [14908]
0.00 0.00 32081/165565 sys_fcntl [19687]
0.00 0.00 36045/165565 vfs_ioctl [21822]
0.00 0.00 42025/165565 tty_release [20818]
0.00 0.00 48047/165565 chrdev_open [3702]
[3] 0.0 0.00 0.00 165565 lock_kernel [3]
0.00 0.00 152987/153190 down [4]
-----------------------------------------------
0.00 0.00 203/153190 __reacquire_kernel_lock [23420]
0.00 0.00 152987/153190 lock_kernel [3]
[4] 0.0 0.00 0.00 153190 down [4]
0.00 0.00 153190/153190 __down [22577]
-----------------------------------------------
0.00 0.00 153190/153190 down [4]
[22577 0.0 0.00 0.00 153190 __down [22577]
0.00 0.00 222330308/222333093 schedule_timeout [2]

As system idle is more than 50%, so the schedule/schedule_timeout caller is important
information.
1) lock_kernel causes most schedule/schedule_timeout;
2) When lock_kernel calls down, then __down, __down calls schedule_timeout for
lots of times in a loop;
3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.

-yanmin

Andrew Morton

unread,

May 7, 2008, 12:00:10 AM5/7/08

to Zhang, Yanmin, Ingo Molnar, Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds

On Wed, 07 May 2008 11:41:52 +0800 "Zhang, Yanmin" <yanmin...@linux.intel.com> wrote:

> As system idle is more than 50%, so the schedule/schedule_timeout caller is important
> information.
> 1) lock_kernel causes most schedule/schedule_timeout;

> 2) When lock_kernel calls down, then __down, __down calls ___schedule_timeout for

> lots of times in a loop;

Really? Are you sure? That would imply that we keep on waking up tasks
which then fail to acquire the lock. But the code pretty plainly doesn't
do that.

Odd.

> 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.

Still :(

Zhang, Yanmin

unread,

May 7, 2008, 12:47:31 AM5/7/08

to Andrew Morton, Ingo Molnar, Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds

On Tue, 2008-05-06 at 20:59 -0700, Andrew Morton wrote:
> On Wed, 07 May 2008 11:41:52 +0800 "Zhang, Yanmin" <yanmin...@linux.intel.com> wrote:
>
> > As system idle is more than 50%, so the schedule/schedule_timeout caller is important
> > information.
> > 1) lock_kernel causes most schedule/schedule_timeout;
> > 2) When lock_kernel calls down, then __down, __down calls ___schedule_timeout for
> > lots of times in a loop;
>
> Really? Are you sure? That would imply that we keep on waking up tasks
> which then fail to acquire the lock. But the code pretty plainly doesn't
> do that.

Yes, totally based on the data.
The data means the calling times among functions. Initially , I just collected the caller
of schedule and schedule_timeout. Then I found most schedule/schedule_timeout are called by
__down which is called down. Then, I changes kernel to collect more functions' calling info.

If comparing the calling times of down, __down and schedule_timeout, we could find
schedule_timeout is called by __down for 222330308, but __down is called only for 153190.

>
> Odd.
>
> > 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.
> Still :(

Yes. The data has an error difference, but the difference is small. My patch doesn't
use lock to protect data in case it might introduces too much overhead.

Ingo Molnar

unread,

May 7, 2008, 2:27:25 AM5/7/08

to Zhang, Yanmin, Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton

* Zhang, Yanmin <yanmin...@linux.intel.com> wrote:

> 3) Caller of lock_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.

that's one often-forgotten BKL site: about 1000 ioctls are still running
under the BKL. The TTY one is hurting the most. To make sure it's only
that BKL acquire/release that hurts, could you try the hack patch below,
does it make any difference to performance?

but even if taking the BKL does hurt, it's quite unexpected to cause a
40% drop. Perhaps AIM7 has tons of threads that exit at once and all try
to release their controlling terminal or something like that?

Ingo

------------------------>
Subject: DANGEROUS tty hack: no BKL
From: Ingo Molnar <mi...@elte.hu>
Date: Wed May 07 08:21:22 CEST 2008

NOT-Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
drivers/char/tty_io.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux/drivers/char/tty_io.c
===================================================================
--- linux.orig/drivers/char/tty_io.c
+++ linux/drivers/char/tty_io.c
@@ -2844,9 +2844,10 @@ out:

static int tty_release(struct inode *inode, struct file *filp)
{
- lock_kernel();
+ /* DANGEROUS - can crash your kernel! */
+// lock_kernel();
release_dev(filp);
- unlock_kernel();
+// unlock_kernel();
return 0;

Ingo Molnar

unread,

May 7, 2008, 2:29:15 AM5/7/08

to Zhang, Yanmin, Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton

* Ingo Molnar <mi...@elte.hu> wrote:

> > 3) Caller of lock_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.
>
> that's one often-forgotten BKL site: about 1000 ioctls are still

> running under the BKL. The TTY one is hurting the most. [...]

although it's an unlocked_ioctl() now in 2.6.26, so all the BKL locking
has been nicely pushed down to deep inside the tty code.

> [...] To make sure it's only that BKL acquire/release that hurts,

> could you try the hack patch below, does it make any difference to
> performance?

if you use a serial console you will need the updated patch below.

Ingo

---------------------->
Subject: no: tty bkl

From: Ingo Molnar <mi...@elte.hu>
Date: Wed May 07 08:21:22 CEST 2008

Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
drivers/char/tty_io.c | 5 +++--
drivers/serial/serial_core.c | 2 +-
2 files changed, 4 insertions(+), 3 deletions(-)

Index: linux/drivers/char/tty_io.c
===================================================================
--- linux.orig/drivers/char/tty_io.c
+++ linux/drivers/char/tty_io.c
@@ -2844,9 +2844,10 @@ out:

static int tty_release(struct inode *inode, struct file *filp)
{
- lock_kernel();
+ /* DANGEROUS - can crash your kernel! */
+// lock_kernel();
release_dev(filp);
- unlock_kernel();
+// unlock_kernel();
return 0;
}

Index: linux/drivers/serial/serial_core.c
===================================================================
--- linux.orig/drivers/serial/serial_core.c
+++ linux/drivers/serial/serial_core.c
@@ -1241,7 +1241,7 @@ static void uart_close(struct tty_struct
struct uart_state *state = tty->driver_data;
struct uart_port *port;

- BUG_ON(!kernel_locked());
+// BUG_ON(!kernel_locked());

if (!state || !state->port)
return;

Zhang, Yanmin

unread,

May 7, 2008, 2:51:17 AM5/7/08

to Ingo Molnar, Andrew Morton, Matthew Wilcox, J. Bruce Fields, LKML, Alexander Viro, Linus Torvalds, linux-...@vger.kernel.org

On Tue, 2008-05-06 at 19:39 +0200, Ingo Molnar wrote:
> * Andrew Morton <ak...@linux-foundation.org> wrote:
>
> > Finally: how come we regressed by swapping the semaphore
> > implementation anyway? We went from one sleeping lock implementation
> > to another - I'd have expected performance to be pretty much the same.

> i.e. we'll always keep yet another task in flight. This can mask wakeup
> latencies especially when it takes time.
>
> The patch (hack) below tries to emulate this weirdness - it 'kicks'
> another task as well and keeps it busy. Most of the time this just
> causes extra scheduling, but if AIM7 is _just_ saturating the number of
> CPUs, it might make a difference. Yanmin, does the patch below make any
> difference to the AIM7 results?

I tested it on my 8-core stoakley and the result is 12% worse than the one of
pure 2.6.26-rc1.

-yanmin

Zhang, Yanmin

unread,

May 7, 2008, 3:06:32 AM5/7/08

to Ingo Molnar, Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton

On Wed, 2008-05-07 at 08:28 +0200, Ingo Molnar wrote:
> * Ingo Molnar <mi...@elte.hu> wrote:
>
> > > 3) Caller of lock_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.
> >
> > that's one often-forgotten BKL site: about 1000 ioctls are still
> > running under the BKL. The TTY one is hurting the most. [...]
>
> although it's an unlocked_ioctl() now in 2.6.26, so all the BKL locking
> has been nicely pushed down to deep inside the tty code.
>
> > [...] To make sure it's only that BKL acquire/release that hurts,
> > could you try the hack patch below, does it make any difference to
> > performance?
>
> if you use a serial console you will need the updated patch below.

I tested it on my 8-core stoakley. The result is 4% worse than the one of pure
2.6.26-rc1. Still not good.

Andi Kleen

unread,

May 7, 2008, 7:00:38 AM5/7/08

to Zhang, Yanmin, Ingo Molnar, Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton

"Zhang, Yanmin" <yanmin...@linux.intel.com> writes:
> 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.

I have an older patchkit that introduced unlocked_fnctl for some cases. It was
briefly in mm but then dropped. Sounds like it is worth resurrecting?

tty_* is being taken care of by Alan.

chrdev_open is more work.

-Andi (who BTW never quite understood why BKL is a semaphore now and not
a spinlock?)

Matthew Wilcox

unread,

May 7, 2008, 7:47:27 AM5/7/08

to Andi Kleen, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Linus Torvalds, Andrew Morton

On Wed, May 07, 2008 at 01:00:14PM +0200, Andi Kleen wrote:
> "Zhang, Yanmin" <yanmin...@linux.intel.com> writes:
> > 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.
>
> I have an older patchkit that introduced unlocked_fnctl for some cases. It was
> briefly in mm but then dropped. Sounds like it is worth resurrecting?

Not sure what you're talking about here, Andi. The only lock_kernel in
fcntl.c is around the call to ->fasync. And Yanmin's traces don't show
fasync as being a culprit, just the paths in locks.c

> tty_* is being taken care of by Alan.
>
> chrdev_open is more work.
>
> -Andi (who BTW never quite understood why BKL is a semaphore now and not
> a spinlock?)

See git commit 6478d8800b75253b2a934ddcb734e13ade023ad0

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Andi Kleen

unread,

May 7, 2008, 8:21:40 AM5/7/08

to Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Linus Torvalds, Andrew Morton

Matthew Wilcox <mat...@wil.cx> writes:

> On Wed, May 07, 2008 at 01:00:14PM +0200, Andi Kleen wrote:
>> "Zhang, Yanmin" <yanmin...@linux.intel.com> writes:
>> > 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.
>>
>> I have an older patchkit that introduced unlocked_fnctl for some cases. It was
>> briefly in mm but then dropped. Sounds like it is worth resurrecting?
>
> Not sure what you're talking about here, Andi. The only lock_kernel in
> fcntl.c is around the call to ->fasync. And Yanmin's traces don't show
> fasync as being a culprit, just the paths in locks.c

I was talking about fasync.

>> -Andi (who BTW never quite understood why BKL is a semaphore now and not
>> a spinlock?)
>
> See git commit 6478d8800b75253b2a934ddcb734e13ade023ad0

I am aware of that commit, thank you, but the comment was refering to that it
came with about zero justification why it was done. For the left over BKL
regions which are relatively short surely a spinlock would be better than a
semaphore? So PREEMPT_BKL should have been removed, not !PREEMPT_BKL.

If that was done all these regressions would disappear I bet. That said
of course it is still better to actually fix the lock_kernel()s, but shorter
time just fixing lock_kernel again would be easier.

-Andi

Alan Cox

unread,

May 7, 2008, 10:09:41 AM5/7/08

to Andi Kleen, Zhang, Yanmin, Ingo Molnar, Matthew Wilcox, LKML, Alexander Viro, Linus Torvalds, Andrew Morton

On Wed, 07 May 2008 13:00:14 +0200
Andi Kleen <an...@firstfloor.org> wrote:

> "Zhang, Yanmin" <yanmin...@linux.intel.com> writes:
> > 3) Caller of lcok_kernel are sys_fcntl/vfs_ioctl/tty_release/chrdev_open.
>
> I have an older patchkit that introduced unlocked_fnctl for some cases. It was
> briefly in mm but then dropped. Sounds like it is worth resurrecting?
>
> tty_* is being taken care of by Alan.

The tty open/close paths are probably a few months away from dropping the
BKL.

Alan

Linus Torvalds

unread,

May 7, 2008, 10:37:40 AM5/7/08

to Andi Kleen, Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Andrew Morton

On Wed, 7 May 2008, Andi Kleen wrote:
>
> I am aware of that commit, thank you, but the comment was refering to that it
> came with about zero justification why it was done. For the left over BKL
> regions which are relatively short surely a spinlock would be better than a
> semaphore? So PREEMPT_BKL should have been removed, not !PREEMPT_BKL.

I do agree.

I think turning the BKL into a semaphore was fine per se, but that was
when semaphores were fast.

Considering the apparent AIM regressions, we really either need to revert
the semaphore consolidation, or we need to fix the kernel lock. And by
"fixing", I don't mean removing it - it will happen, but it almost
certainly won't happen for 2.6.26.

The easiest approach would seem to just turn the BKL into a mutex instead,
which should hopefully be about as optimized as the old semaphores.

But my preferred option would indeed be just turning it back into a
spinlock - and screw latency and BKL preemption - and having the RT people
who care deeply just work on removing the BKL in the long run.

Is BKL preemption worth it? Sounds very dubious. Sounds even more dubious
when we now apparently have even more reason to aim for removing the BKL
rather than trying to mess around with it.

Linus

Alan Cox

unread,

May 7, 2008, 10:45:55 AM5/7/08

to Linus Torvalds, Andi Kleen, Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Andrew Morton

> But my preferred option would indeed be just turning it back into a
> spinlock - and screw latency and BKL preemption - and having the RT people
> who care deeply just work on removing the BKL in the long run.

It isn't as if the RT build can't use a different lock type to the
default build.

> Is BKL preemption worth it? Sounds very dubious. Sounds even more dubious
> when we now apparently have even more reason to aim for removing the BKL
> rather than trying to mess around with it.

We have some horrible long lasting BKL users left unfortunately.

Alan

Andi Kleen

unread,

May 7, 2008, 10:58:26 AM5/7/08

to Linus Torvalds, Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Andrew Morton

> I think turning the BKL into a semaphore was fine per se,

Are there really that many long lived BKL holders? I have some doubts.
Semaphore makes only sense when the critical region is at least
many thousands of cycles to amortize the scheduling overhead.

Ok perhaps this needs some numbers to decide.

but that was
> when semaphores were fast.

The semaphores should be still nearly as fast in theory, especially
for the contended case.

> Considering the apparent AIM regressions, we really either need to revert
> the semaphore consolidation,

Or figure out what made the semaphore consolidation slower? As Ingo
pointed out earlier 40% is unlikely to be a fast path problem, but some
algorithmic problem. Surely that is fixable (even for .26)?

Perhaps we were lucky it showed so easily, not in something tricky.

-Andi

Linus Torvalds

unread,

May 7, 2008, 11:01:36 AM5/7/08

to Alan Cox, Andi Kleen, Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Andrew Morton

On Wed, 7 May 2008, Alan Cox wrote:
>
> > But my preferred option would indeed be just turning it back into a
> > spinlock - and screw latency and BKL preemption - and having the RT people
> > who care deeply just work on removing the BKL in the long run.
>
> It isn't as if the RT build can't use a different lock type to the
> default build.

Well, considering just *how* bad the new BKL apparently is, I think that's
a separate issue. The semaphore implementation is simply not worth it. At
a minimum, it should be a mutex.

> > Is BKL preemption worth it? Sounds very dubious. Sounds even more dubious
> > when we now apparently have even more reason to aim for removing the BKL
> > rather than trying to mess around with it.
>
> We have some horrible long lasting BKL users left unfortunately.

Quite frankly, maybe we _need_ to have a bad BKL for those to ever get
fixed. As it was, people worked on trying to make the BKL behave better,
and it was a failure. Rather than spend the effort on trying to make it
work better (at a horrible cost), why not just say "Hell no - if you have
issues with it, you need to work with people to get rid of the BKL
rather than cluge around it".

Linus

Linus Torvalds

unread,

May 7, 2008, 11:03:46 AM5/7/08

to Alan Cox, Andi Kleen, Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Andrew Morton

On Wed, 7 May 2008, Linus Torvalds wrote:
>
> Quite frankly, maybe we _need_ to have a bad BKL for those to ever get
> fixed. As it was, people worked on trying to make the BKL behave better,
> and it was a failure. Rather than spend the effort on trying to make it
> work better (at a horrible cost), why not just say "Hell no - if you have
> issues with it, you need to work with people to get rid of the BKL
> rather than cluge around it".

Put another way: if we had introduced the BKL-as-semaphore with a known
40% performance drop in AIM7, I would simply never ever have accepted the
patch in the first place, regardless of _any_ excuses.

Performance is a feature too.

Now, just because the code is already merged should not be an excuse for
it then being shown to be bad. It's not a valid excuse to say "but we
already merged it, so we can't unmerge it". We sure as hell _can_ unmerge
it.

Linus Torvalds

unread,

May 7, 2008, 11:20:32 AM5/7/08

to Andi Kleen, Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro, Andrew Morton

On Wed, 7 May 2008, Linus Torvalds wrote:
>
> But my preferred option would indeed be just turning it back into a
> spinlock - and screw latency and BKL preemption - and having the RT people
> who care deeply just work on removing the BKL in the long run.

Here's a trial balloon patch to do that.

Yanmin - this is not well tested, but the code is fairly obvious, and it
would be interesting to hear if this fixes the performance regression.
Because if it doesn't, then it's not the BKL, or something totally
different is going on.

Now, we should probably also test just converting the thing to a mutex,
to see if that perhaps also fixes it.

Linus

---
arch/mn10300/Kconfig | 11 ----
include/linux/hardirq.h | 18 ++++---
kernel/sched.c | 27 ++---------
lib/kernel_lock.c | 120 +++++++++++++++++++++++++++++++---------------
4 files changed, 95 insertions(+), 81 deletions(-)

diff --git a/arch/mn10300/Kconfig b/arch/mn10300/Kconfig
index 6a6409a..e856218 100644
--- a/arch/mn10300/Kconfig
+++ b/arch/mn10300/Kconfig
@@ -186,17 +186,6 @@ config PREEMPT
Say Y here if you are building a kernel for a desktop, embedded
or real-time system. Say N if you are unsure.

-config PREEMPT_BKL
- bool "Preempt The Big Kernel Lock"
- depends on PREEMPT
- default y
- help
- This option reduces the latency of the kernel by making the
- big kernel lock preemptible.
-
- Say Y here if you are building a kernel for a desktop system.
- Say N if you are unsure.
-
config MN10300_CURRENT_IN_E2
bool "Hold current task address in E2 register"
default y
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 897f723..181006c 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -72,6 +72,14 @@
#define in_softirq() (softirq_count())
#define in_interrupt() (irq_count())

+#if defined(CONFIG_PREEMPT)
+# define PREEMPT_INATOMIC_BASE kernel_locked()
+# define PREEMPT_CHECK_OFFSET 1
+#else
+# define PREEMPT_INATOMIC_BASE 0
+# define PREEMPT_CHECK_OFFSET 0
+#endif
+
/*
* Are we running in atomic context? WARNING: this macro cannot
* always detect atomic context; in particular, it cannot know about
@@ -79,17 +87,11 @@
* used in the general case to determine whether sleeping is possible.
* Do not use in_atomic() in driver code.
*/
-#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0)
-
-#ifdef CONFIG_PREEMPT
-# define PREEMPT_CHECK_OFFSET 1
-#else
-# define PREEMPT_CHECK_OFFSET 0
-#endif
+#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_INATOMIC_BASE)

/*
* Check whether we were atomic before we did preempt_disable():
- * (used by the scheduler)
+ * (used by the scheduler, *after* releasing the kernel lock)
*/
#define in_atomic_preempt_off() \
((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET)
diff --git a/kernel/sched.c b/kernel/sched.c
index 58fb8af..c51b656 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4567,8 +4567,6 @@ EXPORT_SYMBOL(schedule);
asmlinkage void __sched preempt_schedule(void)
{
struct thread_info *ti = current_thread_info();
- struct task_struct *task = current;
- int saved_lock_depth;

/*
* If there is a non-zero preempt_count or interrupts are disabled,
@@ -4579,16 +4577,7 @@ asmlinkage void __sched preempt_schedule(void)

do {
add_preempt_count(PREEMPT_ACTIVE);
-
- /*
- * We keep the big kernel semaphore locked, but we
- * clear ->lock_depth so that schedule() doesnt
- * auto-release the semaphore:
- */
- saved_lock_depth = task->lock_depth;
- task->lock_depth = -1;
schedule();
- task->lock_depth = saved_lock_depth;
sub_preempt_count(PREEMPT_ACTIVE);

/*
@@ -4609,26 +4598,15 @@ EXPORT_SYMBOL(preempt_schedule);
asmlinkage void __sched preempt_schedule_irq(void)
{
struct thread_info *ti = current_thread_info();
- struct task_struct *task = current;
- int saved_lock_depth;

/* Catch callers which need to be fixed */
BUG_ON(ti->preempt_count || !irqs_disabled());

do {
add_preempt_count(PREEMPT_ACTIVE);
-
- /*
- * We keep the big kernel semaphore locked, but we
- * clear ->lock_depth so that schedule() doesnt
- * auto-release the semaphore:
- */
- saved_lock_depth = task->lock_depth;
- task->lock_depth = -1;
local_irq_enable();
schedule();
local_irq_disable();
- task->lock_depth = saved_lock_depth;
sub_preempt_count(PREEMPT_ACTIVE);

/*
@@ -5853,8 +5831,11 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
spin_unlock_irqrestore(&rq->lock, flags);

/* Set the preempt count _outside_ the spinlocks! */
+#if defined(CONFIG_PREEMPT)
+ task_thread_info(idle)->preempt_count = (idle->lock_depth >= 0);
+#else
task_thread_info(idle)->preempt_count = 0;
-
+#endif
/*
* The idle tasks have their own, simple scheduling class:
*/
diff --git a/lib/kernel_lock.c b/lib/kernel_lock.c
index cd3e825..06722aa 100644
--- a/lib/kernel_lock.c
+++ b/lib/kernel_lock.c
@@ -11,79 +11,121 @@
#include <linux/semaphore.h>

/*
- * The 'big kernel semaphore'
+ * The 'big kernel lock'
*
- * This mutex is taken and released recursively by lock_kernel()
+ * This spinlock is taken and released recursively by lock_kernel()
* and unlock_kernel(). It is transparently dropped and reacquired
* over schedule(). It is used to protect legacy code that hasn't
* been migrated to a proper locking design yet.
*
- * Note: code locked by this semaphore will only be serialized against
- * other code using the same locking facility. The code guarantees that
- * the task remains on the same CPU.
- *
* Don't use in new code.
*/
-static DECLARE_MUTEX(kernel_sem);
+static __cacheline_aligned_in_smp DEFINE_SPINLOCK(kernel_flag);
+

/*
- * Re-acquire the kernel semaphore.
+ * Acquire/release the underlying lock from the scheduler.
*
- * This function is called with preemption off.
+ * This is called with preemption disabled, and should
+ * return an error value if it cannot get the lock and
+ * TIF_NEED_RESCHED gets set.
*
- * We are executing in schedule() so the code must be extremely careful
- * about recursion, both due to the down() and due to the enabling of
- * preemption. schedule() will re-check the preemption flag after
- * reacquiring the semaphore.
+ * If it successfully gets the lock, it should increment
+ * the preemption count like any spinlock does.
+ *
+ * (This works on UP too - _raw_spin_trylock will never
+ * return false in that case)
*/
int __lockfunc __reacquire_kernel_lock(void)
{
- struct task_struct *task = current;
- int saved_lock_depth = task->lock_depth;
-
- BUG_ON(saved_lock_depth < 0);
-
- task->lock_depth = -1;
- preempt_enable_no_resched();
-
- down(&kernel_sem);
-
+ while (!_raw_spin_trylock(&kernel_flag)) {
+ if (test_thread_flag(TIF_NEED_RESCHED))
+ return -EAGAIN;
+ cpu_relax();
+ }
preempt_disable();
- task->lock_depth = saved_lock_depth;
-
return 0;
}

void __lockfunc __release_kernel_lock(void)
{
- up(&kernel_sem);
+ _raw_spin_unlock(&kernel_flag);
+ preempt_enable_no_resched();
}

/*
- * Getting the big kernel semaphore.
+ * These are the BKL spinlocks - we try to be polite about preemption.
+ * If SMP is not on (ie UP preemption), this all goes away because the
+ * _raw_spin_trylock() will always succeed.
*/
-void __lockfunc lock_kernel(void)
+#ifdef CONFIG_PREEMPT
+static inline void __lock_kernel(void)
{
- struct task_struct *task = current;
- int depth = task->lock_depth + 1;
+ preempt_disable();
+ if (unlikely(!_raw_spin_trylock(&kernel_flag))) {
+ /*
+ * If preemption was disabled even before this
+ * was called, there's nothing we can be polite
+ * about - just spin.
+ */
+ if (preempt_count() > 1) {
+ _raw_spin_lock(&kernel_flag);
+ return;
+ }

- if (likely(!depth))
/*
- * No recursion worries - we set up lock_depth _after_
+ * Otherwise, let's wait for the kernel lock
+ * with preemption enabled..
*/
- down(&kernel_sem);
+ do {
+ preempt_enable();
+ while (spin_is_locked(&kernel_flag))
+ cpu_relax();
+ preempt_disable();
+ } while (!_raw_spin_trylock(&kernel_flag));
+ }
+}

- task->lock_depth = depth;
+#else
+
+/*
+ * Non-preemption case - just get the spinlock
+ */
+static inline void __lock_kernel(void)
+{
+ _raw_spin_lock(&kernel_flag);
}
+#endif

-void __lockfunc unlock_kernel(void)
+static inline void __unlock_kernel(void)
{
- struct task_struct *task = current;
+ /*
+ * the BKL is not covered by lockdep, so we open-code the
+ * unlocking sequence (and thus avoid the dep-chain ops):
+ */
+ _raw_spin_unlock(&kernel_flag);
+ preempt_enable();
+}

- BUG_ON(task->lock_depth < 0);
+/*
+ * Getting the big kernel lock.
+ *
+ * This cannot happen asynchronously, so we only need to
+ * worry about other CPU's.
+ */
+void __lockfunc lock_kernel(void)
+{
+ int depth = current->lock_depth+1;
+ if (likely(!depth))
+ __lock_kernel();
+ current->lock_depth = depth;
+}

- if (likely(--task->lock_depth < 0))
- up(&kernel_sem);
+void __lockfunc unlock_kernel(void)
+{
+ BUG_ON(current->lock_depth < 0);
+ if (likely(--current->lock_depth < 0))
+ __unlock_kernel();
}

EXPORT_SYMBOL(lock_kernel);

Andrew Morton

unread,

May 7, 2008, 11:32:18 AM5/7/08

to Andi Kleen, Linus Torvalds, Matthew Wilcox, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro

On Wed, 07 May 2008 16:57:52 +0200 Andi Kleen <an...@firstfloor.org> wrote:

> Or figure out what made the semaphore consolidation slower? As Ingo
> pointed out earlier 40% is unlikely to be a fast path problem, but some
> algorithmic problem. Surely that is fixable (even for .26)?

Absolutely. Yanmin is apparently showing that each call to __down()
results in 1,451 calls to schedule(). wtf?

Ingo Molnar

unread,

May 7, 2008, 12:21:08 PM5/7/08

to Linus Torvalds, Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> I think turning the BKL into a semaphore was fine per se, but that was
> when semaphores were fast.

hm, do we know it for a fact that the 40% AIM regression is due to the
fastpath overhead of the BKL? It would be extraordinary if so.

I think it is far more likely that it's due to the different scheduling
and wakeup behavior of the new kernel/semaphore.c code. So the fix would
be to restore the old scheduling behavior - that's what Yanmin's manual
revert did and that's what got him back the previous AIM7 performance.

Ingo

Matthew Wilcox

unread,

May 7, 2008, 12:23:46 PM5/7/08

to Andrew Morton, Andi Kleen, Linus Torvalds, Zhang, Yanmin, Ingo Molnar, LKML, Alexander Viro

On Wed, May 07, 2008 at 08:31:05AM -0700, Andrew Morton wrote:
> On Wed, 07 May 2008 16:57:52 +0200 Andi Kleen <an...@firstfloor.org> wrote:
>
> > Or figure out what made the semaphore consolidation slower? As Ingo
> > pointed out earlier 40% is unlikely to be a fast path problem, but some
> > algorithmic problem. Surely that is fixable (even for .26)?
>
> Absolutely. Yanmin is apparently showing that each call to __down()
> results in 1,451 calls to schedule(). wtf?

I can't figure it out either. Unless schedule() is broken somehow ...
but that should have shown up with semaphore-sleepers.c, shouldn't it?

One other difference between semaphore-sleepers and the new generic code
is that in effect, semaphore-sleepers does a little bit of spinning
before it sleeps. That is, if up() and down() are called more-or-less
simultaneously, the increment of sem->count will happen before __down
calls schedule(). How about something like this:

diff --git a/kernel/semaphore.c b/kernel/semaphore.c
index 5c2942e..ef83f5a 100644
--- a/kernel/semaphore.c
+++ b/kernel/semaphore.c
@@ -211,6 +211,7 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
waiter.up = 0;

for (;;) {
+ int i;
if (state == TASK_INTERRUPTIBLE && signal_pending(task))
goto interrupted;
if (state == TASK_KILLABLE && fatal_signal_pending(task))
@@ -219,7 +220,15 @@ static inline int __sched __down_common(struct semaphore *sem, long state,
goto timed_out;
__set_task_state(task, state);
spin_unlock_irq(&sem->lock);
+
+ for (i = 0; i < 10; i++) {
+ if (waiter.up)
+ goto skip_schedule;
+ cpu_relax();
+ }
+
timeout = schedule_timeout(timeout);
+ skip_schedule:
spin_lock_irq(&sem->lock);
if (waiter.up)
return 0;

Maybe it'd be enough to test it once ... or maybe we should use
spin_is_locked() ... Ingo?

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Linus Torvalds

unread,

May 7, 2008, 12:36:58 PM5/7/08

to Ingo Molnar, Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton

On Wed, 7 May 2008, Ingo Molnar wrote:
>
> I think it is far more likely that it's due to the different scheduling
> and wakeup behavior of the new kernel/semaphore.c code. So the fix would
> be to restore the old scheduling behavior - that's what Yanmin's manual
> revert did and that's what got him back the previous AIM7 performance.

Yes, Yanmin's manual revert got rid of the new semaphores entirely. Which
was what, 7500 lines of code removed that got reverted.

And the *WHOLE* and *ONLY* excuse for dropping the spinlock lock_kernel
was this (and I quote your message):

remove the !PREEMPT_BKL code.

this removes 160 lines of legacy code.

in other words, your only stated valid reason for getting rid of the
spinlock was 160 lines, and the comment didn't even match what it did (it
removed the spinlocks entirely, not just the preemptible version).

In contrast, the revert adds 7500 lines. If you go by the only documented
reason for the crap that is the current BKL, then I know which one I'll
take. I'll take the spinlock back, and I'd rather put preemption back
than ever take those semaphores.

And even that's ignoring another issue: did anybody ever even do that AIM7
benchmark comparing spinlocks to the semaphore-BKL? It's quite possible
that the semaphores (even the well-behaved ones) behaved worse than the
spinlocks.

Linus

Matthew Wilcox

unread,

May 7, 2008, 12:38:50 PM5/7/08

to Andrew Morton, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, Linus Torvalds, linux-...@vger.kernel.org

On Tue, May 06, 2008 at 10:21:53AM -0700, Andrew Morton wrote:
> up() seems to be doing wake-one, FIFO which is nice. Did the
> implementation which we just removed also do that? Was it perhaps
> accidentally doing LIFO or something like that?

If heavily contended, it could do this.

up() would increment sem->count and cal __up() which would call wake_up()
down() would decrement sem->count.
The unlucky task woken by __up would lose the race and go back to sleep.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Linus Torvalds

unread,

May 7, 2008, 12:56:17 PM5/7/08

to Matthew Wilcox, Andrew Morton, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

On Wed, 7 May 2008, Matthew Wilcox wrote:
>
> If heavily contended, it could do this.

It doesn' have to be heavily contended - if it's just hot and a bit lucky,
it would potentially never schedule at all, because it would never take
the spinlock and serialize the callers.

It doesn't even need "unfairness" to work that way. The old semaphore
implementation was very much designed to be lock-free, and if you had one
CPU doing a lock while another did an unlock, the *common* situation was
that the unlock would succeed first, because the unlocker was also the
person who had the spinlock exclusively in its cache!

The above may count as "lucky", but the hot-cache-line thing is a big
deal. It likely "lucky" into something that isn't a 50:50 chance, but
something that is quite possible to trigger consistently if you just have
mostly short holders of the lock.

Which, btw, is probably true. The BKL is normally held for short times,
and released (by that thread) for relatively much longer times. Which
is when spinlocks tend to work the best, even when they are fair (because
it's not so much a fairness issue, it's simply a cost-of-taking-the-lock
issue!)

Linus

Ingo Molnar

unread,

May 7, 2008, 1:06:19 PM5/7/08

to Linus Torvalds, Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> > I think it is far more likely that it's due to the different
> > scheduling and wakeup behavior of the new kernel/semaphore.c code.
> > So the fix would be to restore the old scheduling behavior - that's
> > what Yanmin's manual revert did and that's what got him back the
> > previous AIM7 performance.
>
> Yes, Yanmin's manual revert got rid of the new semaphores entirely.
> Which was what, 7500 lines of code removed that got reverted.

i wouldnt advocate a 7500 revert instead of a 160 lines change.

my suggestion was that the scheduling behavior of the new
kernel/semaphore.c code is causing the problem - i.e. making it match
the old semaphore code's behavior would give us back performance.

> And the *WHOLE* and *ONLY* excuse for dropping the spinlock
> lock_kernel was this (and I quote your message):
>
> remove the !PREEMPT_BKL code.
>
> this removes 160 lines of legacy code.
>
> in other words, your only stated valid reason for getting rid of the
> spinlock was 160 lines, and the comment didn't even match what it did
> (it removed the spinlocks entirely, not just the preemptible version).

it was removed by me in the course of this discussion:

http://lkml.org/lkml/2008/1/2/58

the whole discussion started IIRC because !CONFIG_PREEMPT_BKL [the
spinlock version] was broken for a longer period of time (it crashed
trivially), because nobody apparently used it. People (Nick) asked why
it was still there and i agreed and removed it. CONFIG_PREEMPT_BKL=y was
the default, that was what all distros used. I.e. the spinlock code was
in essence dead code at that point in time.

the spinlock code might in fact perform _better_, but nobody came up
with such a workload before.

> In contrast, the revert adds 7500 lines. If you go by the only
> documented reason for the crap that is the current BKL, then I know
> which one I'll take. I'll take the spinlock back, and I'd rather put
> preemption back than ever take those semaphores.
>
> And even that's ignoring another issue: did anybody ever even do that
> AIM7 benchmark comparing spinlocks to the semaphore-BKL? It's quite
> possible that the semaphores (even the well-behaved ones) behaved
> worse than the spinlocks.

that's a good question...

Ingo

Linus Torvalds

unread,

May 7, 2008, 1:09:32 PM5/7/08

to Matthew Wilcox, Andrew Morton, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

On Wed, 7 May 2008, Linus Torvalds wrote:
>
> Which, btw, is probably true. The BKL is normally held for short times,
> and released (by that thread) for relatively much longer times. Which
> is when spinlocks tend to work the best, even when they are fair (because
> it's not so much a fairness issue, it's simply a cost-of-taking-the-lock
> issue!)

. and don't get me wrong: the old semaphores (and the new mutexes) should
also have this property when lucky: taking the lock is often a hot-path
case.

And the spinlock+generic semaphore thing probably makes that "lucky"
behavior be exponentially less likely, because now to hit the lucky case,
rather than the hot path having just *one* access to the interesting cache
line, it has basically something like 4 accesses (spinlock, count test,
count decrement, spinunlock), in addition to various serializing
instructions, so I suspect it quite often gets serialized simply because
even the "fast path" is actually about ten times as long!

As a result, a slow "fast path" means that the thing gets saturated much
more easily, and that in turn means that the "fast path" turns into a
"slow path" more easily, which is how you end up in the scheduler rather
than just taking the fast path.

This is why sleeping locks are more expensive in general: they have a
*huge* cost from when they get contended. Hundreds of times higher than a
spinlock. And the faster they are, the longer it takes for them to get
contended under load. So slowing them down in the fast path is a double
whammy, in that it shows their bad behaviour much earlier.

And the generic semaphores really are slower than the old optimized ones
in that fast path. By a *big* amount.

Which is why I'm 100% convinced it's not even worth saving the old code.
It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with
"slow path" other than the fact that it gets to that slow path much more
these days.

Ingo Molnar

unread,

May 7, 2008, 1:15:31 PM5/7/08

to Linus Torvalds, Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> > But my preferred option would indeed be just turning it back into a
> > spinlock - and screw latency and BKL preemption - and having the RT
> > people who care deeply just work on removing the BKL in the long
> > run.
>
> Here's a trial balloon patch to do that.

here's a simpler trial baloon test-patch (well, hack) that is also
reasonably well tested. It turns the BKL into a "spin-semaphore". If
this resolves the performance problem then it's all due to the BKL's
scheduling/preemption properties.

this approach is ugly (it's just a more expensive spinlock), but has an
advantage: the code logic is obviously correct, and it would also make
it much easier later on to turn the BKL back into a sleeping lock again
- once the TTY code's BKL use is fixed. (i think Alan said it might
happen in the next few months) The BKL is more expensive than a simple
spinlock anyway.

Ingo

------------->
Subject: BKL: spin on acquire
From: Ingo Molnar <mi...@elte.hu>
Date: Wed May 07 19:05:40 CEST 2008

NOT-Signed-off-by: Ingo Molnar <mi...@elte.hu>
---

lib/kernel_lock.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

Index: linux/lib/kernel_lock.c
===================================================================
--- linux.orig/lib/kernel_lock.c
+++ linux/lib/kernel_lock.c
@@ -46,7 +46,8 @@ int __lockfunc __reacquire_kernel_lock(v
task->lock_depth = -1;
preempt_enable_no_resched();

- down(&kernel_sem);
+ while (down_trylock(&kernel_sem))
+ cpu_relax();

preempt_disable();
task->lock_depth = saved_lock_depth;
@@ -67,11 +68,13 @@ void __lockfunc lock_kernel(void)

struct task_struct *task = current;

int depth = task->lock_depth + 1;

- if (likely(!depth))
+ if (likely(!depth)) {
/*

* No recursion worries - we set up lock_depth _after_

*/
- down(&kernel_sem);
+ while (down_trylock(&kernel_sem))
+ cpu_relax();
+ }

task->lock_depth = depth;

Andrew Morton

unread,

May 7, 2008, 1:17:05 PM5/7/08

to Linus Torvalds, Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

On Wed, 7 May 2008 10:08:18 -0700 (PDT) Linus Torvalds <torv...@linux-foundation.org> wrote:

> Which is why I'm 100% convinced it's not even worth saving the old code.
> It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with
> "slow path" other than the fact that it gets to that slow path much more
> these days.

Stupid question: why doesn't lock_kernel() use a mutex?

(stupid answer: it'll trigger might_sleep() checks when we do it early in
boot with irqs disabled, but we can fix that)

(And __might_sleep()'s system_state check might even save us from that)

Of course, we shouldn't change anything until we've worked out why the new
semaphores got slower.

Ingo Molnar

unread,

May 7, 2008, 1:23:34 PM5/7/08

to Linus Torvalds, Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> Which is why I'm 100% convinced it's not even worth saving the old
> code. It needs to use mutexes, or spinlocks. I bet it has *nothing* to
> do with "slow path" other than the fact that it gets to that slow path
> much more these days.

i think your theory should be easy to test: Yanmin, could you turn on
CONFIG_MUTEX_DEBUG=y and check by how much AIM7 regresses?

Because in the CONFIG_MUTEX_DEBUG=y case the mutex debug code does
exactly that: it doesnt use the single-instruction fastpath [it uses
asm-generic/mutex-null.h] but always drops into the slowpath (to be able
to access debug state). That debug code is about as expensive as the
generic semaphore code's current fastpath. (perhaps even more
expensive.)

There's far more normal mutex fastpath use during an AIM7 run than any
BKL use. So if it's due to any direct fastpath overhead and the
resulting widening of the window for the real slowdown, we should see a
severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?

Ingo

Linus Torvalds

unread,

May 7, 2008, 1:25:49 PM5/7/08

to Ingo Molnar, Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton

On Wed, 7 May 2008, Ingo Molnar wrote:
>

> it was removed by me in the course of this discussion:
>
> http://lkml.org/lkml/2008/1/2/58
>
> the whole discussion started IIRC because !CONFIG_PREEMPT_BKL [the
> spinlock version] was broken for a longer period of time (it crashed
> trivially), because nobody apparently used it.

Hmm. I've generally used PREEMPT_NONE, and always thought PREEMPT_BKL was
the known-flaky one.

The thread you point to also says that it's PREEMPT_BKL=y that was the
problem (ie "I've seen 1s+ desktop latencies due to PREEMPT_BKL when I was
still using reiserfs."), not the plain spinlock approach.

But it would definitely be interesting to see the crash reports. And the
help message always said "Say N if you are unsure." even if it ended up
being marked 'y' by default at some point (and then in January was made
first unconditional, and then removed entirely)

Because in many ways, the non-preempt BKL is the *much* simpler case. I
don't see why it would crash - it just turns the BKL into a trivial
counting spinlock that can sleep.

Linus

Ingo Molnar

unread,

May 7, 2008, 1:26:38 PM5/7/08

to Linus Torvalds, Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

* Ingo Molnar <mi...@elte.hu> wrote:

> There's far more normal mutex fastpath use during an AIM7 run than any
> BKL use. So if it's due to any direct fastpath overhead and the
> resulting widening of the window for the real slowdown, we should see
> a severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?

my own guesstimate about AIM7 performance impact resulting out of
CONFIG_MUTEX_DEBUG=y: performance overhead will not be measurable, or
will at most be in the sub-1% range. But i've been badly wrong before :)

Linus Torvalds

unread,

May 7, 2008, 1:28:43 PM5/7/08

to Andrew Morton, Matthew Wilcox, Ingo Molnar, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

On Wed, 7 May 2008, Andrew Morton wrote:

> On Wed, 7 May 2008 10:08:18 -0700 (PDT) Linus Torvalds <torv...@linux-foundation.org> wrote:
>
> > Which is why I'm 100% convinced it's not even worth saving the old code.
> > It needs to use mutexes, or spinlocks. I bet it has *nothing* to do with
> > "slow path" other than the fact that it gets to that slow path much more
> > these days.
>
> Stupid question: why doesn't lock_kernel() use a mutex?

Not stupid.

The only reason some code didn't get turned over to mutexes was literally
that they didn't want the debugging because they were doing intentionally
bad things.

I think the BKL is one of them (the console semaphore was another, iirc).

Linus

Linus Torvalds

unread,

May 7, 2008, 1:33:44 PM5/7/08

to Ingo Molnar, Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

On Wed, 7 May 2008, Ingo Molnar wrote:
>
> There's far more normal mutex fastpath use during an AIM7 run than any
> BKL use. So if it's due to any direct fastpath overhead and the
> resulting widening of the window for the real slowdown, we should see a
> severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?

Not agreed.

The BKL is special because it is a *single* lock.

All the "normal" mutex code use fine-grained locking, so even if you slow
down the fast path, that won't cause the same kind of fastpath->slowpath
increase.

In order to see the fastpath->slowpath thing, you do need to have many
threads hitting the same lock: ie the slowdown has to result in real
contention.

Almost no mutexes have any potential for contention what-so-ever, except
for things that very consciously try to hit it (multiple threads doing
readdir and/or file creation on the *same* directory etc).

The BKL really is special.

Linus

Ingo Molnar

unread,

May 7, 2008, 1:36:51 PM5/7/08

to Linus Torvalds, Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton

* Linus Torvalds <torv...@linux-foundation.org> wrote:

>
>
> On Wed, 7 May 2008, Ingo Molnar wrote:
> >
> > it was removed by me in the course of this discussion:
> >
> > http://lkml.org/lkml/2008/1/2/58
> >
> > the whole discussion started IIRC because !CONFIG_PREEMPT_BKL [the
> > spinlock version] was broken for a longer period of time (it crashed
> > trivially), because nobody apparently used it.
>
> Hmm. I've generally used PREEMPT_NONE, and always thought PREEMPT_BKL
> was the known-flaky one.
>
> The thread you point to also says that it's PREEMPT_BKL=y that was the
> problem (ie "I've seen 1s+ desktop latencies due to PREEMPT_BKL when I
> was still using reiserfs."), not the plain spinlock approach.

no, there was another problem (which i couldnt immediately find because
lkml.org only indexes part of the threads, i'll research it some more),
which was some cond_resched() thing in the !PREEMPT_BKL case.

> But it would definitely be interesting to see the crash reports. And
> the help message always said "Say N if you are unsure." even if it
> ended up being marked 'y' by default at some point (and then in
> January was made first unconditional, and then removed entirely)
>
> Because in many ways, the non-preempt BKL is the *much* simpler case.
> I don't see why it would crash - it just turns the BKL into a trivial
> counting spinlock that can sleep.

yeah. The latencies are a different problem, and indeed were reported
against PREEMPT_BKL, and believed to be due to reiser3 and the tty code.
(reiser3 runs almost all of its code under the BKL)

The !PREEMPT_BKL crash was some simple screwup on my part of getting
atomicity checks wrong in cond_resched() - and it went unnoticed for a
long time - or something like that. I'll try to find that discussion.

Ingo

Linus Torvalds

unread,

May 7, 2008, 1:48:04 PM5/7/08

to Ingo Molnar, Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

On Wed, 7 May 2008, Linus Torvalds wrote:
>
> All the "normal" mutex code use fine-grained locking, so even if you slow
> down the fast path, that won't cause the same kind of fastpath->slowpath
> increase.

Put another way: let's say that the "good fastpath" is basically a single
locked instruction - ~12 cycles on AMD, ~35 Core 2. That's the
no-bouncing, no-contention case.

Doing it with debugging (call overhead, spinlocks, local irq saving rtc)
will probably easily triple it or more, but we're not changing anything
else. There's no "downstream" effect: the behaviour itself doesn't change.
It doesn't get more bouncing, it doesn't start sleeping.

But what happens if the lock has the *potential* for conflicts is
different.

There, a "longish pause + fast lock + short average code sequece + fast
unlock" is quite likely to stay uncontended for a fair amount of time, and
while it will be much slower than the no-contention-at-all case (because
you do get a pretty likely cacheline event at the "fast lock" part), with
a fairly low number of CPU's and a long enough pause, you *also* easily
get into a pattern where the thing that got the lock will likely also get
to unlock without dropping the cacheline.

So far so good.

But that basically depends on the fact that "lock + work + unlock" is
_much_ shorter than the "longish pause" in between, so that even if you
have <n> CPU's all doing the same thing, their pauses between the locked
section are still bigger than <n> times that short time.

Once that is no longer true, you now start to bounce both at the lock
*and* the unlock, and now that whole sequence got likely ten times slower.
*AND* because it now actually has real contention, it actually got even
worse: if the lock is a sleeping one, you get *another* order of magnitude
just because you now started doing scheduling overhead too!

So the thing is, it just breaks down very badly. A spinlock that gets
contention probably gets ten times slower due to bouncing the cacheline. A
semaphore that gets contention probably gets a *hundred* times slower, or
more.

And so my bet is that both the old and the new semaphores had the same bad
break-down situation, but the new semaphores just are a lot easier to
trigger it because they are at least three times costlier than the old
ones, so you just hit the bad behaviour with much lower loads (or fewer
number of CPU's).

But spinlocks really do behave much better when contended, because at
least they don't get the even bigger hit of also hitting the scheduler. So
the old semaphores would have behaved badly too *eventually*, they just
needed a more extreme case to show that bad behavior.

Ingo Molnar

unread,

May 7, 2008, 1:49:52 PM5/7/08

to Linus Torvalds, Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> > There's far more normal mutex fastpath use during an AIM7 run than
> > any BKL use. So if it's due to any direct fastpath overhead and the
> > resulting widening of the window for the real slowdown, we should
> > see a severe slowdown on AIM7 with CONFIG_MUTEX_DEBUG=y. Agreed?
>
> Not agreed.
>
> The BKL is special because it is a *single* lock.

ok, indeed my suggestion is wrong and this would not be a good
comparison.

another idea: my trial-baloon patch should test your theory too, because
the generic down_trylock() is still the 'fat' version, it does:

spin_lock_irqsave(&sem->lock, flags);
count = sem->count - 1;
if (likely(count >= 0))
sem->count = count;
spin_unlock_irqrestore(&sem->lock, flags);

if there is a noticeable performance difference between your
trial-ballon patch and mine, then the micro-cost of the BKL very much
matters to this workload. Agreed about that?

but i'd be _hugely_ surprised about it. The tty code's BKL use should i
think only happen when a task exits and releases the tty - and a task
exit - even if this is a threaded test (which AIM7 can be - not sure
which exact parameters Yanmin used) - the costs of thread creation and
thread exit are just not in the same ballpark as any BKL micro-costs.
Dunno, maybe i overlooked some high-freq BKL user. (but any such site
would have shown up before) Even assuming a widening of the critical
path and some catastrophic domino effect (that does show up as increased
scheduling) i've never seen a 40% drop like this.

this regression, to me, has "different scheduling behavior" written all
over it - but that's just an impression. I'm not going to bet against
you though ;-)

Ingo

Linus Torvalds

unread,

May 7, 2008, 1:56:08 PM5/7/08

to Ingo Molnar, Andi Kleen, Matthew Wilcox, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton

On Wed, 7 May 2008, Ingo Molnar wrote:
>
> no, there was another problem (which i couldnt immediately find because
> lkml.org only indexes part of the threads, i'll research it some more),
> which was some cond_resched() thing in the !PREEMPT_BKL case.

Hmm. I do agree that _cond_resched() looks a bit iffy, although in a safe
way. It uses just

!(preempt_count() & PREEMPT_ACTIVE)

to see whether it can schedule, and it should probably use in_atomic()
which ignores the kernel lock.

But right now, that whole thing is disabled if PREEMPT is on anyway, so in
effect (with my test patch, at least) cond_preempt() would just be a no-op
if PREEMPT is on, even if BKL isn't preemptable.

So it doesn't look buggy, but it looks like it might cause longer
latencies than strictly necessary. And if somebody depends on
cond_resched() to avoid some bad livelock situation, that would obviously
not work (but that sounds like a fundamental bug anyway, I really hope
nobody has ever written their code that way).

> The !PREEMPT_BKL crash was some simple screwup on my part of getting
> atomicity checks wrong in cond_resched() - and it went unnoticed for a
> long time - or something like that. I'll try to find that discussion.

Yes, some silly bug sounds more likely. Especially considering how many
different cases there were (semaphores vs spinlocks vs preemptable
spinlocks).

Linus

Matthew Wilcox

unread,

May 7, 2008, 2:01:24 PM5/7/08

to Linus Torvalds, Ingo Molnar, Andi Kleen, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton

On Wed, May 07, 2008 at 10:55:26AM -0700, Linus Torvalds wrote:
> Hmm. I do agree that _cond_resched() looks a bit iffy, although in a safe
> way. It uses just
>
> !(preempt_count() & PREEMPT_ACTIVE)
>
> to see whether it can schedule, and it should probably use in_atomic()
> which ignores the kernel lock.
>
> But right now, that whole thing is disabled if PREEMPT is on anyway, so in
> effect (with my test patch, at least) cond_preempt() would just be a no-op
> if PREEMPT is on, even if BKL isn't preemptable.
>
> So it doesn't look buggy, but it looks like it might cause longer
> latencies than strictly necessary. And if somebody depends on
> cond_resched() to avoid some bad livelock situation, that would obviously
> not work (but that sounds like a fundamental bug anyway, I really hope
> nobody has ever written their code that way).

Funny you should mention it; locks.c uses cond_resched() assuming that
it ignores the BKL. Not through needing to avoid livelock, but it does
presume that other higher priority tasks contending for the lock will
get a chance to take it. You'll notice the patch I posted yesterday
drops the file_lock_lock around the call to cond_resched().

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Linus Torvalds

unread,

May 7, 2008, 2:03:51 PM5/7/08

to Ingo Molnar, Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

On Wed, 7 May 2008, Ingo Molnar wrote:
>

> another idea: my trial-baloon patch should test your theory too, because
> the generic down_trylock() is still the 'fat' version, it does:

I agree that your trial-balloon should likely get rid of the big
regression, since it avoids the scheduler.

So with your patch, lock_kernel() ends up being just a rather expensive
spinlock. And yes, I'd expect that it should get rid of the 40% cost,
because while it makes lock_kernel() more expensive than a spinlock and
you might end up having a few more cacheline bounces on the lock due to
that, that's still the "small" expense compared to going through the whole
scheduler on conflicts.

So I'd expect that realistically the performance difference between your
version and just plain spinlocks shouldn't be *that* big. I'd expect it to
be visible, but in the (low) single-digit percentage range rather than in
any 40% range. That's just a guess.

Linus

Ingo Molnar

unread,

May 7, 2008, 2:17:52 PM5/7/08

to Linus Torvalds, Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> On Wed, 7 May 2008, Ingo Molnar wrote:
> >
> > another idea: my trial-baloon patch should test your theory too,
> > because the generic down_trylock() is still the 'fat' version, it
> > does:
>
> I agree that your trial-balloon should likely get rid of the big
> regression, since it avoids the scheduler.
>
> So with your patch, lock_kernel() ends up being just a rather
> expensive spinlock. And yes, I'd expect that it should get rid of the
> 40% cost, because while it makes lock_kernel() more expensive than a
> spinlock and you might end up having a few more cacheline bounces on
> the lock due to that, that's still the "small" expense compared to
> going through the whole scheduler on conflicts.
>
> So I'd expect that realistically the performance difference between
> your version and just plain spinlocks shouldn't be *that* big. I'd
> expect it to be visible, but in the (low) single-digit percentage
> range rather than in any 40% range. That's just a guess.

third attempt - the patch below ontop of v2.6.25 should be quite similar
fastpath atomic overhead to what generic semaphores do? So if Yanmin
tests this patch ontop of v2.6.25, we should see the direct fastpath
overhead - without any changes to the semaphore wakeup/scheduling logic
otherwise.

[ this patch should in fact be a bit worse, because there's two more
atomics in the fastpath - the fastpath atomics of the old semaphore
code. ]

Ingo

------------------>
Subject: v2.6.25 BKL: add atomic overhead
From: Ingo Molnar <mi...@elte.hu>
Date: Wed May 07 20:09:13 CEST 2008

---
lib/kernel_lock.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

Index: linux-2.6.25/lib/kernel_lock.c
===================================================================
--- linux-2.6.25.orig/lib/kernel_lock.c
+++ linux-2.6.25/lib/kernel_lock.c
@@ -24,6 +24,7 @@

* Don't use in new code.
*/

static DECLARE_MUTEX(kernel_sem);
+static DEFINE_SPINLOCK(global_lock);

/*

* Re-acquire the kernel semaphore.

@@ -47,6 +48,9 @@ int __lockfunc __reacquire_kernel_lock(v

down(&kernel_sem);

+ spin_lock(&global_lock);
+ spin_unlock(&global_lock);
+

preempt_disable();
task->lock_depth = saved_lock_depth;

@@ -55,6 +59,9 @@ int __lockfunc __reacquire_kernel_lock(v

void __lockfunc __release_kernel_lock(void)
{
+ spin_lock(&global_lock);
+ spin_unlock(&global_lock);
+
up(&kernel_sem);
}

@@ -66,12 +73,16 @@ void __lockfunc lock_kernel(void)

struct task_struct *task = current;
int depth = task->lock_depth + 1;

- if (likely(!depth))
+ if (likely(!depth)) {
/*
* No recursion worries - we set up lock_depth _after_
*/

down(&kernel_sem);

+ spin_lock(&global_lock);
+ spin_unlock(&global_lock);
+ }

+
task->lock_depth = depth;
}

Linus Torvalds

unread,

May 7, 2008, 2:19:10 PM5/7/08

to Matthew Wilcox, Ingo Molnar, Andi Kleen, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton

On Wed, 7 May 2008, Matthew Wilcox wrote:
> >
> > So it doesn't look buggy, but it looks like it might cause longer
> > latencies than strictly necessary. And if somebody depends on
> > cond_resched() to avoid some bad livelock situation, that would obviously
> > not work (but that sounds like a fundamental bug anyway, I really hope
> > nobody has ever written their code that way).
>
> Funny you should mention it; locks.c uses cond_resched() assuming that
> it ignores the BKL. Not through needing to avoid livelock, but it does
> presume that other higher priority tasks contending for the lock will
> get a chance to take it. You'll notice the patch I posted yesterday
> drops the file_lock_lock around the call to cond_resched().

Well, this would only be noticeable with CONFIG_PREEMPT.

If you don't have preempt enabled, it looks like everything should work
ok: the kernel lock wouldn't increase the preempt count, and
_cond_resched() works fine.

If you're PREEMPT, then the kernel lock would increase the preempt count,
and _cond_resched() would refuse to re-schedule it, *but* with PREEMPT
you'd never see it *anyway*, because PREEMPT will disable cond_resched()
entirely (because preemption takes care of normal scheduling latencies
without it).

And I'm also sure that this all worked fine at some point, and it's
largely a result just of the multiple different variations of BKL
preemption coupled with some of them getting removed entirely, so the code
that used to handle it just got corrupt over time. See commit 02b67cc3b,
for example.

. Hmm ... Time passes. Linus looks at git history.

It does look like "cond_resched()" has not worked with the BKL since 2005,
and hasn't taken the BKL into account. Commit 5bbcfd9000:

[PATCH] cond_resched(): fix bogus might_sleep() warning

+ if (unlikely(preempt_count()))
+ return;

which talks about the BKS, ie it only took the *semaphore* implementation
into account. Never the spinlock-with-preemption-count one.

Or am I blind?

Linus

Linus Torvalds

unread,

May 7, 2008, 2:28:26 PM5/7/08

to Ingo Molnar, Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

On Wed, 7 May 2008, Ingo Molnar wrote:
>
> [ this patch should in fact be a bit worse, because there's two more
> atomics in the fastpath - the fastpath atomics of the old semaphore
> code. ]

Well, it doesn't have the irq stuff, which is also pretty costly. Also, it
doesn't nest the accesses the same way (with the counts being *inside* the
spinlock and serialized against each other), so I'm not 100% sure you'd
get the same behaviour.

But yes, it certainly has the potential to show the same slowdown. But
it's not a very good patch, since not showing it doesn't really prove
much.

Linus

Ingo Molnar

unread,

May 7, 2008, 2:43:41 PM5/7/08

to Linus Torvalds, Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> > [ this patch should in fact be a bit worse, because there's two more
> > atomics in the fastpath - the fastpath atomics of the old
> > semaphore code. ]
>
> Well, it doesn't have the irq stuff, which is also pretty costly.
> Also, it doesn't nest the accesses the same way (with the counts being
> *inside* the spinlock and serialized against each other), so I'm not
> 100% sure you'd get the same behaviour.
>
> But yes, it certainly has the potential to show the same slowdown. But
> it's not a very good patch, since not showing it doesn't really prove
> much.

ok, the one below does irq ops and the counter behavior - and because
the critical section also has the old-semaphore atomics i think this
should definitely be a more expensive fastpath than what the new generic
code introduces. So if this patch produces a 40% AIM7 slowdown on
v2.6.25 it's the fastpath overhead (and its effects on slowpath
probability) that makes the difference.

Ingo

------------------->
Subject: add BKL atomic overhead

From: Ingo Molnar <mi...@elte.hu>
Date: Wed May 07 20:09:13 CEST 2008

NOT-Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
lib/kernel_lock.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)

Index: linux-2.6.25/lib/kernel_lock.c
===================================================================
--- linux-2.6.25.orig/lib/kernel_lock.c
+++ linux-2.6.25/lib/kernel_lock.c

@@ -24,6 +24,8 @@

* Don't use in new code.
*/
static DECLARE_MUTEX(kernel_sem);

+static int global_count;

+static DEFINE_SPINLOCK(global_lock);

/*
* Re-acquire the kernel semaphore.

@@ -39,6 +41,7 @@ int __lockfunc __reacquire_kernel_lock(v
{

struct task_struct *task = current;

int saved_lock_depth = task->lock_depth;
+ unsigned long flags;

BUG_ON(saved_lock_depth < 0);

@@ -47,6 +50,10 @@ int __lockfunc __reacquire_kernel_lock(v

down(&kernel_sem);

+ spin_lock_irqsave(&global_lock, flags);
+ global_count++;
+ spin_unlock_irqrestore(&global_lock, flags);

+
preempt_disable();
task->lock_depth = saved_lock_depth;

@@ -55,6 +62,10 @@ int __lockfunc __reacquire_kernel_lock(v

void __lockfunc __release_kernel_lock(void)
{
+ spin_lock_irqsave(&global_lock, flags);
+ global_count--;
+ spin_unlock_irqrestore(&global_lock, flags);
+
up(&kernel_sem);
}

@@ -66,12 +77,17 @@ void __lockfunc lock_kernel(void)

struct task_struct *task = current;
int depth = task->lock_depth + 1;

- if (likely(!depth))
+ if (likely(!depth)) {
/*
* No recursion worries - we set up lock_depth _after_
*/
down(&kernel_sem);

+ spin_lock_irqsave(&global_lock, flags);
+ global_count++;
+ spin_unlock_irqrestore(&global_lock, flags);

+ }
+
task->lock_depth = depth;
}

Ingo Molnar

unread,

May 7, 2008, 2:50:02 PM5/7/08

to Linus Torvalds, Matthew Wilcox, Andi Kleen, Zhang, Yanmin, LKML, Alexander Viro, Andrew Morton

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> .. Hmm ... Time passes. Linus looks at git history.

>
> It does look like "cond_resched()" has not worked with the BKL since
> 2005, and hasn't taken the BKL into account. Commit 5bbcfd9000:
>
> [PATCH] cond_resched(): fix bogus might_sleep() warning
>
> + if (unlikely(preempt_count()))
> + return;
>
> which talks about the BKS, ie it only took the *semaphore*
> implementation into account. Never the spinlock-with-preemption-count
> one.
>
> Or am I blind?

hm, i think you are right.

most latency reduction was concentrated on the PREEMPT+PREEMPT_BKL case,
and not getting proper cond_resched() behavior in case of !PREEMPT_BKL
would certainly not be noticed by distros or users.

We made CONFIG_PREEMPT_BKL=y the default on SMP in v2.6.8, in this
post-2.6.7 commit that introduced the feature:

| commit fb8f6499abc6a847109d9602b797aa6afd2d5a3d
| Author: Ingo Molnar <mi...@elte.hu>
| Date: Fri Jan 7 21:59:57 2005 -0800
|
| [PATCH] remove the BKL by turning it into a semaphore

There was constant trouble around all these variations of preemptability
and their combination with debugging helpers. (So i was rather happy to
get rid of !PREEMPT_BKL - in the (apparently wrong) assumption that no
tears will be shed.)

Ingo

Linus Torvalds

unread,

May 7, 2008, 3:02:33 PM5/7/08

to Ingo Molnar, Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

On Wed, 7 May 2008, Ingo Molnar wrote:
>

> ok, the one below does irq ops and the counter behavior

No it doesn't. The counter isn't used for any actual *testing*, so the
locking around it and the serialization of it has absolutely no impact on
the scheduling behaviour!

Since the big slowdown was clearly accompanied by sleeping behaviour (the
processes who didn't get the lock end up sleeping!), that is a *big* part
of the slowdown.

Is it possible that your patch gets similar behaviour? Absolutely. But
you're missing the whole point here. Anybody can make code behave badly
and perform worse. But if you want to just verify that it's about the
sleeping behaviour and timings of the BKL, then you need to do exactly
that: emulate the sleeping behavior, not just the timings _outside_ of the
sleeping behavior.

The thing is, we definitely are interested to see whether it's the BKL or
some other semaphore that is the problem. But the best way to test that is
to just try my patch that *guarantees* that the BKL doesn't have any
semaphore behaviour AT ALL.

Could it be something else entirely? Yes. We know it's semaphore- related.
We don't know for a fact that it's the BKL itself. There could be other
semaphores that are that hot. It sounds unlikely, but quite frankly,
regardless, I don't really see the point of your patches.

If Yanmin tries my patch, it is *guaranteed* to show something. It either
shows that it's about the BKL (and that we absolutely have to do the BKL
as something _else_ than a generic semaphore), or it shows that it's not
about the BKL (and that _all_ the patches in this discussion are likely
pointless).

In contrast, these "try to emulate bad behavior with the old known-ok
semaphores" don't show anything AT ALL. We already know it's related to
semaphores. And your patches aren't even guaranteed to show the same
issue.

Linus

Ingo Molnar

unread,

May 7, 2008, 3:11:27 PM5/7/08

to Linus Torvalds, Matthew Wilcox, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> In contrast, these "try to emulate bad behavior with the old known-ok
> semaphores" don't show anything AT ALL. We already know it's related
> to semaphores. And your patches aren't even guaranteed to show the
> same issue.

yeah, i was just trying to come up with patches to probe which one of
the following two possibilities is actually the case:

- if the regression is due to the difference in scheduling behavior of
new semaphores (different wakeup patterns, etc.), that's fixable in
the new semaphore code => then the BKL code need not change.

- if the regression is due due to difference in the fastpath cost, then
the new semaphores can probably not be improved (much of their appeal
comes from them not being complex and not being in assembly) => then
the BKL code needs to change to become cheaper [i.e. then we want
your patch].

Ingo

Matthew Wilcox

unread,

May 7, 2008, 3:25:54 PM5/7/08

to Linus Torvalds, Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

On Wed, May 07, 2008 at 12:01:28PM -0700, Linus Torvalds wrote:
> The thing is, we definitely are interested to see whether it's the BKL or
> some other semaphore that is the problem. But the best way to test that is
> to just try my patch that *guarantees* that the BKL doesn't have any
> semaphore behaviour AT ALL.
>
> Could it be something else entirely? Yes. We know it's semaphore- related.
> We don't know for a fact that it's the BKL itself. There could be other
> semaphores that are that hot. It sounds unlikely, but quite frankly,
> regardless, I don't really see the point of your patches.
>
> If Yanmin tries my patch, it is *guaranteed* to show something. It either
> shows that it's about the BKL (and that we absolutely have to do the BKL
> as something _else_ than a generic semaphore), or it shows that it's not
> about the BKL (and that _all_ the patches in this discussion are likely
> pointless).

One patch I'd still like Yanmin to test is my one from yesterday which
removes the BKL from fs/locks.c.

http://marc.info/?l=linux-fsdevel&m=121009123427437&w=2

Obviously, it won't help if the problem isn't the BKL.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Linus Torvalds

unread,

May 7, 2008, 3:45:51 PM5/7/08

to Matthew Wilcox, Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

On Wed, 7 May 2008, Matthew Wilcox wrote:
>
> One patch I'd still like Yanmin to test is my one from yesterday which
> removes the BKL from fs/locks.c.

And I'd personally rather have the network-fs people test and comment on
that one ;)

I think that patch is worth looking at regardless, but the problems with
that one aren't about performance, but about what the implications are for
the filesystems (if any)...

Linus

Matthew Wilcox

unread,

May 7, 2008, 4:01:31 PM5/7/08

to Linus Torvalds, Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

On Wed, May 07, 2008 at 12:44:48PM -0700, Linus Torvalds wrote:
> On Wed, 7 May 2008, Matthew Wilcox wrote:
> >
> > One patch I'd still like Yanmin to test is my one from yesterday which
> > removes the BKL from fs/locks.c.
>
> And I'd personally rather have the network-fs people test and comment on
> that one ;)
>
> I think that patch is worth looking at regardless, but the problems with
> that one aren't about performance, but about what the implications are for
> the filesystems (if any)...

Oh, well, they don't seem interested.

I can comment on some of the problems though.

fs/lockd/svcsubs.c, fs/nfs/delegation.c, fs/nfs/nfs4state.c,
fs/nfsd/nfs4state.c all walk the i_flock list under the BKL. That won't
protect them against locks.c any more. That's probably OK for fs/nfs/*
since they'll be protected by their own data structures (Someone please
check me on that?), but it's a bad idea for lockd/nfsd which are walking
the lists for filesystems.

Are we going to have to export the file_lock_lock? I'd rather not. But
we need to keep nfsd/lockd from tripping over locks.c.

Maybe we could come up with a decent API that lockd could use? It all
seems a bit complex at the moment ... maybe lockd should be keeping
track of the locks it owns anyway (since surely the posix deadlock
detection code can't work properly if it's just passing all the locks
through).

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Trond Myklebust

unread,

May 7, 2008, 6:11:49 PM5/7/08

to Matthew Wilcox, Linus Torvalds, Ingo Molnar, Andrew Morton, J. Bruce Fields, Zhang, Yanmin, LKML, Alexander Viro, linux-...@vger.kernel.org

On Wed, 2008-05-07 at 14:00 -0600, Matthew Wilcox wrote:
> On Wed, May 07, 2008 at 12:44:48PM -0700, Linus Torvalds wrote:
> > On Wed, 7 May 2008, Matthew Wilcox wrote:
> > >
> > > One patch I'd still like Yanmin to test is my one from yesterday which
> > > removes the BKL from fs/locks.c.
> >
> > And I'd personally rather have the network-fs people test and comment on
> > that one ;)
> >
> > I think that patch is worth looking at regardless, but the problems with
> > that one aren't about performance, but about what the implications are for
> > the filesystems (if any)...
>
> Oh, well, they don't seem interested.

Poor timing: we're all preparing for and travelling to the annual
Connectathon interoperability testing conference which starts tomorrow.

> I can comment on some of the problems though.
>
> fs/lockd/svcsubs.c, fs/nfs/delegation.c, fs/nfs/nfs4state.c,
> fs/nfsd/nfs4state.c all walk the i_flock list under the BKL. That won't
> protect them against locks.c any more. That's probably OK for fs/nfs/*
> since they'll be protected by their own data structures (Someone please
> check me on that?), but it's a bad idea for lockd/nfsd which are walking
> the lists for filesystems.

Yes. fs/nfs is just reusing the code in fs/locks.c in order to track the
locks it holds on the server. We could alternatively have coded a
private lock implementation, but this seemed easier.

> Are we going to have to export the file_lock_lock? I'd rather not. But
> we need to keep nfsd/lockd from tripping over locks.c.
>
> Maybe we could come up with a decent API that lockd could use? It all
> seems a bit complex at the moment ... maybe lockd should be keeping
> track of the locks it owns anyway (since surely the posix deadlock
> detection code can't work properly if it's just passing all the locks
> through).

I'm not sure what you mean when you talk about lockd keeping track of
the locks it owns. It has to keep those locks on inode->i_flock in order
to make them visible to the host filesystem...

All lockd really needs, is the ability to find a lock it owns, and then
obtain a copy. As for the nfs client, I suspect we can make do with
something similar...

Cheers
Trond

Zhang, Yanmin

unread,

May 7, 2008, 10:46:56 PM5/7/08

to Linus Torvalds, Andi Kleen, Matthew Wilcox, Ingo Molnar, LKML, Alexander Viro, Andrew Morton

On Wed, 2008-05-07 at 08:19 -0700, Linus Torvalds wrote:

>
> On Wed, 7 May 2008, Linus Torvalds wrote:
> >
> > But my preferred option would indeed be just turning it back into a
> > spinlock - and screw latency and BKL preemption - and having the RT people
> > who care deeply just work on removing the BKL in the long run.
>
> Here's a trial balloon patch to do that.
>

> Yanmin - this is not well tested, but the code is fairly obvious, and it
> would be interesting to hear if this fixes the performance regression.
> Because if it doesn't, then it's not the BKL, or something totally
> different is going on.
Congratulations! The patch really fixes the regression completely!
vmstat showed cpu idle is 0%, just like 2.6.25's.

Some config options in my .config file:

CONFIG_NR_CPUS=32
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y

yanmin

>
> Now, we should probably also test just converting the thing to a mutex,
> to see if that perhaps also fixes it.
>
> Linus
>
> ---
> arch/mn10300/Kconfig | 11 ----
> include/linux/hardirq.h | 18 ++++---
> kernel/sched.c | 27 ++---------
> lib/kernel_lock.c | 120 +++++++++++++++++++++++++++++++---------------
> 4 files changed, 95 insertions(+), 81 deletions(-)
>
> diff --git a/arch/mn10300/Kconfig b/arch/mn10300/Kconfig
> index 6a6409a..e856218 100644
> --- a/arch/mn10300/Kconfig
> +++ b/arch/mn10300/Kconfig
> @@ -186,17 +186,6 @@ config PREEMPT
> Say Y here if you are building a kernel for a desktop, embedded
> or real-time system. Say N if you are unsure.
>
> -config PREEMPT_BKL
> - bool "Preempt The Big Kernel Lock"
> - depends on PREEMPT
> - default y
> - help
> - This option reduces the latency of the kernel by making the
> - big kernel lock preemptible.
> -
> - Say Y here if you are building a kernel for a desktop system.
> - Say N if you are unsure.
> -
> config MN10300_CURRENT_IN_E2
> bool "Hold current task address in E2 register"
> default y
> diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> index 897f723..181006c 100644
> --- a/include/linux/hardirq.h
> +++ b/include/linux/hardirq.h
> @@ -72,6 +72,14 @@
> #define in_softirq() (softirq_count())
> #define in_interrupt() (irq_count())
>
> +#if defined(CONFIG_PREEMPT)
> +# define PREEMPT_INATOMIC_BASE kernel_locked()
> +# define PREEMPT_CHECK_OFFSET 1
> +#else
> +# define PREEMPT_INATOMIC_BASE 0
> +# define PREEMPT_CHECK_OFFSET 0
> +#endif
> +
> /*
> * Are we running in atomic context? WARNING: this macro cannot
> * always detect atomic context; in particular, it cannot know about
> @@ -79,17 +87,11 @@
> * used in the general case to determine whether sleeping is possible.
> * Do not use in_atomic() in driver code.
> */
> -#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != 0)
> -
> -#ifdef CONFIG_PREEMPT
> -# define PREEMPT_CHECK_OFFSET 1
> -#else
> -# define PREEMPT_CHECK_OFFSET 0
> -#endif
> +#define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_INATOMIC_BASE)
>
> /*
> * Check whether we were atomic before we did preempt_disable():
> - * (used by the scheduler)
> + * (used by the scheduler, *after* releasing the kernel lock)
> */
> #define in_atomic_preempt_off() \
> ((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET)
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 58fb8af..c51b656 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4567,8 +4567,6 @@ EXPORT_SYMBOL(schedule);
> asmlinkage void __sched preempt_schedule(void)
> {
> struct thread_info *ti = current_thread_info();
> - struct task_struct *task = current;
> - int saved_lock_depth;
>
> /*
> * If there is a non-zero preempt_count or interrupts are disabled,
> @@ -4579,16 +4577,7 @@ asmlinkage void __sched preempt_schedule(void)
>
> do {
> add_preempt_count(PREEMPT_ACTIVE);
> -
> - /*
> - * We keep the big kernel semaphore locked, but we
> - * clear ->lock_depth so that schedule() doesnt
> - * auto-release the semaphore:
> - */
> - saved_lock_depth = task->lock_depth;
> - task->lock_depth = -1;
> schedule();
> - task->lock_depth = saved_lock_depth;
> sub_preempt_count(PREEMPT_ACTIVE);
>
> /*
> @@ -4609,26 +4598,15 @@ EXPORT_SYMBOL(preempt_schedule);
> asmlinkage void __sched preempt_schedule_irq(void)
> {
> struct thread_info *ti = current_thread_info();
> - struct task_struct *task = current;
> - int saved_lock_depth;
>
> /* Catch callers which need to be fixed */
> BUG_ON(ti->preempt_count || !irqs_disabled());
>
> do {
> add_preempt_count(PREEMPT_ACTIVE);
> -
> - /*
> - * We keep the big kernel semaphore locked, but we
> - * clear ->lock_depth so that schedule() doesnt
> - * auto-release the semaphore:
> - */
> - saved_lock_depth = task->lock_depth;
> - task->lock_depth = -1;
> local_irq_enable();
> schedule();
> local_irq_disable();
> - task->lock_depth = saved_lock_depth;
> sub_preempt_count(PREEMPT_ACTIVE);
>
> /*
> @@ -5853,8 +5831,11 @@ void __cpuinit init_idle(struct task_struct *idle, int cpu)
> spin_unlock_irqrestore(&rq->lock, flags);
>
> /* Set the preempt count _outside_ the spinlocks! */
> +#if defined(CONFIG_PREEMPT)
> + task_thread_info(idle)->preempt_count = (idle->lock_depth >= 0);
> +#else
> task_thread_info(idle)->preempt_count = 0;
> -
> +#endif
> /*
> * The idle tasks have their own, simple scheduling class:
> */
> diff --git a/lib/kernel_lock.c b/lib/kernel_lock.c
> index cd3e825..06722aa 100644
> --- a/lib/kernel_lock.c
> +++ b/lib/kernel_lock.c
> @@ -11,79 +11,121 @@
> #include <linux/semaphore.h>
>
> /*
> - * The 'big kernel semaphore'
> + * The 'big kernel lock'
> *
> - * This mutex is taken and released recursively by lock_kernel()
> + * This spinlock is taken and released recursively by lock_kernel()
> * and unlock_kernel(). It is transparently dropped and reacquired
> * over schedule(). It is used to protect legacy code that hasn't
> * been migrated to a proper locking design yet.
> *
> - * Note: code locked by this semaphore will only be serialized against
> - * other code using the same locking facility. The code guarantees that
> - * the task remains on the same CPU.
> - *

> * Don't use in new code.
> */

> -static DECLARE_MUTEX(kernel_sem);
> +static __cacheline_aligned_in_smp DEFINE_SPINLOCK(kernel_flag);
> +
>
> /*
> - * Re-acquire the kernel semaphore.
> + * Acquire/release the underlying lock from the scheduler.
> *
> - * This function is called with preemption off.
> + * This is called with preemption disabled, and should
> + * return an error value if it cannot get the lock and
> + * TIF_NEED_RESCHED gets set.
> *
> - * We are executing in schedule() so the code must be extremely careful
> - * about recursion, both due to the down() and due to the enabling of
> - * preemption. schedule() will re-check the preemption flag after
> - * reacquiring the semaphore.
> + * If it successfully gets the lock, it should increment
> + * the preemption count like any spinlock does.
> + *
> + * (This works on UP too - _raw_spin_trylock will never
> + * return false in that case)
> */
> int __lockfunc __reacquire_kernel_lock(void)
> {
> - struct task_struct *task = current;
> - int saved_lock_depth = task->lock_depth;
> -
> - BUG_ON(saved_lock_depth < 0);
> -
> - task->lock_depth = -1;
> - preempt_enable_no_resched();
> -
> - down(&kernel_sem);
> -
> + while (!_raw_spin_trylock(&kernel_flag)) {
> + if (test_thread_flag(TIF_NEED_RESCHED))
> + return -EAGAIN;
> + cpu_relax();
> + }
> preempt_disable();
> - task->lock_depth = saved_lock_depth;
> -
> return 0;
> }
>
> void __lockfunc __release_kernel_lock(void)
> {
> - up(&kernel_sem);
> + _raw_spin_unlock(&kernel_flag);
> + preempt_enable_no_resched();
> }
>
> /*
> - * Getting the big kernel semaphore.
> + * These are the BKL spinlocks - we try to be polite about preemption.
> + * If SMP is not on (ie UP preemption), this all goes away because the
> + * _raw_spin_trylock() will always succeed.
> */
> -void __lockfunc lock_kernel(void)
> +#ifdef CONFIG_PREEMPT
> +static inline void __lock_kernel(void)
> {
> - struct task_struct *task = current;
> - int depth = task->lock_depth + 1;
> + preempt_disable();
> + if (unlikely(!_raw_spin_trylock(&kernel_flag))) {
> + /*
> + * If preemption was disabled even before this
> + * was called, there's nothing we can be polite
> + * about - just spin.
> + */
> + if (preempt_count() > 1) {
> + _raw_spin_lock(&kernel_flag);
> + return;
> + }
>
> - if (likely(!depth))
> /*
> - * No recursion worries - we set up lock_depth _after_
> + * Otherwise, let's wait for the kernel lock
> + * with preemption enabled..
> */
> - down(&kernel_sem);
> + do {
> + preempt_enable();
> + while (spin_is_locked(&kernel_flag))
> + cpu_relax();
> + preempt_disable();
> + } while (!_raw_spin_trylock(&kernel_flag));
> + }
> +}
>
> - task->lock_depth = depth;
> +#else
> +
> +/*
> + * Non-preemption case - just get the spinlock
> + */
> +static inline void __lock_kernel(void)
> +{
> + _raw_spin_lock(&kernel_flag);
> }
> +#endif
>
> -void __lockfunc unlock_kernel(void)
> +static inline void __unlock_kernel(void)
> {
> - struct task_struct *task = current;
> + /*
> + * the BKL is not covered by lockdep, so we open-code the
> + * unlocking sequence (and thus avoid the dep-chain ops):
> + */
> + _raw_spin_unlock(&kernel_flag);
> + preempt_enable();
> +}
>
> - BUG_ON(task->lock_depth < 0);
> +/*
> + * Getting the big kernel lock.
> + *
> + * This cannot happen asynchronously, so we only need to
> + * worry about other CPU's.
> + */
> +void __lockfunc lock_kernel(void)
> +{
> + int depth = current->lock_depth+1;
> + if (likely(!depth))
> + __lock_kernel();
> + current->lock_depth = depth;
> +}
>
> - if (likely(--task->lock_depth < 0))
> - up(&kernel_sem);
> +void __lockfunc unlock_kernel(void)
> +{
> + BUG_ON(current->lock_depth < 0);
> + if (likely(--current->lock_depth < 0))
> + __unlock_kernel();
> }
>
> EXPORT_SYMBOL(lock_kernel);

Zhang, Yanmin

unread,

May 7, 2008, 11:26:43 PM5/7/08

to Matthew Wilcox, Ingo Molnar, J. Bruce Fields, LKML, Alexander Viro, Linus Torvalds, Andrew Morton, linux-...@vger.kernel.org

On Tue, 2008-05-06 at 10:23 -0600, Matthew Wilcox wrote:
> On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> > So the only likely things I can see are:
> >
> > - file locks
> > - fasync
>
> I've wanted to fix file locks for a while. Here's a first attempt.
> It was done quickly, so I concede that it may well have bugs in it.
> I found (and fixed) one with LTP.
>
> It takes *no account* of nfsd, nor remote filesystems. We need to have
> a serious discussion about their requirements.
I tested it on 8-core stoakley. aim7 result becomes 23% worse than the one of
pure 2.6.26-rc1.

I replied this email in case you have many patches and I might test what you don't
expect.

-yanmin

>
> diff --git a/fs/locks.c b/fs/locks.c
> index 663c069..cb09765 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -140,6 +140,8 @@ int lease_break_time = 45;
> #define for_each_lock(inode, lockp) \
> for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)
>
> +static DEFINE_SPINLOCK(file_lock_lock);
> +
> static LIST_HEAD(file_lock_list);
> static LIST_HEAD(blocked_list);
>
> @@ -510,9 +512,9 @@ static void __locks_delete_block(struct file_lock *waiter)
> */
> static void locks_delete_block(struct file_lock *waiter)
> {
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> __locks_delete_block(waiter);
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> }
>
> /* Insert waiter into blocker's block list.
> @@ -649,7 +651,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
> {
> struct file_lock *cfl;
>
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> for (cfl = filp->f_path.dentry->d_inode->i_flock; cfl; cfl = cfl->fl_next) {
> if (!IS_POSIX(cfl))
> continue;
> @@ -662,7 +664,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
> fl->fl_pid = pid_vnr(cfl->fl_nspid);
> } else
> fl->fl_type = F_UNLCK;
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> return;
> }
> EXPORT_SYMBOL(posix_test_lock);
> @@ -735,18 +737,21 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
> int error = 0;
> int found = 0;
>
> - lock_kernel();
> - if (request->fl_flags & FL_ACCESS)
> + if (request->fl_flags & FL_ACCESS) {
> + spin_lock(&file_lock_lock);
> goto find_conflict;
> + }
>
> if (request->fl_type != F_UNLCK) {
> error = -ENOMEM;
> +
> new_fl = locks_alloc_lock();
> if (new_fl == NULL)
> - goto out;
> + goto out_unlocked;
> error = 0;
> }
>
> + spin_lock(&file_lock_lock);
> for_each_lock(inode, before) {
> struct file_lock *fl = *before;
> if (IS_POSIX(fl))
> @@ -772,10 +777,13 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
> * If a higher-priority process was blocked on the old file lock,
> * give it the opportunity to lock the file.
> */
> - if (found)
> + if (found) {
> + spin_unlock(&file_lock_lock);
> cond_resched();
> + spin_lock(&file_lock_lock);
> + }
>
> -find_conflict:
> + find_conflict:
> for_each_lock(inode, before) {
> struct file_lock *fl = *before;
> if (IS_POSIX(fl))
> @@ -796,8 +804,9 @@ find_conflict:
> new_fl = NULL;
> error = 0;
>
> -out:
> - unlock_kernel();
> + out:
> + spin_unlock(&file_lock_lock);
> + out_unlocked:
> if (new_fl)
> locks_free_lock(new_fl);
> return error;
> @@ -826,7 +835,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> new_fl2 = locks_alloc_lock();
> }
>
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> if (request->fl_type != F_UNLCK) {
> for_each_lock(inode, before) {
> fl = *before;
> @@ -994,7 +1003,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> locks_wake_up_blocks(left);
> }
> out:
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> /*
> * Free any unused locks.
> */
> @@ -1069,14 +1078,14 @@ int locks_mandatory_locked(struct inode *inode)
> /*
> * Search the lock list for this inode for any POSIX locks.
> */
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> if (!IS_POSIX(fl))
> continue;
> if (fl->fl_owner != owner)
> break;
> }
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> return fl ? -EAGAIN : 0;
> }
>
> @@ -1190,7 +1199,7 @@ int __break_lease(struct inode *inode, unsigned int mode)
>
> new_fl = lease_alloc(NULL, mode & FMODE_WRITE ? F_WRLCK : F_RDLCK);
>
> - lock_kernel();
> + spin_lock(&file_lock_lock);
>
> time_out_leases(inode);
>
> @@ -1251,8 +1260,10 @@ restart:
> break_time++;
> }
> locks_insert_block(flock, new_fl);
> + spin_unlock(&file_lock_lock);
> error = wait_event_interruptible_timeout(new_fl->fl_wait,
> !new_fl->fl_next, break_time);
> + spin_lock(&file_lock_lock);
> __locks_delete_block(new_fl);
> if (error >= 0) {
> if (error == 0)
> @@ -1266,8 +1277,8 @@ restart:
> error = 0;
> }
>
> -out:
> - unlock_kernel();
> + out:
> + spin_unlock(&file_lock_lock);
> if (!IS_ERR(new_fl))
> locks_free_lock(new_fl);
> return error;
> @@ -1323,7 +1334,7 @@ int fcntl_getlease(struct file *filp)
> struct file_lock *fl;
> int type = F_UNLCK;
>
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> time_out_leases(filp->f_path.dentry->d_inode);
> for (fl = filp->f_path.dentry->d_inode->i_flock; fl && IS_LEASE(fl);
> fl = fl->fl_next) {
> @@ -1332,7 +1343,7 @@ int fcntl_getlease(struct file *filp)
> break;
> }
> }
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> return type;
> }
>
> @@ -1363,6 +1374,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
> if (error)
> return error;
>
> + spin_lock(&file_lock_lock);
> time_out_leases(inode);
>
> BUG_ON(!(*flp)->fl_lmops->fl_break);
> @@ -1370,10 +1382,11 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
> lease = *flp;
>
> if (arg != F_UNLCK) {
> + spin_unlock(&file_lock_lock);
> error = -ENOMEM;
> new_fl = locks_alloc_lock();
> if (new_fl == NULL)
> - goto out;
> + goto out_unlocked;
>
> error = -EAGAIN;
> if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
> @@ -1382,6 +1395,7 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
> && ((atomic_read(&dentry->d_count) > 1)
> || (atomic_read(&inode->i_count) > 1)))
> goto out;
> + spin_lock(&file_lock_lock);
> }
>
> /*
> @@ -1429,11 +1443,14 @@ int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
>
> locks_copy_lock(new_fl, lease);
> locks_insert_lock(before, new_fl);
> + spin_unlock(&file_lock_lock);
>
> *flp = new_fl;
> return 0;
>
> -out:
> + out:
> + spin_unlock(&file_lock_lock);
> + out_unlocked:
> if (new_fl != NULL)
> locks_free_lock(new_fl);
> return error;
> @@ -1471,12 +1488,10 @@ int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
> {
> int error;
>
> - lock_kernel();
> if (filp->f_op && filp->f_op->setlease)
> error = filp->f_op->setlease(filp, arg, lease);
> else
> error = generic_setlease(filp, arg, lease);
> - unlock_kernel();
>
> return error;
> }
> @@ -1503,12 +1518,11 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
> if (error)
> return error;
>
> - lock_kernel();
> -
> error = vfs_setlease(filp, arg, &flp);
> if (error || arg == F_UNLCK)
> - goto out_unlock;
> + return error;
>
> + lock_kernel();
> error = fasync_helper(fd, filp, 1, &flp->fl_fasync);
> if (error < 0) {
> /* remove lease just inserted by setlease */
> @@ -1519,7 +1533,7 @@ int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
> }
>
> error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
> -out_unlock:
> + out_unlock:
> unlock_kernel();
> return error;
> }
> @@ -2024,7 +2038,7 @@ void locks_remove_flock(struct file *filp)
> fl.fl_ops->fl_release_private(&fl);
> }
>
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> before = &inode->i_flock;
>
> while ((fl = *before) != NULL) {
> @@ -2042,7 +2056,7 @@ void locks_remove_flock(struct file *filp)
> }
> before = &fl->fl_next;
> }
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> }
>
> /**
> @@ -2057,12 +2071,12 @@ posix_unblock_lock(struct file *filp, struct file_lock *waiter)
> {
> int status = 0;
>
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> if (waiter->fl_next)
> __locks_delete_block(waiter);
> else
> status = -ENOENT;
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> return status;
> }
>
> @@ -2175,7 +2189,7 @@ static int locks_show(struct seq_file *f, void *v)
>
> static void *locks_start(struct seq_file *f, loff_t *pos)
> {
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> f->private = (void *)1;
> return seq_list_start(&file_lock_list, *pos);
> }
> @@ -2187,7 +2201,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
>
> static void locks_stop(struct seq_file *f, void *v)
> {
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> }
>
> struct seq_operations locks_seq_operations = {
> @@ -2215,7 +2229,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
> {
> struct file_lock *fl;
> int result = 1;
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> if (IS_POSIX(fl)) {
> if (fl->fl_type == F_RDLCK)
> @@ -2232,7 +2246,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
> result = 0;
> break;
> }
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> return result;
> }
>
> @@ -2255,7 +2269,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
> {
> struct file_lock *fl;
> int result = 1;
> - lock_kernel();
> + spin_lock(&file_lock_lock);
> for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> if (IS_POSIX(fl)) {
> if ((fl->fl_end < start) || (fl->fl_start > (start + len)))
> @@ -2270,7 +2284,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
> result = 0;
> break;
> }
> - unlock_kernel();
> + spin_unlock(&file_lock_lock);
> return result;

Linus Torvalds

unread,

May 7, 2008, 11:30:23 PM5/7/08

to Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Ingo Molnar, LKML, Alexander Viro, Andrew Morton

On Thu, 8 May 2008, Zhang, Yanmin wrote:
>
> Congratulations! The patch really fixes the regression completely!
> vmstat showed cpu idle is 0%, just like 2.6.25's.

Well, that shows that it was the BKL.

That said, "idle 0%" is easy when you spin. Do you also have actual
performance numbers? I'd hope that not only do we use full CPU time, it's
also at least as fast as the old semaphores were?

While I've been dissing sleeping locks (because their overhead is so
high), at least in _theory_ they can get better behavior when not
spinning. Now, that's not going to happen with the BKL, I'm 99.99% sure,
but I'd still like to hear actual performance numbers too, just to be
sure.

Anyway, at least the "was it the BKL or some other semaphore user"
question is entirely off the table.

So we need to

- fix the BKL. My patch may be a good starting point, but there are
alternatives:

(a) reinstate the old BKL code entirely

Quite frankly, I'd prefer not to. Not only did it have three
totally different cases, some of them were apparently broken (ie
BKL+regular preempt didn't cond_resched() right), and I just don't
think it's worth maintaining three different versions, when
distro's are going to pick one anyway. *We* should pick one, and
maintain it.

(b) screw the special BKL preemption - it's a spinlock, we don't
preempt other spinlocks, but at least fix BKL+preempt+cond_resched
thing.

This would be "my patch + fixes" where at least one of the fixes
is the known (apparently old) cond_preempt() bug.

(c) Try to keep the 2.6.25 code as closely as possible, but just
switch over to mutexes instead.

I dunno. I was never all that enamoured with the BKL as a sleeping
lock, so I'm biased against this one, but hey, it's just a
personal bias.

- get rid of the BKL anyway, at least in anything that is noticeable.

Matthew's patch to file locking is probably worth doing as-is,
simply because I haven't heard any better solutions. The BKL
certainly can't be it, and whatever comes out of the NFSD
discussion will almost certainly involve just making sure that
those leases just use the new fs/locks.c lock.

This is also why I'd actually prefer the simplest possible
(non-preempting) spinlock BKL. Because it means that we can get
rid of all that "saved_lock_depth" crud (like my patch already
did). We shouldn't aim for a clever BKL, we should aim for a BKL
that nobody uses.

I'm certainly open to anything. Regardless, we should decide fairly soon,
so that we have the choice made before -rc2 is out, and not drag this out,
since regardless of the choice it needs to be tested and people comfy with
it for the 2.6.26 release.

Linus

Linus Torvalds

unread,

May 7, 2008, 11:35:35 PM5/7/08

to Zhang, Yanmin, Matthew Wilcox, Ingo Molnar, J. Bruce Fields, LKML, Alexander Viro, Andrew Morton, linux-...@vger.kernel.org

On Thu, 8 May 2008, Zhang, Yanmin wrote:
>

> On Tue, 2008-05-06 at 10:23 -0600, Matthew Wilcox wrote:
> > On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> > > So the only likely things I can see are:
> > >
> > > - file locks
> > > - fasync
> >
> > I've wanted to fix file locks for a while. Here's a first attempt.
> > It was done quickly, so I concede that it may well have bugs in it.
> > I found (and fixed) one with LTP.
> >
> > It takes *no account* of nfsd, nor remote filesystems. We need to have
> > a serious discussion about their requirements.
>
> I tested it on 8-core stoakley. aim7 result becomes 23% worse than the one of
> pure 2.6.26-rc1.

Ouch. That's really odd. The BKL->spinlock conversion looks really
obvious, so it shouldn't be that noticeably slower.

The *one* difference is that the BKL has the whole "you can take it
recursively and you can sleep without dropping it because the scheduler
will drop it for you" thing. The spinlock conversion changed all of that
into explicit "drop and retake" locks, and maybe that causes some issues.

But 23% worse? That sounds really odd/extreme.

Can you do a oprofile run or something?

Linus

Zhang, Yanmin

unread,

May 8, 2008, 12:10:29 AM5/8/08

to Linus Torvalds, Andi Kleen, Matthew Wilcox, Ingo Molnar, LKML, Alexander Viro, Andrew Morton

On Wed, 2008-05-07 at 20:29 -0700, Linus Torvalds wrote:
>
> On Thu, 8 May 2008, Zhang, Yanmin wrote:
> >
> > Congratulations! The patch really fixes the regression completely!
> > vmstat showed cpu idle is 0%, just like 2.6.25's.
>
> Well, that shows that it was the BKL.
>
> That said, "idle 0%" is easy when you spin. Do you also have actual
> performance numbers?

Yes. My conclusion is based on the actual number. cpu idle 0% is just
a behavior it should be.

> I'd hope that not only do we use full CPU time, it's
> also at least as fast as the old semaphores were?

Yes.

>
> While I've been dissing sleeping locks (because their overhead is so
> high), at least in _theory_ they can get better behavior when not
> spinning. Now, that's not going to happen with the BKL, I'm 99.99% sure,
> but I'd still like to hear actual performance numbers too, just to be
> sure.

For sure.

Linus Torvalds

unread,

May 8, 2008, 12:17:55 AM5/8/08

to Zhang, Yanmin, Andi Kleen, Matthew Wilcox, Ingo Molnar, LKML, Alexander Viro, Andrew Morton

On Thu, 8 May 2008, Zhang, Yanmin wrote:
>
> On Wed, 2008-05-07 at 20:29 -0700, Linus Torvalds wrote:
> >
> > That said, "idle 0%" is easy when you spin. Do you also have actual
> > performance numbers?
>
> Yes. My conclusion is based on the actual number. cpu idle 0% is just
> a behavior it should be.

Thanks, that's all I wanted to verify.

I'll leave this overnight, and see if somebody has come up with some smart
and wonderful patch. And if not, I think I'll apply mine as "known to fix
a regression", and we can perhaps then improve on things further from
there.

Zhang, Yanmin

unread,

May 8, 2008, 12:38:50 AM5/8/08

to Linus Torvalds, Matthew Wilcox, Ingo Molnar, J. Bruce Fields, LKML, Alexander Viro, Andrew Morton, linux-...@vger.kernel.org

On Wed, 2008-05-07 at 20:34 -0700, Linus Torvalds wrote:
>
> On Thu, 8 May 2008, Zhang, Yanmin wrote:
> >
> > On Tue, 2008-05-06 at 10:23 -0600, Matthew Wilcox wrote:
> > > On Tue, May 06, 2008 at 06:09:34AM -0600, Matthew Wilcox wrote:
> > > > So the only likely things I can see are:
> > > >
> > > > - file locks
> > > > - fasync
> > >
> > > I've wanted to fix file locks for a while. Here's a first attempt.
> > > It was done quickly, so I concede that it may well have bugs in it.
> > > I found (and fixed) one with LTP.
> > >
> > > It takes *no account* of nfsd, nor remote filesystems. We need to have
> > > a serious discussion about their requirements.
> >
> > I tested it on 8-core stoakley. aim7 result becomes 23% worse than the one of
> > pure 2.6.26-rc1.
>
> Ouch. That's really odd. The BKL->spinlock conversion looks really
> obvious, so it shouldn't be that noticeably slower.
>
> The *one* difference is that the BKL has the whole "you can take it
> recursively and you can sleep without dropping it because the scheduler
> will drop it for you" thing. The spinlock conversion changed all of that
> into explicit "drop and retake" locks, and maybe that causes some issues.
>
> But 23% worse? That sounds really odd/extreme.
>
> Can you do a oprofile run or something?

I collected oprofile data. It looks not useful, as cpu idle is more than 50%.

samples % app name symbol name
270157 9.4450 multitask add_long
266419 9.3143 multitask add_int
238934 8.3534 multitask add_double
187184 6.5442 multitask mul_double
159448 5.5745 multitask add_float
156312 5.4649 multitask sieve
148081 5.1771 multitask mul_float
127192 4.4468 multitask add_short
80480 2.8137 multitask string_rtns_1
57520 2.0110 vmlinux clear_page_c
53935 1.8856 multitask div_long
48753 1.7045 libc-2.6.90.so strncat
40825 1.4273 multitask array_rtns
32807 1.1470 vmlinux __copy_user_nocache
31995 1.1186 multitask div_int
31143 1.0888 multitask div_float
28821 1.0076 multitask div_double
26400 0.9230 vmlinux find_lock_page
26159 0.9146 vmlinux unmap_vmas
25249 0.8827 multitask div_short
21509 0.7520 vmlinux native_read_tsc
18865 0.6595 vmlinux copy_user_generic_string
17993 0.6291 vmlinux copy_page_c
16367 0.5722 vmlinux system_call
14616 0.5110 libc-2.6.90.so msort_with_tmp
13630 0.4765 vmlinux native_sched_clock
12952 0.4528 vmlinux copy_page_range
12817 0.4481 libc-2.6.90.so strcat
12708 0.4443 vmlinux calc_delta_mine
12611 0.4409 libc-2.6.90.so memset
11631 0.4066 bash (no symbols)
9991 0.3493 vmlinux update_curr
9328 0.3261 vmlinux unlock_page

Ingo Molnar

unread,

May 8, 2008, 2:44:49 AM5/8/08

to Zhang, Yanmin, Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton

* Zhang, Yanmin <yanmin...@linux.intel.com> wrote:

> > Here's a trial balloon patch to do that.
> >
> > Yanmin - this is not well tested, but the code is fairly obvious,
> > and it would be interesting to hear if this fixes the performance
> > regression. Because if it doesn't, then it's not the BKL, or
> > something totally different is going on.
>
> Congratulations! The patch really fixes the regression completely!
> vmstat showed cpu idle is 0%, just like 2.6.25's.

great! Yanmin, could you please also check the other patch i sent (also
attached below), does it solve the regression similarly?

Ingo

---
lib/kernel_lock.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

Index: linux/lib/kernel_lock.c
===================================================================
--- linux.orig/lib/kernel_lock.c
+++ linux/lib/kernel_lock.c
@@ -46,7 +46,8 @@ int __lockfunc __reacquire_kernel_lock(v
task->lock_depth = -1;
preempt_enable_no_resched();

- down(&kernel_sem);
+ while (down_trylock(&kernel_sem))
+ cpu_relax();

preempt_disable();
task->lock_depth = saved_lock_depth;

@@ -67,11 +68,13 @@ void __lockfunc lock_kernel(void)

struct task_struct *task = current;

int depth = task->lock_depth + 1;

- if (likely(!depth))
+ if (likely(!depth)) {
/*

* No recursion worries - we set up lock_depth _after_

*/
- down(&kernel_sem);
+ while (down_trylock(&kernel_sem))
+ cpu_relax();

+ }

task->lock_depth = depth;
}

Andrew Morton

unread,

May 8, 2008, 2:49:14 AM5/8/08

to Ingo Molnar, Zhang, Yanmin, Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro

On Thu, 8 May 2008 08:43:40 +0200 Ingo Molnar <mi...@elte.hu> wrote:

> great! Yanmin, could you please also check the other patch i sent (also
> attached below), does it solve the regression similarly?
>
> Ingo
>
> ---
> lib/kernel_lock.c | 9 ++++++---

but but but. Some other users of down() have presumably also regressed. We just
have't found the workload to demonstrate that yet.

Zhang, Yanmin

unread,

May 8, 2008, 3:16:37 AM5/8/08

to Ingo Molnar, Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton

On Thu, 2008-05-08 at 08:43 +0200, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin...@linux.intel.com> wrote:
>
> > > Here's a trial balloon patch to do that.
> > >
> > > Yanmin - this is not well tested, but the code is fairly obvious,
> > > and it would be interesting to hear if this fixes the performance
> > > regression. Because if it doesn't, then it's not the BKL, or
> > > something totally different is going on.
> >
> > Congratulations! The patch really fixes the regression completely!
> > vmstat showed cpu idle is 0%, just like 2.6.25's.
>
> great! Yanmin, could you please also check the other patch i sent (also
> attached below), does it solve the regression similarly?

With your patch, aim7 regression becomes less than 2%. I ran the testing twice.

Linus' patch could recover it completely. As aim7 result is quite stable(usually
fluctuating less than 1%), 1.5%~2% is a little big.

Ingo Molnar

unread,

May 8, 2008, 3:40:01 AM5/8/08

to Zhang, Yanmin, Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton

* Zhang, Yanmin <yanmin...@linux.intel.com> wrote:

> > great! Yanmin, could you please also check the other patch i sent
> > (also attached below), does it solve the regression similarly?
>
> With your patch, aim7 regression becomes less than 2%. I ran the
> testing twice.
>
> Linus' patch could recover it completely. As aim7 result is quite
> stable(usually fluctuating less than 1%), 1.5%~2% is a little big.

is this the old original aim7 you are running, or osdl-aim-7 or
re-aim-7?

if it's aim7 then this is a workload that starts+stops 2000 parallel
tasks that each start and exit at the same time. That might explain its
sensitivity on the BKL - this is all about tty-controlled task startup
and exit.

i could not get it to produce anywhere close to stable results though. I
also frequently get into this problem:

AIM Multiuser Benchmark - Suite VII Run Beginning
Tasks jobs/min jti jobs/min/task real cpu
2000
Failed to execute
new_raph 200
Unable to solve equation in 100 tries. P = 1.5708, P0 = 1.5708, delta = 6.12574e-17

Failed to execute
disk_cp /mnt/shm
disk_cp (1): cannot open /mnt/shm/tmpa.common
disk1.c: No such file or directory

[.. etc. a large stream of them .. ]

system has 2GB of RAM and tmpfs mounted to the place where aim7 puts its
work files.

Ingo

Zhang, Yanmin

unread,

May 8, 2008, 4:46:19 AM5/8/08

to Ingo Molnar, Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton

On Thu, 2008-05-08 at 09:39 +0200, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin...@linux.intel.com> wrote:
>
> > > great! Yanmin, could you please also check the other patch i sent
> > > (also attached below), does it solve the regression similarly?
> >
> > With your patch, aim7 regression becomes less than 2%. I ran the
> > testing twice.
> >
> > Linus' patch could recover it completely. As aim7 result is quite
> > stable(usually fluctuating less than 1%), 1.5%~2% is a little big.
>
> is this the old original aim7 you are running,

I useold AIM7 plus a small patch which is just to change a couple of data type to match
64bit.

> or osdl-aim-7 or
> re-aim-7?
>
> if it's aim7 then this is a workload that starts+stops 2000 parallel
> tasks that each start and exit at the same time.

Yes.

> That might explain its
> sensitivity on the BKL - this is all about tty-controlled task startup
> and exit.
>
> i could not get it to produce anywhere close to stable results though. I
> also frequently get into this problem:
>
> AIM Multiuser Benchmark - Suite VII Run Beginning
> Tasks jobs/min jti jobs/min/task real cpu
> 2000
> Failed to execute
> new_raph 200
> Unable to solve equation in 100 tries. P = 1.5708, P0 = 1.5708, delta = 6.12574e-17
>
> Failed to execute
> disk_cp /mnt/shm
> disk_cp (1): cannot open /mnt/shm/tmpa.common
> disk1.c: No such file or directory
>
> [.. etc. a large stream of them .. ]
>
> system has 2GB of RAM and tmpfs mounted to the place where aim7 puts its
> work files.

My machine has 8GB. To simulate your environment, I reserve 6GB for hugetlb, then reran the testing
and didn't see any failure except:

AIM Multiuser Benchmark - Suite VII Run Beginning

Tasks jobs/min jti jobs/min/task real cpu

2000create_shared_memory(): can't create semaphore, pausing...
create_shared_memory(): can't create semaphore, pausing...

Above info doesn't mean errors.

Perhaps you could:
1) Apply the attched aim9 patch;
2) check if you have write right under /mnt/shm;
3) echo "/mnt/shm">aim7_path/config;

aim9.diff

Ingo Molnar

unread,

May 8, 2008, 5:21:34 AM5/8/08

to Zhang, Yanmin, Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton

* Zhang, Yanmin <yanmin...@linux.intel.com> wrote:

> > disk_cp /mnt/shm
> > disk_cp (1): cannot open /mnt/shm/tmpa.common
> > disk1.c: No such file or directory
> >
> > [.. etc. a large stream of them .. ]
> >
> > system has 2GB of RAM and tmpfs mounted to the place where aim7 puts its
> > work files.

> My machine has 8GB. To simulate your environment, I reserve 6GB for
> hugetlb, then reran the testing and didn't see any failure except: AIM
> Multiuser Benchmark - Suite VII Run Beginning
>
> Tasks jobs/min jti jobs/min/task real cpu
> 2000create_shared_memory(): can't create semaphore, pausing...
> create_shared_memory(): can't create semaphore, pausing...

that failure message you got worries me - it indicates that your test
ran out of IPC semaphores. You can fix it via upping the semaphore
limits via:

echo "500 32000 128 512" > /proc/sys/kernel/sem

could you check that you still get similar results with this limit
fixed?

note that once i've fixed the semaphore limits it started running fine
here. And i see zero idle time during the run on a quad core box.

here are my numbers:

# on v2.6.26-rc1-166-gc0a1811

Tasks Jobs/Min JTI Real CPU Jobs/sec/task
2000 55851.4 93 208.4 793.6 0.4654 # BKL: sleep
2000 55402.2 79 210.1 800.1 0.4617

2000 55728.4 93 208.9 795.5 0.4644 # BKL: spin
2000 55787.2 93 208.7 794.5 0.4649 #

so the results are the same within noise.

I'll also check this workload on an 8-way box to make sure it's OK on
larger CPU counts too.

could you double-check your test?

plus a tty tidbit as well, during the test i saw a few of these:

Warning: dev (tty1) tty->count(639) != #fd's(638) in release_dev
Warning: dev (tty1) tty->count(462) != #fd's(463) in release_dev
Warning: dev (tty1) tty->count(274) != #fd's(275) in release_dev
Warning: dev (tty1) tty->count(4) != #fd's(3) in release_dev
Warning: dev (tty1) tty->count(164) != #fd's(163) in release_dev

Ingo Molnar

unread,

May 8, 2008, 5:29:57 AM5/8/08

to Zhang, Yanmin, Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton

* Ingo Molnar <mi...@elte.hu> wrote:

> plus a tty tidbit as well, during the test i saw a few of these:
>
> Warning: dev (tty1) tty->count(639) != #fd's(638) in release_dev

false alarm there - these were due to the breakage in the hack-patch i
used ...

Zhang, Yanmin

unread,

May 8, 2008, 5:33:24 AM5/8/08

to Ingo Molnar, Linus Torvalds, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton

On Thu, 2008-05-08 at 11:21 +0200, Ingo Molnar wrote:
> * Zhang, Yanmin <yanmin...@linux.intel.com> wrote:
>
> > > disk_cp /mnt/shm
> > > disk_cp (1): cannot open /mnt/shm/tmpa.common
> > > disk1.c: No such file or directory
> > >
> > > [.. etc. a large stream of them .. ]
> > >
> > > system has 2GB of RAM and tmpfs mounted to the place where aim7 puts its
> > > work files.
>
> > My machine has 8GB. To simulate your environment, I reserve 6GB for
> > hugetlb, then reran the testing and didn't see any failure except: AIM
> > Multiuser Benchmark - Suite VII Run Beginning
> >
> > Tasks jobs/min jti jobs/min/task real cpu
> > 2000create_shared_memory(): can't create semaphore, pausing...
> > create_shared_memory(): can't create semaphore, pausing...
>
> that failure message you got worries me - it indicates that your test
> ran out of IPC semaphores. You can fix it via upping the semaphore
> limits via:
>
> echo "500 32000 128 512" > /proc/sys/kernel/sem

A quick test showed it does work.

Thanks. I need to take shuttle bus or I need walk to home for 2 hours if missing it. :)

Ingo Molnar

unread,

May 8, 2008, 8:02:18 AM5/8/08

to Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> > > That said, "idle 0%" is easy when you spin. Do you also have
> > > actual performance numbers?
> >
> > Yes. My conclusion is based on the actual number. cpu idle 0% is
> > just a behavior it should be.
>
> Thanks, that's all I wanted to verify.
>
> I'll leave this overnight, and see if somebody has come up with some
> smart and wonderful patch. And if not, I think I'll apply mine as
> "known to fix a regression", and we can perhaps then improve on things
> further from there.

hey, i happen to have such a smart and wonderful patch =B-)

i reproduced the AIM7 workload and can confirm Yanmin's findings that
-.26-rc1 regresses over .25 - by over 67% here.

Looking at the workload i found and fixed what i believe to be the real
bug causing the AIM7 regression: it was inefficient wakeup / scheduling
/ locking behavior of the new generic semaphore code, causing suboptimal
performance.

The problem comes from the following code. The new semaphore code does
this on down():

spin_lock_irqsave(&sem->lock, flags);
if (likely(sem->count > 0))
sem->count--;
else
__down(sem);
spin_unlock_irqrestore(&sem->lock, flags);

and this on up():

spin_lock_irqsave(&sem->lock, flags);
if (likely(list_empty(&sem->wait_list)))
sem->count++;
else
__up(sem);
spin_unlock_irqrestore(&sem->lock, flags);

where __up() does:

list_del(&waiter->list);
waiter->up = 1;
wake_up_process(waiter->task);

and where __down() does this in essence:

list_add_tail(&waiter.list, &sem->wait_list);
waiter.task = task;
waiter.up = 0;
for (;;) {
[...]
spin_unlock_irq(&sem->lock);
timeout = schedule_timeout(timeout);
spin_lock_irq(&sem->lock);
if (waiter.up)
return 0;
}

the fastpath looks good and obvious, but note the following property of
the contended path: if there's a task on the ->wait_list, the up() of
the current owner will "pass over" ownership to that waiting task, in a
wake-one manner, via the waiter->up flag and by removing the waiter from
the wait list.

That is all and fine in principle, but as implemented in
kernel/semaphore.c it also creates a nasty, hidden source of contention!

The contention comes from the following property of the new semaphore
code: the new owner owns the semaphore exclusively, even if it is not
running yet.

So if the old owner, even if just a few instructions later, does a
down() [lock_kernel()] again, it will be blocked and will have to wait
on the new owner to eventually be scheduled (possibly on another CPU)!
Or if another other task gets to lock_kernel() sooner than the "new
owner" scheduled, it will be blocked unnecessarily and for a very long
time when there are 2000 tasks running.

I.e. the implementation of the new semaphores code does wake-one and
lock ownership in a very restrictive way - it does not allow
opportunistic re-locking of the lock at all and keeps the scheduler from
picking task order intelligently.

This kind of scheduling, with 2000 AIM7 processes running, creates awful
cross-scheduling between those 2000 tasks, causes reduced parallelism, a
throttled runqueue length and a lot of idle time. With increasing number
of CPUs it causes an exponentially worse behavior in AIM7, as the chance
for a newly woken new-owner task to actually run anytime soon is less
and less likely.

Note that it takes just a tiny bit of contention for the 'new-semaphore
catastrophy' to happen: the wakeup latencies get added to whatever small
contention there is, and quickly snowball out of control!

I believe Yanmin's findings and numbers support this analysis too.

The best fix for this problem is to use the same scheduling logic that
the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and
wanted because we do not want to over-schedule), but also allow
opportunistic locking of the lock even if a wakee is already "in
flight".

The patch below implements this new logic. With this patch applied the
AIM7 regression is largely fixed on my quad testbox:

# v2.6.25 vanilla:
..................

Tasks Jobs/Min JTI Real CPU Jobs/sec/task

2000 56096.4 91 207.5 789.7 0.4675
2000 55894.4 94 208.2 792.7 0.4658

# v2.6.26-rc1-166-gc0a1811 vanilla:
...................................

Tasks Jobs/Min JTI Real CPU Jobs/sec/task

2000 33230.6 83 350.3 784.5 0.2769
2000 31778.1 86 366.3 783.6 0.2648

# v2.6.26-rc1-166-gc0a1811 + semaphore-speedup:
...............................................

Tasks Jobs/Min JTI Real CPU Jobs/sec/task

2000 55707.1 92 209.0 795.6 0.4642
2000 55704.4 96 209.0 796.0 0.4642

i.e. a 67% speedup. We are now back to within 1% of the v2.6.25
performance levels and have zero idle time during the test, as expected.

Btw., interactivity also improved dramatically with the fix - for
example console-switching became almost instantaneous during this
workload (which after all is running 2000 tasks at once!), without the
patch it was stuck for a minute at times.

I also ran Linus's spinlock-BKL patch as well:

# v2.6.26-rc1-166-gc0a1811 + Linus-BKL-spinlock-patch:
......................................................

Tasks Jobs/Min JTI Real CPU Jobs/sec/task

2000 55889.0 92 208.3 793.3 0.4657
2000 55891.7 96 208.3 793.3 0.4658

it is about 0.3% faster - but note that that is within the general noise
levels of this test. I'd expect Linus's spinlock-BKL patch to give some
small speedup because the BKL acquire times are short and 2000 tasks
running all at once really increases the context-switching cost and most
BKL contentions are within the cost of context-switching.

But i believe the better solution for that is to remove BKL use from all
hotpaths, not to hide some of its costs by reintroducing it as a
spinlock. Reintroducing the spinlock based BKL would have other
disadvantages as well: it could reintroduce per-CPU-ness assumptions in
BKL-using code and other complications as well. It's also not a very
realistic workload - with 2000 tasks running the system was barely
serviceable.

I'd much rather make BKL costs more apparent and more visible - but 50%
regression was of course too much. But 0.3% for a 2000-tasks workload,
which is near the noise level ... is acceptable i think - especially as
this discussion has now reinvigorated the remove-the-BKL discussions and
patches.

Linus, we can do your spinlock-BKL patch too if you feel strongly about
it, but i'd rather not - we fought so hard for the preemptible BKL :-)

The spinlock-based-BKL patch only worked around the real problem i
believe, because it eliminated the use of the suboptimal new semaphore
code: with spinlocks there's no scheduling at all, so the wakeup/locking
bug of the new semaphore code did not apply. It was not about any
fastpath overhead AFAICS. [we'd have seen that with the
CONFIG_PREEMPT_BKL=y code as well, which has been the default setting
since v2.6.8.]

There's another nice side-effect of this speedup patch, the new generic
semaphore code got even smaller:

text data bss dec hex filename
1241 0 0 1241 4d9 semaphore.o.before
1207 0 0 1207 4b7 semaphore.o.after

(because the waiter.up complication got removed.)

Longer-term we should look into using the mutex code for the generic
semaphore code as well - but i's not easy due to legacies and it's
outside of the scope of v2.6.26 and outside the scope of this patch as
well.

Hm?

Ingo

Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
kernel/semaphore.c | 36 ++++++++++++++++--------------------
1 file changed, 16 insertions(+), 20 deletions(-)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c
@@ -54,10 +54,9 @@ void down(struct semaphore *sem)
unsigned long flags;

spin_lock_irqsave(&sem->lock, flags);
- if (likely(sem->count > 0))
- sem->count--;
- else
+ if (unlikely(sem->count <= 0))
__down(sem);
+ sem->count--;
spin_unlock_irqrestore(&sem->lock, flags);
}
EXPORT_SYMBOL(down);
@@ -77,10 +76,10 @@ int down_interruptible(struct semaphore
int result = 0;

spin_lock_irqsave(&sem->lock, flags);
- if (likely(sem->count > 0))
- sem->count--;
- else
+ if (unlikely(sem->count <= 0))
result = __down_interruptible(sem);
+ if (!result)
+ sem->count--;
spin_unlock_irqrestore(&sem->lock, flags);

return result;
@@ -103,10 +102,10 @@ int down_killable(struct semaphore *sem)
int result = 0;

spin_lock_irqsave(&sem->lock, flags);
- if (likely(sem->count > 0))
- sem->count--;
- else
+ if (unlikely(sem->count <= 0))
result = __down_killable(sem);
+ if (!result)
+ sem->count--;
spin_unlock_irqrestore(&sem->lock, flags);

return result;
@@ -157,10 +156,10 @@ int down_timeout(struct semaphore *sem,
int result = 0;

spin_lock_irqsave(&sem->lock, flags);
- if (likely(sem->count > 0))
- sem->count--;
- else
+ if (unlikely(sem->count <= 0))
result = __down_timeout(sem, jiffies);
+ if (!result)
+ sem->count--;
spin_unlock_irqrestore(&sem->lock, flags);

return result;
@@ -179,9 +178,8 @@ void up(struct semaphore *sem)
unsigned long flags;

spin_lock_irqsave(&sem->lock, flags);
- if (likely(list_empty(&sem->wait_list)))
- sem->count++;
- else
+ sem->count++;
+ if (unlikely(!list_empty(&sem->wait_list)))
__up(sem);
spin_unlock_irqrestore(&sem->lock, flags);
}
@@ -192,7 +190,6 @@ EXPORT_SYMBOL(up);
struct semaphore_waiter {
struct list_head list;
struct task_struct *task;
- int up;
};

/*
@@ -206,11 +203,11 @@ static inline int __sched __down_common(

struct task_struct *task = current;

struct semaphore_waiter waiter;

- list_add_tail(&waiter.list, &sem->wait_list);
waiter.task = task;
- waiter.up = 0;

for (;;) {
+ list_add_tail(&waiter.list, &sem->wait_list);
+
if (state == TASK_INTERRUPTIBLE && signal_pending(task))
goto interrupted;
if (state == TASK_KILLABLE && fatal_signal_pending(task))
@@ -221,7 +218,7 @@ static inline int __sched __down_common(
spin_unlock_irq(&sem->lock);
timeout = schedule_timeout(timeout);
spin_lock_irq(&sem->lock);
- if (waiter.up)
+ if (sem->count > 0)
return 0;
}

@@ -259,6 +256,5 @@ static noinline void __sched __up(struct
struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
struct semaphore_waiter, list);
list_del(&waiter->list);
- waiter->up = 1;
wake_up_process(waiter->task);

Ingo Molnar

unread,

May 8, 2008, 8:28:44 AM5/8/08

to Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

* Ingo Molnar <mi...@elte.hu> wrote:

> + if (unlikely(sem->count <= 0))
> __down(sem);
> + sem->count--;

Peter pointed it out that because sem->count is u32, the <= 0 is in fact
a "== 0" condition - the patch below does that. As expected gcc figured
out the same thing too so the resulting code output did not change. (so
this is just a cleanup)

i've got this lined up in sched.git and it's undergoing testing right
now. If that testing goes fine and if there are no objections i'll send
a pull request for it later today.

Ingo

---------------->
Subject: semaphores: improve code
From: Ingo Molnar <mi...@elte.hu>
Date: Thu May 08 14:19:23 CEST 2008

No code changed:

kernel/semaphore.o:

text data bss dec hex filename

1207 0 0 1207 4b7 semaphore.o.before

1207 0 0 1207 4b7 semaphore.o.after

md5:
c10198c2952bd345a1edaac6db891548 semaphore.o.before.asm
c10198c2952bd345a1edaac6db891548 semaphore.o.after.asm

Signed-off-by: Ingo Molnar <mi...@elte.hu>
---

kernel/semaphore.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c

@@ -54,7 +54,7 @@ void down(struct semaphore *sem)

unsigned long flags;

spin_lock_irqsave(&sem->lock, flags);

- if (unlikely(sem->count <= 0))
+ if (unlikely(!sem->count))
__down(sem);

sem->count--;
spin_unlock_irqrestore(&sem->lock, flags);

@@ -76,7 +76,7 @@ int down_interruptible(struct semaphore

int result = 0;

spin_lock_irqsave(&sem->lock, flags);

- if (unlikely(sem->count <= 0))
+ if (unlikely(!sem->count))
result = __down_interruptible(sem);
if (!result)
sem->count--;
@@ -102,7 +102,7 @@ int down_killable(struct semaphore *sem)

int result = 0;

spin_lock_irqsave(&sem->lock, flags);

- if (unlikely(sem->count <= 0))
+ if (unlikely(!sem->count))
result = __down_killable(sem);
if (!result)
sem->count--;
@@ -156,7 +156,7 @@ int down_timeout(struct semaphore *sem,

int result = 0;

spin_lock_irqsave(&sem->lock, flags);

- if (unlikely(sem->count <= 0))
+ if (unlikely(!sem->count))
result = __down_timeout(sem, jiffies);
if (!result)
sem->count--;

Matthew Wilcox

unread,

May 8, 2008, 9:21:25 AM5/8/08

to Ingo Molnar, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Thu, May 08, 2008 at 02:01:30PM +0200, Ingo Molnar wrote:
> Looking at the workload i found and fixed what i believe to be the real
> bug causing the AIM7 regression: it was inefficient wakeup / scheduling
> / locking behavior of the new generic semaphore code, causing suboptimal
> performance.

I did note that earlier downthread ... although to be fair, I thought of
it in terms of three tasks with the third task coming in and stealing
the second tasks's wakeup rather than the first task starving the second
by repeatedly locking/unlocking the semaphore.

> So if the old owner, even if just a few instructions later, does a
> down() [lock_kernel()] again, it will be blocked and will have to wait
> on the new owner to eventually be scheduled (possibly on another CPU)!
> Or if another other task gets to lock_kernel() sooner than the "new
> owner" scheduled, it will be blocked unnecessarily and for a very long
> time when there are 2000 tasks running.
>
> I.e. the implementation of the new semaphores code does wake-one and
> lock ownership in a very restrictive way - it does not allow
> opportunistic re-locking of the lock at all and keeps the scheduler from
> picking task order intelligently.

Fair is certainly the enemy of throughput (see also dbench arguments
passim). It may be that some semaphore users really do want fairness --
it seems pretty clear that we don't want fairness for the BKL.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Arjan van de Ven

unread,

May 8, 2008, 9:59:05 AM5/8/08

to Ingo Molnar, Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Thu, 8 May 2008 14:01:30 +0200
Ingo Molnar <mi...@elte.hu> wrote:
>
> The contention comes from the following property of the new semaphore
> code: the new owner owns the semaphore exclusively, even if it is not
> running yet.
>
> So if the old owner, even if just a few instructions later, does a
> down() [lock_kernel()] again, it will be blocked and will have to
> wait on the new owner to eventually be scheduled (possibly on another
> CPU)! Or if another other task gets to lock_kernel() sooner than the
> "new owner" scheduled, it will be blocked unnecessarily and for a
> very long time when there are 2000 tasks running.

ok sounds like I like the fairness part of the new semaphores (but
obviously not the 67% performance downside; I'd expect to sacrifice a
little performance.. but this much??).

It sucks though; if this were a mutex, we could wake up the owner of
the bugger in the contented acquire path synchronously.... but these
are semaphores, and don't have an owner ;( bah bah bah

Ingo Molnar

unread,

May 8, 2008, 10:44:13 AM5/8/08

to Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

* Ingo Molnar <mi...@elte.hu> wrote:

> Peter pointed it out that because sem->count is u32, the <= 0 is in
> fact a "== 0" condition - the patch below does that. As expected gcc
> figured out the same thing too so the resulting code output did not
> change. (so this is just a cleanup)

a second update patch, i've further simplified the semaphore wakeup
logic: there's no need for the wakeup to remove the task from the wait
list. This will make them a slighly bit more fair, but more importantly,
this closes a race in my first patch for the unlikely case of a signal
(or a timeout) and an unlock coming in at the same time and the task not
getting removed from the wait-list.

( my performance testing with 2000 AIM7 tasks on a quad never hit that
race but x86.git QA actually triggered it after about 30 random kernel
bootups and it caused a nasty crash and lockup. )

Ingo

---------------->
Subject: sem: simplify queue management
From: Ingo Molnar <mi...@elte.hu>
Date: Tue May 06 19:32:42 CEST 2008

kernel/semaphore.o:

text data bss dec hex filename

1040 0 0 1040 410 semaphore.o.before
975 0 0 975 3cf semaphore.o.after

Signed-off-by: Ingo Molnar <mi...@elte.hu>
---

kernel/semaphore.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)

Index: linux/kernel/semaphore.c
===================================================================
--- linux.orig/kernel/semaphore.c
+++ linux/kernel/semaphore.c

@@ -202,33 +202,34 @@ static inline int __sched __down_common(

{
struct task_struct *task = current;
struct semaphore_waiter waiter;

+ int ret = 0;

waiter.task = task;
+ list_add_tail(&waiter.list, &sem->wait_list);

for (;;) {
- list_add_tail(&waiter.list, &sem->wait_list);
-
- if (state == TASK_INTERRUPTIBLE && signal_pending(task))
- goto interrupted;
- if (state == TASK_KILLABLE && fatal_signal_pending(task))
- goto interrupted;
- if (timeout <= 0)
- goto timed_out;
+ if (state == TASK_INTERRUPTIBLE && signal_pending(task)) {
+ ret = -EINTR;
+ break;
+ }
+ if (state == TASK_KILLABLE && fatal_signal_pending(task)) {
+ ret = -EINTR;
+ break;
+ }
+ if (timeout <= 0) {
+ ret = -ETIME;
+ break;
+ }
__set_task_state(task, state);

spin_unlock_irq(&sem->lock);
timeout = schedule_timeout(timeout);
spin_lock_irq(&sem->lock);

if (sem->count > 0)
- return 0;
+ break;
}

- timed_out:
- list_del(&waiter.list);
- return -ETIME;
-
- interrupted:
list_del(&waiter.list);
- return -EINTR;
+ return ret;
}

static noinline void __sched __down(struct semaphore *sem)
@@ -255,6 +256,5 @@ static noinline void __sched __up(struct

{
struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
struct semaphore_waiter, list);

- list_del(&waiter->list);
wake_up_process(waiter->task);

Linus Torvalds

unread,

May 8, 2008, 11:00:26 AM5/8/08

to Zhang, Yanmin, Matthew Wilcox, Ingo Molnar, J. Bruce Fields, LKML, Alexander Viro, Andrew Morton, linux-...@vger.kernel.org

On Thu, 8 May 2008, Zhang, Yanmin wrote:
>
> I collected oprofile data. It looks not useful, as cpu idle is more than 50%.

Ahh, so it's probably still the BKL that is the problem, it's just not in
the file locking code. The changes to fs/locks.c probably didn't matter
all that much, and the additional regression was likely just some
perturbation.

So it's probably fasync that AIM7 tests. Quite possibly coupled with
/dev/tty etc. No file locking.

Linus

Ingo Molnar

unread,

May 8, 2008, 11:02:05 AM5/8/08

to Matthew Wilcox, Linus Torvalds, Zhang, Yanmin, Andi Kleen, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Peter Zijlstra

* Matthew Wilcox <mat...@wil.cx> wrote:

> Fair is certainly the enemy of throughput (see also dbench arguments
> passim). It may be that some semaphore users really do want fairness
> -- it seems pretty clear that we don't want fairness for the BKL.

i dont think we need to consider any theoretical arguments about
fairness here as there's a fundamental down-to-earth maintenance issue
that governs: old semaphores were similarly unfair too, so it is just a
bad idea (and a bug) to change behavior when implementing new, generic
semaphores that are supposed to be a seemless replacement! This is about
legacy code that is intended to be phased out anyway.

This is already a killer argument and we wouldnt have to look any
further.

but even on the more theoretical level i disagree: fairness of CPU time
is something that is implemented by the scheduler in a natural way
already. Putting extra ad-hoc synchronization and scheduling into the
locking primitives around data structures only gives mathematical
fairness and artificial micro-scheduling, it does not actually make the
end result more useful! This is especially true for the BKL which is
auto-dropped by the scheduler anyway. (so descheduling a task will
automatically release it of its BKL ownership)

For example we've invested a _lot_ of time and effort into adding lock
stealing (i.e. intentional "unfairness") to kernel/rtmutex.c. Which is a
_lot_ harder to do atomically with PI constraints but still possible and
makes sense in the grand scheme of things. kernel/mutex.c is also
"unfair" - and that's correct IMO.

For the BKL in particular there's almost no sense to talk about any
underlying resource and there's almost no expectation from users for
that imaginery resource to be shared fairly.

Ingo

Ingo Molnar

unread,

May 8, 2008, 11:11:32 AM5/8/08

to Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

* Ingo Molnar <mi...@elte.hu> wrote:

> > Peter pointed it out that because sem->count is u32, the <= 0 is in
> > fact a "== 0" condition - the patch below does that. As expected gcc
> > figured out the same thing too so the resulting code output did not
> > change. (so this is just a cleanup)
>
> a second update patch, i've further simplified the semaphore wakeup
> logic: there's no need for the wakeup to remove the task from the wait
> list. This will make them a slighly bit more fair, but more
> importantly, this closes a race in my first patch for the unlikely
> case of a signal (or a timeout) and an unlock coming in at the same
> time and the task not getting removed from the wait-list.
>
> ( my performance testing with 2000 AIM7 tasks on a quad never hit that
> race but x86.git QA actually triggered it after about 30 random
> kernel bootups and it caused a nasty crash and lockup. )

ok, it's looking good here so far so here's the scheduler fixes tree
that you can pull if my semaphore fix looks good to you too:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-fixes.git for-linus

also includes a scheduler arithmetics fix from Mike. Find the shortlog
and diff below.

Ingo

------------------>
Ingo Molnar (1):
semaphore: fix

Mike Galbraith (1):
sched: fix weight calculations

kernel/sched_fair.c | 11 ++++++--
kernel/semaphore.c | 64 ++++++++++++++++++++++++---------------------------
2 files changed, 38 insertions(+), 37 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c863663..e24ecd3 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -662,10 +662,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
if (!initial) {
/* sleeps upto a single latency don't count. */
if (sched_feat(NEW_FAIR_SLEEPERS)) {
+ unsigned long thresh = sysctl_sched_latency;
+
+ /*
+ * convert the sleeper threshold into virtual time
+ */
if (sched_feat(NORMALIZED_SLEEPER))
- vruntime -= calc_delta_weight(sysctl_sched_latency, se);
- else
- vruntime -= sysctl_sched_latency;
+ thresh = calc_delta_fair(thresh, se);
+
+ vruntime -= thresh;
}

/* ensure we never gain time by being placed backwards. */
diff --git a/kernel/semaphore.c b/kernel/semaphore.c
index 5c2942e..5e41217 100644
--- a/kernel/semaphore.c
+++ b/kernel/semaphore.c
@@ -54,10 +54,9 @@ void down(struct semaphore *sem)

unsigned long flags;

spin_lock_irqsave(&sem->lock, flags);

- if (likely(sem->count > 0))
- sem->count--;
- else

+ if (unlikely(!sem->count))
__down(sem);

+ sem->count--;
spin_unlock_irqrestore(&sem->lock, flags);
}
EXPORT_SYMBOL(down);
@@ -77,10 +76,10 @@ int down_interruptible(struct semaphore *sem)

int result = 0;

spin_lock_irqsave(&sem->lock, flags);

- if (likely(sem->count > 0))
- sem->count--;
- else

+ if (unlikely(!sem->count))
result = __down_interruptible(sem);

+ if (!result)
+ sem->count--;
spin_unlock_irqrestore(&sem->lock, flags);

return result;
@@ -103,10 +102,10 @@ int down_killable(struct semaphore *sem)

int result = 0;

spin_lock_irqsave(&sem->lock, flags);

- if (likely(sem->count > 0))
- sem->count--;
- else

+ if (unlikely(!sem->count))
result = __down_killable(sem);

+ if (!result)
+ sem->count--;
spin_unlock_irqrestore(&sem->lock, flags);

return result;
@@ -157,10 +156,10 @@ int down_timeout(struct semaphore *sem, long jiffies)

int result = 0;

spin_lock_irqsave(&sem->lock, flags);

- if (likely(sem->count > 0))
- sem->count--;
- else

+ if (unlikely(!sem->count))
result = __down_timeout(sem, jiffies);

+ if (!result)
+ sem->count--;
spin_unlock_irqrestore(&sem->lock, flags);

return result;
@@ -179,9 +178,8 @@ void up(struct semaphore *sem)

unsigned long flags;

spin_lock_irqsave(&sem->lock, flags);

- if (likely(list_empty(&sem->wait_list)))
- sem->count++;
- else
+ sem->count++;
+ if (unlikely(!list_empty(&sem->wait_list)))
__up(sem);
spin_unlock_irqrestore(&sem->lock, flags);
}
@@ -192,7 +190,6 @@ EXPORT_SYMBOL(up);
struct semaphore_waiter {
struct list_head list;
struct task_struct *task;
- int up;
};

/*

@@ -205,33 +202,34 @@ static inline int __sched __down_common(struct semaphore *sem, long state,

{
struct task_struct *task = current;
struct semaphore_waiter waiter;
+ int ret = 0;

- list_add_tail(&waiter.list, &sem->wait_list);
waiter.task = task;
- waiter.up = 0;

+ list_add_tail(&waiter.list, &sem->wait_list);

for (;;) {

- if (state == TASK_INTERRUPTIBLE && signal_pending(task))
- goto interrupted;
- if (state == TASK_KILLABLE && fatal_signal_pending(task))
- goto interrupted;
- if (timeout <= 0)
- goto timed_out;
+ if (state == TASK_INTERRUPTIBLE && signal_pending(task)) {
+ ret = -EINTR;
+ break;
+ }
+ if (state == TASK_KILLABLE && fatal_signal_pending(task)) {
+ ret = -EINTR;
+ break;
+ }
+ if (timeout <= 0) {
+ ret = -ETIME;
+ break;
+ }
__set_task_state(task, state);
spin_unlock_irq(&sem->lock);
timeout = schedule_timeout(timeout);
spin_lock_irq(&sem->lock);

- if (waiter.up)
- return 0;

+ if (sem->count > 0)

+ break;
}

- timed_out:
- list_del(&waiter.list);
- return -ETIME;
-
- interrupted:
list_del(&waiter.list);
- return -EINTR;
+ return ret;
}

static noinline void __sched __down(struct semaphore *sem)

@@ -258,7 +256,5 @@ static noinline void __sched __up(struct semaphore *sem)

{
struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list,
struct semaphore_waiter, list);
- list_del(&waiter->list);

- waiter->up = 1;

Adrian Bunk

unread,

May 8, 2008, 11:35:34 AM5/8/08

to Ingo Molnar, Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Mike Galbraith

On Thu, May 08, 2008 at 05:10:28PM +0200, Ingo Molnar wrote:
>...

> also includes a scheduler arithmetics fix from Mike. Find the shortlog
> and diff below.
>
> Ingo
>
> ------------------>

>...

> Mike Galbraith (1):
> sched: fix weight calculations

>...

The commit description says:

<-- snip -->

..

This bug could be related to the regression reported by Yanmin Zhang:

Reported-by: "Zhang, Yanmin" <yanmin...@linux.intel.com>
..

<-- snip -->

Can we get that verified and the description updated before it hits
Linus' tree?

Otherwise this "could be related" will become unchangable metadata that
will stay forever - no matter whether there's any relation at all.

Thanks
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

Ingo Molnar

unread,

May 8, 2008, 11:42:00 AM5/8/08

to Adrian Bunk, Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Mike Galbraith

* Adrian Bunk <bu...@kernel.org> wrote:

> Can we get that verified and the description updated before it hits
> Linus' tree?

that's not needed. Mike's fix is correct, regardless of whether it fixes
the other regression or not.

> Otherwise this "could be related" will become unchangable metadata
> that will stay forever - no matter whether there's any relation at
> all.

.. and the problem with that is exactly what?

Ingo

Linus Torvalds

unread,

May 8, 2008, 12:03:48 PM5/8/08

to Ingo Molnar, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Thu, 8 May 2008, Ingo Molnar wrote:
>
> Peter pointed it out that because sem->count is u32, the <= 0 is in fact
> a "== 0" condition - the patch below does that. As expected gcc figured
> out the same thing too so the resulting code output did not change. (so
> this is just a cleanup)

Why don't we just make it do the same thing that the x86 semaphores used
to do: make it signed, and decrement unconditionally. And callt eh
slow-path if it became negative.

IOW, make the fast-path be

spin_lock_irqsave(&sem->lock, flags);
if (--sem->count < 0)
__down();
spin_unlock_irqrestore(&sem->lock, flags);

and now we have an existing known-good implementation to look at?

Rather than making up a totally new and untested thing.

Linus

Linus Torvalds

unread,

May 8, 2008, 2:31:37 PM5/8/08

to Ingo Molnar, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Thu, 8 May 2008, Linus Torvalds wrote:
>
> Why don't we just make it do the same thing that the x86 semaphores used
> to do: make it signed, and decrement unconditionally. And callt eh
> slow-path if it became negative.

> ...

> and now we have an existing known-good implementation to look at?

Ok, after having thought that over, and looked at the code, I think I like
your version after all. The old implementation was pretty complex due to
the need to be so extra careful about the count that could change outside
of the lock, so everything considered, a new implementation that is
simpler is probably the better choice.

Ergo, I will just pull your scheduler tree.

Adrian Bunk

unread,

May 8, 2008, 3:44:03 PM5/8/08

to Ingo Molnar, Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin, Mike Galbraith

On Thu, May 08, 2008 at 05:41:16PM +0200, Ingo Molnar wrote:
>
> * Adrian Bunk <bu...@kernel.org> wrote:
>
> > Can we get that verified and the description updated before it hits
> > Linus' tree?
>
> that's not needed. Mike's fix is correct, regardless of whether it fixes
> the other regression or not.

Then scrap the part about it possibly fixing a regression and the
Reported-by: line.

> > Otherwise this "could be related" will become unchangable metadata
> > that will stay forever - no matter whether there's any relation at
> > all.
>

> ... and the problem with that is exactly what?

It is important that our metadata is as complete and correct as
reasonably possible. Our code is not as well documented as it should be,
and in my experience often the only way to understand what happens and
why it happens is to ask git for the metadata (and I'm actually doing
this even for most of my "trivial" patches).

In 3 hours or 3 years someone might look at this commit trying to
understand what it does and why it does this.

And there's a big difference between "we do it because it's correct from
a theoretical point of view" and "it is supposed to fix a huge
performance regression".

> Ingo

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--

Ingo Molnar

unread,

May 8, 2008, 4:20:45 PM5/8/08

to Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> On Thu, 8 May 2008, Linus Torvalds wrote:
> >
> > Why don't we just make it do the same thing that the x86 semaphores used
> > to do: make it signed, and decrement unconditionally. And callt eh
> > slow-path if it became negative.
> > ...
> > and now we have an existing known-good implementation to look at?
>
> Ok, after having thought that over, and looked at the code, I think I
> like your version after all. The old implementation was pretty complex
> due to the need to be so extra careful about the count that could
> change outside of the lock, so everything considered, a new
> implementation that is simpler is probably the better choice.

yeah. I thought about that too, the problem i found is this thing in the
old lib/semaphore-sleepers.c code's __down() path:

remove_wait_queue_locked(&sem->wait, &wait);
wake_up_locked(&sem->wait);
spin_unlock_irqrestore(&sem->wait.lock, flags);
tsk->state = TASK_RUNNING;

that mystery wakeup once i understood to be necessary for some weird
ordering reason, but it would probably be hard to justify in the new
code, because it's done unconditionally, regardless of whether there are
sleepers around.

And once we deviate from the old code, we might as well go for the
simplest approach - which also happens to be rather close to the mutex
code's current slowpath - just with counting property added, legacy
semantics and no lockdep coverage.

> Ergo, I will just pull your scheduler tree.

great! Meanwhile a 100 randconfigs booted fine with that tree so i'd say
the implementation is robust.

i also did a quick re-test of AIM7 because the wakeup logic changed a
bit from what i tested initially (from round-robin to strict FIFO), and
as expected not much changed in the AIM7 results on the quad:

Tasks Jobs/Min JTI Real CPU Jobs/sec/task

2000 55019.9 96 211.6 806.5 0.4585
2000 55116.2 90 211.2 804.7 0.4593
2000 55082.3 82 211.3 805.5 0.4590

this is slightly lower but the test was not fully apples to apples
because this also had some tracing active and other small details. It's
still very close to the v2.6.25 numbers. I suspect some more performance
could be won in this particular workload by getting rid of the BKL
dependency altogether.

Ingo

Linus Torvalds

unread,

May 8, 2008, 4:28:54 PM5/8/08

to Ingo Molnar, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

On Thu, 8 May 2008, Ingo Molnar wrote:
>
> I suspect some more performance could be won in this particular workload
> by getting rid of the BKL dependency altogether.

Did somebody have any trace on which BKL taker it is? It apparently wasn't
file locking. Was it the tty layer?

Linus

Ingo Molnar

unread,

May 8, 2008, 5:46:36 PM5/8/08

to Linus Torvalds, Zhang, Yanmin, Andi Kleen, Matthew Wilcox, LKML, Alexander Viro, Andrew Morton, Thomas Gleixner, H. Peter Anvin

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> > I suspect some more performance could be won in this particular
> > workload by getting rid of the BKL dependency altogether.
>
> Did somebody have any trace on which BKL taker it is? It apparently
> wasn't file locking. Was it the tty layer?

yeah, i captured a trace of all the down()s that happen in the workload,
using ftrace/sched_switch + stacktrace + a tracepoint in down(). Here's
the semaphore activities in the trace:

9 down <= lock_kernel <= proc_lookup_de <= proc_lookup <
12 down <= lock_kernel <= de_put <= proc_delete_inode <
14 down <= lock_kernel <= proc_lookup_de <= proc_lookup <
19 down <= lock_kernel <= vfs_ioctl <= do_vfs_ioctl <
58 down <= lock_kernel <= tty_release <= __fput <
62 down <= lock_kernel <= chrdev_open <= __dentry_open <
70 down <= lock_kernel <= sys_fcntl <= system_call_after_swapgs <
2512 down <= lock_kernel <= opost <= write_chan <
2574 down <= lock_kernel <= write_chan <= tty_write <

note that this is running the fixed semaphore code, so contended
semaphores are really rare in the trace. The histogram above includes
all calls to down().

here's the full trace file (from a single CPU, on a dual-core box
running aim7):

http://redhat.com/~mingo/misc/aim7-ftrace.txt

some other interesting stats. Top 5 wakeups sources:

# grep wake_up aim7-ftrace.txt | cut -d: -f2- | sort | uniq -c |
sort -n | cut -d= -f1-6 | tail -5
[...]
340 default_wake_function <= __wake_up_common <= __wake_up_sync <= unix_write_space <= sock_wfree <= skb_release_all <
411 default_wake_function <= autoremove_wake_function <= __wake_up_common <= __wake_up_sync <= pipe_read <= do_sync_read <
924 default_wake_function <= autoremove_wake_function <= __wake_up_common <= __wake_up_sync <= pipe_write <= do_sync_write <
1301 default_wake_function <= __wake_up_common <= __wake_up <= n_tty_receive_buf <= pty_write <= write_chan <
2065 wake_up_state <= prepare_signal <= send_signal <= __group_send_sig_info <= group_send_sig_info <= __kill_pgrp_info <

top 10 scheduling points:

# grep -w schedule aim7-ftrace.txt | cut -d: -f2- | sort |
uniq -c | sort -n | cut -d= -f1-4 | tail -10
[...]
465 schedule <= __cond_resched <= _cond_resched <= shrink_active_list <
582 schedule <= cpu_idle <= start_secondary <= <
929 schedule <= pipe_wait <= pipe_read <= do_sync_read <
990 schedule <= __cond_resched <= _cond_resched <= shrink_page_list <
1034 schedule <= io_schedule <= get_request_wait <= __make_request <
1140 schedule <= worker_thread <= kthread <= child_rip <
1512 schedule <= retint_careful <= <= 0 <
1571 schedule <= __cond_resched <= _cond_resched <= shrink_active_list <
2034 schedule <= schedule_timeout <= do_select <= core_sys_select <
2355 schedule <= sysret_careful <= <= 0 <

as visible from the trace, this is a CONFIG_PREEMPT_NONE kernel, so most
of the preemptions get triggered by wakeups and get executed from the
return-from-syscall need_resched check. But there's a fair amount of
select() related sleeping as well, and a fair amount of
IRQ-hits-userspace driven preemptions as well.

a rather hectic workload, with a surprisingly large amount of TTY
related activity.

and here's a few seconds worth of NMI driven readprofile output:

1131 page_fault 70.6875
1181 find_lock_page 8.2014
1302 __isolate_lru_page 12.8911
1344 copy_page_c 84.0000
1588 page_lock_anon_vma 34.5217
1616 ext3fs_dirhash 3.4753
1683 ext3_htree_store_dirent 6.8975
1976 str2hashbuf 12.7484
1992 copy_user_generic_string 31.1250
2362 do_unlinkat 6.5069
2969 try_to_unmap 2.0791
3031 will_become_orphaned_pgrp 24.4435
4009 __copy_user_nocache 12.3735
4627 congestion_wait 34.0221
6624 clear_page_c 414.0000
18711 try_to_unmap_one 30.8254
34447 page_referenced 140.0285
38669 do_filp_open 17.7789
166569 page_referenced_one 886.0053
213361 *unknown*
216021 sync_page 3375.3281
391888 page_check_address 1414.7581
962212 total 0.3039

system overhead is consistently 20% during this test.

the page_check_address() overhead is surprising - tons of rmap
contention? about 10% wall-clock overhead in that function alone - and
this is just on a dual-core box!

Below is the instruction level profile of that function. The second
column is the # of profile hits. The spin_lock() overhead is clearly
visible.

Ingo

___----- # of profile hits
v
ffffffff80286244: 3478 <page_check_address>:
ffffffff80286244: 3478 55 push %rbp
ffffffff80286245: 1092 48 89 d0 mov %rdx,%rax
ffffffff80286248: 0 48 c1 e8 27 shr $0x27,%rax
ffffffff8028624c: 0 48 89 e5 mov %rsp,%rbp
ffffffff8028624f: 717 41 57 push %r15
ffffffff80286251: 1 25 ff 01 00 00 and $0x1ff,%eax
ffffffff80286256: 0 49 89 cf mov %rcx,%r15
ffffffff80286259: 1354 41 56 push %r14
ffffffff8028625b: 0 49 89 fe mov %rdi,%r14
ffffffff8028625e: 0 48 89 d7 mov %rdx,%rdi
ffffffff80286261: 1507 41 55 push %r13
ffffffff80286263: 0 41 54 push %r12
ffffffff80286265: 0 53 push %rbx
ffffffff80286266: 1763 48 83 ec 08 sub $0x8,%rsp
ffffffff8028626a: 0 48 8b 56 48 mov 0x48(%rsi),%rdx
ffffffff8028626e: 4 48 8b 14 c2 mov (%rdx,%rax,8),%rdx
ffffffff80286272: 3174 f6 c2 01 test $0x1,%dl
ffffffff80286275: 0 0f 84 cc 00 00 00 je ffffffff80286347 <page_check_address+0x103>
ffffffff8028627b: 0 48 89 f8 mov %rdi,%rax
ffffffff8028627e: 64468 48 be 00 f0 ff ff ff mov $0x3ffffffff000,%rsi
ffffffff80286285: 0 3f 00 00
ffffffff80286288: 1 48 b9 00 00 00 00 00 mov $0xffff810000000000,%rcx
ffffffff8028628f: 0 81 ff ff
ffffffff80286292: 4686 48 c1 e8 1b shr $0x1b,%rax
ffffffff80286296: 0 48 21 f2 and %rsi,%rdx
ffffffff80286299: 0 25 f8 0f 00 00 and $0xff8,%eax
ffffffff8028629e: 7468 48 01 d0 add %rdx,%rax
ffffffff802862a1: 0 48 8b 14 08 mov (%rax,%rcx,1),%rdx
ffffffff802862a5: 11 f6 c2 01 test $0x1,%dl
ffffffff802862a8: 4409 0f 84 99 00 00 00 je ffffffff80286347 <page_check_address+0x103>
ffffffff802862ae: 0 48 89 f8 mov %rdi,%rax
ffffffff802862b1: 0 48 21 f2 and %rsi,%rdx
ffffffff802862b4: 1467 48 c1 e8 12 shr $0x12,%rax
ffffffff802862b8: 0 25 f8 0f 00 00 and $0xff8,%eax
ffffffff802862bd: 0 48 01 d0 add %rdx,%rax
ffffffff802862c0: 944 48 8b 14 08 mov (%rax,%rcx,1),%rdx
ffffffff802862c4: 17 f6 c2 01 test $0x1,%dl
ffffffff802862c7: 1 74 7e je ffffffff80286347 <page_check_address+0x103>
ffffffff802862c9: 927 48 89 d0 mov %rdx,%rax
ffffffff802862cc: 77 48 c1 ef 09 shr $0x9,%rdi
ffffffff802862d0: 3 48 21 f0 and %rsi,%rax
ffffffff802862d3: 1238 81 e7 f8 0f 00 00 and $0xff8,%edi
ffffffff802862d9: 1 48 01 c8 add %rcx,%rax
ffffffff802862dc: 0 4c 8d 24 38 lea (%rax,%rdi,1),%r12
ffffffff802862e0: 792 41 f6 04 24 81 testb $0x81,(%r12)
ffffffff802862e5: 25 74 60 je ffffffff80286347 <page_check_address+0x103>
ffffffff802862e7: 118074 48 c1 ea 0c shr $0xc,%rdx
ffffffff802862eb: 41187 48 b8 00 00 00 00 00 mov $0xffffe20000000000,%rax
ffffffff802862f2: 0 e2 ff ff
ffffffff802862f5: 182 48 6b d2 38 imul $0x38,%rdx,%rdx
ffffffff802862f9: 25998 48 8d 1c 02 lea (%rdx,%rax,1),%rbx
ffffffff802862fd: 0 4c 8d 6b 10 lea 0x10(%rbx),%r13
ffffffff80286301: 0 4c 89 ef mov %r13,%rdi
ffffffff80286304: 80598 e8 4b 17 28 00 callq ffffffff80507a54 <_spin_lock>
ffffffff80286309: 36022 49 8b 0c 24 mov (%r12),%rcx
ffffffff8028630d: 1623 f6 c1 81 test $0x81,%cl
ffffffff80286310: 5 74 32 je ffffffff80286344 <page_check_address+0x100>
ffffffff80286312: 16 48 b8 00 00 00 00 00 mov $0x1e0000000000,%rax
ffffffff80286319: 0 1e 00 00
ffffffff8028631c: 359 48 ba b7 6d db b6 6d mov $0x6db6db6db6db6db7,%rdx
ffffffff80286323: 0 db b6 6d
ffffffff80286326: 12 48 c1 e1 12 shl $0x12,%rcx
ffffffff8028632a: 0 49 8d 04 06 lea (%r14,%rax,1),%rax
ffffffff8028632e: 492 48 c1 e9 1e shr $0x1e,%rcx
ffffffff80286332: 23 48 c1 f8 03 sar $0x3,%rax
ffffffff80286336: 0 48 0f af c2 imul %rdx,%rax
ffffffff8028633a: 1390 48 39 c8 cmp %rcx,%rax
ffffffff8028633d: 0 75 05 jne ffffffff80286344 <page_check_address+0x100>
ffffffff8028633f: 0 4d 89 2f mov %r13,(%r15)
ffffffff80286342: 165 eb 06 jmp ffffffff8028634a <page_check_address+0x106>
ffffffff80286344: 0 fe 43 10 incb 0x10(%rbx)
ffffffff80286347: 11886 45 31 e4 xor %r12d,%r12d
ffffffff8028634a: 17451 5a pop %rdx
ffffffff8028634b: 14507 5b pop %rbx
ffffffff8028634c: 42 4c 89 e0 mov %r12,%rax
ffffffff8028634f: 1736 41 5c pop %r12
ffffffff80286351: 40 41 5d pop %r13
ffffffff80286353: 1727 41 5e pop %r14
ffffffff80286355: 1420 41 5f pop %r15
ffffffff80286357: 44 c9 leaveq
ffffffff80286358: 1685 c3 retq

gcc-4.2.3, the config is at:

http://redhat.com/~mingo/misc/config-Thu_May__8_22_23_21_CEST_2008