Re: Quad core CPUs loaded at only 50% when running a CPU and mmap intensive multi-threaded task

Peter Zijlstra

unread,

Aug 25, 2008, 1:52:49 AM8/25/08

to Török Edwin, Ingo Molnar, r...@tech9.net, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin

On Mon, 2008-08-25 at 00:01 +0300, Török Edwin wrote:
> Hi Ingo,
>
> When I run clamd (www.clamav.net), I can only get to load my CPU 50%
> (according to top), and disks at 30% (according to iostat -x 3),
> regardless how many threads I set (I tried 4, 8, 16, 32).

Can you share your .config, and prehaps tell what kernel version did
work for you?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Peter Zijlstra

unread,

Aug 25, 2008, 5:42:25 AM8/25/08

to edwin, Ingo Molnar, r...@tech9.net, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin

On Mon, 2008-08-25 at 10:04 +0300, edwin wrote:

> Peter Zijlstra wrote:
> > On Mon, 2008-08-25 at 00:01 +0300, Török Edwin wrote:
> >
> >> Hi Ingo,
> >>
> >> When I run clamd (www.clamav.net), I can only get to load my CPU 50%
> >> (according to top), and disks at 30% (according to iostat -x 3),
> >> regardless how many threads I set (I tried 4, 8, 16, 32).
> >>
> >
> >
> > Can you share your .config, and prehaps tell what kernel version did
> > work for you?
>

> Sorry, I forgot to include the .config, its at the end of this mail (the
> cfs debug info output included the .config though)
>
> Well, I just bought this new box, so there isn't a kernel version that I
> know that worked on this hardware (but I am trying to boot some older
> versions now).
> However on my previous box (Athlon64, non-SMP) I have never seen such a
> problem (that the CPU is loaded only 50% with clamd) and I've been
> running 2.6.26 and 2.6.27-rc4 there too.
>
> Details below, short summary here:
> 2.6.24: WORKS, clamd 400% CPU, testprogram runs in 27.4 seconds, 67% CPU
> load; and 28.5 seconds w/o setting affinity
> 2.6.25+: DOES NOT WORK, clamd 200%-300% CPU, testprogram runs in 38-40
> seconds, 48-48% CPU load, and 47-56 seconds w/o setting affinity
>
> Debian has 2.6.18, 2.6.22, 2.6.24, 2.6.25, 2.6.26.
> 2.6.22 won't work with my lvm, so I can't boot that, so I tried 2.6.24:
>
> 2.6.24 doesn't have sched_debug enabled in the stock kernel
> unfortunately, but the output of cfs-debug-info.sh is available here,
> maybe it contains some useful info:
> http://edwintorok.googlepages.com/testrun-1219645937.tar.gz
>
> Is this enough info for you to reproduce the problem, or do you want me
> to try and bisect?

No, I think I know what's going on..

mmap() and munmap() need to take the mmap_sem for writing (since they
modify the memory map) and you let each thread (one for each cpu) take
that process wide lock, twice, for a million times.

Guess what happens ;-)

Török Edwin

unread,

Aug 25, 2008, 5:49:39 AM8/25/08

to Peter Zijlstra, Ingo Molnar, r...@tech9.net, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin

Are you referring to the mmap_sem lock, or my mutex lock around
all_thread_time?
> Guess what happens ;-)

So the problem is that doing mmap() doesn't scale well with multiple
threads, because there is contention on mmap_sem?
Why did 2.6.24 seem to work better?

Best regards,
--Edwin

Peter Zijlstra

unread,

Aug 25, 2008, 6:02:31 AM8/25/08

to Török Edwin, Ingo Molnar, r...@tech9.net, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin

mmap_sem, its process wide, and your test prog bangs on it like there's
no tomorrow.

> > Guess what happens ;-)
>
> So the problem is that doing mmap() doesn't scale well with multiple
> threads, because there is contention on mmap_sem?

Indeed.

> Why did 2.6.24 seem to work better?

Perhaps the scheduler overhead did increase, can you try:

echo NO_HRTICK > /debug/sched_features

(after mounting debugfs on /debug, or adjusting the path to where you do
have it mounted)

That might cause some overhead on very high context switch rates.

Török Edwin

unread,

Aug 25, 2008, 6:22:43 AM8/25/08

to Peter Zijlstra, Ingo Molnar, r...@tech9.net, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin

Well, the real program (clamd) that this testprogram tries to simulate
does an mmap for almost every file, and I have lots of small files.
6.5G, 114122 files, average size 57k.

I'll run latencytop again, last time it has showed 100ms - 500ms latency
for clamd, and it was about mmap, I'll provide you with the exact output.

>>> Guess what happens ;-)
>>>
>> So the problem is that doing mmap() doesn't scale well with multiple
>> threads, because there is contention on mmap_sem?
>>
>
> Indeed.
>
>
>> Why did 2.6.24 seem to work better?
>>
>
> Perhaps the scheduler overhead did increase, can you try:
>
> echo NO_HRTICK > /debug/sched_features
>
> (after mounting debugfs on /debug, or adjusting the path to where you do
> have it mounted)
>
> That might cause some overhead on very high context switch rates.

No difference, and turning off the other features from sched_features
doesn't seem to help either.

Best regards,
--Edwin

Peter Zijlstra

unread,

Aug 25, 2008, 6:36:40 AM8/25/08

to Török Edwin, Ingo Molnar, r...@tech9.net, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin

On Mon, 2008-08-25 at 13:22 +0300, Török Edwin wrote:

> Well, the real program (clamd) that this testprogram tries to simulate
> does an mmap for almost every file, and I have lots of small files.
> 6.5G, 114122 files, average size 57k.
>
> I'll run latencytop again, last time it has showed 100ms - 500ms latency
> for clamd, and it was about mmap, I'll provide you with the exact output.

Right - does it make sense to teach clamav about pread() ?

Peter Zijlstra

unread,

Aug 25, 2008, 6:37:48 AM8/25/08

to Török Edwin, Ingo Molnar, r...@tech9.net, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin

On Mon, 2008-08-25 at 13:22 +0300, Török Edwin wrote:
>

> >> Why did 2.6.24 seem to work better?
> >>

> > That might cause some overhead on very high context switch rates.

>
> No difference, and turning off the other features from sched_features
> doesn't seem to help either.

OK, I'll poke a little more at is later today to see if I can spot
something

edwin

unread,

Aug 25, 2008, 7:03:55 AM8/25/08

to Peter Zijlstra, Ingo Molnar, r...@tech9.net, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin

Peter Zijlstra wrote:
> On Mon, 2008-08-25 at 13:22 +0300, Török Edwin wrote:
>
>
>> Well, the real program (clamd) that this testprogram tries to simulate
>> does an mmap for almost every file, and I have lots of small files.
>> 6.5G, 114122 files, average size 57k.
>>
>> I'll run latencytop again, last time it has showed 100ms - 500ms latency
>>

Latencytop output attached.
There is 4 - 60 ms latency for mmap/munmap, and the more threads there
are the total latency gets higher (latencytop says sum was ~480ms).

Running with MaxThreads 4 gets me 300-400% CPU usage, but with
MaxThreads 8 CPU usage drops to around 120-250%.
Now, maxthreads 4 looks like a good choice from a CPU usage point of
view, but is actually bad because it means that threads gets stuck in
iowait, and the CPU won't have anything to do. MaxThreads 8 looked like
a good alternative to fill the iowait gaps, but we run into the mmap_sem
issue.
In a real world environment MaxThreads influences how many mails you can
process in parallel with your MTA, so generally it should be as high as
possible.

On 2.6.27-rc4:

MaxThreads 4 time, empty database (all cached, almost no I/O):
1m9s

MaxThreads 4 time, after echo 3>/proc/sys/vm/drop_caches:
1m29s

MaxThreads 8 time, empty database (all cached, almost no I/O):
2m16s

MaxThreads 8 time, after echo 3>/proc/sys/vm/drop_caches:
2m15s

Of course running with a full database will give different results, so
I'll do some timing with that too (will take a little longer though).

>> for clamd, and it was about mmap, I'll provide you with the exact output.
>>
>
> Right - does it make sense to teach clamav about pread() ?

If it is preferred over mmap, then maybe yes.

Peter Zijlstra wrote:
> OK, I'll poke a little more at is later today to see if I can spot
> something

Thanks!

Best regards,
--Edwin

latgather-1219660865

latgather-1219660915

edwin

unread,

Aug 25, 2008, 7:30:48 AM8/25/08

to Peter Zijlstra, Ingo Molnar, r...@tech9.net, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin

MaxThreads 8, full DB (13 % slower than 2.6.24)
4m42s

MaxThreads 4, full DB (8% faster than 2.6.24)
2m35s

MaxThreads 8, full DB, 2.6.24:
4m3s

MaxThreads 4, full DB, 2.6.24:
2m50s

I have run an echo 3>/proc/sys/vm/drop_caches before each, I hope that
clears all caches,
I have xfs on top of lvm, on top of raid10, and iostat shows only 0 -
20% activity (%util).
That could also mean of course that the disks can provide data fast
enough for clamd.

>
> Of course running with a full database will give different results, so
> I'll do some timing with that too (will take a little longer though).
>
>>> for clamd, and it was about mmap, I'll provide you with the exact
>>> output.
>>>
>>
>> Right - does it make sense to teach clamav about pread() ?
>
> If it is preferred over mmap, then maybe yes.
>
> Peter Zijlstra wrote:

> Best regards,
> --Edwin

>> OK, I'll poke a little more at is later today to see if I can spot
>> something
>
> Thanks!

Still, if I have more threads, performance *decreases* almost linearly
with 2.6.27 (and probably with 2.6.25+ if clamd behaves the same as my
test proggie), however with 2.6.24

With Debian etch having 2.6.24 (etchnhalf actually), and lenny shipping
with 2.6.25 or 2.6.26, users upgrading from etch to lenny could see a
performance decrease.

Best regards,
--Edwin

Peter Zijlstra

unread,

Aug 25, 2008, 7:41:41 AM8/25/08

to edwin, Ingo Molnar, r...@tech9.net, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin

On Mon, 2008-08-25 at 14:00 +0300, edwin wrote:
> Peter Zijlstra wrote:

> > Right - does it make sense to teach clamav about pread() ?
>
> If it is preferred over mmap, then maybe yes.

I would certainly consider this for small (< 1M?) files. With mmap the
faults and pte overhead aren't free either, and the extra memcpy from
pread() isn't that much.

Theodore Tso

unread,

Aug 25, 2008, 9:48:28 AM8/25/08

to Peter Zijlstra, edwin, Ingo Molnar, r...@tech9.net, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin

On Mon, Aug 25, 2008 at 01:41:17PM +0200, Peter Zijlstra wrote:
>
> I would certainly consider this for small (< 1M?) files. With mmap the
> faults and pte overhead aren't free either, and the extra memcpy from
> pread() isn't that much.
>

Even for very big files, if you're only doing a single sequential pass
over a very large file (for example when converting a Canon raw image
file to TIFF format --- I know because I was trying to optimize dcraw
a while aback), you take the page fault for each 4k page, and so
simply using read/pread is faster. And that's on a single-threded
program. With a multithreaded program, the locking issues come on top
of that.

Maybe if I had used hugepages it would have been a win, I suppose, but
I never tried the experiment. And this was several years ago, on much
older hardware, so maybe the relative times of doing the memory copy
versus the page fault, but I wouldn't be surprised if it's even more
expensive to do the mmap, relatively speaking.

- Ted

Török Edwin

unread,

Sep 8, 2008, 3:10:52 PM9/8/08

to Andi Kleen, Theodore Tso, Peter Zijlstra, Ingo Molnar, r...@tech9.net, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin

On 2008-08-26 11:12, Andi Kleen wrote:
> Theodore Tso <ty...@mit.edu> writes:
>
> As a general comment it still sounds there is a regression here?
> If the workload was faster in an earlier kernel and is now slow
> clearly something got slower? And that might be fixable.
> Perhaps something for Rafael's list?

After some more careful testing with the real program (clamd) I can say
that there is no regression.
If I scan the exact same files as the box running 2.6.18 I get similar
results, the difference is within 10% [1].

There is however a problem with mmap [mmap with N threads is as slow as
mmap with 1 thread, i.e. it is sequential :(], pagefaults and disk I/O,
I think I am hitting the problem described in this thread (2 years ago!)
http://lwn.net/Articles/200215/
http://lkml.org/lkml/2006/9/19/260

It looks like such a patch is still not part of 2.6.27, what happened to it?
I will see if that patch applies to 2.6.27, and will rerun my test with
that patch applied too.

While running clamd I noticed in latencytop, that besides mmap/munmap
latencies (around 20 ms), I also get page fault latencies (again around
20 ms).

So I wrote another test program [2] that walks a directory tree, and
reads each file once using read for each, and once using mmap for each.
It clears the cache (using echo 3 >/proc/sys/vm/drop_caches) before each
test (and I run the read test first, so if the cache wouldn't be cleared,
then mmap would be faster not slower).
The results show that reading files using mmap() takes about the same
time, regardless of how many threads I use (1,2,4,8,16), but
using read has a near linear speedup with the number of threads.

First lets see some numbers [3], time to run the program on /usr/bin in
seconds [4]
Number of CPUs, 4
Number of threads ->, 1,, 2,, 4,, 8,, 16
Kernel version, read, mmap, read, mmap, read, mmap, read, mmap, read, mmap
2.6.27-rc5, 16.70, 17.01, 12.86, 16.26, 7.31, 15.16, 4.01, 14.93, 3.79,
15.40
2.6.26-1-amd64, 17.90, 16.95, 13.30, 16.18, 7.31, 15.34, 3.87, 14.96,
3.86, 15.89
2.6.22-3-amd64, 15.12, 15.41, 11.98, 15.17, 6.36, 14.29, 3.15, 14.61,
3.08, 15.44

The kernels are standard Debian kernels, except for 2.6.27-rc5 which
I've built myself (posted .config earlier in this thread).
mmap and read are about the same speed with nthreads=1, so lets see
speedups relative to nthreads=1

Kernel version, read, mmap, read, mmap, read, mmap, read, mmap, read, mmap
"2.6.27-rc5",1.00,1.00,1.30,1.05,2.28,1.12,4.16,1.14,4.41,1.10
"2.6.26-1-amd64",1.00,1.00,1.35,1.05,2.45,1.10,4.63,1.13,4.64,1.07
"2.6.22-3-amd64",1.00,1.00,1.26,1.02,2.38,1.08,4.80,1.05,4.91,1.00

I was running this on a usr/bin/ directory that has 372M, average file
size 160K.

So mmap performance stays about the same (14% change at most) regardless
number of threads, while
read performance *improves* with the number of threads, it is 4.8 times
*faster* than with single threaded case.

I think what happens is the following:
- thread A open a file with mmap, and starts reading it, this generates
page faults (which is normal for reading from an mmaped region)
- thread B opens another file with mmap, and starts reading it. It
happened to find mmap_sem untaken, so it locks it for writing, makes the
change, and unlocks
- thread A reads from a page that is not present triggering a page
fault, mmap_sem is taken, and thread A is waiting for the page to be
read from the disk
- thread B does the same, and takes mmap_sem for reading
- thread C creates a new mapping, and tries to take mmap_sem for
writing, it cannot because there are readers, so it blocks waiting
- thread A finishes the pagefault, releases the mmap_sem
- thread B hasn't finished the pagefault, C is still blocked
- A encounters another pagefault and takes the mmap_sem for reading
- B finishes, and releases, C still blocked because mmap_sem is taken
for reading
...
C eventually takes the mmap_sem for writing, blocking A and B who want
to read from a file
...

Even if C gets the semaphore as soon as one pagefault is done, it still
has to wait for the disk I/O for that pagefault to be completed.

Why do you need to hold the process-wide mmap_sem while waiting for the
page to be read from disk?
As I understand (I coudl be wrong, please correct me!) we need to make
sure that the page we are reading into exists and doesn't change mapping
details
during the disk I/O read, meaning it must not be unmapped, flags
changed, etc.
Can't we have a per-vma lock that would ensure this?

If a process would want to munmap something, it would take the mmap_sem,
then the per-vma lock, remove the mapping, release locks
If you want to mmap something, you take mmap_sem, create the mapping
with the per-vma lock held, release locks
When a page fault reads, it takes the mmap_sem for reading, finds the
vma, locks the per-vma lock, releases mmap_sem; does disk I/O;
makes any changes needed to the vma, acquires the mmap_sem, releases the
per-vma lock, releases mmap_sem
Anybody else wishing to modify a vma, needs to take mmap_sem, the
per-vma lock, release mmap_sem, make modification, release per-vma lock
Anybody wishing to add/remove/reorder vmas needs to take the mmap_sem,
the locks of all affected vmas, make modifications, and release all locks

Is there something wrong with the above approach?
From a lock contention perspective it can't be worse than holding a
single mmap_sem during the entire disk I/O.
mmap_sem would now mean: take me if you add/remove vmas, and take me
before taking a vma lock
vma lock: take me if you modify this vma, or if you need to be sure that
the vma doesn't go away

Thoughts?

[1] 2.6.18 won't boot on my box, because it won't recognize an ICH10
SATA controller
[2] the test program is available here:
http://edwintorok.googlepages.com/scalability.tar.gz
You just build it using 'make' (has to be GNU make), and the run
$ sh ./runtest.sh /usr/bin/ | tee log
$ sh ./postproc.sh log

[3] the raw logs are available here:
http://edwintorok.googlepages.com/logs.tar.gz

[4] The test program was run on this system:
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz
4 GB DDR3 RAM
Motherboard GA-EP45T-DS3, chipset ICH10, SATA controller in AHCI mode
HDD 6x750GB WD Caviar Black, RAID 1 for /, and RAID10 for the rest
/ has ext3
I've run the test on /mnt/bak/usr/bin which is on /

You need to run the testprogram on a system with fast disks (for eg it
doesn't really make a difference on my laptop with 5k4 rpm disks).
This system has these timings:
/dev/md3:
Timing cached reads: 11112 MB in 2.00 seconds = 5561.88 MB/sec
Timing buffered disk reads: 296 MB in 3.00 seconds = 98.57 MB/sec

If you need more info, please ask.

Best regards,
--Edwin

Török Edwin

unread,

Sep 12, 2008, 3:17:29 PM9/12/08

to Mike Waychison, Andi Kleen, Theodore Tso, Peter Zijlstra, Ingo Molnar, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin, Benjamin Herrenschmidt, linu...@kvack.org

On 2008-09-08 22:10, Török Edwin wrote:
> [snip]

> There is however a problem with mmap [mmap with N threads is as slow as
> mmap with 1 thread, i.e. it is sequential :(], pagefaults and disk I/O,
> I think I am hitting the problem described in this thread (2 years ago!)
> http://lwn.net/Articles/200215/
> http://lkml.org/lkml/2006/9/19/260
>
> It looks like such a patch is still not part of 2.6.27, what happened to it?
> I will see if that patch applies to 2.6.27, and will rerun my test with
> that patch applied too.
>

The patch doesn't apply to 2.6.27-rc6, I tried manually applying the patch.
There have been many changes since 2.6.18 (like replacing find_get_page
with find_lock_page, filemap returning VM_FAULT codes, etc.).
I have probably done something wrong, because the resulting kernel won't
boot: I get abnormal exits and random sigbus during boot.

Can you please help porting the patch to 2.6.27-rc6? I have attached my
2 attempts at the end of this mail.

Also it looks like the original patch just releases the mmap_sem if
there is lock contention on the page, but keeps mmap_sem during read?
I would like mmap_sem be released during disk I/O too.

I also tried changing i_mmap_lock into a semaphore, however I that won't
work since some users of i_mmap_lock can't sleep.
Taking the i_mmap_lock spinlock in filemap fault is also not possible,
since we would sleep while holding a spinlock.

Just to confirm that the problem is with pagefaults and mmap, I dropped
the mmap_sem in filemap_fault, and then
I got same performance in my testprogram for mmap and read. Of course
this is totally unsafe, because the mapping could change at any time.

> [2] the test program is available here:
> http://edwintorok.googlepages.com/scalability.tar.gz
> You just build it using 'make' (has to be GNU make), and the run
> $ sh ./runtest.sh /usr/bin/ | tee log
> $ sh ./postproc.sh log

I've written a latency tracer (using ptrace), and I identified the
mutex/mmap related latencies (total runtime 23m):
- mmap-ed files (created by libclamav) ~6680 ms total
- creating/removing anonymous mappings, created by glibc, when I use
functions like fopen/fclose:

With 8 threads:
=====> Total: 3227.732 ms, average: 3.590 ms, times: 899
=== /lib/libc.so.6 (mmap)
=== /lib/libc.so.6 (_IO_file_doallocate)
=== /lib/libc.so.6 (_IO_doallocbuf)
=== /lib/libc.so.6 (_IO_file_seekoff)
=== /lib/libc.so.6 (_IO_file_attach)
=== /lib/libc.so.6 (_IO_fdopen)
=== /usr/lib/libz.so.1 (gzflush)
=== /usr/lib/libz.so.1 (gzdopen)
=== /usr/local/lib/libclamav.so.5 libclamav/scanners.c:470 (cli_scangzip)

=====> Total: 2069.519 ms, average: 3.624 ms, times: 571
=== /lib/libc.so.6 (munmap)
=== /lib/libc.so.6 (_IO_setb)
=== /lib/libc.so.6 (_IO_file_close_it)
=== /lib/libc.so.6 (_IO_fclose)
=== /usr/lib/libz.so.1 (gzerror)
=== /usr/local/lib/libclamav.so.5 libclamav/scanners.c:529 (cli_scangzip)

with 4 threads:
=====> Total: 578.607 ms, average: 4.743 ms, times: 122
=== /lib/libc.so.6 (munmap)
=== /lib/libc.so.6 (_IO_setb)
=== /lib/libc.so.6 (_IO_file_close_it)
=== /lib/libc.so.6 (_IO_fclose)
=== /usr/lib/libz.so.1 (gzerror)
=== /usr/local/lib/libclamav.so.5 libclamav/scanners.c:529 (cli_scangzip)

=====> Total: 148.083 ms, average: 2.278 ms, times: 65
=== /lib/libc.so.6 (mmap)
=== /lib/libc.so.6 (_IO_file_doallocate)
=== /lib/libc.so.6 (_IO_doallocbuf)
=== /lib/libc.so.6 (_IO_file_seekoff)
=== /lib/libc.so.6 (_IO_file_attach)
=== /lib/libc.so.6 (_IO_fdopen)
=== /usr/lib/libz.so.1 (gzflush)
=== /usr/lib/libz.so.1 (gzdopen)
=== /usr/local/lib/libclamav.so.5 libclamav/scanners.c:470 (cli_scangzip)

With 8 threads situation is much worse than with 4 threads even for
functions using anonymous mappings.

Of course the latency tracer has its own overhead (1.2 ms average, 67 ms
max for 8 threads, and 0.1 ms average, 31 ms max for 4 threads),but
these latencies are above that value.

Best regards,
--Edwin

---
First attempt:
arch/x86/mm/fault.c | 33 +++++++++++++++++++++++++++------
include/linux/mm.h | 1 +
include/linux/mm_types.h | 1 -
include/linux/sched.h | 1 +
mm/filemap.c | 18 ++++++++++++------
5 files changed, 41 insertions(+), 13 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 455f3fe..38bea4b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -578,10 +578,7 @@ int show_unhandled_signals = 1;
* and the problem, and then passes it off to one of the appropriate
* routines.
*/
-#ifdef CONFIG_X86_64
-asmlinkage
-#endif
-void __kprobes do_page_fault(struct pt_regs *regs, unsigned long
error_code)
+static inline void __do_page_fault(struct pt_regs *regs, unsigned long
error_code)
{
struct task_struct *tsk;
struct mm_struct *mm;
@@ -702,6 +699,7 @@ again:
down_read(&mm->mmap_sem);
}

+retry:
vma = find_vma(mm, address);
if (!vma)
goto bad_area;
@@ -761,8 +759,21 @@ survive:
}
if (fault & VM_FAULT_MAJOR)
tsk->maj_flt++;
- else
- tsk->min_flt++;
+ else {
+ if ((fault & VM_FAULT_RETRY) && (current->flags &
PF_FAULT_MAYRETRY)) {
+ current->flags &= ~PF_FAULT_MAYRETRY;
+ goto retry;
+ }
+ /*
+ * If we had to retry (PF_FAULT_MAYRETRY cleared), then
+ * the page originally wasn't up to date before the
+ * retry, but now it is.
+ */
+ if (!(current->flags & PF_FAULT_MAYRETRY))
+ tsk->maj_flt++;
+ else
+ tsk->min_flt++;
+ }

#ifdef CONFIG_X86_32
/*
@@ -909,6 +920,16 @@ do_sigbus:
tsk->thread.trap_no = 14;
force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
}
+#ifdef CONFIG_X86_64
+asmlinkage
+#endif
+void __kprobes do_page_fault(struct pt_regs *regs,
+ unsigned long error_code)
+{
+ current->flags |= PF_FAULT_MAYRETRY;
+ __do_page_fault(regs, error_code);
+ current->flags &= ~PF_FAULT_MAYRETRY;
+}

DEFINE_SPINLOCK(pgd_lock);
LIST_HEAD(pgd_list);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 72a15dc..4511f68 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -697,6 +697,7 @@ static inline int page_mapped(struct page *page)

#define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not
return page */
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */
+#define VM_FAULT_RETRY 0x0400

#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bf33413..9d065a4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -108,7 +108,6 @@ struct vm_area_struct {
unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */
-
/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next;

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3d9120c..bc39432 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1507,6 +1507,7 @@ extern cputime_t task_gtime(struct task_struct *p);
#define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
#define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over
cpuset */
#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
+#define PF_FAULT_MAYRETRY 0x08000000 /* I may drop mmap_sem during
fault */
#define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt
mutex tester */
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it
as freezeable */
diff --git a/mm/filemap.c b/mm/filemap.c
index 876bc59..f9f11bd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1443,18 +1443,17 @@ int filemap_fault(struct vm_area_struct *vma,
struct vm_fault *vmf)
/*
* Do we have something in the page cache already?
*/
-retry_find:
page = find_lock_page(mapping, vmf->pgoff);
/*
* For sequential accesses, we use the generic readahead logic.
*/
if (VM_SequentialReadHint(vma)) {
if (!page) {
+ up_read(&vma->vm_mm->mmap_sem);
page_cache_sync_readahead(mapping, ra, file,
vmf->pgoff, 1);
- page = find_lock_page(mapping, vmf->pgoff);
- if (!page)
- goto no_cached_page;
+ down_read(&vma->vm_mm->mmap_sem);
+ return VM_FAULT_RETRY;
}
if (PageReadahead(page)) {
page_cache_async_readahead(mapping, ra, file, page,
@@ -1489,7 +1488,10 @@ retry_find:

if (vmf->pgoff > ra_pages / 2)
start = vmf->pgoff - ra_pages / 2;
+ up_read(&vma->vm_mm->mmap_sem);
do_page_cache_readahead(mapping, file, start, ra_pages);
+ down_read(&vma->vm_mm->mmap_sem);
+ return VM_FAULT_RETRY;
}
page = find_lock_page(mapping, vmf->pgoff);
if (!page)
@@ -1527,7 +1529,9 @@ no_cached_page:
* We're only likely to ever get here if MADV_RANDOM is in
* effect.
*/
+ up_read(&vma->vm_mm->mmap_sem);
error = page_cache_read(file, vmf->pgoff);
+ down_read(&vma->vm_mm->mmap_sem);

/*
* The page we want has now been added to the page cache.
@@ -1535,7 +1539,7 @@ no_cached_page:
* meantime, we'll just come back here and read it again.
*/
if (error >= 0)
- goto retry_find;
+ return VM_FAULT_RETRY;

/*
* An error return from page_cache_read can result if the
@@ -1560,16 +1564,18 @@ page_not_uptodate:
* and we need to check for errors.
*/
ClearPageError(page);
+ up_read(&vma->vm_mm->mmap_sem);
error = mapping->a_ops->readpage(file, page);
if (!error) {
wait_on_page_locked(page);
if (!PageUptodate(page))
error = -EIO;
}
+ down_read(&vma->vm_mm->mmap_sem);
page_cache_release(page);

if (!error || error == AOP_TRUNCATED_PAGE)
- goto retry_find;
+ return VM_FAULT_RETRY;

/* Things didn't work out. Return zero to tell the mm layer so. */
shrink_readahead_size_eio(file, ra);

---
Second attempt
arch/x86/mm/fault.c | 33 ++++++++++++++++++++++++-----
include/linux/mm.h | 1
include/linux/sched.h | 1
mm/filemap.c | 56
+++++++++++++++++++++++++++++++++++++++++++++-----
4 files changed, 80 insertions(+), 11 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 455f3fe..38bea4b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -578,10 +578,7 @@ int show_unhandled_signals = 1;
* and the problem, and then passes it off to one of the appropriate
* routines.
*/
-#ifdef CONFIG_X86_64
-asmlinkage
-#endif
-void __kprobes do_page_fault(struct pt_regs *regs, unsigned long
error_code)
+static inline void __do_page_fault(struct pt_regs *regs, unsigned long
error_code)
{
struct task_struct *tsk;
struct mm_struct *mm;
@@ -702,6 +699,7 @@ again:
down_read(&mm->mmap_sem);
}

+retry:
vma = find_vma(mm, address);
if (!vma)
goto bad_area;
@@ -761,8 +759,21 @@ survive:
}
if (fault & VM_FAULT_MAJOR)
tsk->maj_flt++;
- else
- tsk->min_flt++;
+ else {
+ if ((fault & VM_FAULT_RETRY) && (current->flags &
PF_FAULT_MAYRETRY)) {
+ current->flags &= ~PF_FAULT_MAYRETRY;
+ goto retry;
+ }
+ /*
+ * If we had to retry (PF_FAULT_MAYRETRY cleared), then
+ * the page originally wasn't up to date before the
+ * retry, but now it is.
+ */
+ if (!(current->flags & PF_FAULT_MAYRETRY))
+ tsk->maj_flt++;
+ else
+ tsk->min_flt++;
+ }

#ifdef CONFIG_X86_32
/*
@@ -909,6 +920,16 @@ do_sigbus:
tsk->thread.trap_no = 14;
force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
}
+#ifdef CONFIG_X86_64
+asmlinkage
+#endif
+void __kprobes do_page_fault(struct pt_regs *regs,
+ unsigned long error_code)
+{
+ current->flags |= PF_FAULT_MAYRETRY;
+ __do_page_fault(regs, error_code);
+ current->flags &= ~PF_FAULT_MAYRETRY;
+}

DEFINE_SPINLOCK(pgd_lock);
LIST_HEAD(pgd_list);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 72a15dc..e150c80 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -694,6 +694,7 @@ static inline int page_mapped(struct page *page)
#define VM_FAULT_SIGBUS 0x0002
#define VM_FAULT_MAJOR 0x0004
#define VM_FAULT_WRITE 0x0008 /* Special case for get_user_pages */
+#define VM_FAULT_RETRY 0x0016

#define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not
return page */
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3d9120c..bc39432 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1507,6 +1507,7 @@ extern cputime_t task_gtime(struct task_struct *p);
#define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
#define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over
cpuset */
#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
+#define PF_FAULT_MAYRETRY 0x08000000 /* I may drop mmap_sem during
fault */
#define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt
mutex tester */
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it
as freezeable */
diff --git a/mm/filemap.c b/mm/filemap.c
index 876bc59..212ea0f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -670,6 +670,7 @@ repeat:
}
EXPORT_SYMBOL(find_get_page);

+#define NOPAGE_RETRY ((struct page*)-1)
/**
* find_lock_page - locate, pin and lock a pagecache page
* @mapping: the address_space to search
@@ -680,14 +681,31 @@ EXPORT_SYMBOL(find_get_page);
*
* Returns zero if the page was not present. find_lock_page() may sleep.
*/
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+static struct page *__find_lock_page(struct address_space *mapping,
+ pgoff_t offset, struct rw_semaphore *mmap_sem)
{
struct page *page;

repeat:
page = find_get_page(mapping, offset);
if (page) {
- lock_page(page);
+ if(!mmap_sem) {
+ lock_page(page);
+ } else if(!trylock_page(page)) {
+ /*
+ * Page is already locked by someone else.
+ * We don't want to be holding down_read(mmap_sem)
+ * inside lock_page(), so use wait_on_page_locked()
here.
+ */
+ up_read(mmap_sem);
+ wait_on_page_locked(page);
+ down_read(mmap_sem);
+ /*
+ * The VMA tree may have changed at this point.
+ */
+ page_cache_release(page);
+ goto repeat;
+ }
/* Has the page been truncated? */
if (unlikely(page->mapping != mapping)) {
unlock_page(page);
@@ -698,6 +716,10 @@ repeat:
}
return page;
}
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+{
+ return __find_lock_page(mapping, offset, NULL);
+}
EXPORT_SYMBOL(find_lock_page);

/**
@@ -1427,6 +1449,8 @@ int filemap_fault(struct vm_area_struct *vma,
struct vm_fault *vmf)
struct address_space *mapping = file->f_mapping;
struct file_ra_state *ra = &file->f_ra;
struct inode *inode = mapping->host;
+ struct rw_semaphore *mmap_sem;
+ struct rw_semaphore *mmap_sem_mayretry;
struct page *page;
pgoff_t size;
int did_readaround = 0;
@@ -1435,6 +1459,7 @@ int filemap_fault(struct vm_area_struct *vma,
struct vm_fault *vmf)
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (vmf->pgoff >= size)
return VM_FAULT_SIGBUS;
+ up_read(&vma->vm_mm->mmap_sem);

/* If we don't want any read-ahead, don't bother */
if (VM_RandomReadHint(vma))
@@ -1443,16 +1468,25 @@ int filemap_fault(struct vm_area_struct *vma,
struct vm_fault *vmf)
/*
* Do we have something in the page cache already?
*/
+ mmap_sem = &vma->vm_mm->mmap_sem;
+ mmap_sem_mayretry = current->flags & PF_FAULT_MAYRETRY ? mmap_sem :
NULL;
retry_find:
- page = find_lock_page(mapping, vmf->pgoff);
+ page = __find_lock_page(mapping, vmf->pgoff, mmap_sem_mayretry);
+ if(page == NOPAGE_RETRY)
+ goto nopage_retry;
/*
* For sequential accesses, we use the generic readahead logic.
*/
if (VM_SequentialReadHint(vma)) {
if (!page) {
+ up_read(mmap_sem);
page_cache_sync_readahead(mapping, ra, file,
vmf->pgoff, 1);
- page = find_lock_page(mapping, vmf->pgoff);
+ down_read(mmap_sem);
+ page = __find_lock_page(mapping, vmf->pgoff,
+ mmap_sem_mayretry);
+ if(page == NOPAGE_RETRY)
+ goto nopage_retry;
if (!page)
goto no_cached_page;
}
@@ -1489,9 +1523,15 @@ retry_find:

if (vmf->pgoff > ra_pages / 2)
start = vmf->pgoff - ra_pages / 2;
+ up_read(mmap_sem);
do_page_cache_readahead(mapping, file, start, ra_pages);
+ down_read(mmap_sem);
}
- page = find_lock_page(mapping, vmf->pgoff);
+ page = __find_lock_page(mapping, vmf->pgoff,
+ (current->flags & PF_FAULT_MAYRETRY) ?
+ &vma->vm_mm->mmap_sem : NULL);
+ if(page == NOPAGE_RETRY)
+ goto nopage_retry;
if (!page)
goto no_cached_page;
}
@@ -1527,7 +1567,9 @@ no_cached_page:
* We're only likely to ever get here if MADV_RANDOM is in
* effect.
*/
+ up_read(mmap_sem);
error = page_cache_read(file, vmf->pgoff);
+ down_read(mmap_sem);

/*
* The page we want has now been added to the page cache.
@@ -1560,12 +1602,14 @@ page_not_uptodate:
* and we need to check for errors.
*/
ClearPageError(page);
+ up_read(mmap_sem);
error = mapping->a_ops->readpage(file, page);
if (!error) {
wait_on_page_locked(page);
if (!PageUptodate(page))
error = -EIO;
}
+ down_read(mmap_sem);
page_cache_release(page);

if (!error || error == AOP_TRUNCATED_PAGE)
@@ -1574,6 +1618,8 @@ page_not_uptodate:
/* Things didn't work out. Return zero to tell the mm layer so. */
shrink_readahead_size_eio(file, ra);
return VM_FAULT_SIGBUS;
+nopage_retry:
+ return VM_FAULT_RETRY;
}
EXPORT_SYMBOL(filemap_fault);

Mike Waychison

unread,

Sep 12, 2008, 3:37:05 PM9/12/08

to Török Edwin, Andi Kleen, Theodore Tso, Peter Zijlstra, Ingo Molnar, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin, Benjamin Herrenschmidt, linu...@kvack.org

Török Edwin wrote:
> On 2008-09-08 22:10, Török Edwin wrote:
>> [snip]
>> There is however a problem with mmap [mmap with N threads is as slow as
>> mmap with 1 thread, i.e. it is sequential :(], pagefaults and disk I/O,
>> I think I am hitting the problem described in this thread (2 years ago!)
>> http://lwn.net/Articles/200215/
>> http://lkml.org/lkml/2006/9/19/260
>>
>> It looks like such a patch is still not part of 2.6.27, what happened to it?
>> I will see if that patch applies to 2.6.27, and will rerun my test with
>> that patch applied too.
>>
>
> The patch doesn't apply to 2.6.27-rc6, I tried manually applying the patch.
> There have been many changes since 2.6.18 (like replacing find_get_page
> with find_lock_page, filemap returning VM_FAULT codes, etc.).
> I have probably done something wrong, because the resulting kernel won't
> boot: I get abnormal exits and random sigbus during boot.
>
> Can you please help porting the patch to 2.6.27-rc6? I have attached my
> 2 attempts at the end of this mail.

I actually have to forward port this and a bunch of other mm speed-ups
in the coming two weeks, though they'll be ports from 2.6.18 to 2.6.26.
I'll be sending them out to linux-mm once I've done so and completed
some testing.

>
> Also it looks like the original patch just releases the mmap_sem if
> there is lock contention on the page, but keeps mmap_sem during read?
> I would like mmap_sem be released during disk I/O too.

The 'lock'ing of the page is the part that waits for the read to finish,
and is the part that is contentious.

>
> I also tried changing i_mmap_lock into a semaphore, however I that won't
> work since some users of i_mmap_lock can't sleep.
> Taking the i_mmap_lock spinlock in filemap fault is also not possible,
> since we would sleep while holding a spinlock.

You shouldn't be seeing much contention on the i_mmap_lock, at least not
in the fault path (it's mostly just painful when you have a lot of
threads in direct reclaim and you have a large file mmaped).

Török Edwin

unread,

Sep 12, 2008, 4:11:22 PM9/12/08

to Mike Waychison, Andi Kleen, Theodore Tso, Peter Zijlstra, Ingo Molnar, Linux Kernel, Thomas Gleixner mingo@redhat.com, H. Peter Anvin, Benjamin Herrenschmidt, linu...@kvack.org

That would be great, thanks!

>>
>> Also it looks like the original patch just releases the mmap_sem if
>> there is lock contention on the page, but keeps mmap_sem during read?
>> I would like mmap_sem be released during disk I/O too.
>
> The 'lock'ing of the page is the part that waits for the read to
> finish, and is the part that is contentious.

Didn't know that, thanks for explaining.

>
>>
>> I also tried changing i_mmap_lock into a semaphore, however I that won't
>> work since some users of i_mmap_lock can't sleep.
>> Taking the i_mmap_lock spinlock in filemap fault is also not possible,
>> since we would sleep while holding a spinlock.
>
> You shouldn't be seeing much contention on the i_mmap_lock, at least
> not in the fault path (it's mostly just painful when you have a lot of
> threads in direct reclaim and you have a large file mmaped).

I was thinking of using i_mmap_lock as an alternative to holding
mmap_sem, but it didn't seem like a good idea after all.

Best regards,
--Edwin