i just reproduced the test to validate the data. i'm using 8kbyte blocks here.
on kernel is 2.4.18, O_DIRECT is still the slowest.
this machine has 2GB RAM, so it has 1.1GB RAM in HighMem.
booting a kernel with 'profile=2' set, the numbers were as follows:
- Base performance, /dev/md0 raid-0 8-disk array:
[root@mel-stglab-host1 src]# readprofile -r;
./test_disk_performance bs=8k blocks=4M /dev/md0
Completed writing 31250 mbytes in 214. 94761 seconds (153.05
Mbytes/sec), 53usec mean latency
- using /dev/md0 raid-0 8-disk array with O_DIRECT:
[root@mel-stglab-host1 src]# readprofile -r;
./test_disk_performance bs=8k blocks=4M direct /dev/md0
Completed reading 31250 mbytes in 1229.830726 seconds (26.64
Mbytes/sec), 306usec mean latency
- using /dev/md0 raid-0 8-disk array with O_NOCOPY hack:
[root@mel-stglab-host1 src]# readprofile -r;
./test_disk_performance bs=8k blocks=4M nocopy /dev/md0
Completed writing 31250 mbytes in 163.602116 seconds (200.29
Mbytes/sec), 39usec mean latency
so O_DIRECT in 2.4.18 still shows up as a 55% performance hit versus no
anyone have any clues?
from the profile of the O_DIRECT kernel, we have:
[root@mel-stglab-host1 src]# cat /tmp/profile2.txt | sort -n -k3 |
c01ceb90 submit_bh 270 2.4107
c01fc8c0 scsi_init_io_vc 286 0.7772
c0136ec0 create_bounce 323 0.9908
c0139d80 unlock_buffer 353 4.4125
c012f7d0 kmem_cache_alloc 465 1.6146
c0115a40 __wake_up 470 2.4479
c01fa720 __scsi_end_request 509 1.7674
c01fae00 scsi_request_fn 605 0.7002
c013cab0 end_buffer_io_kiobuf 675 10.5469
c01154e0 schedule 849 0.6170
c0131a40 rmqueue 868 1.5069
c025ede0 raid0_make_request 871 2.5923
c0225ee0 qla2x00_done 973 1.6436
c013cb60 brw_kiovec 1053 1.0446
c01ce400 __make_request 1831 1.1110
c01f30e0 scsi_dispatch_cmd 1854 2.0692
c011d010 do_softirq 2183 9.7455
c0136c30 bounce_end_io_read 13947 39.6222
c0105230 default_idle 231472 3616.7500
00000000 total 266665 0.1425
contrast this to the profile where we're not using O_DIRECT:
[root@mel-stglab-host1 src]# cat /tmp/profile3_base.txt | sort -n
-k3 | tail -20
c012fdc0 kmem_cache_reap 369 0.4707
c013b330 set_bh_page 397 4.9625
c011d010 do_softirq 419 1.8705
c0131a40 rmqueue 466 0.8090
c01fa720 __scsi_end_request 484 1.6806
c012fa60 kmem_cache_free 496 3.8750
c013bd00 block_read_full_page 523 0.7783
c012f7d0 kmem_cache_alloc 571 1.9826
c013db39 _text_lock_buffer 729 0.9812
c0130ca0 shrink_cache 747 0.7781
c01cea70 generic_make_request 833 2.8924
c025ede0 raid0_make_request 930 2.7679
c013b280 get_unused_buffer_head 975 5.5398
c01fc8c0 scsi_init_io_vc 1003 2.7255
c013d490 try_to_free_buffers 1757 4.7745
c013a9d0 end_buffer_io_async 2482 14.1023
c01ce400 __make_request 2687 1.6305
c012a6e0 file_read_actor 6951 27.1523
c0105230 default_idle 15227 237.9219
00000000 total 45048 0.0241
the biggest difference here is bounce_end_io_read in O_DIRECT.
given there's still lots of idle-time, i'll file up lockmeter on here and
see if theres any gremlins there.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Fri, 10 May 2002, Lincoln Dale wrote:
> so O_DIRECT in 2.4.18 still shows up as a 55% performance hit versus no
> O_DIRECT. anyone have any clues?
O_DIRECT isn't doing any read-ahead.
For O_DIRECT to be a win, you need to make it asynchronous.
O_DIRECT is especially useful for applications which maintain their
own cache, e.g. a database. And adding Async to it is an even bigger
bonus (another Oracleism we did in PTX). No read ahead, no attempt
to keep the buffer in memory until memory pressure kicks in. Just
a good tool for doing random IO (like an OLTP database would do).
No, the I/O scheduler can't even tell whether it's being handed
O_DIRECT buffers or not.
You're only halfway right. You want to avoid the mmap altogether. To see
why, postulate that you have infinitely fast I/O devices (I know that's
not true but it's close enough if you get enough DMA channels going at
once, it doesn't take very many to saturate memory). For any server
application, now all your time is in the mmap(). And there is no need
for it in general, it's just there because the upper layer of the system
is too lame to handle real page frames.
Go read the splice notes, ftp://bitmover.com/pub/splice.ps because those
were written after we had tuned things enough in IRIX that it was the
VM manipulations that became the bottleneck.
Another way to think of it is this: figure out how fast the hardware could
move the data. Now make it go that fast. Unless you can hide all the
VM crud somehow, you won't achieve 100% of the hardware's capability.
I know I've done a bad job explaining the splice crud, but there is
some pretty cool stuff in there, if you really got it, you'd see how
the server stuff, the database stuff, the aio stuff, all I/O of any
kind can be done in terms of the splice:pull() and splice:push()
interfaces and that it is the absolute lowest cost way to have a
generic I/O layer.
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
On Fri, 10 May 2002, Gerrit Huizenga wrote:
> In message <Pine.LNX.4.44.02051...@home.transmeta.com>, > : Li
> nus Torvalds writes:
> > For O_DIRECT to be a win, you need to make it asynchronous.
> O_DIRECT is especially useful for applications which maintain their
> own cache, e.g. a database. And adding Async to it is an even bigger
> bonus (another Oracleism we did in PTX).
The thing that has always disturbed me about O_DIRECT is that the whole
interface is just stupid, and was probably designed by a deranged monkey
on some serious mind-controlling substances [*].
It's simply not very pretty, and it doesn't perform very well either
because of the bad interfaces (where synchronocity of read/write is part
of it, but the inherent page-table-walking is another issue).
I bet you could get _better_ performance more cleanly by splitting up the
actual IO generation and the "user-space mapping" thing sanely. For
example, if you want to do an O_DIRECT read into a buffer, there is no
reason why it shouldn't be done in two phases:
(1) readahead: allocate pages, and start the IO asynchronously
(2) mmap the file with a MAP_UNCACHED flag, which causes read-faults to
"steal" the page from the page cache and make it private to the
mapping on page faults.
If you split it up like that, you can do much more interesting things than
O_DIRECT can do (ie the above is inherently asynchronous - we'll wait only
for IO to complete when the page is actually faulted in).
For O_DIRECT writes, you split it the other way around:
(1) mmwrite() takes the pages in the memory area, and moves them into the
page cache, removing the page from the page table (and only copies
if existing pages already exist)
(2) fdatasync_area(fd, offset, len)
Again, the above is likely to be a lot more efficient _and_ can do things
that O_DIRECT only dreams on.
With my suggested _sane_ interface, I can do a noncached file copy that
should be "perfect" even in the face of memory pressure by simply doing
addr = mmap( .. MAP_UNCACHED .. src .. )
mwrite(dst, addr, len);
which does true zero-copy (and, since mwrite removes it from the page
table anyway, you can actually avoid even the TLB overhead trivially: if
mwrite notices that the page isn't mapped, it will just take it directly
from the page cache).
Sadly, database people don't seem to have any understanding of good taste,
and various OS people end up usually just saying "Yes, Mr Oracle, I'll
open up any orifice I have for your pleasure".
[*] In other words, it's an Oracleism.
We tried disabling the elevator while doing Raw IO with DB2
a couple of weeks ago. The database performance degraded much
more than expected. Disks were FC connected Tritons or SCSI
connected ServerRaid (or both?). Oracle often asks for a patch
to disable the elevator since they believe they can schedule IO
better. We didn't try with Oracle in this case, but DB2 and RAW
IO without and elevator was not a good choice.
On Sat, 11 May 2002, Linus Torvalds wrote:
... Large snip ...
> And I personally believe that "generate the data yourself" is actually a
> very common case. A pure pipe between two places is not what a computer is
> good at, or what a computer should be used for.
Hmmm , (This may not apply here But ...)
What about linux as a router (ip/ipx/...) or a bridge device ?
Tia , JimL
| James W. Laferriere | System Techniques | Give me VMS |
| Network Engineer | P.O. Box 854 | Give me Linux |
| bab...@baby-dragons.com | Coudersport PA 16915 | only on AXP |
Huh, I must have missed something, does the mmap() not create any page
tables at all?
I've never liked mmap although that may just be my advanced age
("we never had mmap, we copied files by cutting cuneiform in fresh
clay tablets, the way the gods intended ")
struct kio k;
k.count = RECORDSIZE;
fd1 = open("inputfile",KIO_READ);
fd1a = dup(fd1); //dup creates a non KIO descript for the samefile
fd2 = open("outputfile",KIO_WRITE);
while( (n=read(fd1,&k,sizeof struct kio)
write(fd2,&k,sizof struct kio);
write(fd1a,"Another record sent,Mr E.\n",GROVELSIZE);
> Sadly, database people don't seem to have any understanding of good taste,
> and various OS people end up usually just saying "Yes, Mr Oracle, I'll
> open up any orifice I have for your pleasure".
When you drive by that campus in redwood city you start to understand how
insignificant you are.
On Sat, 11 May 2002, Larry McVoy wrote:
> You're only halfway right. You want to avoid the mmap altogether.
See my details on doing the perfect zero-copy copy thing.
The mmap doesn't actually touch the page tables - it ends up being nothing
but a "placeholder".
So if you do
addr = mmap( .. MAP_UNCACHED .. src .. )
mwrite(dst, addr, len);
then you can think of the mmap as just a "cookie" or the "hose" between
the source and the destination.
Does it have to be an mmap? No. But the advantage of the mmap is that you
can use the mmap to modify the stream if you want to, quite transparently.
And it gives the whole thing a whole lot more flexibility, in that if you
generate the data yourself, you'd just do the mwrite() - again with zero
And I personally believe that "generate the data yourself" is actually a
very common case. A pure pipe between two places is not what a computer is
good at, or what a computer should be used for.
On Sat, 11 May 2002, Larry McVoy wrote:
> On Sat, May 11, 2002 at 11:35:21AM -0700, Linus Torvalds wrote:
> > See my details on doing the perfect zero-copy copy thing.
> > The mmap doesn't actually touch the page tables - it ends up being nothing
> > but a "placeholder".
> Huh, I must have missed something, does the mmap() not create any page
> tables at all?
It can. But go down to the end in my first explanation to see why it
doesn't have to.
I'll write up the implementation notes and you'll see what I'm talking
- readahead(fd, offset, size)
Obvious (except the readahead is free to ignore the size, it's just a
- mmap( MAP_UNCACHED )
This only sets up the "vma" descriptor (like all other MMAP's). It's
exactly like a regular private mapping, except instead of just
incrementing the page count on a page-in, it will look at whether the
page can just be removed from the page cache and inserted as a private
page into the mapping ("stealing" the page).
- fdatasync_area( fd, offset, len)
Obvious. It's fdatasync, except it only guarantees the specific range.
- mwrite(fd, addr, len)
This is really does the "reverse" of mmap(MAP_UNCACHED) (and like a
mapping, addr/len have to be page-aligned).
This walks the page tables, and does the _smart_ thing:
- if no mapping exists, it looks at the backing store of the vma,
and gets the page directly from the backing store instead of
bothering to populate the page tables.
- if the mapped page exists, it removes it from the page table
- in either case, it moves the page it got into the page cache of the
destination file descriptor.
NOTE on zero-copy / no-page-fault behaviour:
- mwrite has to walk the page tables _anyway_ (the same as O_DIRECT),
since that's the only way to do zero-copy.
- since mwrite has to do that part, it's trivial to notice that the page
tables don't exist. In fact, it's a very natural result of the whole
- if user space doesn't touch the mapping itself in any way (other than
point mwrite() at it), you never build up any page tables at all, and
you never even need to touch the TLB (ie no flushes, no nothing).
- note how even "mmap( MAP_UNCACHED )" doesn't actually touch the TLB or
the page tables (unless it uses MAP_FIXED and you use it to unmap a
previous area, of course - that's all in the normal mmap code already)
I will _guarantee_ that this is more efficient than any O_DIRECT ever was,
and it will get very close to your "optimal" thing (it does need to look
at some page tables, but since the page tables haven't ever really needed
to be built up for the pure copy case, it will be able to decide that the
page isn't there from the top-level page table if you align the virtual
area properly - ie at 4MB boundaries on an x86).
I suspect that this is about a few hundred lines of code (and a lot of
testing). And you can emulate O_DIRECT behaviour with it, along with
splice (only for page-cache entities, though), and a lot of other
I'm curious how you did this -- did you disable sorting and merging, or
just sorting? Merging is pretty essential to getting decent I/O speeds
in current kernels.
> more than expected. Disks were FC connected Tritons or SCSI
> connected ServerRaid (or both?). Oracle often asks for a patch
> to disable the elevator since they believe they can schedule IO
> better. We didn't try with Oracle in this case, but DB2 and RAW
> IO without and elevator was not a good choice.
Due to excessive queue scan times, lock contention, or just slight waste
[...snip... lots of good ideas...]
I'm not sure this is quite the same problem that Oracle (and others)
typically used O_DIRECT for (not trying to be an apologist here, just
making sure the right problem gets solved)...
Most of what Oracle was managing with O_DIRECT was its "Shared Global
Area", which is usually a region of all possible memory that the OS
and other applications aren't using. It uses that space like a giant
buffer cache. Most of the IO's for OLTP applications were little
bitty random 2K IOs. So, their ideal goal was to have the ability to
say here's a list of 10,000 random 2K IOs I want you to do really
quickly and spread them out at these spots within the SGS. Those IOs
can be read asynchronously, but there needs to be some way to know when
the bits make it from disk to memory. Think of it as something like
a big async readv, ideally with the buffer cache and as much of the OS
out of the way as possible.
When the SGA is "full" (memory pressure) they do big async, no buffer
cache, non-deferred writev's (by non deferred, I mean that the write
is actually scheduled for disk, not buffered in memory indefinitely -
they really believe they are done with those buffers).
Now the mmap( MAP_UNCACHED ) thing might work, except that this isn't
really a private mapping - it's a shared mapping. So something like
tmpfs might be the answer, where the tmpfs had a property of being
uncached (in fact, Oracle would love it if that space were pinned into
memory/non-pageable/non-swappable). That way the clients don't block
taking page faults and the server schedules activities to get the
greatest throughput (e.g. schedule clients who wouldn't block).
Unfortunately, tmpfs takes away the niceness of the VM optimizations,
Oh, and Database DSS workloads (Decision Support, scan all disks looking
for needles in a big haystack) has different tradeoffs, mostly needing
to focus on lots of sequential IO where pre-fetching and reading, and
discard buffers immediately after use are the primary focus and write
performance is not critical.
> > more than expected. Disks were FC connected Tritons or SCSI
> > connected ServerRaid (or both?). Oracle often asks for a patch
> > to disable the elevator since they believe they can schedule IO
> > better. We didn't try with Oracle in this case, but DB2 and RAW
> > IO without and elevator was not a good choice.
> Due to excessive queue scan times, lock contention, or just slight waste
> of cycles?
A lot more interrupts on the RAID device, indicating a lot more
IOs, probably a direct result of disabling merging. Overall IO throughput
dropped pretty dramatically, reducing database throughput.
A good indication to gen a patch with just sorting turned off and
see where that gets us...
i believe the elevator is based on the 'block' layer and anything that goes
thru it. so the answer is that the requests would use the elevator.
for the test in question, i was doing sequential reads from the first block
of each disk until some block later on in the disk. (ie. a 2gbyte read or
given that was the case and the only i/o ops were 'read' operations,
elevator would make no difference here.
At 11:35 AM 11/05/2002 -0700, Linus Torvalds wrote:
>And I personally believe that "generate the data yourself" is actually a
>very common case. A pure pipe between two places is not what a computer is
>good at, or what a computer should be used for.
i think you'd be surprised. if we include "pipe from disk to network" then
a large number of 'server' applications do exactly this.
webservers do. fileservers do. http caches do. streaming-media servers do.
sure, they may add additional headers on the front and still generate
dynamic content in some cases, but the "common case" is 'pipe from disk to
network' or 'pipe from network to disk'.
'network' is typically TCP but can be UDP (with rate-limiting) in some cases.
its very good to see this being discussed. thats a large step forward from
many people believing the problem was nonexistent.
i'm skeptical that continuing to use the page-cache is the correct way to
go -- many of these kinds of applications are doing their own form of
memory-management and hot-content 'caching' so are happy to manage a
few-to-several hundred megabytes of "page cache equivalent" data themselves.
at least on many of the 2.3.xx linux releases, that was one of the big
attractions of 'raw' devices -- they didn't get the box into an OOM situation.
if 2.5.xx and recent 2.4.xx has the issues of
page-cache-doesn't-shrink-fast-enough solved, then its forseeable it will fly.
We did some i/o profiling about 6 years ago on a big scientific
app that had started in fortran and had been rewritten in c++
the fortran code used r/w on files and used temp files
the c++ did memmaps and had big data structures - taking advantage of
one thing I thought was interesting is that it was easy to see how a smart
algorithm, not even such a smart one, could adapt i/o to the patterns of
i/o in the fortran code, but the c++ i/o patters were really complex.
when everything goes into the page cache, it seems like you will loose
That is certainly the case. If the application is seekily writing
to a file then we currently lay the file out on-disk in the order
in which the application seeked. So reading the file back
linearly is very slow.
Now this is not necessarily a bad thing - if the file was created
seekily then it will probably be _used_ seekily so no big
This problem is pretty unsolvable for filesystems which map blocks
to their disk address at write(2) time. It can be solved for
allocate-on-flush filesystems via a sort of the dirty page list,
or by maintaining ->dirty_pages in a tree or whatever.
There is one "file" where this problem really does matter - the
blockdev mapping "/dev/hda1". It is both highly fragmented and
poorly sorted on the dirty_pages list.
It's pretty trivial to perform a sillysort at writeout time:
if we just wrote page N and the next page isn't N+1 then do a
pagecache probe for "N+1". That's probably sufficient. If
not, there's a simple little sort routine over at
which is appropriate to our lists.
I'll be taking a look at the sillysort option once I've cleared away
some other I/O scheduling glitches.