вівторок 21 березень 2006 16:23, Matthew
Dillon Ви написали:
> You might be doing just writes to the mmap()'d memory, but the
system
> doesn't know that.
Actually, it does. The program tells it, that I don't care to read, what'
s
currently there, by specifying the PROT_READ flag only.
> The moment you touch any mmap()'d page, reading or writing, the
system
> has to fault it in, which means it has to read it and load vali
d data
> into the page.
Sounds like a missed optimization opportunity :-(
> :When I mount with large read and write sizes:
> :
> : mount_nfs -r 65536 -w 65536 -U -ointr pandora:/ba
ckup /backup
> :
> :it changes -- for the worse. Short time into it -- the file stops grow
ing
> :according to the `ls -sl' run on the NFS server (pandora) at exactly 3
200
> : FS blocks (the FS was created with `-b 65536 -f 8129').
> :
> :At the same time, according to `systat -if' on both client and server,
the
> : client continues to send (and the server continues to receive) ab
out
> : 30Mb of some (?) data per second.
> It kinda sounds like the buffer cache is getting blown out, but
not
> having seen the program I can't really analyze it.
See http://aldan.algebra.com/~mi/mzip.c
> It will always be more efficient to write to a file using write
() then
> using mmap()
I understand, that write() is much better optimized at the moment, but th
e
mmap interface carries some advantages, which may allow future OSes to
optimize their ways. The application can hint at its planned usage of the
data via madvise, for example.
Unfortunately, my problem, so far, is with it not writing _at all_...
> and it will always be far, far more efficient to write to an NF
S file in
> nfs block-sized chunks rather then in smaller chunks
> due to the way the buffer cache works.
Yes, this is an example of how a good implemented mmap can be better than
write. Without explicit writes by the application and without doubling th
e
memory requirements, the data can be written in the most optimal way.
Thanks for your help. Yours,
-mi
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stabl...@freebsd.org"
That's an architectural flag. Very few architectures actually support
write-only memory maps. IA32 does not. It does not change the
fact that the operating system must validate the memory underlying
the page, nor does it imply that the system shouldn't.
:Sounds like a missed optimization opportunity :-(
Even on architectures that did support write-only memory maps, the
system would still have to fault in the rest of the data on the page,
because the system would have no way of knowing which bytes in the
page you wrote to (that is, whether you wrote to all the bytes in the
page or whether you left gaps). The system does not take a fault for
every write you issue to the page, only for the first one. So, no
matter how you twist it, the system *MUST* validate the entire page
when it takes the page fault.
:> It kinda sounds like the buffer cache is getting blown out, but not
:> having seen the program I can't really analyze it.
:
:See http://aldan.algebra.com/~mi/mzip.c
I can't access this URL, it says 'not found'.
:> It will always be more efficient to write to a file using write() then
:> using mmap()
:
:I understand, that write() is much better optimized at the moment, but the
:mmap interface carries some advantages, which may allow future OSes to
:optimize their ways. The application can hint at its planned usage of the
:data via madvise, for example.
Yes, but those advantages are limited by the way memory mapping hardware
works. There are some things that simply cannot be optimized through
lack of sufficient information.
Reading via mmap() is very well optimized. Making modifications via
mmap() is optimized insofar as the expectation that the data is intended
to be read, modified, and written back. It is not possible to
optimize with the expectation that the data would only be written to
the mmap, for the reasons described above. The hardware simply does not
provide sufficient information to the operating system to optimize
the write-only case.
:Unfortunately, my problem, so far, is with it not writing _at all_...
Not sure what is going on since I can't access the program yet, but
I'd be happy to take a look at the code.
The most common mistake people make when trying to write to a file via
mmap() is that they forget to ftruncate() the file to the proper length
first. Mapped memory beyond the file's EOF is ignored within the last
page, and the program will take a page fault if it tries to write to
mapped pages that are entire beyond the file's current EOF. Writing
to mapped memory does *not* extend the size of a file. Only
ftruncate() or write() can extend the size of a file.
The second most common mistake is to forget to specify MAP_SHARED
in the mmap() call.
:Yes, this is an example of how a good implemented mmap can be better than
:write. Without explicit writes by the application and without doubling the
:memory requirements, the data can be written in the most optimal way.
:...
:Thanks for your help. Yours,
:
: -mi
I don't think mmap()-based writing will EVER be more efficient then
write() except in the case where the entire data set fits into memory
and has been entirely cached by the system. In that one case writing via
mmap will be faster. In all other cases the system will be taking as
many VM faults on the pages as it would be taking system call faults
to execute the write()'s.
You are making a classic mistake by assuming that the copying overhead
of a write() into the file's backing store, verses directly mmap()ing
the file's backing store, represents a large chunk of the overhead for
the operation. In fact, the copying overhead represents only a small
chunk of the related overhead. The vast majority of the overhead is
always going to be the disk I/O itself.
I/O must occur even in the cached/delayed-write case so on a busy system
it still represents the greatest overhead from the point of view of
system load. On a lightly loaded system nobody is going to care about
a few milliseconds of improved performance here and there since, by
definition, the system is lightly loaded and thus has plenty of idle
cpu and I/O cycles to spare.
-Matt
Matthew Dillon
<dil...@backplane.com>
Why does the flag being architectural matter? The application tells the O
S,
that it only plans to write...
> It does not change the fact that the operating system must vali
date the
> memory underlying the page, nor does it imply that the system s
houldn't.
> :Sounds like a missed optimization opportunity :-(
> Even on architectures that did support write-only memory maps, the
> system would still have to fault in the rest of the data on the page
,
> because the system would have no way of knowing which bytes in the
> page you wrote to (that is, whether you wrote to all the bytes in th
e
> page or whether you left gaps).
Indeed, but in my case there is no data in the target file to begin with
-- it
is "created" by ftruncate() prior to mmap-ing.
> :See http://aldan.algebra.com/~mi/mzip.c
> I can't access this URL, it says 'not found'.
Uh, sorry, the newest Apache is quite restrictive. I just tweaked it, ple
ase,
try again.
> :The application can hint at its planned usage of the
> :data via madvise, for example.
> Yes, but those advantages are limited by the way memory mapping hard
ware
> works. There are some things that simply cannot be optimized throug
h
> lack of sufficient information.
There is no need for additional information from hardware, when, deciding
--
based on the information supplied by madvise -- which parts of the file (
if
any) to keep in cache.
> I don't think mmap()-based writing will EVER be more efficient then
> write() except in the case where the entire data set fits into memor
y
> and has been entirely cached by the system.
My custom compressor is intended to operate on the database and filesyste
m
dumps, as they arrive (uncompressed) from the computers being backed up v
ia
NFS. It is intended to pick most of the input data, while it is still in
the
RAM cache.
It was more convenient for me to implement outputting via mmap as well. T
he
bulk of the time is spent reading and compressing anyway -- the output is
many time smaller than the input. So the write performance never bothered
me,
until I tried to do it via NFS and encountered all of these bugs :-( ...
-mi
Actually, I can not agree here -- quite the opposite seems true. When run
ning
locally (no NFS involved) my compressor with the `-1' flag (fast, least
effective compression), the program easily compresses faster, than it can
read.
The Opteron CPU is about 50% idle, *and so is the disk* producing only 15
Mb/s.
I guess, despite the noise I raised on this subject a year ago, reading v
ia
mmap continues to ignore the MADV_SEQUENTIONAL and has no other adaptabil
ity.
Unlike read, which uses buffering, mmap-reading still does not pre-fault
the
file's pieces in efficiently :-(
Although the program was written to compress files, that are _likely_ sti
ll in
memory, when used with regular files, it exposes the lack of mmap
optimization.
This should be even more obvious, if you time searching for a string in a
large file using grep vs. 'grep --mmap'.
Yours,
-mi
http://aldan.algebra.com/~mi/mzip.c
Well, I don't know about FreeBSD, but both grep cases work just fine on
DragonFly. I can't test mzip.c because I don't see the compression
library you are calling (maybe that's a FreeBSD thing). The results
of the grep test ought to be similar for FreeBSD since the heuristic
used by both OS's is the same. If they aren't, something might have
gotten nerfed accidently in the FreeBSD tree.
Here is the cache case test. mmap is clearly faster (though I would
again caution that this should not be an implicit assumption since
VM fault overheads can rival read() overheads, depending on the
situation).
The 'x1' file in all tests below is simply /usr/share/dict/words
concactenated over and over again to produce a large file.
crater# ls -la x1
-rw-r--r-- 1 root wheel 638228992 Mar 23 11:36 x1
[ machine has 1GB of ram ]
crater# time grep --mmap asdfasf x1
1.000u 0.117s 0:01.11 100.0% 10+40k 0+0io 0pf+0w
crater# time grep --mmap asdfasf x1
0.976u 0.132s 0:01.13 97.3% 10+40k 0+0io 0pf+0w
crater# time grep --mmap asdfasf x1
0.984u 0.140s 0:01.11 100.9% 10+41k 0+0io 0pf+0w
crater# time grep asdfasf x1
0.601u 0.781s 0:01.40 98.5% 10+42k 0+0io 0pf+0w
crater# time grep asdfasf x1
0.507u 0.867s 0:01.39 97.8% 10+40k 0+0io 0pf+0w
crater# time grep asdfasf x1
0.562u 0.812s 0:01.43 95.8% 10+41k 0+0io 0pf+0w
crater# iostat 1
[ while grep is running, in order to test the cache case and verify that
no I/O is occuring once the data has been cached ]
The disk I/O case, which I can test by unmounting and remounting the
partition containing the file in question, then running grep, seems
to be well optimized on DragonFly. It should be similarly optimized
on FreeBSD since the code that does this optimization is nearly the
same. In my test, it is clear that the page-fault overhead in the
uncached case is considerably greater then the copying overhead of
a read(), though not by much. And I would expect that, too.
test28# umount /home
test28# mount /home
test28# time grep asdfasdf /home/x1
0.382u 0.351s 0:10.23 7.1% 55+141k 42+0io 4pf+0w
test28# umount /home
test28# mount /home
test28# time grep asdfasdf /home/x1
0.390u 0.367s 0:10.16 7.3% 48+123k 42+0io 0pf+0w
test28# umount /home
test28# mount /home
test28# time grep --mmap asdfasdf /home/x1
0.539u 0.265s 0:10.53 7.5% 36+93k 42+0io 19518pf+0w
test28# umount /home
test28# mount /home
test28# time grep --mmap asdfasdf /home/x1
0.617u 0.289s 0:10.47 8.5% 41+105k 42+0io 19518pf+0w
test28#
test28# iostat 1 during the test showed ~60MBytes/sec for all four tests
Perhaps you should post specifics of the test you are running, as well
as specifics of the results you are getting, such as the actual timing
output instead of a human interpretation of the results. For that
matter, being an opteron system, were you running the tests on a UP
system or an SMP system? grep is a single-threaded so on a 2-cpu
system it will show 50% cpu utilization since one cpu will be
saturated and the other idle. With specifics, a FreeBSD person can
try to reproduce your test results.
A grep vs grep --mmap test is pretty straightforward and should be
a good test of the VM read-ahead code, but there might always be some
unknown circumstance specific to a machine configuration that is
the cause of the problem. Repeatability and reproducability by
third parties is important when diagnosing any problem.
Insofar as MADV_SEQUENTIAL goes... you shouldn't need it on FreeBSD.
Unless someone ripped it out since I committed it many years ago, which
I doubt, FreeBSD's VM heuristic will figure out that the accesses
are sequential and start issuing read-aheads. It should pre-fault, and
it should do read-ahead. That isn't to say that there isn't a bug, just
that everyone interested in the problem has to be able to reproduce it
and help each other track down the source. Just making an assumption
and accusation with regards to the cause of the problem doesn't solve
it.
The VM system is rather fragile when it comes to read-ahead because
the only way to do read-ahead on mapped memory is to issue the
read-ahead and then mark some prior (already cached) page as
inaccessible in order to be able to take a VM fault and issue the
NEXT read-ahead before the program exhausts the current cached data.
It is, in fact, rather complex code, not straightforward as you
might expect.
But I can only caution you, again, on making the assumption that the
operating system should optimize your particular test case intuitively,
like a human would. Operating systems generaly optimize the most
common cases, but it would be pretty dumb to actually try to make
them optimize every conceivable case. You would wind up with hundreds
of thousands of lines of barely exercised and likely buggy code.
-Matt
Yes, they both do work fine, but time gives very different stats for each
In
my experiments, the total CPU time is noticably less with mmap, but the
elapsed time is (much) greater. Here are results from FreeBSD-6.1/amd64 -
-
notice the large number of page faults, because the system does not try t
o
preload file in the mmap case as it does in the read case:
time fgrep meowmeowmeow /home/oh.0.dump
2.167u 7.739s 1:25.21 11.6% 70+3701k 23663+0io 6pf+0w
time fgrep --mmap meowmeowmeow /home/oh.0.dump
1.552u 7.109s 2:46.03 5.2% 18+1031k 156+0io 106327pf+0w
Use a big enough file to bust the memory caching (oh.0.dump above is 2.9G
b),
I'm sure, you will have no problems reproducing this result.
> I can't test mzip.c because I don't see the compression
> library you are calling (maybe that's a FreeBSD thing).
The program uses -lz and -lbz2 -- both are parts of FreeBSD since before
the
unfortunate fork of DF. The following should work for you:
make -f bsd.prog.mk LDADD="-lz -lbz2" PROG=mzip mzip
Yours,
-mi
106,000 page faults. How many pages is a 2.9GB file? If this is running
in 64-bit mode those would be 8K pages, right? So that would come to
around 380,000 pages. About 1:4. So, clearly the operating system
*IS* pre-faulting multiple pages.
Since I don't believe that a memory fault would be so inefficient as
to account for 80 seconds of run time, it seems more likely to me that
the problem is that the VM system is not issuing read-aheads. Not
issuing read-aheads would easily account for the 80 seconds.
It is possible that the kernel believes the VM system to be too loaded
to issue read-aheads, as a consequence of your blowing out of the system
caches. It is also possible that the read-ahead code is broken in
FreeBSD. To determine which of the two is more likely, you have to
run a smaller data set (like 600MB of data on a system with 1GB of ram),
and use the unmount/mount trick to clear the cache before each grep test.
If the time differential is still huge using the unmount/mount data set
test as described above, then the VM system's read-ahead code is broken.
If the time differential is tiny, however, then it's probably nothing
more then the kernel interpreting your massive 2.9GB mmap as being
too stressful on the VM system and disabling read-aheads for that
reason.
In anycase, this sort of test is not really a good poster child for how
to use mmap(). Nobody in their right mind uses mmap() on datasets that
they expect to be uncacheable and which are accessed sequentially. It's
just plain silly to use mmap() in that sort of circumstance. This is
a trueism on ANY operating system, not just FreeBSD. The uncached
data set test (using unmount/mount and a dataset which fits into memory)
is a far more realistic test because it simulates the most common case
encountered by a system under load... the accessing of a reasonably sized
data set which happens to not be in the cache.
-Matt
I thought one serious advantage to this situation for sequential read
mmap() is to madvise(MADV_DONTNEED) so that the pages don't have to
wait for the clock hands to reap them. On a large Solaris box I used
to have the non-pleasure of running the VM page scan rate was high, and
I suggested to the app vendor that proper use of mmap might reduce that
overhead. Admitedly the files in question were much smaller than the
available memory, but they were also not likely to be referenced again
before the memory had to be reclaimed forcibly by the VM system.
Is that not the case? Is it better to let the VM system reclaim pages
as needed?
Thanks,
Gary
May be the OS needs "reclaim-behind" for the sequential case?
This way you can mmap many many pages and use a much smaller
pool of physical pages to back them. The idea is for the VM
to reclaim pages N-k..N-1 when page N is accessed and allow
the same process to reuse this page. Similar to read ahead,
where the OS schedules read of page N+k, N+k+1.. when page N
is accessed. May be even use TCP algorithms to adjust the
backing buffer (window) size:-)
madvise() should theoretically have that effect, but it isn't quite
so simple a solution.
Lets say you have, oh, your workstation, with 1GB of ram, and you
run a program which runs several passes on a 900MB data set.
Your X session, xterms, gnome, kde, etc etc etc all take around 300MB
of working memory.
Now that data set could fit into memory if portions of your UI were
pushed out of memory. The question is not only how much of that data
set should the kernel fit into memory, but which portions of that data
set should the kernel fit into memory and whether the kernel should
bump out other data (pieces of your UI) to make it fit.
Scenario #1: If the kernel fits the whole 900MB data set into memory,
the entire rest of the system would have to compete for the remaining
100MB of memory. Your UI would suck rocks.
Scenario #2: If the kernel fits 700MB of the data set into memory, and
the rest of the system (your UI, etc) is only using 300MB, and the kernel
is using MADV_DONTNEED on pages it has already scanned, now your UI
works fine but your data set processing program is continuously
accessing the disk for all 900MB of data, on every pass, because the
kernel is always only keeping the most recently accessed 700MB of
the 900MB data set in memory.
Scenario #3: Now lets say the kernel decides to keep just the first
700MB of the data set in memory, and not try to cache the last 200MB
of the data set. Now your UI works fine, and your processing program
runs FOUR TIMES FASTER because it only has to access the disk for
the last 200MB of the 900MB data set.
--
Now, which of these scenarios does madvise() cover? Does it cover
scenario #1? Well, no. the madvise() call that the program makes has
no clue whether you intend to play around with your UI every few minutes,
or whether you intend to leave the room for 40 minutes. If the kernel
guesses wrong, we wind up with one unhappy user.
What about scenario #2? There the program decided to call madvise(),
and the system dutifully reuses the pages, and you come back an hour
later and your data processing program has only done 10 passes out
of the 50 passes it needs to do on the data and you are PISSED.
Ok. What about scenario #3? Oops. The program has no way of knowing
how much memory you need for your UI to be 'happy'. No madvise() call
of any sort will make you happy. Not only that, but the KERNEL has no
way of knowing that your data processing program intends to make
multiple passes on the data set, whether the working set is represented
by one file or several files, and even the data processing program itself
might not know (you might be running a script which runs a separate
program for each pass on the same data set).
So much for madvise().
So, no matter what, there will ALWAYS be an unhappy user somewhere. Lets
take Mikhail's grep test as an example. If he runs it over and over
again, should the kernel be 'optimized' to realize that the same data
set is being scanned sequentially, over and over again, ignore the
localized sequential nature of the data accesses, and just keep a
dedicated portion of that data set in memory to reduce long term
disk access? Should it keep the first 1.5GB, or the last 1.5GB,
or perhaps it should slice the data set up and keep every other 256MB
block? How does it figure out what to cache and when? What if the
program suddenly starts accessing the data in a cacheable way?
Maybe it should randomly throw some of the data away slowly in the hopes
of 'adapting' to the access pattern, which would also require that it
throw away most of the 'recently read' data far more quickly to make
up for the data it isn't throwing away. Believe it or not, that
actually works for certain types of problems, except then you get hung
up in a situation where two subsystems are competing with each other
for memory resources (like mail server verses web server), and the
system is unable to cope as the relative load factors for the competing
subsystems change. The problem becomes really complex really fast.
This sort of problem is easy to consider in human terms, but virtually
impossible to program into a computer with a heuristic or even with
specific madvise() calls. The computer simply does not know what the
human operator expects from one moment to the next.
The problem Mikhail is facing is one where his human assumptions do not
match the assumptions the kernel is making on data retention, assumed
system load, and the many other factors that the kernel uses to decide
what to keep and what to throw away, and when.
--
Now, aside from the potential read-ahead issue, which could be a real
issue for FreeBSD (but not one really worthy of insulting someone over),
there is literally no way for a kernel programmer to engineer the
'perfect' set of optimizations for a system. There are a huge
number of pits you can fall into if you try to over-optimize
a system. Each optimization adds that much more complexity to an already
complex system, and has that much greater a chance to introduce yet
another hard-to-find bug.
Nearly all operating systems that I know of tend to presume a certain
degree of locality of reference for mmap()'d pages. It just so happens
that Mikhail's test has no locality of reference. But 99.9% of the
programs ever run on a BSD system WILL, so which should the kernel
programmer spend all his time coding optimizations for? The 99.9% of
the time or the 0.1% of the time?
-Matt
On an amd64 system running about 6-week old -stable, both behave
pretty much identically. In both cases, systat reports that the disk
is about 96% busy whilst loading the cache. In the cache case, mmap
is significantly faster.
The test data is 2 copies of OOo_2.0.2rc2_src.tar.gz concatenated.
turion% ls -l /6_i386/var/tmp/test
-rw-r--r-- 1 peter wheel 586333684 Mar 24 19:24 /6_i386/var/tmp/test
turion% /usr/bin/time -l grep dfhfhdsfhjdsfl /6_i386/var/tmp/test
21.69 real 0.16 user 0.68 sys
1064 maximum resident set size
82 average shared memory size
95 average unshared data size
138 average unshared stack size
119 page reclaims
0 page faults
0 swaps
4499 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
4497 voluntary context switches
3962 involuntary context switches
[umount/remount /6_i386/var]
turion% /usr/bin/time -l grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test
21.68 real 0.41 user 0.51 sys
1068 maximum resident set size
80 average shared memory size
93 average unshared data size
136 average unshared stack size
17836 page reclaims
18081 page faults
0 swaps
23 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
18105 voluntary context switches
169 involuntary context switches
The speed gain with mmap is clearly evident when the data is cached and
the CPU clock wound right down (99MHz ISO 2200MHz):
turion% /usr/bin/time grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test
12.15 real 7.98 user 2.95 sys
turion% /usr/bin/time grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test
12.28 real 7.92 user 2.94 sys
turion% /usr/bin/time grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test
13.16 real 8.03 user 2.89 sys
turion% /usr/bin/time grep dfhfhdsfhjdsfl /6_i386/var/tmp/test
17.09 real 6.37 user 8.92 sys
turion% /usr/bin/time grep dfhfhdsfhjdsfl /6_i386/var/tmp/test
17.36 real 6.35 user 9.37 sys
turion% /usr/bin/time grep dfhfhdsfhjdsfl /6_i386/var/tmp/test
17.54 real 6.37 user 9.39 sys
--
Peter Jeremy
That pretty much means that the read-ahead algorithm is working.
If it weren't, the disk would not be running at near 100%.
Ok. The next test is to NOT do umount/remount and then use a data set
that is ~2x system memory (but can still be mmap'd by grep). Rerun
the data set multiple times using grep and grep --mmap.
If the times for the mmap case blow up relative to the non-mmap case,
then the vm_page_alloc() calls and/or vm_page_count_severe() (and other
tests) in the vm_fault case are causing the read-ahead to drop out.
If this is the case the problem is not in the read-ahead path, but
probably in the pageout code not maintaining a sufficient number of
free and cache pages. The system would only be allocating ~60MB/s
(or whatever your disk can do), so the pageout thread ought to be able
to keep up.
If the times for the mmap case do not blow up, we are back to square
one and I would start investigating the disk driver that Mikhail is
using.
-Matt
Matthew Dillon
<dil...@backplane.com>
See attachment for the snapshot of `systat 1 -vm' -- it stays like that for
the most of the compression run time with only occasional flushes to the
amrd0 device (the destination for the compressed output).
Bakul Shah followed up:
> May be the OS needs "reclaim-behind" for the sequential case?
> This way you can mmap many many pages and use a much smaller
> pool of physical pages to back them. The idea is for the VM
> to reclaim pages N-k..N-1 when page N is accessed and allow
> the same process to reuse this page.
Although it may hard for the kernel to guess, which pages it can reclaim
efficiently in the general case, my issuing of madvise with MADV_SEQUENTIONAL
should've given it a strong hint.
It is for this reasons, that I very much prefer the mmap API to read/write
(against Matt's repeated advice) -- there is a way to advise the kernel,
which there is not with the read. Read also requires fairly large buffers in
the user space to be efficient -- *in addition* to the buffers in the kernel.
Managing such buffers properly makes the program far messier _and_
OS-dependent, than using the mmap interface has to be.
I totally agree with Matt, that FreeBSD's (and probably DragonFly's too) mmap
interface is better than others', but, it seems to me, there is plenty of
room for improvement. Reading via mmap should never be slower, than via read
-- it should be just a notch faster, in fact...
I'm also quite certain, that fulfulling my "demands" would add quite a bit of
complexity to the mmap support in kernel, but hey, that's what the kernel is
there for :-)
Unlike grep, which seems to use only 32k buffers anyway (and does not use
madvise -- see attachment), my program mmaps gigabytes of the input file at
once, trusting the kernel to do a better job at reading the data in the most
efficient manner :-)
Peter Jeremy wrote:
> On an amd64 system running about 6-week old -stable, both ['grep' and 'grep
> --mmap' -mi] behave pretty much identically.
Peter, I read grep's source -- it is not using madvise (because it hurts
performance on SunOS-4.1!) and reads in chunks of 32k anyway. Would you care
to look at my program instead? Thanks:
http://aldan.algebra.com/mzip.c
(link with -lz and -lbz2).
Matthew Dillon wrote:
[...]
> If the times for the mmap case do not blow up, we are back to square
> one and I would start investigating the disk driver that Mikhail is
> using.
On the machine, where both mzip and the disk run at only 50%, the disk is a
plain SATA drive (mzip's state goes from "RUN" to "vnread" and back).
Thanks, everyone!
-mi
Yes, that is what I was saying. If mmap read can be made as
efficient as the use of read() for this most common case,
there are benefits. In effect we set up a fifo that rolls
along the mapped address range and the kernel processing and
the user processing are somewhat decoupled.
> Reading via mmap should never be slower, than via read
> -- it should be just a notch faster, in fact...
Depends on the cost of mostly redundant processing of N
read() syscalls versus the cost of setting up and tearing
down multiple v2p mappings -- presumably page faults
can be avoided if the kernel fills in pages ahead of when
they are first accessed. The cost of tlbmiss is likely
minor. Probably the breakeven point is just a few read()
calls.
> I'm also quite certain, that fulfulling my "demands" would add quite a bit of
> complexity to the mmap support in kernel, but hey, that's what the kernel is
> there for :-)
An interesting thought experiment is to assume the system has
*no* read and write calls and see how far you can get with
the present mmap scheme and what extensions are needed to get
back the same functionality. Yes, assume mmap & friends even
for serial IO! I am betting that mmap can be simplified.
[Proof by handwaving elided; this screen is too small to fit
my hands :-)]
The results here are weird. With 1GB RAM and a 2GB dataset, the
timings seem to depend on the sequence of operations: reading is
significantly faster, but only when the data was mmap'd previously
There's one outlier that I can't easily explain.
hw.physmem: 932249600
hw.usermem: 815050752
+ ls -l /6_i386/var/tmp/test
-rw-r--r-- 1 peter wheel 2052167894 Mar 25 05:44 /6_i386/var/tmp/test
+ /usr/bin/time -l grep dfhfhdsfhjdsfl /6_i386/var/tmp/test
+ /usr/bin/time -l grep --mmap dfhfhdsfhjdsfl /6_i386/var/tmp/test
This was done in multi-user on a VTY using a script. X was running
(and I forgot to kill an xclock) but there shouldn't have been anything
else happening.
grep --mmap followed by grep --mmap:
mm 77.94 real 1.65 user 2.08 sys
mm 78.22 real 1.53 user 2.21 sys
mm 78.34 real 1.55 user 2.21 sys
mm 79.33 real 1.48 user 2.37 sys
grep --mmap followed by grep/read
mr 56.64 real 0.77 user 2.45 sys
mr 56.73 real 0.67 user 2.53 sys
mr 56.86 real 0.68 user 2.60 sys
mr 57.64 real 0.64 user 2.63 sys
mr 57.71 real 0.62 user 2.68 sys
mr 58.04 real 0.63 user 2.59 sys
mr 58.83 real 0.78 user 2.50 sys
mr 59.15 real 0.74 user 2.50 sys
grep/read followed by grep --mmap
rm 75.98 real 1.56 user 2.19 sys
rm 76.06 real 1.50 user 2.29 sys
rm 76.50 real 1.40 user 2.38 sys
rm 77.35 real 1.47 user 2.30 sys
rm 77.49 real 1.39 user 2.44 sys
rm 79.14 real 1.56 user 2.19 sys
rm 88.88 real 1.57 user 2.27 sys
grep/read followed by grep/read
rr 78.00 real 0.69 user 2.74 sys
rr 78.34 real 0.67 user 2.74 sys
rr 79.64 real 0.69 user 2.71 sys
rr 79.69 real 0.73 user 2.75 sys
> free and cache pages. The system would only be allocating ~60MB/s
> (or whatever your disk can do), so the pageout thread ought to be able
> to keep up.
This is a laptop so the disk can only manage a bit over 25 MB/sec.
--
Peter Jeremy
I disagree. With a filesystem read, the kernel is solely responsible
for handling physical I/O with an efficient buffer size. The userland
buffers simply amortise the cost of the system call and copyout
overheads.
>I'm also quite certain, that fulfulling my "demands" would add quite a bit of
>complexity to the mmap support in kernel, but hey, that's what the kernel is
>there for :-)
Unfortunately, your patches to implement this seem to have become detached
from your e-mail. :-)
>Unlike grep, which seems to use only 32k buffers anyway (and does not use
>madvise -- see attachment), my program mmaps gigabytes of the input file at
>once, trusting the kernel to do a better job at reading the data in the most
>efficient manner :-)
mmap can lend itself to cleaner implementatione because there's no
need to have a nested loop to read buffers and then process them. You
can mmap then entire file and process it. The downside is that on a
32-bit architecture, this limits you to processing files that are
somewhat less than 2GB. The downside is that touching an uncached
page triggers a trap which may not be as efficient as reading a block
of data through the filesystem interface, and I/O errors are delivered
via signals (which may not be as easy to handle).
>Peter Jeremy wrote:
>> On an amd64 system running about 6-week old -stable, both ['grep' and 'grep
>> --mmap' -mi] behave pretty much identically.
>
>Peter, I read grep's source -- it is not using madvise (because it hurts
>performance on SunOS-4.1!) and reads in chunks of 32k anyway. Would you care
>to look at my program instead? Thanks:
>
> http://aldan.algebra.com/mzip.c
fetch: http://aldan.algebra.com/mzip.c: Not Found
I tried writing a program that just mmap'd my entire (2GB) test file
and summed all the longwords in it. This gave me similar results to
grep. Setting MADV_SEQUENTIAL and/or MADV_WILLNEED made no noticable
difference. I suspect something about your code or system is disabling
the mmap read-ahead functionality.
What happens if you simulate read-ahead yourself? Have your main
program fork and the child access pages slightly ahead of the parent
but do nothing else.
I don't see a disagreement in the above :-) Mmap API can be slightly faster
than read -- kernel is still "responsible for handling physical I/O with an
efficient buffer size". But instead of copying the data out after reading, it
can read it directly into the process' memory.
= >I'm also quite certain, that fulfulling my "demands" would add quite a
= >bit of complexity to the mmap support in kernel, but hey, that's what the
= > kernel is there for :-)
=
= Unfortunately, your patches to implement this seem to have become detached
= from your e-mail. :-)
If I manage to *convince* someone, that there is a problem to solve, I'll
consider it a good contribution to the project...
= mmap can lend itself to cleaner implementatione because there's no
= need to have a nested loop to read buffers and then process them. You
= can mmap then entire file and process it. The downside is that on a
= 32-bit architecture, this limits you to processing files that are
= somewhat less than 2GB.
First, only one of our architectures is 32-bit :-) On 64-bit systems, the
addressable memory (kind of) matches the maximum file size. Second even with
the loop reading/processing chunks at a time, the implementation is cleaner,
because it does not need to allocate any memory nor try to guess, which
buffer size to pick for optimal performance, nor align the buffers on pages
(which grep is doing, for example, rather hairily).
= The downside is that touching an uncached page triggers a trap which may
= not be as efficient as reading a block of data through the filesystem
= interface, and I/O errors are delivered via signals (which may not be as
= easy to handle).
My point exactly. It does seem to be less efficient *at the moment* and I
am trying to have the kernel support for this cleaner method of reading
*improved*. By convincing someone with a clue to do it, that is... :-)
= >Would you care to look at my program instead? Thanks:
= >
= > http://aldan.algebra.com/mzip.c
I'm sorry, that should be http://aldan.algebra.com/~mi/mzip.c -- I checked
this time :-(
= I tried writing a program that just mmap'd my entire (2GB) test file
= and summed all the longwords in it.
The files I'm dealing with are database dumps -- 10-80Gb :-) Maybe, that's,
what triggers some pessimal case?..
Thanks! Yours,
-mi
Really odd. Note that if your disk can only do 25 MBytes/sec, the
calculation is: 2052167894 / 25MB = ~80 seconds, not ~60 seconds
as you would expect from your numbers.
So that would imply that the 80 second numbers represent read-ahead,
and the 60 second numbers indicate that some of the data was retained
from a prior run (and not blown out by the sequential reading in the
later run).
This type of situation *IS* possible as a side effect of other
heuristics. It is particularly possible when you combine read() with
mmap because read() uses a different heuristic then mmap() to
implement the read-ahead. There is also code in there which depresses
the page priority of 'old' already-read pages in the sequential case.
So, for example, if you do a linear grep of 2GB you might end up with
a cache state that looks like this:
l = low priority page
m = medium priority page
h = high priority page
FILE: [---------------------------mmmmmmmmmmmmm]
Then when you rescan using mmap,
FILE: [lllllllll------------------mmmmmmmmmmmmm]
[------lllllllll------------mmmmmmmmmmmmm]
[---------lllllllll---------mmmmmmmmmmmmm]
[------------lllllllll------mmmmmmmmmmmmm]
[---------------lllllllll---mmmmmmmmmmmmm]
[------------------lllllllllmmmmmmmmmmmmm]
[---------------------llllllHHHmmmmmmmmmm]
[------------------------lllHHHHHHmmmmmmm]
[---------------------------HHHHHHHHHmmmm]
[---------------------------mmmHHHHHHHHHm]
The low priority pages don't bump out the medium priority pages
from the previous scan, so the grep winds up doing read-ahead
until it hits the large swath of pages already cached from the
previous scan, without bumping out those pages.
There is also a heuristic in the system (FreeBSD and DragonFly)
which tries to randomly retain pages. It clearly isn't working :-)
I need to change it to randomly retain swaths of pages, the
idea being that it should take repeated runs to rebalance the VM cache
rather then allowing a single run to blow it out or allowing a
static set of pages to be retained indefinitely, which is what your
tests seem to show is occuring.
-Matt
Matthew Dillon
<dil...@backplane.com>
I think the thing is that there isn't an easy way to speed up the
faulting of the page, and that is why you are getting such trouble
making people believe that there is a problem...
To convince people that there is a problem, you need to run benchmarks,
and make code modifications to show that yes, something can be done to
improve the performance...
The other useful/interesting number would be to compare system time
between the mmap case and the read case to see how much work the
kernel is doing in each case...
--
John-Mark Gurney Voice: +1 415 225 5579
"All that I will do, has been done, All that I have, has not."
systat was reporting 25-26 MB/sec. dd'ing the underlying partition gives
27MB/sec (with 24 and 28 for adjacent partions).
> This type of situation *IS* possible as a side effect of other
> heuristics. It is particularly possible when you combine read() with
> mmap because read() uses a different heuristic then mmap() to
> implement the read-ahead. There is also code in there which depresses
> the page priority of 'old' already-read pages in the sequential case.
> So, for example, if you do a linear grep of 2GB you might end up with
> a cache state that looks like this:
If I've understood you correctly, this also implies that the timing
depends on the previous two scans, not just the previous scan. I
didn't test all combinations of this but would have expected to see
two distinct sets of mmap/read timings - one for read/mmap/read and
one for mmap/mmap/read.
> I need to change it to randomly retain swaths of pages, the
> idea being that it should take repeated runs to rebalance the VM cache
> rather then allowing a single run to blow it out or allowing a
> static set of pages to be retained indefinitely, which is what your
> tests seem to show is occuring.
I dont think this sort of test is a clear indication that something is
wrong. There's only one active process at any time and it's performing
a sequential read of a large dataset. In this case, evicting already
cached data to read new data is not necessarily productive (a simple-
minded algorithm will be evicting data this is going to be accessed in
the near future).
Based on the timings, mmap/read case manages to retain ~15% of the file
in cache. Given the amount of RAM available, the theoretical limit is
about 40% so this isn't too bad. It would be nicer if both read and
mmap managed this gain, irrespective of how the data had been previously
accessed.
--
Peter Jeremy
It doesn't look like it's doing anything especially weird. As Matt
pointed out, creating files with mmap() is not a good idea because the
syncer can cause massive fragmentation when allocating space. I can't
test is as-is because it insists on mmap'ing its output and I only
have one disk and you can't mmap /dev/null.
Since your program is already written to mmap the input and output in
pieces, it would be trivial to convert it to use read/write.
>= I tried writing a program that just mmap'd my entire (2GB) test file
>= and summed all the longwords in it.
>
>The files I'm dealing with are database dumps -- 10-80Gb :-) Maybe, that's,
>what triggers some pessimal case?..
I tried generating an 11GB test file and got results consistent with my
previous tests: grep using read or mmap, as well as mmap'ing the entire
file give similar times with the disk mostly saturated.
I suggest you try converting mzip.c to use read/write and see if the
problem is still present.
--
Peter Jeremy
OK. I _can_ see something like this when I try to compress a big file using
either your program or gzip. In my case, both the disk % busy and system idle
vary widely but there's typicaly 50-60% disk utilisation and 30-40% CPU idle.
However, systat is reporting 23-25MB/sec (whereas dd peaks at ~30MB/sec) so the
time to gzip the datafile isn't that much different to the time to just read it.
My guess is that the read-ahead algorithms are working but aren't doing enough
re-ahead to cope with "read a bit, do some cpu-intensive processing and repeat"
at 25MB/sec so you're winding up with a degree of serialisation where the I/O
and compressing aren't overlapped. I'm not sure how tunable the read-ahead is.
Well, is the MADV_SEQUNTIAL advice, given over the entire mmap-ed region,
taken into account anywhere in the kernel? The kernel could read-ahead mo
re
aggressively if it freed the just accessed pages faster, than it does in
the
default case...
Matt wrote in the same thread:
= It is particularly possible when you combine read() with
= mmap because read() uses a different heuristic then mmap() to
= implement the read-ahead. There is also code in there which depre
sses
= the page priority of 'old' already-read pages in the sequential ca
se.
Well, thanks for the theoretical confirmation of what I was trying to pro
ve by
experiments :-) Can this depressing of the "old" pages in the sequential
case, that read's implementation already has, be also implemented in mmap
's
case? It may not *always* be, what the mmap-ing program wants, but when t
he
said program uses MADV_SEQUENTAIL, it should not be ignored... (Bakul
understood this point of mine 3 days ago :-)
Peter Jeremy also wrote, in another message:
= I can't test is as-is because it insists on mmap'ing its output and I
only
= have one disk and you can't mmap /dev/null.
If you use a well compressible (redundant) file, such as a web-server log
, and
a high enough compression ratio, you can use the same disk for output --
the
writes will be very infrequent.
Thanks! Yours,
-mi
I suspect something like this may be the best approach for your application.
My suggestion would be to split the backup into 3 processes that share
memory. I wrote a program that is designed to buffer data in what looks
like a big FIFO and "dump | myfifo | gzip > file.gz" is significantly
faster than "dump | gzip > file.gz" so I suspect it will help you as well.
Process 1 reads the input file into mmap A.
Process 2 {b,gz}ips's mmap A into mmap B.
Process 3 writes mmap B into the output file.
Process 3 and mmap B may be optional, depending on your target's write
performance.
mmap A could be the real file with process 1 just accessing pages to
force them into RAM.
I'd suggest that each mmap be capable of storing several hundred msec of
data as a minumum (maybe 10MB input and 5MB output, preferably more).
Synchronisation can be done by writing tokens into pipes shared with the
mmap's, optimised by sharing read/write pointers (so you only really need
the tokens when the shared buffer is full/empty).
Thank you very much, Peter, for your suggestions. Unfortunately, I have n
o
control whatsoever over the dump-ing part of the process. The dump is don
e by
Sybase database servers -- old, clunky, and closed-source software, runni
ng
on slow CPU (but good I/O) Sun hardware.
You are right, of course, that my application (mzip being only part of it
)
needs to keep the dumper and the compressor in sync. Without any cooperat
ion
from the former, however, I see no other way but to temporarily throttle
the
NFS-bandwidth via firewall, when the compressor falls behind (as can be
detected by the increased proportion of sys-time, I guess).
Much as I apreciate the (past and future) help and suggestions, I'm not a
sking
you, nor the mailing list to solve my particular problem here :-) I only
gave
the details of my need and application to illustrate a missed general
optimization opportunity in FreeBSD -- reading large files via mmap need
not
be slower than via read. If anything, it should be (slightly) faster.
After many days Matt has finally stated (admitted? ;-):
read() uses a different heuristic then mmap() to implement the
read-ahead. There is also code in there which depresses the page
priority of 'old' already-read pages in the sequential case.
There is no reason not to implement similar smarts in the mmap-handling c
ode
to similarly depress the priority of the in-memory pages in the
MADV_SEQUENTIAL case, thus freeing more RAM for aggressive read-ahead.
As I admitted before, actually implementing this far exceeds my own
capabilities, so all I can do is pester, whoever cares, to do it instead
:-)
C'mon, guys...
-mi
After adding begin- and end-offset options to md5(1) -- implemented
using mmap (see bin/142814) -- I, once again, am upset over the slowness
of pagefaulting-in compared to the reading-in.
(To reproduce my results, patch your /usr/src/sbin/md5/ with
http://aldan.algebra.com/~mi/tmp/md5-offsets.patch
Then use plain ``md5 LARGE_FILE'' to use read and ``md5 -b 0
LARGE_FILE'' to use the mmap-method.)
The times for processing an 8Gb file residing on a reasonable IDE drive
(on a recent FreeBSD-7.2-StABLE/i386) are thus:
mmap: 43.400u 9.439s 2:35.19 34.0% 16+184k 0+0io 106994pf+0w
read: 41.358u 23.799s 2:12.04 49.3% 16+177k 67677+0io 0pf+0w
Observe, that even though read-ing is quite taxing on the kernel (high
sys-time), the mmap-ing loses overall -- at least, on an otherwise idle
system -- because read gets the full throughput of the drive (systat -vm
shows 100% disk utilization), while pagefaulting gets only about 69%.
When I last brought this up in 2006, it was "revealed", that read(2)
uses heuristics to perform a read-ahead. Why can't the pagefaulting-in
implementation use the same or similar "trickery" was never explained...
Now, without a clue on how these things are implemented, I'll concede,
that, probably, it may /sometimes/ be difficult for VM to predict, where
the next pagefault will strike, but in the cases, when the process:
a) mmaps up to 1Gb at a time;
b) issues an madvise MADV_SEQUENTIAL over the entire mmap-ed
region
mmaping ought to offer the same -- or better -- performance, than read.
For example, a hit on a page inside a region marked as SEQUENTIAL ought
to bring in the next page or two. VM has all the information and the
hints, just does not use them... Shame, is not it?
-mi
P.S. If it is any consolation, on Linux things seem to be even worse.
Processing a 9Gb file on kernel 2.6.18/i386:
mmap: 26.222u 6.336s 6:01.75 8.9% 0+0k 0+0io 61032pf+0w
read: 25.991u 7.686s 3:43.70 15.0% 0+0k 0+0io 23pf+0w
although the absolute times can't be compared with us due to hardware
differences, the mmap being nearly twice slower is a shame...
I suspect, it would be harder for me to setup ZFS, than for you to apply
my patch for to md5.c :-)
-mi
Well, the VM system does do read-ahead, but clearly the pipelining
is not working properly because if it were then either the cpu or
the disk would be pegged, and neither is.
It's broken in DFly too. Both FreeBSD and DragonFly use
vnode_pager_generic_getpages() (UFS's ffs_getpages() just calls
the generic) which means (typically) the whole thing devolves into
a UIO_NOCOPY VOP_READ(). The VOP_READ should be doing read-ahead
based on the sequential access heuristic but I already see issues
in both implementations of vnode_pager_generic_getpages() where it
finds a valid page from an earlier read-ahead and stops (failing to
issue any new read-aheads because it fails to issue a new UIO_NOCOPY
VOP_READ... doh!).
This would explain why the performance is not as bad as linux but
is not as good as a properly pipelined case. I'll play with it
some in DFly and I'm sure the FreeBSD folks can fix it in FreeBSD.
-Matt
I suspect it would be noticably worse :) AFAIK ZFS integration with mmap
does at least one extra in-memory data copy.
* Solaris 8, native, 32-bit binary (using -lcrypto instead of -lmd):
mmap: 103.54u 27.18s 2:56.46 74.0%
read: 99.12u 40.37s 2:53.06 80.6%
* Solaris 10, native, 32-bit binary (using -lcrypto instead of -lmd):
mmap: 159.36u 83.23s 5:28.25 73.9%
read: 173.50u 104.16s 4:48.30 96.3%
* Solaris 10, using the 32-bit binary built on Solaris-8:
mmap: 217.74u 101.20s 5:58.89 88.8%
All of the "read" results on Solaris (and earlier on Linux) were
obtained from using ``openssl md5 < largefile''.
Seems like BSD is not the only OS, where the mmap's theoretical promise
lags behind the actual offering -- read wins on Solaris overall too,
despite being quite a bit more expensive in sys-time. Would still be
nice to be the first to deliver...
-mi