large page patch (fwd) (fwd)

David Mosberger

unread,

Aug 2, 2002, 11:19:59 PM8/2/02

to

>>>>> On Fri, 2 Aug 2002 12:34:08 -0700 (PDT), Linus Torvalds <torv...@transmeta.com> said:

Linus> We may well expand the FS layer to bigger pages, but "bigger"
Linus> is almost certainly not going to include things like 256MB
Linus> pages - if for no other reason than the fact that memory
Linus> fragmentation really means that the limit on page sizes in
Linus> practice is somewhere around 128kB for any reasonable usage
Linus> patterns even with gigabytes of RAM.

Linus> And _maybe_ we might get to the single-digit megabytes. I
Linus> doubt it, simply because even with a good buddy allocator and
Linus> a memory manager that actively frees pages to get large
Linus> contiguous chunks of RAM, it's basically impossible to have
Linus> something that can reliably give you that big chunks without
Linus> making normal performance go totally down the toiled.

The Rice people avoided some of the fragmentation problems by
pro-actively allocating a max-order physical page, even when only a
(small) virtual page was being mapped. This should work very well as
long as the total memory usage (including memory lost due to internal
fragmentation of max-order physical pages) doesn't exceed available
memory. That's not a condition which will hold for every system in
the world, but I suspect it is true for lots of systems for large
periods of time. And since superpages quickly become
counter-productive in tight-memory situations anyhow, this seems like
a very reasonable approach.

--david
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

David Mosberger

unread,

Aug 3, 2002, 12:17:45 AM8/3/02

to

>>>>> On Fri, 2 Aug 2002 20:32:10 -0700 (PDT), Linus Torvalds <torv...@transmeta.com> said:

>> And since superpages quickly become counter-productive in
>> tight-memory situations anyhow, this seems like a very reasonable
>> approach.

Linus> Ehh.. The only people who are _really_ asking for the
Linus> superpages want almost nothing _but_ superpages. They are
Linus> willing to use 80% of all memory for just superpages.

Linus> Yes, it's Oracle etc, and the whole point for these users is
Linus> to avoid having any OS memory allocation for these areas.

My terminology is perhaps a bit too subtle: I user "superpage"
exclusively for the case where multiple pages get coalesced into a
larger page. The "large page" ("huge page") case that you were
talking about is different, since pages never get demoted or promoted.

I wasn't disagreeing with your case for separate large page syscalls.
Those syscalls certainly simplify implementation and, as you point
out, it well may be the case that a transparent superpage scheme never
will be able to replace the former.

David Mosberger

unread,

Aug 3, 2002, 12:39:36 AM8/3/02

to

>>>>> On Fri, 2 Aug 2002 21:26:52 -0700 (PDT), Linus Torvalds <torv...@transmeta.com> said:

>> I wasn't disagreeing with your case for separate large page
>> syscalls. Those syscalls certainly simplify implementation and,
>> as you point out, it well may be the case that a transparent
>> superpage scheme never will be able to replace the former.

Linus> Somebody already had patches for the transparent superpage
Linus> thing for alpha, which supports it. I remember seeing numbers
Linus> implying that helped noticeably.

Yes, I saw those. I still like the Rice work a _lot_ better. It's
just a thing of beauty, from a design point of view (disclaimer: I
haven't seen the implementation, so there may be ugly things
lurking...).

Linus> But yes, that definitely doesn't work for humongous pages (or
Linus> whatever we should call the multi-megabyte-special-case-thing
Linus> ;).

Yes, you're probably right. 2MB was reported to be fine in the Rice
experiments, but I doubt 256MB (and much less 4GB, as supported by
some CPUs) would fly.

Linus Torvalds

unread,

Aug 3, 2002, 12:26:52 AM8/3/02

to

On Fri, 2 Aug 2002, David Mosberger wrote:
>
> My terminology is perhaps a bit too subtle: I user "superpage"
> exclusively for the case where multiple pages get coalesced into a
> larger page. The "large page" ("huge page") case that you were
> talking about is different, since pages never get demoted or promoted.

Ahh, ok.

> I wasn't disagreeing with your case for separate large page syscalls.
> Those syscalls certainly simplify implementation and, as you point
> out, it well may be the case that a transparent superpage scheme never
> will be able to replace the former.

Somebody already had patches for the transparent superpage thing for
alpha, which supports it. I remember seeing numbers implying that helped
noticeably.

But yes, that definitely doesn't work for humongous pages (or whatever we
should call the multi-megabyte-special-case-thing ;).

Linus

David S. Miller

unread,

Aug 3, 2002, 1:20:24 AM8/3/02

to

From: David Mosberger <dav...@napali.hpl.hp.com>
Date: Fri, 2 Aug 2002 21:39:36 -0700

>>>>> On Fri, 2 Aug 2002 21:26:52 -0700 (PDT), Linus Torvalds <torv...@transmeta.com> said:

>> I wasn't disagreeing with your case for separate large page
>> syscalls. Those syscalls certainly simplify implementation and,
>> as you point out, it well may be the case that a transparent
>> superpage scheme never will be able to replace the former.

Linus> Somebody already had patches for the transparent superpage
Linus> thing for alpha, which supports it. I remember seeing numbers
Linus> implying that helped noticeably.

Yes, I saw those. I still like the Rice work a _lot_ better.

Now here's the thing. To me, we should be adding these superpage
syscalls to things like the implementation of malloc() :-) If you
allocate enough anonymous pages together, you should get a superpage
in the TLB if that is easy to do. Once any hint of memory pressure
occurs, you just break up the large page clusters as you hit such
ptes. This is what one of the Linux large-page implementations did
and I personally find it the most elegant way to handle the so called
"paging complexity" of transparent superpages.

At that point it's like "why the system call". If it would rather be
more of a large-page reservation system than a "optimization hint"
then these syscalls would sit better with me. Currently I think they
are superfluous. To me the hint to use large-pages is a given :-)

Stated another way, if these syscalls said "gimme large pages for this
area and lock them into memory", this would be fine. If the syscalls
say "use large pages if you can", that's crap. And in fact we could
use mmap() attribute flags if we really thought that stating this was
necessary.

Linus Torvalds

unread,

Aug 3, 2002, 1:35:00 PM8/3/02

to

On Fri, 2 Aug 2002, David S. Miller wrote:
>
> Now here's the thing. To me, we should be adding these superpage
> syscalls to things like the implementation of malloc() :-) If you
> allocate enough anonymous pages together, you should get a superpage
> in the TLB if that is easy to do.

For architectures that have these "small" superpages, we can just do it
transparently. That's what the alpha patches did.

The problem space is roughly the same as just page coloring.

> At that point it's like "why the system call". If it would rather be
> more of a large-page reservation system than a "optimization hint"
> then these syscalls would sit better with me. Currently I think they
> are superfluous. To me the hint to use large-pages is a given :-)

Yup.

David, you did page coloring once.

I bet your patches worked reasonably well to color into 4 or 8 colors.

How well do you think something like your old patches would work if

- you _require_ 1024 colors in order to get the TLB speedup on some
hypothetical machine (the same hypothetical machine that might
hypothetically run on 95% of all hardware ;)

- the machine is under heavy load, and heavy load is exactly when you
want this optimization to trigger.

Can you explain this difficulty to people?

> Stated another way, if these syscalls said "gimme large pages for this
> area and lock them into memory", this would be fine. If the syscalls
> say "use large pages if you can", that's crap. And in fact we could
> use mmap() attribute flags if we really thought that stating this was
> necessary.

I agree 100%.

I think we can at some point do the small cases completely transparently,
with no need for a new system call, and not even any new hint flags. We'll
just silently do 4/8-page superpages and be done with it. Programs don't
need to know about it to take advantage of better TLB usage.

Linus

David Mosberger

unread,

Aug 3, 2002, 3:30:05 PM8/3/02

to

>>>>> On Sat, 3 Aug 2002 10:35:00 -0700 (PDT), Linus Torvalds <torv...@transmeta.com> said:

Linus> How well do you think something like your old patches would
Linus> work if

Linus> - you _require_ 1024 colors in order to get the TLB speedup
Linus> on some hypothetical machine (the same hypothetical machine
Linus> that might hypothetically run on 95% of all hardware ;)

Linus> - the machine is under heavy load, and heavy load is exactly
Linus> when you want this optimization to trigger.

Your point about wanting databases have access to giant pages even
under memory pressure is a good one. I had not considered that
before. However, what we really are talking about then is a security
or resource policy as to who gets to allocate from a reserved and
pinned pool of giant physical pages. You don't need separate system
calls for that: with a transparent superpage framework and a
privileged & reserved giant-page pool, it's trivial to set up things
such that your favorite data base will always be able to get the giant
pages (and hence the giant TLB mappings) it wants. The only thing you
lose in the transparent case is control over _which_ pages need to use
the pinned giant pages. I can certainly imagine cases where this
would be an issue, but I kind of doubt it would be an issue for
databases.

As Dave Miller justly pointed out, it's stupid for a task not to ask
for giant pages for anonymous memory. The only reason this is not a
smart thing overall is that globally it's not optimal (it is optimal
only locally, from the task's point of view). So if the only barrier
to getting the giant pinned pages is needing to know about the new
system calls, I'll predict that very soon we'll have EVERY task in the
system allocating such pages (and LD_PRELOAD tricks make that pretty
much trivial). Then we're back to square one, because the favorite
database may not even be able to start up, because all the "reserved"
memory is already used up by the other tasks.

Clearly there needs to be some additional policies in effect, no
matter what the implementation is (the normal VM policies don't work,
because, by definition, the pinned giant pages are not pageable).

In my opinion, the primary benefit of the separate syscalls is still
ease-of-implementation (which isn't unimportant, of course).

--david

David Mosberger

unread,

Aug 3, 2002, 3:41:33 PM8/3/02

to

>>>>> On Sat, 3 Aug 2002 14:41:29 -0400, Hubertus Franke <fra...@watson.ibm.com> said:

Hubertus> But I'd like to point out that superpages are there to
Hubertus> reduce the number of TLB misses by providing larger
Hubertus> coverage. Simply providing page coloring will not get you
Hubertus> there.

Yes, I agree.

It appears that Juan Navarro, the primary author behind the Rice
project, is working on breaking down the superpage benefits they
observed. That would tell us how much benefit is due to page-coloring
and how much is due to TLB effects. Here in our lab, we do have some
(weak) empirical evidence that some of the SPECint benchmarks benefit
primarily from page-coloring, but clearly there are others that are
TLB limited.

--daivd

Linus Torvalds

unread,

Aug 3, 2002, 3:43:47 PM8/3/02

to

On Sat, 3 Aug 2002, David Mosberger wrote:
>
> Your point about wanting databases have access to giant pages even
> under memory pressure is a good one. I had not considered that
> before. However, what we really are talking about then is a security
> or resource policy as to who gets to allocate from a reserved and
> pinned pool of giant physical pages.

Absolutely. We can't allow just anybody to allocate giant pages, since
they are a scarce resource (set up at boot time in both Ingo's and Intels
patches - with the potential to move things around later with additional
interfaces).

> You don't need separate system
> calls for that: with a transparent superpage framework and a
> privileged & reserved giant-page pool, it's trivial to set up things
> such that your favorite data base will always be able to get the giant
> pages (and hence the giant TLB mappings) it wants. The only thing you
> lose in the transparent case is control over _which_ pages need to use
> the pinned giant pages. I can certainly imagine cases where this
> would be an issue, but I kind of doubt it would be an issue for
> databases.

That's _probably_ true. There aren't that many allocations that ask for
megabytes of consecutive memory that wouldn't want to do it. However,
there might certainly be non-critical maintenance programs (with the same
privileges as the database program proper) that _do_ do large allocations,
and that we don't want to give large pages to.

Guessing is always bad, especially since the application certainly does
know what it wants.

Linus

Linus Torvalds

unread,

Aug 3, 2002, 3:39:40 PM8/3/02

to

On Sat, 3 Aug 2002, Hubertus Franke wrote:
>
> But I'd like to point out that superpages are there to reduce the number of
> TLB misses by providing larger coverage. Simply providing page coloring
> will not get you there.

Superpages can from a memory allocation angle be seen as a very strict
form of page coloring - the problems are fairly closely related, I think
(superpages are just a lot stricter, in that it's not enough to get "any
page of color X", you have to get just the _right_ page).

Doing superpages will automatically do coloring (while the reverse is
obviously not true). And the way David did coloring a long time ago (if
I remember his implementation correctly) was the same way you'd do
superpages: just do higher order allocations.

David Mosberger

unread,

Aug 3, 2002, 5:18:15 PM8/3/02

to

>>>>> On Sat, 3 Aug 2002 12:43:47 -0700 (PDT), Linus Torvalds <torv...@transmeta.com> said:

>> You don't need separate system calls for that: with a transparent
>> superpage framework and a privileged & reserved giant-page pool,
>> it's trivial to set up things such that your favorite data base
>> will always be able to get the giant pages (and hence the giant
>> TLB mappings) it wants. The only thing you lose in the
>> transparent case is control over _which_ pages need to use the
>> pinned giant pages. I can certainly imagine cases where this
>> would be an issue, but I kind of doubt it would be an issue for
>> databases.

Linus> That's _probably_ true. There aren't that many allocations
Linus> that ask for megabytes of consecutive memory that wouldn't
Linus> want to do it. However, there might certainly be non-critical
Linus> maintenance programs (with the same privileges as the
Linus> database program proper) that _do_ do large allocations, and
Linus> that we don't want to give large pages to.

Linus> Guessing is always bad, especially since the application
Linus> certainly does know what it wants.

Yes, but that applies even to a transparent superpage scheme: in those
instances where an application knows what page size is optimal, it's
better if the application can express that (saves time
promoting/demoting pages needlessly). It's not unlike madvise() or
the readahead() syscall: use reasonable policies for the ordinary
apps, and provide the means to let the smart apps tell the kernel
exactly what they need.

--david

David Mosberger

unread,

Aug 3, 2002, 5:26:27 PM8/3/02

to

>>>>> On Sat, 3 Aug 2002 16:53:39 -0400, Hubertus Franke <fra...@watson.ibm.com> said:

Hubertus> Cool. Does that mean that BSD already has page coloring
Hubertus> implemented ?

FreeBSD (at least on Alpha) makes some attempts at page-coloring, but
it's said to be far from perfect.

Hubertus> The agony is: Page Coloring helps to reduce cache
Hubertus> conflicts in low associative caches while large pages may
Hubertus> reduce TLB overhead.

Why agony? The latter helps the TLB _and_ solves the page coloring
problem (assuming the largest page size is bigger than the largest
cache; yeah, I see that could be a problem on some Power 4
machines... ;-)

Hubertus> One shouldn't rule out one for the other, there is a place
Hubertus> for both.

Hubertus> How did you arrive to the (weak) empirical evidence? You
Hubertus> checked TLB misses and cache misses and turned page
Hubertus> coloring on and off and large pages on and off?

Yes, that's basically what we did (there is a patch implementing a
page coloring kernel module floating around).

David S. Miller

unread,

Aug 3, 2002, 8:32:04 PM8/3/02

to

From: Linus Torvalds <torv...@transmeta.com>
Date: Sat, 3 Aug 2002 12:39:40 -0700 (PDT)

And the way David did coloring a long time ago (if
I remember his implementation correctly) was the same way you'd do
superpages: just do higher order allocations.

Although it wasn't my implementation which did this,
one of them did do it this way. I agree that it is
the nicest way to do coloring.

David S. Miller

unread,

Aug 3, 2002, 8:34:02 PM8/3/02

to

From: Hubertus Franke <fra...@watson.ibm.com>
Date: Sat, 3 Aug 2002 16:53:39 -0400

Does that mean that BSD already has page coloring implemented ?

FreeBSD has had page coloring for quite some time.

Because they don't use buddy lists and don't allow higher-order
allocations fundamentally in the page allocator, they don't have
to deal with all the buddy fragmentation issues we do.

On the other hand, since higher-order page allocations are not
a fundamental operation it might be more difficult for FreeBSD
to implement superpage support efficiently like we can with
the buddy lists.

David S. Miller

unread,

Aug 3, 2002, 8:31:11 PM8/3/02

to

From: David Mosberger <dav...@napali.hpl.hp.com>
Date: Sat, 3 Aug 2002 12:41:33 -0700

It appears that Juan Navarro, the primary author behind the Rice
project, is working on breaking down the superpage benefits they
observed. That would tell us how much benefit is due to page-coloring
and how much is due to TLB effects. Here in our lab, we do have some
(weak) empirical evidence that some of the SPECint benchmarks benefit
primarily from page-coloring, but clearly there are others that are
TLB limited.

There was some comparison done between large-page vs. plain
page coloring for a bunch of scientific number crunchers.

Only one benefitted from page coloring and not from TLB
superpage use.

The ones that benefitted from both coloring and superpages, the
superpage gain was about equal to the coloring gain. Basically,
superpages ended up giving the necessary coloring :-)

Search for the topic "Areas for superpage discussion" in the
sparc...@vger.kernel.org list archives, it has pointers to
all the patches and test programs involved.

David S. Miller

unread,

Aug 3, 2002, 8:28:36 PM8/3/02

to

From: Linus Torvalds <torv...@transmeta.com>
Date: Sat, 3 Aug 2002 10:35:00 -0700 (PDT)

David, you did page coloring once.

I bet your patches worked reasonably well to color into 4 or 8 colors.

How well do you think something like your old patches would work if

- you _require_ 1024 colors in order to get the TLB speedup on some
hypothetical machine (the same hypothetical machine that might

hypothetically run on 95% of all hardware ;)

- the machine is under heavy load, and heavy load is exactly when you

want this optimization to trigger.

Can you explain this difficulty to people?

Actually, we need some clarification here. I tried coloring several
times, the problem with my diffs is that I tried to do the coloring
all the time no matter what.

I wanted strict coloring on the 2-color level for broken L1 caches
that have aliasing problems. If I could make this work, all of the
dumb cache flushing I have to do on Sparcs could be deleted. Because
of this, I couldn't legitimately change the cache flushing rules
unless I had absolutely strict coloring done on all pages where it
mattered (basically anything that could end up in the user's address
space).

So I kept track of color existence precisely in the page lists. The
implementation was fast, but things got really bad fragmentation wise.

No matter how I tweaked things, just running a kernel build 40 or 50
times would fragment the free page lists to shreds such that 2-order
and up pages simply did not exist.

Another person did an implementation of coloring which basically
worked by allocating a big-order chunk and slicing that up. It's not
strictly done and that is why his version works better. In fact I
like that patch a lot and it worked quite well for L2 coloring on
sparc64. Any time there is page pressure, he tosses away all of the
color carving big-order pages.

I think we can at some point do the small cases completely transparently,
with no need for a new system call, and not even any new hint flags. We'll
just silently do 4/8-page superpages and be done with it. Programs don't
need to know about it to take advantage of better TLB usage.

Ok. I think even 64-page ones are viable to attempt but we'll see.
Most TLB's that do superpages seem to have a range from the base
page size to the largest supported superpage with 2-powers of two
being incrememnted between each supported size.

For example on Sparc64 this is:

8K PAGE_SIZE
64K PAGE_SIZE * 8
512K PAGE_SIZE * 64
4M PAGE_SIZE * 512

One of the transparent large page implementations just defined a
small array that the core code used to try and see "hey how big
a superpage can we try" and if the largest for the area failed
(because page orders that large weren't available) it would simply
fall back to the next smallest superpage size.

David S. Miller

unread,

Aug 3, 2002, 8:35:30 PM8/3/02

to

From: Hubertus Franke <fra...@watson.ibm.com>
Date: Sat, 3 Aug 2002 17:54:30 -0400

The Rice paper solved this reasonably elegant. Reservation and check
after a while. If you didn't use reserved memory, you loose it, this is the
auto promotion/demotion.

I keep seeing this Rice stuff being mentioned over and over,
can someone post a URL pointer to this work?

David Mosberger

unread,

Aug 3, 2002, 10:25:30 PM8/3/02

to

>>>>> On Sat, 03 Aug 2002 17:35:30 -0700 (PDT), "David S. Miller" <da...@redhat.com> said:

DaveM> From: Hubertus Franke <fra...@watson.ibm.com> Date: Sat,
DaveM> 3 Aug 2002 17:54:30 -0400

DaveM> The Rice paper solved this reasonably elegant. Reservation
DaveM> and check after a while. If you didn't use reserved memory,
DaveM> you loose it, this is the auto promotion/demotion.

DaveM> I keep seeing this Rice stuff being mentioned over and over,
DaveM> can someone post a URL pointer to this work?

Sure thing. It's the first link under "Publications" at this URL:

http://www.cs.rice.edu/~jnavarro/

--david

Andrew Morton

unread,

Aug 4, 2002, 3:23:16 PM8/4/02

to

Linus Torvalds wrote:
>
> On Sun, 4 Aug 2002, Hubertus Franke wrote:
> >
> > As of the page coloring !
> > Can we tweak the buddy allocator to give us this additional functionality?
>
> I would really prefer to avoid this, and get "95% coloring" by just doing
> read-ahead with higher-order allocations instead of the current "loop
> allocation of one block".
>
> I bet that you will get _practically_ perfect coloring with just two small
> changes:
>
> - do_anonymous_page() looks to see if the page tables are empty around
> the faulting address (and check vma ranges too, of course), and
> optimistically does a non-blocking order-X allocation.
>
> If the order-X allocation fails, we're likely low on memory (this is
> _especially_ true since the very fact that we do lots of order-X
> allocations will probably actually help keep fragementation down
> normally), and we just allocate one page (with a regular GFP_USER this
> time).
>
> Map in all pages.

This would be a problem for short-lived processes. Because "map in
all pages" also means "zero them out". And I think that performing
a 4k clear_user_highpage() immediately before returning to userspace
is optimal. It's essentialy a cache preload for userspace.

If we instead clear out 4 or 8 pages, we trash a ton of cache and
the chances of userspace _using_ pages 1-7 in the short-term are
lower. We could clear the pages with 7,6,5,4,3,2,1,0 ordering,
but the cache implications of faultahead are still there.

Could we establish the eight pte's but still arrange for pages 1-7
to trap, so the kernel can zero the out at the latest possible time?

> - do the same for page_cache_readahead() (this, btw, is where radix trees
> will kick some serious ass - we'd have had a hard time doing the "is
> this range of order-X pages populated" efficiently with the old hashes.
>

On the nopage path, yes. That memory is cache-cold anyway.

Linus Torvalds

unread,

Aug 4, 2002, 2:38:12 PM8/4/02

to

On Sun, 4 Aug 2002, Hubertus Franke wrote:
>
> As of the page coloring !
> Can we tweak the buddy allocator to give us this additional functionality?

I would really prefer to avoid this, and get "95% coloring" by just doing
read-ahead with higher-order allocations instead of the current "loop
allocation of one block".

I bet that you will get _practically_ perfect coloring with just two small
changes:

- do_anonymous_page() looks to see if the page tables are empty around
the faulting address (and check vma ranges too, of course), and
optimistically does a non-blocking order-X allocation.

If the order-X allocation fails, we're likely low on memory (this is
_especially_ true since the very fact that we do lots of order-X
allocations will probably actually help keep fragementation down
normally), and we just allocate one page (with a regular GFP_USER this
time).

Map in all pages.

- do the same for page_cache_readahead() (this, btw, is where radix trees

will kick some serious ass - we'd have had a hard time doing the "is
this range of order-X pages populated" efficiently with the old hashes.

I bet just those fairly small changes will give you effective coloring,
_and_ they are also what you want for doing small superpages.

And no, I do not want separate coloring support in the allocator. I think
coloring without superpage support is stupid and worthless (and
complicates the code for no good reason).

Linus

Andi Kleen

unread,

Aug 4, 2002, 4:20:16 PM8/4/02

to

Andrew Morton <ak...@zip.com.au> writes:

> If we instead clear out 4 or 8 pages, we trash a ton of cache and
> the chances of userspace _using_ pages 1-7 in the short-term are
> lower. We could clear the pages with 7,6,5,4,3,2,1,0 ordering,
> but the cache implications of faultahead are still there.

What you could do on modern x86 and probably most other architectures as
well is to clear the faulted page in cache and clear the other pages
with a non temporal write. The non temporal write will go straight
to main memory and not pollute any caches.

When the process accesses it later it has to fetch the zeroes from
main memory. This is probably still faster than a page fault at least
for the first few accesses. It could be more costly when walking the full
page (then the added up cache miss costs could exceed the page fault cost),
but then hopefully the CPU will help by doing hardware prefetch.

It could help or not help, may be worth a try at least :-)

-Andi

William Lee Irwin III

unread,

Aug 4, 2002, 4:23:22 PM8/4/02

to

On Sun, Aug 04, 2002 at 03:30:24PM -0400, Hubertus Franke wrote:
> As long as the alignments are observed, which you I guess imply by the range.

On Sunday 04 August 2002 02:38 pm, Linus Torvalds wrote:
>> If the order-X allocation fails, we're likely low on memory (this is
>> _especially_ true since the very fact that we do lots of order-X
>> allocations will probably actually help keep fragementation down
>> normally), and we just allocate one page (with a regular GFP_USER this
>> time).

Later on I can redo one of the various online defragmentation things
that went around last October or so if it would help with this.

On Sunday 04 August 2002 02:38 pm, Linus Torvalds wrote:
>> Map in all pages.
>> - do the same for page_cache_readahead() (this, btw, is where radix trees
>> will kick some serious ass - we'd have had a hard time doing the "is
>> this range of order-X pages populated" efficiently with the old hashes.

On Sun, Aug 04, 2002 at 03:30:24PM -0400, Hubertus Franke wrote:
> Hey, we use the radix tree to track page cache mappings for large pages
> particularly for this reason...

Proportion of radix tree populated beneath a given node can be computed
by means of traversals adding up ->count or by incrementally maintaining
a secondary counter for ancestors within the radix tree node. I can look
into this when I go over the path compression heuristics, which would
help the space consumption for access patterns fooling the current one.
Getting physical contiguity out of that is another matter, but the code
can be used for other things (e.g. exec()-time prefaulting) until that's
worked out, and it's not a focus or requirement of this code anyway.

On Sunday 04 August 2002 02:38 pm, Linus Torvalds wrote:
>> I bet just those fairly small changes will give you effective coloring,
>> _and_ they are also what you want for doing small superpages.

On Sun, Aug 04, 2002 at 03:30:24PM -0400, Hubertus Franke wrote:
> The HW TLB case can be extended to not store the same PA in all the PTEs,
> but conceptually carry the superpage concept for the purpose described above.

Pagetable walking gets a tiny hook, not much interesting goes on there.
A specialized wrapper for extracting physical pfn's from the pmd's like
the one for testing whether they're terminal nodes might look more
polished, but that's mostly cosmetic.

Hmm, from looking at the "small" vs. "large" page bits, I have an
inkling this may be relative to the machine size. 256GB boxen will
probably think of 4MB pages as small.

On Sun, Aug 04, 2002 at 03:30:24PM -0400, Hubertus Franke wrote:
> But to go down this route we need the concept of a superpage in the VM,
> not just at TLB time or a hack that throws these things over the fence.

The bit throwing it over the fence is probably still useful, as Oracle
knows what it's doing and I suspect it's largely to dodge pagetable
space consumption OOM'ing machines as opposed to optimizing anything.
It pretty much wants the kernel out of the way aside from as a big bag
of device drivers, so I'm not surprised they're more than happy to have
the MMU in their hands too. The more I think about it, the less related
to superpages it seems. The motive for superpages is 100% TLB, not a
workaround for pagetable OOM.

Cheers,
Bill

Eric W. Biederman

unread,

Aug 4, 2002, 7:51:51 PM8/4/02

to

Andi Kleen <a...@suse.de> writes:

> Andrew Morton <ak...@zip.com.au> writes:
>
> > If we instead clear out 4 or 8 pages, we trash a ton of cache and
> > the chances of userspace _using_ pages 1-7 in the short-term are
> > lower. We could clear the pages with 7,6,5,4,3,2,1,0 ordering,
> > but the cache implications of faultahead are still there.
>
> What you could do on modern x86 and probably most other architectures as
> well is to clear the faulted page in cache and clear the other pages
> with a non temporal write. The non temporal write will go straight
> to main memory and not pollute any caches.

Plus a non temporal write is 3x faster than a write that lands in
the cache on x86 (tested on Athlons, P4, & P3).

> When the process accesses it later it has to fetch the zeroes from
> main memory. This is probably still faster than a page fault at least
> for the first few accesses. It could be more costly when walking the full
> page (then the added up cache miss costs could exceed the page fault cost),
> but then hopefully the CPU will help by doing hardware prefetch.
>
> It could help or not help, may be worth a try at least :-)

Certainly.

Eric

David S. Miller

unread,

Aug 5, 2002, 1:42:20 AM8/5/02

to

From: Linus Torvalds <torv...@transmeta.com>
Date: Sun, 4 Aug 2002 12:28:54 -0700 (PDT)

I suspect that there is some non-zero order-X (probably 2 or 3), where you
just win more than you lose. Even for small programs.

Furthermore it would obviously help to enhance the clear_user_page()
interface to handle multiple pages because that would nullify the
startup/finish overhead of the copy loop. (read as: things like TLB
loads and FPU save/restore on some platforms)

David S. Miller

unread,

Aug 5, 2002, 1:40:43 AM8/5/02

to

From: Hubertus Franke <fra...@watson.ibm.com>
Date: Sun, 4 Aug 2002 13:31:24 -0400

Can we tweak the buddy allocator to give us this additional functionality?

Absolutely not, it's a total lose.

I have tried at least 5 times to make it work without fragmenting the
buddy lists to shit. I channege you to code one up that works without
fragmenting things to shreds. Just run an endless kernel build over
and over in a loop for a few hours to a day. If the buddy lists are
not fragmented after these runs, then you have succeeded in my
challenge.

Do not even reply to this email without meeting the challenge as it
will fall on deaf ears. I've been there and I've done that, and at
this point code talks bullshit walks when it comes to trying to
colorize the buddy allocator in a way that actually works and isn't
disgusting.

David Mosberger

unread,

Aug 5, 2002, 12:59:19 PM8/5/02

to

>>>>> On Sun, 4 Aug 2002 15:30:24 -0400, Hubertus Franke <fra...@watson.ibm.com> said:

Hubertus> Yes, if we (correctly) assume that page coloring only buys
Hubertus> you significant benefits for small associative caches
Hubertus> (e.g. <4 or <= 8).

This seems to be a popular misconception. Yes, page-coloring
obviously plays no role as long as your cache no bigger than
PAGE_SIZE*ASSOCIATIVITY. IIRC, Xeon can have up to 1MB of cache and I
bet that it doesn't have a 1MB/4KB=256-way associative cache. Thus,
I'm quite confident that it's possible to observe significant
page-coloring effects even on a Xeon.

--david

Jamie Lokier

unread,

Aug 5, 2002, 5:10:39 PM8/5/02

to

Hubertus Franke wrote:
> The wording was "significant" benefits. The point is/was that as your
> associativity goes up, the likelihood of full cache occupancy
> increases, with cache thrashing in each class decreasing.
> Would have to dig through the literature to figure out at what point
> the benefits are insignificant (<1 %) wrt page coloring.

One of the benefits of page colouring may be that a program's run time
may be expected to vary less from run to run?

In the old days (6 years ago), I found that a video game I was working
on would vary in its peak frame rate by about 3-5% (I don't recall
exactly). Once the program was started, it would remain operating at
the peak frame rate it had selected, and killing and restarting the
program didn't often make a difference either. In DOS, the same program
always ran at a consistent frame rate (higher than Linux as it happens).
The actual number of objects executing in the program, and the amount of
memory allocated, were deterministic in these tests.

This is pointing at a cache colouring issue to me -- although quite
which cache I am not sure. I suppose it could have been something to do
with Linux' VM page scanner access patterns into the page array instead.

-- Jamie

Seth, Rohit

unread,

Aug 5, 2002, 7:30:54 PM8/5/02

to

> -----Original Message-----
> From: Hubertus Franke [mailto:fra...@watson.ibm.com]
> Sent: Sunday, August 04, 2002 12:30 PM
> To: Linus Torvalds
> Cc: David S. Miller; dav...@hpl.hp.com; dav...@napali.hpl.hp.com;
> g...@us.ibm.com; Martin...@us.ibm.com; w...@holomorphy.com;
> linux-...@vger.kernel.org
> Subject: Re: large page patch (fwd) (fwd)
>
> Well, in what you described above there is no concept of superpages
> the way it is defined for the purpose of <tracking> and <TLB overhead
> reduction>.
> If you don't know about super pages at the VM level, then you need to
> deal with them at TLB fault level to actually create the <large TLB>
> entry. That what the INTC patch will do, namely throughing all the
> complexity over the fence for the page fault.
Our patch does the preallocation of large pages at the time of request.
There is really nothing special like replicating PTEs (that you mentioned
below in your design) happens there. In any case, even for IA-64 where the
TLBs are also sw controlled (we also have Hardware Page Walker that can walk
any 3rd level pt and insert the PTE in TLB.) there are almost no changes (to
be precise one additional asm instructionin the begining of handler for
shifting extra bits) in our implementation that pollute the low level TLB
fault handlers to have the knowledge of large page size in traversing the
3-level page table. (Though there are couple of other asm instructions that
are added in this low-level routine to set helping register with proper
page_size while inserting bigger TLBs). On IA-32 obviously things fall in
place automagically as the page tables are setup as per arch.

> In your case not keeping track of the super pages in the
> VM layer and PT layer requires to discover the large page at soft TLB
> time by scanning PT proximity for contigous pages if we are
> talking now
> about the read_ahead ....
> In our case, we store the same physical address of the super page
> in the PTEs spanning the superpage together with the page order.
> At software TLB time we simply extra the single PTE from the PT based
> on the faulting address and move it into the TLB. This
> ofcourse works only
> for software TLBs (PwrPC, MIPS, IA64). For HW TLB (x86) the
> PT structure
> by definition overlaps the large page size support.

> The HW TLB case can be extended to not store the same PA in
> all the PTEs,
> but conceptually carry the superpage concept for the purpose
> described above.
>

I'm afraid you may be wasting a lot of extra memory by replicaitng these
PTEs(Take an example of one 4G large TLB size entry and assume there are few
hunderd processes using that same physical page.)

> We have that concept exactly the way you want it, but the dress code
> seems to be wrong. That can be worked on.
> Our goal was in the long run 2.7 to explore the Rice approach to see
> whether it yields benefits or whether we getting down the road of
> fragmentation reduction overhead that will kill all the
> benefits we get
> from reduced TLB overhead. Time would tell.

>
> But to go down this route we need the concept of a superpage
> in the VM,
> not just at TLB time or a hack that throws these things over
> the fence.
>

As others have already said that you may want to have the support of smaller
superpages in this way. Where VM is embeded with some knowledge of
different page sizes that it can support. Demoting and permoting pages from
one size to another (efficiently)will be very critical in the design. In my
opinion supporting the largest TLB on archs (like 256M or 4G) will need more
direct appraoch and less intrusion from kernel VM will be prefered.
Ofcourse, kernel will need to put extra checks etc. to maintain some sanity
for allowed users.

There has already been lot of discussion on this mailing list about what is
the right approach. Whether the new APIs are needed or something like
madvise would do it, whether kernel needs to allocate large_pages
transparently to the user or we should expose the underlying HW feature to
user land. There are issues that favor one approach over another. But the
bottom line is: 1) We should not break anything semantically for regular
system calls that happen to be using large TLBs and 2) The performance
advantage of this HW feature (on most of the archs I hope) is just too much
to let go without notice. I hope we get to consensus for getting this
support in kernel ASAP. This will benefit lot of Linux users. (And yes I
understand that we need to do things right in kernel so that we don't make
unforeseen errors.)

>
> > And no, I do not want separate coloring support in the
> allocator. I think
> > coloring without superpage support is stupid and worthless (and
> > complicates the code for no good reason).
> >
> > Linus
>

> That <stupid> seems premature. You are mixing the concept of
> superpage from a TLB miss reduction perspective
> with the concept of superpage for page coloring.
>
>
I have seen couple of HPC apps that try to fit (configure) in their data
sets on the L3 caches size (Like on IA-64 4M). I think these are the apps
that really get hit hardest by lack of proper page coloring support in Linux
kernel. The performance variation of these workloads from run to run could
be as much as 60% And with the page coloring patch, these apps seems to be
giving consistent higher throuput (The real bad part is that once the
throughput of these workloads drop, it stays down thereafter :( ) But seems
like DavidM has enough real world data that prohibits the use of this
approach in kernel for real world scenarios. The good part of large TLBs is
that, TLBs larger than CPU cache size will automatically get you perfect
page coloring .........for free.

rohit

Eric W. Biederman

unread,

Aug 10, 2002, 2:20:06 PM8/10/02

to

Andrew Morton <ak...@zip.com.au> writes:
>
> The other worry is the ZONE_NORMAL space consumption of pte_chains.
> We've halved that, but it will still make high sharing levels
> unfeasible on the big ia32 machines. We are dependant upon large
> pages to solve that problem. (Resurrection of pte_highmem is in
> progress, but it doesn't work yet).

There is a second method to address this. Pages can be swapped out
of the page tables and still remain in the page cache, the virtual
scan does this all of the time. This should allow for arbitrary
amounts of sharing. There is some overhead, in faulting the pages
back in but it is much better than cases that do not work. A simple
implementation would have a maximum pte_chain length.

For any page that is not backed by anonymous memory we do not need to
keep the pte entries after the page has been swapped of the page
table. Which should show a reduction in page table size. In a highly
shared setting with anonymous pages it is likely worth it to promote
those pages to being posix shared memory.

All of the above should allow us to keep a limit on the amount of
resources that go towards sharing, reducing the need for something
like pte_highmem, and keeping memory pressure down in general.

For the cases you describe I have trouble seeing pte_highmem as
anything other than a performance optimization. Only placing shmem
direct and indirect entries in high memory or in swap can I see as
limit to feasibility.

Eric

Eric W. Biederman

unread,

Aug 10, 2002, 3:54:47 PM8/10/02

to

Rik van Riel <ri...@conectiva.com.br> writes:

> On 10 Aug 2002, Eric W. Biederman wrote:
> > Andrew Morton <ak...@zip.com.au> writes:
> > >
> > > The other worry is the ZONE_NORMAL space consumption of pte_chains.
> > > We've halved that, but it will still make high sharing levels
> > > unfeasible on the big ia32 machines.
>

> > There is a second method to address this. Pages can be swapped out
> > of the page tables and still remain in the page cache, the virtual
> > scan does this all of the time. This should allow for arbitrary
> > amounts of sharing. There is some overhead, in faulting the pages
> > back in but it is much better than cases that do not work. A simple
> > implementation would have a maximum pte_chain length.
>

> Indeed. We need this same thing for page tables too, otherwise
> a high sharing situation can easily "require" more page table
> memory than the total amount of physical memory in the system ;)

It's exactly the same situation. To remove a pte from the chain you must
remove it from the page table as well. Then we just need to free
pages with no interesting pte entries.

Rik van Riel

unread,

Aug 10, 2002, 3:55:45 PM8/10/02

to

On 10 Aug 2002, Eric W. Biederman wrote:
> Andrew Morton <ak...@zip.com.au> writes:
> >
> > The other worry is the ZONE_NORMAL space consumption of pte_chains.
> > We've halved that, but it will still make high sharing levels
> > unfeasible on the big ia32 machines.

> There is a second method to address this. Pages can be swapped out
> of the page tables and still remain in the page cache, the virtual
> scan does this all of the time. This should allow for arbitrary
> amounts of sharing. There is some overhead, in faulting the pages
> back in but it is much better than cases that do not work. A simple
> implementation would have a maximum pte_chain length.

Indeed. We need this same thing for page tables too, otherwise
a high sharing situation can easily "require" more page table
memory than the total amount of physical memory in the system ;)

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

Alan Cox

unread,

Aug 11, 2002, 4:30:17 PM8/11/02

to

On Fri, 2002-08-09 at 16:20, Daniel Phillips wrote:
> On Sunday 04 August 2002 19:19, Hubertus Franke wrote:
> > "General Purpose Operating System Support for Multiple Page Sizes"
> > htpp://www.usenix.org/publications/library/proceedings/usenix98/full_papers/ganapathy/ganapathy.pdf
>
> This reference describes roughly what I had in mind for active
> defragmentation, which depends on reverse mapping. The main additional
> wrinkle I'd contemplated is introducing a new ZONE_LARGE, and GPF_LARGE,
> which means the caller promises not to pin the allocation unit for long
> periods and does not mind if the underlying physical page changes
> spontaneously. Defragmenting in this zone is straightforward.

Slight problem. This paper is about a patented SGI method for handling
defragmentation into large pages (6,182,089). They patented it before
the presentation.

They also hold patents on the other stuff that you've recently been
discussing about not keeping seperate rmap structures until there are
more than some value 'n' when they switch from direct to indirect lists
of reverse mappings (6,112,286)

If you are going read and propose things you find on Usenix at least
check what the authors policies on patents are.

Perhaps someone should first of all ask SGI to give the Linux community
permission to use it in a GPL'd operating system ?

Linus Torvalds

unread,

Aug 11, 2002, 6:55:08 PM8/11/02

to

On Mon, 12 Aug 2002, Daniel Phillips wrote:
>
> It goes on in this vein. I suggest all vm hackers have a close look at
> this. Yes, it's stupid, but we can't just ignore it.

Actually, we can, and I will.

I do not look up any patents on _principle_, because (a) it's a horrible
waste of time and (b) I don't want to know.

The fact is, technical people are better off not looking at patents. If
you don't know what they cover and where they are, you won't be knowingly
infringing on them. If somebody sues you, you change the algorithm or you
just hire a hit-man to whack the stupid git.

Linus

Larry McVoy

unread,

Aug 11, 2002, 7:15:01 PM8/11/02

to

On Sun, Aug 11, 2002 at 03:55:08PM -0700, Linus Torvalds wrote:
>
> On Mon, 12 Aug 2002, Daniel Phillips wrote:
> >
> > It goes on in this vein. I suggest all vm hackers have a close look at
> > this. Yes, it's stupid, but we can't just ignore it.
>
> Actually, we can, and I will.
>
> I do not look up any patents on _principle_, because (a) it's a horrible
> waste of time and (b) I don't want to know.
>
> The fact is, technical people are better off not looking at patents. If
> you don't know what they cover and where they are, you won't be knowingly
> infringing on them. If somebody sues you, you change the algorithm or you
> just hire a hit-man to whack the stupid git.

This issue is more complicated than you might think. Big companies with
big pockets are very nervous about being too closely associated with
Linux because of this problem. Imagine that IBM, for example, starts
shipping IBM Linux. Somewhere in the code there is something that
infringes on a patent. Given that it is IBM Linux, people can make
the case that IBM should have known and should have fixed it and
since they didn't, they get sued. Notice that IBM doesn't ship
their own version of Linux, they ship / support Red Hat or Suse
(maybe others, doesn't matter). So if they ever get hassled, they'll
vector the problem to those little guys and the issue will likely
get dropped because the little guys have no money to speak of.

Maybe this is all good, I dunno, but be aware that the patents
have long arms and effects.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

Linus Torvalds

unread,

Aug 11, 2002, 6:56:10 PM8/11/02

to

On Sun, 11 Aug 2002, Linus Torvalds wrote:
>
> If somebody sues you, you change the algorithm or you just hire a
> hit-man to whack the stupid git.

Btw, I'm not a lawyer, and I suspect this may not be legally tenable
advice. Whatever. I refuse to bother with the crap.

Linus

Rik van Riel

unread,

Aug 11, 2002, 7:42:16 PM8/11/02

to

On 12 Aug 2002, Alan Cox wrote:

> Unfortunately the USA forces people to deal with this crap. I'd hope SGI
> would be decent enough to explicitly state they will license this stuff
> freely for GPL use

I seem to remember Apple having a clause for this in
their Darwin sources, forbidding people who contribute
code from suing them about patent violations due to
the code they themselves contributed.

kind regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

-

Alan Cox

unread,

Aug 11, 2002, 8:46:19 PM8/11/02

to

On Sun, 2002-08-11 at 23:56, Linus Torvalds wrote:
>
> On Sun, 11 Aug 2002, Linus Torvalds wrote:
> >
> > If somebody sues you, you change the algorithm or you just hire a
> > hit-man to whack the stupid git.
>
> Btw, I'm not a lawyer, and I suspect this may not be legally tenable
> advice. Whatever. I refuse to bother with the crap.

In which case you might as well do the rest of the world a favour and
restrict US usage of Linux in the license file while you are at it.

Unfortunately the USA forces people to deal with this crap. I'd hope SGI
would be decent enough to explicitly state they will license this stuff

freely for GPL use (although having shipping Linux themselves the
question is partly moot as the GPL says they can't impose additional
restrictions)

Alan

William Lee Irwin III

unread,

Aug 11, 2002, 7:36:10 PM8/11/02

to

On Sun, 11 Aug 2002, Linus Torvalds wrote:
>> If somebody sues you, you change the algorithm or you just hire a
>> hit-man to whack the stupid git.

On Sun, Aug 11, 2002 at 03:56:10PM -0700, Linus Torvalds wrote:
> Btw, I'm not a lawyer, and I suspect this may not be legally tenable
> advice. Whatever. I refuse to bother with the crap.

I'm not really sure what to think of all this patent stuff myself, but
I may need to get some directions from lawyerish types before moving on
here. OTOH I certainly like the suggested approach more than my
conservative one, even though I'm still too chicken to follow it. =)

On a more practical note, though, someone left out an essential 'h'
from my email address. Please adjust the cc: list. =)

Thanks,
Bill

Larry McVoy

unread,

Aug 11, 2002, 7:50:03 PM8/11/02

to

On Sun, Aug 11, 2002 at 08:42:16PM -0300, Rik van Riel wrote:
> On 12 Aug 2002, Alan Cox wrote:
>

> > Unfortunately the USA forces people to deal with this crap. I'd hope SGI
> > would be decent enough to explicitly state they will license this stuff
> > freely for GPL use
>

> I seem to remember Apple having a clause for this in
> their Darwin sources, forbidding people who contribute
> code from suing them about patent violations due to
> the code they themselves contributed.

IBM has a fantastic clause in their open source license. The license grants
you various rights to use, etc., and then goes on to say something in
the termination section (I think) along the lines of

In the event that You or your affiliates instigate patent, trademark,
and/or any other intellectual property suits, this license terminates
as of the filing date of said suit[s].

You get the idea. It's basically "screw me, OK, then screw you too" language.

--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

Linus Torvalds

unread,

Aug 11, 2002, 9:26:00 PM8/11/02

to

On Sun, 11 Aug 2002, Larry McVoy wrote:
>
> This issue is more complicated than you might think.

No, it's not. You miss the point.

> Big companies with
> big pockets are very nervous about being too closely associated with
> Linux because of this problem.

The point being that that is _their_ problem, and at a level that has
nothing to do with technology.

I'm saying that technical people shouldn't care. I certainly don't. The
people who _should_ care are patent attourneys etc, since they actually
get paid for it, and can better judge the matter anyway.

Everybody in the whole software industry knows that any non-trivial
program (and probably most trivial programs too, for that matter) will
infringe on _some_ patent. Ask anybody. It's apparently an accepted fact,
or at least a saying that I've heard too many times.

I just don't care. Clearly, if all significant programs infringe on
something, the issue is no longer "do we infringe", but "is it an issue"?

And that's _exactly_ why technical people shouldn't care. The "is it an
issue" is not something a technical guy can answer, since the answer
depends on totally non-technical things.

Ask your legal counsel, and I strongly suspect that if he is any good, he
will tell you the same thing. Namely that it's _his_ problem, and that
your engineers should not waste their time trying to find existing
patents.

Linus

Larry McVoy

unread,

Aug 12, 2002, 1:05:45 AM8/12/02

to

> Ask your legal counsel, and I strongly suspect that if he is any good, he
> will tell you the same thing. Namely that it's _his_ problem, and that
> your engineers should not waste their time trying to find existing
> patents.

Partially true for us. We do do patent searches to make sure we aren't
doing anything blatently stupid.

I do agree with you 100% that it is impossible to ship any software that
does not infringe on some patent. It's a big point of contention in
contract negotiations because everyone wants you to warrant that your
software doesn't infringe and indemnify them if it does.

--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

Alan Cox

unread,

Aug 12, 2002, 6:31:48 AM8/12/02

to

On Mon, 2002-08-12 at 02:26, Linus Torvalds wrote:
> Ask your legal counsel, and I strongly suspect that if he is any good, he
> will tell you the same thing. Namely that it's _his_ problem, and that
> your engineers should not waste their time trying to find existing
> patents.

Wasn't a case of wasting time. That one is extremely well known because
there were upset people when SGI patented it and then submitted a usenix
paper on it.

Helge Hafting

unread,

Aug 12, 2002, 5:23:50 AM8/12/02

to

Rik van Riel wrote:

> One problem we're running into here is that there are absolutely
> no tools to measure some of the things rmap is supposed to fix,
> like page replacement.
>
There are things like running vmstat while running tests or production.

My office desktop machine (256M RAM) rarely swaps more than 10M
during work with 2.5.30. It used to go some 70M into swap
after a few days of writing, browsing, and those updatedb runs.

Helge Hafting