large page patch (fwd) (fwd)

Linus Torvalds

unread,

Aug 2, 2002, 7:35:08 PM8/2/02

to Gerrit Huizenga, Hubertus Franke, Martin...@us.ibm.com, w...@holomorpy.com, Kernel Mailing List

[ linux-kernel cc'd, simply because I don't want to write the same thing
over and over again ]

[ Executive summary: the current large-page-patch is nothing but a magic
interface to the TLB. Don't view it as anything else, or you'll just be
confused by all the smoke and mirrors. ]

On Fri, 2 Aug 2002, Gerrit Huizenga wrote:
> > Because _none_ of the large-page codepaths are shared with _any_ of the
> > normal cases.
>
> Isn't that currently an implementation detail?

Yes and no.

We may well expand the FS layer to bigger pages, but "bigger" is almost
certainly not going to include things like 256MB pages - if for no other
reason than the fact that memory fragmentation really means that the limit
on page sizes in practice is somewhere around 128kB for any reasonable
usage patterns even with gigabytes of RAM.

And _maybe_ we might get to the single-digit megabytes. I doubt it, simply
because even with a good buddy allocator and a memory manager that
actively frees pages to get large contiguous chunks of RAM, it's basically
impossible to have something that can reliably give you that big chunks
without making normal performance go totally down the toiled.

(Yeah, once you have terabytes of memory, that worry probably ends up
largely going away. I don't think that is going to be a common enough
platform for Linux to care about in the next ten years, though).

So there are implementation issues, yes. In particular, there _is_ a push
for larger pages in the FS and generic MM layers too, but the issues there
are very different and have no basically no generality with the TLB and
page table mapping issues of the current push.

What this VM/VFS push means is that we may actually have a _different_
"large page" support on that level, where the most likely implementation
is that the "struct address_space" will at some point have a new member
that specifies the "page allocation order" for that address space. This
will allow us to do per-file allocations, so that some files (or some
filesystems) migth want to do all IO in 64kB chunks, and they'd just make
the address_space specify a page allocation order that matches that.

This is in fact one of the reasons I explicitly _want_ to keep the
interfaces separate - because there are two totally different issues at
play, and I suspect that we'll end up implementing _both_ of them, but
that they will _still_ have no commonalities.

The current largepage patch is really nothing but an interface to the TLB.
Please view it as that - a direct TLB interface that has zero impact on
the VFS or VM layers, and that is meant _purely_ as a way to expose hw
capabilities to the few applications that really really want them.

The important thing to take away from this is that _even_ if we could
change the FS and VM layers to know about a per-address_space variable-
sized PAGE_CACHE_SIZE (which I think it the long-term goal), that doesn't
impact the fact that we _also_ want to have the TLB interface.

Maybe the largepage patch could be improved upon by just renaming it, and
making clear that it's a "TLB_hugepage" thing. That's what a CPU designer
thinks of when you say "largepage" to him. Some of the confusion is
probably because a VM/FS person in an OS group does _not_ necessarily
think the same way, but thinks about doing big-granularity IO.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

David Mosberger

unread,

Aug 3, 2002, 3:21:43 AM8/3/02

to Linus Torvalds, Gerrit Huizenga, Hubertus Franke, Martin...@us.ibm.com, w...@holomorpy.com, Kernel Mailing List

>>>>> On Fri, 2 Aug 2002 12:34:08 -0700 (PDT), Linus Torvalds <torv...@transmeta.com> said:

Linus> We may well expand the FS layer to bigger pages, but "bigger"
Linus> is almost certainly not going to include things like 256MB
Linus> pages - if for no other reason than the fact that memory
Linus> fragmentation really means that the limit on page sizes in
Linus> practice is somewhere around 128kB for any reasonable usage
Linus> patterns even with gigabytes of RAM.

Linus> And _maybe_ we might get to the single-digit megabytes. I
Linus> doubt it, simply because even with a good buddy allocator and
Linus> a memory manager that actively frees pages to get large
Linus> contiguous chunks of RAM, it's basically impossible to have
Linus> something that can reliably give you that big chunks without
Linus> making normal performance go totally down the toiled.

The Rice people avoided some of the fragmentation problems by
pro-actively allocating a max-order physical page, even when only a
(small) virtual page was being mapped. This should work very well as
long as the total memory usage (including memory lost due to internal
fragmentation of max-order physical pages) doesn't exceed available
memory. That's not a condition which will hold for every system in
the world, but I suspect it is true for lots of systems for large
periods of time. And since superpages quickly become
counter-productive in tight-memory situations anyhow, this seems like
a very reasonable approach.

--david

Linus Torvalds

unread,

Aug 3, 2002, 3:32:48 AM8/3/02

to dav...@hpl.hp.com, Gerrit Huizenga, Hubertus Franke, Martin...@us.ibm.com, w...@holomorpy.com, Kernel Mailing List

On Fri, 2 Aug 2002, David Mosberger wrote:
>
> The Rice people avoided some of the fragmentation problems by
> pro-actively allocating a max-order physical page, even when only a
> (small) virtual page was being mapped.

This probably works ok if
- the superpages are only slightly smaller than the smaller page
- superpages are a nice optimization.

> And since superpages quickly become
> counter-productive in tight-memory situations anyhow, this seems like
> a very reasonable approach.

Ehh.. The only people who are _really_ asking for the superpages want
almost nothing _but_ superpages. They are willing to use 80% of all memory
for just superpages.

Yes, it's Oracle etc, and the whole point for these users is to avoid
having any OS memory allocation for these areas.

Linus

David Mosberger

unread,

Aug 3, 2002, 4:19:00 AM8/3/02

to Linus Torvalds, dav...@hpl.hp.com, Gerrit Huizenga, Hubertus Franke, Martin...@us.ibm.com, w...@holomorpy.com, Kernel Mailing List

>>>>> On Fri, 2 Aug 2002 20:32:10 -0700 (PDT), Linus Torvalds <torv...@transmeta.com> said:

>> And since superpages quickly become counter-productive in
>> tight-memory situations anyhow, this seems like a very reasonable
>> approach.

Linus> Ehh.. The only people who are _really_ asking for the
Linus> superpages want almost nothing _but_ superpages. They are
Linus> willing to use 80% of all memory for just superpages.

Linus> Yes, it's Oracle etc, and the whole point for these users is
Linus> to avoid having any OS memory allocation for these areas.

My terminology is perhaps a bit too subtle: I user "superpage"
exclusively for the case where multiple pages get coalesced into a
larger page. The "large page" ("huge page") case that you were
talking about is different, since pages never get demoted or promoted.

I wasn't disagreeing with your case for separate large page syscalls.
Those syscalls certainly simplify implementation and, as you point
out, it well may be the case that a transparent superpage scheme never
will be able to replace the former.

--david

Linus Torvalds

unread,

Aug 3, 2002, 4:26:35 AM8/3/02

to dav...@hpl.hp.com, Gerrit Huizenga, Hubertus Franke, Martin...@us.ibm.com, w...@holomorpy.com, Kernel Mailing List

On Fri, 2 Aug 2002, David Mosberger wrote:
>

> My terminology is perhaps a bit too subtle: I user "superpage"
> exclusively for the case where multiple pages get coalesced into a
> larger page. The "large page" ("huge page") case that you were
> talking about is different, since pages never get demoted or promoted.

Ahh, ok.

> I wasn't disagreeing with your case for separate large page syscalls.
> Those syscalls certainly simplify implementation and, as you point
> out, it well may be the case that a transparent superpage scheme never
> will be able to replace the former.

Somebody already had patches for the transparent superpage thing for
alpha, which supports it. I remember seeing numbers implying that helped
noticeably.

But yes, that definitely doesn't work for humongous pages (or whatever we
should call the multi-megabyte-special-case-thing ;).

Linus

David Mosberger

unread,

Aug 3, 2002, 4:40:33 AM8/3/02

to Linus Torvalds, dav...@hpl.hp.com, Gerrit Huizenga, Hubertus Franke, Martin...@us.ibm.com, w...@holomorpy.com, Kernel Mailing List

>>>>> On Fri, 2 Aug 2002 21:26:52 -0700 (PDT), Linus Torvalds <torv...@transmeta.com> said:

>> I wasn't disagreeing with your case for separate large page
>> syscalls. Those syscalls certainly simplify implementation and,
>> as you point out, it well may be the case that a transparent
>> superpage scheme never will be able to replace the former.

Linus> Somebody already had patches for the transparent superpage
Linus> thing for alpha, which supports it. I remember seeing numbers
Linus> implying that helped noticeably.

Yes, I saw those. I still like the Rice work a _lot_ better. It's
just a thing of beauty, from a design point of view (disclaimer: I
haven't seen the implementation, so there may be ugly things
lurking...).

Linus> But yes, that definitely doesn't work for humongous pages (or
Linus> whatever we should call the multi-megabyte-special-case-thing
Linus> ;).

Yes, you're probably right. 2MB was reported to be fine in the Rice
experiments, but I doubt 256MB (and much less 4GB, as supported by
some CPUs) would fly.

--david

David S. Miller

unread,

Aug 3, 2002, 5:35:13 AM8/3/02

to dav...@hpl.hp.com, dav...@napali.hpl.hp.com, torv...@transmeta.com, g...@us.ibm.com, fra...@watson.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

From: David Mosberger <dav...@napali.hpl.hp.com>
Date: Fri, 2 Aug 2002 21:39:36 -0700

>>>>> On Fri, 2 Aug 2002 21:26:52 -0700 (PDT), Linus Torvalds <torv...@transmeta.com> said:

>> I wasn't disagreeing with your case for separate large page
>> syscalls. Those syscalls certainly simplify implementation and,
>> as you point out, it well may be the case that a transparent
>> superpage scheme never will be able to replace the former.

Linus> Somebody already had patches for the transparent superpage
Linus> thing for alpha, which supports it. I remember seeing numbers
Linus> implying that helped noticeably.

Yes, I saw those. I still like the Rice work a _lot_ better.

Now here's the thing. To me, we should be adding these superpage
syscalls to things like the implementation of malloc() :-) If you
allocate enough anonymous pages together, you should get a superpage
in the TLB if that is easy to do. Once any hint of memory pressure
occurs, you just break up the large page clusters as you hit such
ptes. This is what one of the Linux large-page implementations did
and I personally find it the most elegant way to handle the so called
"paging complexity" of transparent superpages.

At that point it's like "why the system call". If it would rather be
more of a large-page reservation system than a "optimization hint"
then these syscalls would sit better with me. Currently I think they
are superfluous. To me the hint to use large-pages is a given :-)

Stated another way, if these syscalls said "gimme large pages for this
area and lock them into memory", this would be fine. If the syscalls
say "use large pages if you can", that's crap. And in fact we could
use mmap() attribute flags if we really thought that stating this was
necessary.

Linus Torvalds

unread,

Aug 3, 2002, 5:48:27 PM8/3/02

to David S. Miller, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, g...@us.ibm.com, fra...@watson.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

On Fri, 2 Aug 2002, David S. Miller wrote:
>
> Now here's the thing. To me, we should be adding these superpage
> syscalls to things like the implementation of malloc() :-) If you
> allocate enough anonymous pages together, you should get a superpage
> in the TLB if that is easy to do.

For architectures that have these "small" superpages, we can just do it
transparently. That's what the alpha patches did.

The problem space is roughly the same as just page coloring.

> At that point it's like "why the system call". If it would rather be
> more of a large-page reservation system than a "optimization hint"
> then these syscalls would sit better with me. Currently I think they
> are superfluous. To me the hint to use large-pages is a given :-)

Yup.

David, you did page coloring once.

I bet your patches worked reasonably well to color into 4 or 8 colors.

How well do you think something like your old patches would work if

- you _require_ 1024 colors in order to get the TLB speedup on some
hypothetical machine (the same hypothetical machine that might
hypothetically run on 95% of all hardware ;)

- the machine is under heavy load, and heavy load is exactly when you
want this optimization to trigger.

Can you explain this difficulty to people?

> Stated another way, if these syscalls said "gimme large pages for this
> area and lock them into memory", this would be fine. If the syscalls
> say "use large pages if you can", that's crap. And in fact we could
> use mmap() attribute flags if we really thought that stating this was
> necessary.

I agree 100%.

I think we can at some point do the small cases completely transparently,
with no need for a new system call, and not even any new hint flags. We'll
just silently do 4/8-page superpages and be done with it. Programs don't
need to know about it to take advantage of better TLB usage.

Linus

David Mosberger

unread,

Aug 3, 2002, 7:32:13 PM8/3/02

to Linus Torvalds, David S. Miller, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, g...@us.ibm.com, fra...@watson.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

>>>>> On Sat, 3 Aug 2002 10:35:00 -0700 (PDT), Linus Torvalds <torv...@transmeta.com> said:

Linus> How well do you think something like your old patches would
Linus> work if

Linus> - you _require_ 1024 colors in order to get the TLB speedup
Linus> on some hypothetical machine (the same hypothetical machine
Linus> that might hypothetically run on 95% of all hardware ;)

Linus> - the machine is under heavy load, and heavy load is exactly
Linus> when you want this optimization to trigger.

Your point about wanting databases have access to giant pages even
under memory pressure is a good one. I had not considered that
before. However, what we really are talking about then is a security
or resource policy as to who gets to allocate from a reserved and
pinned pool of giant physical pages. You don't need separate system
calls for that: with a transparent superpage framework and a
privileged & reserved giant-page pool, it's trivial to set up things
such that your favorite data base will always be able to get the giant
pages (and hence the giant TLB mappings) it wants. The only thing you
lose in the transparent case is control over _which_ pages need to use
the pinned giant pages. I can certainly imagine cases where this
would be an issue, but I kind of doubt it would be an issue for
databases.

As Dave Miller justly pointed out, it's stupid for a task not to ask
for giant pages for anonymous memory. The only reason this is not a
smart thing overall is that globally it's not optimal (it is optimal
only locally, from the task's point of view). So if the only barrier
to getting the giant pinned pages is needing to know about the new
system calls, I'll predict that very soon we'll have EVERY task in the
system allocating such pages (and LD_PRELOAD tricks make that pretty
much trivial). Then we're back to square one, because the favorite
database may not even be able to start up, because all the "reserved"
memory is already used up by the other tasks.

Clearly there needs to be some additional policies in effect, no
matter what the implementation is (the normal VM policies don't work,
because, by definition, the pinned giant pages are not pageable).

In my opinion, the primary benefit of the separate syscalls is still
ease-of-implementation (which isn't unimportant, of course).

--david

David Mosberger

unread,

Aug 3, 2002, 7:42:23 PM8/3/02

to fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, Linus Torvalds, Gerrit Huizenga, Martin...@us.ibm.com, w...@holomorpy.com, Kernel Mailing List

>>>>> On Sat, 3 Aug 2002 14:41:29 -0400, Hubertus Franke <fra...@watson.ibm.com> said:

Hubertus> But I'd like to point out that superpages are there to
Hubertus> reduce the number of TLB misses by providing larger
Hubertus> coverage. Simply providing page coloring will not get you
Hubertus> there.

Yes, I agree.

It appears that Juan Navarro, the primary author behind the Rice
project, is working on breaking down the superpage benefits they
observed. That would tell us how much benefit is due to page-coloring
and how much is due to TLB effects. Here in our lab, we do have some
(weak) empirical evidence that some of the SPECint benchmarks benefit
primarily from page-coloring, but clearly there are others that are
TLB limited.

--daivd

Linus Torvalds

unread,

Aug 3, 2002, 7:52:35 PM8/3/02

to Hubertus Franke, dav...@hpl.hp.com, David Mosberger, Gerrit Huizenga, Martin...@us.ibm.com, w...@holomorpy.com, Kernel Mailing List

On Sat, 3 Aug 2002, Hubertus Franke wrote:
>
> But I'd like to point out that superpages are there to reduce the number of
> TLB misses by providing larger coverage. Simply providing page coloring
> will not get you there.

Superpages can from a memory allocation angle be seen as a very strict
form of page coloring - the problems are fairly closely related, I think
(superpages are just a lot stricter, in that it's not enough to get "any
page of color X", you have to get just the _right_ page).

Doing superpages will automatically do coloring (while the reverse is
obviously not true). And the way David did coloring a long time ago (if
I remember his implementation correctly) was the same way you'd do
superpages: just do higher order allocations.

Linus

Linus Torvalds

unread,

Aug 3, 2002, 7:58:12 PM8/3/02

to dav...@hpl.hp.com, David S. Miller, dav...@napali.hpl.hp.com, g...@us.ibm.com, fra...@watson.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

On Sat, 3 Aug 2002, David Mosberger wrote:
>
> Your point about wanting databases have access to giant pages even
> under memory pressure is a good one. I had not considered that
> before. However, what we really are talking about then is a security
> or resource policy as to who gets to allocate from a reserved and
> pinned pool of giant physical pages.

Absolutely. We can't allow just anybody to allocate giant pages, since
they are a scarce resource (set up at boot time in both Ingo's and Intels
patches - with the potential to move things around later with additional
interfaces).

> You don't need separate system
> calls for that: with a transparent superpage framework and a
> privileged & reserved giant-page pool, it's trivial to set up things
> such that your favorite data base will always be able to get the giant
> pages (and hence the giant TLB mappings) it wants. The only thing you
> lose in the transparent case is control over _which_ pages need to use
> the pinned giant pages. I can certainly imagine cases where this
> would be an issue, but I kind of doubt it would be an issue for
> databases.

That's _probably_ true. There aren't that many allocations that ask for
megabytes of consecutive memory that wouldn't want to do it. However,
there might certainly be non-critical maintenance programs (with the same
privileges as the database program proper) that _do_ do large allocations,
and that we don't want to give large pages to.

Guessing is always bad, especially since the application certainly does
know what it wants.

Linus

David Mosberger

unread,

Aug 3, 2002, 9:19:33 PM8/3/02

to Linus Torvalds, dav...@hpl.hp.com, David S. Miller, dav...@napali.hpl.hp.com, g...@us.ibm.com, fra...@watson.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

>>>>> On Sat, 3 Aug 2002 12:43:47 -0700 (PDT), Linus Torvalds <torv...@transmeta.com> said:

>> You don't need separate system calls for that: with a transparent
>> superpage framework and a privileged & reserved giant-page pool,
>> it's trivial to set up things such that your favorite data base
>> will always be able to get the giant pages (and hence the giant
>> TLB mappings) it wants. The only thing you lose in the
>> transparent case is control over _which_ pages need to use the
>> pinned giant pages. I can certainly imagine cases where this
>> would be an issue, but I kind of doubt it would be an issue for
>> databases.

Linus> That's _probably_ true. There aren't that many allocations
Linus> that ask for megabytes of consecutive memory that wouldn't
Linus> want to do it. However, there might certainly be non-critical
Linus> maintenance programs (with the same privileges as the
Linus> database program proper) that _do_ do large allocations, and
Linus> that we don't want to give large pages to.

Linus> Guessing is always bad, especially since the application
Linus> certainly does know what it wants.

Yes, but that applies even to a transparent superpage scheme: in those
instances where an application knows what page size is optimal, it's
better if the application can express that (saves time
promoting/demoting pages needlessly). It's not unlike madvise() or
the readahead() syscall: use reasonable policies for the ordinary
apps, and provide the means to let the smart apps tell the kernel
exactly what they need.

--david

David Mosberger

unread,

Aug 3, 2002, 9:27:14 PM8/3/02

to fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, Linus Torvalds, Gerrit Huizenga, Martin...@us.ibm.com, w...@holomorpy.com, Kernel Mailing List

>>>>> On Sat, 3 Aug 2002 16:53:39 -0400, Hubertus Franke <fra...@watson.ibm.com> said:

Hubertus> Cool. Does that mean that BSD already has page coloring
Hubertus> implemented ?

FreeBSD (at least on Alpha) makes some attempts at page-coloring, but
it's said to be far from perfect.

Hubertus> The agony is: Page Coloring helps to reduce cache
Hubertus> conflicts in low associative caches while large pages may
Hubertus> reduce TLB overhead.

Why agony? The latter helps the TLB _and_ solves the page coloring
problem (assuming the largest page size is bigger than the largest
cache; yeah, I see that could be a problem on some Power 4
machines... ;-)

Hubertus> One shouldn't rule out one for the other, there is a place
Hubertus> for both.

Hubertus> How did you arrive to the (weak) empirical evidence? You
Hubertus> checked TLB misses and cache misses and turned page
Hubertus> coloring on and off and large pages on and off?

Yes, that's basically what we did (there is a patch implementing a
page coloring kernel module floating around).

David S. Miller

unread,

Aug 4, 2002, 12:42:47 AM8/4/02

to torv...@transmeta.com, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, g...@us.ibm.com, fra...@watson.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

From: Linus Torvalds <torv...@transmeta.com>
Date: Sat, 3 Aug 2002 10:35:00 -0700 (PDT)

David, you did page coloring once.

I bet your patches worked reasonably well to color into 4 or 8 colors.

How well do you think something like your old patches would work if

- you _require_ 1024 colors in order to get the TLB speedup on some
hypothetical machine (the same hypothetical machine that might
hypothetically run on 95% of all hardware ;)

- the machine is under heavy load, and heavy load is exactly when you
want this optimization to trigger.

Can you explain this difficulty to people?

Actually, we need some clarification here. I tried coloring several
times, the problem with my diffs is that I tried to do the coloring
all the time no matter what.

I wanted strict coloring on the 2-color level for broken L1 caches
that have aliasing problems. If I could make this work, all of the
dumb cache flushing I have to do on Sparcs could be deleted. Because
of this, I couldn't legitimately change the cache flushing rules
unless I had absolutely strict coloring done on all pages where it
mattered (basically anything that could end up in the user's address
space).

So I kept track of color existence precisely in the page lists. The
implementation was fast, but things got really bad fragmentation wise.

No matter how I tweaked things, just running a kernel build 40 or 50
times would fragment the free page lists to shreds such that 2-order
and up pages simply did not exist.

Another person did an implementation of coloring which basically
worked by allocating a big-order chunk and slicing that up. It's not
strictly done and that is why his version works better. In fact I
like that patch a lot and it worked quite well for L2 coloring on
sparc64. Any time there is page pressure, he tosses away all of the
color carving big-order pages.

I think we can at some point do the small cases completely transparently,
with no need for a new system call, and not even any new hint flags. We'll
just silently do 4/8-page superpages and be done with it. Programs don't
need to know about it to take advantage of better TLB usage.

Ok. I think even 64-page ones are viable to attempt but we'll see.
Most TLB's that do superpages seem to have a range from the base
page size to the largest supported superpage with 2-powers of two
being incrememnted between each supported size.

For example on Sparc64 this is:

8K PAGE_SIZE
64K PAGE_SIZE * 8
512K PAGE_SIZE * 64
4M PAGE_SIZE * 512

One of the transparent large page implementations just defined a
small array that the core code used to try and see "hey how big
a superpage can we try" and if the largest for the area failed
(because page orders that large weren't available) it would simply
fall back to the next smallest superpage size.

David S. Miller

unread,

Aug 4, 2002, 12:46:31 AM8/4/02

to torv...@transmeta.com, fra...@watson.ibm.com, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

From: Linus Torvalds <torv...@transmeta.com>
Date: Sat, 3 Aug 2002 12:39:40 -0700 (PDT)

And the way David did coloring a long time ago (if
I remember his implementation correctly) was the same way you'd do
superpages: just do higher order allocations.

Although it wasn't my implementation which did this,
one of them did do it this way. I agree that it is
the nicest way to do coloring.

David S. Miller

unread,

Aug 4, 2002, 12:47:00 AM8/4/02

to dav...@hpl.hp.com, dav...@napali.hpl.hp.com, fra...@watson.ibm.com, torv...@transmeta.com, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

From: David Mosberger <dav...@napali.hpl.hp.com>
Date: Sat, 3 Aug 2002 12:41:33 -0700

It appears that Juan Navarro, the primary author behind the Rice
project, is working on breaking down the superpage benefits they
observed. That would tell us how much benefit is due to page-coloring
and how much is due to TLB effects. Here in our lab, we do have some
(weak) empirical evidence that some of the SPECint benchmarks benefit
primarily from page-coloring, but clearly there are others that are
TLB limited.

There was some comparison done between large-page vs. plain
page coloring for a bunch of scientific number crunchers.

Only one benefitted from page coloring and not from TLB
superpage use.

The ones that benefitted from both coloring and superpages, the
superpage gain was about equal to the coloring gain. Basically,
superpages ended up giving the necessary coloring :-)

Search for the topic "Areas for superpage discussion" in the
sparc...@vger.kernel.org list archives, it has pointers to
all the patches and test programs involved.

David S. Miller

unread,

Aug 4, 2002, 12:48:34 AM8/4/02

to fra...@watson.ibm.com, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, torv...@transmeta.com, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

From: Hubertus Franke <fra...@watson.ibm.com>
Date: Sat, 3 Aug 2002 16:53:39 -0400

Does that mean that BSD already has page coloring implemented ?

FreeBSD has had page coloring for quite some time.

Because they don't use buddy lists and don't allow higher-order
allocations fundamentally in the page allocator, they don't have
to deal with all the buddy fragmentation issues we do.

On the other hand, since higher-order page allocations are not
a fundamental operation it might be more difficult for FreeBSD
to implement superpage support efficiently like we can with
the buddy lists.

David S. Miller

unread,

Aug 4, 2002, 12:50:49 AM8/4/02

to fra...@watson.ibm.com, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, torv...@transmeta.com, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

From: Hubertus Franke <fra...@watson.ibm.com>
Date: Sat, 3 Aug 2002 17:54:30 -0400

The Rice paper solved this reasonably elegant. Reservation and check
after a while. If you didn't use reserved memory, you loose it, this is the
auto promotion/demotion.

I keep seeing this Rice stuff being mentioned over and over,
can someone post a URL pointer to this work?

David Mosberger

unread,

Aug 4, 2002, 2:27:18 AM8/4/02

to David S. Miller, fra...@watson.ibm.com, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, torv...@transmeta.com, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

>>>>> On Sat, 03 Aug 2002 17:35:30 -0700 (PDT), "David S. Miller" <da...@redhat.com> said:

DaveM> From: Hubertus Franke <fra...@watson.ibm.com> Date: Sat,
DaveM> 3 Aug 2002 17:54:30 -0400

DaveM> The Rice paper solved this reasonably elegant. Reservation
DaveM> and check after a while. If you didn't use reserved memory,
DaveM> you loose it, this is the auto promotion/demotion.

DaveM> I keep seeing this Rice stuff being mentioned over and over,
DaveM> can someone post a URL pointer to this work?

Sure thing. It's the first link under "Publications" at this URL:

http://www.cs.rice.edu/~jnavarro/

--david

Linus Torvalds

unread,

Aug 4, 2002, 6:52:58 PM8/4/02

to Hubertus Franke, David S. Miller, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

On Sun, 4 Aug 2002, Hubertus Franke wrote:
>
> As of the page coloring !
> Can we tweak the buddy allocator to give us this additional functionality?

I would really prefer to avoid this, and get "95% coloring" by just doing
read-ahead with higher-order allocations instead of the current "loop
allocation of one block".

I bet that you will get _practically_ perfect coloring with just two small
changes:

- do_anonymous_page() looks to see if the page tables are empty around
the faulting address (and check vma ranges too, of course), and
optimistically does a non-blocking order-X allocation.

If the order-X allocation fails, we're likely low on memory (this is
_especially_ true since the very fact that we do lots of order-X
allocations will probably actually help keep fragementation down
normally), and we just allocate one page (with a regular GFP_USER this
time).

Map in all pages.

- do the same for page_cache_readahead() (this, btw, is where radix trees
will kick some serious ass - we'd have had a hard time doing the "is
this range of order-X pages populated" efficiently with the old hashes.

I bet just those fairly small changes will give you effective coloring,
_and_ they are also what you want for doing small superpages.

And no, I do not want separate coloring support in the allocator. I think
coloring without superpage support is stupid and worthless (and
complicates the code for no good reason).

Linus

Andrew Morton

unread,

Aug 4, 2002, 7:14:35 PM8/4/02

to Linus Torvalds, Hubertus Franke, David S. Miller, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

Linus Torvalds wrote:
>
> On Sun, 4 Aug 2002, Hubertus Franke wrote:
> >
> > As of the page coloring !
> > Can we tweak the buddy allocator to give us this additional functionality?
>
> I would really prefer to avoid this, and get "95% coloring" by just doing
> read-ahead with higher-order allocations instead of the current "loop
> allocation of one block".
>
> I bet that you will get _practically_ perfect coloring with just two small
> changes:
>
> - do_anonymous_page() looks to see if the page tables are empty around
> the faulting address (and check vma ranges too, of course), and
> optimistically does a non-blocking order-X allocation.
>
> If the order-X allocation fails, we're likely low on memory (this is
> _especially_ true since the very fact that we do lots of order-X
> allocations will probably actually help keep fragementation down
> normally), and we just allocate one page (with a regular GFP_USER this
> time).
>
> Map in all pages.

This would be a problem for short-lived processes. Because "map in
all pages" also means "zero them out". And I think that performing
a 4k clear_user_highpage() immediately before returning to userspace
is optimal. It's essentialy a cache preload for userspace.

If we instead clear out 4 or 8 pages, we trash a ton of cache and
the chances of userspace _using_ pages 1-7 in the short-term are
lower. We could clear the pages with 7,6,5,4,3,2,1,0 ordering,
but the cache implications of faultahead are still there.

Could we establish the eight pte's but still arrange for pages 1-7
to trap, so the kernel can zero the out at the latest possible time?

> - do the same for page_cache_readahead() (this, btw, is where radix trees
> will kick some serious ass - we'd have had a hard time doing the "is
> this range of order-X pages populated" efficiently with the old hashes.
>

On the nopage path, yes. That memory is cache-cold anyway.

Linus Torvalds

unread,

Aug 4, 2002, 7:43:34 PM8/4/02

to Andrew Morton, Hubertus Franke, David S. Miller, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

On Sun, 4 Aug 2002, Andrew Morton wrote:
>
> Could we establish the eight pte's but still arrange for pages 1-7
> to trap, so the kernel can zero the out at the latest possible time?

You could do that by marking the pages as being there, but PROT_NONE.

On the other hand, cutting down the number of initial pagefaults (by _not_
doing what you suggest) migth be a bigger speedup for process startup than
the slowdown from occasionally doing unnecessary work.

I suspect that there is some non-zero order-X (probably 2 or 3), where you
just win more than you lose. Even for small programs.

Linus

Rik van Riel

unread,

Aug 4, 2002, 7:44:07 PM8/4/02

to Linus Torvalds, Hubertus Franke, David S. Miller, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, g...@us.ibm.com, Martin...@us.ibm.com, William Lee Irwin III, linux-...@vger.kernel.org

On Sun, 4 Aug 2002, Linus Torvalds wrote:
> On Sun, 4 Aug 2002, Hubertus Franke wrote:
> >
> > As of the page coloring !
> > Can we tweak the buddy allocator to give us this additional functionality?
>
> I would really prefer to avoid this, and get "95% coloring" by just doing
> read-ahead with higher-order allocations instead of the current "loop
> allocation of one block".

OK, now I'm really going to start on some code to try and free
physically contiguous pages when a higher-order allocation comes
in ;)

(well, after this hamradio rpm I started)

cheers,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

Andi Kleen

unread,

Aug 4, 2002, 8:21:17 PM8/4/02

to Andrew Morton, Hubertus Franke, David S. Miller, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org, torv...@transmeta.com

Andrew Morton <ak...@zip.com.au> writes:

> If we instead clear out 4 or 8 pages, we trash a ton of cache and
> the chances of userspace _using_ pages 1-7 in the short-term are
> lower. We could clear the pages with 7,6,5,4,3,2,1,0 ordering,
> but the cache implications of faultahead are still there.

What you could do on modern x86 and probably most other architectures as
well is to clear the faulted page in cache and clear the other pages
with a non temporal write. The non temporal write will go straight
to main memory and not pollute any caches.

When the process accesses it later it has to fetch the zeroes from
main memory. This is probably still faster than a page fault at least
for the first few accesses. It could be more costly when walking the full
page (then the added up cache miss costs could exceed the page fault cost),
but then hopefully the CPU will help by doing hardware prefetch.

It could help or not help, may be worth a try at least :-)

-Andi

William Lee Irwin III

unread,

Aug 4, 2002, 8:27:45 PM8/4/02

to Hubertus Franke, Linus Torvalds, David S. Miller, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, g...@us.ibm.com, Martin...@us.ibm.com, linux-...@vger.kernel.org

On Sun, Aug 04, 2002 at 03:30:24PM -0400, Hubertus Franke wrote:
> As long as the alignments are observed, which you I guess imply by the range.

On Sunday 04 August 2002 02:38 pm, Linus Torvalds wrote:
>> If the order-X allocation fails, we're likely low on memory (this is
>> _especially_ true since the very fact that we do lots of order-X
>> allocations will probably actually help keep fragementation down
>> normally), and we just allocate one page (with a regular GFP_USER this
>> time).

Later on I can redo one of the various online defragmentation things
that went around last October or so if it would help with this.

On Sunday 04 August 2002 02:38 pm, Linus Torvalds wrote:
>> Map in all pages.

>> - do the same for page_cache_readahead() (this, btw, is where radix trees
>> will kick some serious ass - we'd have had a hard time doing the "is
>> this range of order-X pages populated" efficiently with the old hashes.

On Sun, Aug 04, 2002 at 03:30:24PM -0400, Hubertus Franke wrote:
> Hey, we use the radix tree to track page cache mappings for large pages
> particularly for this reason...

Proportion of radix tree populated beneath a given node can be computed
by means of traversals adding up ->count or by incrementally maintaining
a secondary counter for ancestors within the radix tree node. I can look
into this when I go over the path compression heuristics, which would
help the space consumption for access patterns fooling the current one.
Getting physical contiguity out of that is another matter, but the code
can be used for other things (e.g. exec()-time prefaulting) until that's
worked out, and it's not a focus or requirement of this code anyway.

On Sunday 04 August 2002 02:38 pm, Linus Torvalds wrote:
>> I bet just those fairly small changes will give you effective coloring,
>> _and_ they are also what you want for doing small superpages.

On Sun, Aug 04, 2002 at 03:30:24PM -0400, Hubertus Franke wrote:
> The HW TLB case can be extended to not store the same PA in all the PTEs,
> but conceptually carry the superpage concept for the purpose described above.

Pagetable walking gets a tiny hook, not much interesting goes on there.
A specialized wrapper for extracting physical pfn's from the pmd's like
the one for testing whether they're terminal nodes might look more
polished, but that's mostly cosmetic.

Hmm, from looking at the "small" vs. "large" page bits, I have an
inkling this may be relative to the machine size. 256GB boxen will
probably think of 4MB pages as small.

On Sun, Aug 04, 2002 at 03:30:24PM -0400, Hubertus Franke wrote:
> But to go down this route we need the concept of a superpage in the VM,
> not just at TLB time or a hack that throws these things over the fence.

The bit throwing it over the fence is probably still useful, as Oracle
knows what it's doing and I suspect it's largely to dodge pagetable
space consumption OOM'ing machines as opposed to optimizing anything.
It pretty much wants the kernel out of the way aside from as a big bag
of device drivers, so I'm not surprised they're more than happy to have
the MMU in their hands too. The more I think about it, the less related
to superpages it seems. The motive for superpages is 100% TLB, not a
workaround for pagetable OOM.

Cheers,
Bill

Eric W. Biederman

unread,

Aug 5, 2002, 12:05:18 AM8/5/02

to Andi Kleen, Andrew Morton, Hubertus Franke, David S. Miller, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org, torv...@transmeta.com

Andi Kleen <a...@suse.de> writes:

> Andrew Morton <ak...@zip.com.au> writes:
>
> > If we instead clear out 4 or 8 pages, we trash a ton of cache and
> > the chances of userspace _using_ pages 1-7 in the short-term are
> > lower. We could clear the pages with 7,6,5,4,3,2,1,0 ordering,
> > but the cache implications of faultahead are still there.
>
> What you could do on modern x86 and probably most other architectures as
> well is to clear the faulted page in cache and clear the other pages
> with a non temporal write. The non temporal write will go straight
> to main memory and not pollute any caches.

Plus a non temporal write is 3x faster than a write that lands in
the cache on x86 (tested on Athlons, P4, & P3).

> When the process accesses it later it has to fetch the zeroes from
> main memory. This is probably still faster than a page fault at least
> for the first few accesses. It could be more costly when walking the full
> page (then the added up cache miss costs could exceed the page fault cost),
> but then hopefully the CPU will help by doing hardware prefetch.
>
> It could help or not help, may be worth a try at least :-)

Certainly.

Eric

David S. Miller

unread,

Aug 5, 2002, 5:55:22 AM8/5/02

to fra...@watson.ibm.com, torv...@transmeta.com, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

From: Hubertus Franke <fra...@watson.ibm.com>
Date: Sun, 4 Aug 2002 13:31:24 -0400

Can we tweak the buddy allocator to give us this additional functionality?

Absolutely not, it's a total lose.

I have tried at least 5 times to make it work without fragmenting the
buddy lists to shit. I channege you to code one up that works without
fragmenting things to shreds. Just run an endless kernel build over
and over in a loop for a few hours to a day. If the buddy lists are
not fragmented after these runs, then you have succeeded in my
challenge.

Do not even reply to this email without meeting the challenge as it
will fall on deaf ears. I've been there and I've done that, and at
this point code talks bullshit walks when it comes to trying to
colorize the buddy allocator in a way that actually works and isn't
disgusting.

David S. Miller

unread,

Aug 5, 2002, 5:56:54 AM8/5/02

to torv...@transmeta.com, ak...@zip.com.au, fra...@watson.ibm.com, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

From: Linus Torvalds <torv...@transmeta.com>
Date: Sun, 4 Aug 2002 12:28:54 -0700 (PDT)

I suspect that there is some non-zero order-X (probably 2 or 3), where you
just win more than you lose. Even for small programs.

Furthermore it would obviously help to enhance the clear_user_page()
interface to handle multiple pages because that would nullify the
startup/finish overhead of the copy loop. (read as: things like TLB
loads and FPU save/restore on some platforms)

David Mosberger

unread,

Aug 5, 2002, 5:00:32 PM8/5/02

to fra...@watson.ibm.com, Linus Torvalds, David S. Miller, dav...@hpl.hp.com, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorphy.com, linux-...@vger.kernel.org

>>>>> On Sun, 4 Aug 2002 15:30:24 -0400, Hubertus Franke <fra...@watson.ibm.com> said:

Hubertus> Yes, if we (correctly) assume that page coloring only buys
Hubertus> you significant benefits for small associative caches
Hubertus> (e.g. <4 or <= 8).

This seems to be a popular misconception. Yes, page-coloring
obviously plays no role as long as your cache no bigger than
PAGE_SIZE*ASSOCIATIVITY. IIRC, Xeon can have up to 1MB of cache and I
bet that it doesn't have a 1MB/4KB=256-way associative cache. Thus,
I'm quite confident that it's possible to observe significant
page-coloring effects even on a Xeon.

--david

Jamie Lokier

unread,

Aug 5, 2002, 9:13:50 PM8/5/02

to Hubertus Franke, dav...@hpl.hp.com, David Mosberger, Linus Torvalds, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorphy.com, linux-...@vger.kernel.org

Hubertus Franke wrote:
> The wording was "significant" benefits. The point is/was that as your
> associativity goes up, the likelihood of full cache occupancy
> increases, with cache thrashing in each class decreasing.
> Would have to dig through the literature to figure out at what point
> the benefits are insignificant (<1 %) wrt page coloring.

One of the benefits of page colouring may be that a program's run time
may be expected to vary less from run to run?

In the old days (6 years ago), I found that a video game I was working
on would vary in its peak frame rate by about 3-5% (I don't recall
exactly). Once the program was started, it would remain operating at
the peak frame rate it had selected, and killing and restarting the
program didn't often make a difference either. In DOS, the same program
always ran at a consistent frame rate (higher than Linux as it happens).
The actual number of objects executing in the program, and the amount of
memory allocated, were deterministic in these tests.

This is pointing at a cache colouring issue to me -- although quite
which cache I am not sure. I suppose it could have been something to do
with Linux' VM page scanner access patterns into the page array instead.

-- Jamie

Seth, Rohit

unread,

Aug 5, 2002, 11:33:15 PM8/5/02

to fra...@watson.ibm.com, Linus Torvalds, David S. Miller, dav...@hpl.hp.com, dav...@napali.hpl.hp.com, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorphy.com, linux-...@vger.kernel.org

> -----Original Message-----
> From: Hubertus Franke [mailto:fra...@watson.ibm.com]
> Sent: Sunday, August 04, 2002 12:30 PM
> To: Linus Torvalds
> Cc: David S. Miller; dav...@hpl.hp.com; dav...@napali.hpl.hp.com;
> g...@us.ibm.com; Martin...@us.ibm.com; w...@holomorphy.com;
> linux-...@vger.kernel.org
> Subject: Re: large page patch (fwd) (fwd)
>
> Well, in what you described above there is no concept of superpages
> the way it is defined for the purpose of <tracking> and <TLB overhead
> reduction>.
> If you don't know about super pages at the VM level, then you need to
> deal with them at TLB fault level to actually create the <large TLB>
> entry. That what the INTC patch will do, namely throughing all the
> complexity over the fence for the page fault.
Our patch does the preallocation of large pages at the time of request.
There is really nothing special like replicating PTEs (that you mentioned
below in your design) happens there. In any case, even for IA-64 where the
TLBs are also sw controlled (we also have Hardware Page Walker that can walk
any 3rd level pt and insert the PTE in TLB.) there are almost no changes (to
be precise one additional asm instructionin the begining of handler for
shifting extra bits) in our implementation that pollute the low level TLB
fault handlers to have the knowledge of large page size in traversing the
3-level page table. (Though there are couple of other asm instructions that
are added in this low-level routine to set helping register with proper
page_size while inserting bigger TLBs). On IA-32 obviously things fall in
place automagically as the page tables are setup as per arch.

> In your case not keeping track of the super pages in the
> VM layer and PT layer requires to discover the large page at soft TLB
> time by scanning PT proximity for contigous pages if we are
> talking now
> about the read_ahead ....
> In our case, we store the same physical address of the super page
> in the PTEs spanning the superpage together with the page order.
> At software TLB time we simply extra the single PTE from the PT based
> on the faulting address and move it into the TLB. This
> ofcourse works only
> for software TLBs (PwrPC, MIPS, IA64). For HW TLB (x86) the
> PT structure
> by definition overlaps the large page size support.

> The HW TLB case can be extended to not store the same PA in
> all the PTEs,
> but conceptually carry the superpage concept for the purpose
> described above.
>

I'm afraid you may be wasting a lot of extra memory by replicaitng these
PTEs(Take an example of one 4G large TLB size entry and assume there are few
hunderd processes using that same physical page.)

> We have that concept exactly the way you want it, but the dress code
> seems to be wrong. That can be worked on.
> Our goal was in the long run 2.7 to explore the Rice approach to see
> whether it yields benefits or whether we getting down the road of
> fragmentation reduction overhead that will kill all the
> benefits we get
> from reduced TLB overhead. Time would tell.

>
> But to go down this route we need the concept of a superpage
> in the VM,
> not just at TLB time or a hack that throws these things over
> the fence.
>

As others have already said that you may want to have the support of smaller
superpages in this way. Where VM is embeded with some knowledge of
different page sizes that it can support. Demoting and permoting pages from
one size to another (efficiently)will be very critical in the design. In my
opinion supporting the largest TLB on archs (like 256M or 4G) will need more
direct appraoch and less intrusion from kernel VM will be prefered.
Ofcourse, kernel will need to put extra checks etc. to maintain some sanity
for allowed users.

There has already been lot of discussion on this mailing list about what is
the right approach. Whether the new APIs are needed or something like
madvise would do it, whether kernel needs to allocate large_pages
transparently to the user or we should expose the underlying HW feature to
user land. There are issues that favor one approach over another. But the
bottom line is: 1) We should not break anything semantically for regular
system calls that happen to be using large TLBs and 2) The performance
advantage of this HW feature (on most of the archs I hope) is just too much
to let go without notice. I hope we get to consensus for getting this
support in kernel ASAP. This will benefit lot of Linux users. (And yes I
understand that we need to do things right in kernel so that we don't make
unforeseen errors.)

>
> > And no, I do not want separate coloring support in the
> allocator. I think
> > coloring without superpage support is stupid and worthless (and
> > complicates the code for no good reason).
> >
> > Linus
>

> That <stupid> seems premature. You are mixing the concept of
> superpage from a TLB miss reduction perspective
> with the concept of superpage for page coloring.
>
>
I have seen couple of HPC apps that try to fit (configure) in their data
sets on the L3 caches size (Like on IA-64 4M). I think these are the apps
that really get hit hardest by lack of proper page coloring support in Linux
kernel. The performance variation of these workloads from run to run could
be as much as 60% And with the page coloring patch, these apps seems to be
giving consistent higher throuput (The real bad part is that once the
throughput of these workloads drop, it stays down thereafter :( ) But seems
like DavidM has enough real world data that prohibits the use of this
approach in kernel for real world scenarios. The good part of large TLBs is
that, TLBs larger than CPU cache size will automatically get you perfect
page coloring .........for free.

rohit

Eric W. Biederman

unread,

Aug 10, 2002, 6:34:25 PM8/10/02

to Andrew Morton, Linus Torvalds, Daniel Phillips, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

Andrew Morton <ak...@zip.com.au> writes:
>
> The other worry is the ZONE_NORMAL space consumption of pte_chains.
> We've halved that, but it will still make high sharing levels
> unfeasible on the big ia32 machines. We are dependant upon large
> pages to solve that problem. (Resurrection of pte_highmem is in
> progress, but it doesn't work yet).

There is a second method to address this. Pages can be swapped out
of the page tables and still remain in the page cache, the virtual
scan does this all of the time. This should allow for arbitrary
amounts of sharing. There is some overhead, in faulting the pages
back in but it is much better than cases that do not work. A simple
implementation would have a maximum pte_chain length.

For any page that is not backed by anonymous memory we do not need to
keep the pte entries after the page has been swapped of the page
table. Which should show a reduction in page table size. In a highly
shared setting with anonymous pages it is likely worth it to promote
those pages to being posix shared memory.

All of the above should allow us to keep a limit on the amount of
resources that go towards sharing, reducing the need for something
like pte_highmem, and keeping memory pressure down in general.

For the cases you describe I have trouble seeing pte_highmem as
anything other than a performance optimization. Only placing shmem
direct and indirect entries in high memory or in swap can I see as
limit to feasibility.

Eric

Rik van Riel

unread,

Aug 10, 2002, 7:57:46 PM8/10/02

to Eric W. Biederman, Andrew Morton, Linus Torvalds, Daniel Phillips, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, William Lee Irwin III, linux-...@vger.kernel.org

On 10 Aug 2002, Eric W. Biederman wrote:
> Andrew Morton <ak...@zip.com.au> writes:
> >
> > The other worry is the ZONE_NORMAL space consumption of pte_chains.
> > We've halved that, but it will still make high sharing levels
> > unfeasible on the big ia32 machines.

> There is a second method to address this. Pages can be swapped out

> of the page tables and still remain in the page cache, the virtual
> scan does this all of the time. This should allow for arbitrary
> amounts of sharing. There is some overhead, in faulting the pages
> back in but it is much better than cases that do not work. A simple
> implementation would have a maximum pte_chain length.

Indeed. We need this same thing for page tables too, otherwise
a high sharing situation can easily "require" more page table
memory than the total amount of physical memory in the system ;)

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

-

Eric W. Biederman

unread,

Aug 10, 2002, 8:08:40 PM8/10/02

to Rik van Riel, Andrew Morton, Linus Torvalds, Daniel Phillips, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, William Lee Irwin III, linux-...@vger.kernel.org

Rik van Riel <ri...@conectiva.com.br> writes:

> On 10 Aug 2002, Eric W. Biederman wrote:
> > Andrew Morton <ak...@zip.com.au> writes:
> > >
> > > The other worry is the ZONE_NORMAL space consumption of pte_chains.
> > > We've halved that, but it will still make high sharing levels
> > > unfeasible on the big ia32 machines.
>
> > There is a second method to address this. Pages can be swapped out
> > of the page tables and still remain in the page cache, the virtual
> > scan does this all of the time. This should allow for arbitrary
> > amounts of sharing. There is some overhead, in faulting the pages
> > back in but it is much better than cases that do not work. A simple
> > implementation would have a maximum pte_chain length.
>
> Indeed. We need this same thing for page tables too, otherwise
> a high sharing situation can easily "require" more page table
> memory than the total amount of physical memory in the system ;)

It's exactly the same situation. To remove a pte from the chain you must
remove it from the page table as well. Then we just need to free
pages with no interesting pte entries.
Eric

Alan Cox

unread,

Aug 11, 2002, 7:11:36 PM8/11/02

to Daniel Phillips, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, Linus Torvalds, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

On Fri, 2002-08-09 at 16:20, Daniel Phillips wrote:
> On Sunday 04 August 2002 19:19, Hubertus Franke wrote:
> > "General Purpose Operating System Support for Multiple Page Sizes"
> > htpp://www.usenix.org/publications/library/proceedings/usenix98/full_papers/ganapathy/ganapathy.pdf
>
> This reference describes roughly what I had in mind for active
> defragmentation, which depends on reverse mapping. The main additional
> wrinkle I'd contemplated is introducing a new ZONE_LARGE, and GPF_LARGE,
> which means the caller promises not to pin the allocation unit for long
> periods and does not mind if the underlying physical page changes
> spontaneously. Defragmenting in this zone is straightforward.

Slight problem. This paper is about a patented SGI method for handling
defragmentation into large pages (6,182,089). They patented it before
the presentation.

They also hold patents on the other stuff that you've recently been
discussing about not keeping seperate rmap structures until there are
more than some value 'n' when they switch from direct to indirect lists
of reverse mappings (6,112,286)

If you are going read and propose things you find on Usenix at least
check what the authors policies on patents are.

Perhaps someone should first of all ask SGI to give the Linux community
permission to use it in a GPL'd operating system ?

Linus Torvalds

unread,

Aug 11, 2002, 11:08:00 PM8/11/02

to Daniel Phillips, Alan Cox, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

On Mon, 12 Aug 2002, Daniel Phillips wrote:
>
> It goes on in this vein. I suggest all vm hackers have a close look at
> this. Yes, it's stupid, but we can't just ignore it.

Actually, we can, and I will.

I do not look up any patents on _principle_, because (a) it's a horrible
waste of time and (b) I don't want to know.

The fact is, technical people are better off not looking at patents. If
you don't know what they cover and where they are, you won't be knowingly
infringing on them. If somebody sues you, you change the algorithm or you
just hire a hit-man to whack the stupid git.

Linus

Linus Torvalds

unread,

Aug 11, 2002, 11:08:54 PM8/11/02

to Daniel Phillips, Alan Cox, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

On Sun, 11 Aug 2002, Linus Torvalds wrote:
>
> If somebody sues you, you change the algorithm or you just hire a
> hit-man to whack the stupid git.

Btw, I'm not a lawyer, and I suspect this may not be legally tenable
advice. Whatever. I refuse to bother with the crap.

Larry McVoy

unread,

Aug 11, 2002, 11:17:20 PM8/11/02

to Linus Torvalds, Daniel Phillips, Alan Cox, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

On Sun, Aug 11, 2002 at 03:55:08PM -0700, Linus Torvalds wrote:
>
> On Mon, 12 Aug 2002, Daniel Phillips wrote:
> >
> > It goes on in this vein. I suggest all vm hackers have a close look at
> > this. Yes, it's stupid, but we can't just ignore it.
>
> Actually, we can, and I will.
>
> I do not look up any patents on _principle_, because (a) it's a horrible
> waste of time and (b) I don't want to know.
>
> The fact is, technical people are better off not looking at patents. If
> you don't know what they cover and where they are, you won't be knowingly
> infringing on them. If somebody sues you, you change the algorithm or you
> just hire a hit-man to whack the stupid git.

This issue is more complicated than you might think. Big companies with
big pockets are very nervous about being too closely associated with
Linux because of this problem. Imagine that IBM, for example, starts
shipping IBM Linux. Somewhere in the code there is something that
infringes on a patent. Given that it is IBM Linux, people can make
the case that IBM should have known and should have fixed it and
since they didn't, they get sued. Notice that IBM doesn't ship
their own version of Linux, they ship / support Red Hat or Suse
(maybe others, doesn't matter). So if they ever get hassled, they'll
vector the problem to those little guys and the issue will likely
get dropped because the little guys have no money to speak of.

Maybe this is all good, I dunno, but be aware that the patents
have long arms and effects.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

Alan Cox

unread,

Aug 11, 2002, 11:27:16 PM8/11/02

to Linus Torvalds, Daniel Phillips, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

On Sun, 2002-08-11 at 23:56, Linus Torvalds wrote:
>
> On Sun, 11 Aug 2002, Linus Torvalds wrote:
> >
> > If somebody sues you, you change the algorithm or you just hire a
> > hit-man to whack the stupid git.
>
> Btw, I'm not a lawyer, and I suspect this may not be legally tenable
> advice. Whatever. I refuse to bother with the crap.

In which case you might as well do the rest of the world a favour and
restrict US usage of Linux in the license file while you are at it.
Unfortunately the USA forces people to deal with this crap. I'd hope SGI
would be decent enough to explicitly state they will license this stuff
freely for GPL use (although having shipping Linux themselves the
question is partly moot as the GPL says they can't impose additional
restrictions)

Alan

William Lee Irwin III

unread,

Aug 11, 2002, 11:43:17 PM8/11/02

to Linus Torvalds, Daniel Phillips, Alan Cox, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, linux-...@vger.kernel.org

On Sun, 11 Aug 2002, Linus Torvalds wrote:
>> If somebody sues you, you change the algorithm or you just hire a
>> hit-man to whack the stupid git.

On Sun, Aug 11, 2002 at 03:56:10PM -0700, Linus Torvalds wrote:
> Btw, I'm not a lawyer, and I suspect this may not be legally tenable
> advice. Whatever. I refuse to bother with the crap.

I'm not really sure what to think of all this patent stuff myself, but
I may need to get some directions from lawyerish types before moving on
here. OTOH I certainly like the suggested approach more than my
conservative one, even though I'm still too chicken to follow it. =)

On a more practical note, though, someone left out an essential 'h'
from my email address. Please adjust the cc: list. =)

Thanks,
Bill

Rik van Riel

unread,

Aug 11, 2002, 11:45:24 PM8/11/02

to Alan Cox, Linus Torvalds, Daniel Phillips, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, William Lee Irwin III, linux-...@vger.kernel.org

On 12 Aug 2002, Alan Cox wrote:

> Unfortunately the USA forces people to deal with this crap. I'd hope SGI
> would be decent enough to explicitly state they will license this stuff
> freely for GPL use

I seem to remember Apple having a clause for this in
their Darwin sources, forbidding people who contribute
code from suing them about patent violations due to
the code they themselves contributed.

kind regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

-

Larry McVoy

unread,

Aug 11, 2002, 11:52:19 PM8/11/02

to Rik van Riel, Alan Cox, Linus Torvalds, Daniel Phillips, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, William Lee Irwin III, linux-...@vger.kernel.org

On Sun, Aug 11, 2002 at 08:42:16PM -0300, Rik van Riel wrote:
> On 12 Aug 2002, Alan Cox wrote:
>
> > Unfortunately the USA forces people to deal with this crap. I'd hope SGI
> > would be decent enough to explicitly state they will license this stuff
> > freely for GPL use
>
> I seem to remember Apple having a clause for this in
> their Darwin sources, forbidding people who contribute
> code from suing them about patent violations due to
> the code they themselves contributed.

IBM has a fantastic clause in their open source license. The license grants
you various rights to use, etc., and then goes on to say something in
the termination section (I think) along the lines of

In the event that You or your affiliates instigate patent, trademark,
and/or any other intellectual property suits, this license terminates
as of the filing date of said suit[s].

You get the idea. It's basically "screw me, OK, then screw you too" language.

--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

Linus Torvalds

unread,

Aug 12, 2002, 1:39:40 AM8/12/02

to Larry McVoy, Daniel Phillips, Alan Cox, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

On Sun, 11 Aug 2002, Larry McVoy wrote:
>
> This issue is more complicated than you might think.

No, it's not. You miss the point.

> Big companies with
> big pockets are very nervous about being too closely associated with
> Linux because of this problem.

The point being that that is _their_ problem, and at a level that has
nothing to do with technology.

I'm saying that technical people shouldn't care. I certainly don't. The
people who _should_ care are patent attourneys etc, since they actually
get paid for it, and can better judge the matter anyway.

Everybody in the whole software industry knows that any non-trivial
program (and probably most trivial programs too, for that matter) will
infringe on _some_ patent. Ask anybody. It's apparently an accepted fact,
or at least a saying that I've heard too many times.

I just don't care. Clearly, if all significant programs infringe on
something, the issue is no longer "do we infringe", but "is it an issue"?

And that's _exactly_ why technical people shouldn't care. The "is it an
issue" is not something a technical guy can answer, since the answer
depends on totally non-technical things.

Ask your legal counsel, and I strongly suspect that if he is any good, he
will tell you the same thing. Namely that it's _his_ problem, and that
your engineers should not waste their time trying to find existing
patents.

Linus

Larry McVoy

unread,

Aug 12, 2002, 5:08:32 AM8/12/02

to Linus Torvalds, Larry McVoy, Daniel Phillips, Alan Cox, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

> Ask your legal counsel, and I strongly suspect that if he is any good, he
> will tell you the same thing. Namely that it's _his_ problem, and that
> your engineers should not waste their time trying to find existing
> patents.

Partially true for us. We do do patent searches to make sure we aren't
doing anything blatently stupid.

I do agree with you 100% that it is impossible to ship any software that
does not infringe on some patent. It's a big point of contention in
contract negotiations because everyone wants you to warrant that your
software doesn't infringe and indemnify them if it does.

--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

Alan Cox

unread,

Aug 12, 2002, 9:13:16 AM8/12/02

to Linus Torvalds, Larry McVoy, Daniel Phillips, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, w...@holomorpy.com, linux-...@vger.kernel.org

On Mon, 2002-08-12 at 02:26, Linus Torvalds wrote:
> Ask your legal counsel, and I strongly suspect that if he is any good, he
> will tell you the same thing. Namely that it's _his_ problem, and that
> your engineers should not waste their time trying to find existing
> patents.

Wasn't a case of wasting time. That one is extremely well known because
there were upset people when SGI patented it and then submitted a usenix
paper on it.

Helge Hafting

unread,

Aug 12, 2002, 9:23:21 AM8/12/02

to Rik van Riel, linux-...@vger.kernel.org

Rik van Riel wrote:

> One problem we're running into here is that there are absolutely
> no tools to measure some of the things rmap is supposed to fix,
> like page replacement.
>
There are things like running vmstat while running tests or production.

My office desktop machine (256M RAM) rarely swaps more than 10M
during work with 2.5.30. It used to go some 70M into swap
after a few days of writing, browsing, and those updatedb runs.

Helge Hafting

Bill Davidsen

unread,

Aug 13, 2002, 3:23:34 AM8/13/02

to Helge Hafting, Rik van Riel, linux-...@vger.kernel.org

On Mon, 12 Aug 2002, Helge Hafting wrote:

> Rik van Riel wrote:
>
> > One problem we're running into here is that there are absolutely
> > no tools to measure some of the things rmap is supposed to fix,
> > like page replacement.
> >
> There are things like running vmstat while running tests or production.
>
> My office desktop machine (256M RAM) rarely swaps more than 10M
> during work with 2.5.30. It used to go some 70M into swap
> after a few days of writing, browsing, and those updatedb runs.

Now tell us how someone who isn't a VM developer can tell if that's bad or
good. Is it good because it didn't swap more than it needed to, or bad
because there were more things it could have swapped to make more buffer
room?

Serious question, tuning the -aa VM sometimes makes the swap use higher,
even as the response to starting small jobs while doing kernel compiles or
mkisofs gets better. I don't normally tune -ac kernels much, so I can't
comment there.

--
bill davidsen <davi...@tmr.com>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

Rik van Riel

unread,

Aug 13, 2002, 3:34:29 AM8/13/02

to Bill Davidsen, Helge Hafting, linux-...@vger.kernel.org

On Mon, 12 Aug 2002, Bill Davidsen wrote:

> Now tell us how someone who isn't a VM developer can tell if that's bad
> or good. Is it good because it didn't swap more than it needed to, or
> bad because there were more things it could have swapped to make more
> buffer room?

Good point, just looking at the swap usage doesn't mean
much because we're interested in the _consequences_ of
that number and not in the number itself.

> Serious question, tuning the -aa VM sometimes makes the swap use higher,
> even as the response to starting small jobs while doing kernel compiles
> or mkisofs gets better. I don't normally tune -ac kernels much, so I
> can't comment there.

The key word here is "response", benchmarks really need
to be able to measure responsiveness.

Some benchmarks (eg. irman by Bob Matthews) do this
already, but we're still focussing too much on throughput.

In 1990 Margo Selzer wrote an excellent paper on disk IO
sorting and its effects on throughput and latency. The
end result was that in order to get decent throughput by
doing just disk IO sorting you would need queues so deep
that IO latency would grow to about 30 seconds. ;)

Of course, if databases or online shops would increase
their throughput by going to deep queueing and every
request would get 30 second latencies ... they would
immediately lose their users (or customers) !!!

I'm pretty convinced that sysadmins aren't interested
in throughput, at least not until throughput is so low
that it starts affecting system response latency.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

-

Helge Hafting

unread,

Aug 13, 2002, 7:27:33 AM8/13/02

to Bill Davidsen, Rik van Riel, linux-...@vger.kernel.org

Bill Davidsen wrote:
>
> On Mon, 12 Aug 2002, Helge Hafting wrote:

> > My office desktop machine (256M RAM) rarely swaps more than 10M
> > during work with 2.5.30. It used to go some 70M into swap
> > after a few days of writing, browsing, and those updatedb runs.
>
> Now tell us how someone who isn't a VM developer can tell if that's bad or
> good. Is it good because it didn't swap more than it needed to, or bad
> because there were more things it could have swapped to make more buffer
> room?

It feels more responsive too - which is no surprise. Like most users,
I don't _expect_ to wait for swapin when pressing a key or something.
Waiting for file io seems to be less of a problem, that stuff
_is_ on disk after all. I guess many people who knows a little about
computers feel this way. People that don't know what a "disk" is
may be different and more interested in total waiting.

On the serious side: vmstat provides more than swap info. It also
lists block io, where one might see if the block io goes up or down.
I suggest to find some repeatable workload with lots of file & swap
io, and see how much we get of each. My guess is that rmap
results in less io to to the same job. Not only swap io, but
swap+file io too. The design is more capable of selecting
the _right_ page to evict. (Assuming that page usage may
tell us something useful.) So the only questions left is
if the current implementation is good, and if the
improved efficiency makes up for the memory overhead.

>
> Serious question, tuning the -aa VM sometimes makes the swap use higher,
> even as the response to starting small jobs while doing kernel compiles or
> mkisofs gets better. I don't normally tune -ac kernels much, so I can't
> comment there.

Swap is good if there's lots of file io and
lots of unused apps sitting around. And bad if there's a large working
set and little _repeating_ file io. Such as the user switching between
a bunch of big apps working on few files. And perhaps some
non-repeating
io like updatedb or mail processing...

Helge Hafting

Alan Cox

unread,

Aug 13, 2002, 3:14:17 PM8/13/02

to Rob Landley, Daniel Phillips, Larry McVoy, Rik van Riel, Linus Torvalds, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, William Lee Irwin III, linux-...@vger.kernel.org

On Tue, 2002-08-13 at 09:40, Rob Landley wrote:
> Unfortunately, the maintainer of the GPL is Stallman, so he's the logical guy
> to spearhead a "GPL patent pool" project, but any time anybody mentions the
> phrase "intellectual property" to him he goes off on a tangent about how you
> shouldn't call anything "intellectual property", so how can you have a
> discussion about it, and nothing ever gets done. It's FRUSTRATING to see
> somebody with such brilliant ideas hamstrung not just by idealism, but
> PEDANTIC idealism.
>

Richard isnt daft on this one. The FSF does not have the 30 million
dollars needed to fight a *single* US patent lawsuit. The problem also
reflects back on things like Debian, because Debian certainly cannot
afford to play the patent game either.

Linus Torvalds

unread,

Aug 13, 2002, 5:05:01 PM8/13/02

to Rob Landley, Alan Cox, Daniel Phillips, Larry McVoy, Rik van Riel, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, William Lee Irwin III, linux-...@vger.kernel.org

On Tue, 13 Aug 2002, Rob Landley wrote:
>
> Last time I really looked into all this, Stallman was trying to do an
> enormous new GPL 3.0, addressing application service providers. That seems
> to have fallen though (as has the ASP business model), but the patent issue
> remains unresolved.

At least one problem is exactly the politics played by the FSF, which
means that a lot of people (not just me), do not trust such new versions
of the GPL. Especially since the last time this happened, it all happened
in dark back-rooms, and I got to hear about it not off any of the lists,
but because I had an insider snitch on it.

I lost all respect I had for the FSF due to its sneakiness.

The kernel explicitly states that it is under the _one_ particular version
of the "GPL v2" that is included with the kernel. Exactly because I do not
want to have politics dragged into the picture by an external party (and
I'm anal enough that I made sure that "version 2" cannot be misconstrued
to include "version 2.1".

Also, a license is a two-way street. I do not think it is morally right to
change an _existing_ license for any other reason than the fact that it
has some technical legal problem. I intensely dislike the fact that many
people seem to want to extend the current GPL as a way to take advantage
of people who used the old GPL and agreed with _that_ - but not
necessarily the new one.

As a result, every time this comes up, I ask for any potential new
"patent-GPL" to be a _new_ license, and not try to feed off existing
works. Please dopn't make it "GPL". Make it the GPPL for "General Public
Patent License" or something. And let people buy into it on its own
merits, not on some "the FSF decided unilaterally to make this decision
for us".

I don't like patents. But I absolutely _hate_ people who play politics
with other peoples code. Be up-front, not sneaky after-the-fact.

Linus

Ruth Ivimey-Cook

unread,

Aug 13, 2002, 5:19:13 PM8/13/02

to Linus Torvalds, Rob Landley, Alan Cox, Daniel Phillips, Larry McVoy, Rik van Riel, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, William Lee Irwin III, linux-...@vger.kernel.org

On Tue, 13 Aug 2002, Linus Torvalds wrote:
>I don't like patents. But I absolutely _hate_ people who play politics
>with other peoples code. Be up-front, not sneaky after-the-fact.

Well said :-)

Ruth

--
Ruth Ivimey-Cook
Software engineer and technical writer.

Rik van Riel

unread,

Aug 13, 2002, 5:33:41 PM8/13/02

to Linus Torvalds, Rob Landley, Alan Cox, Daniel Phillips, Larry McVoy, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, William Lee Irwin III, linux-...@vger.kernel.org

On Tue, 13 Aug 2002, Linus Torvalds wrote:

> Also, a license is a two-way street. I do not think it is morally right
> to change an _existing_ license for any other reason than the fact that
> it has some technical legal problem.

Agreed, but we might be running into one of these.

> I don't like patents. But I absolutely _hate_ people who play politics
> with other peoples code. Be up-front, not sneaky after-the-fact.

Suppose somebody sends you a patch which implements a nice
algorithm that just happens to be patented by that same
somebody. You don't know about the patent.

You integrate the patch into the kernel and distribute it,
one year later you get sued by the original contributor of
that patch because you distribute code that is patented by
that person.

Not having some protection in the license could open you
up to sneaky after-the-fact problems.

Having a license that explicitly states that people who
contribute and use Linux shouldn't sue you over it might
prevent some problems.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

-

Alexander Viro

unread,

Aug 13, 2002, 5:49:18 PM8/13/02

to Rik van Riel, Linus Torvalds, Rob Landley, Alan Cox, Daniel Phillips, Larry McVoy, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, William Lee Irwin III, linux-...@vger.kernel.org

On Tue, 13 Aug 2002, Rik van Riel wrote:

> Suppose somebody sends you a patch which implements a nice
> algorithm that just happens to be patented by that same
> somebody. You don't know about the patent.
>
> You integrate the patch into the kernel and distribute it,
> one year later you get sued by the original contributor of
> that patch because you distribute code that is patented by
> that person.
>
> Not having some protection in the license could open you
> up to sneaky after-the-fact problems.

Accepting non-trivial patches from malicious source means running code
from malicious source on your boxen. In kernel mode. And in that case
patents are the least of your troubles...

Rik van Riel

unread,

Aug 13, 2002, 6:03:21 PM8/13/02

to Linus Torvalds, Rob Landley, Alan Cox, Daniel Phillips, Larry McVoy, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, William Lee Irwin III, linux-...@vger.kernel.org

On Tue, 13 Aug 2002, Linus Torvalds wrote:
> On Tue, 13 Aug 2002, Rik van Riel wrote:
> >

> > Having a license that explicitly states that people who
> > contribute and use Linux shouldn't sue you over it might
> > prevent some problems.
>

> The thing is, if you own the patent, and you sneaked the code into the
> kernel, you will almost certainly be laughed out of court for trying to
> enforce it.

Apparently not everybody agrees on this:

http://zdnet.com.com/2100-1106-884681.html

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

-

Linus Torvalds

unread,

Aug 13, 2002, 6:10:06 PM8/13/02

to Rik van Riel, Rob Landley, Alan Cox, Daniel Phillips, Larry McVoy, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, William Lee Irwin III, linux-...@vger.kernel.org

On Tue, 13 Aug 2002, Rik van Riel wrote:
>
> Having a license that explicitly states that people who
> contribute and use Linux shouldn't sue you over it might
> prevent some problems.

The thing is, if you own the patent, and you sneaked the code into the

kernel, you will almost certainly be laughed out of court for trying to
enforce it.

And if somebody else owns the patent, no amount of copyright license makes
any difference.

Linus

Linus Torvalds

unread,

Aug 13, 2002, 6:46:42 PM8/13/02

to Rob Landley, Rik van Riel, Alan Cox, Daniel Phillips, Larry McVoy, fra...@watson.ibm.com, dav...@hpl.hp.com, David Mosberger, David S. Miller, g...@us.ibm.com, Martin...@us.ibm.com, William Lee Irwin III, linux-...@vger.kernel.org

On Tue, 13 Aug 2002, Rob Landley wrote:
>
> > Having a license that explicitly states that people who
> > contribute and use Linux shouldn't sue you over it might
> > prevent some problems.
>

> Such a clause is what IBM insisted on having in ITS open source license. You
> sue, your rights under this license terminate, which is basically automatic
> grounds for a countersuit for infringement.

Note that I personally think the "you screw with me, I screw with you"
approach is a fine one. After all, the GPL is based on "you help me, I'll
help you", so it fits fine.

However, it doesn't work due to the distributed nature of the GPL. The FSF
tried to do something like it in the GPL 3.0 discussions, and the end
result was a total disaster. The GPL 3.0 suggestion was something along
the lines of "you sue any GPL project, you lose all GPL rights". Which to
me makes no sense at all - I could imagine that there might be some GPL
project out there that _deserves_ getting sued(*) and it has nothing to do
with Linux.

Linus

(*) "GNU Emacs, the defendent, did inefariously conspire to play
towers-of-hanoy, while under the guise of a harmless editor".

bill davidsen

unread,

Aug 22, 2002, 12:11:18 PM8/22/02

to linux-...@vger.kernel.org

In article <Pine.LNX.4.44L.02081...@imladris.surriel.com>,

Rik van Riel <ri...@conectiva.com.br> wrote:

| Suppose somebody sends you a patch which implements a nice
| algorithm that just happens to be patented by that same
| somebody. You don't know about the patent.
|
| You integrate the patch into the kernel and distribute it,
| one year later you get sued by the original contributor of
| that patch because you distribute code that is patented by
| that person.
|
| Not having some protection in the license could open you
| up to sneaky after-the-fact problems.
|
| Having a license that explicitly states that people who
| contribute and use Linux shouldn't sue you over it might
| prevent some problems.

Unlikely as this is, since offering the patch would probably be
(eventually) interpreted as giving you the right to use it under GPL, I
think this is a valid concern.

Maybe some lawyer could add the required words and it could become the
LFSL v1.0 (Linux Free Software License). Although FSF would probably buy
into a change if the alternative was creation of a Linux license. There
are people there who are in touch with reality.

--
bill davidsen <davi...@tmr.com>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.