Note describing poor dcache utilization under high memory pressure

Josh MacDonald

unread,

Jan 28, 2002, 12:20:12 PM1/28/02

to

When memory pressure becomes high, the Linux kswapd begins calling
shrink_caches() from try_to_free_pages() with an integer priority from
6 (the default, lowest priority) to 1 (high priority). Looking
specifically at the dcache, this results in a calls to
shrink_dcache_memory() that attempt to free a fraction (1/priority) of
the inactive dcache entries. This ultimately leads to prune_dcache()
scanning the dcache in least-recently-used order attempting to call
kmem_cache_free() on some number of dcache entries.

Dcache entries are allocated from the kmem_slab_cache, which manages
objects in page-size "slabs", but the kmem_slab_cache cannot free a
page until every object in a slab becomes unused. The problem is that
freeing dcache entries in LRU-order is effectively freeing entries
from randomly-selected slabs, and therefore applying shrink_caches()
pressure to the dcache has an undesired result. In the attempt to
reduce its size, the dcache must free objects from random slabs in
order to actually release full pages. The result is that under high
memory pressure the dcache utilization drops dramatically. The
prune_dcache() mechanism doesn't just reduce the page utilization as
desired, it reduces the intra-page utilization, which is bad.

In order to measure this effect (via /proc/slabinfo) I first populated
a large dcache and then ran a memory-hog to force swapping to occur.
The dcache utilization drops to between 20-35%. For example, before
running the memory-hog my dcache reports:

dentry_cache 10170 10170 128 339 339 1 : 252 126

(i.e., 10170 active dentry objects, 10170 available dentry objects @
128 bytes each, 339 pages with at least one object, and 339 allocated
pages, an approximately 1.4MB dcache)

While running the memory-hog program to initiate swapping, the dcache
stands at:

dentry_cache 693 3150 128 105 105 1 : 252 126

Meaning, the randomly-applied cache pressure was successful at freeing
234 (= 339-105) pages, leaving a 430KB dcache, but at the same time it
reduced the cache utilization to 22%, meaning that although it was
able to free nearly 1MB of space, 335KB are now wasted as a result of
the high memory-pressure condition.

So, it would seem that the dcache and kmem_slab_cache memory allocator
could benefit from a way to shrink the dcache in a less random way.
Any thoughts?

-josh

--
PRCS version control system http://sourceforge.net/projects/prcs
Xdelta storage & transport http://sourceforge.net/projects/xdelta
Need a concurrent skip list? http://sourceforge.net/projects/skiplist
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

unread,

Jan 28, 2002, 1:30:23 PM1/28/02

to

On Mon, 28 Jan 2002, Rik van Riel wrote:
>
> I'd be interested to know exactly how much overhead -rmap is
> causing for both page faults and fork (but I'm sure one of
> the regular benchmarkers can figure that one out while I fix
> the RSS limit stuff ;))

I doubt it is noticeable on page faults (the cost of maintaining the list
at COW should be basically zero compared to all the other costs), but I've
seen several people reporting fork() overheads of ~300% or so.

Which is not that surprising, considering that most of the fork overhead
by _far_ is the work to copy the page tables, and rmap makes them three
times larger or so.

And I agree that COW'ing the page tables may not actually help. But it
might be worth it even _without_ rmap, so it's worth a look.

(Also, I'd like to understand why some people report so much better times
on dbench, and some people reports so much _worse_ times with dbench.
Admittedly dbench is a horrible benchmark, but still.. Is it just the
elevator breakage, or is it rmap itself?)

Linus

Rik van Riel

unread,

Jan 28, 2002, 1:40:18 PM1/28/02

to

On Mon, 28 Jan 2002, Linus Torvalds wrote:
> On Mon, 28 Jan 2002, Rik van Riel wrote:
> >
> > I'd be interested to know exactly how much overhead -rmap is
> > causing for both page faults and fork (but I'm sure one of
> > the regular benchmarkers can figure that one out while I fix
> > the RSS limit stuff ;))
>
> I doubt it is noticeable on page faults (the cost of maintaining the list
> at COW should be basically zero compared to all the other costs), but I've
> seen several people reporting fork() overheads of ~300% or so.

Dave McCracken has tested with applications of different
sizes and has found fork() speed differences of 10% for
small applications up to 400% for a 10 MB (IIRC) program.

This was with some debugging code enabled, however...
(some of the debugging code I've only disabled now)

> Which is not that surprising, considering that most of the fork overhead
> by _far_ is the work to copy the page tables, and rmap makes them three
> times larger or so.

For dense page tables they'll be 3 times larger, but for a page
table with is only occupied for 10% (eg. bash with 1.5 MB spread
over executable+data, libraries and stack) the space overhead is
much smaller.

The amount of RAM touched in fork() is mostly tripled though, if
the program is completely resident, because fork() follows VMA
boundaries.

> And I agree that COW'ing the page tables may not actually help. But it
> might be worth it even _without_ rmap, so it's worth a look.

Absolutely, this is something to try...

> (Also, I'd like to understand why some people report so much better
> times on dbench, and some people reports so much _worse_ times with
> dbench. Admittedly dbench is a horrible benchmark, but still.. Is it
> just the elevator breakage, or is it rmap itself?)

We're still looking into this. William Irwin is running a
nice script to see if the settings in /proc/sys/vm/bdflush
have an observable influence on dbench.

Another thing which could have to do with decreased dbench
and increased tiobench performance is drop behind vs. use-once.
It turns out drop behind is better able to sustain IO streams
of different speeds and can fit more IO streams in the same
amount of cache (people running very heavily loaded ftp or
web download servers can find a difference here).

For the interested parties, I've put some text and pictures of
this phenomenon online at:

http://linux-mm.org/wiki/moin.cgi/StreamingIo

It basically comes down to the fact that use-once degrades into
FIFO, which isn't too efficient when different programs do IO
at different speeds. I'm not sure how this is supposed to
affect dbench, but it could have an influence...

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

William Lee Irwin III

unread,

Jan 28, 2002, 2:30:17 PM1/28/02

to

> On Mon, 28 Jan 2002, Linus Torvalds wrote:
>> (Also, I'd like to understand why some people report so much better
>> times on dbench, and some people reports so much _worse_ times with
>> dbench. Admittedly dbench is a horrible benchmark, but still.. Is it
>> just the elevator breakage, or is it rmap itself?)

On Mon, Jan 28, 2002 at 04:37:02PM -0200, Rik van Riel wrote:
> We're still looking into this. William Irwin is running a
> nice script to see if the settings in /proc/sys/vm/bdflush
> have an observable influence on dbench.

They have observable effects, but the preliminary results (the
runs are so time-intensive the repetitions needed to average out
dbench's wild fluctuations are going to take a while) seem to
give me three intuitions:

(1) the bdflush logic is brittle and/or not sufficiently adaptive
(2) dbench results fluctuate so wildly it's difficult to
reproduce results accurately
(3) dbench results fluctuate so wildly it obscures the true
performance curve

The winner of the first round seems to be 7 0 0 0 500 3000 28 0 0

Cheers,
Bill

Rick Stevens

unread,

Jan 28, 2002, 4:40:17 PM1/28/02

to

Daniel Phillips wrote:
[snip]

> I've been a little slow to 'publish' this on lkml because I wanted a working
> prototype first, as proof of concept. My efforts to dragoon one or more of
> the notably capable kernelnewbies crowd into coding it haven't been
> particularly successful, perhaps due to the opacity of the code in question
> (pgtable.h et al). So I've begun coding it myself, and it's rather slow
> going, again because of the opacity of the code. Oh, and the difficult
> nature of the problem itself, since it requires understanding pretty much all
> of Unix memory management semantics first, including the bizarre (and useful)
> properties of process forking.
>
> The good news is, the code changes required do fit very cleanly into the
> current design. Changes are required in three places I've identified so far:
>
> copy_page_range
> Intead of copying the page tables, just increment their use counts
>
> zap_page_range:
> If a page table is shared according to its use count, just decrement
> the use count and otherwise leave it alone.
>
> handle_mm_fault:
> If a page table is shared according to its use count and the faulting
> instruction is a write, allocate a new page table and do the work that
> would have normally been done by copy_page_range at fork time.
> Decrement the use count of the (perhaps formerly) shared page table.

Perhaps I'm missing this, but I read that as the child gets a reference
to the parent's memory. If the child attempts a write, then new memory
is allocated, data copied and the write occurs to this new memory. As
I read this, it's only invoked on a child write.

Would this not leave a hole where the parent could write and, since the
child shares that memory, the new data would be read by the child? Sort
of a hidden shm segment? If so, I think we've got problems brewing.
Now, if a parent write causes the same behaviour as a child write, then
my point is moot.

Could you clarify this for me? I'm probably way off base here.

----------------------------------------------------------------------
- Rick Stevens, SSE, VitalStream, Inc. rste...@vitalstream.com -
- 949-743-2010 (Voice) http://www.vitalstream.com -
- -
- grep me no patterns and I'll tell you no lines -
----------------------------------------------------------------------

Rik van Riel

unread,

Jan 28, 2002, 4:50:12 PM1/28/02

to

On Mon, 28 Jan 2002, Rick Stevens wrote:
> Daniel Phillips wrote:
> [snip]
[page table COW description]

> Perhaps I'm missing this, but I read that as the child gets a reference
> to the parent's memory. If the child attempts a write, then new memory
> is allocated, data copied and the write occurs to this new memory. As
> I read this, it's only invoked on a child write.
>
> Would this not leave a hole where the parent could write and, since the
> child shares that memory, the new data would be read by the child? Sort
> of a hidden shm segment? If so, I think we've got problems brewing.
> Now, if a parent write causes the same behaviour as a child write, then
> my point is moot.

Daniel and I discussed this issue when Daniel first came up with
the idea of doing page table COW. He seemed a bit confused by
fork semantics when we first discussed this idea, too ;)

You're right though, both parent and child need to react in the
same way, preferably _without_ having to walk all of the parent's
page tables and mark them read-only ...

kind regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

-

Daniel Phillips

unread,

Jan 28, 2002, 5:20:14 PM1/28/02

to

On January 28, 2002 11:01 pm, Momchil Velikov wrote:
> >>>>> "Daniel" == Daniel Phillips <phil...@bonn-fries.net> writes:
>
> Daniel> I'd cheerfully hand this coding effort off to someone more familiar with this
> Daniel> particular neck of the kernel woods - you, Davem and Marcelo come to mind,
> Daniel> but if nobody bites I'll just continue working on it at my own pace. I
>
> BTW, I'm doing just this, working on it at my own pace.

Right, well in a couple of days we can compare notes. I'm a little
embarrassed at the state of the code as of today, I think I'm interpreting
some of those ulongs as things they shouldn't be.

This would be a whole lot easier if those ugly macros in pgtable.h were inlines
with pagetable_t etc. parameters instead.

--
Daniel

Daniel Phillips

unread,

Jan 28, 2002, 5:40:18 PM1/28/02

to

Notice that the word 'child' is only used in reference to the intial page
directory copy. After that things are symmetric with respect to parent and
child, a fundamental simplification that allows the algorithm to work without
explicit knowledge of the structure of the mm tree (and also simplifies the
locking considerably).

> Would this not leave a hole where the parent could write and, since the
> child shares that memory, the new data would be read by the child? Sort
> of a hidden shm segment? If so, I think we've got problems brewing.
> Now, if a parent write causes the same behaviour as a child write, then
> my point is moot.
>
> Could you clarify this for me? I'm probably way off base here.

Since the page was copied to the child, the child's page table must be
altered, and since it is shared, it must first be instantiated by the child.
So after all the dust settles, the parent and child have their own copies of
a page table page, which differ only at a single location: the child's page
table points at its freshly made CoW copy, and the parent's page table points
at the original page.

The beauty of this is, the page table could just as easily have been shared
by a sibling of the child, not the parent at all, in the case that the parent
had already instantiated its own copy of the page table page because of an
earlier CoW.

Confused yet? Welcome to the club ;-)

--
Daniel

Brian Gerst

unread,

Jan 28, 2002, 5:40:21 PM1/28/02

to

Fortunately, the kernel's page mappings are shared by all processes
(except the top level), so if you mark the page containing the user page
table as read-only from the child, it will also be read-only in the
parent.

--

Brian Gerst

Alex Bligh - linux-kernel

unread,

Jan 28, 2002, 6:00:18 PM1/28/02

to

--On Monday, 28 January, 2002 9:39 AM -0800 Linus Torvalds
<torv...@transmeta.com> wrote:

> Thus any slab user that wants to, could just register their own per-page
> memory pressure logic.

It might be useful to use a similar type interface to aid
in defragmentation - i.e. 'relocate the stuff on this
(physical) page please, I want it back'. If nothing is
registered, the default mechanism could be just to free it
a la writepage().

--
Alex Bligh

Rick Stevens

unread,

Jan 28, 2002, 6:10:18 PM1/28/02

to

Daniel Phillips wrote:

> On January 28, 2002 11:00 pm, Rick Stevens wrote:
>
>>I've gotta read up on the kernel's VM system. I use to write them
>>for a certain three-letter-acronymed company--many, many moons ago.
>>Maybe I'd have some ideas. Then again, perhaps not.
>>
>
> Well, you really want to take a trip over to irc.openprojects.net,
> #kernelnewbies, and there you'll find a number of still-current IBM people
> happily helping cook up plots to take over Linu^H^H^H^H the world ;-)

Uh, I never said IBM ;-) I said "a three-letter-acronym" company.
There were several. The one I dealt with was in Massachusetts, had
a real penchant for three-letter acronyms and used a programming
dialect which was the only single word oxymoron in the English
language (enough hints yet?).

And, no, I wasn't an employee.

----------------------------------------------------------------------
- Rick Stevens, SSE, VitalStream, Inc. rste...@vitalstream.com -
- 949-743-2010 (Voice) http://www.vitalstream.com -
- -

- "More hay, Trigger?" "No thanks, Roy, I'm stuffed!" -
----------------------------------------------------------------------

Daniel Phillips

unread,

Jan 28, 2002, 6:30:19 PM1/28/02

to

> Ok. Still seems like a bit more copying than necessary.

I think I can show that it's exactly as much copying as necessary, and no more.
Oh, the page directory could also be shared and, on a 3 level page table, so
could the mid level tables, but that's not a really big win because of the 1K/1
fanout. I.e, we already took care of 99.9% of the problem just by sharing the
bottom-level page tables.

> I'd have to look at it a bit more and do some noodling.

>
> > Confused yet? Welcome to the club ;-)
>

> Does my head exploding qualify for "confused"? If so, then I'm not
> yet "confused". I'm "concerned", since my ears are bleeding (a
> precursor to an explosion) ;-p

:-)

Daniel Phillips

unread,

Jan 28, 2002, 8:50:15 PM1/28/02

to

On January 29, 2002 02:29 am, Oliver Xymoron wrote:

> On Mon, 28 Jan 2002, Daniel Phillips wrote:
>
> > copy_page_range
> > Intead of copying the page tables, just increment their use counts
> >
> > zap_page_range:
> > If a page table is shared according to its use count, just decrement
> > the use count and otherwise leave it alone.
> >
> > handle_mm_fault:
> > If a page table is shared according to its use count and the faulting
> > instruction is a write, allocate a new page table and do the work that
> > would have normally been done by copy_page_range at fork time.
> > Decrement the use count of the (perhaps formerly) shared page table.
>

> Somewhere in here, the pages have got to all be marked read-only or
> something.

Yes, that's an essential detail I omitted: when a page table's use count
transitions from 1 to 2, mark all the CoW pages on the page table RO.

> If they're not, then either parent or child writing to
> non-faulting addresses will be writing to shared memory.

Yes, and after all, the whole point is to generalize CoW of pages to include
instantiation of page tables.

> I think something more is needed, such as creating a minimal page table
> for the child process with read-only mappings to the current %EIP and %EBP
> pages in it. This gets us past the fork/exec hurdle. Without the exec, we
> copy over chunks when they're accessed as above in handle_mm_fault. But
> you can't actually _share_ the page tables without marking the pages
> themselves readonly.

Oh yes, it's what I intended, thanks. Um, and I think you just told me what
one of my bugs is.

IPmonger

unread,

Jan 28, 2002, 9:40:13 PM1/28/02

to

jep...@unpythonic.dhs.org writes:

> On Mon, Jan 28, 2002 at 03:06:24PM -0800, Rick Stevens wrote:
>> Uh, I never said IBM ;-) I said "a three-letter-acronym"
>> company. There were several. The one I dealt with was in
>> Massachusetts, had a real penchant for three-letter acronyms and
>> used a programming dialect which was the only single word oxymoron
>> in the English language (enough hints yet?).

I'm guessing DEC, but I must admit that the oxymoronic (scripting?)
language escapes me...

-IPmonger
--
------------------
IPmonger
ipmo...@delamancha.org
CCIE #8338

Momchil Velikov

unread,

Jan 29, 2002, 3:40:12 AM1/29/02

to

>>>>> "Oliver" == Oliver Xymoron <oxym...@waste.org> writes:
Oliver> you can't actually _share_ the page tables without marking the pages
Oliver> themselves readonly.

Of course, ptes are made COW, just like now. Which brings up the
question how much speedup we'll gain with a code that touches every
single pte anyway ?

Daniel Phillips

unread,

Jan 29, 2002, 5:00:08 AM1/29/02

to

On January 29, 2002 10:20 am, William Lee Irwin III wrote:
> On Tue, Jan 29, 2002 at 09:55:02AM +0100, Daniel Phillips wrote:
> > It's only touching the ptes on tables that are actually used, so if a parent
> > with a massive amount of mapped memory forks a child that only instantiates
> > a small portion of it (common situation) then the saving is pretty big.
>
> Please correct my attempt at clarifying this:

Sorry, it's my fault, my explanation above is ambiguous, or even incorrect.

> The COW markings are done at the next higher level of hierarchy above
> the pte's themselves, and so experience the radix tree branch factor
> reduction in the amount of work done at fork-time in comparison to a
> full pagetable copy on fork.

The CoW markings are done at the same level they always have been - directly
in the ptes (the 4 byte thingies, please don't get confused by the unfortunate
overloading of 'pte' to mean 'page table' in some contexts, e.g.,
zap_pte_range). But the CoW marking only has to be done for page tables
that have use count == 1 at the time of fork. So if the parent inherited
some page tables then these already have use count > 1, i.e., are shared,
and they don't have to be set up for CoW again when the child forks. Only
page tables that the parent instantiated for itself have to be re-marked
for CoW.

Wow, this is getting subtle, isn't it?

> On Tue, Jan 29, 2002 at 09:55:02AM +0100, Daniel Phillips wrote:
> > Note that I'm not counting on this to be a huge performance win, except in
> > the specific case that that is bothering rmap. This is already worth the
> > price of admission.
>
> It is an overall throughput loss in the cases where the majority of the
> page table entries are in fact referenced by the child, and this is
> more than acceptable because it is more incremental, reference-all is
> an uncommon case, and once all the page table entries are referenced,
> there are no longer any penalties. Defeating this scheme would truly
> require a contrived application, and penalizes only that application.

True. Also, since the child is doing all that work anyway, the cost of
instantiating one page table (and extending the pte_chains) per 1K
referenced pages will get lost in the noise.

--
Daniel

Momchil Velikov

unread,

Jan 29, 2002, 5:20:12 AM1/29/02

to

>>>>> "Daniel" == Daniel Phillips <phil...@bonn-fries.net> writes:

Daniel> On January 29, 2002 09:39 am, Momchil Velikov wrote:
>> >>>>> "Oliver" == Oliver Xymoron <oxym...@waste.org> writes:
Oliver> you can't actually _share_ the page tables without marking the pages
Oliver> themselves readonly.
>>
>> Of course, ptes are made COW, just like now. Which brings up the
>> question how much speedup we'll gain with a code that touches every
>> single pte anyway ?

Daniel> It's only touching the ptes on tables that are actually used, so if a parent
Daniel> with a massive amount of mapped memory forks a child that only instantiates
Daniel> a small portion of it (common situation) then the saving is pretty big.

Umm, all the ptes af the parent ought to be made COW, no ?

Daniel Phillips

unread,

Jan 29, 2002, 6:40:07 AM1/29/02

to

On January 29, 2002 11:59 am, Rik van Riel wrote:

> On Mon, 28 Jan 2002, Oliver Xymoron wrote:
>
> > Somewhere in here, the pages have got to all be marked read-only or

> > something. If they're not, then either parent or child writing to

> > non-faulting addresses will be writing to shared memory.
>

> Either that, or we don't populate the page tables of the
> parent and the child at all and have the page tables
> filled in at fault time.

Yes, you could go that route but you'd have to do some weird and wonderful
bookkeeping to figure out how to populate those page tables.

--
Daniel

Helge Hafting

unread,

Jan 29, 2002, 7:10:11 AM1/29/02

to

Momchil Velikov wrote:
>
> >>>>> "Daniel" == Daniel Phillips <phil...@bonn-fries.net> writes:

[...]

> Daniel> It's only touching the ptes on tables that are actually used, so if a parent
> Daniel> with a massive amount of mapped memory forks a child that only instantiates
> Daniel> a small portion of it (common situation) then the saving is pretty big.
>
> Umm, all the ptes af the parent ought to be made COW, no ?

Sure. But quite a few of them may be COW already, if the parent
itself is a result of some earlier fork.

Helge Hafting

Daniel Phillips

unread,

Jan 29, 2002, 7:20:14 AM1/29/02

to

On January 29, 2002 12:38 pm, Rik van Riel wrote:

> On Tue, 29 Jan 2002, Daniel Phillips wrote:
>
> > > Either that, or we don't populate the page tables of the
> > > parent and the child at all and have the page tables
> > > filled in at fault time.
> >
> > Yes, you could go that route but you'd have to do some weird and wonderful
> > bookkeeping to figure out how to populate those page tables.
>

> Not really, if the page table isn't present you just check whether
> you need to allocate a new one or whether you need to instantiate
> one.

Since you didn't store it in the parent and you didn't store it in the child,
how are you going to find it? This is my point about the weird and wonderful
bookkeeping, which I managed to avoid entirely.

> That can all be done from within pte_alloc, which is always called
> by handle_mm_fault()...

--
Daniel

Karl & Betty Schendel

unread,

Jan 29, 2002, 7:20:11 AM1/29/02

to

At 9:30 PM -0500 1/28/02, IPmonger wrote:
>jep...@unpythonic.dhs.org writes:
>
>> On Mon, Jan 28, 2002 at 03:06:24PM -0800, Rick Stevens wrote:
>>> Uh, I never said IBM ;-) I said "a three-letter-acronym"
>>> company. There were several. The one I dealt with was in
>>> Massachusetts, had a real penchant for three-letter acronyms and
>>> used a programming dialect which was the only single word oxymoron
>>> in the English language (enough hints yet?).
>
> I'm guessing DEC, but I must admit that the oxymoronic (scripting?)
> language escapes me...
>

Bliss, surely?

Which actually wasn't too bad once you got used to putting the
goddamed dots in front of all your variable (contents) accesses.

--
Karl R. Schendel, Jr. sche...@kbcomputer.com
K/B Computer Associates www.kbcomputer.com
Ingres, Unix, VMS Consulting and Training

Daniel Phillips

unread,

Jan 29, 2002, 7:40:18 AM1/29/02

to

On January 29, 2002 12:54 pm, Helge Hafting wrote:
> Momchil Velikov wrote:
> >
> > >>>>> "Daniel" == Daniel Phillips <phil...@bonn-fries.net> writes:
> [...]
> > Daniel> It's only touching the ptes on tables that are actually used, so if a parent
> > Daniel> with a massive amount of mapped memory forks a child that only instantiates
> > Daniel> a small portion of it (common situation) then the saving is pretty big.
> >
> > Umm, all the ptes af the parent ought to be made COW, no ?
>
> Sure. But quite a few of them may be COW already, if the parent
> itself is a result of some earlier fork.

Right, or if the parent has already forked at least one child.

--
Daniel

Oliver Xymoron

unread,

Jan 29, 2002, 12:00:22 PM1/29/02

to

On Tue, 29 Jan 2002, Rik van Riel wrote:

> On Mon, 28 Jan 2002, Oliver Xymoron wrote:
>
> > Somewhere in here, the pages have got to all be marked read-only or
> > something. If they're not, then either parent or child writing to
> > non-faulting addresses will be writing to shared memory.
>

> Either that, or we don't populate the page tables of the
> parent and the child at all and have the page tables
> filled in at fault time.

That's very nearly what I proposed in the second half of my message (with
the exception that we ought to pre-fault the current stack and code page
tables as we're sure to need these immediately).

Daniel's approach seems to be workable (once he's spelled out all the
details) but it misses the big performance win for fork/exec, which is
surely the common case. Given that exec will be throwing away all these
mappings, we can safely assume that we will not be inheriting many shared
mappings from parents of parents so Daniel's approach also still ends up
marking most of the pages RO still.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

Rik van Riel

unread,

Jan 29, 2002, 12:30:22 PM1/29/02

to

On Tue, 29 Jan 2002, Oliver Xymoron wrote:

> Daniel's approach seems to be workable (once he's spelled out all the
> details) but it misses the big performance win for fork/exec, which is
> surely the common case. Given that exec will be throwing away all these
> mappings, we can safely assume that we will not be inheriting many shared
> mappings from parents of parents so Daniel's approach also still ends up
> marking most of the pages RO still.

It gets worse. His approach also needs to adjust the reference
counts on all pages (and swap pages).

kind regards,

Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/

http://www.surriel.com/ http://distro.conectiva.com/

Josh MacDonald

unread,

Jan 29, 2002, 12:30:25 PM1/29/02

to

Quoting Linus Torvalds (torv...@transmeta.com):

>
> On Mon, 28 Jan 2002, Josh MacDonald wrote:
> >
> > So, it would seem that the dcache and kmem_slab_cache memory allocator
> > could benefit from a way to shrink the dcache in a less random way.
> > Any thoughts?
>

> The way I want to solve this problem generically is to basically get rid
> of the special-purpose memory shrinkers, and have everything done with one
> unified interface, namely the physical-page-based "writeout()" routine. We
> do that for the page cache, and there's nothing that says that we couldn't
> do the same for all other caches, including very much the slab allocator.

>
> Thus any slab user that wants to, could just register their own per-page

> memory pressure logic. The dcache "reference" bit would go away, to be
> replaced by a per-page reference bit (that part could be done already, of
> course, and might help a bit on its own).
>
> Basically, the more different "pools" of memory we have, the harder it
> gets to balance them. Clearly, the optimal number of pools from a
> balancing standpoint is just a single, direct physical pool.
>
> Right now we have several pools - we have the pure physical LRU, we have
> the virtual mapping (where scanning is directly tied to the physical LRU,
> but where the separate pool still _does_ pose some problems), and we have
> separate balancing for inodes, dentries and quota. And there's no question
> that it hurts us under memory pressure.
>
> (There's a related question, which is whether other caches might also
> benefit from being able to grow more - right now there are some caches
> that are of a limited size partly because they have no good way of
> shrinking back on demand).

Using a physical-page-based "writeout()" routine seems like a nice way
to unify the application of memory pressure to various caches, but it
does not address the issue of fragmentation within a cache slab. You
could have a situation in which a number of hot dcache entries are
occupying some number of pages, such that dcache pages are always more
recently used than other pages in the system. Would the VM ever tell
the dcache to writeout() in that case?

It seems that the current special-purpose memory "shrinkers" approach
has some advantages in this regard: when memory pressure is applied
every cache attempts to free some resources. Do you envision the
unified interface approach applying pressure to pages of every kind of
cache under memory pressure?

Even so, the physical-page writeout() approach results in a less
effective cache under memory pressure. Suppose the VM chooses some
number of least-recently-used physical pages belonging to the dcache
and tells the slab allocator to release those pages. Assume that the
dcache entries are not currently in use and that the dcache is in fact
able to release them. Some of the dcache entries being tossed from
memory could instead replace less-recently-used objects on more
recently-used physical pages. In other words, the dcache would
benefit from relocating its more frequently used entries onto the same
physical pages under memory pressure.

Unless the cache ejects entries based on the object access and not
physical page access, the situation will never improve. Pages with
hot dcache entries will never clean-out the inactive entries on the
same page. For this reason, I don't think it makes sense to eliminate
the object-based aging of cache entries altogether.

Perhaps a combination of the approaches would work best. When the VM
system begins forcing the dcache to writeout(), the dcache could both
release some of its pages by ejecting all the entries (as above) and
in addition it could run something like prune_dcache(), thus creating
free space in the hotter set of physical pages so that over a period
of prolonged memory pressure, the hotter dcache entries would
eventually become located on the same pages.

A solution that relocates dcache entries to reduce total page
consumption, however, makes the most effective use of cache space.

William Lee Irwin III

unread,

Jan 29, 2002, 3:00:14 PM1/29/02

to

"William" == William Lee Irwin <w...@holomorphy.com> writes:
William> Please correct my attempt at clarifying this:
William> The COW markings are done at the next higher level of hierarchy above
William> the pte's themselves, and so experience the radix tree branch factor
William> reduction in the amount of work done at fork-time in comparison to a
William> full pagetable copy on fork.

On Tue, Jan 29, 2002 at 12:18:42PM +0200, Momchil Velikov wrote:
> COW at pgd/pmd level is ia32-ism, unlike COW at pte level.

Pain! Well, at least the pte markings dodge the page_add_rmap() bullet.

On Tue, Jan 29, 2002 at 12:18:42PM +0200, Momchil Velikov wrote:
> PS. Well, the whole pgd/pmd/ptb stuff is ia32-ism, but that's another
> story.

Perhaps something can be done about that.

Cheers,
Bill

Daniel Phillips

unread,

Jan 29, 2002, 3:50:17 PM1/29/02

to

On January 29, 2002 06:25 pm, Rik van Riel wrote:
> On Tue, 29 Jan 2002, Oliver Xymoron wrote:
>
> > Daniel's approach seems to be workable (once he's spelled out all the
> > details) but it misses the big performance win for fork/exec, which is
> > surely the common case. Given that exec will be throwing away all these
> > mappings, we can safely assume that we will not be inheriting many shared
> > mappings from parents of parents so Daniel's approach also still ends up
> > marking most of the pages RO still.
>
> It gets worse. His approach also needs to adjust the reference
> counts on all pages (and swap pages).

Well, Rik, time to present your algorithm. I assume it won't reference
counts on pages, and will do some kind of traversal of the mm tree. Note
however, that I did investigate the class of algorithm you are interested in,
and found only nasty, complex solutions there, with challenging locking
problems. (I also looked at a number of possible improvements to virtual
scanning, as you know, and likewise only found ugly or inadequate solutions.)

Before you sink a lot of time into it though, you might add up the actual
overhead you're worried about above, and see if it moves the needle in a real
system.

--
Daniel

Oliver Xymoron

unread,

Jan 29, 2002, 4:20:19 PM1/29/02

to

On Tue, 29 Jan 2002, Linus Torvalds wrote:

>
> On Tue, 29 Jan 2002, Oliver Xymoron wrote:
> >

> > fork:
> > detach page tables from parent
>
> - leave the option ot just mark them read-only on architectures that
> support it (ie x86, I think alpha does this too).

I don't think read-only for the tables is sufficient if the pages
themselves are writable.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

-

Oliver Xymoron

unread,

Jan 29, 2002, 5:10:13 PM1/29/02

to

On Tue, 29 Jan 2002, Linus Torvalds wrote:

>
> On Tue, 29 Jan 2002, Oliver Xymoron wrote:
> >
> > I don't think read-only for the tables is sufficient if the pages
> > themselves are writable.
>

> At least on x86, the WRITE bit in the page directory entries will override
> any bits int he PTE. In other words, it doesn't make the page directory
> entries thmselves unwritable - it makes the final pages unwritable.
>
> Which are exactly the semantics we want.

Oh. Cool. I knew I must have been missing some detail.

> I have this strong feeling (but am lazy enough to not try to find the
> documentation) that on alpha the access bits in the upper page tables are
> just ignored (ie you have to actually turn off the present bit), which is
> a bit sad as it shouldn't matter from a PAL-code standpoint (just two more
> "and" instructions to and all the levels access bits together).

The "detached mm" approach should be sufficiently parallel to the
read-only page directory entries that the two can use almost the same
framework. The downside is faults on reads in the detached case, but that
shouldn't be significantly worse than the original copy, thanks to the
large fanout.

Linus Torvalds

unread,

Jan 29, 2002, 5:20:15 PM1/29/02

to

On Tue, 29 Jan 2002, Oliver Xymoron wrote:
>
> The "detached mm" approach should be sufficiently parallel to the
> read-only page directory entries that the two can use almost the same
> framework.

Yes. I suspect that it can be trivially hidden in just two
architecture-specific functions, ie something like "detach_pgd(pgd)" and
"attach_pgd_entry(mm, address)".

> The downside is faults on reads in the detached case, but that
> shouldn't be significantly worse than the original copy, thanks to the
> large fanout.

Right. We'd get a few "unnecessary" page faults, but they should be on the
order of 0.1% of the necessary ones. In fact, with pre-faulting in
sys_fork(), I wouldn't be surprised if the common case is to not have any
directory-related page faults at all.

Linus

Daniel Phillips

unread,

Jan 29, 2002, 6:00:13 PM1/29/02

to

On January 29, 2002 10:00 pm, Oliver Xymoron wrote:

> On Tue, 29 Jan 2002, Daniel Phillips wrote:
>
> > On January 29, 2002 06:25 pm, Rik van Riel wrote:

> > > On Tue, 29 Jan 2002, Oliver Xymoron wrote:
> > >

> > > > Daniel's approach seems to be workable (once he's spelled out all the
> > > > details) but it misses the big performance win for fork/exec, which is
> > > > surely the common case. Given that exec will be throwing away all these
> > > > mappings, we can safely assume that we will not be inheriting many shared
> > > > mappings from parents of parents so Daniel's approach also still ends up
> > > > marking most of the pages RO still.
> > >
> > > It gets worse. His approach also needs to adjust the reference
> > > counts on all pages (and swap pages).
> >
> > Well, Rik, time to present your algorithm. I assume it won't reference
> > counts on pages, and will do some kind of traversal of the mm tree. Note
> > however, that I did investigate the class of algorithm you are interested in,
> > and found only nasty, complex solutions there, with challenging locking
> > problems. (I also looked at a number of possible improvements to virtual
> > scanning, as you know, and likewise only found ugly or inadequate solutions.)
>

> I think it goes something like this:

>
> fork:
> detach page tables from parent

> retain pointer to "backing page tables" in parent and child
> update use count in page tables
> "prefault" tables for current stack and instruction pages in both parent
> and child
>
> page fault:
> if faulted on page table:
> look up backing page tables
> if use count > 1: copy, dec use count
> else: take ownership

>
> > Before you sink a lot of time into it though, you might add up the actual
> > overhead you're worried about above, and see if it moves the needle in a real
> > system.
>

> I'm pretty sure something like the above does signficantly less work in
> the fork/exec case, which is the important one.

With fork/exec, for each page table there are two cases:

- The parent instantiated the page table. In this case the extra work to
set the ptes RO (only for CoW pages) is insignificant.

- The parent is still sharing the page table with its parent and so the
ptes are still set RO.

I don't see how there is a whole lot of fat to cut here.

--
Daniel

Daniel Phillips

unread,

Jan 29, 2002, 6:30:21 PM1/29/02

to

On January 30, 2002 12:02 am, Oliver Xymoron wrote:
> On Tue, 29 Jan 2002, Daniel Phillips wrote:
> > With fork/exec, for each page table there are two cases:
> >
> > - The parent instantiated the page table. In this case the extra work
> > to set the ptes RO (only for CoW pages) is insignificant.
>

> Marking the page table entries rather than the page directory entries
> read-only is a lot of work on a large process.

I'm still missing your point. When the parent's page table was instantiated
we took a fault. Later, we walk through up to 1024 ptes setting them RO, if
they are not already (which they probably are). Don't you think the cost of
the former dwarves the latter? In fact, if we are worried about this, we can
keep a flag on the page table telling us all the ptes are still set RO so we
don't have to do it again.

> And it doesn't make a lot
> of sense for a large process that wants to fork/exec something tiny.
>
> In
> fact, I'm slightly worried about the possible growth of the page
> directories on really big boxes. Detaching the entire mm is comparatively
> cheap and doesn't grow with process size.

>
> > - The parent is still sharing the page table with its parent and so the
> > ptes are still set RO.
>

> Fork/exec is far and away the most common case, and the fork/fork case is
> rare enough that it's not even worth thinking about.

I'm not sure I agree with this. I matters a lot if that rare case happens to
be the application your using all the time, and then it becomes the common
case.

Rik van Riel

unread,

Jan 30, 2002, 9:50:16 AM1/30/02

to

On Wed, 30 Jan 2002, Daniel Phillips wrote:
> On January 30, 2002 10:07 am, Horst von Brand wrote:

> > But most of this will be lost on exec(2).

> > Also, it is my impression that
> > the tree of _running_ processes isn't usually very deep (Say init --> X -->
> > [Random processes] --> [compilations &c], this would make 5 or 6 deep, no
> > more.

> Here's my tree - on a non-very-busy laptop. Why is my X tree so much deeper?
> I suppose if I was running java this would look considerably more interesting.

> |-bash---bash---xinit-+-XFree86
> | `-xfwm-+-xfce---gnome-terminal-+-bash---pstree

It doesn't matter how deep the tree is, on exec() all
previously shared page tables will be blown away.

In this part of the tree, I see exactly 2 processes
which could be sharing page tables (the two bash
processes).

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

Daniel Phillips

unread,

Jan 30, 2002, 10:00:14 AM1/30/02

to

On January 30, 2002 03:46 pm, Rik van Riel wrote:
> On Wed, 30 Jan 2002, Daniel Phillips wrote:
> > On January 30, 2002 10:07 am, Horst von Brand wrote:
>
> > > But most of this will be lost on exec(2).
>
> > > Also, it is my impression that
> > > the tree of _running_ processes isn't usually very deep (Say init --> X -->
> > > [Random processes] --> [compilations &c], this would make 5 or 6 deep, no
> > > more.
>
> > Here's my tree - on a non-very-busy laptop. Why is my X tree so much deeper?
> > I suppose if I was running java this would look considerably more interesting.
>
> > |-bash---bash---xinit-+-XFree86
> > | `-xfwm-+-xfce---gnome-terminal-+-bash---pstree
>
> It doesn't matter how deep the tree is, on exec() all
> previously shared page tables will be blown away.
>
> In this part of the tree, I see exactly 2 processes
> which could be sharing page tables (the two bash
> processes).

Sure, your point is that there is no problem and the speed of rmap on fork
is not something to worry about?

--
Daniel

Rik van Riel

unread,

Jan 30, 2002, 11:00:19 AM1/30/02

to

On Wed, 30 Jan 2002, Daniel Phillips wrote:
> On January 30, 2002 03:46 pm, Rik van Riel wrote:
> > On Wed, 30 Jan 2002, Daniel Phillips wrote:

> > > |-bash---bash---xinit-+-XFree86
> > > | `-xfwm-+-xfce---gnome-terminal-+-bash---pstree
> >
> > It doesn't matter how deep the tree is, on exec() all
> > previously shared page tables will be blown away.
> >
> > In this part of the tree, I see exactly 2 processes
> > which could be sharing page tables (the two bash
> > processes).
>
> Sure, your point is that there is no problem and the speed of rmap on
> fork is not something to worry about?

No. The point is that we should optimise for fork()+exec(),
not for a long series of consecutive fork()s all sharing the
same page tables.

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

-

Daniel Phillips

unread,

Jan 30, 2002, 11:40:12 AM1/30/02

to

On January 30, 2002 04:54 pm, Rik van Riel wrote:
> On Wed, 30 Jan 2002, Daniel Phillips wrote:
> > On January 30, 2002 03:46 pm, Rik van Riel wrote:
> > > On Wed, 30 Jan 2002, Daniel Phillips wrote:
>
> > > > |-bash---bash---xinit-+-XFree86
> > > > | `-xfwm-+-xfce---gnome-terminal-+-bash---pstree
> > >
> > > It doesn't matter how deep the tree is, on exec() all
> > > previously shared page tables will be blown away.
> > >
> > > In this part of the tree, I see exactly 2 processes
> > > which could be sharing page tables (the two bash
> > > processes).
> >
> > Sure, your point is that there is no problem and the speed of rmap on
> > fork is not something to worry about?
>
> No. The point is that we should optimise for fork()+exec(),
> not for a long series of consecutive fork()s all sharing the
> same page tables.

Fork+exec is adequately optimized for. Fork+100 execs is supremely well
optimized for. I'm entirely satisfied with the way the performance looks
at this point, it will outdo anything we've seen to date. With Linus's
write-protect-in-page-directory optimization there's not a lot more fat
to be squeezed out, if any, and even without it, it will be a screamer.
I think we've done this one, it's time to move on from here.

--
Daniel