process creation time increases linearly with shmem

Ray Fucillo

unread,

Aug 24, 2005, 2:50:08 PM8/24/05

to

I am seeing process creation time increase linearly with the size of the
shared memory segment that the parent touches. The attached forktest.c
is a very simple user program that illustrates this behavior, which I
have tested on various kernel versions from 2.4 through 2.6. Is this a
known issue, and is it solvable?

TIA,
Ray

forktest.c

Nick Piggin

unread,

Aug 24, 2005, 8:20:08 PM8/24/05

to

fork() can be changed so as not to set up page tables for
MAP_SHARED mappings. I think that has other tradeoffs like
initially causing several unavoidable faults reading
libraries and program text.

What kind of application are you using?

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Ray Fucillo

unread,

Aug 25, 2005, 9:10:04 AM8/25/05

to

Nick Piggin wrote:
> fork() can be changed so as not to set up page tables for
> MAP_SHARED mappings. I think that has other tradeoffs like
> initially causing several unavoidable faults reading
> libraries and program text.
>
> What kind of application are you using?

The application is a database system called Caché. We allocate a large
shared memory segment for database cache, which in a large production
environment may realistically be 1+GB on 32-bit platforms and much
larger on 64-bit. At these sizes fork() is taking hundreds of
miliseconds, which can become a noticeable bottleneck for us. This
performance characteristic seems to be unique to Linux vs other Unix
implementations.

Andi Kleen

unread,

Aug 25, 2005, 9:20:10 AM8/25/05

to

Ray Fucillo <fuc...@intersystems.com> writes:
>
> The application is a database system called Caché. We allocate a
> large shared memory segment for database cache, which in a large
> production environment may realistically be 1+GB on 32-bit platforms
> and much larger on 64-bit. At these sizes fork() is taking hundreds
> of miliseconds, which can become a noticeable bottleneck for us. This
> performance characteristic seems to be unique to Linux vs other Unix
> implementations.

You could set up hugetlbfs and use large pages for the SHM (with SHM_HUGETLB);
then the overhead of walking the pages of it at fork would be much lower.

-Andi

Parag Warudkar

unread,

Aug 25, 2005, 10:10:06 AM8/25/05

to

> Ray Fucillo <fuc...@intersystems.com> writes:
> >
> > The application is a database system called Caché. We allocate a
> > large shared memory segment for database cache, which in a large
> > production environment may realistically be 1+GB on 32-bit platforms
> > and much larger on 64-bit. At these sizes fork() is taking hundreds
> > of miliseconds, which can become a noticeable bottleneck for us. This
> > performance characteristic seems to be unique to Linux vs other Unix
> > implementations.
>
> You could set up hugetlbfs and use large pages for the SHM (with SHM_HUGETLB);
> then the overhead of walking the pages of it at fork would be much lower.
>
> -Andi
> -

Why isn't the page walk for the Shared Memory done lazily though? It is better in that applications most likely may not want to page in all of the shared memory at once. Program logic/requirements should dictate this instead of fork making it compulsory. I think this is because we don't distinguish between shared libraries, program text and explicitly shared memory as the above application does - everything is MAP_SHARED.

As someone mentioned this causes unavoidable faults for reading in shared libraries and program text. But if there was a MAP_SHARED|MAP_LAZY - can fork() then be setup not to setup page tables for such mappings and still continue to map the MAP_SHARED ones so program text and libraries don't cause faults? Applications can then specify MAP_SHARED|MAP_LAZY and not incur the overhead of page table walk for the shared memory all at once.

Would it be worth trying to do something like this?

Parag

Andi Kleen

unread,

Aug 25, 2005, 10:30:12 AM8/25/05

to

> Would it be worth trying to do something like this?

Maybe. Shouldn't be very hard though - you just need to check if the VMA is
backed by an object and if yes don't call copy_page_range for it.

I think it just needs (untested)

Index: linux-2.6.13-rc5-misc/kernel/fork.c
===================================================================
--- linux-2.6.13-rc5-misc.orig/kernel/fork.c
+++ linux-2.6.13-rc5-misc/kernel/fork.c
@@ -265,7 +265,8 @@ static inline int dup_mmap(struct mm_str
rb_parent = &tmp->vm_rb;

mm->map_count++;
- retval = copy_page_range(mm, current->mm, tmp);
+ if (!file && !is_vm_hugetlb_page(vma))
+ retval = copy_page_range(mm, current->mm, tmp);
spin_unlock(&mm->page_table_lock);

if (tmp->vm_ops && tmp->vm_ops->open)

But I'm not sure it's a good idea in all cases. Would need a lot of
benchmarking at least.

-Andi

Nick Piggin

unread,

Aug 25, 2005, 10:30:16 AM8/25/05

to

Ray Fucillo wrote:
> Nick Piggin wrote:
>
>> fork() can be changed so as not to set up page tables for
>> MAP_SHARED mappings. I think that has other tradeoffs like
>> initially causing several unavoidable faults reading
>> libraries and program text.
>>
>> What kind of application are you using?
>
>
> The application is a database system called Caché. We allocate a large
> shared memory segment for database cache, which in a large production
> environment may realistically be 1+GB on 32-bit platforms and much
> larger on 64-bit. At these sizes fork() is taking hundreds of
> miliseconds, which can become a noticeable bottleneck for us. This
> performance characteristic seems to be unique to Linux vs other Unix
> implementations.
>
>

As Andi said, hugepages might be a very nice feature for you guys
to look into and might potentially give a performance increase with
reduced TLB pressure, not only your immediate fork problem.

Anyway, the attached patch is something you could try testing. If
you do so, then I would be very interested to see performance results.

Thanks,
Nick

vm-dontcopy-shared.patch

Nick Piggin

unread,

Aug 25, 2005, 10:40:08 AM8/25/05

to

Andi Kleen wrote:
>>Would it be worth trying to do something like this?
>
>
> Maybe. Shouldn't be very hard though - you just need to check if the VMA is
> backed by an object and if yes don't call copy_page_range for it.
>
> I think it just needs (untested)
>

I think you need to check for MAP_SHARED as well, because
MAP_PRIVATE mapping of a file could be modified in parent.

See patch I posted just now.

Also, do you need any special case for hugetlb?

> Index: linux-2.6.13-rc5-misc/kernel/fork.c
> ===================================================================
> --- linux-2.6.13-rc5-misc.orig/kernel/fork.c
> +++ linux-2.6.13-rc5-misc/kernel/fork.c
> @@ -265,7 +265,8 @@ static inline int dup_mmap(struct mm_str
> rb_parent = &tmp->vm_rb;
>
> mm->map_count++;
> - retval = copy_page_range(mm, current->mm, tmp);
> + if (!file && !is_vm_hugetlb_page(vma))
> + retval = copy_page_range(mm, current->mm, tmp);
> spin_unlock(&mm->page_table_lock);
>
> if (tmp->vm_ops && tmp->vm_ops->open)
>
> But I'm not sure it's a good idea in all cases. Would need a lot of
> benchmarking at least.
>

Yep. I'm sure it must have come up in the past, and Linus
must have said something about best-for-most.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

Parag Warudkar

unread,

Aug 25, 2005, 10:50:11 AM8/25/05

to

On Thu, 2005-08-25 at 16:22 +0200, Andi Kleen wrote:
> But I'm not sure it's a good idea in all cases. Would need a lot of
> benchmarking at least.
>
> -Andi
>

Exactly - one problem is that this forces all of the hugetlb users to go
the lazy faulting way. This is more or less similar to the original
problem the fork() forces everything to be mapped and some apps don't
like it. Same way, some apps may not want hugetlb pages to be all
pre-mapped.

That's why I was alluding towards having the user specify MAP_SHARED|
MAP_LAZY or something to that tune and then have fork() honor it. So
people who want all things pre-mapped will not specify MAP_LAZY, just
MAP_SHARED.

Now I don't even know if above is possible and workable for all
scenarios but that's why I was asking.. :)

Parag

Andi Kleen

unread,

Aug 25, 2005, 12:00:11 PM8/25/05

to

On Thursday 25 August 2005 16:47, Parag Warudkar wrote:

> Exactly - one problem is that this forces all of the hugetlb users to go
> the lazy faulting way.

Actually I disabled it for hugetlbfs (... !is_huge...vma). The reason
is that lazy faulting for huge pages is still not in mainline.

-Andi

Rik van Riel

unread,

Aug 25, 2005, 4:10:06 PM8/25/05

to

On Thu, 25 Aug 2005, Nick Piggin wrote:

> fork() can be changed so as not to set up page tables for
> MAP_SHARED mappings. I think that has other tradeoffs like
> initially causing several unavoidable faults reading
> libraries and program text.

Actually, libraries and program text are usually mapped
MAP_PRIVATE, so those would still be copied.

Skipping MAP_SHARED in fork() sounds like a good idea to me...

--
All Rights Reversed

Nick Piggin

unread,

Aug 25, 2005, 9:30:06 PM8/25/05

to

Rik van Riel wrote:
> On Thu, 25 Aug 2005, Nick Piggin wrote:
>
>
>>fork() can be changed so as not to set up page tables for
>>MAP_SHARED mappings. I think that has other tradeoffs like
>>initially causing several unavoidable faults reading
>>libraries and program text.
>
>
> Actually, libraries and program text are usually mapped
> MAP_PRIVATE, so those would still be copied.
>

Yep, that seems to be the case here.

> Skipping MAP_SHARED in fork() sounds like a good idea to me...
>

Indeed. Linus, can you remember why we haven't done this before?

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

Rik van Riel

unread,

Aug 25, 2005, 10:00:11 PM8/25/05

to

On Fri, 26 Aug 2005, Nick Piggin wrote:

> > Skipping MAP_SHARED in fork() sounds like a good idea to me...
>
> Indeed. Linus, can you remember why we haven't done this before?

Where "this" looks something like the patch below, shamelessly
merging Nick's and Andy's patches and adding the initialization
of retval.

I suspect this may be a measurable win on database servers with
a web frontend, where the connections to the database server are
set up basically for each individual query, and don't stick around
for a long time.

No, I haven't actually tested this patch - but feel free to go
wild while I sign off for the night.

Signed-off-by: Rik van Riel <ri...@redhat.com>

--- linux-2.6.12/kernel/fork.c.mapshared 2005-08-25 18:40:44.000000000 -0400
+++ linux-2.6.12/kernel/fork.c 2005-08-25 18:47:16.000000000 -0400
@@ -184,7 +184,7 @@
{
struct vm_area_struct * mpnt, *tmp, **pprev;
struct rb_node **rb_link, *rb_parent;
- int retval;
+ int retval = 0;
unsigned long charge;
struct mempolicy *pol;

@@ -265,7 +265,10 @@

rb_parent = &tmp->vm_rb;

mm->map_count++;
- retval = copy_page_range(mm, current->mm, tmp);

+ /* Skip pte copying if page faults can take care of things. */
+ if (!file || !(tmp->vm_flags & VM_SHARED) ||
+ is_vm_hugetlb_page(vma))

+ retval = copy_page_range(mm, current->mm, tmp);
spin_unlock(&mm->page_table_lock);

if (tmp->vm_ops && tmp->vm_ops->open)

Linus Torvalds

unread,

Aug 26, 2005, 12:00:12 AM8/26/05

to

On Fri, 26 Aug 2005, Nick Piggin wrote:
>
> > Skipping MAP_SHARED in fork() sounds like a good idea to me...
> >
>
> Indeed. Linus, can you remember why we haven't done this before?

Hmm. Historical reasons. Also, if the child ends up needing it, it will
now have to fault them in.

That said, I think it's a valid optimization. Especially as the child
_probably_ doesn't need it (ie there's at least some likelihood of an
execve() or similar).

Linus

Hugh Dickins

unread,

Aug 26, 2005, 7:50:11 AM8/26/05

to

On Thu, 25 Aug 2005, Linus Torvalds wrote:
> On Fri, 26 Aug 2005, Nick Piggin wrote:
> >
> > > Skipping MAP_SHARED in fork() sounds like a good idea to me...
> >
> > Indeed. Linus, can you remember why we haven't done this before?
>
> Hmm. Historical reasons. Also, if the child ends up needing it, it will
> now have to fault them in.
>
> That said, I think it's a valid optimization. Especially as the child
> _probably_ doesn't need it (ie there's at least some likelihood of an
> execve() or similar).

I agree, seems a great idea to me (sulking because I was too dumb
to get it, even when Nick and Andi first posted their patches).

It won't just save on the copying at fork time, it'll save on
undoing it all again when the child mm is torn down for exec.

The refaulting will hurt the performance of something: let's
just hope that something doesn't turn out to be a show-stopper.

I see some flaws in the various patches posted, including Rik's.
Here's another version - doing it inside copy_page_range, so this
kind of vma special-casing is over in mm/ rather than kernel/.

No point in testing vm_file, the vm_flags cover the cases.
Test VM_MAYSHARE rather than VM_SHARED to include the never-can-be-
written MAP_SHARED cases too. Must exclude VM_NONLINEAR, their ptes
are essential for defining the file offsets. Must exclude VM_RESERVED,
faults on remap_pfn_range areas would usually put in anon zeroed pages
instead of the driver pages - or perhaps would be better as a test
against VM_IO, or vma->vm_ops->nopage?

Having to exclude the VM_NONLINEAR seems rather a shame, since those
are always shared and likely enormous. The InfiniBand people's idea
of a way for the app to set VM_DONTCOPY (to avoid rdma get_user_pages
problems) becomes attractive as a way for apps to speed their forks.

Hugh

--- 2.6.13-rc7/mm/memory.c 2005-08-24 11:13:41.000000000 +0100
+++ linux/mm/memory.c 2005-08-26 10:09:50.000000000 +0100
@@ -498,6 +498,14 @@ int copy_page_range(struct mm_struct *ds
unsigned long addr = vma->vm_start;
unsigned long end = vma->vm_end;

+ /*
+ * Assume the fork will probably exec: don't waste time copying
+ * ptes where a page fault will fill them correctly afterwards.
+ */
+ if ((vma->vm_flags & (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))
+ == VM_MAYSHARE)
+ return 0;
+
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);

Nick Piggin

unread,

Aug 26, 2005, 11:20:13 AM8/26/05

to

Hugh Dickins wrote:
> On Thu, 25 Aug 2005, Linus Torvalds wrote:

>>That said, I think it's a valid optimization. Especially as the child
>>_probably_ doesn't need it (ie there's at least some likelihood of an
>>execve() or similar).
>
>
> I agree, seems a great idea to me (sulking because I was too dumb
> to get it, even when Nick and Andi first posted their patches).
>
> It won't just save on the copying at fork time, it'll save on
> undoing it all again when the child mm is torn down for exec.
>
> The refaulting will hurt the performance of something: let's
> just hope that something doesn't turn out to be a show-stopper.
>

OK let's see how Ray goes, and try it when 2.6.14 opens...

> I see some flaws in the various patches posted, including Rik's.
> Here's another version - doing it inside copy_page_range, so this
> kind of vma special-casing is over in mm/ rather than kernel/.
>

Yeah I guess that's a good idea. Patch looks pretty good.
Just a minor issue with the comment, it is not strictly
just assuming the child will exec... IMO it is worthwhile
in Ray's case even if his forked process _eventually_ ends
up touching all the shared memory pages, it is better to
avoid many ms of fork overhead.

Also, on NUMA systems this will help get page tables allocated
on the right nodes, which is not an insignificant problem for
big HPC jobs.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

Hugh Dickins

unread,

Aug 26, 2005, 12:40:14 PM8/26/05

to

On Fri, 26 Aug 2005, Ross Biro wrote:
> On 8/26/05, Hugh Dickins <hu...@veritas.com> wrote:
> >
> > The refaulting will hurt the performance of something: let's
> > just hope that something doesn't turn out to be a show-stopper.
>

> Why not just fault in all the pages on the first fault. Then the performance
> loss is a single page fault (the page table copy that would have happened a
> fork time now happens at fault time) and you get the big win for processes
> that do fork/exec.

"all" might be very many more pages than were ever mapped in the parent,
and not be a win. Some faultahead might work better. Might, might, ...

Hugh

Ross Biro

unread,

Aug 26, 2005, 12:50:41 PM8/26/05

to

On 8/26/05, Hugh Dickins <hu...@veritas.com> wrote:
> On Fri, 26 Aug 2005, Ross Biro wrote:
> > On 8/26/05, Hugh Dickins <hu...@veritas.com> wrote:
> > >
> > > The refaulting will hurt the performance of something: let's
> > > just hope that something doesn't turn out to be a show-stopper.
> >
> > Why not just fault in all the pages on the first fault. Then the performance
> > loss is a single page fault (the page table copy that would have happened a
> > fork time now happens at fault time) and you get the big win for processes
> > that do fork/exec.
>
> "all" might be very many more pages than were ever mapped in the parent,
> and not be a win. Some faultahead might work better. Might, might, ...

If you reduce "all" to whatever would have been done in fork
originially, then you've got a big win in some cases and a minimal
loss in others, and it's easy to argue you've got something better.

Now changng "all" to something even less might be an even bigger win,
but that requires a lot of benchmarking to justify.

Ross

Ray Fucillo

unread,

Aug 26, 2005, 1:10:07 PM8/26/05

to

Nick Piggin wrote:
> OK let's see how Ray goes, and try it when 2.6.14 opens...

Working on that now - I'll let you know.

> Yeah I guess that's a good idea. Patch looks pretty good.
> Just a minor issue with the comment, it is not strictly
> just assuming the child will exec... IMO it is worthwhile
> in Ray's case even if his forked process _eventually_ ends
> up touching all the shared memory pages, it is better to
> avoid many ms of fork overhead.

Yes, in our database system the child will immediately touch some shmem
pages, and may eventually touch most of them (and would almost never
exec()). Fork performance is critical in usage scenarios where an
end-user database request forks a new server process from one master
server process.

However, there is still a need that the child, once successfully forked,
is operational reasonably quickly. I suspect that Ross's idea of paging
in everything after the first fault would not be optimal for us, because
we'd still be talking about hundreds of ms of work done before the child
does anything useful. It would still be far better than the behavior we
have today because that time would no longer be synchronous with the
fork(). Of course, it sounds like our app might be able to make use of
the hugetlb stuff can mitigate this problem in the future...

Rik van Riel

unread,

Aug 26, 2005, 2:00:19 PM8/26/05

to

On Fri, 26 Aug 2005, Ray Fucillo wrote:

> However, there is still a need that the child, once successfully forked, is
> operational reasonably quickly. I suspect that Ross's idea of paging in
> everything after the first fault would not be optimal for us, because we'd
> still be talking about hundreds of ms of work done before the child does
> anything useful.

Simply skipping the page table setup of MAP_SHARED regions
should be enough to fix this issue.

> It would still be far better than the behavior we have today because
> that time would no longer be synchronous with the fork().

Filling in all the page table entries at the first fault to
a VMA doesn't make much sense, IMHO.

The reason I think this is that people have experimented
with prefaulting already resident pages at page fault time,
and those experiments have never shown a conclusive benefit.

Now, if doing such prefaulting for normal processes does not
show a benefit - why would it be beneficial to recently forked
processes with a huge SHM area ?

I suspect we would be better off without that extra complexity,
unless there is a demonstrated benefit to it.

--
All Rights Reversed

Linus Torvalds

unread,

Aug 26, 2005, 2:10:17 PM8/26/05

to

On Fri, 26 Aug 2005, Hugh Dickins wrote:
>
> I see some flaws in the various patches posted, including Rik's.
> Here's another version - doing it inside copy_page_range, so this
> kind of vma special-casing is over in mm/ rather than kernel/.

I like this approach better, but I don't understand your particular
choice of bits.

> + * Assume the fork will probably exec: don't waste time copying
> + * ptes where a page fault will fill them correctly afterwards.
> + */
> + if ((vma->vm_flags & (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))
> + == VM_MAYSHARE)
> + return 0;
> +
> if (is_vm_hugetlb_page(vma))
> return copy_hugetlb_page_range(dst_mm, src_mm, vma);

First off, if you just did it below the hugetlb check, you'd not need to
check hugetlb again. And while I understand VM_NONLINEAR and VM_RESERVED,
can you please comment on why VM_MAYSHARE is so important, and why no
other information matters.

Now, VM_MAYSHARE is a sign of the mapping being a shared mapping. Fair
enough. But afaik, a shared anonymous mapping absolutely needs its page
tables copied, because those page tables contains either the pointers to
the shared pages, or the swap entries.

So I really think you need to verify that it's a file mapping too.

Also, arguably, there are other cases that may or may not be worth
worrying about. What about non-shared non-writable file mappings? What
about private mappings that haven't been COW'ed?

So I think that in addition to your tests, you should test for
"vma->vm_file", and you could toy with testing for "vma->anon_vma" being
NULL (the latter will cause a _lot_ of hits, because any read-only private
mapping will trigger, but it's a good stress-test and conceptually
interesting, even if I suspect it will kill any performance gain through
extra minor faults in the child).

Linus

Ross Biro

unread,

Aug 26, 2005, 2:30:15 PM8/26/05

to

On 8/26/05, Rik van Riel <ri...@redhat.com> wrote:
>
> Filling in all the page table entries at the first fault to
> a VMA doesn't make much sense, IMHO.
>
>

> I suspect we would be better off without that extra complexity,
> unless there is a demonstrated benefit to it.

You are probably right, but do you want to put in a patch that might
have a big performance impact in either direction with out verifying
it?

My suggestion is safe, but most likely sub-optimal. What everyone
else is suggesting may be far better, but needs to be verified first.

I'm suggesting that we change the code to do the same work fork would
have done on the first page fault immediately, since it's easy to
argue that it's not much worse than we have now and much better in
many cases, and then try to experiment and figure out what the
correct solution is.

Ross

Hugh Dickins

unread,

Aug 26, 2005, 2:50:07 PM8/26/05

to

On Fri, 26 Aug 2005, Linus Torvalds wrote:
> On Fri, 26 Aug 2005, Hugh Dickins wrote:
> >
> > I see some flaws in the various patches posted, including Rik's.
> > Here's another version - doing it inside copy_page_range, so this
> > kind of vma special-casing is over in mm/ rather than kernel/.
>
> I like this approach better, but I don't understand your particular
> choice of bits.
>
> > + * Assume the fork will probably exec: don't waste time copying
> > + * ptes where a page fault will fill them correctly afterwards.
> > + */
> > + if ((vma->vm_flags & (VM_MAYSHARE|VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))
> > + == VM_MAYSHARE)
> > + return 0;
> > +
> > if (is_vm_hugetlb_page(vma))
> > return copy_hugetlb_page_range(dst_mm, src_mm, vma);
>
> First off, if you just did it below the hugetlb check, you'd not need to
> check hugetlb again.

Yes: I wanted to include VM_HUGETLB in the list as documentation really;
and it costs nothing to test it along with the other flags - or are there
architectures where the more bits you test, the costlier?

> And while I understand VM_NONLINEAR and VM_RESERVED,
> can you please comment on why VM_MAYSHARE is so important, and why no
> other information matters.

The VM_MAYSHARE one isn't terribly important, there's no correctness
reason to replace VM_SHARED there. It's just that do_mmap_pgoff takes
VM_SHARED and VM_MAYWRITE off a MAP_SHARED mapping of a file which was
not opened for writing. We can safely avoid copying the ptes of such a
vma, just as with the writable ones, but the VM_MAYSHARE test catches
them where the VM_SHARED test does not.

> Now, VM_MAYSHARE is a sign of the mapping being a shared mapping. Fair
> enough. But afaik, a shared anonymous mapping absolutely needs its page
> tables copied, because those page tables contains either the pointers to
> the shared pages, or the swap entries.
>
> So I really think you need to verify that it's a file mapping too.

Either I'm misunderstanding, or you're remembering back to how shared
anonymous was done in 2.2 (perhaps). In 2.4 and 2.6, shared anonymous
is "backed" by a shared memory object, created by shmem_zero_setup:
which sets vm_file even though we came into do_mmap_pgoff with no file.

> Also, arguably, there are other cases that may or may not be worth
> worrying about. What about non-shared non-writable file mappings? What
> about private mappings that haven't been COW'ed?

Non-shared non-currently-writable file mappings might have been writable
and modified in the past, so we cannot necessarily skip those.

We could, and I did, consider testing whether the vma has an anon_vma:
we always allocate a vma's anon_vma just before first allocating it a
private page, and it's a good test which swapoff uses to narrow its
search.

But partly I thought that a little too tricksy, and hard to explain;
and partly I thought it was liable to catch the executable text,
some of which is most likely to be needed in between fork and exec.

> So I think that in addition to your tests, you should test for
> "vma->vm_file", and you could toy with testing for "vma->anon_vma" being
> NULL (the latter will cause a _lot_ of hits, because any read-only private
> mapping will trigger, but it's a good stress-test and conceptually
> interesting, even if I suspect it will kill any performance gain through
> extra minor faults in the child).

Ah yes, I wrote the paragraph above before reading this one, honest!

Well, I still don't think we need to test vm_file. We can add an
anon_vma test if you like, if we really want to minimize the fork
overhead, in favour of later faults. Do we?

Hugh

Hugh Dickins

unread,

Aug 26, 2005, 3:00:24 PM8/26/05

to

On Fri, 26 Aug 2005, Ross Biro wrote:
> On 8/26/05, Rik van Riel <ri...@redhat.com> wrote:
> >
> > Filling in all the page table entries at the first fault to
> > a VMA doesn't make much sense, IMHO.
> >
> > I suspect we would be better off without that extra complexity,
> > unless there is a demonstrated benefit to it.
>
> You are probably right, but do you want to put in a patch that might
> have a big performance impact in either direction with out verifying
> it?
>
> My suggestion is safe, but most likely sub-optimal. What everyone
> else is suggesting may be far better, but needs to be verified first.

It all has to be verified, and the problem will be that some things
fare well and others badly: how to reach a balanced decision?
Following your suggestion is no more safe than not following it.

> I'm suggesting that we change the code to do the same work fork would
> have done on the first page fault immediately, since it's easy to
> argue that it's not much worse than we have now and much better in
> many cases, and then try to experiment and figure out what the
> correct solution is.

We don't know what work fork would have done, that information was in
the ptes we decided not to bother to copy. Perhaps every pte of the
vma was set, perhaps none, perhaps only one.

Also, doing it at fault time has significantly more work to do than
just zipping along the ptes incrementing page counts and clearing bits.
I think; but probably much less extra work than I originally imagined,
since Andrew gave us the gang lookup of the page cache.

All the same, I'm with Rik: one of the great virtues of the original
idea was its simplicity; I'd prefer not to add complexity.

Hugh

Linus Torvalds

unread,

Aug 26, 2005, 7:00:12 PM8/26/05

to

I think we might want to do it in -mm for testing. Because quite frankly,
otherwise the new fork() logic won't get a lot of testing. Shared memory
isn't that common.

Linus

Rik van Riel

unread,

Aug 26, 2005, 7:20:05 PM8/26/05

to

When you consider NUMA placement (the child process may
end up running elsewhere), allocating things like page
tables lazily may well end up being a performance win.

--
All Rights Reversed

Linus Torvalds

unread,

Aug 26, 2005, 7:30:14 PM8/26/05

to

On Fri, 26 Aug 2005, Rik van Riel wrote:
> On Fri, 26 Aug 2005, Hugh Dickins wrote:
>
> > Well, I still don't think we need to test vm_file. We can add an
> > anon_vma test if you like, if we really want to minimize the fork
> > overhead, in favour of later faults. Do we?
>
> When you consider NUMA placement (the child process may
> end up running elsewhere), allocating things like page
> tables lazily may well end up being a performance win.

It should be easy enough to benchmark something like kernel compiles etc,
which are reasonably fork-rich and should show a good mix for something
like this. Or even just something like "time to restart a X session" after
you've brought it into memory once.

Linus

Nick Piggin

unread,

Aug 27, 2005, 11:10:12 AM8/27/05

to

Linus Torvalds wrote:
>
> On Fri, 26 Aug 2005, Rik van Riel wrote:
>
>>On Fri, 26 Aug 2005, Hugh Dickins wrote:
>>
>>
>>>Well, I still don't think we need to test vm_file. We can add an
>>>anon_vma test if you like, if we really want to minimize the fork
>>>overhead, in favour of later faults. Do we?
>>
>>When you consider NUMA placement (the child process may
>>end up running elsewhere), allocating things like page
>>tables lazily may well end up being a performance win.
>
>
> It should be easy enough to benchmark something like kernel compiles etc,
> which are reasonably fork-rich and should show a good mix for something
> like this. Or even just something like "time to restart a X session" after
> you've brought it into memory once.
>

2.6.13-rc7-git2
kbuild (make -j4) on dual G5.

plain
228.85user 19.90system 2:06.50elapsed 196%CPU (3725666minor)
228.91user 19.90system 2:06.07elapsed 197%CPU (3721353minor)
229.00user 19.78system 2:06.20elapsed 197%CPU (3721345minor)
228.81user 19.94system 2:06.05elapsed 197%CPU (3723791minor)

nocopy shared
229.28user 19.76system 2:06.24elapsed 197%CPU (3725661minor)
229.04user 19.91system 2:06.92elapsed 196%CPU (3718904minor)
228.97user 20.06system 2:06.46elapsed 196%CPU (3723807minor)
229.24user 19.84system 2:06.13elapsed 197%CPU (3723793minor)

nocopy all
228.74user 19.87system 2:06.27elapsed 196%CPU (3819927minor)
228.89user 19.81system 2:05.89elapsed 197%CPU (3822943minor)
228.77user 19.73system 2:06.23elapsed 196%CPU (3820517minor)
228.93user 19.70system 2:05.84elapsed 197%CPU (3822935minor)

I'd say the full test (including anon_vma) is maybe slightly
faster on this test though maybe it isn't significant.

It is doing around 2.5% more minor faults, thought the profiles
say copy_page_range time is reduced as one would expect.

I think that if all else (ie. final performance) is equal, then
faulting is better than copying because the work is being
deferred until it is needed, and we dodge some pathological
cases like Ray's database taking 100s of ms to fork (we hope!)

However it will always depend on workload.

This is the condition I ended up with. Any good?

if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
if (vma->vm_flags & VM_MAYSHARE)
return 0;
if (vma->vm_file && !vma->anon_vma)
return 0;
}

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

Hugh Dickins

unread,

Aug 28, 2005, 12:30:06 AM8/28/05

to

On Sun, 28 Aug 2005, Nick Piggin wrote:
>
> This is the condition I ended up with. Any good?
>
> if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
> if (vma->vm_flags & VM_MAYSHARE)
> return 0;
> if (vma->vm_file && !vma->anon_vma)
> return 0;
> }

It's not bad, and practical timings are unlikely to differ, but your
VM_MAYSHARE test is redundant (VM_MAYSHARE areas don't have anon_vmas *),
and your vm_file test is unnecessary, excluding pure anonymous areas
which haven't yet taken a fault.

Please do send Andrew the patch for -mm, Nick: you were one of the
creators of this (don't omit credit to Ray, Parag, Andi, Rik, Linus),
much better that it go in your name (heh, heh, heh, can you trust me?)

Hugh

* That's ignoring, as we do everywhere else, the case which came up
a couple of weeks back in discussions with Linus, ptrace writing to
an area the process does not have write access to, creating an anon
page within a shared vma: that's an awkward case currently mishandled,
but the patch below does it no harm.

--- 2.6.13-rc7/mm/memory.c 2005-08-24 11:13:41.000000000 +0100

+++ linux/mm/memory.c 2005-08-28 04:48:34.000000000 +0100
@@ -498,6 +498,15 @@ int copy_page_range(struct mm_struct *ds

unsigned long addr = vma->vm_start;
unsigned long end = vma->vm_end;

+ /*

+ * Assume the fork will probably exec: don't waste time copying
+ * ptes where a page fault will fill them correctly afterwards.
+ */

+ if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
+ if (!vma->anon_vma)
+ return 0;
+ }

+
if (is_vm_hugetlb_page(vma))
return copy_hugetlb_page_range(dst_mm, src_mm, vma);

Nick Piggin

unread,

Aug 28, 2005, 3:00:11 AM8/28/05

to

Hugh Dickins wrote:
> On Sun, 28 Aug 2005, Nick Piggin wrote:
>
>>This is the condition I ended up with. Any good?
>>
>>if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
>>if (vma->vm_flags & VM_MAYSHARE)
>> return 0;
>>if (vma->vm_file && !vma->anon_vma)
>> return 0;
>>}
>
>
> It's not bad, and practical timings are unlikely to differ, but your
> VM_MAYSHARE test is redundant (VM_MAYSHARE areas don't have anon_vmas *),
> and your vm_file test is unnecessary, excluding pure anonymous areas
> which haven't yet taken a fault.
>

Haven't taken a _write_ fault? Hmm, OK that would seem to be a good
optimisation as well: we don't need to copy anon memory with only
ZERO_PAGE mappings... well, good as in "nice and logical" if not so
much "will make a difference"!

> Please do send Andrew the patch for -mm, Nick: you were one of the
> creators of this (don't omit credit to Ray, Parag, Andi, Rik, Linus),
> much better that it go in your name (heh, heh, heh, can you trust me?)
>

Well Andi and I seemed to have the idea independently, Linus thought
private would be a good idea (I agree), you came up with the complete
patch with others contributing bits and pieces, and most importantly
Ray brought our attention to the possible deficiency in our mm.

> Hugh
>
> * That's ignoring, as we do everywhere else, the case which came up
> a couple of weeks back in discussions with Linus, ptrace writing to
> an area the process does not have write access to, creating an anon
> page within a shared vma: that's an awkward case currently mishandled,
> but the patch below does it no harm.
>

And in that case maybe your patch works better anyway, because the child
will inherit that page from parent.

How does the following look? (I changed the comment a bit). Andrew, please
apply if nobody objects.

vm-lazy-fork.patch

Ray Fucillo

unread,

Aug 29, 2005, 7:40:11 PM8/29/05

to

Nick Piggin wrote:
> How does the following look? (I changed the comment a bit). Andrew, please
> apply if nobody objects.

Nick, I applied this latest patch to a 2.6.12 kernel and found that it
does resolve the problem. Prior to the patch on this machine, I was
seeing about 23ms spent in fork for ever 100MB of shared memory segment.
After applying the patch, fork is taking about 1ms regardless of the
shared memory size.

Many thanks to everyone for your help on this.

FWIW, an interesting side effect of this occurs when I run the database
with this patch internally on a Linux server that uses NIS. Its an
unrelated problem and not a kernel problem. Its due to the children
calling initgroups()... apparently when you have many processes making
simultaneous initgroups() calls something starts imposing very long
waits in increments of 3 seconds, so some processes return from
initgroups() in a few ms and other processes complete in 3, 6, 9, up to
21 seconds (plus a few ms). I'm not sure what the story is with that,
though its clearly not a kernel issue. If someone happens to have the
answer or a suggestion, great, otherwise I'll persue that elsewhere as
necessary. (I can reproduce this by simply adding a call to
initgroups() call in the child of the forktest program that I sent earlier)

Linus Torvalds

unread,

Aug 29, 2005, 8:40:11 PM8/29/05

to

On Mon, 29 Aug 2005, Ray Fucillo wrote:
>
> FWIW, an interesting side effect of this occurs when I run the database
> with this patch internally on a Linux server that uses NIS. Its an
> unrelated problem and not a kernel problem. Its due to the children
> calling initgroups()... apparently when you have many processes making
> simultaneous initgroups() calls something starts imposing very long
> waits in increments of 3 seconds

Sounds like something is backing off by waiting for three seconds whenever
some lock failure occurs. I don't see what locking the code might want to
do (it should just do the NIS equivalent of reading /etc/groups and do a
"setgroups()" system call), but I assume that the NIS server ends up
having some strange locking.

You might do an "ltrace testcase" (and, probably, the nis server) to see
if you can see where it happens, and bug the appropriate maintainers.
Especially if you have a repeatable test-case (where "repeatable" isn't
just for your particular machine: it's probably timing-related), somebody
might even fix it ;)

Linus

Nick Piggin

unread,

Aug 29, 2005, 9:00:13 PM8/29/05

to

Ray Fucillo wrote:

> Nick Piggin wrote:
>
>> How does the following look? (I changed the comment a bit). Andrew,
>> please
>> apply if nobody objects.
>
>
> Nick, I applied this latest patch to a 2.6.12 kernel and found that it
> does resolve the problem. Prior to the patch on this machine, I was
> seeing about 23ms spent in fork for ever 100MB of shared memory
> segment. After applying the patch, fork is taking about 1ms
> regardless of the shared memory size.
>

Hi Ray,
That's good news. I think we should probably consider putting the patch in
2.6.14 or if not, then definitely 2.6.15.

Andrew, did you pick up the patch or should I resend to someone?

I think the fork latency alone is enough to justify inclusion...
however, did
you actually see increased aggregate throughput of your database (or at
least
not a _decreased_ throughput)?

> Many thanks to everyone for your help on this.
>

Well thank you very much for breaking the kernel and telling us about it! :)

Nick

Send instant messages to your online friends http://au.messenger.yahoo.com

Linus Torvalds

unread,

Aug 29, 2005, 9:10:11 PM8/29/05

to

On Tue, 30 Aug 2005, Nick Piggin wrote:
>
> Andrew, did you pick up the patch or should I resend to someone?

I picked it up. If it causes performance regressions, we can fix them, and
if it causes other problems then that will be interesting in itself.

Linus