[RFC][PATCHSET] mremap/mmap mess

99 views
Skip to first unread message

Al Viro

unread,
Dec 5, 2009, 3:18:59 PM12/5/09
to linux...@vger.kernel.org, torv...@linux-foundation.org, linux-...@vger.kernel.org

[NOTE: the patch series below is not for merge until ACKed by arch maintainers]

We have a bunch of interesting problems with mmap/mremap.

1) MREMAP_FIXED allows remap to any location, regardless of what
the architecture has to say about it. The only check is TASK_SIZE.
That's not enough - e.g. there are architectures where some ranges
are simply absent (itanic, sparc), there are some that have cache
coherency requirements on alignments of shared mapping (a lot -
anything with VIPT cache, itanic where it's not a coherency but
a performance issue). There are architectures where specific ranges
are reserved for hugetlb and they either simply do not allow normal
mappings in there or need to do something to make them possible (as
ppc64 does). sparc tried to deal with that problem, but it hadn't
been complete (alignment issues) and it had been actually wrong for
non-MREMAP_FIXED calls of mremap().

2) without MREMAP_FIXED we happily allowed extension into a hole in
address space - the only check for being able to extend without
move had been for TASK_SIZE (and for non-overlap with other vmas).
Victims: sparc, itanic due to extending into holes, powerpc due to
extending into hugetlb range.

3) in case of relocation without MREMAP_FIXED we ended up doing
get_unmapped_area() with wrong pgoff if the starting address had
been in the middle of a mapping. New vma gets the right pgoff,
the checks are done for the wrong one. Cache coherency issues
on all VIPT architectures.

4) mmap() with MAP_HUGETLB leaks struct file if it bails out anywhere
past the allocation of struct file (by do_mmap_pgoff())

5) brk() into a hugetlb range failed without trying to do anything;
known thing, ppc folks had been unhappy about that.

Series below should deal with those, mostly by switching to consistent
use of get_unmapped_area() and sanitizing mmap/mremap code in general.

There is one case where we still have a serious PITA and I'm not sure
how to deal with that; it's expand_stack(). We can trigger that by
creating a VM_GROWS{UP,DOWN} mapping and either hitting a pagefault
on address {below,above} it or doing PTRACE_POKEDATA on such address.
As it is, we only check that range we are expanding into is not a
hugetlb-only one. The thing is, we *can't* just do the normal checks
as-is there.

For cases when we do expand_stack() for our own mm that would work just
fine and do the right thing. Unfortunately, we have places that hit
it from get_user_pages() for another process. And checks (starting with
"what's the maximal address we allow") are process-dependent on biarch
architectures. Worse yet, execve() does that when we have no other
process - it creates new mm, puts an anonymous mapping as high as
possible in it and copies argv/envp in there. And that's done with
get_user_pages() on new mm. If we have a 32bit task on e.g. amd64,
we'll have that mapping at addresses far above the TASK_SIZE of caller.
Later, when ->load_binary() figures out what personality we'll get,
it turns that mapping into a valid vma for stack, possibly relocating
the entire thing to address suitable for resulting process.

Breaking execve() from 32bit processes on biarch architectures would
be a bad thing, so we can't just add the normal set of checks to
expand_stack() (acct_stack_growth(), actually). The problem is quite
real, though, since e.g. on itanic PTRACE_POKEDATA can be used to get
a vma hanging down into a gap in address space quite easily. Results
are not pretty...

One way to deal with that would be to put enough information into mm_struct
so that all these checks wouldn't have to look at the caller's personality.
I'm not sure how much PITA would that be, though; I've been digging through
the arch/* VM code for several weeks by now, but I certainly don't pretend
to be able to spot e.g. performance implications of such change.

Comments (both on that issue and on following patch series) would be very
welcome.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Al Viro

unread,
Dec 5, 2009, 6:02:07 PM12/5/09
to mi...@redhat.com, linux...@vger.kernel.org, torv...@linux-foundation.org, linux-...@vger.kernel.org
BTW, another fun question in the same area: what enforces the virtual
aliasing rules for fremap.c stuff? AFAICS, on VIPT architectures
we are asking for trouble there. No?

Is there anything that would prevent an arbitrary page from file's
page cache from being mapped at any cache colours?

Russell King

unread,
Dec 5, 2009, 6:59:02 PM12/5/09
to Al Viro, linux...@vger.kernel.org, torv...@linux-foundation.org, linux-...@vger.kernel.org
On Sat, Dec 05, 2009 at 07:08:50PM +0000, Al Viro wrote:
> [NOTE: the patch series below is not for merge until ACKed by arch maintainers]

Having spent a while looking at these changes, I don't forsee any
problems with patches 1,2,3,4,5,9. So that range of patches can
have my:

Acked-by: Russell King <rmk+k...@arm.linux.org.uk>

If I get some spare time tomorrow, I'll try to test them on hardware,
and check that the page mappings below the minimum address are still
prevented on ARM, but I don't forsee a problem.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

Hugh Dickins

unread,
Dec 6, 2009, 12:23:13 PM12/6/09
to Al Viro, linux...@vger.kernel.org, torv...@linux-foundation.org, linux-...@vger.kernel.org
On Sat, 5 Dec 2009, Al Viro wrote:
>
> Series below should deal with those, mostly by switching to consistent
> use of get_unmapped_area() and sanitizing mmap/mremap code in general.

Sorry, I've not got through them all yet.

>
> There is one case where we still have a serious PITA and I'm not sure
> how to deal with that; it's expand_stack(). We can trigger that by
> creating a VM_GROWS{UP,DOWN} mapping and either hitting a pagefault
> on address {below,above} it or doing PTRACE_POKEDATA on such address.
> As it is, we only check that range we are expanding into is not a
> hugetlb-only one. The thing is, we *can't* just do the normal checks
> as-is there.

It's been noticed in the past that the wrong rlimits are checked too.
I toyed with this patch nearly five years ago: does anyone know why
we need access_process_vm() to be able to expand another's stack?


[PATCH] mm: forbid expand_stack on another mm

In a very few cases, get_user_pages() is used on another process's address
space: by access_process_vm() (ptrace, /proc/<pid>/mem and a few others),
perhaps arch/arm/mach-bcmring/dma.c, perhaps drivers/gpu/drm/ttm/ttm_tt.c.

But expand_stack() can get into trouble when used on another process:
the wrong rlimits are applied, and the stack could be expanded into an
area forbidden to that kind of process - mm doesn't contain all the info.

I've never found out why we need access_process_vm() to expand a stack:
shall we see what breaks if we just forbid expand_stack() on another mm?

Signed-off-by: Hugh Dickins <hugh.d...@tiscali.co.uk>
---

mm/mmap.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

--- 2.6.32/mm/mmap.c 2009-12-03 03:51:21.000000000 +0000
+++ linux/mm/mmap.c 2009-12-06 16:49:01.000000000 +0000
@@ -1717,7 +1717,7 @@ find_extend_vma(struct mm_struct *mm, un
vma = find_vma_prev(mm, addr, &prev);
if (vma && (vma->vm_start <= addr))
return vma;
- if (!prev || expand_stack(prev, addr))
+ if (!prev || mm != current->mm || expand_stack(prev, addr))
return NULL;
if (prev->vm_flags & VM_LOCKED) {
if (mlock_vma_pages_range(prev, addr, prev->vm_end) < 0)
@@ -1745,6 +1745,8 @@ find_extend_vma(struct mm_struct * mm, u
return vma;
if (!(vma->vm_flags & VM_GROWSDOWN))
return NULL;
+ if (mm != current->mm)
+ return NULL;
start = vma->vm_start;
if (expand_stack(vma, addr))
return NULL;

Linus Torvalds

unread,
Dec 6, 2009, 1:00:42 PM12/6/09
to Hugh Dickins, Al Viro, linux...@vger.kernel.org, linux-...@vger.kernel.org

On Sun, 6 Dec 2009, Hugh Dickins wrote:
>
> I've never found out why we need access_process_vm() to expand a stack:
> shall we see what breaks if we just forbid expand_stack() on another mm?

Hmm. If you want to actually emulate some stack instruction using ptrace,
you'd need to be able to extend the stack.

The classic example of this might be to do some crazy user-space emulation
of 'pushf' for virtualization.

And emulating pushf is not theory: at least KVM does actually do exactly
that (although KVM obviously does it from kernel space and from within the
process that faulted). Same goes for vm86 mode (again, we do that
emulation in kernel) and for ptrace single-stepping (which we actually
don't bother emulating).

The point being that at least 'pushf' really _is_ an instruction that (a)
might want to extend the stack and (b) does tend to need emulation or
fixup in some virtualized/emulated environments. I just don't know if you
actually ever have user space doing so.

But I could imagine that Wine does some pushf emulation using ptrace, for
example.

Linus

Al Viro

unread,
Dec 6, 2009, 10:59:09 PM12/6/09
to Al Viro, linux...@vger.kernel.org, torv...@linux-foundation.org, linux-...@vger.kernel.org
Updated series posted, with fixes merged. It should also cover
the shm bug caught by dwg - in case of hugepage shm we ended up with
oopsen due to b0rken ->get_unmapped_area().

Hugh Dickins

unread,
Dec 7, 2009, 1:58:46 PM12/7/09
to Al Viro, Al Viro, linux...@vger.kernel.org, torv...@linux-foundation.org, ak...@linux-foundation.org, linux-...@vger.kernel.org
On Mon, 7 Dec 2009, Al Viro wrote:
> Updated series posted, with fixes merged. It should also cover
> the shm bug caught by dwg - in case of hugepage shm we ended up with
> oopsen due to b0rken ->get_unmapped_area().

I've now gone over these, is it okay for me to comment all in one?
I do like the way complex conditions have evaporated, though sometimes
I had to spend a long time looking to see how it was justified.
I've given no more thought to that nasty PITA you raise, since
Linus reminded me of the pushf emulation issue (and what I proposed
would also have needed more to handle the exec args case).

[PATCH 1/19] untangling do_mremap(), part 1
Acked-by: Hugh Dickins <hugh.d...@tiscali.co.uk>

[PATCH 2/19] do_mremap() untangling, part 2
Acked-by: Hugh Dickins <hugh.d...@tiscali.co.uk>

[PATCH 3/19] do_mremap() untangling, part 3
Acked-by: Hugh Dickins <hugh.d...@tiscali.co.uk>

[PATCH 4/19] fix checks for expand-in-place mremap

A couple of points on vma_expandable() in 4/19:
+ if (arch_mmap_check(vma->vm_start, end - vma->vm_start, MAP_FIXED))
+ return 0;
+ if (get_unmapped_area(NULL, vma->vm_start, end - vma->vm_start,
+ 0, MAP_FIXED) & ~PAGE_MASK)
return 0;

I wondered why you don't pass the appropriate filp and pgoff here?
Maybe it doesn't matter for anything at present (given that you're
expanding an existing mmap), but even if that's the case, I do think
it would be more robust to pass the correct filp and pgoff.

And that arch_mmap_check(,, MAP_FIXED) there: no problem with that,
but it does become a problem later on (mainly in 9 and 11 and 16):
one idiocy you haven't yet noticed, is that (a) nothing has been
using the flags arg to arch_mmap_check(), and (b) do_mmap_pgoff()
and do_brk() disagree on whether they're MAP_ flags or VM_ flags
(and MAP_FIXED == VM_MAYREAD).

You're going for MAP_ flags, fine, but then do_brk() will need
changing once you take notice of those flags (in 9 and 11).

[PATCH 5/19] fix the arch checks in MREMAP_FIXED case
Acked-by: Hugh Dickins <hugh.d...@tiscali.co.uk>

[PATCH 6/19] fix pgoff in "have to relocate" case of mremap()
Acked-by: Hugh Dickins <hugh.d...@tiscali.co.uk>

but that get_unmapped_area() ought to be accompanied by an
arch_mmap_check(), oughtn't it? Doesn't really matter because
in the end arch_mmap_check() moves into get_unmapped_area()
anyway, so maybe you decided to leave that issue until then,
and just stick to the pgoff fix here for this patch.

[PATCH 7/19] kill useless checks in sparc mremap variants
Acked-by: Hugh Dickins <hugh.d...@tiscali.co.uk>

[PATCH 8/19] file ->get_unmapped_area() shouldn't duplicate work of get_unmapped_area()

I kept on finding more interesting things to do than arrive at a
full understanding of file ->get_unmapped_area()s: they confuse me,
and obviously that's not your fault. But, while I agree with your
principle of apportioning the work appropriately between the different
helpers, I did wonder (a) what actual benefit this patch brings? you
don't mention it, and it looks like a rearrangement which is very
easy to get wrong (which you admit you did), and (b) whether the patch
is complete, since there are lots of driver ->get_unmapped_area()s
which are not doing the current->mm->get_unmapped_area thing.
I think. Maybe they're all NOMMU, and it doesn't matter there:
I gave up on trying to work it all out and moved on.

[PATCH 9/19] arm: add arch_mmap_check(), get rid of sys_arm_mremap()

You give arm an arch_mmap_check() which tests MAP_FIXED,
so now do_brk() needs fixing. Or can arm's get_unmapped_area()
handle this, without any arch_mmap_check()? In the end you move
the arch_mmap_check() call into get_unmapped_area(), but could it be
eliminated completely, in favour of code in arch's get_unmapped_area()?

[PATCH 10/19] Kill ancient crap in s390 compat mmap
Acked-by: Hugh Dickins <hugh.d...@tiscali.co.uk>

[PATCH 11/19] arch_mmap_check() on mn10300
Like 9/19, another arch_mmap_check() which tests MAP_FIXED.

[PATCH 12/19] Cut hugetlb case early for 32bit on ia64
Could you explain this one a bit more, please? I worry because
MAP_HUGETLB is a recently added special, there's plenty of hugetlb
without MAP_HUGETLB, so I wonder if you're really catching early
all that you want to be catching, whatever that is.

[PATCH 13/19] Unify sys_mmap*
Couple of points on this one.
(a) I didn't understand the arch/score part of it, why you said it
should almost certainly shift pgoff too, nor why you left in the
(pgoff & (~PAGE_MASK >> 12)) part, which looked redundant to me.

And (b) I thought you were being perverse in putting sys_mmap_pgoff()
in mm/util.c, that's never where I'd look for it, we have a file
mm/mmap.c which is just the place for it, after do_mmap_pgoff().
Ah, you're trying to avoid duplicating it in mm/nommu.c? Hmm,
well, I'd much rather you do duplicate it. Particularly once
14/19 complicates it with the MAP_HUGETLB fix, which we should
keep in in mm/mmap.c, and shouldn't be needed in mm/nommu.c.

[PATCH 14/19] fix a struct file leak in do_mmap_pgoff()
Good catch, but please transfer to mm/mmap.c.

[PATCH 15/19] Take arch_mmap_check() into get_unmapped_area()
Acked-by: Hugh Dickins <hugh.d...@tiscali.co.uk>

[PATCH 16/19] switch do_brk() to get_unmapped_area()
Here we see do_brk()'s confusion over arch_mmap_check()'s flags.

[PATCH 17/19] sparc_brk() is not needed anymore
Acked-by: Hugh Dickins <hugh.d...@tiscali.co.uk>

[PATCH 18/19] Get rid of open-coding in ia64_brk()
Acked-by: Hugh Dickins <hugh.d...@tiscali.co.uk>

[PATCH 19/19] fix broken aliasing checks for MAP_FIXED on sparc32, mips, arm and sh
Acked-by: Hugh Dickins <hugh.d...@tiscali.co.uk>

Thanks,
Hugh

Al Viro

unread,
Dec 7, 2009, 2:31:01 PM12/7/09
to Hugh Dickins, Al Viro, linux...@vger.kernel.org, torv...@linux-foundation.org, ak...@linux-foundation.org, linux-...@vger.kernel.org
On Mon, Dec 07, 2009 at 06:58:25PM +0000, Hugh Dickins wrote:

> [PATCH 4/19] fix checks for expand-in-place mremap
>
> A couple of points on vma_expandable() in 4/19:
> + if (arch_mmap_check(vma->vm_start, end - vma->vm_start, MAP_FIXED))
> + return 0;
> + if (get_unmapped_area(NULL, vma->vm_start, end - vma->vm_start,
> + 0, MAP_FIXED) & ~PAGE_MASK)
> return 0;
>
> I wondered why you don't pass the appropriate filp and pgoff here?
> Maybe it doesn't matter for anything at present (given that you're
> expanding an existing mmap), but even if that's the case, I do think
> it would be more robust to pass the correct filp and pgoff.

Umm... Can do, but that's a bit of an extra work - I'd need to check
if we want MAP_SHARED passed if we bother to pass file.

> And that arch_mmap_check(,, MAP_FIXED) there: no problem with that,
> but it does become a problem later on (mainly in 9 and 11 and 16):
> one idiocy you haven't yet noticed, is that (a) nothing has been
> using the flags arg to arch_mmap_check(), and (b) do_mmap_pgoff()
> and do_brk() disagree on whether they're MAP_ flags or VM_ flags
> (and MAP_FIXED == VM_MAYREAD).

*ow*

> You're going for MAP_ flags, fine, but then do_brk() will need
> changing once you take notice of those flags (in 9 and 11).

Point taken, but see below.

> [PATCH 8/19] file ->get_unmapped_area() shouldn't duplicate work of get_unmapped_area()
>
> I kept on finding more interesting things to do than arrive at a
> full understanding of file ->get_unmapped_area()s: they confuse me,
> and obviously that's not your fault. But, while I agree with your
> principle of apportioning the work appropriately between the different
> helpers, I did wonder (a) what actual benefit this patch brings? you
> don't mention it, and it looks like a rearrangement which is very
> easy to get wrong (which you admit you did), and (b) whether the patch
> is complete, since there are lots of driver ->get_unmapped_area()s
> which are not doing the current->mm->get_unmapped_area thing.
> I think. Maybe they're all NOMMU, and it doesn't matter there:
> I gave up on trying to work it all out and moved on.

_That_ is one hell of a mess; most of those suckers are NOMMU (and
!MAP_FIXED, while we are at it). There are very few exceptions:
* hugetlb
* shm on top of hugetlb
* fb (== pci)
* spufs
That's *all*. And frankly, hugetlb/shm/spufs look a lot like candidates
for a single mm method; "is this a hugepage mapping" matters in a lot more
places and I'm not at all sure that spufs is correct. Oh, and there's
pure cargo-cult thing in bad_inode.c - we are not going to get that
file_operations instance as anything->f_op, so *all* methods except
->open() and ->fsync() (the latter due to nfsd playing silly buggers with
sync of directories) are never going to be called.

The benefit of patch... it's a preparation to the next one - we want to
push arch_mmap_check() down into get_unmapped_area() and the less extra
calls we grow and have to analyse, the better...

> [PATCH 9/19] arm: add arch_mmap_check(), get rid of sys_arm_mremap()
>
> You give arm an arch_mmap_check() which tests MAP_FIXED,
> so now do_brk() needs fixing. Or can arm's get_unmapped_area()
> handle this, without any arch_mmap_check()? In the end you move
> the arch_mmap_check() call into get_unmapped_area(), but could it be
> eliminated completely, in favour of code in arch's get_unmapped_area()?

Point, but take a look at actual check there. do_brk() won't run afoul
of it anyway with existing callers on arm. But yes, I agree that flags
need to be fixed there.

> [PATCH 12/19] Cut hugetlb case early for 32bit on ia64
> Could you explain this one a bit more, please? I worry because
> MAP_HUGETLB is a recently added special, there's plenty of hugetlb
> without MAP_HUGETLB, so I wonder if you're really catching early
> all that you want to be catching, whatever that is.

What I want to catch is "do_mmap_pgoff() would create struct file" case.
And MAP_HUGETLB is *exactly* what I want to catch - existing file will
get through to do_mmap_pgoff() and we'll fail due to unsuitable address
(with MAP_FIXED) or address beyond TASK_SIZE (without MAP_FIXED). That's
fine.

> [PATCH 13/19] Unify sys_mmap*
> Couple of points on this one.
> (a) I didn't understand the arch/score part of it, why you said it
> should almost certainly shift pgoff too, nor why you left in the
> (pgoff & (~PAGE_MASK >> 12)) part, which looked redundant to me.

Exactly because it's redundant ;-)

If they ever grow PAGE_SHIFT > 12, they'll need to shift; until they do
that the check is simply if (0) and gets compiled away.

That check looks like either a pure cargo-cult thing, or a preparation to
different page sizes. I'd rather give a benefit of doubt and put a warning
in comment...



> And (b) I thought you were being perverse in putting sys_mmap_pgoff()
> in mm/util.c, that's never where I'd look for it, we have a file
> mm/mmap.c which is just the place for it, after do_mmap_pgoff().
> Ah, you're trying to avoid duplicating it in mm/nommu.c? Hmm,
> well, I'd much rather you do duplicate it. Particularly once
> 14/19 complicates it with the MAP_HUGETLB fix, which we should
> keep in in mm/mmap.c, and shouldn't be needed in mm/nommu.c.

I'm not too happy with mm/util.c, but I really don't like the mmap vs nommu
duplications. Hell knows; we can always split and move later on.

Hugh Dickins

unread,
Dec 7, 2009, 3:05:25 PM12/7/09
to Al Viro, Al Viro, linux...@vger.kernel.org, torv...@linux-foundation.org, ak...@linux-foundation.org, linux-...@vger.kernel.org
Thanks for the explanations,
it sounds as if you and Linus would like to push forward.

On Mon, 7 Dec 2009, Al Viro wrote:

> On Mon, Dec 07, 2009 at 06:58:25PM +0000, Hugh Dickins wrote:
>
> > [PATCH 9/19] arm: add arch_mmap_check(), get rid of sys_arm_mremap()
> >
> > You give arm an arch_mmap_check() which tests MAP_FIXED,
> > so now do_brk() needs fixing. Or can arm's get_unmapped_area()
> > handle this, without any arch_mmap_check()? In the end you move
> > the arch_mmap_check() call into get_unmapped_area(), but could it be
> > eliminated completely, in favour of code in arch's get_unmapped_area()?
>
> Point, but take a look at actual check there. do_brk() won't run afoul
> of it anyway with existing callers on arm. But yes, I agree that flags
> need to be fixed there.

Right, all that matters is whether MAP_FIXED is set here, and the
fact that it is set through both an explicit MAP_FIXED and a confused
VM_MAYREAD does not really matter - something to clean up, but not
something that will break anybody before it's cleaned up.

> > And (b) I thought you were being perverse in putting sys_mmap_pgoff()
> > in mm/util.c, that's never where I'd look for it, we have a file
> > mm/mmap.c which is just the place for it, after do_mmap_pgoff().
> > Ah, you're trying to avoid duplicating it in mm/nommu.c? Hmm,
> > well, I'd much rather you do duplicate it. Particularly once
> > 14/19 complicates it with the MAP_HUGETLB fix, which we should
> > keep in in mm/mmap.c, and shouldn't be needed in mm/nommu.c.
>
> I'm not too happy with mm/util.c, but I really don't like the mmap vs nommu
> duplications. Hell knows; we can always split and move later on.

mm/nommu.c is all about duplicating stuff with variations:
unsatisfactory, but no reason to go do it differently here.
Yes, I'll want to revert the util.c mods, but you don't need
to do so now.

Hugh

Al Viro

unread,
Dec 8, 2009, 1:07:13 AM12/8/09
to Hugh Dickins, Al Viro, linux...@vger.kernel.org, torv...@linux-foundation.org, ak...@linux-foundation.org, linux-...@vger.kernel.org
On Mon, Dec 07, 2009 at 08:05:05PM +0000, Hugh Dickins wrote:

> mm/nommu.c is all about duplicating stuff with variations:
> unsatisfactory, but no reason to go do it differently here.
> Yes, I'll want to revert the util.c mods, but you don't need
> to do so now.

OK... BTW, I think I see how to get rid of the worst of expand_stack()
mess. Note that conceptually the nastiest part is execve() - there
we have no task_struct matching the mm we are accessing. But let's
take a look at what execve() is doing:
* we create a new mm
* we create a kinda-sorta vma at STACK_TOP_MAX
* we push argv/envp into it via get_user_pages(), populating
page tables for new mm as we go
* we set personality
* we possibly relocate it down
And all of that - to avoid the limit on number of pages caused by fixed-sized
array in bprm.

First of all, that implictly assumes that this relocation downwards is
rare. And so it is on amd64 and alpha. However, sparc64 and ppc64
have nearly 100% 32bit userland. That got to hurt and if the situation
with s390 is anywhere near that, they *really* hurt - we have variable
depth of page table tree there and forcing it up is Not Nice(tm).

Why do we want user_get_pages(), anyway? It's not that we lacked an
easy way to do large arrays, especially since the use is purely sequential.
Even a linked list of vmalloc'ed pages would do just fine (i.e. start with
static array in bprm, keep the pointer to last filled entry + number of
entries left before the next allocation; use the last pointer in array
for finding the next page-sized chunk).

What do we lose if we go that way? Inserting all these pages into mm
at once shouldn't be slower. Memory overhead is not really an issue
(one page per 511 or 1023 pages of argv). Am I missing something?

The benefit, AFAICS, is that we get rid of the mess with forced high
address use, get *sane* get_user_pages() (we always have matching
task_struct with the right personality, so we can avoid massive PITA
for doing checks right) and we get unified mmu/nommu code in fs/exec.c
out of that.

If you see serious problems I've missed, please tell. Otherwise I'm
going to hack up a prototype and post it here...

Hugh Dickins

unread,
Dec 8, 2009, 6:43:29 AM12/8/09
to Al Viro, Al Viro, linux...@vger.kernel.org, torv...@linux-foundation.org, ak...@linux-foundation.org, linux-...@vger.kernel.org
On Tue, 8 Dec 2009, Al Viro wrote:
>
> Why do we want user_get_pages(), anyway? It's not that we lacked an
> easy way to do large arrays, especially since the use is purely sequential.
> Even a linked list of vmalloc'ed pages would do just fine (i.e. start with
> static array in bprm, keep the pointer to last filled entry + number of
> entries left before the next allocation; use the last pointer in array
> for finding the next page-sized chunk).
>
> What do we lose if we go that way? Inserting all these pages into mm
> at once shouldn't be slower. Memory overhead is not really an issue
> (one page per 511 or 1023 pages of argv). Am I missing something?

I think what you lose that way is swappability.

Since we're supporting unlimited args and env here, it is important
that those pages can belong to an mm, be discoverable by rmap, and
be swapped out if really necessary. Whereas I think you're proposing
an internal list of those pages, unknown to rmap, unswappable.

Of course, a page is pinned in core between get_user_pages() and
put_page(), but unless I've got it wrong, get_user_pages() is being
applied one by one to these pages, each unpinned as the next is pinned.

It is conceivable that it actually doesn't work in the way I'm
expecting: that although it's all designed to leave those pages
swappable, some mod here or there has interfered with that. But if
so, that would be a bug: the intention, and I believe it's important,
is that those pages are swappable.

I have a different reason for wanting to change how it's done:
it's the major user of non-atomic kmap() and its global kmap_lock,
and rather swamps other uses of kmap() (which have better use for
the cached virtual address). So I'd be happy with a better way
of doing it, but not at the cost of losing swappability.

Hugh

Hugh Dickins

unread,
Dec 8, 2009, 8:03:53 AM12/8/09
to Al Viro, Al Viro, linux...@vger.kernel.org, torv...@linux-foundation.org, ak...@linux-foundation.org, linux-...@vger.kernel.org
On Tue, 8 Dec 2009, Hugh Dickins wrote:
> On Tue, 8 Dec 2009, Al Viro wrote:
> >
> > Why do we want user_get_pages(), anyway? It's not that we lacked an
> > easy way to do large arrays, especially since the use is purely sequential.
> > Even a linked list of vmalloc'ed pages would do just fine (i.e. start with
> > static array in bprm, keep the pointer to last filled entry + number of
> > entries left before the next allocation; use the last pointer in array
> > for finding the next page-sized chunk).
> >
> > What do we lose if we go that way? Inserting all these pages into mm
> > at once shouldn't be slower. Memory overhead is not really an issue
> > (one page per 511 or 1023 pages of argv). Am I missing something?
>
> I think what you lose that way is swappability.
..

> I have a different reason for wanting to change how it's done:
> it's the major user of non-atomic kmap() and its global kmap_lock,
> and rather swamps other uses of kmap() (which have better use for
> the cached virtual address). So I'd be happy with a better way
> of doing it, but not at the cost of losing swappability.

Would it make sense to build up argv and env of execee on the execer's
user stack (below user's sp ("below" assuming topdown stack))?

That would impose some (unacceptable?) limits, and require some funny
code to migrate the pages over to the new mm later (instead of
relocating within the new mm as we do now).

But it would avoid get_user_pages and kmap altogether:
just straight copies in current's mm.

Hugh

David Miller

unread,
Dec 8, 2009, 4:08:19 PM12/8/09
to hugh.d...@tiscali.co.uk, vi...@zeniv.linux.org.uk, vi...@ftp.linux.org.uk, linux...@vger.kernel.org, torv...@linux-foundation.org, ak...@linux-foundation.org, linux-...@vger.kernel.org
From: Hugh Dickins <hugh.d...@tiscali.co.uk>
Date: Tue, 8 Dec 2009 13:03:30 +0000 (GMT)

> That would impose some (unacceptable?) limits, and require some funny
> code to migrate the pages over to the new mm later (instead of
> relocating within the new mm as we do now).

I think this approach would create new failure cases that don't exist
now. Whether that's acceptable or not is another issue.

The forced page table move, and TLB+cache flush that goes along with
that, for every single compat task we get now on the other hand is not
acceptable :-)

I also think this page table move overhead is worse than the
non-swapability added by Al's approach.

Al Viro

unread,
Dec 8, 2009, 5:06:57 PM12/8/09
to David Miller, hugh.d...@tiscali.co.uk, vi...@ftp.linux.org.uk, linux...@vger.kernel.org, torv...@linux-foundation.org, ak...@linux-foundation.org, linux-...@vger.kernel.org
On Tue, Dec 08, 2009 at 01:08:02PM -0800, David Miller wrote:
> From: Hugh Dickins <hugh.d...@tiscali.co.uk>
> Date: Tue, 8 Dec 2009 13:03:30 +0000 (GMT)
>
> > That would impose some (unacceptable?) limits, and require some funny
> > code to migrate the pages over to the new mm later (instead of
> > relocating within the new mm as we do now).
>
> I think this approach would create new failure cases that don't exist
> now. Whether that's acceptable or not is another issue.
>
> The forced page table move, and TLB+cache flush that goes along with
> that, for every single compat task we get now on the other hand is not
> acceptable :-)
>
> I also think this page table move overhead is worse than the
> non-swapability added by Al's approach.

We should be able to make them swappable - embed an inode into bprm, use
a _very_ trimmed-down analog of shmem.c to handle it, then, after switch
to new VM, swap what's needed in, steal it from that inode and shove resulting
anon pages into freshly created stack vma. At least assuming that I haven't
completely misunderstood Rik's answers to my questions, which is admittedly
quite possible ;-)

I'll try to do it that way and see what falls out...

Hugh Dickins

unread,
Dec 9, 2009, 6:44:21 AM12/9/09
to Al Viro, David Miller, Ollie Wild, Peter Zijlstra, Rik van Riel, vi...@ftp.linux.org.uk, linux...@vger.kernel.org, torv...@linux-foundation.org, ak...@linux-foundation.org, linux-...@vger.kernel.org
[ Peter, Ollie:
What started out with some nasty problems in mremap(), not checking
as mmap() does that it was expanding or moving into forbidden areas
of the address space, has now mostly morphed into a discussion of how
to enforce such checks when get_user_pages() has mm != current->mm: and
in particular, how to eliminate get_user_pages() on bprm->mm in exec().
Hence I've added you, with Rik, to the Cc list.]

On Tue, 8 Dec 2009, Al Viro wrote:

> On Tue, Dec 08, 2009 at 01:08:02PM -0800, David Miller wrote:
> > From: Hugh Dickins <hugh.d...@tiscali.co.uk>
> > Date: Tue, 8 Dec 2009 13:03:30 +0000 (GMT)
> >
> > > That would impose some (unacceptable?) limits, and require some funny
> > > code to migrate the pages over to the new mm later (instead of
> > > relocating within the new mm as we do now).
> >
> > I think this approach would create new failure cases that don't exist
> > now. Whether that's acceptable or not is another issue.

David: Yes, that's one of my fears too - I don't think
rlimits would pose any new problem, but building up the argv+env below
sp on the execer's userstack would be in danger of colliding with the
vma below if the space allowed to that userstack is too small. We can
say "sorry, you left too little space for your userstack", but it's
still a regression. My other big fear is this: that it's such a simple
and obvious way to do it, that it has probably been ruled out for very
good reasons in the past.

> >
> > The forced page table move, and TLB+cache flush that goes along with
> > that, for every single compat task we get now on the other hand is not
> > acceptable :-)

David: This seems a valid concern, but this is the first
time I've heard such a complaint. Perhaps I've just not noticed them;
but I do wonder if it's been noticed as a regression in practice, or
just causing alarm now that Al has drawn attention to how it works.

> >
> > I also think this page table move overhead is worse than the
> > non-swapability added by Al's approach.

David: I see your point, though it may be an issue on which the "main"
architectures win the day. My execer's userstack approach would have
the same overhead as at present, I think; no, worse, it would involve
that overhead in all cases. Hmm.

>
> We should be able to make them swappable - embed an inode into bprm, use
> a _very_ trimmed-down analog of shmem.c to handle it, then, after switch
> to new VM, swap what's needed in, steal it from that inode and shove resulting
> anon pages into freshly created stack vma. At least assuming that I haven't
> completely misunderstood Rik's answers to my questions, which is admittedly
> quite possible ;-)
>
> I'll try to do it that way and see what falls out...

.. my hair ;-)
I have to say, Dr Frankenstein, that this idea fills me with dread.

I'm not saying it's impossible, but the resulting creature sounds like
it's going to be special in several easily-buggy hard-to-maintain ways.

I think you already realize that shmem file pages (shared) live by
different rules from anonymous pages (COWed): they're both swappable,
but switching a group of pages from one to the other is going to be
weird new territory. (In fairness, my suggestion involves some
weird new territory too, but considerably less scary to me.)

I think you'd do better to drop the idea of swappability for the moment.
I don't like to do so at all, but I'd rather you came up with a clean
design without it first, and swappability be added a release later
if it can be got to work.

However, if you do drop swappability for the moment, what are you left
with? A reversion of commit b6a2fea39318e43fee84fa7b0b90d68bed92d2ba
"mm: variable length argument support", but putting the pages into a
linked list instead of a MAX_ARG_PAGES array. Well, that should be
very easy, but would it be adequate?

Hugh

Peter Zijlstra

unread,
Dec 9, 2009, 7:22:18 AM12/9/09
to Hugh Dickins, Al Viro, David Miller, Ollie Wild, Rik van Riel, vi...@ftp.linux.org.uk, linux...@vger.kernel.org, torv...@linux-foundation.org, ak...@linux-foundation.org, linux-...@vger.kernel.org
On Wed, 2009-12-09 at 11:43 +0000, Hugh Dickins wrote:

> On Tue, 8 Dec 2009, Al Viro wrote:
> > On Tue, Dec 08, 2009 at 01:08:02PM -0800, David Miller wrote:
> > > From: Hugh Dickins <hugh.d...@tiscali.co.uk>
> > > Date: Tue, 8 Dec 2009 13:03:30 +0000 (GMT)

> Would it make sense to build up argv and env of execee on the execer's


> user stack (below user's sp ("below" assuming topdown stack))?

> > >

> > > > That would impose some (unacceptable?) limits, and require some funny
> > > > code to migrate the pages over to the new mm later (instead of
> > > > relocating within the new mm as we do now).
> > >
> > > I think this approach would create new failure cases that don't exist
> > > now. Whether that's acceptable or not is another issue.
>
> David: Yes, that's one of my fears too - I don't think
> rlimits would pose any new problem, but building up the argv+env below
> sp on the execer's userstack would be in danger of colliding with the
> vma below if the space allowed to that userstack is too small. We can
> say "sorry, you left too little space for your userstack", but it's
> still a regression. My other big fear is this: that it's such a simple
> and obvious way to do it, that it has probably been ruled out for very
> good reasons in the past.

Vague memories, but here goes..

/me ponders.. doesn't the binfmt engine cruft need the args in place in
order to execute?

That is, IIRC the problem is that you need to have the argc/env in place
for the binfmt engine thing, and need to have ran the binfmt engine
thing before you know the personality.

As to your idea, if that were feasible we could do without the copy and
simply steal the pages directly from the old mm.

Hugh Dickins

unread,
Dec 9, 2009, 8:12:57 AM12/9/09
to Peter Zijlstra, Al Viro, David Miller, Ollie Wild, Rik van Riel, vi...@ftp.linux.org.uk, linux...@vger.kernel.org, torv...@linux-foundation.org, ak...@linux-foundation.org, linux-...@vger.kernel.org
On Wed, 9 Dec 2009, Peter Zijlstra wrote:
> On Wed, 2009-12-09 at 11:43 +0000, Hugh Dickins wrote:
> >
> > David: Yes, that's one of my fears too - I don't think
> > rlimits would pose any new problem, but building up the argv+env below
> > sp on the execer's userstack would be in danger of colliding with the
> > vma below if the space allowed to that userstack is too small. We can
> > say "sorry, you left too little space for your userstack", but it's
> > still a regression. My other big fear is this: that it's such a simple
> > and obvious way to do it, that it has probably been ruled out for very
> > good reasons in the past.
>
> Vague memories, but here goes..

Thanks!

>
> /me ponders.. doesn't the binfmt engine cruft need the args in place in
> order to execute?

Hardly looked, Al will be more up to date with all the grisly details.

The "binfmt engine cruft" being search_binary_handler()? I think the
args have to be "ready to go" before that, but that's different from
the new mm actually being used as an mm before that. It used not to
be used early, but from 2.6.23 on it is used early, via get_user_pages.

>
> That is, IIRC the problem is that you need to have the argc/env in place
> for the binfmt engine thing, and need to have ran the binfmt engine
> thing before you know the personality.

It is a problem that personality is discovered late in the sequence,
and that is a considerable part of what Al is up against.

>
> As to your idea, if that were feasible we could do without the copy and
> simply steal the pages directly from the old mm.

Perhaps, but I think that would lead to a gradual accumulation
of more and more pages pinned in memory by scattered references.

I Cc'ed you really because I wasn't much involved in the variable
length arg discussions, and don't remember how important swappability
was viewed at the time. It is a significant feature of what you and
Ollie ended up with, so I'm guessing it was then viewed as essential.
That would be my view.

But now it's suggested that the TLB+cache effects of using an mm there
are counter-productive, and better to forget swappability: well, I want
to keep it, and Al is making a brave effort to hold on to it, but I'm
wary of the weirdness involved.

Hugh

Al Viro

unread,
Dec 9, 2009, 8:24:59 AM12/9/09
to Peter Zijlstra, Hugh Dickins, David Miller, Ollie Wild, Rik van Riel, vi...@ftp.linux.org.uk, linux...@vger.kernel.org, torv...@linux-foundation.org, ak...@linux-foundation.org, linux-...@vger.kernel.org
On Wed, Dec 09, 2009 at 01:21:42PM +0100, Peter Zijlstra wrote:

> /me ponders.. doesn't the binfmt engine cruft need the args in place in
> order to execute?

Needs to modify them in case #! and a few other. But that's generally
a matter of "drop argv[0], push several arguments in its place"

> That is, IIRC the problem is that you need to have the argc/env in place
> for the binfmt engine thing, and need to have ran the binfmt engine
> thing before you know the personality.
>
> As to your idea, if that were feasible we could do without the copy and
> simply steal the pages directly from the old mm.

*raised brows*
Old mm may bloody well be still shared, so I'd be vary of trying that -
there's a lot of nasty surprises we could get on that way.

Peter Zijlstra

unread,
Dec 9, 2009, 8:38:01 AM12/9/09
to Hugh Dickins, Al Viro, David Miller, Ollie Wild, Rik van Riel, vi...@ftp.linux.org.uk, linux...@vger.kernel.org, torv...@linux-foundation.org, ak...@linux-foundation.org, linux-...@vger.kernel.org
On Wed, 2009-12-09 at 13:12 +0000, Hugh Dickins wrote:
> On Wed, 9 Dec 2009, Peter Zijlstra wrote:

> > /me ponders.. doesn't the binfmt engine cruft need the args in place in
> > order to execute?
>
> Hardly looked, Al will be more up to date with all the grisly details.
>
> The "binfmt engine cruft" being search_binary_handler()? I think the
> args have to be "ready to go" before that, but that's different from
> the new mm actually being used as an mm before that. It used not to
> be used early, but from 2.6.23 on it is used early, via get_user_pages.

Yeah, explicitly the fn() call in there which will mostly land you
load_elf_binary(). After that I loose track.

> > That is, IIRC the problem is that you need to have the argc/env in place
> > for the binfmt engine thing, and need to have ran the binfmt engine
> > thing before you know the personality.
>
> It is a problem that personality is discovered late in the sequence,
> and that is a considerable part of what Al is up against.
>
> >
> > As to your idea, if that were feasible we could do without the copy and
> > simply steal the pages directly from the old mm.
>
> Perhaps, but I think that would lead to a gradual accumulation
> of more and more pages pinned in memory by scattered references.

Well, IF the binfmt stuff can deal with the arrays being in the old mm
then it doesn't need to pin them I think, but I really don't know how
all this binfmt stuff works.

Reading fs/binfmt_elf.c:load_elf_binary() it looks like there might be a
spot where the personality is know and we still have the old mm around,
maybe we can hook in there -- we'd need to visit all binfmt though..

If we can make the binfmt stuff pass the correct location to
flush_old_exec() we could do the copy there.

> I Cc'ed you really because I wasn't much involved in the variable
> length arg discussions, and don't remember how important swappability
> was viewed at the time. It is a significant feature of what you and
> Ollie ended up with, so I'm guessing it was then viewed as essential.
> That would be my view.
>
> But now it's suggested that the TLB+cache effects of using an mm there
> are counter-productive, and better to forget swappability: well, I want
> to keep it, and Al is making a brave effort to hold on to it, but I'm
> wary of the weirdness involved.

Right, the swappability is key, without that you can easily run the
kernel into the ground if you don't have a limit on the argv/env arrays.
And not having a limit was the whole point.

Peter Zijlstra

unread,
Dec 9, 2009, 8:39:50 AM12/9/09
to Al Viro, Hugh Dickins, David Miller, Ollie Wild, Rik van Riel, vi...@ftp.linux.org.uk, linux...@vger.kernel.org, torv...@linux-foundation.org, ak...@linux-foundation.org, linux-...@vger.kernel.org
On Wed, 2009-12-09 at 13:24 +0000, Al Viro wrote:

> > That is, IIRC the problem is that you need to have the argc/env in place
> > for the binfmt engine thing, and need to have ran the binfmt engine
> > thing before you know the personality.
> >
> > As to your idea, if that were feasible we could do without the copy and
> > simply steal the pages directly from the old mm.
>
> *raised brows*
> Old mm may bloody well be still shared, so I'd be vary of trying that -
> there's a lot of nasty surprises we could get on that way.

Good point, but I think we might be able to do the copy from old to new
in flush_old_exec() if we can get the binfmt things to pass the right
address along (or set it in the brpm struct). It looks like it has clue
about personality at that point.

Al Viro

unread,
Dec 9, 2009, 8:46:52 AM12/9/09
to Hugh Dickins, David Miller, Ollie Wild, Peter Zijlstra, Rik van Riel, vi...@ftp.linux.org.uk, linux...@vger.kernel.org, torv...@linux-foundation.org, ak...@linux-foundation.org, linux-...@vger.kernel.org
On Wed, Dec 09, 2009 at 11:43:57AM +0000, Hugh Dickins wrote:
> ... my hair ;-)

> I have to say, Dr Frankenstein, that this idea fills me with dread.
>
> I'm not saying it's impossible, but the resulting creature sounds like
> it's going to be special in several easily-buggy hard-to-maintain ways.
>
> I think you already realize that shmem file pages (shared) live by
> different rules from anonymous pages (COWed): they're both swappable,
> but switching a group of pages from one to the other is going to be
> weird new territory. (In fairness, my suggestion involves some
> weird new territory too, but considerably less scary to me.)

Umm... Note that these guys are considerably simpler than shmem; _nothing_
will have them mapped anywhere until after that eviction and nothing will
modify their address_space in any way (no truncation, etc.).

And no, I'm not particulary happy about the picture - the sight of, ahem,
locking hierarcy rules in mm/filemap.c alone is a cause for dread. It had
been a while since I'd done any serious RTFS on mm/*, so...

Anyway, it's obviously *not* -rc1 fodder at the moment. So what I'm going
to do is to leave that one as it is in mainline, push the rest of mmap/mremap
stuff to Linus later today and get to VFS and audit queues. After the -rc1,
OTOH...

Hugh Dickins

unread,
Dec 9, 2009, 9:37:11 AM12/9/09
to Al Viro, David Miller, Ollie Wild, Peter Zijlstra, Rik van Riel, vi...@ftp.linux.org.uk, linux...@vger.kernel.org, torv...@linux-foundation.org, ak...@linux-foundation.org, linux-...@vger.kernel.org
On Wed, 9 Dec 2009, Al Viro wrote:
>
> Anyway, it's obviously *not* -rc1 fodder at the moment. So what I'm going
> to do is to leave that one as it is in mainline, push the rest of mmap/mremap
> stuff to Linus later today

Phew, yes, agreed!

Hugh

Linus Torvalds

unread,
Dec 9, 2009, 10:13:31 AM12/9/09
to Hugh Dickins, Al Viro, David Miller, Ollie Wild, Peter Zijlstra, Rik van Riel, vi...@ftp.linux.org.uk, linux...@vger.kernel.org, ak...@linux-foundation.org, linux-...@vger.kernel.org

On Wed, 9 Dec 2009, Hugh Dickins wrote:
>
> David: Yes, that's one of my fears too - I don't think
> rlimits would pose any new problem, but building up the argv+env below
> sp on the execer's userstack would be in danger of colliding with the
> vma below if the space allowed to that userstack is too small.

For threads, the stack is usually pretty small, so yeah, it would severely
limit threads doing execve.

That's supposed to be unusual and is frowned upon, but we've spent a fair
amount of effort on making sure it works, so I think it would be
unacceptable.

And I really don't see what the problem with our current approach is. We
have a nice source array - the originating VM, and a nice destination
array - the resulting VM, and trying to exchange that nice conceptual
setup with another setup with _different_ problems that isn't even as
clean sounds like a major mistake to me.

Linus

Reply all
Reply to author
Forward
0 new messages