[PATCH v5] add MAP_UNLOCKED mmap flag

Gleb Natapov

unread,

Jan 13, 2010, 4:40:02 AM1/13/10

to

If application does mlockall(MCL_FUTURE) it is no longer possible to mmap
file bigger than main memory or allocate big area of anonymous memory
in a thread safe manner. Sometimes it is desirable to lock everything
related to program execution into memory, but still be able to mmap
big file or allocate huge amount of memory and allow OS to swap them on
demand. MAP_UNLOCKED allows to do that.

Signed-off-by: Gleb Natapov <gl...@redhat.com>
---

I get reports that people find this useful, so resending.

v1->v2
- adding new flag to all archs
- fixing typo
v2->v3
- one more typo fix
v3->v4
- return error if MAP_LOCKED | MAP_UNLOCKED is specified
v4->v5
- rebase to latest head

diff --git a/arch/alpha/include/asm/mman.h b/arch/alpha/include/asm/mman.h
index 99c56d4..cfc51ac 100644
--- a/arch/alpha/include/asm/mman.h
+++ b/arch/alpha/include/asm/mman.h
@@ -30,6 +30,7 @@
#define MAP_NONBLOCK 0x40000 /* do not block on IO */
#define MAP_STACK 0x80000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x100000 /* create a huge page mapping */
+#define MAP_UNLOCKED 0x200000 /* force page unlocking */

#define MS_ASYNC 1 /* sync memory asynchronously */
#define MS_SYNC 2 /* synchronous memory sync */
diff --git a/arch/mips/include/asm/mman.h b/arch/mips/include/asm/mman.h
index c892bfb..3e4d108 100644
--- a/arch/mips/include/asm/mman.h
+++ b/arch/mips/include/asm/mman.h
@@ -48,6 +48,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_UNLOCKED 0x100000 /* force page unlocking */

/*
* Flags for msync
diff --git a/arch/parisc/include/asm/mman.h b/arch/parisc/include/asm/mman.h
index 9749c8a..4e8b9bf 100644
--- a/arch/parisc/include/asm/mman.h
+++ b/arch/parisc/include/asm/mman.h
@@ -24,6 +24,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_UNLOCKED 0x100000 /* force page unlocking */

#define MS_SYNC 1 /* synchronous memory sync */
#define MS_ASYNC 2 /* sync memory asynchronously */
diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index d4a7f64..7d33f01 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -27,6 +27,7 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_UNLOCKED 0x80000 /* force page unlocking */

#ifdef __KERNEL__
#ifdef CONFIG_PPC64
diff --git a/arch/sparc/include/asm/mman.h b/arch/sparc/include/asm/mman.h
index c3029ad..f80d203 100644
--- a/arch/sparc/include/asm/mman.h
+++ b/arch/sparc/include/asm/mman.h
@@ -22,6 +22,7 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_UNLOCKED 0x80000 /* force page unlocking */

#ifdef __KERNEL__
#ifndef __ASSEMBLY__
diff --git a/arch/xtensa/include/asm/mman.h b/arch/xtensa/include/asm/mman.h
index fca4db4..c62bcd8 100644
--- a/arch/xtensa/include/asm/mman.h
+++ b/arch/xtensa/include/asm/mman.h
@@ -55,6 +55,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_UNLOCKED 0x100000 /* force page unlocking */

/*
* Flags for msync
diff --git a/include/asm-generic/mman.h b/include/asm-generic/mman.h
index 32c8bd6..59e0f29 100644
--- a/include/asm-generic/mman.h
+++ b/include/asm-generic/mman.h
@@ -12,6 +12,7 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_UNLOCKED 0x80000 /* force page unlocking */

#define MCL_CURRENT 1 /* lock all current mappings */
#define MCL_FUTURE 2 /* lock all future mappings */
diff --git a/mm/mmap.c b/mm/mmap.c
index ee22989..4bda220 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -962,6 +962,9 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
if (!can_do_mlock())
return -EPERM;

+ if (flags & MAP_UNLOCKED)
+ vm_flags &= ~VM_LOCKED;
+
/* mlock MCL_FUTURE? */
if (vm_flags & VM_LOCKED) {
unsigned long locked, lock_limit;
@@ -1050,7 +1053,10 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
struct file *file = NULL;
unsigned long retval = -EBADF;

- if (!(flags & MAP_ANONYMOUS)) {
+ if (unlikely((flags & (MAP_LOCKED | MAP_UNLOCKED)) ==
+ (MAP_LOCKED | MAP_UNLOCKED))) {
+ return -EINVAL;
+ } else if (!(flags & MAP_ANONYMOUS)) {
if (unlikely(flags & MAP_HUGETLB))
return -EINVAL;
file = fget(fd);
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Américo Wang

unread,

Jan 13, 2010, 9:40:02 AM1/13/10

to

On Wed, Jan 13, 2010 at 11:31:19AM +0200, Gleb Natapov wrote:
>If application does mlockall(MCL_FUTURE) it is no longer possible to mmap
>file bigger than main memory or allocate big area of anonymous memory
>in a thread safe manner. Sometimes it is desirable to lock everything
>related to program execution into memory, but still be able to mmap
>big file or allocate huge amount of memory and allow OS to swap them on
>demand. MAP_UNLOCKED allows to do that.
>
>Signed-off-by: Gleb Natapov <gl...@redhat.com>

Thanks for keeping working on it.

This version looks fine for me.

Acked-by: WANG Cong <xiyou.w...@gmail.com>

--
Live like a child, think like the god.

Chris Wright

unread,

Jan 13, 2010, 12:40:02 PM1/13/10

to

* Gleb Natapov (gl...@redhat.com) wrote:
> If application does mlockall(MCL_FUTURE) it is no longer possible to mmap
> file bigger than main memory or allocate big area of anonymous memory
> in a thread safe manner. Sometimes it is desirable to lock everything
> related to program execution into memory, but still be able to mmap
> big file or allocate huge amount of memory and allow OS to swap them on
> demand. MAP_UNLOCKED allows to do that.
>
> Signed-off-by: Gleb Natapov <gl...@redhat.com>

Looks good to me.

Acked-by: Chris Wright <chr...@sous-sol.org>

thanks,
-chris

KOSAKI Motohiro

unread,

Jan 13, 2010, 7:40:02 PM1/13/10

to

> If application does mlockall(MCL_FUTURE) it is no longer possible to mmap
> file bigger than main memory or allocate big area of anonymous memory
> in a thread safe manner. Sometimes it is desirable to lock everything
> related to program execution into memory, but still be able to mmap
> big file or allocate huge amount of memory and allow OS to swap them on
> demand. MAP_UNLOCKED allows to do that.
>
> Signed-off-by: Gleb Natapov <gl...@redhat.com>
> ---
>
> I get reports that people find this useful, so resending.

This description is still wrong. It doesn't describe why this patch is useful.

Gleb Natapov

unread,

Jan 14, 2010, 2:00:01 AM1/14/10

to

On Thu, Jan 14, 2010 at 09:31:03AM +0900, KOSAKI Motohiro wrote:
> > If application does mlockall(MCL_FUTURE) it is no longer possible to mmap
> > file bigger than main memory or allocate big area of anonymous memory
> > in a thread safe manner. Sometimes it is desirable to lock everything
> > related to program execution into memory, but still be able to mmap
> > big file or allocate huge amount of memory and allow OS to swap them on
> > demand. MAP_UNLOCKED allows to do that.
> >
> > Signed-off-by: Gleb Natapov <gl...@redhat.com>
> > ---
> >
> > I get reports that people find this useful, so resending.
>
> This description is still wrong. It doesn't describe why this patch is useful.
>

I think the text above describes the feature it adds and its use
case quite well. Can you elaborate what is missing in your opinion,
or suggest alternative text please?

--
Gleb.

KOSAKI Motohiro

unread,

Jan 14, 2010, 2:10:01 AM1/14/10

to

> On Thu, Jan 14, 2010 at 09:31:03AM +0900, KOSAKI Motohiro wrote:
> > > If application does mlockall(MCL_FUTURE) it is no longer possible to mmap
> > > file bigger than main memory or allocate big area of anonymous memory
> > > in a thread safe manner. Sometimes it is desirable to lock everything
> > > related to program execution into memory, but still be able to mmap
> > > big file or allocate huge amount of memory and allow OS to swap them on
> > > demand. MAP_UNLOCKED allows to do that.
> > >
> > > Signed-off-by: Gleb Natapov <gl...@redhat.com>
> > > ---
> > >
> > > I get reports that people find this useful, so resending.
> >
> > This description is still wrong. It doesn't describe why this patch is useful.
> >
> I think the text above describes the feature it adds and its use
> case quite well. Can you elaborate what is missing in your opinion,
> or suggest alternative text please?

My point is, introducing mmap new flags need strong and clearly use-case.
All patch should have good benefit/cost balance. the code can describe the cost,
but the benefit can be only explained by the patch description.

I don't think this poor description explained bit benefit rather than cost.
you should explain why this patch is useful and not just pretty toy.

Gleb Natapov

unread,

Jan 14, 2010, 2:30:02 AM1/14/10

to

On Thu, Jan 14, 2010 at 04:02:42PM +0900, KOSAKI Motohiro wrote:
> > On Thu, Jan 14, 2010 at 09:31:03AM +0900, KOSAKI Motohiro wrote:
> > > > If application does mlockall(MCL_FUTURE) it is no longer possible to mmap
> > > > file bigger than main memory or allocate big area of anonymous memory
> > > > in a thread safe manner. Sometimes it is desirable to lock everything
> > > > related to program execution into memory, but still be able to mmap
> > > > big file or allocate huge amount of memory and allow OS to swap them on
> > > > demand. MAP_UNLOCKED allows to do that.
> > > >
> > > > Signed-off-by: Gleb Natapov <gl...@redhat.com>
> > > > ---
> > > >
> > > > I get reports that people find this useful, so resending.
> > >
> > > This description is still wrong. It doesn't describe why this patch is useful.
> > >
> > I think the text above describes the feature it adds and its use
> > case quite well. Can you elaborate what is missing in your opinion,
> > or suggest alternative text please?
>
> My point is, introducing mmap new flags need strong and clearly use-case.
> All patch should have good benefit/cost balance. the code can describe the cost,
> but the benefit can be only explained by the patch description.
>
> I don't think this poor description explained bit benefit rather than cost.
> you should explain why this patch is useful and not just pretty toy.
>

The benefit is that with this patch I can lock all of my application in
memory except some very big memory areas. My use case is that I want to
run virtual machine in such a way that everything related to machine
emulator is locked into the memory, but guest address space can be
swapped out at will. Guest address space is so huge that it is not
possible to allocated it locked and then unlock. I was very surprised
that current Linux API has no way to do it hence this patch. It may look
like a pretty toy to you until some day you need this and has no way to
do it.

--
Gleb.

KOSAKI Motohiro

unread,

Jan 14, 2010, 2:40:02 AM1/14/10

to

Hmm..
Your answer didn't match I wanted.
few additional questions.

- Why don't you change your application? It seems natural way than kernel change.
- Why do you want your virtual machine have mlockall? AFAIK, current majority
virtual machine doesn't.
- If this feature added, average distro user can get any benefit?

I mean, many application developrs want to add their specific feature
into kernel. but if we allow it unlimitedly, major syscall become
the trushbox of pretty toy feature soon.

Gleb Natapov

unread,

Jan 14, 2010, 3:10:02 AM1/14/10

to

Then I don't get what you want.

> few additional questions.
>
> - Why don't you change your application? It seems natural way than kernel change.

There is no way to change my application and achieve what I've described
in a multithreaded app.

> - Why do you want your virtual machine have mlockall? AFAIK, current majority
> virtual machine doesn't.

It is absolutely irrelevant for that patch, but just because you ask I
want to measure the cost of swapping out of a guest memory.

> - If this feature added, average distro user can get any benefit?
>

?! Is this some kind of new measure? There are plenty of much more
invasive features that don't bring benefits to an average distro user.
This feature can bring benefit to embedded/RT developers.

> I mean, many application developrs want to add their specific feature
> into kernel. but if we allow it unlimitedly, major syscall become
> the trushbox of pretty toy feature soon.
>

And if application developer wants to extend kernel in a way that it
will be possible to do something that was not possible before why is
this a bad thing? I would agree with you if for my problem was userspace
solution, but there is none. The mmap interface is asymmetric in regards
to mlock currently. There is MAP_LOCKED, but no MAP_UNLOCKED. Why
MAP_LOCKED is useful then?

--
Gleb.

KOSAKI Motohiro

unread,

Jan 14, 2010, 3:20:02 AM1/14/10

to

> > Hmm..
> > Your answer didn't match I wanted.
> Then I don't get what you want.

I want to know the benefit of the patch for patch reviewing.

> > few additional questions.
> >
> > - Why don't you change your application? It seems natural way than kernel change.
> There is no way to change my application and achieve what I've described
> in a multithreaded app.

Then, we don't recommend to use mlockall(). I don't hope to hear your conclusion,
it is not objectivization. I hope to hear why you reached such conclusion.

> > - Why do you want your virtual machine have mlockall? AFAIK, current majority
> > virtual machine doesn't.
> It is absolutely irrelevant for that patch, but just because you ask I
> want to measure the cost of swapping out of a guest memory.

No. if you stop to use mlockall, the issue is vanished.

> > - If this feature added, average distro user can get any benefit?
> >
> ?! Is this some kind of new measure? There are plenty of much more
> invasive features that don't bring benefits to an average distro user.
> This feature can bring benefit to embedded/RT developers.

I mean who get benifit?

> > I mean, many application developrs want to add their specific feature
> > into kernel. but if we allow it unlimitedly, major syscall become
> > the trushbox of pretty toy feature soon.
> >
> And if application developer wants to extend kernel in a way that it
> will be possible to do something that was not possible before why is
> this a bad thing? I would agree with you if for my problem was userspace
> solution, but there is none. The mmap interface is asymmetric in regards
> to mlock currently. There is MAP_LOCKED, but no MAP_UNLOCKED. Why
> MAP_LOCKED is useful then?

Why? Because this is formal LKML reviewing process. I'm reviewing your
patch for YOU.

If there is no objective reason, I don't want to continue reviewing.

Gleb Natapov

unread,

Jan 14, 2010, 5:30:01 AM1/14/10

to

On Thu, Jan 14, 2010 at 05:17:36PM +0900, KOSAKI Motohiro wrote:
> > > Hmm..
> > > Your answer didn't match I wanted.
> > Then I don't get what you want.
>
> I want to know the benefit of the patch for patch reviewing.
>
>
> > > few additional questions.
> > >
> > > - Why don't you change your application? It seems natural way than kernel change.
> > There is no way to change my application and achieve what I've described
> > in a multithreaded app.
>
> Then, we don't recommend to use mlockall(). I don't hope to hear your conclusion,
> it is not objectivization. I hope to hear why you reached such conclusion.

So what do you recommend? Don't just wave hand on me saying "These
aren't the droids you're looking for". I explained you what I need to
achieve you seems to be trying to convince me I don't really need it
without proposing any alternatives. This is not constructive discussion.

>
>
> > > - Why do you want your virtual machine have mlockall? AFAIK, current majority
> > > virtual machine doesn't.
> > It is absolutely irrelevant for that patch, but just because you ask I
> > want to measure the cost of swapping out of a guest memory.
>
> No. if you stop to use mlockall, the issue is vanished.
>

And emulator parts will be swapped out too which is not what I want.

>
> > > - If this feature added, average distro user can get any benefit?
> > >
> > ?! Is this some kind of new measure? There are plenty of much more
> > invasive features that don't bring benefits to an average distro user.
> > This feature can bring benefit to embedded/RT developers.
>
> I mean who get benifit?

Someone who wants to mlock all application memory, but wants to be able
to mmap big file for reading and understand that access to that file can
cause major fault.

>
>
> > > I mean, many application developrs want to add their specific feature
> > > into kernel. but if we allow it unlimitedly, major syscall become
> > > the trushbox of pretty toy feature soon.
> > >
> > And if application developer wants to extend kernel in a way that it
> > will be possible to do something that was not possible before why is
> > this a bad thing? I would agree with you if for my problem was userspace
> > solution, but there is none. The mmap interface is asymmetric in regards
> > to mlock currently. There is MAP_LOCKED, but no MAP_UNLOCKED. Why
> > MAP_LOCKED is useful then?
>
> Why? Because this is formal LKML reviewing process. I'm reviewing your
> patch for YOU.
>

I appreciate that, but unfortunately it seems that you are trying to dismiss
my arguments on the basis that _you_ don't find that useful.

> If there is no objective reason, I don't want to continue reviewing.
>
--

Gleb.

Andrew C. Morrow

unread,

Jan 14, 2010, 2:40:02 PM1/14/10

to

On Thu, Jan 14, 2010 at 3:17 AM, KOSAKI Motohiro
<kosaki....@jp.fujitsu.com> wrote:
>> > Hmm..
>> > Your answer didn't match I wanted.
>> Then I don't get what you want.
>
> I want to know the benefit of the patch for patch reviewing.
>

The benefit of the patch is that it makes it possible for an
application which has previously called mlockall(MCL_FUTURE) to
selectively exempt new memory mappings from memory locking, on a
per-mmap-call basis. As was pointed out earlier, there is currently no
thread-safe way for an application to do this. The earlier proposed
workaround of toggling MCL_FUTURE around calls to mmap is racy in a
multi-threaded context. Other threads may manipulate the address space
during the window where MCL_FUTURE is off, subverting the programmers
intended memory locking semantics.

The ability to exempt specific memory mappings from memory locking is
necessary when the region to be mapped is larger than physical memory.
In such cases a call to mmap the region cannot succeed, unless
MAP_UNLOCKED is available.

>
>> > few additional questions.
>> >
>> > - Why don't you change your application? It seems natural way than kernel change.
>> There is no way to change my application and achieve what I've described
>> in a multithreaded app.
>
> Then, we don't recommend to use mlockall(). I don't hope to hear your conclusion,
> it is not objectivization. I hope to hear why you reached such conclusion.
>

I agree that mlockall is a big hammer and should be avoided in most
cases, but there are situations where it is exactly what is needed. In
Gleb's instance, it sounds like he is doing some finicky performance
measurement and major page faults skew his results. In my case, I have
a realtime process where the measured latency impact of major page
faults is unacceptable. In both of these cases, mlockall is a
reasonable approach to eliminating major faults.

However, Gleb and I have independently found ourselves unable to use
mlockall because we also need to create a very large memory mapping
(for which we don't care about major faults). The proposed
MAP_UNLOCKED flag would allow us to override MCL_FUTURE for that one
mapping.

>
>> > - Why do you want your virtual machine have mlockall? AFAIK, current majority
>> > � virtual machine doesn't.
>> It is absolutely irrelevant for that patch, but just because you ask I
>> want to measure the cost of swapping out of a guest memory.
>
> No. if you stop to use mlockall, the issue is vanished.
>

And other issues arise. Gleb described a situation where the use of
mlockall is justified, identified an issue which prevents its use, and
provided a patch which resolves that issue. Why are you focusing on
the validity of using mlockall?

>
>> > - If this feature added, average distro user can get any benefit?
>> >
>> ?! Is this some kind of new measure? There are plenty of much more
>> invasive features that don't bring benefits to an average distro user.
>> This feature can bring benefit to embedded/RT developers.
>
> I mean who get benifit?
>
>
>> > I mean, many application developrs want to add their specific feature
>> > into kernel. but if we allow it unlimitedly, major syscall become
>> > the trushbox of pretty toy feature soon.
>> >
>> And if application developer wants to extend kernel in a way that it
>> will be possible to do something that was not possible before why is
>> this a bad thing? I would agree with you if for my problem was userspace
>> solution, but there is none. The mmap interface is asymmetric in regards
>> to mlock currently. There is MAP_LOCKED, but no MAP_UNLOCKED. Why
>> MAP_LOCKED is useful then?
>
> Why? Because this is formal LKML reviewing process. I'm reviewing your
> patch for YOU.
>
> If there is no objective reason, I don't want to continue reviewing.
>

There is an objective reason: the current interaction between
mlockall(MCL_FUTURE) and mmap has a deficiency. In 'normal' mode,
without MCL_FUTURE in force, the default is that new memory mappings
are not locked, but mmap provides MAP_LOCKED specifically to override
that default. However, with MCL_FUTURE toggled to on, there is no
analogous way to tell mmap to override the default. The proposed
MAP_UNLOCKED flag would resolve this deficiency.

Andrew

KOSAKI Motohiro

unread,

Jan 17, 2010, 10:30:02 PM1/17/10

to

Very thank you, Andrew!

Your explanation help me lots rather than original patch description. OK, At least
MAP_UNLOCED have two users (you and gleb) and your explanation seems
makes sense.

So, if gleb resend this patch with rewrited description, I might take my reviewed-by tag to it, probagly.

Thanks.

Gleb Natapov

unread,

Jan 18, 2010, 8:40:02 AM1/18/10

to

The current interaction between mlockall(MCL_FUTURE) and mmap has a
deficiency. In 'normal' mode, without MCL_FUTURE in force, the default
is that new memory mappings are not locked, but mmap provides MAP_LOCKED
specifically to override that default. However, with MCL_FUTURE toggled
to on, there is no analogous way to tell mmap to override the default. The
proposed MAP_UNLOCKED flag would resolve this deficiency.

The benefit of the patch is that it makes it possible for an application

which has previously called mlockall(MCL_FUTURE) to selectively exempt

new memory mappings from memory locking, on a per-mmap-call basis. There
is currently no thread-safe way for an application to do this as

toggling MCL_FUTURE around calls to mmap is racy in a multi-threaded
context. Other threads may manipulate the address space during the
window where MCL_FUTURE is off, subverting the programmers intended
memory locking semantics.

The ability to exempt specific memory mappings from memory locking is
necessary when the region to be mapped is larger than physical memory.
In such cases a call to mmap the region cannot succeed, unless
MAP_UNLOCKED is available.

Acked-by: WANG Cong <xiyou.w...@gmail.com>
Acked-by: Chris Wright <chr...@sous-sol.org>

Signed-off-by: Gleb Natapov <gl...@redhat.com>
---

I keep the acks since the patch is exactly the same, only commit message
is changed.
Commit message is mostly copied from Andrew C. Morrow email. Hope now it
is OK. Thank you Andrew :)

v1->v2
- adding new flag to all archs
- fixing typo
v2->v3
- one more typo fix
v3->v4
- return error if MAP_LOCKED | MAP_UNLOCKED is specified
v4->v5
- rebase to latest head

v5->v6
- commit message rewritten

Gleb Natapov

unread,

Jan 18, 2010, 8:50:02 AM1/18/10

to

> > >> > О©╫ virtual machine doesn't.

Just did it. I hope the commit message is OK with you now. Its text is
taken from this Andrew's mail. Thanks.

--
Gleb.

Pekka Enberg

unread,

Jan 18, 2010, 9:10:03 AM1/18/10

to

Hi Gleb,

On Mon, Jan 18, 2010 at 3:37 PM, Gleb Natapov <gl...@redhat.com> wrote:
> The current interaction between mlockall(MCL_FUTURE) and mmap has a
> deficiency. In 'normal' mode, without MCL_FUTURE in force, the default
> is that new memory mappings are not locked, but mmap provides MAP_LOCKED
> specifically to override that default. However, with MCL_FUTURE toggled
> to on, there is no analogous way to tell mmap to override the default. The
> proposed MAP_UNLOCKED flag would resolve this deficiency.
>
> The benefit of the patch is that it makes it possible for an application
> which has previously called mlockall(MCL_FUTURE) to selectively exempt
> new memory mappings from memory locking, on a per-mmap-call basis. There
> is currently no thread-safe way for an application to do this as
> toggling MCL_FUTURE around calls to mmap is racy in a multi-threaded
> context. Other threads may manipulate the address space during the
> window where MCL_FUTURE is off, subverting the programmers intended
> memory locking semantics.
>
> The ability to exempt specific memory mappings from memory locking is
> necessary when the region to be mapped is larger than physical memory.
> In such cases a call to mmap the region cannot succeed, unless
> MAP_UNLOCKED is available.

The changelog doesn't mention what kind of applications would want to
use this. Are there some? Using mlockall(MCL_FUTURE) but then having
some memory regions MAP_UNLOCKED sounds like a strange combination to
me.

Gleb Natapov

unread,

Jan 18, 2010, 9:30:02 AM1/18/10

to

The specific use cases were discussed in the thread following previous
version of the patch. I can describe my specific use case in a change log
and I can copy what Andrew said about his case, but is it really needed in
a commit message itself? It boils down to greater control over when and
where application can get major fault. There are applications that need
this kind of control. As of use of mlockall(MCL_FUTURE) how can I make
sure that all memory allocated behind my application's back (by dynamic
linker, libraries, stack) will be locked otherwise?

--
Gleb.

Alan Cox

unread,

Jan 18, 2010, 9:40:02 AM1/18/10

to

> this kind of control. As of use of mlockall(MCL_FUTURE) how can I make
> sure that all memory allocated behind my application's back (by dynamic
> linker, libraries, stack) will be locked otherwise?

If you add this flag you can't do that anyway - some library will
helpfully start up using it and then you are completely stuffed or will
be back in two or three years adding MLOCKALL_ALWAYS.

Alan

Gleb Natapov

unread,

Jan 18, 2010, 9:40:01 AM1/18/10

to

On Mon, Jan 18, 2010 at 02:32:32PM +0000, Alan Cox wrote:
> > this kind of control. As of use of mlockall(MCL_FUTURE) how can I make
> > sure that all memory allocated behind my application's back (by dynamic
> > linker, libraries, stack) will be locked otherwise?
>
> If you add this flag you can't do that anyway - some library will
> helpfully start up using it and then you are completely stuffed or will
> be back in two or three years adding MLOCKALL_ALWAYS.
>

Libraries can do many other bad things. They can do mlockall(0) today
too and this is not the reason to ditch mlockall(). I don't expect libc will
do that though.

--
Gleb.

Peter Zijlstra

unread,

Jan 18, 2010, 10:00:02 AM1/18/10

to

On Mon, 2010-01-18 at 14:32 +0000, Alan Cox wrote:
> > this kind of control. As of use of mlockall(MCL_FUTURE) how can I make
> > sure that all memory allocated behind my application's back (by dynamic
> > linker, libraries, stack) will be locked otherwise?
>
> If you add this flag you can't do that anyway - some library will
> helpfully start up using it and then you are completely stuffed or will
> be back in two or three years adding MLOCKALL_ALWAYS.

Agreed, mlockall() is a very bad interface and should not be used for a
plethora of reasons, this being one of them.

The thing is, if you cant trust your library to do sane things, then
don't use it.

Peter Zijlstra

unread,

Jan 18, 2010, 10:10:02 AM1/18/10

to

On Mon, 2010-01-18 at 17:01 +0200, Gleb Natapov wrote:
> There are valid uses for mlockall()

That's debatable.

Gleb Natapov

unread,

Jan 18, 2010, 10:10:02 AM1/18/10

to

On Mon, Jan 18, 2010 at 03:49:58PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 14:32 +0000, Alan Cox wrote:
> > > this kind of control. As of use of mlockall(MCL_FUTURE) how can I make
> > > sure that all memory allocated behind my application's back (by dynamic
> > > linker, libraries, stack) will be locked otherwise?
> >
> > If you add this flag you can't do that anyway - some library will
> > helpfully start up using it and then you are completely stuffed or will
> > be back in two or three years adding MLOCKALL_ALWAYS.
>
> Agreed, mlockall() is a very bad interface and should not be used for a
> plethora of reasons, this being one of them.
>

There are valid uses for mlockall() and even if the interface is bad there
is no alternative right now, so why not fix one of it problems?

> The thing is, if you cant trust your library to do sane things, then
> don't use it.
>

Agreed, the are things that sane library should never do: exit() or output
debug info to stdio or meddle with memory mlock/munlock behind application's
back.

--
Gleb.

Gleb Natapov

unread,

Jan 18, 2010, 10:20:01 AM1/18/10

to

On Mon, Jan 18, 2010 at 04:06:34PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 17:01 +0200, Gleb Natapov wrote:
> > There are valid uses for mlockall()
>
> That's debatable.
>

Well, I have use for it. You can look at previous thread were I described
it and suggest alternatives.

--
Gleb.

Peter Zijlstra

unread,

Jan 18, 2010, 10:20:02 AM1/18/10

to

On Mon, 2010-01-18 at 17:11 +0200, Avi Kivity wrote:

> On 01/18/2010 05:06 PM, Peter Zijlstra wrote:
> > On Mon, 2010-01-18 at 17:01 +0200, Gleb Natapov wrote:
> >
> >> There are valid uses for mlockall()
> >>
> > That's debatable.
> >
> >
>

> Real-time?

I would not advice that, just mlock() the text and data you need for the
real-time thread. mlockall() is a really blunt instrument.

Avi Kivity

unread,

Jan 18, 2010, 10:20:02 AM1/18/10

to

On 01/18/2010 05:06 PM, Peter Zijlstra wrote:

> On Mon, 2010-01-18 at 17:01 +0200, Gleb Natapov wrote:
>
>> There are valid uses for mlockall()
>>
> That's debatable.
>
>

Real-time?

--
error compiling committee.c: too many arguments to function

Peter Zijlstra

unread,

Jan 18, 2010, 10:30:02 AM1/18/10

to

On Mon, 2010-01-18 at 17:19 +0200, Avi Kivity wrote:
> > I would not advice that, just mlock() the text and data you need for the
> > real-time thread. mlockall() is a really blunt instrument.
> >
>

> May not be feasible due to libraries.

Esp for the real-time case I could advise not to use those libraries
then, since they're clearly not designed for that use case.

Avi Kivity

unread,

Jan 18, 2010, 10:30:02 AM1/18/10

to

On 01/18/2010 05:14 PM, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 17:11 +0200, Avi Kivity wrote:
>
>> On 01/18/2010 05:06 PM, Peter Zijlstra wrote:
>>
>>> On Mon, 2010-01-18 at 17:01 +0200, Gleb Natapov wrote:
>>>
>>>
>>>> There are valid uses for mlockall()
>>>>
>>>>
>>> That's debatable.
>>>
>>>
>>>
>> Real-time?
>>
> I would not advice that, just mlock() the text and data you need for the
> real-time thread. mlockall() is a really blunt instrument.
>

May not be feasible due to libraries.

--

error compiling committee.c: too many arguments to function

--

Alan Cox

unread,

Jan 18, 2010, 10:50:02 AM1/18/10

to

On Mon, 18 Jan 2010 16:24:07 +0100
Peter Zijlstra <pet...@infradead.org> wrote:

> On Mon, 2010-01-18 at 17:19 +0200, Avi Kivity wrote:
> > > I would not advice that, just mlock() the text and data you need for the
> > > real-time thread. mlockall() is a really blunt instrument.
> > >
> >
> > May not be feasible due to libraries.
>
> Esp for the real-time case I could advise not to use those libraries
> then, since they're clearly not designed for that use case.

In "hard" real time cases an awful lot of libraries have things like
memory allocations in them and don't care about stack growth which can
cause faults and sleeps. The memory allocator if you are running threaded
was not real time priority aware either last time I checked so the
standard libraries are not going to give the behaviour you want unless
you have a proper RT environment, and even then it may be a bit iffy here
and there.

Peter Zijlstra

unread,

Jan 18, 2010, 10:50:01 AM1/18/10

to

On Mon, 2010-01-18 at 15:41 +0000, Alan Cox wrote:
> On Mon, 18 Jan 2010 16:24:07 +0100
> Peter Zijlstra <pet...@infradead.org> wrote:
>
> > On Mon, 2010-01-18 at 17:19 +0200, Avi Kivity wrote:
> > > > I would not advice that, just mlock() the text and data you need for the
> > > > real-time thread. mlockall() is a really blunt instrument.
> > > >
> > >
> > > May not be feasible due to libraries.
> >
> > Esp for the real-time case I could advise not to use those libraries
> > then, since they're clearly not designed for that use case.
>
> In "hard" real time cases an awful lot of libraries have things like
> memory allocations in them and don't care about stack growth which can
> cause faults and sleeps. The memory allocator if you are running threaded
> was not real time priority aware either last time I checked so the
> standard libraries are not going to give the behaviour you want unless
> you have a proper RT environment, and even then it may be a bit iffy here
> and there.

I'm quite aware of that, which is why we recommend people to
pre-allocate, mlock() and pre-fault everything in advance and make sure
the RT thread doesn't touch any data/text outside of that and uses a
limited set of system calls.

You can also do that for stacks using pthread_attr_setstack().

Pekka Enberg

unread,

Jan 18, 2010, 11:10:02 AM1/18/10

to

On Mon, Jan 18, 2010 at 4:19 PM, Gleb Natapov <gl...@redhat.com> wrote:
> The specific use cases were discussed in the thread following previous
> version of the patch. I can describe my specific use case in a change log
> and I can copy what Andrew said about his case, but is it really needed in
> a commit message itself? It boils down to greater control over when and
> where application can get major fault. There are applications that need
> this kind of control. As of use of mlockall(MCL_FUTURE) how can I make
> sure that all memory allocated behind my application's back (by dynamic
> linker, libraries, stack) will be locked otherwise?

Again, why do you want to MCL_FUTURE but then go and use MAP_UNLOCKED?
"Greater control" is not an argument for adding a new API that needs
to be maintained forever, a real world use case is.

And yes, this stuff needs to be in the changelog. Whether you want to
spell it out or post an URL to some previous discussion is up to you.

Gleb Natapov

unread,

Jan 18, 2010, 12:10:02 PM1/18/10

to

On Mon, Jan 18, 2010 at 06:05:38PM +0200, Pekka Enberg wrote:
> On Mon, Jan 18, 2010 at 4:19 PM, Gleb Natapov <gl...@redhat.com> wrote:
> > The specific use cases were discussed in the thread following previous
> > version of the patch. I can describe my specific use case in a change log
> > and I can copy what Andrew said about his case, but is it really needed in
> > a commit message itself? It boils down to greater control over when and
> > where application can get major fault. There are applications that need
> > this kind of control. As of use of mlockall(MCL_FUTURE) how can I make
> > sure that all memory allocated behind my application's back (by dynamic
> > linker, libraries, stack) will be locked otherwise?
>
> Again, why do you want to MCL_FUTURE but then go and use MAP_UNLOCKED?

I need to have all my memory locked except one big (bigger then main
memory) chunk. I either need to rewrite my application and all libraries
to use memory allocator that return locked memory, may be even rewrite
dynamic loader to use this allocator to lock executable code too, or run
mlockall(MCL_FUTURE|MCL_CURRENT) at startup and exempt one allocation
from this rule. Note that I can't allocate it locked and then unlock
since allocation will fail. Actually for me it hangs kernel last I
checked.

> "Greater control" is not an argument for adding a new API that needs
> to be maintained forever, a real world use case is.
>

If there is real world use case for mlockall() there is real use case for
this too. People seems to be trying to convince me that I don't need
mlockall() without proposing alternatives. The only alternative I see
lock everything from userspace.

> And yes, this stuff needs to be in the changelog. Whether you want to
> spell it out or post an URL to some previous discussion is up to you.

The discussion was here just a couple of days ago. Here is the link
were I describe my use case: http://marc.info/?l=linux-mm&m=126345374125942&w=2
If you think it needs to be spelled out in commit log I'll do it.

--
Gleb.

Gleb Natapov

unread,

Jan 18, 2010, 12:20:04 PM1/18/10

to

On Mon, Jan 18, 2010 at 04:14:43PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-18 at 17:11 +0200, Avi Kivity wrote:
> > On 01/18/2010 05:06 PM, Peter Zijlstra wrote:
> > > On Mon, 2010-01-18 at 17:01 +0200, Gleb Natapov wrote:
> > >
> > >> There are valid uses for mlockall()
> > >>
> > > That's debatable.
> > >
> > >
> >
> > Real-time?
>
> I would not advice that, just mlock() the text and data you need for the
> real-time thread. mlockall() is a really blunt instrument.
>

Yes it is blunt, but the patch makes it less so.

--
Gleb.

Pekka Enberg

unread,

Jan 18, 2010, 1:10:03 PM1/18/10

to

Hi Gleb,

On Mon, Jan 18, 2010 at 7:08 PM, Gleb Natapov <gl...@redhat.com> wrote:
>> "Greater control" is not an argument for adding a new API that needs
>> to be maintained forever, a real world use case is.
>>
> If there is real world use case for mlockall() there is real use case for
> this too. People seems to be trying to convince me that I don't need
> mlockall() without proposing alternatives. The only alternative I see
> lock everything from userspace.
>
>> And yes, this stuff needs to be in the changelog. Whether you want to
>> spell it out or post an URL to some previous discussion is up to you.
> The discussion was here just a couple of days ago. Here is the link
> were I describe my use case: http://marc.info/?l=linux-mm&m=126345374125942&w=2
> If you think it needs to be spelled out in commit log I'll do it.

So this is a performance thing? Btw, is there are reason you can't use
plain mlock() for it as suggested by Peter earlier?

Pekka

Gleb Natapov

unread,

Jan 18, 2010, 1:30:01 PM1/18/10

to

On Mon, Jan 18, 2010 at 08:09:26PM +0200, Pekka Enberg wrote:
> Hi Gleb,
>
> On Mon, Jan 18, 2010 at 7:08 PM, Gleb Natapov <gl...@redhat.com> wrote:
> >> "Greater control" is not an argument for adding a new API that needs
> >> to be maintained forever, a real world use case is.
> >>
> > If there is real world use case for mlockall() there is real use case for
> > this too. People seems to be trying to convince me that I don't need
> > mlockall() without proposing alternatives. The only alternative I see
> > lock everything from userspace.
> >
> >> And yes, this stuff needs to be in the changelog. Whether you want to
> >> spell it out or post an URL to some previous discussion is up to you.
> > The discussion was here just a couple of days ago. Here is the link
> > were I describe my use case: http://marc.info/?l=linux-mm&m=126345374125942&w=2
> > If you think it needs to be spelled out in commit log I'll do it.
>
> So this is a performance thing? Btw, is there are reason you can't use
> plain mlock() for it as suggested by Peter earlier?
>

I can't realistically chase every address space mapping changes and mlock
new areas. The only way other then mlockall() is to use custom memory
allocator that allocates mlocked memory.

--
Gleb.

Alan Cox

unread,

Jan 18, 2010, 2:10:02 PM1/18/10

to

> I can't realistically chase every address space mapping changes and mlock
> new areas. The only way other then mlockall() is to use custom memory
> allocator that allocates mlocked memory.

Which keeps all the special cases in your app rather than in every single
users kernel. That seems to be the right way up, especially as you can
make a library of it !

Alan

KOSAKI Motohiro

unread,

Jan 18, 2010, 7:20:02 PM1/18/10

to

> On Mon, Jan 18, 2010 at 04:06:34PM +0100, Peter Zijlstra wrote:
> > On Mon, 2010-01-18 at 17:01 +0200, Gleb Natapov wrote:
> > > There are valid uses for mlockall()
> >
> > That's debatable.
> >
> Well, I have use for it. You can look at previous thread were I described
> it and suggest alternatives.

Please stop suck.
This is the reviewing. The reviewers shouldn't need to look at all
previous thread. It mean your description isn't enough.

Gleb Natapov

unread,

Jan 19, 2010, 2:20:01 AM1/19/10

to

On Mon, Jan 18, 2010 at 07:10:31PM +0000, Alan Cox wrote:
> > I can't realistically chase every address space mapping changes and mlock
> > new areas. The only way other then mlockall() is to use custom memory
> > allocator that allocates mlocked memory.
>
> Which keeps all the special cases in your app rather than in every single
> users kernel. That seems to be the right way up, especially as you can
> make a library of it !
>

Are you advocating rewriting mlockall() in userspace? It may be possible,
but will require rewriting half of libc. Everything that changes process
address space should support mlocking (memory allocation functions, dynamic
loading, strdup). Allocations can be done only with mmap() since brk()
can allocate mlocked memory atomically. And of course if third party library
uses mmap syscall directly instead of using libc one you are SOL. Been
there already, worked on project that replaced libc memory allocations functions
because it had to track when memory is returned to OS, not just internal
libc pool (MPI). This is pain in the arse and on top of that it doesn't work
reliably. Some things are better be done on OS level.

The thread took a direction of bashing mlockall(). This is especially
strange since proposed patch actually makes mlockall() more fine
grained and thus more useful.

--
Gleb.

Pekka Enberg

unread,

Jan 19, 2010, 2:40:01 AM1/19/10

to

Hi Gleb,

On Tue, Jan 19, 2010 at 9:17 AM, Gleb Natapov <gl...@redhat.com> wrote:
> The thread took a direction of bashing mlockall(). This is especially
> strange since proposed patch actually makes mlockall() more fine
> grained and thus more useful.

No, the thread took a direction of you not being able to properly
explain why we want MMAP_UNLOCKED in the kernel. It seems useless for
real-time and I've yet to figure out why you need _mlockall()_ if it's
a performance thing.

It would be probably useful if you could point us to the application
source code that actually wants this feature.

Pekka

Gleb Natapov

unread,

Jan 19, 2010, 3:00:02 AM1/19/10

to

On Tue, Jan 19, 2010 at 09:37:05AM +0200, Pekka Enberg wrote:
> Hi Gleb,
>
> On Tue, Jan 19, 2010 at 9:17 AM, Gleb Natapov <gl...@redhat.com> wrote:
> > The thread took a direction of bashing mlockall(). This is especially
> > strange since proposed patch actually makes mlockall() more fine
> > grained and thus more useful.
>
> No, the thread took a direction of you not being able to properly
> explain why we want MMAP_UNLOCKED in the kernel. It seems useless for

It is needed in the kernel because this is the only proper (aka thread
safe) way to mmap area bigger the main memory after mlockall(MCL_FUTURE).
Do you agree we that? Now you can ask why is this needed and this is
valid question.

> real-time and I've yet to figure out why you need _mlockall()_ if it's
> a performance thing.

I don't do real-time so will not argue how useful it is for that,
but it seems to me that people who argue that it is not useful for real
time don't do it either and the only person in this thread who does real
time uses mlockall(). Hmm strange.

In my case (virtualization) I want to test/profile guest under heavy swapping
of a guests memory, so I intentionally create memory shortage by creating
guest much large then host memory, but I want system to swap out only
guest's memory.

>
> It would be probably useful if you could point us to the application
> source code that actually wants this feature.
>

This is two line patch to qemu that calls mlockall(MCL_CURRENT|MCL_FUTURE)
at the beginning of the main() and changes guest memory allocation to
use MAP_UNLOCKED flag. All alternative solutions in this thread suggest
that I should rewrite qemu + all library it uses. You see why I can't
take them seriously?

--
Gleb.

Pekka Enberg

unread,

Jan 19, 2010, 3:10:02 AM1/19/10

to

Hi Gleb,

On Tue, Jan 19, 2010 at 9:52 AM, Gleb Natapov <gl...@redhat.com> wrote:
>> It would be probably useful if you could point us to the application
>> source code that actually wants this feature.
>>
> This is two line patch to qemu that calls mlockall(MCL_CURRENT|MCL_FUTURE)
> at the beginning of the main() and changes guest memory allocation to
> use MAP_UNLOCKED flag. All alternative solutions in this thread suggest
> that I should rewrite qemu + all library it uses. You see why I can't
> take them seriously?

Well, that's not going to be portable, is it, so the application
design would still be broken, no? Did you try using (or extending)
posix_madvise(MADV_DONTNEED) for the guest address space? It seems to
me that you're trying to use a big hammer (mlock) when a polite hint
for the VM would probably be sufficient for it do its job.

Pekka

Gleb Natapov

unread,

Jan 19, 2010, 3:30:03 AM1/19/10

to

On Tue, Jan 19, 2010 at 10:07:07AM +0200, Pekka Enberg wrote:
> Hi Gleb,
>
> On Tue, Jan 19, 2010 at 9:52 AM, Gleb Natapov <gl...@redhat.com> wrote:
> >> It would be probably useful if you could point us to the application
> >> source code that actually wants this feature.
> >>
> > This is two line patch to qemu that calls mlockall(MCL_CURRENT|MCL_FUTURE)
> > at the beginning of the main() and changes guest memory allocation to
> > use MAP_UNLOCKED flag. All alternative solutions in this thread suggest
> > that I should rewrite qemu + all library it uses. You see why I can't
> > take them seriously?
>
> Well, that's not going to be portable, is it, so the application

KVM is not portable ;) and that is what my main interest is.

> design would still be broken, no? Did you try using (or extending)
> posix_madvise(MADV_DONTNEED) for the guest address space? It seems to

After mlockall() I can't even allocate guest address space. Or do you mean
instead of mlockall()? Then how MADV_DONTNEED will help? It just drops
page table for the address range (which is not what I need) and does not
have any long time effect.

> me that you're trying to use a big hammer (mlock) when a polite hint
> for the VM would probably be sufficient for it do its job.
>

I what to tell to VM "swap this, don't swap that" and as far as I see
there is no other way to do it currently.

--
Gleb.

Pekka Enberg

unread,

Jan 19, 2010, 3:50:02 AM1/19/10

to

Hi Gleb,

On Tue, Jan 19, 2010 at 10:26 AM, Gleb Natapov <gl...@redhat.com> wrote:
>> design would still be broken, no? Did you try using (or extending)
>> posix_madvise(MADV_DONTNEED) for the guest address space? It seems to
> After mlockall() I can't even allocate guest address space. Or do you mean
> instead of mlockall()? Then how MADV_DONTNEED will help? It just drops
> page table for the address range (which is not what I need) and does not
> have any long time effect.

Oh right, MADV_DONTNEED is no good.

On Tue, Jan 19, 2010 at 10:26 AM, Gleb Natapov <gl...@redhat.com> wrote:
>> me that you're trying to use a big hammer (mlock) when a polite hint
>> for the VM would probably be sufficient for it do its job.
>>
> I what to tell to VM "swap this, don't swap that" and as far as I see
> there is no other way to do it currently.

Yeah, which is why I was suggesting that maybe posix_madvise() needs
to be extended to have a MADV_NEED_BUT_LESS_IMPORTANT flag that can be
used as a hint by mm/vmscan.c to first swap the guest address spaces.

Pekka

Gleb Natapov

unread,

Jan 19, 2010, 5:50:01 AM1/19/10

to

On Tue, Jan 19, 2010 at 10:44:23AM +0200, Pekka Enberg wrote:
> On Tue, Jan 19, 2010 at 10:26 AM, Gleb Natapov <gl...@redhat.com> wrote:
> >> me that you're trying to use a big hammer (mlock) when a polite hint
> >> for the VM would probably be sufficient for it do its job.
> >>
> > I what to tell to VM "swap this, don't swap that" and as far as I see
> > there is no other way to do it currently.
>
> Yeah, which is why I was suggesting that maybe posix_madvise() needs
> to be extended to have a MADV_NEED_BUT_LESS_IMPORTANT flag that can be
> used as a hint by mm/vmscan.c to first swap the guest address spaces.
>

If such thing would exist may be I would have used it since swapping out
of a wrong page is not live or death matter in my case, but mlockall()
provides me with exactly what I need and without swapping out wrong
pages. Speaking about adding such madvise call wouldn't it be even
harder to justify? It obviously not good enough for real-time use and my
case, I admit, is unusual. Also if we start prioritise memory why stop
on binary, why not set value like "this memory is more important then
that memory by factor of 5"?

--
Gleb.

Alan Cox

unread,

Jan 19, 2010, 7:00:01 AM1/19/10

to

> In my case (virtualization) I want to test/profile guest under heavy swapping
> of a guests memory, so I intentionally create memory shortage by creating
> guest much large then host memory, but I want system to swap out only
> guest's memory.

So this isn't an API question this is an obscure corner case testing
question.

>
> >
> > It would be probably useful if you could point us to the application
> > source code that actually wants this feature.
> >
> This is two line patch to qemu that calls mlockall(MCL_CURRENT|MCL_FUTURE)
> at the beginning of the main() and changes guest memory allocation to
> use MAP_UNLOCKED flag. All alternative solutions in this thread suggest
> that I should rewrite qemu + all library it uses. You see why I can't
> take them seriously?

And you want millions of users to have kernels with weird extra functions
whole sole value is one test environment you wish to run

See why we can't take you seriously either ?

Gleb Natapov

unread,

Jan 19, 2010, 7:10:02 AM1/19/10

to

On Tue, Jan 19, 2010 at 11:54:42AM +0000, Alan Cox wrote:
> > In my case (virtualization) I want to test/profile guest under heavy swapping
> > of a guests memory, so I intentionally create memory shortage by creating
> > guest much large then host memory, but I want system to swap out only
> > guest's memory.
>
> So this isn't an API question this is an obscure corner case testing
> question.
>

It is real use case scenario where the kernel doesn't provider me with
enough rope. You, of course, can dismiss it as "obscure corner case". You
can't expect issues with mlockall() which is corner case by itself to be
mainstream, can you?

> > >
> > > It would be probably useful if you could point us to the application
> > > source code that actually wants this feature.
> > >
> > This is two line patch to qemu that calls mlockall(MCL_CURRENT|MCL_FUTURE)
> > at the beginning of the main() and changes guest memory allocation to
> > use MAP_UNLOCKED flag. All alternative solutions in this thread suggest
> > that I should rewrite qemu + all library it uses. You see why I can't
> > take them seriously?
>
> And you want millions of users to have kernels with weird extra functions
> whole sole value is one test environment you wish to run
>

We are talking about 4 lines of code that other people find useful too
and they commented in this thread. This wouldn't be the first kernel
feature not used by millions of people.

> See why we can't take you seriously either ?
>

I was taking about solutions. Thank you.

--
Gleb.

Minchan Kim

unread,

Jan 19, 2010, 7:50:01 AM1/19/10

to

Hi, Pekka.

On Tue, 2010-01-19 at 10:44 +0200, Pekka Enberg wrote:
> Hi Gleb,
>
> On Tue, Jan 19, 2010 at 10:26 AM, Gleb Natapov <gl...@redhat.com> wrote:
> >> design would still be broken, no? Did you try using (or extending)
> >> posix_madvise(MADV_DONTNEED) for the guest address space? It seems to
> > After mlockall() I can't even allocate guest address space. Or do you mean
> > instead of mlockall()? Then how MADV_DONTNEED will help? It just drops
> > page table for the address range (which is not what I need) and does not
> > have any long time effect.
>
> Oh right, MADV_DONTNEED is no good.
>
> On Tue, Jan 19, 2010 at 10:26 AM, Gleb Natapov <gl...@redhat.com> wrote:
> >> me that you're trying to use a big hammer (mlock) when a polite hint
> >> for the VM would probably be sufficient for it do its job.
> >>
> > I what to tell to VM "swap this, don't swap that" and as far as I see
> > there is no other way to do it currently.
>
> Yeah, which is why I was suggesting that maybe posix_madvise() needs
> to be extended to have a MADV_NEED_BUT_LESS_IMPORTANT flag that can be
> used as a hint by mm/vmscan.c to first swap the guest address spaces.
>
> Pekka

Gleb. How about using MADV_SEQUENTIAL on guest memory?
It makes that pages of guest are moved into inactive reclaim list more
fast. It means it is likely to swap out faster than other pages if it
isn't hit during inactive list.

--
Kind regards,
Minchan Kim

Pekka Enberg

unread,

Jan 19, 2010, 8:20:02 AM1/19/10

to

On Tue, Jan 19, 2010 at 2:48 PM, Minchan Kim <minch...@gmail.com> wrote:
> Gleb. How about using MADV_SEQUENTIAL on guest memory?
> It makes that pages of guest are moved into inactive reclaim list more
> fast. It means it is likely to swap out faster than other pages if it
> isn't hit during inactive list.

Yeah, something like that but we don't want the readahead. OTOH, it's
not clear what Gleb's real problem is. Are the guest address spaces
anonymous or file backed? Which parts of the emulator are swapped out
that are causing the problem? Maybe it's a VM balancing issue that
mlock papers over?

Pekka

Alan Cox

unread,

Jan 19, 2010, 8:30:02 AM1/19/10

to

> > And you want millions of users to have kernels with weird extra functions
> > whole sole value is one test environment you wish to run
> >
> We are talking about 4 lines of code that other people find useful too
> and they commented in this thread. This wouldn't be the first kernel
> feature not used by millions of people.

It wouldn't be the first completely dumb mistake in the kernel either,
but one dumb mistake doesn't argue for including others

Gleb Natapov

unread,

Jan 19, 2010, 8:30:02 AM1/19/10

to

On Tue, Jan 19, 2010 at 03:18:11PM +0200, Pekka Enberg wrote:
> On Tue, Jan 19, 2010 at 2:48 PM, Minchan Kim <minch...@gmail.com> wrote:
> > Gleb. How about using MADV_SEQUENTIAL on guest memory?
> > It makes that pages of guest are moved into inactive reclaim list more
> > fast. It means it is likely to swap out faster than other pages if it
> > isn't hit during inactive list.
>
> Yeah, something like that but we don't want the readahead. OTOH, it's
> not clear what Gleb's real problem is. Are the guest address spaces
> anonymous or file backed?

Anonymous.

> Which parts of the emulator are swapped out
> that are causing the problem?

I don't want anything that can be used during guest runtime to be
swapped out. And I run 2G guest in 512M container, so eventually
everything is swapped out :)

> Maybe it's a VM balancing issue that
> mlock papers over?
>

There is no problem. I do measurements on how host swapping affects
guest and I don't want qemu code to be swapped out.

--
Gleb.

Minchan Kim

unread,

Jan 19, 2010, 9:10:02 AM1/19/10

to

On Tue, 2010-01-19 at 09:52 +0200, Gleb Natapov wrote:

> In my case (virtualization) I want to test/profile guest under heavy swapping
> of a guests memory, so I intentionally create memory shortage by creating

You mean "guest memory" that is area emulated DRAM in qemu?
It is anonymous vma.

> guest much large then host memory, but I want system to swap out only
> guest's memory.

Couldn't you use MADV_SEQUENTIAL on only guest memory area?
It doesn't make side effect about readahead since it's anon area.
And it would make do best effort to swap out guest's memory.

--
Kind regards,
Minchan Kim

--

Gleb Natapov

unread,

Jan 19, 2010, 9:20:01 AM1/19/10

to

On Tue, Jan 19, 2010 at 11:07:23PM +0900, Minchan Kim wrote:
> On Tue, 2010-01-19 at 09:52 +0200, Gleb Natapov wrote:
>
> > In my case (virtualization) I want to test/profile guest under heavy swapping
> > of a guests memory, so I intentionally create memory shortage by creating
>
> You mean "guest memory" that is area emulated DRAM in qemu?
> It is anonymous vma.
>
> > guest much large then host memory, but I want system to swap out only
> > guest's memory.
>
> Couldn't you use MADV_SEQUENTIAL on only guest memory area?
> It doesn't make side effect about readahead since it's anon area.
> And it would make do best effort to swap out guest's memory.
>

I can try, should be better then nothing I guess. Doesn't guaranty that
emulator memory will not be swapped out though.

--
Gleb.

KOSAKI Motohiro

unread,

Jan 19, 2010, 7:30:01 PM1/19/10

to

> Hi Gleb,
>
> On Tue, Jan 19, 2010 at 10:26 AM, Gleb Natapov <gl...@redhat.com> wrote:
> >> design would still be broken, no? Did you try using (or extending)
> >> posix_madvise(MADV_DONTNEED) for the guest address space? It seems to
> > After mlockall() I can't even allocate guest address space. Or do you mean
> > instead of mlockall()? Then how MADV_DONTNEED will help? It just drops
> > page table for the address range (which is not what I need) and does not
> > have any long time effect.
>
> Oh right, MADV_DONTNEED is no good.

Off topic:

posix_madvise(MADV_DONTNEED) is nop. glibc's posix_madvise(MADV_DONTNEED)
don't call linux's madvise(MADV_DONTNEED).
It's because madvise(MADV_DONTNEED) is not POSIX compliant.

The behavior of linux madvise(MADV_DONTNEED) is similar to Solaris (or *BSD)
madvise(MADV_FREE).