process 'stuck' at exit.

Dave Jones

unread,

Dec 10, 2013, 3:50:03 PM12/10/13

to

I woke up to find my fuzzer in a curious state.

1121 pts/5 SN+ 0:00 | \_ ../trinity -q -l off -N 999999 -C 42
1130 pts/5 SN+ 0:01 | \_ ../trinity -q -l off -N 999999 -C 42
1131 pts/5 SN+ 0:17 | \_ ../trinity -q -l off -N 999999 -C 42
10818 ? RNs 21115152:53 | \_ ../trinity -q -l off -N 999999 -C 42

(ignore the first 3 pids, they're waiting on 10818, which is the interesting one)

It's completed its run of 999999 syscalls, and looking at tmux, it tried to exit.

/proc/10818/stack just shows

[<ffffffffffffffff>] 0xffffffffffffffff

Top reports a core is spinning in the kernel 100%, so I ran perf top -a
and saw..

8.63% [kernel] [k] trace_hardirqs_off_caller
7.68% [kernel] [k] __lock_acquire
5.35% [kernel] [k] gup_huge_pmd
5.31% [kernel] [k] put_compound_page
4.93% [kernel] [k] gup_pud_range
4.76% [kernel] [k] trace_hardirqs_on_caller
4.58% [kernel] [k] get_user_pages_fast
4.00% [kernel] [k] debug_smp_processor_id
4.00% [kernel] [k] lock_is_held
3.73% [kernel] [k] lock_acquired
3.67% [kernel] [k] lock_release

sysrq-t shows..

trinity-child27 R running task 5520 10818 1131 0x00080004
0000000000000000 ffff8801b0ef4170 000000000000032c ffff8801b609e108
0000000000000000 ffff880160d21c30 ffffffff810ad895 ffffffff817587a0
ffff8801b0ef4170 ffff8801b609e0a8 ffff8801b609e000 ffff880160d21d50
Call Trace:
[<ffffffff817587a0>] ? retint_restore_args+0xe/0xe
[<ffffffff8132af0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff8100b184>] ? native_sched_clock+0x24/0x80
[<ffffffff8109624f>] ? local_clock+0xf/0x50
[<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30
[<ffffffff8103edd0>] ? gup_pud_range+0x170/0x190
[<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0
[<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0
[<ffffffff810a8a2f>] ? up_read+0x1f/0x40
[<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0
[<ffffffff8115f76c>] ? put_page+0x3c/0x50
[<ffffffff810dd525>] ? get_futex_key+0xd5/0x2c0
[<ffffffff810df18a>] ? futex_requeue+0xfa/0x9c0
[<ffffffff810e019e>] ? do_futex+0xae/0xc80
[<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30
[<ffffffff810aa7de>] ? lock_release_holdtime.part.29+0xee/0x170
[<ffffffff8114f16e>] ? context_tracking_user_exit+0x4e/0x190
[<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0
[<ffffffff810e0de1>] ? SyS_futex+0x71/0x150
[<ffffffff81010a45>] ? syscall_trace_enter+0x145/0x2a0
[<ffffffff81760be4>] ? tracesys+0xdd/0xe2

any ideas ?

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Linus Torvalds

unread,

Dec 10, 2013, 6:50:02 PM12/10/13

to

Hmm. Looks like the futex code is somehow stuck in a loop, calling
get_user_pages_fast().

The futex code itself is apparently so low-overhead that it doesn't
show up in your 'perf top' report (which is dominated by all the
expensive debug things that get_user_pages_fast() etc ends up doing),
but that's the only looping I can see. Perhaps the "goto again" case
for transparent huge pages in get_futex_key()? Or the
"retry[_private]" cases in futex_requeue()? Some error condition that
causes us to retry forever, rather than returning the error code?

Added a few more people to the cc.. Ideas?

Linus

Thomas Gleixner

unread,

Dec 10, 2013, 7:20:01 PM12/10/13

to

On Tue, 10 Dec 2013, Linus Torvalds wrote:

> Hmm. Looks like the futex code is somehow stuck in a loop, calling
> get_user_pages_fast().
>
> The futex code itself is apparently so low-overhead that it doesn't
> show up in your 'perf top' report (which is dominated by all the
> expensive debug things that get_user_pages_fast() etc ends up doing),
> but that's the only looping I can see. Perhaps the "goto again" case
> for transparent huge pages in get_futex_key()? Or the

Cc'ng more folks on that.

> "retry[_private]" cases in futex_requeue()? Some error condition that
> causes us to retry forever, rather than returning the error code?

So this is pretty unlikely. The retry requires:

get_futex_value_locked() == EFAULT;

Now we drop the hash bucket locks and do:

get_user();

And if that get_user() faults again, we bail out.

If it succeeds we try again. So between the get_user() and the next
get_futex_value_locked() the effect of get_user() must have been
undone. I guess this can happen, but a gazillion times in a row?

>
> Added a few more people to the cc.. Ideas?

...

Linus Torvalds

unread,

Dec 10, 2013, 8:00:03 PM12/10/13

to

On Tue, Dec 10, 2013 at 11:18 AM, Thomas Gleixner <tg...@linutronix.de> wrote:
>
> So this is pretty unlikely. The retry requires:
>
> get_futex_value_locked() == EFAULT;
>
> Now we drop the hash bucket locks and do:
>
> get_user();
>
> And if that get_user() faults again, we bail out.

I think you need to look closer.

We have at least also that "futex_proxy_trylock_atomic() returns
-EAGAIN" case. Which triggers at some exit condition. Another thread
in the same group, perhaps never completing the exit because it's
waiting for this one? I dunno, I didn't look any closer (but this does
make me think "Hey, we should add Oleg to the Cc too", since
PF_EXITING is involved).. So maybe there is some situation where that
EAGAIN will keep happening, forever..

Now, I'm *not* saying that that is it. It's quite possible/likely some
other loop, but I do have to say that it sure isn't _obvious_. And
that whole EAGAIN return case is quite deep and special, so ...

Linus

PS: Oleg - the whole thread is on lkml. Ping me if you need more context.

Dave Jones

unread,

Dec 10, 2013, 8:30:01 PM12/10/13

to

On Tue, Dec 10, 2013 at 11:55:06AM -0800, Linus Torvalds wrote:
> On Tue, Dec 10, 2013 at 11:18 AM, Thomas Gleixner <tg...@linutronix.de> wrote:
> >
> > So this is pretty unlikely. The retry requires:
> >
> > get_futex_value_locked() == EFAULT;
> >
> > Now we drop the hash bucket locks and do:
> >
> > get_user();
> >
> > And if that get_user() faults again, we bail out.
>
> I think you need to look closer.
>
> We have at least also that "futex_proxy_trylock_atomic() returns
> -EAGAIN" case. Which triggers at some exit condition. Another thread
> in the same group, perhaps never completing the exit because it's
> waiting for this one? I dunno, I didn't look any closer (but this does
> make me think "Hey, we should add Oleg to the Cc too", since
> PF_EXITING is involved).. So maybe there is some situation where that
> EAGAIN will keep happening, forever..
>
> Now, I'm *not* saying that that is it. It's quite possible/likely some
> other loop, but I do have to say that it sure isn't _obvious_. And
> that whole EAGAIN return case is quite deep and special, so ...
>
> Linus
>
> PS: Oleg - the whole thread is on lkml. Ping me if you need more context.

btw, I've left the machine in that state, and will for as long as necesary
in case someone has any ideas for further tracing experiments.

Dave

Thomas Gleixner

unread,

Dec 10, 2013, 8:40:02 PM12/10/13

to

On Tue, 10 Dec 2013, Linus Torvalds wrote:

> On Tue, Dec 10, 2013 at 11:18 AM, Thomas Gleixner <tg...@linutronix.de> wrote:
> >
> > So this is pretty unlikely. The retry requires:
> >
> > get_futex_value_locked() == EFAULT;
> >
> > Now we drop the hash bucket locks and do:
> >
> > get_user();
> >
> > And if that get_user() faults again, we bail out.
>
> I think you need to look closer.
>
> We have at least also that "futex_proxy_trylock_atomic() returns
> -EAGAIN" case. Which triggers at some exit condition. Another thread
> in the same group, perhaps never completing the exit because it's
> waiting for this one? I dunno, I didn't look any closer (but this does
> make me think "Hey, we should add Oleg to the Cc too", since
> PF_EXITING is involved).. So maybe there is some situation where that
> EAGAIN will keep happening, forever..
>
> Now, I'm *not* saying that that is it. It's quite possible/likely some
> other loop, but I do have to say that it sure isn't _obvious_. And
> that whole EAGAIN return case is quite deep and special, so ...

The -EAGAIN is when the user value changed, simplified:

oldval = *uval;
sys_futex(....., oldval)
do_futex(...) {
if (oldval != *uval)
return -EAGAIN;
}
sys_exit();

And user space loops on that.

Thanks,

tglx

Thomas Gleixner

unread,

Dec 10, 2013, 8:40:01 PM12/10/13

to

On Tue, 10 Dec 2013, Dave Jones wrote:
> On Tue, Dec 10, 2013 at 11:55:06AM -0800, Linus Torvalds wrote:
> > On Tue, Dec 10, 2013 at 11:18 AM, Thomas Gleixner <tg...@linutronix.de> wrote:
> > >
> > > So this is pretty unlikely. The retry requires:
> > >
> > > get_futex_value_locked() == EFAULT;
> > >
> > > Now we drop the hash bucket locks and do:
> > >
> > > get_user();
> > >
> > > And if that get_user() faults again, we bail out.
> >
> > I think you need to look closer.
> >
> > We have at least also that "futex_proxy_trylock_atomic() returns
> > -EAGAIN" case. Which triggers at some exit condition. Another thread
> > in the same group, perhaps never completing the exit because it's
> > waiting for this one? I dunno, I didn't look any closer (but this does
> > make me think "Hey, we should add Oleg to the Cc too", since
> > PF_EXITING is involved).. So maybe there is some situation where that
> > EAGAIN will keep happening, forever..
> >
> > Now, I'm *not* saying that that is it. It's quite possible/likely some
> > other loop, but I do have to say that it sure isn't _obvious_. And
> > that whole EAGAIN return case is quite deep and special, so ...
> >
> > Linus
> >
> > PS: Oleg - the whole thread is on lkml. Ping me if you need more context.
>
> btw, I've left the machine in that state, and will for as long as necesary
> in case someone has any ideas for further tracing experiments.

Can you gather a trace with the function tracer? That will tell us
what the thing is actually doing.

Thanks,

tglx

Oleg Nesterov

unread,

Dec 10, 2013, 8:40:02 PM12/10/13

to

Dave, I must have missed something, help.

I am looking at the first message and I can't understand who stuck
"at exit".

The trace shows that the task with pid=10818 called sys_futex() ?

Perhaps "exit" means the userspace paths?

Oleg.

Linus Torvalds

unread,

Dec 10, 2013, 8:50:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 12:33 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
>
> The -EAGAIN is when the user value changed, simplified:

No it's not.

Thomas, stop this crap already. Look at the f*cking code carefully
instead of just dismissing cases.

The worrisome EAGAIN case is

futex_requeue
futex_proxy_trylock_atomic
futex_lock_pi_atomic
lookup_pi_state:
ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;

and now futex_requeue() will do "goto repeat" for that EAGAIN case.

So, Christ, Thomas, you have now *twice* dismissed a real concern with
totally bogus "that can never happen" by explaining some totally
unrelated *simple* case rather than the much more complex case.

So please. Really. Truly look at the code and thing about it, or shut
the f*ck up. No more of this shit where you glance at the code, find
some simple case, and say "that can't happen", and dismiss the
bug-report.

So far Dave's bug-reports have generally pretty much universally shown
real bugs. Being dismissive about it is not helpful, quite the
reverse.

Maybe the loop I'm pointing at cannot happen, but *your* explanation
for why it couldn't happen was pure and utter garbage, and was clearly
because you hadn't even bothered to look at all the cases.

Linus

Dave Jones

unread,

Dec 10, 2013, 9:00:01 PM12/10/13

to

On Tue, Dec 10, 2013 at 09:35:59PM +0100, Oleg Nesterov wrote:
> Dave, I must have missed something, help.
>
> I am looking at the first message and I can't understand who stuck
> "at exit".
>
> The trace shows that the task with pid=10818 called sys_futex() ?
>
> Perhaps "exit" means the userspace paths?

pid 1131 is wait()'ing for 10818 to exit

pid 1130 is periodically sending SIGKILL to 10818 because it's gotten
tired of waiting. 10818 is ignoring these because it's stuck in a loop
somewhere in the kernel.

I tried attaching to 10818 with gdb, and it just hangs.
(possibly because its weird stack situation [see 1st post])

by inspecting the shared mapping that all processes have (by gdb'ing 1130)
I can see that 10818 did all its full run without incident, and the
"exit child" flag in the fuzzer had been in set.

The last 'random syscall' the fuzzer did was to sys_accept4, so the futex call
must come from somewhere in libc maybe ?

Dave

Dave Jones

unread,

Dec 10, 2013, 9:00:01 PM12/10/13

to

On Tue, Dec 10, 2013 at 09:34:24PM +0100, Thomas Gleixner wrote:

> > > PS: Oleg - the whole thread is on lkml. Ping me if you need more context.
> >
> > btw, I've left the machine in that state, and will for as long as necesary
> > in case someone has any ideas for further tracing experiments.
>
> Can you gather a trace with the function tracer? That will tell us
> what the thing is actually doing.

Feed me command lines, and I'll see what I can do.

Dave

Darren Hart

unread,

Dec 10, 2013, 9:00:01 PM12/10/13

to

Hi Dave,

Can you get us an idea of the arguments trinity is tossing into
SYS_futex?

Op code? Would help to know if this was requeue_pi for example.
Type of memory being used for the uaddr?

I see futex_requeue in the stack, which means the opcode is one of:

FUTEX_REQUEUE
FUTEX_CMP_REQUEUE
FUTEX_CMP_REQUEUE_PI

FUTEX_REQUEUE has a known issue and was replaced with FUTEX_CMP_REQUEUE,
for details, test cases, and an analysis see the historic tree:

commit 9b91d73bde9d68800f9e5c338c0cf9d0fe3bc862
Author: Andrew Morton <ak...@osdl.org>
Date: 2004-05-31

[PATCH] Add FUTEX_CMP_REQUEUE futex op

Specifically:
http://listman.redhat.com/archives/phil-list/2004-May/msg00023.html

Trinity is going to trigger hangs in futexes just by it's very nature,
but I believe you have watchdogs in place to kill such malformed tests
after a timeout?

I'll keep digging.

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

Darren Hart

unread,

Dec 10, 2013, 9:10:02 PM12/10/13

to

On Tue, 2013-12-10 at 15:49 -0500, Dave Jones wrote:
> On Tue, Dec 10, 2013 at 09:35:59PM +0100, Oleg Nesterov wrote:
> > Dave, I must have missed something, help.
> >
> > I am looking at the first message and I can't understand who stuck
> > "at exit".
> >
> > The trace shows that the task with pid=10818 called sys_futex() ?
> >
> > Perhaps "exit" means the userspace paths?
>
> pid 1131 is wait()'ing for 10818 to exit
>
> pid 1130 is periodically sending SIGKILL to 10818 because it's gotten
> tired of waiting. 10818 is ignoring these because it's stuck in a loop
> somewhere in the kernel.
>
> I tried attaching to 10818 with gdb, and it just hangs.
> (possibly because its weird stack situation [see 1st post])
>
> by inspecting the shared mapping that all processes have (by gdb'ing 1130)
> I can see that 10818 did all its full run without incident, and the
> "exit child" flag in the fuzzer had been in set.
>
> The last 'random syscall' the fuzzer did was to sys_accept4, so the futex call
> must come from somewhere in libc maybe ?

If that is the case, then Linus' requeue_pi path is highly unlikely as
FUTEX_CMP_REQUEUE_PI is not used by glibc (yet). That gives me hope as
that way there be dragons. Knowing exactly what syscall was made would
be very useful, but I don't know if that information is even available
anymore.

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

Dave Jones

unread,

Dec 10, 2013, 9:20:01 PM12/10/13

to

On Tue, Dec 10, 2013 at 12:57:57PM -0800, Darren Hart wrote:

> > > Call Trace:
> > > [<ffffffff817587a0>] ? retint_restore_args+0xe/0xe
> > > [<ffffffff8132af0e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> > > [<ffffffff8100b184>] ? native_sched_clock+0x24/0x80
> > > [<ffffffff8109624f>] ? local_clock+0xf/0x50
> > > [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30
> > > [<ffffffff8103edd0>] ? gup_pud_range+0x170/0x190
> > > [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0
> > > [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> > > [<ffffffff810a8a2f>] ? up_read+0x1f/0x40
> > > [<ffffffff8103f0d5>] ? get_user_pages_fast+0x1a5/0x1c0
> > > [<ffffffff8115f76c>] ? put_page+0x3c/0x50
> > > [<ffffffff810dd525>] ? get_futex_key+0xd5/0x2c0
> > > [<ffffffff810df18a>] ? futex_requeue+0xfa/0x9c0
> > > [<ffffffff810e019e>] ? do_futex+0xae/0xc80
> > > [<ffffffff810aa27e>] ? put_lock_stats.isra.28+0xe/0x30
> > > [<ffffffff810aa7de>] ? lock_release_holdtime.part.29+0xee/0x170
> > > [<ffffffff8114f16e>] ? context_tracking_user_exit+0x4e/0x190
> > > [<ffffffff810ad1f5>] ? trace_hardirqs_on_caller+0x115/0x1e0
> > > [<ffffffff810e0de1>] ? SyS_futex+0x71/0x150
> > > [<ffffffff81010a45>] ? syscall_trace_enter+0x145/0x2a0
> > > [<ffffffff81760be4>] ? tracesys+0xdd/0xe2
> > >
>

> Can you get us an idea of the arguments trinity is tossing into
> SYS_futex?
>
> Op code? Would help to know if this was requeue_pi for example.
> Type of memory being used for the uaddr?

As is always the case, the interesting bugs only seem to happen
when I have logging disabled. So other than what I can glean from what's
left in the shm, no idea.

One of the other child processes (which exited already) did do a sys_futex.
the params it passed were..

1cb5000, -1, c57, 1cb5004, ffffffffffd8f420, 90000000091a6311

The result of this syscall was -1

> I see futex_requeue in the stack, which means the opcode is one of:
>
> FUTEX_REQUEUE
> FUTEX_CMP_REQUEUE
> FUTEX_CMP_REQUEUE_PI
>
> FUTEX_REQUEUE has a known issue and was replaced with FUTEX_CMP_REQUEUE,
> for details, test cases, and an analysis see the historic tree:
>
> commit 9b91d73bde9d68800f9e5c338c0cf9d0fe3bc862
> Author: Andrew Morton <ak...@osdl.org>
> Date: 2004-05-31
>
> [PATCH] Add FUTEX_CMP_REQUEUE futex op
>
> Specifically:
> http://listman.redhat.com/archives/phil-list/2004-May/msg00023.html
>
>
> Trinity is going to trigger hangs in futexes just by it's very nature,
> but I believe you have watchdogs in place to kill such malformed tests
> after a timeout?

It should. Though that pid is happily ignoring the SIGKILL's the watchdog
is continuing to send, because it's never getting around to processing the
signals apparently.

Dave

Thomas Gleixner

unread,

Dec 10, 2013, 9:20:01 PM12/10/13

to

On Tue, 10 Dec 2013, Linus Torvalds wrote:

> On Tue, Dec 10, 2013 at 12:33 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
> >
> > The -EAGAIN is when the user value changed, simplified:
>
> No it's not.
>
> Thomas, stop this crap already. Look at the f*cking code carefully
> instead of just dismissing cases.
>
> The worrisome EAGAIN case is
>
> futex_requeue
> futex_proxy_trylock_atomic
> futex_lock_pi_atomic
> lookup_pi_state:
> ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN;
>
> and now futex_requeue() will do "goto repeat" for that EAGAIN case.
>
> So, Christ, Thomas, you have now *twice* dismissed a real concern with
> totally bogus "that can never happen" by explaining some totally
> unrelated *simple* case rather than the much more complex case.
>
> So please. Really. Truly look at the code and thing about it, or shut
> the f*ck up. No more of this shit where you glance at the code, find
> some simple case, and say "that can't happen", and dismiss the
> bug-report.

Well, I spent a fricking long time to work on that code and find the
absurdest bugs in it and I'm well aware of the exit issue.

> So far Dave's bug-reports have generally pretty much universally shown
> real bugs. Being dismissive about it is not helpful, quite the
> reverse.

I know and I used that information more than once to carefully track
down the real reason. I never dismissed a single report.

> Maybe the loop I'm pointing at cannot happen, but *your* explanation
> for why it couldn't happen was pure and utter garbage, and was clearly
> because you hadn't even bothered to look at all the cases.

I might have been sloppy and not really explaining why I think, that
the requeue_pi exit case is not likely.

To make that loop happen it requires the following:

1) An actual user of requeue_pi, which can only be the fuzzer as glibc
does not support it. But Daves last fuzzed syscall was something
else.

2) The report said, that the last fuzzing test was already done and
the fuzzer app is about to exit.

That involves futex_requeue from deep inside glibc thread
library. And that is NOT using REQUEUE_PI.

3) So now even IF it would use requeue_pi then this would require,
that the outer PI lock is already held by some other task which is
already in the process of exiting. IOW, this lock would be held by
something which did not release it before exit and already set
PF_EXITING in exit_signals() and then got stuck for ever before
reaching: tsk->flags |= PF_EXITPIDONE;

I still think, that it is highly unlikely, but to make sure I already
asked Dave before reading your rant to fire up the tracer, so we know
for sure where hell the thing is looping.

Thanks,

tglx

Dave Jones

unread,

Dec 10, 2013, 9:20:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 01:06:23PM -0800, Darren Hart wrote:
> On Tue, 2013-12-10 at 15:49 -0500, Dave Jones wrote:
> > On Tue, Dec 10, 2013 at 09:35:59PM +0100, Oleg Nesterov wrote:
> > > Dave, I must have missed something, help.
> > >
> > > I am looking at the first message and I can't understand who stuck
> > > "at exit".
> > >
> > > The trace shows that the task with pid=10818 called sys_futex() ?
> > >
> > > Perhaps "exit" means the userspace paths?
> >
> > pid 1131 is wait()'ing for 10818 to exit
> >
> > pid 1130 is periodically sending SIGKILL to 10818 because it's gotten
> > tired of waiting. 10818 is ignoring these because it's stuck in a loop
> > somewhere in the kernel.
> >
> > I tried attaching to 10818 with gdb, and it just hangs.
> > (possibly because its weird stack situation [see 1st post])
> >
> > by inspecting the shared mapping that all processes have (by gdb'ing 1130)
> > I can see that 10818 did all its full run without incident, and the
> > "exit child" flag in the fuzzer had been in set.
> >
> > The last 'random syscall' the fuzzer did was to sys_accept4, so the futex call
> > must come from somewhere in libc maybe ?
>
> If that is the case, then Linus' requeue_pi path is highly unlikely as
> FUTEX_CMP_REQUEUE_PI is not used by glibc (yet). That gives me hope as
> that way there be dragons. Knowing exactly what syscall was made would
> be very useful, but I don't know if that information is even available
> anymore.

So that last syscall _that_ 'stuck' thread did was accept4, but I found
another pid that had done a futex call just before it exited.
(see other mail)

What I don't understand is how the running child has futex as part of
its stack trace, when the internal log that trinity keeps has on record
for that particular pid having called it.

Dave

Linus Torvalds

unread,

Dec 10, 2013, 9:20:01 PM12/10/13

to

On Tue, Dec 10, 2013 at 1:06 PM, Darren Hart <dvh...@linux.intel.com> wrote:
>
> . Knowing exactly what syscall was made would
> be very useful, but I don't know if that information is even available
> anymore.

Well, the loop should be visible in the profile, since it's still active.

So doing something like

perf record -e cycles:pp -g -a sleep 60

to get a good profile for a minute, and then looking at the
instruction-level profiles in futex_requeue() should be possible.

Of course, inlining etc often makes those rather hard to see, and the
bulk of the profile is clearly all about the lock debugging overhead,
but if the loop is in futex_requeue() then that *should* be visible in
the profile, even if it may not be anywhere near the top..

Dave?

Linus

Darren Hart

unread,

Dec 10, 2013, 9:30:02 PM12/10/13

to

On Tue, 2013-12-10 at 15:55 -0500, Dave Jones wrote:
> On Tue, Dec 10, 2013 at 09:34:24PM +0100, Thomas Gleixner wrote:
>
> > > > PS: Oleg - the whole thread is on lkml. Ping me if you need more context.
> > >
> > > btw, I've left the machine in that state, and will for as long as necesary
> > > in case someone has any ideas for further tracing experiments.
> >
> > Can you gather a trace with the function tracer? That will tell us
> > what the thing is actually doing.
>
> Feed me command lines, and I'll see what I can do.
>
> Dave
>

So let's see if we can determine the call chain inside futex_requeue, if
there is one. We want to see if we are calling the following functions:

futex.c: free_pi_state
futex.c: get_futex_value_locked
futex.c: futex_proxy_trylock_atomic
futex.c: lookup_pi_state
futex.c: fault_in_user_writeable

I'm hoping we can make use of ftrace here?

Consider:

$ trace-cmd record -P <PID> -p function -l '*futex*'

And to try outside of futex:

$ trace-cmd record -P <PID> -p function

Then ^C after a few seconds. The trace will be in trace.dat

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

Linus Torvalds

unread,

Dec 10, 2013, 9:30:03 PM12/10/13

to

On Tue, Dec 10, 2013 at 1:18 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> to get a good profile for a minute, and then looking at the
> instruction-level profiles in futex_requeue() should be possible.

Ugh. Looking at kernel/futex.s even *without* debugging enabled is
pretty messy. Although much of it seems to be because the hash is
actually fairly complex. jhash2() really ends up expanding to a lot of
instructions.

But especially if that kernel was compiled with debug information,
then 'perf report' is pretty good about matching it up with the source
code, so it might not be *too* hard to try to figure out where the
loop is.

And if it's not in futex_requeue() (there are loops at other levels,
like futex_get_key()), then having the callchain information for the
costly operations will hopefully give a hint where the top loop is. So
profiling data can be very powerful for things like this.

Thomas Gleixner

unread,

Dec 10, 2013, 9:30:04 PM12/10/13

to

On Tue, 10 Dec 2013, Darren Hart wrote:
> On Tue, 2013-12-10 at 15:55 -0500, Dave Jones wrote:
> > On Tue, Dec 10, 2013 at 09:34:24PM +0100, Thomas Gleixner wrote:
> >
> > > > > PS: Oleg - the whole thread is on lkml. Ping me if you need more context.
> > > >
> > > > btw, I've left the machine in that state, and will for as long as necesary
> > > > in case someone has any ideas for further tracing experiments.
> > >
> > > Can you gather a trace with the function tracer? That will tell us
> > > what the thing is actually doing.
> >
> > Feed me command lines, and I'll see what I can do.
> >
> > Dave
> >
>
> So let's see if we can determine the call chain inside futex_requeue, if
> there is one. We want to see if we are calling the following functions:
>
> futex.c: free_pi_state
> futex.c: get_futex_value_locked
> futex.c: futex_proxy_trylock_atomic
> futex.c: lookup_pi_state
> futex.c: fault_in_user_writeable
>
> I'm hoping we can make use of ftrace here?
>
> Consider:
>
> $ trace-cmd record -P <PID> -p function -l '*futex*'
>
> And to try outside of futex:
>
> $ trace-cmd record -P <PID> -p function
>
> Then ^C after a few seconds. The trace will be in trace.dat

We definitely want the latter. The futex code gets compiled into some
badly traceable mess.

Thanks,

tglx

Dave Jones

unread,

Dec 10, 2013, 9:40:01 PM12/10/13

to

On Tue, Dec 10, 2013 at 01:18:20PM -0800, Linus Torvalds wrote:
> On Tue, Dec 10, 2013 at 1:06 PM, Darren Hart <dvh...@linux.intel.com> wrote:
> >
> > . Knowing exactly what syscall was made would
> > be very useful, but I don't know if that information is even available
> > anymore.
>
> Well, the loop should be visible in the profile, since it's still active.
>
> So doing something like
>
> perf record -e cycles:pp -g -a sleep 60
>
> to get a good profile for a minute, and then looking at the
> instruction-level profiles in futex_requeue() should be possible.
>
> Of course, inlining etc often makes those rather hard to see, and the
> bulk of the profile is clearly all about the lock debugging overhead,
> but if the loop is in futex_requeue() then that *should* be visible in
> the profile, even if it may not be anywhere near the top..
>
> Dave?

http://www.codemonkey.org.uk/junk/perf.data.xz

Dave

Steven Rostedt

unread,

Dec 10, 2013, 9:40:02 PM12/10/13

to

On Tue, 10 Dec 2013 22:28:01 +0100 (CET)
Thomas Gleixner <tg...@linutronix.de> wrote:

> On Tue, 10 Dec 2013, Darren Hart wrote:
> > On Tue, 2013-12-10 at 15:55 -0500, Dave Jones wrote:
> > > On Tue, Dec 10, 2013 at 09:34:24PM +0100, Thomas Gleixner wrote:
> > >
> > > > > > PS: Oleg - the whole thread is on lkml. Ping me if you need more context.
> > > > >
> > > > > btw, I've left the machine in that state, and will for as long as necesary
> > > > > in case someone has any ideas for further tracing experiments.
> > > >
> > > > Can you gather a trace with the function tracer? That will tell us
> > > > what the thing is actually doing.
> > >
> > > Feed me command lines, and I'll see what I can do.
> > >
> > > Dave
> > >
> >
> > So let's see if we can determine the call chain inside futex_requeue, if
> > there is one. We want to see if we are calling the following functions:
> >
> > futex.c: free_pi_state
> > futex.c: get_futex_value_locked
> > futex.c: futex_proxy_trylock_atomic
> > futex.c: lookup_pi_state
> > futex.c: fault_in_user_writeable
> >
> > I'm hoping we can make use of ftrace here?
> >
> > Consider:
> >
> > $ trace-cmd record -P <PID> -p function -l '*futex*'

Note the above just picks functions that have "futex" in its name, and
will not include all the functions you listed above.

> >
> > And to try outside of futex:
> >
> > $ trace-cmd record -P <PID> -p function
> >
> > Then ^C after a few seconds. The trace will be in trace.dat
>
> We definitely want the latter. The futex code gets compiled into some
> badly traceable mess.

Yeah, lets use all functions. If its in a loop, we should see exactly
what is happening, with full function tracing.

You may be able to simplify it to just:

trace-cmd record -p function sleep 10

and then read the trace. Nasty loopers usually dominate these types of
traces so no need to filter on them in the recording.

-- Steve

Oleg Nesterov

unread,

Dec 10, 2013, 9:40:03 PM12/10/13

to

On 12/10, Dave Jones wrote:
>
> On Tue, Dec 10, 2013 at 09:35:59PM +0100, Oleg Nesterov wrote:
> >
> > I am looking at the first message and I can't understand who stuck
> > "at exit".
> >
> > The trace shows that the task with pid=10818 called sys_futex() ?
> >
> > Perhaps "exit" means the userspace paths?
>
> pid 1131 is wait()'ing for 10818 to exit
>
> pid 1130 is periodically sending SIGKILL to 10818 because it's gotten
> tired of waiting. 10818 is ignoring these because it's stuck in a loop
> somewhere in the kernel.

OK, thanks. So it doesn't return to user-space.

could you do

cd /sys/kernel/debug/tracing/
echo 10818 >> set_ftrace_pid
echo function_graph >> current_tracer
echo 1 >> tracing_on

and look into "trace" file to find out how exactly it loops?

> I tried attaching to 10818 with gdb, and it just hangs.

This is clear, 10818 obviously can't stop.

Oleg.

Dave Jones

unread,

Dec 10, 2013, 9:50:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 10:34:31PM +0100, Oleg Nesterov wrote:
> On 12/10, Dave Jones wrote:
> >
> > On Tue, Dec 10, 2013 at 09:35:59PM +0100, Oleg Nesterov wrote:
> > >
> > > I am looking at the first message and I can't understand who stuck
> > > "at exit".
> > >
> > > The trace shows that the task with pid=10818 called sys_futex() ?
> > >
> > > Perhaps "exit" means the userspace paths?
> >
> > pid 1131 is wait()'ing for 10818 to exit
> >
> > pid 1130 is periodically sending SIGKILL to 10818 because it's gotten
> > tired of waiting. 10818 is ignoring these because it's stuck in a loop
> > somewhere in the kernel.
>
> OK, thanks. So it doesn't return to user-space.
>
> could you do
>
> cd /sys/kernel/debug/tracing/
> echo 10818 >> set_ftrace_pid
> echo function_graph >> current_tracer
> echo 1 >> tracing_on
>
> and look into "trace" file to find out how exactly it loops?

http://codemonkey.org.uk/junk/trace

Dave

Linus Torvalds

unread,

Dec 10, 2013, 9:50:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 1:32 PM, Dave Jones <da...@redhat.com> wrote:
>
> http://www.codemonkey.org.uk/junk/perf.data.xz

"Forbidden

You don't have permission to access /junk/perf.data.xz on this server."

also, we'd need the vmlinux file to actually decode the data, I think.

Linus

Dave Jones

unread,

Dec 10, 2013, 10:00:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 01:49:19PM -0800, Linus Torvalds wrote:
> On Tue, Dec 10, 2013 at 1:32 PM, Dave Jones <da...@redhat.com> wrote:
> >
> > http://www.codemonkey.org.uk/junk/perf.data.xz
>
> "Forbidden
>
> You don't have permission to access /junk/perf.data.xz on this server."

fixed.

> also, we'd need the vmlinux file to actually decode the data, I think.

Hmm, the only vmlinux I have still around is newer than the running kernel,
so that's not going to be much help.

Dave

Linus Torvalds

unread,

Dec 10, 2013, 10:00:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 1:41 PM, Dave Jones <da...@redhat.com> wrote:
>
> http://codemonkey.org.uk/junk/trace

Hmm. Ok, so something is calling [__]get_user_pages_fast() and
put_page() in a loop, but the trace doesn't show what that "something"
is, because it is itself not ever called.

However, that pattern does seem to imply that the loop is in
get_futex_key(), because all the other loops I see seem to be calling
other things as well.

And the __get_user_pages_fast() call implies that it's the THP case
that triggers the "unlikely(PageTail(page))" case. And anyway,
otherwise we'd see lock_page()/unlock_page() too.

So it looks like __get_user_pages_fast() fails, and keeps failing.
Andrea, this is your code, any ideas? Commit a5b338f2b0b1f ("thp:
update futex compound knowledge") to be exact.

Linus

Linus Torvalds

unread,

Dec 10, 2013, 10:00:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 1:56 PM, Dave Jones <da...@redhat.com> wrote:
>
> Hmm, the only vmlinux I have still around is newer than the running kernel,
> so that's not going to be much help.

Ok, /proc/kallsyms would do it, but never mind. I think you already
pinpointed where the loop is with the trace file, so no need for
trying to decode the profile.

Linus

Dave Jones

unread,

Dec 10, 2013, 10:10:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 01:57:49PM -0800, Linus Torvalds wrote:
> On Tue, Dec 10, 2013 at 1:41 PM, Dave Jones <da...@redhat.com> wrote:
> >
> > http://codemonkey.org.uk/junk/trace
>
> Hmm. Ok, so something is calling [__]get_user_pages_fast() and
> put_page() in a loop, but the trace doesn't show what that "something"
> is, because it is itself not ever called.
>
> However, that pattern does seem to imply that the loop is in
> get_futex_key(), because all the other loops I see seem to be calling
> other things as well.
>
> And the __get_user_pages_fast() call implies that it's the THP case
> that triggers the "unlikely(PageTail(page))" case. And anyway,
> otherwise we'd see lock_page()/unlock_page() too.
>
> So it looks like __get_user_pages_fast() fails, and keeps failing.
> Andrea, this is your code, any ideas? Commit a5b338f2b0b1f ("thp:
> update futex compound knowledge") to be exact.

So, a reason that this might only be showing up now, is that in the last
week I added support to trinity to explicitly do huge page mmaps in the children,
whereas before it only ever did that with MAP_SHARED in the main pid, and then
every child inherited them on fork().

Dave

Steven Rostedt

unread,

Dec 10, 2013, 10:10:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 04:41:43PM -0500, Dave Jones wrote:
> >
> > OK, thanks. So it doesn't return to user-space.
> >
> > could you do
> >
> > cd /sys/kernel/debug/tracing/
> > echo 10818 >> set_ftrace_pid
> > echo function_graph >> current_tracer
> > echo 1 >> tracing_on
> >
> > and look into "trace" file to find out how exactly it loops?
>
> http://codemonkey.org.uk/junk/trace

Because we are already in the function that is looping, we don't see what
that function is (it's never called).

So you can do either:

trace-cmd record -p function -l get_user_pages_fast --func-stack sleep 5

Which will trace the get_user_pages_fast and spit out a full call trace.

Or if you don't want to use trace-cmd, you can do it by hand.
But be warned! If you don't do this right, you can live lock the system.
Or make it extremely slow. That is, you must have a filter on the
functions you trace before you set the function stack trace flag.
(/me needs to prevent that from happening)

cd /sys/kernel/debug/tracing
echo get_user_pages_fast > set_ftrace_filter
cat set_ftrace_filter
# make sure get_user_pages_fast is there
echo function > current_tracer
echo 1 > options/func_stack_trace

read your trace. Either by:

cat trace

or

trace-cmd show

And then after you recorded that.

echo 0 > options/func_stack_trace

to make sure you don't accidently enable stack tracing on *all*
functions. I haven't had that really live lock the system, but
it took about two minutes to disable it again, as each key stroke
took several seconds to compete.

-- Steve

Dave Jones

unread,

Dec 10, 2013, 10:10:03 PM12/10/13

to

On Tue, Dec 10, 2013 at 01:59:30PM -0800, Linus Torvalds wrote:
> On Tue, Dec 10, 2013 at 1:56 PM, Dave Jones <da...@redhat.com> wrote:
> >
> > Hmm, the only vmlinux I have still around is newer than the running kernel,
> > so that's not going to be much help.
>
> Ok, /proc/kallsyms would do it, but never mind. I think you already
> pinpointed where the loop is with the trace file, so no need for
> trying to decode the profile.

just for completeness in case someone else is following along and prefers to try that:
http://www.codemonkey.org.uk/junk/syms

Dave

Oleg Nesterov

unread,

Dec 10, 2013, 10:10:04 PM12/10/13

to

On 12/10, Linus Torvalds wrote:
>
> So it looks like __get_user_pages_fast() fails, and keeps failing.

And "again:" does get_user_pages_fast().

And according to this trace get_user_pages_fast() takes mmap_sem
and calls __get_user_pages().

But __get_user_pages() should fail even before follow_page_mask()
due to fatal_signal_pending(), in this case get_futex_key() should
return the error. is_vm_hugetlb_page(vma).

So it seems that this is not THP but VM_HUGETLB and
follow_hugetlb_page() succeeds every time?

Oleg.

Linus Torvalds

unread,

Dec 10, 2013, 10:20:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 1:57 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> So it looks like __get_user_pages_fast() fails, and keeps failing.

Hmm.. Is any of the addresses unchecked, perhaps?
__get_user_pages_fast() does an access_ok() check, while
get_user_pages_fast() does *not* seem to do one.

That looks a bit dangerous. Yeah, users should have checked the
address range, but there really is no reason not to do it in
get_user_pages_fast().

And it looks like the futex code is actually seriously buggered. It
only does the access_ok() check for the non-shared case.

Why?

Shouldn't we do something like the attached?

Linus

patch.diff

Dave Jones

unread,

Dec 10, 2013, 10:20:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 05:09:46PM -0500, Steven Rostedt wrote:

> So you can do either:
>
> trace-cmd record -p function -l get_user_pages_fast --func-stack sleep 5
>
> Which will trace the get_user_pages_fast and spit out a full call trace.

This gives me a 100M trace.dat, but trace show shows..

# tracer: nop
#
# entries-in-buffer/entries-written: 0/0 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |

and that's it.

> Or if you don't want to use trace-cmd, you can do it by hand.

Given the state the system is in, I don't want to perturb it too much.

Dave

Steven Rostedt

unread,

Dec 10, 2013, 10:30:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 05:16:21PM -0500, Dave Jones wrote:
> 46.G...@home.goodmis.org>
> User-Agent: Mutt/1.5.21 (2010-09-15)
> X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23
> Sender: linux-ker...@vger.kernel.org
> Precedence: bulk
> List-ID: <linux-kernel.vger.kernel.org>
> X-Mailing-List: linux-...@vger.kernel.org
> X-UID: 421684
> Content-Length: 1147
> Status: O

>
> On Tue, Dec 10, 2013 at 05:09:46PM -0500, Steven Rostedt wrote:
>
> > So you can do either:
> >
> > trace-cmd record -p function -l get_user_pages_fast --func-stack sleep 5
> >
> > Which will trace the get_user_pages_fast and spit out a full call trace.
> This gives me a 100M trace.dat, but trace show shows..
>
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 0/0 #P:4
> #
> # _-----=> irqs-off
> # / _----=> need-resched
> # | / _---=> hardirq/softirq
> # || / _--=> preempt-depth
> # ||| / delay
> # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> # | | | |||| | |
>
> and that's it.

Oh, if you use trace-cmd record. You need to do trace-cmd report.

The trace-cmd show, shows you what's in the trace file, which was for the
"manual" version.

Sorry for the confusion.

-- Steve

>
> > Or if you don't want to use trace-cmd, you can

Dave Jones

unread,

Dec 10, 2013, 10:30:03 PM12/10/13

to

On Tue, Dec 10, 2013 at 05:21:50PM -0500, Steven Rostedt wrote:

> The trace-cmd show, shows you what's in the trace file, which was for the
> "manual" version.
>
> Sorry for the confusion.

ah, thanks.

That shows..

version = 6
CPU 0 is empty
CPU 2 is empty
CPU 3 is empty
cpus=4
trinity-child27-10818 [001] 89790.703544: kernel_stack: <stack trace>
=> futex_requeue (ffffffff810df18a)
=> do_futex (ffffffff810e019e)
=> SyS_futex (ffffffff810e0de1)
=> tracesys (ffffffff81760be4)
trinity-child27-10818 [001] 89790.703545: function: get_user_pages_fast
trinity-child27-10818 [001] 89790.703547: kernel_stack: <stack trace>
=> futex_requeue (ffffffff810df18a)
=> do_futex (ffffffff810e019e)
=> SyS_futex (ffffffff810e0de1)
=> tracesys (ffffffff81760be4)
trinity-child27-10818 [001] 89790.703547: function: get_user_pages_fast
trinity-child27-10818 [001] 89790.703549: kernel_stack: <stack trace>
=> futex_requeue (ffffffff810df18a)
=> do_futex (ffffffff810e019e)
=> SyS_futex (ffffffff810e0de1)
=> tracesys (ffffffff81760be4)
trinity-child27-10818 [001] 89790.703550: function: get_user_pages_fast
trinity-child27-10818 [001] 89790.703552: kernel_stack: <stack trace>
=> futex_requeue (ffffffff810df18a)
=> do_futex (ffffffff810e019e)
=> SyS_futex (ffffffff810e0de1)
=> tracesys (ffffffff81760be4)

over and over and..

Dave

Darren Hart

unread,

Dec 10, 2013, 10:40:01 PM12/10/13

to

On Tue, 2013-12-10 at 14:33 -0800, Linus Torvalds wrote:

> On Tue, Dec 10, 2013 at 2:19 PM, Linus Torvalds
> <torv...@linux-foundation.org> wrote:
> >
> > Shouldn't we do something like the attached?
>

> So I think that kernel/futex.c part of the patch might be a good idea,
> but on x86-64 (which is what Dave is running), the
>
> if (end >> __VIRTUAL_MASK_SHIFT)
>
> test in get_user_pages_fast() should have been equivalent. And even on
> 32-bit, we do check the _PAGE_USER bits in the page tables, so I guess
> it's all good on a get_user_pages_fast() side.
>
> So never mind. It's not the address checking.
>
> And I think I see what's up.
>
> I think what happens is:
> - get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only)
> - get_user_pages_fast(address, 1, 0, &page) succeeds and gets a large-page
> - __get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only).
>
> so what triggers this is likely that Dave now does large-pages, and
> one of them is a read-only mapping.
>
> So I would suggest replacing the second "1" in the
> __get_user_pages_fast() call with a "!ro" instead. So how about this
> second patch instead (the access_ok() move remains).
>
> Comments?
>

You're too fast for me, but I'm trying to keep up.

It was my understanding that access_ok() was an optimization for private
futexes, but something more heavy weight was required for shared. I
believe that was find_vma() in the past, but Peter Z changed that with
fast gup. Trying to page that all in now and get the exact details....

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

Linus Torvalds

unread,

Dec 10, 2013, 10:40:03 PM12/10/13

to

On Tue, Dec 10, 2013 at 2:19 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> Shouldn't we do something like the attached?

So I think that kernel/futex.c part of the patch might be a good idea,
but on x86-64 (which is what Dave is running), the

if (end >> __VIRTUAL_MASK_SHIFT)

test in get_user_pages_fast() should have been equivalent. And even on
32-bit, we do check the _PAGE_USER bits in the page tables, so I guess
it's all good on a get_user_pages_fast() side.

So never mind. It's not the address checking.

And I think I see what's up.

I think what happens is:
- get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only)
- get_user_pages_fast(address, 1, 0, &page) succeeds and gets a large-page
- __get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only).

so what triggers this is likely that Dave now does large-pages, and
one of them is a read-only mapping.

So I would suggest replacing the second "1" in the
__get_user_pages_fast() call with a "!ro" instead. So how about this
second patch instead (the access_ok() move remains).

Comments?

Linus

patch.diff

Thomas Gleixner

unread,

Dec 10, 2013, 10:50:02 PM12/10/13

to

The !fshared case is the fast path which does not even reach
get_user_pages_fast().

We had this discussion some time ago already, where the access_ok()
check was missing in the !fshared case or the check was buggered for
some reason. Need to dig up the gory details.

And yes, I remember that we do not do an extra check for the fshared
case, because get_user_pages_fast() should do it for us already. If
not we are fubared not only in the futex code.

But there is a subtle detail:

err = get_user_pages_fast(address, 1, 1, &page);

So we ask for write access as the write argument is 1. In case that
fails we have that fallback path:

/*
* If write access is not required (eg. FUTEX_WAIT), try
* and get read-only access.
*/
if (err == -EFAULT && rw == VERIFY_READ) {
err = get_user_pages_fast(address, 1, 0, &page);

That's a legitimate use case. And futex_requeue only requests
VERIFY_READ for the !requeue_pi case.

Now, if that map is RO, i.e. we took the fallback path then the THP
one will fail as it has write=1 unconditionally.

if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1))

Thanks,

tglx

Linus Torvalds

unread,

Dec 10, 2013, 10:50:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 2:42 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
>
> And yes, I remember that we do not do an extra check for the fshared
> case, because get_user_pages_fast() should do it for us already. If
> not we are fubared not only in the futex code.

Yeah. It turns out we do do the access check indirectly - by looking
at the PAGE_USER bit, even if we don't necessarily check the actual
limits. So get_user_pages_fast() is fine.

> But there is a subtle detail:

Yup, see my email from ten minutes ago, we found the same thing. And
that would seem to explain the endless loop, and also the timing
(since Dave mentions he started doing large-pages lately).

So I think the "__get_user_pages_fast(address, 1, !ro, &page)" thing
should work.

Dave, can you re-create that trinity run and test that patch? I think
we've got this, but it might be nice to leave the hung machine up and
running until it's verified.. Although I don't really see what else we
could need or get out of it, so..

Linus

Darren Hart

unread,

Dec 10, 2013, 11:00:01 PM12/10/13

to

Is there a reason THP requires unconditional rw? Andrea?

Or is the following actually the answer here?

diff --git a/kernel/futex.c b/kernel/futex.c
index 80ba086..02febad 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -288,7 +288,7 @@ again:
put_page(page);
/* serialize against __split_huge_page_splitting() */
local_irq_disable();
- if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
+ if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
page_head = compound_head(page);
/*
* page_head is valid pointer but we must pin

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

Darren Hart

unread,

Dec 10, 2013, 11:00:02 PM12/10/13

to

On Tue, 2013-12-10 at 14:48 -0800, Linus Torvalds wrote:
> On Tue, Dec 10, 2013 at 2:42 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
> >
> > And yes, I remember that we do not do an extra check for the fshared
> > case, because get_user_pages_fast() should do it for us already. If
> > not we are fubared not only in the futex code.
>
> Yeah. It turns out we do do the access check indirectly - by looking
> at the PAGE_USER bit, even if we don't necessarily check the actual
> limits. So get_user_pages_fast() is fine.
>
> > But there is a subtle detail:
>
> Yup, see my email from ten minutes ago, we found the same thing. And
> that would seem to explain the endless loop, and also the timing
> (since Dave mentions he started doing large-pages lately).
>
> So I think the "__get_user_pages_fast(address, 1, !ro, &page)" thing
> should work.
>
> Dave, can you re-create that trinity run and test that patch? I think
> we've got this, but it might be nice to leave the hung machine up and
> running until it's verified.. Although I don't really see what else we
> could need or get out of it, so..

Would it be possible to limit the options to only pass FUTEX_CMP_REQUEUE
and a read-only uaddr? That should improve confidence when it doesn't
fail :-)

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

Thomas Gleixner

unread,

Dec 10, 2013, 11:00:03 PM12/10/13

to

On Tue, 10 Dec 2013, Linus Torvalds wrote:

Yes, that's what I decoded as well.

But how does the access_ok() move do anything helpful here?

We really need it for the fastpath !fshared case, but for the fshared
case you actively break working code, because you force a VERIFY_WRITE
check into it. The VERIFY_WRITE is necessary for !fshared, because
there is no way that one thread can map the futex RO and the other RW,
right?

But for fshared it's legitimate to have a RO mapping if you just wait
for the futex. Note, that futexes are (ab)used as user space
waitqueues, so RO makes sense. And your move would break them.

If [__]get_user_pages_fast() does not do the right checks, then we
need to fix that and not add random access_ok() checks into the call
sites.

Thanks,

tglx

Linus Torvalds

unread,

Dec 10, 2013, 11:10:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 2:57 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
>
> But how does the access_ok() move do anything helpful here?

Just making it all more obvious.

> We really need it for the fastpath !fshared case, but for the fshared
> case you actively break working code, because you force a VERIFY_WRITE
> check into it. The VERIFY_WRITE is necessary for !fshared, because
> there is no way that one thread can map the futex RO and the other RW,
> right?

Nobody actually uses that argument any more (it goes back to the old
i386 "let's manually verify that we have write permissions, because
the CPU doesn't do it for us in the trap handling"), and it should
probably be removed.

But you're right that it's at least misleading. I'd love to remove it
entirely, because it's not even syntax-checked, and it's confusing.
But that would be a humongous patch.

So these days, "access_ok()" literally just checks that the address is
in the user address space range. And that would seem to always be
appropriate for futexes, so why not just do it in the generic code?

Linus

Dave Jones

unread,

Dec 10, 2013, 11:10:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 02:58:19PM -0800, Darren Hart wrote:
> On Tue, 2013-12-10 at 14:48 -0800, Linus Torvalds wrote:
> > On Tue, Dec 10, 2013 at 2:42 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
> > >
> > > And yes, I remember that we do not do an extra check for the fshared
> > > case, because get_user_pages_fast() should do it for us already. If
> > > not we are fubared not only in the futex code.
> >
> > Yeah. It turns out we do do the access check indirectly - by looking
> > at the PAGE_USER bit, even if we don't necessarily check the actual
> > limits. So get_user_pages_fast() is fine.
> >
> > > But there is a subtle detail:
> >
> > Yup, see my email from ten minutes ago, we found the same thing. And
> > that would seem to explain the endless loop, and also the timing
> > (since Dave mentions he started doing large-pages lately).
> >
> > So I think the "__get_user_pages_fast(address, 1, !ro, &page)" thing
> > should work.
> >
> > Dave, can you re-create that trinity run and test that patch? I think
> > we've got this, but it might be nice to leave the hung machine up and
> > running until it's verified.. Although I don't really see what else we
> > could need or get out of it, so..
>
> Would it be possible to limit the options to only pass FUTEX_CMP_REQUEUE
> and a read-only uaddr? That should improve confidence when it doesn't
> fail :-)

easy enough to hack into the code yeah. A bit complicated to come up with
a sensible grammar for a command line parser for such cases sadly.

I'll give the patch a try after dinner.

Dave

Al Viro

unread,

Dec 10, 2013, 11:30:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 11:42:15PM +0100, Thomas Gleixner wrote:
> /*
> * If write access is not required (eg. FUTEX_WAIT), try
> * and get read-only access.
> */
> if (err == -EFAULT && rw == VERIFY_READ) {
> err = get_user_pages_fast(address, 1, 0, &page);
>
> That's a legitimate use case. And futex_requeue only requests
> VERIFY_READ for the !requeue_pi case.
>
> Now, if that map is RO, i.e. we took the fallback path then the THP
> one will fail as it has write=1 unconditionally.

access_ok() has nothing whatsoever to do with RO vs. RW mappings.
It checks whether the address is OK for userland on architectures
with userland and kernel sharing the same address space (e.g. x86).
On something like e.g. sparc64 or s390 it's constant 1.

Note that there's nothing to stop another thread from remapping an RW
area RO just as you've returned from access_ok(), so checking for
writability in access_ok() would've been racy as hell. Ditto for
address being mapped at all...

Moreover, there are exactly two architectures that do not ignore the
first argument of access_ok() - microblaze and um. The former uses
it in debugging printk in failure case. The latter... AFAICS, it's
pointless - it's a special dispensation for read access to host
vsyscall page from guest process. The thing is, writes there are
going to fail anyway - host kernel won't let the guest kernel to
modify that page, period. IOW, it looks like um might as well drop
the (type == VERIFY_READ) part in __access_ok_vsyscall().

Why do we have the 'type' argument of access_ok(), anyway?

Thomas Gleixner

unread,

Dec 10, 2013, 11:30:02 PM12/10/13

to

On Tue, 10 Dec 2013, Linus Torvalds wrote:

> On Tue, Dec 10, 2013 at 2:57 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
> >
> > But how does the access_ok() move do anything helpful here?
>
> Just making it all more obvious.
>
> > We really need it for the fastpath !fshared case, but for the fshared
> > case you actively break working code, because you force a VERIFY_WRITE
> > check into it. The VERIFY_WRITE is necessary for !fshared, because
> > there is no way that one thread can map the futex RO and the other RW,
> > right?
>
> Nobody actually uses that argument any more (it goes back to the old
> i386 "let's manually verify that we have write permissions, because
> the CPU doesn't do it for us in the trap handling"), and it should
> probably be removed.

Fair enough.

> But you're right that it's at least misleading. I'd love to remove it
> entirely, because it's not even syntax-checked, and it's confusing.
> But that would be a humongous patch.

Well, we should ask Julia for a coccinelle patch to limit the
wreckage. :)

Seriously, if that VERIFY_WRITE is completely useless we really want
to get rid of it.

> So these days, "access_ok()" literally just checks that the address is
> in the user address space range. And that would seem to always be
> appropriate for futexes, so why not just do it in the generic code?

Agreed, but as long as the VERIFY_WRITE argument is there it needs at
least a big fat comment :)

Thanks,

tglx

Al Viro

unread,

Dec 10, 2013, 11:40:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 03:05:46PM -0800, Linus Torvalds wrote:

> Nobody actually uses that argument any more (it goes back to the old
> i386 "let's manually verify that we have write permissions, because
> the CPU doesn't do it for us in the trap handling"), and it should
> probably be removed.

Ah...

> But you're right that it's at least misleading. I'd love to remove it
> entirely, because it's not even syntax-checked, and it's confusing.
> But that would be a humongous patch.
>
> So these days, "access_ok()" literally just checks that the address is
> in the user address space range. And that would seem to always be
> appropriate for futexes, so why not just do it in the generic code?

So let's turn it into
#define access_ok(type, addr, size) address_ok(addr, size)

and then users can be converted at leisure. Eventually we'll just remove
the unused macros...

Dave Jones

unread,

Dec 10, 2013, 11:50:02 PM12/10/13

to

On Tue, Dec 10, 2013 at 02:48:52PM -0800, Linus Torvalds wrote:
> On Tue, Dec 10, 2013 at 2:42 PM, Thomas Gleixner <tg...@linutronix.de> wrote:
> >
> > And yes, I remember that we do not do an extra check for the fshared
> > case, because get_user_pages_fast() should do it for us already. If
> > not we are fubared not only in the futex code.
>
> Yeah. It turns out we do do the access check indirectly - by looking
> at the PAGE_USER bit, even if we don't necessarily check the actual
> limits. So get_user_pages_fast() is fine.
>
> > But there is a subtle detail:
>
> Yup, see my email from ten minutes ago, we found the same thing. And
> that would seem to explain the endless loop, and also the timing
> (since Dave mentions he started doing large-pages lately).
>
> So I think the "__get_user_pages_fast(address, 1, !ro, &page)" thing
> should work.
>
> Dave, can you re-create that trinity run and test that patch? I think
> we've got this, but it might be nice to leave the hung machine up and
> running until it's verified.. Although I don't really see what else we
> could need or get out of it, so..

The only thing I'm still unclear on, is how that pid allegedly wasn't doing
a futex call as part of its run. The only thing I can think of is that
the other pid that _did_ do a futex call did it on a page that was MAP_SHARED
between all the other children, and this 'spin forever' thing only
happens when the last process with a reference on that page exits ?

does that make sense?

Dave

Steven Rostedt

unread,

Dec 11, 2013, 12:10:01 AM12/11/13

to

On Tue, Dec 10, 2013 at 06:00:09PM -0500, Dave Jones wrote:
>
> The only thing I'm still unclear on, is how that pid allegedly wasn't doing
> a futex call as part of its run. The only thing I can think of is that
> the other pid that _did_ do a futex call did it on a page that was MAP_SHARED
> between all the other children, and this 'spin forever' thing only
> happens when the last process with a reference on that page exits ?

Which thread did not do the futex call? The one that was spinning? No, that one
most definitely was, at least according to the stack trace trace you posted:

trinity-child27-10818 [001] 89790.703547: kernel_stack: <stack trace>
=> futex_requeue (ffffffff810df18a)
=> do_futex (ffffffff810e019e)
=> SyS_futex (ffffffff810e0de1)
=> tracesys (ffffffff81760be4)

It did a futex() system call.

Or are you talking about another thread?

-- Steve

Dave Jones

unread,

Dec 11, 2013, 12:30:02 AM12/11/13

to

On Tue, Dec 10, 2013 at 07:05:04PM -0500, Steven Rostedt wrote:
> On Tue, Dec 10, 2013 at 06:00:09PM -0500, Dave Jones wrote:
> >
> > The only thing I'm still unclear on, is how that pid allegedly wasn't doing
> > a futex call as part of its run. The only thing I can think of is that
> > the other pid that _did_ do a futex call did it on a page that was MAP_SHARED
> > between all the other children, and this 'spin forever' thing only
> > happens when the last process with a reference on that page exits ?
>
> Which thread did not do the futex call? The one that was spinning? No, that one
> most definitely was, at least according to the stack trace trace you posted:
>
> trinity-child27-10818 [001] 89790.703547: kernel_stack: <stack trace>
> => futex_requeue (ffffffff810df18a)
> => do_futex (ffffffff810e019e)
> => SyS_futex (ffffffff810e0de1)
> => tracesys (ffffffff81760be4)
>
> It did a futex() system call.
>
> Or are you talking about another thread?

It's the same thread. but here's what it says the last thing it did was..

(gdb) print shm->previous_syscallno[27]
$1 = 288

accept4.

Just to verify I'm looking at the right array member..

(gdb) print shm->pids[27]
$2 = 10818

Oh, hmm. Wait, I'm an idiot.
I only update ->previous when we come back from the syscall.
It's _still_ doing this syscall.

(gdb) print shm->syscallno[27]
$4 = 202

I was distracted by seeing all the other threads exiting, so I was only looking at
what this one had already done.

ok, mystery solved. derp.

Dave

Dave Jones

unread,

Dec 11, 2013, 1:00:04 AM12/11/13

to

On Tue, Dec 10, 2013 at 07:23:30PM -0500, Dave Jones wrote:

> I was distracted by seeing all the other threads exiting, so I was only looking at
> what this one had already done.

another thing that distracted me was that /proc/10818/stack was just showing that

[<ffffffffffffffff>] 0xffffffffffffffff

output.

For my own education, what causes that ?
How come it didn't show the same trace I saw when I sysrq-t'd ?

Mel Gorman

unread,

Dec 11, 2013, 1:10:01 AM12/11/13

to

On Tue, Dec 10, 2013 at 08:18:29PM +0100, Thomas Gleixner wrote:
> On Tue, 10 Dec 2013, Linus Torvalds wrote:
>

> > Hmm. Looks like the futex code is somehow stuck in a loop, calling
> > get_user_pages_fast().
> >
> > The futex code itself is apparently so low-overhead that it doesn't
> > show up in your 'perf top' report (which is dominated by all the
> > expensive debug things that get_user_pages_fast() etc ends up doing),
> > but that's the only looping I can see. Perhaps the "goto again" case
> > for transparent huge pages in get_futex_key()? Or the
>
> Cc'ng more folks on that.
>

I just saw this before heading to bed and have not read the thread. I'll
read it in the morning but in the meantime the following might ring a bell
for someone elses investigation or someone more familiar with how futexs
work from end to end.

Was NUMA balancing enabled and was this a NUMA machine?

I ask because of these two patches that are currently in flight

mm: numa: Serialise parallel get_user_page against THP migration mm
fix TLB flush race between migration, and change_protection_range

There are related patches but these two are the most important for what
I have in mind. The two in combination address a problem whereby a write
from one thread can be lost due to a THP migration but it's specific to
automatic NUMA balancing. If the lost update was for a page containing a
futex then the lost write could confuse waiters. The downside is that this
is a bad fit for the problem description in the first mail. A lost update
might result in processes waiting forever on a value that never changes
but offhand it's less clear why it might result in a loop. Unless of
course there is a combination of events that allows for a busy wait on a
value that will never change due to the lost write.

--
Mel Gorman
SUSE Labs

Dave Jones

unread,

Dec 11, 2013, 4:20:01 AM12/11/13

to

On Tue, Dec 10, 2013 at 02:48:52PM -0800, Linus Torvalds wrote:

> Dave, can you re-create that trinity run and test that patch?

Looks ok so far, but I'll leave it run overnight to be sure.

Dave

Ingo Molnar

unread,

Dec 11, 2013, 12:50:02 PM12/11/13

to

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> On Tue, Dec 10, 2013 at 1:32 PM, Dave Jones <da...@redhat.com> wrote:
> >
> > http://www.codemonkey.org.uk/junk/perf.data.xz
>
> "Forbidden
>
> You don't have permission to access /junk/perf.data.xz on this server."
>
> also, we'd need the vmlinux file to actually decode the data, I think.

So my point is somewhat moot in light of a fix patch, but the best way
to export all interesting data is via running 'perf archive' after the
perf.data got created:

comet:~/tip> perf archive
Now please run:

$ tar xvf perf.data.tar.bz2 -C ~/.debug

wherever you need to run 'perf report' on.
comet:~/tip>

and copying over both the perf.data and the perf.data.tar.bz2.

Arnaldo, I think we should make it even easier and more obvious to
export/import perf data, via something like:

perf clean # cleans out ~/.debug
perf record ...
perf export # creates perf.data.tar.bz2 which includes perf.data as well, not just .debug

and then whoever gets a perf.data.tar.bz2 can do:

perf import # does 'tar xjvf perf.data.tar.bz2'
perf report # analyze the data as if it was captured on your own box

which extracts it into the local directory. This makes the whole thing
rather easy and makes the result complete and trustable.

What do you think?

Thanks,

Ingo

Mel Gorman

unread,

Dec 11, 2013, 4:40:02 PM12/11/13

to

On Tue, Dec 10, 2013 at 11:42:15PM +0100, Thomas Gleixner wrote:

Not that it really matters but the naming and comments around that
particular __get_user_pages_fast call are a little misleading. The ifdef
CONFIG_TRANSPARENT_HUGEPAGE in futex.c is there because greater care has
to be taken against THP splits, not because it is dealing exclusively with
THP. The PageTail check applies to either hugetlbfs or THPs and similarly
gup_huge_pmd handles both. The whole path should be very rare for THPs
considering that THPs exist on MAP_PRIVATE anonymous mappings and how many
shared futexes exist backed by such mappings? A RO mapping makes that seem
even more strange because what thread is updating the value the caller is
waiting on? It seems more like that was a shared futex on a hugetlbfs-backed
mappings which might explain why the bug was undiscovered for so long.

--
Mel Gorman
SUSE Labs

Thomas Gleixner

unread,

Dec 11, 2013, 4:40:02 PM12/11/13

to

This is the fshared path. The MAP_PRIVATE path does not do that dance
at all.

Thanks,

tglx

Oleg Nesterov

unread,

Dec 11, 2013, 5:10:01 PM12/11/13

to

On 12/10, Linus Torvalds wrote:
>
> I think what happens is:
> - get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only)
> - get_user_pages_fast(address, 1, 0, &page) succeeds and gets a large-page
> - __get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only).
>
> so what triggers this is likely that Dave now does large-pages, and
> one of them is a read-only mapping.
>
> So I would suggest replacing the second "1" in the
> __get_user_pages_fast() call with a "!ro" instead. So how about this
> second patch instead (the access_ok() move remains).

I know almost nothing about THP, but why we may need write == true in
this case?

IOW,

> - if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
> + if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {

can't
if (likely(__get_user_pages_fast(address, 1, 0, &page) == 1)) {

work?

I have to admit, I do not understand why we can't avoid this altogether.

__get_page_tail() can find the stable ->first_page, why get_futex_key()
can't ?

Oleg.

Thomas Gleixner

unread,

Dec 11, 2013, 5:20:02 PM12/11/13

to

On Wed, 11 Dec 2013, Oleg Nesterov wrote:
> On 12/10, Linus Torvalds wrote:
> >
> > I think what happens is:
> > - get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only)
> > - get_user_pages_fast(address, 1, 0, &page) succeeds and gets a large-page
> > - __get_user_pages_fast(address, 1, 1, &page) fails (because it's read-only).
> >
> > so what triggers this is likely that Dave now does large-pages, and
> > one of them is a read-only mapping.
> >
> > So I would suggest replacing the second "1" in the
> > __get_user_pages_fast() call with a "!ro" instead. So how about this
> > second patch instead (the access_ok() move remains).
>
> I know almost nothing about THP, but why we may need write == true in
> this case?
>
> IOW,
>
> > - if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
> > + if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
>
> can't
> if (likely(__get_user_pages_fast(address, 1, 0, &page) == 1)) {
>
> work?

If the futex call needs to write to that address, we need a writeable
mapping, right? And get_user_pages_fast() will verify that for us.

> I have to admit, I do not understand why we can't avoid this altogether.
>
> __get_page_tail() can find the stable ->first_page, why get_futex_key()
> can't ?

Because it can be split under us.

Thanks,

tglx

Mel Gorman

unread,

Dec 11, 2013, 6:00:02 PM12/11/13

to

On Wed, Dec 11, 2013 at 05:38:55PM +0100, Thomas Gleixner wrote:
> On Wed, 11 Dec 2013, Mel Gorman wrote:
> > On Tue, Dec 10, 2013 at 11:42:15PM +0100, Thomas Gleixner wrote:
> > > Now, if that map is RO, i.e. we took the fallback path then the THP
> > > one will fail as it has write=1 unconditionally.
> > >
> > > if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1))
> > >
> >
> > Not that it really matters but the naming and comments around that
> > particular __get_user_pages_fast call are a little misleading. The ifdef
> > CONFIG_TRANSPARENT_HUGEPAGE in futex.c is there because greater care has
> > to be taken against THP splits, not because it is dealing exclusively with
> > THP. The PageTail check applies to either hugetlbfs or THPs and similarly
> > gup_huge_pmd handles both. The whole path should be very rare for THPs
> > considering that THPs exist on MAP_PRIVATE anonymous mappings and how many
> > shared futexes exist backed by such mappings? A RO mapping makes that seem
> > even more strange because what thread is updating the value the caller is
> > waiting on? It seems more like that was a shared futex on a hugetlbfs-backed
> > mappings which might explain why the bug was undiscovered for so long.
>
> This is the fshared path. The MAP_PRIVATE path does not do that dance
> at all.
>

do_futex takes an op parameter and sets FLAGS_SHARED on that op parameter,
not whether the VMA backing that address was created with the MAP_PRIVATE
or MAP_SHARED flag. For example;

do_futex(addr, op) where !(ops & FUTEX_PRIVATE_FLAG) and cmd == FUTEX_WAIT
-> futex_wait
-> futex_wait_setup
-> get_futex_key(fshared == true)

THP pages encountered in the fshared path should be rare because why create
a shared futex on a private mapping? That does not make it impossible
which is why splits are guarded against. If it was genuinely impossible
to be in this path for MAP_PRIVATE then there is no point worrying about
THP as it is currently supported at least. Overall, it still seems more
likely this was a hugetlbfs page.

--
Mel Gorman
SUSE Labs

Oleg Nesterov

unread,

Dec 11, 2013, 6:00:03 PM12/11/13

to

On 12/11, Thomas Gleixner wrote:
>
> On Wed, 11 Dec 2013, Oleg Nesterov wrote:
> >
> > I know almost nothing about THP, but why we may need write == true in
> > this case?
> >
> > IOW,
> >
> > > - if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
> > > + if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
> >
> > can't
> > if (likely(__get_user_pages_fast(address, 1, 0, &page) == 1)) {
> >
> > work?
>
> If the futex call needs to write to that address, we need a writeable
> mapping, right? And get_user_pages_fast() will verify that for us.

Yes sure. I meant __get_user_pages_fast() which is called to find
page_head.

> > I have to admit, I do not understand why we can't avoid this altogether.
> >
> > __get_page_tail() can find the stable ->first_page, why get_futex_key()
> > can't ?
>
> Because it can be split under us.

I understand, but why we can't rely on compound_lock() like
__get_page_tail() does?

OK, could you please explain why something like the pseudo-code
below can't work?

Oleg.

--- x/mm/swap.c
+++ x/mm/swap.c
@@ -210,7 +210,7 @@ EXPORT_SYMBOL(put_page);
* This function is exported but must not be called by anything other
* than get_page(). It implements the slow path of get_page().
*/
-bool __get_page_tail(struct page *page)
+struct page *__get_page_tail(struct page *page)
{
/*
* This takes care of get_page() if run on a tail page
@@ -235,7 +235,7 @@ bool __get_page_tail(struct page *page)
*/
VM_BUG_ON(!PageHead(page_head));
__get_page_tail_foll(page, false);
- return true;
+ return page_head;
} else {
/*
* __split_huge_page_refcount run
@@ -247,7 +247,7 @@ bool __get_page_tail(struct page *page)
* slab on x86).
*/
put_page(page_head);
- return false;
+ return NULL;
}
}

@@ -264,10 +264,12 @@ bool __get_page_tail(struct page *page)
got = true;
}
compound_unlock_irqrestore(page_head, flags);
- if (unlikely(!got))
+ if (likely(got))
+ return page_head;
+ else
put_page(page_head);
}
- return got;
+ return NULL;
}
EXPORT_SYMBOL(__get_page_tail);

--- x/kernel/futex.c
+++ x/kernel/futex.c
@@ -282,41 +282,11 @@ again:
else
err = 0;

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- page_head = page;
- if (unlikely(PageTail(page))) {
+ page_head = __get_page_tail(page);
+ if (page_head) {
put_page(page);
- /* serialize against __split_huge_page_splitting() */
- local_irq_disable();

- if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {

- page_head = compound_head(page);
- /*
- * page_head is valid pointer but we must pin
- * it before taking the PG_lock and/or
- * PG_compound_lock. The moment we re-enable
- * irqs __split_huge_page_splitting() can
- * return and the head page can be freed from
- * under us. We can't take the PG_lock and/or
- * PG_compound_lock on a page that could be
- * freed from under us.
- */
- if (page != page_head) {
- get_page(page_head);
- put_page(page);
- }
- local_irq_enable();
- } else {
- local_irq_enable();
- goto again;
- }
- }
-#else
- page_head = compound_head(page);
- if (page != page_head) {
- get_page(page_head);
- put_page(page);
- }
-#endif
+ else
+ page_head = page;

lock_page(page_head);

Oleg Nesterov

unread,

Dec 11, 2013, 7:20:01 PM12/11/13

to

On 12/11, Oleg Nesterov wrote:
>
> On 12/11, Thomas Gleixner wrote:
> >
> > On Wed, 11 Dec 2013, Oleg Nesterov wrote:
> > >
> > > I have to admit, I do not understand why we can't avoid this altogether.
> > >
> > > __get_page_tail() can find the stable ->first_page, why get_futex_key()
> > > can't ?
> >
> > Because it can be split under us.
>
> I understand, but why we can't rely on compound_lock() like
> __get_page_tail() does?
>
> OK, could you please explain why something like the pseudo-code
> below can't work?

Seriously. Can the path below work?

With this change get_futex_key() can simply do

page_head = get_compound_head(page);
if (page_head)
put_page(page);
else
page_head = page;

instead of CONFIG_TRANSPARENT_HUGEPAGE/else code.

(and we can also remove "bool get_page_head" from __get_page_tail_foll).

I am sure I missed something. But I am curious and I'd like to understand
what I have missed.

Thanks,

Oleg.

--- a/mm/swap.c
+++ b/mm/swap.c

@@ -210,7 +210,7 @@ EXPORT_SYMBOL(put_page);
* This function is exported but must not be called by anything other
* than get_page(). It implements the slow path of get_page().
*/
-bool __get_page_tail(struct page *page)

+static struct page *__get_compound_head(struct page *page, bool and_tail)

{
/*
* This takes care of get_page() if run on a tail page

@@ -234,8 +234,9 @@ bool __get_page_tail(struct page *page)
* cannot race here.
*/
VM_BUG_ON(!PageHead(page_head));
- __get_page_tail_foll(page, false);
- return true;
+ if (and_tail)
+ get_huge_page_tail(page);

+ return page_head;
} else {
/*
* __split_huge_page_refcount run

@@ -260,17 +261,34 @@ bool __get_page_tail(struct page *page)
flags = compound_lock_irqsave(page_head);
/* here __split_huge_page_refcount won't run anymore */
if (likely(PageTail(page))) {
- __get_page_tail_foll(page, false);
+ if (and_tail)
+ get_huge_page_tail(page);

got = true;
}
compound_unlock_irqrestore(page_head, flags);
- if (unlikely(!got))
+
+ if (likely(got))
+ return page_head;
+ else
put_page(page_head);
}
- return got;
+ return NULL;

+}
+
+bool __get_page_tail(struct page *page)
+{
+ return !!__get_compound_head(page, true);
}
EXPORT_SYMBOL(__get_page_tail);

+struct page *get_compound_head(struct page *page)
+{
+ if (PageTail(page))
+ return __get_compound_head(page, false);
+ else
+ return NULL;
+}
+
/**
* put_pages_list() - release a list of pages
* @pages: list of pages threaded on page->lru

Dave Jones

unread,

Dec 12, 2013, 4:30:01 AM12/12/13

to

On Tue, Dec 10, 2013 at 02:48:52PM -0800, Linus Torvalds wrote:

> Dave, can you re-create that trinity run and test that patch? I think
> we've got this

24 hours later, all is well. I think we can call this one done.

Tested-by: Dave Jones <da...@fedoraproject.org>

Dave

Darren Hart

unread,

Dec 12, 2013, 5:30:02 AM12/12/13

to

On Wed, 2013-12-11 at 23:26 -0500, Dave Jones wrote:
> On Tue, Dec 10, 2013 at 02:48:52PM -0800, Linus Torvalds wrote:
>
> > Dave, can you re-create that trinity run and test that patch? I think
> > we've got this
>
> 24 hours later, all is well. I think we can call this one done.
>
> Tested-by: Dave Jones <da...@fedoraproject.org>

Thank you again for a fine preemptive bug catch Dave!

--
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

Andrea Arcangeli

unread,

Dec 12, 2013, 7:10:01 PM12/12/13

to

Hello,

On Tue, Dec 10, 2013 at 01:57:49PM -0800, Linus Torvalds wrote:
> On Tue, Dec 10, 2013 at 1:41 PM, Dave Jones <da...@redhat.com> wrote:
> >
> > http://codemonkey.org.uk/junk/trace
>
> Hmm. Ok, so something is calling [__]get_user_pages_fast() and
> put_page() in a loop, but the trace doesn't show what that "something"
> is, because it is itself not ever called.
>
> However, that pattern does seem to imply that the loop is in
> get_futex_key(), because all the other loops I see seem to be calling
> other things as well.
>
> And the __get_user_pages_fast() call implies that it's the THP case
> that triggers the "unlikely(PageTail(page))" case. And anyway,
> otherwise we'd see lock_page()/unlock_page() too.

>
> So it looks like __get_user_pages_fast() fails, and keeps failing.

> Andrea, this is your code, any ideas? Commit a5b338f2b0b1f ("thp:
> update futex compound knowledge") to be exact.

I can only agree that the __get_user_pages_fast "write" parameter
should have been !ro instead of 1.

However it wasn't me introducing the bug, my code when patched in
early 2011 would work fine. The bug was introduced half a year later
in commit 9ea71503a8ed9184d2d0b8ccc4d269d05f7940ae .

@@ -262,8 +264,18 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key)

again:

err = get_user_pages_fast(address, 1, 1, &page);

+ /*
+ * If write access is not required (eg. FUTEX_WAIT), try
+ * and get read-only access.
+ */
+ if (err == -EFAULT && rw == VERIFY_READ) {
+ err = get_user_pages_fast(address, 1, 0, &page);
+ ro = 1;
+ }
if (err < 0)
return err;
+ else
+ err = 0;

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
page_head = page;
@@ -305,6 +317,13 @@ again:
if (!page_head->mapping) {
unlock_page(page_head);
put_page(page_head);
+ /*
+ * ZERO_PAGE pages don't have a mapping. Avoid a busy
loop
+ * trying to find one. RW mapping would have COW'd (and
thus
+ * have a mapping) so this page is RO and won't ever
change.
+ */
+ if ((page_head == ZERO_PAGE(address)))
+ return -EFAULT;
goto again;
}

[..]

This commit didn't update the __get_user_pages_fast to use !ro instead
of 1, as it would have been required after the above change.

I'll now review Oleg proposed cleanup.

Thanks for tracking this down!

Andrea

Linus Torvalds

unread,

Dec 12, 2013, 7:10:02 PM12/12/13

to

On Thu, Dec 12, 2013 at 11:00 AM, Andrea Arcangeli <aarc...@redhat.com> wrote:
>
> However it wasn't me introducing the bug, my code when patched in
> early 2011 would work fine. The bug was introduced half a year later
> in commit 9ea71503a8ed9184d2d0b8ccc4d269d05f7940ae .

I'd argue that half a year later the bug was *fixed*, but it was only
fixed for the regular pages. Leaving largepages buggy.

Linus

Andrea Arcangeli

unread,

Dec 13, 2013, 3:20:02 PM12/13/13

to

On Wed, Dec 11, 2013 at 08:18:55PM +0100, Oleg Nesterov wrote:
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -210,7 +210,7 @@ EXPORT_SYMBOL(put_page);
> * This function is exported but must not be called by anything other
> * than get_page(). It implements the slow path of get_page().
> */
> -bool __get_page_tail(struct page *page)
> +static struct page *__get_compound_head(struct page *page, bool and_tail)
> {

This is not necessarily inline (unlike __get_page_tail_foll which is
inline and optimizes away the build time known "false"/"true" params).

> /*
> * This takes care of get_page() if run on a tail page
> @@ -234,8 +234,9 @@ bool __get_page_tail(struct page *page)
> * cannot race here.
> */
> VM_BUG_ON(!PageHead(page_head));
> - __get_page_tail_foll(page, false);
> - return true;
> + if (and_tail)
> + get_huge_page_tail(page);

So you may be introducing a real branch here unless gcc decides to
self-inline the static func.

> + return page_head;
> } else {
> /*
> * __split_huge_page_refcount run
> @@ -260,17 +261,34 @@ bool __get_page_tail(struct page *page)
> flags = compound_lock_irqsave(page_head);
> /* here __split_huge_page_refcount won't run anymore */
> if (likely(PageTail(page))) {
> - __get_page_tail_foll(page, false);
> + if (and_tail)
> + get_huge_page_tail(page);

Same here.

> +bool __get_page_tail(struct page *page)
> +{
> + return !!__get_compound_head(page, true);
> }

I hope gcc would optimize away the !! overhead vs the old got =
true/false, once __get_compound_head is inline.

So I think you should add inline to __get_compound_head, there's no
benefit to keep the get_compound_head in the CPU icache.

get_huge_page_tail checks different invariants in the VM_BUG_ON and is
only used by gup.c not sure why to call that here.

But __get_page_tail has changed in -mm so the above will reject, in
-mm you will find __get_page_tail_foll called with "true" parameter
too in __get_page_tail for example (not equivalent to
get_huge_page_tail even ignoring the invariant checking in VM_BUG_ON)
and it defines a fast path for hugetlbfs and slab (faster than the one
upstream).

But here we have a "tail page" that has reference on both head and
tail, and you're taking one more reference on the head, just to
release it later by releasing the tail.

Your objective is to do:

page_head = get_compound_head(page);
if (page_head)
put_page(page);
else
page_head = page;

If we've to mess with the compound lock for this, considering
get_compound_head would have only this user anyway, we could do:

page_head = put_compound_tail(page);
if (!page_head)
page_head = page;

Implemented as:

put_compound_tail(page) {
page_head = compound_trans_head(page);
if (!__compound_tail_refcounted(page_head)) {
smp_rmb();
if (PageTail(page)) {
VM_BUG_ON(!PageHead(page_head));
VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
VM_BUG_ON(page_mapcount(page) != 0);
return page_head;
} else
return NULL;

}
flags = compound_lock_irqsave(page_head);
if (!PageTail(page)) {
unlock
return NULL
}
VM_BUG_ON(page_head != page->first_page);
/* __split_huge_page_refcount will wait now */
VM_BUG_ON(page_mapcount(page) <= 0);
atomic_dec(&page->_mapcount);
VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
VM_BUG_ON(atomic_read(&page->_count) != 0);
compound_unlock_irqrestore(page_head, flags);
return page_head;
}

Considering this is going incremental with -mm, and that the -mm
previous patches the above would depend on are not upstream yet, I'd
suggest to fix the bug in futex with Linus patch now (s/1/!ro/).

The strict obviously safe fix Linus posted, is not going to interfere
with these optimization later, so there's no downside to apply it now
and defer the optimizations to the -mm queue.

Thanks!
Andrea

Oleg Nesterov

unread,

Dec 13, 2013, 4:30:01 PM12/13/13

to

Andrea. Thanks a lot for the detailed reply.

On 12/13, Andrea Arcangeli wrote:
>
> On Wed, Dec 11, 2013 at 08:18:55PM +0100, Oleg Nesterov wrote:
>
> get_huge_page_tail checks different invariants in the VM_BUG_ON and is
> only used by gup.c not sure why to call that here.

Compared to __get_page_tail_foll(get_page_head => F) get_huge_page_tail()
lacks the ->first_page->_count, but it is called right after
get_page_unless_zero(page_head) so to me get_huge_page_tail() looks a bit
better.

Personally, I think that even this change

--- x/kernel/events/../../mm/internal.h
+++ x/kernel/events/../../mm/internal.h
@@ -47,11 +47,9 @@ static inline void __get_page_tail_foll(
* page_cache_get_speculative()) on tail pages.
*/
VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
- VM_BUG_ON(atomic_read(&page->_count) != 0);
- VM_BUG_ON(page_mapcount(page) < 0);
if (get_page_head)
atomic_inc(&page->first_page->_count);
- atomic_inc(&page->_mapcount);
+ get_huge_page_tail(page);
}

/*

makes sense. But this is minor and subjective, I am not going to argue.

And I agree with other comments, the patch I sent only tried to explain
what I meant.

> If we've to mess with the compound lock for this, considering
> get_compound_head would have only this user anyway, we could do:
>
> page_head = put_compound_tail(page);
> if (!page_head)
> page_head = page;
>
> Implemented as:

OK, this looks better, I agree.

I'll try to make v2 based on -mm and your suggestions.

> Considering this is going incremental with -mm, and that the -mm
> previous patches the above would depend on are not upstream yet, I'd
> suggest to fix the bug in futex with Linus patch now (s/1/!ro/).

Yes, yes, sure. I didn't mean that this (potential) cleanup can replace
the simple fix from Linus.

Thanks!

Oleg.

Andrea Arcangeli

unread,

Dec 13, 2013, 5:40:03 PM12/13/13

to

The above diff looks a straightforward cleanup you could submit it as
a separate patch in a v2 series. This more clearly shows the
difference between the two functions, so there is less overlap
too.

Feel free to change it further if you want, but the current model was
to have get_huge_page_tail only for arch/*/mm/gup.c (in mm.h) and
__get_page_tail_foll in internal.h for mm/*.c and with the ability to
obtain the head page too (needed in some of the mm/*.c cases, and
never needed for arch/*/mm/gup.c).

> I'll try to make v2 based on -mm and your suggestions.

Ok great!

Thanks,
Andrea

Oleg Nesterov

unread,

Dec 14, 2013, 8:20:01 PM12/14/13

to

On 12/10, Dave Jones wrote:
>
> On Tue, Dec 10, 2013 at 07:23:30PM -0500, Dave Jones wrote:
>
> > I was distracted by seeing all the other threads exiting, so I was only looking at
> > what this one had already done.
>
> another thing that distracted me was that /proc/10818/stack was just showing that
>
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> output.
>
> For my own education, what causes that ?

save_stack_trace_tsk() adds ULONG_MAX as the "last" entry.

and dump_trace() fails if task is running and != current (note that
cat /proc/self/stack works).

> How come it didn't show the same trace I saw when I sysrq-t'd ?

Because print_trace_address() does not skip !reliable entries,
unlike __save_stack_address(). This (afaics) makes the difference.

I'll try to make a patch but I am not sure... I am not even sure
it makes sense, but in any case this all doesn't look right to me.

First of all, stack = task->thread.sp is not really right if this
task is running. Worse, bp = *stack returned by stack_frame() is
random in this case. This equally applies to sysrq-t's output.
Not that bad, but still wrong and confusing, imho.

And lets look at dump_trace(),

const unsigned cpu = get_cpu();
unsigned long *irq_stack_end =
(unsigned long *)per_cpu(irq_stack_ptr, cpu);

This (in general) has nothing to do with task_cpu(task).

And why dump_trace() checks irq_stack_end != NULL ? This is always true.

I think that these paths should not even try to guess what bp is
if the task is not running/current. But it is not clear to "disable"
reliable check in __save_stack_address(), we should report this fact
in proc_pid_stack()->seq_printf() somehow.

And proc_pid_stack() should drop ->cred_guard_mutex right after
save_stack_trace_tsk(), although this is off-topic.

Oleg.

Oleg Nesterov

unread,

Dec 16, 2013, 6:40:01 PM12/16/13

to

On 12/13, Andrea Arcangeli wrote:
>
> On Fri, Dec 13, 2013 at 05:22:40PM +0100, Oleg Nesterov wrote:
> >
> > I'll try to make v2 based on -mm and your suggestions.
>
> Ok great!

Yes, it would be great, but I need your help again ;)

Let me quote the pseudo-code you sent me:

put_compound_tail(page) {
page_head = compound_trans_head(page);
if (!__compound_tail_refcounted(page_head)) {

...
return ...;
}

flags = compound_lock_irqsave(page_head);
...

Sure, put_compound_tail() should be the simplified version of
put_compound_page() which doesn't dec page_head->_count, this is clear.

But afaics, compound_lock_irqsave() above looks unsafe without
get_page_unless_zero(page_head) ? If we race with _split, page_head
can be freed and compound_lock() can race with, say, free_pages_check()
which plays with page->flags ?

So it seems that put_compound_tail() should also do get/put(head) like
put_compound_page() does, and this probably means we should factor out
the common code somehow.

Or I missed something?

OTOH, I can't really understand

if (likely(page != page_head && get_page_unless_zero(page_head)))

in __get_page_tail() and put_compound_page().

First of all, is it really possible that page == compound_trans_head(page)?
We already verified that PG_tail was set. Of course this bit can be cleared
and ->first_page can be a dangling pointer but it can never be changed to
point to this page? (in fact, afaics it can be changed at all as long as we
have a reference, but this doesn't matter).

And compound_lock_irqsave() looks racy even after get_page_unless_zero().

For example, suppose that page_head was already freed and then re-allocated
as (say) alloc_pages(__GFP_COMP, 1). get_page_unless_zero() can succeed right
after prep_new_page() does set_page_refcounted(). Now, can't compound_lock()
race with the non-atomic prep_compound_page()->__SetPageHead() ?

Finally. basepage_index(page) after put_page(page) in get_futex_key() looks
confusing imho. I think this is correct, we already checked PageAnon() so it
can't be a thp page. But probably this needs a comment and __basepage_index()
should do BUG_ON(!PageHuge()).

Oleg.

Oleg Nesterov

unread,

Dec 16, 2013, 8:20:01 PM12/16/13

to

Cleanup. Change __get_page_tail_foll() to use get_huge_page_tail()
to avoid the code duplication.

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---
mm/internal.h | 5 +----
1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index a85a3ab..a346ba1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -47,12 +47,9 @@ static inline void __get_page_tail_foll(struct page *page,

* page_cache_get_speculative()) on tail pages.
*/
VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
- VM_BUG_ON(atomic_read(&page->_count) != 0);
- VM_BUG_ON(page_mapcount(page) < 0);
if (get_page_head)
atomic_inc(&page->first_page->_count);

- if (compound_tail_refcounted(page->first_page))

- atomic_inc(&page->_mapcount);
+ get_huge_page_tail(page);
}

/*

--
1.5.5.1

Oleg Nesterov

unread,

Dec 16, 2013, 8:20:01 PM12/16/13

to

On top of

mm-tail-page-refcounting-optimization-for-slab-and-hugetlbfs.patch

should not be applied without the ack from Andrea.

On 12/13, Andrea Arcangeli wrote:
>
> The above diff looks a straightforward cleanup you could submit it as
> a separate patch in a v2 series.

OK, let me send this separately, because (afaics) put_compound_tail()
needs more thinking.

See also 2/2. Again, I won't argue if you dislike this change even if
it is correct, so please review and ack/nack. To me compound_head() in
get_huge_page_tail() looks confusing, as if get_huge_page_tail() can
accept a !PageTail page. But perhaps this is only because I am new to
this code.

Oleg.

Oleg Nesterov

unread,

Dec 16, 2013, 8:20:02 PM12/16/13

to

get_huge_page_tail()->compound_head() looks confusing. Every caller
must check PageTail(page), otherwise atomic_inc(&page->_mapcount) is
simply wrong if this page is compound-trans-head.

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---

include/linux/mm.h | 7 +++----
1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 13bae9e..13495bd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -431,13 +431,12 @@ static inline bool compound_tail_refcounted(struct page *page)
static inline void get_huge_page_tail(struct page *page)
{
/*
- * __split_huge_page_refcount() cannot run
- * from under us.
- * In turn no need of compound_trans_head here.
+ * __split_huge_page_refcount() cannot run from under us.
*/
+ VM_BUG_ON(!PageTail(page));
VM_BUG_ON(page_mapcount(page) < 0);
VM_BUG_ON(atomic_read(&page->_count) != 0);
- if (compound_tail_refcounted(compound_head(page)))
+ if (compound_tail_refcounted(page->first_page))
atomic_inc(&page->_mapcount);
}

--
1.5.5.1

Andrea Arcangeli

unread,

Dec 16, 2013, 8:30:03 PM12/16/13

to

On Mon, Dec 16, 2013 at 07:36:18PM +0100, Oleg Nesterov wrote:
> On 12/13, Andrea Arcangeli wrote:
> >
> > On Fri, Dec 13, 2013 at 05:22:40PM +0100, Oleg Nesterov wrote:
> > >
> > > I'll try to make v2 based on -mm and your suggestions.
> >
> > Ok great!
>
> Yes, it would be great, but I need your help again ;)
>
>
> Let me quote the pseudo-code you sent me:
>
> put_compound_tail(page) {
> page_head = compound_trans_head(page);
> if (!__compound_tail_refcounted(page_head)) {
> ...
> return ...;
> }
>
> flags = compound_lock_irqsave(page_head);
> ...
>
> Sure, put_compound_tail() should be the simplified version of
> put_compound_page() which doesn't dec page_head->_count, this is clear.
>
> But afaics, compound_lock_irqsave() above looks unsafe without
> get_page_unless_zero(page_head) ? If we race with _split, page_head
> can be freed and compound_lock() can race with, say, free_pages_check()
> which plays with page->flags ?
>
> So it seems that put_compound_tail() should also do get/put(head) like
> put_compound_page() does, and this probably means we should factor out
> the common code somehow.
>

Yes it was supposed to do the get_page_unless_zero before the
compound_lock.

> Or I missed something?
>
>
>
> OTOH, I can't really understand
>
> if (likely(page != page_head && get_page_unless_zero(page_head)))
>
> in __get_page_tail() and put_compound_page().
>
> First of all, is it really possible that page == compound_trans_head(page)?

It is possible if it become a regular page because split_huge_page
run in between.

compound_trans_head guarantees us it got us to a safe head page. And
the idea was to do get_page_unless_zero only on head pages and never
on tails to be safer/simpler. Same goes for the compound_lock that
follows still on the page_head where "page != page_head".

If you like, you can try to optimize away the check and reuse the
PageTail branch inside the compound_lock but I'm not sure if it's
worth it.

> We already verified that PG_tail was set. Of course this bit can be cleared
> and ->first_page can be a dangling pointer but it can never be changed to
> point to this page? (in fact, afaics it can be changed at all as long as we
> have a reference, but this doesn't matter).

If PG_tail gets cleared compound_trans_head returns "page".

> And compound_lock_irqsave() looks racy even after get_page_unless_zero().
>
> For example, suppose that page_head was already freed and then re-allocated
> as (say) alloc_pages(__GFP_COMP, 1). get_page_unless_zero() can succeed right
> after prep_new_page() does set_page_refcounted(). Now, can't compound_lock()
> race with the non-atomic prep_compound_page()->__SetPageHead() ?

Yes. We need to change to:

if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);
smp_wmb();
/* as the compound_lock can be taken after it's refcounted */
set_page_refcounted(page);

__SetPageHead uses bts asm insn so literally only a "lock" prefix is
missing in a assembly instruction. So the race window is incredibly
small, but it must be fixed indeed. This also puts set_page_refcounted
as the last action of buffered_rmqueue so there shouldn't be any other
issues like this left in the page allocation code.

Can you reorder set_page_refcount in your v2?

I wonder if arch_alloc_page needs refcount 1, it sets the page as
stable on s390. the other way around is to move prep_compound_page
before set_page_refcounted (though I think if we can, keeping the
refcounted at the very last with a comment is preferable).

> Finally. basepage_index(page) after put_page(page) in get_futex_key() looks
> confusing imho. I think this is correct, we already checked PageAnon() so it
> can't be a thp page. But probably this needs a comment and __basepage_index()
> should do BUG_ON(!PageHuge()).

This looks a bug if you apply the patches to add THP in pagecache, and
BUG_ON(!PageHuge) would also break THP in pagecache later (though
safer than silently broken like now).

It'd safer to read the basepage_index in a local variable just before
doing the put_compund_tail but I agree it's not going to make a
difference right now.

But the whole idea of working with the head of the THP despite the
address points to a tail, looks flakey if there is a split (THP in
pagecache). I mean chances are reading basepage_index(page) before
releasing the tail pin is not enough.

The Anon path looks sane for all cases, as it doesn't pretend to
return any information on the actual mapping and it limits itself to
do the PageAnon check on the actual head that has PageAnon set, which
is still valid thing to do even after a split, and in fact required if
a split didn't take place yet as the tail "page" isn't anon but just a
tail.

Andrea Arcangeli

unread,

Dec 16, 2013, 8:30:03 PM12/16/13

to

On Mon, Dec 16, 2013 at 09:19:00PM +0100, Oleg Nesterov wrote:
> On top of
>
> mm-tail-page-refcounting-optimization-for-slab-and-hugetlbfs.patch
>
> should not be applied without the ack from Andrea.
>
> On 12/13, Andrea Arcangeli wrote:
> >
> > The above diff looks a straightforward cleanup you could submit it as
> > a separate patch in a v2 series.
>
> OK, let me send this separately, because (afaics) put_compound_tail()
> needs more thinking.
>
> See also 2/2. Again, I won't argue if you dislike this change even if
> it is correct, so please review and ack/nack. To me compound_head() in
> get_huge_page_tail() looks confusing, as if get_huge_page_tail() can
> accept a !PageTail page. But perhaps this is only because I am new to
> this code.

compound_head in get_huge_page_tail was just a more readable version
of page->first_page in __get_page_tail_foll. But page->first_page is
faster so it's better.

I reviewed all callers and there's no risk of the VM_BUG_ON triggering
but I prefer it too.

Both patches Acked.

Acked-by: Andrea Arcangeli <aarc...@redhat.com>

Oleg Nesterov

unread,

Dec 16, 2013, 8:50:02 PM12/16/13

to

On 12/16, Andrea Arcangeli wrote:
>
> On Mon, Dec 16, 2013 at 07:36:18PM +0100, Oleg Nesterov wrote:
> >
> > So it seems that put_compound_tail() should also do get/put(head) like
> > put_compound_page() does, and this probably means we should factor out
> > the common code somehow.
> >
>
> Yes it was supposed to do the get_page_unless_zero before the
> compound_lock.

OK,

> > OTOH, I can't really understand
> >
> > if (likely(page != page_head && get_page_unless_zero(page_head)))
> >
> > in __get_page_tail() and put_compound_page().
> >
> > First of all, is it really possible that page == compound_trans_head(page)?
>

> ...

>
> If PG_tail gets cleared compound_trans_head returns "page".

Damn, indeed, thanks.

> > And compound_lock_irqsave() looks racy even after get_page_unless_zero().
> >
> > For example, suppose that page_head was already freed and then re-allocated
> > as (say) alloc_pages(__GFP_COMP, 1). get_page_unless_zero() can succeed right
> > after prep_new_page() does set_page_refcounted(). Now, can't compound_lock()
> > race with the non-atomic prep_compound_page()->__SetPageHead() ?
>
> Yes. We need to change to:
>
> if (order && (gfp_flags & __GFP_COMP))
> prep_compound_page(page, order);
> smp_wmb();
> /* as the compound_lock can be taken after it's refcounted */
> set_page_refcounted(page);
>
> __SetPageHead uses bts asm insn so literally only a "lock" prefix is
> missing in a assembly instruction. So the race window is incredibly
> small, but it must be fixed indeed. This also puts set_page_refcounted
> as the last action of buffered_rmqueue so there shouldn't be any other
> issues like this left in the page allocation code.
>
> Can you reorder set_page_refcount in your v2?

OK. I'll try to make something on Wednesday.

> > Finally. basepage_index(page) after put_page(page) in get_futex_key() looks
> > confusing imho. I think this is correct, we already checked PageAnon() so it
> > can't be a thp page. But probably this needs a comment and __basepage_index()
> > should do BUG_ON(!PageHuge()).
>
> This looks a bug if you apply the patches to add THP in pagecache, and
> BUG_ON(!PageHuge) would also break THP in pagecache later (though
> safer than silently broken like now).
>
> It'd safer to read the basepage_index in a local variable just before
> doing the put_compund_tail but I agree it's not going to make a
> difference right now.

Yes, so lets discuss (and perhaps change) this later/separately.

Andrea, thanks again for your help.

Oleg.

Oleg Nesterov

unread,

Dec 17, 2013, 5:00:02 PM12/17/13

to

On 12/16, Oleg Nesterov wrote:
>
> On 12/16, Andrea Arcangeli wrote:
> >
> > On Mon, Dec 16, 2013 at 07:36:18PM +0100, Oleg Nesterov wrote:
> > >
> > > And compound_lock_irqsave() looks racy even after get_page_unless_zero().
> > >
> > > For example, suppose that page_head was already freed and then re-allocated
> > > as (say) alloc_pages(__GFP_COMP, 1). get_page_unless_zero() can succeed right
> > > after prep_new_page() does set_page_refcounted(). Now, can't compound_lock()
> > > race with the non-atomic prep_compound_page()->__SetPageHead() ?
> >
> > Yes. We need to change to:
> >
> > if (order && (gfp_flags & __GFP_COMP))
> > prep_compound_page(page, order);
> > smp_wmb();
> > /* as the compound_lock can be taken after it's refcounted */
> > set_page_refcounted(page);
> >
> > __SetPageHead uses bts asm insn so literally only a "lock" prefix is
> > missing in a assembly instruction. So the race window is incredibly
> > small, but it must be fixed indeed. This also puts set_page_refcounted
> > as the last action of buffered_rmqueue so there shouldn't be any other
> > issues like this left in the page allocation code.
> >
> > Can you reorder set_page_refcount in your v2?
>
> OK. I'll try to make something on Wednesday.

Yes, I will, but...

I can't stop thinking about another change. What if we simply change
__split_huge_page_refcount() to also do compound_lock/unlock(page_tail)
in a main loop?

This way we can greatly simplify get/put_page paths, we can rely on
compound_lock(sub-page) and avoid get_page_unless_zero(page_head).
Yes, this will make _split a bit slower, but PG_compound_lock should
not be contended? And we should change page_tail->flags carefully, but
this looks simple.

Or this is not possible/desirable?

Andrea Arcangeli

unread,

Dec 17, 2013, 6:10:01 PM12/17/13

to

That would be 512 nested spinlocks instead of 1, last time I did
something like that in mm_take_all_locks people weren't too pleased as
it started firing lockdeps complains too. Generally I try to avoid
taking too many locks nested if I can.

mm_take_all_locks is fine because it only runs when you register an mm
into a device driver, so it is a very rare event and not performance
critical at all, it is a slow path by all means (only runs when you
start a virtual machine or start X with nvidia etc..). So it is not a
concern. split_huge_page to the contrary could run in a flood if
you're unlucky. split_huge_page is needed not just to handle non-THP
aware paths that mostly disappeared by now, but also when you truncate
a vma so that a THP doesn't fit in it anymore. So it's up to userland
how frequently it needs to run.

I think it's reasonable to consider it though, but then it's not
guaranteed that a put_page on a THP tail is more frequent than
split_huge_page. Keep in mind we do the get_page_unless_zero to
stabilize the head to take the compound_lock on it, only for the
tails, never for the heads. So this restricts it to _only_ the
put_page following a gup_fast. Only gup_fast can ever take a reference
on the tail pages of a THP. Nothing else can.

I intend to add the foll_flags to gup_fast parameter so we remove
FOLL_GET from it in the KVM page fault to avoid doing atomic_inc
immediately followed by atomic_dec when establishing sptes. So the
only transient mappings that cannot be converted to pin-less
mmu_notifier will have to run put_page in gup_fast.

In short you're only going to help O_DIRECT with that change, and
removing 1 locked op for every 4k written to disk may not offset the
cost of 511 locked ops in split_huge_page.

Still worth thinking about it, but not obvious win in my view (plus
the lockdep trouble with taking too many locks nested). Comments welcome.

Thanks!
Andrea

Oleg Nesterov

unread,

Dec 18, 2013, 7:20:02 PM12/18/13

to

On 12/17, Andrea Arcangeli wrote:
>
> On Tue, Dec 17, 2013 at 05:53:35PM +0100, Oleg Nesterov wrote:
> >
> > I can't stop thinking about another change. What if we simply change
> > __split_huge_page_refcount() to also do compound_lock/unlock(page_tail)
> > in a main loop?
> >
> > This way we can greatly simplify get/put_page paths, we can rely on
> > compound_lock(sub-page) and avoid get_page_unless_zero(page_head).
> > Yes, this will make _split a bit slower, but PG_compound_lock should
> > not be contended? And we should change page_tail->flags carefully, but
> > this looks simple.
> >
> > Or this is not possible/desirable?
>
> That would be 512 nested spinlocks instead of 1,

Yes. But, just in case, we do not need to take them all at once. We
only need to take/release the additional lock per page_tail in a loop.

> last time I did
> something like that in mm_take_all_locks people weren't too pleased as
> it started firing lockdeps complains too.

bit_spin_lock() doesn't have the lockdep annotations and it would be
trivial to add _nested if necessary.

> split_huge_page to the contrary could run in a flood if
> you're unlucky. split_huge_page is needed not just to handle non-THP
> aware paths that mostly disappeared by now, but also when you truncate
> a vma so that a THP doesn't fit in it anymore. So it's up to userland
> how frequently it needs to run.

Yes, I understand. I _think_ this should be fine performance-wise,
compared to other things split_huge_page() paths should do these 511
test_and_set_bit + clear_bit should not be noticeable.

But of course I can be wrong.

> I think it's reasonable to consider it though, but then it's not
> guaranteed that a put_page on a THP tail is more frequent than
> split_huge_page. Keep in mind we do the get_page_unless_zero to
> stabilize the head to take the compound_lock on it, only for the
> tails, never for the heads. So this restricts it to _only_ the
> put_page following a gup_fast. Only gup_fast can ever take a reference
> on the tail pages of a THP. Nothing else can.

I see. But my main point was simplification.

And I'm afraid this compound_lock() in get/put can race with someone
else playing with page->flags "because I own this freshly allocated
page". Perhaps.

However. I do understand your concerns, lets forget about this for now.

---------------------------------------------------------------------------
Please review get_futex_key() changes we discussed before. As for the race
with prep_new_page() I'll send another patch, it is completely orthogonal
to this series.

Testing. I simply have no idea how can I test this series so I didn't
even try ;) I'll try to do something tomorrow, but in any case I have
to rely on your review.

Most of the cleanups in this series are not needed to change get_futex_key(),
but imho make sense. And we need more cleanups after this series, I'll do
this later if this series makes any sense.

Oleg.

include/linux/mm.h | 2 +
kernel/futex.c | 37 +--------
mm/swap.c | 251 ++++++++++++++++++++++++++--------------------------
3 files changed, 128 insertions(+), 162 deletions(-)

Oleg Nesterov

unread,

Dec 18, 2013, 7:20:03 PM12/18/13

to

__get_page_tail() can use __put_nontail_page() instead of put_page()
if it races with split_huge_page(). This is what put_compound_page()
does, so this is equally safe. And this allows us to factor out this
code later.

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---

mm/swap.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index f83de1f..aae24fe 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -289,7 +289,7 @@ bool __get_page_tail(struct page *page)
}
compound_unlock_irqrestore(page_head, flags);
if (unlikely(!got))
- put_page(page_head);
+ __put_nontail_page(page_head);
}
return got;
}
--
1.5.5.1

Oleg Nesterov

unread,

Dec 18, 2013, 7:20:03 PM12/18/13

to

Both __put_page_tail() and __get_page_tail() need to carefully
take a reference on page_head, take compound_lock() and recheck
PageTail(page) under this lock.

This patch extracts this code into the new helper, it will have
another user. This also means that __get_page_tail() gets the
same VM_BUG_ON() checks.

Note: this change can also help if we decide to change the locking,
perhaps it makes sense to change __split_huge_page_refcount() to
also do compound_lock/unlock(page_tail) in a loop. In this case it
would be simple to adapt this helper and its usage.

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---

mm/swap.c | 99 +++++++++++++++++++++++++++---------------------------------
1 files changed, 45 insertions(+), 54 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 5f3dda6..972923d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -93,12 +93,45 @@ static void __put_nontail_page(struct page *page)
}
}

-static void __put_page_tail(struct page *page)
+static bool get_lock_thp_head(struct page *head, struct page *tail,
+ unsigned long *flags)
{
- struct page *page_head;
+ if (unlikely(head == tail || !get_page_unless_zero(head)))
+ return false;
+
+ /*
+ * page_head wasn't a dangling pointer but it may not be a head
+ * page anymore by the time we obtain the lock. That is ok as
+ * long as it can't be freed from under us.
+ */
+ *flags = compound_lock_irqsave(head);
+ if (likely(PageTail(tail))) {
+ VM_BUG_ON(head != tail->first_page);
+ VM_BUG_ON(atomic_read(&head->_count) <= 1);
+ VM_BUG_ON(atomic_read(&tail->_count) != 0);
+ VM_BUG_ON(page_mapcount(tail) <= 1);
+ return true;
+ }
+
+ compound_unlock_irqrestore(head, *flags);
+ /*
+ * The head page may have been freed and reallocated as a compound
+ * page of smaller order and then freed again. All we know is that
+ * it cannot have become: a THP page, a compound page of higher
+ * order, a tail page. That is because we still hold the refcount
+ * of the split THP tail and page_head was the THP head before the
+ * split.
+ */
+ __put_nontail_page(head);
+ return false;
+}

+
+static void __put_page_tail(struct page *page)
+{
/* __split_huge_page_refcount can run under us */
- page_head = compound_trans_head(page);
+ struct page *page_head = compound_trans_head(page);
+ unsigned long flags;

/*
* THP can not break up slab pages so avoid taking
@@ -150,45 +183,16 @@ static void __put_page_tail(struct page *page)
return;
}

- if (likely(page != page_head && get_page_unless_zero(page_head))) {
- unsigned long flags;
-
- /*
- * page_head wasn't a dangling pointer but it may not
- * be a head page anymore by the time we obtain the
- * lock. That is ok as long as it can't be freed from
- * under us.
- */
- flags = compound_lock_irqsave(page_head);
- if (unlikely(!PageTail(page))) {
- /* __split_huge_page_refcount run before us */
- compound_unlock_irqrestore(page_head, flags);
- /*
- * The head page may have been freed and reallocated
- * as a compound page of smaller order and then freed
- * again. All we know is that it cannot have become:
- * a THP page, a compound page of higher order, a tail
- * page. That is because we still hold the refcount of
- * the split THP tail and page_head was the THP head
- * before the split.
- */
- __put_nontail_page(page_head);
- goto out_put_single;
- }
- VM_BUG_ON(page_head != page->first_page);
+ if (likely(get_lock_thp_head(page_head, page, &flags))) {
/*
- * We can release the refcount taken by
- * get_page_unless_zero() now that
- * __split_huge_page_refcount() is blocked on the
+ * We can release the refcount taken by get_lock_thp_head()
+ * now that __split_huge_page_refcount() is blocked on the
* compound_lock.
*/
if (put_page_testzero(page_head))
VM_BUG_ON(1);

/* __split_huge_page_refcount will wait now */

- VM_BUG_ON(page_mapcount(page) <= 0);
atomic_dec(&page->_mapcount);
- VM_BUG_ON(atomic_read(&page_head->_count) <= 0);

- VM_BUG_ON(atomic_read(&page->_count) != 0);

compound_unlock_irqrestore(page_head, flags);

__put_nontail_page(page_head);
@@ -224,9 +228,8 @@ bool __get_page_tail(struct page *page)
* proper PT lock that already serializes against
* split_huge_page().
*/
- unsigned long flags;
- bool got;
struct page *page_head = compound_trans_head(page);
+ unsigned long flags;

/* Ref to __put_page_tail() comment. */
if (!__compound_tail_refcounted(page_head)) {
@@ -254,25 +257,13 @@ bool __get_page_tail(struct page *page)
}
}

- got = false;
- if (likely(page != page_head && get_page_unless_zero(page_head))) {
- /*
- * page_head wasn't a dangling pointer but it
- * may not be a head page anymore by the time
- * we obtain the lock. That is ok as long as it
- * can't be freed from under us.
- */
- flags = compound_lock_irqsave(page_head);
- /* here __split_huge_page_refcount won't run anymore */
- if (likely(PageTail(page))) {
- __get_page_tail_foll(page, false);
- got = true;
- }
+ if (likely(get_lock_thp_head(page_head, page, &flags))) {
+ __get_page_tail_foll(page, false);

compound_unlock_irqrestore(page_head, flags);
- if (unlikely(!got))

- __put_nontail_page(page_head);
+ return true;
}
- return got;
+
+ return false;
}
EXPORT_SYMBOL(__get_page_tail);

--
1.5.5.1

Oleg Nesterov

unread,

Dec 18, 2013, 7:30:02 PM12/18/13

to

1. Add the new helper, compound_head_put_tail(page) which returns
the stable compound_head() and drops the reference to this page
if it is the sub-page of __compound_tail_refcounted(head).

Note: this patch tries to be as simple as possible. But we need
to unify the new helper with __put_page_tail() later, or at least
factor out the common code.

2. Remove the nasty __get_user_pages_fast() code in get_futex_key(),
it can use the new helper to get page_head.

Note: with or without this change basepage_index(page) after
put_page(page) looks confusing at least, this will be addressed
later.

Suggested-by: Andrea Arcangeli <aarc...@redhat.com>

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---

include/linux/mm.h | 2 ++
kernel/futex.c | 37 +------------------------------------
mm/swap.c | 35 +++++++++++++++++++++++++++++++++++
3 files changed, 38 insertions(+), 36 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 13495bd..545df45 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -501,6 +501,8 @@ static inline void __ClearPageBuddy(struct page *page)
void put_page(struct page *page);
void put_pages_list(struct list_head *pages);

+struct page *compound_head_put_tail(struct page *page);
+
void split_page(struct page *page, unsigned int order);
int split_free_page(struct page *page);

diff --git a/kernel/futex.c b/kernel/futex.c
index c3a1a55..1fd7031 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -282,42 +282,7 @@ again:

else
err = 0;

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- page_head = page;
- if (unlikely(PageTail(page))) {

- put_page(page);

- /* serialize against __split_huge_page_splitting() */
- local_irq_disable();
- if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
- page_head = compound_head(page);

- /*

- * page_head is valid pointer but we must pin
- * it before taking the PG_lock and/or
- * PG_compound_lock. The moment we re-enable
- * irqs __split_huge_page_splitting() can
- * return and the head page can be freed from
- * under us. We can't take the PG_lock and/or
- * PG_compound_lock on a page that could be

- * freed from under us.
- */

- if (page != page_head) {
- get_page(page_head);
- put_page(page);
- }
- local_irq_enable();
- } else {
- local_irq_enable();
- goto again;
- }
- }
-#else
- page_head = compound_head(page);
- if (page != page_head) {
- get_page(page_head);
- put_page(page);
- }
-#endif

-
+ page_head = compound_head_put_tail(page);
lock_page(page_head);

/*
diff --git a/mm/swap.c b/mm/swap.c
index 972923d..6742c85 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -215,6 +215,41 @@ void put_page(struct page *page)
EXPORT_SYMBOL(put_page);

/*
+ * Like compound_head() but also drops the additional reference
+ * this page can have. Unlike compound_head() it returns the page
+ * which has a reference, and can't race with split_huge_page().
+ */
+struct page *compound_head_put_tail(struct page *page)
+{
+ struct page *page_head;
+ unsigned long flags;
+
+ if (!PageTail(page))
+ return page;
+
+ page_head = compound_trans_head(page);
+
+ if (!__compound_tail_refcounted(page_head)) {
+ smp_rmb();
+ if (likely(PageTail(page)))

+ return page_head;
+ else

+ return page;
+ }
+

+ if (likely(get_lock_thp_head(page_head, page, &flags))) {

+ if (put_page_testzero(page_head))
+ VM_BUG_ON(1);
+ atomic_dec(&page->_mapcount);
+ compound_unlock_irqrestore(page_head, flags);
+
+ return page_head;
+ }
+
+ return page;
+}
+
+/*

* This function is exported but must not be called by anything other
* than get_page(). It implements the slow path of get_page().
*/

--
1.5.5.1

Oleg Nesterov

unread,

Dec 18, 2013, 7:30:02 PM12/18/13

to

Change release_pages() to avoid put_compound_page() and simply
call put_page(). This adds the additional PageCompound() check
into the unlikely path but allows us to change the semantics of
put_compound_page().

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---

mm/swap.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index aae24fe..b0e65b6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -819,7 +819,7 @@ void release_pages(struct page **pages, int nr, int cold)
spin_unlock_irqrestore(&zone->lru_lock, flags);
zone = NULL;
}
- put_compound_page(page);
+ put_page(page);
continue;
}

--
1.5.5.1

Peter Zijlstra

unread,

Dec 18, 2013, 7:30:01 PM12/18/13

to

On Wed, Dec 18, 2013 at 08:20:08PM +0100, Oleg Nesterov wrote:
> +struct page *compound_head_put_tail(struct page *page)
> +{
> + struct page *page_head;
> + unsigned long flags;
> +
> + if (!PageTail(page))
> + return page;
> +
> + page_head = compound_trans_head(page);
> +
> + if (!__compound_tail_refcounted(page_head)) {

This barrier is missing a comment describing the ordering and the
pairing.

> + smp_rmb();
> + if (likely(PageTail(page)))
> + return page_head;
> + else
> + return page;
> + }
> +
> + if (likely(get_lock_thp_head(page_head, page, &flags))) {
> + if (put_page_testzero(page_head))
> + VM_BUG_ON(1);
> + atomic_dec(&page->_mapcount);
> + compound_unlock_irqrestore(page_head, flags);
> +
> + return page_head;
> + }
> +
> + return page;
> +}

Oleg Nesterov

unread,

Dec 18, 2013, 7:50:02 PM12/18/13

to

On 12/18, Peter Zijlstra wrote:
>
> On Wed, Dec 18, 2013 at 08:20:08PM +0100, Oleg Nesterov wrote:
> > +struct page *compound_head_put_tail(struct page *page)
> > +{
> > + struct page *page_head;
> > + unsigned long flags;
> > +
> > + if (!PageTail(page))
> > + return page;
> > +
> > + page_head = compound_trans_head(page);
> > +
> > + if (!__compound_tail_refcounted(page_head)) {
>
> This barrier is missing a comment describing the ordering and the
> pairing.

Yes, and this function lacks other comments it needs, they should be
copy-and-past'ed from the very similar __put_page_tail().

But I am going to unify them later if this series passes the review,
can't the comments wait till then?

OTOH, I won't mind to send v2 with the additional comment about this
rmb() at least. Or perhaps I should simply add /* See the comments in
__put_page_tail() */ into the new helper.

OK. But I'll wait for other reviews first.

Oleg.

Linus Torvalds

unread,

Dec 18, 2013, 9:40:03 PM12/18/13

to

On Wed, Dec 18, 2013 at 11:20 AM, Oleg Nesterov <ol...@redhat.com> wrote:
> Both __put_page_tail() and __get_page_tail() need to carefully
> take a reference on page_head, take compound_lock() and recheck
> PageTail(page) under this lock.

Btw I suspect this is just disgustingly expensive, and I don't think
there's a really good reason for it.

May I suggest:

- getting rid of the PG_compound_lock bit-lock

bitlocks are expensive and unfair, and don't even get lockdep checking

- replace it with a (small, say 32-256 entries) array of hashed sequence locks

- just hash based on the "struct page" pointer, and teach this code
to do a read_seqcount_begin/read_seqcount_retry sequence instead for
the page lookup.

I think you can get rid of all the irq disables too, and the sequence
lock should be pure memory reads for the read-case that we care about.

Hmm? This is obviously orthogonal to your series, I just reacted to
seeing that bitlock thing that needs atomics for both locking and
unlocking and the irq disable, and just generally looks like the worst
possible way to do these things.

Linus

Oleg Nesterov

unread,

Dec 19, 2013, 4:30:03 PM12/19/13

to

On 12/18, Linus Torvalds wrote:
>
> On Wed, Dec 18, 2013 at 11:20 AM, Oleg Nesterov <ol...@redhat.com> wrote:
> > Both __put_page_tail() and __get_page_tail() need to carefully
> > take a reference on page_head, take compound_lock() and recheck
> > PageTail(page) under this lock.
>
> Btw I suspect this is just disgustingly expensive, and I don't think
> there's a really good reason for it.
>
> May I suggest:
>
> - getting rid of the PG_compound_lock bit-lock
>
> bitlocks are expensive and unfair, and don't even get lockdep checking
>
> - replace it with a (small, say 32-256 entries) array of hashed sequence locks
>
> - just hash based on the "struct page" pointer, and teach this code
> to do a read_seqcount_begin/read_seqcount_retry sequence instead for
> the page lookup.

Yes, I thought about this too but didn't dare to suggest. Not sure
about seqlock/irqs and other details, this needs more discussion anyway.
And of course I am not sure this will be actually better.

> This is obviously orthogonal to your series,

Yes.

But please note that one of the reasons for the new helper is simplify
the potential locking changes. The changelog only mentions "even more
bitlocks" change, but this doesn't matter.

And in fact I think that this allows to do more cleanups even if we
do not change the locking, get_lock_thp_head() should return page_head
or NULL, tail != head is just another PageTail() check. Perhaps.

Oleg.

Oleg Nesterov

unread,

Dec 19, 2013, 7:10:02 PM12/19/13

to

On 12/16, Andrea Arcangeli wrote:
>

> Can you reorder set_page_refcount in your v2?

Please see the patch.

> I wonder if arch_alloc_page needs refcount 1, it sets the page as
> stable on s390.

Obviously I have no idea what set_page_stable() does, but it works
with page_to_phys(), unlikely the content of "struct page" can matter.
And only s390 HAVE_ARCH_ALLOC_PAGE, I added Martin and Heiko.

> the other way around is to move prep_compound_page
> before set_page_refcounted (though I think if we can, keeping the
> refcounted at the very last with a comment is preferable).

Yes, yes, this looks much more natural.

Oleg.

Oleg Nesterov

unread,

Dec 19, 2013, 7:10:02 PM12/19/13

to

get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
compound_lock(). In theory this page_head can be already freed and
reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
get_page_unless_zero() can succeed right after set_page_refcounted(),
and compound_lock() can race with the non-atomic __SetPageHead().

Perhaps we should rework the thp locking (under discussion), but
until then this patch moves set_page_refcounted() and adds wmb()
to ensure that page->_count != 0 comes as a last change.

I am not sure about other callers of set_page_refcounted(), but at
first glance they look fine to me.

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
---

mm/page_alloc.c | 10 ++++++++--
1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 115b23b..9402337 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -865,8 +865,6 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
}

set_page_private(page, 0);
- set_page_refcounted(page);
-
arch_alloc_page(page, order);
kernel_map_pages(page, 1 << order, 1);

@@ -876,6 +874,14 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)

if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);

+ /*
+ * Make sure the caller of get_page_unless_zero() will see the
+ * fully initialized page. Say, to ensure that compound_lock()
+ * can't race with the non-atomic __SetPage*() above.
+ */
+ smp_wmb();
+ set_page_refcounted(page);
+
return 0;
}

--
1.5.5.1

Martin Schwidefsky

unread,

Dec 20, 2013, 2:20:01 PM12/20/13

to

On Thu, 19 Dec 2013 20:08:46 +0100
Oleg Nesterov <ol...@redhat.com> wrote:

> On 12/16, Andrea Arcangeli wrote:
> >
> > Can you reorder set_page_refcount in your v2?
>
> Please see the patch.
>
> > I wonder if arch_alloc_page needs refcount 1, it sets the page as
> > stable on s390.
>
> Obviously I have no idea what set_page_stable() does, but it works
> with page_to_phys(), unlikely the content of "struct page" can matter.
> And only s390 HAVE_ARCH_ALLOC_PAGE, I added Martin and Heiko.

On s390 the arch_alloc_page primitive is used to tell the hipervisor
that a page is going to be used. While the page is free it is marked
as "unused" which allows the hipervisor to throw away the page content
if the page is selected to be swapped. We do have a patch to add the
host support for KVM somewhere in our queue.

The content of the "struct page" does not matter at all for the
set-stable/set-unused state transition, s390 does not care about
the refcount in its arch_alloc_page function.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

Andrea Arcangeli

unread,

Dec 23, 2013, 11:50:02 AM12/23/13

to

On Thu, Dec 19, 2013 at 08:09:20PM +0100, Oleg Nesterov wrote:
> get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> compound_lock(). In theory this page_head can be already freed and
> reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> get_page_unless_zero() can succeed right after set_page_refcounted(),
> and compound_lock() can race with the non-atomic __SetPageHead().
>
> Perhaps we should rework the thp locking (under discussion), but
> until then this patch moves set_page_refcounted() and adds wmb()
> to ensure that page->_count != 0 comes as a last change.
>
> I am not sure about other callers of set_page_refcounted(), but at
> first glance they look fine to me.
>
> Signed-off-by: Oleg Nesterov <ol...@redhat.com>

Acked-by: Andrea Arcangeli <aarc...@redhat.com>

Only one improvement possible, the smp_wmb() could have been put under
CONFIG_TRANSPARENT_HUGEPAGE somehow. No difference for x86-64 though.

Thanks,
Andrea

Oleg Nesterov

unread,

Jan 3, 2014, 8:00:03 PM1/3/14

to

get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
compound_lock(). In theory this page_head can be already freed and
reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
get_page_unless_zero() can succeed right after set_page_refcounted(),
and compound_lock() can race with the non-atomic __SetPageHead().

Perhaps we should rework the thp locking (under discussion), but
until then this patch moves set_page_refcounted() and adds wmb()
to ensure that page->_count != 0 comes as a last change.

I am not sure about other callers of set_page_refcounted(), but at
first glance they look fine to me.

Signed-off-by: Oleg Nesterov <ol...@redhat.com>
Acked-by: Andrea Arcangeli <aarc...@redhat.com>

---
mm/page_alloc.c | 12 ++++++++++--
1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 115b23b..d323102 100644

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -865,8 +865,6 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
}

set_page_private(page, 0);
- set_page_refcounted(page);
-
arch_alloc_page(page, order);
kernel_map_pages(page, 1 << order, 1);

@@ -876,6 +874,16 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)

if (order && (gfp_flags & __GFP_COMP))
prep_compound_page(page, order);

+ /*
+ * Make sure the caller of get_page_unless_zero() will see the
+ * fully initialized page. Say, to ensure that compound_lock()
+ * can't race with the non-atomic __SetPage*() above.
+ */

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ smp_wmb();
+#endif

+ set_page_refcounted(page);
+
return 0;
}

--
1.5.5.1

Oleg Nesterov

unread,

Jan 3, 2014, 8:00:03 PM1/3/14

to

On 12/23, Andrea Arcangeli wrote:
>
> Acked-by: Andrea Arcangeli <aarc...@redhat.com>
>
> Only one improvement possible, the smp_wmb() could have been put under
> CONFIG_TRANSPARENT_HUGEPAGE somehow. No difference for x86-64 though.

I thought it would be better to serialize it with get_page_unless_zero()
in general, but OK, As the changelog says I didn't find another problematic
caller, so please see v2. I preserved your ack, thanks.

And I would like to know your opinion about other changes I sent ;)

Oleg.

Andrew Morton

unread,

Jan 3, 2014, 9:10:01 PM1/3/14

to

On Fri, 3 Jan 2014 20:55:47 +0100 Oleg Nesterov <ol...@redhat.com> wrote:

> get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> compound_lock(). In theory this page_head can be already freed and
> reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> get_page_unless_zero() can succeed right after set_page_refcounted(),
> and compound_lock() can race with the non-atomic __SetPageHead().

Would be useful to mention that these things are happening inside
prep_compound_opage() (yes?).

> Perhaps we should rework the thp locking (under discussion), but
> until then this patch moves set_page_refcounted() and adds wmb()
> to ensure that page->_count != 0 comes as a last change.
>
> I am not sure about other callers of set_page_refcounted(), but at
> first glance they look fine to me.

I don't get it. We're in prep_new_page() - this page is freshly
allocated and no other thread yet has any means by which to look it up
and start fiddling with it?

Oleg Nesterov

unread,

Jan 4, 2014, 4:50:02 PM1/4/14

to

On 01/03, Andrew Morton wrote:
>
> On Fri, 3 Jan 2014 20:55:47 +0100 Oleg Nesterov <ol...@redhat.com> wrote:
>
> > get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> > compound_lock(). In theory this page_head can be already freed and
> > reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> > get_page_unless_zero() can succeed right after set_page_refcounted(),
> > and compound_lock() can race with the non-atomic __SetPageHead().
>
> Would be useful to mention that these things are happening inside
> prep_compound_opage() (yes?).

Agreed. Added "in prep_compound_opage()" into the changelog:

get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
compound_lock(). In theory this page_head can be already freed and
reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
get_page_unless_zero() can succeed right after set_page_refcounted(),

and compound_lock() can race with the non-atomic __SetPageHead() in
prep_compound_page().

Perhaps we should rework the thp locking (under discussion), but
until then this patch moves set_page_refcounted() and adds wmb()
to ensure that page->_count != 0 comes as a last change.

I am not sure about other callers of set_page_refcounted(), but at
first glance they look fine to me.

or should I send v3?

> > Perhaps we should rework the thp locking (under discussion), but
> > until then this patch moves set_page_refcounted() and adds wmb()
> > to ensure that page->_count != 0 comes as a last change.
> >
> > I am not sure about other callers of set_page_refcounted(), but at
> > first glance they look fine to me.
>
> I don't get it. We're in prep_new_page() - this page is freshly
> allocated and no other thread yet has any means by which to look it up
> and start fiddling with it?

Yes, but thp can access this page_head via stale pointer, tail->first_page,
if it races with split_huge_page_refcount(). In this case we rely on
compound_lock() to detect this race, the problem is that compound_lock()
itself can race with head_page->flags manipulations.

For example, __get_page_tail() roughly does:

// PageTail(page) was already checked

page_head = page->first_page;

/* WINDOW */

get_page_unless_zero(page_head);

compound_lock(page_head);

recheck PageTail(page) to ensure page_head is still valid

However, in the WINDOW above, split_huge_page() can split this huge page.
After that its head can be freed and reallocated. Of course, I don't think
it is possible to hit this race in practice, but still this looks wrong.

Oleg.

Mel Gorman

unread,

Jan 8, 2014, 12:00:03 PM1/8/14

to

On Sat, Jan 04, 2014 at 05:43:47PM +0100, Oleg Nesterov wrote:
> On 01/03, Andrew Morton wrote:
> >
> > On Fri, 3 Jan 2014 20:55:47 +0100 Oleg Nesterov <ol...@redhat.com> wrote:
> >
> > > get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> > > compound_lock(). In theory this page_head can be already freed and
> > > reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> > > get_page_unless_zero() can succeed right after set_page_refcounted(),
> > > and compound_lock() can race with the non-atomic __SetPageHead().
> >
> > Would be useful to mention that these things are happening inside
> > prep_compound_opage() (yes?).
>
> Agreed. Added "in prep_compound_opage()" into the changelog:
>
> get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> compound_lock(). In theory this page_head can be already freed and
> reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> get_page_unless_zero() can succeed right after set_page_refcounted(),
> and compound_lock() can race with the non-atomic __SetPageHead() in
> prep_compound_page().
>
> Perhaps we should rework the thp locking (under discussion), but
> until then this patch moves set_page_refcounted() and adds wmb()
> to ensure that page->_count != 0 comes as a last change.
>
> I am not sure about other callers of set_page_refcounted(), but at
> first glance they look fine to me.
>
> or should I send v3?
>

This patch is putting a write barrier in the page allocator fast path and
that is going to be a leading cause of Sad Face. We already have seen
large regressions before when write barriers were introduced to the page
allocator paths for cpusets. Sticking it under CONFIG_TRANSPARENT_HUGEPAGE
does not really address the issue.

> > > Perhaps we should rework the thp locking (under discussion), but
> > > until then this patch moves set_page_refcounted() and adds wmb()
> > > to ensure that page->_count != 0 comes as a last change.
> > >
> > > I am not sure about other callers of set_page_refcounted(), but at
> > > first glance they look fine to me.
> >
> > I don't get it. We're in prep_new_page() - this page is freshly
> > allocated and no other thread yet has any means by which to look it up
> > and start fiddling with it?
>
> Yes, but thp can access this page_head via stale pointer, tail->first_page,
> if it races with split_huge_page_refcount().

To justify the introduction of a performance regression we need to be 100%
sure this race actually exists and not just theoretical. I had lost track
of this thread but did not see a description of how this bug can actually
happen. At the risk of sounding stupid -- what are the actual circumstances
when this race can occur?

For example, in the reclaim paths, we are going to be dealing with the head
pages as they are on the LRU. They get split into base pages and then the
compound handling becomes irrelevant. I cannot see problems there.

For futex, the THP page (and the tail) must have been discovered via
the page tables in which case the page tables are temporarily preventing
the page being freed to the allocator. GUP fast paths are protected from
parallel teardowns and the slow paths are protected from parallel unmaps
and frees by mmap_sem.

Compaction and other walkers mostly deal with the head and/or deal with
pages on the LRU where there will be an elevated reference count.

The changelog needs a specific example where the problem can occur and
even then we should consider if there is an option other than smacking
the page allocator.

> In this case we rely on
> compound_lock() to detect this race, the problem is that compound_lock()
> itself can race with head_page->flags manipulations.
>
> For example, __get_page_tail() roughly does:
>
> // PageTail(page) was already checked
>
> page_head = page->first_page;
>
> /* WINDOW */
>
> get_page_unless_zero(page_head);
>
> compound_lock(page_head);
>
> recheck PageTail(page) to ensure page_head is still valid
>
> However, in the WINDOW above, split_huge_page() can split this huge page.
> After that its head can be freed and reallocated. Of course, I don't think
> it is possible to hit this race in practice, but still this looks wrong.
>

I can't think of a reason why we would actually hit that race in practice
but I might have tunnel vision due to disliking that barrier so much. Unless
there is an example with no other possible solution, I do not think we
should stick a write barrier into the page allocator fast path.

--
Mel Gorman
SUSE Labs

Mel Gorman

unread,

Jan 8, 2014, 1:20:02 PM1/8/14

to

Peter Zijlstra correctly pointed out to me that on x86 that we generally
would not care/notice a write barrier as it almost always is a no-op.
X86 (which is all I test any more) can execute an sfence for a smp_wmb
but not in any configuration that matters. The previous barrier damage in
page_alloc.c was due to full barriers but I generally assume barriers have a
cost in core code when I see them regardless of the underlying architecture
details. So 99% of the time, we will not care and I won't be making Sad
Face but eventually someone using an affected architecture will whinge --
ppc64 probably as write barriers on sparc are compile barriers.

Oleg Nesterov

unread,

Jan 8, 2014, 4:20:02 PM1/8/14

to

On 01/08, Mel Gorman wrote:
>
> On Sat, Jan 04, 2014 at 05:43:47PM +0100, Oleg Nesterov wrote:
> >
> > get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> > compound_lock(). In theory this page_head can be already freed and
> > reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> > get_page_unless_zero() can succeed right after set_page_refcounted(),
> > and compound_lock() can race with the non-atomic __SetPageHead() in
> > prep_compound_page().
> >

> This patch is putting a write barrier in the page allocator fast path and
> that is going to be a leading cause of Sad Face. We already have seen
> large regressions before when write barriers were introduced to the page
> allocator paths for cpusets. Sticking it under CONFIG_TRANSPARENT_HUGEPAGE
> does not really address the issue.

As you already mentioned in another email, smp_wmb() is mostly nop. On
x86_64 at least. Although perhaps it would be nice to have

static inline void atomic_store_release(atomic_t *v, int i)
{
smp_store_release(&v->counter, i);

}

> > Yes, but thp can access this page_head via stale pointer, tail->first_page,
> > if it races with split_huge_page_refcount().
>
> To justify the introduction of a performance regression we need to be 100%
> sure this race actually exists

See below. But let me remind that I never looked at this code before,
I can be easily wrong.

> and not just theoretical.

It is theoretical anyway, I guess.

> For futex, the THP page (and the tail) must have been discovered via
> the page tables in which case the page tables are temporarily preventing
> the page being freed to the allocator.

Yes. But, for example, get_futex_key() does

if (unlikely(PageTail(page))) {
put_page(page);

why this put_page() can't race with _split? If nothing else, another thread
can unmap the part of this vma.

> > For example, __get_page_tail() roughly does:
> >
> > // PageTail(page) was already checked
> >
> > page_head = page->first_page;
> >
> > /* WINDOW */
> >
> > get_page_unless_zero(page_head);
> >
> > compound_lock(page_head);
> >
> > recheck PageTail(page) to ensure page_head is still valid
> >
> > However, in the WINDOW above, split_huge_page() can split this huge page.
> > After that its head can be freed and reallocated. Of course, I don't think
> > it is possible to hit this race in practice, but still this looks wrong.
> >
>
> I can't think of a reason why we would actually hit that race in practice

Agreed, the window is tiny, unlikely this possible.

> I do not think we
> should stick a write barrier into the page allocator fast path.

OK, I won't argue, I leave this to you and Andrea.

But I still think this code needs other cleanups/simplifications. In
particular get_futex_key()->__get_user_pages_fast() should die imho.

Oleg.

Mel Gorman

unread,

Jan 8, 2014, 6:10:02 PM1/8/14

to

On Wed, Jan 08, 2014 at 05:13:38PM +0100, Oleg Nesterov wrote:
> On 01/08, Mel Gorman wrote:
> >
> > On Sat, Jan 04, 2014 at 05:43:47PM +0100, Oleg Nesterov wrote:
> > >
> > > get/put_page(thp_tail) paths do get_page_unless_zero(page_head) +
> > > compound_lock(). In theory this page_head can be already freed and
> > > reallocated as alloc_pages(__GFP_COMP, smaller_order). In this case
> > > get_page_unless_zero() can succeed right after set_page_refcounted(),
> > > and compound_lock() can race with the non-atomic __SetPageHead() in
> > > prep_compound_page().
> > >
> > This patch is putting a write barrier in the page allocator fast path and
> > that is going to be a leading cause of Sad Face. We already have seen
> > large regressions before when write barriers were introduced to the page
> > allocator paths for cpusets. Sticking it under CONFIG_TRANSPARENT_HUGEPAGE
> > does not really address the issue.
>
> As you already mentioned in another email, smp_wmb() is mostly nop. On
> x86_64 at least.

Which sometimes means that it'll just take longer for someone to find it
and bitch about it.

> Although perhaps it would be nice to have
>
> static inline void atomic_store_release(atomic_t *v, int i)
> {
> smp_store_release(&v->counter, i);
> }
>
> > > Yes, but thp can access this page_head via stale pointer, tail->first_page,
> > > if it races with split_huge_page_refcount().
> >
> > To justify the introduction of a performance regression we need to be 100%
> > sure this race actually exists
>
> See below. But let me remind that I never looked at this code before,
> I can be easily wrong.
>
> > and not just theoretical.
>
> It is theoretical anyway, I guess.
>
> > For futex, the THP page (and the tail) must have been discovered via
> > the page tables in which case the page tables are temporarily preventing
> > the page being freed to the allocator.
>
> Yes. But, for example, get_futex_key() does
>
> if (unlikely(PageTail(page))) {
> put_page(page);
>
> why this put_page() can't race with _split? If nothing else, another thread
> can unmap the part of this vma.
>

The race is not prevented but that does not mean it matters. Basic
scenario where a split starts after the PageTail check but before the
put_page in get_futex_key

CPU A
get_futex_key
-> fast gup, page table removing prevents parallel unmap and free
-> gup_huge_pmd (arch/x86/mm/gup.c at least)
-> get_huge_page_tail (increment page tail _map_count)
-> get_huge_page_multiple (increment ref on head page)
-> Check PageTail
CPU B
split_huge_page_to_list
-> split_huge_page_refcount
spin_lock_irq(lru_lock)
compound_lock
-> put_page(tail_page)
->put_compound_page
looks up head page
takes reference unless zero
compound_lock (block)
complete split
compound_unlock
check PageTail

This put_page blocks on the compound lock, finds the page is no longer a
PageTail as the split barriers correctly and backs out gracefully. So sure,
splits can race but the case is cared for.

The parallel unmap is prevented by get_huge_page_multiple in the gup path
and held in place until put_page_compound frees it later.

I still don't see the case where a page gets freed to the page allocator
that causes weird problems here. Unfortunately, I also recognise I have
tunnel vision because subconsciously I don't *want* to see a problem here
that justifies adding a write barrier. Andrea may stomp all over this
reasoning in which case we'll get a good comment for the smp_wmb :/

--
Mel Gorman
SUSE Labs