[PATCH 00/13] mm: preemptibility -v2

Peter Zijlstra

unread,

Apr 8, 2010, 3:40:02 PM4/8/10

to

Hi,

This (still incomplete) patch-set makes part of the mm a lot more preemptible.
It converts i_mmap_lock and anon_vma->lock to mutexes. On the way there it
also makes mmu_gather preemptible.

The main motivation was making mm_take_all_locks() preemptible, since it
appears people are nesting hundreds of spinlocks there.

The side-effects are that we can finally make mmu_gather preemptible, something
which lots of people have wanted to do for a long time.

It also gets us anon_vma refcounting which seems to be wanted by KSM as well as
Mel's compaction work.

This patch-set seems to build and boot on my x86_64 machines and even builds a
kernel. I've also attempted powerpc and sparc, which I've compile tested with
their respective defconfigs, remaining are (afaikt the rest uses the generic
tlb bits):

- s390
- ia64
- arm
- superh
- um

From those, s390 and ia64 look 'interesting', arm and superh seem very similar
and should be relatively easy (-rt has a patchlet for arm iirc).

What kind of performance tests would people have me run on this to satisfy
their need for numbers? I've done a kernel build on x86_64 and if anything that
was slightly faster with these patches, but it was well within the noise
levels so it might be heat noise I'm looking at ;-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Peter Zijlstra

unread,

Apr 8, 2010, 3:40:02 PM4/8/10

to

mutex-is-contended.patch

Peter Zijlstra

unread,

Apr 8, 2010, 3:40:01 PM4/8/10

to

x86-remove-quicklist-include.patch

Peter Zijlstra

unread,

Apr 8, 2010, 3:40:02 PM4/8/10

to

mutex_lock_nest_lock.patch

Peter Zijlstra

unread,

Apr 8, 2010, 3:40:03 PM4/8/10

to

mm-anon_vma-ref.patch

Peter Zijlstra

unread,

Apr 8, 2010, 3:40:03 PM4/8/10

to

mm-mutex.patch

Peter Zijlstra

unread,

Apr 8, 2010, 3:40:03 PM4/8/10

to

powerpc-gup_fast-rcu.patch

Peter Zijlstra

unread,

Apr 8, 2010, 3:40:01 PM4/8/10

to

mm-page_lock_anon_vma.patch

Peter Zijlstra

unread,

Apr 8, 2010, 3:40:03 PM4/8/10

to

mm-opt-rmap.patch

Peter Zijlstra

unread,

Apr 8, 2010, 3:40:02 PM4/8/10

to

mm-preempt-tlb-gather-sparc.patch

Peter Zijlstra

unread,

Apr 8, 2010, 3:40:04 PM4/8/10

to

mm-preempt-tlb-gather-power.patch

Peter Zijlstra

unread,

Apr 8, 2010, 3:40:03 PM4/8/10

to

mm-use-anon_vma-ref.patch

Peter Zijlstra

unread,

Apr 8, 2010, 4:40:03 PM4/8/10

to

On Thu, 2010-04-08 at 13:29 -0700, David Miller wrote:
> From: Peter Zijlstra <a.p.zi...@chello.nl>
> Date: Thu, 08 Apr 2010 21:17:37 +0200

>
> > This patch-set seems to build and boot on my x86_64 machines and even builds a
> > kernel. I've also attempted powerpc and sparc, which I've compile tested with
> > their respective defconfigs, remaining are (afaikt the rest uses the generic
> > tlb bits):
>

> Did a build and test boot of this on my 128-cpu Niagara2 box, seems to
> work basically fine.

Wheee, thanks Dave!

David Miller

unread,

Apr 8, 2010, 4:40:03 PM4/8/10

to

From: Peter Zijlstra <a.p.zi...@chello.nl>
Date: Thu, 08 Apr 2010 21:17:37 +0200

> This patch-set seems to build and boot on my x86_64 machines and even builds a

> kernel. I've also attempted powerpc and sparc, which I've compile tested with
> their respective defconfigs, remaining are (afaikt the rest uses the generic
> tlb bits):

Did a build and test boot of this on my 128-cpu Niagara2 box, seems to
work basically fine.

Rik van Riel

unread,

Apr 8, 2010, 4:40:04 PM4/8/10

to

On 04/08/2010 03:17 PM, Peter Zijlstra wrote:
> The powerpc page table freeing relies on the fact that IRQs hold off
> an RCU grace period, this is currently true for all existing RCU
> implementations but is not an assumption Paul wants to support.
>
> Therefore, also take the RCU read lock along with disabling IRQs to
> ensure the RCU grace period does at least cover these lookups.
>
> Requested-by: Paul E. McKenney<pau...@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra<a.p.zi...@chello.nl>
> Cc: Nick Piggin<npi...@suse.de>
> Cc: Benjamin Herrenschmidt<be...@kernel.crashing.org>

Reviewed-by: Rik van Riel <ri...@redhat.com>

Rik van Riel

unread,

Apr 8, 2010, 5:00:04 PM4/8/10

to

On 04/08/2010 03:17 PM, Peter Zijlstra wrote:

> There is nothing preventing the anon_vma from being detached while we
> are spinning to acquire the lock. Most (all?) current users end up
> calling something like vma_address(page, vma) on it, which has a
> fairly good chance of weeding out wonky vmas.
>
> However suppose the anon_vma got freed and re-used while we were
> waiting to acquire the lock, and the new anon_vma fits with the
> page->index (because that is the only thing vma_address() uses to
> determine if the page fits in a particular vma, we could end up
> traversing faulty anon_vma chains.
>
> Close this hole for good by re-validating that page->mapping still
> holds the very same anon_vma pointer after we acquire the lock, if not
> be utterly paranoid and retry the whole operation (which will very
> likely bail, because it's unlikely the page got attached to a different
> anon_vma in the meantime).
>
> Signed-off-by: Peter Zijlstra<a.p.zi...@chello.nl>
> Cc: Hugh Dickins<hugh.d...@tiscali.co.uk>
> Cc: Linus Torvalds<torv...@linux-foundation.org>

Reviewed-by: Rik van Riel <ri...@redhat.com>

Rik van Riel

unread,

Apr 8, 2010, 5:00:03 PM4/8/10

to

On 04/08/2010 03:17 PM, Peter Zijlstra wrote:

> We still have a stray quicklist header included even though we axed
> quicklist usage quite a while back.
>
> Signed-off-by: Peter Zijlstra<a.p.zi...@chello.nl>

Reviewed-by: Rik van Riel <ri...@redhat.com>

--

Andrew Morton

unread,

Apr 8, 2010, 5:30:03 PM4/8/10

to

On Thu, 08 Apr 2010 21:17:39 +0200
Peter Zijlstra <a.p.zi...@chello.nl> wrote:

> There is nothing preventing the anon_vma from being detached while we
> are spinning to acquire the lock.

Well. The comment there clearly implies (or states) that RCU
protection is used to "guard against races". If that's inaccurate
or incomplete, can we please get it fixed?

The whole function makes be a bit queasy.

- Fails to explain why it pulls all these party tricks to read
page->mapping a single time. What code path are we defending against
here?

- Then checks page_mapped() without having any apparent defence
against page_mapped() becoming untrue one nanosecond later.

- Checks page_mapped() inside the rcu_read_locked() section for
inscrutable reasons.

> Most (all?) current users end up
> calling something like vma_address(page, vma) on it, which has a
> fairly good chance of weeding out wonky vmas.
>
> However suppose the anon_vma got freed and re-used while we were
> waiting to acquire the lock, and the new anon_vma fits with the
> page->index (because that is the only thing vma_address() uses to
> determine if the page fits in a particular vma, we could end up
> traversing faulty anon_vma chains.
>
> Close this hole for good by re-validating that page->mapping still
> holds the very same anon_vma pointer after we acquire the lock, if not
> be utterly paranoid and retry the whole operation (which will very
> likely bail, because it's unlikely the page got attached to a different
> anon_vma in the meantime).
>
> Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
> Cc: Hugh Dickins <hugh.d...@tiscali.co.uk>
> Cc: Linus Torvalds <torv...@linux-foundation.org>

> ---
> mm/rmap.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -294,6 +294,7 @@ struct anon_vma *page_lock_anon_vma(stru
> unsigned long anon_mapping;
>
> rcu_read_lock();
> +again:
> anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> goto out;
> @@ -302,6 +303,12 @@ struct anon_vma *page_lock_anon_vma(stru
>
> anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> spin_lock(&anon_vma->lock);
> +
> + if (page_rmapping(page) != anon_vma) {
> + spin_unlock(&anon_vma->lock);
> + goto again;
> + }
> +
> return anon_vma;
> out:
> rcu_read_unlock();
>

A comment here explaining how this situation could come about would
be helpful.

Peter Zijlstra

unread,

Apr 8, 2010, 6:00:03 PM4/8/10

to

On Thu, 2010-04-08 at 14:20 -0700, Andrew Morton wrote:
> On Thu, 08 Apr 2010 21:17:39 +0200
> Peter Zijlstra <a.p.zi...@chello.nl> wrote:
>
> > There is nothing preventing the anon_vma from being detached while we
> > are spinning to acquire the lock.
>
> Well. The comment there clearly implies (or states) that RCU
> protection is used to "guard against races". If that's inaccurate
> or incomplete, can we please get it fixed?

Good point, goes together with that last comment you made.

> The whole function makes be a bit queasy.
>
> - Fails to explain why it pulls all these party tricks to read
> page->mapping a single time. What code path are we defending against
> here?

>From what I understand we race with tear-down, anon_vma_unlock() takes
anon_vma->lock, so holding the lock pins the anon_vma.

So what we do to acquire a stable anon_vma from a page * is to, while
holding RCU read lock, very carefully read page->mapping, extract the
anon_vma and acquire the lock.

Now, the RCU usage is a tad tricky here, anon_vma uses
SLAB_DESTROY_BY_RCU, which means that the slab will be RCU freed,
however not the objects allocated from it. This means that an anon_vma
can be re-used directly after its gets freed, but the storage will
remain valid for at least a grace period after the free.

So once we do have the lock we need to revalidate that we indeed got the
anon_vma we throught we got.

So its:

page->mapping = NULL;
anon_vma_unlink();
spin_lock()
spin_unlock()
kmem_cache_free(anon_vma);

VS

page_lock_anon_vma()'s trickery.

> - Then checks page_mapped() without having any apparent defence
> against page_mapped() becoming untrue one nanosecond later.
>
> - Checks page_mapped() inside the rcu_read_locked() section for
> inscrutable reasons.

Right, I think the page_mapped() stuff is just an early bail out.

Paul E. McKenney

unread,

Apr 8, 2010, 6:20:01 PM4/8/10

to

On Thu, Apr 08, 2010 at 09:17:50PM +0200, Peter Zijlstra wrote:
> Optimize page_lock_anon_vma() by removing the atomic ref count
> ops from the fast path.
>
> Rather complicates the code a lot, but might be worth it.

Some questions and a disclaimer below.

> Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
> ---
> mm/rmap.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 67 insertions(+), 4 deletions(-)

>
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c

> @@ -78,6 +78,12 @@ static inline struct anon_vma *anon_vma_
> void anon_vma_free(struct anon_vma *anon_vma)
> {
> VM_BUG_ON(atomic_read(&anon_vma->ref));
> + /*
> + * Sync against the anon_vma->lock, so that we can hold the
> + * lock without requiring a reference. See page_lock_anon_vma().
> + */
> + mutex_lock(&anon_vma->lock);

On some systems, the CPU is permitted to pull references into the critical
section from either side. So, do we also need an smp_mb() here?

> + mutex_unlock(&anon_vma->lock);

So, a question...

Can the above mutex be contended? If yes, what happens when the
competing mutex_lock() acquires the lock at this point? Or, worse yet,
after the kmem_cache_free()?

If no, what do we accomplish by acquiring the lock?

If the above mutex can be contended, can we fix by substituting
synchronize_rcu_expedited()? Which will soon require some scalability
attention if it gets used here, but what else is new? ;-)

> kmem_cache_free(anon_vma_cachep, anon_vma);
> }
>
> @@ -291,7 +297,7 @@ void __init anon_vma_init(void)
>
> /*
> * Getting a lock on a stable anon_vma from a page off the LRU is
> - * tricky: page_lock_anon_vma relies on RCU to guard against the races.
> + * tricky: anon_vma_get relies on RCU to guard against the races.
> */
> struct anon_vma *anon_vma_get(struct page *page)
> {
> @@ -320,12 +326,70 @@ out:
> return anon_vma;
> }
>
> +/*
> + * Similar to anon_vma_get(), however it relies on the anon_vma->lock
> + * to pin the object. However since we cannot wait for the mutex
> + * acquisition inside the RCU read lock, we use the ref count
> + * in the slow path.
> + */
> struct anon_vma *page_lock_anon_vma(struct page *page)
> {
> - struct anon_vma *anon_vma = anon_vma_get(page);
> + struct anon_vma *anon_vma = NULL;
> + unsigned long anon_mapping;
> +
> +again:
> + rcu_read_lock();

This is interesting. You have an RCU read-side critical section with
no rcu_dereference().

This strange state of affairs is actually legal (assuming that
anon_mapping is the RCU-protected structure) because all dereferences
of the anon_vma variable are atomic operations that guarantee ordering
(the mutex_trylock() and the atomic_inc_not_zero().

The other dereferences (the atomic_read()s) are under the lock, so
are also OK assuming that the lock is held when initializing and
updating these fields, and even more OK due to the smp_rmb() below.

But see below.

> + anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> + if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> + goto unlock;
> + if (!page_mapped(page))
> + goto unlock;
> +
> + anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> + if (!mutex_trylock(&anon_vma->lock)) {
> + /*
> + * We failed to acquire the lock, take a ref so we can
> + * drop the RCU read lock and sleep on it.
> + */
> + if (!atomic_inc_not_zero(&anon_vma->ref)) {
> + /*
> + * Failed to get a ref, we're dead, bail.
> + */
> + anon_vma = NULL;
> + goto unlock;
> + }
> + rcu_read_unlock();
>
> - if (anon_vma)
> mutex_lock(&anon_vma->lock);
> + /*
> + * We got the lock, drop the temp. ref, if it was the last
> + * one free it and bail.
> + */
> + if (atomic_dec_and_test(&anon_vma->ref)) {
> + mutex_unlock(&anon_vma->lock);
> + anon_vma_free(anon_vma);
> + anon_vma = NULL;
> + }
> + goto out;
> + }
> + /*
> + * Got the lock, check we're still alive. Seeing a ref
> + * here guarantees the object will stay alive due to
> + * anon_vma_free() syncing against the lock we now hold.
> + */
> + smp_rmb(); /* Order against anon_vma_put() */

This is ordering the fetch into anon_vma against the atomic_read() below?
If so, smp_read_barrier_depends() will cover it more cheaply. Alternatively,
use rcu_dereference() when fetching into anon_vma.

Or am I misunderstanding the purpose of this barrier?

(Disclaimer: I have not yet found anon_vma_put(), so I am assuming that
anon_vma_free() plays the role of a grace period.)

> + if (!atomic_read(&anon_vma->ref)) {
> + mutex_unlock(&anon_vma->lock);
> + anon_vma = NULL;
> + }
> +
> +unlock:
> + rcu_read_unlock();
> +out:
> + if (anon_vma && page_rmapping(page) != anon_vma) {
> + mutex_unlock(&anon_vma->lock);
> + goto again;
> + }
>
> return anon_vma;
> }
> @@ -333,7 +397,6 @@ struct anon_vma *page_lock_anon_vma(stru
> void page_unlock_anon_vma(struct anon_vma *anon_vma)
> {
> mutex_unlock(&anon_vma->lock);
> - anon_vma_put(anon_vma);
> }
>
> /*

David Miller

unread,

Apr 8, 2010, 9:10:02 PM4/8/10

to

From: David Miller <da...@davemloft.net>
Date: Thu, 08 Apr 2010 13:29:35 -0700 (PDT)

> From: Peter Zijlstra <a.p.zi...@chello.nl>
> Date: Thu, 08 Apr 2010 21:17:37 +0200
>
>> This patch-set seems to build and boot on my x86_64 machines and even builds a
>> kernel. I've also attempted powerpc and sparc, which I've compile tested with
>> their respective defconfigs, remaining are (afaikt the rest uses the generic
>> tlb bits):
>
> Did a build and test boot of this on my 128-cpu Niagara2 box, seems to
> work basically fine.

Here comes a set of 4 patches which build on top of your
work by:

1) Killing quicklists on sparc64
2) Using the generic RCU page table liberation code on sparc64
3) Implement pte_special() et al. on sparc64
4) Implement get_user_pages_fast() on sparc64

Please add them to your patch set. If you change the RCU generic code
enabler CPP define to be controlled via Kconfig (as we discussed on
IRC) it should be easy to propagate that change into patch #2 here.

Thanks!

Minchan Kim

unread,

Apr 8, 2010, 10:20:02 PM4/8/10

to

Hi, Peter.

On Fri, Apr 9, 2010 at 4:17 AM, Peter Zijlstra <a.p.zi...@chello.nl> wrote:
> There is nothing preventing the anon_vma from being detached while we

> are spinning to acquire the lock. Most (all?) current users end up

> calling something like vma_address(page, vma) on it, which has a
> fairly good chance of weeding out wonky vmas.
>
> However suppose the anon_vma got freed and re-used while we were
> waiting to acquire the lock, and the new anon_vma fits with the
> page->index (because that is the only thing vma_address() uses to
> determine if the page fits in a particular vma, we could end up
> traversing faulty anon_vma chains.

We have second defense rule by page_check_address.
Before anon_vma is detached, pte of pages on the anon_vma should be zeroed.
So can't page_check_address close the race?

Thanks for good trial for good feature.

--
Kind regards,
Minchan Kim

KOSAKI Motohiro

unread,

Apr 8, 2010, 10:30:01 PM4/8/10

to

Hi

> > - Then checks page_mapped() without having any apparent defence
> > against page_mapped() becoming untrue one nanosecond later.
> >
> > - Checks page_mapped() inside the rcu_read_locked() section for
> > inscrutable reasons.
>
> Right, I think the page_mapped() stuff is just an early bail out.

FWIW, not only early bail out.
page_remove_rmap() don't have "page->mapping = NULL". page_remove_rmap's
comment says

/*
* It would be tidy to reset the PageAnon mapping here,
* but that might overwrite a racing page_add_anon_rmap
* which increments mapcount after us but sets mapping
* before us: so leave the reset to free_hot_cold_page,
* and remember that it's only reliable while mapped.
* Leaving it set also helps swapoff to reinstate ptes
* faster for those pages still in swapcache.
*/

So, If following scenario happen, we can't dereference page->mapping
(iow, can't call spin_lock(&anon_vma->lock)). It might be pointed
invalid address.

CPU0 CPU1
===============================================================
page->mappings become -1
anon_vma_unlink()
-- grace period --
page_lock_anon_vma
page_mapped()
spin_lock(&anon_vma->lock);

Of cource, this statement doesn't mean I'm against your patch at all.
I like it.

Nick Piggin

unread,

Apr 8, 2010, 11:20:02 PM4/8/10

to

It's not wrong per se, but the entire powerpc memory management code
does the IRQ disabling for its pagetable RCU code. So I think it would be
better to do the whole thing in one go.

I don't think Paul will surprise-break powerpc :)

It's up to Ben really, though.

On Thu, Apr 08, 2010 at 09:17:38PM +0200, Peter Zijlstra wrote:
> The powerpc page table freeing relies on the fact that IRQs hold off
> an RCU grace period, this is currently true for all existing RCU
> implementations but is not an assumption Paul wants to support.
>
> Therefore, also take the RCU read lock along with disabling IRQs to
> ensure the RCU grace period does at least cover these lookups.
>
> Requested-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>

> Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>

> Cc: Nick Piggin <npi...@suse.de>
> Cc: Benjamin Herrenschmidt <be...@kernel.crashing.org>

> ---
> arch/powerpc/mm/gup.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> Index: linux-2.6/arch/powerpc/mm/gup.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/gup.c
> +++ linux-2.6/arch/powerpc/mm/gup.c
> @@ -142,6 +142,7 @@ int get_user_pages_fast(unsigned long st
> * So long as we atomically load page table pointers versus teardown,
> * we can follow the address down to the the page and take a ref on it.
> */
> + rcu_read_lock();
> local_irq_disable();
>
> pgdp = pgd_offset(mm, addr);
> @@ -162,6 +163,7 @@ int get_user_pages_fast(unsigned long st
> } while (pgdp++, addr = next, addr != end);
>
> local_irq_enable();
> + rcu_read_unlock();
>
> VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
> return nr;
> @@ -171,6 +173,7 @@ int get_user_pages_fast(unsigned long st
>
> slow:
> local_irq_enable();
> + rcu_read_unlock();
> slow_irqon:
> pr_devel(" slow path ! nr = %d\n", nr);

Nick Piggin

unread,

Apr 8, 2010, 11:20:02 PM4/8/10

to

On Thu, Apr 08, 2010 at 09:17:39PM +0200, Peter Zijlstra wrote:
> There is nothing preventing the anon_vma from being detached while we
> are spinning to acquire the lock. Most (all?) current users end up
> calling something like vma_address(page, vma) on it, which has a
> fairly good chance of weeding out wonky vmas.
>
> However suppose the anon_vma got freed and re-used while we were
> waiting to acquire the lock, and the new anon_vma fits with the
> page->index (because that is the only thing vma_address() uses to
> determine if the page fits in a particular vma, we could end up
> traversing faulty anon_vma chains.
>

> Close this hole for good by re-validating that page->mapping still
> holds the very same anon_vma pointer after we acquire the lock, if not
> be utterly paranoid and retry the whole operation (which will very
> likely bail, because it's unlikely the page got attached to a different
> anon_vma in the meantime).

Hm, looks like a bugfix? How was this supposed to be safe?

> Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>

> Cc: Hugh Dickins <hugh.d...@tiscali.co.uk>
> Cc: Linus Torvalds <torv...@linux-foundation.org>
> ---
> mm/rmap.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>

> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c

> @@ -294,6 +294,7 @@ struct anon_vma *page_lock_anon_vma(stru
> unsigned long anon_mapping;
>
> rcu_read_lock();
> +again:

> anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);

> if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)

> goto out;
> @@ -302,6 +303,12 @@ struct anon_vma *page_lock_anon_vma(stru
>

> anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);

> spin_lock(&anon_vma->lock);
> +
> + if (page_rmapping(page) != anon_vma) {

very unlikely()?

> + spin_unlock(&anon_vma->lock);
> + goto again;
> + }

> +
> return anon_vma;
> out:
> rcu_read_unlock();
>

Nick Piggin

unread,

Apr 9, 2010, 12:10:01 AM4/9/10

to

On Thu, Apr 08, 2010 at 09:17:44PM +0200, Peter Zijlstra wrote:
> Fix up powerpc to the new mmu_gather stuffs.
>
> PPC has an extra batching queue to RCU free the actual pagetable
> allocations, use the ARCH extentions for that for now.
>
> For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
> hardware hash-table, keep using per-cpu arrays but flush on context
> switch and use a TIF bit to track the laxy_mmu state.

Hm. Pity powerpc can't just use tlb flush gathering for this batching,
(which is what it was designed for). Then it could avoid these tricks.
What's preventing this? Adding a tlb gather for COW case in
copy_page_range?

Nick Piggin

unread,

Apr 9, 2010, 12:20:01 AM4/9/10

to

On Thu, Apr 08, 2010 at 09:17:37PM +0200, Peter Zijlstra wrote:
> Hi,
>
> This (still incomplete) patch-set makes part of the mm a lot more preemptible.
> It converts i_mmap_lock and anon_vma->lock to mutexes. On the way there it
> also makes mmu_gather preemptible.
>
> The main motivation was making mm_take_all_locks() preemptible, since it
> appears people are nesting hundreds of spinlocks there.
>
> The side-effects are that we can finally make mmu_gather preemptible, something
> which lots of people have wanted to do for a long time.

What's the straight-line performance impact of all this? And how about
concurrency, I wonder. mutexes of course are double the atomics, and
you've added a refcount which is two more again for those paths using
it.

Page faults are very important. We unfortunately have some databases
doing a significant amount of mmap/munmap activity too. I'd like to
see microbenchmark numbers for each of those (both anon and file backed
for page faults).

kbuild does quite a few pages faults, that would be an easy thing to
test. Not sure what reasonable kinds of cases exercise parallelism.

> What kind of performance tests would people have me run on this to satisfy
> their need for numbers? I've done a kernel build on x86_64 and if anything that
> was slightly faster with these patches, but it was well within the noise
> levels so it might be heat noise I'm looking at ;-)

Is it because you're reducing the number of TLB flushes, or what
(kbuild isn't multi threaded so on x86 TLB flushes should be really
fast anyway).

KAMEZAWA Hiroyuki

unread,

Apr 9, 2010, 1:10:01 AM4/9/10

to

On Fri, 9 Apr 2010 13:16:41 +1000
Nick Piggin <npi...@suse.de> wrote:

> On Thu, Apr 08, 2010 at 09:17:39PM +0200, Peter Zijlstra wrote:
> > There is nothing preventing the anon_vma from being detached while we
> > are spinning to acquire the lock. Most (all?) current users end up
> > calling something like vma_address(page, vma) on it, which has a
> > fairly good chance of weeding out wonky vmas.
> >
> > However suppose the anon_vma got freed and re-used while we were
> > waiting to acquire the lock, and the new anon_vma fits with the
> > page->index (because that is the only thing vma_address() uses to
> > determine if the page fits in a particular vma, we could end up
> > traversing faulty anon_vma chains.
> >
> > Close this hole for good by re-validating that page->mapping still
> > holds the very same anon_vma pointer after we acquire the lock, if not
> > be utterly paranoid and retry the whole operation (which will very
> > likely bail, because it's unlikely the page got attached to a different
> > anon_vma in the meantime).
>
> Hm, looks like a bugfix? How was this supposed to be safe?
>

IIUC.

Before Rik's change to anon_vma, once page->mapping is set as anon_vma | 0x1,
it's not modified until the page is freed.
After the patch, do_wp_page() overwrite page->mapping when it reuse existing
page.

==
static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *page_table, pmd_t *pmd,
spinlock_t *ptl, pte_t orig_pte)
{
....
if (PageAnon(old_page) && !PageKsm(old_page)) {
if (!trylock_page(old_page)) {
page_cache_get(old_page);
....
reuse = reuse_swap_page(old_page);
if (reuse)
/*
* The page is all ours. Move it to our anon_vma so
* the rmap code will not search our parent or siblings.
* Protected against the rmap code by the page lock.
*/
page_move_anon_rmap(old_page, vma, address); ----(*)
}
===
(*) is new.

Then, this new check makes sense in the current kernel.

>
> > Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
> > Cc: Hugh Dickins <hugh.d...@tiscali.co.uk>
> > Cc: Linus Torvalds <torv...@linux-foundation.org>
> > ---
> > mm/rmap.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > Index: linux-2.6/mm/rmap.c
> > ===================================================================
> > --- linux-2.6.orig/mm/rmap.c
> > +++ linux-2.6/mm/rmap.c
> > @@ -294,6 +294,7 @@ struct anon_vma *page_lock_anon_vma(stru
> > unsigned long anon_mapping;
> >
> > rcu_read_lock();
> > +again:
> > anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> > if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> > goto out;
> > @@ -302,6 +303,12 @@ struct anon_vma *page_lock_anon_vma(stru
> >
> > anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> > spin_lock(&anon_vma->lock);
> > +
> > + if (page_rmapping(page) != anon_vma) {
>
> very unlikely()?
>

I think so.

Thanks,
-Kame

KOSAKI Motohiro

unread,

Apr 9, 2010, 2:40:02 AM4/9/10

to

> On Fri, 9 Apr 2010 13:16:41 +1000
> Nick Piggin <npi...@suse.de> wrote:
>
> > On Thu, Apr 08, 2010 at 09:17:39PM +0200, Peter Zijlstra wrote:
> > > There is nothing preventing the anon_vma from being detached while we
> > > are spinning to acquire the lock. Most (all?) current users end up
> > > calling something like vma_address(page, vma) on it, which has a
> > > fairly good chance of weeding out wonky vmas.
> > >
> > > However suppose the anon_vma got freed and re-used while we were
> > > waiting to acquire the lock, and the new anon_vma fits with the
> > > page->index (because that is the only thing vma_address() uses to
> > > determine if the page fits in a particular vma, we could end up
> > > traversing faulty anon_vma chains.
> > >
> > > Close this hole for good by re-validating that page->mapping still
> > > holds the very same anon_vma pointer after we acquire the lock, if not
> > > be utterly paranoid and retry the whole operation (which will very
> > > likely bail, because it's unlikely the page got attached to a different
> > > anon_vma in the meantime).
> >
> > Hm, looks like a bugfix? How was this supposed to be safe?
> >
> IIUC.
>
> Before Rik's change to anon_vma, once page->mapping is set as anon_vma | 0x1,
> it's not modified until the page is freed.
> After the patch, do_wp_page() overwrite page->mapping when it reuse existing
> page.

Why?
IIUC. page->mapping dereference in page_lock_anon_vma() makes four story.

1. the anon_vma is valid
-> do page_referenced_one().
2. the anon_vma is invalid and freed to buddy
-> bail out by page_mapped(), no touch anon_vma
3. the anon_vma is kfreed, and not reused
-> bail out by page_mapped()
4. the anon_vma is kfreed, but reused as another anon_vma
-> bail out by page_check_address()

Now we have to consider 5th story.

5. the anon_vma is exchanged another anon_vma by do_wp_page.
-> bail out by above bailing out stuff.

I agree peter's patch makes sense. but I don't think Rik's patch change
locking rule.

>
> ==
> static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, pte_t *page_table, pmd_t *pmd,
> spinlock_t *ptl, pte_t orig_pte)
> {
> ....
> if (PageAnon(old_page) && !PageKsm(old_page)) {
> if (!trylock_page(old_page)) {
> page_cache_get(old_page);
> ....
> reuse = reuse_swap_page(old_page);
> if (reuse)
> /*
> * The page is all ours. Move it to our anon_vma so
> * the rmap code will not search our parent or siblings.
> * Protected against the rmap code by the page lock.
> */
> page_move_anon_rmap(old_page, vma, address); ----(*)
> }
> ===
> (*) is new.
>
> Then, this new check makes sense in the current kernel.

--

KAMEZAWA Hiroyuki

unread,

Apr 9, 2010, 3:00:02 AM4/9/10

to

Hmm, I think following.

Assume a page is ANON and SwapCache, and it has only one reference.
Consider it's read-only mapped and cause do_wp_page().
page_mapcount(page) == 1 here.

CPU0 CPU1

1. do_wp_page()
2. .....
3. replace anon_vma. anon_vma = lock_page_anon_vma()

So, lock_page_anon_vma() may have lock on wrong anon_vma, here.(mapcount=1)

4. modify pte to writable. do something...

After lock, in CPU1, a pte of estimated address by vma_address(vma, page)
containes pfn of the page and page_check_address() will success.

I'm not sure how this is dangerouns.
But it's possible that CPU1 cannot notice there was anon_vma replacement.
And modifies pte withoug holding anon vma's lock which the code believes
it's holded.

Thanks,
-Kame

Christian Ehrhardt

unread,

Apr 9, 2010, 3:10:01 AM4/9/10

to

Hi,

On Thu, Apr 08, 2010 at 09:17:42PM +0200, Peter Zijlstra wrote:
> @@ -302,23 +307,33 @@ again:
> goto out;

>
> anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);

> - spin_lock(&anon_vma->lock);
> + if (!atomic_inc_not_zero(&anon_vma->ref))
> + anon_vma = NULL;
>
> if (page_rmapping(page) != anon_vma) {
> - spin_unlock(&anon_vma->lock);
> + anon_vma_put(anon_vma);
> goto again;
> }

AFAICS anon_vma_put might be called with anon_vma == NULL here which
will oops on the ref count. Not sure if

page_rmapping(page) == anon_vma == NULL

is possible, too.

regards Christian

KOSAKI Motohiro

unread,

Apr 9, 2010, 3:40:01 AM4/9/10

to

Hehe, page_referenced() already can take unstable VM_LOCKED value. So,
In worst case we make false positive pageout, but it's not disaster.
I think. Anyway "use after free" don't happen by this blutal code.

However, I think you pointed one good thing. before Rik patch, we don't have
page->mapping reassignment. then, we didn't need rcu_dereference().
but now it can happen. so, I think rcu_dereference() is better.

Perhaps, I'm missing something.

diff --git a/mm/rmap.c b/mm/rmap.c
index 8b088f0..b4a0b5b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -295,7 +295,7 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
unsigned long anon_mapping;

rcu_read_lock();
- anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
+ anon_mapping = (unsigned long) rcu_dereference(page->mapping);

if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
goto out;

if (!page_mapped(page))

Minchan Kim

unread,

Apr 9, 2010, 4:10:02 AM4/9/10

to

Hi, Kosaki.

On Fri, Apr 9, 2010 at 4:29 PM, KOSAKI Motohiro
<kosaki....@jp.fujitsu.com> wrote:
>> Hmm, I think following.
>>
>> Assume a page is ANON and SwapCache, and it has only one reference.
>> Consider it's read-only mapped and cause do_wp_page().
>> page_mapcount(page) == 1 here.
>>
>> CPU0 CPU1
>>
>> 1. do_wp_page()
>> 2. .....
>> 3. replace anon_vma. anon_vma = lock_page_anon_vma()
>>
>> So, lock_page_anon_vma() may have lock on wrong anon_vma, here.(mapcount=1)
>>
>> 4. modify pte to writable. do something...
>>
>> After lock, in CPU1, a pte of estimated address by vma_address(vma, page)
>> containes pfn of the page and page_check_address() will success.
>>
>> I'm not sure how this is dangerouns.
>> But it's possible that CPU1 cannot notice there was anon_vma replacement.
>> And modifies pte withoug holding anon vma's lock which the code believes
>> it's holded.
>
>
> Hehe, page_referenced() already can take unstable VM_LOCKED value. So,
> In worst case we make false positive pageout, but it's not disaster.

OFF-TOPIC:

I think you pointed out good thing, too. :)

You mean although application call mlock of any vma, few pages on the vma can
be swapout by race between mlock and reclaim?

Although it's not disaster, apparently it breaks API.
Man page
" mlock() and munlock()
mlock() locks pages in the address range starting at addr and
continuing for len bytes. All pages that contain a part of the
specified address range are guaranteed to be resident in RAM when the
call returns successfully; the pages are guaranteed to stay in RAM
until later unlocked."

Do you have a plan to solve such problem?

And how about adding simple comment about that race in page_referenced_one?
Could you send the patch?

--
Kind regards,
Minchan Kim

KAMEZAWA Hiroyuki

unread,

Apr 9, 2010, 4:10:02 AM4/9/10

to

On Fri, 9 Apr 2010 16:29:59 +0900 (JST)
KOSAKI Motohiro <kosaki....@jp.fujitsu.com> wrote:

Hmm. I wonder we can check "whether we lock valid anon_vma or not" only under
pte_lock or lock_page().
==
anon_vma = page_anon_vma();
lock(anon_vma->lock);
....
page_check_address(page)
....
pte_lock();
if (page_anon_vma(page) == anon_vma)
# anon_vma replacement happens!
unlock(anon_vma->lock);
==
So, rather than page_lock_anon_vma(), page_check_address() may have to check anon_vma
replacement....But I cannot think of dangerous case which can cause panic for now.
I may miss something...

Thanks,
-Kame

KAMEZAWA Hiroyuki

unread,

Apr 9, 2010, 4:10:02 AM4/9/10

to

Ah...anon_vma replacemet occurs under lock_page() and pte_lock.
Almost all callers of page_lock_anon_vma() holds lock_page(). So, I think
this anon_vma replacement is not very serious.
Hmm...

Thanks,
-Kame

KOSAKI Motohiro

unread,

Apr 9, 2010, 4:20:02 AM4/9/10

to

Hi Minchan,

> OFF-TOPIC:
>
> I think you pointed out good thing, too. :)
>
> You mean although application call mlock of any vma, few pages on the vma can
> be swapout by race between mlock and reclaim?
>
> Although it's not disaster, apparently it breaks API.
> Man page
> " mlock() and munlock()
> mlock() locks pages in the address range starting at addr and
> continuing for len bytes. All pages that contain a part of the
> specified address range are guaranteed to be resident in RAM when the
> call returns successfully; the pages are guaranteed to stay in RAM
> until later unlocked."
>
> Do you have a plan to solve such problem?
>
> And how about adding simple comment about that race in page_referenced_one?
> Could you send the patch?

I'm surprising this mail. you were pushing much patch in this area.
I believed you know all stuff ;)

My answer is, it don't need to fix, because it's not bug. The point is
that this one is race issue. not "pageout after mlock" issue.
If pageout and mlock occur at the exactly same time, the human can't
observe which event occur in first. it's not API violation.

Thanks.

Peter Zijlstra

unread,

Apr 9, 2010, 4:20:02 AM4/9/10

to

On Fri, 2010-04-09 at 14:07 +1000, Nick Piggin wrote:
> On Thu, Apr 08, 2010 at 09:17:44PM +0200, Peter Zijlstra wrote:
> > Fix up powerpc to the new mmu_gather stuffs.
> >
> > PPC has an extra batching queue to RCU free the actual pagetable
> > allocations, use the ARCH extentions for that for now.
> >
> > For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
> > hardware hash-table, keep using per-cpu arrays but flush on context
> > switch and use a TIF bit to track the laxy_mmu state.
>
> Hm. Pity powerpc can't just use tlb flush gathering for this batching,
> (which is what it was designed for). Then it could avoid these tricks.
> What's preventing this? Adding a tlb gather for COW case in
> copy_page_range?

I'm not quite sure what about that, didn't fully investigate it, just
wanted to get something working for now.

Of of the things is that both power and sparc need more than the struct
page we normally gather.

I did think of making the mmu_gather have something like

struct mmu_page {
struct page *page;
#ifdef HAVE_ARCH_TLB_VADDR
unsigned long vaddr;
#endif
};

struct mmu_gather {
...
unsigned int nr;
struct mmu_page *pages;
};

and doing that vaddr collection right along with it in the same batch.

I think that that would work, Ben, Dave?

KAMEZAWA Hiroyuki

unread,

Apr 9, 2010, 4:30:02 AM4/9/10

to

On Fri, 9 Apr 2010 17:03:49 +0900
KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com> wrote:

Sorry for short mails ;(

Note:vmscan.c::shrink_active_list()
-> page_referenced()
doesn't take lock_page() and may see wrong anon_vma by replacement.

Don't we need lock_page() around ?

Peter Zijlstra

unread,

Apr 9, 2010, 4:40:01 AM4/9/10

to

On Fri, 2010-04-09 at 14:14 +1000, Nick Piggin wrote:
> On Thu, Apr 08, 2010 at 09:17:37PM +0200, Peter Zijlstra wrote:
> > Hi,
> >
> > This (still incomplete) patch-set makes part of the mm a lot more preemptible.
> > It converts i_mmap_lock and anon_vma->lock to mutexes. On the way there it
> > also makes mmu_gather preemptible.
> >
> > The main motivation was making mm_take_all_locks() preemptible, since it
> > appears people are nesting hundreds of spinlocks there.
> >
> > The side-effects are that we can finally make mmu_gather preemptible, something
> > which lots of people have wanted to do for a long time.
>
> What's the straight-line performance impact of all this? And how about
> concurrency, I wonder. mutexes of course are double the atomics, and
> you've added a refcount which is two more again for those paths using
> it.
>
> Page faults are very important. We unfortunately have some databases
> doing a significant amount of mmap/munmap activity too.

You think this would affect the mmap/munmap times in any significant
way? It seems to me those are relatively heavy ops to begin with.

> I'd like to
> see microbenchmark numbers for each of those (both anon and file backed
> for page faults).

OK, I'll dig out that fault test used in the whole mmap_sem/rwsem thread
a while back and modify it to also do file backed faults.

> kbuild does quite a few pages faults, that would be an easy thing to
> test. Not sure what reasonable kinds of cases exercise parallelism.
>
>
> > What kind of performance tests would people have me run on this to satisfy
> > their need for numbers? I've done a kernel build on x86_64 and if anything that
> > was slightly faster with these patches, but it was well within the noise
> > levels so it might be heat noise I'm looking at ;-)
>
> Is it because you're reducing the number of TLB flushes, or what
> (kbuild isn't multi threaded so on x86 TLB flushes should be really
> fast anyway).

I'll try and get some perf stat runs to get some insight into this. But
the numbers were:

time make O=defconfig -j48 bzImage (5x, cache hot)

without: avg: 39.2018s +- 0.3407
with: avg: 38.9886s +- 0.1814

Peter Zijlstra

unread,

Apr 9, 2010, 4:40:01 AM4/9/10

to

The thing we gain is that when the holder of the lock finds a !0
refcount it knows it can't go away because any free will first wait to
acquire the lock.

> If the above mutex can be contended, can we fix by substituting
> synchronize_rcu_expedited()? Which will soon require some scalability
> attention if it gets used here, but what else is new? ;-)

No, synchronize_rcu_expedited() will not work here, there is no RCU read
side that covers the full usage of the anon_vma (there can't be, it
needs to sleep).

Right so the only thing rcu_read_lock() does here is create the
guarantee that anon_vma is safe to dereference (it lives on a
SLAB_DESTROY_BY_RCU slab).

But yes, I suppose that page->mapping read that now uses ACCESS_ONCE()
would actually want to be an rcu_dereference(), since that both provides
the ACCESS_ONCE() as the read-depend barrier that I thing would be
needed.

Yes, it is:

atomic_dec_and_test(&anon_vma->ref) /* implies mb */

smp_rmb();
atomic_read(&anon_vma->ref);

> (Disclaimer: I have not yet found anon_vma_put(), so I am assuming that
> anon_vma_free() plays the role of a grace period.)

Yes, that lives in one of the other patches (does not exist in
mainline).

Nick Piggin

unread,

Apr 9, 2010, 4:50:01 AM4/9/10

to

On Fri, Apr 09, 2010 at 10:14:07AM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-09 at 14:07 +1000, Nick Piggin wrote:
> > On Thu, Apr 08, 2010 at 09:17:44PM +0200, Peter Zijlstra wrote:
> > > Fix up powerpc to the new mmu_gather stuffs.
> > >
> > > PPC has an extra batching queue to RCU free the actual pagetable
> > > allocations, use the ARCH extentions for that for now.
> > >
> > > For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
> > > hardware hash-table, keep using per-cpu arrays but flush on context
> > > switch and use a TIF bit to track the laxy_mmu state.
> >
> > Hm. Pity powerpc can't just use tlb flush gathering for this batching,
> > (which is what it was designed for). Then it could avoid these tricks.
> > What's preventing this? Adding a tlb gather for COW case in
> > copy_page_range?
>
> I'm not quite sure what about that, didn't fully investigate it, just
> wanted to get something working for now.

No it's not your problem, but just perhaps a good add-on to your
patchset. Thanks for thinking about it, though.

>
> Of of the things is that both power and sparc need more than the struct
> page we normally gather.
>
> I did think of making the mmu_gather have something like
>
> struct mmu_page {
> struct page *page;
> #ifdef HAVE_ARCH_TLB_VADDR
> unsigned long vaddr;
> #endif
> };

Well you could also have a per-arch struct for this, which they can
fill in their own info with (I think powerpc takes the pte as well)

Peter Zijlstra

unread,

Apr 9, 2010, 4:50:01 AM4/9/10

to

On Fri, 2010-04-09 at 16:29 +0900, KOSAKI Motohiro wrote:
>
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8b088f0..b4a0b5b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -295,7 +295,7 @@ struct anon_vma *page_lock_anon_vma(struct page
> *page)
> unsigned long anon_mapping;
>
> rcu_read_lock();
> - anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> + anon_mapping = (unsigned long) rcu_dereference(page->mapping);
> if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> goto out;
> if (!page_mapped(page))

Yes, I think this is indeed required.

I'll do a new version of the patch that includes the comment updates
requested by Andrew.

Martin Schwidefsky

unread,

Apr 9, 2010, 5:00:01 AM4/9/10

to

Hi Peter,

On Thu, 08 Apr 2010 21:17:37 +0200
Peter Zijlstra <a.p.zi...@chello.nl> wrote:

> The side-effects are that we can finally make mmu_gather preemptible, something
> which lots of people have wanted to do for a long time.

Yes, that is a good thing. With the preemptible mmu_gather s390 will
use less IDTEs (that is the instruction that flushes all TLBs for a
given address space) on a full flush. Good :-)

> This patch-set seems to build and boot on my x86_64 machines and even builds a
> kernel. I've also attempted powerpc and sparc, which I've compile tested with
> their respective defconfigs, remaining are (afaikt the rest uses the generic
> tlb bits):
>

> - s390
> - ia64
> - arm
> - superh
> - um
>
> From those, s390 and ia64 look 'interesting', arm and superh seem very similar
> and should be relatively easy (-rt has a patchlet for arm iirc).

To get the 'interesting' TLB flushing on s390 working again you need
this patch:

--
[PATCH] s390: preemptible mmu_gather

From: Martin Schwidefsky <schwi...@de.ibm.com>

Adapt the stand-alone s390 mmu_gather implementation to the new
preemptible mmu_gather interface.

Signed-off-by: Martin Schwidefsky <schwi...@de.ibm.com>
---
arch/s390/include/asm/tlb.h | 43 +++++++++++++++++++++++++------------------
1 file changed, 25 insertions(+), 18 deletions(-)

--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -28,45 +28,51 @@
#include <asm/smp.h>
#include <asm/tlbflush.h>

-#ifndef CONFIG_SMP
-#define TLB_NR_PTRS 1
-#else
-#define TLB_NR_PTRS 508
-#endif
-
struct mmu_gather {
struct mm_struct *mm;
unsigned int fullmm;
unsigned int nr_ptes;
unsigned int nr_pxds;
- void *array[TLB_NR_PTRS];
+ unsigned int max;
+ void **array;
+ void *local[8];
};

-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
-static inline struct mmu_gather *tlb_gather_mmu(struct mm_struct *mm,
- unsigned int full_mm_flush)
+static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
{
- struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
+ unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+
+ if (addr) {
+ tlb->array = (void *) addr;
+ tlb->max = PAGE_SIZE / sizeof(void *);
+ }
+}

+static inline void tlb_gather_mmu(struct mmu_gather *tlb,
+ struct mm_struct *mm,
+ unsigned int full_mm_flush)
+{
tlb->mm = mm;
+ tlb->max = ARRAY_SIZE(tlb->local);
+ tlb->array = tlb->local;
tlb->fullmm = full_mm_flush || (num_online_cpus() == 1) ||
(atomic_read(&mm->mm_users) <= 1 && mm == current->active_mm);
- tlb->nr_ptes = 0;
- tlb->nr_pxds = TLB_NR_PTRS;
if (tlb->fullmm)
__tlb_flush_mm(mm);
- return tlb;
+ else
+ __tlb_alloc_pages(tlb);
+ tlb->nr_ptes = 0;
+ tlb->nr_pxds = tlb->max;
}

static inline void tlb_flush_mmu(struct mmu_gather *tlb,
unsigned long start, unsigned long end)
{
- if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < TLB_NR_PTRS))
+ if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < tlb->max))
__tlb_flush_mm(tlb->mm);
while (tlb->nr_ptes > 0)
pte_free(tlb->mm, tlb->array[--tlb->nr_ptes]);
- while (tlb->nr_pxds < TLB_NR_PTRS)
+ while (tlb->nr_pxds < tlb->max)
/* pgd_free frees the pointer as region or segment table */
pgd_free(tlb->mm, tlb->array[tlb->nr_pxds++]);
}
@@ -79,7 +85,8 @@ static inline void tlb_finish_mmu(struct
/* keep the page table cache within bounds */
check_pgt_cache();

- put_cpu_var(mmu_gathers);
+ if (tlb->array != tlb->local)
+ free_pages((unsigned long) tlb->array, 0);
}

/*

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

Nick Piggin

unread,

Apr 9, 2010, 5:00:02 AM4/9/10

to

On Fri, Apr 09, 2010 at 10:35:31AM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-09 at 14:14 +1000, Nick Piggin wrote:
> > On Thu, Apr 08, 2010 at 09:17:37PM +0200, Peter Zijlstra wrote:
> > > Hi,
> > >
> > > This (still incomplete) patch-set makes part of the mm a lot more preemptible.
> > > It converts i_mmap_lock and anon_vma->lock to mutexes. On the way there it
> > > also makes mmu_gather preemptible.
> > >
> > > The main motivation was making mm_take_all_locks() preemptible, since it
> > > appears people are nesting hundreds of spinlocks there.
> > >
> > > The side-effects are that we can finally make mmu_gather preemptible, something
> > > which lots of people have wanted to do for a long time.
> >
> > What's the straight-line performance impact of all this? And how about
> > concurrency, I wonder. mutexes of course are double the atomics, and
> > you've added a refcount which is two more again for those paths using
> > it.
> >
> > Page faults are very important. We unfortunately have some databases
> > doing a significant amount of mmap/munmap activity too.
>
> You think this would affect the mmap/munmap times in any significant
> way? It seems to me those are relatively heavy ops to begin with.

They're actually not _too_ heavy because they just setup and tear
down vmas. No flushing or faulting required (well, in order to make
any _use_ of them you need flushing and faulting of course).

I have some microbenchmarks like this and the page fault test I
could try.

> > I'd like to
> > see microbenchmark numbers for each of those (both anon and file backed
> > for page faults).
>
> OK, I'll dig out that fault test used in the whole mmap_sem/rwsem thread
> a while back and modify it to also do file backed faults.

That'd be good. Anonymous as well of course, for non-databases :)

> > kbuild does quite a few pages faults, that would be an easy thing to
> > test. Not sure what reasonable kinds of cases exercise parallelism.
> >
> >
> > > What kind of performance tests would people have me run on this to satisfy
> > > their need for numbers? I've done a kernel build on x86_64 and if anything that
> > > was slightly faster with these patches, but it was well within the noise
> > > levels so it might be heat noise I'm looking at ;-)
> >
> > Is it because you're reducing the number of TLB flushes, or what
> > (kbuild isn't multi threaded so on x86 TLB flushes should be really
> > fast anyway).
>
> I'll try and get some perf stat runs to get some insight into this. But
> the numbers were:
>
> time make O=defconfig -j48 bzImage (5x, cache hot)
>
> without: avg: 39.2018s +- 0.3407
> with: avg: 38.9886s +- 0.1814

Well that's interesting. Nice if it is an improvement not just some
anomoly. I'd start by looking at TLB flushes maybe? For testing, it
would be nice to make the flush sizes equal so you get more of a
comparison of the straight line code.

Other than this, I don't have a good suggestion of what to test. I mean,
how far can you go? :) Some threaded workloads would probably be a good
idea, though. Java?

Peter Zijlstra

unread,

Apr 9, 2010, 5:00:03 AM4/9/10

to

On Fri, 2010-04-09 at 18:50 +1000, Nick Piggin wrote:
> I have some microbenchmarks like this and the page fault test I
> could try.
>

If you don't mind, please. If you send them my way I'll run them on my
machines too.

David Howells

unread,

Apr 9, 2010, 5:10:01 AM4/9/10

to

Have you tried compiling this for NOMMU?

David

Peter Zijlstra

unread,

Apr 9, 2010, 5:30:03 AM4/9/10

to

On Fri, 2010-04-09 at 10:03 +0100, David Howells wrote:
> Have you tried compiling this for NOMMU?

No, I will eventually, I'm sure there's something to fix.

Peter Zijlstra

unread,

Apr 9, 2010, 5:30:03 AM4/9/10

to

On Fri, 2010-04-09 at 18:46 +1000, Nick Piggin wrote:
> > struct mmu_page {
> > struct page *page;
> > #ifdef HAVE_ARCH_TLB_VADDR
> > unsigned long vaddr;
> > #endif
> > };
>
> Well you could also have a per-arch struct for this, which they can
> fill in their own info with (I think powerpc takes the pte as well)

Ah, right you are.

Maybe something like:

#ifndef tlb_fill_page

struct mmu_page {
struct page *page;

};

#define tlb_fill_page(tlb, pte, addr) do { } while (0)
#endif

Which can be used from the tlb_remove_tlb_entry() implementation to
gather both the pte and addr. The only complication seems to be that
tlb_remove_tlb_entry() and tlb_remove_page() aren't always matched.

Also, it looks like both power and sparc drive the pte/vaddr flush from
the pte_update/set_pte_at like functions. Which means its not synced to
the mmu_gather at all.

So I'm not at all sure its feasible to link these, but I'll keep it in
mind.

Peter Zijlstra

unread,

Apr 9, 2010, 6:00:03 AM4/9/10

to

On Fri, 2010-04-09 at 10:58 +0200, Martin Schwidefsky wrote:
> [PATCH] s390: preemptible mmu_gather
>
> From: Martin Schwidefsky <schwi...@de.ibm.com>
>
> Adapt the stand-alone s390 mmu_gather implementation to the new
> preemptible mmu_gather interface.
>
> Signed-off-by: Martin Schwidefsky <schwi...@de.ibm.com>

Thanks a lot Martin!

Peter Zijlstra

unread,

Apr 9, 2010, 6:00:03 AM4/9/10

to

On Fri, 2010-04-09 at 09:04 +0200, Christian Ehrhardt wrote:
> Hi,
>
> On Thu, Apr 08, 2010 at 09:17:42PM +0200, Peter Zijlstra wrote:
> > @@ -302,23 +307,33 @@ again:
> > goto out;
> >
> > anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> > - spin_lock(&anon_vma->lock);
> > + if (!atomic_inc_not_zero(&anon_vma->ref))
> > + anon_vma = NULL;
> >
> > if (page_rmapping(page) != anon_vma) {
> > - spin_unlock(&anon_vma->lock);
> > + anon_vma_put(anon_vma);
> > goto again;
> > }
>
> AFAICS anon_vma_put might be called with anon_vma == NULL here which
> will oops on the ref count. Not sure if
>
> page_rmapping(page) == anon_vma == NULL
>
> is possible, too.

Gah, you're right, thanks!

Rik van Riel

unread,

Apr 9, 2010, 8:40:02 AM4/9/10

to

On 04/08/2010 03:17 PM, Peter Zijlstra wrote:
> We need an anon_vma refcount for preemptible anon_vma->lock as well
> as memory compaction, so move it out into generic code.
>
> Signed-off-by: Peter Zijlstra<a.p.zi...@chello.nl>

Reviewed-by: Rik van Riel <ri...@redhat.com>

Peter Zijlstra

unread,

Apr 9, 2010, 9:00:01 AM4/9/10

to

On Thu, 2010-04-08 at 21:17 +0200, Peter Zijlstra wrote:

> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c

> @@ -294,6 +294,7 @@ struct anon_vma *page_lock_anon_vma(stru
> unsigned long anon_mapping;
>
> rcu_read_lock();
> +again:

> anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);

> if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> goto out;

> @@ -302,6 +303,12 @@ struct anon_vma *page_lock_anon_vma(stru
>

> anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);

> spin_lock(&anon_vma->lock);
> +
> + if (page_rmapping(page) != anon_vma) {

> + spin_unlock(&anon_vma->lock);
> + goto again;
> + }
> +
> return anon_vma;
> out:
> rcu_read_unlock();

OK, so I'm not quite sure about this anymore... this locking is quite,..
umh,. interesting.

We have:

- page_remove_rmap() - which decrements page->_mapcount but does not
clear page->mapping. Even though both it and page_add_anon_rmap()
require pte_lock, there is no guarantee these are the same locks, so
these can indeed race.

So there is indeed a possibility for page->mapping to change, but
holding anon_vma->lock won't make a difference.

- anon_vma->lock - this does protect vma_anon_link/unlink like things
and is held during rmap walks, it only protects the list.

- SLAB_DESTROY_BY_RCU - only guarantees we can deref anon_vma, not that
we'll find the one we were looking for.

- we generally unmap/zap before unlink/free - so this should not see
pages race with anon_vma freeing (except when someone has a page ref),
however on anon_vma merges we seem to simply unlink the next one,
without updating potential pages referencing it?

- KSM seems to not do anything that swap doesn't also do (but I didn't
look too closely).

- page migration seems terribly broken..

As it stands I don't see a reliable way to obtain an anon_vma,
vma_address() + page_check_address() things are indeed enough, but not
everybody seems to do that.

Minchan Kim

unread,

Apr 9, 2010, 10:50:02 AM4/9/10

to

Hi, Kosaki.

I don't want to make noise due to off-topic.
So I open new thread.

On Fri, 2010-04-09 at 17:17 +0900, KOSAKI Motohiro wrote:
> Hi Minchan,
>
> > OFF-TOPIC:
> >
> > I think you pointed out good thing, too. :)
> >
> > You mean although application call mlock of any vma, few pages on the vma can
> > be swapout by race between mlock and reclaim?
> >
> > Although it's not disaster, apparently it breaks API.
> > Man page
> > " mlock() and munlock()
> > mlock() locks pages in the address range starting at addr and
> > continuing for len bytes. All pages that contain a part of the
> > specified address range are guaranteed to be resident in RAM when the
> > call returns successfully; the pages are guaranteed to stay in RAM
> > until later unlocked."
> >
> > Do you have a plan to solve such problem?
> >
> > And how about adding simple comment about that race in page_referenced_one?
> > Could you send the patch?
>
> I'm surprising this mail. you were pushing much patch in this area.
> I believed you know all stuff ;)

If I disappoint you, sorry for that.
Still, there are many thing to study to me. :)

>
> My answer is, it don't need to fix, because it's not bug. The point is
> that this one is race issue. not "pageout after mlock" issue.
> If pageout and mlock occur at the exactly same time, the human can't
> observe which event occur in first. it's not API violation.

If it might happen, it's obviously API violation, I think.

int main()
{
mlock(any vma, CURRENT|FUTURE);
system("cat /proc/self/smaps | grep "any vma");
..
}
result :

08884000-088a5000 rw-p 00000000 00:00 0 [any vma]
Size: 4 kB
Rss: 4 kB
...
Swap: 4 kB
...

Apparently, user expected that "If I call mlock, there are whole pages
of the vma in DRAM". But the result make him embarrassed :(

side note :
Of course, mlock's semantic is rather different with smaps's Swap.
mlock's semantic just makes sure pages on DRAM after success of mlock
call. it's not relate smap's swap entry.
Actually, smaps's swap entry cannot compare to mlock's semantic.
Some page isn't on swap device yet but on swap cache and whole PTEs of
page already have swap entry(ie, all unmapped). In such case, smap's
Swap entry represent it with swap page. But with semantic of mlock, it's
still on RAM so that it's okay.

I looked the code more detail.
Fortunately, the situation you said "page_referenced() already can take

unstable VM_LOCKED value. So, In worst case we make false positive

pageout, but it's not disaster" cannot happen, I think.

1)
mlock_fixup shrink_page_list

lock_page
try_to_unmap

vma->vm_flags = VM_LOCKED
pte_lock
pte_present test
get_page
pte_unlock
pte_lock
VM_LOCKED test fail
pte_unlock
never pageout
So, no problem.

2)
mlock_fixup shrink_page_list

lock_page
try_to_unmap
pte_lock
VM_LOCKED test pass
vma->vm_flags = VM_LOCKED make pte to swap entry
pte_lock pte_unlock
pte_present test fail
pte_unlock
pageout
swapin by handle_mm_fault

So, no problem.

3)
mlock_fixup shrink_page_list

lock_page
try_to_unmap
pte_lock
VM_LOCKED test pass
vma->vm_flags = VM_LOCKED make pte to swap entry
pte_lock pte_unlock
pte_present test fail
pte_unlock
cachehit in swapcache by handle_mm_fault
pageout
is_page_cache_freeable fail
So, no problem, too.

I can't think the race situation you mentioned.
When 'false positive pageout' happens?
Could you elaborate on it?

--
Kind regards,
Minchan Kim

Rik van Riel

unread,

Apr 9, 2010, 11:40:02 AM4/9/10

to

On 04/08/2010 03:17 PM, Peter Zijlstra wrote:

> Usable for lock-breaks and such.
>
> Signed-off-by: Peter Zijlstra<a.p.zi...@chello.nl>

Reviewed-by: Rik van Riel <ri...@redhat.com>

--

Rik van Riel

unread,

Apr 9, 2010, 11:40:02 AM4/9/10

to

On 04/08/2010 03:17 PM, Peter Zijlstra wrote:

> Provide the mutex_lock_nest_lock() annotation.

Paul E. McKenney

unread,

Apr 9, 2010, 3:30:02 PM4/9/10

to

OK. Here is the sequence of events that I am concerned about:

1. CPU 0 invokes page_lock_anon_vma() [13/13], and executes the
assignment to anon_vma. It has not yet attempted to
acquire the anon_vma->lock mutex.

2. CPU 1 invokes page_unlock_anon_vma() [13/13], which in turn
calls anon_vma_put() [5/13], which atomically decrements
->ref, finds it zero, invokes anon_vma_free() [13/13], which
finds ->ref still zero, so acquires ->lock and immediately
releases it, and then calls kmem_cache_free().

3. This kmem_cache does have SLAB_DESTROY_BY_RCU, so this
anon_vma structure will remain an anon_vma for as long as
CPU 0 remains in its RCU read-side critical section.

4. CPU 2 allocates an anon_vma, and gets the one that
CPU 0 just freed. It initializes it and makes ->ref
non-zero.

5. CPU 0 continues executing page_lock_anon_vma(), and therefore
invokes mutex_trylock() on a now-reused struct anon_vma.
It finds ->ref nonzero, so increments it and continues using
it, despite its having been reallocated, possibly to some
other process.

Or am I missing a step? (Extremely possible, as I am not as familiar
with this code as I might be.)

> > If the above mutex can be contended, can we fix by substituting
> > synchronize_rcu_expedited()? Which will soon require some scalability
> > attention if it gets used here, but what else is new? ;-)
>
> No, synchronize_rcu_expedited() will not work here, there is no RCU read
> side that covers the full usage of the anon_vma (there can't be, it
> needs to sleep).

Got it, apologies for my confusion.

Ah, I was getting the wrong access. Now that I see it, yes, this is
tied to the access of page->mapping that is assigned to anon_mapping.

Thank you -- and yes, I should have thought to search the patch set.

Thanx, Paul

Benjamin Herrenschmidt

unread,

Apr 12, 2010, 9:10:02 PM4/12/10

to

On Thu, 2010-04-08 at 21:17 +0200, Peter Zijlstra wrote:

> plain text document attachment (powerpc-gup_fast-rcu.patch)
> The powerpc page table freeing relies on the fact that IRQs hold off
> an RCU grace period, this is currently true for all existing RCU
> implementations but is not an assumption Paul wants to support.
>
> Therefore, also take the RCU read lock along with disabling IRQs to
> ensure the RCU grace period does at least cover these lookups.

There's a few other places that need a similar fix then. The hash page
code for example. All the C cases should end up calling the
find_linux_pte() helper afaik, so we should be able to stick the lock in
there (and the hugetlbfs variant, find_linux_pte_or_hugepte()).

However, we also have cases of tight asm code walking the page tables,
such as the tlb miss handler on embedded processors. I don't see how I
could do that there. IE. I only have a handful of registers to play
with, no stack, etc...

So we might have to support the interrupt assumption, at least in some
form, with those guys...

Cheers,
Ben.

> Requested-by: Paul E. McKenney <pau...@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
> Cc: Nick Piggin <npi...@suse.de>
> Cc: Benjamin Herrenschmidt <be...@kernel.crashing.org>
> ---
> arch/powerpc/mm/gup.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> Index: linux-2.6/arch/powerpc/mm/gup.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/gup.c
> +++ linux-2.6/arch/powerpc/mm/gup.c
> @@ -142,6 +142,7 @@ int get_user_pages_fast(unsigned long st
> * So long as we atomically load page table pointers versus teardown,
> * we can follow the address down to the the page and take a ref on it.
> */
> + rcu_read_lock();
> local_irq_disable();
>
> pgdp = pgd_offset(mm, addr);
> @@ -162,6 +163,7 @@ int get_user_pages_fast(unsigned long st
> } while (pgdp++, addr = next, addr != end);
>
> local_irq_enable();
> + rcu_read_unlock();
>
> VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
> return nr;
> @@ -171,6 +173,7 @@ int get_user_pages_fast(unsigned long st
>
> slow:
> local_irq_enable();
> + rcu_read_unlock();
> slow_irqon:
> pr_devel(" slow path ! nr = %d\n", nr);

Benjamin Herrenschmidt

unread,

Apr 12, 2010, 9:30:02 PM4/12/10

to

On Thu, 2010-04-08 at 21:17 +0200, Peter Zijlstra wrote:

.../...

> static inline void arch_leave_lazy_mmu_mode(void)
> {
> - struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
> + struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
> +
> + if (batch->active) {
> + if (batch->index)
> + __flush_tlb_pending(batch);
> + batch->active = 0;
> + }

Can index be > 0 if active == 0 ? I though not, which means you don't
need to add a test on active, do you ?

I'm also pondering whether we should just stick something in the
task/thread struct and avoid that whole per-cpu manipulation including
the stuff in process.c in fact.

Heh, maybe it's time to introduce thread vars ? :-)

> - if (batch->index)
> - __flush_tlb_pending(batch);
> - batch->active = 0;
> + put_cpu_var(ppc64_tlb_batch);
> }
>
> #define arch_flush_lazy_mmu_mode() do {} while (0)
> Index: linux-2.6/arch/powerpc/kernel/process.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/kernel/process.c
> +++ linux-2.6/arch/powerpc/kernel/process.c
> @@ -389,6 +389,9 @@ struct task_struct *__switch_to(struct t
> struct thread_struct *new_thread, *old_thread;
> unsigned long flags;
> struct task_struct *last;
> +#ifdef CONFIG_PPC64
> + struct ppc64_tlb_batch *batch;
> +#endif

> #ifdef CONFIG_SMP
> /* avoid complexity of lazy save/restore of fpu
> @@ -479,6 +482,14 @@ struct task_struct *__switch_to(struct t
> old_thread->accum_tb += (current_tb - start_tb);
> new_thread->start_tb = current_tb;
> }
> +
> + batch = &__get_cpu_var(ppc64_tlb_batch);
> + if (batch->active) {
> + set_ti_thread_flag(task_thread_info(prev), TIF_LAZY_MMU);
> + if (batch->index)
> + __flush_tlb_pending(batch);
> + batch->active = 0;
> + }
> #endif

Use ti->local_flags so you can do it non-atomically, which is a lot
cheaper.

> local_irq_save(flags);
> @@ -495,6 +506,13 @@ struct task_struct *__switch_to(struct t
> hard_irq_disable();
> last = _switch(old_thread, new_thread);
>
> +#ifdef CONFIG_PPC64
> + if (test_and_clear_ti_thread_flag(task_thread_info(new), TIF_LAZY_MMU)) {
> + batch = &__get_cpu_var(ppc64_tlb_batch);
> + batch->active = 1;
> + }
> +#endif
> +
> local_irq_restore(flags);
>
> return last;
> Index: linux-2.6/arch/powerpc/mm/pgtable.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/pgtable.c
> +++ linux-2.6/arch/powerpc/mm/pgtable.c
> @@ -33,8 +33,6 @@
>
> #include "mmu_decl.h"
>
> -DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
> -
> #ifdef CONFIG_SMP
>
> /*
> @@ -43,7 +41,6 @@ DEFINE_PER_CPU(struct mmu_gather, mmu_ga
> * freeing a page table page that is being walked without locks
> */
>
> -static DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur);
> static unsigned long pte_freelist_forced_free;
>
> struct pte_freelist_batch
> @@ -98,12 +95,30 @@ static void pte_free_submit(struct pte_f
>
> void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
> {
> - /* This is safe since tlb_gather_mmu has disabled preemption */
> - struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
> + struct pte_freelist_batch **batchp = &tlb->arch.batch;
> unsigned long pgf;
>
> - if (atomic_read(&tlb->mm->mm_users) < 2 ||
> - cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
> + /*
> + * A comment here about on why we have RCU freed page tables might be
> + * interesting, also explaining why we don't need any sort of grace
> + * period for mm_users == 1, and have some home brewn smp_call_func()
> + * for single frees.

iirc, we are synchronizing with CPUs walking page tables in their hash
or TLB miss code, which is lockless. The mm_users test is a -little- bit
dubious indeed. It may have to be mm_users < 2 && mm ==
current->active_mm, ie, we know for sure nobody else is currently
walking those page tables ...

Tho even than is fishy nowadays. We -can- walk page tables on behave of
another process. In fact, we do it in the Cell SPU code for faulting
page table entries as a result of SPEs taking faults for example. So I'm
starting to suspect that this mm_users optimisation is bogus.

We -do- want to optimize out the case where there is no user left
though, ie, the MM is dead. IE. The typical exit case.

> + *
> + * The only lockless page table walker I know of is gup_fast() which
> + * relies on irq_disable(). So my guess is that mm_users == 1 means
> + * that there cannot be another thread and so precludes gup_fast()
> + * concurrency.

Which is fishy as I said above.

> + * If there are, but we fail to batch, we need to IPI (all?) CPUs so as
> + * to serialize against the IRQ disable. In case we do batch, the RCU
> + * grace period is at least long enough to cover IRQ disabled sections
> + * (XXX assumption, not strictly true).

Yeah well ... I'm not that familiar with RCU anymore. Back when I wrote
that, iirc, I would more or less be safe against a CPU that doesn't
schedule, but things may have well changed.

We are trying to be safe against another CPU walking page tables in the
asm lockless hash miss or TLB miss code. Note that sparc64 has a similar
issue. This is highly optimized asm code that -cannot- call into things
like rcu_read_lock().

> + * All this results in us doing our own free batching and not using
> + * the generic mmu_gather batches (XXX fix that somehow?).
> + */

When this was written, generic batch only dealt with the actual pages,
not the page tables, and as you may have noticed, our page tables on
powerpc aren't struct page*, they come from dedicated slab caches and
are of variable sizes.

Cheers,
Ben.

> + if (atomic_read(&tlb->mm->mm_users) < 2) {
> pgtable_free(table, shift);
> return;
> }
> @@ -125,10 +140,9 @@ void pgtable_free_tlb(struct mmu_gather
> }
> }
>
> -void pte_free_finish(void)
> +void pte_free_finish(struct mmu_gather *tlb)
> {
> - /* This is safe since tlb_gather_mmu has disabled preemption */
> - struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
> + struct pte_freelist_batch **batchp = &tlb->arch.batch;
>
> if (*batchp == NULL)
> return;
> Index: linux-2.6/arch/powerpc/mm/tlb_hash64.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_hash64.c
> +++ linux-2.6/arch/powerpc/mm/tlb_hash64.c
> @@ -38,13 +38,11 @@ DEFINE_PER_CPU(struct ppc64_tlb_batch, p
> * neesd to be flushed. This function will either perform the flush
> * immediately or will batch it up if the current CPU has an active
> * batch on it.
> - *
> - * Must be called from within some kind of spinlock/non-preempt region...
> */
> void hpte_need_flush(struct mm_struct *mm, unsigned long addr,
> pte_t *ptep, unsigned long pte, int huge)
> {
> - struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
> + struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
> unsigned long vsid, vaddr;
> unsigned int psize;
> int ssize;
> @@ -99,6 +97,7 @@ void hpte_need_flush(struct mm_struct *m
> */
> if (!batch->active) {
> flush_hash_page(vaddr, rpte, psize, ssize, 0);
> + put_cpu_var(ppc64_tlb_batch);
> return;
> }
>
> @@ -127,6 +126,7 @@ void hpte_need_flush(struct mm_struct *m
> batch->index = ++i;
> if (i >= PPC64_TLB_BATCH_NR)
> __flush_tlb_pending(batch);
> + put_cpu_var(ppc64_tlb_batch);
> }
>
> /*
> @@ -155,7 +155,7 @@ void __flush_tlb_pending(struct ppc64_tl
>
> void tlb_flush(struct mmu_gather *tlb)
> {
> - struct ppc64_tlb_batch *tlbbatch = &__get_cpu_var(ppc64_tlb_batch);
> + struct ppc64_tlb_batch *tlbbatch = &get_cpu_var(ppc64_tlb_batch);
>
> /* If there's a TLB batch pending, then we must flush it because the
> * pages are going to be freed and we really don't want to have a CPU
> @@ -164,8 +164,10 @@ void tlb_flush(struct mmu_gather *tlb)
> if (tlbbatch->index)
> __flush_tlb_pending(tlbbatch);
>
> + put_cpu_var(ppc64_tlb_batch);
> +
> /* Push out batch of freed page tables */
> - pte_free_finish();
> + pte_free_finish(tlb);
> }
>
> /**
> Index: linux-2.6/arch/powerpc/include/asm/thread_info.h
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/include/asm/thread_info.h
> +++ linux-2.6/arch/powerpc/include/asm/thread_info.h
> @@ -111,6 +111,7 @@ static inline struct thread_info *curren
> #define TIF_NOTIFY_RESUME 13 /* callback before returning to user */
> #define TIF_FREEZE 14 /* Freezing for suspend */
> #define TIF_RUNLATCH 15 /* Is the runlatch enabled? */
> +#define TIF_LAZY_MMU 16 /* tlb_batch is active */
>
> /* as above, but as bit values */
> #define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
> @@ -128,6 +129,7 @@ static inline struct thread_info *curren
> #define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME)
> #define _TIF_FREEZE (1<<TIF_FREEZE)
> #define _TIF_RUNLATCH (1<<TIF_RUNLATCH)
> +#define _TIF_LAZY_MMU (1<<TIF_LAZY_MMU)
> #define _TIF_SYSCALL_T_OR_A (_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP)
>
> #define _TIF_USER_WORK_MASK (_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
> Index: linux-2.6/arch/powerpc/include/asm/pgalloc.h
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/include/asm/pgalloc.h
> +++ linux-2.6/arch/powerpc/include/asm/pgalloc.h
> @@ -32,13 +32,13 @@ static inline void pte_free(struct mm_st
>
> #ifdef CONFIG_SMP
> extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
> -extern void pte_free_finish(void);
> +extern void pte_free_finish(struct mmu_gather *tlb);
> #else /* CONFIG_SMP */
> static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
> {
> pgtable_free(table, shift);
> }
> -static inline void pte_free_finish(void) { }
> +static inline void pte_free_finish(struct mmu_gather *tlb) { }
> #endif /* !CONFIG_SMP */
>
> static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
> Index: linux-2.6/arch/powerpc/mm/tlb_hash32.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_hash32.c
> +++ linux-2.6/arch/powerpc/mm/tlb_hash32.c
> @@ -73,7 +73,7 @@ void tlb_flush(struct mmu_gather *tlb)
> }
>
> /* Push out batch of freed page tables */
> - pte_free_finish();
> + pte_free_finish(tlb);
> }
>
> /*
> Index: linux-2.6/arch/powerpc/mm/tlb_nohash.c
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/mm/tlb_nohash.c
> +++ linux-2.6/arch/powerpc/mm/tlb_nohash.c
> @@ -298,7 +298,7 @@ void tlb_flush(struct mmu_gather *tlb)
> flush_tlb_mm(tlb->mm);
>
> /* Push out batch of freed page tables */
> - pte_free_finish();
> + pte_free_finish(tlb);
> }
>
> /*

Benjamin Herrenschmidt

unread,

Apr 12, 2010, 10:00:02 PM4/12/10

to

On Fri, 2010-04-09 at 14:07 +1000, Nick Piggin wrote:
> > PPC has an extra batching queue to RCU free the actual pagetable
> > allocations, use the ARCH extentions for that for now.
> >
> > For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
> > hardware hash-table, keep using per-cpu arrays but flush on context
> > switch and use a TIF bit to track the laxy_mmu state.
>
> Hm. Pity powerpc can't just use tlb flush gathering for this batching,
> (which is what it was designed for). Then it could avoid these tricks.
> What's preventing this? Adding a tlb gather for COW case in
> copy_page_range?

We must flush before the pte_lock is released. If not, we end up with
this funny situation:

- PTE is read-only, hash contains a translation for it
- PTE gets cleared & added to the batch, hash not flushed yet
- PTE lock released, maybe even VMA fully removed
- Other CPU takes a write fault, puts in a new PTE
- Hash ends up with duplicates of the vaddr -> arch violation

Now we could get out of that one, I suppose, if we had some kind of way
to force flush any batch pertaining to a given mm before a new valid PTE
can be written, but that doesn't sound such a trivial thing to do.

Any better idea ?

Cheers,
Ben.

Benjamin Herrenschmidt

unread,

Apr 12, 2010, 10:20:01 PM4/12/10

to

On Fri, 2010-04-09 at 10:14 +0200, Peter Zijlstra wrote:
>
> and doing that vaddr collection right along with it in the same batch.
>
> I think that that would work, Ben, Dave?

Well, ours aren't struct pages.

IE. There's fundamentally 3 things that we are trying to batch here :-)

1- The original mmu_gather: batching the freeing of the actual user
pages, so that the TLB flush can be delayed/gathered, plus there might
be some micro-improvement in passing the page list to the allocator for
freeing all at omce. This is thus purely a batch of struct pages.

2- The batching of the TLB flushes (or hash invalidates in the ppc
case) proper, which needs the addition of the vaddr for things like
sparc and powerpc since we don't just invalidate the whole bloody thing
unlike x86 :-) On powerpc, we actually need more, we need the actual PTE
content since it also contains tracking information relative to where
things have been put in the hash table.

3- The batching of the freeing of the page table structure, which we
want to delay more than batch, ie, the goal here is to delay that
freeing using RCU until everybody has stopped walking them. This does
rely on RCU grace period being "interrupt safe", ie, there's no
rcu_read_lock() in the low level TLB or hash miss code, but that code
runs with interrupts off.

Now, 2. has a problem I described earlier, which is that we must not
have the possibility of introducing a duplicate in the hash table, thus
it must not be possible to put a new PTE in until the previous one has
been flushed or bad things would happen. This is why powerpc doesn't use
the mmu_gather the way it was originally intended to do both 1. and 2.
but really only for 1., while for 2. we use a small batch that only
exist between lazy_mmu_enter/exit, since those are always fully enclosed
by a pte lock section.

3. As you have noticed, relies on the irq stuff. Plus there seem to be a
dubious optimization here with mm_users. Might be worth sorting that
out. However, it's a very different goal than 1. and 2. in the sense
that batching proper is a minor issue, what we want is synchronization
with walkers, and that batching is a way to lower the cost of that
synchronization (allocating of the RCU struct etc...).

Cheers,
Ben.

Paul E. McKenney

unread,

Apr 12, 2010, 11:50:01 PM4/12/10

to

On Tue, Apr 13, 2010 at 11:05:31AM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2010-04-08 at 21:17 +0200, Peter Zijlstra wrote:
> > plain text document attachment (powerpc-gup_fast-rcu.patch)
> > The powerpc page table freeing relies on the fact that IRQs hold off
> > an RCU grace period, this is currently true for all existing RCU
> > implementations but is not an assumption Paul wants to support.
> >
> > Therefore, also take the RCU read lock along with disabling IRQs to
> > ensure the RCU grace period does at least cover these lookups.
>
> There's a few other places that need a similar fix then. The hash page
> code for example. All the C cases should end up calling the
> find_linux_pte() helper afaik, so we should be able to stick the lock in
> there (and the hugetlbfs variant, find_linux_pte_or_hugepte()).
>
> However, we also have cases of tight asm code walking the page tables,
> such as the tlb miss handler on embedded processors. I don't see how I
> could do that there. IE. I only have a handful of registers to play
> with, no stack, etc...
>
> So we might have to support the interrupt assumption, at least in some
> form, with those guys...

One way to make the interrupt assumption official is to use
synchronize_sched() rather than synchronize_rcu().

Thanx, Paul

Peter Zijlstra

unread,

Apr 13, 2010, 6:30:01 AM4/13/10

to

On Tue, 2010-04-13 at 11:23 +1000, Benjamin Herrenschmidt wrote:
> > static inline void arch_leave_lazy_mmu_mode(void)
> > {
> > - struct ppc64_tlb_batch *batch = &__get_cpu_var(ppc64_tlb_batch);
> > + struct ppc64_tlb_batch *batch = &get_cpu_var(ppc64_tlb_batch);
> > +
> > + if (batch->active) {
> > + if (batch->index)
> > + __flush_tlb_pending(batch);
> > + batch->active = 0;
> > + }
>
> Can index be > 0 if active == 0 ? I though not, which means you don't
> need to add a test on active, do you ?

True I guess, but like this we avoid a write, doesn't really matter I
suspect.

> I'm also pondering whether we should just stick something in the
> task/thread struct and avoid that whole per-cpu manipulation including
> the stuff in process.c in fact.

Can do, I can add batching similar to the generic code to the
thread_info thingy.

> Heh, maybe it's time to introduce thread vars ? :-)

Heh, that seems like a real good way to waste a lot of memory fast ;-)

Peter Zijlstra

unread,

Apr 15, 2010, 3:30:01 AM4/15/10

to

On Mon, 2010-04-12 at 20:43 -0700, Paul E. McKenney wrote:
> > So we might have to support the interrupt assumption, at least in some
> > form, with those guys...
>
> One way to make the interrupt assumption official is to use
> synchronize_sched() rather than synchronize_rcu().

Well, call_rcu_sched() then, because the current usage is to use
call_rcu() to free the page directories.

Paul, here is a call_rcu_sched() available in kernel/rcutree.c, but am I
right in reading that code that that would not be available for
preemptible RCU?

Peter Zijlstra

unread,

Apr 15, 2010, 3:30:01 AM4/15/10

to

On Tue, 2010-04-13 at 11:23 +1000, Benjamin Herrenschmidt wrote:

> > + * If there are, but we fail to batch, we need to IPI (all?) CPUs so as
> > + * to serialize against the IRQ disable. In case we do batch, the RCU
> > + * grace period is at least long enough to cover IRQ disabled sections
> > + * (XXX assumption, not strictly true).
>
> Yeah well ... I'm not that familiar with RCU anymore. Back when I wrote
> that, iirc, I would more or less be safe against a CPU that doesn't
> schedule, but things may have well changed.
>
> We are trying to be safe against another CPU walking page tables in the
> asm lockless hash miss or TLB miss code. Note that sparc64 has a similar
> issue. This is highly optimized asm code that -cannot- call into things
> like rcu_read_lock().

Right, so Paul has been working hard to remove certain implementation
artifact from RCU, such as the preempt-disable == rcu_read_lock thing.

Now, even Preemptible RCU has IRQ-disabled == rcu_read_lock, simply
because the RCU grace period state machine is driven from an interrupt.

But there is no such requirement on RCU at all, so in the interest of
removing assumptions and code validating we're trying to remove such
things.

Peter Zijlstra

unread,

Apr 15, 2010, 3:40:02 AM4/15/10

to

On Tue, 2010-04-13 at 11:23 +1000, Benjamin Herrenschmidt wrote:

> > + * A comment here about on why we have RCU freed page tables might be
> > + * interesting, also explaining why we don't need any sort of grace
> > + * period for mm_users == 1, and have some home brewn smp_call_func()
> > + * for single frees.
>
> iirc, we are synchronizing with CPUs walking page tables in their hash
> or TLB miss code, which is lockless. The mm_users test is a -little- bit
> dubious indeed. It may have to be mm_users < 2 && mm ==
> current->active_mm, ie, we know for sure nobody else is currently
> walking those page tables ...
>
> Tho even than is fishy nowadays. We -can- walk page tables on behave of
> another process. In fact, we do it in the Cell SPU code for faulting
> page table entries as a result of SPEs taking faults for example. So I'm
> starting to suspect that this mm_users optimisation is bogus.
>
> We -do- want to optimize out the case where there is no user left
> though, ie, the MM is dead. IE. The typical exit case.

Can't you fix that by having the SPE code take a reference on these
mm_structs they're playing with?

Poking at one without a ref seems fishy anyway.

Paul E. McKenney

unread,

Apr 15, 2010, 10:30:02 AM4/15/10

to

On Wed, Apr 14, 2010 at 03:51:50PM +0200, Peter Zijlstra wrote:
> On Mon, 2010-04-12 at 20:43 -0700, Paul E. McKenney wrote:
> > > So we might have to support the interrupt assumption, at least in some
> > > form, with those guys...
> >
> > One way to make the interrupt assumption official is to use
> > synchronize_sched() rather than synchronize_rcu().
>
> Well, call_rcu_sched() then, because the current usage is to use
> call_rcu() to free the page directories.
>
> Paul, here is a call_rcu_sched() available in kernel/rcutree.c, but am I
> right in reading that code that that would not be available for
> preemptible RCU?

Both call_rcu_sched() and call_rcu() are always there for you. ;-)

o If CONFIG_TREE_RCU (or CONFIG_TINY_RCU), they both have the same
implementation.

o If CONFIG_TREE_PREEMPT_RCU, call_rcu_sched() is preemptible and
call_rcu() is not.

Of course, with call_rcu_sched(), the corresponding RCU read-side critical
sections are non-preemptible. Therefore, in CONFIG_PREEMPT_RT, these
read-side critical sections must use raw spinlocks.

Can the code in question accommodate these restrictions?

Thanx, Paul

Benjamin Herrenschmidt

unread,

Apr 16, 2010, 3:00:02 AM4/16/10

to

On Mon, 2010-04-12 at 20:43 -0700, Paul E. McKenney wrote:

> > So we might have to support the interrupt assumption, at least in
> some
> > form, with those guys...
>
> One way to make the interrupt assumption official is to use
> synchronize_sched() rather than synchronize_rcu().

Ok, so I'm a bit of a RCU newbie as you may know :-) Right now, we use
neither, we use call_rcu and we free the pages from the callback.

Ben.

Benjamin Herrenschmidt

unread,

Apr 16, 2010, 3:30:02 AM4/16/10

to

On Thu, 2010-04-15 at 07:28 -0700, Paul E. McKenney wrote:
>
> Of course, with call_rcu_sched(), the corresponding RCU read-side
> critical
> sections are non-preemptible. Therefore, in CONFIG_PREEMPT_RT, these
> read-side critical sections must use raw spinlocks.
>
> Can the code in question accommodate these restrictions?

What we protect against is always code that hard-disable IRQs (though
there seem to be a bug in the hugepages code there...). Would that
work ?

Cheers,
Ben.

Nick Piggin

unread,

Apr 16, 2010, 4:20:03 AM4/16/10

to

On Fri, Apr 16, 2010 at 04:51:34PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2010-04-12 at 20:43 -0700, Paul E. McKenney wrote:
> > > So we might have to support the interrupt assumption, at least in
> > some
> > > form, with those guys...
> >
> > One way to make the interrupt assumption official is to use
> > synchronize_sched() rather than synchronize_rcu().
>
> Ok, so I'm a bit of a RCU newbie as you may know :-) Right now, we use
> neither, we use call_rcu and we free the pages from the callback.

BTW. you currently have an interesting page table freeing path where
you usually free by RCU, but (occasionally) free by IPI. This means
you need to disable both RCU and interrupts to walk page tables.

If you change it to always use RCU, then you wouldn't need to disable
interrupts. Whether this actually matters anywhere in your mm code, I
don't know (it's probably not terribly important for gup_fast). But
rcu disable is always preferable for latency and performance.

Benjamin Herrenschmidt

unread,

Apr 16, 2010, 4:40:01 AM4/16/10

to

On Fri, 2010-04-16 at 18:18 +1000, Nick Piggin wrote:
> On Fri, Apr 16, 2010 at 04:51:34PM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2010-04-12 at 20:43 -0700, Paul E. McKenney wrote:
> > > > So we might have to support the interrupt assumption, at least in
> > > some
> > > > form, with those guys...
> > >
> > > One way to make the interrupt assumption official is to use
> > > synchronize_sched() rather than synchronize_rcu().
> >
> > Ok, so I'm a bit of a RCU newbie as you may know :-) Right now, we use
> > neither, we use call_rcu and we free the pages from the callback.
>
> BTW. you currently have an interesting page table freeing path where
> you usually free by RCU, but (occasionally) free by IPI. This means
> you need to disable both RCU and interrupts to walk page tables.

Well, the point is we use interrupts to synchronize. The fact that RCU
used to do the job was an added benefit. I may need to switch to rcu
_sched variants tho to keep that. The IPI case is a slow path in case we
are out of memory and cannot allocate our page of RCU batch.

> If you change it to always use RCU, then you wouldn't need to disable
> interrupts. Whether this actually matters anywhere in your mm code, I
> don't know (it's probably not terribly important for gup_fast). But
> rcu disable is always preferable for latency and performance.

Well, the main case is the hash miss and that always runs with IRQs off.

Cheers,
Ben.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-arch" in

Nick Piggin

unread,

Apr 16, 2010, 5:30:02 AM4/16/10

to

On Fri, Apr 16, 2010 at 06:29:02PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-04-16 at 18:18 +1000, Nick Piggin wrote:
> > On Fri, Apr 16, 2010 at 04:51:34PM +1000, Benjamin Herrenschmidt wrote:
> > > On Mon, 2010-04-12 at 20:43 -0700, Paul E. McKenney wrote:
> > > > > So we might have to support the interrupt assumption, at least in
> > > > some
> > > > > form, with those guys...
> > > >
> > > > One way to make the interrupt assumption official is to use
> > > > synchronize_sched() rather than synchronize_rcu().
> > >
> > > Ok, so I'm a bit of a RCU newbie as you may know :-) Right now, we use
> > > neither, we use call_rcu and we free the pages from the callback.
> >
> > BTW. you currently have an interesting page table freeing path where
> > you usually free by RCU, but (occasionally) free by IPI. This means
> > you need to disable both RCU and interrupts to walk page tables.
>
> Well, the point is we use interrupts to synchronize. The fact that RCU
> used to do the job was an added benefit. I may need to switch to rcu
> _sched variants tho to keep that. The IPI case is a slow path in case we
> are out of memory and cannot allocate our page of RCU batch.

It is the slowpath but it forces all lookup paths to do irq disable
too.

> > If you change it to always use RCU, then you wouldn't need to disable
> > interrupts. Whether this actually matters anywhere in your mm code, I
> > don't know (it's probably not terribly important for gup_fast). But
> > rcu disable is always preferable for latency and performance.
>
> Well, the main case is the hash miss and that always runs with IRQs off.

Probably not a big deal then.

Paul E. McKenney

unread,

Apr 16, 2010, 9:50:02 AM4/16/10

to

On Fri, Apr 16, 2010 at 04:54:51PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2010-04-15 at 07:28 -0700, Paul E. McKenney wrote:
> >
> > Of course, with call_rcu_sched(), the corresponding RCU read-side
> > critical
> > sections are non-preemptible. Therefore, in CONFIG_PREEMPT_RT, these
> > read-side critical sections must use raw spinlocks.
> >
> > Can the code in question accommodate these restrictions?
>
> What we protect against is always code that hard-disable IRQs (though
> there seem to be a bug in the hugepages code there...). Would that
> work ?

From the perspective of call_rcu_sched() and synchronize_sched(),
the following things mark RCU-sched read-side critical sections:

1. rcu_read_lock_sched() and rcu_read_unlock_sched().

2. preempt_disable() and preempt_enable(), along with anything
else that disables preemption.

3. local_bh_disable() and local_bh_enable(), along with anything
else that disables bottom halves.

4. local_irq_disable() and local_irq_enable(), along wiht anything
else that disables hardirqs.

5. Handlers for NMIs.

So I believe that in this case call_rcu_sched() is your friend. ;-)

Thanx, Paul

Peter Zijlstra

unread,

Apr 16, 2010, 10:00:02 AM4/16/10

to

On Thu, 2010-04-15 at 07:28 -0700, Paul E. McKenney wrote:
> On Wed, Apr 14, 2010 at 03:51:50PM +0200, Peter Zijlstra wrote:
> > On Mon, 2010-04-12 at 20:43 -0700, Paul E. McKenney wrote:
> > > > So we might have to support the interrupt assumption, at least in some
> > > > form, with those guys...
> > >
> > > One way to make the interrupt assumption official is to use
> > > synchronize_sched() rather than synchronize_rcu().
> >
> > Well, call_rcu_sched() then, because the current usage is to use
> > call_rcu() to free the page directories.
> >
> > Paul, here is a call_rcu_sched() available in kernel/rcutree.c, but am I
> > right in reading that code that that would not be available for
> > preemptible RCU?
>
> Both call_rcu_sched() and call_rcu() are always there for you. ;-)
>
> o If CONFIG_TREE_RCU (or CONFIG_TINY_RCU), they both have the same
> implementation.
>
> o If CONFIG_TREE_PREEMPT_RCU, call_rcu_sched() is preemptible and
> call_rcu() is not.

(The reverse I suspect?)

> Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> sections are non-preemptible. Therefore, in CONFIG_PREEMPT_RT, these
> read-side critical sections must use raw spinlocks.

OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
synchronize_rcu} functions to {*}_preempt and then add a new
CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
{*}_preempt, we've basically got what I've been asking for for a while,
no?

> Can the code in question accommodate these restrictions?

Yes, that should do just fine I think.

Paul E. McKenney

unread,

Apr 16, 2010, 10:20:01 AM4/16/10

to

On Fri, Apr 16, 2010 at 03:51:21PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-04-15 at 07:28 -0700, Paul E. McKenney wrote:
> > On Wed, Apr 14, 2010 at 03:51:50PM +0200, Peter Zijlstra wrote:
> > > On Mon, 2010-04-12 at 20:43 -0700, Paul E. McKenney wrote:
> > > > > So we might have to support the interrupt assumption, at least in some
> > > > > form, with those guys...
> > > >
> > > > One way to make the interrupt assumption official is to use
> > > > synchronize_sched() rather than synchronize_rcu().
> > >
> > > Well, call_rcu_sched() then, because the current usage is to use
> > > call_rcu() to free the page directories.
> > >
> > > Paul, here is a call_rcu_sched() available in kernel/rcutree.c, but am I
> > > right in reading that code that that would not be available for
> > > preemptible RCU?
> >
> > Both call_rcu_sched() and call_rcu() are always there for you. ;-)
> >
> > o If CONFIG_TREE_RCU (or CONFIG_TINY_RCU), they both have the same
> > implementation.
> >
> > o If CONFIG_TREE_PREEMPT_RCU, call_rcu_sched() is preemptible and
> > call_rcu() is not.
>
> (The reverse I suspect?)

Indeed: If CONFIG_TREE_PREEMPT_RCU, call_rcu() is preemptible and
call_rcu_sched() is not.

> > Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> > sections are non-preemptible. Therefore, in CONFIG_PREEMPT_RT, these
> > read-side critical sections must use raw spinlocks.
>
> OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
> rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
> synchronize_rcu} functions to {*}_preempt and then add a new
> CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
> {*}_preempt, we've basically got what I've been asking for for a while,
> no?

What would rcu_read_lock_preempt() do in a !CONFIG_PREEMPT kernel?

> > Can the code in question accommodate these restrictions?
>
> Yes, that should do just fine I think.

Cool!!!

Thanx, Paul

Peter Zijlstra

unread,

Apr 16, 2010, 10:30:02 AM4/16/10

to

On Fri, 2010-04-16 at 07:17 -0700, Paul E. McKenney wrote:

> > > Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> > > sections are non-preemptible. Therefore, in CONFIG_PREEMPT_RT, these
> > > read-side critical sections must use raw spinlocks.
> >
> > OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
> > rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
> > synchronize_rcu} functions to {*}_preempt and then add a new
> > CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
> > {*}_preempt, we've basically got what I've been asking for for a while,
> > no?
>
> What would rcu_read_lock_preempt() do in a !CONFIG_PREEMPT kernel?

Same as for a preempt one, since you'd have to be able to schedule()
while holding it to be able to do things like mutex_lock().

Paul E. McKenney

unread,

Apr 16, 2010, 10:40:02 AM4/16/10

to

On Fri, Apr 16, 2010 at 04:23:39PM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-16 at 07:17 -0700, Paul E. McKenney wrote:
>
> > > > Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> > > > sections are non-preemptible. Therefore, in CONFIG_PREEMPT_RT, these
> > > > read-side critical sections must use raw spinlocks.
> > >
> > > OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
> > > rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
> > > synchronize_rcu} functions to {*}_preempt and then add a new
> > > CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
> > > {*}_preempt, we've basically got what I've been asking for for a while,
> > > no?
> >
> > What would rcu_read_lock_preempt() do in a !CONFIG_PREEMPT kernel?
>
> Same as for a preempt one, since you'd have to be able to schedule()
> while holding it to be able to do things like mutex_lock().

So what you really want is something like rcu_read_lock_sleep() rather
than rcu_read_lock_preempt(), right? The point is that you want to do
more than merely preempt, given that it is legal to do general blocking
while holding a mutex, correct?

Thanx, Paul

Peter Zijlstra

unread,

Apr 16, 2010, 11:00:02 AM4/16/10

to

On Fri, 2010-04-16 at 07:32 -0700, Paul E. McKenney wrote:
> On Fri, Apr 16, 2010 at 04:23:39PM +0200, Peter Zijlstra wrote:
> > On Fri, 2010-04-16 at 07:17 -0700, Paul E. McKenney wrote:
> >
> > > > > Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> > > > > sections are non-preemptible. Therefore, in CONFIG_PREEMPT_RT, these
> > > > > read-side critical sections must use raw spinlocks.
> > > >
> > > > OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
> > > > rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
> > > > synchronize_rcu} functions to {*}_preempt and then add a new
> > > > CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
> > > > {*}_preempt, we've basically got what I've been asking for for a while,
> > > > no?
> > >
> > > What would rcu_read_lock_preempt() do in a !CONFIG_PREEMPT kernel?
> >
> > Same as for a preempt one, since you'd have to be able to schedule()
> > while holding it to be able to do things like mutex_lock().
>
> So what you really want is something like rcu_read_lock_sleep() rather
> than rcu_read_lock_preempt(), right? The point is that you want to do
> more than merely preempt, given that it is legal to do general blocking
> while holding a mutex, correct?

Right, but CONFIG_TREE_PREEMPT_RCU=y ends up being that. We could change
the name to _sleep, but we've been calling it preemptible-rcu for a long
while now.

Paul E. McKenney

unread,

Apr 16, 2010, 11:10:02 AM4/16/10

to

On Fri, Apr 16, 2010 at 04:56:50PM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-16 at 07:32 -0700, Paul E. McKenney wrote:
> > On Fri, Apr 16, 2010 at 04:23:39PM +0200, Peter Zijlstra wrote:
> > > On Fri, 2010-04-16 at 07:17 -0700, Paul E. McKenney wrote:
> > >
> > > > > > Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> > > > > > sections are non-preemptible. Therefore, in CONFIG_PREEMPT_RT, these
> > > > > > read-side critical sections must use raw spinlocks.
> > > > >
> > > > > OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
> > > > > rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
> > > > > synchronize_rcu} functions to {*}_preempt and then add a new
> > > > > CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
> > > > > {*}_preempt, we've basically got what I've been asking for for a while,
> > > > > no?
> > > >
> > > > What would rcu_read_lock_preempt() do in a !CONFIG_PREEMPT kernel?
> > >
> > > Same as for a preempt one, since you'd have to be able to schedule()
> > > while holding it to be able to do things like mutex_lock().
> >
> > So what you really want is something like rcu_read_lock_sleep() rather
> > than rcu_read_lock_preempt(), right? The point is that you want to do
> > more than merely preempt, given that it is legal to do general blocking
> > while holding a mutex, correct?
>
> Right, but CONFIG_TREE_PREEMPT_RCU=y ends up being that. We could change
> the name to _sleep, but we've been calling it preemptible-rcu for a long
> while now.

It is actually not permitted to do general blocking in a preemptible RCU
read-side critical section. Otherwise, someone is going to block waiting
for a network packet that never comes, thus OOMing the system.

Thanx, Paul

Peter Zijlstra

unread,

Apr 16, 2010, 11:20:02 AM4/16/10

to

Sure, something that guarantees progress seems like a sensible
restriction for any lock, and in particular RCU :-)

Paul E. McKenney

unread,

Apr 16, 2010, 12:50:02 PM4/16/10

to

Excellent point -- much of the issue really does center around
forward-progress guarantees. In fact the Linux kernel has a number of
locking primitives that require different degrees of forward-progress
guarantee from the code in their respective critical sections:

o spin_lock_irqsave(): Critical sections must guarantee forward
progress against everything except NMI handlers.

o raw_spin_lock(): Critical sections must guarantee forward
progress against everything except IRQ (including softirq)
and NMI handlers.

o spin_lock(): Critical sections must guarantee forward
progress against everything except IRQ (again including softirq)
and NMI handlers and (given CONFIG_PREEMPT_RT) higher-priority
realtime tasks.

o mutex_lock(): Critical sections need not guarantee
forward progress, as general blocking is permitted.

The other issue is the scope of the lock. The Linux kernel has
the following:

o BKL: global scope.

o Everything else: scope defined by the use of the underlying
lock variable.

One of the many reasons that we are trying to get rid of BKL is because
it combines global scope with relatively weak forward-progress guarantees.

So here is how the various RCU flavors stack up:

o rcu_read_lock_bh(): critical sections must guarantee forward
progress against everything except NMI handlers and IRQ handlers,
but not against softirq handlers. Global in scope, so that
violating the forward-progress guarantee risks OOMing the system.

o rcu_read_lock_sched(): critical sections must guarantee
forward progress against everything except NMI and IRQ handlers,
including softirq handlers. Global in scope, so that violating
the forward-progress guarantee risks OOMing the system.

o rcu_read_lock(): critical sections must guarantee forward
progress against everything except NMI handlers, IRQ handlers,
softirq handlers, and (in CONFIG_PREEMPT_RT) higher-priority
realtime tasks. Global in scope, so that violating the
forward-progress guarantee risks OOMing the system.

o srcu_read_lock(): critical sections need not guarantee forward
progress, as general blocking is permitted. Scope is controlled
by the use of the underlying srcu_struct structure.

As you say, one can block in rcu_read_lock() critical sections, but
the only blocking that is really safe is blocking that is subject to
priority inheritance. This prohibits mutexes, because although the
mutexes themselves are subject to priority inheritance, the mutexes'
critical sections might well not be.

So the easy response is "just use SRCU." Of course, SRCU has some
disadvantages at the moment:

o The return value from srcu_read_lock() must be passed to
srcu_read_unlock(). I believe that I can fix this.

o There is no call_srcu(). I believe that I can fix this.

o SRCU uses a flat per-CPU counter scheme that is not particularly
scalable. I believe that I can fix this.

o SRCU's current implementation makes it almost impossible to
implement priority boosting. I believe that I can fix this.

o SRCU requires explicit initialization of the underlying
srcu_struct. Unfortunately, I don't see a reasonable way
around this. Not yet, anyway.

So, is there anything else that you don't like about SRCU?

Thanx, Paul

Peter Zijlstra

unread,

Apr 16, 2010, 3:40:02 PM4/16/10

to

On Fri, 2010-04-16 at 09:45 -0700, Paul E. McKenney wrote:
> o mutex_lock(): Critical sections need not guarantee
> forward progress, as general blocking is permitted.
>

Right, I would argue that they should guarantee fwd progress, but due to
being able to schedule while holding them, its harder to enforce.

Anything that is waiting for uncertainty should do so without any locks
held and simply re-acquire them once such an event does occur.

> So the easy response is "just use SRCU." Of course, SRCU has some
> disadvantages at the moment:
>
> o The return value from srcu_read_lock() must be passed to
> srcu_read_unlock(). I believe that I can fix this.
>
> o There is no call_srcu(). I believe that I can fix this.
>
> o SRCU uses a flat per-CPU counter scheme that is not particularly
> scalable. I believe that I can fix this.
>
> o SRCU's current implementation makes it almost impossible to
> implement priority boosting. I believe that I can fix this.
>
> o SRCU requires explicit initialization of the underlying
> srcu_struct. Unfortunately, I don't see a reasonable way
> around this. Not yet, anyway.
>
> So, is there anything else that you don't like about SRCU?

No, I quite like SRCU when implemented as preemptible tree RCU, and I
don't at all mind that last point, all dynamic things need some sort of
init. All locks certainly have.

Paul E. McKenney

unread,

Apr 16, 2010, 4:30:01 PM4/16/10

to

On Fri, Apr 16, 2010 at 09:37:02PM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-16 at 09:45 -0700, Paul E. McKenney wrote:
> > o mutex_lock(): Critical sections need not guarantee
> > forward progress, as general blocking is permitted.
> >
> Right, I would argue that they should guarantee fwd progress, but due to
> being able to schedule while holding them, its harder to enforce.
>
> Anything that is waiting for uncertainty should do so without any locks
> held and simply re-acquire them once such an event does occur.

Agreed. But holding a small-scope mutex for (say) 60 seconds would not be
a problem (at 120 seconds, you might start seeing softlockup messages).
In contrast, holding off an RCU grace period for 60 seconds might well
OOM the machine, especially a small embedded system with limited memory.

> > So the easy response is "just use SRCU." Of course, SRCU has some
> > disadvantages at the moment:
> >
> > o The return value from srcu_read_lock() must be passed to
> > srcu_read_unlock(). I believe that I can fix this.
> >
> > o There is no call_srcu(). I believe that I can fix this.
> >
> > o SRCU uses a flat per-CPU counter scheme that is not particularly
> > scalable. I believe that I can fix this.
> >
> > o SRCU's current implementation makes it almost impossible to
> > implement priority boosting. I believe that I can fix this.
> >
> > o SRCU requires explicit initialization of the underlying
> > srcu_struct. Unfortunately, I don't see a reasonable way
> > around this. Not yet, anyway.
> >
> > So, is there anything else that you don't like about SRCU?
>
> No, I quite like SRCU when implemented as preemptible tree RCU, and I
> don't at all mind that last point, all dynamic things need some sort of
> init. All locks certainly have.

Very good!!! I should clarify, though -- by "explicit initialization",
I mean that there needs to be a run-time call to init_srcu_struct().
Unless there is some clever way to initialize an array of pointers to
per-CPU structures at compile time. And, conversely, a way to initialize
pointers in a per-CPU structure to point to possibly-different rcu_node
structures.

Thanx, Paul

Benjamin Herrenschmidt

unread,

Apr 16, 2010, 7:30:02 PM4/16/10

to

On Fri, 2010-04-16 at 06:43 -0700, Paul E. McKenney wrote:
> So I believe that in this case call_rcu_sched() is your friend. ;-)

Looks like it :-)

I'll cook up a patch changing my current call_rcu() to call_rcu_sched().

Cheers,
Ben.

James Bottomley

unread,

Apr 17, 2010, 11:10:02 PM4/17/10

to

On Fri, 2010-04-16 at 09:45 -0700, Paul E. McKenney wrote:
> o mutex_lock(): Critical sections need not guarantee
> forward progress, as general blocking is permitted.

This isn't quite right. mutex critical sections must guarantee eventual
forward progress against the class of other potential acquirers of the
mutex otherwise the system will become either deadlocked or livelocked.

James

Paul E. McKenney

unread,

Apr 18, 2010, 10:00:01 AM4/18/10

to

On Sat, Apr 17, 2010 at 10:06:36PM -0500, James Bottomley wrote:
> On Fri, 2010-04-16 at 09:45 -0700, Paul E. McKenney wrote:
> > o mutex_lock(): Critical sections need not guarantee
> > forward progress, as general blocking is permitted.
>
> This isn't quite right. mutex critical sections must guarantee eventual
> forward progress against the class of other potential acquirers of the
> mutex otherwise the system will become either deadlocked or livelocked.

If I understand you correctly, you are saying that it is OK for a given
critical section for a given mutex to fail to make forward progress if
nothing else happens to acquire that mutex during that time. I would
agree, at least I would if you were to further add that the soft-lockup
checks permit an additional 120 seconds of failure to make forward progress
even if something -is- attempting to acquire that mutex.

By my standards, 120 seconds is a reasonable approximation to infinity,
hence my statement above.

So, would you agree with the following as a more precise statement?

o mutex_lock(): Critical sections need not guarantee

forward progress unless some other task is waiting
on the mutex in question, in which case critical sections
should complete in 120 seconds.

Thanx, Paul

James Bottomley

unread,

Apr 18, 2010, 3:00:01 PM4/18/10

to

On Sun, 2010-04-18 at 06:55 -0700, Paul E. McKenney wrote:
> On Sat, Apr 17, 2010 at 10:06:36PM -0500, James Bottomley wrote:
> > On Fri, 2010-04-16 at 09:45 -0700, Paul E. McKenney wrote:
> > > o mutex_lock(): Critical sections need not guarantee
> > > forward progress, as general blocking is permitted.
> >
> > This isn't quite right. mutex critical sections must guarantee eventual
> > forward progress against the class of other potential acquirers of the
> > mutex otherwise the system will become either deadlocked or livelocked.
>
> If I understand you correctly, you are saying that it is OK for a given
> critical section for a given mutex to fail to make forward progress if
> nothing else happens to acquire that mutex during that time. I would
> agree, at least I would if you were to further add that the soft-lockup
> checks permit an additional 120 seconds of failure to make forward progress
> even if something -is- attempting to acquire that mutex.

Yes ... I was thinking of two specific cases: one is wrong programming
of lock acquisition where the system deadlocks; the other is doing silly
things like taking a mutex around an event loop instead of inside it so
incoming events prevent forward progress and the system livelocks, but
there are many other ways of producing deadlocks and livelocks. I just
couldn't think of a concise way of saying all of that but I didn't want
a statement about mutexes to give the impression that anything goes.

I've got to say that I also dislike seeing any form of sleep within a
critical section because it's just asking for a nasty entangled deadlock
which can be very hard to sort out. So I also didn't like the statement
"general blocking is permitted"

> By my standards, 120 seconds is a reasonable approximation to infinity,
> hence my statement above.
>
> So, would you agree with the following as a more precise statement?
>
> o mutex_lock(): Critical sections need not guarantee
> forward progress unless some other task is waiting
> on the mutex in question, in which case critical sections
> should complete in 120 seconds.

Sounds fair.

James

Andrew Morton

unread,

May 24, 2010, 3:40:03 PM5/24/10

to

On Fri, 09 Apr 2010 10:44:53 +0200
Peter Zijlstra <pet...@infradead.org> wrote:

> On Fri, 2010-04-09 at 16:29 +0900, KOSAKI Motohiro wrote:
> >
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 8b088f0..b4a0b5b 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -295,7 +295,7 @@ struct anon_vma *page_lock_anon_vma(struct page
> > *page)
> > unsigned long anon_mapping;
> >
> > rcu_read_lock();
> > - anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> > + anon_mapping = (unsigned long) rcu_dereference(page->mapping);
> > if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> > goto out;
> > if (!page_mapped(page))
>
> Yes, I think this is indeed required.
>
> I'll do a new version of the patch that includes the comment updates
> requested by Andrew.

Either this didn't happen or I lost the patch.

I parked mm-revalidate-anon_vma-in-page_lock_anon_vma.patch for now.
Hopefully everything still works without it..

Peter Zijlstra

unread,

May 25, 2010, 5:10:02 AM5/25/10

to

On Mon, 2010-05-24 at 12:32 -0700, Andrew Morton wrote:
> On Fri, 09 Apr 2010 10:44:53 +0200
> Peter Zijlstra <pet...@infradead.org> wrote:
>
> > On Fri, 2010-04-09 at 16:29 +0900, KOSAKI Motohiro wrote:
> > >
> > >
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 8b088f0..b4a0b5b 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -295,7 +295,7 @@ struct anon_vma *page_lock_anon_vma(struct page
> > > *page)
> > > unsigned long anon_mapping;
> > >
> > > rcu_read_lock();
> > > - anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
> > > + anon_mapping = (unsigned long) rcu_dereference(page->mapping);
> > > if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> > > goto out;
> > > if (!page_mapped(page))
> >
> > Yes, I think this is indeed required.
> >
> > I'll do a new version of the patch that includes the comment updates
> > requested by Andrew.
>
> Either this didn't happen or I lost the patch.

It didn't happen, I got distracted.

> I parked mm-revalidate-anon_vma-in-page_lock_anon_vma.patch for now.
> Hopefully everything still works without it..

Yes, I think it actually does due to a number of really non-obvious
things ;-)

I'll try and get back to my make mmu_gather preemptible stuff shortly
and write comments there.