[RFC][PATCH 0/7] some page_referenced() improvement

KOSAKI Motohiro

unread,

Dec 4, 2009, 3:50:02 AM12/4/09

to

Hi

here is my refactoring and improvement patchset of page_referenced().
I think it solve Larry's AIM7 scalability issue.

I'll test this patches on stress workload at this weekend. but I hope to
receive guys review concurrently.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

KOSAKI Motohiro

unread,

Dec 4, 2009, 3:50:03 AM12/4/09

to

From 381108e1ff6309f45f45a67acf2a1dd66e41df4f Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
Date: Thu, 3 Dec 2009 15:01:42 +0900
Subject: [PATCH 2/7] Introduce __page_check_address

page_check_address() need to take ptelock. but it might be contended.
Then we need trylock version and this patch introduce new helper function.

it will be used latter patch.

Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
mm/rmap.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 278cd27..1b50425 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -268,44 +268,80 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
* the page table lock when the pte is not present (helpful when reclaiming
* highly shared pages).
*
- * On success returns with pte mapped and locked.
+ * if @noblock is true, page_check_address may return -EAGAIN if lock is
+ * contended.
+ *
+ * Returns valid pte pointer if success.
+ * Returns -EFAULT if address seems invalid.
+ * Returns -EAGAIN if trylock failed.
*/
-pte_t *page_check_address(struct page *page, struct mm_struct *mm,
- unsigned long address, spinlock_t **ptlp, int sync)
+static pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
+ unsigned long address, spinlock_t **ptlp,
+ int sync, int noblock)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
spinlock_t *ptl;
+ int err = -EFAULT;

pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
- return NULL;
+ goto out;

pud = pud_offset(pgd, address);
if (!pud_present(*pud))
- return NULL;
+ goto out;

pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
- return NULL;
+ goto out;

pte = pte_offset_map(pmd, address);
/* Make a quick check before getting the lock */
- if (!sync && !pte_present(*pte)) {
- pte_unmap(pte);
- return NULL;
- }
+ if (!sync && !pte_present(*pte))
+ goto out_unmap;

ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
+ if (noblock) {
+ if (!spin_trylock(ptl)) {
+ err = -EAGAIN;
+ goto out_unmap;
+ }
+ } else
+ spin_lock(ptl);
+
if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
*ptlp = ptl;
return pte;
}
- pte_unmap_unlock(pte, ptl);
- return NULL;
+
+ spin_unlock(ptl);
+ out_unmap:
+ pte_unmap(pte);
+ out:
+ return ERR_PTR(err);
+}
+
+/*
+ * Check that @page is mapped at @address into @mm.
+ *
+ * If @sync is false, page_check_address may perform a racy check to avoid
+ * the page table lock when the pte is not present (helpful when reclaiming
+ * highly shared pages).
+ *
+ * On success returns with pte mapped and locked.
+ */
+pte_t *page_check_address(struct page *page, struct mm_struct *mm,
+ unsigned long address, spinlock_t **ptlp, int sync)
+{
+ pte_t *pte;
+
+ pte = __page_check_address(page, mm, address, ptlp, sync, 0);
+ if (IS_ERR(pte))
+ return NULL;
+ return pte;
}

/**
--
1.6.5.2

Rik van Riel

unread,

Dec 6, 2009, 10:00:02 AM12/6/09

to

On 12/04/2009 03:42 AM, KOSAKI Motohiro wrote:
> From 381108e1ff6309f45f45a67acf2a1dd66e41df4f Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>
> Date: Thu, 3 Dec 2009 15:01:42 +0900
> Subject: [PATCH 2/7] Introduce __page_check_address
>
> page_check_address() need to take ptelock. but it might be contended.
> Then we need trylock version and this patch introduce new helper function.
>
> it will be used latter patch.
>
> Signed-off-by: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>

Reviewed-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed.

KOSAKI Motohiro

unread,

Dec 7, 2009, 4:30:03 AM12/7/09

to

> Hi
>
> here is my refactoring and improvement patchset of page_referenced().
> I think it solve Larry's AIM7 scalability issue.
>
> I'll test this patches on stress workload at this weekend. but I hope to
> receive guys review concurrently.

OK. this patches survived my stress workload correctly for two days of last weekend.

KOSAKI Motohiro

unread,

Dec 7, 2009, 6:40:01 AM12/7/09

to

Andrea, Can you please try following patch on your workload?

From a7758c66d36a136d5fbbcf0b042839445f0ca522 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
Date: Mon, 7 Dec 2009 18:37:20 +0900
Subject: [PATCH] [RFC] vmscan: Don't deactivate many touched page

Changelog
o from andrea's original patch
- Rebase topon my patches.
- Use list_cut_position/list_splice_tail pair instead
list_del/list_add to make pte scan fairness.
- Only use max young threshold when soft_try is true.
It avoid wrong OOM sideeffect.
- Return SWAP_AGAIN instead successful result if max
young threshold exceed. It prevent the pages without clear
pte young bit will be deactivated wrongly.
- Add to treat ksm page logic

Many shared and frequently used page don't need deactivate and
try_to_unamp(). It's pointless while VM pressure is low, the page
might reactivate soon. it's only makes cpu wasting.

Then, This patch makes to stop pte scan if wipe_page_reference()
found lots young pte bit.

Originally-Signed-off-by: Andrea Arcangeli <aarc...@redhat.com>

Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---

include/linux/rmap.h | 17 +++++++++++++++++
mm/ksm.c | 4 ++++
mm/rmap.c | 19 +++++++++++++++++++
3 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 499972e..9ad69b5 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -128,6 +128,23 @@ int wipe_page_reference_one(struct page *page,
struct page_reference_context *refctx,
struct vm_area_struct *vma, unsigned long address);

+#define MAX_YOUNG_BIT_CLEARED 64
+/*
+ * if VM pressure is low and the page have too many active mappings, there isn't
+ * any reason to continue clear young bit of other ptes. Otherwise,
+ * - Makes meaningless cpu wasting, many touched page sholdn't be reclaimed.
+ * - Makes lots IPI for pte change and it might cause another sadly lock
+ * contention.
+ */
+static inline
+int too_many_young_bit_found(struct page_reference_context *refctx)
+{
+ if (refctx->soft_try &&
+ refctx->referenced >= MAX_YOUNG_BIT_CLEARED)
+ return 1;
+ return 0;
+}
+
enum ttu_flags {
TTU_UNMAP = 0, /* unmap mode */
TTU_MIGRATION = 1, /* migration mode */
diff --git a/mm/ksm.c b/mm/ksm.c
index 3c121c8..46ea519 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1586,6 +1586,10 @@ again:
rmap_item->address);
if (ret != SWAP_SUCCESS)
goto out;
+ if (too_many_young_bit_found(refctx)) {
+ ret = SWAP_AGAIN;
+ goto out;
+ }
mapcount--;
if (!search_new_forks || !mapcount)
break;
diff --git a/mm/rmap.c b/mm/rmap.c
index cfda0a0..f4517f3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -473,6 +473,21 @@ static int wipe_page_reference_anon(struct page *page,
ret = wipe_page_reference_one(page, refctx, vma, address);
if (ret != SWAP_SUCCESS)
break;
+ if (too_many_young_bit_found(refctx)) {
+ LIST_HEAD(tmp_list);
+
+ /*
+ * The scanned ptes move to list tail. it help every ptes
+ * on this page will be tested by ptep_clear_young().
+ * Otherwise, this shortcut makes unfair thing.
+ */
+ list_cut_position(&tmp_list,
+ &vma->anon_vma_node,
+ &anon_vma->head);
+ list_splice_tail(&tmp_list, &vma->anon_vma_node);
+ ret = SWAP_AGAIN;
+ break;
+ }
mapcount--;
if (!mapcount || refctx->maybe_mlocked)
break;
@@ -543,6 +558,10 @@ static int wipe_page_reference_file(struct page *page,
ret = wipe_page_reference_one(page, refctx, vma, address);
if (ret != SWAP_SUCCESS)
break;
+ if (too_many_young_bit_found(refctx)) {
+ ret = SWAP_AGAIN;
+ break;
+ }
mapcount--;
if (!mapcount || refctx->maybe_mlocked)
break;
--
1.6.5.2

Rik van Riel

unread,

Dec 7, 2009, 1:20:03 PM12/7/09

to

On 12/07/2009 06:36 AM, KOSAKI Motohiro wrote:
>
> Andrea, Can you please try following patch on your workload?
>
>
> From a7758c66d36a136d5fbbcf0b042839445f0ca522 Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>
> Date: Mon, 7 Dec 2009 18:37:20 +0900
> Subject: [PATCH] [RFC] vmscan: Don't deactivate many touched page
>
> Changelog
> o from andrea's original patch
> - Rebase topon my patches.
> - Use list_cut_position/list_splice_tail pair instead
> list_del/list_add to make pte scan fairness.
> - Only use max young threshold when soft_try is true.
> It avoid wrong OOM sideeffect.
> - Return SWAP_AGAIN instead successful result if max
> young threshold exceed. It prevent the pages without clear
> pte young bit will be deactivated wrongly.
> - Add to treat ksm page logic

I like the concept and your changes, and really only
have a few small nitpicks :)

First, the VM uses a mix of "referenced", "accessed" and
"young". We should probably avoid adding "active" to that
mix, and may even want to think about moving to just one
or two terms :)

> +#define MAX_YOUNG_BIT_CLEARED 64
> +/*
> + * if VM pressure is low and the page have too many active mappings, there isn't
> + * any reason to continue clear young bit of other ptes. Otherwise,
> + * - Makes meaningless cpu wasting, many touched page sholdn't be reclaimed.
> + * - Makes lots IPI for pte change and it might cause another sadly lock
> + * contention.
> + */

If VM pressure is low and the page has lots of active users, we only
clear up to MAX_YOUNG_BIT_CLEARED accessed bits at a time. Clearing
accessed bits takes CPU time, needs TLB invalidate IPIs and could
cause lock contention. Since a heavily shared page is very likely
to be used again soon, the cost outweighs the benefit of making such
a heavily shared page a candidate for eviction.

> diff --git a/mm/rmap.c b/mm/rmap.c
> index cfda0a0..f4517f3 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -473,6 +473,21 @@ static int wipe_page_reference_anon(struct page *page,
> ret = wipe_page_reference_one(page, refctx, vma, address);
> if (ret != SWAP_SUCCESS)
> break;
> + if (too_many_young_bit_found(refctx)) {
> + LIST_HEAD(tmp_list);
> +
> + /*
> + * The scanned ptes move to list tail. it help every ptes
> + * on this page will be tested by ptep_clear_young().
> + * Otherwise, this shortcut makes unfair thing.
> + */
> + list_cut_position(&tmp_list,
> + &vma->anon_vma_node,
> + &anon_vma->head);

> + list_splice_tail(&tmp_list,&vma->anon_vma_node);

> + ret = SWAP_AGAIN;
> + break;
> + }

I do not understand the unfairness here, since all a page needs
to stay on the active list is >64 referenced PTEs. It does not
matter which of the PTEs mapping the page were recently referenced.

However, rotating the anon vmas around may help spread out lock
pressure in the VM and help things that way, so the code looks
useful to me.

In short, you can give the next version of this patch my

Reviewed-by: Rik van Riel <ri...@redhat.com>

All I have are comment nitpicks :)

--
All rights reversed.

KOSAKI Motohiro

unread,

Dec 8, 2009, 1:30:02 AM12/8/09

to

> On 12/07/2009 06:36 AM, KOSAKI Motohiro wrote:
> >
> > Andrea, Can you please try following patch on your workload?
> >
> >
> > From a7758c66d36a136d5fbbcf0b042839445f0ca522 Mon Sep 17 00:00:00 2001
> > From: KOSAKI Motohiro<kosaki....@jp.fujitsu.com>
> > Date: Mon, 7 Dec 2009 18:37:20 +0900
> > Subject: [PATCH] [RFC] vmscan: Don't deactivate many touched page
> >
> > Changelog
> > o from andrea's original patch
> > - Rebase topon my patches.
> > - Use list_cut_position/list_splice_tail pair instead
> > list_del/list_add to make pte scan fairness.
> > - Only use max young threshold when soft_try is true.
> > It avoid wrong OOM sideeffect.
> > - Return SWAP_AGAIN instead successful result if max
> > young threshold exceed. It prevent the pages without clear
> > pte young bit will be deactivated wrongly.
> > - Add to treat ksm page logic
>
> I like the concept and your changes, and really only
> have a few small nitpicks :)
>
> First, the VM uses a mix of "referenced", "accessed" and
> "young". We should probably avoid adding "active" to that
> mix, and may even want to think about moving to just one
> or two terms :)

Ah yes, certainly.

> > +#define MAX_YOUNG_BIT_CLEARED 64
> > +/*
> > + * if VM pressure is low and the page have too many active mappings, there isn't
> > + * any reason to continue clear young bit of other ptes. Otherwise,
> > + * - Makes meaningless cpu wasting, many touched page sholdn't be reclaimed.
> > + * - Makes lots IPI for pte change and it might cause another sadly lock
> > + * contention.
> > + */
>
> If VM pressure is low and the page has lots of active users, we only
> clear up to MAX_YOUNG_BIT_CLEARED accessed bits at a time. Clearing
> accessed bits takes CPU time, needs TLB invalidate IPIs and could
> cause lock contention. Since a heavily shared page is very likely
> to be used again soon, the cost outweighs the benefit of making such
> a heavily shared page a candidate for eviction.

Thanks. Will fix.

agreed. I have to rewrite the comment.

> In short, you can give the next version of this patch my
>
> Reviewed-by: Rik van Riel <ri...@redhat.com>
>
> All I have are comment nitpicks :)

No. It's really worth.

Thank you.