Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
thp: implement refcounting for huge zero page
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 70 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Kirill A. Shutemov  
View profile  
 More options Nov 7 2012, 10:10 am
Newsgroups: linux.kernel
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Date: Wed, 07 Nov 2012 16:10:01 +0100
Local: Wed, Nov 7 2012 10:10 am
Subject: [PATCH v5 10/11] thp: implement refcounting for huge zero page
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>

H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.

We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.

If counter is 0, get_huge_zero_page() allocates a new huge page and
takes two references: one for caller and one for shrinker. We free the
page only in shrinker callback if counter is 1 (only shrinker has the
reference).

put_huge_zero_page() only decrements counter. Counter is never zero
in put_huge_zero_page() since shrinker holds on reference.

Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.

Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation.  I think it's pretty reasonable for synthetic benchmark.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/huge_memory.c |  111 ++++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 87 insertions(+), 24 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 52073c2..92a1b66 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,6 +18,7 @@
 #include <linux/freezer.h>
 #include <linux/mman.h>
 #include <linux/pagemap.h>
+#include <linux/shrinker.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -47,7 +48,6 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
 /* during fragmentation poll the hugepage allocator once every minute */
 static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
 static struct task_struct *khugepaged_thread __read_mostly;
-static unsigned long huge_zero_pfn __read_mostly;
 static DEFINE_MUTEX(khugepaged_mutex);
 static DEFINE_SPINLOCK(khugepaged_mm_lock);
 static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
@@ -160,31 +160,74 @@ static int start_khugepaged(void)
        return err;
 }

-static int init_huge_zero_pfn(void)
+static atomic_t huge_zero_refcount;
+static unsigned long huge_zero_pfn __read_mostly;
+
+static inline bool is_huge_zero_pfn(unsigned long pfn)
 {
-       struct page *hpage;
-       unsigned long pfn;
+       unsigned long zero_pfn = ACCESS_ONCE(huge_zero_pfn);
+       return zero_pfn && pfn == zero_pfn;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+       return is_huge_zero_pfn(pmd_pfn(pmd));
+}
+
+static unsigned long get_huge_zero_page(void)
+{
+       struct page *zero_page;
+retry:
+       if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
+               return ACCESS_ONCE(huge_zero_pfn);

-       hpage = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
+       zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
                        HPAGE_PMD_ORDER);
-       if (!hpage)
-               return -ENOMEM;
-       pfn = page_to_pfn(hpage);
-       if (cmpxchg(&huge_zero_pfn, 0, pfn))
-               __free_page(hpage);
-       return 0;
+       if (!zero_page)
+               return 0;
+       preempt_disable();
+       if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
+               preempt_enable();
+               __free_page(zero_page);
+               goto retry;
+       }
+
+       /* We take additional reference here. It will be put back by shrinker */
+       atomic_set(&huge_zero_refcount, 2);
+       preempt_enable();
+       return ACCESS_ONCE(huge_zero_pfn);
 }

-static inline bool is_huge_zero_pfn(unsigned long pfn)
+static void put_huge_zero_page(void)
 {
-       return huge_zero_pfn && pfn == huge_zero_pfn;
+       /*
+        * Counter should never go to zero here. Only shrinker can put
+        * last reference.
+        */
+       BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
 }

-static inline bool is_huge_zero_pmd(pmd_t pmd)
+static int shrink_huge_zero_page(struct shrinker *shrink,
+               struct shrink_control *sc)
 {
-       return is_huge_zero_pfn(pmd_pfn(pmd));
+       if (!sc->nr_to_scan)
+               /* we can free zero page only if last reference remains */
+               return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
+
+       if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
+               unsigned long zero_pfn = xchg(&huge_zero_pfn, 0);
+               BUG_ON(zero_pfn == 0);
+               __free_page(__pfn_to_page(zero_pfn));
+       }
+
+       return 0;
 }

+static struct shrinker huge_zero_page_shrinker = {
+       .shrink = shrink_huge_zero_page,
+       .seeks = DEFAULT_SEEKS,
+};
+
 #ifdef CONFIG_SYSFS

 static ssize_t double_flag_show(struct kobject *kobj,
@@ -576,6 +619,8 @@ static int __init hugepage_init(void)
                goto out;
        }

+       register_shrinker(&huge_zero_page_shrinker);
+
        /*
         * By default disable transparent hugepages on smaller systems,
         * where the extra memory used could hurt more than TLB overhead
@@ -698,10 +743,11 @@ static inline struct page *alloc_hugepage(int defrag)
 #endif

 static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
-               struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
+               struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
+               unsigned long zero_pfn)
 {
        pmd_t entry;
-       entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
+       entry = pfn_pmd(zero_pfn, vma->vm_page_prot);
        entry = pmd_wrprotect(entry);
        entry = pmd_mkhuge(entry);
        set_pmd_at(mm, haddr, pmd, entry);
@@ -724,15 +770,19 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
                        return VM_FAULT_OOM;
                if (!(flags & FAULT_FLAG_WRITE)) {
                        pgtable_t pgtable;
-                       if (unlikely(!huge_zero_pfn && init_huge_zero_pfn())) {
-                               count_vm_event(THP_FAULT_FALLBACK);
-                               goto out;
-                       }
+                       unsigned long zero_pfn;
                        pgtable = pte_alloc_one(mm, haddr);
                        if (unlikely(!pgtable))
                                goto out;
+                       zero_pfn = get_huge_zero_page();
+                       if (unlikely(!zero_pfn)) {
+                               pte_free(mm, pgtable);
+                               count_vm_event(THP_FAULT_FALLBACK);
+                               goto out;
+                       }
                        spin_lock(&mm->page_table_lock);
-                       set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
+                       set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+                                       zero_pfn);
                        spin_unlock(&mm->page_table_lock);
                        return 0;
                }
@@ -801,7 +851,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
                goto out_unlock;
        }
        if (is_huge_zero_pmd(pmd)) {
-               set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd);
+               unsigned long zero_pfn;
+               /*
+                * get_huge_zero_page() will never allocate a new page here,
+                * since we already have a zero page to copy. It just takes a
+                * reference.
+                */
+               zero_pfn = get_huge_zero_page();
+               set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+                               zero_pfn);
                ret = 0;
                goto out_unlock;
        }
@@ -908,6 +966,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
        smp_wmb(); /* make pte visible before pmd */
        pmd_populate(mm, pmd, pgtable);
        spin_unlock(&mm->page_table_lock);
+       put_huge_zero_page();

        mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

@@ -1109,8 +1168,10 @@ alloc:
                page_add_new_anon_rmap(new_page, vma, haddr);
                set_pmd_at(mm, haddr, pmd, entry);
                update_mmu_cache_pmd(vma, address, pmd);
-               if (is_huge_zero_pmd(orig_pmd))
+               if (is_huge_zero_pmd(orig_pmd)) {
                        add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+                       put_huge_zero_page();
+               }
                if (page) {
                        VM_BUG_ON(!PageHead(page));
                        page_remove_rmap(page);
@@ -1188,6 +1249,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
                if (is_huge_zero_pmd(orig_pmd)) {
                        tlb->mm->nr_ptes--;
                        spin_unlock(&tlb->mm->page_table_lock);
+                       put_huge_zero_page();
                } else {
                        page = pmd_page(orig_pmd);
                        page_remove_rmap(page);
@@ -2544,6 +2606,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
        }
        smp_wmb(); /* make pte visible before pmd */
        pmd_populate(vma->vm_mm, pmd, pgtable);
+       put_huge_zero_page();
 }

 void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
--
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: zap_huge_pmd(): zap huge zero pmd" by Kirill A. Shutemov
Kirill A. Shutemov  
View profile  
 More options Nov 7 2012, 10:10 am
Newsgroups: linux.kernel
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Date: Wed, 07 Nov 2012 16:10:01 +0100
Local: Wed, Nov 7 2012 10:10 am
Subject: [PATCH v5 02/11] thp: zap_huge_pmd(): zap huge zero pmd
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>

We don't have a real page to zap in huge zero page case. Let's just
clear pmd and remove it from tlb.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/huge_memory.c |   21 +++++++++++++--------
 1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e5ce979..ff834ea 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1058,15 +1058,20 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
                pmd_t orig_pmd;
                pgtable = pgtable_trans_huge_withdraw(tlb->mm);
                orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
-               page = pmd_page(orig_pmd);
                tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
-               page_remove_rmap(page);
-               VM_BUG_ON(page_mapcount(page) < 0);
-               add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
-               VM_BUG_ON(!PageHead(page));
-               tlb->mm->nr_ptes--;
-               spin_unlock(&tlb->mm->page_table_lock);
-               tlb_remove_page(tlb, page);
+               if (is_huge_zero_pmd(orig_pmd)) {
+                       tlb->mm->nr_ptes--;
+                       spin_unlock(&tlb->mm->page_table_lock);
+               } else {
+                       page = pmd_page(orig_pmd);
+                       page_remove_rmap(page);
+                       VM_BUG_ON(page_mapcount(page) < 0);
+                       add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+                       VM_BUG_ON(!PageHead(page));
+                       tlb->mm->nr_ptes--;
+                       spin_unlock(&tlb->mm->page_table_lock);
+                       tlb_remove_page(tlb, page);
+               }
                pte_free(tlb->mm, pgtable);
                ret = 1;
        }
--
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: change_huge_pmd(): keep huge zero page write-protected" by Kirill A. Shutemov
Kirill A. Shutemov  
View profile  
 More options Nov 7 2012, 10:10 am
Newsgroups: linux.kernel
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Date: Wed, 07 Nov 2012 16:10:02 +0100
Local: Wed, Nov 7 2012 10:10 am
Subject: [PATCH v5 05/11] thp: change_huge_pmd(): keep huge zero page write-protected
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>

We want to get page fault on write attempt to huge zero page, so let's
keep it write-protected.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/huge_memory.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d767a7c..05490b3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1259,6 +1259,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
                pmd_t entry;
                entry = pmdp_get_and_clear(mm, addr, pmd);
                entry = pmd_modify(entry, newprot);
+               if (is_huge_zero_pmd(entry))
+                       entry = pmd_wrprotect(entry);
                set_pmd_at(mm, addr, pmd, entry);
                spin_unlock(&vma->vm_mm->page_table_lock);
                ret = 1;
--
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED events" by Kirill A. Shutemov
Kirill A. Shutemov  
View profile  
 More options Nov 7 2012, 10:10 am
Newsgroups: linux.kernel
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Date: Wed, 07 Nov 2012 16:10:02 +0100
Local: Wed, Nov 7 2012 10:10 am
Subject: [PATCH v5 11/11] thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED events
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>

hzp_alloc is incremented every time a huge zero page is successfully
        allocated. It includes allocations which where dropped due
        race with other allocation. Note, it doesn't count every map
        of the huge zero page, only its allocation.

hzp_alloc_failed is incremented if kernel fails to allocate huge zero
        page and falls back to using small pages.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 Documentation/vm/transhuge.txt |    8 ++++++++
 include/linux/vm_event_item.h  |    2 ++
 mm/huge_memory.c               |    5 ++++-
 mm/vmstat.c                    |    2 ++
 4 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 677a599..ec4e84e 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -197,6 +197,14 @@ thp_split is incremented every time a huge page is split into base
        pages. This can happen for a variety of reasons but a common
        reason is that a huge page is old and is being reclaimed.

+hzp_alloc is incremented every time a huge zero page is successfully
+       allocated. It includes allocations which where dropped due
+       race with other allocation. Note, it doesn't count every map
+       of the huge zero page, only its allocation.
+
+hzp_alloc_failed is incremented if kernel fails to allocate huge zero
+       page and falls back to using small pages.
+
 As the system ages, allocating huge pages may be expensive as the
 system uses memory compaction to copy data around memory to free a
 huge page for use. There are some counters in /proc/vmstat to help
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3d31145..d7156fb 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -58,6 +58,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
                THP_COLLAPSE_ALLOC,
                THP_COLLAPSE_ALLOC_FAILED,
                THP_SPLIT,
+               HZP_ALLOC,
+               HZP_ALLOC_FAILED,
 #endif
                NR_VM_EVENT_ITEMS
 };
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 92a1b66..492658a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -183,8 +183,11 @@ retry:

        zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
                        HPAGE_PMD_ORDER);
-       if (!zero_page)
+       if (!zero_page) {
+               count_vm_event(HZP_ALLOC_FAILED);
                return 0;
+       }
+       count_vm_event(HZP_ALLOC);
        preempt_disable();
        if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
                preempt_enable();
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..cb8901c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -801,6 +801,8 @@ const char * const vmstat_text[] = {
        "thp_collapse_alloc",
        "thp_collapse_alloc_failed",
        "thp_split",
+       "hzp_alloc",
+       "hzp_alloc_failed",
 #endif

 #endif /* CONFIG_VM_EVENTS_COUNTERS */
--
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: setup huge zero page on non-write page fault" by Kirill A. Shutemov
Kirill A. Shutemov  
View profile  
 More options Nov 7 2012, 10:10 am
Newsgroups: linux.kernel
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Date: Wed, 07 Nov 2012 16:10:02 +0100
Local: Wed, Nov 7 2012 10:10 am
Subject: [PATCH v5 08/11] thp: setup huge zero page on non-write page fault
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>

All code paths seems covered. Now we can map huge zero page on read page
fault.

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.

If we fail to setup huge zero page (ENOMEM) we fallback to
handle_pte_fault() as we normally do in THP.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/huge_memory.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f36bc7d..41f05f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -726,6 +726,16 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
                        return VM_FAULT_OOM;
                if (unlikely(khugepaged_enter(vma)))
                        return VM_FAULT_OOM;
+               if (!(flags & FAULT_FLAG_WRITE)) {
+                       pgtable_t pgtable;
+                       pgtable = pte_alloc_one(mm, haddr);
+                       if (unlikely(!pgtable))
+                               goto out;
+                       spin_lock(&mm->page_table_lock);
+                       set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
+                       spin_unlock(&mm->page_table_lock);
+                       return 0;
+               }
                page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
                                          vma, haddr, numa_node_id(), 0);
                if (unlikely(!page)) {
--
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: implement splitting pmd for huge zero page" by Kirill A. Shutemov
Kirill A. Shutemov  
View profile  
 More options Nov 7 2012, 10:10 am
Newsgroups: linux.kernel
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Date: Wed, 07 Nov 2012 16:10:02 +0100
Local: Wed, Nov 7 2012 10:10 am
Subject: [PATCH v5 07/11] thp: implement splitting pmd for huge zero page
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>

We can't split huge zero page itself (and it's bug if we try), but we
can split the pmd which points to it.

On splitting the pmd we create a table with all ptes set to normal zero
page.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/huge_memory.c |   47 ++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 90e651c..f36bc7d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1611,6 +1611,7 @@ int split_huge_page(struct page *page)
        struct anon_vma *anon_vma;
        int ret = 1;

+       BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
        BUG_ON(!PageAnon(page));
        anon_vma = page_lock_anon_vma(page);
        if (!anon_vma)
@@ -2509,23 +2510,63 @@ static int khugepaged(void *none)
        return 0;
 }

+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+               unsigned long haddr, pmd_t *pmd)
+{
+       pgtable_t pgtable;
+       pmd_t _pmd;
+       int i;
+
+       pmdp_clear_flush(vma, haddr, pmd);
+       /* leave pmd empty until pte is filled */
+
+       pgtable = get_pmd_huge_pte(vma->vm_mm);
+       pmd_populate(vma->vm_mm, &_pmd, pgtable);
+
+       for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+               pte_t *pte, entry;
+               entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+               entry = pte_mkspecial(entry);
+               pte = pte_offset_map(&_pmd, haddr);
+               VM_BUG_ON(!pte_none(*pte));
+               set_pte_at(vma->vm_mm, haddr, pte, entry);
+               pte_unmap(pte);
+       }
+       smp_wmb(); /* make pte visible before pmd */
+       pmd_populate(vma->vm_mm, pmd, pgtable);
+}
+
 void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
                pmd_t *pmd)
 {
        struct page *page;
+       struct mm_struct *mm = vma->vm_mm;
        unsigned long haddr = address & HPAGE_PMD_MASK;
+       unsigned long mmun_start;       /* For mmu_notifiers */
+       unsigned long mmun_end;         /* For mmu_notifiers */

        BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);

-       spin_lock(&vma->vm_mm->page_table_lock);
+       mmun_start = haddr;
+       mmun_end   = address + HPAGE_PMD_SIZE;
+       mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+       spin_lock(&mm->page_table_lock);
        if (unlikely(!pmd_trans_huge(*pmd))) {
-               spin_unlock(&vma->vm_mm->page_table_lock);
+               spin_unlock(&mm->page_table_lock);
+               mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+               return;
+       }
+       if (is_huge_zero_pmd(*pmd)) {
+               __split_huge_zero_page_pmd(vma, haddr, pmd);
+               spin_unlock(&mm->page_table_lock);
+               mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
                return;
        }
        page = pmd_page(*pmd);
        VM_BUG_ON(!page_count(page));
        get_page(page);
-       spin_unlock(&vma->vm_mm->page_table_lock);
+       spin_unlock(&mm->page_table_lock);
+       mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

        split_huge_page(page);

--
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: copy_huge_pmd(): copy huge zero page" by Kirill A. Shutemov
Kirill A. Shutemov  
View profile  
 More options Nov 7 2012, 10:10 am
Newsgroups: linux.kernel
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Date: Wed, 07 Nov 2012 16:10:02 +0100
Local: Wed, Nov 7 2012 10:10 am
Subject: [PATCH v5 03/11] thp: copy_huge_pmd(): copy huge zero page
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>

It's easy to copy huge zero page. Just set destination pmd to huge zero
page.

It's safe to copy huge zero page since we have none yet :-p

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/huge_memory.c |   17 +++++++++++++++++
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ff834ea..0d903bf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -701,6 +701,18 @@ static inline struct page *alloc_hugepage(int defrag)
 }
 #endif

+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+               struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
+{
+       pmd_t entry;
+       entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
+       entry = pmd_wrprotect(entry);
+       entry = pmd_mkhuge(entry);
+       set_pmd_at(mm, haddr, pmd, entry);
+       pgtable_trans_huge_deposit(mm, pgtable);
+       mm->nr_ptes++;
+}
+
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
                               unsigned long address, pmd_t *pmd,
                               unsigned int flags)
@@ -778,6 +790,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
                pte_free(dst_mm, pgtable);
                goto out_unlock;
        }
+       if (is_huge_zero_pmd(pmd)) {
+               set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd);
+               ret = 0;
+               goto out_unlock;
+       }
        if (unlikely(pmd_trans_splitting(pmd))) {
                /* split huge page running from under us */
                spin_unlock(&src_mm->page_table_lock);
--
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: lazy huge zero page allocation" by Kirill A. Shutemov
Kirill A. Shutemov  
View profile  
 More options Nov 7 2012, 10:10 am
Newsgroups: linux.kernel
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Date: Wed, 07 Nov 2012 16:10:03 +0100
Local: Wed, Nov 7 2012 10:10 am
Subject: [PATCH v5 09/11] thp: lazy huge zero page allocation
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>

Instead of allocating huge zero page on hugepage_init() we can postpone it
until first huge zero page map. It saves memory if THP is not in use.

cmpxchg() is used to avoid race on huge_zero_pfn initialization.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/huge_memory.c |   20 ++++++++++----------
 1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 41f05f1..52073c2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -160,22 +160,24 @@ static int start_khugepaged(void)
        return err;
 }

-static int init_huge_zero_page(void)
+static int init_huge_zero_pfn(void)
 {
        struct page *hpage;
+       unsigned long pfn;

        hpage = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
                        HPAGE_PMD_ORDER);
        if (!hpage)
                return -ENOMEM;
-
-       huge_zero_pfn = page_to_pfn(hpage);
+       pfn = page_to_pfn(hpage);
+       if (cmpxchg(&huge_zero_pfn, 0, pfn))
+               __free_page(hpage);
        return 0;
 }

 static inline bool is_huge_zero_pfn(unsigned long pfn)
 {
-       return pfn == huge_zero_pfn;
+       return huge_zero_pfn && pfn == huge_zero_pfn;
 }

 static inline bool is_huge_zero_pmd(pmd_t pmd)
@@ -564,10 +566,6 @@ static int __init hugepage_init(void)
        if (err)
                return err;

-       err = init_huge_zero_page();
-       if (err)
-               goto out;
-
        err = khugepaged_slab_init();
        if (err)
                goto out;
@@ -590,8 +588,6 @@ static int __init hugepage_init(void)

        return 0;
 out:
-       if (huge_zero_pfn)
-               __free_page(pfn_to_page(huge_zero_pfn));
        hugepage_exit_sysfs(hugepage_kobj);
        return err;
 }
@@ -728,6 +724,10 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
                        return VM_FAULT_OOM;
                if (!(flags & FAULT_FLAG_WRITE)) {
                        pgtable_t pgtable;
+                       if (unlikely(!huge_zero_pfn && init_huge_zero_pfn())) {
+                               count_vm_event(THP_FAULT_FALLBACK);
+                               goto out;
+                       }
                        pgtable = pte_alloc_one(mm, haddr);
                        if (unlikely(!pgtable))
                                goto out;
--
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: do_huge_pmd_wp_page(): handle huge zero page" by Kirill A. Shutemov
Kirill A. Shutemov  
View profile  
 More options Nov 7 2012, 10:10 am
Newsgroups: linux.kernel
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Date: Wed, 07 Nov 2012 16:10:03 +0100
Local: Wed, Nov 7 2012 10:10 am
Subject: [PATCH v5 04/11] thp: do_huge_pmd_wp_page(): handle huge zero page
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>

On write access to huge zero page we alloc a new huge page and clear it.

If ENOMEM, graceful fallback: we create a new pmd table and set pte
around fault address to newly allocated normal (4k) page. All other ptes
in the pmd set to normal zero page.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 include/linux/mm.h |    8 +++
 mm/huge_memory.c   |  129 ++++++++++++++++++++++++++++++++++++++++++++++------
 mm/memory.c        |    7 ---
 3 files changed, 122 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..fe329da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -516,6 +516,14 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 }
 #endif

+#ifndef my_zero_pfn
+static inline unsigned long my_zero_pfn(unsigned long addr)
+{
+       extern unsigned long zero_pfn;
+       return zero_pfn;
+}
+#endif
+
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0d903bf..d767a7c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -824,6 +824,88 @@ out:
        return ret;
 }

+/* no "address" argument so destroys page coloring of some arch */
+pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
+{
+       pgtable_t pgtable;
+
+       assert_spin_locked(&mm->page_table_lock);
+
+       /* FIFO */
+       pgtable = mm->pmd_huge_pte;
+       if (list_empty(&pgtable->lru))
+               mm->pmd_huge_pte = NULL;
+       else {
+               mm->pmd_huge_pte = list_entry(pgtable->lru.next,
+                                             struct page, lru);
+               list_del(&pgtable->lru);
+       }
+       return pgtable;
+}
+
+static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
+               struct vm_area_struct *vma, unsigned long address,
+               pmd_t *pmd, unsigned long haddr)
+{
+       pgtable_t pgtable;
+       pmd_t _pmd;
+       struct page *page;
+       int i, ret = 0;
+       unsigned long mmun_start;       /* For mmu_notifiers */
+       unsigned long mmun_end;         /* For mmu_notifiers */
+
+       page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+       if (!page) {
+               ret |= VM_FAULT_OOM;
+               goto out;
+       }
+
+       if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
+               put_page(page);
+               ret |= VM_FAULT_OOM;
+               goto out;
+       }
+
+       clear_user_highpage(page, address);
+       __SetPageUptodate(page);
+
+       mmun_start = haddr;
+       mmun_end   = haddr + HPAGE_PMD_SIZE;
+       mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+
+       spin_lock(&mm->page_table_lock);
+       pmdp_clear_flush(vma, haddr, pmd);
+       /* leave pmd empty until pte is filled */
+
+       pgtable = get_pmd_huge_pte(mm);
+       pmd_populate(mm, &_pmd, pgtable);
+
+       for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+               pte_t *pte, entry;
+               if (haddr == (address & PAGE_MASK)) {
+                       entry = mk_pte(page, vma->vm_page_prot);
+                       entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+                       page_add_new_anon_rmap(page, vma, haddr);
+               } else {
+                       entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+                       entry = pte_mkspecial(entry);
+               }
+               pte = pte_offset_map(&_pmd, haddr);
+               VM_BUG_ON(!pte_none(*pte));
+               set_pte_at(mm, haddr, pte, entry);
+               pte_unmap(pte);
+       }
+       smp_wmb(); /* make pte visible before pmd */
+       pmd_populate(mm, pmd, pgtable);
+       spin_unlock(&mm->page_table_lock);
+
+       mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+
+       ret |= VM_FAULT_WRITE;
+out:
+       return ret;
+}
+
 static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
                                        struct vm_area_struct *vma,
                                        unsigned long address,
@@ -930,19 +1012,21 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
                        unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
 {
        int ret = 0;
-       struct page *page, *new_page;
+       struct page *page = NULL, *new_page;
        unsigned long haddr;
        unsigned long mmun_start;       /* For mmu_notifiers */
        unsigned long mmun_end;         /* For mmu_notifiers */

        VM_BUG_ON(!vma->anon_vma);
+       haddr = address & HPAGE_PMD_MASK;
+       if (is_huge_zero_pmd(orig_pmd))
+               goto alloc;
        spin_lock(&mm->page_table_lock);
        if (unlikely(!pmd_same(*pmd, orig_pmd)))
                goto out_unlock;

        page = pmd_page(orig_pmd);
        VM_BUG_ON(!PageCompound(page) || !PageHead(page));
-       haddr = address & HPAGE_PMD_MASK;
        if (page_mapcount(page) == 1) {
                pmd_t entry;
                entry = pmd_mkyoung(orig_pmd);
@@ -954,7 +1038,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
        }
        get_page(page);
        spin_unlock(&mm->page_table_lock);
-
+alloc:
        if (transparent_hugepage_enabled(vma) &&
            !transparent_hugepage_debug_cow())
                new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -964,24 +1048,34 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,

        if (unlikely(!new_page)) {
                count_vm_event(THP_FAULT_FALLBACK);
-               ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
-                                                  pmd, orig_pmd, page, haddr);
-               if (ret & VM_FAULT_OOM)
-                       split_huge_page(page);
-               put_page(page);
+               if (is_huge_zero_pmd(orig_pmd)) {
+                       ret = do_huge_pmd_wp_zero_page_fallback(mm, vma,
+                                       address, pmd, haddr);
+               } else {
+                       ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
+                                       pmd, orig_pmd, page, haddr);
+                       if (ret & VM_FAULT_OOM)
+                               split_huge_page(page);
+                       put_page(page);
+               }
                goto out;
        }
        count_vm_event(THP_FAULT_ALLOC);

        if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
                put_page(new_page);
-               split_huge_page(page);
-               put_page(page);
+               if (page) {
+                       split_huge_page(page);
+                       put_page(page);
+               }
                ret |= VM_FAULT_OOM;
                goto out;
        }

-       copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
+       if (is_huge_zero_pmd(orig_pmd))
+               clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
+       else
+               copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
        __SetPageUptodate(new_page);

        mmun_start = haddr;
@@ -989,7 +1083,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
        mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

        spin_lock(&mm->page_table_lock);
-       put_page(page);
+       if (page)
+               put_page(page);
        if (unlikely(!pmd_same(*pmd, orig_pmd))) {
                spin_unlock(&mm->page_table_lock);
                mem_cgroup_uncharge_page(new_page);
@@ -997,7 +1092,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
                goto out_mn;
        } else {
                pmd_t entry;
-               VM_BUG_ON(!PageHead(page));
                entry = mk_pmd(new_page, vma->vm_page_prot);
                entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
                entry = pmd_mkhuge(entry);
@@ -1005,8 +1099,13 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
                page_add_new_anon_rmap(new_page, vma, haddr);
                set_pmd_at(mm, haddr, pmd, entry);
                update_mmu_cache_pmd(vma, address, pmd);
-               page_remove_rmap(page);
-               put_page(page);
+               if (is_huge_zero_pmd(orig_pmd))
+                       add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+               if (page) {
+                       VM_BUG_ON(!PageHead(page));
+                       page_remove_rmap(page);
+                       put_page(page);
+               }
                ret |= VM_FAULT_WRITE;
        }
        spin_unlock(&mm->page_table_lock);
diff --git a/mm/memory.c b/mm/memory.c
index fb135ba..6edc030 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -724,13 +724,6 @@ static inline int is_zero_pfn(unsigned long pfn)
 }
 #endif

-#ifndef my_zero_pfn
-static inline unsigned long my_zero_pfn(unsigned long addr)
-{
-       return zero_pfn;
-}
-#endif
-
 /*
  * vm_normal_page -- This function gets the "struct page" associated with a pte.
  *
--
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: huge zero page: basic preparation" by Kirill A. Shutemov
Kirill A. Shutemov  
View profile  
 More options Nov 7 2012, 10:10 am
Newsgroups: linux.kernel
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Date: Wed, 07 Nov 2012 16:10:02 +0100
Local: Wed, Nov 7 2012 10:10 am
Subject: [PATCH v5 01/11] thp: huge zero page: basic preparation
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled
with zeros.

For now let's allocate the page on hugepage_init(). We'll switch to lazy
allocation later.

We are not going to map the huge zero page until we can handle it
properly on all code paths.

is_huge_zero_{pfn,pmd}() functions will be used by following patches to
check whether the pfn/pmd is huge zero page.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 mm/huge_memory.c |   30 ++++++++++++++++++++++++++++++
 1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..e5ce979 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -47,6 +47,7 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
 /* during fragmentation poll the hugepage allocator once every minute */
 static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
 static struct task_struct *khugepaged_thread __read_mostly;
+static unsigned long huge_zero_pfn __read_mostly;
 static DEFINE_MUTEX(khugepaged_mutex);
 static DEFINE_SPINLOCK(khugepaged_mm_lock);
 static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
@@ -159,6 +160,29 @@ static int start_khugepaged(void)
        return err;
 }

+static int init_huge_zero_page(void)
+{
+       struct page *hpage;
+
+       hpage = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
+                       HPAGE_PMD_ORDER);
+       if (!hpage)
+               return -ENOMEM;
+
+       huge_zero_pfn = page_to_pfn(hpage);
+       return 0;
+}
+
+static inline bool is_huge_zero_pfn(unsigned long pfn)
+{
+       return pfn == huge_zero_pfn;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+       return is_huge_zero_pfn(pmd_pfn(pmd));
+}
+
 #ifdef CONFIG_SYSFS

 static ssize_t double_flag_show(struct kobject *kobj,
@@ -540,6 +564,10 @@ static int __init hugepage_init(void)
        if (err)
                return err;

+       err = init_huge_zero_page();
+       if (err)
+               goto out;
+
        err = khugepaged_slab_init();
        if (err)
                goto out;
@@ -562,6 +590,8 @@ static int __init hugepage_init(void)

        return 0;
 out:
+       if (huge_zero_pfn)
+               __free_page(pfn_to_page(huge_zero_pfn));
        hugepage_exit_sysfs(hugepage_kobj);
        return err;
 }
--
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: change split_huge_page_pmd() interface" by Kirill A. Shutemov
Kirill A. Shutemov  
View profile  
 More options Nov 7 2012, 10:10 am
Newsgroups: linux.kernel
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Date: Wed, 07 Nov 2012 16:10:03 +0100
Local: Wed, Nov 7 2012 10:10 am
Subject: [PATCH v5 06/11] thp: change split_huge_page_pmd() interface
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>

Pass vma instead of mm and add address parameter.

In most cases we already have vma on the stack. We provides
split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

This change is preparation to huge zero pmd splitting implementation.

Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
---
 Documentation/vm/transhuge.txt |    4 ++--
 arch/x86/kernel/vm86_32.c      |    2 +-
 fs/proc/task_mmu.c             |    2 +-
 include/linux/huge_mm.h        |   14 ++++++++++----
 mm/huge_memory.c               |   24 +++++++++++++++++++-----
 mm/memory.c                    |    4 ++--
 mm/mempolicy.c                 |    2 +-
 mm/mprotect.c                  |    2 +-
 mm/mremap.c                    |    2 +-
 mm/pagewalk.c                  |    2 +-
 10 files changed, 39 insertions(+), 19 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index f734bb2..677a599 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -276,7 +276,7 @@ unaffected. libhugetlbfs will also work fine as usual.
 == Graceful fallback ==

 Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
+split_huge_page_pmd(vma, pmd, addr) where the pmd is the one returned by
 pmd_offset. It's trivial to make the code transparent hugepage aware
 by just grepping for "pmd_offset" and adding split_huge_page_pmd where
 missing after pmd_offset returns the pmd. Thanks to the graceful
@@ -299,7 +299,7 @@ diff --git a/mm/mremap.c b/mm/mremap.c
                return NULL;

        pmd = pmd_offset(pud, addr);
-+      split_huge_page_pmd(mm, pmd);
++      split_huge_page_pmd(vma, pmd, addr);
        if (pmd_none_or_clear_bad(pmd))
                return NULL;

diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index 5c9687b..1dfe69c 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -182,7 +182,7 @@ static void mark_screen_rdonly(struct mm_struct *mm)
        if (pud_none_or_clear_bad(pud))
                goto out;
        pmd = pmd_offset(pud, 0xA0000);
-       split_huge_page_pmd(mm, pmd);
+       split_huge_page_pmd_mm(mm, 0xA0000, pmd);
        if (pmd_none_or_clear_bad(pmd))
                goto out;
        pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 90c63f9..291a0d1 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -643,7 +643,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
        spinlock_t *ptl;
        struct page *page;

-       split_huge_page_pmd(walk->mm, pmd);
+       split_huge_page_pmd(vma, addr, pmd);
        if (pmd_trans_unstable(pmd))
                return 0;

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b31cb7d..856f080 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -91,12 +91,14 @@ extern int handle_pte_fault(struct mm_struct *mm,
                            struct vm_area_struct *vma, unsigned long address,
                            pte_t *pte, pmd_t *pmd, unsigned int flags);
 extern int split_huge_page(struct page *page);
-extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
-#define split_huge_page_pmd(__mm, __pmd)                               \
+extern void __split_huge_page_pmd(struct vm_area_struct *vma,
+               unsigned long address, pmd_t *pmd);
+#define split_huge_page_pmd(__vma, __address, __pmd)                   \
        do {                                                            \
                pmd_t *____pmd = (__pmd);                               \
                if (unlikely(pmd_trans_huge(*____pmd)))                 \
-                       __split_huge_page_pmd(__mm, ____pmd);           \
+                       __split_huge_page_pmd(__vma, __address,         \
+                                       ____pmd);                       \
        }  while (0)
 #define wait_split_huge_page(__anon_vma, __pmd)                                \
        do {                                                            \
@@ -106,6 +108,8 @@ extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
                BUG_ON(pmd_trans_splitting(*____pmd) ||                 \
                       pmd_trans_huge(*____pmd));                       \
        } while (0)
+extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
+               pmd_t *pmd);
 #if HPAGE_PMD_ORDER > MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
@@ -173,10 +177,12 @@ static inline int split_huge_page(struct page *page)
 {
        return 0;
 }
-#define split_huge_page_pmd(__mm, __pmd)       \
+#define split_huge_page_pmd(__vma, __address, __pmd)   \
        do { } while (0)
 #define wait_split_huge_page(__anon_vma, __pmd)        \
        do { } while (0)
+#define split_huge_page_pmd_mm(__mm, __address, __pmd) \
+       do { } while (0)
 #define compound_trans_head(page) compound_head(page)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
                                   unsigned long *vm_flags, int advice)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 05490b3..90e651c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2509,19 +2509,23 @@ static int khugepaged(void *none)
        return 0;
 }

-void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
+void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
+               pmd_t *pmd)
 {
        struct page *page;
+       unsigned long haddr = address & HPAGE_PMD_MASK;

-       spin_lock(&mm->page_table_lock);
+       BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
+
+       spin_lock(&vma->vm_mm->page_table_lock);
        if (unlikely(!pmd_trans_huge(*pmd))) {
-               spin_unlock(&mm->page_table_lock);
+               spin_unlock(&vma->vm_mm->page_table_lock);
                return;
        }
        page = pmd_page(*pmd);
        VM_BUG_ON(!page_count(page));
        get_page(page);
-       spin_unlock(&mm->page_table_lock);
+       spin_unlock(&vma->vm_mm->page_table_lock);

        split_huge_page(page);

@@ -2529,6 +2533,16 @@ void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
        BUG_ON(pmd_trans_huge(*pmd));
 }

+void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
+               pmd_t *pmd)
+{
+       struct vm_area_struct *vma;
+
+       vma = find_vma(mm, address);
+       BUG_ON(vma == NULL);
+       split_huge_page_pmd(vma, address, pmd);
+}
+
 static void split_huge_page_address(struct mm_struct *mm,
                                    unsigned long address)
 {
@@ -2553,7 +2567,7 @@ static void split_huge_page_address(struct mm_struct *mm,
         * Caller holds the mmap_sem write mode, so a huge pmd cannot
         * materialize from under us.
         */
-       split_huge_page_pmd(mm, pmd);
+       split_huge_page_pmd_mm(mm, address, pmd);
 }

 void __vma_adjust_trans_huge(struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index 6edc030..6017e23 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1243,7 +1243,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
                                        BUG();
                                }
 #endif
-                               split_huge_page_pmd(vma->vm_mm, pmd);
+                               split_huge_page_pmd(vma, addr, pmd);
                        } else if (zap_huge_pmd(tlb, vma, pmd, addr))
                                goto next;
                        /* fall through */
@@ -1512,7 +1512,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
        }
        if (pmd_trans_huge(*pmd)) {
                if (flags & FOLL_SPLIT) {
-                       split_huge_page_pmd(mm, pmd);
+                       split_huge_page_pmd(vma, address, pmd);
                        goto split_fallthrough;
                }
                spin_lock(&mm->page_table_lock);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..b68061e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -511,7 +511,7 @@ static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
        pmd = pmd_offset(pud, addr);
        do {
                next = pmd_addr_end(addr, end);
-               split_huge_page_pmd(vma->vm_mm, pmd);
+               split_huge_page_pmd(vma, addr, pmd);
                if (pmd_none_or_trans_huge_or_clear_bad(pmd))
                        continue;
                if (check_pte_range(vma, pmd, addr, next, nodes,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a409926..e8c3938 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -90,7 +90,7 @@ static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
                next = pmd_addr_end(addr, end);
                if (pmd_trans_huge(*pmd)) {
                        if (next - addr != HPAGE_PMD_SIZE)
-                               split_huge_page_pmd(vma->vm_mm, pmd);
+                               split_huge_page_pmd(vma, addr, pmd);
                        else if (change_huge_pmd(vma, pmd, addr, newprot))
                                continue;
                        /* fall through */
diff --git a/mm/mremap.c b/mm/mremap.c
index 1b61c2d..eabb24d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -182,7 +182,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
                                need_flush = true;
                                continue;
                        } else if (!err) {
-                               split_huge_page_pmd(vma->vm_mm, old_pmd);
+                               split_huge_page_pmd(vma, old_addr, old_pmd);
                        }
                        VM_BUG_ON(pmd_trans_huge(*old_pmd));
                }
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 6c118d0..35aa294 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ again:
                if (!walk->pte_entry)
                        continue;

-               split_huge_page_pmd(walk->mm, pmd);
+               split_huge_page_pmd_mm(walk->mm, addr, pmd);
                if (pmd_none_or_trans_huge_or_clear_bad(pmd))
                        goto again;
                err = walk_pte_range(pmd, addr, next, walk);
--
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Introduce huge zero page" by Kirill A. Shutemov
Kirill A. Shutemov  
View profile  
 More options Nov 7 2012, 10:10 am
Newsgroups: linux.kernel
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
Date: Wed, 07 Nov 2012 16:10:03 +0100
Local: Wed, Nov 7 2012 10:10 am
Subject: [PATCH v5 00/11] Introduce huge zero page
From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>

Hi,

Andrew, here's updated huge zero page patchset.

Please consider applying.

=================

During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include <assert.h>
#include <stdlib.h>
#include <unistd.h>

#define MB 1024*1024

int main(int argc, char **argv)
{
        char *p;
        int i;

        posix_memalign((void **)&p, 2 * MB, 200 * MB);
        for (i = 0; i < 200 * MB; i+= 4096)
                assert(p[i] == 0);
        pause();
        return 0;

}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

Design overview.

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros.  The way how we allocate it changes in the patchset:

- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
normally do in THP.

On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.

Patchset organized in bisect-friendly way:
 Patches 01-07: prepare all code paths for hzp
 Patch 08: all code paths are covered: safe to setup hzp
 Patch 09: lazy allocation
 Patch 10: lockless refcounting for hzp

v5:
 - implement HZP_ALLOC and HZP_ALLOC_FAILED events;
v4:
 - Rebase to v3.7-rc1;
 - Update commit message;
v3:
 - fix potential deadlock in refcounting code on preemptive kernel.
 - do not mark huge zero page as movable.
 - fix typo in comment.
 - Reviewed-by tag from Andrea Arcangeli.
v2:
 - Avoid find_vma() if we've already had vma on stack.
   Suggested by Andrea Arcangeli.
 - Implement refcounting for huge zero page.

--------------------------------------------------------------------------

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==============

test:
        posix_memalign((void **)&p, 2 * MB, 8 * GB);
        for (i = 0; i < 100; i++) {
                assert(memcmp(p, p + 4*GB, 4*GB) == 0);
                asm volatile ("": : :"memory");
        }

hzp:
 Performance counter stats for './test_memcmp' (5 runs):

      32356.272845 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
                40 context-switches          #    0.001 K/sec                    ( +-  0.94% )
                 0 CPU-migrations            #    0.000 K/sec
             4,218 page-faults               #    0.130 K/sec                    ( +-  0.00% )
    76,712,481,765 cycles                    #    2.371 GHz                      ( +-  0.13% ) [83.31%]
    36,279,577,636 stalled-cycles-frontend   #   47.29% frontend cycles idle     ( +-  0.28% ) [83.35%]
     1,684,049,110 stalled-cycles-backend    #    2.20% backend  cycles idle     ( +-  2.96% ) [66.67%]
   134,355,715,816 instructions              #    1.75  insns per cycle
                                             #    0.27  stalled cycles per insn  ( +-  0.10% ) [83.35%]
    13,526,169,702 branches                  #  418.039 M/sec                    ( +-  0.10% ) [83.31%]
         1,058,230 branch-misses             #    0.01% of all branches          ( +-  0.91% ) [83.36%]

      32.413866442 seconds time elapsed                                          ( +-  0.13% )

vhzp:
 Performance counter stats for './test_memcmp' (5 runs):

      30327.183829 task-clock                #    0.998 CPUs utilized            ( +-  0.13% )
                38 context-switches          #    0.001 K/sec                    ( +-  1.53% )
                 0 CPU-migrations            #    0.000 K/sec
             4,218 page-faults               #    0.139 K/sec                    ( +-  0.01% )
    71,964,773,660 cycles                    #    2.373 GHz                      ( +-  0.13% ) [83.35%]
    31,191,284,231 stalled-cycles-frontend   #   43.34% frontend cycles idle     ( +-  0.40% ) [83.32%]
       773,484,474 stalled-cycles-backend    #    1.07% backend  cycles idle     ( +-  6.61% ) [66.67%]
   134,982,215,437 instructions              #    1.88  insns per cycle
                                             #    0.23  stalled cycles per insn  ( +-  0.11% ) [83.32%]
    13,509,150,683 branches                  #  445.447 M/sec                    ( +-  0.11% ) [83.34%]
         1,017,667 branch-misses             #    0.01% of all branches          ( +-  1.07% ) [83.32%]

      30.381324695 seconds time elapsed                                          ( +-  0.13% )

Mirobenchmark2
==============

test:
        posix_memalign((void **)&p, 2 * MB, 8 * GB);
        for (i = 0; i < 1000; i++) {
                char *_p = p;
                while (_p < p+4*GB) {
                        assert(*_p == *(_p+4*GB));
                        _p += 4096;
                        asm volatile ("": : :"memory");
                }
        }

hzp:
 Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):

       3505.727639 task-clock                #    0.998 CPUs utilized            ( +-  0.26% )
                 9 context-switches          #    0.003 K/sec                    ( +-  4.97% )
             4,384 page-faults               #    0.001 M/sec                    ( +-  0.00% )
     8,318,482,466 cycles                    #    2.373 GHz                      ( +-  0.26% ) [33.31%]
     5,134,318,786 stalled-cycles-frontend   #   61.72% frontend cycles idle     ( +-  0.42% ) [33.32%]
     2,193,266,208 stalled-cycles-backend    #   26.37% backend  cycles idle     ( +-  5.51% ) [33.33%]
     9,494,670,537 instructions              #    1.14  insns per cycle
                                             #    0.54  stalled cycles per insn  ( +-  0.13% ) [41.68%]
     2,108,522,738 branches                  #  601.451 M/sec                    ( +-  0.09% ) [41.68%]
           158,746 branch-misses             #    0.01% of all branches          ( +-  1.60% ) [41.71%]
     3,168,102,115 L1-dcache-loads
          #  903.693 M/sec                    ( +-  0.11% ) [41.70%]
     1,048,710,998 L1-dcache-misses
         #   33.10% of all L1-dcache hits    ( +-  0.11% ) [41.72%]
     1,047,699,685 LLC-load
                 #  298.854 M/sec                    ( +-  0.03% ) [33.38%]
             2,287 LLC-misses
               #    0.00% of all LL-cache hits     ( +-  8.27% ) [33.37%]
     3,166,187,367 dTLB-loads
               #  903.147 M/sec                    ( +-  0.02% ) [33.35%]
         4,266,538 dTLB-misses
              #    0.13% of all dTLB cache hits   ( +-  0.03% ) [33.33%]

       3.513339813 seconds time elapsed                                          ( +-  0.26% )

vhzp:
 Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):

      27313.891128 task-clock                #    0.998 CPUs utilized            ( +-  0.24% )
                62 context-switches          #    0.002 K/sec                    ( +-  0.61% )
             4,384 page-faults               #    0.160 K/sec                    ( +-  0.01% )
    64,747,374,606 cycles                    #    2.370 GHz                      ( +-  0.24% ) [33.33%]
    61,341,580,278 stalled-cycles-frontend   #   94.74% frontend cycles idle     ( +-  0.26% ) [33.33%]
    56,702,237,511 stalled-cycles-backend    #   87.57% backend  cycles idle     ( +-  0.07% ) [33.33%]
    10,033,724,846 instructions              #    0.15  insns per cycle
                                             #    6.11  stalled cycles per insn  ( +-  0.09% ) [41.65%]
     2,190,424,932 branches                  #   80.195 M/sec                    ( +-  0.12% ) [41.66%]
         1,028,630 branch-misses             #    0.05% of all branches          ( +-  1.50% ) [41.66%]
     3,302,006,540 L1-dcache-loads
          #  120.891 M/sec                    ( +-  0.11% ) [41.68%]
       271,374,358 L1-dcache-misses
         #    8.22% of all L1-dcache hits    ( +-  0.04% ) [41.66%]
        20,385,476 LLC-load
                 #    0.746 M/sec                    ( +-  1.64% ) [33.34%]
            76,754 LLC-misses
               #    0.38% of all LL-cache hits     ( +-  2.35% ) [33.34%]
     3,309,927,290 dTLB-loads
               #  121.181 M/sec                    ( +-  0.03% ) [33.34%]
     2,098,967,427 dTLB-misses
              #   63.41% of all dTLB cache hits   ( +-  0.03% ) [33.34%]

      27.364448741 seconds time elapsed                                          ( +-  0.24% )

--------------------------------------------------------------------------

I personally prefer implementation present in this patchset. It doesn't
touch arch-specific code.

Kirill A. Shutemov (11): ...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andrew Morton  
View profile  
 More options Nov 14 2012, 4:40 pm
Newsgroups: linux.kernel
From: Andrew Morton <a...@linux-foundation.org>
Date: Wed, 14 Nov 2012 22:40:02 +0100
Local: Wed, Nov 14 2012 4:40 pm
Subject: Re: [PATCH v5 00/11] Introduce huge zero page
On Wed,  7 Nov 2012 17:00:52 +0200
"Kirill A. Shutemov" <kirill.shute...@linux.intel.com> wrote:

> Andrew, here's updated huge zero page patchset.

There is still a distinct lack of reviewed-by's and acked-by's on this
patchset.

On 13 Sep, Andrea did indicate that he "reviewed the whole patchset and
it looks fine to me".  But that information failed to make it into the
changelogs, which is bad.

I grabbed the patchset.  I might hold it over until 3.9 depending on
additional review/test feedback and upon whether Andrea can be
persuaded to take another look at it all.

I'm still a bit concerned over the possibility that some workloads will
cause a high-frequency free/alloc/memset cycle on that huge zero page.
We'll see how it goes...

For this reason and for general ease-of-testing: can and should we add
a knob which will enable users to disable the feature at runtime?  That
way if it causes problems or if we suspect it's causing problems, we
can easily verify the theory and offer users a temporary fix.

Such a knob could be a boot-time option, but a post-boot /proc thing
would be much nicer.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: huge zero page: basic preparation" by David Rientjes
David Rientjes  
View profile  
 More options Nov 14 2012, 5:10 pm
Newsgroups: linux.kernel
From: David Rientjes <rient...@google.com>
Date: Wed, 14 Nov 2012 23:10:03 +0100
Local: Wed, Nov 14 2012 5:10 pm
Subject: Re: [PATCH v5 01/11] thp: huge zero page: basic preparation

Acked-by: David Rientjes <rient...@google.com>

Could be __init, but this gets switched over to lazy allocation later in
the series so probably not worth it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: zap_huge_pmd(): zap huge zero pmd" by David Rientjes
David Rientjes  
View profile  
 More options Nov 14 2012, 5:20 pm
Newsgroups: linux.kernel
From: David Rientjes <rient...@google.com>
Date: Wed, 14 Nov 2012 23:20:02 +0100
Local: Wed, Nov 14 2012 5:20 pm
Subject: Re: [PATCH v5 02/11] thp: zap_huge_pmd(): zap huge zero pmd

On Wed, 7 Nov 2012, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>

> We don't have a real page to zap in huge zero page case. Let's just
> clear pmd and remove it from tlb.

s/real/mapped/

> Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>

Acked-by: David Rientjes <rient...@google.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: copy_huge_pmd(): copy huge zero page" by David Rientjes
David Rientjes  
View profile  
 More options Nov 14 2012, 5:40 pm
Newsgroups: linux.kernel
From: David Rientjes <rient...@google.com>
Date: Wed, 14 Nov 2012 23:40:02 +0100
Local: Wed, Nov 14 2012 5:40 pm
Subject: Re: [PATCH v5 03/11] thp: copy_huge_pmd(): copy huge zero page

You said in the introduction message in this series that you still allow
splitting of the pmd, so why no check for pmd_trans_splitting() before
this?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: do_huge_pmd_wp_page(): handle huge zero page" by David Rientjes
David Rientjes  
View profile  
 More options Nov 14 2012, 6:10 pm
Newsgroups: linux.kernel
From: David Rientjes <rient...@google.com>
Date: Thu, 15 Nov 2012 00:10:03 +0100
Local: Wed, Nov 14 2012 6:10 pm
Subject: Re: [PATCH v5 04/11] thp: do_huge_pmd_wp_page(): handle huge zero page

On Wed, 7 Nov 2012, Kirill A. Shutemov wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fa06804..fe329da 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -516,6 +516,14 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>  }
>  #endif

> +#ifndef my_zero_pfn
> +static inline unsigned long my_zero_pfn(unsigned long addr)
> +{
> +  extern unsigned long zero_pfn;

I don't think you should be declaring this inside an inlined function, you
probably should be protecting the declarations of the variable and the
function instead.  Perhaps by CONFIG_MMU?

Umm, this is a copy and paste of pgtable_trans_huge_withdraw() from the
generic page table handling.  Why can't you reuse that and support (and/or
modify) the s390 and sparc code?

This whole function is extremely similar to the implementation of
do_huge_pmd_wp_page_fallback(), there really is no way to fold the two?  
Typically in cases like this it's helpful to split out different logical
segments of a function into smaller functions that would handle both  
page and !page accordingly.

This could all use a minor restructuring to make it much more cleaner,
perhaps by extracting the page_mapcount(page) == 1 case to be a separate
function that deals with non-copying writes?

Couldn't this be an "else" instead?

> +                  VM_BUG_ON(!PageHead(page));
> +                  page_remove_rmap(page);
> +                  put_page(page);
> +          }
>            ret |= VM_FAULT_WRITE;
>    }
>    spin_unlock(&mm->page_table_lock);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: change_huge_pmd(): keep huge zero page write-protected" by David Rientjes
David Rientjes  
View profile  
 More options Nov 14 2012, 6:20 pm
Newsgroups: linux.kernel
From: David Rientjes <rient...@google.com>
Date: Thu, 15 Nov 2012 00:20:01 +0100
Local: Wed, Nov 14 2012 6:20 pm
Subject: Re: [PATCH v5 05/11] thp: change_huge_pmd(): keep huge zero page write-protected

On Wed, 7 Nov 2012, Kirill A. Shutemov wrote:
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d767a7c..05490b3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1259,6 +1259,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>            pmd_t entry;
>            entry = pmdp_get_and_clear(mm, addr, pmd);
>            entry = pmd_modify(entry, newprot);
> +          if (is_huge_zero_pmd(entry))
> +                  entry = pmd_wrprotect(entry);
>            set_pmd_at(mm, addr, pmd, entry);
>            spin_unlock(&vma->vm_mm->page_table_lock);
>            ret = 1;

Nack, this should be handled in pmd_modify().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Introduce huge zero page" by Alan Cox
Alan Cox  
View profile  
 More options Nov 14 2012, 6:20 pm
Newsgroups: linux.kernel
From: Alan Cox <a...@lxorguk.ukuu.org.uk>
Date: Thu, 15 Nov 2012 00:20:01 +0100
Local: Wed, Nov 14 2012 6:20 pm
Subject: Re: [PATCH v5 00/11] Introduce huge zero page

> I'm still a bit concerned over the possibility that some workloads will
> cause a high-frequency free/alloc/memset cycle on that huge zero page.
> We'll see how it goes...

That is easy enough to fix - we can delay the freeing by a random time or
until memory pressure is applied.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: implement splitting pmd for huge zero page" by David Rientjes
David Rientjes  
View profile  
 More options Nov 14 2012, 6:30 pm
Newsgroups: linux.kernel
From: David Rientjes <rient...@google.com>
Date: Thu, 15 Nov 2012 00:30:02 +0100
Local: Wed, Nov 14 2012 6:30 pm
Subject: Re: [PATCH v5 07/11] thp: implement splitting pmd for huge zero page

This entire function duplicates other code in mm/huge_memory.c which gives
even more incentive into breaking do_huge_pmd_wp_zero_page_fallback() into
logical helper functions and reusing them for both page and !page.  
Duplicating all this code throughout the thp code just becomes a
maintenance nightmare down the road.

address or haddr?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: change split_huge_page_pmd() interface" by David Rientjes
David Rientjes  
View profile  
 More options Nov 14 2012, 6:30 pm
Newsgroups: linux.kernel
From: David Rientjes <rient...@google.com>
Date: Thu, 15 Nov 2012 00:30:01 +0100
Local: Wed, Nov 14 2012 6:30 pm
Subject: Re: [PATCH v5 06/11] thp: change split_huge_page_pmd() interface

Why not be consistent and make this split_huge_page_pmd_mm(mm, pmd, addr)?

> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 90c63f9..291a0d1 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -643,7 +643,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>    spinlock_t *ptl;
>    struct page *page;

> -  split_huge_page_pmd(walk->mm, pmd);
> +  split_huge_page_pmd(vma, addr, pmd);

Ah, it's because the change to the documentation is wrong: the format is
actually split_huge_page_pmd(vma, addr, pmd).

Just do

        struct mm_struct *mm = vma->vm_mm;

here and it makes everything else simpler.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: setup huge zero page on non-write page fault" by David Rientjes
David Rientjes  
View profile  
 More options Nov 14 2012, 6:40 pm
Newsgroups: linux.kernel
From: David Rientjes <rient...@google.com>
Date: Thu, 15 Nov 2012 00:40:01 +0100
Local: Wed, Nov 14 2012 6:40 pm
Subject: Re: [PATCH v5 08/11] thp: setup huge zero page on non-write page fault

On Wed, 7 Nov 2012, Kirill A. Shutemov wrote:
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f36bc7d..41f05f1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -726,6 +726,16 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>                    return VM_FAULT_OOM;
>            if (unlikely(khugepaged_enter(vma)))
>                    return VM_FAULT_OOM;
> +          if (!(flags & FAULT_FLAG_WRITE)) {
> +                  pgtable_t pgtable;
> +                  pgtable = pte_alloc_one(mm, haddr);
> +                  if (unlikely(!pgtable))
> +                          goto out;

No use in retrying, just return VM_FAULT_OOM.

> +                  spin_lock(&mm->page_table_lock);
> +                  set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
> +                  spin_unlock(&mm->page_table_lock);
> +                  return 0;
> +          }
>            page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
>                                      vma, haddr, numa_node_id(), 0);
>            if (unlikely(!page)) {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: lazy huge zero page allocation" by David Rientjes
David Rientjes  
View profile  
 More options Nov 14 2012, 6:40 pm
Newsgroups: linux.kernel
From: David Rientjes <rient...@google.com>
Date: Thu, 15 Nov 2012 00:40:02 +0100
Local: Wed, Nov 14 2012 6:40 pm
Subject: Re: [PATCH v5 09/11] thp: lazy huge zero page allocation

On Wed, 7 Nov 2012, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>

> Instead of allocating huge zero page on hugepage_init() we can postpone it
> until first huge zero page map. It saves memory if THP is not in use.

Is it worth the branch on every non-write pagefault after that?  The
unlikely() is not going to help on x86.  If thp is enabled in your
.config (which isn't the default), then I think it's better to just
allocate the zero huge page once and avoid any branches after that to
lazily allocate it.  (Or do it only when thp is set to "madvise" or
"always" if booting with transparent_hugepage=never.)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Introduce huge zero page" by Andrew Morton
Andrew Morton  
View profile  
 More options Nov 14 2012, 6:40 pm
Newsgroups: linux.kernel
From: Andrew Morton <a...@linux-foundation.org>
Date: Thu, 15 Nov 2012 00:40:02 +0100
Local: Wed, Nov 14 2012 6:40 pm
Subject: Re: [PATCH v5 00/11] Introduce huge zero page
On Wed, 14 Nov 2012 23:20:13 +0000

Alan Cox <a...@lxorguk.ukuu.org.uk> wrote:
> > I'm still a bit concerned over the possibility that some workloads will
> > cause a high-frequency free/alloc/memset cycle on that huge zero page.
> > We'll see how it goes...

> That is easy enough to fix - we can delay the freeing by a random time or
> until memory pressure is applied.

The current code does the latter, by freeing the page via a
"slab"-shrinker callback.

But I do suspect that with the right combination of use/unuse and
memory pressure, we could still get into the high-frequency scenario.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "thp: implement refcounting for huge zero page" by David Rientjes
David Rientjes  
View profile  
 More options Nov 14 2012, 6:50 pm
Newsgroups: linux.kernel
From: David Rientjes <rient...@google.com>
Date: Thu, 15 Nov 2012 00:50:01 +0100
Local: Wed, Nov 14 2012 6:50 pm
Subject: Re: [PATCH v5 10/11] thp: implement refcounting for huge zero page

Eek, this is disappointing that we need to check a refcount before
referencing the zero huge page and it obviously shows in your benchmark
(which I consider 1% to be significant given the alternative is 2MB of
memory for a system where thp was enabled to be on).  I think it would be
much better to simply allocate and reference the zero huge page locklessly
when thp is enabled to be either "madvise" or "always", i.e. allocate it
when enabled.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 70   Newer >
« Back to Discussions « Newer topic     Older topic »