[PATCH 0/12] Memory Compaction v2r12

Mel Gorman

unread,

Feb 12, 2010, 7:10:02 AM2/12/10

to

Changelog since V1
o Update help blurb on CONFIG_MIGRATION
o Max unusable free space index is 100, not 1000
o Move blockpfn forward properly during compaction
o Cleanup CONFIG_COMPACTION vs CONFIG_MIGRATION confusion
o Permissions on /proc and /sys files should be 0200
o Reduce verbosity
o Compact all nodes when triggered via /proc
o Add per-node compaction via sysfs
o Move defer_compaction out-of-line
o Fix lock oddities in rmap_walk_anon
o Add documentation

===== CUT HERE =====

This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was slub
"defragmentation" (really a form of targeted reclaim). Hence, this is called
"compaction" to distinguish it from other forms of defragmentation.

In this implementation, a full compaction run involves two scanners operating
within a zone - a migration and a free scanner. The migration scanner
starts at the beginning of a zone and finds all movable pages within one
pageblock_nr_pages-sized area and isolates them on a migratepages list. The
free scanner begins at the end of the zone and searches on a per-area
basis for enough free pages to migrate all the pages on the migratepages
list. As each area is respectively migrated or exhausted of free pages,
the scanners are advanced one area. A compaction run completes within a
zone when the two scanners meet.

This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.

It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.

Memory compaction can be triggered in one of three ways. It may be triggered
explicitly by writing any value to /proc/sys/vm/compact_memory and compacting
all of memory. It can be triggered on a per-node basis by writing any
value to /sys/devices/system/node/nodeN/compact where N is the node ID to
be compacted. When a process fails to allocate a high-order page, it may
compact memory in an attempt to satisfy the allocation instead of entering
direct reclaim. Explicit compaction does not finish until the two scanners
meet and direct compaction ends if a suitable page becomes available that
would meet watermarks.

The series is in 12 patches

Patch 1 adds documentation on /proc/pagetypeinfo which is extended later
in the series
Patch 2 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 3 exports a "unusable free space index" via /proc/pagetypeinfo. It's
a measure of external fragmentation that takes the size of the
allocation request into account. It can also be calculated from
userspace so can be dropped if requested
Patch 4 exports a "fragmentation index" which only has meaning when an
allocation request fails. It determines if an allocation failure
would be due to a lack of memory or external fragmentation.
Patch 5 is the compaction mechanism although it's unreachable at this point
Patch 6 adds a means of compacting all of memory with a proc trgger
Patch 7 adds a means of compacting a specific node with a sysfs trigger
Patch 8 adds "direct compaction" before "direct reclaim" if it is
determined there is a good chance of success.
Patch 9 temporarily disables compaction if an allocation failure occurs
after compaction.
Patches 10 and 11 address two race conditions within rmap_walk_anon where the
VMAs or anon_vma can disappear unexpectedly due to the way locks
are acquired. It's not clear why it was ever safe although the
strongest possibility is that currently processes migrated only
their own pages where the anon_vma and VMAs would be guaranteed to
exist during migration.
Patch 12 is disturbing. It only occurred on ppc64 but it looks like a
use-after-free race. It's probably something to do with locking
around page migration but a few more eyes looking at it before
I start really digging would be helpful.

Testing of compaction was in three stages. For the test, debugging, preempt,
the sleep watchdog and lockdep were all enabled but nothing nasty popped
out. min_free_kbytes was tuned as recommended by hugeadm to help fragmentation
avoidance and high-order allocations. It was only tested on X86-64 due to
the lack of availability of an X86 and PPC64 test machine for the moment.

Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.

1. Machine freshly booted and configured for hugepage usage with
a) hugeadm --create-global-mounts
b) hugeadm --pool-pages-max DEFAULT:8G
c) hugeadm --set-recommended-min_free_kbytes
d) hugeadm --set-recommended-shmmax

The min_free_kbytes here is important. Anti-fragmentation works best
when pageblocks don't mix. hugeadm knows how to calculate a value that
will significantly reduce the worst of external-fragmentation-related
events as reported by the mm_page_alloc_extfrag tracepoint.

2. Load up memory
a) Start updatedb
b) Create in parallel a X files of pagesize*128 in size. Wait
until files are created. By parallel, I mean that 4096 instances
of dd were launched, one after the other using &. The crude
objective being to mix filesystem metadata allocations with
the buffer cache.
c) Delete every second file so that pageblocks are likely to
have holes
d) kill updatedb if it's still running

At this point, the system is quiet, memory is full but it's full with
clean filesystem metadata and clean buffer cache that is unmapped.
This is readily migrated or discarded so you'd expect lumpy reclaim
to have no significant advantage over compaction but this is at
the POC stage.

3. In increments, attempt to allocate 5% of memory as hugepages.
Measure how long it took, how successful it was, how many
direct reclaims took place and how how many compactions. Note
the compaction figures might not fully add up as compactions
can take place for orders other than the hugepage size

X86-64
vanilla compaction
Final page count: 896 898 (attempted 1002)
Total pages reclaimed: 131419 42851
Total blocks compacted: 0 1474
Total compact pages alloced: 0 265

Compaction allocated slightly more pages but reclaimed a lot less - 88568
fewer pages or approximately 346MB worth of IO.

PPC64
vanilla compaction
Final page count: 95 95 (attempted 110)
Total pages reclaimed: 131419 42851
Total blocks compacted: 0 1474
Total compact pages alloced: 0 265

Similar to X86-64. No more huge pages were allocated byt a lot less was
reclaimed - about 345MB in this case.

The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.

The last test was a high-order allocation stress test. Many kernel compiles
are started to fill memory with a pressured mix of kernel and movable
allocations. During this, an attempt is made to allocate 90% of memory
as huge pages - one at a time with small delays between attempts to avoid
flooding the IO queue. Funningly, previous tests would have attempted 100%
of memory but compaction pushed up the allocation success rates just enough
that the machine would really go OOM.

vanilla compaction
Percentage of request allocated X86-64 94.00 97.00
Percentage of request allocated PPC64 67.00 84.00

Compaction had slightly higher success rates on X86-64 but helped
significantly on PPC64 with the much larger huge pages and greater opportunity
for racers between direct reclaimers and page allocators. The main impact
is expected to be in latencies.

This link shows the mean latency between allocation attempts as time goes
by. The Y axis is the average latency and the X axis is the allocation
attempt (whether it succeeded or failed). Three kernels are shown. The
vanilla 2.6.33-rc6 kernel. compaction-v2r12 is this series of patches and
compaction-disabled is this series of patches but CONFIG_COMPACTION is
not set. In those graphs, hydra is the x86-64 machine and powyah is the
ppc64 machine.

http://www.csn.ul.ie/~mel/postings/compaction-20100212/highalloc-interlatency-hydra-compaction-v2r12-mean.ps

The vanilla and compaction-disabled kernels were roughly similar. The
fact that compaction-disabled started with lower latencies is just a
co-incidence. The nature of the test means that luck is a factor. While
the overall success rates between test runs is repeatable, the timings
generally are not. With compaction enabled though, the latencies remain
very low until almost 50% of the allocation requests are made. This lower
latency when memory is available is consistent. At that point, lumpy reclaim
presumably starts being used and latencies increase.

http://www.csn.ul.ie/~mel/postings/compaction-20100212/highalloc-interlatency-powyah-compaction-v2r12-mean.ps

Again, the vanilla and compaction-disabled kernels are roughly similar. With
compaction, latencies remain low and more successful allocations are made.

While the average latencies are good, the standard deviation is also
interesting;

http://www.csn.ul.ie/~mel/postings/compaction-20100212/highalloc-interlatency-hydra-compaction-v2r12-stddev.ps
http://www.csn.ul.ie/~mel/postings/compaction-20100212/highalloc-interlatency-poaysh-compaction-v2r12-stddev.ps

Without compaction, there are very large variances between allocation
attempts. With compaction, they are all steadily low variances until lumpy
reclaim starts being used.

Overall, functional testing did not show up any problems and the performance
is as-expected. However, the three patches related to the page migration
core need careful review to determine why they are necessary at all.

The next stage is figuring out what to do with rmap_walk_anon VMA, if the
set is a merge candidate and if not, what additional work is required or
if the concept is acceptable or not. Any comment?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Mel Gorman

unread,

Feb 12, 2010, 7:10:03 AM2/12/10

to

Fragmentation index is a value that makes sense when an allocation of a
given size would fail. The index indicates whether an allocation failure is
due to a lack of memory (values towards 0) or due to external fragmentation
(value towards 1). For the most part, the huge page size will be the size
of interest but not necessarily so it is exported on a per-order and per-zone
basis via /proc/pagetypeinfo.

The index is normally calculated as a value between 0 and 1 which is
obviously unsuitable within the kernel. Instead, the first three decimal
places are used as a value between 0 and 1000 for an integer approximation.

Signed-off-by: Mel Gorman <m...@csn.ul.ie>
---
Documentation/filesystems/proc.txt | 11 ++++++
mm/vmstat.c | 63 ++++++++++++++++++++++++++++++++++++
2 files changed, 74 insertions(+), 0 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 0968a81..06bf53c 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -618,6 +618,10 @@ Unusable free space index at order
Node 0, zone DMA 0 0 0 2 6 18 34 67 99 227 485
Node 0, zone DMA32 0 0 1 2 4 7 10 17 23 31 34

+Fragmentation index at order
+Node 0, zone DMA -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
+Node 0, zone DMA32 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
+
Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
Node 0, zone DMA 2 0 5 1 0
Node 0, zone DMA32 41 6 967 2 0
@@ -639,6 +643,13 @@ value between 0 and 1000. The higher the value, the more of free memory is
unusable and by implication, the worse the external fragmentation is. The
percentage of unusable free memory can be found by dividing this value by 10.

+The fragmentation index, is only meaningful if an allocation would fail and
+indicates what the failure is due to. A value of -1 such as in the example
+states that the allocation would succeed. If it would fail, the value is
+between 0 and 1000. A value tending towards 0 implies the allocation failed
+due to a lack of memory. A value tending towards 1000 implies it failed
+due to external fragmentation.
+
If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
make an estimate of the likely number of huge pages that can be allocated
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d05d610..e2d0cc1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -494,6 +494,35 @@ static void fill_contig_page_info(struct zone *zone,
}

/*
+ * A fragmentation index only makes sense if an allocation of a requested
+ * size would fail. If that is true, the fragmentation index indicates
+ * whether external fragmentation or a lack of memory was the problem.
+ * The value can be used to determine if page reclaim or compaction
+ * should be used
+ */
+int fragmentation_index(struct zone *zone,
+ unsigned int order,
+ struct contig_page_info *info)
+{
+ unsigned long requested = 1UL << order;
+
+ if (!info->free_blocks_total)
+ return 0;
+
+ /* Fragmentation index only makes sense when a request would fail */
+ if (info->free_blocks_suitable)
+ return -1;
+
+ /*
+ * Index is between 0 and 1 so return within 3 decimal places
+ *
+ * 0 => allocation would fail due to lack of memory
+ * 1 => allocation would fail due to fragmentation
+ */
+ return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
+}
+
+/*
* Return an index indicating how much of the available free memory is
* unusable for an allocation of the requested size.
*/
@@ -516,6 +545,39 @@ static int unusable_free_index(struct zone *zone,

}

+static void pagetypeinfo_showfragmentation_print(struct seq_file *m,
+ pg_data_t *pgdat, struct zone *zone)
+{
+ unsigned int order;
+
+ /* Alloc on stack as interrupts are disabled for zone walk */
+ struct contig_page_info info;
+
+ seq_printf(m, "Node %4d, zone %8s %19s",
+ pgdat->node_id,
+ zone->name, " ");
+ for (order = 0; order < MAX_ORDER; ++order) {
+ fill_contig_page_info(zone, order, &info);
+ seq_printf(m, "%6d ", fragmentation_index(zone, order, &info));
+ }
+
+ seq_putc(m, '\n');
+}
+
+/*
+ * Display fragmentation index for orders that allocations would fail for
+ * XXX: Could be a lot more efficient, but it's not a critical path
+ */
+static int pagetypeinfo_showfragmentation(struct seq_file *m, void *arg)
+{
+ pg_data_t *pgdat = (pg_data_t *)arg;
+
+ seq_printf(m, "\nFragmentation index at order\n");
+ walk_zones_in_node(m, pgdat, pagetypeinfo_showfragmentation_print);
+
+ return 0;
+}
+
static void pagetypeinfo_showunusable_print(struct seq_file *m,
pg_data_t *pgdat, struct zone *zone)
{
@@ -657,6 +719,7 @@ static int pagetypeinfo_show(struct seq_file *m, void *arg)
seq_putc(m, '\n');
pagetypeinfo_showfree(m, pgdat);
pagetypeinfo_showunusable(m, pgdat);
+ pagetypeinfo_showfragmentation(m, pgdat);
pagetypeinfo_showblockcount(m, pgdat);

return 0;
--
1.6.5

Mel Gorman

unread,

Feb 12, 2010, 7:10:03 AM2/12/10

to

The memory compaction patches add details to pagetypeinfo that are not
obvious and need to be documented. In preparation for this, document
what is already in /proc/pagetypeinfo.

Signed-off-by: Mel Gorman <m...@csn.ul.ie>
---

Documentation/filesystems/proc.txt | 45 +++++++++++++++++++++++++++++++++++-
1 files changed, 44 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 0d07513..1829dfb 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -430,6 +430,7 @@ Table 1-5: Kernel info in /proc
modules List of loaded modules
mounts Mounted filesystems
net Networking info (see text)
+ pagetypeinfo Additional page allocator information (see text) (2.5)
partitions Table of partitions known to the system
pci Deprecated info of PCI bus (new way -> /proc/bus/pci/,
decoupled by lspci (2.4)
@@ -584,7 +585,7 @@ Node 0, zone DMA 0 4 5 4 4 3 ...
Node 0, zone Normal 1 0 0 1 101 8 ...
Node 0, zone HighMem 2 0 0 1 1 0 ...

-Memory fragmentation is a problem under some workloads, and buddyinfo is a
+External fragmentation is a problem under some workloads, and buddyinfo is a
useful tool for helping diagnose these problems. Buddyinfo will give you a
clue as to how big an area you can safely allocate, or why a previous
allocation failed.
@@ -594,6 +595,48 @@ available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in
ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
available in ZONE_NORMAL, etc...

+More information relevant to external fragmentation can be found in
+pagetypeinfo.
+
+> cat /proc/pagetypeinfo
+Page block order: 9
+Pages per block: 512
+
+Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
+Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0
+Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0
+Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2
+Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0
+Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0
+Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9
+Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0
+Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452
+Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0
+Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0
+
+Number of blocks type Unmovable Reclaimable Movable Reserve Isolate
+Node 0, zone DMA 2 0 5 1 0
+Node 0, zone DMA32 41 6 967 2 0
+
+Fragmentation avoidance in the kernel works by grouping pages of different
+migrate types into the same contiguous regions of memory called page blocks.
+A page block is typically the size of the default hugepage size e.g. 2MB on
+X86-64. By keeping pages grouped based on their ability to move, the kernel
+can reclaim pages within a page block to satisfy a high-order allocation.
+
+The pagetypinfo begins with information on the size of a page block. It
+then gives the same type of information as buddyinfo except broken down
+by migrate-type and finishes with details on how many page blocks of each
+type exist.
+
+If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
+from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
+make an estimate of the likely number of huge pages that can be allocated
+at a given point in time. All the "Movable" blocks should be allocatable
+unless memory has been mlock()'d. Some of the Reclaimable blocks should
+also be allocatable although a lot of filesystem metadata may have to be
+reclaimed to achieve this.
+
..............................................................................

meminfo:
--
1.6.5

Mel Gorman

unread,

Feb 12, 2010, 7:10:04 AM2/12/10

to

Ordinarily when a high-order allocation fails, direct reclaim is entered to
free pages to satisfy the allocation. With this patch, it is determined if
an allocation failed due to external fragmentation instead of low memory
and if so, the calling process will compact until a suitable page is
freed. Compaction by moving pages in memory is considerably cheaper than
paging out to disk and works where there are locked pages or no swap. If
compaction fails to free a page of a suitable size, then reclaim will
still occur.

Direct compaction returns as soon as possible. As each block is compacted,
it is checked if a suitable page has been freed and if so, it returns.

Signed-off-by: Mel Gorman <m...@csn.ul.ie>
---

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 6a2eefd..1cf95e2 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,13 +1,25 @@
#ifndef _LINUX_COMPACTION_H
#define _LINUX_COMPACTION_H

-/* Return values for compact_zone() */
+/* Return values for compact_zone() and try_to_compact_pages() */
#define COMPACT_INCOMPLETE 0
-#define COMPACT_COMPLETE 1
+#define COMPACT_PARTIAL 1
+#define COMPACT_COMPLETE 2

#ifdef CONFIG_COMPACTION
extern int sysctl_compaction_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos);
+
+extern int fragmentation_index(struct zone *zone, unsigned int order);
+extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
+ int order, gfp_t gfp_mask, nodemask_t *mask);
+#else
+static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
+ int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+ return COMPACT_INCOMPLETE;
+}
+
#endif /* CONFIG_COMPACTION */

#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index d7f7236..0ea7a38 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
KSWAPD_SKIP_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
+ COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
#endif
diff --git a/mm/compaction.c b/mm/compaction.c
index f5bd5ed..2c88ca9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -29,6 +29,9 @@ struct compact_control {
unsigned long nr_migratepages; /* Number of pages to migrate */
unsigned long free_pfn; /* isolate_freepages search base */
unsigned long migrate_pfn; /* isolate_migratepages search base */
+
+ unsigned int order; /* order a direct compactor needs */
+ int migratetype; /* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
};

@@ -282,10 +285,31 @@ static void update_nr_listpages(struct compact_control *cc)
static inline int compact_finished(struct zone *zone,
struct compact_control *cc)
{
+ unsigned int order;
+ unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
+
/* Compaction run completes if the migrate and free scanner meet */
if (cc->free_pfn <= cc->migrate_pfn)
return COMPACT_COMPLETE;

+ /* Compaction run is not finished if the watermark is not met */
+ if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
+ return COMPACT_INCOMPLETE;
+
+ if (cc->order == -1)
+ return COMPACT_INCOMPLETE;
+
+ /* Direct compactor: Is a suitable page free? */
+ for (order = cc->order; order < MAX_ORDER; order++) {
+ /* Job done if page is free of the right migratetype */
+ if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
+ return COMPACT_PARTIAL;
+
+ /* Job done if allocation would set block type */
+ if (order >= pageblock_order && zone->free_area[order].nr_free)
+ return COMPACT_PARTIAL;
+ }
+
return COMPACT_INCOMPLETE;
}

@@ -341,6 +365,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
return ret;
}

+static inline unsigned long compact_zone_order(struct zone *zone,
+ int order, gfp_t gfp_mask)
+{
+ struct compact_control cc = {
+ .nr_freepages = 0,
+ .nr_migratepages = 0,
+ .order = order,
+ .migratetype = allocflags_to_migratetype(gfp_mask),
+ .zone = zone,
+ };
+ INIT_LIST_HEAD(&cc.freepages);
+ INIT_LIST_HEAD(&cc.migratepages);
+
+ return compact_zone(zone, &cc);
+}
+
+/**
+ * try_to_compact_pages - Direct compact to satisfy a high-order allocation
+ * @zonelist: The zonelist used for the current allocation
+ * @order: The order of the current allocation
+ * @gfp_mask: The GFP mask of the current allocation
+ * @nodemask: The allowed nodes to allocate from
+ *
+ * This is the main entry point for direct page compaction.
+ */
+unsigned long try_to_compact_pages(struct zonelist *zonelist,
+ int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+ enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+ int may_enter_fs = gfp_mask & __GFP_FS;
+ int may_perform_io = gfp_mask & __GFP_IO;
+ unsigned long watermark;
+ struct zoneref *z;
+ struct zone *zone;
+ int rc = COMPACT_INCOMPLETE;
+
+ /* Check whether it is worth even starting compaction */
+ if (order == 0 || !may_enter_fs || !may_perform_io)
+ return rc;
+
+ /*
+ * We will not stall if the necessary conditions are not met for
+ * migration but direct reclaim seems to account stalls similarly
+ */
+ count_vm_event(COMPACTSTALL);
+
+ /* Compact each zone in the list */
+ for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
+ nodemask) {
+ int fragindex;
+ int status;
+
+ /*
+ * Watermarks for order-0 must be met for compaction. Note
+ * the 2UL. This is because during migration, copies of
+ * pages need to be allocated and for a short time, the
+ * footprint is higher
+ */
+ watermark = low_wmark_pages(zone) + (2UL << order);
+ if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+ continue;
+
+ /*
+ * fragmentation index determines if allocation failures are
+ * due to low memory or external fragmentation
+ *
+ * index of -1 implies allocations might succeed depending
+ * on watermarks
+ * index < 500 implies alloc failure is due to lack of memory
+ *
+ * XXX: The choice of 500 is arbitrary. Reinvestigate
+ * appropriately to determine a sensible default.
+ * and what it means when watermarks are also taken
+ * into account. Consider making it a sysctl
+ */
+ fragindex = fragmentation_index(zone, order);
+ if (fragindex >= 0 && fragindex <= 500)
+ continue;
+
+ if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
+ rc = COMPACT_PARTIAL;
+ break;
+ }
+
+ status = compact_zone_order(zone, order, gfp_mask);
+ rc = max(status, rc);
+
+ if (zone_watermark_ok(zone, order, watermark, 0, 0))
+ break;
+ }
+
+ return rc;
+}
+
+
/* Compact all zones within a node */
static int compact_node(int nid)
{
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d57154..1910b8b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -49,6 +49,7 @@
#include <linux/debugobjects.h>
#include <linux/kmemleak.h>
#include <linux/memory.h>
+#include <linux/compaction.h>
#include <trace/events/kmem.h>

#include <asm/tlbflush.h>
@@ -1728,6 +1729,31 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,

cond_resched();

+ /* Try memory compaction for high-order allocations before reclaim */
+ if (order) {
+ *did_some_progress = try_to_compact_pages(zonelist,
+ order, gfp_mask, nodemask);
+ if (*did_some_progress != COMPACT_INCOMPLETE) {
+ page = get_page_from_freelist(gfp_mask, nodemask,
+ order, zonelist, high_zoneidx,
+ alloc_flags, preferred_zone,
+ migratetype);
+ if (page) {
+ __count_vm_event(COMPACTSUCCESS);
+ return page;
+ }
+
+ /*
+ * It's bad if compaction run occurs and fails.
+ * The most likely reason is that pages exist,
+ * but not enough to satisfy watermarks.
+ */
+ count_vm_event(COMPACTFAIL);
+
+ cond_resched();
+ }
+ }
+
/* We now go into synchronous reclaim */
cpuset_memory_pressure_bump();
p->flags |= PF_MEMALLOC;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f0930ae..8edbe38 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -500,7 +500,7 @@ static void fill_contig_page_info(struct zone *zone,

* The value can be used to determine if page reclaim or compaction

* should be used
*/
-int fragmentation_index(struct zone *zone,
+static int __fragmentation_index(struct zone *zone,
unsigned int order,
struct contig_page_info *info)
{
@@ -522,6 +522,15 @@ int fragmentation_index(struct zone *zone,

return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
}

+/* Same as __fragmentation index but allocs contig_page_info on stack */
+int fragmentation_index(struct zone *zone, unsigned int order)
+{

+ struct contig_page_info info;
+

+ fill_contig_page_info(zone, order, &info);
+ return __fragmentation_index(zone, order, &info);
+}
+

/*
* Return an index indicating how much of the available free memory is
* unusable for an allocation of the requested size.

@@ -558,7 +567,7 @@ static void pagetypeinfo_showfragmentation_print(struct seq_file *m,
zone->name, " ");

for (order = 0; order < MAX_ORDER; ++order) {

fill_contig_page_info(zone, order, &info);
- seq_printf(m, "%6d ", fragmentation_index(zone, order, &info));
+ seq_printf(m, "%6d ", __fragmentation_index(zone, order, &info));
}

seq_putc(m, '\n');
@@ -856,6 +865,9 @@ static const char * const vmstat_text[] = {
"compact_blocks_moved",
"compact_pages_moved",
"compact_pagemigrate_failed",
+ "compact_stall",
+ "compact_fail",
+ "compact_success",

#ifdef CONFIG_HUGETLB_PAGE
"htlb_buddy_alloc_success",
--
1.6.5

Christoph Lameter

unread,

Feb 12, 2010, 11:00:02 AM2/12/10

to

Reviewed-by: Christoph Lameter <c...@linux-foundation.org>

KOSAKI Motohiro

unread,

Feb 16, 2010, 2:10:02 AM2/16/10

to

> The memory compaction patches add details to pagetypeinfo that are not
> obvious and need to be documented. In preparation for this, document
> what is already in /proc/pagetypeinfo.
>
> Signed-off-by: Mel Gorman <m...@csn.ul.ie>

Looks nicer.
Reviewed-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>

KOSAKI Motohiro

unread,

Feb 16, 2010, 3:00:02 AM2/16/10

to

> Fragmentation index is a value that makes sense when an allocation of a
> given size would fail. The index indicates whether an allocation failure is
> due to a lack of memory (values towards 0) or due to external fragmentation
> (value towards 1). For the most part, the huge page size will be the size
> of interest but not necessarily so it is exported on a per-order and per-zone
> basis via /proc/pagetypeinfo.
>
> The index is normally calculated as a value between 0 and 1 which is
> obviously unsuitable within the kernel. Instead, the first three decimal
> places are used as a value between 0 and 1000 for an integer approximation.

Hmmm..

I haven't understand why admin need to know two metrics (unusable-index
and fragmentation-index). they have very similar meanings and easy confusable
imho.

Can we make just one user friendly metrics?

Dumb question. I haven't understand why this calculation represent
fragmentation index. Do this have theorical background? if yes, can you
please tell me the pointer?

KOSAKI Motohiro

unread,

Feb 16, 2010, 3:50:03 AM2/16/10

to

> > Dumb question. I haven't understand why this calculation represent
> > fragmentation index. Do this have theorical background? if yes, can you
> > please tell me the pointer?
> >
>

> Yes, there is a theoritical background. It's mostly described in
>
> http://portal.acm.org/citation.cfm?id=1375634.1375641
>
> I have a more updated version but it's not published unfortunately.

ok, thanks. I stop to rush dumb question and read it first. I'll resume rest reviewing
few days after.

thanks.

Mel Gorman

unread,

Feb 16, 2010, 3:50:02 AM2/16/10

to

On Tue, Feb 16, 2010 at 04:59:05PM +0900, KOSAKI Motohiro wrote:
> > Fragmentation index is a value that makes sense when an allocation of a
> > given size would fail. The index indicates whether an allocation failure is
> > due to a lack of memory (values towards 0) or due to external fragmentation
> > (value towards 1). For the most part, the huge page size will be the size
> > of interest but not necessarily so it is exported on a per-order and per-zone
> > basis via /proc/pagetypeinfo.
> >
> > The index is normally calculated as a value between 0 and 1 which is
> > obviously unsuitable within the kernel. Instead, the first three decimal
> > places are used as a value between 0 and 1000 for an integer approximation.
>
> Hmmm..
>
> I haven't understand why admin need to know two metrics (unusable-index
> and fragmentation-index). they have very similar meanings and easy confusable
> imho.
>

Because they have different meanings and used for different things. Unusable
index describes the current system state and is the one that is most likely
to be of interest to an administrator monitoring this. Fragmentation index is
telling you "why" an allocation failed because arguably external fragmentation
does not exist until the time of allocation failure.

Fragmentation index is used for example to determine if compaction is
likely to work in advance or not.

> Can we make just one user friendly metrics?
>

What do you suggest?

Unusable free space index is easier to understand and can be expressed
as a percentage but fragmentation index is what the kernel is using. I
could hide the fragmentation index altogether if you prefer? I intend to
use it myself but I can always use a debugging patch.

Yes, there is a theoritical background. It's mostly described in

http://portal.acm.org/citation.cfm?id=1375634.1375641

I have a more updated version but it's not published unfortunately.

>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

Mel Gorman

unread,

Feb 16, 2010, 10:50:03 AM2/16/10

to

This patch adds documentation for /proc/pagetypeinfo.

Signed-off-by: Mel Gorman <m...@csn.ul.ie>
Reviewed-by: Christoph Lameter <c...@linux-foundation.org>

Reviewed-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
Documentation/filesystems/proc.txt | 45 +++++++++++++++++++++++++++++++++++-

1 files changed, 44 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 0d07513..1829dfb 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt

Rik van Riel

unread,

Feb 16, 2010, 8:50:01 PM2/16/10

to

On 02/12/2010 07:00 AM, Mel Gorman wrote:
> Fragmentation index is a value that makes sense when an allocation of a
> given size would fail. The index indicates whether an allocation failure is
> due to a lack of memory (values towards 0) or due to external fragmentation
> (value towards 1). For the most part, the huge page size will be the size
> of interest but not necessarily so it is exported on a per-order and per-zone
> basis via /proc/pagetypeinfo.
>
> The index is normally calculated as a value between 0 and 1 which is
> obviously unsuitable within the kernel. Instead, the first three decimal
> places are used as a value between 0 and 1000 for an integer approximation.
>
> Signed-off-by: Mel Gorman<m...@csn.ul.ie>

Acked-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed.

Rik van Riel

unread,

Feb 17, 2010, 11:00:01 PM2/17/10

to

On 02/12/2010 07:00 AM, Mel Gorman wrote:

> Ordinarily when a high-order allocation fails, direct reclaim is entered to
> free pages to satisfy the allocation. With this patch, it is determined if
> an allocation failed due to external fragmentation instead of low memory
> and if so, the calling process will compact until a suitable page is
> freed. Compaction by moving pages in memory is considerably cheaper than
> paging out to disk and works where there are locked pages or no swap. If
> compaction fails to free a page of a suitable size, then reclaim will
> still occur.
>
> Direct compaction returns as soon as possible. As each block is compacted,
> it is checked if a suitable page has been freed and if so, it returns.
>
> Signed-off-by: Mel Gorman<m...@csn.ul.ie>

Acked-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed.

Minchan Kim

unread,

Feb 18, 2010, 10:40:01 AM2/18/10

to

On Fri, 2010-02-12 at 12:00 +0000, Mel Gorman wrote:
> Fragmentation index is a value that makes sense when an allocation of a
> given size would fail. The index indicates whether an allocation failure is
> due to a lack of memory (values towards 0) or due to external fragmentation
> (value towards 1). For the most part, the huge page size will be the size
> of interest but not necessarily so it is exported on a per-order and per-zone
> basis via /proc/pagetypeinfo.
>
> The index is normally calculated as a value between 0 and 1 which is
> obviously unsuitable within the kernel. Instead, the first three decimal
> places are used as a value between 0 and 1000 for an integer approximation.
>
> Signed-off-by: Mel Gorman <m...@csn.ul.ie>

Reviewed-by: Minchan Kim <minch...@gmail.com>

Like previous [3/12], why do you remain "zone" argument?
If you will use it in future, I don't care. It's just trivial.

--
Kind regards,
Minchan Kim