[PATCH 3/6] mm: kswapd: Continue reclaiming for reclaim/compaction if the minimum number of pages have not been reclaimed

Mel Gorman

unread,

Aug 7, 2012, 8:40:02 AM8/7/12

to

When direct reclaim is running reclaim/compaction, there is a minimum
number of pages it reclaims. As it must be under the low watermark to be
in direct reclaim it has also woken kswapd to do some work. This patch
has kswapd use the same logic as direct reclaim to reclaim a minimum
number of pages so compaction can run later.

Signed-off-by: Mel Gorman <mgo...@suse.de>
---
mm/vmscan.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0cb2593..afdec93 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1701,7 +1701,7 @@ static bool in_reclaim_compaction(struct scan_control *sc)
* calls try_to_compact_zone() that it will have enough free pages to succeed.
* It will give up earlier than that if there is difficulty reclaiming pages.
*/
-static inline bool should_continue_reclaim(struct lruvec *lruvec,
+static bool should_continue_reclaim(struct lruvec *lruvec,
unsigned long nr_reclaimed,
unsigned long nr_scanned,
struct scan_control *sc)
@@ -1768,6 +1768,17 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
}
}

+static inline bool should_continue_reclaim_zone(struct zone *zone,
+ unsigned long nr_reclaimed,
+ unsigned long nr_scanned,
+ struct scan_control *sc)
+{
+ struct mem_cgroup *memcg = mem_cgroup_iter(NULL, NULL, NULL);
+ struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+
+ return should_continue_reclaim(lruvec, nr_reclaimed, nr_scanned, sc);
+}
+
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
@@ -2496,8 +2507,10 @@ loop_again:
*/
testorder = order;
if (COMPACTION_BUILD && order &&
- compaction_suitable(zone, order) !=
- COMPACT_SKIPPED)
+ !should_continue_reclaim_zone(zone,
+ nr_soft_reclaimed,
+ sc.nr_scanned - nr_soft_scanned,
+ &sc))
testorder = 0;

if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
--
1.7.9.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Mel Gorman

unread,

Aug 7, 2012, 8:40:02 AM8/7/12

to

commit [7db8889a: mm: have order > 0 compaction start off where it left]
introduced a caching mechanism to reduce the amount work the free page
scanner does in compaction. However, it has a problem. Consider two process
simultaneously scanning free pages

C
Process A M S F
|---------------------------------------|
Process B M FS

C is zone->compact_cached_free_pfn
S is cc->start_pfree_pfn
M is cc->migrate_pfn
F is cc->free_pfn

In this diagram, Process A has just reached its migrate scanner, wrapped
around and updated compact_cached_free_pfn accordingly.

Simultaneously, Process B finishes isolating in a block and updates
compact_cached_free_pfn again to the location of its free scanner.

Process A moves to "end_of_zone - one_pageblock" and runs this check

if (cc->order > 0 && (!cc->wrapped ||
zone->compact_cached_free_pfn >
cc->start_free_pfn))
pfn = min(pfn, zone->compact_cached_free_pfn);

compact_cached_free_pfn is above where it started so the free scanner skips
almost the entire space it should have scanned. When there are multiple
processes compacting it can end in a situation where the entire zone is
not being scanned at all. Further, it is possible for two processes to
ping-pong update to compact_cached_free_pfn which is just random.

Overall, the end result wrecks allocation success rates.

There is not an obvious way around this problem without introducing new
locking and state so this patch takes a different approach.

First, it gets rid of the skip logic because it's not clear that it matters
if two free scanners happen to be in the same block but with racing updates
it's too easy for it to skip over blocks it should not.

Second, it updates compact_cached_free_pfn in a more limited set of
circumstances.

If a scanner has wrapped, it updates compact_cached_free_pfn to the end
of the zone. Each time a wrapped scanner isoaltes a page, it
updates compact_cached_free_pfn. The intention is that after
wrapping, the compact_cached_free_pfn will be at the highest
pageblock with free pages when compaction completes.

If a scanner has not wrapped when compaction completes and
compact_cached_free_pfn is set the end of the the zone, initialise
it once.

This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.

Signed-off-by: Mel Gorman <mgo...@suse.de>
---

mm/compaction.c | 54 ++++++++++++++++++++++++++++--------------------------
1 file changed, 28 insertions(+), 26 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index be310f1..df50f73 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -419,6 +419,20 @@ static bool suitable_migration_target(struct page *page)
}

/*
+ * Returns the start pfn of the last page block in a zone. This is the starting
+ * point for full compaction of a zone. Compaction searches for free pages from
+ * the end of each zone, while isolate_freepages_block scans forward inside each
+ * page block.
+ */
+static unsigned long start_free_pfn(struct zone *zone)
+{
+ unsigned long free_pfn;
+ free_pfn = zone->zone_start_pfn + zone->spanned_pages;
+ free_pfn &= ~(pageblock_nr_pages-1);
+ return free_pfn;
+}
+
+/*
* Based on information in the current compact_control, find blocks
* suitable for isolating free pages from and then isolate them.
*/
@@ -457,17 +471,6 @@ static void isolate_freepages(struct zone *zone,
pfn -= pageblock_nr_pages) {
unsigned long isolated;

- /*
- * Skip ahead if another thread is compacting in the area
- * simultaneously. If we wrapped around, we can only skip
- * ahead if zone->compact_cached_free_pfn also wrapped to
- * above our starting point.
- */
- if (cc->order > 0 && (!cc->wrapped ||
- zone->compact_cached_free_pfn >
- cc->start_free_pfn))
- pfn = min(pfn, zone->compact_cached_free_pfn);
-
if (!pfn_valid(pfn))
continue;

@@ -510,7 +513,15 @@ static void isolate_freepages(struct zone *zone,
*/
if (isolated) {
high_pfn = max(high_pfn, pfn);
- if (cc->order > 0)
+
+ /*
+ * If the free scanner has wrapped, update
+ * compact_cached_free_pfn to point to the highest
+ * pageblock with free pages. This reduces excessive
+ * scanning of full pageblocks near the end of the
+ * zone
+ */
+ if (cc->order > 0 && cc->wrapped)
zone->compact_cached_free_pfn = high_pfn;
}
}
@@ -520,6 +531,11 @@ static void isolate_freepages(struct zone *zone,

cc->free_pfn = high_pfn;
cc->nr_freepages = nr_freepages;
+
+ /* If compact_cached_free_pfn is reset then set it now */
+ if (cc->order > 0 && !cc->wrapped &&
+ zone->compact_cached_free_pfn == start_free_pfn(zone))
+ zone->compact_cached_free_pfn = high_pfn;
}

/*
@@ -607,20 +623,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
return ISOLATE_SUCCESS;
}

-/*
- * Returns the start pfn of the last page block in a zone. This is the starting
- * point for full compaction of a zone. Compaction searches for free pages from
- * the end of each zone, while isolate_freepages_block scans forward inside each
- * page block.
- */
-static unsigned long start_free_pfn(struct zone *zone)
-{
- unsigned long free_pfn;
- free_pfn = zone->zone_start_pfn + zone->spanned_pages;
- free_pfn &= ~(pageblock_nr_pages-1);
- return free_pfn;
-}
-
static int compact_finished(struct zone *zone,
struct compact_control *cc)
{

Mel Gorman

unread,

Aug 7, 2012, 8:40:02 AM8/7/12

to

If allocation fails after compaction then compaction may be deferred for
a number of allocation attempts. If there are subsequent failures,
compact_defer_shift is increased to defer for longer periods. This patch
uses that information to scale the number of pages reclaimed with
compact_defer_shift until allocations succeed again.

Signed-off-by: Mel Gorman <mgo...@suse.de>
---

mm/vmscan.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 66e4310..0cb2593 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1708,6 +1708,7 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
{
unsigned long pages_for_compaction;
unsigned long inactive_lru_pages;
+ struct zone *zone;

/* If not in reclaim/compaction mode, stop */
if (!in_reclaim_compaction(sc))
@@ -1741,6 +1742,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
* inactive lists are large enough, continue reclaiming
*/
pages_for_compaction = (2UL << sc->order);
+
+ /*
+ * If compaction is deferred for this order then scale the number of
+ * pages reclaimed based on the number of consecutive allocation
+ * failures
+ */
+ zone = lruvec_zone(lruvec);
+ if (zone->compact_order_failed >= sc->order)
+ pages_for_compaction <<= zone->compact_defer_shift;
inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
if (nr_swap_pages > 0)
inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);

Mel Gorman

unread,

Aug 7, 2012, 8:40:02 AM8/7/12

to

While compaction is moving pages to free up large contiguous blocks for
allocation it races with other allocation requests that may steal these
blocks or break them up. This patch alters direct compaction to capture a
suitable free page as soon as it becomes available to reduce this race. It
uses similar logic to split_free_page() to ensure that watermarks are
still obeyed.

Signed-off-by: Mel Gorman <mgo...@suse.de>
---

include/linux/compaction.h | 4 +--
include/linux/mm.h | 1 +
mm/compaction.c | 71 +++++++++++++++++++++++++++++++++++++-------
mm/internal.h | 1 +
mm/page_alloc.c | 63 +++++++++++++++++++++++++++++----------
5 files changed, 111 insertions(+), 29 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 51a90b7..5673459 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,7 +22,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask,
- bool sync);
+ bool sync, struct page **page);
extern int compact_pgdat(pg_data_t *pgdat, int order);
extern unsigned long compaction_suitable(struct zone *zone, int order);

@@ -64,7 +64,7 @@ static inline bool compaction_deferred(struct zone *zone, int order)
#else
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- bool sync)
+ bool sync, struct page **page)
{
return COMPACT_CONTINUE;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b36d08c..0812e86 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -454,6 +454,7 @@ void put_pages_list(struct list_head *pages);

void split_page(struct page *page, unsigned int order);
int split_free_page(struct page *page);
+int capture_free_page(struct page *page, int alloc_order, int migratetype);

/*
* Compound pages have a destructor function. Provide a
diff --git a/mm/compaction.c b/mm/compaction.c
index 95ca967..63af8d2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,41 @@ static inline bool migrate_async_suitable(int migratetype)
return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
}

+static void compact_capture_page(struct compact_control *cc)
+{
+ unsigned long flags;
+ int mtype;
+
+ if (!cc->page || *cc->page)
+ return;
+
+ /* Speculatively examine the free lists without zone lock */
+ for (mtype = 0; mtype < MIGRATE_PCPTYPES; mtype++) {
+ int order;
+ for (order = cc->order; order < MAX_ORDER; order++) {
+ struct page *page;
+ struct free_area *area;
+ area = &(cc->zone->free_area[order]);
+ if (list_empty(&area->free_list[mtype]))
+ continue;
+
+ /* Take the lock and attempt capture of the page */
+ spin_lock_irqsave(&cc->zone->lock, flags);
+ if (!list_empty(&area->free_list[mtype])) {
+ page = list_entry(area->free_list[mtype].next,
+ struct page, lru);
+ if (capture_free_page(page, cc->order, mtype)) {
+ spin_unlock_irqrestore(&cc->zone->lock,
+ flags);
+ *cc->page = page;
+ return;
+ }
+ }
+ spin_unlock_irqrestore(&cc->zone->lock, flags);
+ }
+ }
+}
+
/*
* Isolate free pages onto a private freelist. Caller must hold zone->lock.
* If @strict is true, will abort returning 0 on any invalid PFNs or non-free
@@ -561,7 +596,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,

static int compact_finished(struct zone *zone,
struct compact_control *cc)
{

- unsigned int order;
unsigned long watermark;

if (fatal_signal_pending(current))
@@ -586,14 +620,22 @@ static int compact_finished(struct zone *zone,
return COMPACT_CONTINUE;

/* Direct compactor: Is a suitable page free? */
- for (order = cc->order; order < MAX_ORDER; order++) {
- /* Job done if page is free of the right migratetype */
- if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
- return COMPACT_PARTIAL;
-
- /* Job done if allocation would set block type */
- if (order >= pageblock_order && zone->free_area[order].nr_free)
+ if (cc->page) {
+ /* Was a suitable page captured? */
+ if (*cc->page)
return COMPACT_PARTIAL;
+ } else {
+ unsigned int order;
+ for (order = cc->order; order < MAX_ORDER; order++) {
+ struct free_area *area = &zone->free_area[cc->order];
+ /* Job done if page is free of the right migratetype */
+ if (!list_empty(&area->free_list[cc->migratetype]))
+ return COMPACT_PARTIAL;
+
+ /* Job done if allocation would set block type */
+ if (cc->order >= pageblock_order && area->nr_free)
+ return COMPACT_PARTIAL;
+ }
}

return COMPACT_CONTINUE;
@@ -708,6 +750,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
goto out;
}
}
+
+ /* Capture a page now if it is a suitable size */
+ if (cc->migratetype == MIGRATE_MOVABLE)
+ compact_capture_page(cc);
}

out:
@@ -720,7 +766,7 @@ out:

static unsigned long compact_zone_order(struct zone *zone,
int order, gfp_t gfp_mask,
- bool sync)
+ bool sync, struct page **page)
{
struct compact_control cc = {
.nr_freepages = 0,
@@ -729,6 +775,7 @@ static unsigned long compact_zone_order(struct zone *zone,
.migratetype = allocflags_to_migratetype(gfp_mask),
.zone = zone,
.sync = sync,
+ .page = page,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);
@@ -750,7 +797,7 @@ int sysctl_extfrag_threshold = 500;
*/
unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- bool sync)
+ bool sync, struct page **page)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
int may_enter_fs = gfp_mask & __GFP_FS;
@@ -770,7 +817,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
nodemask) {
int status;

- status = compact_zone_order(zone, order, gfp_mask, sync);
+ status = compact_zone_order(zone, order, gfp_mask, sync, page);
rc = max(status, rc);

/* If a normal allocation would succeed, stop compacting */
@@ -825,6 +872,7 @@ int compact_pgdat(pg_data_t *pgdat, int order)
struct compact_control cc = {
.order = order,
.sync = false,
+ .page = NULL,
};

return __compact_pgdat(pgdat, &cc);
@@ -835,6 +883,7 @@ static int compact_node(int nid)
struct compact_control cc = {
.order = -1,
.sync = true,
+ .page = NULL,
};

return __compact_pgdat(NODE_DATA(nid), &cc);
diff --git a/mm/internal.h b/mm/internal.h
index 2ba87fb..9156714 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -124,6 +124,7 @@ struct compact_control {
int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
+ struct page **page; /* Page captured of requested size */
};

unsigned long
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a4f921..adc3aa8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1374,16 +1374,11 @@ void split_page(struct page *page, unsigned int order)
}

/*
- * Similar to split_page except the page is already free. As this is only
- * being used for migration, the migratetype of the block also changes.
- * As this is called with interrupts disabled, the caller is responsible
- * for calling arch_alloc_page() and kernel_map_page() after interrupts
- * are enabled.
- *
- * Note: this is probably too low level an operation for use in drivers.
- * Please consult with lkml before using this in your driver.
+ * Similar to the split_page family of functions except that the page
+ * required at the given order and being isolated now to prevent races
+ * with parallel allocators
*/
-int split_free_page(struct page *page)
+int capture_free_page(struct page *page, int alloc_order, int migratetype)
{
unsigned int order;
unsigned long watermark;
@@ -1405,10 +1400,11 @@ int split_free_page(struct page *page)
rmv_page_order(page);
__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));

- /* Split into individual pages */
- set_page_refcounted(page);
- split_page(page, order);
+ if (alloc_order != order)
+ expand(zone, page, alloc_order, order,
+ &zone->free_area[order], migratetype);

+ /* Set the pageblock if the captured page is at least a pageblock */
if (order >= pageblock_order - 1) {
struct page *endpage = page + (1 << order) - 1;
for (; page < endpage; page += pageblock_nr_pages) {
@@ -1419,7 +1415,35 @@ int split_free_page(struct page *page)
}
}

- return 1 << order;
+ return 1UL << order;
+}
+
+/*
+ * Similar to split_page except the page is already free. As this is only
+ * being used for migration, the migratetype of the block also changes.
+ * As this is called with interrupts disabled, the caller is responsible
+ * for calling arch_alloc_page() and kernel_map_page() after interrupts
+ * are enabled.
+ *
+ * Note: this is probably too low level an operation for use in drivers.
+ * Please consult with lkml before using this in your driver.
+ */
+int split_free_page(struct page *page)
+{
+ unsigned int order;
+ int nr_pages;
+
+ BUG_ON(!PageBuddy(page));
+ order = page_order(page);
+
+ nr_pages = capture_free_page(page, order, 0);
+ if (!nr_pages)
+ return 0;
+
+ /* Split into individual pages */
+ set_page_refcounted(page);
+ split_page(page, order);
+ return nr_pages;
}

/*
@@ -2065,7 +2089,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
bool *deferred_compaction,
unsigned long *did_some_progress)
{
- struct page *page;
+ struct page *page = NULL;

if (!order)
return NULL;
@@ -2077,10 +2101,16 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,

current->flags |= PF_MEMALLOC;
*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
- nodemask, sync_migration);
+ nodemask, sync_migration, &page);
current->flags &= ~PF_MEMALLOC;
- if (*did_some_progress != COMPACT_SKIPPED) {

+ /* If compaction captured a page, prep and use it */
+ if (page) {
+ prep_new_page(page, order, gfp_mask);
+ goto got_page;
+ }
+
+ if (*did_some_progress != COMPACT_SKIPPED) {
/* Page migration frees to the PCP lists but we want merging */
drain_pages(get_cpu());
put_cpu();
@@ -2090,6 +2120,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
alloc_flags, preferred_zone,
migratetype);
if (page) {
+got_page:
preferred_zone->compact_considered = 0;
preferred_zone->compact_defer_shift = 0;
if (order >= preferred_zone->compact_order_failed)

Mel Gorman

unread,

Aug 7, 2012, 8:40:03 AM8/7/12

to

The comment about order applied when the check was
order > PAGE_ALLOC_COSTLY_ORDER which has not been the case since
[c5a73c3d: thp: use compaction for all allocation orders]. Fixing
the comment while I'm in the general area.

Signed-off-by: Mel Gorman <mgo...@suse.de>
---

mm/compaction.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b39ede1..95ca967 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -759,11 +759,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
struct zone *zone;
int rc = COMPACT_SKIPPED;

- /*
- * Check whether it is worth even starting compaction. The order check is
- * made because an assumption is made that the page allocator can satisfy
- * the "cheaper" orders without taking special steps
- */
+ /* Check if the GFP flags allow compaction */
if (!order || !may_enter_fs || !may_perform_io)
return rc;

Mel Gorman

unread,

Aug 7, 2012, 8:40:03 AM8/7/12

to

From: Rik van Riel <ri...@redhat.com>

This commit is already upstream as [7db8889a: mm: have order > 0 compaction
start off where it left]. It's included in this series to provide context
to the next patch as the series is based on 3.5.

Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced. When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.

This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

The obvious solution is to have isolate_freepages remember where it left
off last time, and continue at that point the next time it gets invoked
for an order > 0 compaction. This could cause compaction to fail if
cc->free_pfn and cc->migrate_pfn are close together initially, in that
case we restart from the end of the zone and try once more.

Forced full (order == -1) compactions are left alone.

[ak...@linux-foundation.org: checkpatch fixes]
[ak...@linux-foundation.org: s/laste/last/, use 80 cols]
Signed-off-by: Rik van Riel <ri...@redhat.com>
Reported-by: Jim Schutt <jas...@sandia.gov>
Tested-by: Jim Schutt <jas...@sandia.gov>
Cc: Minchan Kim <minch...@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezaw...@jp.fujitsu.com>
Acked-by: Mel Gorman <m...@csn.ul.ie>
Signed-off-by: Andrew Morton <ak...@linux-foundation.org>
Signed-off-by: Linus Torvalds <torv...@linux-foundation.org>
Signed-off-by: Mel Gorman <m...@csn.ul.ie>
---
include/linux/mmzone.h | 4 +++
mm/compaction.c | 63 ++++++++++++++++++++++++++++++++++++++++++++----
mm/internal.h | 6 +++++
mm/page_alloc.c | 5 ++++
4 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 68c569f..6340f38 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -369,6 +369,10 @@ struct zone {
*/
spinlock_t lock;
int all_unreclaimable; /* All pages pinned */
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+ /* pfn where the last incremental compaction isolated free pages */
+ unsigned long compact_cached_free_pfn;
+#endif
#ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t span_seqlock;
diff --git a/mm/compaction.c b/mm/compaction.c
index 63af8d2..be310f1 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -457,6 +457,17 @@ static void isolate_freepages(struct zone *zone,

pfn -= pageblock_nr_pages) {
unsigned long isolated;

+ /*
+ * Skip ahead if another thread is compacting in the area
+ * simultaneously. If we wrapped around, we can only skip
+ * ahead if zone->compact_cached_free_pfn also wrapped to
+ * above our starting point.
+ */
+ if (cc->order > 0 && (!cc->wrapped ||
+ zone->compact_cached_free_pfn >
+ cc->start_free_pfn))
+ pfn = min(pfn, zone->compact_cached_free_pfn);
+
if (!pfn_valid(pfn))
continue;

@@ -497,8 +508,11 @@ static void isolate_freepages(struct zone *zone,
* looking for free pages, the search will restart here as
* page migration may have returned some pages to the allocator
*/
- if (isolated)
+ if (isolated) {
high_pfn = max(high_pfn, pfn);
+ if (cc->order > 0)
+ zone->compact_cached_free_pfn = high_pfn;
+ }
}

/* split_free_page does not map the pages */
@@ -593,6 +607,20 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
return ISOLATE_SUCCESS;
}

+/*

+ * Returns the start pfn of the last page block in a zone. This is the starting
+ * point for full compaction of a zone. Compaction searches for free pages from
+ * the end of each zone, while isolate_freepages_block scans forward inside each
+ * page block.
+ */
+static unsigned long start_free_pfn(struct zone *zone)
+{
+ unsigned long free_pfn;
+ free_pfn = zone->zone_start_pfn + zone->spanned_pages;
+ free_pfn &= ~(pageblock_nr_pages-1);
+ return free_pfn;
+}
+

static int compact_finished(struct zone *zone,
struct compact_control *cc)
{

@@ -601,8 +629,26 @@ static int compact_finished(struct zone *zone,
if (fatal_signal_pending(current))
return COMPACT_PARTIAL;

- /* Compaction run completes if the migrate and free scanner meet */
- if (cc->free_pfn <= cc->migrate_pfn)
+ /*
+ * A full (order == -1) compaction run starts at the beginning and
+ * end of a zone; it completes when the migrate and free scanner meet.
+ * A partial (order > 0) compaction can start with the free scanner
+ * at a random point in the zone, and may have to restart.
+ */
+ if (cc->free_pfn <= cc->migrate_pfn) {
+ if (cc->order > 0 && !cc->wrapped) {
+ /* We started partway through; restart at the end. */
+ unsigned long free_pfn = start_free_pfn(zone);
+ zone->compact_cached_free_pfn = free_pfn;
+ cc->free_pfn = free_pfn;
+ cc->wrapped = 1;
+ return COMPACT_CONTINUE;
+ }
+ return COMPACT_COMPLETE;
+ }
+
+ /* We wrapped around and ended up where we started. */
+ if (cc->wrapped && cc->free_pfn <= cc->start_free_pfn)
return COMPACT_COMPLETE;

/*
@@ -708,8 +754,15 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)

/* Setup to move all movable pages to the end of the zone */
cc->migrate_pfn = zone->zone_start_pfn;
- cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
- cc->free_pfn &= ~(pageblock_nr_pages-1);
+
+ if (cc->order > 0) {
+ /* Incremental compaction. Start where the last one stopped. */
+ cc->free_pfn = zone->compact_cached_free_pfn;
+ cc->start_free_pfn = cc->free_pfn;
+ } else {
+ /* Order == -1 starts at the end of the zone. */
+ cc->free_pfn = start_free_pfn(zone);
+ }

migrate_prep_local();

diff --git a/mm/internal.h b/mm/internal.h
index 9156714..064f6ef 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -118,8 +118,14 @@ struct compact_control {
unsigned long nr_freepages; /* Number of isolated free pages */
unsigned long nr_migratepages; /* Number of pages to migrate */
unsigned long free_pfn; /* isolate_freepages search base */
+ unsigned long start_free_pfn; /* where we started the search */
unsigned long migrate_pfn; /* isolate_migratepages search base */
bool sync; /* Synchronous migration */
+ bool wrapped; /* Order > 0 compactions are
+ incremental, once free_pfn
+ and migrate_pfn meet, we restart
+ from the top of the zone;
+ remember we wrapped around. */

int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index adc3aa8..781d6e4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4425,6 +4425,11 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

zone->spanned_pages = size;
zone->present_pages = realsize;
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+ zone->compact_cached_free_pfn = zone->zone_start_pfn +
+ zone->spanned_pages;
+ zone->compact_cached_free_pfn &= ~(pageblock_nr_pages-1);
+#endif
#ifdef CONFIG_NUMA
zone->node = nid;
zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)

Mel Gorman

unread,

Aug 7, 2012, 8:40:03 AM8/7/12

to

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order > 0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case. There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order > 0 compaction start off where it left].

This series aims to improve the allocation success rates without regressing
the benefits of commit fe2c2a10. The series is based on 3.5 and includes
the commit 7db8889a to illustrate what impact it has to success rates.

Patch 1 updates a stale comment seeing as I was in the general area.

Patch 2 updates reclaim/compaction to reclaim pages scaled on the number
of recent failures.

Patch 3 has kswapd use similar logic to direct reclaim when deciding whether
to continue reclaiming for reclaim/compaction or not.

Patch 4 captures suitable high-order pages freed by compaction to reduce
races with parallel allocation requests.

Patch 5 is an upstream commit that has compaction restart free page scanning
from an old position instead of always starting from the end of the
zone

Patch 6 adjusts patch 5 to restores allocation success rates.

STRESS-HIGHALLOC
3.5.0-vanilla patches:1-2 patches:1-3 patches:1-4 patches:1-5 patches:1-6
Pass 1 36.00 ( 0.00%) 61.00 (25.00%) 49.00 (13.00%) 57.00 (21.00%) 0.00 (-36.00%) 62.00 (26.00%)
Pass 2 46.00 ( 0.00%) 61.00 (15.00%) 55.00 ( 9.00%) 62.00 (16.00%) 0.00 (-46.00%) 63.00 (17.00%)
while Rested 84.00 ( 0.00%) 85.00 ( 1.00%) 84.00 ( 0.00%) 86.00 ( 2.00%) 86.00 ( 2.00%) 86.00 ( 2.00%)

From
http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html
I know that the allocation success rates in 3.3.6 was 78% in comparison
to 36% in 3.5. With the full series applied, the success rates are up
to 62% which is still much less but it does not reclaim excessively.
Note what patch 5 which is the upstream commit fe2c2a10 did to allocation
success rates.

MMTests Statistics: vmstat
Page Ins 3037580 3167260 3002720 3120080 2885540 3159024
Page Outs 8026888 8028472 8023292 8031056 8025324 8026676
Swap Ins 0 0 0 0 0 0
Swap Outs 0 0 0 0 0 8

Note that swap in/out rates remain at 0. In 3.3.6 with 78% success rates
there were 71881 pages swapped out.

Direct pages scanned 97106 59600 43926 108327 2109 171530
Kswapd pages scanned 1231288 1419472 1388888 1443504 1180916 1377362
Kswapd pages reclaimed 1231221 1419248 1358130 1427561 1164936 1372875
Direct pages reclaimed 97100 59486 24233 88990 2109 171235
Kswapd efficiency 99% 99% 97% 98% 98% 99%
Kswapd velocity 1001.153 1129.622 1098.647 1080.758 955.967 1084.657
Direct efficiency 99% 99% 55% 82% 100% 99%
Direct velocity 78.956 47.430 34.747 81.105 1.707 135.078

kswapd velocity stays at around 1000 pages/second which is reasonable. In
kernel 3.3.6, it was 8140 pages/second.

include/linux/compaction.h | 4 +-
include/linux/mm.h | 1 +
include/linux/mmzone.h | 4 ++
mm/compaction.c | 142 +++++++++++++++++++++++++++++++++++++-------
mm/internal.h | 7 +++
mm/page_alloc.c | 68 ++++++++++++++++-----
mm/vmscan.c | 29 ++++++++-
7 files changed, 213 insertions(+), 42 deletions(-)

Rik van Riel

unread,

Aug 7, 2012, 9:20:03 AM8/7/12

to

On 08/07/2012 08:31 AM, Mel Gorman wrote:
> The comment about order applied when the check was
> order> PAGE_ALLOC_COSTLY_ORDER which has not been the case since
> [c5a73c3d: thp: use compaction for all allocation orders]. Fixing
> the comment while I'm in the general area.
>
> Signed-off-by: Mel Gorman<mgo...@suse.de>

Reviewed-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed

Rik van Riel

unread,

Aug 7, 2012, 9:30:02 AM8/7/12

to

On 08/07/2012 08:31 AM, Mel Gorman wrote:

> If allocation fails after compaction then compaction may be deferred for
> a number of allocation attempts. If there are subsequent failures,
> compact_defer_shift is increased to defer for longer periods. This patch
> uses that information to scale the number of pages reclaimed with
> compact_defer_shift until allocations succeed again.
>
> Signed-off-by: Mel Gorman<mgo...@suse.de>

Acked-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed

Rik van Riel

unread,

Aug 7, 2012, 9:30:03 AM8/7/12

to

On 08/07/2012 08:31 AM, Mel Gorman wrote:

> When direct reclaim is running reclaim/compaction, there is a minimum
> number of pages it reclaims. As it must be under the low watermark to be
> in direct reclaim it has also woken kswapd to do some work. This patch
> has kswapd use the same logic as direct reclaim to reclaim a minimum
> number of pages so compaction can run later.
>
> Signed-off-by: Mel Gorman<mgo...@suse.de>

Acked-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed

Rik van Riel

unread,

Aug 7, 2012, 9:40:01 AM8/7/12

to

On 08/07/2012 08:31 AM, Mel Gorman wrote:

> While compaction is moving pages to free up large contiguous blocks for
> allocation it races with other allocation requests that may steal these
> blocks or break them up. This patch alters direct compaction to capture a
> suitable free page as soon as it becomes available to reduce this race. It
> uses similar logic to split_free_page() to ensure that watermarks are
> still obeyed.
>
> Signed-off-by: Mel Gorman<mgo...@suse.de>

Reviewed-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed

Rik van Riel

unread,

Aug 7, 2012, 10:50:02 AM8/7/12

to

On 08/07/2012 08:31 AM, Mel Gorman wrote:

> commit [7db8889a: mm: have order> 0 compaction start off where it left]
> introduced a caching mechanism to reduce the amount work the free page
> scanner does in compaction. However, it has a problem. Consider two process
> simultaneously scanning free pages
>
> C
> Process A M S F
> |---------------------------------------|
> Process B M FS

Argh. Good spotting.

> This is not optimal and it can still race but the compact_cached_free_pfn
> will be pointing to or very near a pageblock with free pages.

Agreed on the "not optimal", but I also cannot think of a better
idea right now. Getting this fixed for 3.6 is important, we can
think of future optimizations in San Diego.

> Signed-off-by: Mel Gorman<mgo...@suse.de>

Reviewed-by: Rik van Riel <ri...@redhat.com>

--
All rights reversed

Mel Gorman

unread,

Aug 7, 2012, 11:00:02 AM8/7/12

to

On Tue, Aug 07, 2012 at 10:45:25AM -0400, Rik van Riel wrote:
> On 08/07/2012 08:31 AM, Mel Gorman wrote:
> >commit [7db8889a: mm: have order> 0 compaction start off where it left]
> >introduced a caching mechanism to reduce the amount work the free page
> >scanner does in compaction. However, it has a problem. Consider two process
> >simultaneously scanning free pages
> >
> > C
> >Process A M S F
> > |---------------------------------------|
> >Process B M FS
>
> Argh. Good spotting.
>
> >This is not optimal and it can still race but the compact_cached_free_pfn
> >will be pointing to or very near a pageblock with free pages.
>
> Agreed on the "not optimal", but I also cannot think of a better
> idea right now. Getting this fixed for 3.6 is important, we can
> think of future optimizations in San Diego.
>

Sounds like a plan.

> >Signed-off-by: Mel Gorman<mgo...@suse.de>
>
> Reviewed-by: Rik van Riel <ri...@redhat.com>
>

Thanks very much.

Jim, what are the chances of getting this series tested with your large
data workload? As it's on top of 3.5, it should be less scary than
testing 3.6-rc1 but if you are comfortable testing 3.6-rc1 then please
test with just this patch on top.

--
Mel Gorman
SUSE Labs

Jim Schutt

unread,

Aug 7, 2012, 11:40:02 AM8/7/12

to

As it turns out I'm already testing 3.6-rc1, as I'm on
the trail of a Ceph client messaging bug. I think I've
about got that figured out, and am working on a patch, but
I need it fixed in order to generate enough load to trigger
the problem that your patch addresses.

Which is a long-winded way of saying: no problem, I'll
roll this into my current testing, but I'll need another
day or two before I'm likely to be able to generate a
high enough load to test effectively. OK?

Also FWIW, it occurs to me that you might be interested
to know that my load also involves lots of network load
where I'm using jumbo frames. I suspect that puts even
more stress on higher page order allocations, right?

-- Jim

Mel Gorman

unread,

Aug 7, 2012, 11:50:03 AM8/7/12

to

Grand, good luck with the Ceph bug.

> Which is a long-winded way of saying: no problem, I'll
> roll this into my current testing, but I'll need another
> day or two before I'm likely to be able to generate a
> high enough load to test effectively. OK?
>

That is perfectly reasonable, thanks.

> Also FWIW, it occurs to me that you might be interested
> to know that my load also involves lots of network load
> where I'm using jumbo frames. I suspect that puts even
> more stress on higher page order allocations, right?
>

It might. It depends on whether the underlying driver needs contiguous
pages to handle jumbo frame, if it can do scatter/gather IO or some
combination like trying for a contiguous page but using scatter/gather as
a fallback. Certainly it is interesting and I will keep it in mind.

--
Mel Gorman
SUSE Labs

Minchan Kim

unread,

Aug 7, 2012, 7:30:01 PM8/7/12

to

On Tue, Aug 07, 2012 at 01:31:12PM +0100, Mel Gorman wrote:
> The comment about order applied when the check was
> order > PAGE_ALLOC_COSTLY_ORDER which has not been the case since
> [c5a73c3d: thp: use compaction for all allocation orders]. Fixing
> the comment while I'm in the general area.
>
> Signed-off-by: Mel Gorman <mgo...@suse.de>

Reviewed-by: Minchan Kim <min...@kernel.org>

--
Kind regards,
Minchan Kim

Minchan Kim

unread,

Aug 7, 2012, 9:50:02 PM8/7/12

to

Hi Mel,

Just out of curiosity.
What's the problem did you see? (ie, What's the problem do this patch solve?)
AFAIUC, it seem to solve consecutive allocation success ratio through
getting several free pageblocks all at once in a process/kswapd
reclaim context. Right?

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--
Kind regards,
Minchan Kim

Minchan Kim

unread,

Aug 7, 2012, 10:10:02 PM8/7/12

to

On Tue, Aug 07, 2012 at 01:31:14PM +0100, Mel Gorman wrote:
> When direct reclaim is running reclaim/compaction, there is a minimum
> number of pages it reclaims. As it must be under the low watermark to be
> in direct reclaim it has also woken kswapd to do some work. This patch
> has kswapd use the same logic as direct reclaim to reclaim a minimum
> number of pages so compaction can run later.

-ENOPARSE by my stupid brain.
Could you elaborate a bit more?

nr_soft_reclaimed is always zero with !CONFIG_MEMCG.
So should_continue_reclaim_zone would return normally true in case of
non-__GFP_REPEAT allocation. Is it intentional?

> + sc.nr_scanned - nr_soft_scanned,
> + &sc))
> testorder = 0;
>
> if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
> --
> 1.7.9.2
>
> --

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--
Kind regards,
Minchan Kim

Minchan Kim

unread,

Aug 8, 2012, 12:40:02 AM8/8/12

to

Okay.

>
> Second, it updates compact_cached_free_pfn in a more limited set of
> circumstances.
>
> If a scanner has wrapped, it updates compact_cached_free_pfn to the end
> of the zone. Each time a wrapped scanner isoaltes a page, it
> updates compact_cached_free_pfn. The intention is that after
> wrapping, the compact_cached_free_pfn will be at the highest
> pageblock with free pages when compaction completes.

Okay.

>
> If a scanner has not wrapped when compaction completes and

Compaction complete?
Your code seem to do it in isolate_freepages.
Isn't it compaction complete?

> compact_cached_free_pfn is set the end of the the zone, initialise
> it once.

I can't understad this part.

Could you elaborate a bit more?

>

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--
Kind regards,
Minchan Kim

Mel Gorman

unread,

Aug 8, 2012, 4:00:01 AM8/8/12

to

On Wed, Aug 08, 2012 at 10:48:24AM +0900, Minchan Kim wrote:
> Hi Mel,
>
> Just out of curiosity.
> What's the problem did you see? (ie, What's the problem do this patch solve?)

Everythign in this series is related to the problem in the leader - high
order allocation success rates are lower. This patch increases the success
rates when allocating under load.

> AFAIUC, it seem to solve consecutive allocation success ratio through
> getting several free pageblocks all at once in a process/kswapd
> reclaim context. Right?

Only pageblocks if it is order-9 on x86, it reclaims an amount that depends
on an allocation size. This only happens during reclaim/compaction context
when we know that a high-order allocation has recently failed. The objective
is to reclaim enough order-0 pages so that compaction can succeed again.

--
Mel Gorman
SUSE Labs

Minchan Kim

unread,

Aug 8, 2012, 4:30:01 AM8/8/12

to

On Wed, Aug 08, 2012 at 08:55:26AM +0100, Mel Gorman wrote:
> On Wed, Aug 08, 2012 at 10:48:24AM +0900, Minchan Kim wrote:
> > Hi Mel,
> >
> > Just out of curiosity.
> > What's the problem did you see? (ie, What's the problem do this patch solve?)
>
> Everythign in this series is related to the problem in the leader - high
> order allocation success rates are lower. This patch increases the success
> rates when allocating under load.
>
> > AFAIUC, it seem to solve consecutive allocation success ratio through
> > getting several free pageblocks all at once in a process/kswapd
> > reclaim context. Right?
>
> Only pageblocks if it is order-9 on x86, it reclaims an amount that depends
> on an allocation size. This only happens during reclaim/compaction context
> when we know that a high-order allocation has recently failed. The objective
> is to reclaim enough order-0 pages so that compaction can succeed again.

Your patch increases the number of pages to be reclaimed with considering
the number of fail case during deferring period and your test proved it's
really good. Without your patch, why can't VM reclaim enough pages?
Other processes steal the pages reclaimed?
Why I ask a question is that I want to know what's the problem at current
VM.

>
> --
> Mel Gorman
> SUSE Labs
>
> --

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--
Kind regards,
Minchan Kim

Mel Gorman

unread,

Aug 8, 2012, 5:00:02 AM8/8/12

to

On Wed, Aug 08, 2012 at 05:27:38PM +0900, Minchan Kim wrote:
> On Wed, Aug 08, 2012 at 08:55:26AM +0100, Mel Gorman wrote:
> > On Wed, Aug 08, 2012 at 10:48:24AM +0900, Minchan Kim wrote:
> > > Hi Mel,
> > >
> > > Just out of curiosity.
> > > What's the problem did you see? (ie, What's the problem do this patch solve?)
> >
> > Everythign in this series is related to the problem in the leader - high
> > order allocation success rates are lower. This patch increases the success
> > rates when allocating under load.
> >
> > > AFAIUC, it seem to solve consecutive allocation success ratio through
> > > getting several free pageblocks all at once in a process/kswapd
> > > reclaim context. Right?
> >
> > Only pageblocks if it is order-9 on x86, it reclaims an amount that depends
> > on an allocation size. This only happens during reclaim/compaction context
> > when we know that a high-order allocation has recently failed. The objective
> > is to reclaim enough order-0 pages so that compaction can succeed again.
>
> Your patch increases the number of pages to be reclaimed with considering
> the number of fail case during deferring period and your test proved it's
> really good. Without your patch, why can't VM reclaim enough pages?

It could reclaim enough pages but it doesn't. nr_to_reclaim is
SWAP_CLUSTER_MAX and that gets short-cutted in direct reclaim at least
by

if (sc->nr_reclaimed >= sc->nr_to_reclaim)
goto out;

I could set nr_to_reclaim in try_to_free_pages() of course and drive
it from there but that's just different, not better. If driven from
do_try_to_free_pages(), it is also possible that priorities will rise.
When they reach DEF_PRIORITY-2, it will also start stalling and setting
pages for immediate reclaim which is more disruptive than not desirable
in this case. That is a more wide-reaching change than I would expect for
this problem and could cause another regression related to THP requests
causing interactive jitter.

> Other processes steal the pages reclaimed?

Or the page it reclaimed were in pageblocks that could not be used.

> Why I ask a question is that I want to know what's the problem at current
> VM.
>

We cannot reliably tell in advance whether compaction is going to succeed
in the future without doing a full scan of the zone which would be both
very heavy and race with any allocation requests. Compaction needs free
pages to succeed so the intention is to scale the number of pages reclaimed
with the number of recent compaction failures.

--
Mel Gorman
SUSE Labs
--

Mel Gorman

unread,

Aug 8, 2012, 5:10:02 AM8/8/12

to

On Wed, Aug 08, 2012 at 11:07:49AM +0900, Minchan Kim wrote:
> On Tue, Aug 07, 2012 at 01:31:14PM +0100, Mel Gorman wrote:
> > When direct reclaim is running reclaim/compaction, there is a minimum
> > number of pages it reclaims. As it must be under the low watermark to be
> > in direct reclaim it has also woken kswapd to do some work. This patch
> > has kswapd use the same logic as direct reclaim to reclaim a minimum
> > number of pages so compaction can run later.
>
> -ENOPARSE by my stupid brain.
> Could you elaborate a bit more?
>

Which part did not make sense so I know which part to elaborate on? Lets
try again randomly with this;

When direct reclaim is running reclaim/compaction for high-order allocations,
it aims to reclaim a minimum number of pages for compaction as controlled
by should_continue_reclaim. Before it entered direct reclaim, kswapd was
woken to reclaim pages at the same order. This patch forces kswapd to use

the same logic as direct reclaim to reclaim a minimum number of pages so

that subsequent allocation requests are less likely to enter direct reclaim.

It was intentional at the time but asking me about it made me reconsider,
thanks. In too many cases, this is a no-op and any apparent increase of
kswapd activity is likely a co-incidence. This is untested but is what I
intended.

---8<---
mm: kswapd: Continue reclaiming for reclaim/compaction if the minimum number of pages have not been reclaimed

When direct reclaim is running reclaim/compaction for high-order allocations,
it aims to reclaim a minimum number of pages for compaction as controlled
by should_continue_reclaim. Before it entered direct reclaim, kswapd was
woken to reclaim pages at the same order. This patch forces kswapd to use

the same logic as direct reclaim to reclaim a minimum number of pages so

that subsequent allocation requests are less likely to enter direct reclaim.

Signed-off-by: Mel Gorman <mgo...@suse.de>
---

mm/vmscan.c | 81 ++++++++++++++++++++++++++++++++++++-----------------------
1 file changed, 50 insertions(+), 31 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0cb2593..6840218 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1696,14 +1696,11 @@ static bool in_reclaim_compaction(struct scan_control *sc)

/*
* Reclaim/compaction is used for high-order allocation requests. It reclaims
- * order-0 pages before compacting the zone. should_continue_reclaim() returns
+ * order-0 pages before compacting the zone. __should_continue_reclaim() returns
* true if more pages should be reclaimed such that when the page allocator

* calls try_to_compact_zone() that it will have enough free pages to succeed.

- * It will give up earlier than that if there is difficulty reclaiming pages.

*/
-static inline bool should_continue_reclaim(struct lruvec *lruvec,

- unsigned long nr_reclaimed,
- unsigned long nr_scanned,
+static bool __should_continue_reclaim(struct lruvec *lruvec,
struct scan_control *sc)
{
unsigned long pages_for_compaction;
@@ -1714,29 +1711,6 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
if (!in_reclaim_compaction(sc))
return false;

- /* Consider stopping depending on scan and reclaim activity */
- if (sc->gfp_mask & __GFP_REPEAT) {
- /*
- * For __GFP_REPEAT allocations, stop reclaiming if the
- * full LRU list has been scanned and we are still failing
- * to reclaim pages. This full LRU scan is potentially
- * expensive but a __GFP_REPEAT caller really wants to succeed
- */
- if (!nr_reclaimed && !nr_scanned)
- return false;
- } else {
- /*
- * For non-__GFP_REPEAT allocations which can presumably
- * fail without consequence, stop if we failed to reclaim
- * any pages from the last SWAP_CLUSTER_MAX number of
- * pages that were scanned. This will return to the
- * caller faster at the risk reclaim/compaction and
- * the resulting allocation attempt fails
- */
- if (!nr_reclaimed)
- return false;
- }
-
/*
* If we have not reclaimed enough pages for compaction and the

* inactive lists are large enough, continue reclaiming

@@ -1768,6 +1742,51 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
}
}

+/* Looks up the lruvec before calling __should_continue_reclaim */
+static inline bool should_kswapd_continue_reclaim(struct zone *zone,

+ struct scan_control *sc)
+{
+ struct mem_cgroup *memcg = mem_cgroup_iter(NULL, NULL, NULL);
+ struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+

+ return __should_continue_reclaim(lruvec, sc);
+}
+
+/*
+ * This uses __should_continue_reclaim at its core but will also give up
+ * earlier than that if there is difficulty reclaiming pages.
+ */
+static inline bool should_direct_continue_reclaim(struct lruvec *lruvec,

+ unsigned long nr_reclaimed,
+ unsigned long nr_scanned,
+ struct scan_control *sc)
+{

+ /* Consider stopping depending on scan and reclaim activity */
+ if (sc->gfp_mask & __GFP_REPEAT) {
+ /*
+ * For __GFP_REPEAT allocations, stop reclaiming if the
+ * full LRU list has been scanned and we are still failing
+ * to reclaim pages. This full LRU scan is potentially
+ * expensive but a __GFP_REPEAT caller really wants to succeed
+ */
+ if (!nr_reclaimed && !nr_scanned)
+ return false;
+ } else {
+ /*
+ * For non-__GFP_REPEAT allocations which can presumably
+ * fail without consequence, stop if we failed to reclaim
+ * any pages from the last SWAP_CLUSTER_MAX number of
+ * pages that were scanned. This will return to the
+ * caller faster at the risk reclaim/compaction and
+ * the resulting allocation attempt fails
+ */
+ if (!nr_reclaimed)
+ return false;
+ }
+
+ return __should_continue_reclaim(lruvec, sc);

+}
+
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/

@@ -1822,7 +1841,7 @@ restart:
sc, LRU_ACTIVE_ANON);

/* reclaim/compaction might need reclaim to continue */
- if (should_continue_reclaim(lruvec, nr_reclaimed,
+ if (should_direct_continue_reclaim(lruvec, nr_reclaimed,
sc->nr_scanned - nr_scanned, sc))
goto restart;

@@ -2496,8 +2515,8 @@ loop_again:

*/
testorder = order;
if (COMPACTION_BUILD && order &&
- compaction_suitable(zone, order) !=
- COMPACT_SKIPPED)

+ !should_kswapd_continue_reclaim(zone,

+ &sc))
testorder = 0;

if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
--

Mel Gorman

unread,

Aug 8, 2012, 6:00:02 AM8/8/12

to

On Wed, Aug 08, 2012 at 10:07:57AM +0100, Mel Gorman wrote:
> > <SNIP>

>
> It was intentional at the time but asking me about it made me reconsider,
> thanks. In too many cases, this is a no-op and any apparent increase of
> kswapd activity is likely a co-incidence. This is untested but is what I
> intended.
>
> ---8<---
> mm: kswapd: Continue reclaiming for reclaim/compaction if the minimum number of pages have not been reclaimed
>

And considering this further again, it would partially regress fe2c2a10
and be too aggressive. I'm dropping this patch completely for now and will
revisit it in the future.

Thanks Minchan.

--
Mel Gorman
SUSE Labs

Mel Gorman

unread,

Aug 8, 2012, 6:20:02 AM8/8/12

to

On Wed, Aug 08, 2012 at 01:36:00PM +0900, Minchan Kim wrote:
> >
> > Second, it updates compact_cached_free_pfn in a more limited set of
> > circumstances.
> >
> > If a scanner has wrapped, it updates compact_cached_free_pfn to the end
> > of the zone. Each time a wrapped scanner isoaltes a page, it
> > updates compact_cached_free_pfn. The intention is that after
> > wrapping, the compact_cached_free_pfn will be at the highest
> > pageblock with free pages when compaction completes.
>
> Okay.
>
> >
> > If a scanner has not wrapped when compaction completes and
>
> Compaction complete?
> Your code seem to do it in isolate_freepages.
> Isn't it compaction complete?
>

s/compaction/free page isolation/

> > compact_cached_free_pfn is set the end of the the zone, initialise
> > it once.
>

> I can't understad this part.
> Could you elaborate a bit more?
>

Is this better?

If a scanner has wrapped, it updates compact_cached_free_pfn to the end

of the zone. When a wrapped scanner isolates a page, it updates
compact_cached_free_pfn to point to the highest pageblock it
can isolate pages from.

If a scanner has not wrapped when it has finished isolated pages it
checks if compact_cached_free_pfn is pointing to the end of the
zone. If so, the value is updated to point to the highest
pageblock that pages were isolated from. This value will not
be updated again until a free page scanner wraps and resets
compact_cached_free_pfn.

This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.

--
Mel Gorman
SUSE Labs

Mel Gorman

unread,

Aug 8, 2012, 3:10:01 PM8/8/12

to

Second, it updates compact_cached_free_pfn in a more limited set of
circumstances.

If a scanner has wrapped, it updates compact_cached_free_pfn to the end

of the zone. When a wrapped scanner isolates a page, it updates
compact_cached_free_pfn to point to the highest pageblock it
can isolate pages from.

If a scanner has not wrapped when it has finished isolated pages it
checks if compact_cached_free_pfn is pointing to the end of the
zone. If so, the value is updated to point to the highest
pageblock that pages were isolated from. This value will not
be updated again until a free page scanner wraps and resets
compact_cached_free_pfn.

This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.

Signed-off-by: Mel Gorman <mgo...@suse.de>
Reviewed-by: Rik van Riel <ri...@redhat.com>

Mel Gorman

unread,

Aug 8, 2012, 3:10:02 PM8/8/12

to

While compaction is migrating pages to free up large contiguous blocks for

allocation it races with other allocation requests that may steal these
blocks or break them up. This patch alters direct compaction to capture a
suitable free page as soon as it becomes available to reduce this race. It
uses similar logic to split_free_page() to ensure that watermarks are
still obeyed.

Signed-off-by: Mel Gorman <mgo...@suse.de>
Reviewed-by: Rik van Riel <ri...@redhat.com>
---

diff --git a/mm/compaction.c b/mm/compaction.c
index 95ca967..63af8d2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c

@@ -561,7 +596,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,

static int compact_finished(struct zone *zone,
struct compact_control *cc)
{

unsigned long try_to_compact_pages(struct zonelist *zonelist,

int order, gfp_t gfp_mask, nodemask_t *nodemask,
- bool sync)
+ bool sync, struct page **page)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
int may_enter_fs = gfp_mask & __GFP_FS;
@@ -770,7 +817,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
nodemask) {
int status;

- status = compact_zone_order(zone, order, gfp_mask, sync);
+ status = compact_zone_order(zone, order, gfp_mask, sync, page);
rc = max(status, rc);

/* If a normal allocation would succeed, stop compacting */
@@ -825,6 +872,7 @@ int compact_pgdat(pg_data_t *pgdat, int order)
struct compact_control cc = {
.order = order,
.sync = false,
+ .page = NULL,
};

return __compact_pgdat(pgdat, &cc);
@@ -835,6 +883,7 @@ static int compact_node(int nid)
struct compact_control cc = {
.order = -1,
.sync = true,
+ .page = NULL,
};

return __compact_pgdat(NODE_DATA(nid), &cc);

diff --git a/mm/internal.h b/mm/internal.h
index 2ba87fb..9156714 100644
--- a/mm/internal.h
+++ b/mm/internal.h

@@ -124,6 +124,7 @@ struct compact_control {

int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */

struct zone *zone;
+ struct page **page; /* Page captured of requested size */
};

unsigned long

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a4f921..adc3aa8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c

+}
+
+/*

Mel Gorman

unread,

Aug 8, 2012, 3:10:02 PM8/8/12

to

diff --git a/mm/compaction.c b/mm/compaction.c
index 63af8d2..be310f1 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -457,6 +457,17 @@ static void isolate_freepages(struct zone *zone,

pfn -= pageblock_nr_pages) {
unsigned long isolated;

+ /*
+ * Skip ahead if another thread is compacting in the area
+ * simultaneously. If we wrapped around, we can only skip
+ * ahead if zone->compact_cached_free_pfn also wrapped to
+ * above our starting point.
+ */
+ if (cc->order > 0 && (!cc->wrapped ||
+ zone->compact_cached_free_pfn >
+ cc->start_free_pfn))
+ pfn = min(pfn, zone->compact_cached_free_pfn);
+
if (!pfn_valid(pfn))
continue;

@@ -497,8 +508,11 @@ static void isolate_freepages(struct zone *zone,
* looking for free pages, the search will restart here as
* page migration may have returned some pages to the allocator
*/
- if (isolated)

+ if (isolated) {
high_pfn = max(high_pfn, pfn);

+ if (cc->order > 0)
+ zone->compact_cached_free_pfn = high_pfn;
+ }
}

/* split_free_page does not map the pages */

@@ -593,6 +607,20 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
return ISOLATE_SUCCESS;
}

+/*

+ * Returns the start pfn of the last page block in a zone. This is the starting
+ * point for full compaction of a zone. Compaction searches for free pages from
+ * the end of each zone, while isolate_freepages_block scans forward inside each
+ * page block.
+ */
+static unsigned long start_free_pfn(struct zone *zone)
+{
+ unsigned long free_pfn;
+ free_pfn = zone->zone_start_pfn + zone->spanned_pages;
+ free_pfn &= ~(pageblock_nr_pages-1);
+ return free_pfn;
+}
+

static int compact_finished(struct zone *zone,
struct compact_control *cc)
{

@@ -601,8 +629,26 @@ static int compact_finished(struct zone *zone,
if (fatal_signal_pending(current))
return COMPACT_PARTIAL;

- /* Compaction run completes if the migrate and free scanner meet */
- if (cc->free_pfn <= cc->migrate_pfn)
+ /*
+ * A full (order == -1) compaction run starts at the beginning and
+ * end of a zone; it completes when the migrate and free scanner meet.
+ * A partial (order > 0) compaction can start with the free scanner
+ * at a random point in the zone, and may have to restart.

+ */

diff --git a/mm/internal.h b/mm/internal.h
index 9156714..064f6ef 100644
--- a/mm/internal.h
+++ b/mm/internal.h

@@ -118,8 +118,14 @@ struct compact_control {
unsigned long nr_freepages; /* Number of isolated free pages */
unsigned long nr_migratepages; /* Number of pages to migrate */
unsigned long free_pfn; /* isolate_freepages search base */
+ unsigned long start_free_pfn; /* where we started the search */
unsigned long migrate_pfn; /* isolate_migratepages search base */
bool sync; /* Synchronous migration */
+ bool wrapped; /* Order > 0 compactions are
+ incremental, once free_pfn
+ and migrate_pfn meet, we restart
+ from the top of the zone;
+ remember we wrapped around. */

int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index adc3aa8..781d6e4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c

@@ -4425,6 +4425,11 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

zone->spanned_pages = size;
zone->present_pages = realsize;
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+ zone->compact_cached_free_pfn = zone->zone_start_pfn +
+ zone->spanned_pages;
+ zone->compact_cached_free_pfn &= ~(pageblock_nr_pages-1);
+#endif
#ifdef CONFIG_NUMA
zone->node = nid;
zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)

Mel Gorman

unread,

Aug 8, 2012, 3:10:02 PM8/8/12

to

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order > 0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case. There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order > 0 compaction start off where it left].

This series aims to improve the allocation success rates without regressing
the benefits of commit fe2c2a10. The series is based on 3.5 and includes
the commit 7db8889a to illustrate what impact it has to success rates.

Patch 1 updates a stale comment seeing as I was in the general area.

Patch 2 updates reclaim/compaction to reclaim pages scaled on the number
of recent failures.

Patch 3 captures suitable high-order pages freed by compaction to reduce

races with parallel allocation requests.

Patch 4 is an upstream commit that has compaction restart free page scanning
from an old position instead of always starting from the end of the
zone

Patch 5 adjusts patch 5 to restores allocation success rates.

STRESS-HIGHALLOC
3.5.0-vanilla patches:1-2 patches:1-3 patches:1-5
Pass 1 36.00 ( 0.00%) 61.00 (25.00%) 62.00 (26.00%) 56.00 (20.00%)
Pass 2 46.00 ( 0.00%) 61.00 (15.00%) 60.00 (14.00%) 56.00 (10.00%)
while Rested 84.00 ( 0.00%) 85.00 ( 1.00%) 86.00 ( 2.00%) 85.00 ( 1.00%)

From
http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html
I know that the allocation success rates in 3.3.6 was 78% in comparison
to 36% in 3.5. With the full series applied, the success rates are up to

around 60% with some variability in the results. This is not as high
a success rate but it does not reclaim excessively which is a key point.

Previous tests on V1 of this series showed that patch 4 on its own adversely
affected high-order allocation success rates.

MMTests Statistics: vmstat
Page Ins 3037580 3167260 3121588 2939576
Page Outs 8026888 8028472 8026444 8033852

Swap Ins 0 0 0 0

Swap Outs 0 0 0 0

Note that swap in/out rates remain at 0. In 3.3.6 with 78% success rates
there were 71881 pages swapped out.

Direct pages scanned 97106 59600 118792 84142
Kswapd pages scanned 1231288 1419472 1406569 1406642
Kswapd pages reclaimed 1231221 1419248 1390694 1357820
Direct pages reclaimed 97100 59486 107873 82067
Kswapd efficiency 99% 99% 98% 96%
Kswapd velocity 1001.153 1129.622 1082.592 1077.474
Direct efficiency 99% 99% 90% 97%
Direct velocity 78.956 47.430 91.431 64.452

kswapd velocity stays at around 1000 pages/second which is reasonable. In
kernel 3.3.6, it was 8140 pages/second.

include/linux/compaction.h | 4 +-
include/linux/mm.h | 1 +
include/linux/mmzone.h | 4 ++
mm/compaction.c | 142 +++++++++++++++++++++++++++++++++++++-------
mm/internal.h | 7 +++
mm/page_alloc.c | 68 ++++++++++++++++-----

mm/vmscan.c | 10 ++++
7 files changed, 197 insertions(+), 39 deletions(-)

Mel Gorman

unread,

Aug 8, 2012, 3:10:02 PM8/8/12

to

The comment about order applied when the check was
order > PAGE_ALLOC_COSTLY_ORDER which has not been the case since
[c5a73c3d: thp: use compaction for all allocation orders]. Fixing
the comment while I'm in the general area.

Signed-off-by: Mel Gorman <mgo...@suse.de>
Reviewed-by: Rik van Riel <ri...@redhat.com>

Reviewed-by: Minchan Kim <min...@kernel.org>

---
mm/compaction.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b39ede1..95ca967 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c

@@ -759,11 +759,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
struct zone *zone;
int rc = COMPACT_SKIPPED;

- /*
- * Check whether it is worth even starting compaction. The order check is
- * made because an assumption is made that the page allocator can satisfy
- * the "cheaper" orders without taking special steps
- */
+ /* Check if the GFP flags allow compaction */
if (!order || !may_enter_fs || !may_perform_io)
return rc;

Mel Gorman

unread,

Aug 8, 2012, 3:20:01 PM8/8/12

to

If allocation fails after compaction then compaction may be deferred for
a number of allocation attempts. If there are subsequent failures,
compact_defer_shift is increased to defer for longer periods. This patch
uses that information to scale the number of pages reclaimed with
compact_defer_shift until allocations succeed again.

Signed-off-by: Mel Gorman <mgo...@suse.de>

Acked-by: Rik van Riel <ri...@redhat.com>
---

mm/vmscan.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 66e4310..0cb2593 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c

@@ -1708,6 +1708,7 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
{
unsigned long pages_for_compaction;
unsigned long inactive_lru_pages;
+ struct zone *zone;

/* If not in reclaim/compaction mode, stop */
if (!in_reclaim_compaction(sc))

@@ -1741,6 +1742,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,

* inactive lists are large enough, continue reclaiming

*/
pages_for_compaction = (2UL << sc->order);

+
+ /*

+ * If compaction is deferred for this order then scale the number of
+ * pages reclaimed based on the number of consecutive allocation
+ * failures
+ */
+ zone = lruvec_zone(lruvec);
+ if (zone->compact_order_failed >= sc->order)
+ pages_for_compaction <<= zone->compact_defer_shift;
inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
if (nr_swap_pages > 0)
inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);

Minchan Kim

unread,

Aug 8, 2012, 8:00:02 PM8/8/12

to

Agreed.
I hope it should be added by changelog.

>
> > Other processes steal the pages reclaimed?
>
> Or the page it reclaimed were in pageblocks that could not be used.
>
> > Why I ask a question is that I want to know what's the problem at current
> > VM.
> >
>
> We cannot reliably tell in advance whether compaction is going to succeed
> in the future without doing a full scan of the zone which would be both
> very heavy and race with any allocation requests. Compaction needs free
> pages to succeed so the intention is to scale the number of pages reclaimed
> with the number of recent compaction failures.

this order? sc->order?

> + * pages reclaimed based on the number of consecutive allocation
> + * failures
> + */
> + zone = lruvec_zone(lruvec);
> + if (zone->compact_order_failed >= sc->order)

I can't understand this part.
We don't defer lower order than compact_order_failed by aff62249.
Do you mean lower order compaction context should be a lamb for
deferred higher order allocation request success? I think it's not fair
and even I can't understand rationale why it has to scale the number of pages
reclaimed with the number of recent compaction failture.
Your changelog just says "What we have to do, NOT Why we have to do".

> + pages_for_compaction <<= zone->compact_defer_shift;

> inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
> if (nr_swap_pages > 0)
> inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
> --
> 1.7.9.2
>

>

> --
> Mel Gorman
> SUSE Labs
>
> --

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--
Kind regards,
Minchan Kim

Minchan Kim

unread,

Aug 8, 2012, 8:20:01 PM8/8/12

to

Hi Mel,

Okay until here.

>
> If a scanner has not wrapped when it has finished isolated pages it
> checks if compact_cached_free_pfn is pointing to the end of the
> zone. If so, the value is updated to point to the highest
> pageblock that pages were isolated from. This value will not
> be updated again until a free page scanner wraps and resets
> compact_cached_free_pfn.

I tried to understand your intention of this part but unfortunately failed.
By this part, the problem you mentioned could happen again?

C
Process A M S F
|---------------------------------------|
Process B M FS

C is zone->compact_cached_free_pfn
S is cc->start_pfree_pfn
M is cc->migrate_pfn
F is cc->free_pfn

In this diagram, Process A has just reached its migrate scanner, wrapped

around and updated compact_cached_free_pfn to end of the zone accordingly.

Simultaneously, Process B finishes isolating in a block and peek
compact_cached_free_pfn position and know it's end of the zone so
update compact_cached_free_pfn to highest pageblock that pages were
isolated from.

Process A updates compact_cached_free_pfn to the highest pageblock which
was set by process B because process A has wrapped. It ends up big jump
without any scanning in process A.

No?

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--
Kind regards,
Minchan Kim

Minchan Kim

unread,

Aug 8, 2012, 9:40:01 PM8/8/12

to

Hi Mel,

Just one questoin below.

Why do we capture only when we migrate MIGRATE_MOVABLE type?
If you have a reasone, it should have been added as comment.

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--
Kind regards,
Minchan Kim

Mel Gorman

unread,

Aug 9, 2012, 4:00:03 AM8/9/12

to

I guess but it's not really part of this patch is it? The decision on
where to drive should_continue_reclaim from was made in commit [3e7d3449:
mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim].

Anyway changelog now reads as

If allocation fails after compaction then compaction may be deferred
for a number of allocation attempts. If there are subsequent failures,
compact_defer_shift is increased to defer for longer periods. This
patch uses that information to scale the number of pages reclaimed with

compact_defer_shift until allocations succeed again. The rationale is
that reclaiming the normal number of pages still allowed compaction to
fail and its success depends on the number of pages. If it's failing,
reclaim more pages until it succeeds again.

Note that this is not implying that VM reclaim is not reclaiming enough
pages or that its logic is broken. try_to_free_pages() always asks for
SWAP_CLUSTER_MAX pages to be reclaimed regardless of order and that is
what it does. Direct reclaim stops normally with this check.

if (sc->nr_reclaimed >= sc->nr_to_reclaim)
goto out;

should_continue_reclaim delays when that check is made until a minimum number
of pages for reclaim/compaction are reclaimed. It is possible that this patch
could instead set nr_to_reclaim in try_to_free_pages() and drive it from
there but that's behaves differently and not necessarily for the better.

If driven from do_try_to_free_pages(), it is also possible that priorities
will rise. When they reach DEF_PRIORITY-2, it will also start stalling
and setting pages for immediate reclaim which is more disruptive than not

desirable in this case. That is a more wide-reaching change that could

cause another regression related to THP requests causing interactive jitter.

> >

yes. Comment changed to clarify.

> > + * pages reclaimed based on the number of consecutive allocation
> > + * failures
> > + */
> > + zone = lruvec_zone(lruvec);
> > + if (zone->compact_order_failed >= sc->order)
>
> I can't understand this part.
> We don't defer lower order than compact_order_failed by aff62249.
> Do you mean lower order compaction context should be a lamb for
> deferred higher order allocation request success? I think it's not fair
> and even I can't understand rationale why it has to scale the number of pages
> reclaimed with the number of recent compaction failture.
> Your changelog just says "What we have to do, NOT Why we have to do".
>

I'm a moron, that should be <=, not >=. All my tests were based on order==9
and that was the only order using reclaim/compaction so it happened to
work as expected. Thanks! I fixed that and added the following
clarification to the changelog

The rationale is that reclaiming the normal number of pages still allowed
compaction to fail and its success depends on the number of pages. If it's
failing, reclaim more pages until it succeeds again.

Does that make more sense?

>
> > + pages_for_compaction <<= zone->compact_defer_shift;
>
>
> > inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
> > if (nr_swap_pages > 0)
> > inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);

--

Mel Gorman
SUSE Labs
--

Mel Gorman

unread,

Aug 9, 2012, 4:20:02 AM8/9/12

to

On Thu, Aug 09, 2012 at 10:33:58AM +0900, Minchan Kim wrote:
> Hi Mel,
>
> Just one questoin below.
>

Sure! Your questions usually get me thinking about the right part of the
series, this series in particular :)

> > <SNIP>

> > @@ -708,6 +750,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> > goto out;
> > }
> > }
> > +
> > + /* Capture a page now if it is a suitable size */
>
> Why do we capture only when we migrate MIGRATE_MOVABLE type?
> If you have a reasone, it should have been added as comment.
>

Good question and there is an answer. However, I also spotted a problem when
thinking about this more where !MIGRATE_MOVABLE allocations are forced to
do a full compaction. The simple solution would be to only set cc->page for
MIGRATE_MOVABLE but there is a better approach that I've implemented in the
patch below. It includes a comment that should answer your question. Does
this make sense to you?

diff --git a/mm/compaction.c b/mm/compaction.c
index 63af8d2..384164e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -53,13 +53,31 @@ static inline bool migrate_async_suitable(int migratetype)
static void compact_capture_page(struct compact_control *cc)
{
unsigned long flags;
- int mtype;
+ int mtype, mtype_low, mtype_high;

if (!cc->page || *cc->page)

return;

+ /*
+ * For MIGRATE_MOVABLE allocations we capture a suitable page ASAP
+ * regardless of the migratetype of the freelist is is captured from.
+ * This is fine because the order for a high-order MIGRATE_MOVABLE
+ * allocation is typically at least a pageblock size and overall
+ * fragmentation is not impaired. Other allocation types must
+ * capture pages from their own migratelist because otherwise they
+ * could pollute other pageblocks like MIGRATE_MOVABLE with
+ * difficult to move pages and making fragmentation worse overall.
+ */

+ if (cc->migratetype == MIGRATE_MOVABLE) {

+ mtype_low = 0;
+ mtype_high = MIGRATE_PCPTYPES;
+ } else {
+ mtype_low = cc->migratetype;
+ mtype_high = cc->migratetype + 1;

+ }
+
/* Speculatively examine the free lists without zone lock */

- for (mtype = 0; mtype < MIGRATE_PCPTYPES; mtype++) {
+ for (mtype = mtype_low; mtype < mtype_high; mtype++) {
int order;

for (order = cc->order; order < MAX_ORDER; order++) {

struct page *page;
@@ -752,8 +770,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)

}

/* Capture a page now if it is a suitable size */

- if (cc->migratetype == MIGRATE_MOVABLE)
- compact_capture_page(cc);
+ compact_capture_page(cc);
}

out:

Minchan Kim

unread,

Aug 9, 2012, 4:30:01 AM8/9/12

to

If compaction is defered, requestors fails to get high-order page and
they normally do fallback by order-0 or something.
In this context, if they don't depends on fallback and retrying higher order
allocation, your patch makes sense to me because your algorithm is based on
past allocation request fail rate.
Do I miss something?

>
> >
> > > + pages_for_compaction <<= zone->compact_defer_shift;
> >
> >
> > > inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
> > > if (nr_swap_pages > 0)
> > > inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
>
> --
> Mel Gorman
> SUSE Labs
>
> --

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--
Kind regards,
Minchan Kim

Mel Gorman

unread,

Aug 9, 2012, 4:30:01 AM8/9/12

to

On Thu, Aug 09, 2012 at 09:12:12AM +0900, Minchan Kim wrote:
> > <SNIP>

> >
> > Second, it updates compact_cached_free_pfn in a more limited set of
> > circumstances.
> >
> > If a scanner has wrapped, it updates compact_cached_free_pfn to the end
> > of the zone. When a wrapped scanner isolates a page, it updates
> > compact_cached_free_pfn to point to the highest pageblock it
> > can isolate pages from.
>
> Okay until here.
>

Great.

> >
> > If a scanner has not wrapped when it has finished isolated pages it
> > checks if compact_cached_free_pfn is pointing to the end of the
> > zone. If so, the value is updated to point to the highest
> > pageblock that pages were isolated from. This value will not
> > be updated again until a free page scanner wraps and resets
> > compact_cached_free_pfn.
>
> I tried to understand your intention of this part but unfortunately failed.
> By this part, the problem you mentioned could happen again?
>

Potentially yes, I did say it still races in the changelog.

> C
> Process A M S F
> |---------------------------------------|
> Process B M FS
>
> C is zone->compact_cached_free_pfn
> S is cc->start_pfree_pfn
> M is cc->migrate_pfn
> F is cc->free_pfn
>
> In this diagram, Process A has just reached its migrate scanner, wrapped
> around and updated compact_cached_free_pfn to end of the zone accordingly.
>

Yes. Now that it has wrapped it updates the compact_cached_free_pfn
every loop of isolate_freepages here.

if (isolated) {
high_pfn = max(high_pfn, pfn);

/*

* If the free scanner has wrapped, update

* compact_cached_free_pfn to point to the highest

* pageblock with free pages. This reduces excessive

* scanning of full pageblocks near the end of the

* zone
*/

if (cc->order > 0 && cc->wrapped)
zone->compact_cached_free_pfn = high_pfn;
}

> Simultaneously, Process B finishes isolating in a block and peek
> compact_cached_free_pfn position and know it's end of the zone so
> update compact_cached_free_pfn to highest pageblock that pages were
> isolated from.
>

Yes, they race at this point. One of two things happen here and I agree
that this is racy

1. Process A does another iteration of its loop and sets it back
2. Process A does not do another iteration of the loop, the cached_pfn
is further along that it should. The next compacting process will
wrap early and reset cached_pfn again but continue to scan the zone.

Either option is relatively harmless because in both cases the zone gets
scanned. In patch 4 it was possible that large portions of the zone were
frequently missed.

> Process A updates compact_cached_free_pfn to the highest pageblock which
> was set by process B because process A has wrapped. It ends up big jump
> without any scanning in process A.
>

It recovers quickly and is nowhere near as severe as what patch 4
suffers from.

--
Mel Gorman
SUSE Labs

Minchan Kim

unread,

Aug 9, 2012, 4:40:02 AM8/9/12

to

On Thu, Aug 09, 2012 at 09:11:20AM +0100, Mel Gorman wrote:
> On Thu, Aug 09, 2012 at 10:33:58AM +0900, Minchan Kim wrote:
> > Hi Mel,
> >
> > Just one questoin below.
> >
>
> Sure! Your questions usually get me thinking about the right part of the
> series, this series in particular :)
>
> > > <SNIP>
> > > @@ -708,6 +750,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> > > goto out;
> > > }
> > > }
> > > +
> > > + /* Capture a page now if it is a suitable size */
> >
> > Why do we capture only when we migrate MIGRATE_MOVABLE type?
> > If you have a reasone, it should have been added as comment.
> >
>
> Good question and there is an answer. However, I also spotted a problem when
> thinking about this more where !MIGRATE_MOVABLE allocations are forced to
> do a full compaction. The simple solution would be to only set cc->page for
> MIGRATE_MOVABLE but there is a better approach that I've implemented in the
> patch below. It includes a comment that should answer your question. Does
> this make sense to you?

It does make sense.
I will add my Reviewed-by in your next spin which includes below patch.

Thanks, Mel.

>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 63af8d2..384164e 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -53,13 +53,31 @@ static inline bool migrate_async_suitable(int migratetype)
> static void compact_capture_page(struct compact_control *cc)
> {
> unsigned long flags;
> - int mtype;
> + int mtype, mtype_low, mtype_high;
>
> if (!cc->page || *cc->page)
> return;
>
> + /*
> + * For MIGRATE_MOVABLE allocations we capture a suitable page ASAP
> + * regardless of the migratetype of the freelist is is captured from.

^ ^
typo?

--
Kind regards,
Minchan Kim

Minchan Kim

unread,

Aug 9, 2012, 4:50:02 AM8/9/12

to

On Wed, Aug 08, 2012 at 08:08:44PM +0100, Mel Gorman wrote:

Reviewed-by: Minchan Kim <min...@kernel.org>

--
Kind regards,
Minchan Kim

Minchan Kim

unread,

Aug 9, 2012, 4:50:02 AM8/9/12

to

Agreed.
Thanks, Mel.

--
Kind regards,
Minchan Kim

Mel Gorman

unread,

Aug 9, 2012, 5:30:02 AM8/9/12

to

On Thu, Aug 09, 2012 at 05:27:15PM +0900, Minchan Kim wrote:
> > > > + * pages reclaimed based on the number of consecutive allocation
> > > > + * failures
> > > > + */
> > > > + zone = lruvec_zone(lruvec);
> > > > + if (zone->compact_order_failed >= sc->order)
> > >
> > > I can't understand this part.
> > > We don't defer lower order than compact_order_failed by aff62249.
> > > Do you mean lower order compaction context should be a lamb for
> > > deferred higher order allocation request success? I think it's not fair
> > > and even I can't understand rationale why it has to scale the number of pages
> > > reclaimed with the number of recent compaction failture.
> > > Your changelog just says "What we have to do, NOT Why we have to do".
> > >
> >
> > I'm a moron, that should be <=, not >=. All my tests were based on order==9
> > and that was the only order using reclaim/compaction so it happened to
> > work as expected. Thanks! I fixed that and added the following
> > clarification to the changelog
> >
> > The rationale is that reclaiming the normal number of pages still allowed
> > compaction to fail and its success depends on the number of pages. If it's
> > failing, reclaim more pages until it succeeds again.
> >
> > Does that make more sense?
>
> If compaction is defered, requestors fails to get high-order page and
> they normally do fallback by order-0 or something.

Yes. At least, one hopes they fell back to order-0.

> In this context, if they don't depends on fallback and retrying higher order
> allocation, your patch makes sense to me because your algorithm is based on
> past allocation request fail rate.
> Do I miss something?

Your question is difficult to parse but I think you are making an implicit
assumption that it's the same caller retrying the high order allocation.
That is not the case, not do I want it to be because that would be similar
to the caller using __GFP_REPEAT. Retrying with more reclaim until the
allocation succeeds would both stall and reclaim excessively.

The intention is that an allocation can fail but each subsequent attempt will
try harder until there is success. Each allocation request does a portion
of the necessary work to spread the cost between multiple requests. Take
THP for example where there is a constant request for THP allocations
for whatever reason (heavy fork workload, large buffer allocation being
populated etc.). Some of those allocations fail but if they do, future
THP requests will reclaim more pages. When compaction resumes again, it
will be more likely to succeed and compact_defer_shift gets reset. In the
specific case of THP there will be allocations that fail but khugepaged
will promote them later if the process is long-lived.

--
Mel Gorman
SUSE Labs
--

Mel Gorman

unread,

Aug 9, 2012, 9:50:01 AM8/9/12

to

commit [7db8889a: mm: have order > 0 compaction start off where it left]
introduced a caching mechanism to reduce the amount work the free page
scanner does in compaction. However, it has a problem. Consider two process
simultaneously scanning free pages

C
Process A M S F
|---------------------------------------|
Process B M FS

C is zone->compact_cached_free_pfn
S is cc->start_pfree_pfn
M is cc->migrate_pfn
F is cc->free_pfn

In this diagram, Process A has just reached its migrate scanner, wrapped

around and updated compact_cached_free_pfn accordingly.

Simultaneously, Process B finishes isolating in a block and updates
compact_cached_free_pfn again to the location of its free scanner.

Process A moves to "end_of_zone - one_pageblock" and runs this check

if (cc->order > 0 && (!cc->wrapped ||
zone->compact_cached_free_pfn >
cc->start_free_pfn))
pfn = min(pfn, zone->compact_cached_free_pfn);

compact_cached_free_pfn is above where it started so the free scanner skips
almost the entire space it should have scanned. When there are multiple
processes compacting it can end in a situation where the entire zone is
not being scanned at all. Further, it is possible for two processes to
ping-pong update to compact_cached_free_pfn which is just random.

Overall, the end result wrecks allocation success rates.

There is not an obvious way around this problem without introducing new
locking and state so this patch takes a different approach.

First, it gets rid of the skip logic because it's not clear that it matters
if two free scanners happen to be in the same block but with racing updates
it's too easy for it to skip over blocks it should not.

Second, it updates compact_cached_free_pfn in a more limited set of
circumstances.

If a scanner has wrapped, it updates compact_cached_free_pfn to the end
of the zone. When a wrapped scanner isolates a page, it updates
compact_cached_free_pfn to point to the highest pageblock it
can isolate pages from.

If a scanner has not wrapped when it has finished isolated pages it
checks if compact_cached_free_pfn is pointing to the end of the
zone. If so, the value is updated to point to the highest
pageblock that pages were isolated from. This value will not
be updated again until a free page scanner wraps and resets
compact_cached_free_pfn.

This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.

Signed-off-by: Mel Gorman <mgo...@suse.de>
Reviewed-by: Rik van Riel <ri...@redhat.com>
Reviewed-by: Minchan Kim <min...@kernel.org>

---
mm/compaction.c | 54 ++++++++++++++++++++++++++++--------------------------
1 file changed, 28 insertions(+), 26 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index a806a9c..c2d0958 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -437,6 +437,20 @@ static bool suitable_migration_target(struct page *page)

}

/*
+ * Returns the start pfn of the last page block in a zone. This is the starting
+ * point for full compaction of a zone. Compaction searches for free pages from
+ * the end of each zone, while isolate_freepages_block scans forward inside each
+ * page block.
+ */
+static unsigned long start_free_pfn(struct zone *zone)
+{
+ unsigned long free_pfn;
+ free_pfn = zone->zone_start_pfn + zone->spanned_pages;
+ free_pfn &= ~(pageblock_nr_pages-1);
+ return free_pfn;
+}
+
+/*
* Based on information in the current compact_control, find blocks
* suitable for isolating free pages from and then isolate them.
*/

@@ -475,17 +489,6 @@ static void isolate_freepages(struct zone *zone,

pfn -= pageblock_nr_pages) {
unsigned long isolated;

- /*
- * Skip ahead if another thread is compacting in the area
- * simultaneously. If we wrapped around, we can only skip
- * ahead if zone->compact_cached_free_pfn also wrapped to
- * above our starting point.
- */
- if (cc->order > 0 && (!cc->wrapped ||
- zone->compact_cached_free_pfn >
- cc->start_free_pfn))
- pfn = min(pfn, zone->compact_cached_free_pfn);
-
if (!pfn_valid(pfn))
continue;

@@ -528,7 +531,15 @@ static void isolate_freepages(struct zone *zone,
*/

if (isolated) {
high_pfn = max(high_pfn, pfn);

- if (cc->order > 0)
+
+ /*
+ * If the free scanner has wrapped, update
+ * compact_cached_free_pfn to point to the highest
+ * pageblock with free pages. This reduces excessive
+ * scanning of full pageblocks near the end of the
+ * zone
+ */

+ if (cc->order > 0 && cc->wrapped)
zone->compact_cached_free_pfn = high_pfn;
}
}
@@ -538,6 +549,11 @@ static void isolate_freepages(struct zone *zone,

cc->free_pfn = high_pfn;
cc->nr_freepages = nr_freepages;
+
+ /* If compact_cached_free_pfn is reset then set it now */
+ if (cc->order > 0 && !cc->wrapped &&
+ zone->compact_cached_free_pfn == start_free_pfn(zone))
+ zone->compact_cached_free_pfn = high_pfn;
}

/*

@@ -625,20 +641,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,

return ISOLATE_SUCCESS;
}

-/*
- * Returns the start pfn of the last page block in a zone. This is the starting
- * point for full compaction of a zone. Compaction searches for free pages from
- * the end of each zone, while isolate_freepages_block scans forward inside each
- * page block.
- */
-static unsigned long start_free_pfn(struct zone *zone)
-{
- unsigned long free_pfn;
- free_pfn = zone->zone_start_pfn + zone->spanned_pages;
- free_pfn &= ~(pageblock_nr_pages-1);
- return free_pfn;
-}
-

static int compact_finished(struct zone *zone,
struct compact_control *cc)
{
--
1.7.9.2

Mel Gorman

unread,

Aug 9, 2012, 9:50:01 AM8/9/12

to

diff --git a/mm/compaction.c b/mm/compaction.c
index 384164e..a806a9c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -475,6 +475,17 @@ static void isolate_freepages(struct zone *zone,

pfn -= pageblock_nr_pages) {
unsigned long isolated;

+ /*
+ * Skip ahead if another thread is compacting in the area
+ * simultaneously. If we wrapped around, we can only skip
+ * ahead if zone->compact_cached_free_pfn also wrapped to
+ * above our starting point.
+ */

+ if (cc->order > 0 && (!cc->wrapped ||
+ zone->compact_cached_free_pfn >
+ cc->start_free_pfn))
+ pfn = min(pfn, zone->compact_cached_free_pfn);
+
if (!pfn_valid(pfn))
continue;

@@ -515,8 +526,11 @@ static void isolate_freepages(struct zone *zone,

* looking for free pages, the search will restart here as
* page migration may have returned some pages to the allocator
*/
- if (isolated)

+ if (isolated) {
high_pfn = max(high_pfn, pfn);

+ if (cc->order > 0)
+ zone->compact_cached_free_pfn = high_pfn;
+ }
}

/* split_free_page does not map the pages */

@@ -611,6 +625,20 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
return ISOLATE_SUCCESS;
}

+/*

+ * Returns the start pfn of the last page block in a zone. This is the starting
+ * point for full compaction of a zone. Compaction searches for free pages from
+ * the end of each zone, while isolate_freepages_block scans forward inside each
+ * page block.
+ */
+static unsigned long start_free_pfn(struct zone *zone)
+{
+ unsigned long free_pfn;
+ free_pfn = zone->zone_start_pfn + zone->spanned_pages;
+ free_pfn &= ~(pageblock_nr_pages-1);
+ return free_pfn;
+}
+

static int compact_finished(struct zone *zone,
struct compact_control *cc)
{

@@ -619,8 +647,26 @@ static int compact_finished(struct zone *zone,

if (fatal_signal_pending(current))
return COMPACT_PARTIAL;

- /* Compaction run completes if the migrate and free scanner meet */
- if (cc->free_pfn <= cc->migrate_pfn)
+ /*
+ * A full (order == -1) compaction run starts at the beginning and
+ * end of a zone; it completes when the migrate and free scanner meet.
+ * A partial (order > 0) compaction can start with the free scanner
+ * at a random point in the zone, and may have to restart.

+ */

+ if (cc->free_pfn <= cc->migrate_pfn) {
+ if (cc->order > 0 && !cc->wrapped) {
+ /* We started partway through; restart at the end. */
+ unsigned long free_pfn = start_free_pfn(zone);
+ zone->compact_cached_free_pfn = free_pfn;
+ cc->free_pfn = free_pfn;
+ cc->wrapped = 1;
+ return COMPACT_CONTINUE;
+ }
+ return COMPACT_COMPLETE;
+ }
+
+ /* We wrapped around and ended up where we started. */
+ if (cc->wrapped && cc->free_pfn <= cc->start_free_pfn)
return COMPACT_COMPLETE;

/*

@@ -726,8 +772,15 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)

/* Setup to move all movable pages to the end of the zone */
cc->migrate_pfn = zone->zone_start_pfn;
- cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
- cc->free_pfn &= ~(pageblock_nr_pages-1);
+
+ if (cc->order > 0) {
+ /* Incremental compaction. Start where the last one stopped. */
+ cc->free_pfn = zone->compact_cached_free_pfn;
+ cc->start_free_pfn = cc->free_pfn;
+ } else {
+ /* Order == -1 starts at the end of the zone. */
+ cc->free_pfn = start_free_pfn(zone);
+ }

migrate_prep_local();

diff --git a/mm/internal.h b/mm/internal.h
index 9156714..064f6ef 100644
--- a/mm/internal.h
+++ b/mm/internal.h

@@ -118,8 +118,14 @@ struct compact_control {
unsigned long nr_freepages; /* Number of isolated free pages */
unsigned long nr_migratepages; /* Number of pages to migrate */
unsigned long free_pfn; /* isolate_freepages search base */
+ unsigned long start_free_pfn; /* where we started the search */
unsigned long migrate_pfn; /* isolate_migratepages search base */
bool sync; /* Synchronous migration */
+ bool wrapped; /* Order > 0 compactions are
+ incremental, once free_pfn
+ and migrate_pfn meet, we restart
+ from the top of the zone;
+ remember we wrapped around. */

int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index adc3aa8..781d6e4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c

@@ -4425,6 +4425,11 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

zone->spanned_pages = size;
zone->present_pages = realsize;
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+ zone->compact_cached_free_pfn = zone->zone_start_pfn +
+ zone->spanned_pages;
+ zone->compact_cached_free_pfn &= ~(pageblock_nr_pages-1);
+#endif
#ifdef CONFIG_NUMA
zone->node = nid;
zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)

Mel Gorman

unread,

Aug 9, 2012, 9:50:02 AM8/9/12

to

The comment about order applied when the check was
order > PAGE_ALLOC_COSTLY_ORDER which has not been the case since
[c5a73c3d: thp: use compaction for all allocation orders]. Fixing
the comment while I'm in the general area.

Signed-off-by: Mel Gorman <mgo...@suse.de>
Reviewed-by: Rik van Riel <ri...@redhat.com>
Reviewed-by: Minchan Kim <min...@kernel.org>
---

mm/compaction.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b39ede1..95ca967 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c

@@ -759,11 +759,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
struct zone *zone;
int rc = COMPACT_SKIPPED;

- /*
- * Check whether it is worth even starting compaction. The order check is
- * made because an assumption is made that the page allocator can satisfy
- * the "cheaper" orders without taking special steps
- */
+ /* Check if the GFP flags allow compaction */
if (!order || !may_enter_fs || !may_perform_io)
return rc;

Mel Gorman

unread,

Aug 9, 2012, 9:50:02 AM8/9/12

to

While compaction is migrating pages to free up large contiguous blocks for
allocation it races with other allocation requests that may steal these
blocks or break them up. This patch alters direct compaction to capture a
suitable free page as soon as it becomes available to reduce this race. It
uses similar logic to split_free_page() to ensure that watermarks are
still obeyed.

Signed-off-by: Mel Gorman <mgo...@suse.de>
Reviewed-by: Rik van Riel <ri...@redhat.com>

---

include/linux/compaction.h | 4 +-
include/linux/mm.h | 1 +

mm/compaction.c | 88 ++++++++++++++++++++++++++++++++++++++------
mm/internal.h | 1 +
mm/page_alloc.c | 63 +++++++++++++++++++++++--------
5 files changed, 128 insertions(+), 29 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 95ca967..384164e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,59 @@ static inline bool migrate_async_suitable(int migratetype)

return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
}

+static void compact_capture_page(struct compact_control *cc)
+{
+ unsigned long flags;

+ int mtype, mtype_low, mtype_high;

+
+ if (!cc->page || *cc->page)
+ return;
+
+ /*

+ * For MIGRATE_MOVABLE allocations we capture a suitable page ASAP
+ * regardless of the migratetype of the freelist is is captured from.

+ * This is fine because the order for a high-order MIGRATE_MOVABLE
+ * allocation is typically at least a pageblock size and overall
+ * fragmentation is not impaired. Other allocation types must
+ * capture pages from their own migratelist because otherwise they
+ * could pollute other pageblocks like MIGRATE_MOVABLE with
+ * difficult to move pages and making fragmentation worse overall.

+ */

+ if (cc->migratetype == MIGRATE_MOVABLE) {
+ mtype_low = 0;
+ mtype_high = MIGRATE_PCPTYPES;
+ } else {
+ mtype_low = cc->migratetype;
+ mtype_high = cc->migratetype + 1;
+ }
+
+ /* Speculatively examine the free lists without zone lock */

+ for (mtype = mtype_low; mtype < mtype_high; mtype++) {

@@ -561,7 +614,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,

static int compact_finished(struct zone *zone,
struct compact_control *cc)
{

- unsigned int order;
unsigned long watermark;

if (fatal_signal_pending(current))

@@ -586,14 +638,22 @@ static int compact_finished(struct zone *zone,

@@ -708,6 +768,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)

goto out;
}
}
+
+ /* Capture a page now if it is a suitable size */

+ compact_capture_page(cc);
}

out:
@@ -720,7 +783,7 @@ out:

static unsigned long compact_zone_order(struct zone *zone,
int order, gfp_t gfp_mask,
- bool sync)
+ bool sync, struct page **page)
{
struct compact_control cc = {
.nr_freepages = 0,

@@ -729,6 +792,7 @@ static unsigned long compact_zone_order(struct zone *zone,

.migratetype = allocflags_to_migratetype(gfp_mask),
.zone = zone,
.sync = sync,
+ .page = page,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);

@@ -750,7 +814,7 @@ int sysctl_extfrag_threshold = 500;
*/

unsigned long try_to_compact_pages(struct zonelist *zonelist,

int order, gfp_t gfp_mask, nodemask_t *nodemask,
- bool sync)
+ bool sync, struct page **page)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
int may_enter_fs = gfp_mask & __GFP_FS;

@@ -770,7 +834,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,

nodemask) {
int status;

- status = compact_zone_order(zone, order, gfp_mask, sync);
+ status = compact_zone_order(zone, order, gfp_mask, sync, page);
rc = max(status, rc);

/* If a normal allocation would succeed, stop compacting */

@@ -825,6 +889,7 @@ int compact_pgdat(pg_data_t *pgdat, int order)

struct compact_control cc = {
.order = order,
.sync = false,
+ .page = NULL,
};

return __compact_pgdat(pgdat, &cc);

@@ -835,6 +900,7 @@ static int compact_node(int nid)

struct compact_control cc = {
.order = -1,
.sync = true,
+ .page = NULL,
};

return __compact_pgdat(NODE_DATA(nid), &cc);

diff --git a/mm/internal.h b/mm/internal.h
index 2ba87fb..9156714 100644
--- a/mm/internal.h
+++ b/mm/internal.h

@@ -124,6 +124,7 @@ struct compact_control {

int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */

struct zone *zone;
+ struct page **page; /* Page captured of requested size */
};

unsigned long

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a4f921..adc3aa8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c

Mel Gorman

unread,

Aug 9, 2012, 10:00:03 AM8/9/12

to

If allocation fails after compaction then compaction may be deferred for
a number of allocation attempts. If there are subsequent failures,
compact_defer_shift is increased to defer for longer periods. This patch
uses that information to scale the number of pages reclaimed with

compact_defer_shift until allocations succeed again. The rationale is

that reclaiming the normal number of pages still allowed compaction to
fail and its success depends on the number of pages. If it's failing,
reclaim more pages until it succeeds again.

Note that this is not implying that VM reclaim is not reclaiming enough
pages or that its logic is broken. try_to_free_pages() always asks for
SWAP_CLUSTER_MAX pages to be reclaimed regardless of order and that is
what it does. Direct reclaim stops normally with this check.

if (sc->nr_reclaimed >= sc->nr_to_reclaim)
goto out;

should_continue_reclaim delays when that check is made until a minimum number
of pages for reclaim/compaction are reclaimed. It is possible that this patch
could instead set nr_to_reclaim in try_to_free_pages() and drive it from
there but that's behaves differently and not necessarily for the better. If
driven from do_try_to_free_pages(), it is also possible that priorities
will rise. When they reach DEF_PRIORITY-2, it will also start stalling
and setting pages for immediate reclaim which is more disruptive than not
desirable in this case. That is a more wide-reaching change that could
cause another regression related to THP requests causing interactive jitter.

Signed-off-by: Mel Gorman <mgo...@suse.de>
Acked-by: Rik van Riel <ri...@redhat.com>
---

mm/vmscan.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c

index 66e4310..7a43fd8 100644

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1708,6 +1708,7 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
{
unsigned long pages_for_compaction;
unsigned long inactive_lru_pages;
+ struct zone *zone;

/* If not in reclaim/compaction mode, stop */
if (!in_reclaim_compaction(sc))
@@ -1741,6 +1742,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
* inactive lists are large enough, continue reclaiming
*/
pages_for_compaction = (2UL << sc->order);

+
+ /*
+ * If compaction is deferred for sc->order then scale the number of

+ * pages reclaimed based on the number of consecutive allocation
+ * failures
+ */
+ zone = lruvec_zone(lruvec);
+ if (zone->compact_order_failed <= sc->order)

+ pages_for_compaction <<= zone->compact_defer_shift;
inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
if (nr_swap_pages > 0)
inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);

Mel Gorman

unread,

Aug 9, 2012, 10:00:03 AM8/9/12

to

Changelog since V2
o Capture !MIGRATE_MOVABLE pages where possible
o Document the treatment of MIGRATE_MOVABLE pages while capturing
o Expand changelogs

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order > 0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case. There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order > 0 compaction start off where it left].

This series aims to improve the allocation success rates without regressing
the benefits of commit fe2c2a10. The series is based on 3.5 and includes
the commit 7db8889a to illustrate what impact it has to success rates.

Patch 1 updates a stale comment seeing as I was in the general area.

Patch 2 updates reclaim/compaction to reclaim pages scaled on the number
of recent failures.

Patch 3 captures suitable high-order pages freed by compaction to reduce
races with parallel allocation requests.

Patch 4 is an upstream commit that has compaction restart free page scanning

from an old position instead of always starting from the end of the
zone

Patch 5 adjusts patch 5 to restores allocation success rates.

STRESS-HIGHALLOC
3.5.0-vanilla patches:1-2 patches:1-3 patches:1-5

Pass 1 36.00 ( 0.00%) 56.00 (20.00%) 63.00 (27.00%) 58.00 (22.00%)
Pass 2 46.00 ( 0.00%) 64.00 (18.00%) 63.00 (17.00%) 58.00 (12.00%)
while Rested 84.00 ( 0.00%) 86.00 ( 2.00%) 85.00 ( 1.00%) 84.00 ( 0.00%)

From
http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html
I know that the allocation success rates in 3.3.6 was 78% in comparison
to 36% in 3.5. With the full series applied, the success rates are up to
around 60% with some variability in the results. This is not as high
a success rate but it does not reclaim excessively which is a key point.

Previous tests on V1 of this series showed that patch 4 on its own adversely
affected high-order allocation success rates.

MMTests Statistics: vmstat

Page Ins 3037580 2979316 2988160 2957716
Page Outs 8026888 8027300 8031232 8041696

Swap Ins 0 0 0 0
Swap Outs 0 0 0 0

Note that swap in/out rates remain at 0. In 3.3.6 with 78% success rates
there were 71881 pages swapped out.

Direct pages scanned 97106 110003 80319 130947
Kswapd pages scanned 1231288 1372523 1498003 1392390
Kswapd pages reclaimed 1231221 1321591 1439185 1342106
Direct pages reclaimed 97100 102174 56267 125401
Kswapd efficiency 99% 96% 96% 96%
Kswapd velocity 1001.153 1060.896 1131.567 1103.189
Direct efficiency 99% 92% 70% 95%
Direct velocity 78.956 85.027 60.672 103.749

The direct reclaim and kswapd velocities change very little. kswapd velocity
is around the 1000 pages/sec mark where as in kernel 3.3.6 with the high
allocation success rates it was 8140 pages/second.

include/linux/compaction.h | 4 +-
include/linux/mm.h | 1 +

Jim Schutt

unread,

Aug 9, 2012, 10:40:02 AM8/9/12

to

Hi Mel,

On 08/09/2012 07:49 AM, Mel Gorman wrote:
> Changelog since V2
> o Capture !MIGRATE_MOVABLE pages where possible
> o Document the treatment of MIGRATE_MOVABLE pages while capturing
> o Expand changelogs
>
> Changelog since V1
> o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan)
> o Expanded changelogs a little
>
> Allocation success rates have been far lower since 3.4 due to commit
> [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
> commit was introduced for good reasons and it was known in advance that
> the success rates would suffer but it was justified on the grounds that
> the high allocation success rates were achieved by aggressive reclaim.
> Success rates are expected to suffer even more in 3.6 due to commit
> [7db8889a: mm: have order> 0 compaction start off where it left] which
> testing has shown to severely reduce allocation success rates under load -
> to 0% in one case. There is a proposed change to that patch in this series
> and it would be ideal if Jim Schutt could retest the workload that led to
> commit [7db8889a: mm: have order> 0 compaction start off where it left].

I was successful at resolving my Ceph issue on 3.6-rc1, but ran
into some other issue that isn't immediately obvious, and prevents
me from testing your patch with 3.6-rc1. Today I will apply your
patch series to 3.5 and test that way.

Sorry for the delay.

-- Jim

Mel Gorman

unread,

Aug 9, 2012, 11:00:01 AM8/9/12

to

No need to be sorry at all. I appreciate you taking the time and as
there were revisions since V1 you were better off waiting even if you
did not have the Ceph issue!

Thanks.

--
Mel Gorman
SUSE Labs

Jim Schutt

unread,

Aug 9, 2012, 2:20:01 PM8/9/12

to

On 08/09/2012 07:49 AM, Mel Gorman wrote:

> Changelog since V2
> o Capture !MIGRATE_MOVABLE pages where possible
> o Document the treatment of MIGRATE_MOVABLE pages while capturing
> o Expand changelogs
>
> Changelog since V1
> o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan)
> o Expanded changelogs a little
>
> Allocation success rates have been far lower since 3.4 due to commit
> [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
> commit was introduced for good reasons and it was known in advance that
> the success rates would suffer but it was justified on the grounds that
> the high allocation success rates were achieved by aggressive reclaim.
> Success rates are expected to suffer even more in 3.6 due to commit
> [7db8889a: mm: have order> 0 compaction start off where it left] which
> testing has shown to severely reduce allocation success rates under load -
> to 0% in one case. There is a proposed change to that patch in this series
> and it would be ideal if Jim Schutt could retest the workload that led to
> commit [7db8889a: mm: have order> 0 compaction start off where it left].

On my first test of this patch series on top of 3.5, I ran into an
instance of what I think is the sort of thing that patch 4/5 was
fixing. Here's what vmstat had to say during that period:

----------

2012-08-09 11:58:04.107-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
20 14 0 235884 576 38916072 0 0 12 17047 171 133 3 8 85 4 0
18 17 0 220272 576 38955912 0 0 86 2131838 200142 162956 12 38 31 19 0
17 9 0 244284 576 38955328 0 0 19 2179562 213775 167901 13 43 26 18 0
27 15 0 223036 576 38952640 0 0 24 2202816 217996 158390 14 47 25 15 0
17 16 0 233124 576 38959908 0 0 5 2268815 224647 165728 14 50 21 15 0
16 13 0 225840 576 38995740 0 0 52 2253829 216797 160551 14 47 23 16 0
22 13 0 260584 576 38982908 0 0 92 2196737 211694 140924 14 53 19 15 0
16 10 0 235784 576 38917128 0 0 22 2157466 210022 137630 14 54 19 14 0
12 13 0 214300 576 38923848 0 0 31 2187735 213862 142711 14 52 20 14 0
25 12 0 219528 576 38919540 0 0 11 2066523 205256 142080 13 49 23 15 0
26 14 0 229460 576 38913704 0 0 49 2108654 200692 135447 13 51 21 15 0
11 11 0 220376 576 38862456 0 0 45 2136419 207493 146813 13 49 22 16 0
36 12 0 229860 576 38869784 0 0 7 2163463 212223 151812 14 47 25 14 0
16 13 0 238356 576 38891496 0 0 67 2251650 221728 154429 14 52 20 14 0
65 15 0 211536 576 38922108 0 0 59 2237925 224237 156587 14 53 19 14 0
24 13 0 585024 576 38634024 0 0 37 2240929 229040 148192 15 61 14 10 0

2012-08-09 11:59:04.714-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
43 8 0 794392 576 38382316 0 0 11 20491 576 420 3 10 82 4 0
127 6 0 579328 576 38422156 0 0 21 2006775 205582 119660 12 70 11 7 0
44 5 0 492860 576 38512360 0 0 46 1536525 173377 85320 10 78 7 4 0
218 9 0 585668 576 38271320 0 0 39 1257266 152869 64023 8 83 7 3 0
101 6 0 600168 576 38128104 0 0 10 1438705 160769 68374 9 84 5 3 0
62 5 0 597004 576 38098972 0 0 93 1376841 154012 63912 8 82 7 4 0
61 11 0 850396 576 37808772 0 0 46 1186816 145731 70453 7 78 9 6 0
124 7 0 437388 576 38126320 0 0 15 1208434 149736 57142 7 86 4 3 0
204 11 0 1105816 576 37309532 0 0 20 1327833 145979 52718 7 87 4 2 0
29 8 0 751020 576 37360332 0 0 8 1405474 169916 61982 9 85 4 2 0
38 7 0 626448 576 37333244 0 0 14 1328415 174665 74214 8 84 5 3 0
23 5 0 650040 576 37134280 0 0 28 1351209 179220 71631 8 85 5 2 0
40 10 0 610988 576 37054292 0 0 104 1272527 167530 73527 7 85 5 3 0
79 22 0 2076836 576 35487340 0 0 750 1249934 175420 70124 7 88 3 2 0
58 6 0 431068 576 36934140 0 0 1000 1366234 169675 72524 8 84 5 3 0
134 9 0 574692 576 36784980 0 0 1049 1305543 152507 62639 8 84 4 4 0

2012-08-09 12:00:09.137-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0
104 14 0 3140508 576 33522616 0 0 299 1414709 160879 51422 9 89 1 1 0
100 11 0 1323036 576 35337740 0 0 429 1637733 175817 94471 9 73 10 8 0
91 11 0 673320 576 35918084 0 0 562 1477100 157069 67951 8 83 5 4 0
35 15 0 3486592 576 32983244 0 0 384 1574186 189023 82135 9 81 5 5 0
51 16 0 1428108 576 34962112 0 0 394 1573231 160575 76632 9 76 9 7 0
55 6 0 719548 576 35621284 0 0 425 1483962 160335 79991 8 74 10 7 0
96 7 0 1226852 576 35062608 0 0 803 1531041 164923 70820 9 78 7 6 0
97 8 0 862500 576 35332496 0 0 536 1177949 155969 80769 7 74 13 7 0
23 5 0 6096372 576 30115776 0 0 367 919949 124993 81755 6 62 24 8 0
13 5 0 7427860 576 28368292 0 0 399 915331 153895 102186 6 53 32 9 0

----------

And here's a perf report, captured/displayed with
perf record -g -a sleep 10
perf report --sort symbol --call-graph fractal,5
sometime during that period just after 12:00:09, when
the run queueu was > 100.

----------

Processed 0 events and LOST 1175296!

Check IO/CPU overload!

# Events: 208K cycles
#
# Overhead

Symbol
# ........ .....................................................................................................................................................................................
.................................................................................................................................................................................................
............................................................................................................
#
34.63% [k] _raw_spin_lock_irqsave
|
|--97.30%-- isolate_freepages
| compaction_alloc
| unmap_and_move
| migrate_pages
| compact_zone
| compact_zone_order
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--87.39%-- skb_copy_datagram_iovec
| | tcp_recvmsg
| | inet_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call
| | __recv
| | |
| | --100.00%-- (nil)
| |
| --12.61%-- memcpy
--2.70%-- [...]

14.31% [k] _raw_spin_lock_irq
|
|--98.08%-- isolate_migratepages_range
| compact_zone
| compact_zone_order
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--83.93%-- skb_copy_datagram_iovec
| | tcp_recvmsg
| | inet_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call
| | __recv
| | |
| | --100.00%-- (nil)
| |
| --16.07%-- memcpy
--1.92%-- [...]

5.48% [k] isolate_freepages_block
|
|--99.96%-- isolate_freepages
| compaction_alloc
| unmap_and_move
| migrate_pages
| compact_zone
| compact_zone_order
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--86.01%-- skb_copy_datagram_iovec
| | tcp_recvmsg
| | inet_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call
| | __recv
| | |
| | --100.00%-- (nil)
| |
| --13.99%-- memcpy
--0.04%-- [...]

5.34% [.] ceph_crc32c_le
|
|--99.95%-- 0xb8057558d0065990
--0.05%-- [...]

----------

If I understand what this is telling me, skb_copy_datagram_iovec
is responsible for triggering the calls to isolate_freepages_block,
isolate_migratepages_range, and isolate_freepages?

FWIW, I'm using a Chelsio T4 NIC in these hosts, with jumbo frames
and the Linux TCP stack (i.e., no stateful TCP offload).

-- Jim

Rik van Riel

unread,

Aug 9, 2012, 4:40:01 PM8/9/12

to

On 08/09/2012 05:20 AM, Mel Gorman wrote:

> The intention is that an allocation can fail but each subsequent attempt will
> try harder until there is success. Each allocation request does a portion
> of the necessary work to spread the cost between multiple requests.

At some point we need to stop doing that work, though.

Otherwise we could end up back at the problem where
way too much memory gets evicted, and we get swap
storms.

Mel Gorman

unread,

Aug 9, 2012, 4:50:01 PM8/9/12

to

On Thu, Aug 09, 2012 at 12:16:35PM -0600, Jim Schutt wrote:
> On 08/09/2012 07:49 AM, Mel Gorman wrote:
> >Changelog since V2
> >o Capture !MIGRATE_MOVABLE pages where possible
> >o Document the treatment of MIGRATE_MOVABLE pages while capturing
> >o Expand changelogs
> >
> >Changelog since V1
> >o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan)
> >o Expanded changelogs a little
> >
> >Allocation success rates have been far lower since 3.4 due to commit
> >[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
> >commit was introduced for good reasons and it was known in advance that
> >the success rates would suffer but it was justified on the grounds that
> >the high allocation success rates were achieved by aggressive reclaim.
> >Success rates are expected to suffer even more in 3.6 due to commit
> >[7db8889a: mm: have order> 0 compaction start off where it left] which
> >testing has shown to severely reduce allocation success rates under load -
> >to 0% in one case. There is a proposed change to that patch in this series
> >and it would be ideal if Jim Schutt could retest the workload that led to
> >commit [7db8889a: mm: have order> 0 compaction start off where it left].
>
> On my first test of this patch series on top of 3.5, I ran into an
> instance of what I think is the sort of thing that patch 4/5 was
> fixing. Here's what vmstat had to say during that period:
>

> <SNIP>

My conclusion looking at the vmstat data is that everything is looking ok
until system CPU usage goes through the roof. I'm assuming that's what we
are all still looking at.

I am still concerned that what patch 4/5 was actually doing was bypassing
compaction almost entirely in the contended case which "works" but not
exactly expected

> And here's a perf report, captured/displayed with
> perf record -g -a sleep 10
> perf report --sort symbol --call-graph fractal,5
> sometime during that period just after 12:00:09, when
> the run queueu was > 100.
>
> ----------
>
> Processed 0 events and LOST 1175296!
>

> <SNIP>

So lets just consider this. My interpretation of that is that we are
receiving data from the network and copying it into a buffer that is
faulted for the first time and backed by THP.

All good so far *BUT* we are contending like crazy on the zone lock and
probably blocking normal page allocations in the meantime.

>
> 14.31% [k] _raw_spin_lock_irq
> |
> |--98.08%-- isolate_migratepages_range

This is a variation of the same problem but on the LRU lock this time.

> <SNIP>

>
> ----------
>
> If I understand what this is telling me, skb_copy_datagram_iovec
> is responsible for triggering the calls to isolate_freepages_block,
> isolate_migratepages_range, and isolate_freepages?
>

Sortof. I do not think it's the jumbo frames that are doing it, it's the
faulting of the buffer it copies to.

> FWIW, I'm using a Chelsio T4 NIC in these hosts, with jumbo frames
> and the Linux TCP stack (i.e., no stateful TCP offload).
>

Ok, this is an untested hack and I expect it would drop allocation success
rates again under load (but not as much). Can you test again and see what
effect, if any, it has please?

---8<---
mm: compaction: back out if contended

---
include/linux/compaction.h | 4 ++--
mm/compaction.c | 45 ++++++++++++++++++++++++++++++++++++++------
mm/internal.h | 1 +
mm/page_alloc.c | 13 +++++++++----
4 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 5673459..9c94cba 100644

--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,7 +22,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask,

- bool sync, struct page **page);
+ bool sync, bool *contended, struct page **page);

extern int compact_pgdat(pg_data_t *pgdat, int order);
extern unsigned long compaction_suitable(struct zone *zone, int order);

@@ -64,7 +64,7 @@ static inline bool compaction_deferred(struct zone *zone, int order)
#else
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,

- bool sync, struct page **page)
+ bool sync, bool *contended, struct page **page)
{
return COMPACT_CONTINUE;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index c2d0958..8e290d2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,27 @@ static inline bool migrate_async_suitable(int migratetype)

return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
}

+/*
+ * Compaction requires the taking of some coarse locks that are potentially
+ * very heavily contended. For async compaction, back out in the event there
+ * is contention.
+ */
+static bool compact_trylock_irqsave(spinlock_t *lock, unsigned long *flags,
+ struct compact_control *cc)
+{
+ if (cc->sync) {
+ spin_lock_irqsave(lock, *flags);
+ } else {
+ if (!spin_trylock_irqsave(lock, *flags)) {
+ if (cc->contended)
+ *cc->contended = true;
+ return false;
+ }
+ }
+
+ return true;
+}
+

static void compact_capture_page(struct compact_control *cc)
{
unsigned long flags;

@@ -87,7 +108,8 @@ static void compact_capture_page(struct compact_control *cc)
continue;

/* Take the lock and attempt capture of the page */

- spin_lock_irqsave(&cc->zone->lock, flags);
+ if (!compact_trylock_irqsave(&cc->zone->lock, &flags, cc))
+ return;

if (!list_empty(&area->free_list[mtype])) {

page = list_entry(area->free_list[mtype].next,

struct page, lru);
@@ -514,7 +536,16 @@ static void isolate_freepages(struct zone *zone,
* are disabled
*/
isolated = 0;
- spin_lock_irqsave(&zone->lock, flags);
+
+ /*
+ * The zone lock must be held to isolate freepages. This
+ * unfortunately this is a very coarse lock and can be
+ * heavily contended if there are parallel allocations
+ * or parallel compactions. For async compaction do not
+ * spin on the lock
+ */
+ if (!compact_trylock_irqsave(&zone->lock, &flags, cc))
+ break;
if (suitable_migration_target(page)) {
end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
trace_mm_compaction_freepage_scanpfn(pfn);
@@ -837,8 +868,8 @@ out:

}

static unsigned long compact_zone_order(struct zone *zone,

- int order, gfp_t gfp_mask,
- bool sync, struct page **page)
+ int order, gfp_t gfp_mask, bool sync,
+ bool *contended, struct page **page)

{
struct compact_control cc = {
.nr_freepages = 0,

@@ -848,6 +879,7 @@ static unsigned long compact_zone_order(struct zone *zone,

.zone = zone,
.sync = sync,

.page = page,
+ .contended = contended,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);
@@ -869,7 +901,7 @@ int sysctl_extfrag_threshold = 500;

*/
unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,

- bool sync, struct page **page)
+ bool sync, bool *contended, struct page **page)

{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
int may_enter_fs = gfp_mask & __GFP_FS;

@@ -889,7 +921,8 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
nodemask) {
int status;

- status = compact_zone_order(zone, order, gfp_mask, sync, page);

+ status = compact_zone_order(zone, order, gfp_mask, sync,

+ contended, page);

rc = max(status, rc);

/* If a normal allocation would succeed, stop compacting */

diff --git a/mm/internal.h b/mm/internal.h
index 064f6ef..344b555 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -130,6 +130,7 @@ struct compact_control {

int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */
struct zone *zone;

+ bool *contended; /* True if a lock was contended */

struct page **page; /* Page captured of requested size */
};

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 781d6e4..75b30ea 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2086,7 +2086,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
int migratetype, bool sync_migration,
- bool *deferred_compaction,
+ bool *contended_compaction, bool *deferred_compaction,
unsigned long *did_some_progress)
{

struct page *page = NULL;

@@ -2101,7 +2101,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,

current->flags |= PF_MEMALLOC;
*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,

- nodemask, sync_migration, &page);
+ nodemask, sync_migration,
+ contended_compaction, &page);
current->flags &= ~PF_MEMALLOC;

/* If compaction captured a page, prep and use it */

@@ -2154,7 +2155,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
int migratetype, bool sync_migration,
- bool *deferred_compaction,
+ bool *contended_compaction, bool *deferred_compaction,
unsigned long *did_some_progress)
{
return NULL;
@@ -2318,6 +2319,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
unsigned long did_some_progress;
bool sync_migration = false;
bool deferred_compaction = false;
+ bool contended_compaction = false;

/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -2399,6 +2401,7 @@ rebalance:
nodemask,
alloc_flags, preferred_zone,
migratetype, sync_migration,
+ &contended_compaction,
&deferred_compaction,
&did_some_progress);
if (page)
@@ -2411,7 +2414,8 @@ rebalance:
* has requested the system not be heavily disrupted, fail the
* allocation now instead of entering direct reclaim
*/
- if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
+ if ((deferred_compaction || contended_compaction) &&
+ (gfp_mask & __GFP_NO_KSWAPD))
goto nopage;

/* Try direct reclaim and then allocating */
@@ -2482,6 +2486,7 @@ rebalance:
nodemask,
alloc_flags, preferred_zone,
migratetype, sync_migration,
+ &contended_compaction,
&deferred_compaction,
&did_some_progress);
if (page)

Jim Schutt

unread,

Aug 9, 2012, 6:40:01 PM8/9/12

to

I'm concerned about both the high CPU usage as well as the
reduction in write-out rate, but I've been assuming the latter
is caused by the former.

<snip>

>
> Ok, this is an untested hack and I expect it would drop allocation success
> rates again under load (but not as much). Can you test again and see what
> effect, if any, it has please?
>
> ---8<---
> mm: compaction: back out if contended
>
> ---

<snip>

Initial testing with this patch looks very good from
my perspective; CPU utilization stays reasonable,
write-out rate stays high, no signs of stress.
Here's an example after ~10 minutes under my test load:

2012-08-09 16:26:07.550-06:00

vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st

21 19 0 351628 576 37835440 0 0 17 44394 1241 653 6 20 64 9 0
11 11 0 365520 576 37893060 0 0 124 2121508 203450 170957 12 46 25 17 0
13 16 0 359888 576 37954456 0 0 98 2185033 209473 171571 13 44 25 18 0
17 15 0 353728 576 38010536 0 0 89 2170971 208052 167988 13 43 26 18 0
17 16 0 349732 576 38048284 0 0 135 2217752 218754 174170 13 49 21 16 0
43 13 0 343280 576 38046500 0 0 153 2207135 217872 179519 13 47 23 18 0
26 13 0 350968 576 37937184 0 0 147 2189822 214276 176697 13 47 23 17 0
4 12 0 350080 576 37958364 0 0 226 2145212 207077 172163 12 44 24 20 0
15 13 0 353124 576 37921040 0 0 145 2078422 197231 166381 12 41 30 17 0
14 15 0 348964 576 37949588 0 0 107 2020853 188192 164064 12 39 30 20 0
21 9 0 354784 576 37951228 0 0 117 2148090 204307 165609 13 48 22 18 0
36 16 0 347368 576 37989824 0 0 166 2208681 216392 178114 13 47 24 16 0
28 15 0 300656 576 38060912 0 0 164 2181681 214618 175132 13 45 24 18 0
9 16 0 295484 576 38092184 0 0 153 2156909 218993 180289 13 43 27 17 0
17 16 0 346760 576 37979008 0 0 165 2124168 198730 173455 12 44 27 18 0
14 17 0 360988 576 37957136 0 0 142 2092248 197430 168199 12 42 29 17 0

I'll continue testing tomorrow to be sure nothing
shows up after continued testing.

If this passes your allocation success rate testing,
I'm happy with this performance for 3.6 - if not, I'll
be happy to test any further patches.

I really appreciate getting the chance to test out
your patchset.

Thanks -- Jim

Minchan Kim

unread,

Aug 9, 2012, 7:30:02 PM8/9/12

to

You assume high-order allocation are *constant* and I guess your test enviroment
is optimal for it. I agree your patch if we can make sure such high-order
allocation are always constant. But, is it true? Otherwise, your patch could reclaim
too many pages unnecessary and it could reduce system performance by eviction
of page cache and swap out of workingset part. That's a concern to me.
In summary, I think your patch is rather agressive so how about this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 66e4310..0cb2593 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1708,6 +1708,7 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
{
unsigned long pages_for_compaction;
unsigned long inactive_lru_pages;
+ struct zone *zone;

/* If not in reclaim/compaction mode, stop */
if (!in_reclaim_compaction(sc))
@@ -1741,6 +1742,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
* inactive lists are large enough, continue reclaiming
*/
pages_for_compaction = (2UL << sc->order);
+
+ /*
+ * If compaction is deferred for this order then scale the number of

+ * pages reclaimed based on the number of consecutive allocation
+ * failures
+ */
+ zone = lruvec_zone(lruvec);

+ if (zone->compact_order_failed <= sc->order) {
+ if (zone->compact_defer_shift)
+ /*
+ * We can't make sure deferred requests will come again
+ * The probability is 50:50.
+ */
+ pages_for_compaction <<= (zone->compact_defer_shift - 1);

}
inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
if (nr_swap_pages > 0)
inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);

>

> --
> Mel Gorman
> SUSE Labs
>
> --

> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

--
Kind regards,
Minchan Kim

Minchan Kim

unread,

Aug 9, 2012, 7:40:02 PM8/9/12

to

On Thu, Aug 09, 2012 at 02:49:23PM +0100, Mel Gorman wrote:
> While compaction is migrating pages to free up large contiguous blocks for
> allocation it races with other allocation requests that may steal these
> blocks or break them up. This patch alters direct compaction to capture a
> suitable free page as soon as it becomes available to reduce this race. It
> uses similar logic to split_free_page() to ensure that watermarks are
> still obeyed.
>
> Signed-off-by: Mel Gorman <mgo...@suse.de>
> Reviewed-by: Rik van Riel <ri...@redhat.com>

Reviewed-by: Minchan Kim <min...@kernel.org>

--
Kind regards,
Minchan Kim

Mel Gorman

unread,

Aug 10, 2012, 4:20:02 AM8/10/12

to

On Thu, Aug 09, 2012 at 04:29:57PM -0400, Rik van Riel wrote:
> On 08/09/2012 05:20 AM, Mel Gorman wrote:
>
> >The intention is that an allocation can fail but each subsequent attempt will
> >try harder until there is success. Each allocation request does a portion
> >of the necessary work to spread the cost between multiple requests.
>
> At some point we need to stop doing that work, though.
>
> Otherwise we could end up back at the problem where
> way too much memory gets evicted, and we get swap
> storms.

That's the case without this patch as it'll still be running
reclaim/compaction just less aggressively. For it to continually try like
the system must be either continually under load preventing compaction ever
working (which may be undesirable for order-3 and the like) or so badly
fragmented it cannot recover (not aware of a situation where this happened).

You could add a separate patch that checked if
defer_shift == COMPACT_MAX_DEFER_SHIFT and to disable reclaim/compaction in
that case but that will require enough SWAP_CLUSTER_MAX pages to be reclaimed
over time or a large process to exit before compaction succeeds again.

I would expect rates under load to be very low with such a patch
applied.

--
Mel Gorman
SUSE Labs

Mel Gorman

unread,

Aug 10, 2012, 4:40:02 AM8/10/12

to

On Fri, Aug 10, 2012 at 08:27:33AM +0900, Minchan Kim wrote:
> > <SNIP>
> >

> > The intention is that an allocation can fail but each subsequent attempt will
> > try harder until there is success. Each allocation request does a portion
> > of the necessary work to spread the cost between multiple requests. Take
> > THP for example where there is a constant request for THP allocations
> > for whatever reason (heavy fork workload, large buffer allocation being
> > populated etc.). Some of those allocations fail but if they do, future
> > THP requests will reclaim more pages. When compaction resumes again, it
> > will be more likely to succeed and compact_defer_shift gets reset. In the
> > specific case of THP there will be allocations that fail but khugepaged
> > will promote them later if the process is long-lived.
>
> You assume high-order allocation are *constant* and I guess your test enviroment
> is optimal for it.

Ok, my example stated they were constant because it was the easiest to
illustrate but it does not necessarily have to be the case. The high-order
allocation requests can be separated by any length of time with a read or
write stream running in the background applying a small amount of memory
pressure and the same scenario applies.

> I agree your patch if we can make sure such high-order
> allocation are always constant. But, is it true? Otherwise, your patch could reclaim
> too many pages unnecessary and it could reduce system performance by eviction

The "too many pages unnecessarily" is unlikely. For compact_defer_shift to be
elevated there has to have been recent failures by try_to_compact_pages(). If
compact_defer_shift is elevated and a large process exited then
try_to_compact_pages() may succeed and reset compact_defer_shift without
calling direct reclaim and entering this path at all.

This patch is not doing anything radically different to my own patch.
compact_defer_shift == 0 if allocations succeeded recently using
reclaim/compaction at its normal level. Functionally the only difference
is that you delay when more pages get reclaim by one failure.

Was that what you intended? If so, it's not clear why you think this patch
is better or how you concluded that the probability of another failure was
"50:50".

> }
> inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
> if (nr_swap_pages > 0)
> inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
>

--
Mel Gorman
SUSE Labs
--

Minchan Kim

unread,

Aug 10, 2012, 4:50:02 AM8/10/12

to

Hi Mel,

Please ignore my comment about this patch.
I got confused between compat_considered and compact_defer_shift.
compact_defer_shift is indication of constant high order page
allocationfailing so I have no objection any more.
Sorry for the noise. :(

--
Kind regards,
Minchan Kim

Minchan Kim

unread,

Aug 10, 2012, 4:50:02 AM8/10/12

to

Reviewed-by: Minchan Kim <min...@kernel.org>

--
Kind regards,
Minchan Kim

Mel Gorman

unread,

Aug 10, 2012, 7:10:01 AM8/10/12

to

On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote:
> >><SNIP>
> >
> >My conclusion looking at the vmstat data is that everything is looking ok
> >until system CPU usage goes through the roof. I'm assuming that's what we
> >are all still looking at.
>
> I'm concerned about both the high CPU usage as well as the
> reduction in write-out rate, but I've been assuming the latter
> is caused by the former.
>

Almost certainly.

> <snip>
>
> >
> >Ok, this is an untested hack and I expect it would drop allocation success
> >rates again under load (but not as much). Can you test again and see what
> >effect, if any, it has please?
> >
> >---8<---
> >mm: compaction: back out if contended
> >
> >---
>
> <snip>
>
> Initial testing with this patch looks very good from
> my perspective; CPU utilization stays reasonable,
> write-out rate stays high, no signs of stress.
> Here's an example after ~10 minutes under my test load:
>

Excellent, so it is contention that is the problem.

> <SNIP>

> I'll continue testing tomorrow to be sure nothing
> shows up after continued testing.
>
> If this passes your allocation success rate testing,
> I'm happy with this performance for 3.6 - if not, I'll
> be happy to test any further patches.
>

It does impair allocation success rates as I expected (they're still ok
but not as high as I'd like) so I implemented the following instead. It
attempts to backoff when contention is detected or compaction is taking
too long. It does not backoff as quickly as the first prototype did so
I'd like to see if it addresses your problem or not.

> I really appreciate getting the chance to test out
> your patchset.
>

I appreciate that you have a workload that demonstrates the problem and
will test patches. I will not abuse this and hope the keep the revisions
to a minimum.

Thanks.

---8<---
mm: compaction: Abort async compaction if locks are contended or taking too long

Jim Schutt reported a problem that pointed at compaction contending
heavily on locks. The workload is straight-forward and in his own words;

The systems in question have 24 SAS drives spread across 3 HBAs,
running 24 Ceph OSD instances, one per drive. FWIW these servers
are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
Ceph Linux clients doing dd simultaneously to a Ceph file system
backed by 12 of these servers.

Early in the test everything looks fine

procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st

31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0

and then it goes to pot

procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st

163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0

Note that system CPU usage is very high blocks being written out has
dropped by 42%. He analysed this with perf and found

perf record -g -a sleep 10
perf report --sort symbol --call-graph fractal,5

There was other data but primarily it is all showing that compaction is
contended heavily on the zone->lock and zone->lru_lock.

commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
while isolating pages for migration] noted that it was possible for
migration to hold the lru_lock for an excessive amount of time. Very
broadly speaking this patch expands the concept.

This patch introduces compact_checklock_irqsave() to check if a lock
is contended or the process needs to be scheduled. If either condition
is true then async compaction is aborted and the caller is informed.
The page allocator will fail a THP allocation if compaction failed due
to contention. This patch also introduces compact_trylock_irqsave()
which will acquire the lock only if it is not contended and the process
does not need to schedule.

Reported-by: Jim Schutt <jas...@sandia.gov>

Signed-off-by: Mel Gorman <mgo...@suse.de>
---

include/linux/compaction.h | 4 +-
mm/compaction.c | 91 +++++++++++++++++++++++++++++++++++---------
mm/internal.h | 1 +
mm/page_alloc.c | 13 +++++--
4 files changed, 84 insertions(+), 25 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 5673459..9c94cba 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,7 +22,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask,
- bool sync, struct page **page);
+ bool sync, bool *contended, struct page **page);
extern int compact_pgdat(pg_data_t *pgdat, int order);
extern unsigned long compaction_suitable(struct zone *zone, int order);

@@ -64,7 +64,7 @@ static inline bool compaction_deferred(struct zone *zone, int order)
#else
static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- bool sync, struct page **page)
+ bool sync, bool *contended, struct page **page)
{
return COMPACT_CONTINUE;
}
diff --git a/mm/compaction.c b/mm/compaction.c

index c2d0958..1827d9a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,47 @@ static inline bool migrate_async_suitable(int migratetype)

return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
}

+/*
+ * Compaction requires the taking of some coarse locks that are potentially

+ * very heavily contended. Check if the process needs to be scheduled or
+ * if the lock is contended. For async compaction, back out in the event
+ * if contention is severe. For sync compaction, schedule.
+ *
+ * Returns true if the lock is held.
+ * Returns false if the lock is released and compaction should abort
+ */
+static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
+ bool locked, struct compact_control *cc)
+{
+ if (need_resched() || spin_is_contended(lock)) {
+ if (locked) {
+ spin_unlock_irq(lock);
+ locked = false;
+ }
+
+ /* async aborts if taking too long or contended */
+ if (!cc->sync) {

+ if (cc->contended)
+ *cc->contended = true;
+ return false;
+ }
+

+ cond_resched();
+ if (fatal_signal_pending(current))

+ return false;
+ }
+

+ if (!locked)
+ spin_lock_irqsave(lock, *flags);

+ return true;
+}
+

+static inline bool compact_trylock_irqsave(spinlock_t *lock, unsigned long *flags,

+ struct compact_control *cc)
+{

+ return compact_checklock_irqsave(lock, flags, false, cc);

+}
+
static void compact_capture_page(struct compact_control *cc)
{
unsigned long flags;

@@ -87,7 +128,8 @@ static void compact_capture_page(struct compact_control *cc)

continue;

/* Take the lock and attempt capture of the page */
- spin_lock_irqsave(&cc->zone->lock, flags);
+ if (!compact_trylock_irqsave(&cc->zone->lock, &flags, cc))
+ return;
if (!list_empty(&area->free_list[mtype])) {
page = list_entry(area->free_list[mtype].next,
struct page, lru);

@@ -281,6 +323,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
struct list_head *migratelist = &cc->migratepages;
isolate_mode_t mode = 0;
struct lruvec *lruvec;
+ unsigned long flags;
+ bool locked;

/*
* Ensure that there are not too many pages isolated from the LRU
@@ -300,25 +344,22 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,

/* Time to isolate some pages for migration */
cond_resched();
- spin_lock_irq(&zone->lru_lock);
+ spin_lock_irqsave(&zone->lru_lock, flags);
+ locked = true;
for (; low_pfn < end_pfn; low_pfn++) {
struct page *page;
- bool locked = true;

/* give a chance to irqs before checking need_resched() */
if (!((low_pfn+1) % SWAP_CLUSTER_MAX)) {
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
locked = false;
}
- if (need_resched() || spin_is_contended(&zone->lru_lock)) {
- if (locked)
- spin_unlock_irq(&zone->lru_lock);
- cond_resched();
- spin_lock_irq(&zone->lru_lock);
- if (fatal_signal_pending(current))
- break;
- } else if (!locked)
- spin_lock_irq(&zone->lru_lock);
+
+ /* Check if it is ok to still hold the lock */
+ locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
+ locked, cc);
+ if (!locked)
+ break;

/*
* migrate_pfn does not necessarily start aligned to a
@@ -404,7 +445,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,

acct_isolated(zone, cc);

- spin_unlock_irq(&zone->lru_lock);
+ if (locked)
+ spin_unlock_irqrestore(&zone->lru_lock, flags);

trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);

@@ -514,7 +556,16 @@ static void isolate_freepages(struct zone *zone,

* are disabled
*/
isolated = 0;
- spin_lock_irqsave(&zone->lock, flags);
+
+ /*
+ * The zone lock must be held to isolate freepages. This
+ * unfortunately this is a very coarse lock and can be
+ * heavily contended if there are parallel allocations
+ * or parallel compactions. For async compaction do not
+ * spin on the lock
+ */
+ if (!compact_trylock_irqsave(&zone->lock, &flags, cc))
+ break;
if (suitable_migration_target(page)) {
end_pfn = min(pfn + pageblock_nr_pages, zone_end_pfn);
trace_mm_compaction_freepage_scanpfn(pfn);

@@ -837,8 +888,8 @@ out:

}

static unsigned long compact_zone_order(struct zone *zone,
- int order, gfp_t gfp_mask,
- bool sync, struct page **page)
+ int order, gfp_t gfp_mask, bool sync,
+ bool *contended, struct page **page)
{
struct compact_control cc = {
.nr_freepages = 0,

@@ -848,6 +899,7 @@ static unsigned long compact_zone_order(struct zone *zone,

.zone = zone,
.sync = sync,
.page = page,
+ .contended = contended,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);

@@ -869,7 +921,7 @@ int sysctl_extfrag_threshold = 500;

*/
unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- bool sync, struct page **page)
+ bool sync, bool *contended, struct page **page)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
int may_enter_fs = gfp_mask & __GFP_FS;

@@ -889,7 +941,8 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,

Jim Schutt

unread,

Aug 10, 2012, 1:30:03 PM8/10/12

to

On 08/10/2012 05:02 AM, Mel Gorman wrote:
> On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote:

>>>
>>> Ok, this is an untested hack and I expect it would drop allocation success
>>> rates again under load (but not as much). Can you test again and see what
>>> effect, if any, it has please?
>>>
>>> ---8<---
>>> mm: compaction: back out if contended
>>>
>>> ---
>>
>> <snip>
>>
>> Initial testing with this patch looks very good from
>> my perspective; CPU utilization stays reasonable,
>> write-out rate stays high, no signs of stress.
>> Here's an example after ~10 minutes under my test load:
>>

Hmmm, I wonder if I should have tested this patch longer,
in view of the trouble I ran into testing the new patch?
See below.

>
> Excellent, so it is contention that is the problem.
>
>> <SNIP>
>> I'll continue testing tomorrow to be sure nothing
>> shows up after continued testing.
>>
>> If this passes your allocation success rate testing,
>> I'm happy with this performance for 3.6 - if not, I'll
>> be happy to test any further patches.
>>
>
> It does impair allocation success rates as I expected (they're still ok
> but not as high as I'd like) so I implemented the following instead. It
> attempts to backoff when contention is detected or compaction is taking
> too long. It does not backoff as quickly as the first prototype did so
> I'd like to see if it addresses your problem or not.
>
>> I really appreciate getting the chance to test out
>> your patchset.
>>
>
> I appreciate that you have a workload that demonstrates the problem and
> will test patches. I will not abuse this and hope the keep the revisions
> to a minimum.
>
> Thanks.
>
> ---8<---
> mm: compaction: Abort async compaction if locks are contended or taking too long

Hmmm, while testing this patch, a couple of my servers got
stuck after ~30 minutes or so, like this:

[ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds.
[ 2515.876630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2515.884447] ceph-osd D 0000000000000000 0 30375 1 0x00000000
[ 2515.891531] ffff8802e1a99e38 0000000000000082 ffff88056b38e298 ffff8802e1a99fd8
[ 2515.899013] ffff8802e1a98010 ffff8802e1a98000 ffff8802e1a98000 ffff8802e1a98000
[ 2515.906482] ffff8802e1a99fd8 ffff8802e1a98000 ffff880697d31700 ffff8802e1a84500
[ 2515.913968] Call Trace:
[ 2515.916433] [<ffffffff8147fded>] schedule+0x5d/0x60
[ 2515.921417] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140
[ 2515.927938] [<ffffffff81480b73>] rwsem_down_write_failed+0x13/0x20
[ 2515.934195] [<ffffffff8124bcd3>] call_rwsem_down_write_failed+0x13/0x20
[ 2515.940934] [<ffffffff8147edc5>] ? down_write+0x45/0x50
[ 2515.946244] [<ffffffff81127b62>] sys_mprotect+0xd2/0x240
[ 2515.951640] [<ffffffff81489412>] system_call_fastpath+0x16/0x1b
[ 2515.957646] INFO: task ceph-osd:95698 blocked for more than 120 seconds.
[ 2515.964330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2515.972141] ceph-osd D 0000000000000000 0 95698 1 0x00000000
[ 2515.979223] ffff8802b049fe38 0000000000000082 ffff88056b38e2a0 ffff8802b049ffd8
[ 2515.986700] ffff8802b049e010 ffff8802b049e000 ffff8802b049e000 ffff8802b049e000
[ 2515.994176] ffff8802b049ffd8 ffff8802b049e000 ffff8809832ddc00 ffff880611592e00
[ 2516.001653] Call Trace:
[ 2516.004111] [<ffffffff8147fded>] schedule+0x5d/0x60
[ 2516.009072] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140
[ 2516.015589] [<ffffffff81480b73>] rwsem_down_write_failed+0x13/0x20
[ 2516.021861] [<ffffffff8124bcd3>] call_rwsem_down_write_failed+0x13/0x20
[ 2516.028555] [<ffffffff8147edc5>] ? down_write+0x45/0x50
[ 2516.033859] [<ffffffff81127b62>] sys_mprotect+0xd2/0x240
[ 2516.039248] [<ffffffff81489412>] system_call_fastpath+0x16/0x1b
[ 2516.045248] INFO: task ceph-osd:95699 blocked for more than 120 seconds.
[ 2516.051934] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2516.059753] ceph-osd D 0000000000000000 0 95699 1 0x00000000
[ 2516.066832] ffff880c022d3dc8 0000000000000082 ffff880c022d2000 ffff880c022d3fd8
[ 2516.074302] ffff880c022d2010 ffff880c022d2000 ffff880c022d2000 ffff880c022d2000
[ 2516.081784] ffff880c022d3fd8 ffff880c022d2000 ffff8806224cc500 ffff88096b64dc00
[ 2516.089254] Call Trace:
[ 2516.091702] [<ffffffff8147fded>] schedule+0x5d/0x60
[ 2516.096656] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140
[ 2516.103176] [<ffffffff81480b73>] rwsem_down_write_failed+0x13/0x20
[ 2516.109443] [<ffffffff8124bcd3>] call_rwsem_down_write_failed+0x13/0x20
[ 2516.116134] [<ffffffff8147edc5>] ? down_write+0x45/0x50
[ 2516.121442] [<ffffffff8111362e>] vm_mmap_pgoff+0x6e/0xb0
[ 2516.126861] [<ffffffff8112486a>] sys_mmap_pgoff+0x18a/0x190
[ 2516.132552] [<ffffffff8124bd6e>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[ 2516.138985] [<ffffffff81006b22>] sys_mmap+0x22/0x30
[ 2516.143945] [<ffffffff81489412>] system_call_fastpath+0x16/0x1b
[ 2516.149949] INFO: task ceph-osd:95816 blocked for more than 120 seconds.
[ 2516.156632] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2516.164444] ceph-osd D 0000000000000000 0 95816 1 0x00000000
[ 2516.171521] ffff880332991e38 0000000000000082 ffff880332991de8 ffff880332991fd8
[ 2516.178992] ffff880332990010 ffff880332990000 ffff880332990000 ffff880332990000
[ 2516.186466] ffff880332991fd8 ffff880332990000 ffff880697d31700 ffff880a92c32e00
[ 2516.193937] Call Trace:
[ 2516.196396] [<ffffffff8147fded>] schedule+0x5d/0x60
[ 2516.201354] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140
[ 2516.207886] [<ffffffff81480b73>] rwsem_down_write_failed+0x13/0x20
[ 2516.214138] [<ffffffff8124bcd3>] call_rwsem_down_write_failed+0x13/0x20
[ 2516.220843] [<ffffffff8147edc5>] ? down_write+0x45/0x50
[ 2516.226145] [<ffffffff81127b62>] sys_mprotect+0xd2/0x240
[ 2516.231548] [<ffffffff81489412>] system_call_fastpath+0x16/0x1b
[ 2516.237545] INFO: task ceph-osd:95838 blocked for more than 120 seconds.
[ 2516.244248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2516.252067] ceph-osd D 0000000000000000 0 95838 1 0x00000000
[ 2516.259159] ffff8803f8281e38 0000000000000082 ffff88056b38e2a8 ffff8803f8281fd8
[ 2516.266627] ffff8803f8280010 ffff8803f8280000 ffff8803f8280000 ffff8803f8280000
[ 2516.274094] ffff8803f8281fd8 ffff8803f8280000 ffff8809a45f8000 ffff880691d41700
[ 2516.281573] Call Trace:
[ 2516.284028] [<ffffffff8147fded>] schedule+0x5d/0x60
[ 2516.289000] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140
[ 2516.295513] [<ffffffff81480b73>] rwsem_down_write_failed+0x13/0x20
[ 2516.301764] [<ffffffff8124bcd3>] call_rwsem_down_write_failed+0x13/0x20
[ 2516.308450] [<ffffffff8147edc5>] ? down_write+0x45/0x50
[ 2516.313753] [<ffffffff81127b62>] sys_mprotect+0xd2/0x240
[ 2516.319157] [<ffffffff81489412>] system_call_fastpath+0x16/0x1b
[ 2516.325154] INFO: task ceph-osd:95861 blocked for more than 120 seconds.
[ 2516.331844] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2516.339665] ceph-osd D 0000000000000000 0 95861 1 0x00000000
[ 2516.346742] ffff8805026e9e38 0000000000000082 ffff88056b38e2a0 ffff8805026e9fd8
[ 2516.354221] ffff8805026e8010 ffff8805026e8000 ffff8805026e8000 ffff8805026e8000
[ 2516.361698] ffff8805026e9fd8 ffff8805026e8000 ffff880611592e00 ffff880948df0000
[ 2516.369174] Call Trace:
[ 2516.371623] [<ffffffff8147fded>] schedule+0x5d/0x60
[ 2516.376582] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140
[ 2516.383149] [<ffffffff81480b73>] rwsem_down_write_failed+0x13/0x20
[ 2516.389404] [<ffffffff8124bcd3>] call_rwsem_down_write_failed+0x13/0x20
[ 2516.396091] [<ffffffff8147edc5>] ? down_write+0x45/0x50
[ 2516.401397] [<ffffffff81127b62>] sys_mprotect+0xd2/0x240
[ 2516.406818] [<ffffffff81489412>] system_call_fastpath+0x16/0x1b
[ 2516.412868] INFO: task ceph-osd:95899 blocked for more than 120 seconds.
[ 2516.419557] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2516.427371] ceph-osd D 0000000000000000 0 95899 1 0x00000000
[ 2516.434466] ffff8801eaa9dd50 0000000000000082 0000000000000000 ffff8801eaa9dfd8
[ 2516.442020] ffff8801eaa9c010 ffff8801eaa9c000 ffff8801eaa9c000 ffff8801eaa9c000
[ 2516.449594] ffff8801eaa9dfd8 ffff8801eaa9c000 ffff8800865e5c00 ffff8802b356c500
[ 2516.457079] Call Trace:
[ 2516.459534] [<ffffffff8147fded>] schedule+0x5d/0x60
[ 2516.464519] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140
[ 2516.471044] [<ffffffff81480b95>] rwsem_down_read_failed+0x15/0x17
[ 2516.477222] [<ffffffff8124bca4>] call_rwsem_down_read_failed+0x14/0x30
[ 2516.483830] [<ffffffff8147ee07>] ? down_read+0x37/0x40
[ 2516.489050] [<ffffffff81484c49>] do_page_fault+0x239/0x4a0
[ 2516.494627] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 2516.501143] [<ffffffff8148154f>] page_fault+0x1f/0x30

I tried to capture a perf trace while this was going on, but it
never completed. "ps" on this system reports lots of kernel threads
and some user-space stuff, but hangs part way through - no ceph
executables in the output, oddly.

I can retest your earlier patch for a longer period, to
see if it does the same thing, or I can do some other thing
if you tell me what it is.

Also, FWIW I sorted a little through SysRq-T output from such
a system; these bits looked interesting:

[ 3663.685097] INFO: rcu_sched self-detected stall on CPU { 17} (t=60000 jiffies)
[ 3663.685099] sending NMI to all CPUs:
[ 3663.685101] NMI backtrace for cpu 0
[ 3663.685102] CPU 0 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.685138]
[ 3663.685140] Pid: 100027, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.685142] RIP: 0010:[<ffffffff81480ed5>] [<ffffffff81480ed5>] _raw_spin_lock_irqsave+0x45/0x60
[ 3663.685148] RSP: 0018:ffff880a08191898 EFLAGS: 00000012
[ 3663.685149] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000c5
[ 3663.685149] RDX: 00000000000000bf RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.685150] RBP: ffff880a081918a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.685151] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.685152] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.685153] FS: 00007fff90ae0700(0000) GS:ffff880627c00000(0000) knlGS:0000000000000000
[ 3663.685154] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.685155] CR2: ffffffffff600400 CR3: 00000002b8fbe000 CR4: 00000000000007f0
[ 3663.685156] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.685157] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.685158] Process ceph-osd (pid: 100027, threadinfo ffff880a08190000, task ffff880a9a29ae00)
[ 3663.685158] Stack:
[ 3663.685159] 000000000000130a 0000000000000000 ffff880a08191948 ffffffff8111a760
[ 3663.685162] ffffffff81a13420 0000000000000009 ffffea000004c240 0000000000000000
[ 3663.685165] ffff88063fffcba0 000000003fffcb98 ffff880a08191a18 0000000000001600
[ 3663.685168] Call Trace:
[ 3663.685169] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.685173] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.685175] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.685178] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.685180] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.685182] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.685187] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.685190] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.685192] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.685195] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.685199] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.685202] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.685205] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.685208] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.685211] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.685213] Code: 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 0f b6 13 <38> d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66 2e 0f 1f
[ 3663.685238] NMI backtrace for cpu 3
[ 3663.685239] CPU 3 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.685273]
[ 3663.685274] Pid: 101503, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.685276] RIP: 0010:[<ffffffff81480ed2>] [<ffffffff81480ed2>] _raw_spin_lock_irqsave+0x42/0x60
[ 3663.685280] RSP: 0018:ffff8806bce17898 EFLAGS: 00000006
[ 3663.685280] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000cb
[ 3663.685281] RDX: 00000000000000c5 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.685282] RBP: ffff8806bce178a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.685283] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.685284] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.685285] FS: 00007fffc8e60700(0000) GS:ffff880627c60000(0000) knlGS:0000000000000000
[ 3663.685286] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.685287] CR2: ffffffffff600400 CR3: 00000002cbd8c000 CR4: 00000000000007e0
[ 3663.685287] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.685288] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.685289] Process ceph-osd (pid: 101503, threadinfo ffff8806bce16000, task ffff880c06580000)
[ 3663.685290] Stack:
[ 3663.685290] 0000000000001212 0000000000000000 ffff8806bce17948 ffffffff8111a760
[ 3663.685294] ffff8806244d5c00 0000000000000009 ffffea0000048440 0000000000000000
[ 3663.685297] ffff88063fffcba0 000000003fffcb98 ffff8806bce17a18 0000000000001600
[ 3663.685300] Call Trace:
[ 3663.685301] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.685304] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.685306] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.685308] [<ffffffff814018c4>] ? ip_finish_output+0x274/0x300
[ 3663.685311] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.685314] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.685316] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.685319] [<ffffffff813b655b>] ? release_sock+0x6b/0x80
[ 3663.685322] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.685325] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.685327] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.685330] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.685332] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.685335] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.685337] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.685340] [<ffffffff8106707d>] ? up_write+0x1d/0x20
[ 3663.685343] [<ffffffff81113656>] ? vm_mmap_pgoff+0x96/0xb0
[ 3663.685347] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.685349] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.685352] Code: ff 65 48 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 <0f> b6 13 38 d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66
[ 3663.685378] NMI backtrace for cpu 6
[ 3663.685379] CPU 6 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core[ 3663.685402] Uhhuh. NMI received for unknown reason 3d on CPU 3.
[ 3663.685403] mpt2sas[ 3663.685404] Do you have a strange power saving mode enabled?
[ 3663.685405] scsi_transport_sas[ 3663.685406] Dazed and confused, but trying to continue
[ 3663.685407] raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.685420]
[ 3663.685422] Pid: 102943, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.685424] RIP: 0010:[<ffffffff81480ed2>] [<ffffffff81480ed2>] _raw_spin_lock_irqsave+0x42/0x60
[ 3663.685430] RSP: 0018:ffff88065c111898 EFLAGS: 00000006
[ 3663.685430] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000d9
[ 3663.685431] RDX: 00000000000000c5 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.685432] RBP: ffff88065c1118a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.685433] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.685433] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.685434] FS: 00007fffc693b700(0000) GS:ffff880c3fc00000(0000) knlGS:0000000000000000
[ 3663.685435] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.685436] CR2: ffffffffff600400 CR3: 000000048d1b1000 CR4: 00000000000007e0
[ 3663.685437] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.685438] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.685439] Process ceph-osd (pid: 102943, threadinfo ffff88065c110000, task ffff880737b9ae00)
[ 3663.685439] Stack:
[ 3663.685440] 0000000000001d31 0000000000000000 ffff88065c111948 ffffffff8111a760
[ 3663.685444] ffff8806245b2e00 ffff88065c1118c8 0000000000000006 0000000000000000
[ 3663.685447] ffff88063fffcba0 000000003fffcb98 ffff88065c111a18 0000000000002000
[ 3663.685450] Call Trace:
[ 3663.685451] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.685455] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.685458] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.685460] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.685462] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.685464] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.685469] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.685471] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.685474] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.685477] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.685481] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.685483] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.685487] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.685490] [<ffffffff8106707d>] ? up_write+0x1d/0x20
[ 3663.685493] [<ffffffff81113656>] ? vm_mmap_pgoff+0x96/0xb0
[ 3663.685497] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.685500] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.685502] Code: ff 65 48 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 <0f> b6 13 38 d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66
[ 3663.685527] NMI backtrace for cpu 1
[ 3663.685528] CPU 1 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.685562]
[ 3663.685563] Pid: 30029, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.685565] RIP: 0010:[<ffffffff81480ed2>] [<ffffffff81480ed2>] _raw_spin_lock_irqsave+0x42/0x60
[ 3663.685569] RSP: 0018:ffff880563ae1898 EFLAGS: 00000006
[ 3663.685569] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000d6
[ 3663.685570] RDX: 00000000000000c5 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.685571] RBP: ffff880563ae18a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.685572] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.685573] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.685574] FS: 00007fffe86c9700(0000) GS:ffff880627c20000(0000) knlGS:0000000000000000
[ 3663.685575] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.685576] CR2: ffffffffff600400 CR3: 00000002cc584000 CR4: 00000000000007e0
[ 3663.685577] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.685577] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.685578] Process ceph-osd (pid: 30029, threadinfo ffff880563ae0000, task ffff880563adc500)
[ 3663.685579] Stack:
[ 3663.685579] 000000000000167f 0000000000000000 ffff880563ae1948 ffffffff8111a760
[ 3663.685583] ffff88063fffcc38 ffff88063fffcb98 000000000000256b 0000000000000000
[ 3663.685586] ffff88063fffcba0 0000000000000004 ffff880563ae1a18 0000000000001a00
[ 3663.685589] Call Trace:
[ 3663.685590] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.685593] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.685595] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.685597] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.685599] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.685601] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.685604] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.685607] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.685609] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.685612] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.685614] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.685616] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.685619] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.685621] [<ffffffff8106707d>] ? up_write+0x1d/0x20
[ 3663.685623] [<ffffffff81113656>] ? vm_mmap_pgoff+0x96/0xb0
[ 3663.685626] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.685628] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.685630] Code: ff 65 48 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 <0f> b6 13 38 d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66
[ 3663.685656] NMI backtrace for cpu 12
[ 3663.685656] CPU 12 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.685687]
[ 3663.685688] Pid: 97037, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.685690] RIP: 0010:[<ffffffff81480ed2>] [<ffffffff81480ed2>] _raw_spin_lock_irqsave+0x42/0x60
[ 3663.685693] RSP: 0018:ffff880092839898 EFLAGS: 00000016
[ 3663.685694] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000d4
[ 3663.685694] RDX: 00000000000000c5 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.685695] RBP: ffff8800928398a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.685696] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.685697] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.685698] FS: 00007fffcb183700(0000) GS:ffff880627cc0000(0000) knlGS:0000000000000000
[ 3663.685699] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.685700] CR2: ffffffffff600400 CR3: 0000000411741000 CR4: 00000000000007e0
[ 3663.685701] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.685702] Uhhuh. NMI received for unknown reason 3d on CPU 6.
[ 3663.685703] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.685704] Do you have a strange power saving mode enabled?
[ 3663.685705] Process ceph-osd (pid: 97037, threadinfo ffff880092838000, task ffff8805d127dc00)
[ 3663.685706] Dazed and confused, but trying to continue
[ 3663.685707] Stack:
[ 3663.685707] 000000000000358a 0000000000000000 ffff880092839948 ffffffff8111a760
[ 3663.685711] ffff8806245c4500 ffff8800928398c8 000000000000000c 0000000000000000
[ 3663.685714] ffff88063fffcba0 000000003fffcb98 ffff880092839a18 0000000000003800
[ 3663.685717] Call Trace:
[ 3663.685717] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.685720] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.685722] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.685724] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.685727] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.685729] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.685731] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.685734] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.685736] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.685738] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.685740] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.685743] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.685745] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.685747] [<ffffffff8106707d>] ? up_write+0x1d/0x20
[ 3663.685749] [<ffffffff81113656>] ? vm_mmap_pgoff+0x96/0xb0
[ 3663.685752] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.685754] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.685756] Code: ff 65 48 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 <0f> b6 13 38 d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66
[ 3663.685781] NMI backtrace for cpu 14
[ 3663.685782] CPU 14 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.685815]
[ 3663.685816] Pid: 97590, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.685818] RIP: 0010:[<ffffffff81480ed2>] [<ffffffff81480ed2>] _raw_spin_lock_irqsave+0x42/0x60
[ 3663.685821] RSP: 0018:ffff8803f97a9898 EFLAGS: 00000002
[ 3663.685822] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000c6
[ 3663.685823] RDX: 00000000000000c5 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.685823] RBP: ffff8803f97a98a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.685824] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.685825] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.685826] FS: 00007fffca577700(0000) GS:ffff880627d00000(0000) knlGS:0000000000000000
[ 3663.685827] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.685828] CR2: ffffffffff600400 CR3: 00000002e0986000 CR4: 00000000000007e0
[ 3663.685828] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.685829] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.685830] Process ceph-osd (pid: 97590, threadinfo ffff8803f97a8000, task ffff88045554c500)
[ 3663.685831] Stack:
[ 3663.685831] 0000000000001cc3 0000000000000000 ffff8803f97a9948 ffffffff8111a760
[ 3663.685834] ffff8806245d8000 ffff8803f97a98c8 000000000000000e 0000000000000000
[ 3663.685838] ffff88063fffcba0 000000003fffcb98 ffff8803f97a9a18 0000000000002000
[ 3663.685841] Call Trace:
[ 3663.685842] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.685844] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.685847] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.685849] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.685851] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.685853] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.685856] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.685859] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.685861] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.685864] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.685866] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.685868] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.685871] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.685873] [<ffffffff8106707d>] ? up_write+0x1d/0x20
[ 3663.685875] [<ffffffff81113656>] ? vm_mmap_pgoff+0x96/0xb0
[ 3663.685878] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.685880] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.685882] Code: ff 65 48 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 <0f> b6 13 38 d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66
[ 3663.685907] NMI backtrace for cpu 2
[ 3663.685908] CPU 2 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.685939]
[ 3663.685941] Pid: 100053, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.685943] RIP: 0010:[<ffffffff81480ed2>] [<ffffffff81480ed2>] _raw_spin_lock_irqsave+0x42/0x60
[ 3663.685946] RSP: 0018:ffff8808da685898 EFLAGS: 00000012
[ 3663.685947] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000d3
[ 3663.685948] RDX: 00000000000000c6 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.685948] RBP: ffff8808da6858a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.685949] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.685950] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.685951] FS: 00007fff92c01700(0000) GS:ffff880627c40000(0000) knlGS:0000000000000000
[ 3663.685952] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.685953] CR2: ffffffffff600400 CR3: 00000002b8fbe000 CR4: 00000000000007e0
[ 3663.685954] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.685954] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.685955] Process ceph-osd (pid: 100053, threadinfo ffff8808da684000, task ffff880a05a92e00)
[ 3663.685956] Stack:
[ 3663.685956] 000000000000119b 0000000000000000 ffff8808da685948 ffffffff8111a760
[ 3663.685959] ffff8806244d4500 ffff8808da6858c8 0000000000000002 0000000000000000
[ 3663.685962] ffff88063fffcba0 000000003fffcb98 ffff8808da685a18 0000000000001400
[ 3663.685966] Call Trace:
[ 3663.685966] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.685969] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.685971] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.685973] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.685976] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.685978] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.685981] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.685983] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.685986] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.685988] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.685990] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.685992] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.685995] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.685997] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.685999] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.686001] Code: ff 65 48 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 <0f> b6 13 38 d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66
[ 3663.686028] NMI backtrace for cpu 11
[ 3663.686028] CPU 11 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.686062]
[ 3663.686064] Pid: 97756, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.686066] RIP: 0010:[<ffffffff81480ed5>] [<ffffffff81480ed5>] _raw_spin_lock_irqsave+0x45/0x60
[ 3663.686069] RSP: 0018:ffff880b11ecd898 EFLAGS: 00000006
[ 3663.686070] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000d8
[ 3663.686070] RDX: 00000000000000c6 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.686071] RBP: ffff880b11ecd8a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.686072] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.686073] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.686074] FS: 00007ffff36df700(0000) GS:ffff880c3fca0000(0000) knlGS:0000000000000000
[ 3663.686075] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.686076] CR2: ffffffffff600400 CR3: 00000002cae55000 CR4: 00000000000007e0
[ 3663.686077] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.686078] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.686079] Process ceph-osd (pid: 97756, threadinfo ffff880b11ecc000, task ffff880a79a51700)
[ 3663.686079] Stack:
[ 3663.686080] 0000000000001b3e 0000000000000000 ffff880b11ecd948 ffffffff8111a760
[ 3663.686083] ffff8806245c2e00 ffff880b11ecd8c8 000000000000000b 0000000000000000
[ 3663.686086] ffff88063fffcba0 000000003fffcb98 ffff880b11ecda18 0000000000001e00
[ 3663.686089] Call Trace:
[ 3663.686090] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.686093] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.686095] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.686097] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.686099] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.686102] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.686105] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.686107] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.686110] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.686112] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.686114] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.686117] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.686119] [<ffffffff8106707d>] ? up_write+0x1d/0x20
[ 3663.686121] [<ffffffff81113656>] ? vm_mmap_pgoff+0x96/0xb0
[ 3663.686124] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.686126] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.686129] Code: 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 0f b6 13 <38> d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66 2e 0f 1f
[ 3663.686155] NMI backtrace for cpu 20
[ 3663.686155] CPU 20 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel ghash_clmulni_intel aesni_intel cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.686189]
[ 3663.686190] Pid: 97755, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.686193] RIP: 0010:[<ffffffff81480ed5>] [<ffffffff81480ed5>] _raw_spin_lock_irqsave+0x45/0x60
[ 3663.686196] RSP: 0018:ffff88066d5af898 EFLAGS: 00000002
[ 3663.686196] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000cd
[ 3663.686197] RDX: 00000000000000c6 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.686198] RBP: ffff88066d5af8a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.686199] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.686199] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.686200] Uhhuh. NMI received for unknown reason 2d on CPU 11.
[ 3663.686201] FS: 00007ffff3ee0700(0000) GS:ffff880c3fd00000(0000) knlGS:0000000000000000
[ 3663.686202] Do you have a strange power saving mode enabled?
[ 3663.686203] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.686203] Dazed and confused, but trying to continue
[ 3663.686204] CR2: ffffffffff600400 CR3: 00000002cae55000 CR4: 00000000000007e0
[ 3663.686205] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.686206] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.686207] Process ceph-osd (pid: 97755, threadinfo ffff88066d5ae000, task ffff880a79a52e00)
[ 3663.686207] Stack:
[ 3663.686208] 0000000000001cbf 0000000000000000 ffff88066d5af948 ffffffff8111a760
[ 3663.686211] ffff8806245e9700 ffff88066d5af8c8 0000000000000014 0000000000000000
[ 3663.686214] ffff88063fffcba0 000000003fffcb98 ffff88066d5afa18 0000000000002000
[ 3663.686217] Call Trace:
[ 3663.686218] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.686221] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.686223] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.686225] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.686228] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.686230] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.686233] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.686236] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.686238] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.686240] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.686243] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.686245] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.686247] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.686250] [<ffffffff8106707d>] ? up_write+0x1d/0x20
[ 3663.686252] [<ffffffff81113656>] ? vm_mmap_pgoff+0x96/0xb0
[ 3663.686254] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.686257] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.686259] Code: 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 f3 90 0f b6 13 <38> d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66 66 66 2e 0f 1f
[ 3663.686284] NMI backtrace for cpu 13
[ 3663.686285] CPU 13 Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa iw_cxgb4 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic coretemp hwmon kvm crc32c_intel[ 3663.686300] Uhhuh. NMI received for unknown reason 2d on CPU 12.
[ 3663.686300] ghash_clmulni_intel[ 3663.686301] Do you have a strange power saving mode enabled?
[ 3663.686301] aesni_intel[ 3663.686302] Dazed and confused, but trying to continue
[ 3663.686302] cryptd aes_x86_64 microcode serio_raw pcspkr ata_piix libata button mlx4_ib ib_mad ib_core mlx4_en mlx4_core mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfs nfs_acl auth_rpcgss fscache lockd sunrpc broadcom tg3 bnx2 igb dca e1000 [last unloaded: scsi_wait_scan]
[ 3663.686318]
[ 3663.686319] Pid: 98427, comm: ceph-osd Not tainted 3.5.0-00019-g472719a #221 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 3663.686321] RIP: 0010:[<ffffffff81480ed0>] [<ffffffff81480ed0>] _raw_spin_lock_irqsave+0x40/0x60
[ 3663.686324] RSP: 0018:ffff880356409898 EFLAGS: 00000016
[ 3663.686324] RAX: ffff88063fffcb00 RBX: ffff88063fffcb00 RCX: 00000000000000d2
[ 3663.686325] RDX: 00000000000000c6 RSI: 000000000000015a RDI: ffff88063fffcb00
[ 3663.686326] RBP: ffff8803564098a8 R08: 0000000000000000 R09: 0000000000000000
[ 3663.686327] R10: ffff88063fffcb98 R11: ffff88063fffcc38 R12: 0000000000000246
[ 3663.686327] R13: ffff88063fffcba8 R14: ffff88063fffcb90 R15: ffff88063fffc680
[ 3663.686328] FS: 00007fffc794b700(0000) GS:ffff880627ce0000(0000) knlGS:0000000000000000
[ 3663.686329] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3663.686330] CR2: ffffffffff600400 CR3: 00000002bc512000 CR4: 00000000000007e0
[ 3663.686331] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3663.686332] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3663.686333] Process ceph-osd (pid: 98427, threadinfo ffff880356408000, task ffff880027de5c00)
[ 3663.686333] Stack:
[ 3663.686333] 0000000000001061 0000000000000000 ffff880356409948 ffffffff8111a760
[ 3663.686337] ffff8806245c5c00 ffff8803564098c8 000000000000000d 0000000000000000
[ 3663.686340] ffff88063fffcba0 000000003fffcb98 ffff880356409a18 0000000000001400
[ 3663.686343] Call Trace:
[ 3663.686343] [<ffffffff8111a760>] isolate_migratepages_range+0x150/0x4e0
[ 3663.686346] [<ffffffff8111a5b0>] ? isolate_freepages+0x330/0x330
[ 3663.686348] [<ffffffff8111af5b>] compact_zone+0x46b/0x4f0
[ 3663.686350] [<ffffffff8111b3f8>] compact_zone_order+0xe8/0x100
[ 3663.686352] [<ffffffff8111b4b6>] try_to_compact_pages+0xa6/0x110
[ 3663.686354] [<ffffffff81100339>] __alloc_pages_direct_compact+0xd9/0x250
[ 3663.686357] [<ffffffff81100883>] __alloc_pages_slowpath+0x3d3/0x750
[ 3663.686360] [<ffffffff81100d3e>] __alloc_pages_nodemask+0x13e/0x1d0
[ 3663.686362] [<ffffffff8113c894>] alloc_pages_vma+0x124/0x150
[ 3663.686364] [<ffffffff8114e065>] do_huge_pmd_anonymous_page+0xf5/0x1e0
[ 3663.686366] [<ffffffff81121bcd>] handle_mm_fault+0x21d/0x320
[ 3663.686368] [<ffffffff8124bca4>] ? call_rwsem_down_read_failed+0x14/0x30
[ 3663.686370] [<ffffffff81484e49>] do_page_fault+0x439/0x4a0
[ 3663.686373] [<ffffffff8106707d>] ? up_write+0x1d/0x20
[ 3663.686375] [<ffffffff81113656>] ? vm_mmap_pgoff+0x96/0xb0
[ 3663.686377] [<ffffffff8124bdaa>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3663.686379] [<ffffffff8148154f>] page_fault+0x1f/0x30
[ 3663.686381] Code: 6a c5 ff 65 48 8b 14 25 48 b7 00 00 83 82 44 e0 ff ff 01 ba 00 01 00 00 f0 66 0f c1 13 89 d1 66 c1 e9 08 38 d1 74 0d 0f 1f 40 00 <f3> 90 0f b6 13 38 d1 75 f7 5b 4c 89 e0 41 5c c9 c3 66 66 66 66

Please let me know what I can do next to help sort this out.

Thanks -- Jim

Mel Gorman

unread,

Aug 12, 2012, 4:30:01 PM8/12/12

to

On Fri, Aug 10, 2012 at 11:20:07AM -0600, Jim Schutt wrote:
> On 08/10/2012 05:02 AM, Mel Gorman wrote:
> >On Thu, Aug 09, 2012 at 04:38:24PM -0600, Jim Schutt wrote:
>
> >>>
> >>>Ok, this is an untested hack and I expect it would drop allocation success
> >>>rates again under load (but not as much). Can you test again and see what
> >>>effect, if any, it has please?
> >>>
> >>>---8<---
> >>>mm: compaction: back out if contended
> >>>
> >>>---
> >>
> >><snip>
> >>
> >>Initial testing with this patch looks very good from
> >>my perspective; CPU utilization stays reasonable,
> >>write-out rate stays high, no signs of stress.
> >>Here's an example after ~10 minutes under my test load:
> >>
>
> Hmmm, I wonder if I should have tested this patch longer,
> in view of the trouble I ran into testing the new patch?
> See below.
>

The two patches are quite different in what they do. I think it's
unlikely they would share a common bug.

> > <SNIP>

> >---8<---
> >mm: compaction: Abort async compaction if locks are contended or taking too long
>
>
> Hmmm, while testing this patch, a couple of my servers got
> stuck after ~30 minutes or so, like this:
>
> [ 2515.869936] INFO: task ceph-osd:30375 blocked for more than 120 seconds.
> [ 2515.876630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2515.884447] ceph-osd D 0000000000000000 0 30375 1 0x00000000
> [ 2515.891531] ffff8802e1a99e38 0000000000000082 ffff88056b38e298 ffff8802e1a99fd8
> [ 2515.899013] ffff8802e1a98010 ffff8802e1a98000 ffff8802e1a98000 ffff8802e1a98000
> [ 2515.906482] ffff8802e1a99fd8 ffff8802e1a98000 ffff880697d31700 ffff8802e1a84500
> [ 2515.913968] Call Trace:
> [ 2515.916433] [<ffffffff8147fded>] schedule+0x5d/0x60
> [ 2515.921417] [<ffffffff81480b25>] rwsem_down_failed_common+0x105/0x140
> [ 2515.927938] [<ffffffff81480b73>] rwsem_down_write_failed+0x13/0x20
> [ 2515.934195] [<ffffffff8124bcd3>] call_rwsem_down_write_failed+0x13/0x20
> [ 2515.940934] [<ffffffff8147edc5>] ? down_write+0x45/0x50
> [ 2515.946244] [<ffffffff81127b62>] sys_mprotect+0xd2/0x240
> [ 2515.951640] [<ffffffff81489412>] system_call_fastpath+0x16/0x1b

> <SNIP>

>
> I tried to capture a perf trace while this was going on, but it
> never completed. "ps" on this system reports lots of kernel threads
> and some user-space stuff, but hangs part way through - no ceph
> executables in the output, oddly.
>

ps is probably locking up because it's trying to access a proc file for
a process that is not releasing the mmap_sem.

I went through the patch again but only found the following which is a
weak candidate. Still, can you retest with the following patch on top and
CONFIG_PROVE_LOCKING set please?

---8<---
diff --git a/mm/compaction.c b/mm/compaction.c
index 1827d9a..d4a51c6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -64,7 +64,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
{
if (need_resched() || spin_is_contended(lock)) {
if (locked) {
- spin_unlock_irq(lock);
+ spin_unlock_irqrestore(lock, *flags);
locked = false;
}

@@ -276,8 +276,8 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
list_for_each_entry(page, &cc->migratepages, lru)
count[!!page_is_file_cache(page)]++;

- __mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
- __mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
+ mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
+ mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
}

/* Similar to reclaim, but different enough that they don't share logic */

Jim Schutt

unread,

Aug 13, 2012, 4:40:04 PM8/13/12

to

Hi Mel,

On 08/12/2012 02:22 PM, Mel Gorman wrote:

>
> I went through the patch again but only found the following which is a
> weak candidate. Still, can you retest with the following patch on top and
> CONFIG_PROVE_LOCKING set please?
>

I've gotten in several hours of testing on this patch with
no issues at all, and no output from CONFIG_PROVE_LOCKING
(I'm assuming it would show up on a serial console). So,
it seems to me this patch has done the trick.

CPU utilization is staying under control, and write-out rate
is good.

You can add my Tested-by: as you see fit. If you work
up any refinements and would like me to test, please
let me know.

Thanks -- Jim

Mel Gorman

unread,

Aug 14, 2012, 5:30:02 AM8/14/12

to

On Mon, Aug 13, 2012 at 02:35:46PM -0600, Jim Schutt wrote:
> Hi Mel,
>
> On 08/12/2012 02:22 PM, Mel Gorman wrote:
>
> >
> >I went through the patch again but only found the following which is a
> >weak candidate. Still, can you retest with the following patch on top and
> >CONFIG_PROVE_LOCKING set please?
> >
>
> I've gotten in several hours of testing on this patch with
> no issues at all, and no output from CONFIG_PROVE_LOCKING
> (I'm assuming it would show up on a serial console). So,
> it seems to me this patch has done the trick.
>

Super.

> CPU utilization is staying under control, and write-out rate
> is good.
>

Even better.

> You can add my Tested-by: as you see fit. If you work
> up any refinements and would like me to test, please
> let me know.
>

I'll be adding your Tested-by and I'll keep you cc'd on the series. It'll
look a little different because I'm expect to adjust it slightly to match
Andrew's tree but there should be no major surprises and my expectation is
that testing a -rc kernel after it gets merged is all that is necessary. I'm
planning to backport this to -stable but it'll remain to be seen if I can
convince the relevant maintainers that it should be merged.

Thanks.

--
Mel Gorman
SUSE Labs