When direct reclaim is running reclaim/compaction, there is a minimum
number of pages it reclaims. As it must be under the low watermark to be
in direct reclaim it has also woken kswapd to do some work. This patch
has kswapd use the same logic as direct reclaim to reclaim a minimum
number of pages so compaction can run later.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0cb2593..afdec93 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1701,7 +1701,7 @@ static bool in_reclaim_compaction(struct scan_control *sc)
* calls try_to_compact_zone() that it will have enough free pages to succeed.
* It will give up earlier than that if there is difficulty reclaiming pages.
*/
-static inline bool should_continue_reclaim(struct lruvec *lruvec,
+static bool should_continue_reclaim(struct lruvec *lruvec,
unsigned long nr_reclaimed,
unsigned long nr_scanned,
struct scan_control *sc)
@@ -1768,6 +1768,17 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
}
}
+static inline bool should_continue_reclaim_zone(struct zone *zone,
+ unsigned long nr_reclaimed,
+ unsigned long nr_scanned,
+ struct scan_control *sc)
+{
+ struct mem_cgroup *memcg = mem_cgroup_iter(NULL, NULL, NULL);
+ struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+
+ return should_continue_reclaim(lruvec, nr_reclaimed, nr_scanned, sc);
+}
+
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
@@ -2496,8 +2507,10 @@ loop_again:
*/
testorder = order;
if (COMPACTION_BUILD && order &&
- compaction_suitable(zone, order) !=
- COMPACT_SKIPPED)
+ !should_continue_reclaim_zone(zone,
+ nr_soft_reclaimed,
+ sc.nr_scanned - nr_soft_scanned,
+ &sc))
testorder = 0;
if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
-- 1.7.9.2
commit [7db8889a: mm: have order > 0 compaction start off where it left]
introduced a caching mechanism to reduce the amount work the free page
scanner does in compaction. However, it has a problem. Consider two process
simultaneously scanning free pages
C
Process A M S F
|---------------------------------------|
Process B M FS
C is zone->compact_cached_free_pfn
S is cc->start_pfree_pfn
M is cc->migrate_pfn
F is cc->free_pfn
In this diagram, Process A has just reached its migrate scanner, wrapped
around and updated compact_cached_free_pfn accordingly.
Simultaneously, Process B finishes isolating in a block and updates
compact_cached_free_pfn again to the location of its free scanner.
Process A moves to "end_of_zone - one_pageblock" and runs this check
compact_cached_free_pfn is above where it started so the free scanner skips
almost the entire space it should have scanned. When there are multiple
processes compacting it can end in a situation where the entire zone is
not being scanned at all. Further, it is possible for two processes to
ping-pong update to compact_cached_free_pfn which is just random.
Overall, the end result wrecks allocation success rates.
There is not an obvious way around this problem without introducing new
locking and state so this patch takes a different approach.
First, it gets rid of the skip logic because it's not clear that it matters
if two free scanners happen to be in the same block but with racing updates
it's too easy for it to skip over blocks it should not.
Second, it updates compact_cached_free_pfn in a more limited set of
circumstances.
If a scanner has wrapped, it updates compact_cached_free_pfn to the end
of the zone. Each time a wrapped scanner isoaltes a page, it
updates compact_cached_free_pfn. The intention is that after
wrapping, the compact_cached_free_pfn will be at the highest
pageblock with free pages when compaction completes.
If a scanner has not wrapped when compaction completes and
compact_cached_free_pfn is set the end of the the zone, initialise
it once.
This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.
/*
+ * Returns the start pfn of the last page block in a zone. This is the starting
+ * point for full compaction of a zone. Compaction searches for free pages from
+ * the end of each zone, while isolate_freepages_block scans forward inside each
+ * page block.
+ */
+static unsigned long start_free_pfn(struct zone *zone)
+{
+ unsigned long free_pfn;
+ free_pfn = zone->zone_start_pfn + zone->spanned_pages;
+ free_pfn &= ~(pageblock_nr_pages-1);
+ return free_pfn;
+}
+
+/*
* Based on information in the current compact_control, find blocks
* suitable for isolating free pages from and then isolate them.
*/
@@ -457,17 +471,6 @@ static void isolate_freepages(struct zone *zone,
pfn -= pageblock_nr_pages) {
unsigned long isolated;
- /*
- * Skip ahead if another thread is compacting in the area
- * simultaneously. If we wrapped around, we can only skip
- * ahead if zone->compact_cached_free_pfn also wrapped to
- * above our starting point.
- */
- if (cc->order > 0 && (!cc->wrapped ||
- zone->compact_cached_free_pfn >
- cc->start_free_pfn))
- pfn = min(pfn, zone->compact_cached_free_pfn);
-
if (!pfn_valid(pfn))
continue;
@@ -510,7 +513,15 @@ static void isolate_freepages(struct zone *zone,
*/
if (isolated) {
high_pfn = max(high_pfn, pfn);
- if (cc->order > 0)
+
+ /*
+ * If the free scanner has wrapped, update
+ * compact_cached_free_pfn to point to the highest
+ * pageblock with free pages. This reduces excessive
+ * scanning of full pageblocks near the end of the
+ * zone
+ */
+ if (cc->order > 0 && cc->wrapped)
zone->compact_cached_free_pfn = high_pfn;
}
}
@@ -520,6 +531,11 @@ static void isolate_freepages(struct zone *zone,
cc->free_pfn = high_pfn;
cc->nr_freepages = nr_freepages;
+
+ /* If compact_cached_free_pfn is reset then set it now */
+ if (cc->order > 0 && !cc->wrapped &&
+ zone->compact_cached_free_pfn == start_free_pfn(zone))
+ zone->compact_cached_free_pfn = high_pfn;
}
-/*
- * Returns the start pfn of the last page block in a zone. This is the starting
- * point for full compaction of a zone. Compaction searches for free pages from
- * the end of each zone, while isolate_freepages_block scans forward inside each
- * page block.
- */
-static unsigned long start_free_pfn(struct zone *zone)
-{
- unsigned long free_pfn;
- free_pfn = zone->zone_start_pfn + zone->spanned_pages;
- free_pfn &= ~(pageblock_nr_pages-1);
- return free_pfn;
-}
-
static int compact_finished(struct zone *zone,
struct compact_control *cc)
{
-- 1.7.9.2
If allocation fails after compaction then compaction may be deferred for
a number of allocation attempts. If there are subsequent failures,
compact_defer_shift is increased to defer for longer periods. This patch
uses that information to scale the number of pages reclaimed with
compact_defer_shift until allocations succeed again.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 66e4310..0cb2593 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1708,6 +1708,7 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
{
unsigned long pages_for_compaction;
unsigned long inactive_lru_pages;
+ struct zone *zone;
/* If not in reclaim/compaction mode, stop */
if (!in_reclaim_compaction(sc))
@@ -1741,6 +1742,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
* inactive lists are large enough, continue reclaiming
*/
pages_for_compaction = (2UL << sc->order);
+
+ /*
+ * If compaction is deferred for this order then scale the number of
+ * pages reclaimed based on the number of consecutive allocation
+ * failures
+ */
+ zone = lruvec_zone(lruvec);
+ if (zone->compact_order_failed >= sc->order)
+ pages_for_compaction <<= zone->compact_defer_shift;
inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
if (nr_swap_pages > 0)
inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
-- 1.7.9.2
While compaction is moving pages to free up large contiguous blocks for
allocation it races with other allocation requests that may steal these
blocks or break them up. This patch alters direct compaction to capture a
suitable free page as soon as it becomes available to reduce this race. It
uses similar logic to split_free_page() to ensure that watermarks are
still obeyed.
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 51a90b7..5673459 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,7 +22,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
extern int fragmentation_index(struct zone *zone, unsigned int order);
extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *mask,
- bool sync);
+ bool sync, struct page **page);
extern int compact_pgdat(pg_data_t *pgdat, int order);
extern unsigned long compaction_suitable(struct zone *zone, int order);
void split_page(struct page *page, unsigned int order);
int split_free_page(struct page *page);
+int capture_free_page(struct page *page, int alloc_order, int migratetype);
/*
* Compound pages have a destructor function. Provide a
diff --git a/mm/compaction.c b/mm/compaction.c
index 95ca967..63af8d2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,41 @@ static inline bool migrate_async_suitable(int migratetype)
return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
}
+static void compact_capture_page(struct compact_control *cc)
+{
+ unsigned long flags;
+ int mtype;
+
+ if (!cc->page || *cc->page)
+ return;
+
+ /* Speculatively examine the free lists without zone lock */
+ for (mtype = 0; mtype < MIGRATE_PCPTYPES; mtype++) {
+ int order;
+ for (order = cc->order; order < MAX_ORDER; order++) {
+ struct page *page;
+ struct free_area *area;
+ area = &(cc->zone->free_area[order]);
+ if (list_empty(&area->free_list[mtype]))
+ continue;
+
+ /* Take the lock and attempt capture of the page */
+ spin_lock_irqsave(&cc->zone->lock, flags);
+ if (!list_empty(&area->free_list[mtype])) {
+ page = list_entry(area->free_list[mtype].next,
+ struct page, lru);
+ if (capture_free_page(page, cc->order, mtype)) {
+ spin_unlock_irqrestore(&cc->zone->lock,
+ flags);
+ *cc->page = page;
+ return;
+ }
+ }
+ spin_unlock_irqrestore(&cc->zone->lock, flags);
+ }
+ }
+}
+
/*
* Isolate free pages onto a private freelist. Caller must hold zone->lock.
* If @strict is true, will abort returning 0 on any invalid PFNs or non-free
@@ -561,7 +596,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
static int compact_finished(struct zone *zone,
struct compact_control *cc)
{
- unsigned int order;
unsigned long watermark;
if (fatal_signal_pending(current))
@@ -586,14 +620,22 @@ static int compact_finished(struct zone *zone,
return COMPACT_CONTINUE;
/* Direct compactor: Is a suitable page free? */
- for (order = cc->order; order < MAX_ORDER; order++) {
- /* Job done if page is free of the right migratetype */
- if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
- return COMPACT_PARTIAL;
-
- /* Job done if allocation would set block type */
- if (order >= pageblock_order && zone->free_area[order].nr_free)
+ if (cc->page) {
+ /* Was a suitable page captured? */
+ if (*cc->page)
return COMPACT_PARTIAL;
+ } else {
+ unsigned int order;
+ for (order = cc->order; order < MAX_ORDER; order++) {
+ struct free_area *area = &zone->free_area[cc->order];
+ /* Job done if page is free of the right migratetype */
+ if (!list_empty(&area->free_list[cc->migratetype]))
+ return COMPACT_PARTIAL;
+
+ /* Job done if allocation would set block type */
+ if (cc->order >= pageblock_order && area->nr_free)
+ return COMPACT_PARTIAL;
+ }
}
return COMPACT_CONTINUE;
@@ -708,6 +750,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
goto out;
}
}
+
+ /* Capture a page now if it is a suitable size */
+ if (cc->migratetype == MIGRATE_MOVABLE)
+ compact_capture_page(cc);
}
out:
@@ -720,7 +766,7 @@ out:
static unsigned long compact_zone_order(struct zone *zone,
int order, gfp_t gfp_mask,
- bool sync)
+ bool sync, struct page **page)
{
struct compact_control cc = {
.nr_freepages = 0,
@@ -729,6 +775,7 @@ static unsigned long compact_zone_order(struct zone *zone,
.migratetype = allocflags_to_migratetype(gfp_mask),
.zone = zone,
.sync = sync,
+ .page = page,
};
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);
@@ -750,7 +797,7 @@ int sysctl_extfrag_threshold = 500;
*/
unsigned long try_to_compact_pages(struct zonelist *zonelist,
int order, gfp_t gfp_mask, nodemask_t *nodemask,
- bool sync)
+ bool sync, struct page **page)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
int may_enter_fs = gfp_mask & __GFP_FS;
@@ -770,7 +817,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
nodemask) {
int status;
- status = compact_zone_order(zone, order, gfp_mask, sync);
+ status = compact_zone_order(zone, order, gfp_mask, sync, page);
rc = max(status, rc);
/* If a normal allocation would succeed, stop compacting */
@@ -825,6 +872,7 @@ int compact_pgdat(pg_data_t *pgdat, int order)
struct compact_control cc = {
.order = order,
.sync = false,
+ .page = NULL,
};
return __compact_pgdat(NODE_DATA(nid), &cc);
diff --git a/mm/internal.h b/mm/internal.h
index 2ba87fb..9156714 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -124,6 +124,7 @@ struct compact_control {
int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */
struct zone *zone;
+ struct page **page; /* Page captured of requested size */
};
unsigned long
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a4f921..adc3aa8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1374,16 +1374,11 @@ void split_page(struct page *page, unsigned int order)
}
/*
- * Similar to split_page except the page is already free. As this is only
- * being used for migration, the migratetype of the block also changes.
- * As this is called with interrupts disabled, the caller is responsible
- * for calling arch_alloc_page() and kernel_map_page() after interrupts
- * are enabled.
- *
- * Note: this is probably too low level an operation for use in drivers.
- * Please consult with lkml before using this in your driver.
+ * Similar to the split_page family of functions except that the page
+ * required at the given order and being isolated now to prevent races
+ * with parallel allocators
*/
-int split_free_page(struct page *page)
+int capture_free_page(struct page *page, int alloc_order, int migratetype)
{
unsigned int order;
unsigned long watermark;
@@ -1405,10 +1400,11 @@ int split_free_page(struct page *page)
rmv_page_order(page);
__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+ /* Set the pageblock if the captured page is at least a pageblock */
if (order >= pageblock_order - 1) {
struct page *endpage = page + (1 << order) - 1;
for (; page < endpage; page += pageblock_nr_pages) {
@@ -1419,7 +1415,35 @@ int split_free_page(struct page *page)
}
}
- return 1 << order;
+ return 1UL << order;
+}
+
+/*
+ * Similar to split_page except the page is already free. As this is only
+ * being used for migration, the migratetype of the block also changes.
+ * As this is called with interrupts disabled, the caller is responsible
+ * for calling arch_alloc_page() and kernel_map_page() after interrupts
+ * are enabled.
+ *
+ * Note: this is probably too low level an operation for use in drivers.
+ * Please consult with lkml before using this in your driver.
+ */
+int split_free_page(struct page *page)
+{
+ unsigned int order;
+ int nr_pages;
+
+ BUG_ON(!PageBuddy(page));
+ order = page_order(page);
+
+ nr_pages = capture_free_page(page, order, 0);
+ if (!nr_pages)
+ return 0;
+
+ /* Split into individual pages */
+ set_page_refcounted(page);
+ split_page(page, order);
+ return nr_pages;
}
The comment about order applied when the check was
order > PAGE_ALLOC_COSTLY_ORDER which has not been the case since
[c5a73c3d: thp: use compaction for all allocation orders]. Fixing
the comment while I'm in the general area.
diff --git a/mm/compaction.c b/mm/compaction.c
index b39ede1..95ca967 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -759,11 +759,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
struct zone *zone;
int rc = COMPACT_SKIPPED;
- /*
- * Check whether it is worth even starting compaction. The order check is
- * made because an assumption is made that the page allocator can satisfy
- * the "cheaper" orders without taking special steps
- */
+ /* Check if the GFP flags allow compaction */
if (!order || !may_enter_fs || !may_perform_io)
return rc;
This commit is already upstream as [7db8889a: mm: have order > 0 compaction
start off where it left]. It's included in this series to provide context
to the next patch as the series is based on 3.5.
Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced. When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.
However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.
This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.
The obvious solution is to have isolate_freepages remember where it left
off last time, and continue at that point the next time it gets invoked
for an order > 0 compaction. This could cause compaction to fail if
cc->free_pfn and cc->migrate_pfn are close together initially, in that
case we restart from the end of the zone and try once more.
Forced full (order == -1) compactions are left alone.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 68c569f..6340f38 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -369,6 +369,10 @@ struct zone {
*/
spinlock_t lock;
int all_unreclaimable; /* All pages pinned */
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+ /* pfn where the last incremental compaction isolated free pages */
+ unsigned long compact_cached_free_pfn;
+#endif
#ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t span_seqlock;
diff --git a/mm/compaction.c b/mm/compaction.c
index 63af8d2..be310f1 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -457,6 +457,17 @@ static void isolate_freepages(struct zone *zone,
pfn -= pageblock_nr_pages) {
unsigned long isolated;
+ /*
+ * Skip ahead if another thread is compacting in the area
+ * simultaneously. If we wrapped around, we can only skip
+ * ahead if zone->compact_cached_free_pfn also wrapped to
+ * above our starting point.
+ */
+ if (cc->order > 0 && (!cc->wrapped ||
+ zone->compact_cached_free_pfn >
+ cc->start_free_pfn))
+ pfn = min(pfn, zone->compact_cached_free_pfn);
+
if (!pfn_valid(pfn))
continue;
@@ -497,8 +508,11 @@ static void isolate_freepages(struct zone *zone,
* looking for free pages, the search will restart here as
* page migration may have returned some pages to the allocator
*/
- if (isolated)
+ if (isolated) {
high_pfn = max(high_pfn, pfn);
+ if (cc->order > 0)
+ zone->compact_cached_free_pfn = high_pfn;
+ }
}
/* split_free_page does not map the pages */
@@ -593,6 +607,20 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
return ISOLATE_SUCCESS;
}
+/*
+ * Returns the start pfn of the last page block in a zone. This is the starting
+ * point for full compaction of a zone. Compaction searches for free pages from
+ * the end of each zone, while isolate_freepages_block scans forward inside each
+ * page block.
+ */
+static unsigned long start_free_pfn(struct zone *zone)
+{
+ unsigned long free_pfn;
+ free_pfn = zone->zone_start_pfn + zone->spanned_pages;
+ free_pfn &= ~(pageblock_nr_pages-1);
+ return free_pfn;
+}
+
static int compact_finished(struct zone *zone,
struct compact_control *cc)
{
@@ -601,8 +629,26 @@ static int compact_finished(struct zone *zone,
if (fatal_signal_pending(current))
return COMPACT_PARTIAL;
- /* Compaction run completes if the migrate and free scanner meet */
- if (cc->free_pfn <= cc->migrate_pfn)
+ /*
+ * A full (order == -1) compaction run starts at the beginning and
+ * end of a zone; it completes when the migrate and free scanner meet.
+ * A partial (order > 0) compaction can start with the free scanner
+ * at a random point in the zone, and may have to restart.
+ */
+ if (cc->free_pfn <= cc->migrate_pfn) {
+ if (cc->order > 0 && !cc->wrapped) {
+ /* We started partway through; restart at the end. */
+ unsigned long free_pfn = start_free_pfn(zone);
+ zone->compact_cached_free_pfn = free_pfn;
+ cc->free_pfn = free_pfn;
+ cc->wrapped = 1;
+ return COMPACT_CONTINUE;
+ }
+ return COMPACT_COMPLETE;
+ }
+
+ /* We wrapped around and ended up where we started. */
+ if (cc->wrapped && cc->free_pfn <= cc->start_free_pfn)
return COMPACT_COMPLETE;
/*
@@ -708,8 +754,15 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
/* Setup to move all movable pages to the end of the zone */
cc->migrate_pfn = zone->zone_start_pfn;
- cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
- cc->free_pfn &= ~(pageblock_nr_pages-1);
+
+ if (cc->order > 0) {
+ /* Incremental compaction. Start where the last one stopped. */
+ cc->free_pfn = zone->compact_cached_free_pfn;
+ cc->start_free_pfn = cc->free_pfn;
+ } else {
+ /* Order == -1 starts at the end of the zone. */
+ cc->free_pfn = start_free_pfn(zone);
+ }
migrate_prep_local();
diff --git a/mm/internal.h b/mm/internal.h
index 9156714..064f6ef 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -118,8 +118,14 @@ struct compact_control {
unsigned long nr_freepages; /* Number of isolated free pages */
unsigned long nr_migratepages; /* Number of pages to migrate */
unsigned long free_pfn; /* isolate_freepages search base */
+ unsigned long start_free_pfn; /* where we started the search */
unsigned long migrate_pfn; /* isolate_migratepages search base */
bool sync; /* Synchronous migration */
+ bool wrapped; /* Order > 0 compactions are
+ incremental, once free_pfn
+ and migrate_pfn meet, we restart
+ from the top of the zone;
+ remember we wrapped around. */
int order; /* order a direct compactor needs */
int migratetype; /* MOVABLE, RECLAIMABLE etc */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index adc3aa8..781d6e4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4425,6 +4425,11 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order > 0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case. There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order > 0 compaction start off where it left].
This series aims to improve the allocation success rates without regressing
the benefits of commit fe2c2a10. The series is based on 3.5 and includes
the commit 7db8889a to illustrate what impact it has to success rates.
Patch 1 updates a stale comment seeing as I was in the general area.
Patch 2 updates reclaim/compaction to reclaim pages scaled on the number
of recent failures.
Patch 3 has kswapd use similar logic to direct reclaim when deciding whether
to continue reclaiming for reclaim/compaction or not.
Patch 4 captures suitable high-order pages freed by compaction to reduce
races with parallel allocation requests.
Patch 5 is an upstream commit that has compaction restart free page scanning
from an old position instead of always starting from the end of the
zone
Patch 6 adjusts patch 5 to restores allocation success rates.
From
http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stres... I know that the allocation success rates in 3.3.6 was 78% in comparison
to 36% in 3.5. With the full series applied, the success rates are up
to 62% which is still much less but it does not reclaim excessively.
Note what patch 5 which is the upstream commit fe2c2a10 did to allocation
success rates.
> The comment about order applied when the check was
> order> PAGE_ALLOC_COSTLY_ORDER which has not been the case since
> [c5a73c3d: thp: use compaction for all allocation orders]. Fixing
> the comment while I'm in the general area.
> If allocation fails after compaction then compaction may be deferred for
> a number of allocation attempts. If there are subsequent failures,
> compact_defer_shift is increased to defer for longer periods. This patch
> uses that information to scale the number of pages reclaimed with
> compact_defer_shift until allocations succeed again.
Discussion subject changed to "mm: kswapd: Continue reclaiming for reclaim/compaction if the minimum number of pages have not been reclaimed" by Rik van Riel
> When direct reclaim is running reclaim/compaction, there is a minimum
> number of pages it reclaims. As it must be under the low watermark to be
> in direct reclaim it has also woken kswapd to do some work. This patch
> has kswapd use the same logic as direct reclaim to reclaim a minimum
> number of pages so compaction can run later.
> While compaction is moving pages to free up large contiguous blocks for
> allocation it races with other allocation requests that may steal these
> blocks or break them up. This patch alters direct compaction to capture a
> suitable free page as soon as it becomes available to reduce this race. It
> uses similar logic to split_free_page() to ensure that watermarks are
> still obeyed.
> commit [7db8889a: mm: have order> 0 compaction start off where it left]
> introduced a caching mechanism to reduce the amount work the free page
> scanner does in compaction. However, it has a problem. Consider two process
> simultaneously scanning free pages
> C
> Process A M S F
> |---------------------------------------|
> Process B M FS
Argh. Good spotting.
> This is not optimal and it can still race but the compact_cached_free_pfn
> will be pointing to or very near a pageblock with free pages.
Agreed on the "not optimal", but I also cannot think of a better
idea right now. Getting this fixed for 3.6 is important, we can
think of future optimizations in San Diego.
On Tue, Aug 07, 2012 at 10:45:25AM -0400, Rik van Riel wrote:
> On 08/07/2012 08:31 AM, Mel Gorman wrote:
> >commit [7db8889a: mm: have order> 0 compaction start off where it left]
> >introduced a caching mechanism to reduce the amount work the free page
> >scanner does in compaction. However, it has a problem. Consider two process
> >simultaneously scanning free pages
> > C
> >Process A M S F
> > |---------------------------------------|
> >Process B M FS
> Argh. Good spotting.
> >This is not optimal and it can still race but the compact_cached_free_pfn
> >will be pointing to or very near a pageblock with free pages.
> Agreed on the "not optimal", but I also cannot think of a better
> idea right now. Getting this fixed for 3.6 is important, we can
> think of future optimizations in San Diego.
Jim, what are the chances of getting this series tested with your large
data workload? As it's on top of 3.5, it should be less scary than
testing 3.6-rc1 but if you are comfortable testing 3.6-rc1 then please
test with just this patch on top.
> On Tue, Aug 07, 2012 at 10:45:25AM -0400, Rik van Riel wrote:
>> On 08/07/2012 08:31 AM, Mel Gorman wrote:
>>> commit [7db8889a: mm: have order> 0 compaction start off where it left]
>>> introduced a caching mechanism to reduce the amount work the free page
>>> scanner does in compaction. However, it has a problem. Consider two process
>>> simultaneously scanning free pages
>>> C
>>> Process A M S F
>>> |---------------------------------------|
>>> Process B M FS
>> Argh. Good spotting.
>>> This is not optimal and it can still race but the compact_cached_free_pfn
>>> will be pointing to or very near a pageblock with free pages.
>> Agreed on the "not optimal", but I also cannot think of a better
>> idea right now. Getting this fixed for 3.6 is important, we can
>> think of future optimizations in San Diego.
> Jim, what are the chances of getting this series tested with your large
> data workload? As it's on top of 3.5, it should be less scary than
> testing 3.6-rc1 but if you are comfortable testing 3.6-rc1 then please
> test with just this patch on top.
As it turns out I'm already testing 3.6-rc1, as I'm on
the trail of a Ceph client messaging bug. I think I've
about got that figured out, and am working on a patch, but
I need it fixed in order to generate enough load to trigger
the problem that your patch addresses.
Which is a long-winded way of saying: no problem, I'll
roll this into my current testing, but I'll need another
day or two before I'm likely to be able to generate a
high enough load to test effectively. OK?
Also FWIW, it occurs to me that you might be interested
to know that my load also involves lots of network load
where I'm using jumbo frames. I suspect that puts even
more stress on higher page order allocations, right?
On Tue, Aug 07, 2012 at 09:20:08AM -0600, Jim Schutt wrote:
> On 08/07/2012 08:52 AM, Mel Gorman wrote:
> >On Tue, Aug 07, 2012 at 10:45:25AM -0400, Rik van Riel wrote:
> >>On 08/07/2012 08:31 AM, Mel Gorman wrote:
> >>>commit [7db8889a: mm: have order> 0 compaction start off where it left]
> >>>introduced a caching mechanism to reduce the amount work the free page
> >>>scanner does in compaction. However, it has a problem. Consider two process
> >>>simultaneously scanning free pages
> >>> C
> >>>Process A M S F
> >>> |---------------------------------------|
> >>>Process B M FS
> >>Argh. Good spotting.
> >>>This is not optimal and it can still race but the compact_cached_free_pfn
> >>>will be pointing to or very near a pageblock with free pages.
> >>Agreed on the "not optimal", but I also cannot think of a better
> >>idea right now. Getting this fixed for 3.6 is important, we can
> >>think of future optimizations in San Diego.
> >Jim, what are the chances of getting this series tested with your large
> >data workload? As it's on top of 3.5, it should be less scary than
> >testing 3.6-rc1 but if you are comfortable testing 3.6-rc1 then please
> >test with just this patch on top.
> As it turns out I'm already testing 3.6-rc1, as I'm on
> the trail of a Ceph client messaging bug. I think I've
> about got that figured out, and am working on a patch, but
> I need it fixed in order to generate enough load to trigger
> the problem that your patch addresses.
Grand, good luck with the Ceph bug.
> Which is a long-winded way of saying: no problem, I'll
> roll this into my current testing, but I'll need another
> day or two before I'm likely to be able to generate a
> high enough load to test effectively. OK?
That is perfectly reasonable, thanks.
> Also FWIW, it occurs to me that you might be interested
> to know that my load also involves lots of network load
> where I'm using jumbo frames. I suspect that puts even
> more stress on higher page order allocations, right?
It might. It depends on whether the underlying driver needs contiguous
pages to handle jumbo frame, if it can do scatter/gather IO or some
combination like trying for a contiguous page but using scatter/gather as
a fallback. Certainly it is interesting and I will keep it in mind.
On Tue, Aug 07, 2012 at 01:31:12PM +0100, Mel Gorman wrote:
> The comment about order applied when the check was
> order > PAGE_ALLOC_COSTLY_ORDER which has not been the case since
> [c5a73c3d: thp: use compaction for all allocation orders]. Fixing
> the comment while I'm in the general area.
Just out of curiosity.
What's the problem did you see? (ie, What's the problem do this patch solve?)
AFAIUC, it seem to solve consecutive allocation success ratio through
getting several free pageblocks all at once in a process/kswapd
reclaim context. Right?
On Tue, Aug 07, 2012 at 01:31:13PM +0100, Mel Gorman wrote:
> If allocation fails after compaction then compaction may be deferred for
> a number of allocation attempts. If there are subsequent failures,
> compact_defer_shift is increased to defer for longer periods. This patch
> uses that information to scale the number of pages reclaimed with
> compact_defer_shift until allocations succeed again.
> /* If not in reclaim/compaction mode, stop */
> if (!in_reclaim_compaction(sc))
> @@ -1741,6 +1742,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> * inactive lists are large enough, continue reclaiming
> */
> pages_for_compaction = (2UL << sc->order);
> +
> + /*
> + * If compaction is deferred for this order then scale the number of
> + * pages reclaimed based on the number of consecutive allocation
> + * failures
> + */
> + zone = lruvec_zone(lruvec);
> + if (zone->compact_order_failed >= sc->order)
> + pages_for_compaction <<= zone->compact_defer_shift;
> inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
> if (nr_swap_pages > 0)
> inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
> -- > 1.7.9.2
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"d...@kvack.org"> em...@kvack.org </a>
Discussion subject changed to "mm: kswapd: Continue reclaiming for reclaim/compaction if the minimum number of pages have not been reclaimed" by Minchan Kim
On Tue, Aug 07, 2012 at 01:31:14PM +0100, Mel Gorman wrote:
> When direct reclaim is running reclaim/compaction, there is a minimum
> number of pages it reclaims. As it must be under the low watermark to be
> in direct reclaim it has also woken kswapd to do some work. This patch
> has kswapd use the same logic as direct reclaim to reclaim a minimum
> number of pages so compaction can run later.
-ENOPARSE by my stupid brain.
Could you elaborate a bit more?
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0cb2593..afdec93 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1701,7 +1701,7 @@ static bool in_reclaim_compaction(struct scan_control *sc)
> * calls try_to_compact_zone() that it will have enough free pages to succeed.
> * It will give up earlier than that if there is difficulty reclaiming pages.
> */
> -static inline bool should_continue_reclaim(struct lruvec *lruvec,
> +static bool should_continue_reclaim(struct lruvec *lruvec,
> unsigned long nr_reclaimed,
> unsigned long nr_scanned,
> struct scan_control *sc)
> @@ -1768,6 +1768,17 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> }
> }
> +static inline bool should_continue_reclaim_zone(struct zone *zone,
> + unsigned long nr_reclaimed,
> + unsigned long nr_scanned,
> + struct scan_control *sc)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_iter(NULL, NULL, NULL);
> + struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> +
> + return should_continue_reclaim(lruvec, nr_reclaimed, nr_scanned, sc);
> +}
> +
> /*
> * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
> */
> @@ -2496,8 +2507,10 @@ loop_again:
> */
> testorder = order;
> if (COMPACTION_BUILD && order &&
> - compaction_suitable(zone, order) !=
> - COMPACT_SKIPPED)
> + !should_continue_reclaim_zone(zone,
> + nr_soft_reclaimed,
nr_soft_reclaimed is always zero with !CONFIG_MEMCG.
So should_continue_reclaim_zone would return normally true in case of
non-__GFP_REPEAT allocation. Is it intentional?
> if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
> -- > 1.7.9.2
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"d...@kvack.org"> em...@kvack.org </a>
On Tue, Aug 07, 2012 at 01:31:17PM +0100, Mel Gorman wrote:
> commit [7db8889a: mm: have order > 0 compaction start off where it left]
> introduced a caching mechanism to reduce the amount work the free page
> scanner does in compaction. However, it has a problem. Consider two process
> simultaneously scanning free pages
> C
> Process A M S F
> |---------------------------------------|
> Process B M FS
> C is zone->compact_cached_free_pfn
> S is cc->start_pfree_pfn
> M is cc->migrate_pfn
> F is cc->free_pfn
> In this diagram, Process A has just reached its migrate scanner, wrapped
> around and updated compact_cached_free_pfn accordingly.
> Simultaneously, Process B finishes isolating in a block and updates
> compact_cached_free_pfn again to the location of its free scanner.
> Process A moves to "end_of_zone - one_pageblock" and runs this check
> compact_cached_free_pfn is above where it started so the free scanner skips
> almost the entire space it should have scanned. When there are multiple
> processes compacting it can end in a situation where the entire zone is
> not being scanned at all. Further, it is possible for two processes to
> ping-pong update to compact_cached_free_pfn which is just random.
> Overall, the end result wrecks allocation success rates.
> There is not an obvious way around this problem without introducing new
> locking and state so this patch takes a different approach.
> First, it gets rid of the skip logic because it's not clear that it matters
> if two free scanners happen to be in the same block but with racing updates
> it's too easy for it to skip over blocks it should not.
Okay.
> Second, it updates compact_cached_free_pfn in a more limited set of
> circumstances.
> If a scanner has wrapped, it updates compact_cached_free_pfn to the end
> of the zone. Each time a wrapped scanner isoaltes a page, it
> updates compact_cached_free_pfn. The intention is that after
> wrapping, the compact_cached_free_pfn will be at the highest
> pageblock with free pages when compaction completes.
Okay.
> If a scanner has not wrapped when compaction completes and
Compaction complete?
Your code seem to do it in isolate_freepages.
Isn't it compaction complete?
> compact_cached_free_pfn is set the end of the the zone, initialise
> it once.
I can't understad this part.
Could you elaborate a bit more?
> /*
> + * Returns the start pfn of the last page block in a zone. This is the starting
> + * point for full compaction of a zone. Compaction searches for free pages from
> + * the end of each zone, while isolate_freepages_block scans forward inside each
> + * page block.
> + */
> +static unsigned long start_free_pfn(struct zone *zone)
> +{
> + unsigned long free_pfn;
> + free_pfn = zone->zone_start_pfn + zone->spanned_pages;
> + free_pfn &= ~(pageblock_nr_pages-1);
> + return free_pfn;
> +}
> +
> +/*
> * Based on information in the current compact_control, find blocks
> * suitable for isolating free pages from and then isolate them.
> */
> @@ -457,17 +471,6 @@ static void isolate_freepages(struct zone *zone,
> pfn -= pageblock_nr_pages) {
> unsigned long isolated;
> - /*
> - * Skip ahead if another thread is compacting in the area
> - * simultaneously. If we wrapped around, we can only skip
> - * ahead if zone->compact_cached_free_pfn also wrapped to
> - * above our starting point.
> - */
> - if (cc->order > 0 && (!cc->wrapped ||
> - zone->compact_cached_free_pfn >
> - cc->start_free_pfn))
> - pfn = min(pfn, zone->compact_cached_free_pfn);
> -
> if (!pfn_valid(pfn))
> continue;
> @@ -510,7 +513,15 @@ static void isolate_freepages(struct zone *zone,
> */
> if (isolated) {
> high_pfn = max(high_pfn, pfn);
> - if (cc->order > 0)
> +
> + /*
> + * If the free scanner has wrapped, update
> + * compact_cached_free_pfn to point to the highest
> + * pageblock with free pages. This reduces excessive
> + * scanning of full pageblocks near the end of the
> + * zone
> + */
> + if (cc->order > 0 && cc->wrapped)
> zone->compact_cached_free_pfn = high_pfn;
> }
> }
> @@ -520,6 +531,11 @@ static void isolate_freepages(struct zone *zone,
> cc->free_pfn = high_pfn;
> cc->nr_freepages = nr_freepages;
> +
> + /* If compact_cached_free_pfn is reset then set it now */
> + if (cc->order > 0 && !cc->wrapped &&
> + zone->compact_cached_free_pfn == start_free_pfn(zone))
> + zone->compact_cached_free_pfn = high_pfn;
> }
> -/*
> - * Returns the start pfn of the last page block in a zone. This is the starting
> - * point for full compaction of a zone. Compaction searches for free pages from
> - * the end of each zone, while isolate_freepages_block scans forward inside each
> - * page block.
> - */
> -static unsigned long start_free_pfn(struct zone *zone)
> -{
> - unsigned long free_pfn;
> - free_pfn = zone->zone_start_pfn + zone->spanned_pages;
> - free_pfn &= ~(pageblock_nr_pages-1);
> - return free_pfn;
> -}
> -
> static int compact_finished(struct zone *zone,
> struct compact_control *cc)
> {
> -- > 1.7.9.2
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"d...@kvack.org"> em...@kvack.org </a>
On Wed, Aug 08, 2012 at 10:48:24AM +0900, Minchan Kim wrote:
> Hi Mel,
> Just out of curiosity.
> What's the problem did you see? (ie, What's the problem do this patch solve?)
Everythign in this series is related to the problem in the leader - high
order allocation success rates are lower. This patch increases the success
rates when allocating under load.
> AFAIUC, it seem to solve consecutive allocation success ratio through
> getting several free pageblocks all at once in a process/kswapd
> reclaim context. Right?
Only pageblocks if it is order-9 on x86, it reclaims an amount that depends
on an allocation size. This only happens during reclaim/compaction context
when we know that a high-order allocation has recently failed. The objective
is to reclaim enough order-0 pages so that compaction can succeed again.
On Wed, Aug 08, 2012 at 08:55:26AM +0100, Mel Gorman wrote:
> On Wed, Aug 08, 2012 at 10:48:24AM +0900, Minchan Kim wrote:
> > Hi Mel,
> > Just out of curiosity.
> > What's the problem did you see? (ie, What's the problem do this patch solve?)
> Everythign in this series is related to the problem in the leader - high
> order allocation success rates are lower. This patch increases the success
> rates when allocating under load.
> > AFAIUC, it seem to solve consecutive allocation success ratio through
> > getting several free pageblocks all at once in a process/kswapd
> > reclaim context. Right?
> Only pageblocks if it is order-9 on x86, it reclaims an amount that depends
> on an allocation size. This only happens during reclaim/compaction context
> when we know that a high-order allocation has recently failed. The objective
> is to reclaim enough order-0 pages so that compaction can succeed again.
Your patch increases the number of pages to be reclaimed with considering
the number of fail case during deferring period and your test proved it's
really good. Without your patch, why can't VM reclaim enough pages?
Other processes steal the pages reclaimed?
Why I ask a question is that I want to know what's the problem at current
VM.
> -- > Mel Gorman
> SUSE Labs
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"d...@kvack.org"> em...@kvack.org </a>
On Wed, Aug 08, 2012 at 05:27:38PM +0900, Minchan Kim wrote:
> On Wed, Aug 08, 2012 at 08:55:26AM +0100, Mel Gorman wrote:
> > On Wed, Aug 08, 2012 at 10:48:24AM +0900, Minchan Kim wrote:
> > > Hi Mel,
> > > Just out of curiosity.
> > > What's the problem did you see? (ie, What's the problem do this patch solve?)
> > Everythign in this series is related to the problem in the leader - high
> > order allocation success rates are lower. This patch increases the success
> > rates when allocating under load.
> > > AFAIUC, it seem to solve consecutive allocation success ratio through
> > > getting several free pageblocks all at once in a process/kswapd
> > > reclaim context. Right?
> > Only pageblocks if it is order-9 on x86, it reclaims an amount that depends
> > on an allocation size. This only happens during reclaim/compaction context
> > when we know that a high-order allocation has recently failed. The objective
> > is to reclaim enough order-0 pages so that compaction can succeed again.
> Your patch increases the number of pages to be reclaimed with considering
> the number of fail case during deferring period and your test proved it's
> really good. Without your patch, why can't VM reclaim enough pages?
It could reclaim enough pages but it doesn't. nr_to_reclaim is
SWAP_CLUSTER_MAX and that gets short-cutted in direct reclaim at least
by
if (sc->nr_reclaimed >= sc->nr_to_reclaim)
goto out;
I could set nr_to_reclaim in try_to_free_pages() of course and drive
it from there but that's just different, not better. If driven from
do_try_to_free_pages(), it is also possible that priorities will rise.
When they reach DEF_PRIORITY-2, it will also start stalling and setting
pages for immediate reclaim which is more disruptive than not desirable
in this case. That is a more wide-reaching change than I would expect for
this problem and could cause another regression related to THP requests
causing interactive jitter.
> Other processes steal the pages reclaimed?
Or the page it reclaimed were in pageblocks that could not be used.
> Why I ask a question is that I want to know what's the problem at current
> VM.
We cannot reliably tell in advance whether compaction is going to succeed
in the future without doing a full scan of the zone which would be both
very heavy and race with any allocation requests. Compaction needs free
pages to succeed so the intention is to scale the number of pages reclaimed
with the number of recent compaction failures.
Discussion subject changed to "mm: kswapd: Continue reclaiming for reclaim/compaction if the minimum number of pages have not been reclaimed" by Mel Gorman
On Wed, Aug 08, 2012 at 11:07:49AM +0900, Minchan Kim wrote:
> On Tue, Aug 07, 2012 at 01:31:14PM +0100, Mel Gorman wrote:
> > When direct reclaim is running reclaim/compaction, there is a minimum
> > number of pages it reclaims. As it must be under the low watermark to be
> > in direct reclaim it has also woken kswapd to do some work. This patch
> > has kswapd use the same logic as direct reclaim to reclaim a minimum
> > number of pages so compaction can run later.
> -ENOPARSE by my stupid brain.
> Could you elaborate a bit more?
Which part did not make sense so I know which part to elaborate on? Lets
try again randomly with this;
When direct reclaim is running reclaim/compaction for high-order allocations,
it aims to reclaim a minimum number of pages for compaction as controlled
by should_continue_reclaim. Before it entered direct reclaim, kswapd was
woken to reclaim pages at the same order. This patch forces kswapd to use
the same logic as direct reclaim to reclaim a minimum number of pages so
that subsequent allocation requests are less likely to enter direct reclaim.
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 0cb2593..afdec93 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1701,7 +1701,7 @@ static bool in_reclaim_compaction(struct scan_control *sc)
> > * calls try_to_compact_zone() that it will have enough free pages to succeed.
> > * It will give up earlier than that if there is difficulty reclaiming pages.
> > */
> > -static inline bool should_continue_reclaim(struct lruvec *lruvec,
> > +static bool should_continue_reclaim(struct lruvec *lruvec,
> > unsigned long nr_reclaimed,
> > unsigned long nr_scanned,
> > struct scan_control *sc)
> > @@ -1768,6 +1768,17 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
> > }
> > }
> > +static inline bool should_continue_reclaim_zone(struct zone *zone,
> > + unsigned long nr_reclaimed,
> > + unsigned long nr_scanned,
> > + struct scan_control *sc)
> > +{
> > + struct mem_cgroup *memcg = mem_cgroup_iter(NULL, NULL, NULL);
> > + struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> > +
> > + return should_continue_reclaim(lruvec, nr_reclaimed, nr_scanned, sc);
> > +}
> > +
> > /*
> > * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
> > */
> > @@ -2496,8 +2507,10 @@ loop_again:
> > */
> > testorder = order;
> > if (COMPACTION_BUILD && order &&
> > - compaction_suitable(zone, order) !=
> > - COMPACT_SKIPPED)
> > + !should_continue_reclaim_zone(zone,
> > + nr_soft_reclaimed,
> nr_soft_reclaimed is always zero with !CONFIG_MEMCG.
> So should_continue_reclaim_zone would return normally true in case of
> non-__GFP_REPEAT allocation. Is it intentional?
It was intentional at the time but asking me about it made me reconsider,
thanks. In too many cases, this is a no-op and any apparent increase of
kswapd activity is likely a co-incidence. This is untested but is what I
intended.
---8<---
mm: kswapd: Continue reclaiming for reclaim/compaction if the minimum number of pages have not been reclaimed
When direct reclaim is running reclaim/compaction for high-order allocations,
it aims to reclaim a minimum number of pages for compaction as controlled
by should_continue_reclaim. Before it entered direct reclaim, kswapd was
woken to reclaim pages at the same order. This patch forces kswapd to use
the same logic as direct reclaim to reclaim a minimum number of pages so
that subsequent allocation requests are less likely to enter direct reclaim.
/*
* Reclaim/compaction is used for high-order allocation requests. It reclaims
- * order-0 pages before compacting the zone. should_continue_reclaim() returns
+ * order-0 pages before compacting the zone. __should_continue_reclaim() returns
* true if more pages should be reclaimed such that when the page allocator
* calls try_to_compact_zone() that it will have enough free pages to succeed.
- * It will give up earlier than that if there is difficulty reclaiming pages.
*/
-static inline bool should_continue_reclaim(struct lruvec *lruvec,
- unsigned long nr_reclaimed,
- unsigned long nr_scanned,
+static bool __should_continue_reclaim(struct lruvec *lruvec,
struct scan_control *sc)
{
unsigned long pages_for_compaction;
@@ -1714,29 +1711,6 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
if (!in_reclaim_compaction(sc))
return false;
- /* Consider stopping depending on scan and reclaim activity */
- if (sc->gfp_mask & __GFP_REPEAT) {
- /*
- * For __GFP_REPEAT allocations, stop reclaiming if the
- * full LRU list has been scanned and we are still failing
- * to reclaim pages. This full LRU scan is potentially
- * expensive but a __GFP_REPEAT caller really wants to succeed
- */
- if (!nr_reclaimed && !nr_scanned)
- return false;
- } else {
- /*
- * For non-__GFP_REPEAT allocations which can presumably
- * fail without consequence, stop if we failed to reclaim
- * any pages from the last SWAP_CLUSTER_MAX number of
- * pages that were scanned. This will return to the
- * caller faster at the risk reclaim/compaction and
- * the resulting allocation attempt fails
- */
- if (!nr_reclaimed)
- return false;
- }
-
/*
* If we have not reclaimed enough pages for compaction and the
* inactive lists are large enough, continue reclaiming
@@ -1768,6 +1742,51 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
}
}
+/* Looks up the lruvec before calling __should_continue_reclaim */
+static inline bool should_kswapd_continue_reclaim(struct zone *zone,
+ struct scan_control *sc)
+{
+ struct mem_cgroup *memcg = mem_cgroup_iter(NULL, NULL, NULL);
+ struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
+
+ return __should_continue_reclaim(lruvec, sc);
+}
+
+/*
+ * This uses __should_continue_reclaim at its core but will also give up
+ * earlier than that if there is difficulty reclaiming pages.
+ */
+static inline bool should_direct_continue_reclaim(struct lruvec *lruvec,
+ unsigned long nr_reclaimed,
+ unsigned long nr_scanned,
+ struct scan_control *sc)
+{
+ /* Consider stopping depending on scan and reclaim activity */
+ if (sc->gfp_mask & __GFP_REPEAT) {
+ /*
+ * For __GFP_REPEAT allocations, stop reclaiming if the
+ * full LRU list has been scanned and we are still failing
+ * to reclaim pages. This full LRU scan is potentially
+ * expensive but a __GFP_REPEAT caller really wants to succeed
+ */
+ if (!nr_reclaimed && !nr_scanned)
+ return false;
+ } else {
+ /*
+ * For non-__GFP_REPEAT allocations which can presumably
+ * fail without consequence, stop if we failed to reclaim
+ * any pages from the last SWAP_CLUSTER_MAX number of
+ * pages that were scanned. This will return to the
+ * caller faster at the risk reclaim/compaction and
+ * the resulting allocation attempt fails
+ */
+ if (!nr_reclaimed)
+ return false;
+ }
+
+ return __should_continue_reclaim(lruvec, sc);
+}
+
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
@@ -1822,7 +1841,7 @@ restart:
sc, LRU_ACTIVE_ANON);
/* reclaim/compaction might need reclaim to continue */
- if (should_continue_reclaim(lruvec, nr_reclaimed,
+ if (should_direct_continue_reclaim(lruvec, nr_reclaimed,
sc->nr_scanned - nr_scanned, sc))
goto restart;
if ((buffer_heads_over_limit && is_highmem_idx(i)) ||
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Aug 08, 2012 at 10:07:57AM +0100, Mel Gorman wrote:
> > <SNIP>
> It was intentional at the time but asking me about it made me reconsider,
> thanks. In too many cases, this is a no-op and any apparent increase of
> kswapd activity is likely a co-incidence. This is untested but is what I
> intended.
> ---8<---
> mm: kswapd: Continue reclaiming for reclaim/compaction if the minimum number of pages have not been reclaimed
And considering this further again, it would partially regress fe2c2a10
and be too aggressive. I'm dropping this patch completely for now and will
revisit it in the future.
On Wed, Aug 08, 2012 at 01:36:00PM +0900, Minchan Kim wrote:
> > Second, it updates compact_cached_free_pfn in a more limited set of
> > circumstances.
> > If a scanner has wrapped, it updates compact_cached_free_pfn to the end
> > of the zone. Each time a wrapped scanner isoaltes a page, it
> > updates compact_cached_free_pfn. The intention is that after
> > wrapping, the compact_cached_free_pfn will be at the highest
> > pageblock with free pages when compaction completes.
> Okay.
> > If a scanner has not wrapped when compaction completes and
> Compaction complete?
> Your code seem to do it in isolate_freepages.
> Isn't it compaction complete?
s/compaction/free page isolation/
> > compact_cached_free_pfn is set the end of the the zone, initialise
> > it once.
> I can't understad this part.
> Could you elaborate a bit more?
Is this better?
If a scanner has wrapped, it updates compact_cached_free_pfn to the end
of the zone. When a wrapped scanner isolates a page, it updates
compact_cached_free_pfn to point to the highest pageblock it
can isolate pages from.
If a scanner has not wrapped when it has finished isolated pages it checks if compact_cached_free_pfn is pointing to the end of the
zone. If so, the value is updated to point to the highest pageblock that pages were isolated from. This value will not
be updated again until a free page scanner wraps and resets
compact_cached_free_pfn.
This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.