When we enter direct reclaim we may have used an arbitrary amount of stack
space, and hence enterring the filesystem to do writeback can then lead to
stack overruns. This problem was recently encountered x86_64 systems with
8k stacks running XFS with simple storage configurations.
Writeback from direct reclaim also adversely affects background writeback. The
background flusher threads should already be taking care of cleaning dirty
pages, and direct reclaim will kick them if they aren't already doing work. If
direct reclaim is also calling ->writepage, it will cause the IO patterns from
the background flusher threads to be upset by LRU-order writeback from
pageout() which can be effectively random IO. Having competing sources of IO
trying to clean pages on the same backing device reduces throughput by
increasing the amount of seeks that the backing device has to do to write back
the pages.
Hence for direct reclaim we should not allow ->writepages to be entered at all.
Set up the relevant scan_control structures to enforce this, and prevent
sc->may_writepage from being set in other places in the direct reclaim path in
response to other events.
Reported-by: John Berthels <jo...@humyo.com>
Signed-off-by: Dave Chinner <dchi...@redhat.com>
---
mm/vmscan.c | 13 ++++++-------
1 files changed, 6 insertions(+), 7 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e0e5f15..5321ac4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1826,10 +1826,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
* writeout. So in laptop mode, write out the whole world.
*/
writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
- if (total_scanned > writeback_threshold) {
+ if (total_scanned > writeback_threshold)
wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
- sc->may_writepage = 1;
- }
/* Take a nap, wait for some writeback to complete */
if (!sc->hibernation_mode && sc->nr_scanned &&
@@ -1871,7 +1869,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
{
struct scan_control sc = {
.gfp_mask = gfp_mask,
- .may_writepage = !laptop_mode,
+ .may_writepage = 0,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
.may_unmap = 1,
.may_swap = 1,
@@ -1893,7 +1891,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
struct zone *zone, int nid)
{
struct scan_control sc = {
- .may_writepage = !laptop_mode,
+ .may_writepage = 0,
.may_unmap = 1,
.may_swap = !noswap,
.swappiness = swappiness,
@@ -1926,7 +1924,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
{
struct zonelist *zonelist;
struct scan_control sc = {
- .may_writepage = !laptop_mode,
+ .may_writepage = 0,
.may_unmap = 1,
.may_swap = !noswap,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
@@ -2567,7 +2565,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
struct reclaim_state reclaim_state;
int priority;
struct scan_control sc = {
- .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
+ .may_writepage = (current_is_kswapd() &&
+ (zone_reclaim_mode & RECLAIM_WRITE)),
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
.may_swap = 1,
.nr_to_reclaim = max_t(unsigned long, nr_pages,
--
1.6.5
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
> From: Dave Chinner <dchi...@redhat.com>
>
> When we enter direct reclaim we may have used an arbitrary amount of stack
> space, and hence enterring the filesystem to do writeback can then lead to
> stack overruns. This problem was recently encountered x86_64 systems with
> 8k stacks running XFS with simple storage configurations.
>
> Writeback from direct reclaim also adversely affects background writeback. The
> background flusher threads should already be taking care of cleaning dirty
> pages, and direct reclaim will kick them if they aren't already doing work. If
> direct reclaim is also calling ->writepage, it will cause the IO patterns from
> the background flusher threads to be upset by LRU-order writeback from
> pageout() which can be effectively random IO. Having competing sources of IO
> trying to clean pages on the same backing device reduces throughput by
> increasing the amount of seeks that the backing device has to do to write back
> the pages.
>
> Hence for direct reclaim we should not allow ->writepages to be entered at all.
> Set up the relevant scan_control structures to enforce this, and prevent
> sc->may_writepage from being set in other places in the direct reclaim path in
> response to other events.
Ummm..
This patch is harder to ack. This patch's pros/cons seems
Pros:
1) prevent XFS stack overflow
2) improve io workload performance
Cons:
3) TOTALLY kill lumpy reclaim (i.e. high order allocation)
So, If we only need to consider io workload this is no downside. but
it can't.
I think (1) is XFS issue. XFS should care it itself. but (2) is really
VM issue. Now our VM makes too agressive pageout() and decrease io
throughput. I've heard this issue from Chris (cc to him). I'd like to
fix this. but we never kill pageout() completely because we can't
assume users don't run high order allocation workload.
(perhaps Mel's memory compaction code is going to improve much and
we can kill lumpy reclaim in future. but it's another story)
Thanks.
It's already known that the VM requesting specific pages be cleaned and
reclaimed is a bad IO pattern but unfortunately it is still required by
lumpy reclaim. This change would appear to break that although I haven't
tested it to be 100% sure.
Even without high-order considerations, this patch would appear to make
fairly large changes to how direct reclaim behaves. It would no longer
wait on page writeback for example so direct reclaim will return sooner
than it did potentially going OOM if there were a lot of dirty pages and
it made no progress during direct reclaim.
> Hence for direct reclaim we should not allow ->writepages to be entered at all.
> Set up the relevant scan_control structures to enforce this, and prevent
> sc->may_writepage from being set in other places in the direct reclaim path in
> response to other events.
>
If an FS caller cannot re-enter the FS, it should be using GFP_NOFS
instead of GFP_KERNEL.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
The filesystem is irrelevant, IMO.
The traces from the reporter showed that we've got close to a 2k
stack footprint for memory allocation to direct reclaim and then we
can put the entire writeback path on top of that. This is roughly
3.5k for XFS, and then depending on the storage subsystem
configuration and transport can be another 2k of stack needed below
XFS.
IOWs, if we completely ignore the filesystem stack usage, there's
still up to 4k of stack needed in the direct reclaim path. Given
that one of the stack traces supplied show direct reclaim being
entered with over 3k of stack already used, pretty much any
filesystem is capable of blowing an 8k stack.
So, this is not an XFS issue, even though XFS is the first to
uncover it. Don't shoot the messenger....
> but (2) is really
> VM issue. Now our VM makes too agressive pageout() and decrease io
> throughput. I've heard this issue from Chris (cc to him). I'd like to
> fix this.
I didn't expect this to be easy. ;)
I had a good look at what the code was doing before I wrote the
patch, and IMO, there is no good reason for issuing IO from direct
reclaim.
My reasoning is as follows - consider a system with a typical
sata disk and the machine is low on memory and in direct reclaim.
direct reclaim is taking pages of the end of the LRU and writing
them one at a time from there. It is scanning thousands of pages
pages and it triggers IO on on the dirty ones it comes across.
This is done with no regard to the IO patterns it generates - it can
(and frequently does) result in completely random single page IO
patterns hitting the disk, and as a result cleaning pages happens
really, really slowly. If we are in a OOM situation, the machine
will grind to a halt as it struggles to clean maybe 1MB of RAM per
second.
On the other hand, if the IO is well formed then the disk might be
capable of 100MB/s. The background flusher threads and filesystems
try very hard to issue well formed IOs, so the difference in the
rate that memory can be cleaned may be a couple of orders of
magnitude.
(Of course, the difference will typically be somewhere in between
these two extremes, but I'm simply trying to illustrate how big
the difference in performance can be.)
IOWs, the background flusher threads are there to clean memory by
issuing IO as efficiently as possible. Direct reclaim is very
efficient at reclaiming clean memory, but it really, really sucks at
cleaning dirty memory in a predictable and deterministic manner. It
is also much more likely to hit worst case IO patterns than the
background flusher threads.
Hence I think that direct reclaim should be deferring to the
background flusher threads for cleaning memory and not trying to be
doing it itself.
> but we never kill pageout() completely because we can't
> assume users don't run high order allocation workload.
I think that lumpy reclaim will still work just fine.
Lumpy reclaim appears to be using IO as a method of slowing
down the reclaim cycle - the congestion_wait() call will still
function as it does now if the background flusher threads are active
and causing congestion. I don't see why lumpy reclaim specifically
needs to be issuing IO to make it work - if the congestion_wait() is
not waiting long enough then wait longer - don't issue IO to extend
the wait time.
Also, there doesn't appear to be anything special about the chunks of
pages it's issuing IO on and waiting for, either. They are simply
the last N pages on the LRU that could be grabbed so they have no
guarantee of contiguity, so the IO it issues does nothing specific
to help higher order allocations to succeed.
Hence it really seems to me that the effectiveness of lumpy reclaim
is determined mostly by the effectiveness of the IO subsystem - the
faster the IO subsystem cleans pages, the less time lumpy reclaim
will block and the faster it will free pages. From this observation
and the fact that issuing IO only from the bdi flusher threads will
have the same effect (improves IO subsystem effectiveness), it seems
to me that lumpy reclaim should not be adversely affected by this
change.
Of course, the code is a maze of twisty passages, so I probably
missed something important. Hopefully someone can tell me what. ;)
FWIW, the biggest problem here is that I have absolutely no clue on
how to test what the impact on lumpy reclaim really is. Does anyone
have a relatively simple test that can be run to determine what the
impact is?
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
How do you test it? I'd really like to be able to test this myself....
> Even without high-order considerations, this patch would appear to make
> fairly large changes to how direct reclaim behaves. It would no longer
> wait on page writeback for example so direct reclaim will return sooner
AFAICT it still waits for pages under writeback in exactly the same manner
it does now. shrink_page_list() does the following completely
separately to the sc->may_writepage flag:
666 may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
667 (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
668
669 if (PageWriteback(page)) {
670 /*
671 * Synchronous reclaim is performed in two passes,
672 * first an asynchronous pass over the list to
673 * start parallel writeback, and a second synchronous
674 * pass to wait for the IO to complete. Wait here
675 * for any page for which writeback has already
676 * started.
677 */
678 if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
679 wait_on_page_writeback(page);
680 else
681 goto keep_locked;
682 }
So if the page is under writeback, PAGEOUT_IO_SYNC is set and
we can enter the fs, it will still wait for writeback to complete
just like it does now.
However, the current code only uses PAGEOUT_IO_SYNC in lumpy
reclaim, so for most typical workloads direct reclaim does not wait
on page writeback, either. Hence, this patch doesn't appear to
change the actions taken on a page under writeback in direct
reclaim....
> than it did potentially going OOM if there were a lot of dirty pages and
> it made no progress during direct reclaim.
I did a fair bit of low/small memory testing. This is a subjective
observation, but I definitely seemed to get less severe OOM
situations and better overall responisveness with this patch than
compared to when direct reclaim was doing writeback.
> > Hence for direct reclaim we should not allow ->writepages to be entered at all.
> > Set up the relevant scan_control structures to enforce this, and prevent
> > sc->may_writepage from being set in other places in the direct reclaim path in
> > response to other events.
> >
>
> If an FS caller cannot re-enter the FS, it should be using GFP_NOFS
> instead of GFP_KERNEL.
This problem is not a filesystem recursion problem which is, as I
understand it, what GFP_NOFS is used to prevent. It's _any_ kernel
code that uses signficant stack before trying to allocate memory
that is the problem. e.g a select() system call:
Depth Size Location (47 entries)
----- ---- --------
0) 7568 16 mempool_alloc_slab+0x16/0x20
1) 7552 144 mempool_alloc+0x65/0x140
2) 7408 96 get_request+0x124/0x370
3) 7312 144 get_request_wait+0x29/0x1b0
4) 7168 96 __make_request+0x9b/0x490
5) 7072 208 generic_make_request+0x3df/0x4d0
6) 6864 80 submit_bio+0x7c/0x100
7) 6784 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs]
....
32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs]
33) 3120 384 shrink_page_list+0x65e/0x840
34) 2736 528 shrink_zone+0x63f/0xe10
35) 2208 112 do_try_to_free_pages+0xc2/0x3c0
36) 2096 128 try_to_free_pages+0x77/0x80
37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710
38) 1728 48 alloc_pages_current+0x8c/0xe0
39) 1680 16 __get_free_pages+0xe/0x50
40) 1664 48 __pollwait+0xca/0x110
41) 1616 32 unix_poll+0x28/0xc0
42) 1584 16 sock_poll+0x1d/0x20
43) 1568 912 do_select+0x3d6/0x700
44) 656 416 core_sys_select+0x18c/0x2c0
45) 240 112 sys_select+0x4f/0x110
46) 128 128 system_call_fastpath+0x16/0x1b
There's 1.6k of stack used before memory allocation is called, 3.1k
used there before ->writepage is entered, XFS used 3.5k, and
if the mempool needed to allocate a page it would have blown the
stack. If there was any significant storage subsystem (add dm, md
and/or scsi of some kind), it would have blown the stack.
Basically, there is not enough stack space available to allow direct
reclaim to enter ->writepage _anywhere_ according to the stack usage
profiles we are seeing here....
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
> > Pros:
> > 1) prevent XFS stack overflow
> > 2) improve io workload performance
> >
> > Cons:
> > 3) TOTALLY kill lumpy reclaim (i.e. high order allocation)
> >
> > So, If we only need to consider io workload this is no downside. but
> > it can't.
> >
> > I think (1) is XFS issue. XFS should care it itself.
>
> The filesystem is irrelevant, IMO.
>
> The traces from the reporter showed that we've got close to a 2k
> stack footprint for memory allocation to direct reclaim and then we
> can put the entire writeback path on top of that. This is roughly
> 3.5k for XFS, and then depending on the storage subsystem
> configuration and transport can be another 2k of stack needed below
> XFS.
>
> IOWs, if we completely ignore the filesystem stack usage, there's
> still up to 4k of stack needed in the direct reclaim path. Given
> that one of the stack traces supplied show direct reclaim being
> entered with over 3k of stack already used, pretty much any
> filesystem is capable of blowing an 8k stack.
>
> So, this is not an XFS issue, even though XFS is the first to
> uncover it. Don't shoot the messenger....
Thanks explanation. I haven't noticed direct reclaim consume
2k stack. I'll investigate it and try diet it.
But XFS 3.5K stack consumption is too large too. please diet too.
Well, you seems continue to discuss io workload. I don't disagree
such point.
example, If only order-0 reclaim skip pageout(), we will get the above
benefit too.
> > but we never kill pageout() completely because we can't
> > assume users don't run high order allocation workload.
>
> I think that lumpy reclaim will still work just fine.
>
> Lumpy reclaim appears to be using IO as a method of slowing
> down the reclaim cycle - the congestion_wait() call will still
> function as it does now if the background flusher threads are active
> and causing congestion. I don't see why lumpy reclaim specifically
> needs to be issuing IO to make it work - if the congestion_wait() is
> not waiting long enough then wait longer - don't issue IO to extend
> the wait time.
lumpy reclaim is for allocation high order page. then, it not only
reclaim LRU head page, but also its PFN neighborhood. PFN neighborhood
is often newly page and still dirty. then we enfoce pageout cleaning
and discard it.
When high order allocation occur, we don't only need free enough amount
memory, but also need free enough contenious memory block.
If we need to consider _only_ io throughput, waiting flusher thread
might faster perhaps, but actually we also need to consider reclaim
latency. I'm worry about such point too.
> Also, there doesn't appear to be anything special about the chunks of
> pages it's issuing IO on and waiting for, either. They are simply
> the last N pages on the LRU that could be grabbed so they have no
> guarantee of contiguity, so the IO it issues does nothing specific
> to help higher order allocations to succeed.
It does. lumpy reclaim doesn't grab last N pages. instead grab contenious
memory chunk. please see isolate_lru_pages().
>
> Hence it really seems to me that the effectiveness of lumpy reclaim
> is determined mostly by the effectiveness of the IO subsystem - the
> faster the IO subsystem cleans pages, the less time lumpy reclaim
> will block and the faster it will free pages. From this observation
> and the fact that issuing IO only from the bdi flusher threads will
> have the same effect (improves IO subsystem effectiveness), it seems
> to me that lumpy reclaim should not be adversely affected by this
> change.
>
> Of course, the code is a maze of twisty passages, so I probably
> missed something important. Hopefully someone can tell me what. ;)
>
> FWIW, the biggest problem here is that I have absolutely no clue on
> how to test what the impact on lumpy reclaim really is. Does anyone
> have a relatively simple test that can be run to determine what the
> impact is?
So, can you please run two workloads concurrently?
- Normal IO workload (fio, iozone, etc..)
- echo $NUM > /proc/sys/vm/nr_hugepages
Most typical high order allocation is occur by blutal wireless LAN driver.
(or some cheap LAN card)
But sadly, If the test depend on specific hardware, our discussion might
make mess maze easily. then, I hope to use hugepage feature instead.
Thanks.
It hasn't grown in the last 2 years after the last major diet where
all the fat was trimmed from it in the last round of the i386 4k
stack vs XFS saga. it seems that everything else around XFS has
grown in that time, and now we are blowing stacks again....
> > Hence I think that direct reclaim should be deferring to the
> > background flusher threads for cleaning memory and not trying to be
> > doing it itself.
>
> Well, you seems continue to discuss io workload. I don't disagree
> such point.
>
> example, If only order-0 reclaim skip pageout(), we will get the above
> benefit too.
But it won't prevent start blowups...
> > > but we never kill pageout() completely because we can't
> > > assume users don't run high order allocation workload.
> >
> > I think that lumpy reclaim will still work just fine.
> >
> > Lumpy reclaim appears to be using IO as a method of slowing
> > down the reclaim cycle - the congestion_wait() call will still
> > function as it does now if the background flusher threads are active
> > and causing congestion. I don't see why lumpy reclaim specifically
> > needs to be issuing IO to make it work - if the congestion_wait() is
> > not waiting long enough then wait longer - don't issue IO to extend
> > the wait time.
>
> lumpy reclaim is for allocation high order page. then, it not only
> reclaim LRU head page, but also its PFN neighborhood. PFN neighborhood
> is often newly page and still dirty. then we enfoce pageout cleaning
> and discard it.
Ok, I see that now - I missed the second call to __isolate_lru_pages()
in isolate_lru_pages().
> When high order allocation occur, we don't only need free enough amount
> memory, but also need free enough contenious memory block.
Agreed, that was why I was kind of surprised not to find it was
doing that. But, as you have pointed out, that was my mistake.
> If we need to consider _only_ io throughput, waiting flusher thread
> might faster perhaps, but actually we also need to consider reclaim
> latency. I'm worry about such point too.
True, but without know how to test and measure such things I can't
really comment...
> > Of course, the code is a maze of twisty passages, so I probably
> > missed something important. Hopefully someone can tell me what. ;)
> >
> > FWIW, the biggest problem here is that I have absolutely no clue on
> > how to test what the impact on lumpy reclaim really is. Does anyone
> > have a relatively simple test that can be run to determine what the
> > impact is?
>
> So, can you please run two workloads concurrently?
> - Normal IO workload (fio, iozone, etc..)
> - echo $NUM > /proc/sys/vm/nr_hugepages
What do I measure/observe/record that is meaningful?
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
Depends. For raw effectiveness, I run a series of performance-related
benchmarks with a final test that
o Starts a number of parallel compiles that in combination are 1.25 times
of physical memory in total size
o Sleep three minutes
o Start allocating huge pages recording the latency required for each one
o Record overall success rate and graph latency over time
Lumpy reclaim both increases the success rate and reduces the latency.
> > Even without high-order considerations, this patch would appear to make
> > fairly large changes to how direct reclaim behaves. It would no longer
> > wait on page writeback for example so direct reclaim will return sooner
>
> AFAICT it still waits for pages under writeback in exactly the same manner
> it does now. shrink_page_list() does the following completely
> separately to the sc->may_writepage flag:
>
> 666 may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> 667 (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
> 668
> 669 if (PageWriteback(page)) {
> 670 /*
> 671 * Synchronous reclaim is performed in two passes,
> 672 * first an asynchronous pass over the list to
> 673 * start parallel writeback, and a second synchronous
> 674 * pass to wait for the IO to complete. Wait here
> 675 * for any page for which writeback has already
> 676 * started.
> 677 */
> 678 if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
> 679 wait_on_page_writeback(page);
> 680 else
> 681 goto keep_locked;
> 682 }
>
Right, so it'll still wait on writeback but won't kick it off. That
would still be a fairly significant change in behaviour though. Think of
synchronous lumpy reclaim for example where it queues up a contiguous
batch of patches and then waits on them to writeback..
> So if the page is under writeback, PAGEOUT_IO_SYNC is set and
> we can enter the fs, it will still wait for writeback to complete
> just like it does now.
>
But it would be no longer queueing them for writeback so it'd be
depending heavily on kswapd or a background cleaning daemon to clean
them.
> However, the current code only uses PAGEOUT_IO_SYNC in lumpy
> reclaim, so for most typical workloads direct reclaim does not wait
> on page writeback, either.
No, but it does queue them back on the LRU where they might be clean the
next time they are found on the list. How significant a problem this is
I couldn't tell you but it could show a corner case where a large number
of direct reclaimers are encountering dirty pages frequenctly and
recycling them around the LRU list instead of cleaning them.
> Hence, this patch doesn't appear to
> change the actions taken on a page under writeback in direct
> reclaim....
>
It does, but indirectly. The impact is very direct for lumpy reclaim
obviously. For other direct reclaim, pages that were at the end of the
LRU list are no longer getting cleaned before doing another lap through
the LRU list.
The consequences of the latter are harder to predict.
> > than it did potentially going OOM if there were a lot of dirty pages and
> > it made no progress during direct reclaim.
>
> I did a fair bit of low/small memory testing. This is a subjective
> observation, but I definitely seemed to get less severe OOM
> situations and better overall responisveness with this patch than
> compared to when direct reclaim was doing writeback.
>
And it is possible that it is best overall of only kswapd and the
background cleaner are queueing pages for IO. All I can say for sure is
that this does appear to hurt lumpy reclaim and does affect normal
direct reclaim where I have no predictions.
I'm not denying the evidence but how has it been gotten away with for years
then? Prevention of writeback isn't the answer without figuring out how
direct reclaimers can queue pages for IO and in the case of lumpy reclaim
doing sync IO, then waiting on those pages.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
So, I've been reading along, nodding my head to Dave's side of things
because seeks are evil and direct reclaim makes seeks. I'd really loev
for direct reclaim to somehow trigger writepages on large chunks instead
of doing page by page spatters of IO to the drive.
But, somewhere along the line I overlooked the part of Dave's stack trace
that said:
43) 1568 912 do_select+0x3d6/0x700
Huh, 912 bytes...for select, really? From poll.h:
/* ~832 bytes of stack space used max in sys_select/sys_poll before allocating
additional memory. */
#define MAX_STACK_ALLOC 832
#define FRONTEND_STACK_ALLOC 256
#define SELECT_STACK_ALLOC FRONTEND_STACK_ALLOC
#define POLL_STACK_ALLOC FRONTEND_STACK_ALLOC
#define WQUEUES_STACK_ALLOC (MAX_STACK_ALLOC - FRONTEND_STACK_ALLOC)
#define N_INLINE_POLL_ENTRIES (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry))
So, select is intentionally trying to use that much stack. It should be using
GFP_NOFS if it really wants to suck down that much stack...if only the
kernel had some sort of way to dynamically allocate ram, it could try
that too.
-chris
On Tue, Apr 13, 2010 at 9:17 AM, Dave Chinner <da...@fromorbit.com> wrote:
> From: Dave Chinner <dchi...@redhat.com>
>
> When we enter direct reclaim we may have used an arbitrary amount of stack
> space, and hence enterring the filesystem to do writeback can then lead to
> stack overruns. This problem was recently encountered x86_64 systems with
> 8k stacks running XFS with simple storage configurations.
>
> Writeback from direct reclaim also adversely affects background writeback. The
> background flusher threads should already be taking care of cleaning dirty
> pages, and direct reclaim will kick them if they aren't already doing work. If
> direct reclaim is also calling ->writepage, it will cause the IO patterns from
> the background flusher threads to be upset by LRU-order writeback from
> pageout() which can be effectively random IO. Having competing sources of IO
> trying to clean pages on the same backing device reduces throughput by
> increasing the amount of seeks that the backing device has to do to write back
> the pages.
>
> Hence for direct reclaim we should not allow ->writepages to be entered at all.
> Set up the relevant scan_control structures to enforce this, and prevent
> sc->may_writepage from being set in other places in the direct reclaim path in
> response to other events.
I think your solution is rather aggressive change as Mel and Kosaki
already pointed out.
Do flush thread aware LRU of dirty pages in system level recency not
dirty pages recency?
Of course flush thread can clean dirty pages faster than direct reclaimer.
But if it don't aware LRUness, hot page thrashing can be happened by
corner case.
It could lost write merge.
And non-rotation storage might be not big of seek cost.
I think we have to consider that case if we decide to change direct reclaim I/O.
How do we separate the problem?
1. stack hogging problem.
2. direct reclaim random write.
And try to solve one by one instead of all at once.
--
Kind regards,
Minchan Kim
Perhaps drop the lock on the page if it is held and call one of the
helpers that filesystems use to do this, like:
filemap_write_and_wait(page->mapping);
> But, somewhere along the line I overlooked the part of Dave's stack trace
> that said:
>
> 43) 1568 912 do_select+0x3d6/0x700
>
> Huh, 912 bytes...for select, really? From poll.h:
Sure, it's bad, but we focussing on the specific case misses the
point that even code that is using minimal stack can enter direct
reclaim after consuming 1.5k of stack. e.g.:
50) 3168 64 xfs_vm_writepage+0xab/0x160 [xfs]
51) 3104 384 shrink_page_list+0x65e/0x840
52) 2720 528 shrink_zone+0x63f/0xe10
53) 2192 112 do_try_to_free_pages+0xc2/0x3c0
54) 2080 128 try_to_free_pages+0x77/0x80
55) 1952 240 __alloc_pages_nodemask+0x3e4/0x710
56) 1712 48 alloc_pages_current+0x8c/0xe0
57) 1664 32 __page_cache_alloc+0x67/0x70
58) 1632 144 __do_page_cache_readahead+0xd3/0x220
59) 1488 16 ra_submit+0x21/0x30
60) 1472 80 ondemand_readahead+0x11d/0x250
61) 1392 64 page_cache_async_readahead+0xa9/0xe0
62) 1328 592 __generic_file_splice_read+0x48a/0x530
63) 736 48 generic_file_splice_read+0x4f/0x90
64) 688 96 xfs_splice_read+0xf2/0x130 [xfs]
65) 592 32 xfs_file_splice_read+0x4b/0x50 [xfs]
66) 560 64 do_splice_to+0x77/0xb0
67) 496 112 splice_direct_to_actor+0xcc/0x1c0
68) 384 80 do_splice_direct+0x57/0x80
69) 304 96 do_sendfile+0x16c/0x1e0
70) 208 80 sys_sendfile64+0x8d/0xb0
71) 128 128 system_call_fastpath+0x16/0x1b
Yes, __generic_file_splice_read() is a hog, but they seem to be
_everywhere_ today...
> So, select is intentionally trying to use that much stack. It should be using
> GFP_NOFS if it really wants to suck down that much stack...
The code that did the allocation is called from multiple different
contexts - how is it supposed to know that in some of those contexts
it is supposed to treat memory allocation differently?
This is my point - if you introduce a new semantic to memory allocation
that is "use GFP_NOFS when you are using too much stack" and too much
stack is more than 15% of the stack, then pretty much every code path
will need to set that flag...
> if only the
> kernel had some sort of way to dynamically allocate ram, it could try
> that too.
Sure, but to play the devil's advocate: if memory allocation blows
the stack, then surely avoiding allocation by using stack variables
is safer? ;)
FWIW, even if we use GFP_NOFS, allocation+reclaim can still use 2k
of stack; stuff like the radix tree code appears to be a significant
user of stack now:
Depth Size Location (56 entries)
----- ---- --------
0) 7904 48 __call_rcu+0x67/0x190
1) 7856 16 call_rcu_sched+0x15/0x20
2) 7840 16 call_rcu+0xe/0x10
3) 7824 272 radix_tree_delete+0x159/0x2e0
4) 7552 32 __remove_from_page_cache+0x21/0x110
5) 7520 64 __remove_mapping+0xe8/0x130
6) 7456 384 shrink_page_list+0x400/0x860
7) 7072 528 shrink_zone+0x636/0xdc0
8) 6544 112 do_try_to_free_pages+0xc2/0x3c0
9) 6432 112 try_to_free_pages+0x64/0x70
10) 6320 256 __alloc_pages_nodemask+0x3d2/0x710
11) 6064 48 alloc_pages_current+0x8c/0xe0
12) 6016 32 __page_cache_alloc+0x67/0x70
13) 5984 80 find_or_create_page+0x50/0xb0
14) 5904 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs]
or even just calling ->releasepage and freeing bufferheads:
Depth Size Location (55 entries)
----- ---- --------
0) 7440 48 add_partial+0x26/0x90
1) 7392 64 __slab_free+0x1a9/0x380
2) 7328 64 kmem_cache_free+0xb9/0x160
3) 7264 16 free_buffer_head+0x25/0x50
4) 7248 64 try_to_free_buffers+0x79/0xc0
5) 7184 160 xfs_vm_releasepage+0xda/0x130 [xfs]
6) 7024 16 try_to_release_page+0x33/0x60
7) 7008 384 shrink_page_list+0x585/0x860
8) 6624 528 shrink_zone+0x636/0xdc0
9) 6096 112 do_try_to_free_pages+0xc2/0x3c0
10) 5984 112 try_to_free_pages+0x64/0x70
11) 5872 256 __alloc_pages_nodemask+0x3d2/0x710
12) 5616 48 alloc_pages_current+0x8c/0xe0
13) 5568 32 __page_cache_alloc+0x67/0x70
14) 5536 80 find_or_create_page+0x50/0xb0
15) 5456 160 _xfs_buf_lookup_pages+0x145/0x350 [xfs]
And another eye-opening example, this time deep in the sata driver
layer:
Depth Size Location (72 entries)
----- ---- --------
0) 8336 304 select_task_rq_fair+0x235/0xad0
1) 8032 96 try_to_wake_up+0x189/0x3f0
2) 7936 16 default_wake_function+0x12/0x20
3) 7920 32 autoremove_wake_function+0x16/0x40
4) 7888 64 __wake_up_common+0x5a/0x90
5) 7824 64 __wake_up+0x48/0x70
6) 7760 64 insert_work+0x9f/0xb0
7) 7696 48 __queue_work+0x36/0x50
8) 7648 16 queue_work_on+0x4d/0x60
9) 7632 16 queue_work+0x1f/0x30
10) 7616 16 queue_delayed_work+0x2d/0x40
11) 7600 32 ata_pio_queue_task+0x35/0x40
12) 7568 48 ata_sff_qc_issue+0x146/0x2f0
13) 7520 96 mv_qc_issue+0x12d/0x540 [sata_mv]
14) 7424 96 ata_qc_issue+0x1fe/0x320
15) 7328 64 ata_scsi_translate+0xae/0x1a0
16) 7264 64 ata_scsi_queuecmd+0xbf/0x2f0
17) 7200 48 scsi_dispatch_cmd+0x114/0x2b0
18) 7152 96 scsi_request_fn+0x419/0x590
19) 7056 32 __blk_run_queue+0x82/0x150
20) 7024 48 elv_insert+0x1aa/0x2d0
21) 6976 48 __elv_add_request+0x83/0xd0
22) 6928 96 __make_request+0x139/0x490
23) 6832 208 generic_make_request+0x3df/0x4d0
24) 6624 80 submit_bio+0x7c/0x100
25) 6544 96 _xfs_buf_ioapply+0x128/0x2c0 [xfs]
We need at least _700_ bytes of stack free just to call queue_work(),
and that now happens deep in the guts of the driver subsystem below XFS.
This trace shows 1.8k of stack usage on a simple, single sata disk
storage subsystem, so my estimate of 2k of stack for the storage system
below XFS is too small - a worst case of 2.5-3k of stack space is probably
closer to the mark.
This is the sort of thing I'm pointing at when I say that stack
usage outside XFS has grown significantly significantly over the
past couple of years. Given XFS has remained pretty much the same or
even reduced slightly over the same time period, blaming XFS or
saying "callers should use GFP_NOFS" seems like a cop-out to me.
Regardless of the IO pattern performance issues, writeback via
direct reclaim just uses too much stack to be safe these days...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
So, a rough as guts first pass - just run a large dd (8 times the
size of memory - 8GB file vs 1GB RAM) and repeated try to allocate
the entire of memory in huge pages (500) every 5 seconds. The IO
rate is roughly 100MB/s, so it takes 75-85s to complete the dd.
The script:
$ cat t.sh
#!/bin/bash
echo 0 > /proc/sys/vm/nr_hugepages
echo 3 > /proc/sys/vm/drop_caches
dd if=/dev/zero of=/mnt/scratch/test bs=1024k count=8000 > /dev/null 2>&1 &
(
for i in `seq 1 1 20`; do
sleep 5
/usr/bin/time --format="wall %e" sh -c "echo 500 > /proc/sys/vm/nr_hugepages" 2>&1
grep HugePages_Total /proc/meminfo
done
) | awk '
/wall/ { wall += $2; cnt += 1 }
/Pages/ { pages[cnt] = $2 }
END { printf "average wall time %f\nPages step: ", wall / cnt ;
for (i = 1; i <= cnt; i++) {
printf "%d ", pages[i];
}
}'
----
And the output looks like:
$ sudo ./t.sh
average wall time 0.954500
Pages step: 97 101 101 121 173 173 173 173 173 173 175 194 195 195 202 220 226 419 423 426
$
Run 50 times in a loop, and the outputs averaged, the existing lumpy
reclaim resulted in:
dave@test-1:~$ cat current.txt | awk -f av.awk
av. wall = 0.519385 secs
av Pages step: 192 228 242 255 265 272 279 284 289 294 298 303 307 322 342 366 383 401 412 420
And with my patch that disables ->writepage:
dave@test-1:~$ cat no-direct.txt | awk -f av.awk
av. wall = 0.554163 secs
av Pages step: 231 283 310 316 323 328 336 340 345 351 356 359 364 377 388 397 413 423 432 439
Basically, with my patch lumpy reclaim was *substantially* more
effective with only a slight increase in average allocation latency
with this test case.
I need to add a marker to the output that records when the dd
completes, but from monitoring the writeback rates via PCP, they
were in the balllpark of 85-100MB/s for the existing code, and
95-110MB/s with my patch. Hence it improved both IO throughput and
the effectiveness of lumpy reclaim.
On the down side, I did have an OOM killer invocation with my patch
after about 150 iterations - dd failed an order zero allocation
because there were 455 huge pages allocated and there were only
_320_ available pages for IO, all of which were under IO. i.e. lumpy
reclaim worked so well that the machine got into order-0 page
starvation.
I know this is a simple test case, but it shows much better results
than I think anyone (even me) is expecting...
It may be agressive, but writeback from direct reclaim is, IMO, one
of the worst aspects of the current VM design because of it's
adverse effect on the IO subsystem.
I'd prefer to remove it completely that continue to try and patch
around it, especially given that everyone seems to agree that it
does have an adverse affect on IO...
> Do flush thread aware LRU of dirty pages in system level recency not
> dirty pages recency?
It writes back in the order inodes were dirtied. i.e. the LRU is a
coarser measure, but it it still definitely there. It also takes
into account fairness of IO between dirty inodes, so no one dirty
inode prevents IO beining issued on a other dirty inodes on the
LRU...
> Of course flush thread can clean dirty pages faster than direct reclaimer.
> But if it don't aware LRUness, hot page thrashing can be happened by
> corner case.
> It could lost write merge.
>
> And non-rotation storage might be not big of seek cost.
Non-rotational storage still goes faster when it is fed large, well
formed IOs.
> I think we have to consider that case if we decide to change direct reclaim I/O.
>
> How do we separate the problem?
>
> 1. stack hogging problem.
> 2. direct reclaim random write.
AFAICT, the only way to _reliably_ avoid the stack usage problem is
to avoid writeback in direct reclaim. That has the side effect of
fixing #2 as well, so do they really need separating?
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
> 50) 3168 64 xfs_vm_writepage+0xab/0x160 [xfs]
> 51) 3104 384 shrink_page_list+0x65e/0x840
> 52) 2720 528 shrink_zone+0x63f/0xe10
A bit OFF TOPIC.
Could you share disassemble of shrink_zone() ?
In my environ.
00000000000115a0 <shrink_zone>:
115a0: 55 push %rbp
115a1: 48 89 e5 mov %rsp,%rbp
115a4: 41 57 push %r15
115a6: 41 56 push %r14
115a8: 41 55 push %r13
115aa: 41 54 push %r12
115ac: 53 push %rbx
115ad: 48 83 ec 78 sub $0x78,%rsp
115b1: e8 00 00 00 00 callq 115b6 <shrink_zone+0x16>
115b6: 48 89 75 80 mov %rsi,-0x80(%rbp)
disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
until retrun.
I may misunderstand something...
Thanks,
-Kame
I see the same. I didn't compile those kernels, though. IIUC,
they were built through the Ubuntu build infrastructure, so there is
something different in terms of compiler, compiler options or config
to what we are both using. Most likely it is the compiler inlining,
though Chris's patches to prevent that didn't seem to change the
stack usage.
I'm trying to get a stack trace from the kernel that has shrink_zone
in it, but I haven't succeeded yet....
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?
Yeah, Of cource much. I would propse to revert 70674f95c0.
But I doubt GFP_NOFS solve our issue.
I have dumb question, If xfs haven't bloat stack usage, why 3.5
stack usage works fine on 4k stack kernel? It seems impossible.
Please don't think I blame you. I don't know what is "4k stack vs XFS saga".
I merely want to understand what you said.
> > > Hence I think that direct reclaim should be deferring to the
> > > background flusher threads for cleaning memory and not trying to be
> > > doing it itself.
> >
> > Well, you seems continue to discuss io workload. I don't disagree
> > such point.
> >
> > example, If only order-0 reclaim skip pageout(), we will get the above
> > benefit too.
>
> But it won't prevent start blowups...
>
> > > > but we never kill pageout() completely because we can't
> > > > assume users don't run high order allocation workload.
> > >
> > > I think that lumpy reclaim will still work just fine.
> > >
> > > Lumpy reclaim appears to be using IO as a method of slowing
> > > down the reclaim cycle - the congestion_wait() call will still
> > > function as it does now if the background flusher threads are active
> > > and causing congestion. I don't see why lumpy reclaim specifically
> > > needs to be issuing IO to make it work - if the congestion_wait() is
> > > not waiting long enough then wait longer - don't issue IO to extend
> > > the wait time.
> >
> > lumpy reclaim is for allocation high order page. then, it not only
> > reclaim LRU head page, but also its PFN neighborhood. PFN neighborhood
> > is often newly page and still dirty. then we enfoce pageout cleaning
> > and discard it.
>
> Ok, I see that now - I missed the second call to __isolate_lru_pages()
> in isolate_lru_pages().
No problem. It's one of VM mess. Usual developers don't know it :-)
> > When high order allocation occur, we don't only need free enough amount
> > memory, but also need free enough contenious memory block.
>
> Agreed, that was why I was kind of surprised not to find it was
> doing that. But, as you have pointed out, that was my mistake.
>
> > If we need to consider _only_ io throughput, waiting flusher thread
> > might faster perhaps, but actually we also need to consider reclaim
> > latency. I'm worry about such point too.
>
> True, but without know how to test and measure such things I can't
> really comment...
Agreed. I know making VM mesurement benchmark is very difficult. but
probably it is necessary....
I'm sorry, now I can't give you good convenient benchmark.
>
> > > Of course, the code is a maze of twisty passages, so I probably
> > > missed something important. Hopefully someone can tell me what. ;)
> > >
> > > FWIW, the biggest problem here is that I have absolutely no clue on
> > > how to test what the impact on lumpy reclaim really is. Does anyone
> > > have a relatively simple test that can be run to determine what the
> > > impact is?
> >
> > So, can you please run two workloads concurrently?
> > - Normal IO workload (fio, iozone, etc..)
> > - echo $NUM > /proc/sys/vm/nr_hugepages
>
> What do I measure/observe/record that is meaningful?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>
Ummm...
Probably, I have to say I'm sorry. I guess my last mail give you
a misunderstand.
To be honest, I'm not interest this artificial non fragmentation case.
The above test-case does 1) discard all cache 2) fill pages by streaming
io. then, it makes artificial "file offset neighbor == block neighbor == PFN neighbor"
situation. then, file offset order writeout by flusher thread can make
PFN contenious pages effectively.
Why I dont interest it? because lumpy reclaim is a technique for
avoiding external fragmentation mess. IOW, it is for avoiding worst
case. but your test case seems to mesure best one.
I agree that "seeks are evil and direct reclaim makes seeks". Actually,
making 4k io is not must for pageout. So, probably we can improve it.
> Perhaps drop the lock on the page if it is held and call one of the
> helpers that filesystems use to do this, like:
>
> filemap_write_and_wait(page->mapping);
Sorry, I'm lost what you talk about. Why do we need per-file waiting?
If file is 1GB file, do we need to wait 1GB writeout?
>
> > But, somewhere along the line I overlooked the part of Dave's stack trace
> > that said:
> >
> > 43) 1568 912 do_select+0x3d6/0x700
> >
> > Huh, 912 bytes...for select, really? From poll.h:
>
> Sure, it's bad, but we focussing on the specific case misses the
> point that even code that is using minimal stack can enter direct
> reclaim after consuming 1.5k of stack. e.g.:
checkstack.pl says do_select() and __generic_file_splice_read() are one
of worstest stack consumer. both sould be fixed.
also, checkstack.pl says such stack eater aren't so much.
Nodding my head to Dave's side. changing caller argument seems not good
solution. I mean
- do_select() should use GFP_KERNEL instead stack (as revert 70674f95c0)
- reclaim and xfs (and other something else) need to diet.
Also, I believe stack eater function should be created waring. patch attached.
your explanation is very interesting. I have a (probably dumb) question.
Why nobody faced stack overflow issue in past? now I think every users
easily get stack overflow if your explanation is correct.
>
> This is the sort of thing I'm pointing at when I say that stack
> usage outside XFS has grown significantly significantly over the
> past couple of years. Given XFS has remained pretty much the same or
> even reduced slightly over the same time period, blaming XFS or
> saying "callers should use GFP_NOFS" seems like a cop-out to me.
> Regardless of the IO pattern performance issues, writeback via
> direct reclaim just uses too much stack to be safe these days...
Yeah, My answer is simple, All stack eater should be fixed.
but XFS seems not innocence too. 3.5K is enough big although
xfs have use such amount since very ago.
===========================================================
Subject: [PATCH] kconfig: reduce FRAME_WARN default value to 512
Surprisedly, now several odd functions use very much stack.
% objdump -d vmlinux | ./scripts/checkstack.pl
0xffffffff81e3db07 get_next_block [vmlinux]: 1976
0xffffffff8130b9bd node_read_meminfo [vmlinux]: 1240
0xffffffff811553fd do_sys_poll [vmlinux]: 1000
0xffffffff8122b49d test_aead [vmlinux]: 904
0xffffffff81154c9d do_select [vmlinux]: 888
0xffffffff81168d9d default_file_splice_read [vmlinux]: 760
Oh well, Every developers have to pay attention a stack usage!
Thus, this patch reduce FRAME_WARN default value to 512.
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
lib/Kconfig.debug | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ff01710..44ebba6 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -28,8 +28,7 @@ config ENABLE_MUST_CHECK
config FRAME_WARN
int "Warn for stack frames larger than (needs gcc 4.4)"
range 0 8192
- default 1024 if !64BIT
- default 2048 if 64BIT
+ default 512
help
Tell gcc to warn at build time for stack frames larger than this.
Setting this too low will cause a lot of warnings.
--
1.6.5.2
Ok, so here's a trace at the top of the stack from a kernel with a
the above shrink_zone disassembly:
$ cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (49 entries)
----- ---- --------
0) 6152 112 force_qs_rnp+0x58/0x150
1) 6040 48 force_quiescent_state+0x1a7/0x1f0
2) 5992 48 __call_rcu+0x13d/0x190
3) 5944 16 call_rcu_sched+0x15/0x20
4) 5928 16 call_rcu+0xe/0x10
5) 5912 240 radix_tree_delete+0x14a/0x2d0
6) 5672 32 __remove_from_page_cache+0x21/0x110
7) 5640 64 __remove_mapping+0x86/0x100
8) 5576 272 shrink_page_list+0x2fd/0x5a0
9) 5304 400 shrink_inactive_list+0x313/0x730
10) 4904 176 shrink_zone+0x3d1/0x490
11) 4728 128 do_try_to_free_pages+0x2b6/0x380
12) 4600 112 try_to_free_pages+0x5e/0x60
13) 4488 272 __alloc_pages_nodemask+0x3fb/0x730
14) 4216 48 alloc_pages_current+0x87/0xd0
15) 4168 32 __page_cache_alloc+0x67/0x70
16) 4136 80 find_or_create_page+0x4f/0xb0
17) 4056 160 _xfs_buf_lookup_pages+0x150/0x390
.....
So the differences are most likely from the compiler doing
automatic inlining of static functions...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
I changed shrink list by noinline_for_stack.
The result is following as.
00001fe0 <shrink_zone>:
1fe0: 55 push %ebp
1fe1: 89 e5 mov %esp,%ebp
1fe3: 57 push %edi
1fe4: 56 push %esi
1fe5: 53 push %ebx
1fe6: 83 ec 4c sub $0x4c,%esp
1fe9: 89 45 c0 mov %eax,-0x40(%ebp)
1fec: 89 55 bc mov %edx,-0x44(%ebp)
1fef: 89 4d b8 mov %ecx,-0x48(%ebp)
0x110 -> 0x4c.
Should we have to add noinline_for_stack for shrink_list?
--
Kind regards,
Minchan Kim
So use filemap_fdatawrite(page->mapping), or if it's better only
to start IO on a segment of the file, use
filemap_fdatawrite_range(page->mapping, start, end)....
> > > But, somewhere along the line I overlooked the part of Dave's stack trace
> > > that said:
> > >
> > > 43) 1568 912 do_select+0x3d6/0x700
> > >
> > > Huh, 912 bytes...for select, really? From poll.h:
> >
> > Sure, it's bad, but we focussing on the specific case misses the
> > point that even code that is using minimal stack can enter direct
> > reclaim after consuming 1.5k of stack. e.g.:
>
> checkstack.pl says do_select() and __generic_file_splice_read() are one
> of worstest stack consumer. both sould be fixed.
the deepest call chain in queue_work() needs 700 bytes of stack
to complete, wait_for_completion() requires almost 2k of stack space
at it's deepest, the scheduler has some heavy stack users, etc,
and these are all functions that appear at the top of the stack.
> also, checkstack.pl says such stack eater aren't so much.
Yeah, but when we have ia callchain 70 or more functions deep,
even 100 bytes of stack is a lot....
> > > So, select is intentionally trying to use that much stack. It should be using
> > > GFP_NOFS if it really wants to suck down that much stack...
> >
> > The code that did the allocation is called from multiple different
> > contexts - how is it supposed to know that in some of those contexts
> > it is supposed to treat memory allocation differently?
> >
> > This is my point - if you introduce a new semantic to memory allocation
> > that is "use GFP_NOFS when you are using too much stack" and too much
> > stack is more than 15% of the stack, then pretty much every code path
> > will need to set that flag...
>
> Nodding my head to Dave's side. changing caller argument seems not good
> solution. I mean
> - do_select() should use GFP_KERNEL instead stack (as revert 70674f95c0)
> - reclaim and xfs (and other something else) need to diet.
The list I'm seeing so far includes:
- scheduler
- completion interfaces
- radix tree
- memory allocation, memory reclaim
- anything that implements ->writepage
- select
- splice read
> Also, I believe stack eater function should be created waring. patch attached.
Good start, but 512 bytes will only catch select and splice read,
and there are 300-400 byte functions in the above list that sit near
the top of the stack....
> > We need at least _700_ bytes of stack free just to call queue_work(),
> > and that now happens deep in the guts of the driver subsystem below XFS.
> > This trace shows 1.8k of stack usage on a simple, single sata disk
> > storage subsystem, so my estimate of 2k of stack for the storage system
> > below XFS is too small - a worst case of 2.5-3k of stack space is probably
> > closer to the mark.
>
> your explanation is very interesting. I have a (probably dumb) question.
> Why nobody faced stack overflow issue in past? now I think every users
> easily get stack overflow if your explanation is correct.
It's always a problem, but the focus on minimising stack usage has
gone away since i386 has mostly disappeared from server rooms.
XFS has always been the thing that triggered stack usage problems
first - the first reports of problems on x86_64 with 8k stacks in low
memory situations have only just come in, and this is the first time
in a couple of years I've paid close attention to stack usage
outside XFS. What I'm seeing is not pretty....
> > This is the sort of thing I'm pointing at when I say that stack
> > usage outside XFS has grown significantly significantly over the
> > past couple of years. Given XFS has remained pretty much the same or
> > even reduced slightly over the same time period, blaming XFS or
> > saying "callers should use GFP_NOFS" seems like a cop-out to me.
> > Regardless of the IO pattern performance issues, writeback via
> > direct reclaim just uses too much stack to be safe these days...
>
> Yeah, My answer is simple, All stack eater should be fixed.
> but XFS seems not innocence too. 3.5K is enough big although
> xfs have use such amount since very ago.
XFS used to use much more than that - significant effort has been
put into reduce the stack footprint over many years. There's not
much left to trim without rewriting half the filesystem...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
Because on a 32 bit kernel it's somewhere between 2-2.5k of stack
space. That being said, XFS _will_ blow a 4k stack on anything other
than the most basic storage configurations, and if you run out of
memory it is almost guaranteed to do so.
> Please don't think I blame you. I don't know what is "4k stack vs XFS saga".
> I merely want to understand what you said.
Over a period of years there were repeated attempts to make the
default stack size on i386 4k, despite it being known to cause
problems one relatively common configurations. Every time it was
brought up it was rejected, but every few months somebody else made
an attempt to make it the default. There was a lot of flamage
directed at XFS because it was seen as the reason that 4k stacks
were not made the default....
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
--
Tend to agree. But De we need it by last resort if flusher thread
can't catch up
write stream?
Or In my opinion, Could I/O layer have better throttle logic than now?
>
> I'd prefer to remove it completely that continue to try and patch
> around it, especially given that everyone seems to agree that it
> does have an adverse affect on IO...
Of course, If everybody agree, we can do it.
For it, we need many benchmark result which is very hard.
Maybe I will help it in embedded system.
>
>> Do flush thread aware LRU of dirty pages in system level recency not
>> dirty pages recency?
>
> It writes back in the order inodes were dirtied. i.e. the LRU is a
> coarser measure, but it it still definitely there. It also takes
> into account fairness of IO between dirty inodes, so no one dirty
> inode prevents IO beining issued on a other dirty inodes on the
> LRU...
Thanks.
It seems to be lost recency.
I am not sure how much it affects system performance.
>
>> Of course flush thread can clean dirty pages faster than direct reclaimer.
>> But if it don't aware LRUness, hot page thrashing can be happened by
>> corner case.
>> It could lost write merge.
>>
>> And non-rotation storage might be not big of seek cost.
>
> Non-rotational storage still goes faster when it is fed large, well
> formed IOs.
Agreed. I missed. Nand device is stronger than HDD about random read.
But ramdom write is very weak in performance and wear-leveling.
>
>> I think we have to consider that case if we decide to change direct reclaim I/O.
>>
>> How do we separate the problem?
>>
>> 1. stack hogging problem.
>> 2. direct reclaim random write.
>
> AFAICT, the only way to _reliably_ avoid the stack usage problem is
> to avoid writeback in direct reclaim. That has the side effect of
> fixing #2 as well, so do they really need separating?
If we can do it, it's good.
but 2. problem is not easy to fix, I think.
Compared to 2, 1 is rather easy.
So I thought we can solve 1 firstly and then focusing 2.
If your suggestion is right, then we can apply your idea.
Then we don't need to revert the patch of 1 since small stack usage is
always good
if we don't lost big performance.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com
>
--
Kind regards,
Minchan Kim
That does not help the stack usage issue, the caller ends up in
->writepages. From an IO perspective, it'll be better from a seek point of
view but from a VM perspective, it may or may not be cleaning the right pages.
So I think this is a red herring.
> > > > But, somewhere along the line I overlooked the part of Dave's stack trace
> > > > that said:
> > > >
> > > > 43) 1568 912 do_select+0x3d6/0x700
> > > >
> > > > Huh, 912 bytes...for select, really? From poll.h:
> > >
> > > Sure, it's bad, but we focussing on the specific case misses the
> > > point that even code that is using minimal stack can enter direct
> > > reclaim after consuming 1.5k of stack. e.g.:
> >
> > checkstack.pl says do_select() and __generic_file_splice_read() are one
> > of worstest stack consumer. both sould be fixed.
>
> the deepest call chain in queue_work() needs 700 bytes of stack
> to complete, wait_for_completion() requires almost 2k of stack space
> at it's deepest, the scheduler has some heavy stack users, etc,
> and these are all functions that appear at the top of the stack.
>
The real issue here then is that stack usage has gone out of control.
Disabling ->writepage in direct reclaim does not guarantee that stack
usage will not be a problem again. From your traces, page reclaim itself
seems to be a big dirty hog.
Differences in what people see in their machines may be down to architecture,
compiler but most likely inlining. Changing inlining will not fix the problem,
it'll just move the stack usage around.
They will need to be tackled in turn then but obviously there should be
a focus on the common paths. The reclaim paths do seem particularly
heavy and it's down to a lot of temporary variables. I might not get the
time today but what I'm going to try do some time this week is
o Look at what temporary variables are copies of other pieces of information
o See what variables live for the duration of reclaim but are not needed
for all of it (i.e. uninline parts of it so variables do not persist)
o See if it's possible to dynamically allocate scan_control
The last one is the trickiest. Basically, the idea would be to move as much
into scan_control as possible. Then, instead of allocating it on the stack,
allocate a fixed number of them at boot-time (NR_CPU probably) protected by
a semaphore. Limit the number of direct reclaimers that can be active at a
time to the number of scan_control variables. kswapd could still allocate
its on the stack or with kmalloc.
If it works out, it would have two main benefits. Limits the number of
processes in direct reclaim - if there is NR_CPU-worth of proceses in direct
reclaim, there is too much going on. It would also shrink the stack usage
particularly if some of the stack variables are moved into scan_control.
Maybe someone will beat me to looking at the feasibility of this.
I don't think he is levelling a complain at XFS in particular - just pointing
out that it's heavy too. Still, we should be gratful that XFS is sort of
a "Stack Canary". If it dies, everyone else could be in trouble soon :)
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Hmm. about shirnk_zone(), I don't think uninlining functions directly called
by shrink_zone() can be a help.
Total stack size of call-chain will be still big.
Thanks,
-Kame
Absolutely.
But above 500 byte usage is one of hogger and uninlining is not
critical about reclaim performance. So I think we don't get any lost
than gain.
But I don't get in a hurry. adhoc approach is not good.
I hope when Mel tackles down consumption of stack in reclaim path, he
modifies this part, too.
Thanks.
> Thanks,
> -Kame
>
>
>
--
Kind regards,
Minchan Kim
Beat in mind that uninlining can slightly increase the stack usage in some
cases because arguments, return addresses and the like have to be pushed
onto the stack. Inlining or unlining is only the answer when it reduces the
number of stack variables that exist at any given time.
> But I don't get in a hurry. adhoc approach is not good.
> I hope when Mel tackles down consumption of stack in reclaim path, he
> modifies this part, too.
>
It'll be at least two days before I get the chance to try. A lot of the
temporary variables used in the reclaim path have existed for some time so
it will take a while.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
There are lots of other call chains which use multiple KB bytes by itself,
so why not give select() that measly 832 bytes?
You think only file systems are allowed to use stack? :)
Basically if you cannot tolerate 1K (or more likely more) of stack
used before your fs is called you're toast in lots of other situations
anyways.
> kernel had some sort of way to dynamically allocate ram, it could try
> that too.
It does this for large inputs, but the whole point of the stack fast
path is to avoid it for common cases when a small number of fds is
only needed.
It's significantly slower to go to any external allocator.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
Yes. I totally have missed it.
Thanks, Mel.
--
Kind regards,
Minchan Kim
Grin, most definitely.
>
> Basically if you cannot tolerate 1K (or more likely more) of stack
> used before your fs is called you're toast in lots of other situations
> anyways.
Well, on a 4K stack kernel, 832 bytes is a very large percentage for
just one function.
Direct reclaim is a problem because it splices parts of the kernel that
normally aren't connected together. The people that code in select see
832 bytes and say that's teeny, I should have taken 3832 bytes.
But they don't realize their function can dive down into ecryptfs then
the filesystem then maybe loop and then perhaps raid6 on top of a
network block device.
>
> > kernel had some sort of way to dynamically allocate ram, it could try
> > that too.
>
> It does this for large inputs, but the whole point of the stack fast
> path is to avoid it for common cases when a small number of fds is
> only needed.
>
> It's significantly slower to go to any external allocator.
Yeah, but since the call chain does eventually go into the allocator,
this function needs to be more stack friendly.
I do agree that we can't really solve this with noinline_for_stack pixie
dust, the long call chains are going to be a problem no matter what.
Reading through all the comments so far, I think the short summary is:
Cleaning pages in direct reclaim helps the VM because it is able to make
sure that lumpy reclaim finds adjacent pages. This isn't a fast
operation, it has to wait for IO (infinitely slow compared to the CPU).
Will it be good enough for the VM if we add a hint to the bdi writeback
threads to work on a general area of the file? The filesystem will get
writepages(), the VM will get the IO it needs started.
I know Mel mentioned before he wasn't interested in waiting for helper
threads, but I don't see how we can work without it.
-chris
To be honest I think 4K stack simply has to go. I tend to call
it "russian roulette" mode.
It was just a old workaround for a very old buggy VM that couldn't free 8K
pages and the VM is a lot better at that now. And the general trend is
to more complex code everywhere, so 4K stacks become more and more hazardous.
It was a bad idea back then and is still a bad idea, getting
worse and worse with each MLOC being added to the kernel each year.
We don't have any good ways to verify that obscure paths through
the more and more subsystems won't exceed it (in fact I'm pretty
sure there are plenty of problems in exotic configurations)
And even if you can make a specific load work there's basically
no safety net.
The only part of the 4K stack code that's good is the separate
interrupt stack, but that one should be just combined with a sane 8K
process stack.
But yes on a 4K kernel you probably don't want to do any direct reclaim.
Maybe for GFP_NOFS everywhere except user allocations when it's set?
Or simply drop it?
> But they don't realize their function can dive down into ecryptfs then
> the filesystem then maybe loop and then perhaps raid6 on top of a
> network block device.
Those stackings need to use separate threads anyways. A lot of them
do in fact. Block avoided this problem by iterating instead of
recursing. Those that still recurse on the same stack simply
need to be fixed.
> Yeah, but since the call chain does eventually go into the allocator,
> this function needs to be more stack friendly.
For common fast paths it doesn't go into the allocator.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
The reality is that if you are blowing a 4K process stack you are
probably playing russian roulette on the current 8K x86-32 stack as well
because of the non IRQ split. So it needs fixing either way
Yes I think the 8K stack on 32bit should be combined with a interrupt
stack too. There's no reason not to have an interrupt stack ever.
Again the problem with fixing it is that you won't have any safety net
for a slightly different stacking etc. path that you didn't cover.
That said extreme examples (like some of those Chris listed) definitely
need fixing by moving them to different threads. But even after that
you still want a safety net. 4K is just too near the edge.
Maybe it would work if we never used any indirect calls, but that's
clearly not the case.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
Even without direct reclaim, I doubt stack usage is often at the top of
peoples minds except for truly criminal large usages of it. Direct
reclaim splicing is somewhat of a problem but it's separate to stack
consumption overall.
> But they don't realize their function can dive down into ecryptfs then
> the filesystem then maybe loop and then perhaps raid6 on top of a
> network block device.
>
> >
> > > kernel had some sort of way to dynamically allocate ram, it could try
> > > that too.
> >
> > It does this for large inputs, but the whole point of the stack fast
> > path is to avoid it for common cases when a small number of fds is
> > only needed.
> >
> > It's significantly slower to go to any external allocator.
>
> Yeah, but since the call chain does eventually go into the allocator,
> this function needs to be more stack friendly.
>
> I do agree that we can't really solve this with noinline_for_stack pixie
> dust, the long call chains are going to be a problem no matter what.
>
> Reading through all the comments so far, I think the short summary is:
>
> Cleaning pages in direct reclaim helps the VM because it is able to make
> sure that lumpy reclaim finds adjacent pages. This isn't a fast
> operation, it has to wait for IO (infinitely slow compared to the CPU).
>
> Will it be good enough for the VM if we add a hint to the bdi writeback
> threads to work on a general area of the file? The filesystem will get
> writepages(), the VM will get the IO it needs started.
>
Bear in mind that the context of lumpy reclaim that the VM doesn't care
about where the data is on the file or filesystem. It's only concerned
about where the data is located in memory. There *may* be a correlation
between location-of-data-in-file and location-of-data-in-memory but only
if readahead was a factor and readahead happened to hit at a time the page
allocator broke up a contiguous block of memory.
> I know Mel mentioned before he wasn't interested in waiting for helper
> threads, but I don't see how we can work without it.
>
I'm not against the idea as such. It would have advantages in that the
thread could reorder the IO for better seeks for example and lumpy
reclaim is already potentially waiting a long time so another delay
won't hurt. I would worry that it's just hiding the stack usage by
moving it to another thread and that there would be communication cost
between a direct reclaimer and this writeback thread. The main gain
would be in hiding the "splicing" effect between subsystems that direct
reclaim can have.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
[ nods ]
>
> Bear in mind that the context of lumpy reclaim that the VM doesn't care
> about where the data is on the file or filesystem. It's only concerned
> about where the data is located in memory. There *may* be a correlation
> between location-of-data-in-file and location-of-data-in-memory but only
> if readahead was a factor and readahead happened to hit at a time the page
> allocator broke up a contiguous block of memory.
>
> > I know Mel mentioned before he wasn't interested in waiting for helper
> > threads, but I don't see how we can work without it.
> >
>
> I'm not against the idea as such. It would have advantages in that the
> thread could reorder the IO for better seeks for example and lumpy
> reclaim is already potentially waiting a long time so another delay
> won't hurt. I would worry that it's just hiding the stack usage by
> moving it to another thread and that there would be communication cost
> between a direct reclaimer and this writeback thread. The main gain
> would be in hiding the "splicing" effect between subsystems that direct
> reclaim can have.
The big gain from the helper threads is that storage operates at a
roughly fixed iop rate. This is true for ssd as well, it's just a much
higher rate. So the threads can send down 4K ios and recover clean pages at
exactly the same rate it would sending down 64KB ios.
I know that for lumpy purposes it might not be the best 64KB, but the
other side of it is that we have to write those pages eventually anyway.
We might as well write them when it is more or less free.
The per-bdi writeback threads are a pretty good base for changing the
ordering for writeback, it seems like a good place to integrate requests
from the VM about which files (and which offsets in those files) to
write back first.
-chris
If you ask it to clean a bunch of pages around the one you want to
reclaim on the LRU, there is a good chance it will also be cleaning
pages that are near the end of the LRU or physically close by as
well. It's not a guarantee, but for the additional IO cost of about
10% wall time on that IO to clean the page you need, you also get
1-2 orders of magnitude other pages cleaned. That sounds like a
win any way you look at it...
I agree that it doesn't solve the stack problem (Chris' suggestion
that we enable the bdi flusher interface would fix this); what I'm
pointing out is that the arguments that it is too hard or there are
no interfaces available to issue larger IO from reclaim are not at
all valid.
> > the deepest call chain in queue_work() needs 700 bytes of stack
> > to complete, wait_for_completion() requires almost 2k of stack space
> > at it's deepest, the scheduler has some heavy stack users, etc,
> > and these are all functions that appear at the top of the stack.
> >
>
> The real issue here then is that stack usage has gone out of control.
That's definitely true, but it shouldn't cloud the fact that most
ppl want to kill writeback from direct reclaim, too, so killing two
birds with one stone seems like a good idea.
How about this? For now, we stop direct reclaim from doing writeback
only on order zero allocations, but allow it for higher order
allocations. That will prevent the majority of situations where
direct reclaim blows the stack and interferes with background
writeout, but won't cause lumpy reclaim to change behaviour.
This reduces the scope of impact and hence testing and validation
the needs to be done.
Then we can work towards allowing lumpy reclaim to use background
threads as Chris suggested for doing specific writeback operations
to solve the remaining problems being seen. Does this seem like a
reasonable compromise and approach to dealing with the problem?
> Disabling ->writepage in direct reclaim does not guarantee that stack
> usage will not be a problem again. From your traces, page reclaim itself
> seems to be a big dirty hog.
I couldn't agree more - the kernel still needs to be put on a stack
usage diet, but the above would give use some breathing space to attack the
problem before more people start to hit these problems.
> > Good start, but 512 bytes will only catch select and splice read,
> > and there are 300-400 byte functions in the above list that sit near
> > the top of the stack....
> >
>
> They will need to be tackled in turn then but obviously there should be
> a focus on the common paths. The reclaim paths do seem particularly
> heavy and it's down to a lot of temporary variables. I might not get the
> time today but what I'm going to try do some time this week is
>
> o Look at what temporary variables are copies of other pieces of information
> o See what variables live for the duration of reclaim but are not needed
> for all of it (i.e. uninline parts of it so variables do not persist)
> o See if it's possible to dynamically allocate scan_control
Welcome to my world ;)
> The last one is the trickiest. Basically, the idea would be to move as much
> into scan_control as possible. Then, instead of allocating it on the stack,
> allocate a fixed number of them at boot-time (NR_CPU probably) protected by
> a semaphore. Limit the number of direct reclaimers that can be active at a
> time to the number of scan_control variables. kswapd could still allocate
> its on the stack or with kmalloc.
>
> If it works out, it would have two main benefits. Limits the number of
> processes in direct reclaim - if there is NR_CPU-worth of proceses in direct
> reclaim, there is too much going on. It would also shrink the stack usage
> particularly if some of the stack variables are moved into scan_control.
>
> Maybe someone will beat me to looking at the feasibility of this.
I like the idea - it really sounds like you want a fixed size,
preallocated mempool that can't be enlarged. In fact, I can probably
use something like this in XFS to save a couple of hundred bytes of
stack space in the worst hogs....
> > > > This is the sort of thing I'm pointing at when I say that stack
> > > > usage outside XFS has grown significantly significantly over the
> > > > past couple of years. Given XFS has remained pretty much the same or
> > > > even reduced slightly over the same time period, blaming XFS or
> > > > saying "callers should use GFP_NOFS" seems like a cop-out to me.
> > > > Regardless of the IO pattern performance issues, writeback via
> > > > direct reclaim just uses too much stack to be safe these days...
> > >
> > > Yeah, My answer is simple, All stack eater should be fixed.
> > > but XFS seems not innocence too. 3.5K is enough big although
> > > xfs have use such amount since very ago.
> >
> > XFS used to use much more than that - significant effort has been
> > put into reduce the stack footprint over many years. There's not
> > much left to trim without rewriting half the filesystem...
>
> I don't think he is levelling a complain at XFS in particular - just pointing
> out that it's heavy too. Still, we should be gratful that XFS is sort of
> a "Stack Canary". If it dies, everyone else could be in trouble soon :)
Yeah, true. Sorry Ń–f in being a bit too defensive here - the scars
from previous discussions like this are showing through....
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
And to be brutally honest, I'm not interested in wasting my time
trying to come up with a test case that you are interested in.
Instead, can you please you provide me with your test cases
(scripts, preferably) that you use to measure the effectiveness of
reclaim changes and I'll run them.
> The above test-case does 1) discard all cache 2) fill pages by streaming
> io. then, it makes artificial "file offset neighbor == block neighbor == PFN neighbor"
> situation. then, file offset order writeout by flusher thread can make
> PFN contenious pages effectively.
Yes, that's true, but it does indicate that in that situation, it is
more effective than the current code. FWIW, in the case of HPC
applications (which often use huge pages and clear the cache before
starting anew job), large streaming IO is a pretty common IO
pattern, so I don't think this situation is as artificial as you are
indicating.
> Why I dont interest it? because lumpy reclaim is a technique for
> avoiding external fragmentation mess. IOW, it is for avoiding
> worst case. but your test case seems to mesure best one.
Then please provide test cases that you consider valid.
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
I already have some patches to remove trivial parts of struct scan_control,
namely may_unmap, may_swap, all_unreclaimable and isolate_pages. The rest
needs a deeper look.
A rather big offender in there is the combination of shrink_active_list (360
bytes here) and shrink_page_list (200 bytes). I am currently looking at
breaking out all the accounting stuff from shrink_active_list into a separate
leaf function so that the stack footprint does not add up.
Your idea of per-cpu allocated scan controls reminds me of an idea I have
had for some time now: moving reclaim into its own threads (per cpu?).
Not only would it separate the allocator's stack from the writeback stack,
we could also get rid of that too_many_isolated() workaround and coordinate
reclaim work better to prevent overreclaim.
But that is not a quick fix either...
> On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> > They will need to be tackled in turn then but obviously there should be
> > a focus on the common paths. The reclaim paths do seem particularly
> > heavy and it's down to a lot of temporary variables. I might not get the
> > time today but what I'm going to try do some time this week is
> >
> > o Look at what temporary variables are copies of other pieces of information
> > o See what variables live for the duration of reclaim but are not needed
> > for all of it (i.e. uninline parts of it so variables do not persist)
> > o See if it's possible to dynamically allocate scan_control
> >
> > The last one is the trickiest. Basically, the idea would be to move as much
> > into scan_control as possible. Then, instead of allocating it on the stack,
> > allocate a fixed number of them at boot-time (NR_CPU probably) protected by
> > a semaphore. Limit the number of direct reclaimers that can be active at a
> > time to the number of scan_control variables. kswapd could still allocate
> > its on the stack or with kmalloc.
> >
> > If it works out, it would have two main benefits. Limits the number of
> > processes in direct reclaim - if there is NR_CPU-worth of proceses in direct
> > reclaim, there is too much going on. It would also shrink the stack usage
> > particularly if some of the stack variables are moved into scan_control.
> >
> > Maybe someone will beat me to looking at the feasibility of this.
>
> I already have some patches to remove trivial parts of struct scan_control,
> namely may_unmap, may_swap, all_unreclaimable and isolate_pages. The rest
> needs a deeper look.
Seems interesting. but scan_control diet is not so effective. How much
bytes can we diet by it?
> A rather big offender in there is the combination of shrink_active_list (360
> bytes here) and shrink_page_list (200 bytes). I am currently looking at
> breaking out all the accounting stuff from shrink_active_list into a separate
> leaf function so that the stack footprint does not add up.
pagevec. it consume 128bytes per struct. I have removing patch.
> Your idea of per-cpu allocated scan controls reminds me of an idea I have
> had for some time now: moving reclaim into its own threads (per cpu?).
>
> Not only would it separate the allocator's stack from the writeback stack,
> we could also get rid of that too_many_isolated() workaround and coordinate
> reclaim work better to prevent overreclaim.
>
> But that is not a quick fix either...
So, I haven't think this way. probably seems good. but I like to do
simple diet at first.
> How about this? For now, we stop direct reclaim from doing writeback
> only on order zero allocations, but allow it for higher order
> allocations. That will prevent the majority of situations where
> direct reclaim blows the stack and interferes with background
> writeout, but won't cause lumpy reclaim to change behaviour.
> This reduces the scope of impact and hence testing and validation
> the needs to be done.
Tend to agree. but I would proposed slightly different algorithm for
avoind incorrect oom.
for high order allocation
allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim
for low order allocation
- kswapd: always delegate io to flusher thread
- direct reclaim: delegate io to flusher thread only if vm pressure is low
This seems more safely. I mean Who want see incorrect oom regression?
I've made some pathes for this. I'll post it as another mail.
> Then we can work towards allowing lumpy reclaim to use background
> threads as Chris suggested for doing specific writeback operations
> to solve the remaining problems being seen. Does this seem like a
> reasonable compromise and approach to dealing with the problem?
Tend to agree. probably now we are discussing right approach. but
this is definitely needed deep thinking. then, I can't take exactly
answer yet.
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
mm/vmscan.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8b78b49..eab6028 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -623,6 +623,13 @@ static enum page_references page_check_references(struct page *page,
if (current_is_kswapd())
return PAGEREF_RECLAIM_CLEAN;
+ /*
+ * Now VM pressure is not so high. then we can delegate
+ * page cleaning to flusher thread safely.
+ */
+ if (!sc->order && sc->priority > DEF_PRIORITY/2)
+ return PAGEREF_RECLAIM_CLEAN;
+
return PAGEREF_RECLAIM;
}
--
1.6.5.2
=============================================
Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.
Now I have to say that I'm sorry. 2 years ago, I thghout prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. thus I give up such approach.
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
include/linux/mmzone.h | 15 -------------
mm/page_alloc.c | 2 -
mm/vmscan.c | 54 ++---------------------------------------------
mm/vmstat.c | 2 -
4 files changed, 3 insertions(+), 70 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cf9e458..ad76962 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -339,21 +339,6 @@ struct zone {
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
/*
- * prev_priority holds the scanning priority for this zone. It is
- * defined as the scanning priority at which we achieved our reclaim
- * target at the previous try_to_free_pages() or balance_pgdat()
- * invocation.
- *
- * We use prev_priority as a measure of how much stress page reclaim is
- * under - it drives the swappiness decision: whether to unmap mapped
- * pages.
- *
- * Access to both this field is quite racy even on uniprocessor. But
- * it is expected to average out OK.
- */
- int prev_priority;
-
- /*
* The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
* this zone's LRU. Maintained by the pageout code.
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d03c946..88513c0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3862,8 +3862,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
zone_seqlock_init(zone);
zone->zone_pgdat = pgdat;
- zone->prev_priority = DEF_PRIORITY;
-
zone_pcp_init(zone);
for_each_lru(l) {
INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d392a50..dadb461 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1284,20 +1284,6 @@ done:
}
/*
- * We are about to scan this zone at a certain priority level. If that priority
- * level is smaller (ie: more urgent) than the previous priority, then note
- * that priority level within the zone. This is done so that when the next
- * process comes in to scan this zone, it will immediately start out at this
- * priority level rather than having to build up its own scanning priority.
- * Here, this priority affects only the reclaim-mapped threshold.
- */
-static inline void note_zone_scanning_priority(struct zone *zone, int priority)
-{
- if (priority < zone->prev_priority)
- zone->prev_priority = priority;
-}
-
-/*
* This moves pages from the active list to the inactive list.
*
* We move them the other way if the page is referenced by one or more
@@ -1733,20 +1719,15 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
if (scanning_global_lru(sc)) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
- note_zone_scanning_priority(zone, priority);
-
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue; /* Let kswapd poll it */
sc->all_unreclaimable = 0;
- } else {
+ } else
/*
* Ignore cpuset limitation here. We just want to reduce
* # of used pages by us regardless of memory shortage.
*/
sc->all_unreclaimable = 0;
- mem_cgroup_note_reclaim_priority(sc->mem_cgroup,
- priority);
- }
shrink_zone(priority, zone, sc);
}
@@ -1852,17 +1833,11 @@ out:
if (priority < 0)
priority = 0;
- if (scanning_global_lru(sc)) {
- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-
+ if (scanning_global_lru(sc))
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
- zone->prev_priority = priority;
- }
- } else
- mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);
-
delayacct_freepages_end();
return ret;
@@ -2015,22 +1990,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
.mem_cgroup = NULL,
.isolate_pages = isolate_pages_global,
};
- /*
- * temp_priority is used to remember the scanning priority at which
- * this zone was successfully refilled to
- * free_pages == high_wmark_pages(zone).
- */
- int temp_priority[MAX_NR_ZONES];
-
loop_again:
total_scanned = 0;
sc.nr_reclaimed = 0;
sc.may_writepage = !laptop_mode;
count_vm_event(PAGEOUTRUN);
- for (i = 0; i < pgdat->nr_zones; i++)
- temp_priority[i] = DEF_PRIORITY;
-
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;
@@ -2098,9 +2063,7 @@ loop_again:
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;
- temp_priority[i] = priority;
sc.nr_scanned = 0;
- note_zone_scanning_priority(zone, priority);
nid = pgdat->node_id;
zid = zone_idx(zone);
@@ -2173,16 +2136,6 @@ loop_again:
break;
}
out:
- /*
- * Note within each zone the priority level at which this zone was
- * brought into a happy state. So that the next thread which scans this
- * zone will start out at that priority level.
- */
- for (i = 0; i < pgdat->nr_zones; i++) {
- struct zone *zone = pgdat->node_zones + i;
-
- zone->prev_priority = temp_priority[i];
- }
if (!all_zones_ok) {
cond_resched();
@@ -2600,7 +2553,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
*/
priority = ZONE_RECLAIM_PRIORITY;
do {
- note_zone_scanning_priority(zone, priority);
shrink_zone(priority, zone, &sc);
priority--;
} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fa12ea3..2db0a0f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -761,11 +761,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
}
seq_printf(m,
"\n all_unreclaimable: %u"
- "\n prev_priority: %i"
"\n start_pfn: %lu"
"\n inactive_ratio: %u",
zone->all_unreclaimable,
- zone->prev_priority,
zone->zone_start_pfn,
zone->inactive_ratio);
seq_putc(m, '\n');
--
1.6.5.2
At least, kswapd can avoid such pageout() because kswapd don't
need to consider OOM-Killer situation. that's no risk.
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
mm/vmscan.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ff3311..d392a50 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -614,6 +614,13 @@ static enum page_references page_check_references(struct page *page,
if (referenced_page)
return PAGEREF_RECLAIM_CLEAN;
+ /*
+ * Delegate pageout IO to flusher thread. They can make more
+ * effective IO pattern.
+ */
+ if (current_is_kswapd())
+ return PAGEREF_RECLAIM_CLEAN;
+
return PAGEREF_RECLAIM;
}
--
1.6.5.2
--
This patch is not related the patch series directly.
but [4/4] depend on scan_control has `priority' member.
then, I'm include this.
=========================================
Now very lots function in vmscan have `priority' argument. It consume
stack slightly. To move it on struct scan_control reduce stack.
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
mm/vmscan.c | 83 ++++++++++++++++++++++++++--------------------------------
1 files changed, 37 insertions(+), 46 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dadb461..8b78b49 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -77,6 +77,8 @@ struct scan_control {
int order;
+ int priority;
+
/* Which cgroup do we reclaim from */
struct mem_cgroup *mem_cgroup;
@@ -1130,7 +1132,7 @@ static int too_many_isolated(struct zone *zone, int file,
*/
static unsigned long shrink_inactive_list(unsigned long max_scan,
struct zone *zone, struct scan_control *sc,
- int priority, int file)
+ int file)
{
LIST_HEAD(page_list);
struct pagevec pvec;
@@ -1156,7 +1158,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
*/
if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
lumpy_reclaim = 1;
- else if (sc->order && priority < DEF_PRIORITY - 2)
+ else if (sc->order && sc->priority < DEF_PRIORITY - 2)
lumpy_reclaim = 1;
pagevec_init(&pvec, 1);
@@ -1335,7 +1337,7 @@ static void move_active_pages_to_lru(struct zone *zone,
}
static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
- struct scan_control *sc, int priority, int file)
+ struct scan_control *sc, int file)
{
unsigned long nr_taken;
unsigned long pgscanned;
@@ -1498,17 +1500,17 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
}
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
- struct zone *zone, struct scan_control *sc, int priority)
+ struct zone *zone, struct scan_control *sc)
{
int file = is_file_lru(lru);
if (is_active_lru(lru)) {
if (inactive_list_is_low(zone, sc, file))
- shrink_active_list(nr_to_scan, zone, sc, priority, file);
+ shrink_active_list(nr_to_scan, zone, sc, file);
return 0;
}
- return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
+ return shrink_inactive_list(nr_to_scan, zone, sc, file);
}
/*
@@ -1615,8 +1617,7 @@ static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
-static void shrink_zone(int priority, struct zone *zone,
- struct scan_control *sc)
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
unsigned long nr_to_scan;
@@ -1640,8 +1641,8 @@ static void shrink_zone(int priority, struct zone *zone,
unsigned long scan;
scan = zone_nr_lru_pages(zone, sc, l);
- if (priority || noswap) {
- scan >>= priority;
+ if (sc->priority || noswap) {
+ scan >>= sc->priority;
scan = (scan * percent[file]) / 100;
}
nr[l] = nr_scan_try_batch(scan,
@@ -1657,7 +1658,7 @@ static void shrink_zone(int priority, struct zone *zone,
nr[l] -= nr_to_scan;
nr_reclaimed += shrink_list(l, nr_to_scan,
- zone, sc, priority);
+ zone, sc);
}
}
/*
@@ -1668,7 +1669,8 @@ static void shrink_zone(int priority, struct zone *zone,
* with multiple processes reclaiming pages, the total
* freeing target can get unreasonably large.
*/
- if (nr_reclaimed >= nr_to_reclaim && priority < DEF_PRIORITY)
+ if (nr_reclaimed >= nr_to_reclaim &&
+ sc->priority < DEF_PRIORITY)
break;
}
@@ -1679,7 +1681,7 @@ static void shrink_zone(int priority, struct zone *zone,
* rebalance the anon lru active/inactive ratio.
*/
if (inactive_anon_is_low(zone, sc) && nr_swap_pages > 0)
- shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
+ shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, 0);
throttle_vm_writeout(sc->gfp_mask);
}
@@ -1700,8 +1702,7 @@ static void shrink_zone(int priority, struct zone *zone,
* If a zone is deemed to be full of pinned pages then just give it a light
* scan then give up on it.
*/
-static void shrink_zones(int priority, struct zonelist *zonelist,
- struct scan_control *sc)
+static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
{
enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
struct zoneref *z;
@@ -1719,7 +1720,8 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
if (scanning_global_lru(sc)) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
- if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+ if (zone->all_unreclaimable &&
+ sc->priority != DEF_PRIORITY)
continue; /* Let kswapd poll it */
sc->all_unreclaimable = 0;
} else
@@ -1729,7 +1731,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
*/
sc->all_unreclaimable = 0;
- shrink_zone(priority, zone, sc);
+ shrink_zone(zone, sc);
}
}
@@ -1752,7 +1754,6 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
struct scan_control *sc)
{
- int priority;
unsigned long ret = 0;
unsigned long total_scanned = 0;
struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -1779,11 +1780,11 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
}
}
- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for (sc->priority = DEF_PRIORITY; sc->priority >= 0; sc->priority--) {
sc->nr_scanned = 0;
- if (!priority)
+ if (!sc->priority)
disable_swap_token();
- shrink_zones(priority, zonelist, sc);
+ shrink_zones(zonelist, sc);
/*
* Don't shrink slabs when reclaiming memory from
* over limit cgroups
@@ -1816,23 +1817,14 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
/* Take a nap, wait for some writeback to complete */
if (!sc->hibernation_mode && sc->nr_scanned &&
- priority < DEF_PRIORITY - 2)
+ sc->priority < DEF_PRIORITY - 2)
congestion_wait(BLK_RW_ASYNC, HZ/10);
}
/* top priority shrink_zones still had more to do? don't OOM, then */
if (!sc->all_unreclaimable && scanning_global_lru(sc))
ret = sc->nr_reclaimed;
-out:
- /*
- * Now that we've scanned all the zones at this priority level, note
- * that level within the zone so that the next thread which performs
- * scanning of this zone will immediately start out at this priority
- * level. This affects only the decision whether or not to bring
- * mapped pages onto the inactive list.
- */
- if (priority < 0)
- priority = 0;
+out:
if (scanning_global_lru(sc))
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
@@ -1892,7 +1884,8 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
* will pick up pages from other mem cgroup's as well. We hack
* the priority and make it zero.
*/
- shrink_zone(0, zone, &sc);
+ sc.priority = 0;
+ shrink_zone(zone, &sc);
return sc.nr_reclaimed;
}
@@ -1972,7 +1965,6 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
{
int all_zones_ok;
- int priority;
int i;
unsigned long total_scanned;
struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -1996,13 +1988,13 @@ loop_again:
sc.may_writepage = !laptop_mode;
count_vm_event(PAGEOUTRUN);
- for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ for (sc.priority = DEF_PRIORITY; sc.priority >= 0; sc.priority--) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;
int has_under_min_watermark_zone = 0;
/* The swap token gets in the way of swapout... */
- if (!priority)
+ if (!sc.priority)
disable_swap_token();
all_zones_ok = 1;
@@ -2017,7 +2009,7 @@ loop_again:
if (!populated_zone(zone))
continue;
- if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+ if (zone->all_unreclaimable && sc.priority != DEF_PRIORITY)
continue;
/*
@@ -2026,7 +2018,7 @@ loop_again:
*/
if (inactive_anon_is_low(zone, &sc))
shrink_active_list(SWAP_CLUSTER_MAX, zone,
- &sc, priority, 0);
+ &sc, 0);
if (!zone_watermark_ok(zone, order,
high_wmark_pages(zone), 0, 0)) {
@@ -2060,7 +2052,7 @@ loop_again:
if (!populated_zone(zone))
continue;
- if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+ if (zone->all_unreclaimable && sc.priority != DEF_PRIORITY)
continue;
sc.nr_scanned = 0;
@@ -2079,7 +2071,7 @@ loop_again:
*/
if (!zone_watermark_ok(zone, order,
8*high_wmark_pages(zone), end_zone, 0))
- shrink_zone(priority, zone, &sc);
+ shrink_zone(zone, &sc);
reclaim_state->reclaimed_slab = 0;
nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
lru_pages);
@@ -2119,7 +2111,7 @@ loop_again:
* OK, kswapd is getting into trouble. Take a nap, then take
* another pass across the zones.
*/
- if (total_scanned && (priority < DEF_PRIORITY - 2)) {
+ if (total_scanned && (sc.priority < DEF_PRIORITY - 2)) {
if (has_under_min_watermark_zone)
count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
else
@@ -2520,7 +2512,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
const unsigned long nr_pages = 1 << order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
- int priority;
struct scan_control sc = {
.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
@@ -2551,11 +2542,11 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
* Free memory by calling shrink zone with increasing
* priorities until we have enough memory freed.
*/
- priority = ZONE_RECLAIM_PRIORITY;
+ sc.priority = ZONE_RECLAIM_PRIORITY;
do {
- shrink_zone(priority, zone, &sc);
- priority--;
- } while (priority >= 0 && sc.nr_reclaimed < nr_pages);
+ shrink_zone(zone, &sc);
+ sc.priority--;
+ } while (sc.priority >= 0 && sc.nr_reclaimed < nr_pages);
}
slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
--
1.6.5.2
Now, kernel compile and/or backup operation seems keep nr_vmscan_write==0.
Dave, can you please try to run your pageout annoying workload?
SO same as current.
> for low order allocation
> - kswapd: always delegate io to flusher thread
> - direct reclaim: delegate io to flusher thread only if vm pressure is low
IMO, this really doesn't fix either of the problems - the bad IO
patterns nor the stack usage. All it will take is a bit more memory
pressure to trigger stack and IO problems, and the user reporting the
problems is generating an awful lot of memory pressure...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
It's just as easy for you to run and observe the effects. Start with a VM
with 1GB RAM and a 10GB scratch block device:
# mkfs.xfs -f /dev/<blah>
# mount -o logbsize=262144,nobarrier /dev/<blah> /mnt/scratch
in one shell:
# while [ 1 ]; do dd if=/dev/zero of=/mnt/scratch/foo bs=1024k ; done
in another shell, if you have fs_mark installed, run:
# ./fs_mark -S0 -n 100000 -F -s 0 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/3 -d /mnt/scratch/2 &
otherwise run a couple of these in parallel on different directories:
# for i in `seq 1 1 100000`; do echo > /mnt/scratch/0/foo.$i ; done
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
Yes. as same as you propsed.
>
> > for low order allocation
> > - kswapd: always delegate io to flusher thread
> > - direct reclaim: delegate io to flusher thread only if vm pressure is low
>
> IMO, this really doesn't fix either of the problems - the bad IO
> patterns nor the stack usage. All it will take is a bit more memory
> pressure to trigger stack and IO problems, and the user reporting the
> problems is generating an awful lot of memory pressure...
This patch doesn't care stack usage. because
- again, I think all stack eater shold be diet.
- under allowing lumpy reclaim world, only deny low order reclaim
doesn't solve anything.
Please don't forget priority=0 recliam failure incvoke OOM-killer.
I don't imagine anyone want it.
And, Which IO workload trigger <6 priority vmscan?
Thanks.
Unfortunately, I don't have unused disks. So, I'll try it at (probably)
next week.
A filesystem on a loopback device will work just as well ;)
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
> Now, vmscan pageout() is one of IO throuput degression source.
> Some IO workload makes very much order-0 allocation and reclaim
> and pageout's 4K IOs are making annoying lots seeks.
>
> At least, kswapd can avoid such pageout() because kswapd don't
> need to consider OOM-Killer situation. that's no risk.
>
> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
What's your opinion on trying to cluster the writes done by pageout,
instead of not doing any paging out in kswapd?
Something along these lines:
Cluster writes to disk due to memory pressure.
Write out logically adjacent pages to the one we're paging out
so that we may get better IOs in these situations:
These pages are likely to be contiguous on disk to the one we're
writing out, so they should get merged into a single disk IO.
Signed-off-by: Suleiman Souhlal <sule...@google.com>
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c26986c..4e5a613 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,8 @@
#include "internal.h"
+#define PAGEOUT_CLUSTER_PAGES 16
+
struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
@@ -350,6 +352,8 @@ typedef enum {
static pageout_t pageout(struct page *page, struct address_space
*mapping,
enum pageout_io sync_writeback)
{
+ int i;
+
/*
* If the page is dirty, only perform writeback if that write
* will be non-blocking. To prevent this allocation from being
@@ -408,6 +412,37 @@ static pageout_t pageout(struct page *page,
struct address_space *mapping,
}
/*
+ * Try to write out logically adjacent dirty pages too, if
+ * possible, to get better IOs, as the IO scheduler should
+ * merge them with the original one, if the file is not too
+ * fragmented.
+ */
+ for (i = 1; i < PAGEOUT_CLUSTER_PAGES; i++) {
+ struct page *p2;
+ int err;
+
+ p2 = find_get_page(mapping, page->index + i);
+ if (p2) {
+ if (trylock_page(p2) == 0) {
+ page_cache_release(p2);
+ break;
+ }
+ if (page_mapped(p2))
+ try_to_unmap(p2, 0);
+ if (PageDirty(p2)) {
+ err = write_one_page(p2, 0);
+ page_cache_release(p2);
+ if (err)
+ break;
+ } else {
+ unlock_page(p2);
+ page_cache_release(p2);
+ break;
+ }
+ }
+ }
+
+ /*
* Wait on writeback if requested to. This happens when
* direct reclaiming a large contiguous area and the
* first attempt to free a range of pages fails.
I've found one bug in this patch myself. flusher thread don't
pageout anon pages. then, we need PageAnon() check ;)
Interesting.
So, I'd like to review your patch carefully. can you please give me one
day? :)
>
> Cluster writes to disk due to memory pressure.
>
> Write out logically adjacent pages to the one we're paging out
> so that we may get better IOs in these situations:
> These pages are likely to be contiguous on disk to the one we're
> writing out, so they should get merged into a single disk IO.
>
> Signed-off-by: Suleiman Souhlal <sule...@google.com>
> >
> > On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
> >
> > > Now, vmscan pageout() is one of IO throuput degression source.
> > > Some IO workload makes very much order-0 allocation and reclaim
> > > and pageout's 4K IOs are making annoying lots seeks.
> > >
> > > At least, kswapd can avoid such pageout() because kswapd don't
> > > need to consider OOM-Killer situation. that's no risk.
> > >
> > > Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> >
> > What's your opinion on trying to cluster the writes done by pageout,
> > instead of not doing any paging out in kswapd?
> > Something along these lines:
>
> Interesting.
> So, I'd like to review your patch carefully. can you please give me one
> day? :)
Hannes, if my remember is correct, you tried similar swap-cluster IO
long time ago. now I can't remember why we didn't merged such patch.
Do you remember anything?
Agreed (again), but we've already come to the conclusion that a
stack diet is not enough.
> - under allowing lumpy reclaim world, only deny low order reclaim
> doesn't solve anything.
Yes, I suggested it *as a first step*, not as the end goal. Your
patches don't reach the first step which is fixing the reported
stack problem for order-0 allocations...
> Please don't forget priority=0 recliam failure incvoke OOM-killer.
> I don't imagine anyone want it.
Given that I haven't been able to trigger OOM without writeback from
direct reclaim so far (*) I'm not finding any evidence that it is a
problem or that there are regressions. I want to be able to say
that this change has no known regressions. I want to find the
regression and work to fix them, but without test cases there's no
way I can do this.
This is what I'm getting frustrated about - I want to fix this
problem once and for all, but I can't find out what I need to do to
robustly test such a change so we can have a high degree of
confidence that it doesn't introduce major regressions. Can anyone
help here?
(*) except in one case I've already described where it mananged to
allocate enough huge pages to starve the system of order zero pages,
which is what I asked it to do.
> And, Which IO workload trigger <6 priority vmscan?
You're asking me? I've been asking you for workloads that wind up
reclaim priority.... :/
All I can say is that the most common trigger I see for OOM is
copying a large file on a busy system that is running off a single
spindle. When that happens on my laptop I walk away and get a cup
of coffee when that happens and when I come back I pick up all the
broken bits the OOM killer left behind.....
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
XFS already does this in ->writepage to try to minimise the impact
of the way pageout issues IO. It helps, but it is still not as good
as having all the writeback come from the flusher threads because
it's still pretty much random IO.
And, FWIW, it doesn't solve the stack usage problems, either. In
fact, it will make them worse as write_one_page() puts another
struct writeback_control on the stack...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
I havent review such patch yet. then, I'm talking about generic thing.
pageout() doesn't only writeout file backed page, but also write
swap backed page. so, filesystem optimization nor flusher thread
doesn't erase pageout clusterring worth.
> And, FWIW, it doesn't solve the stack usage problems, either. In
> fact, it will make them worse as write_one_page() puts another
> struct writeback_control on the stack...
Correct. we need to avoid double writeback_control on stack.
probably, we need to divide pageout() some piece.
detail
- remove "while (nr_scanned < max_scan)" loop
- remove nr_freed (now, we use nr_reclaimed directly)
- remove nr_scan (now, we use nr_scanned directly)
- rename max_scan to nr_to_scan
- pass nr_to_scan into isolate_pages() directly instead
using SWAP_CLUSTER_MAX
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
mm/vmscan.c | 190 ++++++++++++++++++++++++++++-------------------------------
1 files changed, 89 insertions(+), 101 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eab6028..4de4029 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1137,16 +1137,22 @@ static int too_many_isolated(struct zone *zone, int file,
* shrink_inactive_list() is a helper for shrink_zone(). It returns the number
* of reclaimed pages
*/
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
struct zone *zone, struct scan_control *sc,
int file)
{
LIST_HEAD(page_list);
struct pagevec pvec;
- unsigned long nr_scanned = 0;
+ unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
int lumpy_reclaim = 0;
+ struct page *page;
+ unsigned long nr_taken;
+ unsigned long nr_active;
+ unsigned int count[NR_LRU_LISTS] = { 0, };
+ unsigned long nr_anon;
+ unsigned long nr_file;
while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1172,119 +1178,101 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
lru_add_drain();
spin_lock_irq(&zone->lru_lock);
- do {
- struct page *page;
- unsigned long nr_taken;
- unsigned long nr_scan;
- unsigned long nr_freed;
- unsigned long nr_active;
- unsigned int count[NR_LRU_LISTS] = { 0, };
- int mode = lumpy_reclaim ? ISOLATE_BOTH : ISOLATE_INACTIVE;
- unsigned long nr_anon;
- unsigned long nr_file;
-
- nr_taken = sc->isolate_pages(SWAP_CLUSTER_MAX,
- &page_list, &nr_scan, sc->order, mode,
- zone, sc->mem_cgroup, 0, file);
+ nr_taken = sc->isolate_pages(nr_to_scan,
+ &page_list, &nr_scanned, sc->order,
+ lumpy_reclaim ? ISOLATE_BOTH : ISOLATE_INACTIVE,
+ zone, sc->mem_cgroup, 0, file);
- if (scanning_global_lru(sc)) {
- zone->pages_scanned += nr_scan;
- if (current_is_kswapd())
- __count_zone_vm_events(PGSCAN_KSWAPD, zone,
- nr_scan);
- else
- __count_zone_vm_events(PGSCAN_DIRECT, zone,
- nr_scan);
- }
+ if (scanning_global_lru(sc)) {
+ zone->pages_scanned += nr_scanned;
+ if (current_is_kswapd())
+ __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
+ else
+ __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
+ }
- if (nr_taken == 0)
- goto done;
+ if (nr_taken == 0)
+ goto done;
- nr_active = clear_active_flags(&page_list, count);
- __count_vm_events(PGDEACTIVATE, nr_active);
+ nr_active = clear_active_flags(&page_list, count);
+ __count_vm_events(PGDEACTIVATE, nr_active);
- __mod_zone_page_state(zone, NR_ACTIVE_FILE,
- -count[LRU_ACTIVE_FILE]);
- __mod_zone_page_state(zone, NR_INACTIVE_FILE,
- -count[LRU_INACTIVE_FILE]);
- __mod_zone_page_state(zone, NR_ACTIVE_ANON,
- -count[LRU_ACTIVE_ANON]);
- __mod_zone_page_state(zone, NR_INACTIVE_ANON,
- -count[LRU_INACTIVE_ANON]);
+ __mod_zone_page_state(zone, NR_ACTIVE_FILE,
+ -count[LRU_ACTIVE_FILE]);
+ __mod_zone_page_state(zone, NR_INACTIVE_FILE,
+ -count[LRU_INACTIVE_FILE]);
+ __mod_zone_page_state(zone, NR_ACTIVE_ANON,
+ -count[LRU_ACTIVE_ANON]);
+ __mod_zone_page_state(zone, NR_INACTIVE_ANON,
+ -count[LRU_INACTIVE_ANON]);
- nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
- nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
- __mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
- __mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
+ nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+ nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+ __mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
+ __mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
- reclaim_stat->recent_scanned[0] += nr_anon;
- reclaim_stat->recent_scanned[1] += nr_file;
+ reclaim_stat->recent_scanned[0] += nr_anon;
+ reclaim_stat->recent_scanned[1] += nr_file;
- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(&zone->lru_lock);
- nr_scanned += nr_scan;
- nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+ nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+
+ /*
+ * If we are direct reclaiming for contiguous pages and we do
+ * not reclaim everything in the list, try again and wait
+ * for IO to complete. This will stall high-order allocations
+ * but that should be acceptable to the caller
+ */
+ if (nr_reclaimed < nr_taken && !current_is_kswapd() && lumpy_reclaim) {
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
/*
- * If we are direct reclaiming for contiguous pages and we do
- * not reclaim everything in the list, try again and wait
- * for IO to complete. This will stall high-order allocations
- * but that should be acceptable to the caller
+ * The attempt at page out may have made some
+ * of the pages active, mark them inactive again.
*/
- if (nr_freed < nr_taken && !current_is_kswapd() &&
- lumpy_reclaim) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
-
- /*
- * The attempt at page out may have made some
- * of the pages active, mark them inactive again.
- */
- nr_active = clear_active_flags(&page_list, count);
- count_vm_events(PGDEACTIVATE, nr_active);
-
- nr_freed += shrink_page_list(&page_list, sc,
- PAGEOUT_IO_SYNC);
- }
+ nr_active = clear_active_flags(&page_list, count);
+ count_vm_events(PGDEACTIVATE, nr_active);
- nr_reclaimed += nr_freed;
+ nr_reclaimed += shrink_page_list(&page_list, sc,
+ PAGEOUT_IO_SYNC);
+ }
- local_irq_disable();
- if (current_is_kswapd())
- __count_vm_events(KSWAPD_STEAL, nr_freed);
- __count_zone_vm_events(PGSTEAL, zone, nr_freed);
+ local_irq_disable();
+ if (current_is_kswapd())
+ __count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+ __count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
- spin_lock(&zone->lru_lock);
- /*
- * Put back any unfreeable pages.
- */
- while (!list_empty(&page_list)) {
- int lru;
- page = lru_to_page(&page_list);
- VM_BUG_ON(PageLRU(page));
- list_del(&page->lru);
- if (unlikely(!page_evictable(page, NULL))) {
- spin_unlock_irq(&zone->lru_lock);
- putback_lru_page(page);
- spin_lock_irq(&zone->lru_lock);
- continue;
- }
- SetPageLRU(page);
- lru = page_lru(page);
- add_page_to_lru_list(zone, page, lru);
- if (is_active_lru(lru)) {
- int file = is_file_lru(lru);
- reclaim_stat->recent_rotated[file]++;
- }
- if (!pagevec_add(&pvec, page)) {
- spin_unlock_irq(&zone->lru_lock);
- __pagevec_release(&pvec);
- spin_lock_irq(&zone->lru_lock);
- }
+ spin_lock(&zone->lru_lock);
+ /*
+ * Put back any unfreeable pages.
+ */
+ while (!list_empty(&page_list)) {
+ int lru;
+ page = lru_to_page(&page_list);
+ VM_BUG_ON(PageLRU(page));
+ list_del(&page->lru);
+ if (unlikely(!page_evictable(page, NULL))) {
+ spin_unlock_irq(&zone->lru_lock);
+ putback_lru_page(page);
+ spin_lock_irq(&zone->lru_lock);
+ continue;
}
- __mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
- __mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
-
- } while (nr_scanned < max_scan);
+ SetPageLRU(page);
+ lru = page_lru(page);
+ add_page_to_lru_list(zone, page, lru);
+ if (is_active_lru(lru)) {
+ int file = is_file_lru(lru);
+ reclaim_stat->recent_rotated[file]++;
+ }
+ if (!pagevec_add(&pvec, page)) {
+ spin_unlock_irq(&zone->lru_lock);
+ __pagevec_release(&pvec);
+ spin_lock_irq(&zone->lru_lock);
+ }
+ }
+ __mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+ __mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
done:
spin_unlock_irq(&zone->lru_lock);
--
1.6.5.2
===================================
Free_hot_cold_page() and __free_pages_ok() have very similar
freeing preparation. This patch make consolicate it.
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
mm/page_alloc.c | 40 +++++++++++++++++++++-------------------
1 files changed, 21 insertions(+), 19 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 88513c0..ba9aea7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,20 +599,23 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
spin_unlock(&zone->lock);
}
-static void __free_pages_ok(struct page *page, unsigned int order)
+static int free_pages_prepare(struct page *page, unsigned int order)
{
- unsigned long flags;
int i;
int bad = 0;
- int wasMlocked = __TestClearPageMlocked(page);
trace_mm_page_free_direct(page, order);
kmemcheck_free_shadow(page, order);
- for (i = 0 ; i < (1 << order) ; ++i)
- bad += free_pages_check(page + i);
+ for (i = 0 ; i < (1 << order) ; ++i) {
+ struct page *pg = page + i;
+
+ if (PageAnon(pg))
+ pg->mapping = NULL;
+ bad += free_pages_check(pg);
+ }
if (bad)
- return;
+ return -EINVAL;
if (!PageHighMem(page)) {
debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
@@ -622,6 +625,17 @@ static void __free_pages_ok(struct page *page, unsigned int order)
arch_free_page(page, order);
kernel_map_pages(page, 1 << order, 0);
+ return 0;
+}
+
+static void __free_pages_ok(struct page *page, unsigned int order)
+{
+ unsigned long flags;
+ int wasMlocked = __TestClearPageMlocked(page);
+
+ if (free_pages_prepare(page, order))
+ return;
+
local_irq_save(flags);
if (unlikely(wasMlocked))
free_page_mlock(page);
@@ -1107,21 +1121,9 @@ void free_hot_cold_page(struct page *page, int cold)
int migratetype;
int wasMlocked = __TestClearPageMlocked(page);
- trace_mm_page_free_direct(page, 0);
- kmemcheck_free_shadow(page, 0);
-
- if (PageAnon(page))
- page->mapping = NULL;
- if (free_pages_check(page))
+ if (free_pages_prepare(page, 0))
return;
- if (!PageHighMem(page)) {
- debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
- debug_check_no_obj_freed(page_address(page), PAGE_SIZE);
- }
- arch_free_page(page, 0);
- kernel_map_pages(page, 1, 0);
-
migratetype = get_pageblock_migratetype(page);
set_page_private(page, migratetype);
local_irq_save(flags);
--
1.6.5.2
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
mm/vmscan.c | 22 ++++++++++++++--------
1 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4de4029..fbc26d8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -93,6 +93,8 @@ struct scan_control {
unsigned long *scanned, int order, int mode,
struct zone *z, struct mem_cgroup *mem_cont,
int active, int file);
+
+ struct list_head free_batch_list;
};
#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -641,13 +643,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
enum pageout_io sync_writeback)
{
LIST_HEAD(ret_pages);
- struct pagevec freed_pvec;
int pgactivate = 0;
unsigned long nr_reclaimed = 0;
cond_resched();
- pagevec_init(&freed_pvec, 1);
while (!list_empty(page_list)) {
enum page_references references;
struct address_space *mapping;
@@ -822,10 +822,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
__clear_page_locked(page);
free_it:
nr_reclaimed++;
- if (!pagevec_add(&freed_pvec, page)) {
- __pagevec_free(&freed_pvec);
- pagevec_reinit(&freed_pvec);
- }
+ list_add(&page->lru, &sc->free_batch_list);
continue;
cull_mlocked:
@@ -849,8 +846,6 @@ keep:
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
list_splice(&ret_pages, page_list);
- if (pagevec_count(&freed_pvec))
- __pagevec_free(&freed_pvec);
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
}
@@ -1238,6 +1233,11 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
PAGEOUT_IO_SYNC);
}
+ /*
+ * Free unused pages.
+ */
+ free_pages_bulk(zone, &sc->free_batch_list);
+
local_irq_disable();
if (current_is_kswapd())
__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
@@ -1844,6 +1844,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.mem_cgroup = NULL,
.isolate_pages = isolate_pages_global,
.nodemask = nodemask,
+ .free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
};
return do_try_to_free_pages(zonelist, &sc);
@@ -1864,6 +1865,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
.order = 0,
.mem_cgroup = mem,
.isolate_pages = mem_cgroup_isolate_pages,
+ .free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
};
nodemask_t nm = nodemask_of_node(nid);
@@ -1900,6 +1902,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
.mem_cgroup = mem_cont,
.isolate_pages = mem_cgroup_isolate_pages,
.nodemask = NULL, /* we don't care the placement */
+ .free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
};
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
@@ -1976,6 +1979,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
.order = order,
.mem_cgroup = NULL,
.isolate_pages = isolate_pages_global,
+ .free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
};
loop_again:
total_scanned = 0;
@@ -2333,6 +2337,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
.swappiness = vm_swappiness,
.order = 0,
.isolate_pages = isolate_pages_global,
+ .free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
};
struct zonelist * zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
struct task_struct *p = current;
@@ -2517,6 +2522,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
.swappiness = vm_swappiness,
.order = order,
.isolate_pages = isolate_pages_global,
+ .free_batch_list = LIST_HEAD_INIT(sc.free_batch_list),
};
unsigned long slab_reclaimable;
--
1.6.5.2
At worst, it'll distort the LRU ordering slightly. Lets say the the
file-adjacent-page you clean was near the end of the LRU. Before such a
patch, it may have gotten cleaned and done another lap of the LRU.
After, it would be reclaimed sooner. I don't know if we depend on such
behaviour (very doubtful) but it's a subtle enough change. I can't
predict what it'll do for IO congestion. Simplistically, there is more
IO so it's bad but if the write pattern is less seeky and we needed to
write the pages anyway, it might be improved.
> I agree that it doesn't solve the stack problem (Chris' suggestion
> that we enable the bdi flusher interface would fix this);
I'm afraid I'm not familiar with this interface. Can you point me at
some previous discussion so that I am sure I am looking at the right
thing?
> what I'm
> pointing out is that the arguments that it is too hard or there are
> no interfaces available to issue larger IO from reclaim are not at
> all valid.
>
Sure, I'm not resisting fixing this, just your first patch :) There are four
goals here
1. Reduce stack usage
2. Avoid the splicing of subsystem stack usage with direct reclaim
3. Preserve lumpy reclaims cleaning of contiguous pages
4. Try and not drastically alter LRU aging
1 and 2 are important for you, 3 is important for me and 4 will have to
be dealt with on a case-by-case basis.
Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
guess dirty pages can cycle around more so it'd need to be cared for.
> > > the deepest call chain in queue_work() needs 700 bytes of stack
> > > to complete, wait_for_completion() requires almost 2k of stack space
> > > at it's deepest, the scheduler has some heavy stack users, etc,
> > > and these are all functions that appear at the top of the stack.
> > >
> >
> > The real issue here then is that stack usage has gone out of control.
>
> That's definitely true, but it shouldn't cloud the fact that most
> ppl want to kill writeback from direct reclaim, too, so killing two
> birds with one stone seems like a good idea.
>
Ah yes, but I at least will resist killing of writeback from direct
reclaim because of lumpy reclaim. Again, I recognise the seek pattern
sucks but sometimes there are specific pages we need cleaned.
> How about this? For now, we stop direct reclaim from doing writeback
> only on order zero allocations, but allow it for higher order
> allocations. That will prevent the majority of situations where
> direct reclaim blows the stack and interferes with background
> writeout, but won't cause lumpy reclaim to change behaviour.
> This reduces the scope of impact and hence testing and validation
> the needs to be done.
>
> Then we can work towards allowing lumpy reclaim to use background
> threads as Chris suggested for doing specific writeback operations
> to solve the remaining problems being seen. Does this seem like a
> reasonable compromise and approach to dealing with the problem?
>
I'd like this to be plan b (or maybe c or d) if we cannot reduce stack usage
enough or come up with an alternative fix. From the goals above it mitigates
1, mitigates 2, addresses 3 but potentially allows dirty pages to remain on
the LRU with 4 until the background cleaner or kswapd comes along.
One reason why I am edgy about this is that lumpy reclaim can kick in
for low-enough orders too like order-1 pages for stacks in some cases or
order-2 pages for network cards using jumbo frames or some wireless
cards. The network cards in particular could still cause the stack
overflow but be much harder to reproduce and detect.
> > Disabling ->writepage in direct reclaim does not guarantee that stack
> > usage will not be a problem again. From your traces, page reclaim itself
> > seems to be a big dirty hog.
>
> I couldn't agree more - the kernel still needs to be put on a stack
> usage diet, but the above would give use some breathing space to attack the
> problem before more people start to hit these problems.
>
I'd like stack reduction to be plan a because it buys time without
making the problem exclusively lumpy reclaims where it can still hit,
but is harder to reproduce.
> > > Good start, but 512 bytes will only catch select and splice read,
> > > and there are 300-400 byte functions in the above list that sit near
> > > the top of the stack....
> > >
> >
> > They will need to be tackled in turn then but obviously there should be
> > a focus on the common paths. The reclaim paths do seem particularly
> > heavy and it's down to a lot of temporary variables. I might not get the
> > time today but what I'm going to try do some time this week is
> >
> > o Look at what temporary variables are copies of other pieces of information
> > o See what variables live for the duration of reclaim but are not needed
> > for all of it (i.e. uninline parts of it so variables do not persist)
> > o See if it's possible to dynamically allocate scan_control
>
> Welcome to my world ;)
>
It's not like the brochure at all :)
> > The last one is the trickiest. Basically, the idea would be to move as much
> > into scan_control as possible. Then, instead of allocating it on the stack,
> > allocate a fixed number of them at boot-time (NR_CPU probably) protected by
> > a semaphore. Limit the number of direct reclaimers that can be active at a
> > time to the number of scan_control variables. kswapd could still allocate
> > its on the stack or with kmalloc.
> >
> > If it works out, it would have two main benefits. Limits the number of
> > processes in direct reclaim - if there is NR_CPU-worth of proceses in direct
> > reclaim, there is too much going on. It would also shrink the stack usage
> > particularly if some of the stack variables are moved into scan_control.
> >
> > Maybe someone will beat me to looking at the feasibility of this.
>
> I like the idea - it really sounds like you want a fixed size,
> preallocated mempool that can't be enlarged.
Yep. It would cut down around 1K of stack usage when direct reclaim gets
involved. The "downside" would be a limitation of the number of direct
reclaimers that exist at any given time but that could be a positive in
some cases.
> In fact, I can probably
> use something like this in XFS to save a couple of hundred bytes of
> stack space in the worst hogs....
>
> > > > > This is the sort of thing I'm pointing at when I say that stack
> > > > > usage outside XFS has grown significantly significantly over the
> > > > > past couple of years. Given XFS has remained pretty much the same or
> > > > > even reduced slightly over the same time period, blaming XFS or
> > > > > saying "callers should use GFP_NOFS" seems like a cop-out to me.
> > > > > Regardless of the IO pattern performance issues, writeback via
> > > > > direct reclaim just uses too much stack to be safe these days...
> > > >
> > > > Yeah, My answer is simple, All stack eater should be fixed.
> > > > but XFS seems not innocence too. 3.5K is enough big although
> > > > xfs have use such amount since very ago.
> > >
> > > XFS used to use much more than that - significant effort has been
> > > put into reduce the stack footprint over many years. There's not
> > > much left to trim without rewriting half the filesystem...
> >
> > I don't think he is levelling a complain at XFS in particular - just pointing
> > out that it's heavy too. Still, we should be gratful that XFS is sort of
> > a "Stack Canary". If it dies, everyone else could be in trouble soon :)
>
> Yeah, true. Sorry ??f in being a bit too defensive here - the scars
> from previous discussions like this are showing through....
>
I guessed :)
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
ok.
> > - under allowing lumpy reclaim world, only deny low order reclaim
> > doesn't solve anything.
>
> Yes, I suggested it *as a first step*, not as the end goal. Your
> patches don't reach the first step which is fixing the reported
> stack problem for order-0 allocations...
I have some diet patch as another patches. I'll post todays diet patch
by another mail. I didn't hope mixing perfectly unrelated patches.
> > Please don't forget priority=0 recliam failure incvoke OOM-killer.
> > I don't imagine anyone want it.
>
> Given that I haven't been able to trigger OOM without writeback from
> direct reclaim so far (*) I'm not finding any evidence that it is a
> problem or that there are regressions. I want to be able to say
> that this change has no known regressions. I want to find the
> regression and work to fix them, but without test cases there's no
> way I can do this.
>
> This is what I'm getting frustrated about - I want to fix this
> problem once and for all, but I can't find out what I need to do to
> robustly test such a change so we can have a high degree of
> confidence that it doesn't introduce major regressions. Can anyone
> help here?
>
> (*) except in one case I've already described where it mananged to
> allocate enough huge pages to starve the system of order zero pages,
> which is what I asked it to do.
Agreed. I'm sorry that thing. Probably nobody in the world have
enough VM test case even though include no linux people. Modern general
purpose OS are used really really various purpose and various machine.
So, I haven't seen perfectly zero regression VM change. I'm getting
the same frustration anytime.
Because, Many VM mess is for avoiding extream starvation case. but If
it can be reproduced easily, it's VM bug ;)
> > And, Which IO workload trigger <6 priority vmscan?
>
> You're asking me? I've been asking you for workloads that wind up
> reclaim priority.... :/
??? Do I misunderstand your last mail?
You wrote
> IMO, this really doesn't fix either of the problems - the bad IO
> patterns nor the stack usage. All it will take is a bit more memory
> pressure to trigger stack and IO problems, and the user reporting the
> problems is generating an awful lot of memory pressure...
and, I ask which is "the bad IO patterns". if it's not your intention,
What do you talked about io pattern?
If my understand is correct, you asked me about vmscan hurt case,
and I asked you your the bad IO pattern.
now guessing, your intention was "bad IO patterns", not "the IO patterns"??
> All I can say is that the most common trigger I see for OOM is
> copying a large file on a busy system that is running off a single
> spindle. When that happens on my laptop I walk away and get a cup
> of coffee when that happens and when I come back I pick up all the
> broken bits the OOM killer left behind.....
As far as I understand, you are talking about no specific general thing.
then, I also talking general one. In general, I think slow down is
better than OOM-killer. So, even though we need more and more improvement,
we always care about avoiding incorrect oom. iow, I'd prefer step by
step development.
Then, now we are planning to use page->lru list instead pagevec
for reducing stack. and introduce new helper function.
This is similar to __pagevec_free(), but receive list instead
pagevec. and this don't use pcp cache. it is good characteristics
for vmscan.
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
---
include/linux/gfp.h | 1 +
mm/page_alloc.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 45 insertions(+), 0 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4c6d413..dbcac56 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -332,6 +332,7 @@ extern void free_hot_cold_page(struct page *page, int cold);
#define __free_page(page) __free_pages((page), 0)
#define free_page(addr) free_pages((addr),0)
+void free_pages_bulk(struct zone *zone, struct list_head *list);
void page_alloc_init(void);
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
void drain_all_pages(void);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ba9aea7..1f68832 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2049,6 +2049,50 @@ void free_pages(unsigned long addr, unsigned int order)
EXPORT_SYMBOL(free_pages);
+/*
+ * Frees a number of pages from the list
+ * Assumes all pages on list are in same zone and order==0.
+ *
+ * This is similar to __pagevec_free(), but receive list instead pagevec.
+ * and this don't use pcp cache. it is good characteristics for vmscan.
+ */
+void free_pages_bulk(struct zone *zone, struct list_head *list)
+{
+ unsigned long flags;
+ struct page *page;
+ struct page *page2;
+ int nr_pages = 0;
+
+ list_for_each_entry_safe(page, page2, list, lru) {
+ int wasMlocked = __TestClearPageMlocked(page);
+
+ if (free_pages_prepare(page, 0)) {
+ /* Make orphan the corrupted page. */
+ list_del(&page->lru);
+ continue;
+ }
+ if (unlikely(wasMlocked)) {
+ local_irq_save(flags);
+ free_page_mlock(page);
+ local_irq_restore(flags);
+ }
+ nr_pages++;
+ }
+
+ spin_lock_irqsave(&zone->lock, flags);
+ __count_vm_events(PGFREE, nr_pages);
+ zone->all_unreclaimable = 0;
+ zone->pages_scanned = 0;
+ __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
+
+ list_for_each_entry_safe(page, page2, list, lru) {
+ /* have to delete it as __free_one_page list manipulates */
+ list_del(&page->lru);
+ __free_one_page(page, zone, 0, page_private(page));
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
+
/**
* alloc_pages_exact - allocate an exact number physically-contiguous pages.
* @size: the number of bytes to allocate
--
1.6.5.2
Oh, quite vividly in fact :) For a lot of swap loads the LRU order
diverged heavily from swap slot order and readaround was a waste of
time.
Of course, the patch looked good, too, but it did not match reality
that well.
I guess 'how about this patch?' won't get us as far as 'how about
those numbers/graphs of several real-life workloads? oh and here
is the patch...'.
> > > Cluster writes to disk due to memory pressure.
> > >
> > > Write out logically adjacent pages to the one we're paging out
> > > so that we may get better IOs in these situations:
> > > These pages are likely to be contiguous on disk to the one we're
> > > writing out, so they should get merged into a single disk IO.
> > >
> > > Signed-off-by: Suleiman Souhlal <sule...@google.com>
For random IO, LRU order will have nothing to do with mapping/disk order.
Well, there is some risk here. Direct reclaimers may not be cleaning
more pages than it had to previously except it splices subsystems
together increasing stack usage and causing further problems.
It might not cause OOM-killer issues but it could increase the time
dirty pages spend on the LRU.
Am I missing something?
> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
> ---
> mm/vmscan.c | 7 +++++++
> 1 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3ff3311..d392a50 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -614,6 +614,13 @@ static enum page_references page_check_references(struct page *page,
> if (referenced_page)
> return PAGEREF_RECLAIM_CLEAN;
>
> + /*
> + * Delegate pageout IO to flusher thread. They can make more
> + * effective IO pattern.
> + */
> + if (current_is_kswapd())
> + return PAGEREF_RECLAIM_CLEAN;
> +
> return PAGEREF_RECLAIM;
> }
>
> --
> 1.6.5.2
>
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
No. you are right. I fully agree your previous mail. so, I need to cool down a bit ;)
Yep. I modified bloat-o-meter to work with stacks (imaginatively calling it
stack-o-meter) and got the following. The prereq patches are from
earlier in the thread with the subjects
vmscan: kill prev_priority completely
vmscan: move priority variable into scan_control
It gets
$ stack-o-meter vmlinux-vanilla vmlinux-1-2patchprereq
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-72 (-72)
function old new delta
kswapd 748 676 -72
and with this patch on top
$ stack-o-meter vmlinux-vanilla vmlinux-2-simplfy-shrink
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-144 (-144)
function old new delta
shrink_zone 1232 1160 -72
kswapd 748 676 -72
X86-32 based config.
> detail
> - remove "while (nr_scanned < max_scan)" loop
> - remove nr_freed (now, we use nr_reclaimed directly)
> - remove nr_scan (now, we use nr_scanned directly)
> - rename max_scan to nr_to_scan
> - pass nr_to_scan into isolate_pages() directly instead
> using SWAP_CLUSTER_MAX
>
> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
I couldn't spot any problems. I'd consider throwing a
WARN_ON(nr_to_scan > SWAP_CLUSTER_MAX) in case some future change breaks
the assumptions but otherwise.
Acked-by: Mel Gorman <m...@csn.ul.ie>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
You don't appear to do anything with the return value. bool? Otherwise I
see no problems
Acked-by: Mel Gorman <m...@csn.ul.ie>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
vi fs/direct-reclaim-helper.c, it has a few placeholders for where the
real code needs to go....just look for the ~ marks.
I mostly meant that the bdi helper threads were the best place to add
knowledge about which pages we want to write for reclaim. We might need
to add a thread dedicated to just doing the VM's dirty work, but that's
where I would start discussing fancy new interfaces.
>
> > what I'm
> > pointing out is that the arguments that it is too hard or there are
> > no interfaces available to issue larger IO from reclaim are not at
> > all valid.
> >
>
> Sure, I'm not resisting fixing this, just your first patch :) There are four
> goals here
>
> 1. Reduce stack usage
> 2. Avoid the splicing of subsystem stack usage with direct reclaim
> 3. Preserve lumpy reclaims cleaning of contiguous pages
> 4. Try and not drastically alter LRU aging
>
> 1 and 2 are important for you, 3 is important for me and 4 will have to
> be dealt with on a case-by-case basis.
>
> Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> guess dirty pages can cycle around more so it'd need to be cared for.
I'd like to add one more:
5. Don't dive into filesystem locks during reclaim.
This is different from splicing code paths together, but
the filesystem writepage code has become the center of our attempts at
doing big fat contiguous writes on disk. We push off work as late as we
can until just before the pages go down to disk.
I'll pick on ext4 and btrfs for a minute, just to broaden the scope
outside of XFS. Writepage comes along and the filesystem needs to
actually find blocks on disk for all the dirty pages it has promised to
write.
So, we start a transaction, we take various allocator locks, modify
different metadata, log changed blocks, take a break (logging is hard
work you know, need_resched() triggered a by now), stuff it
all into the file's metadata, log that, and finally return.
Each of the steps above can block for a long time. Ext4 solves
this by not doing them. ext4_writepage only writes pages that
are already fully allocated on disk.
Btrfs is much more efficient at not doing them, it just returns right
away for PF_MEMALLOC.
This is a long way of saying the filesystem writepage code is the
opposite of what direct reclaim wants. Direct reclaim wants to
find free ram now, and if it does end up in the mess describe above,
it'll just get stuck for a long time on work entirely unrelated to
finding free pages.
-chris
You could clear this under the zone->lock below before calling
__free_one_page. It'd avoid a large number of IRQ enables and disables which
are a problem on some CPUs (P4 and Itanium both blow in this regard according
to PeterZ).
> + nr_pages++;
> + }
> +
> + spin_lock_irqsave(&zone->lock, flags);
> + __count_vm_events(PGFREE, nr_pages);
> + zone->all_unreclaimable = 0;
> + zone->pages_scanned = 0;
> + __mod_zone_page_state(zone, NR_FREE_PAGES, nr_pages);
> +
> + list_for_each_entry_safe(page, page2, list, lru) {
> + /* have to delete it as __free_one_page list manipulates */
> + list_del(&page->lru);
> + __free_one_page(page, zone, 0, page_private(page));
> + }
This has the effect of bypassing the per-cpu lists as well as making the
zone lock hotter. The cache hotness of the data within the page is
probably not a factor but the cache hotness of the stuct page is.
The zone lock getting hotter is a greater problem. Large amounts of page
reclaim or dumping of page cache will now contend on the zone lock where
as previously it would have dumped into the per-cpu lists (potentially
but not necessarily avoiding the zone lock).
While there might be a stack saving in the next patch, there would appear
to be definite performance implications in taking this patch.
Functionally, I see no problem but I'd put this sort of patch on the
very long finger until the performance aspects of it could be examined.
> + spin_unlock_irqrestore(&zone->lock, flags);
> +}
> +
> /**
> * alloc_pages_exact - allocate an exact number physically-contiguous pages.
> * @size: the number of bytes to allocate
> --
> 1.6.5.2
>
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
And also stop it always with 4K stacks.
> direct reclaim blows the stack and interferes with background
> writeout, but won't cause lumpy reclaim to change behaviour.
> This reduces the scope of impact and hence testing and validation
> the needs to be done.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
And the next time someone adds a new feature to these code paths or
the compiler inlines differently these 72 bytes are easily there
again. It's not really a long term solution. Code is tending to get
more complicated all the time. I consider it unlikely this trend will
stop any time soon.
So just doing some stack micro optimizations doesn't really help
all that much.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
The same logic applies when/if page writeback is split so that it is
handled by a separate thread.
> So just doing some stack micro optimizations doesn't really help
> all that much.
>
It's a buying-time venture, I'll agree but as both approaches are only
about reducing stack stack they wouldn't be long-term solutions by your
criteria. What do you suggest?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
(from easy to more complicated):
- Disable direct reclaim with 4K stacks
- Do direct reclaim only on separate stacks
- Add interrupt stacks to any 8K stack architectures.
- Get rid of 4K stacks completely
- Think about any other stackings that could give large scale recursion
and find ways to run them on separate stacks too.
- Long term: maybe we need 16K stacks at some point, depending on how
good the VM gets. Alternative would be to stop making Linux more complicated,
but that's unlikely to happen.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
> On Thu, Apr 15, 2010 at 01:05:57AM -0700, Suleiman Souhlal wrote:
>>
>> On Apr 14, 2010, at 9:11 PM, KOSAKI Motohiro wrote:
>>
>>> Now, vmscan pageout() is one of IO throuput degression source.
>>> Some IO workload makes very much order-0 allocation and reclaim
>>> and pageout's 4K IOs are making annoying lots seeks.
>>>
>>> At least, kswapd can avoid such pageout() because kswapd don't
>>> need to consider OOM-Killer situation. that's no risk.
>>>
>>> Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
>>
>> What's your opinion on trying to cluster the writes done by pageout,
>> instead of not doing any paging out in kswapd?
>
> XFS already does this in ->writepage to try to minimise the impact
> of the way pageout issues IO. It helps, but it is still not as good
> as having all the writeback come from the flusher threads because
> it's still pretty much random IO.
Doesn't the randomness become irrelevant if you can cluster enough
pages?
> And, FWIW, it doesn't solve the stack usage problems, either. In
> fact, it will make them worse as write_one_page() puts another
> struct writeback_control on the stack...
Sorry, this patch was not meant to solve the stack usage problems.
-- Suleiman
> On Thu, Apr 15, 2010 at 05:26:27PM +0900, KOSAKI Motohiro wrote:
>>
>> Hannes, if my remember is correct, you tried similar swap-cluster IO
>> long time ago. now I can't remember why we didn't merged such patch.
>> Do you remember anything?
>
> Oh, quite vividly in fact :) For a lot of swap loads the LRU order
> diverged heavily from swap slot order and readaround was a waste of
> time.
>
> Of course, the patch looked good, too, but it did not match reality
> that well.
>
> I guess 'how about this patch?' won't get us as far as 'how about
> those numbers/graphs of several real-life workloads? oh and here
> is the patch...'.
>
>>>> Cluster writes to disk due to memory pressure.
>>>>
>>>> Write out logically adjacent pages to the one we're paging out
>>>> so that we may get better IOs in these situations:
>>>> These pages are likely to be contiguous on disk to the one
>>>> we're
>>>> writing out, so they should get merged into a single disk IO.
>>>>
>>>> Signed-off-by: Suleiman Souhlal <sule...@google.com>
>
> For random IO, LRU order will have nothing to do with mapping/disk
> order.
Right, that's why the patch writes out contiguous pages in mapping
order.
If they are contiguous on disk with the original page, then writing
them out
as well should be essentially free (when it comes to disk time). There
is
almost no waste of memory regardless of the access patterns, as far as I
can tell.
This patch is just a proof of concept and could be improved by getting
help
from the filesystem/swap code to ensure that the additional pages we're
writing out really are contiguous with the original one.
-- Suleiman
This is a real problem, BTW. One of the problems we've been fighting
inside Google is because ext4_writepage() refuses to write pages that
are subject to delayed allocation, it can cause the OOM killer to get
invoked.
I had thought this was because of some evil games we're playing for
container support that makes zones small, but just last night at the
LF Collaboration Summit reception, I ran into a technologist from a
major financial industry customer reported to me that when they tried
using ext4, they ran into the exact same problem because they were
running Oracle which was pinning down 3 gigs of memory, and then when
they tried writing a very big file using ext4, they had the same
problem of writepage() not being able to reclaim enough pages, so the
kernel fell back to invoking the OOM killer, and things got ugly in a
hurry...
One of the things I was proposing internally to try as a long-term
we-gotta-fix writeback is that we need some kind of signal so that we
can do the lumpy reclaim (a) in a separate process, to avoid a lock
inversion problem and the gee-its-going-to-take-a-long-time problem
which Chris Mentioned, and (b) to try to cluster I/O so that we're not
dribbling out writes to the disk in small, seeky, 4k writes, which is
really a disaster from a performance standpoint. Maybe the VM guys
don't care about this, but this sort of things tends to get us
filesystem guys all up in a lather not just because of the really
sucky performance, but also because it tends to mean that the system
can thrash itself to death in low memory situations.
- Ted
> Yep. I modified bloat-o-meter to work with stacks (imaginatively calling it
> stack-o-meter) and got the following. The prereq patches are from
> earlier in the thread with the subjects
Think that's a script worth having in-tree?
No. If you are doing full disk seeks between random chunks, then you
still lose a large amount of throughput. e.g. if the seek time is
10ms and your IO time is 10ms for each 4k page, then increasing the
size ito 64k makes it 10ms seek and 12ms for the IO. We might increase
throughput but we are still limited to 100 IOs per second. We've
gone from 400kB/s to 6MB/s, but that's still an order of magnitude
short of the 100MB/s full size IOs with little in way of seeks
between them will acheive on the same spindle...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
What I meant was that, theoretically speaking, you could increase the
maximum amount of pages that get clustered so that you could get
100MB/s, although it most likely wouldn't be a good idea with the
current patch.
-- Suleiman
Just to re-iterate: we're blowing the stack with direct reclaim on
x86_64 w/ 8k stacks. The old i386/4k stack problem is a red
herring.
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
> From: Dave Chinner <dchi...@redhat.com>
>
> When we enter direct reclaim we may have used an arbitrary amount of stack
> space, and hence enterring the filesystem to do writeback can then lead to
> stack overruns. This problem was recently encountered x86_64 systems with
> 8k stacks running XFS with simple storage configurations.
>
> Writeback from direct reclaim also adversely affects background writeback. The
> background flusher threads should already be taking care of cleaning dirty
> pages, and direct reclaim will kick them if they aren't already doing work. If
> direct reclaim is also calling ->writepage, it will cause the IO patterns from
> the background flusher threads to be upset by LRU-order writeback from
> pageout() which can be effectively random IO. Having competing sources of IO
> trying to clean pages on the same backing device reduces throughput by
> increasing the amount of seeks that the backing device has to do to write back
> the pages.
>
> Hence for direct reclaim we should not allow ->writepages to be entered at all.
> Set up the relevant scan_control structures to enforce this, and prevent
> sc->may_writepage from being set in other places in the direct reclaim path in
> response to other events.
>
> Reported-by: John Berthels <jo...@humyo.com>
> Signed-off-by: Dave Chinner <dchi...@redhat.com>
Hmm. Then, if memoy cgroup is filled by dirty pages, it can't kick writeback
and has to wait for someone else's writeback ?
How long this will take ?
# mount -t cgroup none /cgroup -o memory
# mkdir /cgroup/A
# echo 20M > /cgroup/A/memory.limit_in_bytes
# echo $$ > /cgroup/A/tasks
# dd if=/dev/zero of=./tmpfile bs=4096 count=1000000
Can memcg ask writeback thread to "Wake Up Now! and Write this out!" effectively ?
Thanks,
-Kame
> ---
> mm/vmscan.c | 13 ++++++-------
> 1 files changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e0e5f15..5321ac4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1826,10 +1826,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> * writeout. So in laptop mode, write out the whole world.
> */
> writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
> - if (total_scanned > writeback_threshold) {
> + if (total_scanned > writeback_threshold)
> wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
> - sc->may_writepage = 1;
> - }
>
> /* Take a nap, wait for some writeback to complete */
> if (!sc->hibernation_mode && sc->nr_scanned &&
> @@ -1871,7 +1869,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> {
> struct scan_control sc = {
> .gfp_mask = gfp_mask,
> - .may_writepage = !laptop_mode,
> + .may_writepage = 0,
> .nr_to_reclaim = SWAP_CLUSTER_MAX,
> .may_unmap = 1,
> .may_swap = 1,
> @@ -1893,7 +1891,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> struct zone *zone, int nid)
> {
> struct scan_control sc = {
> - .may_writepage = !laptop_mode,
> + .may_writepage = 0,
> .may_unmap = 1,
> .may_swap = !noswap,
> .swappiness = swappiness,
> @@ -1926,7 +1924,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> {
> struct zonelist *zonelist;
> struct scan_control sc = {
> - .may_writepage = !laptop_mode,
> + .may_writepage = 0,
> .may_unmap = 1,
> .may_swap = !noswap,
> .nr_to_reclaim = SWAP_CLUSTER_MAX,
> @@ -2567,7 +2565,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> struct reclaim_state reclaim_state;
> int priority;
> struct scan_control sc = {
> - .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
> + .may_writepage = (current_is_kswapd() &&
> + (zone_reclaim_mode & RECLAIM_WRITE)),
> .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
> .may_swap = 1,
> .nr_to_reclaim = max_t(unsigned long, nr_pages,
> --
> 1.6.5
Fundamentally, we have so many pages on the LRU, getting a few out
of order at the back end of it is going to be in the noise. If we
trade off "perfect" LRU behaviour for cleaning pages an order of
magnitude faster, reclaim will find candidate pages for a whole lot
faster. And if we have more clean pages available, faster, overall
system throughput is going to improve and be much less likely to
fall into deep, dark holes where the OOM-killer is the light at the
end.....
[ snip questions Chris answered ]
> > what I'm
> > pointing out is that the arguments that it is too hard or there are
> > no interfaces available to issue larger IO from reclaim are not at
> > all valid.
> >
>
> Sure, I'm not resisting fixing this, just your first patch :) There are four
> goals here
>
> 1. Reduce stack usage
> 2. Avoid the splicing of subsystem stack usage with direct reclaim
> 3. Preserve lumpy reclaims cleaning of contiguous pages
> 4. Try and not drastically alter LRU aging
>
> 1 and 2 are important for you, 3 is important for me and 4 will have to
> be dealt with on a case-by-case basis.
#4 is important to me, too, because that has direct impact on large
file IO workloads. however, it is gross changes in behaviour that
concern me, not subtle, probably-in-the-noise changes that you're
concerned about. :)
> Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> guess dirty pages can cycle around more so it'd need to be cared for.
Well, you keep saying that they break #3, but I haven't seen any
test cases or results showing that. I've been unable to confirm that
lumpy reclaim is broken by disallowing writeback in my testing, so
I'm interested to know what tests you are running that show it is
broken...
> > How about this? For now, we stop direct reclaim from doing writeback
> > only on order zero allocations, but allow it for higher order
> > allocations. That will prevent the majority of situations where
> > direct reclaim blows the stack and interferes with background
> > writeout, but won't cause lumpy reclaim to change behaviour.
> > This reduces the scope of impact and hence testing and validation
> > the needs to be done.
> >
> > Then we can work towards allowing lumpy reclaim to use background
> > threads as Chris suggested for doing specific writeback operations
> > to solve the remaining problems being seen. Does this seem like a
> > reasonable compromise and approach to dealing with the problem?
> >
>
> I'd like this to be plan b (or maybe c or d) if we cannot reduce stack usage
> enough or come up with an alternative fix. From the goals above it mitigates
> 1, mitigates 2, addresses 3 but potentially allows dirty pages to remain on
> the LRU with 4 until the background cleaner or kswapd comes along.
We've been through this already, but I'll repeat it again in the
hope it sinks in: reducing stack usage is not sufficient to stay
within an 8k stack if we can enter writeback with an arbitrary
amount of stack already consumed.
We've already got a report of 9k of stack usage (7200 bytes left on
a order-2 stack) and this is without a complex storage stack - it's
just a partition on a SATA drive. We can easily add another 1k,
possibly 2k to that stack depth with a complex storage subsystem.
Trimming this much (3-4k) is simply not feasible in a callchain that
is 50-70 functions deep...
> One reason why I am edgy about this is that lumpy reclaim can kick in
> for low-enough orders too like order-1 pages for stacks in some cases or
> order-2 pages for network cards using jumbo frames or some wireless
> cards. The network cards in particular could still cause the stack
> overflow but be much harder to reproduce and detect.
So push lumpy reclaim into a separate thread. It already blocks, so
waiting for some other thread to do the work won't change anything.
Separating high-order reclaim from LRU reclaim is probably a good
idea, anyway - they use different algorithms and while the two are
intertwined it's hard to optimise/improve either....
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
Hmm.. I saw an oom-kill while testing several cases but performance itself
seems not to be far different with or without patch.
But I'm unhappy with oom-kill, so some tweak for memcg will be necessary
if we'll go with this.
Thanks,
-Kame
Yes that's known, but on 4K it will definitely not work at all.
-Andi
--
a...@linux.intel.com -- Speaking for myself only.
Ahh, it's a hatchet-job at the moment. I copied bloat-o-meter and
altered one function. I made a TODO note to extend bloat-o-meter
properly and that would be worth merging.
The usual armwaving numbers for ops/sec for an ATA disk are in the 200
ops/sec range so that seems horribly credible.
But then I've never quite understood why our anonymous paging isn't
sorting stuff as best it can and then using the drive as a log structure
with in memory metadata so it can stream the pages onto disk. Read
performance is goig to be similar (maybe better if you have a log tidy
when idle), write ought to be far better.
Alan
Signed-off-by: KOSAKI Motohiro <kosaki....@jp.fujitsu.com>
Acked-by: Mel Gorman <m...@csn.ul.ie>
---
mm/page_alloc.c | 40 +++++++++++++++++++++-------------------
1 files changed, 21 insertions(+), 19 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d03c946..6a7d0d0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,20 +599,23 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
spin_unlock(&zone->lock);
}
-static void __free_pages_ok(struct page *page, unsigned int order)
+static bool free_pages_prepare(struct page *page, unsigned int order)
{
- unsigned long flags;
int i;
int bad = 0;
- int wasMlocked = __TestClearPageMlocked(page);
trace_mm_page_free_direct(page, order);
kmemcheck_free_shadow(page, order);
- for (i = 0 ; i < (1 << order) ; ++i)
- bad += free_pages_check(page + i);
+ for (i = 0 ; i < (1 << order) ; ++i) {
+ struct page *pg = page + i;
+
+ if (PageAnon(pg))
+ pg->mapping = NULL;
+ bad += free_pages_check(pg);
+ }
if (bad)
- return;
+ return false;
if (!PageHighMem(page)) {
debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
@@ -622,6 +625,17 @@ static void __free_pages_ok(struct page *page, unsigned int order)
arch_free_page(page, order);
kernel_map_pages(page, 1 << order, 0);
+ return true;
+}
+
+static void __free_pages_ok(struct page *page, unsigned int order)
+{
+ unsigned long flags;
+ int wasMlocked = __TestClearPageMlocked(page);
+
+ if (!free_pages_prepare(page, order))
+ return;
+
local_irq_save(flags);
if (unlikely(wasMlocked))
free_page_mlock(page);
@@ -1107,21 +1121,9 @@ void free_hot_cold_page(struct page *page, int cold)
int migratetype;
int wasMlocked = __TestClearPageMlocked(page);
- trace_mm_page_free_direct(page, 0);
- kmemcheck_free_shadow(page, 0);
-
- if (PageAnon(page))
- page->mapping = NULL;
- if (free_pages_check(page))
+ if (!free_pages_prepare(page, 0))
return;
- if (!PageHighMem(page)) {
- debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
- debug_check_no_obj_freed(page_address(page), PAGE_SIZE);
- }
- arch_free_page(page, 0);
- kernel_map_pages(page, 1, 0);
-
migratetype = get_pageblock_migratetype(page);
set_page_private(page, migratetype);
local_irq_save(flags);
--
1.6.5.2
--
Do not like. While I can see why 4K stacks are a serious problem, I'd
sooner see 4K stacks disabled than have the kernel behave so differently
for direct reclaim. It's be tricky to spot regressions in reclaim that
were due to this .config option
> - Do direct reclaim only on separate stacks
This is looking more and more attractive.
> - Add interrupt stacks to any 8K stack architectures.
This is a similar but separate problem. It's similar in that interrupt
stacks can splice subsystems together in terms of stack usage.
> - Get rid of 4K stacks completely
Why would we *not* do this? I can't remember the original reasoning
behind 4K stacks but am guessing it helped fork-orientated workloads in
startup times in the days before lumpy reclaim and better fragmentation
control.
Who typically enables this option?
> - Think about any other stackings that could give large scale recursion
> and find ways to run them on separate stacks too.
The patch series I threw up about reducing stack was a cut-down
approach. Instead of using separate stacks, keep the stack usage out of
the main caller path where possible.
> - Long term: maybe we need 16K stacks at some point, depending on how
> good the VM gets. Alternative would be to stop making Linux more complicated,
> but that's unlikely to happen.
>
Make this Plan D if nothing else works out and we still hit a wall?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Yep, that is not being disputed. By the way, what did you use to
generate your report? Was it CONFIG_DEBUG_STACK_USAGE or something else?
I used a modified bloat-o-meter to gather my data but it'd be nice to
be sure I'm seeing the same things as you (minus XFS unless I
specifically set it up).
> The old i386/4k stack problem is a red
> herring.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
I must be blind. What tree is this in? I can't see it v2.6.34-rc4,
mmotm or google.
> I mostly meant that the bdi helper threads were the best place to add
> knowledge about which pages we want to write for reclaim. We might need
> to add a thread dedicated to just doing the VM's dirty work, but that's
> where I would start discussing fancy new interfaces.
>
> >
> > > what I'm
> > > pointing out is that the arguments that it is too hard or there are
> > > no interfaces available to issue larger IO from reclaim are not at
> > > all valid.
> > >
> >
> > Sure, I'm not resisting fixing this, just your first patch :) There are four
> > goals here
> >
> > 1. Reduce stack usage
> > 2. Avoid the splicing of subsystem stack usage with direct reclaim
> > 3. Preserve lumpy reclaims cleaning of contiguous pages
> > 4. Try and not drastically alter LRU aging
> >
> > 1 and 2 are important for you, 3 is important for me and 4 will have to
> > be dealt with on a case-by-case basis.
> >
> > Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> > guess dirty pages can cycle around more so it'd need to be cared for.
>
> I'd like to add one more:
>
> 5. Don't dive into filesystem locks during reclaim.
>
Good add. It's not a new problem either. This came up at least two years
ago at around the first VM/FS summit and the response was a long the lines
of shuffling uncomfortably :/
> This is different from splicing code paths together, but
> the filesystem writepage code has become the center of our attempts at
> doing big fat contiguous writes on disk. We push off work as late as we
> can until just before the pages go down to disk.
>
> I'll pick on ext4 and btrfs for a minute, just to broaden the scope
> outside of XFS. Writepage comes along and the filesystem needs to
> actually find blocks on disk for all the dirty pages it has promised to
> write.
>
> So, we start a transaction, we take various allocator locks, modify
> different metadata, log changed blocks, take a break (logging is hard
> work you know, need_resched() triggered a by now), stuff it
> all into the file's metadata, log that, and finally return.
>
> Each of the steps above can block for a long time. Ext4 solves
> this by not doing them. ext4_writepage only writes pages that
> are already fully allocated on disk.
>
> Btrfs is much more efficient at not doing them, it just returns right
> away for PF_MEMALLOC.
>
> This is a long way of saying the filesystem writepage code is the
> opposite of what direct reclaim wants. Direct reclaim wants to
> find free ram now, and if it does end up in the mess describe above,
> it'll just get stuck for a long time on work entirely unrelated to
> finding free pages.
>
Ok, good summary, thanks. I was only partially aware of some of these.
i.e. I knew it was a problem but was not sensitive to how bad it was.
Your last point is interesting because lumpy reclaim for large orders under
heavy pressure can make the system stutter badly (e.g. during a huge
page pool resize). I had blamed just plain IO but messing around with
locks and tranactions could have been a large factor and I didn't go
looking for it.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
haha, I don't think anyone pretends the LRU behaviour is perfect.
Altering its existing behaviour tends to be done with great care but
from what I gather that is often a case of "better the devil you know".
> magnitude faster, reclaim will find candidate pages for a whole lot
> faster. And if we have more clean pages available, faster, overall
> system throughput is going to improve and be much less likely to
> fall into deep, dark holes where the OOM-killer is the light at the
> end.....
>
> [ snip questions Chris answered ]
>
> > > what I'm
> > > pointing out is that the arguments that it is too hard or there are
> > > no interfaces available to issue larger IO from reclaim are not at
> > > all valid.
> > >
> >
> > Sure, I'm not resisting fixing this, just your first patch :) There are four
> > goals here
> >
> > 1. Reduce stack usage
> > 2. Avoid the splicing of subsystem stack usage with direct reclaim
> > 3. Preserve lumpy reclaims cleaning of contiguous pages
> > 4. Try and not drastically alter LRU aging
> >
> > 1 and 2 are important for you, 3 is important for me and 4 will have to
> > be dealt with on a case-by-case basis.
>
> #4 is important to me, too, because that has direct impact on large
> file IO workloads. however, it is gross changes in behaviour that
> concern me, not subtle, probably-in-the-noise changes that you're
> concerned about. :)
>
I'm also less concerned with this aspect. I brought it up because it was
a factor. I don't think it'll cause us problems but if problems do
arise, it's nice to have a few potential candidates to examine in
advance.
> > Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> > guess dirty pages can cycle around more so it'd need to be cared for.
>
> Well, you keep saying that they break #3, but I haven't seen any
> test cases or results showing that. I've been unable to confirm that
> lumpy reclaim is broken by disallowing writeback in my testing, so
> I'm interested to know what tests you are running that show it is
> broken...
>
Ok, I haven't actually tested this. The machines I use are tied up
retesting the compaction patches at the moment. The reason why I reckon
it'll be a problem is that when these sync-writeback changes were
introduced, it significantly helped lumpy reclaim for huge pages. I am
making an assumption that backing out those changes will hurt it.
I'll test for real on Monday and see what falls out.
Ok, based on this, I'll stop working on the stack-reduction patches.
I'll test what I have and push it but I won't bring it further for the
moment and instead look at putting writeback into its own thread. If
someone else works on it in the meantime, I'll review and test from the
perspective of lumpy reclaim.
> > One reason why I am edgy about this is that lumpy reclaim can kick in
> > for low-enough orders too like order-1 pages for stacks in some cases or
> > order-2 pages for network cards using jumbo frames or some wireless
> > cards. The network cards in particular could still cause the stack
> > overflow but be much harder to reproduce and detect.
>
> So push lumpy reclaim into a separate thread. It already blocks, so
> waiting for some other thread to do the work won't change anything.
No, it wouldn't. As long as it can wait on the right pages, it doesn't
really matter who does the work.
> Separating high-order reclaim from LRU reclaim is probably a good
> idea, anyway - they use different algorithms and while the two are
> intertwined it's hard to optimise/improve either....
>
They are not a million miles apart either. Lumpy reclaim uses the LRU to
select a cursor page and then reclaims around it. Improvements on LRU tend
to help lumpy reclaim as well. It's why during the tests I run I can often
allocate 80-95% of memory as huge pages on x86-64 as opposed to when anti-frag
was being developed first where getting 30% was a cause for celebration :)
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Not much, it cuts 16 bytes on x86 32 bit. The bigger gain is the code
clarification it comes with. There is too much state to keep track of
in reclaim.
I'm using the tracing subsystem to get them. Doesn't everyone use
that now? ;)
$ grep STACK .config
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_STACKTRACE=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_STACK_TRACER=y
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
Then:
# echo 1 > /proc/sys/kernel/stack_tracer_enabled
<run workloads>
Monitor the worst recorded stack usage as it changes via:
# cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (44 entries)
----- ---- --------
0) 5584 288 get_page_from_freelist+0x5c0/0x830
1) 5296 272 __alloc_pages_nodemask+0x102/0x730
2) 5024 48 kmem_getpages+0x62/0x160
3) 4976 96 cache_grow+0x308/0x330
4) 4880 96 cache_alloc_refill+0x27f/0x2c0
5) 4784 96 __kmalloc+0x241/0x250
6) 4688 112 vring_add_buf+0x233/0x420
......
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
--
Yeah, in my experience 7200rpm SATA will get you 200 ops/s when you
are doing really small seeks as the typical minimum seek time is
around 4-5ms. Average seek time, however, is usually in the range of
10ms, because full head sweep + spindle rotation seeks take in the
order of 15ms.
Hence small random IO tends to result in seek times nearer the
average seek time than the minimum, so that's what i tend to use for
determining the number of ops/s a disk will sustain.
> But then I've never quite understood why our anonymous paging isn't
> sorting stuff as best it can and then using the drive as a log structure
> with in memory metadata so it can stream the pages onto disk. Read
> performance is goig to be similar (maybe better if you have a log tidy
> when idle), write ought to be far better.
Sounds like a worthy project for someone to sink their teeth into.
Lots of people would like to have a system that can page out at
hundreds of megabytes a second....
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
The poor IO patterns thing is a regression. Some time several years
ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
dirty-page writeback than it used to. AFAIK nobody attempted to work
out why, nor attempted to try to fix it.
Doing writearound in pageout() might help. The kernel was in fact was
doing that around 2.5.10, but I took it out again because it wasn't
obviously beneficial.
Writearound is hard to do, because direct-reclaim doesn't have an easy
way of pinning the address_space: it can disappear and get freed under
your feet. I was able to make this happen under intense MM loads. The
current page-at-a-time pageout code pins the address_space by taking a
lock on one of its pages. Once that lock is released, we cannot touch
*mapping.
And lo, the pageout() code is presently buggy:
res = mapping->a_ops->writepage(page, &wbc);
if (res < 0)
handle_write_error(mapping, page, res);
The ->writepage can/will unlock the page, and we're passing a hand
grenade into handle_write_error().
Any attempt to implement writearound in pageout will need to find a way
to safely pin that address_space. One way is to take a temporary ref
on mapping->host, but IIRC that introduced nasties with inode_lock.
Certainly it'll put more load on that worrisomely-singleton lock.
Regarding simply not doing any writeout in direct reclaim (Dave's
initial proposal): the problem is that pageout() will clean a page in
the target zone. Normal writeout won't do that, so we could get into a
situation where vast amounts of writeout is happening, but none of it
is cleaning pages in the zone which we're trying to allocate from.
It's quite possibly livelockable, too.
Doing writearound (if we can get it going) will solve that adequately
(assuming that the target page gets reliably written), but it won't
help the stack usage problem.
To solve the IO-pattern thing I really do think we should first work
out ytf we started doing much more IO off the LRU. What caused it? Is
it really unavoidable?
To solve the stack-usage thing: dunno, really. One could envisage code
which skips pageout() if we're using more than X amount of stack, but
that sucks. Another possibility might be to hand the target page over
to another thread (I suppose kswapd will do) and then synchronise with
that thread - get_page()+wait_on_page_locked() is one way. The helper
thread could of course do writearound.
I just know that we XFS guys have been complaining about it a lot..
But that was mostly a tuning issue - before writeout mostly happened
from pdflush. If we got into kswapd or direct reclaim we already
did get horrible I/O patterns - it just happened far less often.
> Regarding simply not doing any writeout in direct reclaim (Dave's
> initial proposal): the problem is that pageout() will clean a page in
> the target zone. Normal writeout won't do that, so we could get into a
> situation where vast amounts of writeout is happening, but none of it
> is cleaning pages in the zone which we're trying to allocate from.
> It's quite possibly livelockable, too.
As Chris mentioned currently btrfs and ext4 do not actually do delalloc
conversions from this path, so for typical workloads the amount of
writeout that can happen from this path is extremly limited. And unless
we get things fixed we will have to do the same for XFS. I'd be much
more happy if we could just sort it out at the VM level, because this
means we have one sane place for this kind of policy instead of three
or more hacks down inside the filesystems. It's rather interesting
that all people on the modern fs side completely agree here what the
problem is, but it seems rather hard to convince the VM side to do
anything about it.
> To solve the stack-usage thing: dunno, really. One could envisage code
> which skips pageout() if we're using more than X amount of stack, but
> that sucks.
And it doesn't solve other issues, like the whole lock taking problem.
> Another possibility might be to hand the target page over
> to another thread (I suppose kswapd will do) and then synchronise with
> that thread - get_page()+wait_on_page_locked() is one way. The helper
> thread could of course do writearound.
Allowing the flusher threads to do targeted writeout would be the
best from the FS POV. We'll still have one source of the I/O, just
with another know on how to select the exact region to write out.
We can still synchronously wait for the I/O for lumpy reclaim if really
nessecary.
> On Sat, Apr 17, 2010 at 08:32:39PM -0400, Andrew Morton wrote:
>> The poor IO patterns thing is a regression. Some time several years
>> ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
>> dirty-page writeback than it used to. AFAIK nobody attempted to work
>> out why, nor attempted to try to fix it.
>
> I just know that we XFS guys have been complaining about it a lot..
I know also that the ext3 and reisefs guys complained about this issue
as well.
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
> in
> the body of a message to majo...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
Best Regards
Sorin Faibish
Corporate Distinguished Engineer
Network Storage Group
EMC²
where information lives
Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : sfai...@emc.com