Oops. Make that "only 128 MB".
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
All of the failed allocations seem to be GFP_ATOMIC so it's not _that_
strange. Dunno if anything changed recently. What's the last known
good kernel for you?
It's still very ugly though. And I would say it should be unnecessary.
> Dunno if anything changed recently. What's the last known good kernel for
> you?
I've not used that box very intensively in the past, but I first saw the
allocation failure with aptitude with either .31 or .32. I would be
extremely surprised if I could reproduce the problem with .30.
And I have done large rsyncs to the box without any problems in the past,
but that must have been with .24 or so kernels.
It seems likely to me that it's related to all the other swap and
allocation issues we've been seeing after .30.
Thanks,
FJP
> On Friday 26 February 2010, Pekka Enberg wrote:
> > > Isn't it a bit strange that cache claims so much memory that real
> > > processes get into allocation failures?
> >
> > All of the failed allocations seem to be GFP_ATOMIC so it's not _that_
> > strange.
>
> It's still very ugly though. And I would say it should be unnecessary.
>
> > Dunno if anything changed recently. What's the last known good kernel for
> > you?
>
> I've not used that box very intensively in the past, but I first saw the
> allocation failure with aptitude with either .31 or .32. I would be
> extremely surprised if I could reproduce the problem with .30.
> And I have done large rsyncs to the box without any problems in the past,
> but that must have been with .24 or so kernels.
>
> It seems likely to me that it's related to all the other swap and
> allocation issues we've been seeing after .30.
Hmmm.. How long is the allocation that fails? SLUB can always fall back to
order 0 allocs if the object is < PAGE_SIZE. SLAB cannot do so if it has
decided to use a higher order slab cache for a kmalloc cache.
This is CONFIG_SLAB=y, actually. There are two different call-sites. The first
one is tty_buffer_request_room():
> aptitude: page allocation failure. order:1, mode:0x20
> [<c0029bcc>] (unwind_backtrace+0x0/0xd4) from [<c007ca18>] (__alloc_pages_nodemask+0x4ac/0x510)
> [<c007ca18>] (__alloc_pages_nodemask+0x4ac/0x510) from [<c0099f84>] (cache_alloc_refill+0x260/0x52c)
> [<c0099f84>] (cache_alloc_refill+0x260/0x52c) from [<c009a2e0>] (__kmalloc+0x90/0xd4)
> [<c009a2e0>] (__kmalloc+0x90/0xd4) from [<c0165640>] (tty_buffer_request_room+0x88/0x128)
> [<c0165640>] (tty_buffer_request_room+0x88/0x128) from [<c0165838>] (tty_insert_flip_string+0x24/0x84)
> [<c0165838>] (tty_insert_flip_string+0x24/0x84) from [<c016652c>] (pty_write+0x30/0x50)
> [<c016652c>] (pty_write+0x30/0x50) from [<c0161d84>] (n_tty_write+0x234/0x394)
> [<c0161d84>] (n_tty_write+0x234/0x394) from [<c015f594>] (tty_write+0x190/0x234)
> [<c015f594>] (tty_write+0x190/0x234) from [<c009d9e0>] (vfs_write+0xb0/0x1a4)
> [<c009d9e0>] (vfs_write+0xb0/0x1a4) from [<c009dfa8>] (sys_write+0x3c/0x68)
> [<c009dfa8>] (sys_write+0x3c/0x68) from [<c0023e00>] (ret_fast_syscall+0x0/0x28)
> Mem-info:
> Normal per-cpu:
> CPU 0: hi: 42, btch: 7 usd: 29
> active_anon:2455 inactive_anon:2471 isolated_anon:0
> active_file:16088 inactive_file:7021 isolated_file:0
> unevictable:0 dirty:14 writeback:0 unstable:0
> free:555 slab_reclaimable:1371 slab_unreclaimable:746
> mapped:4960 shmem:40 pagetables:102 bounce:0
> Normal free:2220kB min:1440kB low:1800kB high:2160kB active_anon:9820kB inactive_anon:9884kB active_file:64352kB inactive_file:28084kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:130048kB mlocked:0kB dirty:56kB writeback:0kB mapped:19840kB shmem:160kB slab_reclaimable:5484kB slab_unreclaimable:2984kB kernel_stack:520kB pagetables:408kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0
> Normal: 493*4kB 25*8kB 3*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2220kB
> 23343 total pagecache pages
> 192 pages in swap cache
> Total swap = 979924kB
> 32768 pages of RAM
> 709 free pages
> 1173 reserved pages
> 2117 slab pages
> 13703 pages shared
> 192 pages swap cached
and the second one is sk_prot_alloc():
> sshd: page allocation failure. order:1, mode:0x20
> [<c0029bcc>] (unwind_backtrace+0x0/0xd4) from [<c007ca18>] (__alloc_pages_nodemask+0x4ac/0x510)
> [<c007ca18>] (__alloc_pages_nodemask+0x4ac/0x510) from [<c0099f84>] (cache_alloc_refill+0x260/0x52c)
> [<c0099f84>] (cache_alloc_refill+0x260/0x52c) from [<c009a378>] (kmem_cache_alloc+0x54/0x94)
> [<c009a378>] (kmem_cache_alloc+0x54/0x94) from [<c01db500>] (sk_prot_alloc+0x28/0xfc)
> [<c01db500>] (sk_prot_alloc+0x28/0xfc) from [<c01dbb88>] (sk_clone+0x18/0x1e0)
> [<c01dbb88>] (sk_clone+0x18/0x1e0) from [<c0211608>] (inet_csk_clone+0x14/0x9c)
> [<c0211608>] (inet_csk_clone+0x14/0x9c) from [<c02258b4>] (tcp_create_openreq_child+0x1c/0x3b0)
> [<c02258b4>] (tcp_create_openreq_child+0x1c/0x3b0) from [<c0223df0>] (tcp_v4_syn_recv_sock+0x4c/0x17c)
> [<c0223df0>] (tcp_v4_syn_recv_sock+0x4c/0x17c) from [<c0225738>] (tcp_check_req+0x288/0x3e8)
> [<c0225738>] (tcp_check_req+0x288/0x3e8) from [<c02232a0>] (tcp_v4_do_rcv+0xa4/0x1c4)
> [<c02232a0>] (tcp_v4_do_rcv+0xa4/0x1c4) from [<c0225138>] (tcp_v4_rcv+0x4cc/0x788)
> [<c0225138>] (tcp_v4_rcv+0x4cc/0x788) from [<c0208308>] (ip_local_deliver_finish+0x158/0x220)
> [<c0208308>] (ip_local_deliver_finish+0x158/0x220) from [<c020818c>] (ip_rcv_finish+0x380/0x3a4)
> [<c020818c>] (ip_rcv_finish+0x380/0x3a4) from [<c01e6914>] (netif_receive_skb+0x494/0x4e4)
> [<c01e6914>] (netif_receive_skb+0x494/0x4e4) from [<bf022e78>] (mv643xx_eth_poll+0x458/0x5d0 [mv643xx_eth])
> [<bf022e78>] (mv643xx_eth_poll+0x458/0x5d0 [mv643xx_eth]) from [<c01e963c>] (net_rx_action+0x78/0x184)
> [<c01e963c>] (net_rx_action+0x78/0x184) from [<c0044258>] (__do_softirq+0x78/0x10c)
> [<c0044258>] (__do_softirq+0x78/0x10c) from [<c0023074>] (asm_do_IRQ+0x74/0x94)
> [<c0023074>] (asm_do_IRQ+0x74/0x94) from [<c0023c20>] (__irq_usr+0x40/0x80)
> Exception stack(0xc22ebfb0 to 0xc22ebff8)
> bfa0: 0b08609e 2a07f1b8 3ea285e7 4016a094
> bfc0: f141ed11 4016a30c d81533a7 4016a30c 4016a258 4016a430 00000011 6dc729a1
> bfe0: 71a5db23 bee60c88 400cc1dc 400cbe38 20000010 ffffffff
> Mem-info:
> Normal per-cpu:
> CPU 0: hi: 42, btch: 7 usd: 18
> active_anon:2646 inactive_anon:3510 isolated_anon:0
> active_file:4422 inactive_file:17658 isolated_file:0
> unevictable:0 dirty:700 writeback:0 unstable:0
> free:496 slab_reclaimable:962 slab_unreclaimable:895
> mapped:1512 shmem:11 pagetables:138 bounce:0
> Normal free:1984kB min:1440kB low:1800kB high:2160kB active_anon:10584kB inactive_anon:14040kB active_file:17688kB inactive_file:70632kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:130048kB mlocked:0kB dirty:2800kB writeback:0kB mapped:6048kB shmem:44kB slab_reclaimable:3848kB slab_unreclaimable:3580kB kernel_stack:552kB pagetables:552kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> lowmem_reserve[]: 0 0
> Normal: 462*4kB 3*8kB 5*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1984kB
> 23048 total pagecache pages
> 956 pages in swap cache
> Swap cache stats: add 6902, delete 5946, find 190630/191220
> Free swap = 974116kB
> Total swap = 979924kB
> 32768 pages of RAM
> 660 free pages
> 1173 reserved pages
> 1857 slab pages
> 23999 pages shared
> 956 pages swap cached
AFAICT, even in the worst case, the latter call-site is well below 4K.
I have no idea of the tty one.
afaik, tty_buffer_request_room() try to expand its buffer size for efficiency. but Its failure
doesn't cause any user visible failure. probably we can mark it as NOWARN.
In worst case, maximum tty buffer size is 64K, it can make allocation failure easily.
Alan, Can you please tell us your mention?
(Added Greg as current tty maintainer)
For reasons that are not particularly clear to me, tty_buffer_alloc() is
called far more frequently in 2.6.33 than in 2.6.24. I instrumented the
function to print out the size of the buffers allocated, booted under
qemu and would just "cat /bin/ls" to see what buffers were allocated.
2.6.33 allocates loads, including high-order allocations. 2.6.24
appeared to allocate once and keep silent.
While there have been snags recently with respect to high-order
allocation failures in recent kernels, this might be one of the cases
where it's due to subsystems requesting high-order allocations more.
Anyone familiar with tty that might make a guess as to why it allocates
more aggressively?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
The pty layer is using them now and didn't before. That will massively
distort your numhers.
> While there have been snags recently with respect to high-order
> allocation failures in recent kernels, this might be one of the cases
> where it's due to subsystems requesting high-order allocations more.
The pty code certainly triggered more such allocations. I've sent Greg
patches to make the tty buffering layer allocate sensible sizes as it
doesn't need multiple page allocations in the first place.
Alan
That makes perfect sense. It explains why only one allocation showed up
because it must belong to the tty attached to the serial console.
Thanks Alan.
> > While there have been snags recently with respect to high-order
> > allocation failures in recent kernels, this might be one of the cases
> > where it's due to subsystems requesting high-order allocations more.
>
> The pty code certainly triggered more such allocations. I've sent Greg
> patches to make the tty buffering layer allocate sensible sizes as it
> doesn't need multiple page allocations in the first place.
>
Greg, what's the story with these patches?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
They are in -next and will go to Linus later on today for .34.
thanks,
greg k-h
So, Greg pointed me at the patch in question in linux-next
[c9cf55b: tty: Keep the default buffering to sub-page units]
It's attached for convenience.
However, this patch on its own does not appear to be enough. When rebased to
.33, it's still possible for the TTY layer to require order-1 allocations so
I doubt it would fix Frans's on its own. The problem is that TTY_BUFFER_PAGE
is taking struct tty_buffer into account but not the additional padding
added by tty_buffer_find().
As it's not clear why "Round the buffer size out" is required, I took a
simple approach and adjusted TTY_BUFFER_PAGE rather than being clever in
tty_buffer.c. This keeps the allocation sizes below a page but could it be done
better or did I miss another patch in linux-next that makes this unnecessary?
==== CUT HERE ===
tty: Take a 256 byte padding into account when buffering below sub-page units
The TTY layer takes some care to ensure that only sub-page allocations
are made with interrupts disabled. It does this by setting a goal of
"TTY_BUFFER_PAGE" to allocate. Unfortunately, while TTY_BUFFER_PAGE takes the
size of tty_buffer into account, it fails to account that tty_buffer_find()
rounds the buffer size out to the next 256 byte boundary before adding on
the size of the tty_buffer.
This patch adjusts the TTY_BUFFER_PAGE calculation to take into account the
size of the tty_buffer and the padding. Once applied, tty_buffer_alloc()
should not require high-order allocations.
Signed-off-by: Mel Gorman <m...@csn.ul.ie>
---
include/linux/tty.h | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/include/linux/tty.h b/include/linux/tty.h
index d96e588..8fe018b 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -70,12 +70,13 @@ struct tty_buffer {
/*
* We default to dicing tty buffer allocations to this many characters
- * in order to avoid multiple page allocations. We assume tty_buffer itself
- * is under 256 bytes. See tty_buffer_find for the allocation logic this
- * must match
+ * in order to avoid multiple page allocations. We know the size of
+ * tty_buffer itself but it must also be taken into account that the
+ * the buffer is 256 byte aligned. See tty_buffer_find for the allocation
+ * logic this must match
*/
-#define TTY_BUFFER_PAGE ((PAGE_SIZE - 256) / 2)
+#define TTY_BUFFER_PAGE (((PAGE_SIZE - sizeof(struct tty_buffer)) / 2) & ~0xFF)
struct tty_bufhead {
Yes agreed I missed a '-1'
Alan
Thanks.
Frans, would you mind testing your NAS box with the following patch applied
please? It should apply cleanly on top of 2.6.33-rc7. Thanks
==== CUT HERE ====
tty: Keep the default buffering to sub-page units
We allocate during interrupts so while our buffering is normally diced up
small anyway on some hardware at speed we can pressure the VM excessively
for page pairs. We don't really need big buffers to be linear so don't try
so hard.
In order to make this work well we will tidy up excess callers to request_room,
which cannot itself enforce this break up.
[m...@csn.ul.ie: Adjust TTY_BUFFER_PAGE to take padding into account]
Signed-off-by: Alan Cox <al...@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gre...@suse.de>
---
drivers/char/tty_buffer.c | 6 ++++--
include/linux/tty.h | 11 +++++++++++
2 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/drivers/char/tty_buffer.c b/drivers/char/tty_buffer.c
index 66fa4e1..f27c4d6 100644
--- a/drivers/char/tty_buffer.c
+++ b/drivers/char/tty_buffer.c
@@ -247,7 +247,8 @@ int tty_insert_flip_string(struct tty_struct *tty, const unsigned char *chars,
{
int copied = 0;
do {
- int space = tty_buffer_request_room(tty, size - copied);
+ int goal = min(size - copied, TTY_BUFFER_PAGE);
+ int space = tty_buffer_request_room(tty, goal);
struct tty_buffer *tb = tty->buf.tail;
/* If there is no space then tb may be NULL */
if (unlikely(space == 0))
@@ -283,7 +284,8 @@ int tty_insert_flip_string_flags(struct tty_struct *tty,
{
int copied = 0;
do {
- int space = tty_buffer_request_room(tty, size - copied);
+ int goal = min(size - copied, TTY_BUFFER_PAGE);
+ int space = tty_buffer_request_room(tty, goal);
struct tty_buffer *tb = tty->buf.tail;
/* If there is no space then tb may be NULL */
if (unlikely(space == 0))
diff --git a/include/linux/tty.h b/include/linux/tty.h
index 6abfcf5..42f2076 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -68,6 +68,17 @@ struct tty_buffer {
unsigned long data[0];
};
+/*
+ * We default to dicing tty buffer allocations to this many characters
+ * in order to avoid multiple page allocations. We know the size of
+ * tty_buffer itself but it must also be taken into account that the
+ * the buffer is 256 byte aligned. See tty_buffer_find for the allocation
+ * logic this must match
+ */
+
+#define TTY_BUFFER_PAGE (((PAGE_SIZE - sizeof(struct tty_buffer)) / 2) & ~0xFF)
+
+
struct tty_bufhead {
struct delayed_work work;
spinlock_t lock;
Wow, great! :)
Thanks Mel.
I've been running with this patch for about a week now and have so far not
seen any more allocation failures. I've tried doing large rsyncs a few
times.
It's not 100% conclusive, but I would say it improves things and I've
certainly not noticed any issues with the patch.
Before I got the patch I noticed that the default value for
vm.min_free_kbytes was only 1442 for this machine. Isn't that on the low
side? Could that have been a factor?
My concern is that, although fixing bugs in GFP_ATOMIC allocations is
certainly very good, I can't help wondering why the system does not keep a
bit more memory in reserve instead of using everything up for relatively
silly things like cache and buffers.
What if during an rsync I plug in some USB device whose driver has some
valid GFP_ATOMIC allocations? Shouldn't the memory manager allow for such
situations?
Cheers,
FJP
> tty: Keep the default buffering to sub-page units
>
> We allocate during interrupts so while our buffering is normally diced
> up small anyway on some hardware at speed we can pressure the VM
> excessively for page pairs. We don't really need big buffers to be
> linear so don't try so hard.
>
> In order to make this work well we will tidy up excess callers to
> request_room, which cannot itself enforce this break up.
>
> [m...@csn.ul.ie: Adjust TTY_BUFFER_PAGE to take padding into account]
> Signed-off-by: Alan Cox <al...@linux.intel.com>
> Signed-off-by: Greg Kroah-Hartman <gre...@suse.de>
Tested-by: Frans Pop <f...@planet.nl>