A virtual compound allocation means that there will be first of all
an attempt to satisfy the request with physically contiguous memory.
If that is not possible then a virtually contiguous memory will be
created.
This has two advantages:
1. Current uses of vmalloc can be converted to allocate virtual
compounds instead. In most cases physically contiguous
memory can be used which avoids the vmalloc performance
penalty. See f.e. the e1000 driver patch.
2. Uses of higher order allocations (stacks, buffers etc) can be
converted to use virtual compounds instead. Physically contiguous
memory will still be used for those higher order allocs in general
but the system can degrade to the use of vmalloc should memory
become heavily fragmented.
There is a compile time option to switch on fallback for
testing purposes. Virtually mapped mmemory may behave differently
and the CONFIG_FALLBACK_ALWAYS option will ensure that the code is
tested to deal with virtual memory.
V2->V3:
- Put the code into mm/vmalloc.c and leave the page allocator alone.
- Add a series of examples where virtual compound pages can be used.
- Diffed on top of the page flags and the vmalloc info patches
already in mm.
- Simplify things by omitting some of the more complex code
that used to be in there.
V1->V2
- Remove some cleanup patches and the SLUB patches from this set.
- Transparent vcompound support through page_address() and
virt_to_head_page().
- Additional use cases.
- Factor the code better for an easier read
- Add configurable stack size.
- Follow up on various suggestions made for V1
RFC->V1
- Complete support for all compound functions for virtual compound pages
(including the compound_nth_page() necessary for LBS mmap support)
- Fix various bugs
- Fix i386 build
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
But, isnt it defeating the purpose of this *particular* vmalloc() use ?
CONFIG_NUMA and vmalloc() at boot time means :
Try to distribute the pages on several nodes.
Memory pressure on ehash_locks[] is so high we definitly want to spread it.
(for similar uses of vmalloc(), see also hashdist=1 )
Also, please CC netdev for network patches :)
Thank you
> But, isnt it defeating the purpose of this *particular* vmalloc() use ?
I thought that was controlled by hashdist? I did not see it used here and
so I assumed that the RR was not intended here.
> This allows fallback for order 1 stack allocations. In the fallback
> scenario the stacks will be virtually mapped.
>
> Signed-off-by: Christoph Lameter <clam...@sgi.com>
I would be very careful with this especially on IA64.
If the TLB miss or other low-level trap handler depends upon being
able to dereference thread info, task struct, or kernel stack stuff
without causing a fault outside of the linear PAGE_OFFSET area, this
patch will cause problems.
It will be difficult to debug the kinds of crashes this will cause
too.
> But, isnt it defeating the purpose of this *particular* vmalloc() use ?
>
> CONFIG_NUMA and vmalloc() at boot time means :
>
> Try to distribute the pages on several nodes.
>
> Memory pressure on ehash_locks[] is so high we definitly want to spread it.
>
> (for similar uses of vmalloc(), see also hashdist=1 )
>
> Also, please CC netdev for network patches :)
I agree with Eric, converting any of the networking hash
allocations to this new facility is not the right thing
to do.
Other networking hash tables uses alloc_large_system_hash(), which handles
hashdist settings.
But this helper is __init only, so we can not use it for ehash_locks (can be
allocated by DCCP module)
I would love to see NUMA information as well on vmallocinfo, but have
currently no available time to prepare a patch.
Counters with numbers of pages per node would be great.
(like in /proc/pid/numa_maps)
N0=2 N1=2 N2=2 N3=2
This way we could check hashdist is working or not, since it depends on
various numa policies :)
> From: Christoph Lameter <clam...@sgi.com>
> Date: Thu, 20 Mar 2008 23:17:14 -0700
>
> > This allows fallback for order 1 stack allocations. In the fallback
> > scenario the stacks will be virtually mapped.
> >
> > Signed-off-by: Christoph Lameter <clam...@sgi.com>
>
> I would be very careful with this especially on IA64.
>
> If the TLB miss or other low-level trap handler depends upon being
> able to dereference thread info, task struct, or kernel stack stuff
> without causing a fault outside of the linear PAGE_OFFSET area, this
> patch will cause problems.
>
> It will be difficult to debug the kinds of crashes this will cause
> too. [...]
another thing is that this patchset includes KERNEL_STACK_SIZE_ORDER
which has been NACK-ed before on x86 by several people and i'm nacking
this "configurable stack size" aspect of it again.
although it's not being spelled out in the changelog, i believe the
fundamental problem comes from a cpumask_t taking 512 bytes with
nr_cpus=4096, and if a few of them are on the kernel stack it can be a
problem. The correct answer is to not put them on the stack and we've
been taking patches to that end. Every other object allocator in the
kernel is able to not put stuff on the kernel stack. We _dont_ want
higher-order kernel stacks and we dont want to make a special exception
for cpumask_t either.
i believe time might be better spent increasing PAGE_SIZE on these
ridiculously large systems and making that work well with our binary
formats - instead of complicating our kernel VM with virtually mapped
buffers. That will also solve the kernel stack problem, in a very
natural way.
Ingo
hey, cool patch for sure!
I'll see if I can transpose this to e1000e and all the other drivers I maintain
which use vmalloc as well.
This one goes on my queue and I'll merge through Jeff.
Thanks Christoph!
Auke
> Signed-off-by: Christoph Lameter <clam...@sgi.com>
>
> ---
> drivers/net/e1000/e1000_main.c | 23 +++++++++++------------
> drivers/net/e1000e/netdev.c | 12 ++++++------
> 2 files changed, 17 insertions(+), 18 deletions(-)
>
> Index: linux-2.6.25-rc5-mm1/drivers/net/e1000e/netdev.c
> ===================================================================
> --- linux-2.6.25-rc5-mm1.orig/drivers/net/e1000e/netdev.c 2008-03-20 21:52:45.962733927 -0700
> +++ linux-2.6.25-rc5-mm1/drivers/net/e1000e/netdev.c 2008-03-20 21:57:27.212078371 -0700
> @@ -1083,7 +1083,7 @@ int e1000e_setup_tx_resources(struct e10
> int err = -ENOMEM, size;
>
> size = sizeof(struct e1000_buffer) * tx_ring->count;
> - tx_ring->buffer_info = vmalloc(size);
> + tx_ring->buffer_info = __alloc_vcompound(GFP_KERNEL, get_order(size));
> if (!tx_ring->buffer_info)
> goto err;
> memset(tx_ring->buffer_info, 0, size);
> @@ -1102,7 +1102,7 @@ int e1000e_setup_tx_resources(struct e10
>
> return 0;
> err:
> - vfree(tx_ring->buffer_info);
> + __free_vcompound(tx_ring->buffer_info);
> ndev_err(adapter->netdev,
> "Unable to allocate memory for the transmit descriptor ring\n");
> return err;
> @@ -1121,7 +1121,7 @@ int e1000e_setup_rx_resources(struct e10
> int i, size, desc_len, err = -ENOMEM;
>
> size = sizeof(struct e1000_buffer) * rx_ring->count;
> - rx_ring->buffer_info = vmalloc(size);
> + rx_ring->buffer_info = __alloc_vcompound(GFP_KERNEL, get_order(size));
> if (!rx_ring->buffer_info)
> goto err;
> memset(rx_ring->buffer_info, 0, size);
> @@ -1157,7 +1157,7 @@ err_pages:
> kfree(buffer_info->ps_pages);
> }
> err:
> - vfree(rx_ring->buffer_info);
> + __free_vcompound(rx_ring->buffer_info);
> ndev_err(adapter->netdev,
> "Unable to allocate memory for the transmit descriptor ring\n");
> return err;
> @@ -1204,7 +1204,7 @@ void e1000e_free_tx_resources(struct e10
>
> e1000_clean_tx_ring(adapter);
>
> - vfree(tx_ring->buffer_info);
> + __free_vcompound(tx_ring->buffer_info);
> tx_ring->buffer_info = NULL;
>
> dma_free_coherent(&pdev->dev, tx_ring->size, tx_ring->desc,
> @@ -1231,7 +1231,7 @@ void e1000e_free_rx_resources(struct e10
> kfree(rx_ring->buffer_info[i].ps_pages);
> }
>
> - vfree(rx_ring->buffer_info);
> + __free_vcompound(rx_ring->buffer_info);
> rx_ring->buffer_info = NULL;
>
> dma_free_coherent(&pdev->dev, rx_ring->size, rx_ring->desc,
> Index: linux-2.6.25-rc5-mm1/drivers/net/e1000/e1000_main.c
> ===================================================================
> --- linux-2.6.25-rc5-mm1.orig/drivers/net/e1000/e1000_main.c 2008-03-20 22:06:14.462252441 -0700
> +++ linux-2.6.25-rc5-mm1/drivers/net/e1000/e1000_main.c 2008-03-20 22:08:46.582009872 -0700
> @@ -1609,14 +1609,13 @@ e1000_setup_tx_resources(struct e1000_ad
> int size;
>
> size = sizeof(struct e1000_buffer) * txdr->count;
> - txdr->buffer_info = vmalloc(size);
> + txdr->buffer_info = __alloc_vcompound(GFP_KERNEL | __GFP_ZERO,
> + get_order(size));
> if (!txdr->buffer_info) {
> DPRINTK(PROBE, ERR,
> "Unable to allocate memory for the transmit descriptor ring\n");
> return -ENOMEM;
> }
> - memset(txdr->buffer_info, 0, size);
> -
> /* round up to nearest 4K */
>
> txdr->size = txdr->count * sizeof(struct e1000_tx_desc);
> @@ -1625,7 +1624,7 @@ e1000_setup_tx_resources(struct e1000_ad
> txdr->desc = pci_alloc_consistent(pdev, txdr->size, &txdr->dma);
> if (!txdr->desc) {
> setup_tx_desc_die:
> - vfree(txdr->buffer_info);
> + __free_vcompound(txdr->buffer_info);
> DPRINTK(PROBE, ERR,
> "Unable to allocate memory for the transmit descriptor ring\n");
> return -ENOMEM;
> @@ -1653,7 +1652,7 @@ setup_tx_desc_die:
> DPRINTK(PROBE, ERR,
> "Unable to allocate aligned memory "
> "for the transmit descriptor ring\n");
> - vfree(txdr->buffer_info);
> + __free_vcompound(txdr->buffer_info);
> return -ENOMEM;
> } else {
> /* Free old allocation, new allocation was successful */
> @@ -1826,7 +1825,7 @@ e1000_setup_rx_resources(struct e1000_ad
> int size, desc_len;
>
> size = sizeof(struct e1000_buffer) * rxdr->count;
> - rxdr->buffer_info = vmalloc(size);
> + rxdr->buffer_info = __alloc_vcompound(GFP_KERNEL, size);
> if (!rxdr->buffer_info) {
> DPRINTK(PROBE, ERR,
> "Unable to allocate memory for the receive descriptor ring\n");
> @@ -1837,7 +1836,7 @@ e1000_setup_rx_resources(struct e1000_ad
> rxdr->ps_page = kcalloc(rxdr->count, sizeof(struct e1000_ps_page),
> GFP_KERNEL);
> if (!rxdr->ps_page) {
> - vfree(rxdr->buffer_info);
> + __free_vcompound(rxdr->buffer_info);
> DPRINTK(PROBE, ERR,
> "Unable to allocate memory for the receive descriptor ring\n");
> return -ENOMEM;
> @@ -1847,7 +1846,7 @@ e1000_setup_rx_resources(struct e1000_ad
> sizeof(struct e1000_ps_page_dma),
> GFP_KERNEL);
> if (!rxdr->ps_page_dma) {
> - vfree(rxdr->buffer_info);
> + __free_vcompound(rxdr->buffer_info);
> kfree(rxdr->ps_page);
> DPRINTK(PROBE, ERR,
> "Unable to allocate memory for the receive descriptor ring\n");
> @@ -1870,7 +1869,7 @@ e1000_setup_rx_resources(struct e1000_ad
> DPRINTK(PROBE, ERR,
> "Unable to allocate memory for the receive descriptor ring\n");
> setup_rx_desc_die:
> - vfree(rxdr->buffer_info);
> + __free_vcompound(rxdr->buffer_info);
> kfree(rxdr->ps_page);
> kfree(rxdr->ps_page_dma);
> return -ENOMEM;
> @@ -2175,7 +2174,7 @@ e1000_free_tx_resources(struct e1000_ada
>
> e1000_clean_tx_ring(adapter, tx_ring);
>
> - vfree(tx_ring->buffer_info);
> + __free_vcompound(tx_ring->buffer_info);
> tx_ring->buffer_info = NULL;
>
> pci_free_consistent(pdev, tx_ring->size, tx_ring->desc, tx_ring->dma);
> @@ -2283,9 +2282,9 @@ e1000_free_rx_resources(struct e1000_ada
>
> e1000_clean_rx_ring(adapter, rx_ring);
>
> - vfree(rx_ring->buffer_info);
> + __free_vcompound(rx_ring->buffer_info);
> rx_ring->buffer_info = NULL;
> - kfree(rx_ring->ps_page);
> + __free_vcompound(rx_ring->ps_page);
> rx_ring->ps_page = NULL;
> kfree(rx_ring->ps_page_dma);
> rx_ring->ps_page_dma = NULL;
> I agree with Eric, converting any of the networking hash
> allocations to this new facility is not the right thing
> to do.
Ok. Going to drop it.
> I would love to see NUMA information as well on vmallocinfo, but have
> currently no available time to prepare a patch.
Ok. Easy to add.
> another thing is that this patchset includes KERNEL_STACK_SIZE_ORDER
> which has been NACK-ed before on x86 by several people and i'm nacking
> this "configurable stack size" aspect of it again.
Huh? Nothing of that nature is in this patchset.
> On Fri, 21 Mar 2008, Ingo Molnar wrote:
>
> > another thing is that this patchset includes KERNEL_STACK_SIZE_ORDER
> > which has been NACK-ed before on x86 by several people and i'm
> > nacking this "configurable stack size" aspect of it again.
>
> Huh? Nothing of that nature is in this patchset.
your patch indeed does not introduce it here, but
KERNEL_STACK_SIZE_ORDER shows up in the x86 portion of your patch and
you refer to multi-order stack allocations in your 0/14 mail :-)
> -#define alloc_task_struct() ((struct task_struct *)__get_free_pages(GFP_KERNEL | __GFP_COMP, KERNEL_STACK_SIZE_ORDER))
> -#define free_task_struct(tsk) free_pages((unsigned long) (tsk), KERNEL_STACK_SIZE_ORDER)
> +#define alloc_task_struct() ((struct task_struct *)__alloc_vcompound( \
> + GFP_KERNEL, KERNEL_STACK_SIZE_ORDER))
Ingo
> your patch indeed does not introduce it here, but
> KERNEL_STACK_SIZE_ORDER shows up in the x86 portion of your patch and
> you refer to multi-order stack allocations in your 0/14 mail :-)
Ahh. I see. Remnants from V2 in IA64 code. That portion has to be removed
because of the software TLB issues on IA64 as pointed out by Dave.
> Use virtual compound pages for the large swap maps. This only works for
> swap maps that are smaller than a MAX_ORDER block though. If the swap map
> is larger then there is no way around the use of vmalloc.
Have you considered the potential memory wastage from rounding up
to the next page order now? (similar in all the other patches
to change vmalloc). e.g. if the old size was 64k + 1 byte it will
suddenly get 128k now. That is actually not a uncommon situation
in my experience; there are often power of two buffers with
some small headers.
A long time ago (in 2.4-aa) I did something similar for module loading
as an experiment to avoid too many TLB misses. The module loader
would first try to get a continuous range in the direct mapping and
only then fall back to vmalloc.
But I used a simple trick to avoid the waste problem: it allocated a
continuous range rounded up to the next page-size order and then freed
the excess pages back into the page allocator. That was called
alloc_exact(). If you replace vmalloc with alloc_pages you should
use something like that too I think.
-Andi
> > is larger then there is no way around the use of vmalloc.
>
> Have you considered the potential memory wastage from rounding up
> to the next page order now? (similar in all the other patches
> to change vmalloc). e.g. if the old size was 64k + 1 byte it will
> suddenly get 128k now. That is actually not a uncommon situation
> in my experience; there are often power of two buffers with
> some small headers.
Yes the larger the order the more significant the problem becomes.
> A long time ago (in 2.4-aa) I did something similar for module loading
> as an experiment to avoid too many TLB misses. The module loader
> would first try to get a continuous range in the direct mapping and
> only then fall back to vmalloc.
>
> But I used a simple trick to avoid the waste problem: it allocated a
> continuous range rounded up to the next page-size order and then freed
> the excess pages back into the page allocator. That was called
> alloc_exact(). If you replace vmalloc with alloc_pages you should
> use something like that too I think.
That trick is still in use for alloc_large_system_hash....
But cutting off the tail of compound pages would make treating them as
order N pages difficult. The vmalloc fallback situation is easy to deal
with.
Maybe we can think about making compound pages being N consecutive pages
of PAGE_SIZE rather than an order O page? The api would be a bit
different then and it would require changes to the page allocator. More
fragmentation if pages like that are freed.
> On Fri, 21 Mar 2008, David Miller wrote:
>
> > I would be very careful with this especially on IA64.
> >
> > If the TLB miss or other low-level trap handler depends upon being
> > able to dereference thread info, task struct, or kernel stack stuff
> > without causing a fault outside of the linear PAGE_OFFSET area, this
> > patch will cause problems.
>
> Hmmm. Does not sound good for arches that cannot handle TLB misses in
> hardware. I wonder how arch specific this is? Last time around I was told
> that some arches already virtually map their stacks.
I'm not saying there is a problem, I'm saying "tread lightly"
because there might be one.
The thing to do is to first validate the way that IA64
handles recursive TLB misses occuring during an initial
TLB miss, and if there are any limitations therein.
That's the kind of thing I'm talking about.
> This allows fallback for order 1 stack allocations. In the fallback
> scenario the stacks will be virtually mapped.
The traditional reason this was discouraged (people seem to reinvent
variants of this patch all the time) was that there used
to be drivers that did __pa() (or equivalent) on stack addresses
and that doesn't work with vmalloc pages.
I don't know if such drivers still exist, but such a change
is certainly not a no-brainer
-Andi
in general, I like this patch and I found no bug :)
> Index: linux-2.6.25-rc5-mm1/include/linux/vmalloc.h
> ===================================================================
> --- linux-2.6.25-rc5-mm1.orig/include/linux/vmalloc.h 2008-03-20 23:03:14.600588151 -0700
> +++ linux-2.6.25-rc5-mm1/include/linux/vmalloc.h 2008-03-20 23:03:14.612588010 -0700
> @@ -86,6 +86,20 @@ extern struct vm_struct *alloc_vm_area(s
> extern void free_vm_area(struct vm_struct *area);
>
> /*
> + * Support for virtual compound pages.
> + *
> + * Calls to vcompound alloc will result in the allocation of normal compound
> + * pages unless memory is fragmented. If insufficient physical linear memory
> + * is available then a virtually contiguous area of memory will be created
> + * using the vmalloc functionality.
> + */
> +struct page *alloc_vcompound_alloc(gfp_t flags, int order);
where exist alloc_vcompound_alloc?
> +/*
> + * Virtual Compound Page support.
> + *
> + * Virtual Compound Pages are used to fall back to order 0 allocations if large
> + * linear mappings are not available. They are formatted according to compound
> + * page conventions. I.e. following page->first_page if PageTail(page) is set
> + * can be used to determine the head page.
> + */
> +
Hmm,
IMHO we need vcompound documentation more for the beginner in the Documentation/ directory.
if not, nobody understand mean of vcompound flag at /proc/vmallocinfo.
> +void __free_vcompound(void *addr)
> +void free_vcompound(struct page *page)
> +struct page *alloc_vcompound(gfp_t flags, int order)
> +void *__alloc_vcompound(gfp_t flags, int order)
may be, we need DocBook style comment at the head of these 4 functions.
> Allocations of larger pages are not reliable in Linux. If larger
> pages have to be allocated then one faces various choices of allowing
> graceful fallback or using vmalloc with a performance penalty due
> to the use of a page table. Virtual Compound pages are
> a simple solution out of this dilemma.
can you document the drawback of large, frequent vmalloc() allocations at least?
On 32 bit x86, the effective vmalloc space is 64Mb or so (after various PCI bars are ioremaped),
so if this type of allocation is used for a "scales with nr of ABC" where "ABC" is workload dependent,
there's a rather abrupt upper limit to this.
Not saying that that is a flaw of your patch, just pointing out that we should discourage usage of
the "scales with nr of ABC" (for example "one for each thread") kind of things.
> The thing to do is to first validate the way that IA64
> handles recursive TLB misses occuring during an initial
> TLB miss, and if there are any limitations therein.
I am familiar with that area and I am resonably sure that this
is an issue on IA64 under some conditions (the processor decides to spill
some registers either onto the stack or into the register backing store
during tlb processing). Recursion (in the kernel context) still expects
the stack and register backing store to be available. ccing linux-ia64 for
any thoughts to the contrary.
The move to 64k page size on IA64 is another way that this issue can be
addressed though. So I think its best to drop the IA64 portion.
> > +struct page *alloc_vcompound_alloc(gfp_t flags, int order);
>
> where exist alloc_vcompound_alloc?
Duh... alloc_vcompound is not used at this point. Typo. _alloc needs to be
cut off.
> Hmm,
> IMHO we need vcompound documentation more for the beginner in the Documentation/ directory.
> if not, nobody understand mean of vcompound flag at /proc/vmallocinfo.
Ok.
> can you document the drawback of large, frequent vmalloc() allocations at least?
Ok. Lets add some documentation about this issue and some other
things. A similar suggestion was made by Kosaki-san.
> On 32 bit x86, the effective vmalloc space is 64Mb or so (after various PCI bars are ioremaped),
> so if this type of allocation is used for a "scales with nr of ABC" where "ABC" is workload dependent,
> there's a rather abrupt upper limit to this.
> Not saying that that is a flaw of your patch, just pointing out that we should discourage usage of
> the "scales with nr of ABC" (for example "one for each thread") kind of things.
I better take out any patches that do large scale allocs then.
> The traditional reason this was discouraged (people seem to reinvent
> variants of this patch all the time) was that there used
> to be drivers that did __pa() (or equivalent) on stack addresses
> and that doesn't work with vmalloc pages.
>
> I don't know if such drivers still exist, but such a change
> is certainly not a no-brainer
I thought that had been cleaned up because some arches already have
virtually mapped stacks? This could be debugged by testing with
CONFIG_VFALLBACK_ALWAYS set. Which results in a stack that is always
vmalloc'ed and thus the driver should fail.
> But I used a simple trick to avoid the waste problem: it allocated a
> continuous range rounded up to the next page-size order and then freed
> the excess pages back into the page allocator. That was called
> alloc_exact(). If you replace vmalloc with alloc_pages you should
> use something like that too I think.
One way of dealing with it would be to define an additional allocation
variant that allows the limiting of the loss? I noted that both the swap
and the wait tables vary significantly between allocations. So we could
specify an upper boundary of a loss that is acceptable. If too much memory
would be lost then use vmalloc unconditionally.
---
include/linux/vmalloc.h | 12 ++++++++----
mm/page_alloc.c | 4 ++--
mm/swapfile.c | 4 ++--
mm/vmalloc.c | 34 ++++++++++++++++++++++++++++++++++
4 files changed, 46 insertions(+), 8 deletions(-)
Index: linux-2.6.25-rc5-mm1/include/linux/vmalloc.h
===================================================================
--- linux-2.6.25-rc5-mm1.orig/include/linux/vmalloc.h 2008-03-24 12:51:47.457231129 -0700
+++ linux-2.6.25-rc5-mm1/include/linux/vmalloc.h 2008-03-24 12:52:05.449313572 -0700
@@ -88,14 +88,18 @@ extern void free_vm_area(struct vm_struc
/*
* Support for virtual compound pages.
*
- * Calls to vcompound alloc will result in the allocation of normal compound
- * pages unless memory is fragmented. If insufficient physical linear memory
- * is available then a virtually contiguous area of memory will be created
- * using the vmalloc functionality.
+ * Calls to vcompound_alloc and friends will result in the allocation of
+ * a normal physically contiguous compound page unless memory is fragmented.
+ * If insufficient physical linear memory is available then a virtually
+ * contiguous area of memory will be created using vmalloc.
*/
struct page *alloc_vcompound(gfp_t flags, int order);
+struct page *alloc_vcompound_maxloss(gfp_t flags, unsigned long size,
+ unsigned long maxloss);
void free_vcompound(struct page *);
void *__alloc_vcompound(gfp_t flags, int order);
+void *__alloc_vcompound_maxloss(gfp_t flags, unsigned long size,
+ unsigned long maxloss);
void __free_vcompound(void *addr);
struct page *vcompound_head_page(const void *x);
Index: linux-2.6.25-rc5-mm1/mm/vmalloc.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/vmalloc.c 2008-03-24 12:51:47.485231279 -0700
+++ linux-2.6.25-rc5-mm1/mm/vmalloc.c 2008-03-24 12:52:05.453313419 -0700
@@ -1198,3 +1198,37 @@ void *__alloc_vcompound(gfp_t flags, int
return NULL;
}
+
+/*
+ * Functions to avoid loosing memory because of the rounding up to
+ * power of two sizes for compound page allocation. If the loss would
+ * be too great then use vmalloc regardless of the fragmentation
+ * situation.
+ */
+struct page *alloc_vcompound_maxloss(gfp_t flags, unsigned long size,
+ unsigned long maxloss)
+{
+ int order = get_order(size);
+ unsigned long loss = (PAGE_SIZE << order) - size;
+ void *addr;
+
+ if (loss < maxloss)
+ return alloc_vcompound(flags, order);
+
+ addr = __vmalloc(size, flags, PAGE_KERNEL);
+ if (!addr)
+ return NULL;
+ return vmalloc_to_page(addr);
+}
+
+void *__alloc_vcompound_maxloss(gfp_t flags, unsigned long size,
+ unsigned long maxloss)
+{
+ int order = get_order(size);
+ unsigned long loss = (PAGE_SIZE << order) - size;
+
+ if (loss < maxloss)
+ return __alloc_vcompound(flags, order);
+
+ return __vmalloc(size, flags, PAGE_KERNEL);
+}
Index: linux-2.6.25-rc5-mm1/mm/swapfile.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/swapfile.c 2008-03-24 12:52:05.441314302 -0700
+++ linux-2.6.25-rc5-mm1/mm/swapfile.c 2008-03-24 12:52:05.453313419 -0700
@@ -1636,8 +1636,8 @@ asmlinkage long sys_swapon(const char __
goto bad_swap;
/* OK, set up the swap map and apply the bad block list */
- if (!(p->swap_map = __alloc_vcompound(GFP_KERNEL | __GFP_ZERO,
- get_order(maxpages * sizeof(short))))) {
+ if (!(p->swap_map = __alloc_vcompound_maxloss(GFP_KERNEL | __GFP_ZERO,
+ maxpages * sizeof(short))), 16 * PAGE_SIZE) {
error = -ENOMEM;
goto bad_swap;
}
Index: linux-2.6.25-rc5-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-rc5-mm1.orig/mm/page_alloc.c 2008-03-24 12:52:05.389313168 -0700
+++ linux-2.6.25-rc5-mm1/mm/page_alloc.c 2008-03-24 12:52:07.493322559 -0700
@@ -2866,8 +2866,8 @@ int zone_wait_table_init(struct zone *zo
* To use this new node's memory, further consideration will be
* necessary.
*/
- zone->wait_table = __alloc_vcompound(GFP_KERNEL,
- get_order(alloc_size));
+ zone->wait_table = __alloc_vcompound_maxloss(GFP_KERNEL,
+ alloc_size, 32 * PAGE_SIZE);
}
if (!zone->wait_table)
return -ENOMEM;
> The move to 64k page size on IA64 is another way that this issue can
> be addressed though.
This is such a huge mistake I wish platforms such as powerpc and IA64
would not make such decisions so lightly.
The memory wastage is just rediculious.
I already see several distributions moving to 64K pages for powerpc,
so I want to nip this in the bud before this monkey-see-monkey-do
thing gets any more out of hand.
> From: Christoph Lameter <clam...@sgi.com>
> Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
>
> > The move to 64k page size on IA64 is another way that this issue can
> > be addressed though.
>
> This is such a huge mistake I wish platforms such as powerpc and IA64
> would not make such decisions so lightly.
Its certainly not a light decision if your customer tells you that the box
is almost unusable with 16k page size. For our new 2k and 4k processor
systems this seems to be a requirement. Customers start hacking SLES10 to
run with 64k pages....
> The memory wastage is just rediculious.
Well yes if you would use such a box for kernel compiles and small files
then its a bad move. However, if you have to process terabytes of data
then this is significantly reducing the VM and I/O overhead.
> I already see several distributions moving to 64K pages for powerpc,
> so I want to nip this in the bud before this monkey-see-monkey-do
> thing gets any more out of hand.
powerpc also runs HPC codes. They certainly see the same results that we
see.
Christoph is correct ... IA64 pins the TLB entry for the kernel stack
(which covers both the normal C stack and the register backing store)
so that it won't have to deal with a TLB miss on the stack while handling
another TLB miss.
-Tony
In an ideal world we'd have variable sized pages ... but
since most arcthitectures have no h/w support for these
it may be a long time before that comes to Linux.
In a fixed page size world the right page size to use
depends on the workload and the capacity of the system.
When memory capacity is measured in hundreds of GB, then
a larger page size doesn't look so ridiculous.
-Tony
> On Mon, 24 Mar 2008, David Miller wrote:
>
> > From: Christoph Lameter <clam...@sgi.com>
> > Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
> >
> > > The move to 64k page size on IA64 is another way that this issue can
> > > be addressed though.
> >
> > This is such a huge mistake I wish platforms such as powerpc and IA64
> > would not make such decisions so lightly.
>
> Its certainly not a light decision if your customer tells you that the box
> is almost unusable with 16k page size. For our new 2k and 4k processor
> systems this seems to be a requirement. Customers start hacking SLES10 to
> run with 64k pages....
We should fix the underlying problems.
I'm hitting issues on 128 cpu Niagara2 boxes, and it's all fundamental
stuff like contention on the per-zone page allocator locks.
Which is very fixable, without going to larger pages.
> powerpc also runs HPC codes. They certainly see the same results
> that we see.
There are ways to get large pages into the process address space for
compute bound tasks, without suffering the well known negative side
effects of using larger pages for everything.
> When memory capacity is measured in hundreds of GB, then
> a larger page size doesn't look so ridiculous.
We have hugepages and such for a reason. And this can be
made more dynamic and flexible, as needed.
Increasing the page size is a "stick your head in the sand"
type solution by my book.
Especially when you can make the hugepage facility stronger
and thus get what you want without the memory wastage side
effects.
> On Fri, 21 Mar 2008, Eric Dumazet wrote:
>
> > But, isnt it defeating the purpose of this *particular* vmalloc() use ?
>
> I thought that was controlled by hashdist? I did not see it used here and
> so I assumed that the RR was not intended here.
It's intended for all of the major networking hash tables.
> From: Christoph Lameter <clam...@sgi.com>
> Date: Mon, 24 Mar 2008 11:27:06 -0700 (PDT)
>
> > The move to 64k page size on IA64 is another way that this issue can
> > be addressed though.
>
> This is such a huge mistake I wish platforms such as powerpc and IA64
> would not make such decisions so lightly.
The performance advantage of using hardware 64k pages is pretty
compelling, on a wide range of programs, and particularly on HPC apps.
> The memory wastage is just rediculious.
Depends on the distribution of file sizes you have.
> I already see several distributions moving to 64K pages for powerpc,
> so I want to nip this in the bud before this monkey-see-monkey-do
> thing gets any more out of hand.
I just tried a kernel compile on a 4.2GHz POWER6 partition with 4
threads (2 cores) and 2GB of RAM, with two kernels. One was
configured with 4kB pages and the other with 64kB kernels but they
were otherwise identically configured. Here are the times for the
same kernel compile (total time across all threads, for a fairly
full-featured config):
4kB pages: 444.051s user + 34.406s system time
64kB pages: 419.963s user + 16.869s system time
That's nearly 10% faster with 64kB pages -- on a kernel compile.
Yes, the fragmentation in the page cache can be a pain in some
circumstances, but on the whole I think the performance advantage is
worth that pain, particularly for the sort of applications that people
will tend to be running on RHEL on Power boxes.
Regards,
Paul.
> The performance advantage of using hardware 64k pages is pretty
> compelling, on a wide range of programs, and particularly on HPC apps.
Please read the rest of my responses in this thread, you
can have your HPC cake and eat it too.
Someone posted a patch recently that showed that the cdrom layer
does it. Might be more. It is hard to audit a few million lines
of driver code.
> virtually mapped stacks? This could be debugged by testing with
> CONFIG_VFALLBACK_ALWAYS set. Which results in a stack that is always
> vmalloc'ed and thus the driver should fail.
It might be a subtle failure.
Maybe sparse could be taught to check for this if it happens
in a single function? (cc'ing Al who might have some thoughts
on this). Of course if it happens spread out over multiple
functions sparse wouldn't help neither.
-Andi
I liked your idea of fixing compound pages to not rely on order
better. Ok it is likely more work to implement @)
Also if anything preserving memory should be default, but maybe
skippable a with __GFP_GO_FAST flag.
-Andi
Do you have some idea where the improvement mainly comes from?
Is it TLB misses or reduced in kernel overhead? Ok I assume both
play together but which part of the equation is more important?
-Andi
> > I am familiar with that area and I am resonably sure that this
> > is an issue on IA64 under some conditions (the processor decides to spill
> > some registers either onto the stack or into the register backing store
> > during tlb processing). Recursion (in the kernel context) still expects
> > the stack and register backing store to be available. ccing linux-ia64 for
> > any thoughts to the contrary.
>
> Christoph is correct ... IA64 pins the TLB entry for the kernel stack
> (which covers both the normal C stack and the register backing store)
> so that it won't have to deal with a TLB miss on the stack while handling
> another TLB miss.
I thought the only pinned TLB entry was for the per cpu area? How does it
pin the TLB? The expectation is that a single TLB covers the complete
stack area? Is that a feature of fault handling?
> I liked your idea of fixing compound pages to not rely on order
> better. Ok it is likely more work to implement @)
Right. It just requires a page allocator rewrite. Which is overdue
anyways given the fastpath issues. Volunteers?
> Also if anything preserving memory should be default, but maybe
> skippable a with __GFP_GO_FAST flag.
Well. Guess we need a definition of preserving memory. All allocations
typically have some kind of overhead.
> We should fix the underlying problems.
>
> I'm hitting issues on 128 cpu Niagara2 boxes, and it's all fundamental
> stuff like contention on the per-zone page allocator locks.
>
> Which is very fixable, without going to larger pages.
No its not fixable. You are doing linear optimizations to a slowdown that
grows exponentially. Going just one order up for page size reduces the
necessary locks and handling of the kernel by 50%.
> > powerpc also runs HPC codes. They certainly see the same results
> > that we see.
>
> There are ways to get large pages into the process address space for
> compute bound tasks, without suffering the well known negative side
> effects of using larger pages for everything.
These hacks have limitations. F.e. they do not deal with I/O and
require application changes.
Not when the trick of getting high order, returning left over pages
is used. I meant just updating the GFP_COMPOUND code to always
use number of pages instead of order so that it could deal with a compound
where the excess pages are already returned. That is not actually that
much work (I reimplemented this recently for dma alloc and it's < 20 LOC)
Of course the full rewrite would be also great, agreed :)
-Andi
> Not when the trick of getting high order, returning left over pages
> is used. I meant just updating the GFP_COMPOUND code to always
> use number of pages instead of order so that it could deal with a compound
> where the excess pages are already returned. That is not actually that
> much work (I reimplemented this recently for dma alloc and it's < 20 LOC)
Would you post the patch here?
> Maybe sparse could be taught to check for this if it happens
> in a single function? (cc'ing Al who might have some thoughts
> on this). Of course if it happens spread out over multiple
> functions sparse wouldn't help neither.
We could add debugging code to virt_to_page (or __pa) to catch these uses.
Hard to test all cases. Static checking would be better.
Or just not do it? I didn't think order 1 failures were that big a problem.
-Andi
Can you do the same thing with the 4k MMU pages and 64k PAGE_SIZE?
Wouldn't that easily break out whether the advantage is from the TLB or
from less kernel overhead?
-- Dave
Pinning TLB entries on ia64 is done using TR registers with the "itr"
instruction. Currently we have the following pinned mappings:
itr[0] : maps kernel code. 64MB page at virtual 0xA000000100000000
dtr[1] : maps kernel data. 64MB page at virtual 0xA000000100000000
itr[1] : maps PAL code as required by architecture
dtr[1] : maps an area of region 7 that spans kernel stack
page size is kernel granule size (default 16M).
This mapping needs to be reset on a context switch
where we move to a stack in a different granule.
We used to used dtr[2] to map the 64K per-cpu area at 0xFFFFFFFFFFFF0000
but Ken Chen found that performance was better to use a dynamically
inserted DTC entry from the Alt-TLB miss handler which allows this
entry in the TLB to be available for generic use (on most processor
models).
-Tony
> dtr[1] : maps an area of region 7 that spans kernel stack
> page size is kernel granule size (default 16M).
> This mapping needs to be reset on a context switch
> where we move to a stack in a different granule.
Interesting.... Never realized we were doing these tricks with DTR.
> From: Paul Mackerras <pau...@samba.org>
> Date: Tue, 25 Mar 2008 14:29:55 +1100
>
> > The performance advantage of using hardware 64k pages is pretty
> > compelling, on a wide range of programs, and particularly on HPC apps.
>
> Please read the rest of my responses in this thread, you
> can have your HPC cake and eat it too.
It's not just HPC, as I pointed out, it's pretty much everything,
including kernel compiles. And "use hugepages" is a pretty inadequate
answer given the restrictions of hugepages and the difficulty of using
them. How do I get gcc to use hugepages, for instance? Using 64k
pages gives us a performance boost for almost everything without the
user having to do anything.
If the hugepage stuff was in a state where it enabled large pages to
be used for mapping an existing program, where possible, without any
changes to the executable, then I would agree with you. But it isn't,
it's a long way from that, and (as I understand it) Linus has in the
past opposed the suggestion that we should move in that direction.
Paul.
> Paul Mackerras <pau...@samba.org> writes:
> >
> > 4kB pages: 444.051s user + 34.406s system time
> > 64kB pages: 419.963s user + 16.869s system time
> >
> > That's nearly 10% faster with 64kB pages -- on a kernel compile.
>
> Do you have some idea where the improvement mainly comes from?
> Is it TLB misses or reduced in kernel overhead? Ok I assume both
> play together but which part of the equation is more important?
I think that to a first approximation, the improvement in user time
(24 seconds) is due to the increased TLB reach and reduced TLB misses,
and the improvement in system time (18 seconds) is due to the reduced
number of page faults and reductions in other kernel overheads.
As Dave Hansen points out, I can separate the two effects by having
the kernel use 64k pages at the VM level but 4k pages in the hardware
page table, which is easy since we have support for 64k base page size
on machines that don't have hardware 64k page support. I'll do that
today.
Paul.
> On Mon, 24 Mar 2008, David Miller wrote:
>
> > There are ways to get large pages into the process address space for
> > compute bound tasks, without suffering the well known negative side
> > effects of using larger pages for everything.
>
> These hacks have limitations. F.e. they do not deal with I/O and
> require application changes.
Transparent automatic hugepages are definitely doable, I don't know
why you think this requires application changes.
People want these larger pages for HPC apps.
> How do I get gcc to use hugepages, for instance?
Implementing transparent automatic usage of hugepages has been
discussed many times, it's definitely doable and other OSs have
implemented this for years.
This is what I was implying.
But there is a general problem of larger pages in systems that
don't support them natively (in hardware) depending in how it's
implemented the memory manager in the kernel:
"Doubling the soft page size implies
halfing the TLB soft-entries in the old hardware".
"x4 soft page size=> 1/4 TLB soft-entries, ... and so on."
Assuming one soft double-sized page represents 2 real-sized pages,
one replacing of one soft double-sized page implies replacing
2 TLB's entries containing the 2 real-sized pages.
The TLB is very small, its entries are around 24 entries aprox. in
some processors!.
Assuming soft 64 KiB page using real 4 KiB pages => 1/16 TLB soft-entries.
If the TLB has 24 entries then calculating 24/16=1.5 soft-entries,
the TLB will have only 1 soft-entry for soft 64 KiB pages!!! Weird!!!
The normal soft sizes are 8 KiB or 16 KiB for non-native processors, not more.
So, the TLB of 24 entries of real 4 KiB will have 12 or 6
soft-entries respect.
> It's actually harder than it looks. Ian Wienand just finished his
> Master's project in this area, so we have *lots* of data. The main
> issue is that, at least on Itanium, you have to turn off the hardware
> page table walker for hugepages if you want to mix superpages and
> standard pages in the same region. (The long format VHPT isn't the
> panacea we'd like it to be because the hash function it uses depends
> on the page size). This means that although you have fewer TLB misses
> with larger pages, the cost of those TLB misses is three to four times
> higher than with the standard pages.
If the hugepage is more than 3 to 4 times larger than the base
page size, which it almost certainly is, it's still an enormous
win.
> Other architectures (where the page size isn't tied into the hash
> function, so the hardware walked can be used for superpages) will have
> different tradeoffs.
Right, admittedly this is just a (one of many) strange IA64 quirk.
"large" pages, or "super" pages perhaps ... but Linux "huge" pages
seem pretty hard to adapt for generic use by applications. They
are generally a somewhere between a bit too big (2MB on X86) to
way too big (64MB, 256MB, 1GB or 4GB on ia64) for general use.
Right now they also suffer from making the sysadmin pick at
boot time how much memory to allocate as huge pages (while it
is possible to break huge pages into normal pages, going in
the reverse direction requires a memory defragmenter that
doesn't exist).
Making an application use huge pages as heap may be simple
(just link with a different library to provide with a different
version of malloc()) ... code, stack, mmap'd files are all
a lot harder to do transparently.
-Tony
> Making an application use huge pages as heap may be simple
> (just link with a different library to provide with a different
> version of malloc()) ... code, stack, mmap'd files are all
> a lot harder to do transparently.
The kernel should be able to do this transparently, at the
very least for the anonymous page case. It should also
be able to handle just fine chips that provide multiple
page size support, as many do.
David> From: Peter Chubb <pet...@gelato.unsw.edu.au> Date: Wed, 26 Mar
David> 2008 10:41:32 +1100
>> It's actually harder than it looks. Ian Wienand just finished his
>> Master's project in this area, so we have *lots* of data. The main
>> issue is that, at least on Itanium, you have to turn off the
>> hardware page table walker for hugepages if you want to mix
>> superpages and standard pages in the same region. (The long format
>> VHPT isn't the panacea we'd like it to be because the hash function
>> it uses depends on the page size). This means that although you
>> have fewer TLB misses with larger pages, the cost of those TLB
>> misses is three to four times higher than with the standard pages.
David> If the hugepage is more than 3 to 4 times larger than the base
David> page size, which it almost certainly is, it's still an enormous
David> win.
That depends on the access pattern. We measured a small win for some
workloads, and a small loss for others, using 4k base pages, and
allowing up to 4G superpages (the actual sizes used depended on the
size of the objects being allocated, and the amount of contiguous
memory available).
--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au ERTOS within National ICT Australia
David> From: Christoph Lameter <clam...@sgi.com> Date: Tue, 25 Mar
David> 2008 10:48:19 -0700 (PDT)
>> On Mon, 24 Mar 2008, David Miller wrote:
>>
>> > There are ways to get large pages into the process address space
>> for > compute bound tasks, without suffering the well known
>> negative side > effects of using larger pages for everything.
>>
>> These hacks have limitations. F.e. they do not deal with I/O and
>> require application changes.
David> Transparent automatic hugepages are definitely doable, I don't
David> know why you think this requires application changes.
It's actually harder than it looks. Ian Wienand just finished his
Master's project in this area, so we have *lots* of data. The main
issue is that, at least on Itanium, you have to turn off the hardware
page table walker for hugepages if you want to mix superpages and
standard pages in the same region. (The long format VHPT isn't the
panacea we'd like it to be because the hash function it uses depends
on the page size). This means that although you have fewer TLB misses
with larger pages, the cost of those TLB misses is three to four times
higher than with the standard pages. In addition, to set up a large
page takes more effort... and it turns out there are few applications
where the cost is amortised enough, so on SpecCPU for example, some
tests improved performance slightly, some got slightly worse.
What we saw was essentially that we could almost eliminate DTLB misses,
other than the first, for a huge page. For most applications, though,
the extra cost of that first miss, plus the cost of setting up the
huge page, was greater than the few hundred DTLB misses we avoided.
I'm expecting Ian to publish the full results soon.
Other architectures (where the page size isn't tied into the hash
function, so the hardware walked can be used for superpages) will have
different tradeoffs.
--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au ERTOS within National ICT Australia
> That depends on the access pattern.
Absolutely.
FWIW, I bet it helps enormously for gcc which, even for
small compiles, swims around chaotically in an 8MB pool
of GC'd memory.
Why not just repeat the PTEs for super-pages? That won't work for
huge pages, but for superpages that are a reasonable multiple (e.g.,
16-times) the base-page size, it should work nicely.
--david
--
Mosberger Consulting LLC, http://www.mosberger-consulting.com/
> Why not just repeat the PTEs for super-pages?
This is basically how we implement hugepages in the page
tables on sparc64.
David> On Tue, Mar 25, 2008 at 5:41 PM, Peter Chubb
David> <pet...@gelato.unsw.edu.au> wrote:
>> The main issue is that, at least on Itanium, you have to turn off
>> the hardware page table walker for hugepages if you want to mix
>> superpages and standard pages in the same region. (The long format
>> VHPT isn't the panacea we'd like it to be because the hash function
>> it uses depends on the page size).
David> Why not just repeat the PTEs for super-pages? That won't work
David> for huge pages, but for superpages that are a reasonable
David> multiple (e.g., 16-times) the base-page size, it should work
David> nicely.
You end up having to repeat PTEs to fit into Linux's page table
structure *anyway* (unless we can change Linux's page table). But
there's no place in the short format hardware-walked page table (that
reuses the leaf entries in Linux's table) for a page size. And if you
use some of the holes in the format, the hardware walker doesn't
understand it --- so you have to turn off the hardware walker for
*any* regions where there might be a superpage.
If you use the long format VHPT, you have a choice: load the
hash table with just the translation that caused the miss, load all
possible hash entries that could have caused the miss for the page, or
preload the hash table when the page is instantiated, with all
possible entries that could hash to the huge page. I don't remember
the details, but I seem to remember all these being bad choices for
one reason or other ... Ian, can you elaborate?
--
Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au
http://www.ertos.nicta.com.au ERTOS within National ICT Australia
>
> You end up having to repeat PTEs to fit into Linux's page table
> structure *anyway* (unless we can change Linux's page table). But
> there's no place in the short format hardware-walked page table (that
> reuses the leaf entries in Linux's table) for a page size. And if you
> use some of the holes in the format, the hardware walker doesn't
> understand it --- so you have to turn off the hardware walker for
> *any* regions where there might be a superpage.
No, you can set an illegal memory attribute in the pte for any superpage entry,
and leave the hardware walker enabled for the base page size. The software tlb
miss handler can then install the superpage tlb entry. I posted a working
prototype of Shimizu superpages working on ia64 using short format vhpt's to the
linux kernel list a while back.
>
> If you use the long format VHPT, you have a choice: load the
> hash table with just the translation that caused the miss, load all
> possible hash entries that could have caused the miss for the page, or
> preload the hash table when the page is instantiated, with all
> possible entries that could hash to the huge page. I don't remember
> the details, but I seem to remember all these being bad choices for
> one reason or other ... Ian, can you elaborate?
When I was doing measurements of long format vs. short format, the two main
problems with long format (and why I eventually chose to stick with short
format) were:
1) There was no easy way of determining what size the long format vhpt cache
should be automatically, and changing it dynamically would be too painful.
Different workloads performed better with different size vhpt caches.
2) Regardless of the size, the vhpt cache is duplicated information. Using long
format vhpt's significantly increased the number of cache misses for some
workloads. Theoretically there should have been some cases where the long format
solution would have performed better than the short format solution, but I was
never able to create such a case. In many cases the performance difference
between the long format solution and the short format solution was essentially
the same. In other cases the short format vhpt solution outperformed the long
format solution, and in those cases there was a significant difference in cache
misses that I believe explained the performance difference.
John
> 1) There was no easy way of determining what size the long format vhpt cache
> should be automatically, and changing it dynamically would be too painful.
> Different workloads performed better with different size vhpt caches.
This is exactly what sparc64 does btw, dynamic TLB miss hash table
sizing based upon task RSS
> Paul Mackerras <pau...@samba.org> writes:
> >
> > 4kB pages: 444.051s user + 34.406s system time
> > 64kB pages: 419.963s user + 16.869s system time
> >
> > That's nearly 10% faster with 64kB pages -- on a kernel compile.
>
> Do you have some idea where the improvement mainly comes from?
> Is it TLB misses or reduced in kernel overhead? Ok I assume both
> play together but which part of the equation is more important?
With the kernel configured for a 64k page size, but using 4k pages in
the hardware page table, I get:
64k/4k: 441.723s user + 27.258s system time
So the improvement in the user time is almost all due to the reduced
TLB misses (as one would expect). For the system time, using 64k
pages in the VM reduces it by about 21%, and using 64k hardware pages
reduces it by another 30%. So the reduction in kernel overhead is
significant but not as large as the impact of reducing TLB misses.
Paul.
That's not entirely true. We have a dynamic pool now, thanks to Adam
Litke [added to Cc], which can be treated as a high watermark for the
hugetlb pool (and the static pool value serves as a low watermark).
Unless by hugepages you mean something other than what I think (but
referring to a 2M size on x86 imples you are not). And with the
antifragmentation improvements, hugepage pool changes at run-time are
more likely to succeed [added Mel to Cc].
> Making an application use huge pages as heap may be simple
> (just link with a different library to provide with a different
> version of malloc()) ... code, stack, mmap'd files are all
> a lot harder to do transparently.
I feel like I should promote libhugetlbfs here. We're trying to make
things easier for applications to use. You can back the heap by
hugepages via LD_PRELOAD. But even that isn't always simple (what
happens when something is already allocated on the heap?, which we've
seen happen even in our constructor in the library, for instance).
We're working on hugepage stack support. Text/BSS/Data segment
remapping exists now, too, but does require relinking to be more
successful. We have a mode that allows libhugetlbfs to try to fit the
segments into hugepages, or even just those parts that might fit --
but we have limitations on power and IA64, for instance, where
hugepages are restricted in their placement (either depending on the
process' existing mappings or generally). libhugetlbfs has, at least,
been tested a bit on IA64 to validate the heap backing (IIRC) and the
various kernel tests. We also have basic sparc support -- however, I
don't have any boxes handy to test on (working on getting them added
to our testing grid and then will revisit them), and then one box I
used before gave me semi-spurious soft-lockups (old bug, unclear if it
is software or just buggy hardware).
In any case, my point is people are trying to work on this from
various angles. Both making hugepages more available at run-time (in a
dynamic fashion, based upon need) and making them easier to use for
applications. Is it easy? Not necessarily. Is it guaranteed to work? I
like to think we make a best effort. But as others have pointed out,
it doesn't seem like we're going to get mainline transparent hugepage
support anytime soon.
Thanks,
Nish
That's not a problem, actually, since the TLB entries can get shuffled
like any other (for software TLBs it's a little different, but it can be
dealt with there too.)
The *real* problem is ABI breakage.
-hpa
On Wed, 26 Mar 2008, Paul Mackerras wrote:
>
> So the improvement in the user time is almost all due to the reduced
> TLB misses (as one would expect). For the system time, using 64k
> pages in the VM reduces it by about 21%, and using 64k hardware pages
> reduces it by another 30%. So the reduction in kernel overhead is
> significant but not as large as the impact of reducing TLB misses.
I realize that getting the POWER people to accept that they have been
total morons when it comes to VM for the last three decades is hard, but
somebody in the POWER hardware design camp should (a) be told and (b) be
really ashamed of themselves.
Is this a POWER6 or what? Becasue 21% overhead from TLB handling on
something like gcc shows that some piece of hardware is absolute crap.
May I suggest people inside IBM try to fix this some day, and in the
meantime people outside should probably continue to buy Intel/AMD CPU's
until the others can get their act together.
Linus
Things are better than I thought ... though the phrase "more likely
to succeed" doesn't fill me with confidence. Instead I imagine a
system where an occasional spike in memory load causes some memory
fragmentation that can't be handled, and so from that point many of
the applications that relied on huge pages take a 10% performance
hit. This results in sysadmins scheduling regular reboots to unjam
things. [Reminds me of the instructions that came with my first
flatbed scanner that recommended rebooting the system before and
after each use :-( ]
> I feel like I should promote libhugetlbfs here.
This is also better than I thought ... sounds like some really
good things have already happened here.
-Tony
> So the improvement in the user time is almost all due to the reduced
> TLB misses (as one would expect). For the system time, using 64k
> pages in the VM reduces it by about 21%, and using 64k hardware pages
> reduces it by another 30%. So the reduction in kernel overhead is
> significant but not as large as the impact of reducing TLB misses.
One should emphasize that this test was a kernel compile which is not
a load that gains much from larger pages. 4k pages are mostly okay for
loads that use large amounts of small files.
It's a lot more likely to succeed since 2.6.24 than it has in the past. On
workloads where it is mainly user data that is occuping memory, the chances
are even better. If min_free_kbytes is hugepage_size*num_online_nodes(),
it becomes a harder again to fragment memory.
> Instead I imagine a
> system where an occasional spike in memory load causes some memory
> fragmentation that can't be handled, and so from that point many of
> the applications that relied on huge pages take a 10% performance
> hit.
If it was found to be a problem and normal anti-frag is not coping for hugepage
pool resizes, then specify movablecore=MAX_POSSIBLE_POOL_SIZE_YOU_WOULD_NEED
on the command-line and the hugepage pool will be able to expand to that
side independent of workload. This would avoid the need to scheduled regular
reboots.
> This results in sysadmins scheduling regular reboots to unjam
> things. [Reminds me of the instructions that came with my first
> flatbed scanner that recommended rebooting the system before and
> after each use :-( ]
>
> > I feel like I should promote libhugetlbfs here.
>
> This is also better than I thought ... sounds like some really
> good things have already happened here.
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
> One should emphasize that this test was a kernel compile which is not
> a load that gains much from larger pages.
Actually, ever since gcc went to a garbage collecting allocator, I've
found it to be a TLB thrasher.
It will repeatedly randomly walk over a GC pool of at least 8MB in
size, which to fit fully in the TLB with 4K pages reaquires a TLB with
2048 entries assuming gcc touches no other data which is of course a
false assumption.
For some compiles this GC pool is more than 100MB in size.
GCC does not fit into any modern TLB using it's base page size.
> On Wed, 26 Mar 2008, Paul Mackerras wrote:
> >
> > So the improvement in the user time is almost all due to the reduced
> > TLB misses (as one would expect). For the system time, using 64k
> > pages in the VM reduces it by about 21%, and using 64k hardware pages
> > reduces it by another 30%. So the reduction in kernel overhead is
> > significant but not as large as the impact of reducing TLB misses.
>
> I realize that getting the POWER people to accept that they have been
> total morons when it comes to VM for the last three decades is hard, but
> somebody in the POWER hardware design camp should (a) be told and (b) be
> really ashamed of themselves.
>
> Is this a POWER6 or what? Becasue 21% overhead from TLB handling on
> something like gcc shows that some piece of hardware is absolute crap.
You have misunderstood the 21% number. That number has *nothing* to
do with hardware TLB miss handling, and everything to do with how long
the generic Linux virtual memory code spends doing its thing (page
faults, setting up and tearing down Linux page tables, etc.). It
doesn't even have anything to do with the hash table (hardware page
table), because both cases are using 4k hardware pages. Thus in both
cases the TLB misses and hash-table misses would have been the same.
The *only* difference between the cases is the page size that the
generic Linux virtual memory code is using. With the 64k page size
our architecture-independent kernel code runs 21% faster.
Thus the 21% is not about the TLB or any hardware thing at all, it's
about the larger per-byte overhead of our kernel code when using the
smaller page size.
The thing you were ranting about -- hardware TLB handling overhead --
comes in at 5%, comparing 4k hardware pages to 64k hardware pages (444
seconds vs. 420 seconds user time for the kernel compile). And yes,
it's a POWER6.
Paul.
> One should emphasize that this test was a kernel compile which is not
> a load that gains much from larger pages. 4k pages are mostly okay for
> loads that use large amounts of small files.
It's also worth emphasizing that 1.5% of the total time, or 21% of the
system time, is pure software overhead in the Linux kernel that has
nothing to do with the TLB or with gcc's memory access patterns.
That's the cost of handling memory in small (i.e. 4kB) chunks inside
the generic Linux VM code, rather than bigger chunks.
Paul.