What happens in fs/proc/meminfo.c is this calculation:
allowed = ((totalram_pages - hugetlb_total_pages())
* sysctl_overcommit_ratio / 100) + total_swap_pages;
The problem is that hugetlb_total_pages() is larger than
totalram_pages resulting in a negative number. Since
allowed is an unsigned long the negative shows up as a
big number.
A similar calculation occurs in __vm_enough_memory() in mm/mmap.c.
A symptom of this problem is that /proc/meminfo prints a
very large CommitLimit number.
CommitLimit: 737869762947802600 kB
To reproduce the problem reserve over half of memory as hugepages.
For example "default_hugepagesz=1G hugepagesz=1G hugepages=64
Then look at /proc/meminfo "CommitLimit:" to see if it is too big.
The fix is to not subtract hugetlb_total_pages(). When hugepages
are allocated totalram_pages is decremented so there is no need to
subtract out hugetlb_total_pages() a second time.
Reported-by: Russ Anderson <r...@sgi.com>
Signed-off-by: Russ Anderson <r...@sgi.com>
---
Example of "CommitLimit:" being too big.
uv1-sys:~ # cat /proc/meminfo
MemTotal: 32395508 kB
MemFree: 32029276 kB
Buffers: 8656 kB
Cached: 89548 kB
SwapCached: 0 kB
Active: 55336 kB
Inactive: 73916 kB
Active(anon): 31220 kB
Inactive(anon): 36 kB
Active(file): 24116 kB
Inactive(file): 73880 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 1692 kB
Writeback: 0 kB
AnonPages: 31132 kB
Mapped: 15668 kB
Shmem: 152 kB
Slab: 70256 kB
SReclaimable: 17148 kB
SUnreclaim: 53108 kB
KernelStack: 6536 kB
PageTables: 3704 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 737869762947802600 kB
Committed_AS: 394044 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 713960 kB
VmallocChunk: 34325764204 kB
HardwareCorrupted: 0 kB
HugePages_Total: 32
HugePages_Free: 32
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
DirectMap4k: 16384 kB
DirectMap2M: 2064384 kB
DirectMap1G: 65011712 kB
fs/proc/meminfo.c | 2 +-
mm/mmap.c | 3 +--
2 files changed, 2 insertions(+), 3 deletions(-)
Index: linux/fs/proc/meminfo.c
===================================================================
--- linux.orig/fs/proc/meminfo.c 2011-05-17 16:03:50.935658801 -0500
+++ linux/fs/proc/meminfo.c 2011-05-18 08:53:00.568784147 -0500
@@ -36,7 +36,7 @@ static int meminfo_proc_show(struct seq_
si_meminfo(&i);
si_swapinfo(&i);
committed = percpu_counter_read_positive(&vm_committed_as);
- allowed = ((totalram_pages - hugetlb_total_pages())
+ allowed = (totalram_pages
* sysctl_overcommit_ratio / 100) + total_swap_pages;
cached = global_page_state(NR_FILE_PAGES) -
Index: linux/mm/mmap.c
===================================================================
--- linux.orig/mm/mmap.c 2011-05-17 16:03:51.727658828 -0500
+++ linux/mm/mmap.c 2011-05-18 08:54:34.912222405 -0500
@@ -167,8 +167,7 @@ int __vm_enough_memory(struct mm_struct
goto error;
}
- allowed = (totalram_pages - hugetlb_total_pages())
- * sysctl_overcommit_ratio / 100;
+ allowed = totalram_pages * sysctl_overcommit_ratio / 100;
/*
* Leave the last 3% for root
*/
--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc r...@sgi.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Are you running on x86? It decrements totalram_pages on a x86_64
test system. Perhaps different architectures allocate hugepages
differently.
The way it was verified was putting a printk in to print totalram_pages
and hugetlb_total_pages. First the system was booted without any huge
pages. The next boot one huge page was allocated. The next boot more
hugepages allocated. Each time totalram_pages was reduced by the nuber
of huge pages allocated, with totalram_pages + hugetlb_total_pages
equaling the original number of pages.
That behavior is also consistent with allocating over half of memory
resulting in CommitLimit going negative (as is shown in the above
output).
Here is some data. Each represents a boot using 1G hugepages.
0 hugepages : totalram_pages 16519867 hugetlb_total_pages 0
1 hugepages : totalram_pages 16257723 hugetlb_total_pages 262144
2 hugepages : totalram_pages 15995578 hugetlb_total_pages 524288
31 hugepages : totalram_pages 8393403 hugetlb_total_pages 8126464
32 hugepages : totalram_pages 8131258 hugetlb_total_pages 8388608
> hugepages are reserved, hugetlb_total_pages() has to be accounted and
> subtracted from totalram_pages in order to render an accurate number of
> remaining pages available to the general memory workload commitment.
>
> I've tried to reproduce your findings on my boxes, without
> success, unfortunately.
Put a printk in meminfo_proc_show() to print totalram_pages and
hugetlb_total_pages(). Add "default_hugepagesz=1G hugepagesz=1G hugepages=64"
to the boot line (varying the number of hugepages).
> I'll keep chasing to hit this behaviour, though.
>
> Cheers!
> --aquini
OK, I see your point. The root problem is hugepages allocated at boot are
subtracted from totalram_pages but hugepages allocated at run time are not.
Correct me if I've mistate it or are other conditions.
By "allocated at run time" I mean "echo 1 > /proc/sys/vm/nr_hugepages".
That allocation will not change totalram_pages but will change
hugetlb_total_pages().
How best to fix this inconsistency? Should totalram_pages include or exclude
hugepages? What are the implications?
I have no strong preference as to which way to go as long as it is consistent.
> OK, I see your point. The root problem is hugepages allocated at boot are
> subtracted from totalram_pages but hugepages allocated at run time are not.
> Correct me if I've mistate it or are other conditions.
>
> By "allocated at run time" I mean "echo 1 > /proc/sys/vm/nr_hugepages".
> That allocation will not change totalram_pages but will change
> hugetlb_total_pages().
>
> How best to fix this inconsistency? Should totalram_pages include or exclude
> hugepages? What are the implications?
The problem is that hugetlb_total_pages() is trying to account for two
different things, while totalram_pages accounts for only one of those
things, yes?
One fix would be to stop accounting for huge pages in totalram_pages
altogether. That might break other things so careful checking would be
needed.
Or we stop accounting for the boot-time allocated huge pages in
hugetlb_total_pages(). Split the two things apart altogether and
account for boot-time allocated and runtime-allocated pages separately. This
souds saner to me - it reflects what's actually happening in the kernel.
Perhaps we can just reinstate the # of pages "stealed" at early boot allocation
later, when hugetlb_init() calls gather_bootmem_prealloc()
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8ee3bd8..d606c9c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1111,6 +1111,7 @@ static void __init gather_bootmem_prealloc(void)
WARN_ON(page_count(page) != 1);
prep_compound_huge_page(page, h->order);
prep_new_huge_page(h, page, page_to_nid(page));
+ totalram_pages += 1 << h->order;
}
}
--
Rafael Aquini <aqu...@linux.com>
Howdy Russ,
Were you able to confirm if that proposed change fix the issue you've reported?
Although I've tested it with usual size hugepages and it did not messed things up,
I'm not able to test it with GB hugepages, as I do not have any proc with "pdpe1gb" flag available.
Thanks in advance!
Cheers!
Sorry, I have been distracted. I will get to it shortly.
> Although I've tested it with usual size hugepages and it did not messed things up,
> I'm not able to test it with GB hugepages, as I do not have any proc with "pdpe1gb" flag available.
>
> Thanks in advance!
> Cheers!
> --
> Rafael Aquini <aqu...@linux.com>
--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc r...@sgi.com
Yes, it fixes the inconsistency in reporting totalram_pages.
> Although I've tested it with usual size hugepages and it did not messed things up,
> I'm not able to test it with GB hugepages, as I do not have any proc with "pdpe1gb" flag available.
There seems to be another issue. 1G hugepages can be allocated at boot time, but
cannot be allocated at run time. "default_hugepagesz=1G hugepagesz=1G hugepages=1" on
the boot line works. With "default_hugepagesz=1G hugepagesz=1G" the command
"echo 1 > /proc/sys/vm/nr_hugepages" fails.
uv4-sys:~ # echo 1 > /proc/sys/vm/nr_hugepages
-bash: echo: write error: Invalid argument
> Thanks in advance!
> Cheers!
> --
> Rafael Aquini <aqu...@linux.com>
--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc r...@sgi.com
The problem is that gigantic hugepages (order > MAX_ORDER)
can only be allocated at boot with bootmem, thus its frames
are not accounted to 'totalram_pages'. However, they are
accounted to hugetlb_total_pages()
What happens to turn CommitLimit into a negative number
is this calculation, in fs/proc/meminfo.c:
allowed = ((totalram_pages - hugetlb_total_pages())
* sysctl_overcommit_ratio / 100) + total_swap_pages;
A similar calculation occurs in __vm_enough_memory() in mm/mmap.c.
Also, every vm statistic which depends on 'totalram_pages' will render
confusing values, as if system were 'missing' some part of its memory.
Reported-by: Russ Anderson <r...@sgi.com>
Signed-off-by: Rafael Aquini <aqu...@linux.com>
---
mm/hugetlb.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f33bb31..c67dd0f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1111,6 +1111,14 @@ static void __init gather_bootmem_prealloc(void)
WARN_ON(page_count(page) != 1);
prep_compound_huge_page(page, h->order);
prep_new_huge_page(h, page, page_to_nid(page));
+
+ /* if we had gigantic hugepages allocated at boot time,
+ * we need to reinstate the 'stolen' pages to totalram_pages,
+ * in order to fix confusing memory reports from free(1)
+ * and another side-effects, like CommitLimit going negative.
+ */
+ if (h->order > (MAX_ORDER - 1))
+ totalram_pages += 1 << h->order;
}
}
--
1.7.4.4
On Wed, Jun 01, 2011 at 11:08:31PM -0500, Russ Anderson wrote:
>
> Yes, it fixes the inconsistency in reporting totalram_pages.
Thanks alot for the feedback.
> There seems to be another issue. 1G hugepages can be allocated at boot time, but
> cannot be allocated at run time. "default_hugepagesz=1G hugepagesz=1G hugepages=1" on
> the boot line works. With "default_hugepagesz=1G hugepagesz=1G" the command
> "echo 1 > /proc/sys/vm/nr_hugepages" fails.
>
> uv4-sys:~ # echo 1 > /proc/sys/vm/nr_hugepages
> -bash: echo: write error: Invalid argument
That's not an issue, actually. It seems to be , unfortunately,
an implementation characteristic, due to an imposed arch constraint.
Further reference: http://lwn.net/Articles/273661/
Cheers!
--
Rafael Aquini <aqu...@linux.com>
--
--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc r...@sgi.com
> When 1GB hugepages are allocated on a system, free(1) reports
> less available memory than what really is installed in the box.
> Also, if the total size of hugepages allocated on a system is
> over half of the total memory size, CommitLimit becomes
> a negative number.
>
> The problem is that gigantic hugepages (order > MAX_ORDER)
> can only be allocated at boot with bootmem, thus its frames
> are not accounted to 'totalram_pages'. However, they are
> accounted to hugetlb_total_pages()
>
> What happens to turn CommitLimit into a negative number
> is this calculation, in fs/proc/meminfo.c:
>
> allowed = ((totalram_pages - hugetlb_total_pages())
> * sysctl_overcommit_ratio / 100) + total_swap_pages;
>
> A similar calculation occurs in __vm_enough_memory() in mm/mmap.c.
>
> Also, every vm statistic which depends on 'totalram_pages' will render
> confusing values, as if system were 'missing' some part of its memory.
Is this bug serious enough to justify backporting the fix into -stable
kernels?
Sorry, for this late reply.
On Thu, Jun 09, 2011 at 04:44:08PM -0700, Andrew Morton wrote:
> On Thu, 2 Jun 2011 23:55:57 -0300
> Rafael Aquini <aqu...@linux.com> wrote:
>
> > When 1GB hugepages are allocated on a system, free(1) reports
> > less available memory than what really is installed in the box.
> > Also, if the total size of hugepages allocated on a system is
> > over half of the total memory size, CommitLimit becomes
> > a negative number.
> >
> > The problem is that gigantic hugepages (order > MAX_ORDER)
> > can only be allocated at boot with bootmem, thus its frames
> > are not accounted to 'totalram_pages'. However, they are
> > accounted to hugetlb_total_pages()
> >
> > What happens to turn CommitLimit into a negative number
> > is this calculation, in fs/proc/meminfo.c:
> >
> > allowed = ((totalram_pages - hugetlb_total_pages())
> > * sysctl_overcommit_ratio / 100) + total_swap_pages;
> >
> > A similar calculation occurs in __vm_enough_memory() in mm/mmap.c.
> >
> > Also, every vm statistic which depends on 'totalram_pages' will render
> > confusing values, as if system were 'missing' some part of its memory.
>
> Is this bug serious enough to justify backporting the fix into -stable
> kernels?
Despite not having testing it, I can think the following scenario as
troublesome:
When gigantic hugepages are allocated and sysctl_overcommit_memory == OVERCOMMIT_NEVER.
In a such situation, __vm_enough_memory() goes through the mentioned 'allowed'
calculation and might end up mistakenly returning -ENOMEM, thus forcing
the system to start reclaiming pages earlier than it would be ususal, and this could
cause detrimental impact to overall system's performance, depending on the
workload.
Besides the aforementioned scenario, I can only think of this causing annoyances
with memory reports from /proc/meminfo and free(1).
Thanks for your attention!
Cheers!
--
Rafael Aquini <aqu...@linux.com>
hm, OK, thanks. That sounds a bit thin, but the patch is really simple
so I stuck the cc:stable onto its changelog.