failing to understand the issues with transparent huge paging

4,306 views
Skip to first unread message

Peter Veentjer

unread,
Aug 7, 2017, 11:42:21 AM8/7/17
to mechanica...@googlegroups.com
Hi Everyone,

I'm failing to understand the problem with transparent huge pages.

I 'understand' how normal pages work. A page is typically 4kb in a virtual address space; each process has its own.

I understand how the TLB fits in; a cache providing a mapping of virtual to real addresses to speed up address conversion.

I understand that using a large page e.g. 2mb instead of a 4kb page can reduce pressure on the TLB.

So till so far it looks like huge large pages makes a lot of sense; of course at the expensive of wasting memory if only a small section of a page is being used.

The first part I don't understand is: why is it called transparent huge pages? So what is transparent about it?

The second part I'm failing to understand is: why can it cause problems? There are quite a few applications that recommend disabling THP and I recently helped a customer that was helped by disabling it. It seems there is more going on behind the scene's than having an increased page size. Is it caused due to fragmentation? So if a new page is needed and memory is fragmented (due to smaller pages); that small-pages need to be compacted before a new huge page can be allocated? But if this would be the only thing; this shouldn't be a problem once all pages for the application have been touched and all pages are retained.

So I'm probably missing something simple.

Marshall Pierce

unread,
Aug 7, 2017, 11:55:05 AM8/7/17
to mechanica...@googlegroups.com, Peter Veentjer
AFAIK, The issue is not huge pages that applications opt in to and
control; it's when they are "transparent", as in when the kernel
"helpfully" decides when to use them. This can lead to significant
pauses as the kernel decides that now is the time to rearrange your
process's memory to be more or less huge-page-y. I have experienced this
even with desktop java apps with relatively small heaps.

You can look for CONFIG_TRANSPARENT_HUGEPAGE in your kernel config.

-Marshall

On 08/07/2017 10:42 AM, Peter Veentjer wrote:
> Hi Everyone,
>
> I have a failing understanding the problem with transparent huge pages.
>
> I 'understand' how normal pages work. A page is typically 4kb in a
> virtual address space; each process has its own.
>
> I understand how the TLB fits in; a cache providing a mapping of virtual
> to real addresses to speed up address conversion.
>
> I understand that using a large page e.g. 2mb instead of a 4kb page can
> reduce pressure on the TLB.
>
> So till so far it looks like huge large pages makes a lot of sense; of
> course at the expensive of wasting memory if only a small section of a
> page is being used.
>
> The first part I don't understand is: why is it called transparent huge
> pages? So what is transparent about it?
>
> The second part I'm failing to understand is: why can it cause problems?
> There are quite a few applications that recommend disabling THP and I
> recently helped a customer that was helped by disabling it. It seems
> there is more going on behind the scene's than having an increased page
> size. Is it caused due to fragmentation? So if a new page is needed and
> memory is fragmented (due to smaller pages); that these pages need to be
> compacted before a new page can be allocated?
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to mechanical-symp...@googlegroups.com
> <mailto:mechanical-symp...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Peter Veentjer

unread,
Aug 7, 2017, 12:02:28 PM8/7/17
to mechanical-sympathy, alarm...@gmail.com
Your answer makes a lot of sense. It makes it the 'transparent' part clear.

What doesn't make sense is why the kernel would decide to switch between large/small pages. So what is the benefit to let a process go through the pain of this conversion?

Will it resize in both directions?

Gil Tene

unread,
Aug 7, 2017, 1:25:50 PM8/7/17
to mechanica...@googlegroups.com
THP certainly sits in my "just don't do it" list of tuning things due to it's fundamental dramatic latency disruption in current implementations, seen as occasional 10s to 100s of msec (and sometimes even 1sec+) stalls on something as simple and common as a 32 byte malloc. THP is a form of in-kernel GC. And the current THP implementation involves potential and occasional synchronous, stop-the-world compaction done at allocation-time, on or by any application thread that does an mmap or a malloc.

I dug up an e-mail I wrote on the subject (to a recipient on this list) back in Jan 2013 [see below]. While it has some specific links (including a stack trace showing the kernel de-fragging the whole system on a single mmap call), note that this material is now 4.5 years old, and things *might* have changed or improved to some degree. While I've seen no recent first-hand evidence of efforts to improve things on the don't-dramatically-stall-malloc (or other mappings) front, I haven't been following it very closely (I just wrote it off as "lets check again in 5 years"). If someone else here knows of some actual improvements to this picture in recent years, or of efforts or discussions in the Linux Kernel community on this subject, please point to them.

IMO, the notion of THP is not flawed. The implementation is. And I believe that the implementation of THP *can* be improved to be much more robust and to avoid forcing occasional huge latency artifacts on memory-allocating threads:

1. the first (huge) step in improving thing would be to never-ever-ever-ever have a mapping thread spend any time performing any kind of defragmentation, and to simply accept 4KB mappings when no 2MB physical pages are available. Let background defragmentation do all the work (including converting 4KLB-allocated-but-2MB-contiguous ranges to 2MB mappings).

2. The second level (much needed, but at an order of magnitude of 10s of milliseconds rather than the current 100s of msec or more) would be to make background defragmentation work without stalling foreground access to a currently-being-defragmented 2MB region. I.e. don't stall access for the duration of a 2MB defrag operation (which can take several msec).

While both of these are needed for a "don't worry about it" mode of use (which something called "transparent": really should aim for), #1 is a much easier step than #2 is. Without it, THP can cause application pauses (to any linux app) that are often worse than e.g. HotSpot Java GC pauses. Which is ironic.

-- Gil. 

-------------------------------

The problem is not the background defrag operation. The problem is synchronous defragging done on allocation, where THP on means a 2MB allocation will attempt to allocate a 2MB contiguous page, and if it can't find one, it may end up defragging an entire zone before the allocation completes. The /sys/kernel/mm/transparent_hugepage/defrag setting only controls the background...

Here is something I wrote up on it internally after much investigation:

Transparent huge pages (THP) is a feature Red Hat championed and introduced in RHEL 6.x, and got into the upstream kernel around the  ~2.6.38 time, it generally exists in all Linux 3.x kernels and beyond (so it exists in both SLES 11 SP@ and in Ubuntu 12.04 LTS). With transparent huge pages, the kernel *attempts* to use 2MB page mappings to map contiguous and aligned memory ranges of that size (which are quite common for many program scenarios), but will break those into 4KB mappings when needed (e.g. cannot satisfy with 2MB pages, or when it needs to swap or page out the memory, since paging is done 4KB at a time). With such a mixed approach, some sort of a "defragmenter" or "compactor" is required to exist, because without it simple fragmentation will (over time) make 2MB contiguous physical pages a rare thing, and performance will tend to degrade over time. As a result, and in order to support THP, Linux kernels will attempt to defragment (or "compact") memory and memory zones. This can be done either by unmapping pages, copying their contents to a new compacted space, and mapping them in the new location, or by potentially forcing individually mapped 4KB pages in a 2MB physical page out (via swapping or by paging them out if they are file system pages), and reclaiming the 2MB contiguous page when that is done. 4KB pages that were forced out will come back in as needed (swapped back in on demand, or paged back in on demand).

Defragmentation/compaction with THP can happen in two places:

1. First, there is a background defragmenter (a process called "khugepaged") that goes around and compacts 2MB physical pages by pushing their 4KB pages out when possible. This background defragger could potentially cause pages to swapped out if swapping is enabled, even with no swapping pressure in place.

2. "Synchronous Compaction": In some cases, an on demand page fault (e.g. when first accessing a newly alloocated 4KB page created via mmap() or malloc()) could end up trying to compact memory in order to fault into a 2MB physical page instead of a 4KB page (this can be seen in the stack trace discussed in this posting, for example: https://access.redhat.com/solutions/1560893). When this happens, a single 4KB allocation could end up waiting for an attempt to compact an entire "zone" of pages, even if those are compacted purely thru in-memory moves with no I/O. It can also be blocked waiting for disk I/O as seen on some stack traces in related discussions.

More details can be found in places like this:

And examples for cases of avoiding thrashing by disabling THP ion RHEL 6.2 are around:



BOTTOM LINE: Transparent Huge Pages is a well-inteded idea that helps compact physical memory and use more optimal mappings in the kernel, but it can come with some significant (and often surprising) latency impacts. I recommend we turn it off by default in Zing installations, and it appears that many other software packages (including most DBs, and many Java based apps) recommend the same.

Gil Tene

unread,
Aug 7, 2017, 2:14:31 PM8/7/17
to mechanical-sympathy
To highlight the problematic path in current THP kernel implementations, here is an example call trace that can happen (pulled from the discussion linked to below). It shows that a simple on-demand page fault in regular anonymous memory (e.g. when a normal malloc call is made and manipulates a malloc-mananged 2MB area, or when such the resulting malloc'ed struct is written to) can end up compacting an entire zone (which can be the vast majority of system memory) in a single call, using the faulting thread. The specific example stack trace is taken from a situation where that fault took so long (on a NUMA system) that a soft0lockup was triggered, showing the call took longer than 22 seconds (!!!). But even without the NUMA or migrate_pages aspects, compassion of a single zone can take 100s of msec or more.

Browsing through the current kernel code (e.g. http://elixir.free-electrons.com/linux/latest/source/mm/compaction.c#L1722) seems to show that this is still the likely path that would be taken when no free 2MB pages are found in current kernels :-(

And this situation will naturally occur under all sorts of common timing g conditions (i/o fragmenting free memory to 4KB (but no 2M), background compaction/defrag falls behind during some heavy kernel-driven i/o spike, and some unlucky thread doing a malloc when the 2MB physical free list exhausted)


kernel: Call Trace:
kernel: [<ffffffff81179d8f>] compaction_alloc+0x1cf/0x240
kernel: [<ffffffff811b15ce>] migrate_pages+0xce/0x610
kernel: [<ffffffff81179bc0>] ? isolate_freepages_block+0x380/0x380
kernel: [<ffffffff8117abb9>] compact_zone+0x299/0x400
kernel: [<ffffffff8117adbc>] compact_zone_order+0x9c/0xf0
kernel: [<ffffffff8117b171>] try_to_compact_pages+0x121/0x1a0
kernel: [<ffffffff815ff336>] __alloc_pages_direct_compact+0xac/0x196
kernel: [<ffffffff81160758>] __alloc_pages_nodemask+0x788/0xb90
kernel: [<ffffffff810b11c0>] ? task_numa_fault+0x8d0/0xbb0
kernel: [<ffffffff811a24aa>] alloc_pages_vma+0x9a/0x140
kernel: [<ffffffff811b674b>] do_huge_pmd_anonymous_page+0x10b/0x410
kernel: [<ffffffff81182334>] handle_mm_fault+0x184/0xd60
kernel: [<ffffffff8160f1e6>] __do_page_fault+0x156/0x520
kernel: [<ffffffff8118a945>] ? change_protection+0x65/0xa0
kernel: [<ffffffff811a0dbb>] ? change_prot_numa+0x1b/0x40
kernel: [<ffffffff810adb86>] ? task_numa_work+0x266/0x300
kernel: [<ffffffff8160f5ca>] do_page_fault+0x1a/0x70
kernel: [<ffffffff81013b0c>] ? do_notify_resume+0x9c/0xb0
kernel: [<ffffffff8160b808>] page_fault+0x28/0x30

Alen Vrečko

unread,
Aug 7, 2017, 2:50:27 PM8/7/17
to mechanica...@googlegroups.com
Saw this a while back.

https://shipilev.net/jvm-anatomy-park/2-transparent-huge-pages/

Basically using THP/defrag with madvise and using
-XX:+UseTransparentHugePages -XX:+AlwaysPreTouch JVM opts.

Looks like the defrag cost should be paid in full at startup due to
AlwaysPreTouch. Never got around to test this in production. Just have
THP disabled. Thoughts?

- Alen
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Aug 8, 2017, 12:44:25 PM8/8/17
to mechanical-sympathy


On Monday, August 7, 2017 at 11:50:27 AM UTC-7, Alen Vrečko wrote:
Saw this a while back.

https://shipilev.net/jvm-anatomy-park/2-transparent-huge-pages/

Basically using THP/defrag with madvise and using
-XX:+UseTransparentHugePages -XX:+AlwaysPreTouch JVM opts.

Looks like the defrag cost should be paid in full at startup due to
AlwaysPreTouch. Never got around to test this in production. Just have
THP disabled. Thoughts?

The above flags would only cover the Java heap. In a Java application. So obviously THP for non-Java things doesn't get helped by that.

And for Java stuff, unfortunately, there are lots of non-Java-heap things that are exposed to THP's potentially huge on-demand faulting latencies. The JVM manages lots memory outside of the Java heap for various things (GC stuff, stacks, Metaspace, Code cache, JIT compiler things, and a whole bunch of runtime stuff), and the application itself will often be using off-heap stuff intentionally (e.g. via DirectByteBuffers) or inadvertently (e.g. when libraries make either temporary or lasting use of off-heap memory). E.g. even simple socket I/O involves some use of off heap memory as an intermediate storage location.

As a simple demonstration of why THP artifacts for non-Java-heaps are a key problem for Java apps, I first ran into these THP issues by experience, with Zing, right around the time that RHEL 6 turned it on. We found out the hard way that we have to turn it off to maintain reasonable latency profiles. And since Zing has always *ensured* 2MB pages are used for everything in the heap, the code cache, and for virtually all GC support structures, it is clearly the THP impact on all the rest of the stuff that has caused us to deal with it and recommend against it's use. The way we see THP manifest regularly ()when left on) is with occasional huge TTSPs (time to safepoint) [huge in Zing terms, meaning anything from a few msec to 100s of msec], which we know are there because we specifically log and chart TTSPs. But high TTSPs are just a symptom: since we only measure TTSP when we actually try to bring threads to safepoints, and doing that is a relatively (dynamically) rare event. that means that whenever we see actual high TTSPs in our logs, it is likely that similar-sized disruptions are occurring at the application level, but at a much higher frequently than than of the high TTSPs we observe.

Peter Veentjer

unread,
Aug 9, 2017, 4:51:57 AM8/9/17
to mechanical-sympathy
Thanks for your very useful replies Gil.

Question:

Using huge pages can give a big performance boost:

https://shipilev.net/jvm-anatomy-park/2-transparent-huge-pages/

$ time java -Xms4T -Xmx4T -XX:-UseTransparentHugePages -XX:+AlwaysPreTouch
real    13m58.167s
user    43m37.519s
sys     1011m25.740s

$ time java -Xms4T -Xmx4T -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
real    2m14.758s
user    1m56.488s
sys     73m59.046s


But THP seems to be unusable. Does this effectively mean that we can't benefit from THP under Linux?

Till so far it looks like a damned if you do, damned if you don't situation.

Or should we move to non transparent huge pages?

Henri Tremblay

unread,
Aug 9, 2017, 9:07:01 AM8/9/17
to mechanica...@googlegroups.com
From what I see here I would deduce:
1- THP can give a huge performance gain (when using PreTouch, in some cases, possibly when not playing with offheap too much)
2- But it will increase hiccups

A bit like the throughput collector.

So my current take away is:
  • Use THP is you care about throughput only
  • If you care about latency, just don't
  • If you really care about throughput, use non-transparent huge pages
Is that accurate?

> For more options, visit https://groups.google.com/d/optout.

Benedict Elliott Smith

unread,
Aug 9, 2017, 9:35:36 AM8/9/17
to mechanical-sympathy
You should also consider the CPU you are using.  Westmere CPUs have far fewer (but disjoint/complementary) TLB slots for huge pages, so large sparse working sets (with good temporal locality but poor spatial locality) could incur a significant negative performance penalty on these CPUs with huge pages.

Aleksey Shipilev

unread,
Aug 9, 2017, 9:50:38 AM8/9/17
to mechanica...@googlegroups.com, Gil Tene
On 08/08/2017 06:44 PM, Gil Tene wrote:
> On Monday, August 7, 2017 at 11:50:27 AM UTC-7, Alen Vrečko wrote:
> Saw this a while back.
>
> https://shipilev.net/jvm-anatomy-park/2-transparent-huge-pages/
> <https://shipilev.net/jvm-anatomy-park/2-transparent-huge-pages/>
>
> Basically using THP/defrag with madvise and using
> -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch JVM opts.
>
> Looks like the defrag cost should be paid in full at startup due to
> AlwaysPreTouch. Never got around to test this in production. Just have
> THP disabled. Thoughts?
>
> The above flags would only cover the Java heap. In a Java application. So obviously THP for non-Java
> things doesn't get helped by that.

So, this is the reason why to use THP in "madvise" mode? Then JVM madvise-s Java heap to THP,
upfront defrags it with AlwaysPreTouch, but native allocations stay outside of THP path, and thus do
not incur defrag latencies. If there is a native structure that allocates with madvise hint and does
heavy churn causing defrag, I'd say it should have not madvise in the first place.

-Aleksey

signature.asc

Gil Tene

unread,
Aug 9, 2017, 10:01:53 PM8/9/17
to mechanical-sympathy, g...@azul.com
I agree in principle. Keeping the system-wide THP config setting to "madvise" (as opposed to "always" or "never") should allow careful-explicitly-thp-designated-via-madvise regions to benefit from THP without carrying the blame for widespread latency spikes. It would certainly be less exposed than applying THP by default to all anonymous memory. But it is unfortunately still very widely exposed to latency spikes. The reason for this wide exposure is that not everyone is as careful and smart about madvise use of THP and it's latency implications, and unfortunately the barn door settings available are either "open for everyone" or "closed to everyone". there is no "open just for me" setting.

The weakness with the madvise mode is that it is enough for one piece of code that your application ends up using to be less-than-perfect with how it uses of the thp madvise for the entire application to experience huge latencies. You can only turn the "madvise" behavior itself on/off globally, and can't just use it for the part that you know and understand well. Basically, choosing to let your well understood, responsible code "run with [huge latency] scissors" means that all other code that may have used this madvise as an "optimization" is also running with the same scissors. Unfortunately, lots code, including common libraries you may not be thinking of, may have chosen to use a thp madvise purely for its throughput benefit without considering the latency implications and protecting against them (by e.g. retouching or otherwise pre-populating the physical memory allocation), and as a result may have used it on regions for which physical memory might end up being on-demand allocated.

And before you (the reader) starts down the obvious "who would do that?" and "what are the chances..."  thought chains, take a look at the following y natural verdiscussion about potentially changing all of malloc in Go to do exactly that: https://github.com/golang/go/issues/14264 titled "runtime: consider backing malloc structures with large pages". The issue discusses the possibility of using madvise(...,..., MADV_HUGEPAGE)  (doesn't mean that's what they'll end up doing, but it's a great example of how that might end up happening). It includes some measurements that show throughput benefits when huge pages are used, but contains no discussion of the potential latency outlier downsides. A similar discussion can easily end up with all malloc'ed objects being on-demand-page-fault-allocated in THP advised pages in some key allocator in some system.

On the other end of the spectrum is some cool jemalloc discussion that ends up with the opposite: https://www.nuodb.com/techblog/linux-transparent-huge-pages-jemalloc-and-nuodb describes how nuodb internally patched their jemalloc to madvise(...,..., MADV_NOHUGEPAGE) on all jemalloc pages to get around surprising interplay issues between THP and madvise(...,..., MADV_DONTNEED).

So madvise may come with some sharp edges. And if those edges don't hurt you right now, they might change to hurt you soon (e.g. in an upcoming next version of go or of yourFavoriteCoolMalloc). It's why my knee-jerk recommendation is to turn THP completely off as a first step whenever someone asks me about unexplained latency glitches. 

For reference, in Zing we separately guarantee locked-down 2MB mappings for all pages for the Java heap, code cache, permgen, and various GC support structures regardless of THP or hugetlb settings (the equivalent of hugetlbfs, but without those settings). And we probably wouldn't use a thp madvise on any of the other memory regions. So for us either "madvise" or "none" would work just as well.

Alexandr Nikitin

unread,
Aug 12, 2017, 6:01:31 AM8/12/17
to mechanical-sympathy
I played with Transparent Hugepages some time ago and I want to share some numbers based on real world high-load applications.
We have a JVM application: high-load tcp server based on netty. No clear bottleneck, CPU, memory and network are equally highly loaded. The amount of work depends on request content.
The following numbers are based on normal server load ~40% of maximum number of requests one server can handle.

When THP is off:
End-to-end application latency in microseconds:
"p50" : 718.891,
"p95" : 4110.26,
"p99" : 7503.938,
"p999" : 15564.827,

perf stat -e dTLB-load-misses,iTLB-load-misses -p PID -I 1000
...
...         25,164,369      iTLB-load-misses
...         81,154,170      dTLB-load-misses
...

When THP is always on:
End-to-end application latency in microseconds:
"p50" : 601.196,
"p95" : 3260.494,
"p99" : 7104.526,
"p999" : 11872.642,

perf stat -e dTLB-load-misses,iTLB-load-misses -p PID -I 1000
...
...    21,400,513      dTLB-load-misses
...      4,633,644      iTLB-load-misses
...

As you can see THP performance impact is measurable and too significant to ignore. 4.1 ms vs 3.2 ms 99%% and 100M vs 25M TLB misses.
I also used SytemTap to measure few kernel functions like collapse_huge_page, clear_huge_page, split_huge_page. There were no significant spikes using THP.
AFAIR that was 3.10 kernel which is 4 years old now. I can repeat experiments with the newer kernels if there's interest. (I don't know what was changed there though)

Andriy Plokhotnyuk

unread,
Aug 13, 2017, 1:23:35 AM8/13/17
to mechanical-sympathy
Alexandr, you shared too few details, to judge an impact of THP option. It is easy to fool yourself by measuring latency without taking in account a coordinated omission or just not looking to max values.

Gil Tene

unread,
Aug 13, 2017, 3:10:01 AM8/13/17
to mechanical-sympathy
Unfortunately, just because you didn't run into a huge spike during your test doesn't mean it won't hit you in the future... The stack trace example I posted earlier represents the path that will be taken if an on-demand allocation page fault on a THP-allocated region happens when no free 2MB page is available in the system. Inducing that behavior is not that hard, e.g. just do a bunch of high volume journaling or logging, and you'll probably trigger it eventually. And when it does take that path, that will be your thread de-fragging the entire system's physical memory, one 2MB page at a time.

And when that happens, you're probably not talking 10-20msec. More like several hundreds of msec (growing with the system physical memory size, the specific stack trace is taken from a RHEL issue that reported >22 seconds). If that occasional outlier is something you are fine with, then turning THP on for the speed benefits you may be seeing makes sense. But if you can't accept the occasional ~0.5+ sec freezes, turn it off. 

Alexandr Nikitin

unread,
Aug 13, 2017, 7:29:50 AM8/13/17
to mechanical-sympathy
Regarding measurements: I understand that it's hard. In my case, the measurements were done on production servers and production load. Servers were not overloaded, they got ~40% of their capacity. Latency were gathered for a few dozen minutes. Kernel (khugepaged) functions probing was done for a few hours (I think).
What I didn't measure is the maximum throughput, slow allocation and compaction path mentioned by Gil, page table size and page walking time. If anyone knows how to probe the kernel page walking time, then it would be interesting to compare whether the page and table sizes affect it or not.
It could be a good time to repeat the experiments. Please advice what and how to measure.

The stack trace example I posted earlier represents the path that will be taken if an on-demand allocation page fault on a THP-allocated region happens when no free 2MB page is available in the system.

To be honest I though that if THP fails to allocate a hugepage then it falls back to regular pages. I thought that khugepaged does the compaction logic (if the setting is not always turns out). I see it in docs https://www.kernel.org/doc/Documentation/vm/transhuge.txt

"- if a hugepage allocation fails because of memory fragmentation,
  regular pages should be gracefully allocated instead and mixed in
  the same vma without any failure or significant delay and without
  userland noticing
"

The compaction/ defrag phase can be addressed with its own flags:
/sys/kernel/mm/transparent_hugepage/defrag
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs

I'm not a kernel expert though and I may be wrong. I'm really interested if those flags could solve or mitigate the freezes people mentioned here.

If that occasional outlier is something you are fine with, then turning THP on for the speed benefits you may be seeing makes sense. But if you can't accept the occasional ~0.5+ sec freezes, turn it off.

I just wanted to show for people who blindly follow advice on the Internet (and there are many such suggestions) that there's an impact. It can be noticeable and depends on setup and load.

Gil Tene

unread,
Aug 13, 2017, 12:17:31 PM8/13/17
to mechanical-sympathy


On Sunday, August 13, 2017 at 4:29:50 AM UTC-7, Alexandr Nikitin wrote:
Regarding measurements: I understand that it's hard. In my case, the measurements were done on production servers and production load. Servers were not overloaded, they got ~40% of their capacity. Latency were gathered for a few dozen minutes. Kernel (khugepaged) functions probing was done for a few hours (I think).
What I didn't measure is the maximum throughput, slow allocation and compaction path mentioned by Gil, page table size and page walking time. If anyone knows how to probe the kernel page walking time, then it would be interesting to compare whether the page and table sizes affect it or not.
It could be a good time to repeat the experiments. Please advice what and how to measure.

Since the question being asked is "Does THP cause any long stalls on your system?", I'd just run your measurement over several days, and focus on the maximum. If you can trace stuff, trace try_to_compact_pages, since that's what the allocation path would be calling when the bad situations happen.
 
The stack trace example I posted earlier represents the path that will be taken if an on-demand allocation page fault on a THP-allocated region happens when no free 2MB page is available in the system.
To be honest I though that if THP fails to allocate a hugepage then it falls back to regular pages. I thought that khugepaged does the compaction logic (if the setting is not always turns out). I see it in docs https://www.kernel.org/doc/Documentation/vm/transhuge.txt "- if a hugepage allocation fails because of memory fragmentation, regular pages should be gracefully allocated instead and mixed in the same vma without any failure or significant delay and without userland noticing "

Yeh, that's what I thought it means when I first read that stuff a few years ago too. Unfortunately, "fails because of memory fragmentation" [currently, in implementation] still means "fails if fragmentation is bad enough that it prevents defragmentation after trying to actually defragment" rather than "fails because fragmentation has left no currently free 2MB pages". The "regular pages should be gracefully allocated instead" part [unfortunately] only happens after an attempt to compact pages... It is "graceful" in the sense that it doesn't crash, but not in the sense that it doesn't [potentially] stall for a very long time. The stack trace in my earlier post shows the path of the allocation first trying to compact (with try_to_compact_pages) before falling "gracefully" back to satisfying the fault with 4KB page allocations.

It looks like newer kernels do support a "defer" option, see separate posting (following this one) for that.
 
The compaction/ defrag phase can be addressed with its own flags:
/sys/kernel/mm/transparent_hugepage/defrag
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs

The settings above only control the background khugepaged behavior. khugepaged runs in the background, and you can use these settings to control it's behavior and e.g. make it more aggressive. But your process's on-demand-allocation page faults (hitting on an already mmapped, but not-yet-modfied contiguous 2MB region that THP aims to satisfy by creating 2MB mappings) can always occur in between the sleeps that this background process does, and enough such faults can occur to outrun whatever khugepaged has produced.

khugepaged does not evict filesystem cache and buffer pages, it only defragments them (hey are movable, so it will shift them around to try to make the free memory show up in contiguous 2MB ranges) so the amount of free memory it gets to play with on a long running system is often only a few hundreds of MB (/proc/sys/vm/min_free_kbytes can/should be increased, but usually folks won't set it to more than 1-2GB). Vigorous i/o paging activity [which generally happens in 4KB page units] will often fragment stuff quickly.  
 
I'm not a kernel expert though and I may be wrong. I'm really interested if those flags could solve or mitigate the freezes people mentioned here.

It looks like newer kernels do support a "defer" setting option for THP. That setting seems to avoid trying to compact memory in the allocation path. It may replace my general recommendation to set things to "never" [if you care about avoiding huge outliers] once I get to verify it in a few places.
 


If that occasional outlier is something you are fine with, then turning THP on for the speed benefits you may be seeing makes sense. But if you can't accept the occasional ~0.5+ sec freezes, turn it off.
I just wanted to show for people who blindly follow advice on the Internet (and there are many such suggestions) that there's an impact. It can be noticeable and depends on setup and load.

And please keep doing that. Concrete results postings are useful. And the speed benefit you show in your application is quite compelling. My motivation is quite similar, but my focus here was on highlighting the "if you want to avoid terrible outliers" thing in answering the original questions at the top of this thread. I see way too many "recommendations on the internet" based purely on speed, which ignore the outliers and other degenerate thrashing behaviors that may occur (infrequently, but far too often for some)...

Gil Tene

unread,
Aug 13, 2017, 1:20:14 PM8/13/17
to mechanica...@googlegroups.com
It looks like newer kernels (4.6 and above) support a "defer" option for THP behavior (4.11 and above also support "defer+madvise"). This behavior is set via /sys/kernel/mm/transparent_hugepage/defrag (separate from /sys/kernel/mm/transparent_hugepage/enabled). With the defer options on, allocations avoid [synchronously] attempting to compact memory, and should avoid the huge outliers discussed earlier.

See Andrea Arcangeli's FOSDEM slides from Feb, 2017 for some details. 

cat /sys/kernel/mm/transparent_hugepage/defrag will show you the available options on your system. Looking up the kernel function defrag_show() is probably the easiest way to check for this on a given kernel's sources:
Starting with 4.11, it shows "[always] defer defer+madvise madvise never" (http://elixir.free-electrons.com/linux/v4.11/source/mm/huge_memory.c#L219)
Starting with 4.6 , it shows "[always] defer madvise never" (http://elixir.free-electrons.com/linux/v4.6/source/mm/huge_memory.c#L380)
Prior versions (4.5 and earlier) did not have a defer option... And RHEL/CentOS don't either (not up to RHEL 7.3 anyway).

Personally, I'll need to kick this around a bit to see if it seems to really works before starting to recommend using it (in place of "never") in cases where people care about avoiding huge outliers. Unfortunately, the latest RHEL, CentOS, and Oracle Linux versions don't yet have the "defer" option, And only the latest Ubuntu LTS update (16.04.02) does. So many of the production environments run by people I get to interact with don't yet have the feature. It may take a while to get real production experience with "defer". But assuming it delivers on the promised behavior, it does present a compelling argument for upgrading to the latest releases when they start supporting the feature. E.g. for Ubuntu users, moving to Ubuntu LTS 16.04.02 (or some later release), switching THP enabled to "always" and THP defrag to "defer" may provide a very real performance boost.







Alexandr Nikitin

unread,
Aug 18, 2017, 6:00:37 AM8/18/17
to mechanical-sympathy
I decided to write a post about measuring the performance impact (otherwise it stays in my messy notes forever) 
Any feedback is appreciated.

Gil Tene

unread,
Aug 18, 2017, 11:32:40 AM8/18/17
to mechanica...@googlegroups.com
This is very well written and quite detailed. It has all the makings of a great post I'd point people to. However, as currently stated, I'd worry that it would (mis)lead readers into using THP with "always" /sys/kernel/mm/transparent_hugepage/defrag settings (instead of "defer"), and/or on older (pre-4.6) kernels with a false sense that the many-msec slow path allocation latency problems many people warn about don't actually exist. You do link to the discussions on the subject, but the measurements and summary conclusion of the posting alone would not end up warning people who don't actually follow those links.

I assume your intention is not to have the reader conclude that "there is lots of advise out there telling you to turn off THP, and it is wrong. Turning it on is perfectly safe, and may significantly speed up your application", but are instead are aiming for something like "THP used to be problematic enough to cause wide ranging recommendations to simply turn it off, but this has changed with recent Linux kernels. It is now safe to use in widely applicable ways (will th the right settings) and can really help application performance without risking huge stalls". Unfortunately, I think that many readers would understand the current text as the former, not the latter.

Here is what I'd change to improve on the current text:

1. Highlight the risk of high slow path allocation latencies with the "always" (and even "madvise") setting in /sys/kernel/mm/transparent_hugepage/defrag, the fact that the "defer" option is intended to address those risks, and this defer option is available with Linux kernel versions 4.6 or later.

2. Create an environment that would actually demonstrate these very high (many msec or worse) latencies in the allocation slow path with defrag set to "always". This is the part that will probably take some extra work, but it will also be a very valuable contribution. The issues are so widely reported (into the 100s of msec or more, and with a wide verity of workloads as your links show) that intentional reproduction *should* be possible. And being able to demonstrate it actually happening will also allow you to demonstrate how newer kernels address it with the defer setting.

3. Show how changing the defrag setting to "defer" removes the high latencies seen by the allocation slow path under the same conditions.

For (2) above, I'd look to induce a situation where the allocation slow path can't find a free 2MB page without having to defragment one directly. E.g.
- I'd start by significantly slowing down the background defragmentation in khugepaged (e.g set /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs to 3600000). I'd avoid turning it off completely in order to make sure you are still measuring the system in a configuration that believes it does background defragmentation.
- I'd add some static physical memory pressure (e.g. allocate and touch a bunch of anonymous memory in a process that would just sit on it) such that the system would only have 2-3GB free for buffers and your netty workload's heap. A sleeping jvm launched with an empirically sized and big enough -Xmx and -Xms and with AlwaysPretouch on is an easy way to do that.
I'd then create an intentional and spiky fragmentation load (e.g. perform spikes of a scanning through a 20GB file every minute or so).
- with all that in place, I'd then repeatedly launch and run your Netty workload without the PreTouch flag, in order to try to induce situations where an on-demand allocated 2MB heap page hits the slow path, and the effect shows up in your netty latency measurements.

All the above are obviously experimentation starting points, and may take some iteration to actually induce the demonstrated high latencies we are looking for. But once you are able to demonstrate the impact of on-demand allocation doing direct (synchronous) compaction both in your application latency measurement and in your kernel tracing data, you would then be able to try the same experiment with the defrag setting set to "defer" to show how newer kernels and this new setting now make it safe (or at least much more safe) to use THP. And with that actually demonstrated, everything about THP recommendations for freeze-averse applications can change, making for a really great posting.

Sent from my iPad
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/sljzehnCNZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Alexandr Nikitin

unread,
Aug 20, 2017, 10:32:45 AM8/20/17
to mechanical-sympathy
Thank you for the feedback! Appreciate it. Yes, you are right. The intention was not to show that THP is an awesome feature but to share techniques to measure and control risks. I made the changes to highlight the purpose and risks.

The experiment is indeed interesting. I believe the "defer" option should help in that environment. I'm really keen to try the latest kernel (related not only to THP).

Frankly, I still don't have strong opinion about huge latency spikes in allocation path in general. I'm not sure whether it's a THP issue or application/environment itself. Likely it's high memory pressure in general that causes spikes. Or the root of the issues is in something else, e.g. the jemalloc case.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Peter Booth

unread,
Aug 23, 2017, 6:52:59 AM8/23/17
to mechanical-sympathy

Some points:

Those of us working in large corporate settings are likely to be running close to vanilla RHEL 7.3 or 6.9 with kernel versions 3.10.0-514 or 2.6.32-696 respectively.

 I have seen the THP issue first hand in a dramatic fashion. One Java trading application I supported ran with heaps that ranged from 32GB to 64GB, 
running on Azul Zing, with no appreciable GC pauses. It was migrated from Westmere hardware on RHEL 5.6 to (faster) Ivy Bridge hardware on RHEL 6.4. 
In non-production environments only, the application suddenly began showing occasional pauses of upto a few seconds. Occasional meaning only 
four or five out of 30 instances showed a pause, and they might only have one or two or three pauses in a day. These instances ran a workload that 
replicated a production workload.  I noticed that the only difference between these hosts and the healthy production hosts was that, due to human error,
THP was disabled on the production hosts but not the non-prod hosts. As soon as we disabled THP on the non-prod hosts the pauses disappeared.

This was a reactive discovery - I haven't done any proactive investigation of the effects of THP. This was sufficient for me to rule it out for today.

Tom Lee

unread,
Aug 23, 2017, 1:51:45 PM8/23/17
to mechanica...@googlegroups.com
Peter, just want to say I've also seen very similar behavior with JVM heap sizes ~16GB. I feel like I've seen multiple "failure" modes with THP, but most alarmingly we observed brief system-wide lockups in some cases, similar to those described in: https://access.redhat.com/solutions/1560893. (Don't quite recall if we saw that exact "soft lockup" message, but do recall something similar -- and around the time we saw that message we also observed gaps in the output of a separate shell script that was periodically writing a message to a file every 5 seconds.)

I'm probably just scarred from the experience, but to me the question of whether to leave THP=always in such environments feels more like "do I want to gamble on this pathological behavior occurring?" than some dial for fine tuning performance. Maybe it's better in more recent RHEL kernels, but never really had a reason to roll the dice on it.

(This shouldn't scare folks off [non-transparent] hugepages entirely though -- had much better results with those.)

To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Tom Lee http://tomlee.co / @tglee

Peter Booth

unread,
Aug 24, 2017, 7:16:09 AM8/24/17
to mechanica...@googlegroups.com
I agree completely. For me it's a no-brainer. I have missed countless night's sleep, 
Thanksgiving dinners, weekends, vacations because of buggy code. I don't complain -
it comes with the territory. I've taken short term consulting jobs where i worked close
to 24hrs a day helping resolve critical outages whilst my vacationing family were in
 a swimming pool and I sat inside with a laptop. So I appreciate stability. THP is a great
idea with a broken implementation.

Life is too short to deploy known broken configurations. 

 transparent_hugepage=never has worked well for me so far. 

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Tom Lee http://tomlee.co / @tglee

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/sljzehnCNZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages