This is very well written and quite detailed. It has all the makings of a great post I'd point people to. However, as currently stated, I'd worry that it would (mis)lead readers into using THP with "always" /sys/kernel/mm/transparent_hugepage/defrag settings
(instead of "defer"), and/or on older (pre-4.6) kernels with a false sense that the many-msec slow path allocation latency problems many people warn about don't actually exist. You do link to the discussions on the subject, but the measurements and summary
conclusion of the posting alone would not end up warning people who don't actually follow those links.
I assume your intention is not to have the reader conclude that "there is lots of advise out there telling you to turn off THP, and it is wrong. Turning it on is perfectly safe, and may significantly speed up your application", but are instead are aiming
for something like "THP used to be problematic enough to cause wide ranging recommendations to simply turn it off, but this has changed with recent Linux kernels. It is now safe to use in widely applicable ways (will th the right settings) and can really help
application performance without risking huge stalls". Unfortunately, I think that many readers would understand the current text as the former, not the latter.
Here is what I'd change to improve on the current text:
1. Highlight the risk of high slow path allocation latencies with the "always" (and even "madvise") setting in /sys/kernel/mm/transparent_hugepage/defrag, the fact that the "defer" option is
intended to address those risks, and this defer option is available with Linux kernel versions 4.6 or later.
2. Create an environment that would actually demonstrate these very high (many msec or worse) latencies in the allocation slow path with defrag set to "always". This is the part that will probably take some extra work, but it will also be a very valuable
contribution. The issues are so widely reported (into the 100s of msec or more, and with a wide verity of workloads as your links show) that intentional reproduction *should* be possible. And being able to demonstrate it actually happening will also allow
you to demonstrate how newer kernels address it with the defer setting.
3. Show how changing the defrag setting to "defer" removes the high latencies seen by the allocation slow path under the same conditions.
For (2) above, I'd look to induce a situation where the allocation slow path can't find a free 2MB page without having to defragment one directly. E.g.
- I'd start by significantly slowing down the background defragmentation in khugepaged (e.g set /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs to 3600000). I'd avoid turning it
off completely in order to make sure you are still measuring the system in a configuration that believes it does background defragmentation.
- I'd add some static physical memory pressure (e.g. allocate and touch a bunch of anonymous memory in a process that would just sit on it) such that the system would only have 2-3GB free for buffers
and your netty workload's heap. A sleeping jvm launched with an empirically sized and big enough -Xmx and -Xms and with AlwaysPretouch on is an easy way to do that.
- I'd then create an intentional and spiky fragmentation load (e.g. perform spikes of a scanning through a
20GB file every minute or so).
- with all that in place, I'd then repeatedly launch and run your Netty workload without the PreTouch flag, in order to try to induce situations where an on-demand allocated 2MB heap page hits the
slow path, and the effect shows up in your netty latency measurements.
All the above are obviously
experimentation starting points, and may take some iteration to actually induce the demonstrated high latencies we are looking for. But once you are able to demonstrate the impact of on-demand allocation
doing direct (synchronous) compaction both in your application latency measurement and in your kernel tracing data, you would then be able to try the same experiment with the defrag setting set to "defer" to show how newer kernels and this new setting now
make it safe (or at least much more safe) to use THP. And with that actually demonstrated, everything about THP recommendations for freeze-averse applications can change, making for a really great posting.