Why are page faults used for safe point polling instead of other signals?

Steven Stewart-Gallus

unread,

May 18, 2020, 12:27:36 AM5/18/20

to mechanical-sympathy

Hi
As I understand it most VMs poll for safepoints by using memory management tricks to page fault polling threads. A page is set aside to read from and whenever a safepoint is desired the page is set to be unreadable.

But can't a number of other hardware traps be used instead https://wiki.osdev.org/Exceptions ?

Not sure if a conditional jump to a trapping debug instruction would be slow or not.

Also why not read from a pointer to a page instead of reading directly? That way only a an atomic write to a pointer is needed instead of a whole memory protection syscall.

Also an atomicly written boolean flag is only one of many possible types of branches.
You could have an indirect call or possibly jump and just overwrite a function pointer to install your handler for example.

The standard way of doing things seems pretty sensible but it's just I've never actually seen a concrete comparison.

Basically in pseudocode why

void poll() {
page[0];
}
void trap() {
mprotect(page, 0);
}

over

void poll() {
page[0];
}
void trap() {
page = 0;
}

or

void poll() {
1 / bit;
}
void trap() {
bit = 0;
}

And why

void poll() {
if (flag) dostuff();
}

over

void poll() {
f();
}
void trap() {
f = handler;
}

Or

void poll() {
if (flag) __builtin_trao();
}

Probably makes sense to do safepoint polls the standard way but I've never seen a detailed comparison exactly why.

Alex Blewitt

unread,

May 18, 2020, 4:47:20 AM5/18/20

to mechanica...@googlegroups.com

There’s a couple of overlapping questions here. I hope I can answer them, if not necessarily the right order.

Reads are used rather than writes because reads don’t incur cross-CPU cache invalidations. If threads were writing to a page, then cache invalidation traffic would be sent between CPUs, and also interact with the outgoing memory bus to no effect. With the reads, you’re essentially just validating that read against a permissions bit in the TLB, and while you’re pulling in a cache line, it’s in a shared state so doesn’t affect other CPUs.

The second reason why an ‘if’ conditional is not used is to avoid the jump. A speculating CPU can of course continue executing past the ‘if’ condition in order to make further progress, but may limit how far the CPU can speculate ahead. In fact, if you had the ‘if’ condition in place, you wouldn’t need a trap at all - you could just jump to the place where the trap handler does its work.

The protect a single page works because in normal flow, a read is effectively a free no-op to which the CPU can keep executing and doesn’t pollute any of the branch prediction, outgoing memory bandwidth or cross-CPU cache invalidations.

Note that the single-page approach is the one chosen by OpenJDK/Hotspot; other JVMs have the ability to stop on a per-thread basis rather than globally bringing everything to a halt.

Alex

Sent from my iPhone 📱

> On 18 May 2020, at 05:27, Steven Stewart-Gallus <stevensele...@gmail.com> wrote:
>
> Hi

> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/mechanical-sympathy/5a36e38d-3811-4663-a7d5-b9ac4897404d%40googlegroups.com.

Ben Evans

unread,

May 18, 2020, 5:54:38 AM5/18/20

to mechanica...@googlegroups.com

In Hotspot there's also JEP 312 - http://openjdk.java.net/jeps/312 -
as of Java 10 which might be useful to you, depending on your use
case.

Anyone familiar enough with J9 to comment on how that works? I see to
remember from talking to Gil that the approach Zing takes is also
interesting, but that was a few years ago...

Thanks,

Ben

> To view this discussion on the web, visit https://groups.google.com/d/msgid/mechanical-sympathy/BBFDFC0B-5E9D-49A0-9DF6-E4853905BC66%40gmail.com.

Gil Tene

unread,

May 18, 2020, 12:04:25 PM5/18/20

to mechanica...@googlegroups.com

This is an ever-evolving and ever-explored field...

The "current" (and in "typically used in production" 8 and 11) versions of OpenJDK and HotSpot perform global safepoints as an all-or-nothing, Stop-The-World (STW) event. Since the frequency of STW pauses will generally tend to be low (for obvious reasons, if it were high, you'd be facing much bigger complaints), the likelihood of a safepoint poll actually triggering will be VERY low (think "1 in a billion" or less for a practical order of magnitude feel). As such, code that accepts STW pauses has historically been optimized for the NOT triggering case, and a "blind load" from a potentially protected location has bee (empirically chosen as) the fastest way to perform the poll.

A way to look at a read from a potentially protected page is as

1) An implicit "predicted not taken" form of a check

CPUs will never predict that fault will be taken (that would be silly) so no branch prediction state is needed for this. In contrast, conditional branches use the branch prediction logic and state, and both contaminate it and and are sensitive to it...

2) A cheaper (fewer instructions and u-ops, smaller iCache and u-op cache footprint) way of doing the test.

As JVMs evolve transition to doing better-than-STW coordination (e.g. the Checkpoints used in Zing [a technique first published in 2005 [1] ], and since mimicked in other JVMs with e,g. "Thread-Local Handshakes" [2]), the techniques used for the poll will vary. E.g. the latest HotSpot Thread-Local Handshakes mechanism [when used] currently uses an additional dereference through a thread-local pointer in order to provide per-thread control (adding latency and an additional load, but avoiding a conditional branch, and still accepting the hit of a trap on trigger), while the current Zing checkpoints use a conditional branch on a thread-local flag (a single load, much cheaper trigger when a trigger does happen, but an added conditional branch). One should not project too much from the current schemes tho, as they tend to shift based on empirical observations that vary both as CPUs evolve (get better at making highly prediuctable conditional branches invisible from a performance point of view, and as needs evolve (e.g. the wish for less disruptive triggering as triggering reasons become more frequent).

It is also worth noting that the actual dynamic frequency of safepoint polling in generated code has been significantly reduced over time, as code optimizers (e.g. C2, Falcon, Graal) seek to minimize the dynamic occurrence safepoint polls for reasons that have nothing to do with the polling instructions used: Crossing a safepoint poll [a point at which the full and consistent VM state may need to be observable] is often problematic for various cool optimizations (e.g. escape analysis, redundant write elimination, and even hoisting the loads certain loop invariants out of short-lived loops, to name a few). As optimizers have gotten better at (and find more motivations for) reducing the dynamic occurrence of such poll points, the importance of the instructions used for the polling sequence diminishes.

-- Gil.

[1] "The Pauseless GC Algorithm" https://www.usenix.net/legacy/events/vee05/full_papers/p46-click.pdf

[2] "JEP 312: Thread-Local Handshakes" http://openjdk.java.net/jeps/312

Steven Stewart-Gallus

unread,

May 18, 2020, 12:51:04 PM5/18/20

to mechanica...@googlegroups.com

Hi

What I guess I sort of mean is that I'm looking for more papers like this one

https://dl.acm.org/doi/10.1145/2887746.2754187 . I guess I did not explain myself well.

I'm well aware of the theory behind choosing a simple read but the key word is "empirically chosen". Are there public benchmarks on real world software? Has anything changed on recent hardware? It seems the obviously fastest way to cause a fault but what if it is not?

Thanks

Steven Stewart-Gallus

On Mon., May 18, 2020, 9:04 a.m. Gil Tene, <g...@azul.com> wrote:

This is an evolving and ever-explored field...

The "current" (and in typically used in production 8 and 11) versions of OpenJDK and HotSpot performs safepoint as an all-or-nothing, Stop-The-World (STW) event. Since the frequency of STW pauses will generally tend to be low (for obvious reasons, if it were high, you'd be facing much bigger complaints), the likelihood of a safepoint poll actually triggering will be VERY low (think "1 in a billion" or less for a practical order of magnitude feel). As such, code that accepts STW pauses tends to be optimized for trhe NOT triggering case, and a "blind load" from a potentially protected location has bee (empirically chosen as) the fastest way to perform the poll.

A way to look at a read from a potentially protected page is as

1) An implicit "predicted not taken" form of a check
CPUs will never predict that fault will be taken (that would be silly) so no branch prediction state is needed for this.

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/mechanical-sympathy/8cd7f05d-def0-4f9c-8fff-e54ca39312f2%40googlegroups.com.

Steven Stewart-Gallus

unread,

May 29, 2020, 12:46:38 PM5/29/20

to mechanical-sympathy

Okay I have an idea.
I can't shake the idea you could do fun tricks with thread local executable pages.

The theoretically fastest way of safepoint polling is inserting a trap instruction. But icache overheads dominate. If the icache is based on physical addresses and not virtual ones then it should be possible to remap the page without doing icache synchronization.
You should be able to have very fast safepoints by remapping the page.

But I'm not sure there's a fast way to do a call/return from thread local storage. And a call/return from a constant page still might not be faster than just a load.

TLDR:
Limited self modifying code without icache syncing stuff could be possible with memory management tricks as long as the icache and other stuff is based on physical addresses.

Steven Stewart-Gallus

unread,

May 29, 2020, 2:25:02 PM5/29/20

to mechanical-sympathy

Behold!

I don't think this actually safely works though on x86 at least.

Pretty sure they use the virtual address for instruction caching and debuggers have to do synchronisation when modifying from a different address space.

oh well.

self-modifying-code.c

Alex Blewitt

unread,

May 29, 2020, 6:16:46 PM5/29/20

to mechanica...@googlegroups.com

On 29 May 2020, at 17:46, Steven Stewart-Gallus <stevensele...@gmail.com> wrote:

Okay I have an idea.
I can't shake the idea you could do fun tricks with thread local executable pages.

The theoretically fastest way of safepoint polling is inserting a trap instruction.

Under what basis are you making that assumption? Moreover, how do you know where to insert that instruction at a safe point in the code?

The point of a safepoint is to bring code – wherever it is executing at the time – to the next safe place to stop. There are thousands upon thousands of these in native code generated by the JVM. How would you propose putting the trap instruction in all of them at one time?

If you’re jumping to/from a memory location, just so you can jump back 99% of the time, then you’re going to pollute the branch prediction caches and likely introduce stalls in the pipeline. And to do this, the OS will need to verify whether the page (thread local or not) has execute permissions; in effect, doing a similar permissions check as the current ’test a page’ but much more.

But icache overheads dominate. If the icache is based on physical addresses and not virtual ones then it should be possible to remap the page without doing icache synchronization.
You should be able to have very fast safepoints by remapping the page.

You can get even faster ones by flipping a bit in the page’s protection – no remapping needed then. And if you’re not changing the mapping itself, then you probably don’t even flush the cache line for that part of the page.

But I'm not sure there's a fast way to do a call/return from thread local storage. And a call/return from a constant page still might not be faster than just a load.

TLDR:
Limited self modifying code without icache syncing stuff could be possible with memory management tricks as long as the icache and other stuff is based on physical addresses.

Should be possible, but won’t be faster.

Alex

Gil Tene

unread,

May 30, 2020, 10:58:58 AM5/30/20

to mechanical-sympathy

HotSpot used to actually safepoint by patching the running code of threads, at some point ahead of where you suspended them. The notion was that this lets you completely avoid any polling instructions and Keeps the fast path as fast as possible. HotSpot gave up on doing this almost 20 years ago.because of a myriad of annoying tail bugs, including ones that had to do with edge cases around how OSs deal with suspension, signal delivery, etc., and around the delicate challenges in patching running code safely and atomically. It’s not that such a scheme couldn’t be made to work (it actually ran in production versions for a while), it’s that it had too much hair on it to keep going, and ultimately was not worth it given the very low cost of actual polling instructions on modern OOO cpus.

On Friday, May 29, 2020 at 9:46:38 AM UTC-7, Steven Stewart-Gallus wrote:

Gil Tene

unread,

May 30, 2020, 11:07:00 AM5/30/20

to mechanical-sympathy

The fast-path overhead in this polling scheme (calling code at a specific memory location and returning from it, relying on remapping of the page the code us in to change its behavior from do-nothing to take-safepoint) is much higher than the currently-popular polling schemes of loading from a protect-able page or testing a bit in memory. It also has the downside of only working globally (no thread-specific safepointing capability like the one we use in Zing, or the one a HotSpot introduced in recent versions).

Reply all

Reply to author

Forward