Avoid stepping on page faults while writing to MappedByteBuffer

Ruslan Cheremin

unread,

Jul 11, 2018, 6:29:58 AM7/11/18

to mechanical-sympathy

Hello

I have question about working with memory mapped buffers and associated page faults.

Scenario: I have main thread, which is writing some data to memory mapped ByteBuffer. The thread is supposed to be kind of soft realtime, thus I don't want it to step onto page faults.

So obvious idea is to start another thread, prefetcher, with responsibility to periodically touch buffer 1-2 pages ahead of current writing position, and trigger page faults before main thread step onto them.

The question is: how to trigger memory page loading, but without intervening main thread writes?

Afaik, read value from yet-not-loaded page is not guaranteed to trigger page mapping (virtual memory managers are tricky beasts, and could fake reads from unmapped pages without mapping them).

Write to a page will trigger page loading and mapping, but such a write could overwrite data written from main thread (I.e. in case of unlucky timings main thread could outperform prefetcher thread and already write something in this position.

And I'd prefer to avoid introducing heavy coordination between main thread and prefetcher since it greatly complicates both of them)

Is there any way to force page loading/mapping without using JNI?

Personally, best idea I come to so far is to use CAS(futureBufferPosition, 0, 0) [with help of Unsafe].

I tend to think it could work because:

1. any plain write actually executed concurrently with CAS will make CAS fail, and no data corruption will happen

2. any plain write actually executed _after_ CAS will overwrite value -- it is OK, no problem, I don't care about value written by CAS, only about write op itself

3. CAS itself changes nothing in memory (writes 0 instead of 0), but afaik it could not be optimized out by hardware nor by JIT because of additional memory effects -- thus write _will_ be executed and will trigger page-fault.

Does it seem reasonable?

----

Ruslan

Roman Leventov

unread,

Jul 11, 2018, 10:28:16 AM7/11/18

to mechanica...@googlegroups.com

I was looking for an elegant solution for exactly this problem some time ago and didn't find it, before putting that work on hold.

Without coordination between the threads, how do you know that the prefetcher thread is ahead of the main thread? If you consider that the CAS in the prefetcher thread may fail due to concurrent write in the main thread, it means that the page fault will happen in the main thread.

I suspect that JNI call to madvice with something like MADV_WILLNEED | MADV_SEQUENTIAL (and no prefetcher threads) is the most practical solution, but I didn't test this theory.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Francesco Nigro

unread,

Jul 11, 2018, 10:43:50 AM7/11/18

to mechanical-sympathy

I suppose that's why reusing already mapped pages in order to avoid major page faults is the way to go, according to both OS configuration and page cache limitations.

AFAIK memory mapped heavy based solutions involve the uses of B tree-like free pools to allow reusing already mapped pages/chunks with some madvice magic (as Roman suggested) to help the OS eg lmdb.

The only Java solution I'm aware (with Roman as well, I suppose :P) that uses efficiently a mapped solution for and endless log is chronicle-queue from Peter L.: some time ago I have taken a look and it was using

a pretoucher (https://github.com/OpenHFT/Chronicle-Queue/blob/master/src/main/java/net/openhft/chronicle/queue/impl/single/Pretoucher.java), but I don't know if it was used for this same purpose.

Neo4J were using as well a pretoucher (I don't know the details), but probably is just to fight against the randomness of accesses of the node representation it has to manage, not for sequential uses like a log would be.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Ruslan Cheremin

unread,

Jul 11, 2018, 11:39:47 AM7/11/18

to mechanical-sympathy

I was looking for an elegant solution for exactly this problem some time ago and didn't find it, before putting that work on hold.

I've seen you contribution to log4j2 -- it was merged to master. You don't satisfied with approach used there?

BTW: why do you align pretoucher CAS on 4-bytes boundary? Is it about CAS atomicity on non-x86 platforms?

Without coordination between the threads, how do you know that the prefetcher thread is ahead of the main thread? If you consider that the CAS in the prefetcher thread may fail due to concurrent write in the main thread, it means that the page fault will happen in the main thread.

Yes, but I expect this to be extremely rare cases, thus I do not care much about latency than -- while still do care about probable data corruption.

Right now I consider some simple wait-free coordination mechanisms though.

I suspect that JNI call to madvice with something like MADV_WILLNEED | MADV_SEQUENTIAL (and no prefetcher threads) is the most practical solution, but I didn't test this theory.

Yes, thank you, maybe finally this will be the solution. But I'd like to play with approaches without jni for a while.

On Wed, 11 Jul 2018 at 05:30, Ruslan Cheremin <cher...@gmail.com> wrote:

Hello

I have question about working with memory mapped buffers and associated page faults.

Scenario: I have main thread, which is writing some data to memory mapped ByteBuffer. The thread is supposed to be kind of soft realtime, thus I don't want it to step onto page faults.

So obvious idea is to start another thread, prefetcher, with responsibility to periodically touch buffer 1-2 pages ahead of current writing position, and trigger page faults before main thread step onto them.

The question is: how to trigger memory page loading, but without intervening main thread writes?

Afaik, read value from yet-not-loaded page is not guaranteed to trigger page mapping (virtual memory managers are tricky beasts, and could fake reads from unmapped pages without mapping them).
Write to a page will trigger page loading and mapping, but such a write could overwrite data written from main thread (I.e. in case of unlucky timings main thread could outperform prefetcher thread and already write something in this position.
And I'd prefer to avoid introducing heavy coordination between main thread and prefetcher since it greatly complicates both of them)

Is there any way to force page loading/mapping without using JNI?

Personally, best idea I come to so far is to use CAS(futureBufferPosition, 0, 0) [with help of Unsafe].
I tend to think it could work because:
1. any plain write actually executed concurrently with CAS will make CAS fail, and no data corruption will happen
2. any plain write actually executed _after_ CAS will overwrite value -- it is OK, no problem, I don't care about value written by CAS, only about write op itself
3. CAS itself changes nothing in memory (writes 0 instead of 0), but afaik it could not be optimized out by hardware nor by JIT because of additional memory effects -- thus write _will_ be executed and will trigger page-fault.

Does it seem reasonable?

----
Ruslan

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Ruslan Cheremin

unread,

Jul 11, 2018, 11:48:17 AM7/11/18

to mechanical-sympathy

I suppose that's why reusing already mapped pages in order to avoid major page faults is the way to go, according to both OS configuration and page cache limitations.
AFAIK memory mapped heavy based solutions involve the uses of B tree-like free pools to allow reusing already mapped pages/chunks with some madvice magic (as Roman suggested) to help the OS eg lmdb.

Generally main advantage of using memory mapped file in my scenario is simplicity.

If I'll need to implement complex machinery to be able to use it efficiently -- than writing to plain memory buffer, and flushing it to disk in background thread will become solution of choice.

The only Java solution I'm aware (with Roman as well, I suppose :P) that uses efficiently a mapped solution for and endless log is chronicle-queue from Peter L.: some time ago I have taken a look and it was using
a pretoucher (https://github.com/OpenHFT/Chronicle-Queue/blob/master/src/main/java/net/openhft/chronicle/queue/impl/single/Pretoucher.java), but I don't know if it was used for this same purpose.

Thank you for pointing. Seems like Peter also uses CAS(0,0) as a way to pretouch -- makes me a bit more confident with the approach.

Gil Tene

unread,

Jul 11, 2018, 12:00:01 PM7/11/18

to mechanical-sympathy

While a java-code CAS mechanism (e.g. using varhandles in Java9+ or Unsafe in Java8-) will seemingly do the trick, there is a VERY strong reason for using a CAS in a custom JNI call instead: safepoints. And Time-To-Safepoint (TTSP).

The subtle difference between doing the prefecthing-CAS in java code and in JNI code is that the JNI code is executed AT a safepoint, while the java code (for both varhandles and Unsafe) will be executed BETWEEN safepoints.

The reason this matters *alot* for avoiding page-fault glitches is that whenever the JVM takes a jvm-wide safepoint (for GC, deoptimization, meeting GuaranteedSafepointInterval, deadlock detection, periodic thread-stack walking, profiling, de-biasing, etc. etc. etc.), all application threads experience a pause that is as at least as long as the longest TTSP of any one thread (time before it reaches a safepoint). If a prefetcher thread is in the middle of a page fault BETWEEN safepoint opportunities, it will make all other threads wait at the safepoint, exhibiting an application-wide pause that is as long as the page fault completion time. On the other hand, if the prefetcher thread takes its page fault AT a safepoint (e.g. inside a JNI call), the page fault will not stall the jvm-wide safepoint from happening "quickly", and will not make other threads experience it's page fault stall times.

You may think that's not that big a deal, or not that frequent, but page faults can and do get unlucky, as in "I'm waiting behind 173 other page faults or other i/o operations that are caused by the kernel flushing stuff" unlucky, and can take 100s of msec to complete. Getting double-unlucky by having such a long page fault coincide with a JVM doing a safepoint is actually very common, especially since GuaranteedSafepointInterval on Hotspot defaults to 1 second.

And no, using an already built-in JNI call in the JDK (like Unsafe) won't cut it, because the JVM will intrisify those and avoid the actual JNI call boundary (and the safepoint). You need the JNI call to be one that is opaque to the JVM, and the safest way to do that is to write the C code yourself, or use an example that the JVM doesn't know about by name (today's JVMs are still not smart enough to look into the .so object code and inline it).

Ruslan Cheremin

unread,

Jul 11, 2018, 2:31:11 PM7/11/18

to mechanical-sympathy

Thank you, Gil.

Indeed, even though I'm aware of safepoint mechanics, I still haven't look on my problem from this point of view. Thank you for clear explanation.

So, as far as I see now, without JNI, with only pure java (+Unsafe/VarHandles) tools mmaped buffer could hardly be used as low/predictable latency IO. (And with JNI it is better to use madvice, since there is no competing writes, and intention is much clearer)

----

Ruslan

Gil Tene

unread,

Jul 12, 2018, 12:02:16 AM7/12/18

to mechanical-sympathy

I don't quite trust madvise to actually bring the page in. It's a "hint", and e.g. completion on the madvise call() doesn't mean that the page is actually paged in. A CAS is "for sure" going to have the page in memory (no guarantee that it won't be paged out soon, but that's true for madvise too).

You could do both in the JNI call if you want.

Also, while MADV_WILLNEED is probably safe to use in any case, But I'd be careful with MADV_SEQUENTIAL. If you knew that only access to the leading edge of the file/log is needed, it's fine. But if you don't know that for sure (e.g. in cases where you have a clear wavefront moving through the file, but additional read access can happen semi-randomly in a range that is many MB behind the leading edge), don't use MADV_SEQUENTIAL, as it tells the kernel that pages read in can soon be evicted (probably more likely to become victims than without the flag).

Andrei Pangin

unread,

Jul 12, 2018, 8:02:48 PM7/12/18

to mechanical-sympathy

From our (Odnoklassniki) experience it's really hard if not impossible to keep soft realtime guarantees while using MappedByteBuffers. The main reason is exactly the one Gil mentioned: any access to that buffer, even when done in a background thread, may potentially cause stop-the-world like pause, if a page fault concurs with a safepoint request. That's why we prefer explicit file I/O over MappedByteBuffers in latency-critical applications.

What about allocating a regular direct ByteBuffer and flushing it to disk manually using FileChannel API? This approach will retain nearly the same level of simplicity and performance while being a pure Java solution - no JNI or Unsafe required.

--
Andrei

среда, 11 июля 2018 г., 13:29:58 UTC+3 пользователь Ruslan Cheremin написал:

Avi Kivity

unread,

Jul 14, 2018, 12:21:28 PM7/14/18

to mechanica...@googlegroups.com, Ruslan Cheremin

If your requirements are really important, use aio + O_DIRECT. You then know exactly when your memory has the data you want. You don't even need the extra thread.

It's more work for sure, but you are in complete control, instead of playing whack-a-mole with the JVM.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Ruslan Cheremin

unread,

Jul 15, 2018, 3:18:07 PM7/15/18

to mechanical-sympathy

any access to that buffer, even when done in a background thread, may potentially cause stop-the-world like pause, if a page fault concurs with a safepoint request.

But if it is madvice JNI call -- safepoint issue shouldn't happen? Seems like private MappedByteBuffer.load0(start, length) is mapped exactly to madvice(WILL_NEED), so one don't ever need to white it's own JNI wrapper...

What about allocating a regular direct ByteBuffer and flushing it to disk manually using FileChannel API? This approach will retain nearly the same level of simplicity and performance while being a pure Java solution - no JNI or Unsafe required.

Well, yes, this is my backup option. Actually, I already have 2 implementations, with mmapped file and with byte buffer flushed though FileChannel. I was dreamed about rule them all with one impl, and start with mmapped one 'cos it seems more promising (less copying, almost no data loss on process crash, etc). Now I'm in doubt.

Thank you for sharing your experience, Andrei

Gil Tene

unread,

Jul 16, 2018, 12:23:39 PM7/16/18

to mechanical-sympathy

On Sunday, July 15, 2018 at 12:18:07 PM UTC-7, Ruslan Cheremin wrote:

any access to that buffer, even when done in a background thread, may potentially cause stop-the-world like pause, if a page fault concurs with a safepoint request.

But if it is madvice JNI call -- safepoint issue shouldn't happen? Seems like private MappedByteBuffer.load0(start, length) is mapped exactly to madvice(WILL_NEED), so one don't ever need to white it's own JNI wrapper...

The truly-reliable implementations I've seen for avoding page fault glitches on mapped-file-access-from-java don't rely on madvise (which is potentially asynchronous, so you don't actually know when the fetch was done), or even purely on an actual memory touch (which is synchronous, and you know the fetch was actually done, but could be evicted shortly after). They use mlock(). E.g. mlock 128MB ahead of all expected access, and munlock 2GB behind the wavefront. The "downside" to such implementations is that they require more knowledge about the application use pattern [you don't need to know how long things need to be kept around to just touch-fetch, but with mlock you'll run out of physical memory if you don't release the mlock at some point].

Steven Stewart-Gallus

unread,

Jan 17, 2019, 2:41:32 AM1/17/19

to mechanical-sympathy

I think what you want is something like the mincore system call on Linux so your thread can write directly if the page is mapped but offload the work to another thread if it is not mapped. I don't have any experience with the system call though.

Gil Tene

unread,

Jan 17, 2019, 3:07:19 AM1/17/19

to mechanica...@googlegroups.com

mincore is no more useful than pretouch. Unless you lock, a page that was resident at the time of the mincore/pretouch operation can be evicted a few microseconds later. A “this seems to be resident right now” quality does not prevent page faults or long TTSPs and STW pauses on mapped buffer access, it just improves the stats a bit.

Sent from my iPad

On Jan 16, 2019, at 9:41 PM, Steven Stewart-Gallus <stevensele...@gmail.com> wrote:

I think what you want is something like the mincore system call on Linux so your thread can write directly if the page is mapped but offload the work to another thread if it is not mapped. I don't have any experience with the system call though.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/yL4Yaedgqg4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Steven Stewart-Gallus

unread,

Mar 30, 2019, 1:17:15 PM3/30/19

to mechanica...@googlegroups.com

I feel like this is just a bug in the JDK that should be patched.

Couldn't this all be solved by replacing UNSAFE.copyMemory with a call to a different method that isn't a HotSpot intrinsic?

https://hg.openjdk.java.net/jdk/jdk/file/235883996bc7/src/java.base/share/classes/java/nio/Direct-X-Buffer.java.template#l313

Interestingly I don't believe that copySwapMemory is an intrinsic so an ugly kludge might be to use a nonnative byte order deliberately.

Gil Tene

unread,

Mar 30, 2019, 2:26:56 PM3/30/19

to mechanical-sympathy

On Saturday, March 30, 2019 at 10:17:15 AM UTC-7, Steven Stewart-Gallus wrote:

I feel like this is just a bug in the JDK that should be patched.

And how would you "patch" it? Without the result being sucky performance, that is?

The tension is between the performance of mapped byte buffer access, and the wish to avoid being caught in a page fault while not at a safepoint. You can do one or the other "easily": either be at a safepoint on all buffer accesses, or don't. Being at a safepoint in e.g. every call to ByteBuffer.get() [in the actual memory accessing code that is susceptible to page faults] would certainly prevent the TTSP-due-to-page-fault problems. But it would also dramatically reduce the performance of loops with such access in them. Not only because of the need to poll for safepoint conditions on every access but [mostly] because many compiler optimizations are "hard" to do across safepoint opportunities.

Couldn't this all be solved by replacing UNSAFE.copyMemory with a call to a different method that isn't a HotSpot intrinsic?

https://hg.openjdk.java.net/jdk/jdk/file/235883996bc7/src/java.base/share/classes/java/nio/Direct-X-Buffer.java.template#l313

Interestingly I don't believe that copySwapMemory is an intrinsic so an ugly kludge might be to use a nonnative byte order deliberately.

Counting on things not being intrinsified (or treated as leaf functions that don't need safepoints) is a dangerous practice, since anything in the jdk may become intrinsifed or otherwise optimized tomorrow.

But specifcially to the above, copySwapMemory (via copySwapMemory0) is already treated as a known leaf method (http://hg.openjdk.java.net/jdk/jdk/file/235883996bc7/src/hotspot/share/prims/unsafe.cpp#l1095), which (among other things) means that [unlike generic JNI calls] no safepoints will be taken on calling it.

Reply all

Reply to author

Forward