How to achieve stable write latencies with memory mapped files

525 views
Skip to first unread message

Petr Postulka

unread,
May 3, 2015, 11:45:55 AM5/3/15
to mechanica...@googlegroups.com
Hi All,

we are using memory mapped files for the persistence layer in our application and we are not able to achieve stable write latencies. I'm attaching the simple test case which measures the write latency and contains few optimizations we already tried ... preallocation + regular flushing from different thread. Those techniques helped quite a lot, but still the stability is not perfect - peaks are still 20 times, sometimes even 100 times higher than mean.

Does anyone know a technique to make it more stable? We will be happy if the mean is little bit higher, but the results are more stable ...

Or do you recommend completely different approach for application persistence layer which is more suitable for stability?

Thank you.

Petr
MemoryMapFileTest.java

Jan van Oort

unread,
May 3, 2015, 12:27:05 PM5/3/15
to mechanical-sympathy
Looks like you are relying upon the combined behaviour of Thread.sleep() and System.nanoTime(). 

From the Java Language Specification for Java 7, Chapter 17.3: 

"Thread.sleep causes the currently executing thread to sleep (temporarily cease execution) for the specified duration, subject to the precision and accuracy of system timers and schedulers"  

The JLS does not say anything about System.nanoTime(), of course, but nanoTime() does have a reputation for being subject to jitter to a high degree. The Java 7 documentation about System.nanoTime() goes as follows: 

"This method provides nanosecond precision, but not necessarily nanosecond resolution (that is, how frequently the value changes) - no guarantees are made except that the resolution is at least as good as that of currentTimeMillis()."

In other words: aren't you looking at a combination of OS scheduling limits and a an unreliable method of measurement, ? 

Best, 

Jan 



Fortuna audaces adiuvat - hos solos ? 

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Wojciech Kudla

unread,
May 3, 2015, 12:27:42 PM5/3/15
to mechanica...@googlegroups.com
There are many factors impacting high nines when dealing with mmapped IO. Transparent huge pages and TLB shootdowns were usually some of the worst offenders from my experience.

--

Petr Postulka

unread,
May 3, 2015, 12:56:34 PM5/3/15
to mechanical-sympathy
Any tricks how to prevent all those factors to make mmapped IO more stable? I know how to disable transparent huge pages ...

Thanks.

Greg Young

unread,
May 3, 2015, 1:01:04 PM5/3/15
to mechanica...@googlegroups.com
TBF those things you can make a career of.

Can you tell more about your expected workload?
Studying for the Turing test

Jan van Oort

unread,
May 3, 2015, 1:02:14 PM5/3/15
to mechanical-sympathy
Don't put your threads to sleep(). Never. Ever. 



Fortuna audaces adiuvat - hos solos ? 

Petr Postulka

unread,
May 3, 2015, 1:05:16 PM5/3/15
to mechanical-sympathy
"TBF those things you can make a career of." - can you explain a little bit more, please?

Expected workload - usually hundreds/thousands of messages per second, each message approximately between 100 -> 200 bytes. Peaks in throughput not higher then 100K messages per second.

Wojciech Kudla

unread,
May 3, 2015, 1:05:19 PM5/3/15
to mechanica...@googlegroups.com
You can get rid off TLB shootdowns by moving to /dev/shm.
Write/commit rates are also important, especially with respect to os-level buffer sizes and other parameters controlling the behaviour of vm subsystem.
That would be numa balancing, swappiness, dirty_expire_centisecs, dirty_ratio, dirty_writeback_centisecs and many more.
Also things like scheduling pressure during your tests, RCU activity or even vmstat polling interval.
I suggest familiarizing with ftrace to get a detailed view on kernel-induced latency.

Petr Postulka

unread,
May 3, 2015, 1:06:15 PM5/3/15
to mechanical-sympathy
Hi Jan,

I never use Thread.sleep() in real application ... this was only for testing purposes.

Thanks.

P.

Greg Young

unread,
May 3, 2015, 1:07:10 PM5/3/15
to mechanica...@googlegroups.com
and persistence? what is needed from it?

Wojciech Kudla

unread,
May 3, 2015, 1:07:29 PM5/3/15
to mechanica...@googlegroups.com

On 3 May 2015 at 18:05, Petr Postulka <ppos...@gmail.com> wrote:
"TBF those things you can make a career of." - can you explain a little bit more, please?

That just means reducing jitter on Linux (or any other OS) takes years of studying. And even then it's a bit like black magic and more about educated guess than established knowledge.

Greg Young

unread,
May 3, 2015, 1:07:54 PM5/3/15
to mechanica...@googlegroups.com
"TBF those things you can make a career of." - can you explain a
little bit more, please?

"Any tricks how to prevent all those factors to make mmapped IO more
stable? I know how to disable transparent huge pages"

along with the other questions you have been asking :)

Greg Young

unread,
May 3, 2015, 1:09:24 PM5/3/15
to mechanica...@googlegroups.com
"And even then it's a bit like black magic and more about educated
guess than established knowledge."

And even then it's a bit like black magic and more about measuring
than established knowledge."

Studying just gives you an educated guess what to measure.

Wojciech Kudla

unread,
May 3, 2015, 1:10:50 PM5/3/15
to mechanica...@googlegroups.com

On 3 May 2015 at 18:09, Greg Young <gregor...@gmail.com> wrote:
And even then it's a bit like black magic and more about measuring
than established knowledge."

Studying just gives you an educated guess what to measure.

I stand corrected :)

Petr Postulka

unread,
May 3, 2015, 1:11:55 PM5/3/15
to mechanical-sympathy
Well the application is using FIX communication and we need to implement FIX message store to persist messages in case the counterparty requests a resend.

Petr Postulka

unread,
May 3, 2015, 1:15:47 PM5/3/15
to mechanical-sympathy
Thank you guys, we will play with OS a little bit more then and do all sorts of measurement and tests. Just wanted to know whether there are some common things we should take care of.

P.

--

Andy Smith

unread,
May 3, 2015, 1:20:06 PM5/3/15
to mechanica...@googlegroups.com
Not sure of your use case but one atrocious way around this problem ( that I know is used in industry), is simply to keep a high watermark of your outbound sequence, then gapfill any requests asking for retran's lower than your high watermark.

You can probably only get away with this on the prop-trading side, but it provides a convenient "If you didn't know about this message it didn't happen". You need to apply clean-up your side for messages that you think you sent but you've subsequently gap-filled. But for high message loads + the probability of a disconnect it might be the lesser of two evils rather than succumbing to the persistence burden.

I wouldn't recommend this but it's one way people get around it..

Cheers,

A.



Petr Postulka

unread,
May 3, 2015, 1:24:28 PM5/3/15
to mechanical-sympathy
Thank you Andy for a proposed solution with gap fills, but unfortunately our client requests full persistence.

P.

Gil Tene

unread,
May 3, 2015, 9:52:54 PM5/3/15
to mechanica...@googlegroups.com
Petr, Can you provide the histogram output you see on your actual system? [And do it on a system with THP off, obviously].

As to ways to make the mapped file write latency stable: the "strongest" way to ensure this is to mlock() the file, or the portion of it you are working on. This assures that the pages are actually in physical memory, and won't leave until you unlock them (or unmap them). Doing this in a long-winded journaling system often means running a "pre-fetcher / post-flush" thread that both pre-allocates and pre-mlocks the portions of the file that are about to be accessed, and unlocks and flushes portions that have been written but will no longer be written to in the future.

You may still see interesting interactions with the kernel's flushing logic, which in some kernel versions holds locks that are way to wide (and may interfere with the logic of dirtying pages?). I only have indirect knowledge of this, but I think that @mikeb01 has some actual experience with it (want to chime in Mike?).

Obviously, another way to deal with mapped file write latency (or any file write latency) is to punt and do the writing in a separate, off-the-latency-critical-path thread. You'd hand that thread the material to write over a queue of some sort, and make sure the queue is deep enough to work around the sort of write latency glitches you see. Ofcourse, this may not work for environments that require actual persistence in the latency critical path. But then again, the example code doesn't provide such persistence either, as any not-yet-flushed contents may be lost in a crash. I.e. if you are willing to accept the potential for losing the data in the linux file buffers for a few tens of msec or more [which is easily possible in the example code], losing the data in a cross-thread queue is not really much different...

Michael Barker

unread,
May 3, 2015, 11:01:55 PM5/3/15
to mechanica...@googlegroups.com
You may still see interesting interactions with the kernel's flushing logic, which in some kernel versions holds locks that are way to wide (and may interfere with the logic of dirtying pages?). I only have indirect knowledge of this, but I think that @mikeb01 has some actual experience with it (want to chime in Mike?).

If you are after consistent latency, there are 2 things you desperately want to avoid.

1) Page faults on the memory that you are writing to.
2) Inducing structural changes on page cache (adding/removing pages), but that generally happens as a result of #1 (on Linux anyway).

Preallocation is useful, but only if you are allocating an amount of memory that is smaller than the memory available for the page cache as there is a good chance that the memory could be paged out again later.  We found that background flushing helps as it prevents the kernel's background flusher threads from kicking in, but is not a panacea as there is still a possibility of a faulted memory access running into a lock in the page cache being held by the thread that is doing the background syncing.

Personally I would do something similar to what Gil mentions, grab a mmapped file and mlock it into memory and use it like a ring buffer.  Have an archiving thread hanging off the back of that.  Have the archiving thread use standard I/O rather than mmaped I/O (and possibly consider something like O_DIRECT for file writing so that it doesn't touch the page cache - I haven't actually tried this YMMV).

If you don't need to store an arbitrary amount of data in the FIX message store, instead you can limit it to some fixed window that (for all connections) will be less than the physical memory on the box, then you could mlock that whole lot and not bother with the archiving thread.  Beware the more data you mlock, the more you constraint the kernel to make smart decisions about what to page in/out.  Also mlock requires root level configuration (ulimits) to set up.

You will also need to look out for what happens when a connection requests a replay.  If they have been disconnected for a while there is a strong possibility that there data has been paged out and will result in a page fault to bring it back in.  I would consider standard file I/O for this too, given that it is not likely to be on the critical path.

The specific kernel version makes a big difference.

Petr Postulka

unread,
May 4, 2015, 3:34:36 AM5/4/15
to mechanical-sympathy
Hi Gil/Mike

thanks a lot for your excellent input ... your ideas sound very interesting and promising. I'm on business trip this week but once I'm back I will give it a try and post the results here ... in case someone have the same requirements in the future.

Mike - what kernel version would you recommend, please?

Thanks.

Petr

--

Wojciech Kudla

unread,
May 4, 2015, 3:35:41 AM5/4/15
to mechanica...@googlegroups.com

As Mike suggests, specific kernel versions can make dramatic difference. Have a look at 3.18.8 vs earlier, for example.

--

Michael Barker

unread,
May 4, 2015, 3:40:48 AM5/4/15
to mechanica...@googlegroups.com
We're looking at 3.19 at the moment.

Wojciech Kudla

unread,
May 4, 2015, 3:44:29 AM5/4/15
to mechanica...@googlegroups.com

I saw a difference between 3.18 and 3.19, but found the one between 3.18.8 and earlier versions oferring much more severe impact. That's for scenarios with no tricks on the native side. Just pure Java. I guess that still mostly depend on your usage patterns.

Martin Thompson

unread,
May 4, 2015, 9:07:18 AM5/4/15
to mechanica...@googlegroups.com

On Monday, 4 May 2015 04:01:55 UTC+1, mikeb01 wrote:

Personally I would do something similar to what Gil mentions, grab a mmapped file and mlock it into memory and use it like a ring buffer.  Have an archiving thread hanging off the back of that.  Have the archiving thread use standard I/O rather than mmaped I/O (and possibly consider something like O_DIRECT for file writing so that it doesn't touch the page cache - I haven't actually tried this YMMV).

I've used this type of approach often myself and can attest that it works well. You ring buffer can be in /dev/shm if you want to avoid having to mlock it. The thread that is extending files needs to be off the critical path and not at risk of taking a page fault where it is not at safepoint, i.e. it needs to be doing its IO on the other side of a JNI boundary.

 
You will also need to look out for what happens when a connection requests a replay.  If they have been disconnected for a while there is a strong possibility that there data has been paged out and will result in a page fault to bring it back in.  I would consider standard file I/O for this too, given that it is not likely to be on the critical path.

+1 In a replay scenario. You need to read from the ring buffer in-memory for fast replay. For anything older you need to go to the archive and be using standard IO to again be at safepoint.

Martin...
 

Vitaly Davidovich

unread,
May 4, 2015, 9:22:40 AM5/4/15
to mechanica...@googlegroups.com

Why does /dev/shm allow avoiding mlock'ing the pages? AFAIK those pages are swappable like any other.

sent from my phone

--

Martin Thompson

unread,
May 4, 2015, 9:25:00 AM5/4/15
to mechanica...@googlegroups.com
Ah. I should have mentioned I turn off swap on such systems.

Vitaly Davidovich

unread,
May 4, 2015, 9:26:16 AM5/4/15
to mechanica...@googlegroups.com

The other source of kernel latency with disk backed writes is the stable writeback "feature" that was introduced a few years ago.  This is particularly problematic when doing writes over the same set of pages, e.g. ringbuffer.  There's a patch to remediate it that I saw but haven't tracked whether it made it into kernel or not.

sent from my phone

Vitaly Davidovich

unread,
May 4, 2015, 9:29:46 AM5/4/15
to mechanica...@googlegroups.com

Ok that makes sense then :).  Swap + overcommit off should be the default/preferred configuration for these types of systems.

sent from my phone

Gil Tene

unread,
May 4, 2015, 12:05:00 PM5/4/15
to mechanica...@googlegroups.com


On Monday, May 4, 2015 at 6:07:18 AM UTC-7, Martin Thompson wrote:

On Monday, 4 May 2015 04:01:55 UTC+1, mikeb01 wrote:

Personally I would do something similar to what Gil mentions, grab a mmapped file and mlock it into memory and use it like a ring buffer.  Have an archiving thread hanging off the back of that.  Have the archiving thread use standard I/O rather than mmaped I/O (and possibly consider something like O_DIRECT for file writing so that it doesn't touch the page cache - I haven't actually tried this YMMV).

I've used this type of approach often myself and can attest that it works well. You ring buffer can be in /dev/shm if you want to avoid having to mlock it. The thread that is extending files needs to be off the critical path and not at risk of taking a page fault where it is not at safepoint, i.e. it needs to be doing its IO on the other side of a JNI boundary.

To elaborate on what Martin says above about the extending thread needing to do it's extending ops (or ant ops that take page faults) at safepoints: Extending mapped files by "touching" the mapped buffers far enough ahead is a simple memory operation, but it does not occur at a safepoint on most [all?] current JVMs. However, non-mapped i/o calls (like write()) all have their i/o happen at a safepoint (effectively using JNI within the JDK). So any potentially page faulting operations (like extending or prefetching file contents into OS buffers) should be done using those. [there is no need to write your own JNI code for this or anything...]

The reason for making those calls at safepoints even on the non-critical-path extending thread is that *if* the JVM needs to bring all threads to a safepoint, a non-safepointing extension operation on the non-critical extending thread could cause the critical thread to stall for 100s of msec (as that one page fault could be stuck behind 100 other i/o operations that are scheduled ahead of it for the same device). The reasons for bringing all threads to a safepoint vary. they obviously include GC pauses, but they also include depotimazation, deadlock detection, and several other reasons that e.g. John Cuthbertson covered in his "what else makes a JVM pause?" talk.

ymo

unread,
May 4, 2015, 1:20:27 PM5/4/15
to mechanica...@googlegroups.com
I have not seen it on the incoming traffic but for the replay scenario (assuming you just copied the data to disk in the way they need to be replayed) you could use something like https://www.ibm.com/developerworks/linux/library/j-zerocopy/

j.rob....@gmail.com

unread,
May 4, 2015, 9:19:39 PM5/4/15
to mechanica...@googlegroups.com

> Well the application is using FIX communication and we need to implement FIX message store to persist messages in case the counterparty requests a resend.

We had the same problem when we were trying to persist FIX messages to honor resend requests. The solution we chose was to use an asynchronous store that in addition to using a memory-mapped file, also did the writing on another thread, keeping latencies to a minimum through pipelining between the logger thread and the I/O thread. The FIX engine we are currently using that supports that is called CoralFIX.

>> >>> For more options, visit https://groups.google.com/d/optout.
>> >>
>> >>
>> >> --
>> >> You received this message because you are subscribed to the Google
>> >> Groups
>> >> "mechanical-sympathy" group.
>> >> To unsubscribe from this group and stop receiving emails from it, send
>> >> an

>> >> For more options, visit https://groups.google.com/d/optout.
>> >
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "mechanical-sympathy" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an

>> > For more options, visit https://groups.google.com/d/optout.
>>
>>
>>
>> --
>> Studying for the Turing test
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an

>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> For more options, visit https://groups.google.com/d/optout.



--
Studying for the Turing test

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages