Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".
Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.
That's nasty! Thanks for sharing.
sent from my phone
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).
The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'l spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.This behavior seem to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).
The commit for the fix is here: https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0
The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.
So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).
Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
(gdb) bt
#0 0x0000003593e0e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003593e09508 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003593e093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fe87a42a50d in os::PlatformEvent::park() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fe87a3f10e8 in Monitor::ILock(Thread*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fe87a3f132f in Monitor::lock_without_safepoint_check() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fe87a15a7bf in G1HotCardCache::insert(signed char*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fe87a15db03 in G1RemSet::refine_card(signed char*, int, bool) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#8 0x00007fe87a143dc8 in RefineCardTableEntryClosure::do_card_ptr(signed char*, int) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#9 0x00007fe87a0feb9f in DirtyCardQueueSet::apply_closure_to_completed_buffer_helper(CardTableEntryClosure*, int, BufferNode*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#10 0x00007fe87a0fed8d in DirtyCardQueueSet::apply_closure_to_completed_buffer(int, int, bool) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#11 0x00007fe87a0683a4 in ConcurrentG1RefineThread::run() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#12 0x00007fe87a430ca8 in java_start(Thread*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#13 0x0000003593e079d1 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003593ae88fd in clone () from /lib64/libc.so.6
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
There has been quite some benchmarking done on haswells comparing Java 7 and 8 (quite impressive actually) I wonder how they haven't stumble into this.
Either way, debugging this has been quite fun and perhaps we'll write about the adventure in more detail :)
Kernel futex_wait() calls end up being at the core of almost any user-land synchronization primitive these days. Whether it's posix stuff (like mutexes and semaphores) or direct use of futures. And all JVM synchronization including synchronized, Lock, park/unpark, as well as all internal JVM threads, like GC and compiler stuff all end up with a waiting futex at some point.
Have you moved to 6.6.z? (or if not on a RHEL or RHEL-like, a latest kernel of some sort?).
Private futures are exactly one of the types affected, according to that changelog.
sent from my phone
Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Private *futexes*, damn autocorrect.
sent from my phone
Thread 10 (Thread 0x7fa458ade700 (LWP 4482)):
#0 0x00000037016f805e in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x000000370167d16b in _L_lock_9503 () from /lib64/libc.so.6
#2 0x000000370167a6a6 in malloc () from /lib64/libc.so.6
#3 0x00007fa45a52ed29 in os::malloc(unsigned long, unsigned short, unsigned char*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fa459fb66b3 in ChunkPool::allocate(unsigned long, AllocFailStrategy::AllocFailEnum) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fa459fb62d1 in Arena::grow(unsigned long, AllocFailStrategy::AllocFailEnum) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fa45a145cc0 in CompactibleFreeListSpace::new_dcto_cl(OopClosure*, CardTableModRefBS::PrecisionStyle, HeapWord*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fa45a54ce6d in CardTableModRefBS::process_stride(Space*, MemRegion, int, int, OopsInGenClosure*, CardTableRS*, signed char**, unsigned long, unsigned long) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#8 0x00007fa45a54d040 in CardTableModRefBS::non_clean_card_iterate_parallel_work(Space*, MemRegion, OopsInGenClosure*, CardTableRS*, int) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#9 0x00007fa45a0d4e08 in CardTableModRefBS::non_clean_card_iterate_possibly_parallel(Space*, MemRegion, OopsInGenClosure*, CardTableRS*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#10 0x00007fa45a0d6a0e in CardTableRS::younger_refs_in_space_iterate(Space*, OopsInGenClosure*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#11 0x00007fa45a1823fe in ConcurrentMarkSweepGeneration::younger_refs_iterate(OopsInGenClosure*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#12 0x00007fa45a5c98aa in SharedHeap::process_strong_roots(bool, bool, SharedHeap::ScanningOption, OopClosure*, CodeBlobClosure*, OopsInGenClosure*, bool) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#13 0x00007fa45a27ef8c in GenCollectedHeap::gen_process_strong_roots(int, bool, bool, bool, SharedHeap::ScanningOption, OopsInGenClosure*, bool, OopsInGenClosure*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#14 0x00007fa45a551e4f in ParNewGenTask::work(unsigned int) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#15 0x00007fa45a6cf0cf in GangWorker::loop() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#16 0x00007fa45a537ca8 in java_start(Thread*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#17 0x0000003701a079d1 in start_thread () from /lib64/libpthread.so.0
#18 0x00000037016e88fd in clone () from /lib64/libc.so.6
Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Those two traces look like they're coming from different code paths (malloc vs pthread_mutex) so I'm not sure if lll_lock_wait means it's not private. Looking at the kernel change, only private futexes weren't covered by a barrier in the broken version.
sent from my phone
Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
How do jstack and the like subvert the problem? Do they cause the thread to be woken up (from bogus sleep) and observe consistent state at that point?
sent from my phone
Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD
Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.