Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW].

71,427 views
Skip to first unread message

Gil Tene

unread,
May 13, 2015, 6:37:32 PM5/13/15
to mechanica...@googlegroups.com
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...

The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).

The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find. 

This behavior seems to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).


The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".

The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.

The fix is simple, an added default case for the switch that just has an explicit smp_mb() in it. There was a missing memory barrier in the wakeup path, and now (hopefully) it's not missing any more...

So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.

Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).

Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".
 
Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.



Vitaly Davidovich

unread,
May 13, 2015, 6:48:41 PM5/13/15
to mechanica...@googlegroups.com

That's nasty! Thanks for sharing.

sent from my phone

On May 13, 2015 6:37 PM, "Gil Tene" <g...@azulsystems.com> wrote:
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...

The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).

The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'l spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find. 

This behavior seem to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).

The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".

The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.

So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.

Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).

Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".
 
Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.



--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Cosmin Lehene

unread,
May 13, 2015, 9:33:45 PM5/13/15
to mechanica...@googlegroups.com
We've been hunting for this for a while now. 
In our case it's reproducing on 10 core haswells only which are different than 8 cores (dual vs single ring bus and more cache coherency options). It's probably a probability matter.

Attaching (and detaching) fixes the state so using jstack and other tools makes it disappear which was tricky (and a good workaround) and we just used core dumps to figure out what's going on which brought us to suspecting glibc or kernel.
We weren't able to reproduce while running systemtap probes either.

Pinning the JVM to a single cpu reduces the probability of occurrence drastically (from a few times a day to weeks) so I'm guessing latency distributions may have an effect.

We're seeing it mostly during GC pauses in multiple places, but most times when parking.

For example:

(gdb) bt
#0  0x0000003593e0e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003593e09508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x0000003593e093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fe87a42a50d in os::PlatformEvent::park() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4  0x00007fe87a3f10e8 in Monitor::ILock(Thread*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5  0x00007fe87a3f132f in Monitor::lock_without_safepoint_check() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6  0x00007fe87a15a7bf in G1HotCardCache::insert(signed char*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7  0x00007fe87a15db03 in G1RemSet::refine_card(signed char*, int, bool) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#8  0x00007fe87a143dc8 in RefineCardTableEntryClosure::do_card_ptr(signed char*, int) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#9  0x00007fe87a0feb9f in DirtyCardQueueSet::apply_closure_to_completed_buffer_helper(CardTableEntryClosure*, int, BufferNode*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#10 0x00007fe87a0fed8d in DirtyCardQueueSet::apply_closure_to_completed_buffer(int, int, bool) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#11 0x00007fe87a0683a4 in ConcurrentG1RefineThread::run() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#12 0x00007fe87a430ca8 in java_start(Thread*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#13 0x0000003593e079d1 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003593ae88fd in clone () from /lib64/libc.so.6


It's still unclear to me why a futex would end up being used and there are open questions with what we see in the registers (who's on the other side), but didn't get time to investigate every detail.
There has been quite some benchmarking done on haswells comparing Java 7 and 8 (quite impressive actually) I wonder how they haven't stumble into this.
 
Either way, debugging this has been quite fun and perhaps we'll write about the adventure in more detail :)

Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
May 13, 2015, 9:44:35 PM5/13/15
to mechanica...@googlegroups.com
Kernel futex_wait() calls end up being at the core of almost any user-land synchronization primitive these days. Whether it's posix stuff (like mutexes and semaphores) or direct use of futures. And all JVM synchronization including synchronized, Lock, park/unpark, as well as all internal JVM threads, like GC and compiler stuff all end up with a waiting futex at some point.
 
There has been quite some benchmarking done on haswells comparing Java 7 and 8 (quite impressive actually) I wonder how they haven't stumble into this.

I think this benchmarking was done before the bug appeared in most distros. E.g. RHEL 6.6 (and CentOS and SL) only got the bug last October...
 
 
Either way, debugging this has been quite fun and perhaps we'll write about the adventure in more detail :)

Have you moved to 6.6.z? (or if not on a RHEL or RHEL-like, a latest kernel of some sort?). 

Cosmin Lehene

unread,
May 13, 2015, 10:19:45 PM5/13/15
to mechanica...@googlegroups.com
Kernel futex_wait() calls end up being at the core of almost any user-land synchronization primitive these days. Whether it's posix stuff (like mutexes and semaphores) or direct use of futures. And all JVM synchronization including synchronized, Lock, park/unpark, as well as all internal JVM threads, like GC and compiler stuff all end up with a waiting futex at some point.

Yes, but why not a private one as long as it's only the JVM process involved? I'm likely missing something and I was going to look closer to the java mutex path along with glibc to understand why.

Have you moved to 6.6.z? (or if not on a RHEL or RHEL-like, a latest kernel of some sort?). 

Not yet. Like I said, we managed with workarounds (e.g. by pinning and automatically unlock it by running something that causes it to detach when the behavior is detected). 
I didn't know about the kernel bug until I saw this (I was suspecting glic/kernel but only looked through glibc code and bugs...) 
We'll likely update the kernel soon.

Vitaly Davidovich

unread,
May 13, 2015, 10:42:29 PM5/13/15
to mechanica...@googlegroups.com

Private futures are exactly one of the types affected, according to that changelog.

sent from my phone

Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
May 13, 2015, 10:44:09 PM5/13/15
to mechanica...@googlegroups.com

Private *futexes*, damn autocorrect.

sent from my phone

Cosmin Lehene

unread,
May 13, 2015, 10:54:34 PM5/13/15
to mechanica...@googlegroups.com
I guess the question should be: why isn't it always a private futex?

And yes, we saw it with private as well...

Thread 10 (Thread 0x7fa458ade700 (LWP 4482)):
#0  0x00000037016f805e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x000000370167d16b in _L_lock_9503 () from /lib64/libc.so.6
#2  0x000000370167a6a6 in malloc () from /lib64/libc.so.6
#3  0x00007fa45a52ed29 in os::malloc(unsigned long, unsigned short, unsigned char*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4  0x00007fa459fb66b3 in ChunkPool::allocate(unsigned long, AllocFailStrategy::AllocFailEnum) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5  0x00007fa459fb62d1 in Arena::grow(unsigned long, AllocFailStrategy::AllocFailEnum) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6  0x00007fa45a145cc0 in CompactibleFreeListSpace::new_dcto_cl(OopClosure*, CardTableModRefBS::PrecisionStyle, HeapWord*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7  0x00007fa45a54ce6d in CardTableModRefBS::process_stride(Space*, MemRegion, int, int, OopsInGenClosure*, CardTableRS*, signed char**, unsigned long, unsigned long) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#8  0x00007fa45a54d040 in CardTableModRefBS::non_clean_card_iterate_parallel_work(Space*, MemRegion, OopsInGenClosure*, CardTableRS*, int) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#9  0x00007fa45a0d4e08 in CardTableModRefBS::non_clean_card_iterate_possibly_parallel(Space*, MemRegion, OopsInGenClosure*, CardTableRS*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#10 0x00007fa45a0d6a0e in CardTableRS::younger_refs_in_space_iterate(Space*, OopsInGenClosure*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#11 0x00007fa45a1823fe in ConcurrentMarkSweepGeneration::younger_refs_iterate(OopsInGenClosure*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#12 0x00007fa45a5c98aa in SharedHeap::process_strong_roots(bool, bool, SharedHeap::ScanningOption, OopClosure*, CodeBlobClosure*, OopsInGenClosure*, bool) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#13 0x00007fa45a27ef8c in GenCollectedHeap::gen_process_strong_roots(int, bool, bool, bool, SharedHeap::ScanningOption, OopsInGenClosure*, bool, OopsInGenClosure*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#14 0x00007fa45a551e4f in ParNewGenTask::work(unsigned int) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#15 0x00007fa45a6cf0cf in GangWorker::loop() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#16 0x00007fa45a537ca8 in java_start(Thread*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#17 0x0000003701a079d1 in start_thread () from /lib64/libpthread.so.0
#18 0x00000037016e88fd in clone () from /lib64/libc.so.6
Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,
May 14, 2015, 12:05:28 AM5/14/15
to mechanica...@googlegroups.com

Those two traces look like they're coming from different code paths (malloc vs pthread_mutex) so I'm not sure if lll_lock_wait means it's not private.  Looking at the kernel change, only private futexes weren't covered by a barrier in the broken version.

sent from my phone

Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
May 14, 2015, 1:18:46 AM5/14/15
to mechanica...@googlegroups.com
Regarding the two different traces you included in the two different posts: The difference between __lll_lock_wait_private(int *futex) and __lll_lock_wait(int *futex, int private) is that the first is always private, while the second is optionally private depending on what the caller says.

[The following is an educated guess about the difference, not based on tracing the actual code all the way...:]

- malloc() is always private to the process involved, so it using it's private futex that it own for it's own purposes. So you end up with a direct _lll_lock_wait_private() call.

- In contrast, pthread_mutex_lock() is performing a lock operation on behalf the caller. The caller may very well be wanting to do a locking or unlocking operation on a shared (across processes, via a shared word in a shared mmap'ed region) futex. pthread_mutex_lock() doesn't know which it is (private or not). It is told by it's caller. Normally (and probably by default) the caller would indicate a private futex, but from the api point of view, it *could* be different... So you end up with a call to __lll_lock_wait(int *futex, int private)

-- Gil.

Marcin Sobieszczanski

unread,
May 14, 2015, 6:24:33 AM5/14/15
to mechanica...@googlegroups.com
> More importantly this breakage seems to have
> been back ported into major distros (e.g. into RHEL 6.6 and its cousins,
> released in October 2014), and the fix for it has only recently been back
> ported (e.g. RHEL 6.6.z and cousins have the fix).

According to the ChangeLogs attached to rpms, it looks like the kernel
series used in RHEL 6.6 (kernel-2.6.32-504) were affected from the
start of 6.6 release. It has been fixed only recently in
kernel-2.6.32-504.16.2.el6 update (21 April,
https://rhn.redhat.com/errata/RHSA-2015-0864.html)

rpm -qp --changelog kernel-2.6.32-504.16.2.el6.x86_64.rpm | grep
'Ensure get_futex_key_refs() always implies a barrier'

Alex Bagehot

unread,
May 14, 2015, 9:01:21 AM5/14/15
to mechanica...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
May 14, 2015, 9:45:15 AM5/14/15
to mechanica...@googlegroups.com

How do jstack and the like subvert the problem? Do they cause the thread to be woken up (from bogus sleep) and observe consistent state at that point?

sent from my phone

Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
May 14, 2015, 1:50:37 PM5/14/15
to <mechanical-sympathy@googlegroups.com>
The bug manifests in a lost wake up. Things that happen to do redundant wake ups (e.g. notifyAll for condition variables or Java synchronized objects, as well as some other ways) might accidentally release some threads from their sleeping beauty state. But only if they are lucky.

So I don't think jstack subverts the bug in a general sense, or even in most cases. I think it just helps things that are lucky enough to involve the use of redundant (as in more than required) wakeups. Things that are more precise (like semaphores built on top of futex, or wait/notify without notifyAll) probably won't be helped.

Sent from Gil's iPhone
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
May 14, 2015, 2:01:55 PM5/14/15
to mechanica...@googlegroups.com
Right, hence I was curious about Cosmin's statement that jstack seemed to help, which AFAIK won't issue any wakeups, so doesn't quite make sense to me.

So reading a bit more about this bug, it seems like some folks experienced this on Power and Arm machines as well, both with weaker memory models than x86.  Would be interesting to know what exactly about the flavor of Haswell chips helps exacerbate this; Haswell has a larger out of order window than predecessors, and also revamped coherence for TSX support, but that's all very high level and imprecise.

Cosmin Lehene

unread,
May 14, 2015, 2:25:28 PM5/14/15
to mechanica...@googlegroups.com
jstack (with -F), jhat, strace, gdb will attach and then detach from the process. Upon detach the process wakes up, invariably.
Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Adrian Muraru

unread,
May 14, 2015, 3:05:23 PM5/14/15
to mechanica...@googlegroups.com

Gil,


RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD

AFAIK Centos 7.1 is shipping kernel 3.10 which based on your initial message is not impacted. Right?

-adrian


Vitaly Davidovich

unread,
May 14, 2015, 3:18:03 PM5/14/15
to mechanica...@googlegroups.com
Ah, jstack with -F, ok, makes sense.

Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Michael Barker

unread,
May 14, 2015, 4:41:32 PM5/14/15
to mechanica...@googlegroups.com
BTW, this bug is also fixed in kernel 3.14.23.

Gil Tene

unread,
May 14, 2015, 6:51:19 PM5/14/15
to <mechanical-sympathy@googlegroups.com>
The upstream 3.10 didn't have the bug. But RHEL 7's version is different from the pure upstream version.

Unfortunately, RHEL 7.1 (much like RHEL 6.6) backported the change that included the bug. But unlike 6.6.z, there is no backport of the fix to that bug yet for RHEL 7... (This is according to one of our engineers that checked the actual futex.c code).

I expect that some other distros may have also done the same...

Sent from Gil's iPhone

Gil Tene

unread,
May 14, 2015, 6:52:38 PM5/14/15
to <mechanical-sympathy@googlegroups.com>
Maybe the attach (or the detach) forces the thread to re-evaluate the sleep, and it wakes up as a result.


Sent from Gil's iPhone

Todd Lipcon

unread,
May 14, 2015, 6:54:23 PM5/14/15
to mechanica...@googlegroups.com
My guess is that these tools are causing a signal to get sent to the affected threads (perhaps as part of the JVM coming to a safepoint or something?). If a thread is blocked in a syscall, and gets a signal, it will usually make that syscall return with EINTR. This is the case with futex(). So, if it missed a wakeup, but you send a signal, you'll recover.

-Todd

Vitaly Davidovich

unread,
May 14, 2015, 6:54:35 PM5/14/15
to mechanica...@googlegroups.com
My guess it's the actual signals sent (as part of attach/detach) to the proc that causes it to become runnable again in the kernel.

Paul Blair

unread,
May 15, 2015, 10:00:16 AM5/15/15
to mechanica...@googlegroups.com
I'm seeing a similar problem on kernel 3.19.0-16-generic after having upgraded Ubuntu to 15.04 (Vivid Vervet). However, checking the github link above indicates that the fix should be in 3.19, so I'm not sure if this is a regression or something else.

I've documented what I'm seeing here: http://ubuntuforums.org/showthread.php?t=2278238&p=13285416#post13285416 (I'm not really familiar with Linux debugging, though, so I'm not sure if I'm reporting the right information.)

Basically, certain Java applications are regularly hanging, and when I run strace I see them in FUTEX_WAIT. So far it's affecting only Java, but this includes previous versions of Java that have been working for a long time.


On Thursday, May 14, 2015 at 8:35:49 AM UTC-4, Gil Tene wrote:
Thanks for identifying the date and change for the fix in RHEL 6 (April 21, 2015). Strange that the errata makes no mention of it, or it's impact (it's a security advisory).

So far, here is what I know when it comes to distro release numbers for RHEL and it's main cousins (At least in that world, most admins view things in terms of distro version rather than kernel version. you can look up associated kernel versions). I'm using good/BAD to mean ("does not have the bug" / "HAS THE MISSING BARRIER BUG"):

RHEL 6 (and CentOS 6, and SL 6): 6.0-6.5 are good. 6.6 is BAD. 6.6.z is good.
RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD. As of yesterday. there does not yet appear to be a 7.x fix. 
RHEL 5 (and CentOS 5, and SL 5): All versions are good (including 5.11).

Sadly, 6.6 seems to be a very (the most?) popular version we run into. And very few have moved to 6.6.z.

I/we are looking for more info to identify the specific versions affected in other distros (Ubuntu 12.04 LTS and 14.04 LTS, SLES 12 and 11, Amazon Linux, Oracle Linux, etc.). So far we've verified that SLES12 kernel versions 3.12.32-33.1 & above have the fix (but not what versions have the bug), and that Amazon Linux  kernel versions 3.14.35-28.38 & above have the fix (but not which versions have the bug).

I will post here as I have more, if you find more info useful to identifying releases that have the bug and ones that have fixed it, please do the same.

Alex Bagehot

unread,
May 15, 2015, 12:21:07 PM5/15/15
to mechanica...@googlegroups.com
Hi Paul, 
Did you run strace -F to get the child pids? else it'll only print the parent process waiting.


# strace -p 14603

Process 14603 attached - interrupt to quit

futex(0x7f5c8e6019d0, FUTEX_WAIT, 14604, NULL

^C <unfinished ...>

Process 14603 detached

-> no more output


# strace -F -p 14603 

Process 14603 attached with 8 threads - interrupt to quit

-> + output from all the threads in the process.


As far as I can tell, your bug appears to be 100% cpu on 1 cpu core. This bug is more characterised by less cpu burn as threads sleep when they should be woken ("…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" ), so I would assume that your issue is different (as a starting point) and try to gather more evidence to prove it either way.

run 

sudo perf record -F 99 -ag -p <pid> -- sleep 10

sudo perf script 


To get some more info on the stacks associated with the cpu burn. The stacks may be corrupted by the jvm preventing perf from reporting something useful with -g,  so YMMV, to fix that it's a custom build of openjdk...

Thanks,

Alex



--

Paul Blair

unread,
May 15, 2015, 2:51:57 PM5/15/15
to mechanica...@googlegroups.com
Thanks for the pointers, Alex. Unfortunately neither one of these techniques worked for me. strace with -F unjiggles the JVM so that the process does not hang. On the system monitor I see activity on multiple CPUs and nothing hits 100%.  The perf record command never returns, probably for the reasons you say.

I do have a Java crash dump from when I left the process alone for a few hours and it finally crashed. I'm going to submit that to JetBrains and see if they can shed more light.

If I find I need more help on a Linux system level, is this an appropriate forum? 
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Erik Gorset

unread,
May 19, 2015, 10:23:24 AM5/19/15
to mechanica...@googlegroups.com
I’ve reproduced similar symptoms on haswell running Ubuntu 3.18.13-031813-generic #201505062159 SMP Wed May 6 22:00:44 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

One java process was locked up with one core burning 100% of kernel time, with the following strace result:

[pid 423743] futex(0x7f80f0002354, FUTEX_WAIT_PRIVATE, 449, NULL <unfinished ...>
[pid 422607] futex(0x7f842c002454, FUTEX_WAIT_PRIVATE, 925, NULL <unfinished ...>
[pid 422602] futex(0x7ff6b058a854, FUTEX_WAIT_PRIVATE, 391, NULL <unfinished ...>

The symptoms went away after setting kernel.numa_balancing to 0, which also gave us considerable better performance overall. Looks like running a big process with numa interleaving on a big server with numa_balancing is a very bad idea (which is unfortunately the default).

We’re still investigating another (possibly related) matter, where we see inconsistent results when using Java8 on haswell. We don’t see the problem using Java7 on haswell, or when using Java8 on ivy bridge (which is what we are running on most of our servers). Most likely it’s a bug in our software, but given that we don’t see this problem on 80+ servers under high load unless we use Java8 on haswell, I’m starting to wonder..

— 
Erik Cysneiros Gorset

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

manis...@gmail.com

unread,
May 19, 2015, 8:16:12 PM5/19/15
to mechanica...@googlegroups.com
I bumped on this error couple of months back when using CentOS 6.6 with 32 cores Dell server. After many days of debugging, I realized it to be a CentOS 6.6 bug and moved back to 6.5 and since then no such issues have been seen.
I am able to reproduce this issue in 15 minutes of heavy load on my multi threaded c  code.


On Wednesday, May 13, 2015 at 3:37:32 PM UTC-7, Gil Tene wrote:
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...

The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).

The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find. 

This behavior seems to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).

The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".

The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.

The fix is simple, an added default case for the switch that just has an explicit smp_mb() in it. There was a missing memory barrier in the wakeup path, and now (hopefully) it's not missing any more...

So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.

Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).

Joe R

unread,
May 20, 2015, 11:09:32 AM5/20/15
to mechanica...@googlegroups.com
Did anyone come up with a test case for the showing the bug?

Greg Senia

unread,
May 20, 2015, 10:44:49 PM5/20/15
to mechanica...@googlegroups.com
I spent a solid week debugging this with the help of some IBM Java/Kernel Performance folks and going through all the kernel patches that came in with RHEL6. It took one solid week back in Mid-March to get Redhat to acknowledge the issue that could effect PPC, X86 anyone using futex.... It occured regularly with Hadoop YARN Nodemanger and Tez Jobs. It also regularly occurred with Websphere JVMs on RHEL 6.6 PPC64.. We never came up with a definitive way to re-pro the problem at will. Jstack, strace, gdb, gcore, kill -3 with IBM JDKs would unblock the stuck threads. And also Redhat confirmed this could only happen on a machine with more than 1 CPU as it had to do with the thread being put to sleep on one cpu but missing the notify to wake it up...

Greg Senia

unread,
May 20, 2015, 11:00:52 PM5/20/15
to mechanica...@googlegroups.com
And to truly verify the problem we had to kernel dump the systems and work with Redhat to analyze the vmcore....  If anyone wants further info feel free to let me know..

Both of these situations applied in my situations:

Kevin Burton

unread,
May 21, 2015, 10:06:44 PM5/21/15
to mechanica...@googlegroups.com
The kernel (and glibc) needs more continuous integration 

Bugs like these are hard to find but we consistently find race conditions in our code because months later, an integration will fail.

There's a bug in glibc that's been around for 2-5 years now.  It supports a 'rotate' option in the resolver so you can load balance DNS requests.

But it's completely broken - and has been for YEARS now.  

A continuous integration system would fix this easily.

Of course, one problem, is the tight coupling of the kernel.  I'm sure this code isn't very testable... :-/

bhu...@gmail.com

unread,
May 27, 2015, 2:05:04 AM5/27/15
to mechanica...@googlegroups.com
I have come across a similar situation with kernel 3.0.101 , SLES 11. Based on the details seen so far this Kernel version doesnt seem to be affected.

The Server is non-Haswell (
Intel(R) Xeon(R) CPU L5638  @ 2.00GHz, 12 core). The application hangs with 100% CPU . The same code on an older kernel runs fine.


Thread 3 (Thread 0x7f22ab35c700 (LWP 19575)):
#0  0x00007f22b3df52d4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f22b3df0659 in _L_lock_1008 () from /lib64/libpthread.so.0
#2  0x00007f22b3df046e in pthread_mutex_lock () from /lib64/libpthread.so.0

I observe this with 85 threads out of 102

Could some suggest test code to simulate this error ?

Regards
Bhupinder



On Thursday, May 14, 2015 at 4:07:32 AM UTC+5:30, Gil Tene wrote:
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...

Gil Tene

unread,
May 27, 2015, 11:19:39 AM5/27/15
to mechanica...@googlegroups.com
To be clear: the bug is not limited to kernels labeled "3.14" thru "3.18". It appears in several production kernels with "earlier" labels that are part s of various distros, and got there through backporting efforts. E.g. RHEL 6.6 uses 2.6.32.xxxxx kernels, and RHEL 7.1 uses 3.10.0.xxx kernels, and both of those have had the bug introduced through backporting of changes from 3.14. "3.0.101 SLES 11" is not the same as 3.0.101. My first suspicion would be that this SLES kernel has had the bug back ported to it. E.g. we originally ran into the bug in RHEL 6.6 with a 2.6.32.504 kernel that included a back port of the buggy update. The fix back port appeared in the 2.6.32.504.16.2 kernel in RHEL 6.6.z.

You should get the specific sources for your 3.0.101 SLES 11 kernel and look at kernel/futex.c. Specifically, you want to see if the 3.14 update (in https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db ) which introduced the bug was backported to it, but the 3.18 fix to that (in https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0 ) hadn't been applied on top of that. [See more detailed discussion in the original post on this thread].

bhu...@gmail.com

unread,
May 27, 2015, 1:40:41 PM5/27/15
to mechanica...@googlegroups.com
Thanks Gil for the details.

However in this case the bug turned out to be in our code.

Apart from other factors there was considerably higher load on this new target system.
So I guess as a result of it the deadlocks were happening.

We have a patch deployed and would be monitoring to see if we still encounter it.

I will still go ahead and review the futex code as you suggested to see if it is affected or not. Perhaps it could help others.

Adrian Muraru

unread,
May 28, 2015, 1:04:05 PM5/28/15
to mechanica...@googlegroups.com
Thanks Gil,
I can confirm kernel-3.10.0-229.4.2.el7 shipped with Centos7.1 is impacted.

I filed a bug with Centos:
http://bugs.centos.org/view.php?id=8803 

-adrian

Jack Bradach

unread,
Jun 3, 2015, 5:30:31 PM6/3/15
to mechanica...@googlegroups.com
I'm seeing the same thing on the same version of Ubuntu with a Haswell when running Jetbrains CLion (which runs in Java).  It hangs on a futex and I have to kill the program to unlock it.  Binding it to a single core seems to avoid the hang but kills performance.  I checked Ubuntu's version of the 3.19 kernel source and the patch is in there.

It does seem to be somehow related to Java 8 vs Java 7.  When I installed Oracle JDK 7 and forced CLion to run it, I no longer see the hang.

Gil Tene

unread,
May 14, 2015, 8:35:49 AM5/14/15
to mechanica...@googlegroups.com

Adrian Muraru

unread,
Jun 25, 2015, 1:48:39 AM6/25/15
to mechanica...@googlegroups.com
RHEL/Centos: 3.10.0-229.7.2.el7 kernel is now including a fix for this bug.

-adrian

Alen Vrečko

unread,
Jun 25, 2015, 5:12:48 AM6/25/15
to mechanica...@googlegroups.com
While on the subject, meant to post a while back:

The bug occurred frequently on HP Gen 9 (Haswell) servers running OpenSuse 13.1. On HP Gen 8 running the "same" software the bug never happened.

o) Upgrading the kernel to 4.0.4 fixed the problem. As expected.

o) Upgrading just Java from 7u25 to 7u79 (without upgrading the kernel) also "fixed" the problem. This is very surprising. Didn't had the time to investigate further.

Best regards,
Alen

2015-06-25 7:48 GMT+02:00 Adrian Muraru <adi.m...@gmail.com>:
RHEL/Centos: 3.10.0-229.7.2.el7 kernel is now including a fix for this bug.

-adrian

--

Bill Kelso

unread,
Jul 9, 2015, 9:23:58 PM7/9/15
to mechanica...@googlegroups.com
We are getting killed by this right now. We are running Oracle Linux, Redhat rel. 6.6, kernel version 2.6.32-504.16.2.el6.x86_64. Supposedly the bug in this version was fixed, but it just happened again tonight (after not happening for two nights in a row).

Does it matter what Oracle client version you use with a particular kernel? I mean, I know it matters. But is there a known good combo of kernel and client? We are going nuts trying to track this down, not to even mention actually fixing it. The two 'good' nights we had were on the Linux kernel version above and an Oracle 12c client (12.1.0.1). But then we started getting other errors (random "TNS connection closed" errors).

We've been working on this for more than a month. I've actually started working nights in order to prevent holdups to our data warehouse load. I just sit and watch for jobs that start to hang mysteriously. It's really getting old.

Any advice as to where I can find info on this bug in Oracle Linux implementations?

Thanks. bk

Gil Tene

unread,
Jul 9, 2015, 10:56:43 PM7/9/15
to mechanica...@googlegroups.com


On Thursday, July 9, 2015 at 6:23:58 PM UTC-7, Bill Kelso wrote:
We are getting killed by this right now. We are running Oracle Linux, Redhat rel. 6.6, kernel version 2.6.32-504.16.2.el6.x86_64. Supposedly the bug in this version was fixed, but it just happened again tonight (after not happening for two nights in a row).

If the kernel is really that RHEL version (2.6.32-504.16.2.el6.x86_64), I'm pretty sure the specific bug discussed here is fixed in that one. Maybe what you are running into is some other bug? Are you seeing processes hang in futex wait?

Bill Kelso

unread,
Jul 10, 2015, 12:02:37 AM7/10/15
to mechanica...@googlegroups.com
I'm not sure. When I run a stack trace, the PID always refers to futex_. But the hang happens on an Oracle OCI call (I think that's what it is). All the threads look like this:

Thread 1 (Thread 0x2b8807f2d420 (LWP 9885)):
#0  0x000000347540b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b87fb24983c in conditionWait(pthread_cond_t*, SMutex*, unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so
#2  0x00002b87fb24b2a7 in SThread::Sleep(unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so
#3  0x00002b87fb249b3b in SEvent::putThreadToSleep(SThread*, unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so
#4  0x00002b87fb24a4cc in msgque::get(int, TObject**, unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so
#5  0x00002b87fb24a657 in SThread::readMessage(int, unsigned int, TObject**) () from /opt/informatica/9.6.1/server/bin/libpmcef.so
#6  0x00000000005f2798 in SDirectorImpl::getNextMessage() ()
#7  0x00000000005f4e3d in SDirectorImpl::doPETLOrchestration() ()
#8  0x00000000005f679e in SDirectorImpl::orchestrate() ()
#9  0x00000000005f6d21 in SDirectorImpl::run() ()
#10 0x00000000005faa6c in SDirectorRunnable::start() ()
#11 0x00000000005c01c3 in SExecutorDTM::start() ()
#12 0x00000000005dff57 in SPreparerDTMImpl::start() ()
#13 0x00000000005d774f in DTMMain(int, char const**) ()
#14 0x000000347501ed5d in __libc_start_main () from /lib64/libc.so.6
#15 0x00000000005b0a89 in _start ()

And then when I run the stack trace, the session 'wakes up' and starts sourcing data from Oracle.

The symptoms we are encountering sure sound like the futex_wait bug. And I agree the kernel version is identified elsewhere as a 'good' version. I suppose it could be something else. But how the heck do I figure that out?

Thanks for replying, by the way.

Gil Tene

unread,
Jul 10, 2015, 12:23:40 AM7/10/15
to mechanica...@googlegroups.com


On Thursday, July 9, 2015 at 9:02:37 PM UTC-7, Bill Kelso wrote:
I'm not sure. When I run a stack trace, the PID always refers to futex_. But the hang happens on an Oracle OCI call (I think that's what it is). All the threads look like this:

Thread 1 (Thread 0x2b8807f2d420 (LWP 9885)):
#0  0x000000347540b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b87fb24983c in conditionWait(pthread_cond_t*, SMutex*, unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so
#2  0x00002b87fb24b2a7 in SThread::Sleep(unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so
#3  0x00002b87fb249b3b in SEvent::putThreadToSleep(SThread*, unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so
#4  0x00002b87fb24a4cc in msgque::get(int, TObject**, unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so
#5  0x00002b87fb24a657 in SThread::readMessage(int, unsigned int, TObject**) () from /opt/informatica/9.6.1/server/bin/libpmcef.so
#6  0x00000000005f2798 in SDirectorImpl::getNextMessage() ()
#7  0x00000000005f4e3d in SDirectorImpl::doPETLOrchestration() ()
#8  0x00000000005f679e in SDirectorImpl::orchestrate() ()
#9  0x00000000005f6d21 in SDirectorImpl::run() ()
#10 0x00000000005faa6c in SDirectorRunnable::start() ()
#11 0x00000000005c01c3 in SExecutorDTM::start() ()
#12 0x00000000005dff57 in SPreparerDTMImpl::start() ()
#13 0x00000000005d774f in DTMMain(int, char const**) ()
#14 0x000000347501ed5d in __libc_start_main () from /lib64/libc.so.6
#15 0x00000000005b0a89 in _start ()

And then when I run the stack trace, the session 'wakes up' and starts sourcing data from Oracle.

The symptoms we are encountering sure sound like the futex_wait bug.

This sounds like a similar symptom. But other missed wakeup (including user-mode logic bugs that cause a missed wakeup via a real race, not just kernel issues) could cause the same behavior... And sometimes those "normal" user-mode missed-wakeup bugs can also be kicked-back-to-life by something that interrupts the thread (like a stack trace).
 
And I agree the kernel version is identified elsewhere as a 'good' version. I suppose it could be something else. But how the heck do I figure that out?

The way to figure out if *this* bug (the one this thread was started on) exists in a specific kernel is to get the kernel sources for that specific kernel and look through kernel/futex.c to see if the bug had been back ported into it (e.g. like https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db) but the fix (e.g. like https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0) has not been ported in to fix it. However, if the kernel version really is the one mentioned here, that check had already been done for you by others...

My inclination would be to suspect a "regular" missed wakeup in some user-mode code. A possible way to eliminate this specific kernel bug as a possible cause is to downgrade to a RHEL 6.5 kernel. The bug did not exist in RHEL 6.x versions before RHEL 6.6. If you downgrade and the behavior problems persist, you are looking at another bug... 

Serguei Kolos

unread,
Aug 17, 2015, 5:45:32 AM8/17/15
to mechanical-sympathy
Hi

Fantastic. Many thanks for sharing that info, which saved me several weeks of working time. I went as far as getting a nasty GDB stack traces showing threads waiting on a non-locked mutex, but I didn't know how to dig this further down. 

Cheers,

Todd Lipcon

unread,
Oct 30, 2015, 2:18:32 PM10/30/15
to mechanica...@googlegroups.com
Just to tag onto this old thread (because we ran into it on a new Haswell cluster last night)...

I did some digging in the CentOS/RHEL kernel changelog, and the fix shows up in version 2.6.32-504.14.1.el6. Hope that's useful for other folks determining if they're vulnerable.

-Todd

Daniel Worthington-Bodart

unread,
Feb 26, 2016, 4:43:02 PM2/26/16
to mechanical-sympathy
Sorry to be a necromancer but I though it was worth let you all know that there is still what appears to be a related freeze for Java applications on recent Ubuntu versions when run on Haswell-E platforms


I had this problem on a 5960x running Ubuntu 15.10, stock Kernel 4.2.0-18, latest JDK jdk1.8.0_74, I'm can confirm the the cold boot fix works with the stock kernel.

The problem is also resolved using the very latest kernel 4.5.0-rc5 from mainline PPA

Andriy Plokhotnyuk

unread,
Feb 27, 2016, 1:36:54 PM2/27/16
to mechanical-sympathy
Who will do git bisect?


On Friday, February 26, 2016 at 10:43:02 PM UTC+1, Daniel Worthington-Bodart wrote:
Sorry to be a necromancer but I though it was worth let you all know that there is still what appears to be a related freeze for Java applications on recent Ubuntu versions when run on Haswell-E platforms


I had this problem on a 5960x running Ubuntu 15.10, stock Kernel 4.2.0-18, latest JDK jdk1.8.0_74, I'm can confirm the the cold boot fix works with the stock kernel.

The problem is also resolved using the very latest kernel 4.5.0-rc5 from mainline PPA

On Fri, 30 Oct 2015 at 18:18 Todd Lipcon <to...@lipcon.org> wrote:
Just to tag onto this old thread (because we ran into it on a new Haswell cluster last night)...

I did some digging in the CentOS/RHEL kernel changelog, and the fix shows up in version 2.6.32-504.14.1.el6. Hope that's useful for other folks determining if they're vulnerable.

-Todd
On Mon, Aug 17, 2015 at 2:45 AM, Serguei Kolos <sergue...@gmail.com> wrote:
Hi

Fantastic. Many thanks for sharing that info, which saved me several weeks of working time. I went as far as getting a nasty GDB stack traces showing threads waiting on a non-locked mutex, but I didn't know how to dig this further down. 

Cheers,


On Thursday, May 14, 2015 at 12:37:32 AM UTC+2, Gil Tene wrote:
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...




--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Craig Yoshioka

unread,
Apr 7, 2016, 3:58:21 PM4/7/16
to mechanical-sympathy
I believe I am also seeing this issue, or a related one.  In my case it occurs when running an MPI C++ program over 400+ cores/processes.  The program occasionally seems to get stuck at certain steps, especially when RAM use goes up. CPU use stays pegged at 100%, but most of it becomes system.  Running strace on a process shows a lot of sched_yield and futex calls.  If I run strace on every process, on every node, it seems to kick the troublesome process out of its rut, and things resume like normal.  I am running CentOS 6.7 with linux kernel 2.6.32-504.30.3


On Friday, February 26, 2016 at 1:43:02 PM UTC-8, Daniel Worthington-Bodart wrote:
Sorry to be a necromancer but I though it was worth let you all know that there is still what appears to be a related freeze for Java applications on recent Ubuntu versions when run on Haswell-E platforms


I had this problem on a 5960x running Ubuntu 15.10, stock Kernel 4.2.0-18, latest JDK jdk1.8.0_74, I'm can confirm the the cold boot fix works with the stock kernel.

The problem is also resolved using the very latest kernel 4.5.0-rc5 from mainline PPA

On Fri, 30 Oct 2015 at 18:18 Todd Lipcon <to...@lipcon.org> wrote:
Just to tag onto this old thread (because we ran into it on a new Haswell cluster last night)...

I did some digging in the CentOS/RHEL kernel changelog, and the fix shows up in version 2.6.32-504.14.1.el6. Hope that's useful for other folks determining if they're vulnerable.

-Todd
On Mon, Aug 17, 2015 at 2:45 AM, Serguei Kolos <sergue...@gmail.com> wrote:
Hi

Fantastic. Many thanks for sharing that info, which saved me several weeks of working time. I went as far as getting a nasty GDB stack traces showing threads waiting on a non-locked mutex, but I didn't know how to dig this further down. 

Cheers,


On Thursday, May 14, 2015 at 12:37:32 AM UTC+2, Gil Tene wrote:
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...




--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Trent Nelson

unread,
Apr 8, 2016, 10:53:00 AM4/8/16
to mechanical-sympathy
Craig: sounds like you've got transparent huge pages enabled.  I'd highly recommend disabling it and seeing if the problem persists.  (See https://access.redhat.com/solutions/46111.)

    Trent.

Craig Yoshioka

unread,
Apr 8, 2016, 10:59:13 AM4/8/16
to mechanica...@googlegroups.com
Hi Trent,

Thanks for the suggestion.  That was a problem I ran into a while back, but all the nodes now have THP disabled.  This problem does have similar performance symptoms, but appears to have a different cause.

You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Tom Lee

unread,
Apr 8, 2016, 11:20:34 AM4/8/16
to mechanica...@googlegroups.com

Hey Craig,

"perf top" would be my first port of call here to get an idea where all that system time is going.

Cheers,
Tom

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Longchao Dong

unread,
Feb 13, 2017, 5:01:39 AM2/13/17
to mechanical-sympathy
How to reproduce this issue ? Is it possible to show us the method ? I am also working on one strange pthread_cond_wait issue, but not sure if that one is related with this issue.

Allen Reese

unread,
Feb 14, 2017, 11:01:52 AM2/14/17
to mechanica...@googlegroups.com
This bug report seems to have a way to reproduce it:

Hope that helps.

--Allen Reese



From: Longchao Dong <donglo...@gmail.com>
To: mechanical-sympathy <mechanica...@googlegroups.com>
Sent: Monday, February 13, 2017 1:55 AM
Subject: Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW].

How to reproduce this issue ? Is it possible to show us the method ? I am also working on one strange pthread_cond_wait issue, but not sure if that one is related with this issue.

On Wednesday, May 20, 2015 at 8:16:12 AM UTC+8, manis...@gmail.com wrote:
I bumped on this error couple of months back when using CentOS 6.6 with 32 cores Dell server. After many days of debugging, I realized it to be a CentOS 6.6 bug and moved back to 6.5 and since then no such issues have been seen.
I am able to reproduce this issue in 15 minutes of heavy load on my multi threaded c  code.

On Wednesday, May 13, 2015 at 3:37:32 PM UTC-7, Gil Tene wrote:
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...

The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).

The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find. 

This behavior seems to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).


The commit explanation says that it fixes https://github.com/torvalds/ linux/commit/ b0c29f79ecea0b6fbcefc999e70f28 43ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".

The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.

The fix is simple, an added default case for the switch that just has an explicit smp_mb() in it. There was a missing memory barrier in the wakeup path, and now (hopefully) it's not missing any more...

So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.

Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/ torvalds/linux/commit/ b0c29f79ecea0b6fbcefc999e70f28 43ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).

Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".
 
Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.



Will Foster

unread,
Feb 15, 2017, 9:33:45 AM2/15/17
to mechanical-sympathy, are...@yahoo-inc.com


On Tuesday, February 14, 2017 at 4:01:52 PM UTC, Allen Reese wrote:
This bug report seems to have a way to reproduce it:

Hope that helps.

--Allen Reese



I also see this on latest CentOS7.3 with Logstash, I've disabled huge pages via
transparent_hugepage=never

in grub.

Here's what I get from strace against logstash (never fully comes up to listen on TCP/5044)

[root@host-01 ~]# strace -p 1292
Process 1292 attached
futex
(0x7f80eff8a9d0, FUTEX_WAIT, 1312, NULL


I am hitting this issue on Logstash 5.2.1-1 while trying to upgrade my Ansible playbooks to the latest ES versions.

 

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
Feb 15, 2017, 10:45:35 AM2/15/17
to mechanica...@googlegroups.com, are...@yahoo-inc.com
Don't know if this is the same bug. RHEL 7 kernels included fixes for this since some time in 2015.

While one of my first courses of action when I see a suspicious FUTEX_WAIT hang situation is still to check kernel versions to rules this out (since this bug has wasted us a bunch of time in the past), keep in mind that not all things stuck in FUTEX_WAIT are futex_wait kernel bugs. The most likely explanations are usually actual application logic bugs involving actual deadlock or starvation.

Does attaching and detaching from the process with gdb move it forward? [the original bug was missing the wakeup, and an attach/detach would "kick" the futex out of its slumber once]

Wojciech Kudla

unread,
Feb 15, 2017, 2:33:13 PM2/15/17
to mechanical-sympathy, are...@yahoo-inc.com

Just trying to eliminate the obvious. You should be stracing JVM threads by referring their tids rather than parent process pid. That guy will pretty much always show being blocked on a futex.


On Wed, 15 Feb 2017, 15:45 Gil Tene, <g...@azul.com> wrote:
Don't know if this is the same bug. RHEL 7 kernel included fixes for this since some time in 2015.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Longchao Dong

unread,
Feb 15, 2017, 8:25:05 PM2/15/17
to mechanica...@googlegroups.com
In fact, I met a very strange problem.
My c ++ program now calls HDFS‘s interfaces via jni, but they are all blocked by  the same java object lock. I obtained the state of the process by jstack. All threads are waiting to lock the object(0x00000006b30b3be8), but no thread is holding it. Does anybody have clues?
The attachment is the output of jstack at that time.

On Wed, Feb 15, 2017 at 11:45 PM, Gil Tene <g...@azul.com> wrote:
Don't know if this is the same bug. RHEL 7 kernel included fixes for this since some time in 2015.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Best Regards,
董隆超
jstack.log

Dan Beaulieu

unread,
Mar 16, 2017, 6:29:08 PM3/16/17
to mechanical-sympathy

Hi Daniel, do you happen to know what the commit was that fixed this? I'd like to learn more about the fix.


On Friday, February 26, 2016 at 4:43:02 PM UTC-5, Daniel Worthington-Bodart wrote:
Sorry to be a necromancer but I though it was worth let you all know that there is still what appears to be a related freeze for Java applications on recent Ubuntu versions when run on Haswell-E platforms


I had this problem on a 5960x running Ubuntu 15.10, stock Kernel 4.2.0-18, latest JDK jdk1.8.0_74, I'm can confirm the the cold boot fix works with the stock kernel.

The problem is also resolved using the very latest kernel 4.5.0-rc5 from mainline PPA

On Fri, 30 Oct 2015 at 18:18 Todd Lipcon <to...@lipcon.org> wrote:
Just to tag onto this old thread (because we ran into it on a new Haswell cluster last night)...

I did some digging in the CentOS/RHEL kernel changelog, and the fix shows up in version 2.6.32-504.14.1.el6. Hope that's useful for other folks determining if they're vulnerable.

-Todd
On Mon, Aug 17, 2015 at 2:45 AM, Serguei Kolos <sergue...@gmail.com> wrote:
Hi

Fantastic. Many thanks for sharing that info, which saved me several weeks of working time. I went as far as getting a nasty GDB stack traces showing threads waiting on a non-locked mutex, but I didn't know how to dig this further down. 

Cheers,


On Thursday, May 14, 2015 at 12:37:32 AM UTC+2, Gil Tene wrote:
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...




--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Allen Reese

unread,
Mar 16, 2017, 6:38:57 PM3/16/17
to mechanica...@googlegroups.com
You're asking about the futex_wait fix right?

That would be this commit as far as I can tell from looking around:


--Allen Reese
Yahoo! Inc.


From: Dan Beaulieu <danjacob...@gmail.com>
To: mechanical-sympathy <mechanica...@googlegroups.com>
Sent: Thursday, March 16, 2017 3:29 PM

Subject: Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW].
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Dan Beaulieu

unread,
Mar 17, 2017, 10:27:39 AM3/17/17
to mechanical-sympathy, are...@yahoo-inc.com
Possibly, but I don't think so. I was replying directly to the post that mentioned there was a fix in the 4.5 kernel. Since we were seeing an issue we think is this issue with 4.4.z, but don't see if after upgrading to 4.10.z, we think whatever Daniel is referring to is related. 

The patch you linked is from 2014, so I'd imagine it'd also be in the 4.4.z kernel we were using and having issues with.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Allen Reese

unread,
Mar 17, 2017, 11:57:53 AM3/17/17
to Dan Beaulieu, mechanical-sympathy
that's kinda what I thought.
I read the reports for a bit and it wasn't clear to me.

I've only seen what might be a microcode issue on some haswell boxes, but I don't have much more than that.  the microcode issue I'm aware of is fixed by installing a newer microcode package. 

however I've only been supporting java on RHEL.  :)

Daniel Worthington-Bodart

unread,
Mar 18, 2017, 2:36:14 PM3/18/17
to mechanica...@googlegroups.com, Dan Beaulieu
I never found the exact fix but I do know that 4.4.0-65 (Ubuntu 16.04 LTS)
is also fixed so somewhere between that and at least 4.2.0-18 lies a problem

D

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

SITARAM SAKTHI

unread,
Jul 20, 2017, 8:19:21 AM7/20/17
to mechanical-sympathy, are...@yahoo-inc.com
i see similar issue, it moves forward after attaching, detaching from the process with gdb only on centos6.7. 
could this be an issue with kernel?

Allen Reese

unread,
Jul 20, 2017, 10:34:32 AM7/20/17
to mechanica...@googlegroups.com
It's fixed for me in RHEL 6.7, with kernel-2.6.32-504.16.2.el6 or later.
For RHEL7, it's fixed with 3.10.0-229.7.2.el7 or later.

--Allen Reese



From: SITARAM SAKTHI <sita...@gmail.com>
To: mechanical-sympathy <mechanica...@googlegroups.com>
Cc: are...@yahoo-inc.com
Sent: Thursday, July 20, 2017 5:19 AM

Subject: Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW].
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

zhengzh...@gmail.com

unread,
Sep 1, 2017, 2:49:47 AM9/1/17
to mechanical-sympathy
Hi Tene ,
        when I read your note, I know that this bug appear in the kernel version of 3.14 - 3.17 . but why this can occur the operate system RHEL 6.6 what is still use the kernel of 2.6.32 ? I can't understand  and need help . Can you explain for me or reply me for some reference? Thaks !!


在 2015年5月14日星期四 UTC+8上午6:37:32,Gil Tene写道:
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...

The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).

The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find. 

This behavior seems to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).


The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".

The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.

The fix is simple, an added default case for the switch that just has an explicit smp_mb() in it. There was a missing memory barrier in the wakeup path, and now (hopefully) it's not missing any more...

So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.

Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).

Peter Booth

unread,
Sep 1, 2017, 2:55:34 PM9/1/17
to mechanical-sympathy
Zheng,

The issue is how Redhat Enterprise Linux uses security backports. The RHEL distro tries to be as stable and secure as possible by using using well tested (old) versions of components. But when testing discovers security vulnerabilities in any newer version of a component, red hat check to see if the bug exists in the old version of the component. If it did, they patch the old version with code change from the newer version to address the issue. It's a great idea that works well most of the time. This is called backporting and is described on the red hat site

Occasionally however, the fix to a security issue also introduces an unrelated bug. This is what occurred here.

Peter

zhengzh...@gmail.com

unread,
Sep 4, 2017, 6:24:09 AM9/4/17
to mechanical-sympathy
Thank you very much !  I get it now !

在 2017年9月2日星期六 UTC+8上午2:55:34,Peter Booth写道:

Vishal Sharma

unread,
Mar 25, 2020, 3:44:26 AM3/25/20
to mechanical-sympathy
Hi,
I seem to be hitting this issue on RHEL 6.5. Is that possible?
I've attached strace and the pstack output that I'm getting. Given below are my OS details:

[root@localhost ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.5 (Santiago)

[root@localhost ~]# uname -r
2.6.32-431.el6.x86_64


On Thursday, May 14, 2015 at 6:05:49 PM UTC+5:30, Gil Tene wrote:
Thanks for identifying the date and change for the fix in RHEL 6 (April 21, 2015). Strange that the errata makes no mention of it, or it's impact (it's a security advisory).

So far, here is what I know when it comes to distro release numbers for RHEL and it's main cousins (At least in that world, most admins view things in terms of distro version rather than kernel version. you can look up associated kernel versions). I'm using good/BAD to mean ("does not have the bug" / "HAS THE MISSING BARRIER BUG"):

RHEL 6 (and CentOS 6, and SL 6): 6.0-6.5 are good. 6.6 is BAD. 6.6.z is good.
RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD. As of yesterday. there does not yet appear to be a 7.x fix. 
RHEL 5 (and CentOS 5, and SL 5): All versions are good (including 5.11).

Sadly, 6.6 seems to be a very (the most?) popular version we run into. And very few have moved to 6.6.z.

I/we are looking for more info to identify the specific versions affected in other distros (Ubuntu 12.04 LTS and 14.04 LTS, SLES 12 and 11, Amazon Linux, Oracle Linux, etc.). So far we've verified that SLES12 kernel versions 3.12.32-33.1 & above have the fix (but not what versions have the bug), and that Amazon Linux  kernel versions 3.14.35-28.38 & above have the fix (but not which versions have the bug).

I will post here as I have more, if you find more info useful to identifying releases that have the bug and ones that have fixed it, please do the same.

On Thursday, May 14, 2015 at 5:24:33 AM UTC-5, Marcin Sobieszczanski wrote:
> More importantly this breakage seems to have
> been back ported into major distros (e.g. into RHEL 6.6 and its cousins,
> released in October 2014), and the fix for it has only recently been back
> ported (e.g. RHEL 6.6.z and cousins have the fix).

According to the ChangeLogs attached to rpms, it looks like the kernel
series used in RHEL 6.6 (kernel-2.6.32-504) were affected from the
start of 6.6 release. It has been fixed only recently in
kernel-2.6.32-504.16.2.el6 update (21 April,
https://rhn.redhat.com/errata/RHSA-2015-0864.html)

rpm -qp --changelog kernel-2.6.32-504.16.2.el6.x86_64.rpm | grep
'Ensure get_futex_key_refs() always implies a barrier'
strace output.txt

Todd Lipcon

unread,
Mar 25, 2020, 3:51:55 AM3/25/20
to mechanica...@googlegroups.com
Yes, per my earlier email in this thread, the fix went into RHEL in kernel 2.6.32-504.14.1.el6 which is newer than the one you're reporting.

-Todd

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages