Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW].

73,730 views
Skip to first unread message

Gil Tene

unread,
May 13, 2015, 6:37:32 PM5/13/15
to mechanica...@googlegroups.com
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...

The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).

The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find. 

This behavior seems to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).


The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".

The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.

The fix is simple, an added default case for the switch that just has an explicit smp_mb() in it. There was a missing memory barrier in the wakeup path, and now (hopefully) it's not missing any more...

So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.

Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).

Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".
 
Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.



Vitaly Davidovich

unread,
May 13, 2015, 6:48:41 PM5/13/15
to mechanica...@googlegroups.com

That's nasty! Thanks for sharing.

sent from my phone

On May 13, 2015 6:37 PM, "Gil Tene" <g...@azulsystems.com> wrote:
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...

The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).

The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'l spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find. 

This behavior seem to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).

The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".

The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.

So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.

Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).

Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".
 
Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.



--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Cosmin Lehene

unread,
May 13, 2015, 9:33:45 PM5/13/15
to mechanica...@googlegroups.com
We've been hunting for this for a while now. 
In our case it's reproducing on 10 core haswells only which are different than 8 cores (dual vs single ring bus and more cache coherency options). It's probably a probability matter.

Attaching (and detaching) fixes the state so using jstack and other tools makes it disappear which was tricky (and a good workaround) and we just used core dumps to figure out what's going on which brought us to suspecting glibc or kernel.
We weren't able to reproduce while running systemtap probes either.

Pinning the JVM to a single cpu reduces the probability of occurrence drastically (from a few times a day to weeks) so I'm guessing latency distributions may have an effect.

We're seeing it mostly during GC pauses in multiple places, but most times when parking.

For example:

(gdb) bt
#0  0x0000003593e0e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003593e09508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x0000003593e093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fe87a42a50d in os::PlatformEvent::park() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4  0x00007fe87a3f10e8 in Monitor::ILock(Thread*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5  0x00007fe87a3f132f in Monitor::lock_without_safepoint_check() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6  0x00007fe87a15a7bf in G1HotCardCache::insert(signed char*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7  0x00007fe87a15db03 in G1RemSet::refine_card(signed char*, int, bool) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#8  0x00007fe87a143dc8 in RefineCardTableEntryClosure::do_card_ptr(signed char*, int) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#9  0x00007fe87a0feb9f in DirtyCardQueueSet::apply_closure_to_completed_buffer_helper(CardTableEntryClosure*, int, BufferNode*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#10 0x00007fe87a0fed8d in DirtyCardQueueSet::apply_closure_to_completed_buffer(int, int, bool) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#11 0x00007fe87a0683a4 in ConcurrentG1RefineThread::run() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#12 0x00007fe87a430ca8 in java_start(Thread*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#13 0x0000003593e079d1 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003593ae88fd in clone () from /lib64/libc.so.6


It's still unclear to me why a futex would end up being used and there are open questions with what we see in the registers (who's on the other side), but didn't get time to investigate every detail.
There has been quite some benchmarking done on haswells comparing Java 7 and 8 (quite impressive actually) I wonder how they haven't stumble into this.
 
Either way, debugging this has been quite fun and perhaps we'll write about the adventure in more detail :)

Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
May 13, 2015, 9:44:35 PM5/13/15
to mechanica...@googlegroups.com
Kernel futex_wait() calls end up being at the core of almost any user-land synchronization primitive these days. Whether it's posix stuff (like mutexes and semaphores) or direct use of futures. And all JVM synchronization including synchronized, Lock, park/unpark, as well as all internal JVM threads, like GC and compiler stuff all end up with a waiting futex at some point.
 
There has been quite some benchmarking done on haswells comparing Java 7 and 8 (quite impressive actually) I wonder how they haven't stumble into this.

I think this benchmarking was done before the bug appeared in most distros. E.g. RHEL 6.6 (and CentOS and SL) only got the bug last October...
 
 
Either way, debugging this has been quite fun and perhaps we'll write about the adventure in more detail :)

Have you moved to 6.6.z? (or if not on a RHEL or RHEL-like, a latest kernel of some sort?). 

Cosmin Lehene

unread,
May 13, 2015, 10:19:45 PM5/13/15
to mechanica...@googlegroups.com
Kernel futex_wait() calls end up being at the core of almost any user-land synchronization primitive these days. Whether it's posix stuff (like mutexes and semaphores) or direct use of futures. And all JVM synchronization including synchronized, Lock, park/unpark, as well as all internal JVM threads, like GC and compiler stuff all end up with a waiting futex at some point.

Yes, but why not a private one as long as it's only the JVM process involved? I'm likely missing something and I was going to look closer to the java mutex path along with glibc to understand why.

Have you moved to 6.6.z? (or if not on a RHEL or RHEL-like, a latest kernel of some sort?). 

Not yet. Like I said, we managed with workarounds (e.g. by pinning and automatically unlock it by running something that causes it to detach when the behavior is detected). 
I didn't know about the kernel bug until I saw this (I was suspecting glic/kernel but only looked through glibc code and bugs...) 
We'll likely update the kernel soon.

Vitaly Davidovich

unread,
May 13, 2015, 10:42:29 PM5/13/15
to mechanica...@googlegroups.com

Private futures are exactly one of the types affected, according to that changelog.

sent from my phone

Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
May 13, 2015, 10:44:09 PM5/13/15
to mechanica...@googlegroups.com

Private *futexes*, damn autocorrect.

sent from my phone

Cosmin Lehene

unread,
May 13, 2015, 10:54:34 PM5/13/15
to mechanica...@googlegroups.com
I guess the question should be: why isn't it always a private futex?

And yes, we saw it with private as well...

Thread 10 (Thread 0x7fa458ade700 (LWP 4482)):
#0  0x00000037016f805e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x000000370167d16b in _L_lock_9503 () from /lib64/libc.so.6
#2  0x000000370167a6a6 in malloc () from /lib64/libc.so.6
#3  0x00007fa45a52ed29 in os::malloc(unsigned long, unsigned short, unsigned char*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4  0x00007fa459fb66b3 in ChunkPool::allocate(unsigned long, AllocFailStrategy::AllocFailEnum) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5  0x00007fa459fb62d1 in Arena::grow(unsigned long, AllocFailStrategy::AllocFailEnum) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6  0x00007fa45a145cc0 in CompactibleFreeListSpace::new_dcto_cl(OopClosure*, CardTableModRefBS::PrecisionStyle, HeapWord*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7  0x00007fa45a54ce6d in CardTableModRefBS::process_stride(Space*, MemRegion, int, int, OopsInGenClosure*, CardTableRS*, signed char**, unsigned long, unsigned long) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#8  0x00007fa45a54d040 in CardTableModRefBS::non_clean_card_iterate_parallel_work(Space*, MemRegion, OopsInGenClosure*, CardTableRS*, int) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#9  0x00007fa45a0d4e08 in CardTableModRefBS::non_clean_card_iterate_possibly_parallel(Space*, MemRegion, OopsInGenClosure*, CardTableRS*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#10 0x00007fa45a0d6a0e in CardTableRS::younger_refs_in_space_iterate(Space*, OopsInGenClosure*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#11 0x00007fa45a1823fe in ConcurrentMarkSweepGeneration::younger_refs_iterate(OopsInGenClosure*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#12 0x00007fa45a5c98aa in SharedHeap::process_strong_roots(bool, bool, SharedHeap::ScanningOption, OopClosure*, CodeBlobClosure*, OopsInGenClosure*, bool) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#13 0x00007fa45a27ef8c in GenCollectedHeap::gen_process_strong_roots(int, bool, bool, bool, SharedHeap::ScanningOption, OopsInGenClosure*, bool, OopsInGenClosure*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#14 0x00007fa45a551e4f in ParNewGenTask::work(unsigned int) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#15 0x00007fa45a6cf0cf in GangWorker::loop() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#16 0x00007fa45a537ca8 in java_start(Thread*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#17 0x0000003701a079d1 in start_thread () from /lib64/libpthread.so.0
#18 0x00000037016e88fd in clone () from /lib64/libc.so.6
Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,
May 14, 2015, 12:05:28 AM5/14/15
to mechanica...@googlegroups.com

Those two traces look like they're coming from different code paths (malloc vs pthread_mutex) so I'm not sure if lll_lock_wait means it's not private.  Looking at the kernel change, only private futexes weren't covered by a barrier in the broken version.

sent from my phone

Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
May 14, 2015, 1:18:46 AM5/14/15
to mechanica...@googlegroups.com
Regarding the two different traces you included in the two different posts: The difference between __lll_lock_wait_private(int *futex) and __lll_lock_wait(int *futex, int private) is that the first is always private, while the second is optionally private depending on what the caller says.

[The following is an educated guess about the difference, not based on tracing the actual code all the way...:]

- malloc() is always private to the process involved, so it using it's private futex that it own for it's own purposes. So you end up with a direct _lll_lock_wait_private() call.

- In contrast, pthread_mutex_lock() is performing a lock operation on behalf the caller. The caller may very well be wanting to do a locking or unlocking operation on a shared (across processes, via a shared word in a shared mmap'ed region) futex. pthread_mutex_lock() doesn't know which it is (private or not). It is told by it's caller. Normally (and probably by default) the caller would indicate a private futex, but from the api point of view, it *could* be different... So you end up with a call to __lll_lock_wait(int *futex, int private)

-- Gil.

Marcin Sobieszczanski

unread,
May 14, 2015, 6:24:33 AM5/14/15
to mechanica...@googlegroups.com
> More importantly this breakage seems to have
> been back ported into major distros (e.g. into RHEL 6.6 and its cousins,
> released in October 2014), and the fix for it has only recently been back
> ported (e.g. RHEL 6.6.z and cousins have the fix).

According to the ChangeLogs attached to rpms, it looks like the kernel
series used in RHEL 6.6 (kernel-2.6.32-504) were affected from the
start of 6.6 release. It has been fixed only recently in
kernel-2.6.32-504.16.2.el6 update (21 April,
https://rhn.redhat.com/errata/RHSA-2015-0864.html)

rpm -qp --changelog kernel-2.6.32-504.16.2.el6.x86_64.rpm | grep
'Ensure get_futex_key_refs() always implies a barrier'

Alex Bagehot

unread,
May 14, 2015, 9:01:21 AM5/14/15
to mechanica...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
May 14, 2015, 9:45:15 AM5/14/15
to mechanica...@googlegroups.com

How do jstack and the like subvert the problem? Do they cause the thread to be woken up (from bogus sleep) and observe consistent state at that point?

sent from my phone

Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
May 14, 2015, 1:50:37 PM5/14/15
to <mechanical-sympathy@googlegroups.com>
The bug manifests in a lost wake up. Things that happen to do redundant wake ups (e.g. notifyAll for condition variables or Java synchronized objects, as well as some other ways) might accidentally release some threads from their sleeping beauty state. But only if they are lucky.

So I don't think jstack subverts the bug in a general sense, or even in most cases. I think it just helps things that are lucky enough to involve the use of redundant (as in more than required) wakeups. Things that are more precise (like semaphores built on top of futex, or wait/notify without notifyAll) probably won't be helped.

Sent from Gil's iPhone
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
May 14, 2015, 2:01:55 PM5/14/15
to mechanica...@googlegroups.com
Right, hence I was curious about Cosmin's statement that jstack seemed to help, which AFAIK won't issue any wakeups, so doesn't quite make sense to me.

So reading a bit more about this bug, it seems like some folks experienced this on Power and Arm machines as well, both with weaker memory models than x86.  Would be interesting to know what exactly about the flavor of Haswell chips helps exacerbate this; Haswell has a larger out of order window than predecessors, and also revamped coherence for TSX support, but that's all very high level and imprecise.

Cosmin Lehene

unread,
May 14, 2015, 2:25:28 PM5/14/15
to mechanica...@googlegroups.com
jstack (with -F), jhat, strace, gdb will attach and then detach from the process. Upon detach the process wakes up, invariably.
Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Adrian Muraru

unread,
May 14, 2015, 3:05:23 PM5/14/15
to mechanica...@googlegroups.com

Gil,


RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD

AFAIK Centos 7.1 is shipping kernel 3.10 which based on your initial message is not impacted. Right?

-adrian


Vitaly Davidovich

unread,
May 14, 2015, 3:18:03 PM5/14/15
to mechanica...@googlegroups.com
Ah, jstack with -F, ok, makes sense.

Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Michael Barker

unread,
May 14, 2015, 4:41:32 PM5/14/15
to mechanica...@googlegroups.com
BTW, this bug is also fixed in kernel 3.14.23.

Gil Tene

unread,
May 14, 2015, 6:51:19 PM5/14/15
to <mechanical-sympathy@googlegroups.com>
The upstream 3.10 didn't have the bug. But RHEL 7's version is different from the pure upstream version.

Unfortunately, RHEL 7.1 (much like RHEL 6.6) backported the change that included the bug. But unlike 6.6.z, there is no backport of the fix to that bug yet for RHEL 7... (This is according to one of our engineers that checked the actual futex.c code).

I expect that some other distros may have also done the same...

Sent from Gil's iPhone

Gil Tene

unread,
May 14, 2015, 6:52:38 PM5/14/15