| Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Gil Tene |
13/05/15 15:37 |
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...
The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).
The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.
This behavior seems to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).
The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".
The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.
The fix is simple, an added default case for the switch that just has an explicit smp_mb() in it. There was a missing memory barrier in the wakeup path, and now (hopefully) it's not missing any more...
So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.
Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).
Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".
Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Vitaly Davidovich |
13/05/15 15:48 |
That's nasty! Thanks for sharing.
sent from my phone
On May 13, 2015 6:37 PM, "Gil Tene" <g...@azulsystems.com> wrote:
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...
The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).
The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'l spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.
This behavior seem to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).
The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".
The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.
So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.
Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).
Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".
Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Cosmin Lehene |
13/05/15 18:33 |
We've been hunting for this for a while now. In our case it's reproducing on 10 core haswells only which are different than 8 cores (dual vs single ring bus and more cache coherency options). It's probably a probability matter.
Attaching (and detaching) fixes the state so using jstack and other tools makes it disappear which was tricky (and a good workaround) and we just used core dumps to figure out what's going on which brought us to suspecting glibc or kernel. We weren't able to reproduce while running systemtap probes either.
Pinning the JVM to a single cpu reduces the probability of occurrence drastically (from a few times a day to weeks) so I'm guessing latency distributions may have an effect.
We're seeing it mostly during GC pauses in multiple places, but most times when parking.
For example:
(gdb) bt | #0 0x0000003593e0e264 in __lll_lock_wait () from /lib64/libpthread.so.0 | #1 0x0000003593e09508 in _L_lock_854 () from /lib64/libpthread.so.0 | #2 0x0000003593e093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0 | #3 0x00007fe87a42a50d in os::PlatformEvent::park() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #4 0x00007fe87a3f10e8 in Monitor::ILock(Thread*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #5 0x00007fe87a3f132f in Monitor::lock_without_safepoint_check() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #6 0x00007fe87a15a7bf in G1HotCardCache::insert(signed char*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #7 0x00007fe87a15db03 in G1RemSet::refine_card(signed char*, int, bool) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #8 0x00007fe87a143dc8 in RefineCardTableEntryClosure::do_card_ptr(signed char*, int) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #9 0x00007fe87a0feb9f in DirtyCardQueueSet::apply_closure_to_completed_buffer_helper(CardTableEntryClosure*, int, BufferNode*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #10 0x00007fe87a0fed8d in DirtyCardQueueSet::apply_closure_to_completed_buffer(int, int, bool) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #11 0x00007fe87a0683a4 in ConcurrentG1RefineThread::run() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #12 0x00007fe87a430ca8 in java_start(Thread*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #13 0x0000003593e079d1 in start_thread () from /lib64/libpthread.so.0 | #14 0x0000003593ae88fd in clone () from /lib64/libc.so.6 |
It's still unclear to me why a futex would end up being used and there are open questions with what we see in the registers (who's on the other side), but didn't get time to investigate every detail. There has been quite some benchmarking done on haswells comparing Java 7 and 8 (quite impressive actually) I wonder how they haven't stumble into this. Either way, debugging this has been quite fun and perhaps we'll write about the adventure in more detail :)
Cosmin |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Gil Tene |
13/05/15 18:44 |
Kernel futex_wait() calls end up being at the core of almost any user-land synchronization primitive these days. Whether it's posix stuff (like mutexes and semaphores) or direct use of futures. And all JVM synchronization including synchronized, Lock, park/unpark, as well as all internal JVM threads, like GC and compiler stuff all end up with a waiting futex at some point. There has been quite some benchmarking done on haswells comparing Java 7 and 8 (quite impressive actually) I wonder how they haven't stumble into this.
I think this benchmarking was done before the bug appeared in most distros. E.g. RHEL 6.6 (and CentOS and SL) only got the bug last October... Either way, debugging this has been quite fun and perhaps we'll write about the adventure in more detail :)
Have you moved to 6.6.z? (or if not on a RHEL or RHEL-like, a latest kernel of some sort?). Cosmin On Wednesday, May 13, 2015 at 3:48:41 PM UTC-7, Vitaly Davidovich wrote: That's nasty! Thanks for sharing.
sent from my phone
On May 13, 2015 6:37 PM, "Gil Tene" < g...@azulsystems.com> wrote: We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...
The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).
The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'l spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.
This behavior seem to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).
The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".
The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.
So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.
Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).
Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".
Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Cosmin Lehene |
13/05/15 19:19 |
Kernel futex_wait() calls end up being at the core of almost any user-land synchronization primitive these days. Whether it's posix stuff (like mutexes and semaphores) or direct use of futures. And all JVM synchronization including synchronized, Lock, park/unpark, as well as all internal JVM threads, like GC and compiler stuff all end up with a waiting futex at some point.
Yes, but why not a private one as long as it's only the JVM process involved? I'm likely missing something and I was going to look closer to the java mutex path along with glibc to understand why.
Have you moved to 6.6.z? (or if not on a RHEL or RHEL-like, a latest kernel of some sort?).
Not yet. Like I said, we managed with workarounds (e.g. by pinning and automatically unlock it by running something that causes it to detach when the behavior is detected). I didn't know about the kernel bug until I saw this (I was suspecting glic/kernel but only looked through glibc code and bugs...) We'll likely update the kernel soon. |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Vitaly Davidovich |
13/05/15 19:42 |
Private futures are exactly one of the types affected, according to that changelog.
sent from my phone |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Vitaly Davidovich |
13/05/15 19:44 |
Private *futexes*, damn autocorrect.
sent from my phone |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Cosmin Lehene |
13/05/15 19:54 |
I guess the question should be: why isn't it always a private futex?
And yes, we saw it with private as well...
Thread 10 (Thread 0x7fa458ade700 (LWP 4482)): | #0 0x00000037016f805e in __lll_lock_wait_private () from /lib64/libc.so.6 | #1 0x000000370167d16b in _L_lock_9503 () from /lib64/libc.so.6 | #2 0x000000370167a6a6 in malloc () from /lib64/libc.so.6 | #3 0x00007fa45a52ed29 in os::malloc(unsigned long, unsigned short, unsigned char*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #4 0x00007fa459fb66b3 in ChunkPool::allocate(unsigned long, AllocFailStrategy::AllocFailEnum) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #5 0x00007fa459fb62d1 in Arena::grow(unsigned long, AllocFailStrategy::AllocFailEnum) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #6 0x00007fa45a145cc0 in CompactibleFreeListSpace::new_dcto_cl(OopClosure*, CardTableModRefBS::PrecisionStyle, HeapWord*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #7 0x00007fa45a54ce6d in CardTableModRefBS::process_stride(Space*, MemRegion, int, int, OopsInGenClosure*, CardTableRS*, signed char**, unsigned long, unsigned long) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #8 0x00007fa45a54d040 in CardTableModRefBS::non_clean_card_iterate_parallel_work(Space*, MemRegion, OopsInGenClosure*, CardTableRS*, int) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #9 0x00007fa45a0d4e08 in CardTableModRefBS::non_clean_card_iterate_possibly_parallel(Space*, MemRegion, OopsInGenClosure*, CardTableRS*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #10 0x00007fa45a0d6a0e in CardTableRS::younger_refs_in_space_iterate(Space*, OopsInGenClosure*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #11 0x00007fa45a1823fe in ConcurrentMarkSweepGeneration::younger_refs_iterate(OopsInGenClosure*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #12 0x00007fa45a5c98aa in SharedHeap::process_strong_roots(bool, bool, SharedHeap::ScanningOption, OopClosure*, CodeBlobClosure*, OopsInGenClosure*, bool) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #13 0x00007fa45a27ef8c in GenCollectedHeap::gen_process_strong_roots(int, bool, bool, bool, SharedHeap::ScanningOption, OopsInGenClosure*, bool, OopsInGenClosure*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #14 0x00007fa45a551e4f in ParNewGenTask::work(unsigned int) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #15 0x00007fa45a6cf0cf in GangWorker::loop() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #16 0x00007fa45a537ca8 in java_start(Thread*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so | #17 0x0000003701a079d1 in start_thread () from /lib64/libpthread.so.0 | #18 0x00000037016e88fd in clone () from /lib64/libc.so.6 |
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Vitaly Davidovich |
13/05/15 21:05 |
Those two traces look like they're coming from different code paths (malloc vs pthread_mutex) so I'm not sure if lll_lock_wait means it's not private. Looking at the kernel change, only private futexes weren't covered by a barrier in the broken version.
sent from my phone |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Gil Tene |
13/05/15 22:18 |
Regarding the two different traces you included in the two different posts: The difference between __lll_lock_wait_private(int *futex) and __lll_lock_wait(int *futex, int private) is that the first is always private, while the second is optionally private depending on what the caller says.
[The following is an educated guess about the difference, not based on tracing the actual code all the way...:]
- malloc() is always private to the process involved, so it using it's private futex that it own for it's own purposes. So you end up with a direct _lll_lock_wait_private() call.
- In contrast, pthread_mutex_lock() is performing a lock operation on behalf the caller. The caller may very well be wanting to do a locking or unlocking operation on a shared (across processes, via a shared word in a shared mmap'ed region) futex. pthread_mutex_lock() doesn't know which it is (private or not). It is told by it's caller. Normally (and probably by default) the caller would indicate a private futex, but from the api point of view, it *could* be different... So you end up with a call to __lll_lock_wait(int *futex, int private)
-- Gil. |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Marcin Sobieszczanski |
14/05/15 03:24 |
> More importantly this breakage seems to have
> been back ported into major distros (e.g. into RHEL 6.6 and its cousins,
> released in October 2014), and the fix for it has only recently been back
> ported (e.g. RHEL 6.6.z and cousins have the fix).
According to the ChangeLogs attached to rpms, it looks like the kernel
series used in RHEL 6.6 (kernel-2.6.32-504) were affected from the
start of 6.6 release. It has been fixed only recently in
kernel-2.6.32-504.16.2.el6 update (21 April,
https://rhn.redhat.com/errata/RHSA-2015-0864.html)
rpm -qp --changelog kernel-2.6.32-504.16.2.el6.x86_64.rpm | grep
'Ensure get_futex_key_refs() always implies a barrier'
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Gil Tene |
14/05/15 05:35 |
Thanks for identifying the date and change for the fix in RHEL 6 (April 21, 2015). Strange that the errata makes no mention of it, or it's impact (it's a security advisory).
So far, here is what I know when it comes to distro release numbers for RHEL and it's main cousins (At least in that world, most admins view things in terms of distro version rather than kernel version. you can look up associated kernel versions). I'm using good/BAD to mean ("does not have the bug" / "HAS THE MISSING BARRIER BUG"):
RHEL 6 (and CentOS 6, and SL 6): 6.0-6.5 are good. 6.6 is BAD. 6.6.z is good. RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD. As of yesterday. there does not yet appear to be a 7.x fix. RHEL 5 (and CentOS 5, and SL 5): All versions are good (including 5.11).
Sadly, 6.6 seems to be a very (the most?) popular version we run into. And very few have moved to 6.6.z.
I/we are looking for more info to identify the specific versions affected in other distros (Ubuntu 12.04 LTS and 14.04 LTS, SLES 12 and 11, Amazon Linux, Oracle Linux, etc.). So far we've verified that SLES12 kernel versions 3.12.32-33.1 & above have the fix (but not what versions have the bug), and that Amazon Linux kernel versions 3.14.35-28.38 & above have the fix (but not which versions have the bug).
I will post here as I have more, if you find more info useful to identifying releases that have the bug and ones that have fixed it, please do the same. |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Alex Bagehot |
14/05/15 06:01 |
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Vitaly Davidovich |
14/05/15 06:45 |
How do jstack and the like subvert the problem? Do they cause the thread to be woken up (from bogus sleep) and observe consistent state at that point?
sent from my phone |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Gil Tene |
14/05/15 10:50 |
The bug manifests in a lost wake up. Things that happen to do redundant wake ups (e.g. notifyAll for condition variables or Java synchronized objects, as well as some other ways) might accidentally release some threads from their sleeping beauty state.
But only if they are lucky.
So I don't think jstack subverts the bug in a general sense, or even in most cases. I think it just helps things that are lucky enough to involve the use of redundant (as in more than required) wakeups. Things that are more precise (like semaphores built
on top of futex, or wait/notify without notifyAll) probably won't be helped.
Sent from Gil's iPhone
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Vitaly Davidovich |
14/05/15 11:01 |
Right, hence I was curious about Cosmin's statement that jstack seemed to help, which AFAIK won't issue any wakeups, so doesn't quite make sense to me.
So reading a bit more about this bug, it seems like some folks experienced this on Power and Arm machines as well, both with weaker memory models than x86. Would be interesting to know what exactly about the flavor of Haswell chips helps exacerbate this; Haswell has a larger out of order window than predecessors, and also revamped coherence for TSX support, but that's all very high level and imprecise. |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Cosmin Lehene |
14/05/15 11:25 |
jstack (with -F), jhat, strace, gdb will attach and then detach from the process. Upon detach the process wakes up, invariably. |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Adrian Muraru |
14/05/15 12:05 |
Gil,
RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD
AFAIK Centos 7.1 is shipping kernel 3.10 which based on your initial message is not impacted. Right?
-adrian
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Vitaly Davidovich |
14/05/15 12:18 |
Ah, jstack with -F, ok, makes sense. |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
mikeb01 |
14/05/15 13:41 |
BTW, this bug is also fixed in kernel 3.14.23. |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Gil Tene |
14/05/15 15:51 |
The upstream 3.10 didn't have the bug. But RHEL 7's version is different from the pure upstream version.
Unfortunately, RHEL 7.1 (much like RHEL 6.6) backported the change that included the bug. But unlike 6.6.z, there is no backport of the fix to that bug yet for RHEL 7... (This is according to one of our engineers that checked the actual futex.c code).
I expect that some other distros may have also done the same...
Sent from Gil's iPhone
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Gil Tene |
14/05/15 15:52 |
Maybe the attach (or the detach) forces the thread to re-evaluate the sleep, and it wakes up as a result.
Sent from Gil's iPhone
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Todd Lipcon |
14/05/15 15:54 |
My guess is that these tools are causing a signal to get sent to the affected threads (perhaps as part of the JVM coming to a safepoint or something?). If a thread is blocked in a syscall, and gets a signal, it will usually make that syscall return with EINTR. This is the case with futex(). So, if it missed a wakeup, but you send a signal, you'll recover.
-Todd |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Vitaly Davidovich |
14/05/15 15:54 |
My guess it's the actual signals sent (as part of attach/detach) to the proc that causes it to become runnable again in the kernel. |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Paul Blair |
15/05/15 07:00 |
I'm seeing a similar problem on kernel 3.19.0-16-generic after having upgraded Ubuntu to 15.04 (Vivid Vervet). However, checking the github link above indicates that the fix should be in 3.19, so I'm not sure if this is a regression or something else.
I've documented what I'm seeing here: http://ubuntuforums.org/showthread.php?t=2278238&p=13285416#post13285416 (I'm not really familiar with Linux debugging, though, so I'm not sure if I'm reporting the right information.)
Basically, certain Java applications are regularly hanging, and when I run strace I see them in FUTEX_WAIT. So far it's affecting only Java, but this includes previous versions of Java that have been working for a long time.
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Alex Bagehot |
15/05/15 09:21 |
Hi Paul, Did you run strace -F to get the child pids? else it'll only print the parent process waiting.
# strace -p 14603 Process 14603 attached - interrupt to quit futex(0x7f5c8e6019d0, FUTEX_WAIT, 14604, NULL ^C <unfinished ...>
Process 14603 detached -> no more output
# strace -F -p 14603
Process 14603 attached with 8 threads - interrupt to quit -> + output from all the threads in the process.
As far as I can tell, your bug appears to be 100% cpu on 1 cpu core. This bug is more characterised by less cpu burn as threads sleep when they should be woken ("…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" ), so I would assume that your issue is different (as a starting point) and try to gather more evidence to prove it either way. run sudo perf record -F 99 -ag -p <pid> -- sleep 10 sudo perf script
To get some more info on the stacks associated with the cpu burn. The stacks may be corrupted by the jvm preventing perf from reporting something useful with -g, so YMMV, to fix that it's a custom build of openjdk... Thanks, Alex
On Fri, May 15, 2015 at 3:00 PM, Paul Blair <psfb...@gmail.com> wrote: I'm seeing a similar problem on kernel 3.19.0-16-generic after having upgraded Ubuntu to 15.04 (Vivid Vervet). However, checking the github link above indicates that the fix should be in 3.19, so I'm not sure if this is a regression or something else.
I've documented what I'm seeing here: http://ubuntuforums.org/showthread.php?t=2278238&p=13285416#post13285416 (I'm not really familiar with Linux debugging, though, so I'm not sure if I'm reporting the right information.)
Basically, certain Java applications are regularly hanging, and when I run strace I see them in FUTEX_WAIT. So far it's affecting only Java, but this includes previous versions of Java that have been working for a long time.
On Thursday, May 14, 2015 at 8:35:49 AM UTC-4, Gil Tene wrote: Thanks for identifying the date and change for the fix in RHEL 6 (April 21, 2015). Strange that the errata makes no mention of it, or it's impact (it's a security advisory).
So far, here is what I know when it comes to distro release numbers for RHEL and it's main cousins (At least in that world, most admins view things in terms of distro version rather than kernel version. you can look up associated kernel versions). I'm using good/BAD to mean ("does not have the bug" / "HAS THE MISSING BARRIER BUG"):
RHEL 6 (and CentOS 6, and SL 6): 6.0-6.5 are good. 6.6 is BAD. 6.6.z is good. RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD. As of yesterday. there does not yet appear to be a 7.x fix. RHEL 5 (and CentOS 5, and SL 5): All versions are good (including 5.11).
Sadly, 6.6 seems to be a very (the most?) popular version we run into. And very few have moved to 6.6.z.
I/we are looking for more info to identify the specific versions affected in other distros (Ubuntu 12.04 LTS and 14.04 LTS, SLES 12 and 11, Amazon Linux, Oracle Linux, etc.). So far we've verified that SLES12 kernel versions 3.12.32-33.1 & above have the fix (but not what versions have the bug), and that Amazon Linux kernel versions 3.14.35-28.38 & above have the fix (but not which versions have the bug).
I will post here as I have more, if you find more info useful to identifying releases that have the bug and ones that have fixed it, please do the same. On Thursday, May 14, 2015 at 5:24:33 AM UTC-5, Marcin Sobieszczanski wrote: > More importantly this breakage seems to have
> been back ported into major distros (e.g. into RHEL 6.6 and its cousins,
> released in October 2014), and the fix for it has only recently been back
> ported (e.g. RHEL 6.6.z and cousins have the fix).
According to the ChangeLogs attached to rpms, it looks like the kernel
series used in RHEL 6.6 (kernel-2.6.32-504) were affected from the
start of 6.6 release. It has been fixed only recently in
kernel-2.6.32-504.16.2.el6 update (21 April,
https://rhn.redhat.com/errata/RHSA-2015-0864.html)
rpm -qp --changelog kernel-2.6.32-504.16.2.el6.x86_64.rpm | grep
'Ensure get_futex_key_refs() always implies a barrier'
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Paul Blair |
15/05/15 11:51 |
Thanks for the pointers, Alex. Unfortunately neither one of these techniques worked for me. strace with -F unjiggles the JVM so that the process does not hang. On the system monitor I see activity on multiple CPUs and nothing hits 100%. The perf record command never returns, probably for the reasons you say.
I do have a Java crash dump from when I left the process alone for a few hours and it finally crashed. I'm going to submit that to JetBrains and see if they can shed more light.
If I find I need more help on a Linux system level, is this an appropriate forum? |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Erik Gorset |
19/05/15 07:23 |
I’ve reproduced similar symptoms on haswell running Ubuntu 3.18.13-031813-generic #201505062159 SMP Wed May 6 22:00:44 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
One java process was locked up with one core burning 100% of kernel time, with the following strace result:
[pid 423743] futex(0x7f80f0002354, FUTEX_WAIT_PRIVATE, 449, NULL <unfinished ...> [pid 422607] futex(0x7f842c002454, FUTEX_WAIT_PRIVATE, 925, NULL <unfinished ...> [pid 422602] futex(0x7ff6b058a854, FUTEX_WAIT_PRIVATE, 391, NULL <unfinished ...> …
The symptoms went away after setting kernel.numa_balancing to 0, which also gave us considerable better performance overall. Looks like running a big process with numa interleaving on a big server with numa_balancing is a very bad idea (which is unfortunately the default).
We’re still investigating another (possibly related) matter, where we see inconsistent results when using Java8 on haswell. We don’t see the problem using Java7 on haswell, or when using Java8 on ivy bridge (which is what we are running on most of our servers). Most likely it’s a bug in our software, but given that we don’t see this problem on 80+ servers under high load unless we use Java8 on haswell, I’m starting to wonder..
— Erik Cysneiros Gorset |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
manis...@gmail.com |
19/05/15 17:16 |
I bumped on this error couple of months back when using CentOS 6.6 with 32 cores Dell server. After many days of debugging, I realized it to be a CentOS 6.6 bug and moved back to 6.5 and since then no such issues have been seen. I am able to reproduce this issue in 15 minutes of heavy load on my multi threaded c code. On Wednesday, May 13, 2015 at 3:37:32 PM UTC-7, Gil Tene wrote: We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...
The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).
The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.
This behavior seems to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).
The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".
The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.
The fix is simple, an added default case for the switch that just has an explicit smp_mb() in it. There was a missing memory barrier in the wakeup path, and now (hopefully) it's not missing any more...
So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.
Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).
Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".
Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Joe R |
20/05/15 08:09 |
Did anyone come up with a test case for the showing the bug? |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Greg Senia |
20/05/15 19:44 |
I spent a solid week debugging this with the help of some IBM Java/Kernel Performance folks and going through all the kernel patches that came in with RHEL6. It took one solid week back in Mid-March to get Redhat to acknowledge the issue that could effect PPC, X86 anyone using futex.... It occured regularly with Hadoop YARN Nodemanger and Tez Jobs. It also regularly occurred with Websphere JVMs on RHEL 6.6 PPC64.. We never came up with a definitive way to re-pro the problem at will. Jstack, strace, gdb, gcore, kill -3 with IBM JDKs would unblock the stuck threads. And also Redhat confirmed this could only happen on a machine with more than 1 CPU as it had to do with the thread being put to sleep on one cpu but missing the notify to wake it up... |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Greg Senia |
20/05/15 20:00 |
And to truly verify the problem we had to kernel dump the systems and work with Redhat to analyze the vmcore.... If anyone wants further info feel free to let me know..
Both of these situations applied in my situations: |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Kevin Burton |
21/05/15 19:06 |
The kernel (and glibc) needs more continuous integration Bugs like these are hard to find but we consistently find race conditions in our code because months later, an integration will fail.
There's a bug in glibc that's been around for 2-5 years now. It supports a 'rotate' option in the resolver so you can load balance DNS requests.
But it's completely broken - and has been for YEARS now.
A continuous integration system would fix this easily.
Of course, one problem, is the tight coupling of the kernel. I'm sure this code isn't very testable... :-/ On Wednesday, May 13, 2015 at 3:37:32 PM UTC-7, Gil Tene wrote: We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
bhu...@gmail.com |
26/05/15 23:05 |
I have come across a similar situation with kernel 3.0.101 , SLES 11. Based on the details seen so far this Kernel version doesnt seem to be affected.
The Server is non-Haswell (Intel(R) Xeon(R) CPU L5638 @ 2.00GHz, 12 core). The application hangs with 100% CPU . The same code on an older kernel runs fine.
Thread 3 (Thread 0x7f22ab35c700 (LWP 19575)): #0 0x00007f22b3df52d4 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f22b3df0659 in _L_lock_1008 () from /lib64/libpthread.so.0 #2 0x00007f22b3df046e in pthread_mutex_lock () from /lib64/libpthread.so.0
I observe this with 85 threads out of 102
Could some suggest test code to simulate this error ?
Regards Bhupinder
On Thursday, May 14, 2015 at 4:07:32 AM UTC+5:30, Gil Tene wrote: We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...
The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).
The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.
This behavior seems to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).
The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".
The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.
The fix is simple, an added default case for the switch that just has an explicit smp_mb() in it. There was a missing memory barrier in the wakeup path, and now (hopefully) it's not missing any more...
So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.
Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).
Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".
Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Gil Tene |
27/05/15 08:19 |
To be clear: the bug is not limited to kernels labeled "3.14" thru "3.18". It appears in several production kernels with "earlier" labels that are part s of various distros, and got there through backporting efforts. E.g. RHEL 6.6 uses 2.6.32.xxxxx kernels, and RHEL 7.1 uses 3.10.0.xxx kernels, and both of those have had the bug introduced through backporting of changes from 3.14. "3.0.101 SLES 11" is not the same as 3.0.101. My first suspicion would be that this SLES kernel has had the bug back ported to it. E.g. we originally ran into the bug in RHEL 6.6 with a 2.6.32.504 kernel that included a back port of the buggy update. The fix back port appeared in the 2.6.32.504.16.2 kernel in RHEL 6.6.z.
You should get the specific sources for your 3.0.101 SLES 11 kernel and look at kernel/futex.c. Specifically, you want to see if the 3.14 update (in https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db ) which introduced the bug was backported to it, but the 3.18 fix to that (in https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0 ) hadn't been applied on top of that. [See more detailed discussion in the original post on this thread]. |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
bhu...@gmail.com |
27/05/15 10:40 |
Thanks Gil for the details.
However in this case the bug turned out to be in our code.
Apart from other factors there was considerably higher load on this new target system. So I guess as a result of it the deadlocks were happening.
We have a patch deployed and would be monitoring to see if we still encounter it.
I will still go ahead and review the futex code as you suggested to see if it is affected or not. Perhaps it could help others.
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Adrian Muraru |
28/05/15 10:04 |
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Jack Bradach |
03/06/15 14:30 |
I'm seeing the same thing on the same version of Ubuntu with a Haswell when running Jetbrains CLion (which runs in Java). It hangs on a futex and I have to kill the program to unlock it. Binding it to a single core seems to avoid the hang but kills performance. I checked Ubuntu's version of the 3.19 kernel source and the patch is in there.
It does seem to be somehow related to Java 8 vs Java 7. When I installed Oracle JDK 7 and forced CLion to run it, I no longer see the hang.
On Friday, May 15, 2015 at 7:00:16 AM UTC-7, Paul Blair wrote:
I'm seeing a similar problem on kernel 3.19.0-16-generic after having upgraded Ubuntu to 15.04 (Vivid Vervet). However, checking the github link above indicates that the fix should be in 3.19, so I'm not sure if this is a regression or something else.
I've documented what I'm seeing here: http://ubuntuforums.org/showthread.php?t=2278238&p=13285416#post13285416 (I'm not really familiar with Linux debugging, though, so I'm not sure if I'm reporting the right information.)
Basically, certain Java applications are regularly hanging, and when I run strace I see them in FUTEX_WAIT. So far it's affecting only Java, but this includes previous versions of Java that have been working for a long time.
On Thursday, May 14, 2015 at 8:35:49 AM UTC-4, Gil Tene wrote: Thanks for identifying the date and change for the fix in RHEL 6 (April 21, 2015). Strange that the errata makes no mention of it, or it's impact (it's a security advisory).
So far, here is what I know when it comes to distro release numbers for RHEL and it's main cousins (At least in that world, most admins view things in terms of distro version rather than kernel version. you can look up associated kernel versions). I'm using good/BAD to mean ("does not have the bug" / "HAS THE MISSING BARRIER BUG"):
RHEL 6 (and CentOS 6, and SL 6): 6.0-6.5 are good. 6.6 is BAD. 6.6.z is good. RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD. As of yesterday. there does not yet appear to be a 7.x fix. RHEL 5 (and CentOS 5, and SL 5): All versions are good (including 5.11).
Sadly, 6.6 seems to be a very (the most?) popular version we run into. And very few have moved to 6.6.z.
I/we are looking for more info to identify the specific versions affected in other distros (Ubuntu 12.04 LTS and 14.04 LTS, SLES 12 and 11, Amazon Linux, Oracle Linux, etc.). So far we've verified that SLES12 kernel versions 3.12.32-33.1 & above have the fix (but not what versions have the bug), and that Amazon Linux kernel versions 3.14.35-28.38 & above have the fix (but not which versions have the bug).
I will post here as I have more, if you find more info useful to identifying releases that have the bug and ones that have fixed it, please do the same.
On Thursday, May 14, 2015 at 5:24:33 AM UTC-5, Marcin Sobieszczanski wrote:
> More importantly this breakage seems to have
> been back ported into major distros (e.g. into RHEL 6.6 and its cousins,
> released in October 2014), and the fix for it has only recently been back
> ported (e.g. RHEL 6.6.z and cousins have the fix).
According to the ChangeLogs attached to rpms, it looks like the kernel
series used in RHEL 6.6 (kernel-2.6.32-504) were affected from the
start of 6.6 release. It has been fixed only recently in
kernel-2.6.32-504.16.2.el6 update (21 April,
https://rhn.redhat.com/errata/RHSA-2015-0864.html)
rpm -qp --changelog kernel-2.6.32-504.16.2.el6.x86_64.rpm | grep
'Ensure get_futex_key_refs() always implies a barrier'
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Adrian Muraru |
24/06/15 22:48 |
RHEL/Centos: 3.10.0-229.7.2.el7 kernel is now including a fix for this bug. |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Alen Vrečko |
25/06/15 02:12 |
While on the subject, meant to post a while back:
The bug occurred frequently on HP Gen 9 (Haswell) servers running OpenSuse 13.1. On HP Gen 8 running the "same" software the bug never happened.
o) Upgrading the kernel to 4.0.4 fixed the problem. As expected.
o) Upgrading just Java from 7u25 to 7u79 (without upgrading the kernel) also "fixed" the problem. This is very surprising. Didn't had the time to investigate further. 2015-06-25 7:48 GMT+02:00 Adrian Muraru <adi.m...@gmail.com>:
RHEL/Centos: 3.10.0-229.7.2.el7 kernel is now including a fix for this bug.
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Bill Kelso |
09/07/15 18:23 |
We are getting killed by this right now. We are running Oracle Linux, Redhat rel. 6.6, kernel version 2.6.32-504.16.2.el6.x86_64. Supposedly the bug in this version was fixed, but it just happened again tonight (after not happening for two nights in a row). Does it matter what Oracle client version you use with a particular kernel? I mean, I know it matters. But is there a known good combo of kernel and client? We are going nuts trying to track this down, not to even mention actually fixing it. The two 'good' nights we had were on the Linux kernel version above and an Oracle 12c client (12.1.0.1). But then we started getting other errors (random "TNS connection closed" errors). We've been working on this for more than a month. I've actually started working nights in order to prevent holdups to our data warehouse load. I just sit and watch for jobs that start to hang mysteriously. It's really getting old. Any advice as to where I can find info on this bug in Oracle Linux implementations? Thanks. bk
On Wednesday, May 13, 2015 at 4:37:32 PM UTC-6, Gil Tene wrote:We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...
The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).
The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.
This behavior seems to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).
The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".
The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.
The fix is simple, an added default case for the switch that just has an explicit smp_mb() in it. There was a missing memory barrier in the wakeup path, and now (hopefully) it's not missing any more...
So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.
Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).
Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".
Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Gil Tene |
09/07/15 19:56 |
On Thursday, July 9, 2015 at 6:23:58 PM UTC-7, Bill Kelso wrote:We are getting killed by this right now. We are running Oracle Linux, Redhat rel. 6.6, kernel version 2.6.32-504.16.2.el6.x86_64. Supposedly the bug in this version was fixed, but it just happened again tonight (after not happening for two nights in a row).
If the kernel is really that RHEL version (2.6.32-504.16.2.el6.x86_64), I'm pretty sure the specific bug discussed here is fixed in that one. Maybe what you are running into is some other bug? Are you seeing processes hang in futex wait? |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Bill Kelso |
09/07/15 21:02 |
I'm not sure. When I run a stack trace, the PID always refers to futex_. But the hang happens on an Oracle OCI call (I think that's what it is). All the threads look like this:
Thread 1 (Thread 0x2b8807f2d420 (LWP 9885)): #0 0x000000347540b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00002b87fb24983c in conditionWait(pthread_cond_t*, SMutex*, unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so #2 0x00002b87fb24b2a7 in SThread::Sleep(unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so #3 0x00002b87fb249b3b in SEvent::putThreadToSleep(SThread*, unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so #4 0x00002b87fb24a4cc in msgque::get(int, TObject**, unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so #5 0x00002b87fb24a657 in SThread::readMessage(int, unsigned int, TObject**) () from /opt/informatica/9.6.1/server/bin/libpmcef.so #6 0x00000000005f2798 in SDirectorImpl::getNextMessage() () #7 0x00000000005f4e3d in SDirectorImpl::doPETLOrchestration() () #8 0x00000000005f679e in SDirectorImpl::orchestrate() () #9 0x00000000005f6d21 in SDirectorImpl::run() () #10 0x00000000005faa6c in SDirectorRunnable::start() () #11 0x00000000005c01c3 in SExecutorDTM::start() () #12 0x00000000005dff57 in SPreparerDTMImpl::start() () #13 0x00000000005d774f in DTMMain(int, char const**) () #14 0x000000347501ed5d in __libc_start_main () from /lib64/libc.so.6 #15 0x00000000005b0a89 in _start ()
And then when I run the stack trace, the session 'wakes up' and starts sourcing data from Oracle.
The symptoms we are encountering sure sound like the futex_wait bug. And I agree the kernel version is identified elsewhere as a 'good' version. I suppose it could be something else. But how the heck do I figure that out?
Thanks for replying, by the way.
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Gil Tene |
09/07/15 21:23 |
On Thursday, July 9, 2015 at 9:02:37 PM UTC-7, Bill Kelso wrote:I'm not sure. When I run a stack trace, the PID always refers to futex_. But the hang happens on an Oracle OCI call (I think that's what it is). All the threads look like this:
Thread 1 (Thread 0x2b8807f2d420 (LWP 9885)): #0 0x000000347540b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00002b87fb24983c in conditionWait(pthread_cond_t*, SMutex*, unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so #2 0x00002b87fb24b2a7 in SThread::Sleep(unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so #3 0x00002b87fb249b3b in SEvent::putThreadToSleep(SThread*, unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so #4 0x00002b87fb24a4cc in msgque::get(int, TObject**, unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so #5 0x00002b87fb24a657 in SThread::readMessage(int, unsigned int, TObject**) () from /opt/informatica/9.6.1/server/bin/libpmcef.so #6 0x00000000005f2798 in SDirectorImpl::getNextMessage() () #7 0x00000000005f4e3d in SDirectorImpl::doPETLOrchestration() () #8 0x00000000005f679e in SDirectorImpl::orchestrate() () #9 0x00000000005f6d21 in SDirectorImpl::run() () #10 0x00000000005faa6c in SDirectorRunnable::start() () #11 0x00000000005c01c3 in SExecutorDTM::start() () #12 0x00000000005dff57 in SPreparerDTMImpl::start() () #13 0x00000000005d774f in DTMMain(int, char const**) () #14 0x000000347501ed5d in __libc_start_main () from /lib64/libc.so.6 #15 0x00000000005b0a89 in _start ()
And then when I run the stack trace, the session 'wakes up' and starts sourcing data from Oracle.
The symptoms we are encountering sure sound like the futex_wait bug.
This sounds like a similar symptom. But other missed wakeup (including user-mode logic bugs that cause a missed wakeup via a real race, not just kernel issues) could cause the same behavior... And sometimes those "normal" user-mode missed-wakeup bugs can also be kicked-back-to-life by something that interrupts the thread (like a stack trace). And I agree the kernel version is identified elsewhere as a 'good' version. I suppose it could be something else. But how the heck do I figure that out?
My inclination would be to suspect a "regular" missed wakeup in some user-mode code. A possible way to eliminate this specific kernel bug as a possible cause is to downgrade to a RHEL 6.5 kernel. The bug did not exist in RHEL 6.x versions before RHEL 6.6. If you downgrade and the behavior problems persist, you are looking at another bug... |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Serguei Kolos |
17/08/15 02:45 |
Hi
Fantastic. Many thanks for sharing that info, which saved me several weeks of working time. I went as far as getting a nasty GDB stack traces showing threads waiting on a non-locked mutex, but I didn't know how to dig this further down.
Cheers, On Thursday, May 14, 2015 at 12:37:32 AM UTC+2, Gil Tene wrote: We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Todd Lipcon |
30/10/15 11:18 |
Just to tag onto this old thread (because we ran into it on a new Haswell cluster last night)...
I did some digging in the CentOS/RHEL kernel changelog, and the fix shows up in version 2.6.32-504.14.1.el6. Hope that's useful for other folks determining if they're vulnerable.
-Todd
|
| Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Daniel Worthington-Bodart |
26/02/16 13:43 |
Sorry to be a necromancer but I though it was worth let you all know that there is still what appears to be a related freeze for Java applications on recent Ubuntu versions when run on Haswell-E platforms
I had this problem on a 5960x running Ubuntu 15.10, stock Kernel 4.2.0-18, latest JDK jdk1.8.0_74, I'm can confirm the the cold boot fix works with the stock kernel.
The problem is also resolved using the very latest kernel 4.5.0-rc5 from mainline PPA
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Andriy Plokhotnyuk |
27/02/16 10:36 |
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Craig Yoshioka |
07/04/16 12:58 |
I believe I am also seeing this issue, or a related one. In my case it occurs when running an MPI C++ program over 400+ cores/processes. The program occasionally seems to get stuck at certain steps, especially when RAM use goes up. CPU use stays pegged at 100%, but most of it becomes system. Running strace on a process shows a lot of sched_yield and futex calls. If I run strace on every process, on every node, it seems to kick the troublesome process out of its rut, and things resume like normal. I am running CentOS 6.7 with linux kernel 2.6.32-504.30.3 |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Trent Nelson |
08/04/16 07:53 |
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Craig Yoshioka |
08/04/16 07:59 |
Hi Trent,
Thanks for the suggestion. That was a problem I ran into a while back, but all the nodes now have THP disabled. This problem does have similar performance symptoms, but appears to have a different cause. |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Tom Lee |
08/04/16 08:20 |
Hey Craig,
"perf top" would be my first port of call here to get an idea where all that system time is going.
Cheers,
Tom |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Longchao Dong |
13/02/17 02:01 |
How to reproduce this issue ? Is it possible to show us the method ? I am also working on one strange pthread_cond_wait issue, but not sure if that one is related with this issue. |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Allen Reese |
14/02/17 08:01 |
This bug report seems to have a way to reproduce it:
Hope that helps.
--Allen Reese
How to reproduce this issue ? Is it possible to show us the method ? I am also working on one strange pthread_cond_wait issue, but not sure if that one is related with this issue.
On Wednesday, May 20, 2015 at 8:16:12 AM UTC+8, manis...@gmail.com wrote: I bumped on this error couple of months back when using CentOS 6.6 with 32 cores Dell server. After many days of debugging, I realized it to be a CentOS 6.6 bug and moved back to 6.5 and since then no such issues have been seen. I am able to reproduce this issue in 15 minutes of heavy load on my multi threaded c code.
On Wednesday, May 13, 2015 at 3:37:32 PM UTC-7, Gil Tene wrote: We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...
The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).
The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.
This behavior seems to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).
The commit explanation says that it fixes https://github.com/torvalds/ linux/commit/ b0c29f79ecea0b6fbcefc999e70f28 43ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".
The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.
The fix is simple, an added default case for the switch that just has an explicit smp_mb() in it. There was a missing memory barrier in the wakeup path, and now (hopefully) it's not missing any more...
So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.
Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/ torvalds/linux/commit/ b0c29f79ecea0b6fbcefc999e70f28 43ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).
Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".
Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Will Foster |
15/02/17 06:33 |
On Tuesday, February 14, 2017 at 4:01:52 PM UTC, Allen Reese wrote:This bug report seems to have a way to reproduce it:
Hope that helps.
--Allen Reese
I also see this on latest CentOS7.3 with Logstash, I've disabled huge pages via transparent_hugepage=never in grub. Here's what I get from strace against logstash (never fully comes up to listen on TCP/5044) [root@host-01 ~]# strace -p 1292 Process 1292 attached futex(0x7f80eff8a9d0, FUTEX_WAIT, 1312, NULL I am hitting this issue on Logstash 5.2.1-1 while trying to upgrade my Ansible playbooks to the latest ES versions. |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Gil Tene |
15/02/17 07:45 |
Don't know if this is the same bug. RHEL 7 kernels included fixes for this since some time in 2015.
While one of my first courses of action when I see a suspicious FUTEX_WAIT hang situation is still to check kernel versions to rules this out (since this bug has wasted us a bunch of time in the past), keep in mind that not all things stuck in FUTEX_WAIT are futex_wait kernel bugs. The most likely explanations are usually actual application logic bugs involving actual deadlock or starvation.
Does attaching and detaching from the process with gdb move it forward? [the original bug was missing the wakeup, and an attach/detach would "kick" the futex out of its slumber once] |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Wojciech Kudla |
15/02/17 11:33 |
Just trying to eliminate the obvious. You should be stracing JVM threads by referring their tids rather than parent process pid. That guy will pretty much always show being blocked on a futex.
Don't know if this is the same bug. RHEL 7 kernel included fixes for this since some time in 2015.
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Longchao Dong |
15/02/17 17:25 |
In fact, I met a very strange problem. My c ++ program now calls HDFS‘s interfaces via jni, but they are all blocked by the same java object lock. I obtained the state of the process by jstack. All threads are waiting to lock the object(0x00000006b30b3be8), but no thread is holding it. Does anybody have clues? The attachment is the output of jstack at that time. On Wed, Feb 15, 2017 at 11:45 PM, Gil Tene <g...@azul.com> wrote: Don't know if this is the same bug. RHEL 7 kernel included fixes for this since some time in 2015.
--
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Dan Beaulieu |
16/03/17 15:29 |
Hi Daniel, do you happen to know what the commit was that fixed this? I'd like to learn more about the fix.
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Allen Reese |
16/03/17 15:38 |
You're asking about the futex_wait fix right?
That would be this commit as far as I can tell from looking around:
--Allen Reese Yahoo! Inc.
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Dan Beaulieu |
17/03/17 07:27 |
Possibly, but I don't think so. I was replying directly to the post that mentioned there was a fix in the 4.5 kernel. Since we were seeing an issue we think is this issue with 4.4.z, but don't see if after upgrading to 4.10.z, we think whatever Daniel is referring to is related.
The patch you linked is from 2014, so I'd imagine it'd also be in the 4.4.z kernel we were using and having issues with. |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Allen Reese |
17/03/17 08:57 |
that's kinda what I thought. I read the reports for a bit and it wasn't clear to me.
I've only seen what might be a microcode issue on some haswell boxes, but I don't have much more than that. the microcode issue I'm aware of is fixed by installing a newer microcode package.
however I've only been supporting java on RHEL. :) |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Daniel Worthington-Bodart |
18/03/17 11:36 |
I never found the exact fix but I do know that 4.4.0-65 (Ubuntu 16.04 LTS) is also fixed so somewhere between that and at least 4.2.0-18 lies a problem
D
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
SITARAM SAKTHI |
20/07/17 05:19 |
i see similar issue, it moves forward after attaching, detaching from the process with gdb only on centos6.7. could this be an issue with kernel? |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Allen Reese |
20/07/17 07:34 |
It's fixed for me in RHEL 6.7, with kernel-2.6.32-504.16.2.el6 or later. For RHEL7, it's fixed with 3.10.0-229.7.2.el7 or later.
--Allen Reese
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
zhengzh...@gmail.com |
31/08/17 23:49 |
Hi Tene , when I read your note, I know that this bug appear in the kernel version of 3.14 - 3.17 . but why this can occur the operate system RHEL 6.6 what is still use the kernel of 2.6.32 ? I can't understand and need help . Can you explain for me or reply me for some reference? Thaks !! 在 2015年5月14日星期四 UTC+8上午6:37:32,Gil Tene写道: We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...
The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).
The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.
This behavior seems to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).
The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".
The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.
The fix is simple, an added default case for the switch that just has an explicit smp_mb() in it. There was a missing memory barrier in the wakeup path, and now (hopefully) it's not missing any more...
So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.
Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).
Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".
Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.
|
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
Peter Booth |
01/09/17 11:55 |
Zheng, The issue is how Redhat Enterprise Linux uses security backports. The RHEL distro tries to be as stable and secure as possible by using using well tested (old) versions of components. But when testing discovers security vulnerabilities in any newer version of a component, red hat check to see if the bug exists in the old version of the component. If it did, they patch the old version with code change from the newer version to address the issue. It's a great idea that works well most of the time. This is called backporting and is described on the red hat site Occasionally however, the fix to a security issue also introduces an unrelated bug. This is what occurred here. Peter |
| Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW]. |
zhengzh...@gmail.com |
04/09/17 03:24 |
Thank you very much ! I get it now !
在 2017年9月2日星期六 UTC+8上午2:55:34,Peter Booth写道: |