Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".
Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.
That's nasty! Thanks for sharing.
sent from my phone
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).
The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'l spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.This behavior seem to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).
The commit for the fix is here: https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0
The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.
So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).
Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
(gdb) bt
#0 0x0000003593e0e264 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003593e09508 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003593e093d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fe87a42a50d in os::PlatformEvent::park() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fe87a3f10e8 in Monitor::ILock(Thread*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fe87a3f132f in Monitor::lock_without_safepoint_check() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fe87a15a7bf in G1HotCardCache::insert(signed char*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fe87a15db03 in G1RemSet::refine_card(signed char*, int, bool) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#8 0x00007fe87a143dc8 in RefineCardTableEntryClosure::do_card_ptr(signed char*, int) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#9 0x00007fe87a0feb9f in DirtyCardQueueSet::apply_closure_to_completed_buffer_helper(CardTableEntryClosure*, int, BufferNode*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#10 0x00007fe87a0fed8d in DirtyCardQueueSet::apply_closure_to_completed_buffer(int, int, bool) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#11 0x00007fe87a0683a4 in ConcurrentG1RefineThread::run() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#12 0x00007fe87a430ca8 in java_start(Thread*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#13 0x0000003593e079d1 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003593ae88fd in clone () from /lib64/libc.so.6
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
There has been quite some benchmarking done on haswells comparing Java 7 and 8 (quite impressive actually) I wonder how they haven't stumble into this.
Either way, debugging this has been quite fun and perhaps we'll write about the adventure in more detail :)
Kernel futex_wait() calls end up being at the core of almost any user-land synchronization primitive these days. Whether it's posix stuff (like mutexes and semaphores) or direct use of futures. And all JVM synchronization including synchronized, Lock, park/unpark, as well as all internal JVM threads, like GC and compiler stuff all end up with a waiting futex at some point.
Have you moved to 6.6.z? (or if not on a RHEL or RHEL-like, a latest kernel of some sort?).
Private futures are exactly one of the types affected, according to that changelog.
sent from my phone
Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Private *futexes*, damn autocorrect.
sent from my phone
Thread 10 (Thread 0x7fa458ade700 (LWP 4482)):
#0 0x00000037016f805e in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x000000370167d16b in _L_lock_9503 () from /lib64/libc.so.6
#2 0x000000370167a6a6 in malloc () from /lib64/libc.so.6
#3 0x00007fa45a52ed29 in os::malloc(unsigned long, unsigned short, unsigned char*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#4 0x00007fa459fb66b3 in ChunkPool::allocate(unsigned long, AllocFailStrategy::AllocFailEnum) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#5 0x00007fa459fb62d1 in Arena::grow(unsigned long, AllocFailStrategy::AllocFailEnum) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#6 0x00007fa45a145cc0 in CompactibleFreeListSpace::new_dcto_cl(OopClosure*, CardTableModRefBS::PrecisionStyle, HeapWord*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#7 0x00007fa45a54ce6d in CardTableModRefBS::process_stride(Space*, MemRegion, int, int, OopsInGenClosure*, CardTableRS*, signed char**, unsigned long, unsigned long) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#8 0x00007fa45a54d040 in CardTableModRefBS::non_clean_card_iterate_parallel_work(Space*, MemRegion, OopsInGenClosure*, CardTableRS*, int) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#9 0x00007fa45a0d4e08 in CardTableModRefBS::non_clean_card_iterate_possibly_parallel(Space*, MemRegion, OopsInGenClosure*, CardTableRS*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#10 0x00007fa45a0d6a0e in CardTableRS::younger_refs_in_space_iterate(Space*, OopsInGenClosure*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#11 0x00007fa45a1823fe in ConcurrentMarkSweepGeneration::younger_refs_iterate(OopsInGenClosure*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#12 0x00007fa45a5c98aa in SharedHeap::process_strong_roots(bool, bool, SharedHeap::ScanningOption, OopClosure*, CodeBlobClosure*, OopsInGenClosure*, bool) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#13 0x00007fa45a27ef8c in GenCollectedHeap::gen_process_strong_roots(int, bool, bool, bool, SharedHeap::ScanningOption, OopsInGenClosure*, bool, OopsInGenClosure*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#14 0x00007fa45a551e4f in ParNewGenTask::work(unsigned int) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#15 0x00007fa45a6cf0cf in GangWorker::loop() () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#16 0x00007fa45a537ca8 in java_start(Thread*) () from /usr/java/jdk1.7.0_76/jre/lib/amd64/server/libjvm.so
#17 0x0000003701a079d1 in start_thread () from /lib64/libpthread.so.0
#18 0x00000037016e88fd in clone () from /lib64/libc.so.6
Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Those two traces look like they're coming from different code paths (malloc vs pthread_mutex) so I'm not sure if lll_lock_wait means it's not private. Looking at the kernel change, only private futexes weren't covered by a barrier in the broken version.
sent from my phone
Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
How do jstack and the like subvert the problem? Do they cause the thread to be woken up (from bogus sleep) and observe consistent state at that point?
sent from my phone
Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD
Cosmin
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Thanks for identifying the date and change for the fix in RHEL 6 (April 21, 2015). Strange that the errata makes no mention of it, or it's impact (it's a security advisory).So far, here is what I know when it comes to distro release numbers for RHEL and it's main cousins (At least in that world, most admins view things in terms of distro version rather than kernel version. you can look up associated kernel versions). I'm using good/BAD to mean ("does not have the bug" / "HAS THE MISSING BARRIER BUG"):RHEL 6 (and CentOS 6, and SL 6): 6.0-6.5 are good. 6.6 is BAD. 6.6.z is good.RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD. As of yesterday. there does not yet appear to be a 7.x fix.RHEL 5 (and CentOS 5, and SL 5): All versions are good (including 5.11).Sadly, 6.6 seems to be a very (the most?) popular version we run into. And very few have moved to 6.6.z.I/we are looking for more info to identify the specific versions affected in other distros (Ubuntu 12.04 LTS and 14.04 LTS, SLES 12 and 11, Amazon Linux, Oracle Linux, etc.). So far we've verified that SLES12 kernel versions 3.12.32-33.1 & above have the fix (but not what versions have the bug), and that Amazon Linux kernel versions 3.14.35-28.38 & above have the fix (but not which versions have the bug).I will post here as I have more, if you find more info useful to identifying releases that have the bug and ones that have fixed it, please do the same.
# strace -p 14603
Process 14603 attached - interrupt to quit
futex(0x7f5c8e6019d0, FUTEX_WAIT, 14604, NULL
^C <unfinished ...>
Process 14603 detached
-> no more output
# strace -F -p 14603
Process 14603 attached with 8 threads - interrupt to quit
-> + output from all the threads in the process.
As far as I can tell, your bug appears to be 100% cpu on 1 cpu core. This bug is more characterised by less cpu burn as threads sleep when they should be woken ("…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" ), so I would assume that your issue is different (as a starting point) and try to gather more evidence to prove it either way.
run
sudo perf record -F 99 -ag -p <pid> -- sleep 10
sudo perf script
To get some more info on the stacks associated with the cpu burn. The stacks may be corrupted by the jvm preventing perf from reporting something useful with -g, so YMMV, to fix that it's a custom build of openjdk...
Thanks,
Alex
--
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).
The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.This behavior seems to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).
The commit for the fix is here: https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0
The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.
The fix is simple, an added default case for the switch that just has an explicit smp_mb() in it. There was a missing memory barrier in the wakeup path, and now (hopefully) it's not missing any more...
So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...
RHEL/Centos: 3.10.0-229.7.2.el7 kernel is now including a fix for this bug.-adrian
--
We are getting killed by this right now. We are running Oracle Linux, Redhat rel. 6.6, kernel version 2.6.32-504.16.2.el6.x86_64. Supposedly the bug in this version was fixed, but it just happened again tonight (after not happening for two nights in a row).
I'm not sure. When I run a stack trace, the PID always refers to futex_. But the hang happens on an Oracle OCI call (I think that's what it is). All the threads look like this:
Thread 1 (Thread 0x2b8807f2d420 (LWP 9885)):
#0 0x000000347540b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00002b87fb24983c in conditionWait(pthread_cond_t*, SMutex*, unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so
#2 0x00002b87fb24b2a7 in SThread::Sleep(unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so
#3 0x00002b87fb249b3b in SEvent::putThreadToSleep(SThread*, unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so
#4 0x00002b87fb24a4cc in msgque::get(int, TObject**, unsigned int) () from /opt/informatica/9.6.1/server/bin/libpmcef.so
#5 0x00002b87fb24a657 in SThread::readMessage(int, unsigned int, TObject**) () from /opt/informatica/9.6.1/server/bin/libpmcef.so
#6 0x00000000005f2798 in SDirectorImpl::getNextMessage() ()
#7 0x00000000005f4e3d in SDirectorImpl::doPETLOrchestration() ()
#8 0x00000000005f679e in SDirectorImpl::orchestrate() ()
#9 0x00000000005f6d21 in SDirectorImpl::run() ()
#10 0x00000000005faa6c in SDirectorRunnable::start() ()
#11 0x00000000005c01c3 in SExecutorDTM::start() ()
#12 0x00000000005dff57 in SPreparerDTMImpl::start() ()
#13 0x00000000005d774f in DTMMain(int, char const**) ()
#14 0x000000347501ed5d in __libc_start_main () from /lib64/libc.so.6
#15 0x00000000005b0a89 in _start ()
And then when I run the stack trace, the session 'wakes up' and starts sourcing data from Oracle.
The symptoms we are encountering sure sound like the futex_wait bug.
And I agree the kernel version is identified elsewhere as a 'good' version. I suppose it could be something else. But how the heck do I figure that out?
Sorry to be a necromancer but I though it was worth let you all know that there is still what appears to be a related freeze for Java applications on recent Ubuntu versions when run on Haswell-E platformsI had this problem on a 5960x running Ubuntu 15.10, stock Kernel 4.2.0-18, latest JDK jdk1.8.0_74, I'm can confirm the the cold boot fix works with the stock kernel.The problem is also resolved using the very latest kernel 4.5.0-rc5 from mainline PPA
On Fri, 30 Oct 2015 at 18:18 Todd Lipcon <to...@lipcon.org> wrote:
Just to tag onto this old thread (because we ran into it on a new Haswell cluster last night)...I did some digging in the CentOS/RHEL kernel changelog, and the fix shows up in version 2.6.32-504.14.1.el6. Hope that's useful for other folks determining if they're vulnerable.-Todd
On Mon, Aug 17, 2015 at 2:45 AM, Serguei Kolos <sergue...@gmail.com> wrote:
HiFantastic. Many thanks for sharing that info, which saved me several weeks of working time. I went as far as getting a nasty GDB stack traces showing threads waiting on a non-locked mutex, but I didn't know how to dig this further down.Cheers,
On Thursday, May 14, 2015 at 12:37:32 AM UTC+2, Gil Tene wrote:We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Sorry to be a necromancer but I though it was worth let you all know that there is still what appears to be a related freeze for Java applications on recent Ubuntu versions when run on Haswell-E platformsI had this problem on a 5960x running Ubuntu 15.10, stock Kernel 4.2.0-18, latest JDK jdk1.8.0_74, I'm can confirm the the cold boot fix works with the stock kernel.The problem is also resolved using the very latest kernel 4.5.0-rc5 from mainline PPA
On Fri, 30 Oct 2015 at 18:18 Todd Lipcon <to...@lipcon.org> wrote:
Just to tag onto this old thread (because we ran into it on a new Haswell cluster last night)...I did some digging in the CentOS/RHEL kernel changelog, and the fix shows up in version 2.6.32-504.14.1.el6. Hope that's useful for other folks determining if they're vulnerable.-Todd
On Mon, Aug 17, 2015 at 2:45 AM, Serguei Kolos <sergue...@gmail.com> wrote:
HiFantastic. Many thanks for sharing that info, which saved me several weeks of working time. I went as far as getting a nasty GDB stack traces showing threads waiting on a non-locked mutex, but I didn't know how to dig this further down.Cheers,
On Thursday, May 14, 2015 at 12:37:32 AM UTC+2, Gil Tene wrote:We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
Hey Craig,
"perf top" would be my first port of call here to get an idea where all that system time is going.
Cheers,
Tom
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
I bumped on this error couple of months back when using CentOS 6.6 with 32 cores Dell server. After many days of debugging, I realized it to be a CentOS 6.6 bug and moved back to 6.5 and since then no such issues have been seen.
I am able to reproduce this issue in 15 minutes of heavy load on my multi threaded c code.
On Wednesday, May 13, 2015 at 3:37:32 PM UTC-7, Gil Tene wrote:
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.This behavior seems to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).The commit for the fix is here: https://github.com/torvalds/ linux/commit/ 76835b0ebf8a7fe85beb03c7512141 9a7dec52f0
The commit explanation says that it fixes https://github.com/torvalds/ linux/commit/ b0c29f79ecea0b6fbcefc999e70f28 43ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".
The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.The fix is simple, an added default case for the switch that just has an explicit smp_mb() in it. There was a missing memory barrier in the wakeup path, and now (hopefully) it's not missing any more...So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.
Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/ torvalds/linux/commit/ b0c29f79ecea0b6fbcefc999e70f28 43ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).
Note: I would like to profusely thank @aplokhotnyuk. His tweet originally alerted me to the bug's existence, and started us down the path of figuring out the what//why/where/when behind it. Why this is not being shouted in the streets is a mystery to me, and scary in its own right. We were lucky enough that I had a "that looks suspiciously familiar" moment when I read that tweet, and that I put 3.14 and 1.618 together and thought enough to ask "Umm... have we only been seeing this bug on Haswell servers?".Without @aplokhotnyuk's tweet we'd probably still be searching for the nonexistent bugs in our own locking code... And since the tweet originated from another discussion on this group, it presents a rare "posting and reading twitter actually helps us solve bugs sometimes" example.
This bug report seems to have a way to reproduce it:Hope that helps.--Allen Reese
transparent_hugepage=never
[root@host-01 ~]# strace -p 1292
Process 1292 attached
futex(0x7f80eff8a9d0, FUTEX_WAIT, 1312, NULL
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Just trying to eliminate the obvious. You should be stracing JVM threads by referring their tids rather than parent process pid. That guy will pretty much always show being blocked on a futex.
Don't know if this is the same bug. RHEL 7 kernel included fixes for this since some time in 2015.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Don't know if this is the same bug. RHEL 7 kernel included fixes for this since some time in 2015.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QbmpZxp6C64/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Sorry to be a necromancer but I though it was worth let you all know that there is still what appears to be a related freeze for Java applications on recent Ubuntu versions when run on Haswell-E platformsI had this problem on a 5960x running Ubuntu 15.10, stock Kernel 4.2.0-18, latest JDK jdk1.8.0_74, I'm can confirm the the cold boot fix works with the stock kernel.The problem is also resolved using the very latest kernel 4.5.0-rc5 from mainline PPA
On Fri, 30 Oct 2015 at 18:18 Todd Lipcon <to...@lipcon.org> wrote:
Just to tag onto this old thread (because we ran into it on a new Haswell cluster last night)...I did some digging in the CentOS/RHEL kernel changelog, and the fix shows up in version 2.6.32-504.14.1.el6. Hope that's useful for other folks determining if they're vulnerable.-Todd
On Mon, Aug 17, 2015 at 2:45 AM, Serguei Kolos <sergue...@gmail.com> wrote:
HiFantastic. Many thanks for sharing that info, which saved me several weeks of working time. I went as far as getting a nasty GDB stack traces showing threads waiting on a non-locked mutex, but I didn't know how to dig this further down.Cheers,
On Thursday, May 14, 2015 at 12:37:32 AM UTC+2, Gil Tene wrote:We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
We had this one bite us hard and scare the %$^! out of us, so I figured I'd share the fear...The linux futex_wait call has been broken for about a year (in upstream since 3.14, around Jan 2014), and has just recently been fixed (in upstream 3.18, around October 2014). More importantly this breakage seems to have been back ported into major distros (e.g. into RHEL 6.6 and its cousins, released in October 2014), and the fix for it has only recently been back ported (e.g. RHEL 6.6.z and cousins have the fix).The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.This behavior seems to regularly appear in the wild on Haswell servers (all the machines where we have had customers hit it in the field and in labs been Haswells), and since Haswell servers are basically what you get if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure stuff, you are bound to experience some interesting behavior. I don't know of anyone that will see this as a good thing for production systems. Except for maybe Netflix (maybe we should call this the linux fumonkey).The commit for the fix is here: https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0
The commit explanation says that it fixes https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db (presumably the bug introduced with that change), which was made in Jan of 2014into 3.14. That 3.14 code added logic to avoid taking a lock if the code knows that there are no waiters. It documents (pretty elaborately) how "…thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters" with logic that explains how memory barriers guarantee the correct order (see paragraph at line 141), which includes the statement "this is done by the barriers in get_futex_key_refs(), through either ihold or atomic_inc, depending on the futex type." (this assumption is the actual bug). The assumption is further reinforced in the fact that the change added a comment to every calls to get_futex_key_refs() in the code that says "/* implies MB (B) */".
The problem was that get_futex_key_refs() did NOT imply a memory barrier. It only included a memory barrier for two explicit cases in a switch statement that checks the futex type, but did not have a default case handler, and therefor did not apply a memory barrier for other fuxtex types. Like private futexes. Which are a very commonly used type of futex.The fix is simple, an added default case for the switch that just has an explicit smp_mb() in it. There was a missing memory barrier in the wakeup path, and now (hopefully) it's not missing any more...So lets be clear: RHEL 6.6 (and CentOS 6.6., and Scientific Linux 6.6.) are certainly broken on Haswell servers. It is likely that recent versions other distros are too (SLES, Ubuntu, Debia, Oracle Linux, etc.). The good news is that fixes are out there (including 6.6.z). But the bad news is that there is not much chatter saying "if you have a Haswell system, get to version X now". For some reason, people seem to not have noticed this or raised the alarm. We certainly haven't seen much "INSTALL PATCHES NOW" fear mongering. And we really need it, so I'm hoping this posting will start a panic.
Bottom line: the bug is very real, but it probably only appeared in the 3.14 upstream version (and distro versions that had backported https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db , presumably after Jan 2014). The bug was fixed in 3.18 in October 2014, but backports probably took a while (and some may still be pending). I now for a fact that RHEL 6.6.z has the fix. I don't know about other distro families and versions (yet), but if someone else does, please post (including when was it broken, and when was it fixed).
The issue is how Redhat Enterprise Linux uses security backports. The RHEL distro tries to be as stable and secure as possible by using using well tested (old) versions of components. But when testing discovers security vulnerabilities in any newer version of a component, red hat check to see if the bug exists in the old version of the component. If it did, they patch the old version with code change from the newer version to address the issue. It's a great idea that works well most of the time. This is called backporting and is described on the red hat site
Occasionally however, the fix to a security issue also introduces an unrelated bug. This is what occurred here.
Peter
Thanks for identifying the date and change for the fix in RHEL 6 (April 21, 2015). Strange that the errata makes no mention of it, or it's impact (it's a security advisory).So far, here is what I know when it comes to distro release numbers for RHEL and it's main cousins (At least in that world, most admins view things in terms of distro version rather than kernel version. you can look up associated kernel versions). I'm using good/BAD to mean ("does not have the bug" / "HAS THE MISSING BARRIER BUG"):RHEL 6 (and CentOS 6, and SL 6): 6.0-6.5 are good. 6.6 is BAD. 6.6.z is good.RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD. As of yesterday. there does not yet appear to be a 7.x fix.RHEL 5 (and CentOS 5, and SL 5): All versions are good (including 5.11).Sadly, 6.6 seems to be a very (the most?) popular version we run into. And very few have moved to 6.6.z.I/we are looking for more info to identify the specific versions affected in other distros (Ubuntu 12.04 LTS and 14.04 LTS, SLES 12 and 11, Amazon Linux, Oracle Linux, etc.). So far we've verified that SLES12 kernel versions 3.12.32-33.1 & above have the fix (but not what versions have the bug), and that Amazon Linux kernel versions 3.14.35-28.38 & above have the fix (but not which versions have the bug).I will post here as I have more, if you find more info useful to identifying releases that have the bug and ones that have fixed it, please do the same.
On Thursday, May 14, 2015 at 5:24:33 AM UTC-5, Marcin Sobieszczanski wrote:
> More importantly this breakage seems to have
> been back ported into major distros (e.g. into RHEL 6.6 and its cousins,
> released in October 2014), and the fix for it has only recently been back
> ported (e.g. RHEL 6.6.z and cousins have the fix).
According to the ChangeLogs attached to rpms, it looks like the kernel
series used in RHEL 6.6 (kernel-2.6.32-504) were affected from the
start of 6.6 release. It has been fixed only recently in
kernel-2.6.32-504.16.2.el6 update (21 April,
https://rhn.redhat.com/errata/RHSA-2015-0864.html)
rpm -qp --changelog kernel-2.6.32-504.16.2.el6.x86_64.rpm | grep
'Ensure get_futex_key_refs() always implies a barrier'
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/mechanical-sympathy/054061a4-85c4-48ed-85d9-fa12ae2a2d36%40googlegroups.com.