Re: JVM BUG(s) - Hadoop's threads hanging

Ronald Klop

unread,

Jun 16, 2014, 10:45:04 AM6/16/14

to

On Mon, 16 Jun 2014 14:44:10 +0200, Dmitry Sivachenko
<trtr...@gmail.com> wrote:

> Hello!
>
> I set up hadoop-1.2.1 (from ports) on FreeBSD-10/stable with openjdk
> version 1.7.0_60.
>
> On the first glance it is doing well except one annoying thing: after
> executing some tasks, tasktracker process starts to eat CPU when idle.
> Sometimes it is 10-20% (numbers from top(1) output), sometimes it is
> 100-150%.
>
> In tasktrackers's log I see numerious records like this:
>
> 2014-06-09 13:08:29,858 INFO org.mortbay.log:
> org.mortbay.io.nio.SelectorManager$SelectSet@abdcc1c JVM BUG(s) -
> injecting delay59 times
> 2014-06-09 13:08:29,859 INFO org.mortbay.log:
> org.mortbay.io.nio.SelectorManager$SelectSet@abdcc1c JVM BUG(s) -
> recreating selector 59 times, canceled keys 944 times
> 2014-06-09 13:09:29,862 INFO org.mortbay.log:
> org.mortbay.io.nio.SelectorManager$SelectSet@abdcc1c JVM BUG(s) -
> injecting delay58 times
> 2014-06-09 13:09:29,862 INFO org.mortbay.log:
> org.mortbay.io.nio.SelectorManager$SelectSet@abdcc1c JVM BUG(s) -
> recreating selector 58 times, canceled keys 928 times
> 2014-06-09 13:10:29,901 INFO org.mortbay.log:
> org.mortbay.io.nio.SelectorManager$SelectSet@abdcc1c JVM BUG(s) -
> injecting delay58 times
> 2014-06-09 13:10:29,901 INFO org.mortbay.log:
> org.mortbay.io.nio.SelectorManager$SelectSet@abdcc1c JVM BUG(s) -
> recreating selector 58 times, canceled keys 928 times
> <...>
>
>
> The more jobs I run, more java threads start to consume CPU after all
> tasks finished. After several job execution, top(1) output looks like
> this (splitted by thread, the same PID):
>
> PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
> 79045 hadoop 47 0 1948M 867M uwait 2 20:49 37.50%
> java{java}
> 79045 hadoop 31 0 1948M 867M uwait 31 1:45 19.29%
> java{java}
> 79045 hadoop 33 0 1948M 867M uwait 21 2:51 19.19%
> java{java}
> 79045 hadoop 30 0 1948M 867M uwait 17 2:51 18.65%
> java{java}
> 79045 hadoop 30 0 1948M 867M uwait 11 1:52 18.36%
> java{java}
> 79045 hadoop 30 0 1948M 867M uwait 22 1:45 18.36%
> java{java}
> 79045 hadoop 31 0 1948M 867M uwait 29 2:50 18.26%
> java{java}
> 79045 hadoop 31 0 1948M 867M uwait 6 1:57 18.16%
> java{java}
> 79045 hadoop 31 0 1948M 867M uwait 13 4:55 17.97%
> java{java}
> 79045 hadoop 31 0 1948M 867M uwait 26 3:39 17.77%
> java{java}
> 79045 hadoop 33 0 1948M 867M uwait 8 1:21 17.48%
> java{java}
> 79045 hadoop 30 0 1948M 867M uwait 1 3:32 16.70%
> java{java}
> 79045 hadoop 32 0 1948M 867M uwait 24 3:12 16.70%
> java{java}
> 79045 hadoop 26 0 1948M 867M uwait 4 1:27 10.35%
> java{java}
> 72417 root 20 0 19828K 3252K CPU21 21 0:00 0.29% top
> 836 root 20 0 36104K 1952K select 14 6:51 0.00% snmpd
> 79045 hadoop 20 0 1948M 867M uwait 20 6:51 0.00%
> java{java}
> 79045 hadoop 20 0 1948M 867M uwait 27 3:45 0.00%
> java{java}
> 79045 hadoop 20 0 1948M 867M uwait 30 2:37 0.00%
> java{java}
> 79045 hadoop 20 0 1948M 867M uwait 15 0:54 0.00%
> java{java}
> 79045 hadoop 20 0 1948M 867M uwait 2 0:48 0.00%
> java{java}
> 79045 hadoop 20 0 1948M 867M uwait 14 0:48 0.00%
> java{java}
> 79045 hadoop 20 0 1948M 867M uwait 2 0:48 0.00%
> java{java}
> <....>
>
> This is on absolutely idle cluster, no single task is running.
> I am attaching truss(1) output for that java process.
>
> I asked that on us...@hadoop.apache.org without any luck.
>
> Any clues? (I can provide any additional information if needed)

Hi,

From your information it is hard to say something about it. The bug can be
in FreeBSD, OpenJDK (the Oracle part or in the BSD port part), in Hadoop
or in your own code running on top of Hadoop.

My first idea would be to eliminate some of the possibilities.
- Run a Linux machine with the same versions of the software.
- Try FreeBSD 9-stable.
- Try an older version of OpenJDK on FreeBSD.
- Try a very simple 'Hello world' style application on Hadoop which mimics
the thread usage.

Did you ever run your Hadoop application on FreeBSD before without this
symptom? If so, what are the differences between then and now?

Another possibly valuable piece of information is what those threads are
doing. Using jstack you can very easily get a dump of all stacks of all
threads in Java.

Regards,
Ronald.
_______________________________________________
freebs...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-java
To unsubscribe, send any mail to "freebsd-java...@freebsd.org"

Dmitry Sivachenko

unread,

Jun 17, 2014, 6:04:52 AM6/17/14

to

On 16 июня 2014 г., at 18:45, Ronald Klop <ronald...@klop.ws> wrote:

>
> Hi,
>
> From your information it is hard to say something about it. The bug can be in FreeBSD, OpenJDK (the Oracle part or in the BSD port part), in Hadoop or in your own code running on top of Hadoop.
>
> My first idea would be to eliminate some of the possibilities.
> - Run a Linux machine with the same versions of the software.
> - Try FreeBSD 9-stable.

I will try at least FreeBSD-9 soon (never used Linux so it will take more time and not so relevant because I want to continue to use FreeBSD, not just move to Linux)

> - Try an older version of OpenJDK on FreeBSD.

I already tried latest versions of openjdk-6/7/8 from ports.

7 and 8 behaves the same way (as I described in my original e-mail). Below is the output of jstack for openjdk7 (java process running taskttacker):

46897 hadoop 147 21 0 1927M 625M uwait 22 14:31 7.86% java

/tmp# jstack -l 46897
46897: Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding
/tmp# jstack -F -l 46897>/tmp/jstack.out
Attaching to process ID 46897, please wait...
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.tools.jstack.JStack.runJStackTool(JStack.java:136)
at sun.tools.jstack.JStack.main(JStack.java:102)
Caused by: sun.jvm.hotspot.debugger.UnalignedAddressException: 746f705b762f4867
at sun.jvm.hotspot.debugger.bsd.BsdDebuggerLocal$1.checkAlignment(BsdDebuggerLocal.java:183)
at sun.jvm.hotspot.debugger.bsd.BsdDebuggerLocal.readCInteger(BsdDebuggerLocal.java:485)
at sun.jvm.hotspot.debugger.DebuggerBase.readAddressValue(DebuggerBase.java:454)
at sun.jvm.hotspot.debugger.bsd.BsdDebuggerLocal.readAddress(BsdDebuggerLocal.java:430)
at sun.jvm.hotspot.debugger.bsd.BsdAddress.getAddressAt(BsdAddress.java:74)
at sun.jvm.hotspot.HotSpotTypeDataBase.readVMTypes(HotSpotTypeDataBase.java:154)
at sun.jvm.hotspot.HotSpotTypeDataBase.<init>(HotSpotTypeDataBase.java:85)
at sun.jvm.hotspot.bugspot.BugSpotAgent.setupVM(BugSpotAgent.java:573)
at sun.jvm.hotspot.bugspot.BugSpotAgent.go(BugSpotAgent.java:494)
at sun.jvm.hotspot.bugspot.BugSpotAgent.attach(BugSpotAgent.java:332)
at sun.jvm.hotspot.tools.Tool.start(Tool.java:163)
at sun.jvm.hotspot.tools.JStack.main(JStack.java:86)
... 6 more
/tmp#

(jstack.out file is empty)

openjdk-6 is different: during shuffle phase (when portions of intermediate data are copied between data nodes), java process running tasktracker consumes a lot of CPU (300-400%), and it is often in "vm map" state. Data transfer is very-very slow (1MB/sec and less on 1GB network). With openjdk7/8 network is utilized for about 40% (~40MB/sec), it is acceptable though the question why isn't it 100MB/sec still stands. So shuffle phase is almost stuck with openjdk6. But if you wait long enough to finish this, tasktrackers in idle state behave as expected (do not consume CPU). Below is the output of top(1) and jstack:

35291 hadoop 209 22 0 1922M 461M vm map 17 46.5H 336.08% java

/tmp# jstack -l 35291
35291: Unable to open socket file: target process not responding or HotSpot VM not loaded
The -F option can be used when the target process is not responding

/tmp# jstack -F -l 35291>/tmp/jstack.out
Attaching to process ID 35291, please wait...
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces at java.lang.reflect.Method.invoke(Method.java:622)
at sun.tools.jstack.JStack.runJStackTool(JStack.java:136)
at sun.tools.jstack.JStack.main(JStack.java:102)
Caused by: sun.jvm.hotspot.debugger.UnalignedAddressException: 746f705b762f4867
at sun.jvm.hotspot.debugger.bsd.BsdDebuggerLocal$1.checkAlignment(BsdDebuggerLocal.java:183)
at sun.jvm.hotspot.debugger.bsd.BsdDebuggerLocal.readCInteger(BsdDebuggerLocal.java:480)
at sun.jvm.hotspot.debugger.DebuggerBase.readAddressValue(DebuggerBase.java:454)
at sun.jvm.hotspot.debugger.bsd.BsdDebuggerLocal.readAddress(BsdDebuggerLocal.java:425)
at sun.jvm.hotspot.debugger.bsd.BsdAddress.getAddressAt(BsdAddress.java:74)
at sun.jvm.hotspot.HotSpotTypeDataBase.readVMTypes(HotSpotTypeDataBase.java:154)
at sun.jvm.hotspot.HotSpotTypeDataBase.<init>(HotSpotTypeDataBase.java:85)
at sun.jvm.hotspot.bugspot.BugSpotAgent.setupVM(BugSpotAgent.java:572)
at sun.jvm.hotspot.bugspot.BugSpotAgent.go(BugSpotAgent.java:493)
at sun.jvm.hotspot.bugspot.BugSpotAgent.attach(BugSpotAgent.java:331)
at sun.jvm.hotspot.tools.Tool.start(Tool.java:163)
at sun.jvm.hotspot.tools.JStack.main(JStack.java:86)
... 6 more
/tmp#

(/tmp/jstack.out file is empty)

> - Try a very simple 'Hello world' style application on Hadoop which mimics the thread usage.
>
> Did you ever run your Hadoop application on FreeBSD before without this symptom? If so, what are the differences between then and now?

No, it is just my first install of hadoop and I use bundled terasort test suite (hadoop jar /usr/local/share/examples/hadoop/hadoop-examples-1.2.1.jar terasort <...>)
Since it is the problem with tasktracker (it does not run user-supplied code, it just schedules tasks and performs cleanups), so it is hardly relevant which particular task I execute.

Dmitry Sivachenko

unread,

Jun 18, 2014, 11:10:08 AM6/18/14

to

On 17 июня 2014 г., at 14:04, Dmitry Sivachenko <trtr...@gmail.com> wrote:

>
> On 16 июня 2014 г., at 18:45, Ronald Klop <ronald...@klop.ws> wrote:
>
>>
>> Hi,
>>
>> From your information it is hard to say something about it. The bug can be in FreeBSD, OpenJDK (the Oracle part or in the BSD port part), in Hadoop or in your own code running on top of Hadoop.
>>
>> My first idea would be to eliminate some of the possibilities.
>> - Run a Linux machine with the same versions of the software.
>> - Try FreeBSD 9-stable.
>
> I will try at least FreeBSD-9 soon (never used Linux so it will take more time and not so relevant because I want to continue to use FreeBSD, not just move to Linux)
>

Okay, I setup a similar configuration based on FreeBSD-9/stable.

I see the same problems with both openjdk6 and openjdk7 described in my previous e-mail.