05:41:56.017 # Session INFO HTTPD: GET /Inspect.html key=arrhythmia.data.hex job=$03010aea161732d4ffffffff$_b6558ec55a40860eec41f32e9b0d450b 05:45:01.570 # Session INFO HTTPD: GET /Cloud.html 05:46:43.173 # Session INFO HTTPD: GET /2/GBM.html 05:46:58.204 # Session INFO HTTPD: GET /2/TypeaheadHexKeyRequest limit=1024 filter=Generalized Boosted Models 05:47:00.521 # Session INFO HTTPD: GET /2/TypeaheadHexKeyRequest limit=1024 filter=a 05:47:03.751 # Session INFO HTTPD: GET /2/GBM.query source=arrhythmia.data.hex grid_parallelism=1 score_each_iteration=0 max_depth=5 nbins=20 learn_rate=0.1 ntrees=50 min_rows=10 05:47:03.753 # Session INFO WATER: Converting ValueArray to Frame: node(/10.234.22.23:54321) convNum(0) key(arrhythmia.data.hex.autoframe)... 05:47:04.552 # Session INFO WATER: Conversion 0 complete. 05:47:34.709 # Session INFO HTTPD: GET /2/GBM.query source=arrhythmia.data.hex grid_parallelism=1 score_each_iteration=0 max_depth=5 response=C1 nbins=20 learn_rate=0.1 ntrees=50 min_rows=10
[WARN] MetricsSystemImpl - Source name ugi already exists! Exception in thread "Thread for syncLogs" java.lang.NullPointerException at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:67) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:357) at org.apache.hadoop.mapred.Child$3.run(Child.java:141) 05:49:59.982 Thread-13 INFO WATER: [WARN] DFSClient - DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /user/hadoop/output100/_temporary/_attempt_201402111036_0005_m_000001_0/part-m-00001 File does not exist. Holder DFSClient_attempt_201402111036_0005_m_000001_0 does not have any open files. + at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1642) + at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1633) + at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1549) + at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:698) + at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) + at java.lang.reflect.Method.invoke(Method.java:606) + at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573) + at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) + at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) + at java.security.AccessController.doPrivileged(Native Method) + at javax.security.auth.Subject.doAs(Subject.java:415) + at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) + at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387) + + at org.apache.hadoop.ipc.Client.call(Client.java:1067) + at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) + at com.sun.proxy.$Proxy2.addBlock(Unknown Source) + at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) + at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) + at java.lang.reflect.Method.invoke(Method.java:606) + at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83) + at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) + at com.sun.proxy.$Proxy2.addBlock(Unknown Source) + at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3633) + at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3493) + at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2700(DFSClient.java:2691) + at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2932) 05:49:59.985 Thread-13 INFO WATER: [WARN] DFSClient - Error Recovery for block null bad datanode[0] nodes == null 05:49:59.985 Thread-13 INFO WATER: [WARN] DFSClient - Could not get block locations. Source file "/user/hadoop/output100/_temporary/_attempt_201402111036_0005_m_000001_0/part-m-00001" - Aborting... 05:49:59.986 Thread-0 INFO WATER: [ERROR] DFSClient - Exception closing file /user/hadoop/output100/_temporary/_attempt_201402111036_0005_m_000001_0/part-m-00001 : org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /user/hadoop/output100/_temporary/_attempt_201402111036_0005_m_000001_0/part-m-00001 File does not exist. Holder DFSClient_attempt_201402111036_0005_m_000001_0 does not have any open files. + at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1642) + at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1633) + at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1549) + at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:698) + at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) + at java.lang.reflect.Method.invoke(Method.java:606) + at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573) + at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) + at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) + at java.security.AccessController.doPrivileged(Native Method) + at javax.security.auth.Subject.doAs(Subject.java:415) + at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) + at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387) + <org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /user/hadoop/output100/_temporary/_attempt_201402111036_0005_m_000001_0/part-m-00001 File does not exist. Holder DFSClient_attempt_201402111036_0005_m_000001_0 does not have any open files. + at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1642) + at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1633) + at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1549) + at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:698) + at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) + at java.lang.reflect.Method.invoke(Method.java:606) + at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573) + at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) + at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) + at java.security.AccessController.doPrivileged(Native Method) + at javax.security.auth.Subject.doAs(Subject.java:415) + at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) + at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387) + >org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /user/hadoop/output100/_temporary/_attempt_201402111036_0005_m_000001_0/part-m-00001 File does not exist. Holder DFSClient_attempt_201402111036_0005_m_000001_0 does not have any open files. + at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1642) + at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1633) + at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1549) + at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:698) + at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) + at java.lang.reflect.Method.invoke(Method.java:606) + at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573) + at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) + at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) + at java.security.AccessController.doPrivileged(Native Method) + at javax.security.auth.Subject.doAs(Subject.java:415) + at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) + at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387) + + at org.apache.hadoop.ipc.Client.call(Client.java:1067) + at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) + at com.sun.proxy.$Proxy2.addBlock(Unknown Source) + at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) + at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) + at java.lang.reflect.Method.invoke(Method.java:606) + at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83) + at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) + at com.sun.proxy.$Proxy2.addBlock(Unknown Source) + at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3633) + at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3493) + at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2700(DFSClient.java:2691) + at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2932)
20th Feb Logs
=============
stdout logs
01:16:11.046 main INFO WATER: ----- H2O started ----- 01:16:11.049 main INFO WATER: Build git branch: master 01:16:11.049 main INFO WATER: Build git hash: 737f488bb2471b411818002e4d31591cf9a2c83b 01:16:11.049 main INFO WATER: Build git describe: nn-2-3383-g737f488-dirty 01:16:11.050 main INFO WATER: Build project version: 2.3.0.99999 01:16:11.050 main INFO WATER: Built by: 'root' 01:16:11.050 main INFO WATER: Built on: 'Wed Feb 19 05:30:54 UTC 2014' 01:16:11.050 main INFO WATER: Java availableProcessors: 2 01:16:11.052 main INFO WATER: Java heap totalMemory: 0.96 gb 01:16:11.052 main INFO WATER: Java heap maxMemory: 0.96 gb 01:16:11.052 main INFO WATER: Java version: Java 1.7.0_40 (from Oracle Corporation) 01:16:11.053 main INFO WATER: OS version: Linux 3.2.30-49.59.amzn1.x86_64 (amd64) 01:16:11.053 main INFO WATER: ICE root: '/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201402111036_0004/attempt_201402111036_0004_m_000001_0,/mnt1/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201402111036_0004/attempt_201402111036_0004_m_000001_0' 01:16:11.058 main INFO WATER: Possible IP Address: eth0 (eth0), 10.234.22.23 01:16:11.058 main INFO WATER: Possible IP Address: lo (lo), 127.0.0.1 01:16:11.095 main INFO WATER: Internal communication uses port: 54322 + Listening for HTTP and REST traffic on http://10.234.22.23:54321/ EmbeddedH2OConfig: notifyAboutEmbeddedWebServerIpPort called (10.234.22.23, 54321) EmbeddedH2OConfig: fetchFlatfile called EmbeddedH2OConfig: fetchFlatfile returned ------------------------------------------------------------ 10.254.108.58:54321 10.234.22.23:54321 ------------------------------------------------------------ 01:16:11.122 main INFO WATER: H2O cloud name: 'H2O_44859' 01:16:11.122 main INFO WATER: (v2.3.0.99999) 'H2O_44859' on /10.234.22.23:54321, discovery address /237.87.234.66:60759 01:16:11.123 main INFO WATER: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555): + 1. Open a terminal and run 'ssh -L 55555:localhost:54321 had...@10.234.22.23' + 2. Point your browser to http://localhost:55555 01:16:11.128 main INFO WATER: Cloud of size 1 formed [/10.234.22.23:54321] EmbeddedH2OConfig: notifyAboutCloudSize called (10.234.22.23, 54321, 1) 01:16:11.129 main INFO WATER: Log dir: '/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201402111036_0004/attempt_201402111036_0004_m_000001_0,/mnt1/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201402111036_0004/attempt_201402111036_0004_m_000001_0/h2ologs' 01:16:11.131 main INFO WATER: H2O was started by Hadoop; inheriting HDFS library from mapper task. 01:16:14.375 FJ-10-1 INFO WATER: Cloud of size 2 formed [/10.234.22.23:54321, /10.254.108.58:54321] EmbeddedH2OConfig: notifyAboutCloudSize called (10.234.22.23, 54321, 2) 02:14:16.969 Thread-0 INFO WATER: Network congestion: TCPcall Connection reset by peer, AB=[AB write 2nd /10.254.108.58:54321 TCP 0 <= 0 <= 65536 <= 65536], for task#4, waiting and retrying... 02:14:16.969 #P-Accept ERRR WATER: IO error on TCP port 54322: + water.AutoBuffer$TCPIsUnreliableException + at water.AutoBuffer.getImpl(AutoBuffer.java:462) + at water.AutoBuffer.getSz(AutoBuffer.java:443) + at water.AutoBuffer.getPort(AutoBuffer.java:784) + at water.AutoBuffer.<init>(AutoBuffer.java:108) + at water.TCPReceiverThread.run(TCPReceiverThread.java:50) 02:14:17.471 Thread-0 INFO WATER: Network congestion: TCPcall Connection refused, AB=[AB write first /10.254.108.58:54321 TCP 0 <= 0 <= 5312 <= 65536], for task#4, waiting and retrying... 02:14:17.972 Thread-0 INFO WATER: Network congestion: TCPcall Connection refused, AB=[AB write first /10.254.108.58:54321 TCP 0 <= 0 <= 5312 <= 65536], for task#4, waiting and retrying...
stderr logs[WARN] MetricsSystemImpl - Source name ugi already exists! Exception in thread "Thread for syncLogs" java.lang.NullPointerException at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:67) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:357) at org.apache.hadoop.mapred.Child$3.run(Child.java:141) 02:14:16.578 Thread-13 INFO WATER: [WARN] DFSClient - DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /user/hadoop/h2oout/_temporary/_attempt_201402111036_0004_m_000001_0/part-m-00001 File does not exist. Holder DFSClient_attempt_201402111036_0004_m_000001_0 does not have any open files. + at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1642) + at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1633) + at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1549) + at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:698) + at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) + at java.lang.reflect.Method.invoke(Method.java:606) + at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573) + at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) + at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) + at java.security.AccessController.doPrivileged(Native Method) + at javax.security.auth.Subject.doAs(Subject.java:415) + at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) + at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387) + + at org.apache.hadoop.ipc.Client.call(Client.java:1067) + at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) + at com.sun.proxy.$Proxy2.addBlock(Unknown Source) + at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) + at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) + at java.lang.reflect.Method.invoke(Method.java:606) + at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83) + at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) + at com.sun.proxy.$Proxy2.addBlock(Unknown Source) + at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3633) + at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3493) + at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2700(DFSClient.java:2691) + at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2932) 02:14:16.615 Thread-13 INFO WATER: [WARN] DFSClient - Error Recovery for block null bad datanode[0] nodes == null 02:14:16.615 Thread-13 INFO WATER: [WARN] DFSClient - Could not get block locations. Source file "/user/hadoop/h2oout/_temporary/_attempt_201402111036_0004_m_000001_0/part-m-00001" - Aborting... 02:14:17.760 #n thread INFO WATER: [WARN] Task - Parent died. Exiting attempt_201402111036_0004_m_000001_0
I see you are using a hadoop version 1.0.3
One thing you may notice, is that we may stress hdfs systems harder than other mapreduce systems.
I noticed a warning about TCP in your logs, and that got me thinking about that issue..
The hadoop version 1.0.3 is from May 2012?
I would recommend the most recent Hadoop version, if you're experimenting. There may be underlying issues that got resolved in future revisions.
Let us know if you have more info.
Oh another thing. I see your failure occured while H2O was doing a GBM (was this a long running GBM)
During the algos, there shouldn't be any HDFS traffic. All HDFS traffic should just occur during the parse of datasets.
It got me thinking that maybe h2o is using up all the performance in your system (maybe due to thrashing with the small heaps) and that causes performance problems with the namenode etc....even though h2o is not doing any HDFS work.
I would recommend checking where your tasktrackers are (assuming this is a mapreduce job)...and that they are not on the same machine that has the namenode.
And the heaps should be sized relative to your dataset.
Are you using the latest h2o.jar from master? I'm wondering about the memory usage of the GBM algo in the h2o.jar you're using.
thanks
-kevin
I'm always worried about resource contention, (which yarn is helping address) because I tend to test with large datasets, and I have my machines a bit oversubscribed. But yes your setup sounds okay, well spread out, and jvm heap okay for the small datasets.
There's so many hadoop versions over the last two years, that people are using (for various good reasons), that it becomes difficult to become expert (and test on all of them)
So all feedback is useful. I'm really interested in hadoop install and operation feedback, as everyone is trying to make the idea of h2o on hadoop as a service simpler to install and manage...But I notice sometimes we don't have enough feedback on the installations that people want to set up. So all feedback/experiences are good.
I'd be interested in your experiences with repeated parsing of files from hdfs.
We have many interesting features for parse..we can parse .gz files, we can do multi-file pattern match, so that multiple files can be parsed into a single key, we can have separate file for header names (vs the data).
-kevin