H2O Halts with HDFS LeaseExpiredException

203 views

Skip to first unread message

Bhattacharya, Dibyendu

unread,

Feb 21, 2014, 1:53:02 AM2/21/14

to Tom Kraljevic, Irene L., Jonathan Hodges, SriSatish Ambati, h2os...@googlegroups.com

Hi,

I was running H2O on AWS EMR cluster. This is being running for over 2 hours today. While I was trying to play with GBM Tutorial, I see the Web Based H2O console become unresponsive. When look at the Task Tracker Log, I found LeaseExpiredException. This is probably caused by the long running MapperTask running within H2O. The same H2O hang issue happened yesterday, but I did not try to debug the issue, today when it again happened, I see the logs. Here is both the Log details for you..

I am running EMR Cluster with Hadoop 1.0.3 and already built the H2O with Hadoop 1.0.3.

My EMR cluster is 3 Node ( 1 Master, 2 Slaves) and H2O is two node cluster running on 2 EMR Slave nodes.

21st Feb Log

==========

StdOut Log

=========

..........................

05:41:56.017 # Session INFO HTTPD: GET  /Inspect.html key=arrhythmia.data.hex job=$03010aea161732d4ffffffff$_b6558ec55a40860eec41f32e9b0d450b
05:45:01.570 # Session INFO HTTPD: GET  /Cloud.html
05:46:43.173 # Session INFO HTTPD: GET  /2/GBM.html
05:46:58.204 # Session INFO HTTPD: GET  /2/TypeaheadHexKeyRequest limit=1024 filter=Generalized Boosted Models
05:47:00.521 # Session INFO HTTPD: GET  /2/TypeaheadHexKeyRequest limit=1024 filter=a
05:47:03.751 # Session INFO HTTPD: GET  /2/GBM.query source=arrhythmia.data.hex grid_parallelism=1 score_each_iteration=0 max_depth=5 nbins=20 learn_rate=0.1 ntrees=50 min_rows=10
05:47:03.753 # Session INFO WATER: Converting ValueArray to Frame: node(/10.234.22.23:54321) convNum(0) key(arrhythmia.data.hex.autoframe)...
05:47:04.552 # Session INFO WATER: Conversion 0 complete.
05:47:34.709 # Session INFO HTTPD: GET  /2/GBM.query source=arrhythmia.data.hex grid_parallelism=1 score_each_iteration=0 max_depth=5 response=C1 nbins=20 learn_rate=0.1 ntrees=50 min_rows=10

stderr logs

[WARN] MetricsSystemImpl - Source name ugi already exists!
Exception in thread "Thread for syncLogs" java.lang.NullPointerException
	at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:67)
	at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:357)
	at org.apache.hadoop.mapred.Child$3.run(Child.java:141)
05:49:59.982 Thread-13 INFO WATER: [WARN] DFSClient - DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /user/hadoop/output100/_temporary/_attempt_201402111036_0005_m_000001_0/part-m-00001 File does not exist. Holder DFSClient_attempt_201402111036_0005_m_000001_0 does not have any open files.
+                                  	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1642)
+                                  	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1633)
+                                  	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1549)
+                                  	at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:698)
+                                  	at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
+                                  	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
+                                  	at java.lang.reflect.Method.invoke(Method.java:606)
+                                  	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573)
+                                  	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
+                                  	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
+                                  	at java.security.AccessController.doPrivileged(Native Method)
+                                  	at javax.security.auth.Subject.doAs(Subject.java:415)
+                                  	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
+                                  	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
+                                  
+                                  	at org.apache.hadoop.ipc.Client.call(Client.java:1067)
+                                  	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
+                                  	at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
+                                  	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
+                                  	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
+                                  	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
+                                  	at java.lang.reflect.Method.invoke(Method.java:606)
+                                  	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83)
+                                  	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
+                                  	at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
+                                  	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3633)
+                                  	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3493)
+                                  	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2700(DFSClient.java:2691)
+                                  	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2932)
05:49:59.985 Thread-13 INFO WATER: [WARN] DFSClient - Error Recovery for block null bad datanode[0] nodes == null
05:49:59.985 Thread-13 INFO WATER: [WARN] DFSClient - Could not get block locations. Source file "/user/hadoop/output100/_temporary/_attempt_201402111036_0005_m_000001_0/part-m-00001" - Aborting...
05:49:59.986 Thread-0  INFO WATER: [ERROR] DFSClient - Exception closing file /user/hadoop/output100/_temporary/_attempt_201402111036_0005_m_000001_0/part-m-00001 : org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /user/hadoop/output100/_temporary/_attempt_201402111036_0005_m_000001_0/part-m-00001 File does not exist. Holder DFSClient_attempt_201402111036_0005_m_000001_0 does not have any open files.
+                                  	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1642)
+                                  	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1633)
+                                  	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1549)
+                                  	at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:698)
+                                  	at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
+                                  	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
+                                  	at java.lang.reflect.Method.invoke(Method.java:606)
+                                  	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573)
+                                  	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
+                                  	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
+                                  	at java.security.AccessController.doPrivileged(Native Method)
+                                  	at javax.security.auth.Subject.doAs(Subject.java:415)
+                                  	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
+                                  	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
+                                   <org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /user/hadoop/output100/_temporary/_attempt_201402111036_0005_m_000001_0/part-m-00001 File does not exist. Holder DFSClient_attempt_201402111036_0005_m_000001_0 does not have any open files.
+                                  	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1642)
+                                  	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1633)
+                                  	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1549)
+                                  	at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:698)
+                                  	at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
+                                  	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
+                                  	at java.lang.reflect.Method.invoke(Method.java:606)
+                                  	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573)
+                                  	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
+                                  	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
+                                  	at java.security.AccessController.doPrivileged(Native Method)
+                                  	at javax.security.auth.Subject.doAs(Subject.java:415)
+                                  	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
+                                  	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
+                                  >org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /user/hadoop/output100/_temporary/_attempt_201402111036_0005_m_000001_0/part-m-00001 File does not exist. Holder DFSClient_attempt_201402111036_0005_m_000001_0 does not have any open files.
+                                  	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1642)
+                                  	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1633)
+                                  	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1549)
+                                  	at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:698)
+                                  	at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
+                                  	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
+                                  	at java.lang.reflect.Method.invoke(Method.java:606)
+                                  	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573)
+                                  	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
+                                  	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
+                                  	at java.security.AccessController.doPrivileged(Native Method)
+                                  	at javax.security.auth.Subject.doAs(Subject.java:415)
+                                  	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
+                                  	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
+                                  
+                                  	at org.apache.hadoop.ipc.Client.call(Client.java:1067)
+                                  	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
+                                  	at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
+                                  	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
+                                  	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
+                                  	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
+                                  	at java.lang.reflect.Method.invoke(Method.java:606)
+                                  	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83)
+                                  	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
+                                  	at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
+                                  	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3633)
+                                  	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3493)
+                                  	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2700(DFSClient.java:2691)
+                                  	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2932)

20th Feb Logs

=============

stdout logs

01:16:11.046 main      INFO WATER: ----- H2O started -----
01:16:11.049 main      INFO WATER: Build git branch: master
01:16:11.049 main      INFO WATER: Build git hash: 737f488bb2471b411818002e4d31591cf9a2c83b
01:16:11.049 main      INFO WATER: Build git describe: nn-2-3383-g737f488-dirty
01:16:11.050 main      INFO WATER: Build project version: 2.3.0.99999
01:16:11.050 main      INFO WATER: Built by: 'root'
01:16:11.050 main      INFO WATER: Built on: 'Wed Feb 19 05:30:54 UTC 2014'
01:16:11.050 main      INFO WATER: Java availableProcessors: 2
01:16:11.052 main      INFO WATER: Java heap totalMemory: 0.96 gb
01:16:11.052 main      INFO WATER: Java heap maxMemory: 0.96 gb
01:16:11.052 main      INFO WATER: Java version: Java 1.7.0_40 (from Oracle Corporation)
01:16:11.053 main      INFO WATER: OS   version: Linux 3.2.30-49.59.amzn1.x86_64 (amd64)
01:16:11.053 main      INFO WATER: ICE root: '/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201402111036_0004/attempt_201402111036_0004_m_000001_0,/mnt1/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201402111036_0004/attempt_201402111036_0004_m_000001_0'
01:16:11.058 main      INFO WATER: Possible IP Address: eth0 (eth0), 10.234.22.23
01:16:11.058 main      INFO WATER: Possible IP Address: lo (lo), 127.0.0.1
01:16:11.095 main      INFO WATER: Internal communication uses port: 54322
+                                  Listening for HTTP and REST traffic on  http://10.234.22.23:54321/
EmbeddedH2OConfig: notifyAboutEmbeddedWebServerIpPort called (10.234.22.23, 54321)
EmbeddedH2OConfig: fetchFlatfile called
EmbeddedH2OConfig: fetchFlatfile returned
------------------------------------------------------------
10.254.108.58:54321
10.234.22.23:54321

------------------------------------------------------------
01:16:11.122 main      INFO WATER: H2O cloud name: 'H2O_44859'
01:16:11.122 main      INFO WATER: (v2.3.0.99999) 'H2O_44859' on /10.234.22.23:54321, discovery address /237.87.234.66:60759
01:16:11.123 main      INFO WATER: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
+                                    1. Open a terminal and run 'ssh -L 55555:localhost:54321 had...@10.234.22.23'
+                                    2. Point your browser to http://localhost:55555
01:16:11.128 main      INFO WATER: Cloud of size 1 formed [/10.234.22.23:54321]
EmbeddedH2OConfig: notifyAboutCloudSize called (10.234.22.23, 54321, 1)
01:16:11.129 main      INFO WATER: Log dir: '/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201402111036_0004/attempt_201402111036_0004_m_000001_0,/mnt1/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201402111036_0004/attempt_201402111036_0004_m_000001_0/h2ologs'
01:16:11.131 main      INFO WATER: H2O was started by Hadoop; inheriting HDFS library from mapper task.
01:16:14.375 FJ-10-1   INFO WATER: Cloud of size 2 formed [/10.234.22.23:54321, /10.254.108.58:54321]
EmbeddedH2OConfig: notifyAboutCloudSize called (10.234.22.23, 54321, 2)
02:14:16.969 Thread-0  INFO WATER: Network congestion: TCPcall Connection reset by peer,  AB=[AB write 2nd /10.254.108.58:54321 TCP 0 <= 0 <= 65536 <= 65536], for task#4, waiting and retrying...
02:14:16.969 #P-Accept ERRR WATER: IO error on TCP port 54322: 
+                                  water.AutoBuffer$TCPIsUnreliableException
+                                  	at water.AutoBuffer.getImpl(AutoBuffer.java:462)
+                                  	at water.AutoBuffer.getSz(AutoBuffer.java:443)
+                                  	at water.AutoBuffer.getPort(AutoBuffer.java:784)
+                                  	at water.AutoBuffer.<init>(AutoBuffer.java:108)
+                                  	at water.TCPReceiverThread.run(TCPReceiverThread.java:50)
02:14:17.471 Thread-0  INFO WATER: Network congestion: TCPcall Connection refused,  AB=[AB write first /10.254.108.58:54321 TCP 0 <= 0 <= 5312 <= 65536], for task#4, waiting and retrying...
02:14:17.972 Thread-0  INFO WATER: Network congestion: TCPcall Connection refused,  AB=[AB write first /10.254.108.58:54321 TCP 0 <= 0 <= 5312 <= 65536], for task#4, waiting and retrying...



stderr logs
[WARN] MetricsSystemImpl - Source name ugi already exists!
Exception in thread "Thread for syncLogs" java.lang.NullPointerException
	at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:67)
	at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:357)
	at org.apache.hadoop.mapred.Child$3.run(Child.java:141)
02:14:16.578 Thread-13 INFO WATER: [WARN] DFSClient - DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /user/hadoop/h2oout/_temporary/_attempt_201402111036_0004_m_000001_0/part-m-00001 File does not exist. Holder DFSClient_attempt_201402111036_0004_m_000001_0 does not have any open files.
+                                  	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1642)
+                                  	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1633)
+                                  	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1549)
+                                  	at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:698)
+                                  	at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
+                                  	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
+                                  	at java.lang.reflect.Method.invoke(Method.java:606)
+                                  	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:573)
+                                  	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
+                                  	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
+                                  	at java.security.AccessController.doPrivileged(Native Method)
+                                  	at javax.security.auth.Subject.doAs(Subject.java:415)
+                                  	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
+                                  	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
+                                  
+                                  	at org.apache.hadoop.ipc.Client.call(Client.java:1067)
+                                  	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
+                                  	at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
+                                  	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
+                                  	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
+                                  	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
+                                  	at java.lang.reflect.Method.invoke(Method.java:606)
+                                  	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:83)
+                                  	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
+                                  	at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
+                                  	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3633)
+                                  	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3493)
+                                  	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2700(DFSClient.java:2691)
+                                  	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2932)
02:14:16.615 Thread-13 INFO WATER: [WARN] DFSClient - Error Recovery for block null bad datanode[0] nodes == null
02:14:16.615 Thread-13 INFO WATER: [WARN] DFSClient - Could not get block locations. Source file "/user/hadoop/h2oout/_temporary/_attempt_201402111036_0004_m_000001_0/part-m-00001" - Aborting...
02:14:17.760 #n thread INFO WATER: [WARN] Task - Parent died.  Exiting attempt_201402111036_0004_m_000001_0

Tom Kraljevic

unread,

Feb 21, 2014, 3:53:21 AM2/21/14

to Bhattacharya, Dibyendu, Irene L., Jonathan Hodges, SriSatish Ambati, h2os...@googlegroups.com

Hi Dibyendu,

For EC2 experimentation, you might find it easier to use our EC2 AMI rather than EMR.

The ec2 directory contains a README and scripts describing how to create a standalone

H2O cluster that can read data from S3.

Additionally, our own nightly tests run in an EC2 standalone enviornment.

Many of our customers do favor a hadoop environment, although those tend to be internal

corporate clusters as opposed to EMR. While we test H2O with traditional hadoop

environments, we don't currently test on EMR.

Based on the logs you provided, I can see the JVMs are very small (1GB). I recommend

providing more resources. c3.2xlarge and m2.2xlarge are good instance types to do

experiments with H2O.

Please let us know if you have further questions.

Thanks,

Tom

Bhattacharya, Dibyendu

unread,

Feb 21, 2014, 4:05:58 AM2/21/14

to Tom Kraljevic, Irene L., Jonathan Hodges, SriSatish Ambati, h2os...@googlegroups.com

Hi Tom,

I do not think this error is related to JVM Heap size. Also AWS EMR are running as EC2 node only , so it may not be related to EMR as well. I think the error may be something to do with Hadoop Speculative Execution. I think I need to turn off speculative execution for Mapper task as H2O Mapper task are long running process and speculative execution can cause issue. I have turned off Speculative Execution, let see if I can still see the issue.

Regards,

Dibyendu

knor...@gmail.com

unread,

Feb 22, 2014, 1:53:12 AM2/22/14

to h2os...@googlegroups.com, Tom Kraljevic, Irene L., Jonathan Hodges, SriSatish Ambati, dibyendu.bh...@pearson.com

Hi Dibyendu,
Thanks for the detailed logs.
I'm mentally reviewing our long-term HDFS runs.
Most of our amazon ec2 testing uses data sourced from S3N or local file systems.
We do extensive HDFS testing on our local machines at 0xdata, using hdfs from hortonworks, cdh3, cdh4, cdh5, and mapr

I see you are using a hadoop version 1.0.3

One thing you may notice, is that we may stress hdfs systems harder than other mapreduce systems.

I noticed a warning about TCP in your logs, and that got me thinking about that issue..

The hadoop version 1.0.3 is from May 2012?

I would recommend the most recent Hadoop version, if you're experimenting. There may be underlying issues that got resolved in future revisions.

Let us know if you have more info.

Oh another thing. I see your failure occured while H2O was doing a GBM (was this a long running GBM)

During the algos, there shouldn't be any HDFS traffic. All HDFS traffic should just occur during the parse of datasets.

It got me thinking that maybe h2o is using up all the performance in your system (maybe due to thrashing with the small heaps) and that causes performance problems with the namenode etc....even though h2o is not doing any HDFS work.

I would recommend checking where your tasktrackers are (assuming this is a mapreduce job)...and that they are not on the same machine that has the namenode.

And the heaps should be sized relative to your dataset.

Are you using the latest h2o.jar from master? I'm wondering about the memory usage of the GBM algo in the h2o.jar you're using.

thanks
-kevin

Bhattacharya, Dibyendu

unread,

Feb 22, 2014, 6:51:29 AM2/22/14

to knor...@gmail.com, Tom Kraljevic, Irene L., Jonathan Hodges, SriSatish Ambati, h2os...@googlegroups.com

Hi Kevin,

Thanks a lot for detailed email. We are using Hadoop 1.0.3 ( Which is the Amazon's default stable version for EMR). We have a plan for upgrade to YARN soon. The cluster I am using to play with H2O is running on EMR slave machines, Tasktrackers are not on same machine as Namenode. It is just a fresh setup and no other M/R jobs also running in this setup. The GBM Jobs I tried is the same which is in the oxdata tutorial , which is very small data. I am using H2O built from GIT master, and built with 1.0.3 Hadoop jar (For that I modified Hadoop Makefile).

After this LeaseExpiredException error , I disabled the Speculative Execution in Hadoop. I have not see the Halt issue yet in H2O, but if I see it again, will share the log with you .

Regards,

Dibyendu

knor...@gmail.com

unread,

Feb 23, 2014, 2:14:54 PM2/23/14

to h2os...@googlegroups.com, Tom Kraljevic, Irene L., Jonathan Hodges, SriSatish Ambati, dibyendu.bh...@pearson.com

Thanks Dibyendu,

I'm always worried about resource contention, (which yarn is helping address) because I tend to test with large datasets, and I have my machines a bit oversubscribed. But yes your setup sounds okay, well spread out, and jvm heap okay for the small datasets.

There's so many hadoop versions over the last two years, that people are using (for various good reasons), that it becomes difficult to become expert (and test on all of them)

So all feedback is useful. I'm really interested in hadoop install and operation feedback, as everyone is trying to make the idea of h2o on hadoop as a service simpler to install and manage...But I notice sometimes we don't have enough feedback on the installations that people want to set up. So all feedback/experiences are good.

I'd be interested in your experiences with repeated parsing of files from hdfs.

We have many interesting features for parse..we can parse .gz files, we can do multi-file pattern match, so that multiple files can be parsed into a single key, we can have separate file for header names (vs the data).