H2O Support - YARN and H2O

227 views
Skip to first unread message

Max Schloemer

unread,
Aug 10, 2015, 5:18:37 PM8/10/15
to H2O Open Source Scalable Machine Learning - h2ostream, nachike...@linaro.org

Support,


Please review this note from Nachiket Bhoyar and reply to all with troubleshooting or resolution.

 

email id is nachike...@linaro.org

First name: Nachiket

Last name: Bhoyar

 

Thanks,

Max

 

 

Begin Chat Paste:

 

Hello, I am trying out H2O with HDP 2.2 on a 2-node cluster. Every time I try starting H2O for both nodes, I get error saying 'ERROR: Unable to start any H2O nodes; please contact your YARN administrator. A common cause for this is the requested container size (5.5 GB) exceeds the following YARN settings'. How do I tackle this problem? I have configured 45 GB of memory for the cluster.

 

Hi Max, Here is the command line I used for running H2O:

../bin/hadoop jar h2odriver.jar -nodes 2 -mapperXmx 5g -timeout 1800 -network '192.101.0.0/16' -output hdfsOutput_h2o_14

 

 

It works for single node for -mapperXmx of 9GB. It worked once for 10GB but it is failing after that one attempt.

 

End Chat Paste:

 

Max Schloemer

Customer Engagement Manager

H2O.ai

O:  619.467.7016

C:  619.850.0578

m...@h2o.ai

www.h2o.ai

 

Save the date for H2O World 2015!

 

Parag Sanghavi

unread,
Aug 10, 2015, 6:05:37 PM8/10/15
to Max Schloemer, H2O Open Source Scalable Machine Learning - h2ostream, nachike...@linaro.org
Hi Nachiket,


If this still fails than can you post the yarn application logs to further troubleshoot.

Thanks

Parag

--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Parag Sanghavi
Head of Customer Success
H2O.ai
(650) 303-4069

Nachiket Bhoyar

unread,
Aug 10, 2015, 6:23:26 PM8/10/15
to Parag Sanghavi, Max Schloemer, H2O Open Source Scalable Machine Learning - h2ostream
Hey Parag,

I have set the following configuration in yarn-site.xml:
        <property>
                <name>yarn.scheduler.maximum-allocation-mb</name>
                <value>25600</value>
        </property>
        <property>
                <name>yarn.nodemanager.resource.memory-mb</name>
                <value>25600</value>
        </property>

I have attached logs for H2O run. There are no logs created for YARN application.

Thanks,
Nachiket

--
Thanks,
Nachiket
h2o_0011_6GB_2nodes_failure.txt

Parag Sanghavi

unread,
Aug 10, 2015, 6:26:39 PM8/10/15
to Nachiket Bhoyar, Max Schloemer, H2O Open Source Scalable Machine Learning - h2ostream
Hi Nachiket,

If you type this on the node : yarn logs -applicationId application_1439230332572_0001

What do you get back?

Parag

Nachiket Bhoyar

unread,
Aug 10, 2015, 6:28:55 PM8/10/15
to Parag Sanghavi, Max Schloemer, H2O Open Source Scalable Machine Learning - h2ostream
Hi Parag,

It says it does not exist:

[root@master hadoop]# bin/yarn logs -applicationId application_1439230332572_0001
15/08/10 13:29:05 INFO client.RMProxy: Connecting to ResourceManager at master/192.101.9.71:8032
/tmp/logs/root/logs/application_1439230332572_0001does not exist.
Log aggregation has not completed or is not enabled.

Thanks,
Nachiket
--
Thanks,
Nachiket

Nachiket Bhoyar

unread,
Aug 11, 2015, 2:04:13 PM8/11/15
to Parag Sanghavi, Max Schloemer, H2O Open Source Scalable Machine Learning - h2ostream
Hi Parag,

I tried the following and H2O worked:
Reboot systems > Wipe out all hadoop logs > kill eth1 network > restart hadoop > start H2O for 2 nodes.

I am suspecting that it has something to do with networking on my cluster. I have two networks for different bandwidths (eth0 and eth1). It is working fine on eth0. I am still trying to investigate if eth1 is the culprit here. But as of now, H2O is running fine on the 2-node cluster and I tried multiple times with different heap sizes.

Thanks,
Nachiket

--
Thanks,
Nachiket

Parag Sanghavi

unread,
Aug 11, 2015, 3:03:07 PM8/11/15
to Nachiket Bhoyar, Max Schloemer, H2O Open Source Scalable Machine Learning - h2ostream
Hi Nachiket,

If you have multiple network interfaces than you need to tell the h2o node which inteface it needs to bind to by specifying the network mask. See : http://h2o-release.s3.amazonaws.com/h2o/rel-simons/4/docs-website/h2o-docs/index.html

  • -network ###.##.##.#/##: Specify an IP addresses (where ###.##.##.#/## represents the IP address and subnet mask). The IP address discovery code binds to the first interface that matches one of the networks in the comma-separated list; to specify an IP address, use -network. To specify a range, use a comma to separate the IP addresses: -network 123.45.67.0/22,123.45.68.0/24. For example, 10.1.2.0/24 supports 256 possibilities.
Parag

Nachiket Bhoyar

unread,
Aug 11, 2015, 3:19:23 PM8/11/15
to Parag Sanghavi, Max Schloemer, H2O Open Source Scalable Machine Learning - h2ostream
Hi Parag,

Yes, I do specify the network mask in all my attempts. 'eth1' on my cluster seems to be problematic. It had worked once last week. I will look into it.

Thanks,
Nachiket
--
Thanks,
Nachiket

Nachiket Bhoyar

unread,
Aug 19, 2015, 5:45:42 PM8/19/15
to Parag Sanghavi, Max Schloemer, H2O Open Source Scalable Machine Learning - h2ostream
Hi Parag,

Here are two problems which were identified in the enablement process:
1. Systems were out of sync w.r.t. system time which caused an exception on resource manager. This didn't kill the resource manager process.
2. Root directory (where hadoop is installed) was full which caused hadoop logs directory to go bad which in turn caused the node manager to throw exception (but it did not fail). The root dir got full due to /mnt directory having large files of about 46GB. /mnt is used to mount extra data disks on my system. These large files were uncovered only after unmounting the disks.

These problems causes H2O to not work. But the error thrown by H2O was about yarn configuration that the memory configured is not enough. I think these messages should be changed as they cause confusion rather than pointing to the actual cause. Maybe it should be modified to add a line about checking hadoop logs as well.

Also, I noticed in the cluster status on H2O flow that the disk capacity shown is not what I have set in hadoop. I have setup 2.2TB for each node yet cluster status shows that I have 406GB max for each node. Why is it like that? Please let me know.

Thanks,
Nachiket
--
Thanks,
Nachiket

Tom Kraljevic

unread,
Aug 19, 2015, 6:12:04 PM8/19/15
to Nachiket Bhoyar, Parag Sanghavi, Max Schloemer, H2O Open Source Scalable Machine Learning - h2ostream

Hi Nachiket,


When launched on Hadoop, I would expect H2O to report disk size for the partition mapred.local.dir lives on.

        #
    # h2o-3/h2o-hadoop/h2o-mapreduce-generic/src/main/java/water/hadoop/h2omapper.java
    #

    String mapredLocalDir = conf.get("mapred.local.dir");
    String ice_root;
    if (mapredLocalDir.contains(",")) {
      ice_root = mapredLocalDir.split(",")[0];
    }
    else {
      ice_root = mapredLocalDir;
    }

ice_root is where H2O would write any temporary files if it wanted to (which today in h2o-3.0.1.7 it doesn’t do at all).


Thanks,
Tom

Nachiket Bhoyar

unread,
Sep 17, 2015, 5:38:28 PM9/17/15
to Tom Kraljevic, Parag Sanghavi, Max Schloemer, H2O Open Source Scalable Machine Learning - h2ostream
Hello Tom,

Thanks for your reply. 

I have another question: When I try to parse input files whose size is greater than the available memory, H2O either gives an exception saying not enough memory or it crashes during the file parsing phase. Why is there a limitation on input size as this is supposed to work with Big Data?

Please let me know.

Thanks,
Nachiket
--
Thanks,
Nachiket

Tom Kraljevic

unread,
Sep 17, 2015, 6:11:45 PM9/17/15
to Nachiket Bhoyar, Tom Kraljevic, Parag Sanghavi, Max Schloemer, H2O Open Source Scalable Machine Learning - h2ostream

Hi,

Think of H2O as an in-memory processing engine for implementing fast iterative machine learning algorithms.

In-memory means it won't handle infinite data at one time for model training.

Making the cluster behave better under these conditions is definitely something we are working on, but expect graceful failure for such cases (vs. being able to read in an unlimited amount of data at once to train on).

Thanks
Tom

Sent from my iPhone

Nachiket Bhoyar

unread,
Sep 17, 2015, 6:12:45 PM9/17/15
to Tom Kraljevic, Tom Kraljevic, Parag Sanghavi, Max Schloemer, H2O Open Source Scalable Machine Learning - h2ostream
Okay makes sense. Thank you!
--
Thanks,
Nachiket
Reply all
Reply to author
Forward
0 new messages