H2O stability on AWS

522 views
Skip to first unread message

deni...@gmail.com

unread,
Nov 23, 2015, 4:23:23 PM11/23/15
to H2O Open Source Scalable Machine Learning - h2ostream
Hi,

We're running tests on AWS using H2O. The job is training a GBM model on our fairly large dataset. The job runs on a AWS cluster of 100 or 200 nodes. In this configuration a single training runs for 2 or 3 hours.

What we found that H2O is unstable on AWS. More often than not the job would fail on like 80 pr 90% done. I was able to finish the job only once. Other 6 or 7 times it would fail with different messages: Connection refused or Connection reset by peer. In WebUI it would say: "Error fetching job. Error calling GET and Could not connect to H2O. Your cloud is currently unresponsive"

Our guess is that there is a loss of connection to one of the nodes and then the entire H2O cluster fails, one cannot establish a connection to it at all. That unfortunately makes H2O unusable for us.

Denis

deni...@gmail.com

unread,
Nov 23, 2015, 4:39:00 PM11/23/15
to H2O Open Source Scalable Machine Learning - h2ostream, deni...@gmail.com
More details on the job. We were running H2O 3.2.0.3 on the Amazon EMR releases emr-4.0.0 and emr-4.1.0.

H2O was launched using a Hadoop Step:
using appropriate jar h2odriver.jar
with arguments: -nodes 250 -mapperXmx 2g -timeout 6000 -disown -flow_dir /s3/flows

Tom Kraljevic

unread,
Nov 23, 2015, 6:34:21 PM11/23/15
to deni...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream

Hi,

A huge number of tiny nodes is not a recommended configuration for H2O.
250 x 2 GB nodes will likely take up all the memory just for communication overhead.

If you are just getting started and trying to learn about your problem, I suggest trying a
cluster of 10 x 50 GB for starters and experimenting from there (assuming you wish to
keep the same overall 500 GB of memory).

Thanks,
Tom
> --
> You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

deni...@gmail.com

unread,
Nov 23, 2015, 7:02:50 PM11/23/15
to H2O Open Source Scalable Machine Learning - h2ostream, deni...@gmail.com
Hi Tom,

Thanks for your answer. The reason for that many nodes is not the total amount of RAM for the cluster. The reason is to have large number of cores. Sorry that I didn't specify that.


My data size is only about 10GB. I'd like to be able to use lots of cores to speedup training time. So, with 250 nodes each of with 4 cores I get the total number of cores of 1000. With this many cores I can get the job done in about 1.5 hours. I used m3.xlarge nodes for this test.

For EMR there is also m3.2xlarge node type available, which has 8 cores. I tried running the job with 125 of those and still have the same stability problems.

Nicholas Sharkey

unread,
Nov 24, 2015, 8:40:35 AM11/24/15
to H2O Open Source Scalable Machine Learning - h2ostream, deni...@gmail.com
Hi Denis, 

In my experience what Tom is saying is correct: when possible try fewer, more powerful nodes vs more, less powerful nodes. The amount of time wasted in communication will takeaway from "H2O's" performance. Also, from a billing standpoint (at least in my use cases) it's cheaper to spin up more powerful machines to finish in an hour since 1 hour and 1 min = 2 billable hours (which gets expensive if you have hundreds of nodes). 

All that said, are you testing hundreds of nodes just to test hundreds of nodes, or are you trying to speed up training time? 

Nick 

deni...@gmail.com

unread,
Nov 24, 2015, 9:16:25 AM11/24/15
to H2O Open Source Scalable Machine Learning - h2ostream, deni...@gmail.com
Hi Nick,

Thanks for suggestion. I'll try to use EC2 types with more cores.

Although, Amazon is pretty good at scaling up the price for the instances. For example, m3.xlarge has 4 CPUs and costs $0.266 per Hour, m3.2xlarge has 8 CPUs and costs $0.532 per Hour and so on. So, I'm expecting a similar overall cost.

I see that there is c3.8xlarge available in EMR which has 32 cores. So, maybe I'll try to run 32 of those. I requested the limit increase on those on our AWS account. I will test it and can report results here if anyone is interested.

Regarding your question, I'm trying to speedup training time. Our data is pretty large now and the number of parameters is substantial too. Both of these dimensions will be growing. We'd like to be able to build models within few hours, not much more. So, like I said currently we need a cluster with about 1000 cores to finish task within 2 hours.

It would be nice if 0xdata spent more time on the H2O core development to target H2O stability. Their models are OK. They claim that their platform is scalable. But my tests show that it is true only to some extent. Large clusters will have failures. From my experience hundreds of nodes is not a lot. And cloud technologies available makes it possible to use thousands or more whatever the task needs and we were planning for that.

H2O has a competitive advantage in terms of speed and interface is very nice. I tried building a similar GBM model using Spark MLlib using our dataset. The latter is about 6 times slower. But at least it's resilient to cluster failures.

Tom Kraljevic

unread,
Nov 24, 2015, 1:18:12 PM11/24/15
to deni...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream, Jeff Gambera

I’ll make another suggestion, which is to try using dedicated nodes rather than using EMR.

EMR is not an environment we encounter much or test against.
(People generally have their data in S3, .)

Jeff can you please send a link to our EC2 helper scripts?

(You are correct that H2O needs the underlying hosts to be up and functioning to work properly.)


Thanks,
Tom

Jeff Gambera

unread,
Nov 24, 2015, 1:33:56 PM11/24/15
to Tom Kraljevic, deni...@gmail.com, H2O Open Source Scalable Machine Learning - h2ostream
Greetings,

The EC2 scripts are in the public repo on github.


the direct link to all the scripts is


If you have any issues please contact me and we will work through getting your cluster started

We also publish an AMI base image publicly.

it is: ami-34fd845e

Thanks

Denis Perevalov

unread,
Nov 24, 2015, 5:43:35 PM11/24/15
to H2O Open Source Scalable Machine Learning - h2ostream, to...@h2o.ai, deni...@gmail.com
Still trying to figure out how to run those ec2 scripts on my data. AMI is in us-east-1 region, but my data is in us-west-2. Also, I need some security too.

Meanwhile, I encounter a different error when launching h2o with lots of nodes:

Exception in thread "Thread-912" Exception in thread "Thread-921" Exception in thread "Thread-918" Exception in thread "Thread-915" Exception in thread "Thread-919" Exception in thread "Thread-904" Exception in thread "Thread-895" Exception in thread "Thread-903" Exception in thread "Thread-899" MapperToDriverMessage: Read invalid type ( ) from socket, ignoring...
MapperToDriverMessage: Read invalid type ( ) from socket, ignoring...
MapperToDriverMessage: Read invalid type ( ) from socket, ignoring...
MapperToDriverMessage: Read invalid type ( ) from socket, ignoring...
MapperToDriverMessage: Read invalid type ( ) from socket, ignoring...
Exception in thread "Thread-898" 


I'm running latest stable version of h2o now.

Tom Kraljevic

unread,
Nov 24, 2015, 6:40:51 PM11/24/15
to Denis Perevalov, H2O Open Source Scalable Machine Learning - h2ostream, Jeff

Those scripts are oriented for the “standalone” h2o.jar, not the “hadoop" h2odriver.jar.


So when starting h2o on a bare ec2 instance, you should not see any “MapperToDriverMessage” stuff.  That’s for hadoop.

It’s also important to note that if your filesystem is S3, you really want to use the standalone h2o.jar.
It has the right version of hadoop (CDH4, actually) that we test for optimized S3 access.


Tom

Denis Perevalov

unread,
Nov 30, 2015, 11:58:06 AM11/30/15
to H2O Open Source Scalable Machine Learning - h2ostream, deni...@gmail.com, je...@0xdata.com
After observing the AWS cluster, the problem appears to be due to the fact that some nodes would shut down by AWS. We use "Spot Instances" option for EC2 to reduce the overall cost. That makes each EC2 instance non-persistant. I do see that occasionally AWS would shut down a node and then start a new one after some time.

This kind of behavior is alright in Spark since it is resilient to these kind of problems. Moreover, it's pretty easy to dynamically expand or shrink cluster. For H2O it is fatal as the entire H2O cluster goes down even after losing a single node.

I think you guys need to work on H2O resiliency. Especially since you promote H2O working with big data and all that.

Cheers,
Denis

Denis Perevalov

unread,
Nov 30, 2015, 3:18:15 PM11/30/15
to H2O Open Source Scalable Machine Learning - h2ostream, deni...@gmail.com, je...@0xdata.com
Also, in standalone mode I have a problem reading a csv file. It quits midway through reading. I'm using latest H2O version 3.6.0.8. The csv file is not too big, maybe 5 GB.Don't know why cluster name hadoop. Hadoop was not installed in this run. Here is the output that I get:

> cluster <- h2o.init(port = 54321)
Successfully connected to http://127.0.0.1:54321/ 

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 minutes 43 seconds 
    H2O cluster version:        3.6.0.8 
    H2O cluster name:           hadoop 
    H2O cluster total nodes:    9 
    H2O cluster total memory:   117.62 GB 
    H2O cluster total cores:    288 
    H2O cluster allowed cores:  288 
    H2O cluster healthy:        TRUE 

> model_matrix_test.hex = h2o.importFile(path="/s3-ml/Data/modelmatrix_test2.csv", sep=",")
  |===================================================================================                                             |  65%

Got exception 'class java.lang.RuntimeException', with msg 'water.DException$DistributedException: from /172.31.47.21:54321; by class water.parser.ParseDataset$MultiFileParseTask; class water.DException$DistributedException: from ec2-52-35-40-21.us-west-2.compute.amazonaws.com/172.31.35.50:54321; by class water.parser.ParseDataset$MultiFileParseTask; class water.DException$DistributedException: from ec2-52-34-231-25.us-west-2.compute.amazonaws.com/172.31.41.61:54321; by class water.parser.ParseDataset$MultiFileParseTask; class water.DException$DistributedException: from ec2-52-35-6-123.us-west-2.compute.amazonaws.com/172.31.42.64:54321; by class water.parser.ParseDataset$MultiFileParseTask; class water.DException$DistributedException: from ec2-52-35-40-42.us-west-2.compute.amazonaws.com/172.31.35.121:54321; by class water.parser.ParseDataset$MultiFileParseTask$DistributedParse; class water.DException$DistributedException: from ec2-52-35-29-77.us-west-2.compute.amazonaws.com/172.31.47.21:54321; by class water.parser.ParseDataset$MultiFileParseTask$DistributedParse; class water.DException$DistributedException: from ec2-52-35-40-21.us-west-2.compute.amazonaws.com/172.31.35.50:54321; by class water.parser.ParseDataset$MultiFileParseTask$DistributedParse; class water.DException$DistributedException: from ec2-52-34-231-25.us-west-2.compute.amazonaws.com/172.31.41.61:54321; by class water.parser.ParseDataset$MultiFileParseTask$DistributedParse; class java.lang.NullPointerException: null'
java.lang.RuntimeException: water.DException$DistributedException: from /172.31.47.21:54321; by class water.parser.ParseDataset$MultiFileParseTask; class water.DException$DistributedException: from ec2-52-35-40-21.us-west-2.compute.amazonaws.com/172.31.35.50:54321; by class water.parser.ParseDataset$MultiFileParseTask; class water.DException$DistributedException: from ec2-52-34-231-25.us-west-2.compute.amazonaws.com/172.31.41.61:54321; by class water.parser.ParseDataset$MultiFileParseTask; class water.DException$DistributedException: from ec2-52-35-6-123.us-west-2.compute.amazonaws.com/172.31.42.64:54321; by class water.parser.ParseDataset$MultiFileParseTask; class water.DException$DistributedException: from ec2-52-35-40-42.us-west-2.compute.amazonaws.com/172.31.35.121:54321; by class water.parser.ParseDataset$MultiFileParseTask$DistributedParse; class water.DException$DistributedException: from ec2-52-35-29-77.us-west-2.compute.amazonaws.com/172.31.47.21:54321; by class water.parser.ParseDataset$MultiFileParseTask$DistributedParse; class water.DException$DistributedException: from ec2-52-35-40-21.us-west-2.compute.amazonaws.com/172.31.35.50:54321; by class water.parser.ParseDataset$MultiFileParseTask$DistributedParse; class water.DException$DistributedException: from ec2-52-34-231-25.us-west-2.compute.amazonaws.com/172.31.41.61:54321; by class water.parser.ParseDataset$MultiFileParseTask$DistributedParse; class java.lang.NullPointerException: null
at water.MRTask.getResult(MRTask.java:505)
at water.MRTask.doAll(MRTask.java:399)
at water.parser.ParseDataset.parseAllKeys(ParseDataset.java:207)
at water.parser.ParseDataset.access$000(ParseDataset.java:30)
at water.parser.ParseDataset$ParserFJTask.compute2(ParseDataset.java:149)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1069)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: water.DException$DistributedException: from /172.31.47.21:54321; by class water.parser.ParseDataset$MultiFileParseTask; class water.DException$DistributedException: from ec2-52-35-40-21.us-west-2.compute.amazonaws.com/172.31.35.50:54321; by class water.parser.ParseDataset$MultiFileParseTask; class water.DException$DistributedException: from ec2-52-34-231-25.us-west-2.compute.amazonaws.com/172.31.41.61:54321; by class water.parser.ParseDataset$MultiFileParseTask; class water.DException$DistributedException: from ec2-52-35-6-123.us-west-2.compute.amazonaws.com/172.31.42.64:54321; by class water.parser.ParseDataset$MultiFileParseTask; class water.DException$DistributedException: from ec2-52-35-40-42.us-west-2.compute.amazonaws.com/172.31.35.121:54321; by class water.parser.ParseDataset$MultiFileParseTask$DistributedParse; class water.DException$DistributedException: from ec2-52-35-29-77.us-west-2.compute.amazonaws.com/172.31.47.21:54321; by class water.parser.ParseDataset$MultiFileParseTask$DistributedParse; class water.DException$DistributedException: from ec2-52-35-40-21.us-west-2.compute.amazonaws.com/172.31.35.50:54321; by class water.parser.ParseDataset$MultiFileParseTask$DistributedParse; class water.DException$DistributedException: from ec2-52-34-231-25.us-west-2.compute.amazonaws.com/172.31.41.61:54321; by class water.parser.ParseDataset$MultiFileParseTask$DistributedParse; class java.lang.NullPointerException: null
at water.persist.PersistManager.load(PersistManager.java:144)
at water.Value.loadPersist(Value.java:226)
at water.Value.memOrLoad(Value.java:123)
at water.Value.get(Value.java:137)
at water.fvec.Vec.chunkForChunkIdx(Vec.java:835)
at water.fvec.ByteVec.chunkForChunkIdx(ByteVec.java:20)
at water.fvec.ByteVec.chunkForChunkIdx(ByteVec.java:16)
at water.MRTask.compute2(MRTask.java:639)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1069)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:914)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:979)
... 2 more

Denis Perevalov

unread,
Dec 1, 2015, 3:36:46 PM12/1/15
to H2O Open Source Scalable Machine Learning - h2ostream
Just so you know my name is Denis Perevalov. I represent a company called Milliman. We were performing a POC with Amazon where we tried to use AWS on our dataset using both Spark and H2O to make some predictive models.

Like I mentioned we found that H2O currently has an advantage over Spark's MLLib in terms of speed and overall interface but it lacks resiliency to node failures. So our current recommendation to Amazon would be to wait for H2O improvements.

If you guys were to fix this, I believe Amazon would be willing to work with you on adding it to their Elastic Map Reduce service.
That would significantly increase the interest and overall usage of H2O.  They already have Spark in EMR.

Let me know if you would be interested in working with Amazon or whether you'd be interested in fixing the problem.
Reply all
Reply to author
Forward
0 new messages