We're running tests on AWS using H2O. The job is training a GBM model on our fairly large dataset. The job runs on a AWS cluster of 100 or 200 nodes. In this configuration a single training runs for 2 or 3 hours.
What we found that H2O is unstable on AWS. More often than not the job would fail on like 80 pr 90% done. I was able to finish the job only once. Other 6 or 7 times it would fail with different messages: Connection refused or Connection reset by peer. In WebUI it would say: "Error fetching job. Error calling GET and Could not connect to H2O. Your cloud is currently unresponsive"
Our guess is that there is a loss of connection to one of the nodes and then the entire H2O cluster fails, one cannot establish a connection to it at all. That unfortunately makes H2O unusable for us.
Denis
H2O was launched using a Hadoop Step:
using appropriate jar h2odriver.jar
with arguments: -nodes 250 -mapperXmx 2g -timeout 6000 -disown -flow_dir /s3/flows
Thanks for your answer. The reason for that many nodes is not the total amount of RAM for the cluster. The reason is to have large number of cores. Sorry that I didn't specify that.
My data size is only about 10GB. I'd like to be able to use lots of cores to speedup training time. So, with 250 nodes each of with 4 cores I get the total number of cores of 1000. With this many cores I can get the job done in about 1.5 hours. I used m3.xlarge nodes for this test.
For EMR there is also m3.2xlarge node type available, which has 8 cores. I tried running the job with 125 of those and still have the same stability problems.
Thanks for suggestion. I'll try to use EC2 types with more cores.
Although, Amazon is pretty good at scaling up the price for the instances. For example, m3.xlarge has 4 CPUs and costs $0.266 per Hour, m3.2xlarge has 8 CPUs and costs $0.532 per Hour and so on. So, I'm expecting a similar overall cost.
I see that there is c3.8xlarge available in EMR which has 32 cores. So, maybe I'll try to run 32 of those. I requested the limit increase on those on our AWS account. I will test it and can report results here if anyone is interested.
Regarding your question, I'm trying to speedup training time. Our data is pretty large now and the number of parameters is substantial too. Both of these dimensions will be growing. We'd like to be able to build models within few hours, not much more. So, like I said currently we need a cluster with about 1000 cores to finish task within 2 hours.
It would be nice if 0xdata spent more time on the H2O core development to target H2O stability. Their models are OK. They claim that their platform is scalable. But my tests show that it is true only to some extent. Large clusters will have failures. From my experience hundreds of nodes is not a lot. And cloud technologies available makes it possible to use thousands or more whatever the task needs and we were planning for that.
H2O has a competitive advantage in terms of speed and interface is very nice. I tried building a similar GBM model using Spark MLlib using our dataset. The latter is about 6 times slower. But at least it's resilient to cluster failures.