The message from the R client is similar, informing that I have a maximum number of allowed cores equal to 96:
> h2oClient = h2o.init(ip="192.10.10.80", port=54321, strict_version_check = FALSE) # Ethernet IP here Connection successful! R is connected to the H2O cluster (in client mode): H2O cluster uptime: 12 minutes 53 seconds H2O cluster version: 3.8.2.3 H2O cluster name: sparkling-water-ctsats_-856008650 H2O cluster total nodes: 4 H2O cluster total memory: 15.33 GB H2O cluster total cores: 96 H2O cluster allowed cores: 96 H2O cluster healthy: TRUE H2O Connection ip: 192.10.10.80 H2O Connection port: 54321 H2O Connection proxy: NA R Version: R version 3.2.5 (2016-04-14)
Any ideas?
Many thanks in advance.
Hi Mateusz,
Thanks for the fast response & the info provided.
Firstly, the behavior you describe does not account for the 48 (i.e. 24 + 24) cores reported as "available" in host 192.168.1.5: the host has a total no. of only 24 cores; however, H2O, once it has established 2 executors in the subject host, it then goes on and assumes that all the existing cores are available for each one of its 2 executors, hence 2 x 24 = 48 for host 192.168.1.5 (which has only 24!). Clearly, this seems wrong.
In other words, the reported number 96 is reached as:
- 24 cores in host 192.168.1.3 (with 24 cores totally available)
- 24 cores in host 192.168.1.4 (with 24 cores totally available)
- 48 cores in host 192.168.1.5 (with only 24 cores totally available)
To stress the error, here is the result if I ask for 12 executors, i.e:
$ ./sparkling-shell --num-executors 12 --executor-cores 2 --executor-memory 2g
I can see where the confusion starts -> "cores" is just the maximum number of parallelism we will launch in each node, it's not the actual, physical number of cores in the cloud.
This behaviour is the same as with Spark, try running more executors than you have nodes with a lot of cores per node.
Here's a screenshot where I started 6 workers locally with 16 cpu's each where Spark thinks I have 96 cores:
http://oi64.tinypic.com/14o3u54.jpg
Regards,
Mateusz
As Tom mentioned by default we are taking the number of cores from the underlying machine (as defined in the OS). Don't worry, we're still running inside the container so you should be ok :-)
The problem is -> we might spin up way too many threads for a container. I will talk with the others about changing the default value for H2O's nthread when run on YARN. Maybe we can set it to 1 like Spark does?
As for setting nthreads to the same value as num-executors, this might be even more confusing to the user since we are not sharing the same threadpool with Spark. In such case, should you have 8 cores, set num-executors to 4 both spark and h2o would say they are using 4 cores but if you checked the actual CPU usage all 8 cores might be in use (4 for Sparks threadpools and 4 for h2o). That's why I'd simply leave both --num-executors and --spark.ext.h2o.nthreads and ask the user to set them as this will be varying on a case by case scenario.
Also separating those two values has another benefit - you can assign a small number of cores to Spark (i.e. you want to only do some simple ETL with Spark) and assign more to only H2O (i.e. because you want to do some CPU heavy computations).
Also I ask other devs what do they think about renaming "cores" to "threads" in the UI as it is really confusing.
Mateusz