Hi michal,
I am trying to get sparkling water working using the newest code on github. When I run it locally, I can (almost) get through the example, but but it fails on training the model, apparently because the executors are running out of memory. I'm not super familiar with Spark's local execution mode, but env variables/--executor-memory options don't seem to let me increase the executor memory size, which is set to 512mb. I can increase the value in my MASTER configuration (e.g. local-cluster[3, 2, 2048]), but according to the log, this only increases the RAM of the workers, not the actual executors themselves.
When I run on a real cluster, the --executor-memory parameter works again as expected, however H2OContext.start fails. It runs through all the initialization on the cluster, and even displays the H2O startup message block (----- H2O started (client) ----- & info on ports/logging), but then it stalls on water.H2O.waitForCloudSize until the timeout hits, and then I get a "Cloud size under 12" exception. I bumped over to one of the workers and tailed the application stderr log:
lots of
java.lang.NullPointerException
at water.TypeMap.NewFreezable(TypeMap.java:176)
at water.AutoBuffer.get(AutoBuffer.java:665)
at water.RPC.remote_exec(RPC.java:45)
at water.FJPacket.compute2(FJPacket.java:23)
at water.H2O$H2OCountedCompleter.compute(H2O.java:200)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:429)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
any ideas?
thanks!
-ben