"ERROR: Too many retries starting cloud", when trying to build h2o-dev

143 views
Skip to first unread message

JH

unread,
Nov 22, 2014, 10:44:11 AM11/22/14
to h2os...@googlegroups.com
Team,

I like h2o and I'm trying to introduce it at my workplace. I tried building h2o-dev on my enterprise cluster but kept running into the following error (after about 80% of the build is done):

Starting clouds...
+ CMD: java -Xmx4g -ea -jar /path/to/h2o-dev/build/h2o.jar -name H2O_runit_id_3348133 -baseport 40000
Waiting for H2O nodes to come up...
ERROR: Too many retries starting cloud.


Any thoughts on how I could fix this issue?

Thanks,
JH

Tom Kraljevic

unread,
Nov 22, 2014, 12:09:50 PM11/22/14
to JH, h2os...@googlegroups.com

The test is failing to run in your environment for some reason.

You will need to look at the log files immediately above in the test output to figure out why

or

./gradlew build -x test

to skip tests and just do a build.

Tom

Sent from my iPhone
> --
> You received this message because you are subscribed to the Google Groups "H2O & Open Source Scalable Machine Learning - h2ostream" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

jay...@gmail.com

unread,
Nov 22, 2014, 1:19:28 PM11/22/14
to h2os...@googlegroups.com, jay...@gmail.com
Thanks Tom. I was able to build it without running tests (./gradlew build -x test ).


Here's the error log I got when i tried to build it with the tests:


12:43:44.187 [DEBUG] [org.gradle.process.internal.DefaultExecHandle] Changing state to: STARTING
12:43:44.187 [DEBUG] [org.gradle.process.internal.DefaultExecHandle] Waiting until process started: command 'python'.
12:43:44.190 [DEBUG] [org.gradle.process.internal.DefaultExecHandle] Changing state to: STARTED
12:43:44.191 [DEBUG] [org.gradle.process.internal.ExecHandleRunner] waiting until streams are handled...
12:43:44.191 [INFO] [org.gradle.process.internal.DefaultExecHandle] Successfully started process 'command 'python''
12:44:14.591 [QUIET] [system.out]
12:44:14.591 [QUIET] [system.out] Wiping output directory...
12:44:14.592 [QUIET] [system.out]
12:44:14.592 [QUIET] [system.out] Wiping test state (including random seeds)...
12:44:14.592 [QUIET] [system.out]
12:44:14.592 [QUIET] [system.out] Starting clouds...
12:44:14.592 [QUIET] [system.out]
12:44:14.593 [QUIET] [system.out] + CMD: java -Xmx4g -ea -jar /<path/to>/h2o-dev/build/h2o.jar -name H2O_runit_id_4197961 -baseport 40000
12:44:14.593 [QUIET] [system.out]
12:44:14.593 [QUIET] [system.out] Waiting for H2O nodes to come up...
12:44:14.593 [QUIET] [system.out]
12:44:14.593 [QUIET] [system.out]
12:44:14.594 [QUIET] [system.out] ERROR: Too many retries starting cloud.
12:44:14.594 [QUIET] [system.out]
12:44:14.594 [QUIET] [system.out]
12:44:14.594 [QUIET] [system.out] All tests completed; tearing down clouds...
12:44:14.594 [QUIET] [system.out]
12:44:14.595 [QUIET] [system.out] Killing JVM with PID 62640
12:44:14.595 [QUIET] [system.out]
12:44:14.595 [QUIET] [system.out] ----------------------------------------------------------------------
12:44:14.595 [QUIET] [system.out]
12:44:14.595 [QUIET] [system.out] SUMMARY OF RESULTS
12:44:14.596 [QUIET] [system.out]
12:44:14.596 [QUIET] [system.out] ----------------------------------------------------------------------
12:44:14.596 [QUIET] [system.out]
12:44:14.596 [QUIET] [system.out] Total tests: 1
12:44:14.596 [QUIET] [system.out] Passed: 0
12:44:14.597 [QUIET] [system.out] Did not pass: 0
12:44:14.597 [QUIET] [system.out] Did not complete: 1
12:44:14.597 [QUIET] [system.out] Tolerated NOPASS: 0
12:44:14.597 [QUIET] [system.out]
12:44:14.597 [QUIET] [system.out] Total time: 30.36 sec
12:44:14.598 [QUIET] [system.out] Time/completed test: N/A
12:44:14.598 [QUIET] [system.out]
12:44:14.598 [QUIET] [system.out] True fail list:
12:44:14.598 [QUIET] [system.out]
12:44:14.605 [DEBUG] [org.gradle.process.internal.DefaultExecHandle] Changing state to: FAILED
12:44:14.605 [DEBUG] [org.gradle.process.internal.DefaultExecHandle] Process 'command 'python'' finished with exit value 1 (state: FAILED)
12:44:14.607 [DEBUG] [org.gradle.api.internal.tasks.execution.ExecuteAtMostOnceTaskExecuter] Finished executing task ':h2o-test-integ:runGenerateRESTAPIDocs'
12:44:14.607 [LIFECYCLE] [class org.gradle.TaskExecutionLogger] :h2o-test-integ:runGenerateRESTAPIDocs FAILED
12:44:14.614 [INFO] [org.gradle.execution.taskgraph.AbstractTaskPlanExecutor] :h2o-test-integ:runGenerateRESTAPIDocs (Thread[main,5,main]) completed. Took 30.428 secs.
12:44:14.614 [DEBUG] [org.gradle.execution.taskgraph.AbstractTaskPlanExecutor] Task worker [Thread[main,5,main]] finished, busy: 11 mins 8.708 secs, idle: 0.051 secs
12:44:14.622 [ERROR] [org.gradle.BuildExceptionReporter]
12:44:14.622 [ERROR] [org.gradle.BuildExceptionReporter] FAILURE: Build failed with an exception.
12:44:14.622 [ERROR] [org.gradle.BuildExceptionReporter]
12:44:14.622 [ERROR] [org.gradle.BuildExceptionReporter] * What went wrong:
12:44:14.622 [ERROR] [org.gradle.BuildExceptionReporter] Execution failed for task ':h2o-test-integ:runGenerateRESTAPIDocs'.
12:44:14.622 [ERROR] [org.gradle.BuildExceptionReporter] > Process 'command 'python'' finished with non-zero exit value 1
12:44:14.622 [ERROR] [org.gradle.BuildExceptionReporter]
12:44:14.622 [ERROR] [org.gradle.BuildExceptionReporter] * Try:
12:44:14.623 [ERROR] [org.gradle.BuildExceptionReporter] Run with --stacktrace option to get the stack trace.
12:44:14.624 [LIFECYCLE] [org.gradle.BuildResultLogger]
12:44:14.624 [LIFECYCLE] [org.gradle.BuildResultLogger] BUILD FAILED
12:44:14.624 [LIFECYCLE] [org.gradle.BuildResultLogger]
12:44:14.624 [LIFECYCLE] [org.gradle.BuildResultLogger] Total time: 11 mins 33.28 secs




Also, when I tried to build h2o-dev on my mac, I got the following error:

12:40:12.071 [QUIET] [system.out] 11-22 12:40:12.071 127.0.0.1:44008 6617 FJ-0-3 INFO: Internal FluidVec compression/distribution summary:
12:40:12.071 [QUIET] [system.out] 11-22 12:40:12.071 127.0.0.1:44008 6617 FJ-0-3 INFO: Chunk type count fraction size rel. size
12:40:12.071 [QUIET] [system.out] 11-22 12:40:12.071 127.0.0.1:44008 6617 FJ-0-3 INFO: C1N 1 10.000 % 168 B 2.105 %
12:40:12.071 [QUIET] [system.out] 11-22 12:40:12.071 127.0.0.1:44008 6617 FJ-0-3 INFO: C8D 9 90.000 % 7.6 KB 97.895 %
12:40:12.071 [QUIET] [system.out] 11-22 12:40:12.071 127.0.0.1:44008 6617 FJ-0-3 INFO: Total memory usage : 7.8 KB
12:40:12.071 [QUIET] [system.out] 11-22 12:40:12.071 127.0.0.1:44008 6617 FJ-0-3 INFO: Dropping ignored columns: [, b, c, d, e, f, g, o]
12:40:12.072 [QUIET] [system.out] 11-22 12:40:12.072 127.0.0.1:44008 6617 FJ-0-3 INFO: ==============================================================
12:40:12.072 [QUIET] [system.out] 11-22 12:40:12.072 127.0.0.1:44008 6617 FJ-0-3 INFO: r2 is 0.01000001971159592, with 0x1 trees (average of 0.0 nodes)
12:40:12.072 [QUIET] [system.out] 11-22 12:40:12.072 127.0.0.1:44008 6617 FJ-0-3 INFO: Reported on 100 rows.
12:40:12.072 [QUIET] [system.out] 11-22 12:40:12.072 127.0.0.1:44008 6617 FJ-0-3 INFO: 1. tree was built in 00:00:00.000 (Wall: 22-Nov 12:40:12.072)
12:40:12.072 [QUIET] [system.out] 11-22 12:40:12.072 127.0.0.1:44008 6617 FJ-0-3 INFO: ==============================================================
12:40:12.072 [QUIET] [system.out] 11-22 12:40:12.072 127.0.0.1:44008 6617 FJ-0-3 INFO: r2 is 0.4505043520454033, with 1x1 trees (average of 0.0 nodes)
12:40:12.073 [QUIET] [system.out] 11-22 12:40:12.072 127.0.0.1:44008 6617 FJ-0-3 INFO: Reported on 100 rows.
12:40:12.073 [QUIET] [system.out] 11-22 12:40:12.073 127.0.0.1:44008 6617 main INFO: #### TEST hex.tree.gbm.GBMTest#testGBMRegression EXECUTION TIME: 00:00:00.005 (Wall: 22-Nov 12:40:12.073)
12:40:12.077 [QUIET] [system.out]
12:40:12.077 [QUIET] [system.out] Time: 298.207
12:40:12.077 [QUIET] [system.out] There were 5 failures:
12:40:12.078 [QUIET] [system.out] 1) hex.AAA_PreCloudLock
12:40:12.078 [QUIET] [system.out] java.lang.RuntimeException: Cloud size under 5
12:40:12.078 [QUIET] [system.out] at water.H2O.waitForCloudSize(H2O.java:851)
12:40:12.078 [QUIET] [system.out] at water.TestUtil.stall_till_cloudsize(TestUtil.java:34)
12:40:12.078 [QUIET] [system.out] at hex.AAA_PreCloudLock.setup(AAA_PreCloudLock.java:13)
12:40:12.078 [QUIET] [system.out] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
12:40:12.078 [QUIET] [system.out] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
12:40:12.078 [QUIET] [system.out] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
12:40:12.078 [QUIET] [system.out] at java.lang.reflect.Method.invoke(Method.java:606)
12:40:12.078 [QUIET] [system.out] at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
12:40:12.078 [QUIET] [system.out] at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
12:40:12.078 [QUIET] [system.out] at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
12:40:12.078 [QUIET] [system.out] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
12:40:12.078 [QUIET] [system.out] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
12:40:12.078 [QUIET] [system.out] at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
12:40:12.078 [QUIET] [system.out] at org.junit.runners.Suite.runChild(Suite.java:127)
12:40:12.079 [QUIET] [system.out] at org.junit.runners.Suite.runChild(Suite.java:26)
12:40:12.079 [QUIET] [system.out] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
12:40:12.079 [QUIET] [system.out] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
12:40:12.079 [QUIET] [system.out] at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
12:40:12.079 [QUIET] [system.out] at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
12:40:12.079 [QUIET] [system.out] at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
12:40:12.079 [QUIET] [system.out] at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
12:40:12.079 [QUIET] [system.out] at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
12:40:12.079 [QUIET] [system.out] at org.junit.runner.JUnitCore.run(JUnitCore.java:138)
12:40:12.079 [QUIET] [system.out] at org.junit.runner.JUnitCore.run(JUnitCore.java:117)
12:40:12.079 [QUIET] [system.out] at org.junit.runner.JUnitCore.runMain(JUnitCore.java:96)
12:40:12.079 [QUIET] [system.out] at org.junit.runner.JUnitCore.runMainAndExit(JUnitCore.java:47)
12:40:12.079 [QUIET] [system.out] at org.junit.runner.JUnitCore.main(JUnitCore.java:40)
12:40:12.079 [QUIET] [system.out] 2) hex.deeplearning.DeepLearningIrisTest
12:40:12.080 [QUIET] [system.out] java.lang.RuntimeException: Cloud size under 5
12:40:12.080 [QUIET] [system.out] at water.H2O.waitForCloudSize(H2O.java:851)
12:40:12.080 [QUIET] [system.out] at water.TestUtil.stall_till_cloudsize(TestUtil.java:34)
12:40:12.080 [QUIET] [system.out] at hex.deeplearning.DeepLearningIrisTest.setup(DeepLearningIrisTest.java:25)
12:40:12.080 [QUIET] [system.out] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
12:40:12.080 [QUIET] [system.out] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
12:40:12.080 [QUIET] [system.out] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
12:40:12.080 [QUIET] [system.out] at java.lang.reflect.Method.invoke(Method.java:606)
12:40:12.080 [QUIET] [system.out] at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
12:40:12.080 [QUIET] [system.out] at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
12:40:12.080 [QUIET] [system.out] at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
12:40:12.080 [QUIET] [system.out] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
12:40:12.081 [QUIET] [system.out] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
12:40:12.081 [QUIET] [system.out] at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
12:40:12.081 [QUIET] [system.out] at org.junit.runners.Suite.runChild(Suite.java:127)
12:40:12.081 [QUIET] [system.out] at org.junit.runners.Suite.runChild(Suite.java:26)
12:40:12.081 [QUIET] [system.out] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
12:40:12.081 [QUIET] [system.out] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
12:40:12.081 [QUIET] [system.out] at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
12:40:12.081 [QUIET] [system.out] at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
12:40:12.081 [QUIET] [system.out] at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
12:40:12.081 [QUIET] [system.out] at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
12:40:12.081 [QUIET] [system.out] at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
12:40:12.081 [QUIET] [system.out] at org.junit.runner.JUnitCore.run(JUnitCore.java:138)
12:40:12.081 [QUIET] [system.out] at org.junit.runner.JUnitCore.run(JUnitCore.java:117)
12:40:12.081 [QUIET] [system.out] at org.junit.runner.JUnitCore.runMain(JUnitCore.java:96)
12:40:12.081 [QUIET] [system.out] at org.junit.runner.JUnitCore.runMainAndExit(JUnitCore.java:47)
12:40:12.081 [QUIET] [system.out] at org.junit.runner.JUnitCore.main(JUnitCore.java:40)
12:40:12.082 [QUIET] [system.out] 3) hex.deeplearning.DropoutTest
12:40:12.082 [QUIET] [system.out] java.lang.RuntimeException: Cloud size under 5
12:40:12.082 [QUIET] [system.out] at water.H2O.waitForCloudSize(H2O.java:851)
12:40:12.082 [QUIET] [system.out] at water.TestUtil.stall_till_cloudsize(TestUtil.java:34)
12:40:12.082 [QUIET] [system.out] at hexlearning.DropoutTest.setup(DropoutTest.java:14)
12:40:12.082 [QUIET] [system.out] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
12:40:12.082 [QUIET] [system.out] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
12:40:12.082 [QUIET] [system.out] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
12:40:12.082 [QUIET] [system.out] at java.lang.reflect.Method.invoke(Method.java:606)
12:40:12.082 [QUIET] [system.out] at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
12:40:12.082 [QUIET] [system.out] at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
12:40:12.082 [QUIET] [system.out] at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
12:40:12.082 [QUIET] [system.out] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
12:40:12.082 [QUIET] [system.out] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
12:40:12.082 [QUIET] [system.out] at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
12:40:12.082 [QUIET] [system.out] at org.junit.runners.Suite.runChild(Suite.java:127)
12:40:12.082 [QUIET] [system.out] at org.junit.runners.Suite.runChild(Suite.java:26)
12:40:12.083 [QUIET] [system.out] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
12:40:12.083 [QUIET] [system.out] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
12:40:12.083 [QUIET] [system.out] at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
12:40:12.083 [QUIET] [system.out] at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
12:40:12.083 [QUIET] [system.out] at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
12:40:12.083 [QUIET] [system.out] at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
12:40:12.083 [QUIET] [system.out] at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
12:40:12.083 [QUIET] [system.out] at org.junit.runner.JUnitCore.run(JUnitCore.java:138)
12:40:12.083 [QUIET] [system.out] at org.junit.runner.JUnitCore.run(JUnitCore.java:117)
12:40:12.083 [QUIET] [system.out] at org.junit.runner.JUnitCore.runMain(JUnitCore.java:96)
12:40:12.083 [QUIET] [system.out] at org.junit.runner.JUnitCore.runMainAndExit(JUnitCore.java:47)
12:40:12.083 [QUIET] [system.out] at org.junit.runner.JUnitCore.main(JUnitCore.java:40)
12:40:12.083 [QUIET] [system.out] 4) hex.deeplearning.NeuronsTest
12:40:12.083 [QUIET] [system.out] java.lang.RuntimeException: Cloud size under 5
12:40:12.083 [QUIET] [system.out] at water.H2O.waitForCloudSize(H2O.java:851)
12:40:12.083 [QUIET] [system.out] at water.TestUtil.stall_till_cloudsize(TestUtil.java:34)
12:40:12.083 [QUIET] [system.out] at hex.deeplearning.NeuronsTest.setup(NeuronsTest.java:13)
12:40:12.083 [QUIET] [system.out] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
12:40:12.084 [QUIET] [system.out] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccImpl.java:57)
12:40:12.084 [QUIET] [system.out] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
12:40:12.084 [QUIET] [system.out] at java.lang.reflect.Method.invoke(Method.java:606)
12:40:12.084 [QUIET] [system.out] at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
12:40:12.084 [QUIET] [system.out] at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
12:40:12.084 [QUIET] [system.out] at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
12:40:12.084 [QUIET] [system.out] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
12:40:12.084 [QUIET] [system.out] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
12:40:12.084 [QUIET] [system.out] at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
12:40:12.084 [QUIET] [system.out] at org.junit.runners.Suite.runChild(Suite.java:127)
12:40:12.084 [QUIET] [system.out] at org.junit.runners.Suite.runChild(Suite.java:26)
12:40:12.084 [QUIET] [system.out] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
12:40:12.084 [QUIET] [system.out] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
12:40:12.084 [QUIET] [system.out] at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
12:40:12.084 [QUIET] [system.out] at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
12:40:12.084 [QUIET] [system.out] at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
12:40:12.084 [QUIET] [system.out] at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
12:40:12.085 [QUIET] [system.out] at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
12:40:12.085 [QUIET] [system.out] at org.junit.runner.JUnitCore.run(JUnitCore.java:138)
12:40:12.085 [QUIET] [system.out] at org.junit.runner.JUnitCore.run(JUnitCore.java:117)
12:40:12.085 [QUIET] [system.out] at org.junit.runner.JUnitCore.runMain(JUnitCore.java:96)
12:40:12.085 [QUIET] [system.out] at org.junit.runner.JUnitCore.runMainAndExit(JUnitCore.java:47)
12:40:12.085 [QUIET] [system.out] at org.junit.runner.JUnitCore.main(JUnitCore.java:40)
12:40:12.085 [QUIET] [system.out] 5) hex.optimization.L_BFGS_Test
12:40:12.085 [QUIET] [system.out] java.lang.RuntimeException: Cloud size under 5
12:40:12.085 [QUIET] [system.out] at water.H2O.waitForCloudSize(H2O.java:851)
12:40:12.085 [QUIET] [system.out] at water.TestUtil.stall_till_cloudsize(TestUtil.java:34)
12:40:12.085 [QUIET] [system.out] at hex.optimization.L_BFGS_Test.setup(L_BFGS_Test.java:28)
12:40:12.085 [QUIET] [system.out] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
12:40:12.085 [QUIET] [system.out] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
12:40:12.085 [QUIET] [system.out] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
12:40:12.085 [QUIET] [system.out] at java.lang.reflect.Method.invoke(Method.java:606)
12:40:12.085 [QUIET] [system.out] at org.junit.runnodel.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
12:40:12.086 [QUIET] [system.out] at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
12:40:12.086 [QUIET] [system.out] at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
12:40:12.086 [QUIET] [system.out] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
12:40:12.086 [QUIET] [system.out] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
12:40:12.086 [QUIET] [system.out] at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
12:40:12.086 [QUIET] [system.out] at org.junit.runners.Suite.runChild(Suite.java:127)
12:40:12.086 [QUIET] [system.out] at org.junit.runners.Suite.runChild(Suite.java:26)
12:40:12.086 [QUIET] [system.out] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
12:40:12.086 [QUIET] [system.out] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
12:40:12.086 [QUIET] [system.out] at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
12:40:12.086 [QUIET] [system.out] at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
12:40:12.086 [QUIET] [system.out] at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
12:40:12.086 [QUIET] [system.out] at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
12:40:12.086 [QUIET] [system.out] at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
12:40:12.086 [QUIET] [system.out] at org.junit.runner.JUnitCore.run(JUnitCore.java:138)
12:40:12.086 [QUIET] [system.out] at org.junit.runner.JUnitCore.run(JUnitCore.java:117)
12:40:12.086 [QUIET] [system.out] at org.junit.runner.JUnitCore.runMain(JUnitCore.java:96)
12:40:12.086 [QUIET] [system.out] at org.junit.runner.JUnitCore.runMainAndExit(JUnitCore.java:47)
12:40:12.087 [QUIET] [system.out] at org.junit.runner.JUnitCore.main(JUnitCore.java:40)
12:40:12.087 [QUIET] [system.out]
12:40:12.087 [QUIET] [system.out] FAILURES!!!
12:40:12.087 [QUIET] [system.out] Tests run: 18, Failures: 5
12:40:12.087 [QUIET] [system.out]
12:40:12.551 [DEBUG] [org.gradle.process.internal.DefaultExecHandle] Changing state to: FAILED
12:40:12.551 [DEBUG] [org.gradle.process.internal.DefaultExecHandle] Process 'command 'bash'' finished with exit value 1 (state: FAILED)
12:40:12.552 [DEBUG] [org.gradle.api.internal.tasks.execution.ExecuteAtMostOnceTaskExecuter] Finished executing task ':h2o-algos:testMultiNode'
12:40:12.552 [LIFECYCLE] [class org.gradle.TaskExecutionLogger] :h2o-algos:testMultiNode FAILED
12:40:12.555 [INFO] [org.gradle.execution.taskgraph.AbstractTaskPlanExecutor] :h2o-algos:testMultiNode (Thread[main,5,main]) completed. Took 4 mins 58.948 secs.
12:40:12.555 [DEBUG] [org.gradle.execution.taskgraph.AbstractTaskPlanExecutor] Task worker [Thread[main,5,main]] finished, busy: 5 mins 0.982 secs, idle: 0.016 secs
12:40:12.560 [ERROR] [org.gradle.BuildExceptionReporter]
12:40:12.561 [ERROR] [org.gradle.BuildExceptionReporter] FAILURE: Build failed with an exception.
12:40:12.561 [ERROR] [org.gradle.BuildExceptionReporter]
12:40:12.561 [ERROR] [org.gradle.BuildExceptionReporter] * What went wrong:
12:40:12.561 [ERROR] [org.gradle.BuildExceptionReporter] Execution failed for task ':h2o-algos:testMultiNode'.
12:40:12.561 [ERROR] [org.gradle.BuildExceptionReporter] > Process 'command 'bash'' finished with non-zero exit value 1
12:40:12.561 [ERROR] [org.gradle.BuildExceptionReporter]
12:40:12.561 [ERROR] [org.gradle.BuildExceptionReporter] * Try:
12:40:12.561 [ERROR] [org.gradle.BuildExceptionReporter] Run with --stacktrace option to get the stack trace.
12:40:12.562 [LIFECYCLE] [org.gradle.BuildResultLogger]
12:40:12.562 [LIFECYCLE] [org.gradle.BuildResultLogger] BUILD FAILED
12:40:12.562 [LIFECYCLE] [org.gradle.BuildResultLogger]
12:40:12.562 [LIFECYCLE] [org.gradle.BuildResultLogger] Total time: 5 mins 9.55 secs
12:40:12.564 [DEBUG] [org.gradle.cache.internal.btree.BTreePersistentIndexedCache] Closing cache module-metadata.bin (/user/abc1/.gradle/caches/modules-2/metadata-2.12/module-metadata.bin)
12:40:12.564 [DEBUG] [org.gradle.cache.internal.btree.BTreePersistentIndexedCache] Closing cache module-artifacts.bin (/user/abc1/.gradle/caches/modules-2/metadata-2.12/module-artifacts.bin)
12:40:12.564 [DEBUG] [org.gradle.cache.internal.btree.BTreePersistentIndexedCache] Closing cache artifact-at-repository.bin (/user/abc1/.gradle/caches/modules-2/metadata-2.12/artifact-at-repository.bin)


Any thoughts on potential causes of these issues?

Thanks for your time.

JH

knor...@gmail.com

unread,
Nov 23, 2014, 12:09:39 AM11/23/14
to h2os...@googlegroups.com
Hi JH
you mentioned
"I tried building h2o-dev on my enterprise cluster"

Which triggered a random thought

Does your enterprise cluster support multicast?
Can you do
sudo iptables -L
and
sudo ifconfig

on your enterprise machine, and report? (or the equivalent commands)


Even though the multi-jvm test is done on a single machine, it still requires UDP multicast, I believe. (no flatfile is used for the multi-jvm cloud building)

To test this theory:

I used iptables in an Ubuntu system to block multicast.

I kicked off two java -jar h2o-dev/build/h2o.jar
command lines, to see if they would cloud correctly. They didn't.


If curious, I blocked multicast with; (as root)

echo "Disabling Multicast (only), send and receive"
iptables -A OUTPUT -m pkttype --pkt-type multicast -j DROP
iptables -A OUTPUT --protocol igmp -j DROP
iptables -A OUTPUT --dst '224.0.0.0/4' -j DROP


I re-enabled everything with (as root)

echo "Stopping firewall and allowing everyone..."
iptables -F
iptables -X
iptables -t nat -F
iptables -t nat -X
iptables -t mangle -F
iptables -t mangle -X
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT


-kevin


Message has been deleted

jay...@gmail.com

unread,
Nov 23, 2014, 2:38:30 PM11/23/14
to h2os...@googlegroups.com, knor...@gmail.com
Kevin,

Thanks. I do not have sudo privileges as I'm an end user. Here's the output I get when I run the command (without sudo. I masked the ip addresses in the output):

$ iptables -L
WARNING: Failed to open config file bonding.conf: Permission denied
iptables v1.4.7: can't initialize iptables table `filter': Permission denied (you must be root)
Perhaps iptables or your kernel needs to be upgraded.

$ ifconfig
eth0      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx 
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:1006793 errors:0 dropped:0 overruns:0 frame:0
          TX packets:533472 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 

eth1      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx 
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:327005862 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1823195376 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 

eth2      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:659931064 errors:0 dropped:0 overruns:0 frame:0
          TX packets:823036544 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 

eth3      Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
          RX packets:2603082088 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1300702868 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 

lanext    Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx  
          inet addr:xx.xx.xx.xx  Bcast:xx.xx.xx.xx  Mask:xx.xx.xx.xx
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
          RX packets:2604088881 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1301236340 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 

lanint    Link encap:Ethernet  HWaddr xx:xx:xx:xx:xx:xx  
          inet addr:xx.xx.xx.xx  Bcast:xx.xx.xx.xx  Mask:xx.xx.xx.xx
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:9000  Metric:1
          RX packets:986936926 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2646231920 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
 
lo        Link encap:Local Loopback  
          inet addr:xx.xx.xx.xx  Mask:xx.xx.xx.xx
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:47076120 errors:0 dropped:0 overruns:0 frame:0
          TX packets:47076120 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 

ke...@0xdata.com

unread,
Nov 25, 2014, 11:46:43 PM11/25/14
to h2os...@googlegroups.com, knor...@gmail.com
ah okay, jay (sorry for delay in responding)

interesting it has a lot nics, I guess some are bonded.

Maybe you can do a simple test for me that would say whether two h2o jars can cloud on that machine when a flatfile is used

when you start one jar on the server, you'll see an ip address reported to stdout
i.e.
java -jar h2o.jar
Cloud of size 1 formed [/192.168.0.34:54321]

In my case: 192.168.0.34
remember that IP

now
Create a file called flatfile.txt like this (two lines, just these chars)
Assuming these ports are free: 54321 through 54424

192.168.0.34:54321
192.168.0.34:54323

Substitute the IP from above for 192.168.0.34

then do this in two different terminal windows (on that machine)
java -jar ./h2o.jar -flatfile ./flatfile.txt -port 54321

java -jar ./h2o.jar -flatfile ./flatfile.txt -port 54323

They should cloud up to size 2

when I did it, one stdout looks like this at the end:
11-25 20:44:08.181 192.168.0.34:54321 25455 main INFO: Cloud of size 1 formed [/192.168.0.34:54321]
11-25 20:44:32.318 192.168.0.34:54321 25455 FJ-126-15 INFO: Cloud of size 2 formed [/192.168.0.34:54321, /192.168.0.34:54323]




if they don't, that's interesting and you can post the stdout from each here.

While a little cumbersome, using the flatfile means multicast isn't used. If this works, then we know your h2o-dev build/test issue, is a dependence on multicast and it's possible it's disabled on your machine by iptables.

You can make ask an admin to run iptables -L for you and report the results.

But the simplest good thing to try, is what I show above.

Let me know, and thanks for trying H2O!

-kevin


Message has been deleted

jay...@gmail.com

unread,
Nov 26, 2014, 9:25:00 AM11/26/14
to h2os...@googlegroups.com, knor...@gmail.com, ke...@0xdata.com
Thanks Kevin. I followed the steps you suggested and got an output that's similar to what you showed:

11-26 08:56:38.435 127.0.0.1:54321       24039  main      INFO: Cloud of size 1 formed [/127.0.0.1:54321
11-26 08:57:23.487 127.0.0.1:54321       24039  FJ-126-15 INFO: Cloud of size 2 formed [/127.0.0.1:54321, /127.0.0.1:54323

Additionally, when I try this on my mac, I get a similar output but I noticed that I'm always getting the following message: 

11-26 09:20:24.178 127.0.0.1:54321       41262  #P-Accept ERRR: IO error on TCP port 54322: java.net.BindException: Address already in use

11-26 09:20:24.279 127.0.0.1:54321       41262  #P-Accept ERRR: IO error on TCP port 54322: java.net.BindException: Address already in use  

       Shutting down H2O and restarting doesn't seem to help

Thanks
JH
 

SriSatish Ambati

unread,
Nov 26, 2014, 11:06:28 AM11/26/14
to jay...@gmail.com, h2os...@googlegroups.com, Kevin Normoyle, Kevin
Jay,
You might need a network option in the the launch command.

 $java -jar h2o.jar -ip 10.20.205.218

(replace 10.20.205.218 with the eth0 or eth1 ip address in your output from ifconfig)

thanks, Sri


[ for h2o: On hadoop launches this will be adding -network a.b.c.d/24"]


--
You received this message because you are subscribed to the Google Groups "H2O & Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
ceo & co-founder, 0xdata Inc

knor...@gmail.com

unread,
Nov 26, 2014, 2:07:15 PM11/26/14
to h2os...@googlegroups.com, jay...@gmail.com, knor...@gmail.com, ke...@0xdata.com
Sri
Jay is only doing the java -jar startup manually, to help debug why h2o-dev gradle build fails multi-node test on his server (his laptop is fine, I think)


My thinking is that the flatfile test Jay did showed the following:
Jay used the IP address that a single java -jar reported when it built a cloud of 1.

That IP was used for the flatfile for both startups in the 2 node case. That worked. That suggests to me, that it's not a question of finding the right/consistent NIC, because we're not telling the h2o.jar what ip to use (which is what the network param helps do). We're saying "use whatever ip you think is right...but send out information to the ip:port in the flatfile rather than multicast)

I showed that if multicast is disabled on a machine (with iptables) that the current 0xdata h2o-dev gradle build fails the multi-jvm test (the two h2o.jar's don't form a cloud of 2). Which is what Jay is seeing.

My opinion is 0xdata should fix the gradle multijvm test to use a flatfile because multicast support can't be guaranteed.

It's problematic, because I don't think the flatfile can be localhost:<port> I think it needs to be the ip that the h2o.jars will use. I suppose they could be told to use 127.0.0.1 also.

Jay: on your mac
This is detailed but maybe useful for everyone to know.
The default ip h2o.jar uses is 54321, and also the next on 54322
if you say -port <port> it will use <port> instead of 54321

Normally h2o looks for another port (sequentially) if 54321 and 54322 are busy.

It doesn't if you use -port to say "use this port"

You can also change the place it starts looking with -baseport <port>

These options are described when you do java -jar h2o.jar --help

Oh: I just noticed -baseport is not in the help! It looks like -baseport is used in the current testMultiNode.sh


So: other applications on your machine might be using ports. If you don't use -port, you never see that h2o actually searches for an open port, except for it telling you what port to use with a browser.

Now: ports can be busy due to a bad shutdown of h2o.jar or a running h2o.jar you forgot about (ps aux | grep h2o.jar will tell you if you need to kill -9 something)

When h2o.jar dies, it leaves ports in TIME_WAIT state sometimes, which is an OS issue. This means that 54321 and 54323 may be busy for up to 2 minutes after a h2o.jar shutdown. If you use -port to say "use this port only" ...you will see that affect.

We have an outstanding jira for fixing the issue with h2o reusing a port in TIME_WAIT state. We have that fixed in h2o, but not in h2o-dev yet.

Sorry for all the detail, but I think the information all rolls back to 0xdata needing to fix the multi node jvm test in the gradle build to not be dependent on multicast.

-kevin

knor...@gmail.com

unread,
Nov 26, 2014, 2:12:28 PM11/26/14
to h2os...@googlegroups.com, jay...@gmail.com, knor...@gmail.com, ke...@0xdata.com
oh jay,
I don't know if you're conversant in linux commands, but you can see what applications are using ports with netstat

like this, if I want to know about uses of port 54321 or 54322 (the default h2o ports)

netstat -anp | egrep '(54321|54322)'

$ netstat -anp | egrep '(54321|54322)'
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp6 0 0 :::54321 :::* LISTEN 4025/java
tcp6 0 0 172.16.2.222:54322 :::* LISTEN 4025/java
udp6 0 0 172.16.2.222:54322 :::* 4025/java


You'll see TIME_WAIT if the ports are in that state after h2o.jar shutdown. They will stay in that state for up to 2 minutes.

knor...@gmail.com

unread,
Nov 26, 2014, 2:18:22 PM11/26/14
to h2os...@googlegroups.com, jay...@gmail.com, knor...@gmail.com, ke...@0xdata.com
just to complete this with respect to what you saw on your mac with my instructions
(I'm accused sometimes of being too verbose, but I like capturing all the detail :)

here's the jira on the port reuse I mentioned:

https://0xdata.atlassian.net/browse/PUBDEV-75

again, for normal use you don't need to be specifying -port, so this shouldn't affect you.

-kevin

jay...@gmail.com

unread,
Nov 26, 2014, 10:31:20 PM11/26/14
to h2os...@googlegroups.com, jay...@gmail.com, knor...@gmail.com, ke...@0xdata.com
Kevin,

Thank you for the detailed explanation. This is very helpful


knor...@gmail.com

unread,
Nov 29, 2014, 3:52:53 AM11/29/14
to h2os...@googlegroups.com, jay...@gmail.com, knor...@gmail.com, ke...@0xdata.com
Just noticed another interesting tidbit, that might apply if you were trying to build a cloud across multiple machines, not just multiple jvms on the one machine like the gradle build does for multi-jvm

you reported this from our experiment:

11-26 08:56:38.435 127.0.0.1:54321 24039 main INFO: Cloud of size 1 formed [/127.0.0.1:54321

11-26 08:57:23.487 127.0.0.1:54321 24039 FJ-126-15 INFO: Cloud of size 2 formed [/127.0.0.1:54321, /127.0.0.1:54323]

The IPs seem to imply that h2o thought 127.0.0.1 (localhost) was the best ip to use. (unless you forced that ip with -ip, but I think you didn't)

So Sri's comment about maybe needing -network might apply, if you were clouding across multiple machines. You might have a situation where with all your nics, we can't figure out which one has the IP we should use, and we fall back to 127.0.0.1

One of the things we use in our algorithm, is seeing which nic can get to 8.8.8.8 So in a corporate setting with firewalls, that may not work.

The -network param helps tell h2o which ip range to look for, (which allows the ip to vary, for instance if h2o was dispatched on hadoop)

But in a fixed case, just telling h2o which ip to use would work.

Again, this might only be an issue if you were clouding h2o on multiple machines, and it didn't cloud correctly.

-kevin

jay...@gmail.com

unread,
Dec 2, 2014, 11:14:26 PM12/2/14
to h2os...@googlegroups.com, jay...@gmail.com, knor...@gmail.com, ke...@0xdata.com

Thanks Kevin. I was trying to build a cloud across multiple machines in a corporate environment. This actually made me think about another scenario. Would I be able to pick specific nodes/ips that I want to use for the cloud? If I just let H2o pick available nodes randomnlyfor the cloud, is there any scenario where other users in the corporate environment can access the same cloud that I'm working on? How does H2O go about making sure same resources aren't being used/accessed by multiple users (Is it all upto yarn/relevant resource manager to control this?)

ke...@0xdata.com

unread,
Dec 2, 2014, 11:31:10 PM12/2/14
to h2os...@googlegroups.com, jay...@gmail.com, knor...@gmail.com, ke...@0xdata.com
Hi Jay.

Okay there are a couple issues, so I'll tease them out separately

-You had trouble with a gradle build/test of h2o-dev on your corporate cluster. That testing only does at most, multiple jvms on a single machine. It doesn't do multi-machine testing.

-Yes, if you start h2o.jar on multiple machines, the individual machines shouldn't be saying they are using 127.0.0.1. If so, they should be told the -ip to use (out of the ones on their NICs) or -network should be used to say what network they should use. If you use -flatfile, you can tell each node the list of all the ip:port that the collection of machines you're using, covers, for the cloud.

-But, if you're dispatching h2o through yarn, you don't have to worry about any of this. h2o will get dispatched by yarn, and all the h2o nodes will be able to cloud up. There is a special h2odriver jar to use to dispatch to yarn, and you have to tell it what hadoop vendor/version you're using...


If you don't use yarn, and distribute h2o.jar's yourself, they will cloud up with h2o.jars that has the same cloud name. You can give a unique cloud name with the -name argument to h2o.jar, so multiple clouds can exist on shared machines and not interfere with each other.

yarn will always cause a unique cloud name for it's h2o.jar's that are part of the single yarn "job" created by the h2odriver jar

If you build a cloud, then anyone with access to the http port for h2o (the one you browse to, like 54321) can interact with your cloud. We currently don't have a privacy layer.

In terms of shared resources: Yarn is responsible for allocating memory and cores in a way that don't overburden the underlying system (we ask for a certain amount of java heap with params to the h2odriver dispatch to yarn)

We don't do any resource allocation within a single h2o cloud.

-kevin
Reply all
Reply to author
Forward
0 new messages