Hello,
I am currently having an issue when training a job on an engine that has about 50 million events. My spark driver has 10GB assigned to it and my executor has 32GB assigned to it. Executor and driver run on separate servers and I followed the Advanced Tuning section within the Documentation on the ActionML Website.
When initiating harness-cli train <<engine_name>> everything works fine for engines with around 1 million events, however, anything above it seems to crash Harness itself on the driver server. It crashes harness within the first 2 minutes of the job.
Is this an indication that my driver memory is too low? Or maybe my executor memory is too low? Not sure if Harness itself crashing indicates one way or another.
I have attached the output of harness-cli status engines <<engine_name>> and also the spark executor output showing a failed heartbeat. If there is any more logging that can be checked, please let me know. Harness.log didn't really seem to have much.
Output of harness-cli status engines <<engine_name>>
Harness CLI v settings
==================================================================
HARNESS_CLI_HOME .................... /opt/harness-cli/harness-cli
HARNESS_CLI_SSL_ENABLED .................................... false
HARNESS_CLI_AUTH_ENABLED ................................... false
HARNESS_SERVER_ADDRESS ................................. localhost
HARNESS_SERVER_PORT ......................................... 9090
==================================================================
Harness Server status: FAILED
SPARK LOG
2022-03-02 14:01:09 INFO MemoryStore:54 - Block taskresult_92 stored as bytes in memory (estimated size 25.4 MB, free 16.9 GB)
2022-03-02 14:01:09 INFO MemoryStore:54 - Block taskresult_91 stored as bytes in memory (estimated size 25.4 MB, free 16.8 GB)
2022-03-02 14:01:09 INFO MemoryStore:54 - Block taskresult_93 stored as bytes in memory (estimated size 25.5 MB, free 16.8 GB)
2022-03-02 14:01:09 INFO MemoryStore:54 - Block taskresult_96 stored as bytes in memory (estimated size 25.5 MB, free 16.8 GB)
2022-03-02 14:01:09 INFO MemoryStore:54 - Block taskresult_95 stored as bytes in memory (estimated size 25.5 MB, free 16.8 GB)
2022-03-02 14:01:09 INFO MemoryStore:54 - Block taskresult_89 stored as bytes in memory (estimated size 25.5 MB, free 16.7 GB)
2022-03-02 14:01:09 INFO MemoryStore:54 - Block taskresult_94 stored as bytes in memory (estimated size 25.5 MB, free 16.7 GB)
2022-03-02 14:01:09 INFO MemoryStore:54 - Block taskresult_90 stored as bytes in memory (estimated size 25.5 MB, free 16.7 GB)
2022-03-02 14:01:09 INFO Executor:54 - Finished task 3.0 in stage 4.0 (TID 92). 26655718 bytes result sent via BlockManager)
2022-03-02 14:01:09 INFO Executor:54 - Finished task 2.0 in stage 4.0 (TID 91). 26672430 bytes result sent via BlockManager)
2022-03-02 14:01:09 INFO Executor:54 - Finished task 4.0 in stage 4.0 (TID 93). 26687985 bytes result sent via BlockManager)
2022-03-02 14:01:09 INFO Executor:54 - Finished task 7.0 in stage 4.0 (TID 96). 26722931 bytes result sent via BlockManager)
2022-03-02 14:01:09 INFO Executor:54 - Finished task 6.0 in stage 4.0 (TID 95). 26714539 bytes result sent via BlockManager)
2022-03-02 14:01:09 INFO Executor:54 - Finished task 0.0 in stage 4.0 (TID 89). 26687080 bytes result sent via BlockManager)
2022-03-02 14:01:09 INFO Executor:54 - Finished task 5.0 in stage 4.0 (TID 94). 26732444 bytes result sent via BlockManager)
2022-03-02 14:01:09 INFO Executor:54 - Finished task 1.0 in stage 4.0 (TID 90). 26736352 bytes result sent via BlockManager)
2022-03-02 14:02:21 WARN Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992)
at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
2022-03-02 14:02:21 WARN TransportResponseHandler:189 - Ignoring response for RPC 4753128745304104244 from /10.21.0.79:42917 (81 bytes) since it is not outstanding
2022-03-02 14:02:41 WARN Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992)
at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
2022-03-02 14:02:45 WARN TransportResponseHandler:189 - Ignoring response for RPC 8769019784604203050 from /10.21.0.79:42917 (81 bytes) since it is not outstanding
2022-03-02 14:02:46 INFO CoarseGrainedExecutorBackend:54 - Driver commanded a shutdown
2022-03-02 14:02:46 ERROR CoarseGrainedExecutorBackend:43 - RECEIVED SIGNAL TERM
utdown