Issue Training Job

132 views
Skip to first unread message

Mitchell Kresca

unread,
Mar 2, 2022, 9:15:12 AM3/2/22
to actionml-user
Hello,

I am currently having an issue when training a job on an engine that has about 50 million events. My spark driver has 10GB assigned to it and my executor has 32GB assigned to it. Executor and driver run on separate servers and I followed the Advanced Tuning section within the Documentation on the ActionML Website.

When initiating harness-cli train <<engine_name>> everything works fine for engines with around 1 million events, however, anything above it seems to crash Harness itself on the driver server. It crashes harness within the first 2 minutes of the job.

Is this an indication that my driver memory is too low? Or maybe my executor memory is too low? Not sure if Harness itself crashing indicates one way or another.

I have attached the output of harness-cli status engines <<engine_name>> and also the spark executor output showing a failed heartbeat. If there is any more logging that can be checked, please let me know. Harness.log didn't really seem to have much.

Output of harness-cli status engines <<engine_name>>

Harness CLI v settings
==================================================================
HARNESS_CLI_HOME .................... /opt/harness-cli/harness-cli
HARNESS_CLI_SSL_ENABLED .................................... false
HARNESS_CLI_AUTH_ENABLED ................................... false
HARNESS_SERVER_ADDRESS ................................. localhost
HARNESS_SERVER_PORT ......................................... 9090
==================================================================
Harness Server status: FAILED


SPARK LOG

2022-03-02 14:01:09 INFO MemoryStore:54 - Block taskresult_92 stored as bytes in memory (estimated size 25.4 MB, free 16.9 GB) 2022-03-02 14:01:09 INFO MemoryStore:54 - Block taskresult_91 stored as bytes in memory (estimated size 25.4 MB, free 16.8 GB) 2022-03-02 14:01:09 INFO MemoryStore:54 - Block taskresult_93 stored as bytes in memory (estimated size 25.5 MB, free 16.8 GB) 2022-03-02 14:01:09 INFO MemoryStore:54 - Block taskresult_96 stored as bytes in memory (estimated size 25.5 MB, free 16.8 GB) 2022-03-02 14:01:09 INFO MemoryStore:54 - Block taskresult_95 stored as bytes in memory (estimated size 25.5 MB, free 16.8 GB) 2022-03-02 14:01:09 INFO MemoryStore:54 - Block taskresult_89 stored as bytes in memory (estimated size 25.5 MB, free 16.7 GB) 2022-03-02 14:01:09 INFO MemoryStore:54 - Block taskresult_94 stored as bytes in memory (estimated size 25.5 MB, free 16.7 GB) 2022-03-02 14:01:09 INFO MemoryStore:54 - Block taskresult_90 stored as bytes in memory (estimated size 25.5 MB, free 16.7 GB) 2022-03-02 14:01:09 INFO Executor:54 - Finished task 3.0 in stage 4.0 (TID 92). 26655718 bytes result sent via BlockManager) 2022-03-02 14:01:09 INFO Executor:54 - Finished task 2.0 in stage 4.0 (TID 91). 26672430 bytes result sent via BlockManager) 2022-03-02 14:01:09 INFO Executor:54 - Finished task 4.0 in stage 4.0 (TID 93). 26687985 bytes result sent via BlockManager) 2022-03-02 14:01:09 INFO Executor:54 - Finished task 7.0 in stage 4.0 (TID 96). 26722931 bytes result sent via BlockManager) 2022-03-02 14:01:09 INFO Executor:54 - Finished task 6.0 in stage 4.0 (TID 95). 26714539 bytes result sent via BlockManager) 2022-03-02 14:01:09 INFO Executor:54 - Finished task 0.0 in stage 4.0 (TID 89). 26687080 bytes result sent via BlockManager) 2022-03-02 14:01:09 INFO Executor:54 - Finished task 5.0 in stage 4.0 (TID 94). 26732444 bytes result sent via BlockManager) 2022-03-02 14:01:09 INFO Executor:54 - Finished task 1.0 in stage 4.0 (TID 90). 26736352 bytes result sent via BlockManager) 2022-03-02 14:02:21 WARN Executor:87 - Issue communicating with driver in heartbeater org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785) at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814) at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814) at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992) at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) ... 14 more 2022-03-02 14:02:21 WARN TransportResponseHandler:189 - Ignoring response for RPC 4753128745304104244 from /10.21.0.79:42917 (81 bytes) since it is not outstanding 2022-03-02 14:02:41 WARN Executor:87 - Issue communicating with driver in heartbeater org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785) at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814) at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814) at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992) at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) ... 14 more 2022-03-02 14:02:45 WARN TransportResponseHandler:189 - Ignoring response for RPC 8769019784604203050 from /10.21.0.79:42917 (81 bytes) since it is not outstanding 2022-03-02 14:02:46 INFO CoarseGrainedExecutorBackend:54 - Driver commanded a shutdown 2022-03-02 14:02:46 ERROR CoarseGrainedExecutorBackend:43 - RECEIVED SIGNAL TERM utdown

Pat Ferrel

unread,
Mar 2, 2022, 1:35:23 PM3/2/22
to Mitchell Kresca, actionml-user
Harness is not running: Harness Server status: FAILED

Fix this first.


--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/b557ccca-3c90-4e3b-b14e-1237253d83cen%40googlegroups.com.

Mitchell Kresca

unread,
Mar 2, 2022, 1:42:28 PM3/2/22
to actionml-user
Hi Pat,

Yes, Harness Server status: FAILED is a result of training with 50 million records.  Small training jobs don't kill Harness.  The issue I am having is that when Harness is "OK" and run a large training job, the driver server dies.  Harness on that server then throws  Harness Server status: FAILED, and the spark job gives that error above a couple of minutes into the training job.

Pat Ferrel

unread,
Mar 2, 2022, 2:58:04 PM3/2/22
to Mitchell Kresca, actionml-user
How are the microservices hosted? Kubernetes, Docker-compose, native install, cluster?

It sounds like all services are drawing from the same memory pool and Spark is asking for so much that Harness can no longer run but that is just a wild guess.

Mitchell Kresca

unread,
Mar 3, 2022, 8:23:56 AM3/3/22
to actionml-user
We have a native installation.  I ended up spinning up another Harness node and increased the memory on that node and used that to instantiate training.  That seems to have solved my issue.

One other question,  When trying to train an engine that has 100 million events, it is taking a very long time, which for my use case is not a problem.  I see I can update the harness-env config to extend the length of the job expiration.  My question is when a job expires, does that actually stop the job from running?  I see Memory and CPU being used on the executor and driver machines and I see the job "executing" in the spark web ui.  Is it a fair assumption that the job is still running even though the status changed to expired? 

Pat Ferrel

unread,
Mar 3, 2022, 11:02:39 AM3/3/22
to Mitchell Kresca, actionml-user
1) long training can be fixed usually by increasing the power (memory + cores) allocated to Spark. This often means you are about to run out of resources, which will require getting more.
2) Harness “jobs” are asynchronous and so generally expire only when Spark has an internal timeout. So as long as Spark is making progress, you shouldn’t have a job flagged as “expired” It is very difficult to make the job status completely reliable due to Spark architecture. If a Spark job is extremely slow it should timeout internal to Spark, this usually means the Harness job will expire. If Spark is still working you can use the Spark GUI to monitor progress. I suppose it is possible for Spark to hang but it shouldn’t in normal cases.

It sounds like you don’t have enough memory or cores for Spark and possible the Harness Spark “driver”. Slow training may be acceptable but at some point it means you are about to fail training. The total memory for all Spark nodes scales with he size of your data. It must all fit in memory.

The worker nodes can be shut down when not training so you have a way o reduce cost for large AWS instances.


Reply all
Reply to author
Forward
0 new messages