I'm having some trouble running some notebooks, from the deep learning tutorial, namely the distributed training ones.
On the apllication setup I'm setting:
- Distributed Training
AppMaster memory: 1024
DistributionStrategy: MirroredStrategy
Executor memory: 1024
Number of GPUs: 4
For the other strategies the setup is similar, I just change the DistributionStrategy to the according one and set Number of GPUs per Worker to 1 and Workers to 4.
But when running the notebooks I always get this error:
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, gpu2, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 2.1 GB of 1.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:
Am I setting anything wrong on the application setup?
Also, I seem to be unable to install libraries through pip. The instalation always fails.
Thank you,
Tiago
--
You received this message because you are subscribed to the Google Groups "Hops" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hopshadoop+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hopshadoop/5690f34e-cc54-4637-9067-7729225d16f2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Increasing the memory solved the issue.
Thank you,
Tiago