Problems running notebooks

18 views
Skip to first unread message

tiago.nu...@gmail.com

unread,
Jun 2, 2019, 5:16:53 AM6/2/19
to Hops
Hi,

I'm having some trouble running some notebooks, from the deep learning tutorial, namely the distributed training ones.
On the apllication setup I'm setting:
- Distributed Training
AppMaster memory: 1024
DistributionStrategy: MirroredStrategy
Executor memory: 1024
Number of GPUs: 4

For the other strategies the setup is similar, I just change the DistributionStrategy to the according one and set Number of GPUs per Worker to 1 and Workers to 4.

But when running the notebooks I always get this error:

An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, gpu2, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 2.1 GB of 1.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.
Driver stacktrace:

Am I setting anything wrong on the application setup?


Also, I seem to be unable to install libraries through pip. The instalation always fails.

Thank you,
Tiago

Robin

unread,
Jun 2, 2019, 5:25:03 AM6/2/19
to tiago.nu...@gmail.com, Hops
Hello Tiago! 

The Spark executor is using too much memory and is killed by YARN, you need to increase that to atleast 4096 MB assuming you are running one of the provided notebooks. About the pip problems you have, what library are you trying to install? Is this on hops.site?

Best regards,
Robin

--
You received this message because you are subscribed to the Google Groups "Hops" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hopshadoop+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hopshadoop/5690f34e-cc54-4637-9067-7729225d16f2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

tiago.nu...@gmail.com

unread,
Jun 2, 2019, 5:37:43 AM6/2/19
to Hops
Yes, its on hops.site. I was trying to install the library time.

Increasing the memory solved the issue.

Thank you,
Tiago

Reply all
Reply to author
Forward
0 new messages