Hi All,
I am running a Twister2 job on 8 nodes of victor (each node set with 16 slots). When i run with 8 or 16 workers I have no issue, But when i try to run 32 workers I get the following exception. It seems that when multiple workers are running on the same node there is some contention when creating the log dir, has this been noticed before.
[2020-09-17 10:42:12 -0400] [SEVERE] [-] [Twister2MPIWorker-2] edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter: Uncaught exception in thread Thread[Twister2MPIWorker-2,5,main]. Finalizing this worker...
java.lang.RuntimeException: Failed to create log directory: /tmp/twister222/volatile2/pulasthii-python-job-a1b1b5ec-ae72-4102--ou457h3/logs
at edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter.initWorkerLogger(MPIWorkerStarter.java:639)
at edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter.startWorker(MPIWorkerStarter.java:297)
at edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter.startWorkerWithoutJM(MPIWorkerStarter.java:243)
at edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter.<init>(MPIWorkerStarter.java:161)
at edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter.main(MPIWorkerStarter.java:118)
Best Regards
Pulasthi
--
Pulasthi S. Wickramasinghe
PhD Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington