Issue when running higher parallelism

4 views
Skip to first unread message

Pulasthi Supun Wickramasinghe

unread,
Sep 17, 2020, 10:58:11 AM9/17/20
to Twister2, Ahmet Uyar
Hi All,

I am running a Twister2 job on 8 nodes of victor (each node set with 16 slots). When i run with 8 or 16 workers I have no issue, But when i try to run 32 workers I get the following exception. It seems that when multiple workers are running on the same node there is some contention when creating the log dir, has this been noticed before. 



[2020-09-17 10:42:12 -0400] [SEVERE] [-] [Twister2MPIWorker-2] edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter: Uncaught exception in thread Thread[Twister2MPIWorker-2,5,main]. Finalizing this worker...
java.lang.RuntimeException: Failed to create log directory: /tmp/twister222/volatile2/pulasthii-python-job-a1b1b5ec-ae72-4102--ou457h3/logs
at edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter.initWorkerLogger(MPIWorkerStarter.java:639)
at edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter.startWorker(MPIWorkerStarter.java:297)
at edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter.startWorkerWithoutJM(MPIWorkerStarter.java:243)
at edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter.<init>(MPIWorkerStarter.java:161)
at edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter.main(MPIWorkerStarter.java:118)

Best Regards
Pulasthi
--
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Ahmet Uyar

unread,
Sep 17, 2020, 11:08:45 AM9/17/20
to Pulasthi Supun Wickramasinghe, Twister2
Hi Pulasthi,

This bug was introduced with one of the recent pull requests. I have seen it also. 
I added a fix to in the following commit but it is not merge to the master branch yet: 

for now, could you modify the relevant part in MPIWorkerStarter.java

Ahmet

Pulasthi Supun Wickramasinghe

unread,
Sep 17, 2020, 11:12:28 AM9/17/20
to Ahmet Uyar, Twister2
Hi Ahmet

Thanks Ahmet. I can review it and merge it, this seems like a small fix so no reason to delay merging it

Best Regards
Reply all
Reply to author
Forward
0 new messages