ucx bind error

2 views
Skip to first unread message

Ahmet Uyar

unread,
Sep 14, 2020, 5:31:56 AM9/14/20
to Twister2
Hi Chathura,

When I try ucx with a high number of workers, it throws port bind error as below. 
I think multiple workers are trying to use the same port on the same node. 

Ahmet


[1600075627.538112] [v-012:210693:0]           sock.c:376  UCX  ERROR bind(fd=203 addr=172.29.200.212:36367) failed: Address already in use
[1600075627.538172] [v-012:210693:0]      listener.cc:53   UCX  ERROR JUCX: Input/output error
[2020-09-14 05:27:07 -0400] [SEVERE] [worker-130] [Twister2MPIWorker-130] edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter: Uncaught exception in thread Thread[Twister2MPIWorker-130,5,main]. Finalizing this worker...
edu.iu.dsc.tws.api.exceptions.Twister2RuntimeException: Couldn't initialize TWSChannel
        at edu.iu.dsc.tws.api.resource.Network.initializeChannel(Network.java:73)
        at edu.iu.dsc.tws.api.resource.WorkerEnvironment.<init>(WorkerEnvironment.java:136)
        at edu.iu.dsc.tws.api.resource.WorkerEnvironment.init(WorkerEnvironment.java:251)
        at edu.iu.dsc.tws.rsched.worker.Twister2WorkerStarter.execute(Twister2WorkerStarter.java:54)
        at edu.iu.dsc.tws.rsched.worker.MPIWorkerManager.execute(MPIWorkerManager.java:66)
        at edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter.startWorker(MPIWorkerStarter.java:310)
        at edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter.startWorkerWithJM(MPIWorkerStarter.java:253)
        at edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter.<init>(MPIWorkerStarter.java:161)
        at edu.iu.dsc.tws.rsched.schedulers.standalone.MPIWorkerStarter.main(MPIWorkerStarter.java:120)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at edu.iu.dsc.tws.api.resource.Network.initializeChannel(Network.java:67)
        ... 8 more
Caused by: org.openucx.jucx.UcxException: Input/output error
        at org.openucx.jucx.ucp.UcpListener.createUcpListener(Native Method)
        at org.openucx.jucx.ucp.UcpListener.<init>(UcpListener.java:25)
        at org.openucx.jucx.ucp.UcpWorker.newListener(UcpWorker.java:49)
        at edu.iu.dsc.tws.comms.ucx.TWSUCXChannel.createUXCWorker(TWSUCXChannel.java:100)
        at edu.iu.dsc.tws.comms.ucx.TWSUCXChannel.<init>(TWSUCXChannel.java:86)
        ... 13 more
 
[2020-09-14 05:27:07 -0400] [WARNING] [-] [JobMaster] edu.iu.dsc.tws.master.server.JMWorkerHandler: Worker [130] Failed.  
[2020-09-14 05:27:07 -0400] [SEVERE] [-] [JobMaster] edu.iu.dsc.tws.master.server.WorkerMonitor: Worker: 130 FULLY_FAILED.  


Chathura Widanage

unread,
Sep 14, 2020, 10:45:45 AM9/14/20
to Ahmet Uyar, Twister2
Hi,

Currently, we take the port from the worker controller. Can there be collisions in the port if that is the case?


Regards,
Chathura


--
You received this message because you are subscribed to the Google Groups "Twister2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to twister2+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/twister2/CAPBRfYdtLnZQPeUG%3DnXCBf4p2mkbeRzcqpm_b%2BxPcoyNinxp1w%40mail.gmail.com.

Ahmet Uyar

unread,
Sep 15, 2020, 3:32:22 PM9/15/20
to Chathura Widanage, Twister2
Hi guys,

I am releasing ServerSocket but some workers are still getting UCX bind error telling "UCX  ERROR bind(fd=205 addr=172.29.200.202:34761) failed: Address already in use".

But, that worker has clearly closed the socket before the UCX channel is initialized. Do you have any suggestions? 

thanks,

Ahmet
  

Chathura Widanage

unread,
Sep 15, 2020, 3:37:31 PM9/15/20
to Ahmet Uyar, Twister2
Ahmet,

Make the following line true. This will probably resolve the issue. I have added that to prevent some other complications. But since we are coming to a barrier before closing the sockets, hopefully, we will not see any issues.



Regards,
Chathura

Ahmet Uyar

unread,
Sep 15, 2020, 3:40:13 PM9/15/20
to Chathura Widanage, Twister2
Actually I tested it by setting it to true but it did not help. 

Ahmet
To unsubscribe from this group and stop receiving emails from it, send an email to twister2+unsubscribe@googlegroups.com.

Chathura Widanage

unread,
Sep 15, 2020, 3:44:54 PM9/15/20
to Ahmet Uyar, Twister2
Ahmet,

Have you verified whether the workers are trying to bind into different ports within the same node? Are you sure that line 56 still is the cause?

Regards,
Chathura


To unsubscribe from this group and stop receiving emails from it, send an email to twister2+u...@googlegroups.com.

Ahmet Uyar

unread,
Sep 15, 2020, 4:20:51 PM9/15/20
to Chathura Widanage, Twister2
Hi Chathura,

I attached the logs. I printed a log message when each worker released its port. 
In this run, worker 252 is complaining that the port 34761 is in use: "UCX  ERROR bind(fd=205 addr=172.29.200.202:34761) failed: Address already in use"
the logs show that the worker 252 actually released this port before. It logs following message: 
[2020-09-15 15:23:05 -0400] [WARNING] [worker-252] [Twister2MPIWorker-252] edu.iu.dsc.tws.common.util.NetworkUtils: port released: 34761 

And also, none of the 400 workers are actually bound to the port 34761. Not on that machine nor on any other machines. 
I suspect that the OS may not be releasing the port immediately. 


thanks,

Ahmet
auyar-terasort-orja2m1.log

Supun Kamburugamuve

unread,
Sep 15, 2020, 7:39:47 PM9/15/20
to Ahmet Uyar, Chathura Widanage, Twister2
The OS keeps the socket (port) around for some time even if the close method is completed.


Best,
Supun..



--
Supun Kamburugamuve, PhD
Digital Science Center, Indiana University
Member, Apache Software Foundation; http://www.apache.org
E-mail: supun@apache.org;  Mobile: +1 812 219 2563


Chathura Widanage

unread,
Sep 15, 2020, 7:42:35 PM9/15/20
to Supun Kamburugamuve, Ahmet Uyar, Twister2
Hi Supun,

But setting socket.setReuseAddress should make a difference right?

Enabling SO_REUSEADDR prior to binding the socket using bind(SocketAddress) allows the socket to be bound even though a previous connection is in a timeout state.

Regards,
Chathura

Supun Kamburugamuve

unread,
Sep 15, 2020, 7:48:08 PM9/15/20
to Chathura Widanage, Ahmet Uyar, Twister2
That article describes an issue with setting the reuseaddress as well. 

Best,
Supun..

Chathura Widanage

unread,
Sep 15, 2020, 7:50:13 PM9/15/20
to Supun Kamburugamuve, Ahmet Uyar, Twister2
I am thinking that setting SO_REUSEADDR would resolve the problem. By default, UCX might not be using this.

Regards,
Chathura

Chathura Widanage

unread,
Sep 15, 2020, 7:50:41 PM9/15/20
to Supun Kamburugamuve, Ahmet Uyar, Twister2
** setting SO_REUSEADDR in UCX end.

Regards,
Chathura

Chathura Widanage

unread,
Sep 15, 2020, 8:02:19 PM9/15/20
to Supun Kamburugamuve, Ahmet Uyar, Twister2
Reply all
Reply to author
Forward
0 new messages