Debugging Apache Ignite - workers not added to topology

178 views

Skip to first unread message

Robert Syme

unread,

Sep 17, 2021, 12:00:49 PM9/17/21

to Nextflow

Hi all

I'm helping a colleague provision temporary clusters on an Openstack infra setup. It would be nice to use the Apache Ignite scheduler to avoid having to set up SLURM for each of these small temporary projects.

When I start the main node (10.10.1.210) and then the worker node (10.10.3.217), it looks like the worker node's presence is detected by the main node, as the .nextflow.log in the main node shows:

Sep-17 15:31:57.029 [tcp-disco-srvr-#3%nextflow%] INFO o.a.i.s.d.tcp.TcpDiscoverySpi - TCP discovery accepted incoming connection [rmtAddr=/10.10.3.217, rmtPort=34421]

Sep-17 15:31:57.037 [tcp-disco-srvr-#3%nextflow%] INFO o.a.i.s.d.tcp.TcpDiscoverySpi - TCP discovery spawning a new thread for connection [rmtAddr=/10.10.3.217, rmtPort=34421]

Sep-17 15:31:57.038 [tcp-disco-sock-reader-#5%nextflow%] INFO o.a.i.s.d.tcp.TcpDiscoverySpi - Started serving remote node connection [rmtAddr=/10.10.3.217:34421, rmtPort=34421]

Soon after, the .node-nextflow.log on the worker node reports that the message times out:

Sep-17 15:32:27.085 [main] WARN o.a.i.s.d.tcp.TcpDiscoverySpi - Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing 'ackTimeout' configuration property). Will retry to send messa

ge with increased timeout [currentTimeout=30000, rmtAddr=/10.10.1.210:47500, rmtPort=47500]

I've already increased the cluster.ackTimeout to 30000 in the ~/.nextflow/config in both the main and the worker node.

Does anybody have any hints on how I might go about getting the worker node to join the cluster?

Thanks!

-Rob

Paolo Di Tommaso

unread,

Sep 21, 2021, 1:00:28 AM9/21/21

to nextflow

Could not a be a networking issue? I mean required ports not opened? In any case for 2 node cluster not sure make sense to use Ignite, maybe better to try to scale vertically ie. using a single big VM instance instead.

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/452042cc-af44-4fe9-b181-b8e2dc8de8f9n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages