One node not connected to rest in my experiment

47 views
Skip to first unread message

Ertza Warraich

unread,
Jul 14, 2024, 6:21:04 PM (13 days ago) Jul 14
to cloudlab-users
Hi, I created an experiment on Clemson with 8 nodes, out of the 8 the node 3 does not seem to be connected to the rest as I cannot ping between them, rest all are connected.
I am using the NIC eno33 as I want to use the 25G link and set so in the profile.

This is my experiment link: 

Can anyone take a look and suggest something please. 

Ertza Warraich

unread,
Jul 14, 2024, 8:24:54 PM (13 days ago) Jul 14
to cloudlab-users
I re-created the experiment and similarly this time node 7 is not connected.

New experiment's link:
https://www.cloudlab.us/status.php?uuid=a9025d61-423c-11ef-9f39-e4434b2381fc

Sam Pyankov

unread,
Jul 14, 2024, 8:29:10 PM (13 days ago) Jul 14
to cloudla...@googlegroups.com
Why do I receive those emails?

Sincerely,
Sam Pyankov


--
You received this message because you are subscribed to the Google Groups "cloudlab-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloudlab-users/a184eea1-ea70-46fa-a237-dda0ae454a66n%40googlegroups.com.

Mike Hibler

unread,
Jul 14, 2024, 9:54:00 PM (12 days ago) Jul 14
to cloudla...@googlegroups.com
Maybe something wrong with the interface or cable. I will have the Clemson
folks check it out. FYI, at least a couple of your nodes are constantly
spitting out:
-----
[ 5138.736663] NVRM: GPU 0000:21:00.0 is already bound to nouveau.
[ 5138.736671] NVRM: GPU 0000:e2:00.0 is already bound to nouveau.
[ 5138.736692] NVRM: The NVIDIA probe routine was not called for 2 device(s).
[ 5138.736693] NVRM: This can occur when another driver was loaded and
NVRM: obtained ownership of the NVIDIA device(s).
[ 5138.736693] NVRM: Try unloading the conflicting kernel module (and/or
NVRM: reconfigure your kernel without the conflicting
NVRM: driver(s)), then try loading the NVIDIA kernel module
NVRM: again.
[ 5138.736694] NVRM: No NVIDIA devices probed.
[ 5138.736894] nvidia-nvlink: Unregistered Nvlink Core, major device number 509
----
to "dmesg" and the console. Shouldn't have anything to do with the network
problem.
Message has been deleted

Ricardo Guimarães

unread,
Jul 15, 2024, 8:19:10 AM (12 days ago) Jul 15
to cloudlab-users
I have the same issue sometimes, but with an image I have in the Utah cluster.

I reported it in this conv: https://groups.google.com/g/cloudlab-users/c/7H7ImNsvZyc/m/gS-OWGgXBAAJ.

I have not made any changes to the Mellanox driver and neither explicitly made anything with "bpfilter". I was using this image for a long time and suddenly I started having this issue.

If possible, I would like to share the solution so I can run my large experiments. 

Thankfully,
Ricardo.

Leigh Stoller

unread,
Jul 15, 2024, 1:11:01 PM (12 days ago) Jul 15
to cloudla...@googlegroups.com

> I re-created the experiment and similarly this time node 7 is not connected.
>
> New experiment's link:
> https://www.cloudlab.us/status.php?uuid=a9025d61-423c-11ef-9f39-e4434b2381fc

Hi. Take a look and let us know if clgpu014 has its link now.

Thanks!
Leigh

Ertza Warraich

unread,
Jul 15, 2024, 1:12:32 PM (12 days ago) Jul 15
to cloudla...@googlegroups.com
Perfect, working now!

Thanks so much. 

--
You received this message because you are subscribed to the Google Groups "cloudlab-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-user...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages