Experiment Network Issues with c6525-25g Nodes

30 views
Skip to first unread message

Yacqub Mohamed

unread,
Mar 23, 2026, 12:49:24 PM (9 days ago) Mar 23
to cloudlab-users
Hello,

I currently have an experiment with a few c6525-25g nodes in the Utah cluster, and am running into two main issues regarding the experiment network:

1. Some pairs of nodes cannot communicate with each other (`ping` fails). 

2. For the ones that can communicate with each other, I'm able to achieve 23 Gbps for some of them, and only 14-15 Gbps for others.

For example, with the nodes 10.10.1.2 and 10.10.1.3, I can get 23 Gbps when running the two commands:
10.10.1.2: iperf3 -c 10.10.1.3 -l 100K
10.10.1.3: iperf3 -s

But when running the same experiment with nodes 10.10.1.2 and 10.10.1.4, I can only get 15 Gbps, and increasing the parallel client count only increases this up to 19 Gbps.
10.10.1.2: iperf3 -c 10.10.1.4 -l 100K
10.10.1.4: iperf3 -s

---

Is there a way to reliably fix these issues through CloudLab or my OS image, or do I just need to keep rebooting the nodes until they hopefully work?

Thank you for your time.

Sincerely,
Yacqub Mohamed

Mike Hibler

unread,
Mar 23, 2026, 3:27:51 PM (9 days ago) Mar 23
to 'Yacqub Mohamed' via cloudlab-users
Send a link to the experiment status page please.
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/
> 9f0d6576-24fd-4f70-b0fe-12407b411422n%40googlegroups.com.

ajma...@gmail.com

unread,
Mar 23, 2026, 3:29:03 PM (9 days ago) Mar 23
to cloudlab-users
Hi, as Mike said, in the future please give us an experiment status page so we can more easily find your experiment.

I tracked down what nodes you were using, and it looks like cr0 (amd238) isn't recognizing its experiment net NIC.  This issue is not uncommon on the c6525 nodes, and can be fixed with a power cycle.  On the experiment status page, you can either go to the topology view, click the node, and click "Power Cycle", or you can go to the list view, click on the gear button for the node in question on the far right, and click "Power Cycle" there.  I have done this already for amd238 and it brought the NIC back.  FYI, a plain reboot does not typically fix this issue, it has to be a power cycle.

As far as the difference in bandwidth goes, I have not personally seen this before.  If you're confident that there's not anything else running on these nodes that would interfere and if this is consistent behavior, then I would try power cycling those nodes as well and see if the behavior persists.

Best,
 - Aleks

Yacqub Mohamed

unread,
Mar 23, 2026, 10:03:24 PM (9 days ago) Mar 23
to cloudlab-users
Apologies about not including the URL of the experiment status page, and thank you for your help so far.

After power-cycling the nodes, the first problem went away, but the second problem persisted.

Even after booking another experiment with c6525-25g nodes and no inter-switch links (URL: https://www.cloudlab.us/status.php?uuid=d0a5e6ee-4ff8-4228-a0f7-7becf98a9259#), and putting the nodes through a power-cycle, there are some pairs of nodes that get 14-15 Gbps (such as amd238 and amd227). With these nodes, I'm noticing a large number of retries:
```
[  5] local 10.10.1.4 port 33180 connected to 10.10.1.5 port 5201
[ ID] Interval                 Transfer        Bitrate                Retr   Cwnd
[  5]   0.00-1.00   sec  1.91 GBytes  16.4 Gbits/sec  575   1.29 MBytes      
[  5]   1.00-2.00   sec  1.60 GBytes  13.8 Gbits/sec   23   1.59 MBytes      
[  5]   2.00-3.00   sec  1.59 GBytes  13.6 Gbits/sec  787   1.14 MBytes      
[  5]   3.00-4.00   sec  1.58 GBytes  13.6 Gbits/sec  464   1.09 MBytes      
[  5]   4.00-5.00   sec  1.59 GBytes  13.6 Gbits/sec  398   1.15 MBytes      
[  5]   5.00-6.00   sec  1.61 GBytes  13.8 Gbits/sec  462   1.35 MBytes      
[  5]   6.00-7.00   sec  1.56 GBytes  13.4 Gbits/sec  632   1.11 MBytes      
[  5]   7.00-8.00   sec  1.65 GBytes  14.2 Gbits/sec  120   1.63 MBytes      
[  5]   8.00-9.00   sec  1.59 GBytes  13.7 Gbits/sec  609   1.19 MBytes      
[  5]   9.00-10.00  sec  1.67 GBytes  14.4 Gbits/sec  581   1.14 MBytes      
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  16.3 GBytes  14.0 Gbits/sec  4651    sender
[  5]   0.00-10.04  sec  16.3 GBytes  14.0 Gbits/sec               receiver
```

Is there any advice on how to fix this? Thank you.

Sincerely,
Yacqub Mohamed

ajma...@gmail.com

unread,
Mar 24, 2026, 5:05:18 PM (8 days ago) Mar 24
to cloudlab-users
Hi Yacqub,

I ran some iperf3 tests of my own on your nodes.  First I had amd226, amd227, and amd238 all talking to each other, each acting as server and client in turn, then I had amd218, amd226, and amd241 all talking to each other in the same manner.  Unfortunately, while I saw this same behavior with various permutations of pairs of nodes, I didn't see anything that stood out as a specific smoking gun (such as one single node being a problem).  I will mention that overall, the tests involving amd227 and amd238 had the lowest throughput of all of them.  I re-ran iperf3 tests between those two nodes specifically while looking at traffic statistics on the switch, and I didn't see any appreciable signs of congestion or throttling on the switch even while the tests showed numerous retries and low throughput.

Have you played around with kernel parameter tuning at all?  Specifically kernel buffer sizes (net.core.rmem_max and net.core.wmem_max) and TCP socket buffer sizes (net.ipv4.tcp_rmem and net.ipv4.tcp_wmem).  You should also look into various tuning options within iperf3 that might help with higher bandwidth testing.  While all of these retries could be a sign of NIC or cable problems, it would be worthwhile to eliminate or mitigate OS-level interference first and then work from there.  You should also look into core pinning on both the client and server, either using the -A flag or with something like numactl.  Even though these nodes are all the same hwtype, loaded with the same OS, and connected to the same switch, the OSes on each node could be doing considerably different things.

Sorry I don't have a more concrete answer for you, but hopefully my suggestions above can help you track this down.

Best,
 - Aleks

John Ousterhout

unread,
Mar 24, 2026, 5:37:00 PM (8 days ago) Mar 24
to cloudla...@googlegroups.com
Hi Yacqub,

I don't know whether this will help, but I've noticed considerable variation in networking performance in my Homa experiments, both when running Homa and when running TCP, and it happens on all clusters. One factor I've measured is where the Linux scheduler happens to place your threads, relative to the cores that are receiving network packets. This varies from machine to machine, and from run to run on the same machine. You might try pinning your threads to cores, then vary the pinned cores to see what happens to performance. I think there's a good chance that you'll see consistent "best case" performance across machines, though the exact core configuration required will vary from machine to machine (and it even varies across reboots, since Linux randomizes the hash function used to assign incoming packets to cores each time it reboots).

-John-

Reply all
Reply to author
Forward
0 new messages