High latency on c6620 nodes

7 views
Skip to first unread message

john.ou...@gmail.com

unread,
Sep 30, 2025, 6:48:54 PM (8 days ago) Sep 30
to cloudlab-users
I have recently started doing Homa performance measurements on the c6620 cluster and I'm seeing very high latency: 50 usecs unloaded RTT for short messages (both with Homa and with TCP). For comparison, xl170 RTTs are about 15 usecs for Homa and 25 usecs for TCP.

The extra delay is occurring between when the driver rings the NICs doorbell on the sender and the beginning of the interrupt handler on the receiver, meaning someplace in the NICs or switch.

What kind of switch is used for the c6620 cluster (I couldn't find a description for it in the CloudLab Manual)? And, what is the interconnect between nodes and the switch? Copper or fiber? What kind of phys are used on each end (I believe that some phys add high latency because they are doing expensive forward error correction)? Is there latency information available for any of these components?

Thanks in advance for any information you can provide.

-John-

Mike Hibler

unread,
Sep 30, 2025, 7:12:57 PM (8 days ago) Sep 30
to cloudla...@googlegroups.com
Aleks can give you more info, but briefly...

It is a Dell Z9664F switch, OS10 version 10.6.0.2. We use 4x100 -> 400Gb
breakout cables, 2-3m.
E.g. "QSFP56-DD type is QSFP56-DD 4x(100GBASE-CR2)-2.0M"
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/
> 37649918-26f8-4661-9197-892a1219b864n%40googlegroups.com.

ajma...@gmail.com

unread,
Oct 1, 2025, 2:33:34 AM (8 days ago) Oct 1
to cloudlab-users
Hi John,

To add on to what Mike said, these are passive (rather than active) copper 400Gb <-> 4x100Gb breakout cables.  As such, the node-end is 50GbE-based using PAM4 modulation (where the switch-end is 8x50GbE PAM4), and thus doesn't require the gearboxes used for 25GbE <-> 50GbE lane conversion and PAM4 <-> NRZ remodulation found in some other 400GbE <-> 4x100GbE breakout cables.  The switch interfaces claim they've negotiated for rs544 FEC, and there seem to be some decoder latency values published here:  https://www.signalintegrityjournal.com/articles/3405-200-gbps-ethernet-forward-error-correction-fec-analysis .  I've seen a number of discussions of rs544 FEC latency from a quick search, but the latency values discussed (30-80ns) are nowhere near the 50us you're talking about.  Info on the cables is rather scant online.  The best I've been able to find is a datasheet from fs.com, but I don't think that provides the info you're looking for:  https://resource.fs.com/mall/doc/20240428122423cdk25i.pdf .  As far as the switch goes, it claims "sub-850ns" latency, and I don't think that this switch is passing enough traffic to where queueing issues would start to show up.  Might it be an issue with the NIC?  We haven't had a whole lot of experience with the Intel E810 NICs before.

Best,
 - Aleks

Mike Hibler

unread,
Oct 1, 2025, 10:34:41 AM (8 days ago) Oct 1
to cloudla...@googlegroups.com
I can find a 100Gb DAC if you wanted us to connect two of your machines
directly. We can possibly rule out the switch that way.
> f9e5d6b2-0600-4ae4-8bc2-f780388c13f1n%40googlegroups.com.

John Ousterhout

unread,
Oct 1, 2025, 2:55:25 PM (7 days ago) Oct 1
to cloudla...@googlegroups.com
Thanks for all the information. I found the problem and it was my fault: the NIC was delaying interrupts (rx-usecs in ethtool was set to 50). I could have sworn my config scripts turned this off, but apparently not. I'm now seeing latencies of 18-25 usecs, depending on which cores Linux schedules relevant threads on. Oddly, TCP RTTs are still in the 50 usec range; I haven't tracked that down yet.

-John-

ajma...@gmail.com

unread,
Oct 1, 2025, 3:04:25 PM (7 days ago) Oct 1
to cloudlab-users
Thanks for the update John, keep us posted on what you find out regarding the TCP RTTs and let us know if you need us to temporarily rewire anything.
Reply all
Reply to author
Forward
0 new messages