Severe packet delays in Dell Z9264 switch?

43 views
Skip to first unread message

john.ou...@gmail.com

unread,
May 1, 2024, 1:03:21 AMMay 1
to cloudlab-users
I am observing what appear to be very large packet delays in the Dell Z9264 switch used by the c6525-100g cluster. Specifically, individual packets occasionally seem to take 2 ms or more to arrive at their destination, when other packets between the same host pair, sent both before and after, arrive within a few microseconds. Furthermore, during this interval I see gaps when the destination receives no packets, so its link is not overloaded.

It's possible that I am somehow observing things incorrectly, but I've been probing more and more and the data seems to point very consistently in the direction of the switch.

Has anyone else reported behavior like this? Is there anything known about the switch that could cause such behavior? The OS manual for the switch mentions "deep buffers"; does this switch have that option? Is it possible that during a momentary overload some packets get shunted off to deep buffers (which are presumably in slower memory) and it takes them a very long time to find their way back into the switch again?

-John-

Leigh Stoller

unread,
May 1, 2024, 10:04:24 AMMay 1
to cloudla...@googlegroups.com

> Has anyone else reported behavior like this? Is there anything known about the switch that could cause such behavior? The OS manual for the switch mentions "deep buffers"; does this switch have that option? Is it possible that during a momentary overload some packets get shunted off to deep buffers (which are presumably in slower memory) and it takes them a very long time to find their way back into the switch again?

Hi. I do not think anyone else has reported it. If it happens again, we
need a link to a running experiment, the specific nodes, and hopefully
a simple way to demonstrate the problem.

Thanks
Leigh

John Ousterhout

unread,
May 3, 2024, 10:53:12 AMMay 3
to cloudla...@googlegroups.com
I can easily reproduce the problem, but not in a very deterministic fashion. Every time I run an experiment there will be a significant number of packets (at least hundreds?) that experience multi-ms delays, but I can't predict which ones and there are a lot of packets overall. I'm guessing this sort of "reproducibility" isn't very useful?

-John-

--
You received this message because you are subscribed to the Google Groups "cloudlab-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloudlab-users/92208CF8-089C-4578-AC8C-6B6CB684FE2F%40gmail.com.

Robert Ricci

unread,
May 3, 2024, 1:36:10 PMMay 3
to cloudla...@googlegroups.com, ama...@cs.utah.edu
Aleks, has anything like this shown up in your testing?

[For context, we run a bunch tests with very high precision timestamps
as part of our the microbenchmarks we run on CloudLab nodes while free.]
> To view this discussion on the web visit https://groups.google.com/d/msgid/cloudlab-users/CAGXJAmxO-kouV%3Ds7r2bFW4_da4Cih4eqPEDep6GVrf6ocO0fDA%40mail.gmail.com.

ajma...@gmail.com

unread,
May 3, 2024, 6:05:43 PMMay 3
to cloudlab-users
Hi John,

As far as my automated benchmark collections went, testing the 100Gbps links was still future work at the point when I paused my benchmarking.  I also didn't remember noticing anything in my own independent testing when I first deployed one of the Z9264s, but I probably wasn't stressing the switch in nearly the same way you were.

Sorry to say I don't really have a good guess as to why this is happening...  I don't know what your rates are, average or peak, but you make it sound like generally the switch isn't being overloaded.  The Z9264F-ON is built around Broadcom's StrataXGS Tomahawk II ASIC, there may be more information in its data sheet (looks like it's by request) that could help here.  Regarding Packet Buffers, the listed packet buffer size is 42MB, and it looks like "deep buffer" mode is only available on the S4200-series switches.  Have you seen this behavior in any other workloads besides your Homa testing?  This could be an OS bug, and I could update the switch software to see if that helps anything.  Depending on the nature of the problem, we may need to open an issue with Dell.  Let me know how I can help.

Thanks,
 - Aleks

John Ousterhout

unread,
May 9, 2024, 12:17:33 PMMay 9
to cloudla...@googlegroups.com
I think opening an issue with Dell may be the best approach, but I'd like to do some more measurements first to make sure it's not actually something I'm doing. This will take a while... I'll get back in touch then.

-John-

John Ousterhout

unread,
May 9, 2024, 12:21:53 PMMay 9
to cloudla...@googlegroups.com
Do you have a way of obtaining detailed information on the Tomahawk II ASIC? If so, it would be useful for me to take a look at it, to see if it might explain the behavior I'm seeing.

Also, my benchmark is trying to load the switch pretty heavily. The goal is for each host uplink to be running at 80% utilization in both directions during the entire experiment; so far, I'm only getting about 65% utilization, but that's still pretty high.

-John-

ajma...@gmail.com

unread,
May 9, 2024, 7:08:38 PMMay 9
to cloudlab-users
Besides requesting a datasheet from Broadcom, I will note that we have a spare Z9264 switch that I've been using to play around with SONiC Network OS.  One of the features that has over Dell OS10 is that it exposes a command called bcmsh that opens a command prompt on the ASIC itself.  While that doesn't help you gather metrics on the switch you're running on, it could allow you to probe for some more low level info about the ASIC.  If there's a way to access it through OS10, it certainly isn't advertised publicly...
Reply all
Reply to author
Forward
0 new messages