Mellanox NIC issue

19 views
Skip to first unread message

john.ou...@gmail.com

unread,
May 24, 2024, 1:21:30 AMMay 24
to cloudlab-users
I'm currently experiencing a strange problem with my xl170 experiment:


The Mellanox NIC on node3 of this cluster (hp044) is generating unusually long delays after it transmits a packet before it notifies the host of the tx completion. Normally this should only take a few microseconds, but I'm seeing delays of 100 usec or more. I don't see such delays on any of the other nodes in the experiment.

I tried power-cycling node3 to see if that helped; it didn't.

Do you know if there are any BIOS configuration options that might possibly explain this behavior? Or, is there a "harder" form of reset than power-cycling that I can try?

-John-

Mike Hibler

unread,
May 24, 2024, 9:30:36 AMMay 24
to cloudla...@googlegroups.com
I did a quick comparison of the BIOS of hp044 and hp050 (another node in
your experiment that I assume does not show this behavior) and did not see
anything that would cause this. Since these are chassis-based nodes with
four nodes per chassis, the only harder power cycle we can do is to cycle
the entire chassis. Unfortunately, the other three nodes in that chasis are
currently allocated in other experiments.

In the meantime I reset the BIOS settings of hp044 to our standard settings
on the off chance that making a change will clear out any bogus bits that
I might not be seeing. So when you get a chance, reboot that node to apply
the settings and we will see if that makes any difference.
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> cloudlab-users/2ad8687a-3811-4d1e-bb9c-b34e50ae4c06n%40googlegroups.com.

john.ou...@gmail.com

unread,
May 24, 2024, 12:42:48 PMMay 24
to cloudlab-users
After further experimentation it turns out that the issue is in fact happening on all of the nodes, not just node3; a bug in my instrumentation caused me to miss this. So it's probably not a hardware issue. Sorry for the false alarm.

-John-
Reply all
Reply to author
Forward
0 new messages