Is there any internal issues with the Utah cluster?

18 views
Skip to first unread message

Shixiong Qi

unread,
Jan 14, 2022, 3:55:25 PM1/14/22
to cloudlab-users
Hello,

We are currently using XL170 nodes for some network measurements, for example, using apache benchmarks to test NGINX servers and iperf to test network bandwidth.

However, before and after the unexpected shutdown of the Utah cluster a few days ago, our performance measurements were inconsistent.

Before the shutdown, when we tested the NGINX server (configured with 2 dedicated CPU cores) with the apache benchmark on another node, we could get ~26K RPS. however, when we tried to repeat these results after the shutdown, we can only get ~22K RPS even though all configurations were the same (same CPU, mem, 25Gbps NIC, ...). The same problem occurs with the iperf and any other benchmarks we did.  It seems that the Utah cluster has some internal switching problems.

Anyone facing the same problem?

Thanks,
Shixiong

Mike Hibler

unread,
Jan 14, 2022, 4:35:03 PM1/14/22
to cloudla...@googlegroups.com
Was it between just xl170 nodes that you were seeing this? Or an xl170 and
some other node?

The disconnect of the Utah Cloudlab cluster a day or so ago was due to a
switch interface dropping out for some undetermined reason. We ultimately
had to reboot the switch, so it is possible that some unsaved setting got
lost. But that would only have affected connections between xl170s and
other node types in the Utah cluster.
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> cloudlab-users/0de4641b-be6e-4f7b-bdff-bd8ab6205da5n%40googlegroups.com.

Shixiong Qi

unread,
Jan 14, 2022, 5:42:40 PM1/14/22
to cloudlab-users
Hi Mike, 

We have tried several xl170 nodes and found the same issue. If there is a potential issue within the Utah Cloudlab cluster, we better experiment in other clusters.

Thanks,
Shixiong

Mike Hibler

unread,
Jan 14, 2022, 6:53:13 PM1/14/22
to cloudla...@googlegroups.com
Let me make sure I understand. You are running an nginx server on one machine
and a benchmark app on another. Are both of those machines xl170s?

Can you set up an experiment that shows the reduced throughput? There have
been no changes to the xl170 switches recently. A couple of possibilities
would be: running between a pair of nodes on different TOR switches instead
of two nodes on the same TOR switch or using the shared control network
instead of using a dedicated experiment network.
> cloudlab-users/9f8529e8-7c1b-4a9d-9f01-e38779cde511n%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages