Memory Bandwidth Dropped 10x with c6320

25 views
Skip to first unread message

Bijan Tabatabai

unread,
Oct 1, 2025, 4:25:01 PM (7 days ago) Oct 1
to cloudlab-users
Hello,

I have an experiment with a c6320 machine (https://www.cloudlab.us/status.php?uuid=6a32554a-9d4b-11f0-bc80-e4434b2381fc) that I'm using to run bandwidth intensive applications.

When I first got access to the machine, the max memory bandwidth on each node, as measured by Intel's MLC tool, was about 60GB/s. However, at some point the max bandwidth dropped to about 6.5GB/s. I've tried rebooting the machine and reloading the machine, but the low bandwidth remains.

I created a new experiment with a c6320 (https://www.cloudlab.us/status.php?uuid=2625216d-9ee2-11f0-bc80-e4434b2381fc) to verify my original bandiwdth measurement was correct, and MLC is reporting about 60GB/s of memory bandwidth there as well.

Is there anything I can do to return the memory bandwidth to its original value? If not, is there anything I can do to prevent this from happening in the future?

Thanks,
Bijan

ajma...@gmail.com

unread,
Oct 1, 2025, 4:45:25 PM (7 days ago) Oct 1
to cloudlab-users
Hi Bijan,

My first suspicion would have been a failing or failed DIMM, as that would have knocked the node into an unbalanced memory configuration which is known to tank memory bandwidth performance.  However, I checked the iDRAC of that node and didn't see anything noteworthy in the logs or anything else that would suggest a hardware issue.  Nothing really out of the ordinary in the OS either, from a cursory glance.  You say that you tried reloading the machine and the poor bandwidth performance persisted past that?

Best,
 - Aleks

Bijan Tabatabai

unread,
Oct 1, 2025, 5:12:14 PM (7 days ago) Oct 1
to cloudlab-users
Hi Aleks,

Thanks for the prompt response.

> You say that you tried reloading the machine and the poor bandwidth performance persisted past that?
Yes. For clarity, I just reloaded it again and ran the following commands

$ sudo apt update
$ sudo apt upgrade
$ sudo apt install numactl 
$ tar xvf mlc_v3.11b.tgz
$ numactl -m 0 ./Linux/mlc --max_bandwidth

MLC again reported 6.5GB/s of bandwidth.

For the record, I experienced a similar problem with c6320 machines last spring. I think then I just released the offending machines, assuming whatever cleanup routines Cloudlab has would fix the issue.

Bijan

Bijan Tabatabai

unread,
Oct 7, 2025, 10:14:15 AM (2 days ago) Oct 7
to cloudlab-users
Hi Aleks,

I'm just checking in since the CloudLab experiment that I am seeing this issue in expires today. Would it be helpful to extend the experiment to allow CloudLab staff to look at the machine more? Or should I just let the experiment expire?

Thanks,
Bijan

Mike Hibler

unread,
Oct 7, 2025, 10:29:00 AM (2 days ago) Oct 7
to cloudla...@googlegroups.com
Go ahead and leave the experiment til it expires. I have scheduled it to go
out of service in case we don't get a chance to look at it before then.

Thanks for pointing this out and sorry we have not looked at it yet!
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/
> 64f24e11-84b8-430c-b21a-894550f10c2an%40googlegroups.com.

Mike Hibler

unread,
Oct 7, 2025, 11:56:57 AM (2 days ago) Oct 7
to cloudla...@googlegroups.com
I checked the BIOS to make sure nothing was reconfigured. Then I powered it
off and back on, and...it seems to be getting full BW again. So I have
canceled taking it out of service in case you want to and can extend the
experiment.

On Tue, Oct 07, 2025 at 08:28:55AM -0600, Mike Hibler wrote:
> Go ahead and leave the experiment til it expires. I have scheduled it to go
> out of service in case we don't get a chance to look at it before then.
>
> Thanks for pointing this out and sorry we have not looked at it yet!
>
> On Tue, Oct 07, 2025 at 07:14:14AM -0700, Bijan Tabatabai wrote:
> > Hi Aleks,
> >
> > I'm just checking in since the CloudLab experiment that I am seeing this issue
> > in expires today. Would it be helpful to extend the experiment to allow
> > CloudLab staff to look at the machine more? Or should I just let the experiment
> > expire?
> >
> > Thanks,
> > Bijan
> >
> > On Wednesday, October 1, 2025 at 4:12:14???PM UTC-5 Bijan Tabatabai wrote:
> >
> > Hi Aleks,
> >
> > Thanks for the prompt response.
> >
> > > You say that you tried reloading the machine and the poor bandwidth
> > performance persisted past that?
> > Yes. For clarity, I just reloaded it again and ran the following commands
> >
> > $ sudo apt update
> > $ sudo apt upgrade
> > $ sudo apt install numactl??
> > $??wget https://downloadmirror.intel.com/834254/mlc_v3.11b.tgz
> > $??tar xvf mlc_v3.11b.tgz
> > $??numactl -m 0 ./Linux/mlc --max_bandwidth
> >
> > MLC again reported 6.5GB/s of bandwidth.
> >
> > For the record, I experienced a similar problem with c6320 machines last
> > spring. I think then I just released the offending machines, assuming
> > whatever cleanup routines Cloudlab has would fix the issue.
> >
> > Bijan
> >
> > On Wednesday, October 1, 2025 at 3:45:25???PM UTC-5 ajma...@gmail.com wrote:
> >
> > Hi Bijan,
> >
> > My first suspicion would have been a failing or failed DIMM, as that
> > would have knocked the node into an unbalanced memory configuration
> > which is known to tank memory bandwidth performance. ??However, I
> > checked the iDRAC of that node and didn't see anything noteworthy in
> > the logs or anything else that would suggest a hardware issue. ??Nothing
> > really out of the ordinary in the OS either, from a cursory glance.
> > ??You say that you tried reloading the machine and the poor bandwidth
> > performance persisted past that?
> >
> > Best,
> > ??- Aleks
> > On Wednesday, October 1, 2025 at 2:25:01???PM UTC-6 bija...@gmail.com
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/20251007142855.GA39225%40flux.utah.edu.

Bijan Tabatabai

unread,
Oct 7, 2025, 12:13:18 PM (2 days ago) Oct 7
to cloudlab-users
Thanks Mike,

I can confirm that I am also seeing it get the full memory bandwidth.
I had not tried power cycling the machine when I was trouble shooting before. If this pops up again, hopefully that will resolve it going forward as well.

Thanks,
Bijan

ajma...@gmail.com

unread,
Oct 7, 2025, 1:08:01 PM (2 days ago) Oct 7
to cloudlab-users
Hi Bijan,

Apologies for not getting back to you right away, glad that a power cycle ended up sorting it out.  If or when it pops back up again, could you please let us know what you were running immediately prior to that?  That might help track this issue down.  If needed, we also have somebody at Clemson who can physically look at the machine.

Thanks,
 - Aleks

Reply all
Reply to author
Forward
0 new messages