Re: [cloudlab-users] rdma resource issue on r320s

10 views
Skip to first unread message

Robert Ricci

unread,
Mar 24, 2026, 3:25:30 PM (8 days ago) Mar 24
to Amanda Baran, cloudlab-users
Yes, it does look like there's something wrong with the IB subnet
manager. We're looking into it, stay tuned...

On Sat, 21 Mar 2026 06:45, Amanda Baran <amanda...@gmail.com> wrote:
> Hi,
>
> We have been running into this error since Thursday on multiple different
> experiments within the r320 apt cluster:
>
> rdma_create_ep():No space left on device for node1:33335
>
> QPs are not being created successfully. Google searches indicate that this
> could be due to zombies or RDMA resources being held, but this error still
> happens even after a killall. I also confirmed the ulimit, although not
> unlimited, should be plenty high, which I suspected, as this code did work
> on 3/14 when it was last tested on the r320s.
> And, 'rmda res show' confirms that there are not a bunch of open qps or
> cms. We also tried changing the port number and doing a full reboot, both
> with no luck.
>
> We are thinking there might be something wrong with the cluster itself
> after reverting to code we tested and confirmed worked previously on 3/14
> on the r320 cluster. Now that this code is also seeing the same error, we
> were hoping to get some help to see if there are any other possibilities.
>
> Here is the link to my
> experiment: https://www.cloudlab.us/status.php?uuid=455fe49e-c76a-4161-a4b7-58d63c813bf5
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-user...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/156fef2c-d242-47ab-a7a0-2b812173e158n%40googlegroups.com.

Kirk Webb

unread,
Mar 24, 2026, 4:35:47 PM (8 days ago) Mar 24
to cloudla...@googlegroups.com, Amanda Baran
Hi Amanda,

Another user had started up several of their own subnet managers,
presumably by mistake. We have killed these off and have requested the
user get in contact with us. The fabric should be running properly
once more.

-Kirk
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/189fdd805f9347db.20efc4db240c0e8c.15a2b9bdcd7c5b54%40dent.
Reply all
Reply to author
Forward
0 new messages