Yes, it does look like there's something wrong with the IB subnet
manager. We're looking into it, stay tuned...
On Sat, 21 Mar 2026 06:45, Amanda Baran <
amanda...@gmail.com> wrote:
> Hi,
>
> We have been running into this error since Thursday on multiple different
> experiments within the r320 apt cluster:
>
> rdma_create_ep():No space left on device for node1:33335
>
> QPs are not being created successfully. Google searches indicate that this
> could be due to zombies or RDMA resources being held, but this error still
> happens even after a killall. I also confirmed the ulimit, although not
> unlimited, should be plenty high, which I suspected, as this code did work
> on 3/14 when it was last tested on the r320s.
> And, 'rmda res show' confirms that there are not a bunch of open qps or
> cms. We also tried changing the port number and doing a full reboot, both
> with no luck.
>
> We are thinking there might be something wrong with the cluster itself
> after reverting to code we tested and confirmed worked previously on 3/14
> on the r320 cluster. Now that this code is also seeing the same error, we
> were hoping to get some help to see if there are any other possibilities.
>
> Here is the link to my
> experiment:
https://www.cloudlab.us/status.php?uuid=455fe49e-c76a-4161-a4b7-58d63c813bf5
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
cloudlab-user...@googlegroups.com.
> To view this discussion visit
https://groups.google.com/d/msgid/cloudlab-users/156fef2c-d242-47ab-a7a0-2b812173e158n%40googlegroups.com.