Urgent: Complete loss of access to time-critical shared Lustre storage in experiment mrashid2-296178

46 views
Skip to first unread message

Md Hasanur Rashid

unread,
Apr 4, 2026, 12:24:58 PMApr 4
to cloudlab-users

Dear CloudLab Team,

I need urgent help with a likely experiment-network/fabric issue in my Utah CloudLab experiment mrashid2-296178 (profile hpc_lustre_2_15_5, project DIRR).

I am running a 27-node Lustre deployment. The shared filesystem is hasanfs, mounted at /mnt/hasanfs, and it contains time-critical data. I have now lost complete access to this shared storage.

The apparent failure point is er114.utah.cloudlab.us, which is the Lustre MGS/MDT host for this filesystem. Its experiment-network interface ens1f0 (10.10.1.1/24) is down with NO-CARRIER, and lnetctl shows 10.10.1.1@tcp as down. Normal SSH to the node still works over the management network, but the storage network is not functioning.

Impact:

  • All 10 client nodes have lost access to /mnt/hasanfs
  • The remaining Lustre server targets have remounted read-only
  • The shared storage is effectively unavailable cluster-wide, and commands touching the mount can hang

This appears to be outside the guest OS. I only attempted safe, non-destructive host-side checks and interface recovery steps, but the interface still has no carrier. I have intentionally avoided destructive recovery actions because preserving the data is my highest priority.

Could you please investigate the experiment-network path for er114.utah.cloudlab.us ens1f0 as soon as possible? This is a severe and time-sensitive outage affecting access to critical data.

I can provide exact command outputs if needed.

Best regards,
Hasan

ajma...@gmail.com

unread,
Apr 4, 2026, 1:39:18 PMApr 4
to cloudlab-users
Hi Hasan,

This doesn't appear to be a problem on the switch side.  The config is how we would expect it and other legs on the breakout cable from the switch (including a different node in your experiment) are up and passing traffic.  More than likely it's a cable issue on the node side, could be that it's not fully inserted, but even that's not manifesting how I would usually see it.  In any case, we would need to get somebody to the datacenter to try replugging it.  It might also be good to try power cycling the node itself in case the NIC itself is in some weird state, would this cause any issues for your setup?  If your experiment can tolerate a power cycle on er114, then you can initiate this from the experiment status page by clicking on the gear button to the far right of the server0/er114 row and clicking "Power Cycle".

Regarding the urgency of your request, it's a Saturday morning here, which is outside of our normal operating hours.  Additionally, your experiment doesn't expire until Thursday, so while I don't know the nature of this time-critical data, there's at least no imminent risk of the experiment swapping out and your data being outright lost.  We will do what we can to help you get your experiment working again as soon as possible, but please keep in mind that it will likely take longer than it would during the work week.

Best,
 - Aleks

Md Hasanur Rashid

unread,
Apr 4, 2026, 3:04:08 PMApr 4
to cloudlab-users
Hi Aleks,

Thank you for checking and for the explanation.

I went ahead and power cycled `er114.utah.cloudlab.us` as suggested, but the issue persists. After the reboot, the Lustre-facing interface `ens1f0` is still down with no carrier (`Link detected: no`, `carrier=0`), and the clients still cannot reach `10.10.1.1`.

To make this safe, I have already quiesced the Lustre setup on my side:
- all clients have been unmounted from `/mnt/hasanfs`
- all Lustre server targets have been cleanly unmounted

So from the filesystem side, it should now be safe for you to proceed with the next physical step on `er114`, including a cable replug / node-side check in the datacenter.

I understand this is outside normal operating hours, but I want to emphasize that I still have complete loss of access to shared storage containing time-critical data. If possible, I would be very grateful if this could be handled at the earliest opportunity.

Please let me know if you would like any additional command output from my side.

Best regards,
Hasan

ajma...@gmail.com

unread,
Apr 4, 2026, 3:15:24 PMApr 4
to cloudlab-users
Hi Hasan,

Thanks for power cycling the node, I just wanted to rule out the NIC itself.  I'll try to get to the datacenter this afternoon while I'm out running errands, and hopefully I can get it fixed up quickly.

Best,
 - Aleks

Md Hasanur Rashid

unread,
Apr 4, 2026, 3:24:00 PMApr 4
to cloudla...@googlegroups.com
Thank you so much for your prompt response. I highly appreciate your effort looking into this even though it's weekend.

- Hasan

--
You received this message because you are subscribed to a topic in the Google Groups "cloudlab-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cloudlab-users/n9IWrvupPUM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cloudlab-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/51cf3aeb-194a-471b-9f7f-df739378c6a8n%40googlegroups.com.

Aleksander Maricq

unread,
Apr 4, 2026, 4:59:33 PMApr 4
to cloudla...@googlegroups.com
Hi Hasan,

You should be good to go now.  Replugging the cable on the node side did nothing, but once I replugged the cable on the switch side the link came back up and is passing traffic again.  Let us know if anything else comes up.

Best,
 - Aleks

You received this message because you are subscribed to the Google Groups "cloudlab-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-user...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/CAPPB1itWHoNOym%3D%2BxMsYcUJaHjuwD4de5e_t1iCW%3D0WOHAyosg%40mail.gmail.com.

Md Hasanur Rashid

unread,
Apr 4, 2026, 6:45:05 PMApr 4
to cloudla...@googlegroups.com
Hi Aleks,

Yes, I have my cluster up and running again. Thank you so much for taking your time over the weekend to promptly fix this issue. 

Best regards,
Hasan

Reply all
Reply to author
Forward
0 new messages