Slow network, startup/provisioning and intermittent FPGA availability on pc178 and pc179

21 views
Skip to first unread message

Sandeep Bal

unread,
Apr 11, 2026, 3:14:52 PM (10 days ago) Apr 11
to cloudlab-users

Hi, 

I am seeing unusually slow startup/provisioning behavior on OCT VCK5000 nodes pc178 and pc179 that are currently blocking my experiments. The main issue is that the experiment startup remains in “Running” on the portal for several hours, whereas the same workflow had previously completed in about 20 minutes. I can sometimes SSH into the nodes after 15–20 minutes, but the startup process still appears to be continuing in the background. The output of tail -50 /local/logs/output_log.txt changes over time, so the startup does not appear fully stuck, but it is progressing very slowly.

A second issue is very low network throughput on both nodes. In repeated tests, download speeds have only been on the order of a few hundred KiB/s, around 500–517 KiB/s, including while downloading the ImageNet-1K validation dataset and during large-file download tests. Because of this, my current suspicion is that the startup/provisioning scripts are being delayed significantly by a network or connectivity issue rather than failing immediately due to a logic error. This would also explain why tools and dependencies sometimes appear only after the experiment has been left running for a long time.

The runtime behavior has been inconsistent in a way that seems consistent with delayed or incomplete provisioning. On both pc178 and pc179, /opt/xilinx / XRT was not available for a long period and then appeared later after the experiment had remained active for hours. During that time, FPGA tooling was not reliably usable, xbutil discovery could fail, and overlay programming attempts could fail as well. Later, without any changes to the commands or scripts, the same steps would begin working once the node had been left running longer. On pc179, for example, I was eventually able to program the 6PE_misc_dwc overlay, but only after waiting a long time for the startup to continue progressing.

I also observed that after a power cycle, FPGA access becomes unavailable again even if it had been working correctly just before reboot. In that state, the same recovery/programming flow that had worked earlier can fail until the node has again been left running for a significant amount of time. One representative failure after reboot was:
[XRT] ERROR: See dmesg log for details. err = -1
[xbutil] ERROR: Could not program device 0000:0d:00.1: Operation not permitted

Could someone please check whether there may be an issue with network performance, provisioning, or site resources affecting pc178 and pc179? At the moment, the combination of very slow network speed, multi-hour startup times, and loss of FPGA access after reboot is preventing me from running experiments.

Experiment URLs:
pc179: https://www.cloudlab.us/status.php?uuid=2d35d359-5190-4a43-8013-dea22fe5bb35
pc178: https://www.cloudlab.us/status.php?uuid=f4e8eac5-c536-4185-8420-7011d444c946

Thank you in advance. Hoping to hear from you soon.

Regards,

Sandeep Bal
PRATE

Sandeep Bal

unread,
Apr 13, 2026, 1:19:58 PM (9 days ago) Apr 13
to cloudlab-users
The experiments are currently expired. Please, let me know when you are going to work on this issue over here and I will send you a new URL to my new Experiments right away. I am not sending any right now because may be by the time I send this experiment URL, it might have expired by that time.

Mike Hibler

unread,
Apr 13, 2026, 1:33:58 PM (9 days ago) Apr 13
to cloudla...@googlegroups.com
I have forwarded your message to people who are in a better position to
diagnose this.
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/
> 99cfe026-b25f-40a3-a926-6b7e9cbc1b57n%40googlegroups.com.

Mike Hibler

unread,
Apr 13, 2026, 6:23:46 PM (8 days ago) Apr 13
to cloudla...@googlegroups.com
Something I did notice looking at this. The file /var/tmp/startup-1.txt has
the output of your startup script and it shows a lot of apt permission
denied errors. And I see that your startup script:
/local/repository/post-boot-vitis-ai.sh
that runs as you and not root, has a number of apt calls that do not do
"sudo apt..." to run as root. This does not seem like it would have anything
to do with your slow network however.

I see nothing wrong with the disk or the host side NIC. I don't have access
to the switch to see if anything is going on there.

On Mon, Apr 13, 2026 at 11:33:53AM -0600, Mike Hibler wrote:
> I have forwarded your message to people who are in a better position to
> diagnose this.
>
> On Mon, Apr 13, 2026 at 10:19:58AM -0700, Sandeep Bal wrote:
> > The experiments are currently expired. Please, let me know when you are going
> > to work on this issue over here and I will send you a new URL to my new
> > Experiments right away. I am not sending any right now because may be by the
> > time I send this experiment URL, it might have expired by that time.
> >
> > On Saturday, April 11, 2026 at 3:14:52???PM UTC-4 Sandeep Bal wrote:
> >
> >
> > Hi,??
> >
> > I am seeing unusually slow startup/provisioning behavior on OCT VCK5000
> > nodes pc178 and pc179 that are currently blocking my experiments. The main
> > issue is that the experiment startup remains in ???Running??? on the portal for
> > several hours, whereas the same workflow had previously completed in about
> > 20 minutes. I can sometimes SSH into the nodes after 15???20 minutes, but the
> > startup process still appears to be continuing in the background. The
> > output of??tail -50 /local/logs/output_log.txt??changes over time, so the
> > startup does not appear fully stuck, but it is progressing very slowly.
> >
> > A second issue is very low network throughput on both nodes. In repeated
> > tests, download speeds have only been on the order of a few hundred KiB/s,
> > around 500???517 KiB/s, including while downloading the ImageNet-1K
> > validation dataset and during large-file download tests. Because of this,
> > my current suspicion is that the startup/provisioning scripts are being
> > delayed significantly by a network or connectivity issue rather than
> > failing immediately due to a logic error. This would also explain why tools
> > and dependencies sometimes appear only after the experiment has been left
> > running for a long time.
> >
> > The runtime behavior has been inconsistent in a way that seems consistent
> > with delayed or incomplete provisioning. On both pc178 and pc179,??/opt/
> > xilinx??/ XRT was not available for a long period and then appeared later
> > after the experiment had remained active for hours. During that time, FPGA
> > tooling was not reliably usable,??xbutil??discovery could fail, and overlay
> > programming attempts could fail as well. Later, without any changes to the
> > commands or scripts, the same steps would begin working once the node had
> > been left running longer. On pc179, for example, I was eventually able to
> > program the??6PE_misc_dwc??overlay, but only after waiting a long time for
> > the startup to continue progressing.
> >
> > I also observed that after a power cycle, FPGA access becomes unavailable
> > again even if it had been working correctly just before reboot. In that
> > state, the same recovery/programming flow that had worked earlier can fail
> > until the node has again been left running for a significant amount of
> > time. One representative failure after reboot was:
> > [XRT]??ERROR: See dmesg log for details. err = -1
> > [xbutil]??ERROR: Could not program device 0000:0d:00.1: Operation not
> > permitted
> >
> > Could someone please check whether there may be an issue with network
> > performance, provisioning, or site resources affecting pc178 and pc179? At
> > the moment, the combination of very slow network speed, multi-hour startup
> > times, and loss of FPGA access after reboot is preventing me from running
> > experiments.
> >
> > Experiment URLs:
> > pc179:??https://www.cloudlab.us/status.php?uuid=
> > 2d35d359-5190-4a43-8013-dea22fe5bb35
> > pc178:??https://www.cloudlab.us/status.php?uuid=
> > f4e8eac5-c536-4185-8420-7011d444c946
> >
> > Thank you in advance. Hoping to hear from you soon.
> >
> > Regards,
> >
> > Sandeep Bal
> > PRATE
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "cloudlab-users" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email
> > to cloudlab-user...@googlegroups.com.
> > To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/
> > 99cfe026-b25f-40a3-a926-6b7e9cbc1b57n%40googlegroups.com.
>
> --
> You received this message because you are subscribed to the Google Groups "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-user...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/20260413173353.GQ39017%40flux.utah.edu.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages