Hi,
I am seeing unusually slow startup/provisioning behavior on OCT VCK5000 nodes pc178 and pc179 that are currently blocking my experiments. The main issue is that the experiment startup remains in “Running” on the portal for several hours, whereas the same workflow had previously completed in about 20 minutes. I can sometimes SSH into the nodes after 15–20 minutes, but the startup process still appears to be continuing in the background. The output of tail -50 /local/logs/output_log.txt changes over time, so the startup does not appear fully stuck, but it is progressing very slowly.
A second issue is very low network throughput on both nodes. In repeated tests, download speeds have only been on the order of a few hundred KiB/s, around 500–517 KiB/s, including while downloading the ImageNet-1K validation dataset and during large-file download tests. Because of this, my current suspicion is that the startup/provisioning scripts are being delayed significantly by a network or connectivity issue rather than failing immediately due to a logic error. This would also explain why tools and dependencies sometimes appear only after the experiment has been left running for a long time.
The runtime behavior has been inconsistent in a way that seems consistent with delayed or incomplete provisioning. On both pc178 and pc179, /opt/xilinx / XRT was not available for a long period and then appeared later after the experiment had remained active for hours. During that time, FPGA tooling was not reliably usable, xbutil discovery could fail, and overlay programming attempts could fail as well. Later, without any changes to the commands or scripts, the same steps would begin working once the node had been left running longer. On pc179, for example, I was eventually able to program the 6PE_misc_dwc overlay, but only after waiting a long time for the startup to continue progressing.
I also observed that after a power cycle, FPGA access becomes unavailable again even if it had been working correctly just before reboot. In that state, the same recovery/programming flow that had worked earlier can fail until the node has again been left running for a significant amount of time. One representative failure after reboot was:
[XRT] ERROR: See dmesg log for details. err = -1
[xbutil] ERROR: Could not program device 0000:0d:00.1: Operation not permitted
Could someone please check whether there may be an issue with network performance, provisioning, or site resources affecting pc178 and pc179? At the moment, the combination of very slow network speed, multi-hour startup times, and loss of FPGA access after reboot is preventing me from running experiments.
Experiment URLs:
pc179: https://www.cloudlab.us/status.php?uuid=2d35d359-5190-4a43-8013-dea22fe5bb35
pc178: https://www.cloudlab.us/status.php?uuid=f4e8eac5-c536-4185-8420-7011d444c946
Thank you in advance. Hoping to hear from you soon.
Regards,
Sandeep Bal
PRATE