CPU Stuck for Xms until I reboot

4 views
Skip to first unread message

Ertza Warraich

unread,
Jun 30, 2024, 10:36:56 PM (4 days ago) Jun 30
to cloudlab-users
Hi, I am doing ML training and whenever a run ends I get the soft lockup message until I reboot and only then I can do the second run, so for each run I have to reboot. 

I am using d7525, attaching the logs, can it be because of using the grow-root script which turned swapoff to resize root, and I used all 150G for root?


Here are the logs, the CPU lockup messages start at the end:
"""
Ubuntu 20.04 LTS node0.ertza-210631.ultima-pg0.wisc.cloudlab.us ttyS0^M ^M [ 136.420788] sh[57545]: Checking Testbed user accounts configuration ...^M node0 login: [ 136.460175] sh[57545]: Checking Testbed route configuration ...^M [ 136.467425] sh[57545]: net.ipv4.conf.all.forwarding = 1^M [ 136.512177] sh[57545]: Checking Testbed tunnel configuration ...^M [ 136.547111] sh[57545]: Checking Testbed interface configuration ...^M [ 136.548579] sh[57545]: *** Bad Speed 200000 in ifconfig, default to autoconfig^M [ 138.246429] sh[57545]: Checking Testbed hostnames configuration ...^M [ 138.301766] sh[57545]: Checking Testbed remote storage configuration ...^M [ 138.346894] sh[57545]: Checking Testbed trace configuration ...^M [ 138.427265] sh[57545]: Checking Testbed trafgen configuration ...^M [ 138.461692] sh[57545]: Checking Testbed Tarball configuration ...^M [ 138.496980] sh[57545]: Checking Testbed RPM configuration ...^M [ 140.614962] sh[57545]: Checking Testbed tiptunnel configuration ...^M [ 140.899475] sh[57545]: Starting linktest daemon^M [ 140.903197] sh[57545]: Informing Emulab Control that we are up and running^M [ 140.941982] sh[57545]: Checking Testbed Experiment Startup Command ...^M [ 140.942719] sh[57545]: Booting up vnodes^M [ 140.990309] sh[57545]: Booting up subnodes^M [ 141.032182] sh[57545]: No subnodes. Exiting gracefully ...^M [35168.468837] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0x6fd55000 flags=0x0000]^M [35168.479536] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0x6fd55fc0 flags=0x0000]^M [35168.565021] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff15a0c0 flags=0x0000]^M [35168.575716] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff080000 flags=0x0020]^M [35168.615041] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff15a000 flags=0x0000]^M [35168.625734] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff080040 flags=0x0020]^M [35169.656870] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xfea23000 flags=0x0000]^M [35169.667566] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff080080 flags=0x0020]^M [35169.678261] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xfe94d4c0 flags=0x0020]^M [35171.378513] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xf6a7b280 flags=0x0020]^M [35171.565080] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x0031 address=0xff15a0c0 flags=0x0000]^M [35171.574992] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x0031 address=0xff0800c0 flags=0x0020]^M [35171.593932] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x0031 address=0xf6a71480 flags=0x0020]^M [35171.603844] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x0031 address=0xff4c2a40 flags=0x0020]^M [35171.613761] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x0031 address=0xff56a000 flags=0x0000]^M [35171.623675] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x0031 address=0xff080100 flags=0x0020]^M [35171.688105] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x0031 address=0xecc5d240 flags=0x0020]^M [35172.378486] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x0031 address=0xff15a3c0 flags=0x0000]^M [35172.388401] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x0031 address=0xff080140 flags=0x0020]^M [35172.398317] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x0031 address=0xea657900 flags=0x0020]^M [35173.565121] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff15a0c0 flags=0x0000]^M [35173.575827] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff0801c0 flags=0x0020]^M [35175.565159] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff15a0c0 flags=0x0000]^M [35175.575851] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff080200 flags=0x0020]^M [35177.649203] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff080240 flags=0x0020]^M [35180.614529] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff080280 flags=0x0020]^M [35180.766833] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff204000 flags=0x0000]^M [35180.777525] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff0802c0 flags=0x0020]^M [35180.965255] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff15a0c0 flags=0x0000]^M [35180.975945] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff080300 flags=0x0020]^M [35182.565285] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff15a0c0 flags=0x0000]^M [35182.575987] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff080340 flags=0x0020]^M [35183.565304] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff15a0c0 flags=0x0000]^M [35183.575994] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff080380 flags=0x0020]^M [35184.565323] mlx5_core 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0031 address=0xff15a0c0 flags=0x0000]^M [35184.576017] AMD-Vi: Event logged [IO_PAGE_FAULT device=41:00.0 domain=0x0031 address=0xff0803c0 flags=0x0020]^M [35193.339180] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [python:68717]^M 
"""

Mike Hibler

unread,
Jul 1, 2024, 10:32:20 AM (3 days ago) Jul 1
to cloudla...@googlegroups.com
It is unlikely to be the disk resizing as that would manifest as filesystem
corruption. It is more like the combination of Ubuntu and Nvidia software
you are using, and not a Cloudlab-specific thing. Something you could google
about. Unless you are following a recipe which specifies Ubuntu 20, you
might try the Ubuntu 22 image instead. That image also has a 64GB root
filesystem which might make it unnecessary for you to expand the root
partition (if you are only doing so to allow the cuda software to install).
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> cloudlab-users/217f5ea3-731e-4bd8-a670-89e4dfe8f5b9n%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages