Random reboots of r6615 nodes in Clemson cluster

12 views
Skip to first unread message

Leonid Kondrashov

unread,
Mar 23, 2026, 12:08:17 AM (10 days ago) Mar 23
to cloudlab-users
Hello,

During my experiment on the r6615 nodes at Clemson, I observed that some nodes rebooted at random times (experiment: https://www.cloudlab.us/status.php?uuid=029e88bc-1ed0-4e86-93a2-5ba77374a1da#). Our experimental setup doesn't support graceful recovery on reboot, so the reboots themself are problematic for me.

Can you help identify the reason for the reboots so I can avoid them in the future? I have some logs from journalctl right before the boot:
Mar 22 08:22:09 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3798611]: Disconnected from authenticating user root 130.127.132.51 port 20633 [preauth]
Mar 22 08:22:09 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3798609]: Disconnected from authenticating user root 130.127.132.51 port 20632 [preauth]
Mar 22 08:23:15 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3799279]: banner exchange: Connection from 20.55.24.39 port 33558: invalid format
Mar 22 08:23:25 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3799277]: Connection closed by 20.55.24.39 port 33542 [preauth]
Mar 22 08:24:01 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3799806]: Invalid user solv from 92.118.39.76 port 33538
Mar 22 08:24:01 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3799806]: Connection closed by invalid user solv 92.118.39.76 port 33538 [preauth]
Mar 22 08:24:29 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3800048]: Connection reset by authenticating user root 176.120.22.13 port 23594 [preauth]
Mar 22 08:24:30 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3800060]: Invalid user admin from 176.120.22.13 port 23598
Mar 22 08:24:31 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3800060]: Connection reset by invalid user admin 176.120.22.13 port 23598 [preauth]
Mar 22 08:24:41 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3800118]: Connection reset by authenticating user root 176.120.22.13 port 23606 [preauth]
Mar 22 08:24:51 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3800222]: Connection reset by authenticating user root 176.120.22.13 port 51510 [preauth]
Mar 22 08:24:53 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3800320]: Invalid user admin from 176.120.22.13 port 49764
Mar 22 08:24:53 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3800320]: Connection reset by invalid user admin 176.120.22.13 port 49764 [preauth]
Mar 22 08:25:01 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us CRON[3800423]: pam_unix(cron:session): session opened for user root(uid=0) by root(uid=0)
Mar 22 08:25:01 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us CRON[3800424]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 22 08:25:01 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us CRON[3800423]: pam_unix(cron:session): session closed for user root
Mar 22 08:26:44 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us usermod[3801495]: change user 'root' password
Mar 22 08:27:50 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3802152]: Invalid user javier from 128.1.44.162 port 37924
Mar 22 08:27:50 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3802152]: Received disconnect from 128.1.44.162 port 37924:11: Bye Bye [preauth]
Mar 22 08:27:50 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3802152]: Disconnected from invalid user javier 128.1.44.162 port 37924 [preauth]
Mar 22 08:30:00 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Mar 22 08:30:00 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us systemd[1]: sysstat-collect.service: Deactivated successfully.
Mar 22 08:30:00 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Mar 22 08:30:57 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3804076]: Invalid user marek from 128.1.44.162 port 61486
Mar 22 08:30:57 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3804076]: Received disconnect from 128.1.44.162 port 61486:11: Bye Bye [preauth]
Mar 22 08:30:57 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3804076]: Disconnected from invalid user marek 128.1.44.162 port 61486 [preauth]
Mar 22 08:32:10 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3804904]: Connection closed by 64.89.160.135 port 58000
Mar 22 08:32:40 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3805144]: Invalid user admin from 2.57.121.112 port 30715
Mar 22 08:32:41 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3805144]: Received disconnect from 2.57.121.112 port 30715:11: Bye [preauth]
Mar 22 08:32:41 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3805144]: Disconnected from invalid user admin 2.57.121.112 port 30715 [preauth]
Mar 22 08:32:58 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3805375]: Invalid user vpn from 80.94.95.116 port 50342
Mar 22 08:32:58 node-007.snap.ntu-cloud-pg0.clemson.cloudlab.us sshd[3805375]: Connection closed by invalid user vpn 80.94.95.116 port 50342 [preauth]
-- Boot 8ea115fd56224d44b89ad7299b2acc2d --
Mar 22 08:35:47 localhost kernel: Linux version 6.8.0-101-generic (buildd@lcy02-amd64-051) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0, GNU ld (GNU Binutils for Ubuntu) 2.42) #101-Ubuntu SMP PREEMPT_DYNAMIC Mon Feb  9 10:15:05 UTC 2026 (Ubuntu 6.8.0-101.101-generic 6.8.12)
...

Regards,
Leonid

David M Johnson

unread,
Mar 23, 2026, 9:17:36 AM (9 days ago) Mar 23
to cloudla...@googlegroups.com
On 3/22/26 22:08, Leonid Kondrashov wrote:
> Hello,
>
> During my experiment on the r6615 nodes at Clemson, I observed that some
> nodes rebooted at random times (experiment: https://www.cloudlab.us/
> status.php?uuid=029e88bc-1ed0-4e86-93a2-5ba77374a1da#). Our experimental
> setup doesn't support graceful recovery on reboot, so the reboots
> themself are problematic for me.
>
> Can you help identify the reason for the reboots so I can avoid them in
> the future? I have some logs from journalctl right before the boot:

Hi. Unfortunately the experiment has terminated, and the logs below
don't point to any issues. Please let us know if it happens again. I
would check the affected node's serial console logs from the experiment
status page to see if there are messages from the kernel; those may not
have made it to persistent storage on disk if there was a kernel panic
and emergency restart.

David

Mike Hibler

unread,
Mar 23, 2026, 10:34:14 AM (9 days ago) Mar 23
to cloudla...@googlegroups.com
Can you say more about what you were doing in your experiment? (you can email
to porta...@cloudlab.us if you want to take this offline).

Did you have a custom kernel or custom kernel modules? Are you using any
specific CPU features? I ask because I did a quick check of two of the nodes
and they both showed a Machine Check error while they were in your experiment.

On Sun, Mar 22, 2026 at 09:08:16PM -0700, Leonid Kondrashov wrote:
> Hello,
>
> During my experiment on the r6615 nodes at Clemson, I observed that some nodes
> rebooted at random times (experiment: https://www.cloudlab.us/status.php?uuid=
> 029e88bc-1ed0-4e86-93a2-5ba77374a1da#). Our experimental setup doesn't support
> graceful recovery on reboot, so the reboots themself are problematic for me.
>
> ...
>
> Regards,
> Leonid
>
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/
> a229ec89-e02d-4cf0-8e93-85cfd380f711n%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages