Hi all,
We have been running BeeGFS 7.4.3 for in our University cluster for about 3 months for now. However, we have been troubled by some problem that happenned twice in the 3 months period time. All the worker tasks in one of the storage nodes suddenly hang, and the node rebooted by itself like 30 minutes after. We are not sure what is actually causing the issues, as there are no useful information in the BeeGFS storage log when the tasks hang. We can only get some info about the task hung warning in the syslog.
Aug 12 16:08:17 beegfsoss01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 12 16:08:17 beegfsoss01 kernel: task:Worker1-1 state:D stack: 0 pid:99059 ppid: 1 flags:0x00004002
Aug 12 16:08:17 beegfsoss01 kernel: Call Trace:
Aug 12 16:08:17 beegfsoss01 kernel: <TASK>
Aug 12 16:08:17 beegfsoss01 kernel: __schedule+0x248/0x620
Aug 12 16:08:17 beegfsoss01 kernel: schedule+0x5a/0xc0
Aug 12 16:08:17 beegfsoss01 kernel: raid5_get_active_stripe+0x25a/0x2f0 [raid456]
Aug 12 16:08:17 beegfsoss01 kernel: ? cpuacct_percpu_seq_show+0x10/0x10
Aug 12 16:08:17 beegfsoss01 kernel: make_stripe_request+0x9b/0x490 [raid456]
Aug 12 16:08:17 beegfsoss01 kernel: ? bdev_start_io_acct+0x47/0x100
Aug 12 16:08:17 beegfsoss01 kernel: raid5_make_request+0x16f/0x3e0 [raid456]
Aug 12 16:08:17 beegfsoss01 kernel: ? sched_show_numa+0xf0/0xf0
Aug 12 16:08:17 beegfsoss01 kernel: md_handle_request+0x135/0x1e0
Aug 12 16:08:17 beegfsoss01 kernel: __submit_bio+0x89/0x130
Aug 12 16:08:17 beegfsoss01 kernel: __submit_bio_noacct+0x81/0x1f0
Aug 12 16:08:17 beegfsoss01 kernel: iomap_submit_ioend+0x4e/0x80
Aug 12 16:08:17 beegfsoss01 kernel: xfs_vm_writepages+0x7a/0xb0 [xfs]
Aug 12 16:08:17 beegfsoss01 kernel: do_writepages+0xcf/0x1d0
Aug 12 16:08:17 beegfsoss01 kernel: ? selinux_file_open+0xad/0xd0
Aug 12 16:08:17 beegfsoss01 kernel: filemap_fdatawrite_wbc+0x66/0x90
Aug 12 16:08:17 beegfsoss01 kernel: filemap_write_and_wait_range+0x6f/0xf0
Aug 12 16:08:17 beegfsoss01 kernel: xfs_setattr_size+0xb5/0x390 [xfs]
Aug 12 16:08:17 beegfsoss01 kernel: xfs_vn_setattr+0x78/0x180 [xfs]
Aug 12 16:08:17 beegfsoss01 kernel: notify_change+0x34d/0x4e0
Aug 12 16:08:17 beegfsoss01 kernel: ? do_truncate+0x7d/0xd0
Aug 12 16:08:17 beegfsoss01 kernel: do_truncate+0x7d/0xd0
Aug 12 16:08:17 beegfsoss01 kernel: do_sys_ftruncate+0x17d/0x1b0
Aug 12 16:08:17 beegfsoss01 kernel: do_syscall_64+0x5c/0x90
Aug 12 16:08:17 beegfsoss01 kernel: ? syscall_exit_to_user_mode+0x12/0x30
Aug 12 16:08:17 beegfsoss01 kernel: ? do_syscall_64+0x69/0x90
Aug 12 16:08:17 beegfsoss01 kernel: ? do_syscall_64+0x69/0x90
Aug 12 16:08:17 beegfsoss01 kernel: ? do_syscall_64+0x69/0x90
Aug 12 16:08:17 beegfsoss01 kernel: ? common_interrupt+0x43/0xa0
Aug 12 16:08:17 beegfsoss01 kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
Aug 12 16:08:17 beegfsoss01 kernel: RIP: 0033:0x7fa737d466eb
Aug 12 16:08:17 beegfsoss01 kernel: RSP: 002b:00007fa734bf8398 EFLAGS: 00000213 ORIG_RAX: 000000000000004d
Aug 12 16:08:17 beegfsoss01 kernel: RAX: ffffffffffffffda RBX: 00007fa6fc0047e0 RCX: 00007fa737d466eb
Aug 12 16:08:17 beegfsoss01 kernel: RDX: 0000000000000000 RSI: 0000000000000018 RDI: 0000000000000884
Aug 12 16:08:17 beegfsoss01 kernel: RBP: 00007fa734bf85b0 R08: 0000000000000000 R09: 00007fa734bf8670
Aug 12 16:08:17 beegfsoss01 kernel: R10: 0000000000000000 R11: 0000000000000213 R12: 0000000000000884
Aug 12 16:08:17 beegfsoss01 kernel: R13: 0000000000000018 R14: 0000000001beef50 R15: 00007fa734bf8650
Aug 12 16:08:17 beegfsoss01 kernel: </TASK>
We are currently running 2 storage controllers on Rocky 9.2 with kernel version 5.14.0-284.11.1. Each controller are connected to a JBOD with 40 SAS disk. There are total of 4 RAID 6 array created per JBOD, so 10 disks per array. Each storage node is equipped with a AMD EPYC 9124 processors with 256 GB Memory. We know the hardware is not very optimised for performance, but we currently can't really do anything due to the long procurement process in the University.
The worker tasks hung problem seems to be occuring after 6 weeks counted from nodes reboot, and had happenned to all the BeeGFS storage controllers. We have other storage controllers running CephFS but have not encountered similar problem or behavior so far, so we are pretty clueless on what is actually causing the issues.
We would like to get some help here and see if any others are having the similar issues.
Thanks.