Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Storage Worker Hang Suddenly

80 views
Skip to first unread message

Jit Kang CHANG

unread,
Aug 13, 2024, 1:44:29 AM8/13/24
to beegfs-user
Hi all,

We have been running BeeGFS 7.4.3 for in our University cluster for about 3 months for now. However, we have been troubled by some problem that happenned twice in the 3 months period time. All the worker tasks in one of the storage nodes suddenly hang, and the node rebooted by itself like 30 minutes after. We are not sure what is actually causing the issues, as there are no useful information in the BeeGFS storage log when the tasks hang. We can only get some info about the task hung warning in the syslog.

Aug 12 16:08:17 beegfsoss01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 12 16:08:17 beegfsoss01 kernel: task:Worker1-1       state:D stack:    0 pid:99059 ppid:     1 flags:0x00004002
Aug 12 16:08:17 beegfsoss01 kernel: Call Trace:
Aug 12 16:08:17 beegfsoss01 kernel: <TASK>
Aug 12 16:08:17 beegfsoss01 kernel: __schedule+0x248/0x620
Aug 12 16:08:17 beegfsoss01 kernel: schedule+0x5a/0xc0
Aug 12 16:08:17 beegfsoss01 kernel: raid5_get_active_stripe+0x25a/0x2f0 [raid456]
Aug 12 16:08:17 beegfsoss01 kernel: ? cpuacct_percpu_seq_show+0x10/0x10
Aug 12 16:08:17 beegfsoss01 kernel: make_stripe_request+0x9b/0x490 [raid456]
Aug 12 16:08:17 beegfsoss01 kernel: ? bdev_start_io_acct+0x47/0x100
Aug 12 16:08:17 beegfsoss01 kernel: raid5_make_request+0x16f/0x3e0 [raid456]
Aug 12 16:08:17 beegfsoss01 kernel: ? sched_show_numa+0xf0/0xf0
Aug 12 16:08:17 beegfsoss01 kernel: md_handle_request+0x135/0x1e0
Aug 12 16:08:17 beegfsoss01 kernel: __submit_bio+0x89/0x130
Aug 12 16:08:17 beegfsoss01 kernel: __submit_bio_noacct+0x81/0x1f0
Aug 12 16:08:17 beegfsoss01 kernel: iomap_submit_ioend+0x4e/0x80
Aug 12 16:08:17 beegfsoss01 kernel: xfs_vm_writepages+0x7a/0xb0 [xfs]
Aug 12 16:08:17 beegfsoss01 kernel: do_writepages+0xcf/0x1d0
Aug 12 16:08:17 beegfsoss01 kernel: ? selinux_file_open+0xad/0xd0
Aug 12 16:08:17 beegfsoss01 kernel: filemap_fdatawrite_wbc+0x66/0x90
Aug 12 16:08:17 beegfsoss01 kernel: filemap_write_and_wait_range+0x6f/0xf0
Aug 12 16:08:17 beegfsoss01 kernel: xfs_setattr_size+0xb5/0x390 [xfs]
Aug 12 16:08:17 beegfsoss01 kernel: xfs_vn_setattr+0x78/0x180 [xfs]
Aug 12 16:08:17 beegfsoss01 kernel: notify_change+0x34d/0x4e0
Aug 12 16:08:17 beegfsoss01 kernel: ? do_truncate+0x7d/0xd0
Aug 12 16:08:17 beegfsoss01 kernel: do_truncate+0x7d/0xd0
Aug 12 16:08:17 beegfsoss01 kernel: do_sys_ftruncate+0x17d/0x1b0
Aug 12 16:08:17 beegfsoss01 kernel: do_syscall_64+0x5c/0x90
Aug 12 16:08:17 beegfsoss01 kernel: ? syscall_exit_to_user_mode+0x12/0x30
Aug 12 16:08:17 beegfsoss01 kernel: ? do_syscall_64+0x69/0x90
Aug 12 16:08:17 beegfsoss01 kernel: ? do_syscall_64+0x69/0x90
Aug 12 16:08:17 beegfsoss01 kernel: ? do_syscall_64+0x69/0x90
Aug 12 16:08:17 beegfsoss01 kernel: ? common_interrupt+0x43/0xa0
Aug 12 16:08:17 beegfsoss01 kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
Aug 12 16:08:17 beegfsoss01 kernel: RIP: 0033:0x7fa737d466eb
Aug 12 16:08:17 beegfsoss01 kernel: RSP: 002b:00007fa734bf8398 EFLAGS: 00000213 ORIG_RAX: 000000000000004d
Aug 12 16:08:17 beegfsoss01 kernel: RAX: ffffffffffffffda RBX: 00007fa6fc0047e0 RCX: 00007fa737d466eb
Aug 12 16:08:17 beegfsoss01 kernel: RDX: 0000000000000000 RSI: 0000000000000018 RDI: 0000000000000884
Aug 12 16:08:17 beegfsoss01 kernel: RBP: 00007fa734bf85b0 R08: 0000000000000000 R09: 00007fa734bf8670
Aug 12 16:08:17 beegfsoss01 kernel: R10: 0000000000000000 R11: 0000000000000213 R12: 0000000000000884
Aug 12 16:08:17 beegfsoss01 kernel: R13: 0000000000000018 R14: 0000000001beef50 R15: 00007fa734bf8650
Aug 12 16:08:17 beegfsoss01 kernel: </TASK>

We are currently running 2 storage controllers on Rocky 9.2 with kernel version 5.14.0-284.11.1. Each controller are connected to a JBOD with 40 SAS disk. There are total of 4 RAID 6 array created per JBOD, so 10 disks per array. Each storage node is equipped with a AMD EPYC 9124 processors with 256 GB Memory. We know the hardware is not very optimised for performance, but we currently can't really do anything due to the long procurement process in the University.

The worker tasks hung problem seems to be occuring after 6 weeks counted from nodes reboot, and had happenned to all the BeeGFS storage controllers. We have other storage controllers running CephFS but have not encountered similar problem or behavior so far, so we are pretty clueless on what is actually causing the issues.

We would like to get some help here and see if any others are having the similar issues.

Thanks.

Waltar

unread,
Aug 13, 2024, 2:21:00 AM8/13/24
to beegfs-user
Looks like a disk problem affecting first mdadm, then xfs and last the beegfs workers. I would do smartctl -t long (~24h?) to all disks and check the entries after (-a). If you can't find a bad disk with that "for d in /dev/sd*";do time dd if=$d of=/dev/null bs=1M;done" for unusual speed between.
If all disk look even then good there might be a problem to sata/sas (?) controller (perhaps related temperatur/cooling?) or cable.

Jit Kang CHANG

unread,
Aug 14, 2024, 3:00:34 AM8/14/24
to beegfs-user
Hi, thanks for the suggestion.

I tried running a long smartctl test on all the disks, while the test result looks ok, the hang did happen again very quickly this time. I have yet to run the dd test as I worried that it might risk the current data in the production.

So far, all the temperature of the disks looks good to me (around 30 - 33 degree celcius), and there was no errors reported from the SAS JBOD. All the nodes are located in a Data Centre with proper cooling, so I hope temperature should not be an issue.

This time in the log, I do notice something special on the hung task, which might indicate is some sort of XFS and RAID problem?

Aug 14 11:14:55 beegfsoss01 kernel: INFO: task xfsaild/md0:969079 blocked for more than 622 seconds.
Aug 14 11:14:55 beegfsoss01 kernel:      Tainted: G           OE    --------  ---  5.14.0-284.11.1.el9_2.x86_64 #1
Aug 14 11:14:55 beegfsoss01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 14 11:14:55 beegfsoss01 kernel: task:xfsaild/md0     state:D stack:    0 pid:969079 ppid:     2 flags:0x00004000
Aug 14 11:14:55 beegfsoss01 kernel: Call Trace:
Aug 14 11:14:55 beegfsoss01 kernel: <TASK>
Aug 14 11:14:55 beegfsoss01 kernel: __schedule+0x248/0x620
Aug 14 11:14:55 beegfsoss01 kernel: ? __blk_flush_plug+0xdb/0x160
Aug 14 11:14:55 beegfsoss01 kernel: schedule+0x5a/0xc0
Aug 14 11:14:55 beegfsoss01 kernel: md_bitmap_startwrite+0x156/0x1c0
Aug 14 11:14:55 beegfsoss01 kernel: ? cpuacct_percpu_seq_show+0x10/0x10
Aug 14 11:14:55 beegfsoss01 kernel: __add_stripe_bio+0x202/0x240 [raid456]
Aug 14 11:14:55 beegfsoss01 kernel: make_stripe_request+0x1bb/0x490 [raid456]
Aug 14 11:14:55 beegfsoss01 kernel: raid5_make_request+0x16f/0x3e0 [raid456]
Aug 14 11:14:55 beegfsoss01 kernel: ? sched_show_numa+0xf0/0xf0
Aug 14 11:14:55 beegfsoss01 kernel: md_handle_request+0x135/0x1e0
Aug 14 11:14:55 beegfsoss01 kernel: ? bio_split_to_limits+0x51/0x90
Aug 14 11:14:55 beegfsoss01 kernel: __submit_bio+0x89/0x130
Aug 14 11:14:55 beegfsoss01 kernel: __submit_bio_noacct+0x81/0x1f0
Aug 14 11:14:55 beegfsoss01 kernel: xfs_buf_ioapply_map+0x1cb/0x280 [xfs]
Aug 14 11:14:55 beegfsoss01 kernel: _xfs_buf_ioapply+0xcf/0x1b0 [xfs]
Aug 14 11:14:55 beegfsoss01 kernel: ? wake_up_q+0x90/0x90
Aug 14 11:14:55 beegfsoss01 kernel: __xfs_buf_submit+0x6e/0x1e0 [xfs]
Aug 14 11:14:55 beegfsoss01 kernel: xfs_buf_delwri_submit_buffers+0xe9/0x230 [xfs]
Aug 14 11:14:55 beegfsoss01 kernel: xfsaild_push+0x176/0x6f0 [xfs]
Aug 14 11:14:55 beegfsoss01 kernel: ? del_timer_sync+0x67/0xb0
Aug 14 11:14:55 beegfsoss01 kernel: xfsaild+0xa4/0x1e0 [xfs]
Aug 14 11:14:55 beegfsoss01 kernel: ? xfsaild_push+0x6f0/0x6f0 [xfs]
Aug 14 11:14:55 beegfsoss01 kernel: kthread+0xd9/0x100
Aug 14 11:14:55 beegfsoss01 kernel: ? kthread_complete_and_exit+0x20/0x20
Aug 14 11:14:55 beegfsoss01 kernel: ret_from_fork+0x22/0x30
Aug 14 11:14:55 beegfsoss01 kernel: </TASK>

Waltar

unread,
Aug 14, 2024, 3:26:23 AM8/14/24
to beegfs-user
What should xfs do if there is a disk timeout as it could not overcome that problem (xfsaild/md0:969079 blocked for more than 622 seconds.) ?
Find the disk related problem, the dd cmd as written before just reads from disks and doesn't destroy anythink as hopefully is shown in read also and not just in write I/O ... :-)

Waltar

unread,
Aug 14, 2024, 3:32:31 AM8/14/24
to beegfs-user
You could for write test make a test-dir on each of your 4 raid6 volumes and do "dd if=/dev/null of=/<beegfs-store-<1..4>/test-dir  bs=1M count=40960 dflag=sync" and compare if all 4 show sama performance (don't forget to delete after) :-)

Waltar

unread,
Aug 14, 2024, 4:25:24 AM8/14/24
to beegfs-user
Uups, mean "dd if=/dev/zero of=/<beegfs-store-<1..4>/test-dir  bs=1M count=40960 dflag=sync"  (not /dev/null to read ...).

Jit Kang CHANG

unread,
Aug 15, 2024, 7:56:48 AM8/15/24
to beegfs-user
Well, I have tried running dd command and there seems to be not much different in performance for all my raid6 volumes. However, running xfs_repair on two of the volume seems to be significantly slower than the rest of the volumes, so I am suspecting there might be disk failures in among the raid volume. I am trying to clear up the beegfs target on the raid volume before I can destroy and do a proper disk check.

Is there any better way to do this? What would happen if I remove the target before migrating out the data?

Waltar

unread,
Aug 15, 2024, 9:18:01 AM8/15/24
to beegfs-user
Normally there is never the usage of xfs_repair needed and if ... then you you should rethink about your storage layout in complete and perhaps consider other. 
--> Did you run xfs_repair as "active" or as "dry" run and what did it tell you ?
Mdadm is good for small budget server for quiet normal usage (myself last used nearly 15y before). 
If you build a server for high performance I/O with xfs and even as you just have 2 the better choice would be to use raid controller in server when using jbods as even that changes the server price minimal
and even all the "externals" around xfs (as highly parallel limit is always the hw) should be setup properly as all linux defaults are mostly for a desktop workstation.
(I assume inital setup thinking was zfs config and so this procurement.)

Waltar

unread,
Aug 15, 2024, 9:37:49 AM8/15/24
to beegfs-user
And don't forget - before every usage of xfs_repair the journal should be replayed first : umount, mount, umount, xfs_repair full-path-2-xfs-device !!

Jit Kang CHANG

unread,
Aug 18, 2024, 11:36:00 PM8/18/24
to beegfs-user
I was only trying to run xfs_repair just to clarify some doubt earlier, running with active or dry doesn't show me any problem with my XFS, so I think it is not a XFS issue.

Our initial plan was to use Pacemaker + mdadm + xfs setup to ensure one of the controller can take over both jbod if the other fail. It was working fine for years with previous Lustre setup, so we thought we could do similar with BeeGFS. Since I am not a storage system expert, I might have missed if hardware RAID can do similar stuff like taking over jbod from another controller.

So far, after updating OS kernel, the storage seems to live longer without crash, so I am not sure if there is a bug in the kernel that get triggered with I/O. Will need to monitor for longer before I can pinpoint the exact issue.

Waltar

unread,
Aug 19, 2024, 3:19:20 AM8/19/24
to beegfs-user
Aah, ok. You should take a look with ongoing "iostat -xm 1" (while perhaps writing to a file) if a disk is noticeable in response time and utilization.
I'm personally not a friend of HA as I've exprienced a handful of crashes with further automatic trouble generation of "HA systems", so I prefere for myself more a stopping system at error than a running anyway system with ongoing trash data but that's just really on my own.
Would set as minimal host config changes this after beegfs host boot (eg. by script etc):
echo 1000 > /proc/sys/fs/xfs/xfssyncd_centisecs
echo 1 > /proc/sys/vm/dirty_background_ratio
echo 30 > /proc/sys/vm/vfs_cache_pressure
Reply all
Reply to author
Forward
0 new messages