Kernel crash for BeeGFS client 7.3.0 with kernel 5.10.x and TrueScale adapter

16 views
Skip to first unread message

George Knolf

unread,
May 16, 2022, 12:13:26 PMMay 16
to beegfs-user
Hi,

we run BeeGFS clients with kernel 5.10.x on a couple of different
platforms (amd64 with various types of infiniband).

This works fine so far except for one type of older compute node that
has Intel True Scale host adapters (IBA7322). On these nodes the kernel dies
horribly on the first mount attempt:

[  623.099933] beegfs: enabling unsafe global rkey
[  623.104798] BUG: kernel NULL pointer dereference, address: 0000000000000250
[  623.111775] #PF: supervisor read access in kernel mode
[  623.116925] #PF: error_code(0x0000) - not-present page
[  623.122081] PGD 0 P4D 0
[  623.124628] Oops: 0000 [#1] SMP PTI
[  623.128125] CPU: 14 PID: 16301 Comm: mount Tainted: G           OE     5.10.0-14-amd64 #1 Debian 5.10.113-1
[  623.137885] Hardware name: Supermicro SYS-6028TR-HTR/X10DRT-H, BIOS 3.2 11/20/2019
[  623.145475] RIP: 0010:dma_alloc_attrs+0x5/0x50
[  623.149933] Code: 74 16 e9 7e 2e af 00 48 8b 05 b7 7b d6 01 48 85 c0 75 e3 e9 6d 0d 00 00 b8 ff ff ff ff c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 <48> 8b 87 50 02 00 00 4c 8b 8f 60 02 00 00 83 e1 f8 48 85 c0 74 09
[  623.168728] RSP: 0018:ffff99c4ca77f798 EFLAGS: 00010282
[  623.173968] RAX: ffff8be2385e8000 RBX: ffff8be984f1b000 RCX: 0000000000000cc0
[  623.181115] RDX: ffff8be9878814e0 RSI: 0000000000001000 RDI: 0000000000000000
[  623.188265] RBP: 0000000000001000 R08: 0000000000000000 R09: ffff8be9878814e0
[  623.195413] R10: 0000000000000000 R11: ffff8be9887b5800 R12: ffff8be98561c710
[  623.202564] R13: 0000000000001000 R14: ffff8be984a87a00 R15: 0000000000000000
[  623.209716] FS:  00007f4f948b8840(0000) GS:ffff8bf15fd80000(0000) knlGS:0000000000000000
[  623.217820] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  623.223576] CR2: 0000000000000250 CR3: 0000000885e2a004 CR4: 00000000001706e0
[  623.230724] Call Trace:
[  623.233202]  IBVBuffer_init+0xf0/0x180 [beegfs]
[  623.237752]  __IBVSocket_createCommContext+0x10b/0x3a0 [beegfs]
[  623.243688]  ? __switch_to_asm+0x42/0x70
[  623.247624]  ? __switch_to+0x114/0x450
[  623.251385]  ? __schedule+0x28a/0x870
[  623.255065]  __IBVSocket_routeResolvedHandler+0x2b/0xa0 [beegfs]
[  623.261091]  ? finish_wait+0x42/0x80
[  623.264686]  IBVSocket_connectByIP+0xf9/0x2a0 [beegfs]
[  623.269843]  ? add_wait_queue_exclusive+0x70/0x70
[  623.275546]  _RDMASocket_connectByIP+0x28/0x60 [beegfs]
[  623.281751]  NodeConnPool_acquireStreamSocketEx+0x68a/0xf00 [beegfs]
[  623.289095]  __MessagingTk_requestResponseWithRRArgsComm+0x5f/0x780 [beegfs]
[  623.297067]  __MessagingTk_requestResponseNodeRetry+0x166/0x560 [beegfs]
[  623.304704]  FhgfsOpsRemoting_statAndGetParentInfo+0x130/0x260 [beegfs]
[  623.312266]  FhgfsOpsRemoting_statRoot+0x96/0x110 [beegfs]
[  623.318701]  __App_mountServerCheck+0xa8/0x170 [beegfs]
[  623.324877]  ? InternodeSyncer_waitForMgmtInit+0xdd/0x1b0 [beegfs]
[  623.331953]  ? wake_up_q+0xa0/0xa0
[  623.336291]  App_run+0x63/0x70 [beegfs]
[  623.341039]  FhgfsOps_fillSuper+0xb7/0x240 [beegfs]
[  623.346813]  ? ida_alloc_range+0x379/0x3d0
[  623.351783]  ? idr_replace+0x99/0xa0
[  623.356201]  ? sget+0x1d3/0x220
[  623.360166]  ? FhgfsOps_unregisterFilesystem+0x20/0x20 [beegfs]
[  623.366883]  mount_nodev+0x44/0x90
[  623.371069]  legacy_get_tree+0x27/0x40
[  623.375567]  vfs_get_tree+0x25/0xb0
[  623.379798]  path_mount+0x454/0xa70
[  623.384013]  __x64_sys_mount+0x103/0x140
[  623.388639]  do_syscall_64+0x33/0x80
[  623.392938]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  623.398713] RIP: 0033:0x7f4f94af89ea
[  623.403003] Code: 48 8b 0d a9 f4 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 76 f4 0b 00 f7 d8 64 89 01 48
[  623.423254] RSP: 002b:00007ffc6483afd8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[  623.431557] RAX: ffffffffffffffda RBX: 00007f4f94c1b264 RCX: 00007f4f94af89ea
[  623.439408] RDX: 000055f3ce666cf0 RSI: 000055f3ce666d30 RDI: 000055f3ce666d10
[  623.447259] RBP: 000055f3ce666a30 R08: 000055f3ce666c90 R09: 00007f4f94bb8be0
[  623.455055] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  623.462836] R13: 000055f3ce666d10 R14: 000055f3ce666cf0 R15: 000055f3ce666a30
[  623.470667] Modules linked in: beegfs(OE) rdma_cm iw_cm intel_rapl_msr intel_rapl_common ipmi_ssif sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ghash_clmulni_intel aesni_intel joydev ast libaes drm_vram_helper crypto_simd drm_ttm_helper cryptd glue_helper hid_generic evdev ttm rapl iTCO_wdt intel_pmc_bxt intel_cstate mei_me drm_kms_helper usbhid iTCO_vendor_support acpi_ipmi intel_uncore pcspkr hid sg cec watchdog ipmi_si mei ioatdma ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad button ib_ipoib ib_cm ib_qib rdmavt ib_uverbs ib_core fuse drm configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic sd_mod t10_pi crc_t10dif crct10dif_generic xhci_pci ehci_pci igb xhci_hcd ahci ehci_hcd libahci i2c_algo_bit libata usbcore dca crct10dif_pclmul i2c_i801 crct10dif_common ptp crc32_pclmul crc32c_intel scsi_mod lpc_ich i2c_smbus usb_common pps_core wmi
[  623.557093] CR2: 0000000000000250
[  623.561273] ---[ end trace 496152789dc81a07 ]---

The above is from Debian 11, kernel 5.10.0-14-amd64.
The same behaviour was seen under AlmaLinux 8.5 and vanilla kernel 5.10.112.

The server is a Supermicro dual Xeon E5 board (X10DRT-H).

The same beegfs module works fine for all types of Mellanox HCA we were
able to test (ConnectX and above).

Does anyone else have similar HW around and can confirm this ?

Does any of the developers read this and can suggest a fix ?

Regards,
 George

Reply all
Reply to author
Forward
0 new messages