Kernel oops if wring ib interface is in config

34 views
Skip to first unread message

Lukas Hejtmanek

unread,
Nov 18, 2024, 1:23:06 PM11/18/24
to beegfs-user
Hello,

I encountered kernel oops + node reset when invalid interface is in a client config and a node tries to mount beegfs volume. I set ibs225s0 instead of ibs6 in the config (unfortunately not a whole cluster has the same intefrace name) and I got the following oops when trying to mount. It is clearly a configuration bug, but I suppose it should not oops that hard ;) Client/server version is 7.4.5. Ubuntu 24.04, kernel 6.8.0.

[ 1010.800006] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 1010.807089] #PF: supervisor read access in kernel mode
[ 1010.812306] #PF: error_code(0x0000) - not-present page
[ 1010.817518] PGD 1242f466067 P4D 0
[ 1010.820985] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 1010.825413] CPU: 177 PID: 72212 Comm: mount Tainted: P        W  OE      6.8.0-48-generic #48-Ubuntu
[ 1010.834656] Hardware name: HPE ProLiant DL385 Gen11/ProLiant DL385 Gen11, BIOS 1.70 09/05/2024
[ 1010.843369] RIP: 0010:__NodeConnPool_applySocketOptionsConnected+0x20f/0x890 [beegfs]
[ 1010.851344] Code: e8 96 eb fd e8 31 c0 b9 06 00 00 00 48 89 df f3 48 ab 48 8b 85 50 ff ff ff 48 c7 c7 40 74 08 ae 48 8b 00 48 8b 80 d0 00 00 00 <44> 8b 78 08 b8 ab 0f 00 00 48 c7 45 a8 00 00 00 00 48 c7 45 a0 00
[ 1010.870305] RSP: 0018:ff4bf04ed970f398 EFLAGS: 00010246
[ 1010.875611] RAX: 0000000000000000 RBX: ff4bf04ed970f3f0 RCX: 0000000000000000
[ 1010.882842] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffae087440
[ 1010.890066] RBP: ff4bf04ed970f450 R08: 0000000000000000 R09: 0000000000000000
[ 1010.897294] R10: 0000000000000000 R11: 0000000000000000 R12: ff30311d0c9dd140
[ 1010.904522] R13: ff4bf04ed970f3c8 R14: ff4bf04ed970f3b8 R15: ff30311df0262980
[ 1010.911747] FS:  0000753d01420800(0000) GS:ff30317b5c880000(0000) knlGS:0000000000000000
[ 1010.919938] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1010.925760] CR2: 0000000000000008 CR3: 000001243c936004 CR4: 0000000000f71ef0
[ 1010.932985] PKRU: 55555554
[ 1010.935744] Call Trace:
[ 1010.938246]  <TASK>
[ 1010.940407]  ? show_regs+0x6d/0x80
[ 1010.943875]  ? __die+0x24/0x80
[ 1010.946994]  ? page_fault_oops+0x99/0x1b0
[ 1010.951080]  ? do_user_addr_fault+0x2e2/0x670
[ 1010.955517]  ? exc_page_fault+0x83/0x1b0
[ 1010.959516]  ? asm_exc_page_fault+0x27/0x30
[ 1010.963781]  ? __NodeConnPool_applySocketOptionsConnected+0x20f/0x890 [beegfs]
[ 1010.971125]  ? __NodeConnPool_applySocketOptionsConnected+0x1ea/0x890 [beegfs]
[ 1010.978467]  NodeConnPool_acquireStreamSocketEx+0xe7f/0x1070 [beegfs]
[ 1010.985017]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1010.989885]  ? crng_fast_key_erasure+0xd5/0x120
[ 1010.994499]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1010.999363]  ? __ip_dev_find+0x8a/0x130
[ 1011.003277]  NodeConnPool_acquireStreamSocket+0x15/0x30 [beegfs]
[ 1011.009385]  __MessagingTk_requestResponseWithRRArgsComm+0x56f/0x950 [beegfs]
[ 1011.016636]  ? wait_for_completion+0x114/0x150
[ 1011.021150]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1011.026017]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 1011.030888]  ? _destroy_id+0x22a/0x380 [rdma_cm]
[ 1011.035590]  MessagingTk_requestResponseWithRRArgsSock+0x53/0x220 [beegfs]
[ 1011.043076]  ? destroy_id_handler_unlock+0x55/0xb0 [rdma_cm]
[ 1011.049252]  MessagingTk_requestResponseKMalloc+0x6b/0xc0 [beegfs]
[ 1011.055943]  __Logger_logTopGrantedUnlocked+0x190/0x300 [beegfs]
[ 1011.062457]  __Logger_logTopFormattedGranted+0xd6/0x140 [beegfs]
[ 1011.068961]  Logger_logErrFormatted+0x79/0xb0 [beegfs]
[ 1011.074565]  App_findAllowedRDMAInterfaces+0xe2/0x140 [beegfs]
[ 1011.080863]  __App_initLocalNodeInfo+0x75/0x1a0 [beegfs]
[ 1011.086643]  __App_initDataObjects+0x267/0x7f0 [beegfs]
[ 1011.092330]  App_run+0x45/0x150 [beegfs]
[ 1011.096700]  Fcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 mlx5_ib(OE) ib_uverbs(OE) macsec ib_core(OE) nvme(OE) crct10dif_pclmul mlx5_core(OE) crc32_pclmul mlxfw(OE) polyval_clmulni psample polyval_generic nvme_core(OE) mlxdevm(OE) ghash_clmulni_intel tls nvme_auth ice xhci_pci ahci sha256_ssse3 tg3 libahci sha1_ssse3 gnss pci_hyperv_intf xhci_pci_renesas mlx_compat(OE) aesni_intel crypto_simd cryptd [last unloaded: ecc]
[ 1011.515212] CR2: 0000000000000008
[ 1011.519045] ---[ end trace 0000000000000000 ]---
[ 1011.600199] RIP: 0010:__NodeConnPool_applySocketOptionsConnected+0x20f/0x890 [beegfs]
[ 1011.608728] Code: e8 96 eb fd e8 31 c0 b9 06 00 00 00 48 89 df f3 48 ab 48 8b 85 50 ff ff ff 48 c7 c7 40 74 08 ae 48 8b 00 48 8b 80 d0 00 00 00 <44> 8b 78 08 b8 ab 0f 00 00 48 c7 45 a8 00 00 00 00 48 c7 45 a0 00
[ 1011.628668] RSP: 0018:ff4bf04ed970f398 EFLAGS: 00010246
[ 1011.634436] RAX: 0000000000000000 RBX: ff4bf04ed970f3f0 RCX: 0000000000000000

-- 
Lukas Hejtmanek

Steffen Grunewald

unread,
Nov 19, 2024, 2:00:52 PM11/19/24
to fhgfs...@googlegroups.com
Hi,

while this doesn't address the kernel oops: using connNetFilter instead of
connInterface may avoid the interface name confusion. (Another option would
be to assign alias names during install, e,g, via systemd's network link
mechanism, see `man systemd.link`.)

OTOH I agree that a flaw in the config shouldn't throw a NULL pointer deref.

Best,
Steffen
> --
> You received this message because you are subscribed to the Google Groups "beegfs-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/fhgfs-user/e4771d46-8f2c-4baf-9489-0bcfe5c00894n%40googlegroups.com.


--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~
Reply all
Reply to author
Forward
0 new messages