metadata node fails (Received a SIGSEGV)

38 views
Skip to first unread message

Juan A. Cordero Varela

unread,
Apr 23, 2024, 4:37:59 AMApr 23
to beegfs-user
Hi,

I have the following infrastructure:
  • dedicated metada node: a RAID1 (SSDs) with 670 GB mounted on /mnt/beegfs-meta
  • storage server 1 with 2 targets: 
    • a RAID6 (HDDs) with 7 TB mounted on /mnt/beegfs-home (target 1)
    • a RAID6 (HDDs) with 145 TB mounted on /mnt/beegfs-storage (target 2)
  • storage server 2 with 1 target:
    • a RAID6 (HDDs) with 145 TB mounted on /mnt/beegfs-storage (target 3)
  • a proxmox VM as a management node
  • several proxmox VMs as client nodes
All nodes have RDMA connections (also VMs using SR-IOV) using Infiniband except metadata node (whose infiniband card does not work well and hence it uses TCP).

On clients, I've set the directory /mnt/beegfs/projects to store data in targets 2 and 3, while /mnt/beegfs/home stores data in target 1.

I am using NIS to allow login in client nodes, so I installed it and started ypbind in metadata and storage nodes (if this is not correct, please tell me) so that getent passwd lists the users on all nodes as indicated in the documentation.

Everything works well except when I try to set quotas.

As indicated in the documentation, I set quotaEnableEnforcement to true in the conf files of metadata, storage and management nodes, while in clients I set quotaEnabled to true.
Then I ran beegfs-fsck --enablequota, set some default quota and created some users. However, when users try to log in, metadata server fails with the following message:

(4) Apr22 16:10:14 XNodeSync [IBVSocket.cpp:421] >> Bind RDMASocket socket: 0x7f7088003190; addr: 192.168.2.7:0
(4) Apr22 16:10:35 XNodeSync [InternodeSyncer.cpp:412] >> Starting state update.
(4) Apr22 16:10:35 XNodeSync [InternodeSyncer.cpp:437] >> Beginning target state update...
(4) Apr22 16:10:35 XNodeSync [InternodeSyncer.cpp:792] >> Downloading target states and buddy groups
(4) Apr22 16:11:05 XNodeSync [InternodeSyncer.cpp:412] >> Starting state update.
(4) Apr22 16:11:05 XNodeSync [InternodeSyncer.cpp:437] >> Beginning target state update...
(4) Apr22 16:11:05 XNodeSync [InternodeSyncer.cpp:792] >> Downloading target states and buddy groups
(4) Apr22 16:11:17 XNodeSync [IBVSocket.cpp:421] >> Bind RDMASocket socket: 0x7f7088003190; addr: 192.168.2.7:0
(4) Apr22 16:11:35 XNodeSync [InternodeSyncer.cpp:331] >> Downloading capacity pools. Pool type: Meta
(4) Apr22 16:11:35 XNodeSync [InternodeSyncer.cpp:331] >> Downloading capacity pools. Pool type: Meta buddies
(4) Apr22 16:11:35 XNodeSync [InternodeSyncer.cpp:412] >> Starting state update.
(4) Apr22 16:11:35 XNodeSync [InternodeSyncer.cpp:437] >> Beginning target state update...
(4) Apr22 16:11:35 XNodeSync [InternodeSyncer.cpp:792] >> Downloading target states and buddy groups
(4) Apr22 16:11:51 ConnAccept [ConnAccept] >> Accepted new connection from 192.168.2.10:37029 [SockFD: 18]
(4) Apr22 16:11:51 Worker17 [LocalNodeConn (acquire stream)] >> Establishing new stream connection to: internal
(4) Apr22 16:11:51 Worker17 [LocalNodeConn (acquire stream)] >> Connected: internal
(4) Apr22 16:11:51 LocalConnWorker1 [LocalConnWorker1] >> Ready (TID: 6069)
(0) Apr22 16:11:51 Worker19 [PThread.cpp:99] >> Received a SIGSEGV. Trying to shut down...
(1) Apr22 16:11:51 Worker19 [PThread::signalHandler] >> Backtrace:
1: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread13signalHandlerEi+0x65) [0x560a04f926c5]
2: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f70ac1ee520]
3: /opt/beegfs/sbin/beegfs-meta(_ZN17LookupIntentMsgEx6createEP9EntryInfoRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES1_P18FileInodeStoreDatab+0x445) [0x560a04eace35]
4: /opt/beegfs/sbin/beegfs-meta(_ZN17LookupIntentMsgEx14executeLocallyERN10NetMessage15ResponseContextEb+0x60b) [0x560a04eadbcb]
5: /opt/beegfs/sbin/beegfs-meta(_ZN15MirroredMessageI15LookupIntentMsgSt5tupleIJ10FileIDLock14ParentNameLockS2_EEE15processIncomingERN10NetMessage15ResponseContextE+0x119) [0x560a04eb1cc9]
6: /opt/beegfs/sbin/beegfs-meta(_ZN17LookupIntentMsgEx15processIncomingERN10NetMessage15ResponseContextE+0x89) [0x560a04eae619]
7: /opt/beegfs/sbin/beegfs-meta(_ZN27IncomingPreprocessedMsgWork7processEPcjS0_j+0x198) [0x560a04fa3568]
8: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker8workLoopE13QueueWorkType+0x176) [0x560a04f9cfd6]
9: /opt/beegfs/sbin/beegfs-meta(_ZN6Worker3runEv+0x6c) [0x560a04f9d72c]
10: /opt/beegfs/sbin/beegfs-meta(_ZN7PThread9runStaticEPv+0x13e) [0x560a04d28e2e]
11: /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f70ac240ac3]
12: /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f70ac2d2850]
(4) Apr22 16:11:51 Worker1 [Worker1] >> Component stopped.
(4) Apr22 16:11:51 Worker6 [Worker6] >> Component stopped.
(4) Apr22 16:11:51 Worker7 [Worker7] >> Component stopped.
(4) Apr22 16:11:51 Worker4 [Worker4] >> Component stopped.
(4) Apr22 16:11:51 Worker2 [Worker2] >> Component stopped.
(4) Apr22 16:11:51 Worker5 [Worker5] >> Component stopped.
(4) Apr22 16:11:51 Worker11 [Worker11] >> Component stopped.
(4) Apr22 16:11:51 Worker18 [Worker18] >> Component stopped.
(4) Apr22 16:11:51 Worker10 [Worker10] >> Component stopped.
(4) Apr22 16:11:51 Worker8 [Worker8] >> Component stopped.
(4) Apr22 16:11:51 Worker15 [Worker15] >> Component stopped.
(4) Apr22 16:11:51 Worker16 [Worker16] >> Component stopped.
(4) Apr22 16:11:51 XNodeSync [XNodeSync] >> Component stopped.
(4) Apr22 16:11:51 Worker3 [Worker3] >> Component stopped.
(4) Apr22 16:11:51 Worker12 [Worker12] >> Component stopped.
(4) Apr22 16:11:51 Worker9 [Worker9] >> Component stopped.
(4) Apr22 16:11:51 Worker24 [Worker24] >> Component stopped.
(4) Apr22 16:11:51 Worker23 [Worker23] >> Component stopped.
(4) Apr22 16:11:51 Worker21 [Worker21] >> Component stopped.
(4) Apr22 16:11:51 Worker22 [Worker22] >> Component stopped.
(4) Apr22 16:11:51 Worker14 [Worker14] >> Component stopped.
(4) Apr22 16:11:51 Worker13 [Worker13] >> Component stopped.
(4) Apr22 16:11:51 DirectWorker1 [DirectWorker1] >> Component stopped.
(4) Apr22 16:11:51 Worker17 [Worker17] >> Component stopped.
(4) Apr22 16:11:51 Stats [Stats] >> Component stopped.
(4) Apr22 16:11:51 Worker20 [Worker20] >> Component stopped.
(4) Apr22 16:11:51 DGramLis [DGramLis] >> Component stopped.
(0) Apr22 16:11:51 Worker19 [App (component exception handler)] >> The component [Worker19] encountered an unrecoverable error. [SysErr: Success] Exception message: Segmentation fault
(2) Apr22 16:11:51 Worker19 [App (component exception handler)] >> Shutting down...
(4) Apr22 16:11:52 ModificationEventFlusher [ModificationEventFlusher] >> Component stopped.
(4) Apr22 16:11:54 ConnAccept [ConnAccept] >> Component stopped.
(4) Apr22 16:11:54 StreamLis1 [StreamLisV2] >> Component stopped.
(4) Apr22 16:11:54 CommSlave4 [CommSlave4] >> Component stopped.
(4) Apr22 16:11:54 CommSlave6 [CommSlave6] >> Component stopped.
(4) Apr22 16:11:54 CommSlave2 [CommSlave2] >> Component stopped.
(4) Apr22 16:11:54 CommSlave1 [CommSlave1] >> Component stopped.
(4) Apr22 16:11:54 CommSlave5 [CommSlave5] >> Component stopped.
(4) Apr22 16:11:54 CommSlave17 [CommSlave17] >> Component stopped.
(4) Apr22 16:11:54 CommSlave10 [CommSlave10] >> Component stopped.
(4) Apr22 16:11:54 CommSlave8 [CommSlave8] >> Component stopped.
(4) Apr22 16:11:54 CommSlave11 [CommSlave11] >> Component stopped.
(4) Apr22 16:11:54 CommSlave15 [CommSlave15] >> Component stopped.
(4) Apr22 16:11:54 CommSlave13 [CommSlave13] >> Component stopped.
(4) Apr22 16:11:54 CommSlave26 [CommSlave26] >> Component stopped.
(4) Apr22 16:11:54 CommSlave25 [CommSlave25] >> Component stopped.
(4) Apr22 16:11:54 CommSlave28 [CommSlave28] >> Component stopped.
(4) Apr22 16:11:54 CommSlave29 [CommSlave29] >> Component stopped.
(4) Apr22 16:11:54 CommSlave20 [CommSlave20] >> Component stopped.
(4) Apr22 16:11:54 CommSlave21 [CommSlave21] >> Component stopped.
(4) Apr22 16:11:54 CommSlave3 [CommSlave3] >> Component stopped.
(4) Apr22 16:11:54 CommSlave19 [CommSlave19] >> Component stopped.
(4) Apr22 16:11:54 CommSlave37 [CommSlave37] >> Component stopped.
(4) Apr22 16:11:54 CommSlave38 [CommSlave38] >> Component stopped.
(4) Apr22 16:11:54 CommSlave44 [CommSlave44] >> Component stopped.
(4) Apr22 16:11:54 CommSlave39 [CommSlave39] >> Component stopped.
(4) Apr22 16:11:54 CommSlave23 [CommSlave23] >> Component stopped.
(4) Apr22 16:11:54 CommSlave48 [CommSlave48] >> Component stopped.
(4) Apr22 16:11:54 CommSlave30 [CommSlave30] >> Component stopped.
(4) Apr22 16:11:54 CommSlave12 [CommSlave12] >> Component stopped.
(4) Apr22 16:11:54 CommSlave27 [CommSlave27] >> Component stopped.
(4) Apr22 16:11:54 CommSlave33 [CommSlave33] >> Component stopped.
(4) Apr22 16:11:54 CommSlave32 [CommSlave32] >> Component stopped.
(4) Apr22 16:11:54 CommSlave16 [CommSlave16] >> Component stopped.
(4) Apr22 16:11:54 CommSlave31 [CommSlave31] >> Component stopped.
(4) Apr22 16:11:54 CommSlave34 [CommSlave34] >> Component stopped.
(4) Apr22 16:11:54 CommSlave36 [CommSlave36] >> Component stopped.
(4) Apr22 16:11:54 CommSlave9 [CommSlave9] >> Component stopped.
(4) Apr22 16:11:54 CommSlave35 [CommSlave35] >> Component stopped.
(4) Apr22 16:11:54 CommSlave7 [CommSlave7] >> Component stopped.
(4) Apr22 16:11:54 CommSlave43 [CommSlave43] >> Component stopped.
(4) Apr22 16:11:54 CommSlave45 [CommSlave45] >> Component stopped.
(4) Apr22 16:11:54 CommSlave42 [CommSlave42] >> Component stopped.
(4) Apr22 16:11:54 CommSlave14 [CommSlave14] >> Component stopped.
(4) Apr22 16:11:54 CommSlave46 [CommSlave46] >> Component stopped.
(4) Apr22 16:11:54 CommSlave40 [CommSlave40] >> Component stopped.
(4) Apr22 16:11:54 CommSlave47 [CommSlave47] >> Component stopped.
(4) Apr22 16:11:54 CommSlave41 [CommSlave41] >> Component stopped.
(4) Apr22 16:11:54 CommSlave22 [CommSlave22] >> Component stopped.
(4) Apr22 16:11:54 CommSlave24 [CommSlave24] >> Component stopped.
(4) Apr22 16:11:54 CommSlave18 [CommSlave18] >> Component stopped.
(4) Apr22 16:11:54 Main [SessionStore (save)] >> save sessions to file: /mnt/beegfs-meta/sessions
(4) Apr22 16:11:54 Main [SessionStore (save)] >> save sessions to file: /mnt/beegfs-meta/mirroredSessions
(3) Apr22 16:11:54 Main [App] >> Stored 0 sessions and 0 mirrored sessions
(1) Apr22 16:11:54 Main [App] >> All components stopped. Exiting now!
(4) Apr22 16:11:54 Main [NodeConn (destruct)] >> Closing 1 connections...
(4) Apr22 16:11:54 Main [NodeConn (invalidate stream)] >> Disconnected: beegfs...@192.168.2.8:18463
(4) Apr22 16:11:54 Main [LocalNodeConnPool::~LocalNodeConnPool] >> Closing 1 connections...
(3) Apr22 16:11:54 LocalConnWorker1 [LocalConnWorker::processIncomingData] >> Soft disconnect from Localhost:PeerFD#28
(4) Apr22 16:11:54 LocalConnWorker1 [LocalConnWorker1] >> Component stopped.
(3) Apr22 16:11:54 Main [LocalNodeConn (invalidate stream: Localhost:PeerFD#29)] >> Disconnected: Localhost:PeerFD#29



Juan A. Cordero Varela

unread,
Apr 23, 2024, 9:15:52 AMApr 23
to beegfs-user
As soon as I comment quotaEnableEnforcement in the metadata, storage and management conf files and quotaEnabled in the client conf files, everthing works again after restarting services.

My OS is Ubuntu Server 22.04 and I'm using BeeGFS 7.4.3.
Reply all
Reply to author
Forward
0 new messages