Unable to mount clients - unable to root cause issue from logs

Visto 294 veces
Saltar al primer mensaje no leído

Vinayak Kamath

no leída,
15 dic 2021, 17:58:2715/12/21
a beegfs-user
Problem: I'm unable to add a client to my BeeGFS cluster. 

Background: I had a working setup that I've used for several months. Recently, I decided to do a fresh install (of the same version - 7.2.4) and I've been having this issue since.

Details:
beegfs-check-servers shows all server nodes when run from any node.

[vinayak@7558zc3 ~]$ beegfs-check-servers
Management
==========
7558zc3.maas [ID: 1]: reachable at 10.89.69.4:8008 (protocol: TCP)

Metadata
==========
7558zc3.maas [ID: 1]: reachable at 10.89.69.4:8005 (protocol: TCP)
7557zc3.maas [ID: 15]: reachable at 10.89.69.5:8005 (protocol: TCP)

Storage
==========
7558zc3.maas [ID: 10]: reachable at 10.89.69.4:8003 (protocol: TCP)
7557zc3.maas [ID: 20]: reachable at 10.89.69.5:8003 (protocol: TCP)

When I try to start a client, I get the following error - 

[vinayak@7563zc3 ~]$ sudo systemctl restart beegfs-client
Job for beegfs-client.service failed because the control process exited with error code. See "systemctl status beegfs-client.service" and "journalctl -xe" for details.

[vinayak@7563zc3 ~]$ journalctl -xe
Dec 15 14:46:09 7563zc3.maas beegfs-client[193241]: Starting BeeGFS Client:
Dec 15 14:46:09 7563zc3.maas beegfs-client[193241]: - Loading BeeGFS modules
Dec 15 14:46:09 7563zc3.maas beegfs-client[193241]: - Mounting directories from /etc/beegfs/beegfs-mounts.conf
Dec 15 14:46:09 7563zc3.maas kernel: beegfs: mount(193264): __IBVSocket_createCommContext:538: Alloc CommContext @ ffff8bc24217d400
Dec 15 14:46:09 7563zc3.maas kernel: beegfs: enabling unsafe global rkey
Dec 15 14:46:09 7563zc3.maas kernel: beegfs: mount(193264): __IBVSocket_cleanupCommContext:788: Free CommContext @ ffff8bc24217d400
Dec 15 14:46:09 7563zc3.maas kernel: beegfs: mount(193264): Mount sanity check failed. Canceling mount. (Log file may provide additional information. Check can be disabled with sysMountSanityCheckMS=0 in the config file.)
Dec 15 14:46:11 7563zc3.maas beegfs-client[193241]: mount: mount beegfs_nodev on /mnt/beegfs failed: Operation canceled
Dec 15 14:46:11 7563zc3.maas systemd[1]: beegfs-client.service: main process exited, code=exited, status=32/n/a
Dec 15 14:46:11 7563zc3.maas systemd[1]: Failed to start Start BeeGFS Client.

I checked the client log for more information

(1) Dec15 14:46:09 *mount(193264) [App_logInfos] >> BeeGFS Client Version: 7.2.4
(2) Dec15 14:46:09 *mount(193264) [App_logInfos] >> ClientID: 2F2F0-61BA7031-7563zc3.maas
(2) Dec15 14:46:09 *mount(193264) [App_logInfos] >> Usable NICs: eth_bonded(RDMA) eth_bonded(TCP)
(4) Dec15 14:46:09 *mount(193264) [App_logInfos] >> Extended list of usable NICs:
+ eth_bonded[ip addr: 10.89.68.136; type: RDMA]
+ eth_bonded[ip addr: 10.89.68.136; type: TCP]
(4) Dec15 14:46:09 *mount(193264) [App (start components)] >> Starting up components...
(4) Dec15 14:46:09 *beegfs_XNodeSyn(193266) [InternodeSyncer (run)] >> Searching for nodes...
(4) Dec15 14:46:09 *beegfs_XNodeSyn(193266) [NodeConn (acquire stream)] >> Establishing new TCP connection to: beegfs-...@127.0.0.1:8006
(4) Dec15 14:46:09 *mount(193264) [App (start components)] >> Components running.
(3) Dec15 14:46:09 *beegfs_XNodeSyn(193266) [NodeConn (acquire stream)] >> Connected: beegfs-...@127.0.0.1:8006 (protocol: TCP)
(2) Dec15 14:46:09 *beegfs_DGramLis(193265) [Heartbeat incoming] >> New node: beegfs-mgmtd 7558zc3.maas [ID: 1];
(3) Dec15 14:46:09 *beegfs_XNodeSyn(193266) [Init] >> Management node found. Downloading node groups...
(4) Dec15 14:46:09 *beegfs_XNodeSyn(193266) [NodeConn (acquire stream)] >> Establishing new TCP connection to: beegfs...@10.89.69.4:8008
(3) Dec15 14:46:09 *beegfs_XNodeSyn(193266) [NodeConn (acquire stream)] >> Connected: beegfs...@10.89.69.4:8008 (protocol: TCP)
(2) Dec15 14:46:09 *beegfs_XNodeSyn(193266) [Sync] >> Nodes added (sync results): 2 (Type: beegfs-meta)
(2) Dec15 14:46:09 *beegfs_XNodeSyn(193266) [Sync] >> Nodes added (sync results): 2 (Type: beegfs-storage)
(4) Dec15 14:46:09 *beegfs_XNodeSyn(193266) [Update states and mirror groups] >> Storage target states synced.
(4) Dec15 14:46:09 *beegfs_XNodeSyn(193266) [Update states and mirror groups] >> Metadata node states synced.
(3) Dec15 14:46:09 *beegfs_XNodeSyn(193266) [Init] >> Node registration...
(2) Dec15 14:46:09 *beegfs_XNodeSyn(193266) [Registration] >> Node registration successful.
(4) Dec15 14:46:09 *beegfs_XNodeSyn(193266) [Update states and mirror groups] >> Storage target states synced.
(4) Dec15 14:46:09 *beegfs_XNodeSyn(193266) [Update states and mirror groups] >> Metadata node states synced.
(3) Dec15 14:46:09 *beegfs_XNodeSyn(193266) [Init] >> Init complete.
(4) Dec15 14:46:09 *mount(193264) [NodeConn (acquire stream)] >> Establishing new RDMA connection to: beegf...@10.89.69.4:8005
(3) Dec15 14:46:09 *mount(193264) [NodeConn (acquire stream)] >> Connect failed: beegf...@10.89.69.4:8005 (protocol: RDMA)
(4) Dec15 14:46:09 *mount(193264) [NodeConn (acquire stream)] >> Establishing new TCP connection to: beegf...@10.89.69.4:8005
(3) Dec15 14:46:09 *mount(193264) [NodeConn (acquire stream)] >> Connect failed: beegf...@10.89.69.4:8005 (protocol: TCP)
(1) Dec15 14:46:09 *mount(193264) [NodeConn (acquire stream)] >> Connect failed on all available routes: beegfs-meta 7558zc3.maas [ID: 1]
(2) Dec15 14:46:09 *mount(193264) [Messaging (RPC)] >> Unable to connect to: beegfs-meta 7558zc3.maas [ID: 1]
(4) Dec15 14:46:09 *mount(193264) [Messaging (RPC)] >> Message type: 2015
(0) Dec15 14:46:09 *mount(193264) [Mount sanity check] >> Retrieval of root directory entry failed. Are all metadata servers running and registered at the management daemon? (Error: Communication error)

(2) Dec15 14:46:09 *mount(193264) [App (stop components)] >> Stopping components...
(4) Dec15 14:46:09 *mount(193264) [App (join components)] >> Waiting for components to self-terminate...

So the client is unable to establish a connection with the metadata server. I also noticed that soon after the metadata node becomes unreachable.

[vinayak@7563zc3 ~]$ beegfs-check-servers
Management
==========
7558zc3.maas [ID: 1]: reachable at 10.89.69.4:8008 (protocol: TCP)

Metadata
==========
7558zc3.maas [ID: 1]: UNREACHABLE
7557zc3.maas [ID: 15]: reachable at 10.89.69.5:8005 (protocol: TCP)

Storage
==========
7558zc3.maas [ID: 10]: reachable at 10.89.69.4:8003 (protocol: TCP)
7557zc3.maas [ID: 20]: reachable at 10.89.69.5:8003 (protocol: TCP)

I don't see anything helpful in the metadata, storage or management logs.

/var/log/mgmtd.log
(2) Dec15 14:46:09 Worker3 [Node registration] >> New node: beegfs-client 2F2F0-61BA7031-7563zc3.maas [ID: 1]; RDMA; Source: 10.89.68.136:45598
(4) Dec15 14:46:09 Worker3 [Node registration] >> Number of nodes: Meta: 2; Storage: 2; Client: 1; Mgmt: 1
(4) Dec15 14:46:09 DirectWorker1 [Work (process incoming data)] >> Soft disconnect from 10.89.69.4:38778
(2) Dec15 14:46:09 Worker1 [RemoveNodeMsgEx.cpp:66] >> Node removed. node: beegfs-client 2F2F0-61BA7031-7563zc3.maas [ID: 1]
(4) Dec15 14:46:10 XNodeSync [Auto-offline] >> Checking for offline nodes. NodeType: Storage target
(4) Dec15 14:46:10 XNodeSync [Auto-offline] >> Checking for offline nodes. NodeType: Metadata node
(4) Dec15 14:46:10 XNodeSync [Resolve primary resync] >> Checking for primary that needs resync. nodeType: beegfs-storage
(4) Dec15 14:46:10 XNodeSync [Resolve primary resync] >> Checking for primary that needs resync. nodeType: beegfs-meta
(4) Dec15 14:46:10 XNodeSync [Update Node CapPools] >> Starting node capacity pools update.
(4) Dec15 14:46:10 XNodeSync [Update Target CapPools] >> Starting target capacity pools update.
(4) Dec15 14:46:11 Worker3 [Work (process incoming data)] >> Soft disconnect from 10.89.68.136:45598

/var/log/beegfs-meta.log stops logging after this attempt to connect
(4) Dec15 14:46:05 XNodeSync [InternodeSyncer.cpp:376] >> Starting state update.
(4) Dec15 14:46:05 XNodeSync [InternodeSyncer.cpp:401] >> Beginning target state update...
(4) Dec15 14:46:05 XNodeSync [InternodeSyncer.cpp:756] >> Downloading target states and buddy groups

And there are no new messages in the storage log.

I'd appreciate any suggestions on how I can root cause this issue. Thank you!

Conan Huang

no leída,
31 may 2022, 8:22:1231/5/22
a beegfs-user
Were you able to figure out what the issue was? 
Responder a todos
Responder al autor
Reenviar
0 mensajes nuevos