All targets show offlline after BeeGFS upgrade ( on a test server ) from 7.4.6 to 8.2

64 views
Skip to first unread message

Imam Toufique

unread,
Nov 6, 2025, 12:16:01 PMNov 6
to beegfs-user
Hello everyone, 
I am seeing the following, after upgrading from 7.4.6 to 8.2.1 in Rocky linux 8.10 .

[root@client-1 beegfs]# beegfs health check --mgmtd-addr 172.16.16.18:8010 --tls-disable  --force-connections=false
###################################################
Running Health Check for beegfs://172.16.16.18:8010
###################################################
###################################
>>>>> Checking for Busy Nodes <<<<<
###################################

✅ Busy Metadata Nodes -> Number of queued requests does not exceed the degraded (16) or critical (512) thresholds.
✅ Busy Storage Nodes -> Number of queued requests does not exceed the degraded (16) or critical (512) thresholds.

############################
>>>>> Checking Targets <<<<<
############################

🛑 Reachability -> Not all targets are responding.
✅ Consistency -> All mirrors are synchronized.
✅ Available Capacity -> All targets have sufficient free space based on the thresholds defined by the management service's configuration.

----------------
Metadata Targets
----------------
ID      TYPE  ALIAS             NODE    STORAGE_POOL  REACHABILITY  LAST_CONTACT     CONSISTENCY  SYNC_STATE  CAP_POOL  SPACE  SPACE_USED  SPACE_FREE  INODES  INODES_USED  INODES_FREE  
m:9111  meta  target_meta_9111  m:9111  (n/a)         Offline       1762444522s ago  Good         Healthy               -      -           -           -       -            -            
m:9211  meta  target_meta_9211  m:9211  (n/a)         Offline       1762444522s ago  Good         Healthy               -      -           -           -       -            -            

---------------
Storage Targets
---------------
ID     TYPE     ALIAS               NODE  STORAGE_POOL  REACHABILITY  LAST_CONTACT     CONSISTENCY  SYNC_STATE  CAP_POOL  SPACE  SPACE_USED  SPACE_FREE  INODES  INODES_USED  INODES_FREE  
s:911  storage  target_storage_911  s:91  s:1           Offline       1762444522s ago  Good         Healthy               -      -           -           -       -            -            
s:912  storage  target_storage_912  s:91  s:1           Offline       1762444522s ago  Good         Healthy               -      -           -           -       -            -            
s:913  storage  target_storage_913  s:91  s:1           Offline       1762444522s ago  Good         Healthy               -      -           -           -       -            -            
s:921  storage  target_storage_921  s:92  s:1           Offline       1762444522s ago  Good         Healthy               -      -           -           -       -            -            
s:922  storage  target_storage_922  s:92  s:1           Offline       1762444522s ago  Good         Healthy               -      -           -           -       -            -            
s:923  storage  target_storage_923  s:92  s:1           Offline       1762444522s ago  Good         Healthy               -      -           -           -       -            -            

HINT: This mode does not check file system consistency. To check for file system inconsistencies,
      you can run 'beegfs-fsck --checkfs --readOnly' and consult with ThinkParQ support.

################################################
>>>>> Checking Connections to Server Nodes <<<<<
################################################
No client mounts found, skipping connection checks


Error: one or more checks failed
 

I followed this guide https://doc.beegfs.io/8.0/advanced_topics/upgrade.html , to the best of my abilities, did not skip anything :-) 

here is beegfs mgmtd startup log: 

Nov  6 07:49:47 roce-beegfs-00 systemd[1]: Stopping BeeGFS Management Server...
Nov  6 07:49:47 roce-beegfs-00 systemd[1]: beegfs-mgmtd.service: Succeeded.
Nov  6 07:49:47 roce-beegfs-00 systemd[1]: Stopped BeeGFS Management Server.
Nov  6 07:49:47 roce-beegfs-00 systemd[1]: Starting BeeGFS Management Server...
Nov  6 07:49:47 roce-beegfs-00 beegfs-mgmtd[1216153]: Loaded config file from "/etc/beegfs/beegfs-mgmtd.toml"
Nov  6 07:49:47 roce-beegfs-00 beegfs-mgmtd[1216153]: Successfully initialized certificate verification library.
Nov  6 07:49:47 roce-beegfs-00 beegfs-mgmtd[1216153]: Successfully loaded license certificate: TMP-1827008610
Nov  6 07:49:47 roce-beegfs-00 beegfs-mgmtd[1216153]: Opened database at "/var/lib/beegfs/mgmtd.sqlite"
Nov  6 07:49:47 roce-beegfs-00 beegfs-mgmtd[1216153]: Listening for BeeGFS connections on [::]:8008
Nov  6 07:49:47 roce-beegfs-00 beegfs-mgmtd[1216153]: Receiving BeeGFS datagrams on [::]:8008
Nov  6 07:49:47 roce-beegfs-00 beegfs-mgmtd[1216153]: gRPC server running with TLS disabled
Nov  6 07:49:47 roce-beegfs-00 beegfs-mgmtd[1216153]: Serving gRPC requests on [::]:8010
Nov  6 07:49:47 roce-beegfs-00 beegfs-mgmtd[1216153]: Waiting for shutdown signal ...
Nov  6 07:49:47 roce-beegfs-00 systemd[1]: Started BeeGFS Management Server.


I wanted to learn about BeeGFS 8.x , so I wanted to try this out.  If anyone in the forum can give me some hints on what to look for , or what I might be doing wrong here , that would be very much appreciated. 

here is the beegfs-mgmt.toml file ( which is not much ): 
log-target = "stderr"
log-level = "debug"
beemsg-port = 8008
grpc-port = 8010
tls-disable = true
connection-limit = 36
auth-disable = true
registration-disable = false

I had tls-disable=false and auth-disable=false initially ,and since that did not work, I disabled them. 

Also, I could not run : 
beegfs health check --mgmtd-addr 172.16.16.18:8010 --tls-disable  --force-connections=false
without --force-connections=false , which tells me something is still not right in the mgmtd level, and I can't seem to figure it out. 

Help, please ! 
and , thank you in advance.

Joe McCormick

unread,
Nov 6, 2025, 2:17:05 PMNov 6
to beegfs-user
Hello,

Its hard to say exactly what is happening without client logs (i.e., `journalctl -k` or `dmesg`) but 8.2.2 was just released today that fixes a potential network issue when RDMA is in use: https://github.com/ThinkParQ/beegfs/releases/tag/8.2.2

As a first step I would suggest upgrading to 8.2.2 and if the issue persists, provide client logs, output from `cat /proc/fs/beegfs/*/client_info` (provided you are able to mount the file system at all) and output from `beegfs node list --with-nics`.

Thank you,

~Joe

Imam Toufique

unread,
Nov 6, 2025, 3:31:11 PMNov 6
to beegfs-user
Hi Joe, 

Upgraded to 8.2.2 , thanks for the tip!  

But the problem is still there.  Here is some client log: 

Nov  6 12:26:19 roce-cn-00 kernel: beegfs: rmmod(203326): BeeGFS client unloaded.
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: modprobe(203332): File system registered. Type: beegfs. Version: 8.2.2
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: mount(203379): Built without NVFS RDMA support.
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: mount(203379): DatagramListener (init sock): Listening for UDP datagrams: Port 8004
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: mount(203379): App_logInfos: BeeGFS Client Version: 8.2.2
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: mount(203379): App_logInfos: ClientID: c31A73-690D046B-roce-cn-00
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: mount(203379): App_logInfos: Usable NICs: #012+ enp175s0[ip addr: 172.16.16.9; type: RDMA]#012+ enp175s0[ip addr: 172.16.16.9; type: TCP]#012+ eno1[ip addr: 172.16.10.9; type: TCP]
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: beegfs_XNodeSyn(203381): Init: Waiting for beegfs...@172.16.16.18:8008...
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: beegfs_DGramLis(203380): Heartbeat incoming: New node: beegfs-mgmtd management [ID: 1];
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: beegfs_XNodeSyn(203381): Init: Management node found. Downloading node groups...
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: beegfs_XNodeSyn(203381): NodeConn (acquire stream): Connected: beegfs...@172.16.16.18:8008 (protocol: TCP)
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: beegfs_XNodeSyn(203381): Sync: Nodes added (sync results): 2 (Type: beegfs-meta)
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: beegfs_XNodeSyn(203381): Sync: Nodes added (sync results): 2 (Type: beegfs-storage)
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: beegfs_XNodeSyn(203381): Init: Node registration...
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: beegfs_XNodeSyn(203381): Registration: Node registration successful.
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: beegfs_XNodeSyn(203381): Init: Init complete.
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: mount(203379): Mount sanity check: Retrieval of root directory entry failed. Are all metadata servers running and registered at the management daemon? (Error: Communication error)
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: mount(203379): Mount sanity check failed. Canceling mount. (Log file may provide additional information. Check can be disabled with sysMountSanityCheckMS=0 in the config file.)
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: mount(203379): App (stop components): Stopping components...
Nov  6 12:26:19 roce-cn-00 kernel: beegfs: beegfs_XNodeSyn(203381): Deregistration: Node deregistration successful.
Nov  6 12:26:21 roce-cn-00 kernel: beegfs: mount(203379): App (wait for component termination): Still waiting for this component to stop: beegfs_AckMgr
Nov  6 12:26:22 roce-cn-00 kernel: beegfs: mount(203379): App (wait for component termination): Component stopped: beegfs_AckMgr
Nov  6 12:26:22 roce-cn-00 kernel: beegfs: mount(203379): App (stop): All components stopped.




Here is the mgmtd log , accepting the client connection: 

ov  6 12:25:56 roce-beegfs-00 beegfs-mgmtd[68750]: Running switchover check
Nov  6 12:25:59 roce-beegfs-00 beegfs-mgmtd[68750]: Accepted incoming stream from 172.16.16.9:35894
Nov  6 12:25:59 roce-beegfs-00 beegfs-mgmtd[68750]: Marking stream from 172.16.16.9:35894 as authenticated
Nov  6 12:25:59 roce-beegfs-00 beegfs-mgmtd[68750]: Registered new node c31A21-690D0457-roce-cn-00[client:12, uid:26] (Requested Numeric Id: 0)
Nov  6 12:25:59 roce-beegfs-00 beegfs-mgmtd[68750]: Node deleted: c31A21-690D0457-roce-cn-00[client:12, uid:26]
Nov  6 12:26:02 roce-beegfs-00 beegfs-mgmtd[68750]: Closed stream from 172.16.16.9:35894: early eof
Nov  6 12:26:19 roce-beegfs-00 beegfs-mgmtd[68750]: Accepted incoming stream from 172.16.16.9:48120
Nov  6 12:26:19 roce-beegfs-00 beegfs-mgmtd[68750]: Marking stream from 172.16.16.9:48120 as authenticated
Nov  6 12:26:19 roce-beegfs-00 beegfs-mgmtd[68750]: Registered new node c31A73-690D046B-roce-cn-00[client:13, uid:27] (Requested Numeric Id: 0)
Nov  6 12:26:19 roce-beegfs-00 beegfs-mgmtd[68750]: Node deleted: c31A73-690D046B-roce-cn-00[client:13, uid:27]
Nov  6 12:26:22 roce-beegfs-00 beegfs-mgmtd[68750]: Closed stream from 172.16.16.9:48120: early eof


I I am a bit confused here, it looks like client is looking for metadata root , but there is no error in both metadata service startup. 

thoughts  ? Ideas ? 

thanks again!

Imam Toufique

unread,
Nov 6, 2025, 3:33:15 PMNov 6
to beegfs-user
And here is the full 'dmeg' relevant output: 

[165859.396022] beegfs: beegfs_XNodeSyn(203299): Deregistration: Node deregistration successful.
[165861.442202] beegfs: mount(203297): App (wait for component termination): Still waiting for this component to stop: beegfs_AckMgr
[165861.954219] beegfs: mount(203297): App (wait for component termination): Component stopped: beegfs_AckMgr
[165861.954223] beegfs: mount(203297): App (stop): All components stopped.
[165879.187643] beegfs: rmmod(203326): BeeGFS client unloaded.
[165879.238010] beegfs: modprobe(203332): File system registered. Type: beegfs. Version: 8.2.2
[165879.278311] beegfs: mount(203379): Built without NVFS RDMA support.
[165879.289685] beegfs: mount(203379): DatagramListener (init sock): Listening for UDP datagrams: Port 8004
[165879.289692] beegfs: mount(203379): App_logInfos: BeeGFS Client Version: 8.2.2
[165879.289695] beegfs: mount(203379): App_logInfos: ClientID: c31A73-690D046B-roce-cn-00
[165879.289701] beegfs: mount(203379): App_logInfos: Usable NICs:
                + enp175s0[ip addr: 172.16.16.9; type: RDMA]
                + enp175s0[ip addr: 172.16.16.9; type: TCP]
                + eno1[ip addr: 172.16.10.9; type: TCP]
[165879.289768] beegfs: beegfs_XNodeSyn(203381): Init: Waiting for beegfs...@172.16.16.18:8008...
[165879.290083] beegfs: beegfs_DGramLis(203380): Heartbeat incoming: New node: beegfs-mgmtd management [ID: 1];
[165879.290088] beegfs: beegfs_XNodeSyn(203381): Init: Management node found. Downloading node groups...
[165879.290157] beegfs: beegfs_XNodeSyn(203381): NodeConn (acquire stream): Connected: beegfs...@172.16.16.18:8008 (protocol: TCP)
[165879.290332] beegfs: beegfs_XNodeSyn(203381): Sync: Nodes added (sync results): 2 (Type: beegfs-meta)
[165879.290441] beegfs: beegfs_XNodeSyn(203381): Sync: Nodes added (sync results): 2 (Type: beegfs-storage)
[165879.290728] beegfs: beegfs_XNodeSyn(203381): Init: Node registration...
[165879.291243] beegfs: beegfs_XNodeSyn(203381): Registration: Node registration successful.
[165879.291715] beegfs: beegfs_XNodeSyn(203381): Init: Init complete.
[165879.291719] beegfs: mount(203379): Mount sanity check: Retrieval of root directory entry failed. Are all metadata servers running and registered at the management daemon? (Error: Communication error)
[165879.291721] beegfs: mount(203379): Mount sanity check failed. Canceling mount. (Log file may provide additional information. Check can be disabled with sysMountSanityCheckMS=0 in the config file.)
[165879.291723] beegfs: mount(203379): App (stop components): Stopping components...
[165879.292649] beegfs: beegfs_XNodeSyn(203381): Deregistration: Node deregistration successful.
[165881.345378] beegfs: mount(203379): App (wait for component termination): Still waiting for this component to stop: beegfs_AckMgr
[165881.793356] beegfs: mount(203379): App (wait for component termination): Component stopped: beegfs_AckMgr
[165881.793360] beegfs: mount(203379): App (stop): All components stopped.
Reply all
Reply to author
Forward
0 new messages