New storage node registration failing

1,926 views
Skip to first unread message

Doug Cloud

unread,
Apr 27, 2018, 2:22:57 PM4/27/18
to beegfs-user
Hello,

First time trying to add a couple of new storage nodes to our production system. Getting "...Registration failed for target: 0-blahblah; numID:..." in the beegfs-mgmtd.log file when trying to start beegfs-storage on the new nodes (centos 7.4.1708 kernel 3.10.0-693.2.2.el7). BeeGFS version on the production system is 6.16; version on new storage nodes is 6.18. Can't shut down/upgrade production at the moment, so I'm hoping this isn't a versioning issue.

Currently there are 2 storage nodes with 2 targets each;

Storage
==========
oss01 [ID: 1]: reachable at 10.0.7.83:8003 (protocol: RDMA)
oss02 [ID: 2]: reachable at 10.0.7.84:8003 (protocol: RDMA)

The new nodes are, of course, oss03 and oss04. I've set them up exactly as the first two, with appropriate numbering for each, i.e. for the targets on storage node oss03:

/opt/beegfs/sbin/beegfs-setup-storage -p /beegfs/OST0000 -s 3 -i 300 -m <mgmtd.server>
/opt/beegfs/sbin/beegfs-setup-storage -p /beegfs/OST0001 -s 3 -i 301 -m <mgmtd.server>

Here's the kicker; I initially made a mistake and put -s 1 in the setup command, thereby kicking off a collision error with the first storage node. Did this mess it up for good, so that wipe/re-creating the storage setup with the correct number will forever cause a problem? If so, where/how to fix it? Is there a way to manually add the new storage nodes rather than auto-registration? I've tried repeating the entire beegfs install on the new nodes without luck. Thanks—

Harry Mangalam

unread,
Apr 27, 2018, 6:08:18 PM4/27/18
to beegfs-user
3 things:

1: when you report log output, give the exact log output.  If it's long, pipe it into a termbin paste:

tail -333 /var/log/beegfs-blahblah | nc termbin.com 9999
and then provide the returned link.

Your interpretation of the log may very well confuse the ppl who might help you.

2: why in the world would you mix version numbers (even minor versions) in the same filesystem?

3. Did you set

sysAllowNewServers       = true

in the /etc/beegfs/beegfs-mgmtd.conf

to allow new servers to join storage pool?

My 3 cents.

hjm

face...@icloud.com

unread,
Apr 28, 2018, 12:59:01 PM4/28/18
to beegfs-user
On Friday, April 27, 2018 at 6:08:18 PM UTC-4, Harry Mangalam wrote:

3 things:

1: when you report log output, give the exact log output.  If it's long, pipe it into a termbin paste:

from oss3 beegfs-storage.log:
(3) Apr26 16:38:00 Main [RegDGramLis] >> Listening for UDP datagrams: Port 8003
(1) Apr26 16:38:00 Main [App] >> Waiting for beegfs-mgmtd@<mgmtd.server>:8008...
(2) Apr26 16:38:00 RegDGramLis [Heartbeat incoming] >> New node: beegfs-mgmtd <mgmtd.server> [ID: 1]; 
(3) Apr26 16:38:00 Main [NodeConn (acquire stream)] >> Connected: beegfs-mgmtd@<mgmtd.server>:8008 (protocol: TCP)
(0) Apr26 16:38:00 Main [App] >> Target ID reservation request was rejected by this mgmt node: <mgmtd.server> [ID: 1]
(0) Apr26 16:38:00 Main [App] >> Target pre-registration at management node failed

from mgmtd beegfs-mgmtd.log:
(1) Apr26 16:38:00 DirectWorker1 [RegisterTargetMsg incoming] >> Registration failed for target: 0-5AE23430-3; numID: 300

log entries for other node (oss4) are the same.

2: why in the world would you mix version numbers (even minor versions) in the same filesystem?

I did mention we can't shut down/upgrade the production systems right now (they're being used for some long-term jobs), so we're stuck on 6.16 for the time being. BeeGFS' yum repo only has 6.18, all previous versions are unavailable with no archive for previous versions that I can find. Unless you're suggesting and have proof that updating/restarting an operational production system node-by-node has no dire consequences, I'll wait. But if this same issue occurs when we can shut down and update, I'll have wasted nearly a month's time. That's why.

I have updated clients to newer versions than meta/storage/mgmtd/admon without issue, and was hoping (based on the changelogs) that minor versions weren't so functionally different that new storage nodes could be added. Otherwise, if each minor version requires the entire system shut down and updated just to add new storage/meta nodes (of which there's no indication in the docs), that would be rather onerous and not exactly a great thing for large installation administration...

3. Did you set

sysAllowNewServers       = true

in the /etc/beegfs/beegfs-mgmtd.conf

to allow new servers to join storage pool?
 
Of course. I followed directions, with the exception of my one noted mistake.
 
My 3 cents.
 
It appears this may require two bit's worth. :)

Doug Cloud

unread,
Apr 30, 2018, 11:57:05 AM4/30/18
to beegfs-user
All is well now; just had to use different/new target ID numbers, likely because of my initial "mistake", when I initially tried to add the new nodes and set the node ID to existing ones, which cause a collision (-s 1 instead of -s 3). mgmtd must  "remember" the first target numbers used somewhere (300 & 301), but there's no "undoing" of this that I can find. Perhaps a future feature/fix...

Also, the new storage version is noted in the log. Obviously, there's no problem with mixing sub-versions (as expected!).

From beegfs-mgmtd.log:
(2) Apr30 11:35:28 DGramLis [Node registration] >> New node: beegfs-storage oss03 [ID: 3]; RDMA; Ver: 6.18-0; Source: <ip.address.oss03>
(2) Apr30 11:35:30 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 3; TargetID: 330; Pool: Emergency;  Reason: No capacity report received.
(2) Apr30 11:35:30 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 3; TargetID: 331; Pool: Emergency;  Reason: No capacity report received.
(2) Apr30 11:35:31 DirectWorker1 [Change consistency states] >> Storage target is coming online. ID: 330
(2) Apr30 11:35:31 DirectWorker1 [Change consistency states] >> Storage target is coming online. ID: 331
(2) Apr30 11:35:35 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 1; TargetID: 100; Pool: Low;  Reason: Free capacity threshold
(2) Apr30 11:35:35 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 1; TargetID: 101; Pool: Low;  Reason: Free capacity threshold
(2) Apr30 11:35:35 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 2; TargetID: 200; Pool: Low;  Reason: Free capacity threshold
(2) Apr30 11:35:35 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 2; TargetID: 201; Pool: Low;  Reason: Free capacity threshold
(2) Apr30 11:35:35 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 3; TargetID: 330; Pool: Normal.
(2) Apr30 11:35:35 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 3; TargetID: 331; Pool: Normal.
(2) Apr30 11:37:51 DGramLis [Node registration] >> New node: beegfs-storage oss04 [ID: 4]; RDMA; Ver: 6.18-0; Source: <ip.address.oss04>
(2) Apr30 11:37:54 DirectWorker1 [Change consistency states] >> Storage target is coming online. ID: 440
(2) Apr30 11:37:54 DirectWorker1 [Change consistency states] >> Storage target is coming online. ID: 441
(2) Apr30 11:37:55 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 4; TargetID: 440; Pool: Normal.
(2) Apr30 11:37:55 XNodeSync [Assign target to capacity pool] >> Storage target capacity pool assignment updated. NodeID: 4; TargetID: 441; Pool: Normal.

# beegfs-df
METADATA SERVERS
:
TargetID        Pool        Total         Free    %      ITotal       IFree    %
========        ====        =====         ====    =      ======       =====    =
       
1      normal     837.0GiB     834.6GiB 100%      558.4M      554.7M  99%
       
2      normal     837.0GiB     834.6GiB 100%      558.4M      554.8M  99%


STORAGE TARGETS
:
TargetID        Pool        Total         Free    %      ITotal       IFree    %
========        ====        =====         ====    =      ======       =====    =
     
100         low    8936.0GiB    1136.1GiB  13%      893.8M      891.7M 100%
     
101         low    8936.0GiB    1136.4GiB  13%      893.8M      891.7M 100%
     
200         low    8936.0GiB    1136.7GiB  13%      893.8M      891.7M 100%
     
201         low    8936.0GiB    1136.7GiB  13%      893.8M      891.7M 100%
     
330      normal   16759.3GiB   16758.2GiB 100%     1676.1M     1676.1M 100%
     
331      normal   16759.3GiB   16758.3GiB 100%     1676.1M     1676.1M 100%
     
440      normal   16759.3GiB   16758.7GiB 100%     1676.1M     1676.1M 100%
     
441      normal   16759.3GiB   16758.7GiB 100%     1676.1M     1676.1M 100%


Our first two storage servers have obviously become quite full!

James Burton

unread,
Aug 23, 2019, 1:53:06 PM8/23/19
to beegfs-user
To undo this, manually remove the "remembered" targets from the /mgmtd/targetNumIDs file on the beegfs-mgmtd server. It's messy, but effective.

Jim Burton

Mher

unread,
Feb 7, 2020, 6:19:56 AM2/7/20
to beegfs-user
Hi,

I am having a similar problem. The beegfs version that i testing with is: 7.1.3

on the storage node i execute

/opt/beegfs/sbin/beegfs-setup-storage -p /mnt/disk1 -s 1004 -i 1400 -m <mgmt.serve>

and start the beegfs-storage service. The target shows up in beegfs-df

then for testing, i remove the target and the host 

beegfs-ctl --removenode --nodetype=storage 1004
beegfs-ctl --removetarget 1400

and when i try to add the host back by starting the service (or by wiping the storage host) and doing the procedure again,
the storage host is not being registered.

how can i permenantly remove a storage host and target from the configuration?  i tried to delete the target from the file targetNumIDs but 
they showed up again! i think beegfs keeps the info somewhere else too.

Best

desa...@mail.usf.edu

unread,
Mar 2, 2020, 9:51:58 AM3/2/20
to beegfs-user
Hello,

how can i permenantly remove a storage host and target from the configuration?  i tried to delete the target from the file targetNumIDs but 
they showed up again! i think beegfs keeps the info somewhere else too.

I just did this process last week, as I didn't enjoy (OCD, maybe?) having storage targets jumping from 4 to 7.

Basically, if you review the URL https://www.beegfs.io/wiki/StorageSynchronizationConstruction#hn_59ca4f8bbb_6, you'll see where it is stated that the contents of the targetNumID file within the management directory are encoded in hex, which is perfect.

If you loop over the file (awk, bash, etc.) and convert the last part of the target ID string from hex to decimal, you'll get a list of your "misbehaving" target IDs.  Once you have identified the targets that should be removed, I'd recommend the following steps:

1.)  Make a backup of the file.
2.)  Stop the beegfs-mgmtd service.
3.)  Edit the file and remove the offending entries;  I had to remove 16, so I used `sed`.
4.)  Start the beegfs-mgmtd service.

Of course, the targets I removed _did not_ have any live data on them, so I didn't have anything to lose.

HTH,
John DeSantis
Reply all
Reply to author
Forward
0 new messages