BeeGFS client shows wrong IP

609 views
Skip to first unread message

Jean-François Courteau

unread,
Apr 3, 2017, 11:37:09 AM4/3/17
to beegfs-user
Hello there,

I just setup a brand new BeeGFS cluster on 2 CentOS 7 nodes. Mgmtd and admon run only on the primary node, while all other services (meta, storage and client) run on both nodes. I successfully setup metadata mirroring and storage mirroring using Buddy groups. Both servers are multihomed and communicate with a network dedicated for storage.

I configured a NIC Priority file for all the nodes and put it in the config files as follows (this entry is in beegfs-mgmtd.conf, beegfs-storage.conf, beegfs-meta.conf and beegfs-client.conf):
[...]
connInterfacesFile                     = /beegfs/netpriority.conf
[...]


Content of the file is:
[root@srvhc01 proc]# cat /beegfs/netpriority.conf
nm-team
[EOF]

All inter nodes communication goes through this nm-team adapter, which is a NIC bonding. The server host file is:
[root@srvhc01 proc]# cat /etc/hosts
192.168.255.11  srvhc01.nexcess.int
192.168.255.16  srvhc02.nexcess.int


Storage network is 192.168.255.0/24

Everything seems to be working fine, but when I check the NIC configuration using beegfs-ctl, here is what I get:

[root@srvhc01 proc]# beegfs-ctl --listnodes --nodetype=client --nicdetails
641B-58DDBABD-srvhc01.nexcess.int [ID: 1]
   Ports: UDP: 8004; TCP: 0
   Interfaces:
   + enp4s0f1[ip addr: 192.168.1.10; type: TCP] <-- THIS IP NO LONGER EXISTS ON THE SERVER! This NIC now has 192.168.1.11. Moreover, this interface is not in the netpriority.conf file.
   + enp4s0f0[ip addr: 192.168.0.11; type: TCP] <-- THIS IS THE LAN IP, I did not specify this interface in the netpriority.conf file
3271-58E106CC-srvhc02.nexcess.int [ID: 5]
   Ports: UDP: 8004; TCP: 0
   Interfaces:
   + nm-team[ip addr: 192.168.255.16; type: TCP]



Here is the output of beegfs-net on my primary server:

[root@srvhc01 proc]# beegfs-net
mgmt_nodes
=============
srvhc01.nexcess.int [ID: 1]
   Connections: TCP: 1 (192.168.255.11:8008);
meta_nodes
=============
srvhc01.nexcess.int [ID: 1]
   Connections: TCP: 1 (192.168.255.11:8005);
srvhc02.nexcess.int [ID: 2]
   Connections: <none> <-- IS THIS NORMAL?
storage_nodes
=============
srvhc01.nexcess.int [ID: 1]
   Connections: TCP: 3 (192.168.255.11:8003);
srvhc02.nexcess.int [ID: 2]
   Connections: TCP: 1 (192.168.255.16:8003);



Checking all the entries in RED, does this look normal? I want to start in the right direction. I am pretty sure that there is a misconfig, but even after a restart of the client service after I set the netpriority.conf file, the client IPs displayed are incorrect for srvhc01. I am also worried about the second Meta node showing no connection...

Thanks in advance for your help!

Jean-Francois


Jens Dreger

unread,
Apr 3, 2017, 12:07:54 PM4/3/17
to fhgfs...@googlegroups.com
Hi Jean-Francois!

I think you need to use the connNetFilterFile option to specify the
interfaces that the client may use for outgoing connections. From the
comments in beegfs-client.conf:

# [connInterfacesFile]
[...]
# Note: This has no influence on outgoing connections. The information is sent
# to other hosts to inform them about possible communication paths.

vs.

# [connNetFilterFile]
# The path to a text file that specifies allowed IP subnets, which may
# be used for outgoing communication.
[...]

Also, beegfs-net will only show active connections. If a server has
not been connected for a while, beegfs will drop that connection. Try
the beegfs-check-servers command instead, or refresh connections by
accessing your beegfs filesystem:

Example:
[zboot01:zhpc01]root@z001:~> echo 1 > /proc/fs/beegfs/141C-58DACE51-z001/drop_conns
[zboot01:zhpc01]root@z001:~> beegfs-net
mgmt_nodes
=============
zmeta01 [ID: 1]
Connections: <none>

meta_nodes
=============
zmeta01 [ID: 1]
Connections: RDMA: 1 (10.0.29.69:8105);
zmeta02 [ID: 2]
Connections: <none>
zmeta03 [ID: 3]
Connections: <none>
zmeta04 [ID: 4]
Connections: <none>

storage_nodes
=============
zstor01 [ID: 1]
Connections: <none>

[zboot01:zhpc01]root@z001:~> timeout 10 find /scratch > /dev/null
[zboot01:zhpc01]root@z001:~> beegfs-net

mgmt_nodes
=============
zmeta01 [ID: 1]
Connections: TCP: 1 (10.0.29.69:8108);

meta_nodes
=============
zmeta01 [ID: 1]
Connections: RDMA: 1 (10.0.29.69:8105);
zmeta02 [ID: 2]
Connections: <none>
zmeta03 [ID: 3]
Connections: RDMA: 1 (10.0.29.71:8105);
zmeta04 [ID: 4]
Connections: RDMA: 1 (10.0.29.72:8105);

storage_nodes
=============
zstor01 [ID: 1]
Connections: RDMA: 1 (10.0.29.67:8103);

zmeta02 is down, so no connection is actually right ;)

Regards,

Jens,
> --
> You received this message because you are subscribed to the Google Groups
> "beegfs-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to fhgfs-user+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.


--
Jens Dreger Freie Universitaet Berlin
dre...@physik.fu-berlin.de Fachbereich Physik - ZEDV
Tel: +49 30 83854774 Arnimallee 14
Fax: +49 30 838454774 14195 Berlin

Jean-François Courteau

unread,
Apr 3, 2017, 12:31:33 PM4/3/17
to beegfs-user
Hello Jens,

Thanks for your quick response!

I just tried the follofing setting:
connNetFilterFile             = /beegfs/netfilter.conf

Created the file /beegfs/netfilter.conf
[root@srvhc01 beegfs]# cat /beegfs/netfilter.conf
192.168.255.0/24
[EOF]

Then I restarted the beegfs-client and beegfs-helperd services. The output of the beegfs-ctl command is still the same...

[root@srvhc01 beegfs]# beegfs-ctl --listnodes --nodetype=client --nicdetails

641B-58DDBABD-srvhc01.nexcess.int [ID: 1]
   Ports: UDP: 8004; TCP: 0
   Interfaces:
   + enp4s0f1[ip addr: 192.168.1.10; type: TCP]
   + enp4s0f0[ip addr: 192.168.0.11; type: TCP]
3271-58E106CC-srvhc02.nexcess.int [ID: 5]
   Ports: UDP: 8004; TCP: 0
   Interfaces:
   + nm-team[ip addr: 192.168.255.16; type: TCP]

Are there other services I need to restart for the setting to take effect?

Many thanks!

Jean-François

Jens Dreger

unread,
Apr 3, 2017, 1:42:47 PM4/3/17
to fhgfs...@googlegroups.com
Hi Jean-François!

On Mon, Apr 03, 2017 at 09:31:33AM -0700, Jean-François Courteau wrote:
> I just tried the follofing setting:
> connNetFilterFile             = /beegfs/netfilter.conf
>
> Created the file /beegfs/netfilter.conf
> [root@srvhc01 beegfs]# cat /beegfs/netfilter.conf
> 192.168.255.0/24
> [EOF]
>
> Then I restarted the beegfs-client and beegfs-helperd services. The output of
> the beegfs-ctl command is still the same...
>
> [root@srvhc01 beegfs]# beegfs-ctl --listnodes --nodetype=client --nicdetails
> 641B-58DDBABD-srvhc01.nexcess.int [ID: 1]
>    Ports: UDP: 8004; TCP: 0
>    Interfaces:
>    + enp4s0f1[ip addr: 192.168.1.10; type: TCP]
>    + enp4s0f0[ip addr: 192.168.0.11; type: TCP]
> 3271-58E106CC-srvhc02.nexcess.int [ID: 5]
>    Ports: UDP: 8004; TCP: 0
>    Interfaces:
>    + nm-team[ip addr: 192.168.255.16; type: TCP]
>
> Are there other services I need to restart for the setting to take effect?

I just tried that on my test system. All my nodes have ethernet and
infiniband. beegfs-ctl --listnodes always shows all interfaces:

3FD6-58E27EE5-z001 [ID: 46]
Ports: UDP: 8104; TCP: 0
Interfaces:
+ ib0[ip addr: 10.0.29.1; type: RDMA]
+ eth0[ip addr: 130.133.29.1; type: TCP]
+ ib0[ip addr: 10.0.29.1; type: TCP]

But once I configure a connNetFilterFile file, the client is only
ever using the paths I stated in that file.

For example, if I put 10.0.29.0/24 into that file and then disable
the infiniband port, the client does no longer fall back to the
ethernet interface, as it usually does. The client logfile is not
very specific about the effect of the filter file:

mount(2026) [App_logInfos] >> Usable NICs: ib0(RDMA) eth0(TCP) ib0(TCP)
mount(2026) [App_logInfos] >> Net filters: 1

beegfs-check-servers even shows the ethernet interfaces as being usable:

Metadata
==========
zmeta01 [ID: 1]: reachable at 130.133.29.69:8105 (protocol: TCP)
[...]

but the client logfile complains "Connect failed on all available
routes". So this information does not seem to be consistent everywhere
once you activate the connNetFilterFile option. Maybe I'm missing
something here.

The part that I don't understand in your case: why does the nm-team
interface not show up on srvhc01.nexcess.int at all? What do you get
for "Usable NICs" in the client logfile when the service starts up?

Regards,

Jens.

Jean-François Courteau

unread,
Apr 3, 2017, 4:02:27 PM4/3/17
to beegfs-user
Hello Jens,

Now this is wierd. Here is the complete log file of the client after I put the log level at 5 in the config file, then restarted beegfs-client and beegfs-helperd:

[root@srvhc01 beegfs]# cat /var/log/beegfs-client.log
(1) Apr03 14:44:03 Main [App] >> BeeGFS Helper Daemon Version: 6.7
(1) Apr03 14:44:03 Main [App] >> Client log messages will be prefixed with an asterisk (*) symbol.

And here is the old log that was saved right before I restart the service with the new settings:
[root@srvhc01 beegfs]# cat /var/log/beegfs-client.log.old-1
(1) Apr03 14:40:17 Main [App] >> BeeGFS Helper Daemon Version: 6.7
(1) Apr03 14:40:17 Main [App] >> Client log messages will be prefixed with an asterisk (*) symbol.
(1) Apr03 14:44:01 Main [App::signalHandler] >> Received a SIGTERM. Shutting down...
(1) Apr03 14:44:02 Main [App] >> All components stopped. Exiting now!

It's just like the crunchy part of the client logs is sent to /dev/null...

Any idea on this one?

Jean-Francois

Jean-François Courteau

unread,
Apr 3, 2017, 4:12:49 PM4/3/17
to beegfs-user
I also see a strange message in the Meta log, which seems to happen a few times per day. It's just like the new stuff is not sent to the secondary:

Apr03 16:03:01 Worker7 [OpenFileMsgEx/forward] >> Different return codes from primary and secondary buddy. Setting secondary to needs-reync
(1) Apr03 16:03:01 Worker7 [createBuddyNeedsResyncFile] >> Marked secondary buddy for needed resync.
(3) Apr03 16:03:01 Worker10 [SetAttrMsgEx/forward] >> Different return codes from primary and secondary buddy. Setting secondary to needs-reync
(3) Apr03 16:03:01 Worker12 [CloseFileMsgEx/forward] >> Different return codes from primary and secondary buddy. Setting secondary to needs-reync
(2) Apr03 16:03:05 XNodeSync [checkBuddyNeedsResync] >> Starting buddy resync job.
(2) Apr03 16:03:08 BuddyResyncJob [BuddyResyncJob.cpp:230] >> Resync finished. interrupted: no; syncErrors: no
(3) Apr03 16:03:08 Worker7 [OpenFileMsgEx/forward] >> Different return codes from primary and secondary buddy. Setting secondary to needs-reync
(1) Apr03 16:03:08 Worker7 [createBuddyNeedsResyncFile] >> Marked secondary buddy for needed resync.
(3) Apr03 16:03:08 Worker14 [OpenFileMsgEx/forward] >> Different return codes from primary and secondary buddy. Setting secondary to needs-reync
(3) Apr03 16:03:08 Worker3 [CloseFileMsgEx/forward] >> Different return codes from primary and secondary buddy. Setting secondary to needs-reync
(3) Apr03 16:03:08 Worker9 [CloseFileMsgEx/forward] >> Different return codes from primary and secondary buddy. Setting secondary to needs-reync
(3) Apr03 16:03:08 Worker6 [OpenFileMsgEx/forward] >> Different return codes from primary and secondary buddy. Setting secondary to needs-reync
(3) Apr03 16:03:08 Worker5 [SetAttrMsgEx/forward] >> Different return codes from primary and secondary buddy. Setting secondary to needs-reync
(2) Apr03 16:03:11 XNodeSync [checkBuddyNeedsResync] >> Starting buddy resync job.
(2) Apr03 16:03:14 BuddyResyncJob [BuddyResyncJob.cpp:230] >> Resync finished. interrupted: no; syncErrors: no
(3) Apr03 16:03:15 Worker15 [OpenFileMsgEx/forward] >> Different return codes from primary and secondary buddy. Setting secondary to needs-reync
(1) Apr03 16:03:15 Worker15 [createBuddyNeedsResyncFile] >> Marked secondary buddy for needed resync.
(3) Apr03 16:03:15 Worker3 [SetAttrMsgEx/forward] >> Different return codes from primary and secondary buddy. Setting secondary to needs-reync
(3) Apr03 16:03:15 Worker9 [CloseFileMsgEx/forward] >> Different return codes from primary and secondary buddy. Setting secondary to needs-reync
(3) Apr03 16:03:15 Worker13 [OpenFileMsgEx/forward] >> Different return codes from primary and secondary buddy. Setting secondary to needs-reync
(3) Apr03 16:03:15 Worker16 [CloseFileMsgEx/forward] >> Different return codes from primary and secondary buddy. Setting secondary to needs-reync
(3) Apr03 16:03:15 Worker12 [SetAttrMsgEx/forward] >> Different return codes from primary and secondary buddy. Setting secondary to needs-reync
(2) Apr03 16:03:20 XNodeSync [checkBuddyNeedsResync] >> Starting buddy resync job.
(2) Apr03 16:03:23 BuddyResyncJob [BuddyResyncJob.cpp:230] >> Resync finished. interrupted: no; syncErrors: no



Does not seem to be related, but that's the only error I see in the logs...

Jean-François Courteau

unread,
Apr 4, 2017, 3:12:34 PM4/4/17
to beegfs-user
I have finally been able to go around the problem with Jens's help.

I realized that when I tried to restart the client to take new settings into account (like a new IP or interface filter in the config file), there were still some file locks and the client service could not shut down, since it could not unmount.

The thing was to make sure, using 
lsof | grep /mnt/beegfs
That all file locks were released, make sure the settings are what I wanted in the client config file, then restart the client and helperd services.

As for the metadata Buddy mirroring problem, an update of BeeGFS from version 6.7 to version 6.8 did the job. Now all my logs are clean.

Thanks again!

Jean-Francois
Reply all
Reply to author
Forward
0 new messages