Metadata Server High Availability

Jit Kang CHANG

unread,

Jun 20, 2023, 11:19:02 PM6/20/23

to beegfs-user

Hello,

We are planning to deploy BeeGFS as scratch storage in our HPC cluster for the University. However, we could not find enough information on how to deploy metadata server in high availability without mirroring.

We understand that metadata mirroring is a feature of the enterprise edition, but is there anyway to make a primary metadata server failover to a secondary metadata server without buddy mirroring in case failure happened? We have tried to configured multiple metadata servers during testing but it doesn't failover to the secondary metadata server automatically when the primary server failed.

Ticonderoga

unread,

Jun 20, 2023, 11:29:05 PM6/20/23

to fhgfs...@googlegroups.com

I do not think failover could work without buddy mirroring. Without mirroring, the metadata of each folder only resides on one BeeGFS metadata server, so a folder will not be accessible if its associated metadata server goes down.

" PENAFIAN: E-mel ini dan apa-apa fail yang dikepilkan bersamanya ("Mesej") adalah ditujukan hanya untuk kegunaan penerima(-penerima) yang termaklum di atas dan mungkin mengandungi maklumat sulit. Anda dengan ini dimaklumkan bahawa mengambil apa jua tindakan bersandarkan kepada, membuat penilaian, mengulang hantar, menghebah, mengedar, mencetak, atau menyalin Mesej ini atau sebahagian daripadanya oleh sesiapa selain daripada penerima(-penerima) yang termaklum di atas adalah dilarang. Jika anda telah menerima Mesej ini kerana kesilapan, anda mesti menghapuskan Mesej ini dengan segera dan memaklumkan kepada penghantar Mesej ini menerusi balasan e-mel. Pendapat-pendapat, rumusan-rumusan, dan sebarang maklumat lain di dalam Mesej ini yang tidak berkait dengan urusan rasmi Universiti Malaya adalah difahami sebagai bukan dikeluar atau diperakui oleh mana-mana pihak yang disebut.

DISCLAIMER: This e-mail and any files transmitted with it ("Message") is intended only for the use of the recipient(s) named above and may contain confidential information. You are hereby notified that the taking of any action in reliance upon, or any review, retransmission, dissemination, distribution, printing or copying of this Message or any part thereof by anyone other than the intended recipient(s) is strictly prohibited. If you have received this Message in error, you should delete this Message immediately and advise the sender by return e-mail. Opinions, conclusions and other information in this Message that do not relate to the official business of University of Malaya shall be understood as neither given nor endorsed by any of the forementioned. "

--
You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/f117689c-3974-4d55-8125-453d1561fff6n%40googlegroups.com.

Jit Kang CHANG

unread,

Jun 21, 2023, 12:53:41 AM6/21/23

to beegfs-user

Thanks. Just to be sure again, does metadata mirroring with buddy group require enterprise support contract to use? I understand the management node require the implementation of pacemaker and DRBD for high availability, is that usual for the same configuration to also applied to the metadata nodes to achieve high availability?

Also, just wonder what is the purposes of having multiple metadata nodes without mirroring enable, since it doesn't automatic failover when one of the primary node failed.

Lehmann, Greg (IM&T, Pullenvale)

unread,

Jun 21, 2023, 1:15:54 AM6/21/23

to fhgfs...@googlegroups.com

https://doc.beegfs.io/latest/license.html - section 3.4

Multiple metadata nodes are for scale out metadata performance. There are many workloads out there that thrash metadata services. The more metadata processes you have the more chance you have of handling those bad workloads. You will find only so many metadata processes can run on a node before it stops scaling well. Adding nodes scales better.

You could probably do metadata failover without buddy mirroring where you have dual ported drives with 2 servers connected to the drives.

Cheers,

Greg

To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/f87a4fd0-48fe-4e56-975b-1a1e5aec9b41n%40googlegroups.com.

John Hearns

unread,

Jun 21, 2023, 6:40:20 AM6/21/23

to fhgfs...@googlegroups.com

If it is truly scratch storage, why do you require HA?

Also another point - if you are implementing HA nodes using PAcemaker, DRBD is not the only option. You can use storage arrays with connections to 2 servers.

John Hearns

unread,

Jun 21, 2023, 6:41:35 AM6/21/23

to fhgfs...@googlegroups.com

It looks like you are not wanting a commercial solution.

If you are looking for a commercial solution using PAcemaker drop me an email.

On Wed, 21 Jun 2023 at 04:19, 'Jit Kang CHANG' via beegfs-user <fhgfs...@googlegroups.com> wrote:

Kapetanakis Giannis

unread,

Jun 21, 2023, 6:50:22 AM6/21/23

to fhgfs...@googlegroups.com

I'm also running Meta and Storage in HA mode, basically because we only have 1 Meta device and 1 Storage device which is shared among 2 x Meta servers and 2 x Storage Servers (each) via FC.

Each storage/meta device is divided in 2 pools.
Each server is Active for one pool, so both (meta) servers are being utilized, as well as able to take-over the 2nd pool in case of maintenance or a problem.

# pcs status
Cluster name: meta_cluster
Cluster Summary:
* Stack: corosync
* Current DC: mgs-1 (version 2.1.2-4.el8_6.2-ada5c3b36e2) - partition with quorum
* Last updated: Wed Jun 21 13:41:45 2023
* Last change: Wed Jun 21 13:41:37 2023 by hacluster via crmd on mgs-1
* 2 nodes configured
* 8 resource instances configured

Node List:
* Online: [ mgs-1 mgs-2 ]

Full List of Resources:
* Resource Group: beegfs_metadata1:
    * VIP-metadata1     (ocf::heartbeat:IPaddr2):        Started mgs-1
    * disk-metadata1    (ocf::heartbeat:Filesystem):     Started mgs-1
    * beegfs-metadata1 (systemd:beegfs-meta@meta1):     Started mgs-1
* Resource Group: beegfs_metadata2:
    * VIP-metadata2     (ocf::heartbeat:IPaddr2):        Started mgs-2
    * disk-metadata2    (ocf::heartbeat:Filesystem):     Started mgs-2
    * beegfs-metadata2 (systemd:beegfs-meta@meta2):     Started mgs-2
* ipmi-mgs-1 (stonith:fence_ipmilanplus):     Started mgs-1
* ipmi-mgs-2 (stonith:fence_ipmilanplus):     Started mgs-2

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

Same for storage.

No commercial solution but open source. Runs fine the last 2 years.

G

To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/CAPqNE2XimxgHv6qTz32JNxa6b4XTqz1_%2BcaN5eFbnW-qm6rjLw%40mail.gmail.com.

Lehmann, Greg (IM&T, Pullenvale)

unread,

Jun 21, 2023, 5:44:48 PM6/21/23

to fhgfs...@googlegroups.com

Hi John,

That is a bit of a naïve question I hear all the time from management. Yes it is scratch storage. Now consider the impact of an outage to a scratch storage system, HA or otherwise. Say you have a modest compute cluster of 500 nodes. What do those 500 nodes do while the scratch fs is down? Sure some workloads might be OK if they are not IO intensive, but effectively HPC production stops. Now add in data loss on the scratch FS, perhaps from jobs that have just completed and so results are still on the scratch FS. There is now a cost of rerunning all the jobs to recompute the lost results. Finally, I don’t believe there is any such thing as “true scratch” as the effort and I/O BW involved in data transfer on and off scratch means in reality that data lives there for a while. I certainly would like to minimise the copy in/out process as it takes IOPs that workloads could be using. I really don’t want those 500 nodes spinning waiting for data.

Cheers,

Greg

From: fhgfs...@googlegroups.com <fhgfs...@googlegroups.com> On Behalf Of John Hearns
Sent: Wednesday, June 21, 2023 8:40 PM
To: fhgfs...@googlegroups.com
Subject: Re: [beegfs-user] Metadata Server High Availability

If it is truly scratch storage, why do you require HA?

To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/CAPqNE2ULMfAQWqLyg-CQRPqN714PjoNidMefkgGMVyZR9apmWg%40mail.gmail.com.

Denis Anjos

unread,

Jun 21, 2023, 5:55:08 PM6/21/23

to fhgfs...@googlegroups.com

Fail over indeed does work without buddy mirroring. One will need the data available to both servers, though. But it will be mounted only to one server at a time through pacemaker.

The data could be available over FC, SAS or DRBD. As soon as one node fails, pacemaker will bring up IP, mountpoint and beegfs service on the standby node.

D.

From: fhgfs...@googlegroups.com <fhgfs...@googlegroups.com> on behalf of Ticonderoga <tico...@gmail.com>
Date: Wednesday, 21 June 2023 00:29
To: fhgfs...@googlegroups.com <fhgfs...@googlegroups.com>
Subject: Re: [beegfs-user] Metadata Server High Availability

To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/CAHEEduatyQj4zwuXwxdi8uLR3FRK-TufJbbnUQaSZJuvrnyp_g%40mail.gmail.com.

Jit Kang CHANG

unread,

Jun 21, 2023, 9:07:57 PM6/21/23

to beegfs-user

Thank you everyone for sharing your thought.

To be honest, our university's cluster are running all free and open source software for infrastructure and middleware due to budget constraint. We currently have a Lustre scratch storage deployed with Pacemaker for HA. Similar to what was mentioned by Greg, we want to implement HA for scratch storage to minimise the compute downtime due to offline scratch. Due to the complexity required to implement Lustre previously, we would love to look at another open-source alternative, which we came across BeeGFS, but would miss the HA feature that only available through enterprise support contract.

We are thinking to use pacemaker for the HA, since we already can have the metadata disk appear for different metadata node thanks to proxmox. However, we have yet to have a solid idea on the implementation, and will probably need to do some test setup first. The only concern I have so far is BeeGFS metadata require floating IP for pacemaker to work properly, which is not required in Lustre. Unless I understanding incorrectly, this might cause some issue on our side as we have FreeIPA handling all the internet DNS mapping to all those IP address.

Denis Anjos

unread,

Jun 21, 2023, 9:17:09 PM6/21/23

to fhgfs...@googlegroups.com

Do you use ipoib on your cluster? You only need the ip address for initial handshake, then it will talk RDMA between clients and servers. Are the clusters in different subnets of the storages?

I have deployed clusters with multimode running up to 11 metadata targets shared across to two metadata servers using pacemaker for HA and the failover works greatly.

D.

To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/a14b6e85-ca4e-4475-a216-f8b65ec0cc96n%40googlegroups.com.

Jit Kang CHANG

unread,

Jun 21, 2023, 10:16:19 PM6/21/23

to beegfs-user

Hi Denis,

We are getting new infrastructure with IB interconnect in 6 months time. We will very likely use IPoIB for Ceph setup. There will also very likely to have IPoIB in some of the part for new infrastructure, but we are not very sure yet.

For the network design, all the new compute nodes are connected to the storage within same storage subnet, but part of the older cluster will be connected via a different subnet through routing.

Would you mind sharing some details of your setup?

Lukas Hejtmanek

unread,

Jun 22, 2023, 6:40:44 AM6/22/23

to fhgfs...@googlegroups.com

Hello,

how do you deal with HA for storage nodes? My approach is the following:

I have kubernetes cluster.
Mgmtd is running in pod as deployment with pvc over nfs, so in case of node failure, pod is respawned on any cluster node, it has float IP so IP is moved as well.
Metadata is running in two instances with mirroring, each has own local storage for metadata and hoping that mirroring solve issue with rebooting metadata instance. Those instances are statefulsets with host networking.
Storage is running on each storage node without any mirroring.

I extended offline timeout to 10days to mitigate problems when node reboots or needs to be repaired. However, I run into other issues discussing in different thread.

I have two spare nodes that can be replaced (mainly disks), if catastrophic failure of any node happens.

> To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/a14b6e85-ca4e-4475-a216-f8b65ec0cc96n%40googlegroups.com.

--
Lukáš Hejtmánek

Linux Administrator only because
Full Time Multitasking Ninja
is not an official job title

Denis Anjos

unread,

Jun 22, 2023, 4:09:39 PM6/22/23

to fhgfs...@googlegroups.com

Hello, Chang,

For metadata we have two dell r750 with fibre channel to one unity storage. – It was designed that way, but I would go to sas connection between the servers and straoge. It has lower latency.

We have 11 Raid1 pools among 22 SSDs on the unity storage. Through pacemaker we run 5 targets on the first server and 6 on the second one.
The management service is tied to the same group as the first beegfs-metadata service, and its data remains inside the volume of the first mount point – so wherever the meta1 is running, the mgmtd is also running.

We use one virtual IP managed by pacemaker for each service (or else, 12 vIP – one per meta + one for the mgmtd).

On the target nodes, we have a very similar setup, there are 32 luns running on 4 servers, 8 luns per server. Each pair of servers have access to the same storage through SAS and is the failover pair of the other.

Let me know if I can be of any further help.

D.

To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/ffe9cd32-f919-4fe9-a99d-039bbd6368b6n%40googlegroups.com.

Jit Kang CHANG

unread,

Jul 2, 2023, 10:23:46 PM7/2/23

to beegfs-user

Hi Denis,

Sorry for not coming back to you soon enough. I was on leave last week while also running some pacemaker test setup. While I managed to have a HA configuration for beegfs-mgmtd and beegfs-meta server nodes, I can't think of a proper way yet to implement HA setup for beegfs-storage.

Let's say we have two storage controllers that have access to 2 JBOD (with 4 LUNs each) simultaneously at the same time through SAS, we wish to have each controller handling one set of the LUNs for load balancing purposes. Similar to the implementation on the management and metadata server nodes, I can implement 2 floating IPs for two sets of beegfs-storage services (1 set of LUNs each) and run them in multimode using pacemaker setup. However this seems to cause more issues the more we scale as the network interface is limited unless we use sub-interface for the floating IPs.

Not sure if I missed anything, but I believe there are much better implementations out there. Appreciate for any input or suggestion given.

Denis Anjos

unread,

Jul 4, 2023, 9:35:25 AM7/4/23

to fhgfs...@googlegroups.com

Hello, Chang,

I would create a set of services for each LUN, like:

One lun / mountpoint
One VIP
One beegfs-storage@lunX service

Then create a group for each set of services (you will end up with 8 different groups)

Then, on pacemaker I would set a preferable constraint for each group like:

for LUNS in 1 2 3 4; do pcs constraint location group_${LUNS} prefers server01=50 server02=100; done

for LUNS in 5 6 7 8; do pcs constraint location group_${LUNS} prefers server01=100 server02=50; done

What do you think?

D.

To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/3d95f4c6-7321-4d21-99ae-daf7b41f1f3dn%40googlegroups.com.

Jit Kang CHANG

unread,

Jul 6, 2023, 10:33:10 PM7/6/23

to beegfs-user

Hi Denis,

That's exactly what I would do for storage HA for now, and it seems like there is no other choices. Nonetheless, it does works pretty well during my testing.

Reply all

Reply to author

Forward