Metadata hardware recommendation

Steve Eppert

unread,

Jun 28, 2024, 4:16:19 AM (8 days ago) Jun 28

to beegfs-user

Hi
We installed a BeeGFS test system with Metadata- and Storage targets on one server. We are experiencing slow metadata performance in times where the storage processes are filling up the CPU so we want to split these severs.
I am planning to build a dedicated Metadata server with
1x AMD 9174F 16c with high single core performance
192 GB RAM
12 slots for NVME drives.
At the beginning I want to set 2 NVMe (Raid1) for Metadata and when we notice a bottleneck, we will just add an additional set of 2 NVMe drives as a secondary (third...) Metadata target which should increase performance.

Is this a reasonable idea? I did not find any useful hardware recommendation regarding system architecture for Metadata targets, that's why I'm hoping to get a few best practice experiences here

Thanks!

Jure Pečar

unread,

Jun 28, 2024, 4:22:31 AM (8 days ago) Jun 28

to fhgfs...@googlegroups.com

On Fri, 28 Jun 2024 01:16:19 -0700 (PDT)
Steve Eppert <telest...@gmail.com> wrote:

> I did not find any useful hardware recommendation regarding system
> architecture for Metadata targets, that's why I'm hoping to get a few best
> practice experiences here

Metadata is stored in ext4 inodes, which are 4k in size so look for NVMe drives with lowest latencies and highest 4k random io throughput. We use Optane NVMe drives, not sure if you can still find those but I hear there are now modern SLC and TLC based NVMe drives available that are close to Optane performance.

--

Jure Pečar

Guan Xin

unread,

Jun 30, 2024, 10:17:30 PM (5 days ago) Jun 30

to beegfs-user

Hi,

192GB seems not saturating memory channels.

Is it configured as 32GB*6?

Although beegfs is known to be slow for metadata-heavy workload,

high single core speed and high memory bandwidth do help.

The large L3 cache of AMD CPU does seem helpful,

just check that (sub)NUMA locality has been considered.

That said, we use Intel CPUs+Optane DCPMMs here for beegfs-meta.

You can try if 2*SSD RAID0 improves over 1*SSD.

If not, then 4*SSD might not improve over 2*SSD, either.

For our workload SSD access from beegfs-meta server is rare,

on the order of 10^3 iops,

which is far below the capability of modern NVMe SSDes (10^5 iops).

Most metadata access is from/to memory.

Guan

Steve Eppert

unread,

Jul 1, 2024, 3:19:12 AM (5 days ago) Jul 1

to beegfs-user

Hi

These AMD Genoa CPUs have 12 memory channels, so in this configuration 12x 16GB.

Thanks for the information regards the single core speed and memory bandwidth. This should be covered by the setup.

Did you choose the Optanes due to test results? We are currently using SSDs with ~300k IOPS and are not even close to achieving this performance on beegfs, so I thought there is no room for optimization in choosing something else.

Steve

Waltar

unread,

Jul 1, 2024, 2:52:29 PM (4 days ago) Jul 1

to beegfs-user

Hello Steve,

consists your "beegfs system" out of only 1 server now ?

So that's complete useless for usage as you are than better with an nfs server setup with ipoib which give you >11GB/s on a 100Gb ib card and ib switch then to the clients.

A 1 server "beegfs system" is just for take a look inside of installation and configuration, a "beegfs system" for usage begin at least with the minimum of 2 servers.

When you configure your system properly (eg vfs_cache_pressure=30) there won't be any disk read access at all because it would be full answered from (eg xfs) filesystem cache

(which alongside doesn't work with zfs because that looks to kernel like a database application with it's own (2-3x slower) arc cache for meta read requests).

So in meta write situation which application would create millions of that requests .. would be terrible written if creating that number of files per second - so read is meta priority which is solved by ram access.

If you slow down in meta operations than you probably ran out of your ib network capacity and you should taking further beegfs node with 1 meta and 1 data service instead and not going to a dedicated meta server.

Even 1 meta service (MDS) can just have exactly 1 meta target (MDT) which destroy your dream of scaling targets without even scaling meta services also.

As with a distributed system in general your MTBF for it goes down drastically as the number of involved hw parts and services go up (think about 1000x worse than 1 nfs fileserver instead).

So as rule of thumb you should design a distributed system as scaleable and easy as any possible at all and be prepared for coming unexpected downtimes.

You could begin with 2 servers, on the first have 1 mgmt, 1 meta and 1 storage service, on the second 1 meta and 1 storage service.

When you go up think about a separate mgmt server which has exactly same hardware as the servers for meta and storage services to be able to exchange a failing node shortly with the mgmt one

(or if mgmt node fail switch that service to first meta/storage node). If you have lots of TB to PB inside a beegfs you won't like to have a long downtime in case of any failure.

That again could be optimized with external raid storage systems connected to minimum 2 servers (online move volumes between).

Be aware if you look into zfs backend because meta storage usage is really bad, meta arc access slow and for the data streaming you need even double disks for same bandwidth as hw-raid xfs,

and last but not least zfs get easily in import error after kernel crash and power outage which is a lotto if a full pool restore is required after (where's your backup here ?) ...

So mention the dreamfull features of zfs with reality carefully as it's even the easierst to build just 2 server setups complete without beegfs before and compare meta+streaming performance, pulling power plug while writing and look which filesystem is there after.

That's enough without going into config details while even no hw detail available.

Guan Xin

unread,

Jul 2, 2024, 12:27:37 AM (4 days ago) Jul 2

to beegfs-user

Hi,

We chose Optane DCPMM for its access latency is comparable to network round trip time,

while SSD access latency is 10^2 higher.

Find-ing (GNU findutils) through our beegfs instance show that accessing cold beegfs-meta data in DCPMMs is almost as fast as accessing cached data.

It might not make much difference using SSDes when metadata are mostly cached in memory.