Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Metadata hardware recommendation

147 views
Skip to first unread message

Steve Eppert

unread,
Jun 28, 2024, 4:16:19 AM6/28/24
to beegfs-user
Hi
We installed a BeeGFS test system with Metadata- and Storage targets on one server. We are experiencing slow metadata performance in times where the storage processes are filling up the CPU so we want to split these severs.
I am planning to build a dedicated Metadata server with
1x AMD 9174F 16c with high single core performance
192 GB RAM
12 slots for NVME drives.
At the beginning I want to set 2 NVMe (Raid1) for Metadata and when we notice a bottleneck, we will just add an additional set of 2 NVMe drives as a secondary (third...) Metadata target which should increase performance.

Is this a reasonable idea? I did not find any useful hardware recommendation regarding system architecture for Metadata targets, that's why I'm hoping to get a few best practice experiences here

Thanks!

Jure Pečar

unread,
Jun 28, 2024, 4:22:31 AM6/28/24
to fhgfs...@googlegroups.com
On Fri, 28 Jun 2024 01:16:19 -0700 (PDT)
Steve Eppert <telest...@gmail.com> wrote:

> I did not find any useful hardware recommendation regarding system
> architecture for Metadata targets, that's why I'm hoping to get a few best
> practice experiences here

Metadata is stored in ext4 inodes, which are 4k in size so look for NVMe drives with lowest latencies and highest 4k random io throughput. We use Optane NVMe drives, not sure if you can still find those but I hear there are now modern SLC and TLC based NVMe drives available that are close to Optane performance.


--

Jure Pečar

Guan Xin

unread,
Jun 30, 2024, 10:17:30 PM6/30/24
to beegfs-user
Hi,

192GB seems not saturating memory channels.
Is it configured as 32GB*6?

Although beegfs is known to be slow for metadata-heavy workload,
high single core speed and high memory bandwidth do help.
The large L3 cache of AMD CPU does seem helpful,
just check that (sub)NUMA locality has been considered.

That said, we use Intel CPUs+Optane DCPMMs here for beegfs-meta.

You can try if 2*SSD RAID0 improves over 1*SSD.
If not, then 4*SSD might not improve over 2*SSD, either.
For our workload SSD access from beegfs-meta server is rare,
on the order of 10^3 iops,
which is far below the capability of modern NVMe SSDes (10^5 iops).
Most metadata access is from/to memory.

Guan

Steve Eppert

unread,
Jul 1, 2024, 3:19:12 AM7/1/24
to beegfs-user
Hi

These AMD Genoa CPUs have 12 memory channels, so in this configuration 12x 16GB.
Thanks for the information regards the single core speed and memory bandwidth. This should be covered by the setup.
Did you choose the Optanes due to test results? We are currently using SSDs with ~300k IOPS and are not even close  to achieving this performance on beegfs, so I thought there is no room for optimization in choosing something else.

Steve

Waltar

unread,
Jul 1, 2024, 2:52:29 PM7/1/24
to beegfs-user
Hello Steve,
consists your "beegfs system" out of only 1 server now ? 
So that's complete useless for usage as you are than better with an nfs server setup with ipoib which give you >11GB/s on a 100Gb ib card and ib switch then to the clients.
A 1 server "beegfs system" is just for take a look inside of installation and configuration, a "beegfs system" for usage begin at least with the minimum of 2 servers.

When you configure your system properly (eg vfs_cache_pressure=30) there won't be any disk read access at all because it would be full answered from (eg xfs) filesystem cache
(which alongside doesn't work with zfs because that looks to kernel like a database application with it's own (2-3x slower) arc cache for meta read requests).
So in meta write situation which application would create millions of that requests .. would be terrible written if creating that number of files per second - so read is meta priority which is solved by ram access.
If you slow down in meta operations than you probably ran out of your ib network capacity and you should taking further beegfs node with 1 meta and 1 data service instead and not going to a dedicated meta server.
Even 1 meta service (MDS) can just have exactly 1 meta target (MDT) which destroy your dream of scaling targets without even scaling meta services also.

As with a distributed system in general your MTBF for it goes down drastically as the number of involved hw parts and services go up (think about 1000x worse than 1 nfs fileserver instead).
So as rule of thumb you should design a distributed system as scaleable and easy as any possible at all and be prepared for coming unexpected downtimes.
You could begin with 2 servers, on the first have 1 mgmt, 1 meta and 1 storage service, on the second 1 meta and 1 storage service.
When you go up think about a separate mgmt server which has exactly same hardware as the servers for meta and storage services to be able to exchange a failing node shortly with the mgmt one
(or if mgmt node fail switch that service to first meta/storage node). If you have lots of TB to PB inside a beegfs you won't like to have a long downtime in case of any failure.
That again could be optimized with external raid storage systems connected to minimum 2 servers (online move volumes between).

Be aware if you look into zfs backend because meta storage usage is really bad, meta arc access slow and for the data streaming you need even double disks for same bandwidth as hw-raid xfs,
and last but not least zfs get easily in import error after kernel crash and power outage which is a lotto if a full pool restore is required after (where's your backup here ?) ...
So mention the dreamfull features of zfs with reality carefully as it's even the easierst to build just 2 server setups complete without beegfs before and compare meta+streaming performance, pulling power plug while writing and look which filesystem is there after.
That's enough without going into config details while even no hw detail available.

Guan Xin

unread,
Jul 2, 2024, 12:27:37 AM7/2/24
to beegfs-user
Hi,

We chose Optane DCPMM for its access latency is comparable to network round trip time,
while SSD access latency is 10^2 higher.
Find-ing (GNU findutils) through our beegfs instance show that accessing cold beegfs-meta data in DCPMMs is almost as fast as accessing cached data.
It might not make much difference using SSDes when metadata are mostly cached in memory.
In practice there might be no loss if the choice of using DCPMM is unavailable.

Guan

Steve Eppert

unread,
Jul 15, 2024, 9:30:37 AM7/15/24
to beegfs-user
Hi
Thanks for your thoughts!
We are using 3 servers each with a metadata and a storage target. (which is your hardware recommendation)
I don't understand why having multiple servers each with one meta service (in combination with a storage service) is better than having one server which multiple meta services each with its own meta target. Especially in times with high streaming performance that occupies the CPU and the network I would think to see a meta performance benefit in having a dedicated meta server.
Also in terms of extensibility the dedicated meta server looks more flexible to me. If metadata is becoming a problem, simply purchasing two nvme drives for 1-2000 bucks looks more convenient than purchasing and installing a new hardware server (which would also also decreases my MTBF)

Best
Steve

Waltar

unread,
Jul 16, 2024, 10:59:38 PM7/16/24
to beegfs-user
Hello Steve,
if you look into metadata configfile you will see default cores/threads is "0" which means take all available cores or threads (when HT enabled) which should not limited to less.
Nevertheless if you define more than 1 meta service onto a server 1) you cannot get more cores/threads out of your server as were "installed" 2) you cannot get more read iops from further nvme drives as when configured properly they come 100% from xfs filesystem cache with no iops at all and
3) with further fileserver for meta+storage server you split your meta and storage throughput to a client for higher performance, 4) you still can add meta only servers but it's uncommon to have more meta than storage servers, but yeah beegfs can if you want to ... :-)
In case of zfs it's little different as arc isn't as fast as linux cache, just a quarter, test yourself on 1 server just with installed os data by eg.:
cd /usr ; tar cf /pool-name/dataset-name/test.tar bin etc games include lib lib64 libexec local sbin share src /etc ; l -h /pool-name/dataset-name/test.tar ; dd if=/pool-name/dataset-name/test.tar of=/dev/null bs=32k
cd /usr ; tar cf /xfs-mount/test.tar bin config etc games include lib lib64 libexec local sbin share src /etc ; l -h /xfs-mount/test.tar ; dd if=/xfs-mount/test.tar of=/dev/null bs=32k

Steve Eppert

unread,
Jul 29, 2024, 6:30:15 AM7/29/24
to beegfs-user
Hey
Assuming that 1) is correct, wouldn't this be a dramatic problem having meta and storage on one node? In this case xfs filesystem cache will be flooded with storage information instead of the "good" meta data? 
All the benchmarks I found on the internet had dedicated meta and storage servers in their installation. Assuming that they want to get maximum performance out of it why didn't they choose a mixed installation?
I'm not planning to use ZFS at all because of the known lower performance against ext4 and xfs.
Steve

Waltar

unread,
Jul 29, 2024, 9:40:36 AM7/29/24
to beegfs-user
Hello Steve, you should set "echo 30 > /proc/sys/vm/vfs_cache_pressure" and "echo 1000 > /proc/sys/fs/xfs/xfssyncd_centisecs".
The description for vfs_cache_pressure (value 0...100 with default 100) is quiet incomprehensible to understand, 100 means it's a cache with 100% fifo meaning.
Setting to about 30 means to not overwrite the "learned" metadata as in "100% fifo" mode and just update it fo0r new and changed files until reboot.
To see how many metadata you have on a beegfs node, reboot it, check your actual mem usage (eg. top (or /proc/meminfo)) and than run the 3 cmd's: 
"find /mountp/<beegfs-storage> -ls >/dev/null ; find /mountp/<beegfs-meta> -ls >/dev/null ; du -sh  /mountp/<beegfs-storage> /mountp/<beegfs-meta> >/dev/null"
Then check again your new mem usage, perhaps abaout 5GB more used now.
Assume you have a couple of 11 beegfs servers all eg. with 36cores as 72 threads configured and take in account these configs:
A) 2x meta only server, 8x storage only server, 1x mgmt:
Than your max beegfs bandwitdh could be 8x your ib/opa interface. Your metadata, eg. 10GB is splitted on 2 meta servers each holding about 5GB and 2x72 threads for.
B) 10x meta- and storage server, 1x mgmt:
Your max beegfs bandwitdh could be 10x your ib/opa interface. Your metadata, eg. 10GB is splitted on 10 meta servers each holding about just 1GB and so 720 threads for.
As with setting vfs_cache_pressurethe metadata is in cache and client meta requests could be processed immedently from the threads in the cache ram.
Client data requests are normally going to disk storage as couldn't all be fulfilled by fs cache and that gives the meta requests the "higher" priority on a comnbined "meta+ and storage" beegfs node.
#
In a non uniform beegfs server config you can choose different server models and need more server than in a uniform server environment with combined config which give better throughput
and I never thing about better meta performance of small number of meta servers would be faster than in a full distributed one while even the more beegfs server existing the more data is inside and the requirement for faster meta performance increases.
But nevertheless beegfs could be configured as anyone wishes and feel well with, that's pretty cool anyway :-)

Steve Eppert

unread,
Jul 29, 2024, 2:45:22 PM7/29/24
to beegfs-user
I checked the beegfs meta nvme drives and can see 40 TB read after ~5 months. The server rebooted 2x within these 5 months. I can hardly believe that most of the meta information came from system memory when 40TB were read from the SSDs (the meta nvme is ~30 GB filled).
Also running meta and storage on one node also results in a sharing of the threads between meta and storage. I could not see the beegfs-meta process using more than 200% cpu while I could see the beefs-storage processes using 1600% cpu and more in higher workload scenarios.
I can understand your theory but it does not match with my measurements

Steve

Waltar

unread,
Jul 30, 2024, 3:07:18 AM7/30/24
to beegfs-user
Without setting vfs_cache_pressure (def. 100) the cache is permanently flooded in fifo mode and mostly any meta requests must be read again from meta target which you see in your installation.
It's even right to not see more than 200% cpu in meta operations as the data is small and most time is just still lost in communication.
Here small bee-cluster with combined filesystem for beegfs daemons in "sda" while doing just meta read operations:
Total results for 2 nodes:          # beegfs-ctl --nodetype=meta --serverstats
 time_index   reqs   qlen bsy
 1722318428   9807      0   0
 1722318429   9743      0   2
 1722318430   9457      0   1
 1722318431   9944      2   0
 1722318432   9431      0   2
 1722318433   9553      1   0
 1722318434   9724      0   0
 1722318435   9224      1   1
 1722318436   9805      0   0
 1722318437  10668      0   2
 1722318438  10234      0   1
 1722318439   9922      1   0
 1722318440   9497      0   0
 1722318441   9888      0   0
 1722318442   9061      0   1
Total results for 2 nodes:          # beegfs-ctl --nodetype=storage --serverstats
 time_index write_KiB  read_KiB   reqs   qlen bsy
 1722318428         1         0     24      0   0
 1722318429         1         0     27      0   0
 1722318430         1         0     24      0   0
 1722318431         1         0     23      0   0
 1722318432         1         0     15      0   0
 1722318433         1         0     20      0   0
 1722318434         1         0     24      0   0
 1722318435         1         0     21      0   0
 1722318436         1       128     69      0   0
 1722318437         1       132    109      0   0
 1722318438         1         0     27      0   0
 1722318439         1         0     20      0   0
 1722318440         1         0     18      0   0
 1722318441         1         0     16      0   0
 1722318442         1         0     26      0   0
#    here iostat parallel running
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.69    0.00    0.63    0.00    0.00   98.69
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    2.00     0.00     0.01     8.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    2.00     0.00     0.01     8.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.81    0.00    0.63    0.00    0.00   98.56
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    4.00     0.00     0.11    54.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    1.00     0.00     0.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    1.00     0.00     0.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.75    0.00    0.63    0.00    0.00   98.62
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.75    0.00    0.63    0.00    0.00   98.62
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.85    0.00    0.78    0.00    0.00   98.37
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00


Waltar

unread,
Jul 30, 2024, 3:07:18 AM7/30/24
to beegfs-user
And even as you can see if you have a dedicated meta server with just 30GB used and a opa/ib interface it's like wasted hw resource.
At dedicated storage server you have high cpu usage. So why don't use the "meta" server not also for storage use and bring down with that the load of these ?
And when bring that load from eg your >1600% to about 1400-1500% with integrating the further only meta to meta+storge you even get free resources there ...
and with these new free resources bring the "only 1" meta services to more meta services which parallizes that requests which again bring down the meta communication lost times ...
So at the end ... you are there what a said at the beginning ... :-) But as beegfs is so flexible everybody could config what he think and like, that's the very special I like at beegfs !! :-)

Waltar

unread,
Jul 30, 2024, 3:21:28 AM7/30/24
to beegfs-user
And don't forget "vfs_cache_pressure" setting does NOT have any effect on zfs as zfs is like an application to the kernel and has ot's own memory, cache and i/o scheduler.
Reply all
Reply to author
Forward
0 new messages