BeeGFS slow for small file access

1,538 views
Skip to first unread message

Ben Cowan

unread,
Jun 18, 2020, 3:42:25 AM6/18/20
to fhgfs...@googlegroups.com
Hello,

I have a BeeGFS server set up that is supporting a small development cluster. (I need a parallel filesystem to use MPI-IO applications.) I've found that building code in the BeeGFS space is about 80% slower than in an NFS volume shared from the same server. Tests with fio confirm this behavior, with over 2 orders of magnitude fewer IOPS on BeeGFS vs. NFS.

I've reproduced this on a small test system I've set up to try different tuning options to fix this, but so far I've come up empty. The BeeGFS test system consists of a single server, with the following specs:

CPU: Intel Core i7-6700 (quad core @ 3.4 GHz)
16 GB DDR4-2133 RAM
2x 1 TB SATA HDDs
Intel Optane SSD 900P, 280 GB, PCIe x4
Single GBe network connection
CentOS 7.8

The SSD is split into 60 GB and 220 GB partitions. There's a ZFS pool with the 2 HDDs striped and the 60 GB SSD partition used as a SLOG. I've created 2 datasets from the ZFS pool, one shared over NFS with default parameters, and the other set as the BeeGFS (7.1.5) storage target. The 220 GB SSD partition is used for BeeGFS metadata.

I test both shares from a Linux client, using the fio command line

fio --randrepeat=1 --ioengine=libaio --direct=0 --gtod_reduce=1 --name=test --filename=random_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randwrite

to produce small, random writes to mimic a build process.

On NFS, I get the result

iops : min=233364, max=507468, avg=438662.50, stdev=136865.93, samples=4

While on BeeGFS

iops : min= 1818, max= 3250, avg=2128.17, stdev=167.49, samples=984

Monitoring the server with top while the test is underway, there is a single beegfs-storage process using 55–60% of one CPU core; the other cores appear idle. The system isn't running out of memory; there is about 6 GB still available while the test is running?

I'd appreciate any pointers you can give for improving the IOPS/small-file performance of my BeeGFS system.

Thanks,
Ben

Pinkesh Valdria

unread,
Jun 18, 2020, 6:32:53 AM6/18/20
to fhgfs...@googlegroups.com
What options are you using when mounting NFS?     Check if some of them are related to caching metadata, that makes a difference in performance.   Also BeeGFS or most parallel filesystems are designed for parallel jobs, which means minimal caching to ensure all clients have the most updated file data.  Optional features were then added to some of them to allow client side caching.  By default, BeeGFS has a setting of 1sec for file attribute caching in beegfs-meta.conf.   Try changing that to a little higher value and test,  but deviating a lot from the default is not recommended.  

What kind of workload is this?   small files (typical file size - bytes, X-XX KB,  XXX KB,  X MB..?)  and how many files (X million?)

  

--
You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/2920D1F6-0C26-44A4-8603-9B400FA284DA%40txcorp.com.


--
Thanks,
Pinkesh Valdria
Singapore: +65 8932-3639
USA: +1 206-234-4314 (cancelled)

Schweiss, Chip

unread,
Jun 18, 2020, 7:49:09 AM6/18/20
to fhgfs...@googlegroups.com
I have also found that not using RDMA is a huge metadata performance penalty.  Probably the same for small files.  Latency differences stack up significantly.



Ben Cowan

unread,
Jun 19, 2020, 3:03:19 AM6/19/20
to beegfs-user
Thanks. It sounds like BeeGFS incurs an overhead of a larger number of network transactions for a small file access, in order to support its parallel capabilities. Is this correct? Is RDMA even possible for commodity GBe hardware?
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs...@googlegroups.com.


--
Thanks,
Pinkesh Valdria
Singapore: +65 8932-3639
USA: +1 206-234-4314 (cancelled)

--
You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs...@googlegroups.com.

b...@bencowan.org

unread,
Jun 19, 2020, 3:03:20 AM6/19/20
to beegfs-user
Thanks for your response. I don’t think there are any caching options with the NFS share—I’m not using fscache, for instance. The options reported by mount (omitting addresses) are:

    rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,local_lock=none

I don’t see a time setting in beegfs-meta.conf. Is storeClientXAttrs a relevant setting.

For the actual workload of a code build, there are 5000–10000 files generated, with typical sizes ranging from 10kB to 1 MB; there are also a large number of small file reads. The fio test is just 1M random writes of 4k each.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs...@googlegroups.com.

bingbing zhu

unread,
Sep 10, 2021, 12:12:28 PM9/10/21
to beegfs-user
@Ben
hi,  I also meet the same problem, do you have any good solution ;-)

Guan Xin

unread,
Sep 11, 2021, 4:27:24 AM9/11/21
to beegfs-user
The compiler needs massive access to headers files if your source is C and alike.
Try putting most headers on a local filesystem, or on a remote filesystem with better cache like NFS.

Working within a filesystem image on BeeGFS also improves performance.
Just create a filesystem image on BeeGFS, put your source code inside, and compile.

Guan

Guan Xin

unread,
Sep 11, 2021, 4:42:01 AM9/11/21
to beegfs-user
> While on BeeGFS
>
> iops : min= 1818, max= 3250, avg=2128.17, stdev=167.49, samples=984
>
> Monitoring the server with top while the test is underway, there is a single beegfs-storage process using 55–60% of one CPU core; the other cores appear idle.

So beegfs-meta is doing nothing while no a many-core system it can use up to 600% CPU (so on this system it should use 300+%)?
Then the problem might be something else instead of meta data workload.

Guan

Dr. Thomas Orgis

unread,
Sep 11, 2021, 9:07:44 AM9/11/21
to fhgfs...@googlegroups.com
Am Sat, 11 Sep 2021 01:27:24 -0700 (PDT)
schrieb Guan Xin <guan...@gmail.com>:

> The compiler needs massive access to headers files if your source is C and
> alike.
> Try putting most headers on a local filesystem, or on a remote filesystem
> with better cache like NFS.

… or you do that trick of re-exporting the BeeGFS to the client from
the client itself via NFS. Repeated reads, which is the case with
builds and compiler invocations, get a massive boost from the page
cache that way and the BeeGFS server is not touched for the most part.

> Working within a filesystem image on BeeGFS also improves performance.
> Just create a filesystem image on BeeGFS, put your source code inside, and
> compile.

I presume that is because you're talking to a fixed set of storage
targets that this one big image is distributed on, each possibly also
benefitting from its own cache (and even prefetch?) instead of asking
the metadata server for the storage targets for each one of the little
files again and again, then contact those targets anew … with small
files directly stored with the metadata this could be a lot faster, but
for a system created for concurrent continuous streaming, it is still
the worst case of operation.

Anyhow — proper client-side caching seems to be something that probably
has to be considered to keep BeeGFS a good option also for the
entry-level HPC systems it has its biggest market share on
(just guessing). That or a separate NFS for the non-parallel
ignorant users to avoid them killing the BeeGFS performance for
everyone else.

A working cache and directory-based quota (_instead_ of user-based, no
need for both ways concurrently) are the features I miss, also
considering our replacement system on the horizon. Apart from that, I
do support the idea that the file system should not adopt every feature
under the sun. So far I didn't have people complaining that file
locking does not work, for example, just noticed myself and am using a
rename() approach instead.


Alrighty then,

Thomas

--
Dr. Thomas Orgis
HPC @ Universität Hamburg

Guan Xin

unread,
Sep 11, 2021, 11:00:25 AM9/11/21
to beegfs-user
Hi,

Please see comments below ...

On Saturday, September 11, 2021 at 9:07:44 PM UTC+8 Dr. Thomas Orgis wrote:
Am Sat, 11 Sep 2021 01:27:24 -0700 (PDT)
schrieb Guan Xin <...>:

> The compiler needs massive access to headers files if your source is C and
> alike.
> Try putting most headers on a local filesystem, or on a remote filesystem
> with better cache like NFS.

… or you do that trick of re-exporting the BeeGFS to the client from
the client itself via NFS. Repeated reads, which is the case with
builds and compiler invocations, get a massive boost from the page
cache that way and the BeeGFS server is not touched for the most part.

Definitely yes as suggested in another thread.
I didn't adopt that because NFS re-export of BeeGFS kills 70% large file performance.
 

> Working within a filesystem image on BeeGFS also improves performance.
> Just create a filesystem image on BeeGFS, put your source code inside, and
> compile.

I presume that is because you're talking to a fixed set of storage
targets that this one big image is distributed on, each possibly also
benefitting from its own cache (and even prefetch?) instead of asking
the metadata server for the storage targets for each one of the little
files again and again, then contact those targets anew … with small
files directly stored with the metadata this could be a lot faster, but
for a system created for concurrent continuous streaming, it is still
the worst case of operation.

Possibly yes, but note that a virtual machine running with a disk image on BeeGFS does its own proper page caching.
Also (if full virtualization is not being used), a loop device also definitely could be cached properly on a pre-4.4 Linux kernel.
There's some change since Linux-4.4, and I didn't check its effect on caching a loop device on BeeGFS.
 

Anyhow — proper client-side caching seems to be something that probably
has to be considered to keep BeeGFS a good option also for the

Agreed && Confused when I saw native caching changed from "experimental" to "obsolete".
 
entry-level HPC systems it has its biggest market share on
(just guessing). That or a separate NFS for the non-parallel
ignorant users to avoid them killing the BeeGFS performance for
everyone else.


Experienced users also need high iops.

& that overall performance and QoS are different things.
For QoS try tuneUsePerUserMsgQueues.
 
A working cache and directory-based quota (_instead_ of user-based, no
need for both ways concurrently) are the features I miss, also

BeeGFS used the quota function of the underlying filesystem,
which might be why it doesn't support project quota.
A directory on BeeGFS is not a directory on the underlying filesystem.
 
considering our replacement system on the horizon. Apart from that, I
do support the idea that the file system should not adopt every feature
under the sun. So far I didn't have people complaining that file
locking does not work, for example, just noticed myself and am using a
rename() approach instead.

 
lockf(3)? Rarely considered that.
open(2) with O_CREAT | O_EXCL is ok for me.
Reply all
Reply to author
Forward
0 new messages