BeeGFS caching vs. NFS-self-re-export

Dr. Thomas Orgis

unread,

Sep 7, 2021, 10:22:05 AM9/7/21

to fhgfs...@googlegroups.com

Hi,

I'm not joining the Stammtisch today, so I though I'll give my 2 cents
here. The topic I usually bring up is the lack of caching which really
hits hard our aging small-ish cluster with its diverse user base. We
increasingly have users that don't use multiple nodes at all and also
do rather careless I/O to our BeeGFS. Things like reading the same
dataset over and over, random access … small reads and writes that
amount to a rather small overall data rate but kill performance for the
whole cluster.

I had folks complaining that extraction of a moderate Singularity
container takes over hour … compared to the „normal“ case of up to 30
minutes! The pains users endure … a normal time is more like 10
seconds, when our system is not overloaded by jobs that could use some
caching (explicitly via staging to local disk or with what I am
discussing here, arguably more user-friendly).

I tried to enable differing caching modes in BeeGFS, but sadly, they
just are not good. The most recent approach has a huge performance hit
and only small caching benefit.

I figured I'd try the preposterous hack of interposing NFS. First, I
used a single server, but then figured that that bottleneck is not
necessary.

Now, each compute node runs its own NFS and mounts the BeeGFS from
localhost. It's important to do local_lock=all to avoid funny behaviour
because the underlying BeeGFS doesn't do locking at all. I also used
vers=3 to keep things simple (maybe v4 would do better).
Synchronization to other clients is of course hampered, but this hack
is for people working on a single node.

This approach incurs some heavy overhead of doing the NFS dance forth
and back on the same box and also lowers the initial read rate by 30%
or so. Also, write caching doesn't seem to be that effective. Maybe
that could be tuned. But the important part is the read caching. Once I
read data the first time with around 500 MB/s, the subsequent reads can
reach 5 GB/s or more without hitting the BeeGFS. Obviously, native
caching without a trip through NFS would be a lot faster, but this
alrady is worlds apart from what the integrated caching of BeeGFS does.

I'll test drive this and will try to get some people to use this cached
mount. The open question now is how the internal cache of BeeGFS could
be worse than this contraption. A pragmatic approach could be to decide
that caching is something complicating the parallel file system and
should be left out of BeeGFS code entirely (as the use cases
benefitting from it are already somewhat pathological, and it obviously
seems to be tricky to get right) … and rather recommend a tuned setup
that adds a cache on top, like this NFS thing. But maybe even some
fscache/bcache stuff could be done, emphasizing that this is for the
use case of many trivial parallel computations that don't need
synchronous state in the parallel filesystem.

Maybe some custom application of overlayfs could do the trick, too. But
it should be something entirely in kernel space, like bcache.

Alrighty then,

Thomas

--
Dr. Thomas Orgis
HPC @ Universität Hamburg

Dr. Thomas Orgis

unread,

Sep 8, 2021, 10:39:09 AM9/8/21

to fhgfs...@googlegroups.com

Am Tue, 7 Sep 2021 16:22:02 +0200
schrieb "Dr. Thomas Orgis" <thomas...@uni-hamburg.de>:

> This approach incurs some heavy overhead of doing the NFS dance forth
> and back on the same box and also lowers the initial read rate by 30%
> or so.

This is unclear, the read hit is bigger. A more recent test in a batch
job indicates this pattern for dd to /dev/null with bs=1M with the
plain BeeGFS mount:

Streaming to/from /work/scratch/rrztest/.
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 23.3634 s, 460 MB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 10.9139 s, 984 MB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 8.45676 s, 1.3 GB/s

This could be server-side disk caching giving a boost. But it tops at
1.3 GB/s, wich migth be fine for one stream. The NFS-re-export:

Streaming to/from /work_cached/scratch/rrztest/.
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 73.6809 s, 146 MB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 1.8983 s, 5.7 GB/s
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 1.93952 s, 5.5 GB/s
10240+0 records in
10240+0 records out

It goes up to 6.4 GB/s on further repetitions. Clearly some benefit
from caching. The BeeGFS caching did not reach that. But the first read
indeed is rather slow, not just 30% here. But then, the cached BeeGFS
was even slower, as I remember.

Also, having just written the file, I'd hope for the NFS client cache
being filled right away, but it seems to do direct I/O, too. I'll have
some checks with more realistic random I/O patterns. But even now, can
anyone releate to these numbers?

Lehmann, Greg (IM&T, Pullenvale)

unread,

Sep 8, 2021, 7:34:46 PM9/8/21

to fhgfs...@googlegroups.com

Hi Thomas,
We may be a bit different, but we try and find those users doing poor IO and work with them to improve things so that they don't ruin the HPC experience for all. After all we have millions of dollars in HW sitting there and more jobs that we can handle. The idea is to spend a bit of money on staff to iron out these issues so that our large HW investment is utilised as efficiently as possible.

When people talk about HPC, they mention 3 things: compute, fabric and storage, and think it ends there. Hell no. The other silent component is just as important: applications (and tuning/building them as well as running them in a sensible way.) This component means specialised staff to look at why applications are not maximising the CPU throughput, so a monitoring component to go with it, and that covers the compute, fabric and storage.

So while some caching would be nice and would benefit some workloads, it won’t solve all you problems. For us we have solved some beegfs problems that you also have by focusing on getting the inefficient workloads performing properly. We have had some very big wins with that approach.

Cheers,

Greg

--
You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/20210907162202.3cc93e82%40plasteblaster.

Dr. Thomas Orgis

unread,

Sep 9, 2021, 3:48:56 AM9/9/21

to fhgfs...@googlegroups.com

Hi Greg,

am Wed, 8 Sep 2021 23:34:37 +0000
schrieb "Lehmann, Greg (IM&T, Pullenvale)" <Greg.L...@csiro.au>:

> Hi Thomas,
> We may be a bit different, but we try and find those users
> doing poor IO and work with them to improve things

I fully agree. Also to the sentiment that one should spend $CURRENCY on
personnel, application support, and training users. Sadly that is a)
not happening or b) not feasible. I do try to help pathological cases
where I can, but there is also a certain mismatch with our aging system
and the workloads people come up with. We're running a university
cluster which was purchased in 2015. At that time, we planned for one
workload mix which did fit, for example only small SSD space on most
nodes (around 90G free) and a dedicated set of nodes with local hard
disk and more RAM for chemistry (hard disk not being that bad because
of the cache in RAM).

Now we have lots of users on the nodes with small local storage,
working on things like gene sequencing or machine learning stuff, or
just general data analysis. Usually they use some toolkit or
proprietary software — generally something they don't have written
themselves. Tuning the software is sometimes not feasible — and if it
were, users are not programmers (anymore) …

Often the solution would be to stage the working data on the local
disk. We work on this where possible, but are limited by the small
local space. The next system will have some reasonable local NVMe. And
we're considering just providing a separate fast NFS storage like others
have pointed out as a solution. Depending on how the workload evolves,
maybe a beefy NFS setup would generally outperform a parallel
filesystem. Strange having such thoughts:-/

Maybe we just need a seriously beefed-up BeeGFS, with more metadata
servers, for example. But the mere 24 disk groups we have do pose
limits on concurrency. Compare that to the previous system (2009-2015),
where there were only two NFS filesystems, each on a single small RAID
5. How could we survive! I think one should train people on such an old
system first, so that they know the pain of hardware constraints and
thus make better use of it;-)

The idea of a global parallel filesystem just doesn't
work when each node (or even multiple different jobs on one node, as
workloads don't scale) does its own thing in an ignorant manner,
assuming all resources behaving like on a personal laptop, only faster
(they wish).

If there's some easy way to hook into the page cache for async I/O,
this means that users can benefit without having to explicitly move
their data around. Sadly, this is a lot more effective than trying to
educate people (at a university, that is) in the current environment of
science and education. We _do_ try, but that's a full time job that we
don't have the full time for.

Applications properly taking care of their I/O would remove the need
for (heuristic) caching. One can dream.

Lehmann, Greg (IM&T, Pullenvale)

unread,

Sep 13, 2021, 9:28:23 PM9/13/21

to fhgfs...@googlegroups.com

Hi Thomas,
I like the idea of a cluster with trainer wheels for newbies so they can earn the right to play on the big cluster. We have dreamt about that recently too.

I think we do need to lobby for applications specialists, so that those who have the decision making power are aware of the need and benefits. I do wonder if there is a bit of competition mentality there, where the desire to have the biggest possible cluster gazumps the rational choice of balance across all components of an HPC system. We have seen it with storage being the loser in the past, but are over that hurdle I think. The next hurdle is the application specialists.

Cheers,

Greg

-----Original Message-----
From: fhgfs...@googlegroups.com <fhgfs...@googlegroups.com> On Behalf Of Dr. Thomas Orgis

--
You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/fhgfs-user/20210909094853.1f5cf414%40plasteblaster.

Reply all

Reply to author

Forward