Hi Greg,
am Wed, 8 Sep 2021 23:34:37 +0000
schrieb "Lehmann, Greg (IM&T, Pullenvale)" <
Greg.L...@csiro.au>:
> Hi Thomas,
> We may be a bit different, but we try and find those users
> doing poor IO and work with them to improve things
I fully agree. Also to the sentiment that one should spend $CURRENCY on
personnel, application support, and training users. Sadly that is a)
not happening or b) not feasible. I do try to help pathological cases
where I can, but there is also a certain mismatch with our aging system
and the workloads people come up with. We're running a university
cluster which was purchased in 2015. At that time, we planned for one
workload mix which did fit, for example only small SSD space on most
nodes (around 90G free) and a dedicated set of nodes with local hard
disk and more RAM for chemistry (hard disk not being that bad because
of the cache in RAM).
Now we have lots of users on the nodes with small local storage,
working on things like gene sequencing or machine learning stuff, or
just general data analysis. Usually they use some toolkit or
proprietary software — generally something they don't have written
themselves. Tuning the software is sometimes not feasible — and if it
were, users are not programmers (anymore) …
Often the solution would be to stage the working data on the local
disk. We work on this where possible, but are limited by the small
local space. The next system will have some reasonable local NVMe. And
we're considering just providing a separate fast NFS storage like others
have pointed out as a solution. Depending on how the workload evolves,
maybe a beefy NFS setup would generally outperform a parallel
filesystem. Strange having such thoughts:-/
Maybe we just need a seriously beefed-up BeeGFS, with more metadata
servers, for example. But the mere 24 disk groups we have do pose
limits on concurrency. Compare that to the previous system (2009-2015),
where there were only two NFS filesystems, each on a single small RAID
5. How could we survive! I think one should train people on such an old
system first, so that they know the pain of hardware constraints and
thus make better use of it;-)
The idea of a global parallel filesystem just doesn't
work when each node (or even multiple different jobs on one node, as
workloads don't scale) does its own thing in an ignorant manner,
assuming all resources behaving like on a personal laptop, only faster
(they wish).
If there's some easy way to hook into the page cache for async I/O,
this means that users can benefit without having to explicitly move
their data around. Sadly, this is a lot more effective than trying to
educate people (at a university, that is) in the current environment of
science and education. We _do_ try, but that's a full time job that we
don't have the full time for.
Applications properly taking care of their I/O would remove the need
for (heuristic) caching. One can dream.