FhGFS Tuning

Summers, James B. II

unread,

Sep 19, 2014, 9:18:41 AM9/19/14

to <fhgfs-user@googlegroups.com>

Hello All,

Well, I have been charged by the users here, "to make this thing run faster”!

What I have is two fhgfs filesystems that are handled by two groups of storage servers and two groups of meta data servers. Unfortunately the interconnect between it all is only 1GB ethernet. No ssd drives for meta data.

I have been upping the tuneNumWorkers value but that does not seem to be impacting things enough to satisfy them. SO I am beginning to look at the tuneFileRead/Write parameters.

Is there any of those that may have more impact than the others?

Also I am still attempting to isolate where the bottleneck may be occurring. I have been monitoring the bandwidth usage at the individual storage servers and rarely see it go over 32Mb/sec., so I don’t think that is the issue, but I guess I need to look at the the switch itself to confirm that it is not the bottleneck. So any tips on useful tools or baseline numbers that are considered good would be extremely helpful.

Thanks
Jim Summers
University of Oklahoma
jsum...@ou.edu

Adam Brenner

unread,

Sep 21, 2014, 11:00:34 PM9/21/14

to fhgfs...@googlegroups.com, Jim Summers

On Fri, Sep 19, 2014 at 6:18 AM, Summers, James B. II <jsum...@ou.edu> wrote:
> Well, I have been charged by the users here, "to make this thing run faster"!

What is "slow?" You type 'ls' and go get a come coffee and when you
come back hope it finishes? Or.. reads are slow? Writes are slow?

Can you provide us with some numbers from dd?

taskset --cpu-list 0-7 dd if=/dev/zero of=testfile bs=1M
count=32000 oflag=direct
(assumes you have multiple sockets on the machine ... if only one,
remove taskset)

Have you followed the information from the FhgFS wiki on metadata /
storage server tunings?

http://www.fhgfs.com/wiki/StorageServerTuning
http://www.fhgfs.com/wiki/MetaServerTuning

Are your strips aligned on disk? What strip size did you configure on
the storage / meta data server?

All the tunines are available here:
http://www.fhgfs.com/wiki/TuningAdvancedConfiguration

> Also I am still attempting to isolate where the bottleneck may be occurring.
> I have been monitoring the bandwidth usage at the individual storage servers and
> rarely see it go over 32Mb/sec., so I don't think that is the issue, but I guess I
> need to look at the the switch itself to confirm that it is not the bottleneck. So
> any tips on useful tools or baseline numbers that are considered good would be
> extremely helpful.

http://www.fhgfs.com/wiki/NetworkTuning

Might also want to look into increasing the MTU size, assuming its a
private network between client <-> fhgfs meta/storage servers and
everything is local within one hop. A number of other factors to
include if the you have a managed switched in the middle, etc.

As you can see, _a lot_ of factors to consider for performance.

--
Adam Brenner
Computer Science, Undergraduate Student
Donald Bren School of Information and Computer Sciences
University of California, Irvine
www.ics.uci.edu/~aebrenne/
aebr...@uci.edu

Summers, James B. II

unread,

Sep 22, 2014, 2:29:03 PM9/22/14

to <fhgfs-user@googlegroups.com>

On Sep 21, 2014, at 10:00 PM, Adam Brenner <aebr...@uci.edu> wrote:

> On Fri, Sep 19, 2014 at 6:18 AM, Summers, James B. II <jsum...@ou.edu> wrote:
>> Well, I have been charged by the users here, "to make this thing run faster"!
>
> What is "slow?" You type 'ls' and go get a come coffee and when you
> come back hope it finishes?

Not usually that slow, but on some directories with a lot of files, I have seen it be verrrry slow on and ls.

> Or.. reads are slow? Writes are slow?
>

I need to try to benchmark that.

> Can you provide us with some numbers from dd?
>
> taskset --cpu-list 0-7 dd if=/dev/zero of=testfile bs=1M
> count=32000 oflag=direct
> (assumes you have multiple sockets on the machine ... if only one,
> remove taskset)
>

Here are three results:

[jsummers@lily ~]$ taskset --cpu-list 0-7 dd if=/dev/zero of=data/test/testfile.dd bs=1M count=32000 oflag=direct
32000+0 records in
32000+0 records out
33554432000 bytes (34 GB) copied, 336.036 s, 99.9 MB/s
[jsummers@lily ~]$ taskset --cpu-list 0-7 dd if=/dev/zero of=/data/vol18/testfile.dd bs=1M count=32000 oflag=direct
32000+0 records in
32000+0 records out
33554432000 bytes (34 GB) copied, 2438.27 s, 13.8 MB/s
[jsummers@lily ~]$ taskset --cpu-list 0-7 dd if=/dev/zero of=/data/ddn/jimbo/testfile9.dd bs=1M count=32000 oflag=direct
32000+0 records in
32000+0 records out
33554432000 bytes (34 GB) copied, 1340.96 s, 25.0 MB/s
[jsummers@lily ~]$ taskset --cpu-list 0-7 dd if=/dev/zero of=/data/scratch/jsummers/testfile.dd bs=1M count=32000 oflag=direct
32000+0 records in
32000+0 records out
33554432000 bytes (34 GB) copied, 40.4355 s, 830 MB/s

The first one is writing to one of the fhgfs filesystems, second writing to an nfs mount, third our second fhgfs filesystem, and lastly to a local scratch partition that is a raid5. This provides some interesting numbers in that some other benchmarking I did had the fhgfs and nfs basically about the same. That one was reading in a large binary file and writing one back out. The users are really focused on running an application from nasa named Ledaps ( http://ledapsweb.nascom.nasa.gov/ ). The benchmarking they provided show totally different results, in that r/w to fhgfs was a lot slower than NFS and of course local disk. So I am starting to lean toward it being how the application itself is doing it’s I/O.

For example the following are the time to complete a ledap run / scene:

NFS: 824s
fhgfs0: 7415s
fhgfs1: 4980s
local: 703s

The ledaps software they use is a set of C programs. I wonder if there are some C I/O statements that could be slower than others or the parameters are not set correctly for r/w to network filesystems?

>
> Have you followed the information from the FhgFS wiki on metadata /
> storage server tunings?
>
> http://www.fhgfs.com/wiki/StorageServerTuning
> http://www.fhgfs.com/wiki/MetaServerTuning

As much as possible.

>
> Are your strips aligned on disk? What strip size did you configure on
> the storage / meta data server?
>

Not sure on that one, I’ll need to bounce one to check that out.

> All the tunines are available here:
> http://www.fhgfs.com/wiki/TuningAdvancedConfiguration
>
>
>> Also I am still attempting to isolate where the bottleneck may be occurring.
>> I have been monitoring the bandwidth usage at the individual storage servers and
>> rarely see it go over 32Mb/sec., so I don't think that is the issue, but I guess I
>> need to look at the the switch itself to confirm that it is not the bottleneck. So
>> any tips on useful tools or baseline numbers that are considered good would be
>> extremely helpful.
>
> http://www.fhgfs.com/wiki/NetworkTuning
>
> Might also want to look into increasing the MTU size, assuming its a
> private network between client <-> fhgfs meta/storage servers and
> everything is local within one hop. A number of other factors to
> include if the you have a managed switched in the middle, etc.
>
>
> As you can see, _a lot_ of factors to consider for performance.
>
> --
> Adam Brenner
> Computer Science, Undergraduate Student
> Donald Bren School of Information and Computer Sciences
> University of California, Irvine
> www.ics.uci.edu/~aebrenne/
> aebr...@uci.edu

Adam Brenner

unread,

Sep 22, 2014, 6:55:55 PM9/22/14

to fhgfs...@googlegroups.com

On Mon, Sep 22, 2014 at 11:29 AM, Summers, James B. II <jsum...@ou.edu> wrote:
>
> The first one is writing to one of the fhgfs filesystems, second writing to an
> nfs mount, third our second fhgfs filesystem, and lastly to a local scratch
> partition that is a raid5. This provides some interesting numbers in that some
> other benchmarking I did had the fhgfs and nfs basically about the same. That
> one was reading in a large binary file and writing one back out. The users are
> really focused on running an application from nasa named Ledaps (
> http://ledapsweb.nascom.nasa.gov/ ). The benchmarking they provided show
> totally different results, in that r/w to fhgfs was a lot slower than NFS and of
> course local disk. So I am starting to lean toward it being how the application
> itself is doing it's I/O.
>
> For example the following are the time to complete a ledap run / scene:
>
> NFS: 824s
> fhgfs0: 7415s
> fhgfs1: 4980s
> local: 703s
>
> The ledaps software they use is a set of C programs. I wonder if there are some
> C I/O statements that could be slower than others or the parameters are not set
> correctly for r/w to network filesystems?

I think what is happening is the way ledaps writes to disk -- write
pattern. I would recommend profiling the application and see what its
doing (or email the developers). I have been told oprofile is a good
tool. Can also try strace and count the number of write calls or even
gdb.

For example, it was very common, and surprisingly still is, that
software write very small files to disk as checkpoint or temporary
data rather then use system memory. For some applications, you can
specify where tmp data will live, if so, always select local hard
drive or something like a tmpfs filesystem (/dev/shm). Any sort of
filesystem will suffer from small writes and a lot of them, the best
solution would be to cache requests. If this is the case, consider
caching more requests on the client side
http://www.fhgfs.com/wiki/Caching

This is all conjecture, but would help explain whats happening. I know
that NFS is very good at caching requests on the client side before
writing it out to the remote server. When we use to run gluster, this
was a trick we did to help solve this issue (NFS server / client loop
back)
http://gluster.org/pipermail/gluster-users/2012-September/011324.html

The numbers you provided are a bit low in my opinion. I think it would
be worth while to run the same dd commands on the storage servers
(without going via network mounts or FhgFS / BeeGFS) and compare them
with the dd commands over the network mount. They should not be _that_
far off. I would recommend start debugging here and work your way out
towards a client mounting BeeGFS. Another conjecture is to make sure
your network is not saturated.

Your third dd result, writing to your second BeeGFS instance, is
pretty bad: 25.0 MB/s. While your first result to BeeGFS: 99.9 MB/s I
think is pretty close to GigE write speeds -- makes me wonder if the
request was cached somehow. I would take the average of those as a
better compression.

Clearly the NFS request was cached as 830 MB/s exceeds GigE wire speeds.

Hope that helps. Maybe one of the developers can comment on this as well,
-Adam

Reply all

Reply to author

Forward