surprisingly poor performance of FHGFS

Marisa Sandhoff

unread,

Sep 18, 2012, 8:50:18 AM9/18/12

to fhgfs...@googlegroups.com

Der user group,

we are looking for some advice, because we obviously did something wrong
with our setup leading to a very poor FHGFS performance.

We are planning to replace our lustre instance with something new and
try to evaluate, which file system works best for us. The goal is to
serve a cluster of 1300 cores, which is used by a group of about 50
users. The file system is used as a general purpose file system, so
people also store their data, programs and development files on it.

Besides FHGFS, we also have a Panasas evaluation system available.

Our FHGFS setup (Version 2011.04-r21) is as follows:

2 servers (running Ubuntu server 12.04): One for meta data and one for
data, both are DL380 G7
X5650 @ 2.67GHz with 48GB RAM (data) or 24 GB RAM (meta data)

The raids are P2000 G3 Fibre Channel with 24*2TB HDD SAS for data and
12*146GB HDD SAS for meta data.

At the moment we are testing with one client and one thread only running
CENTOS 5.8, with 1Gb Ethernet inter connection (10Gb systems are also
available)

The FHGFS installation is "default" at the moment, hence, we did not
change any config file, but we read and tried to follow the FHGFS Server
Tuning guide:

mds:
/dev/mapper/3600c0ff0001418443aeb495001000000 on /mnt1 type ext4
(rw,noatime,nodiratime,nobarrier)

oss:
/dev/mapper/3600c0ff00013b6b89c084a5001000000 on /mnt1 type xfs
(rw,noatime,nodiratime,largeio,inode64,swalloc,allocsize=131072,nobarrier)

As one of our use cases is the compilation of large programs, one of our
test suites is compliling a NetBSD kernel.

results:

1 thread Panasas:
real 17m48s
user 11m13s

1 Thread FHGFS:
real 690m15s
user 11m56s

(Panasas and FHGFS were after each other mounted on the same worker node
and tested)

1 Thread directly on /mnt1 on mds:
real 11m29s
user 8m54s

1 Thread directly on /mnt1 on oss:
real 11m26s
user 8m 48s

We tried further mount options, file system options, but the FHGFS
results stay at the same level. Maybe someone can give us some hint what
could have gone wrong with our installation?

Thanks a lot in advance!
Marisa

--
Dr. Marisa Sandhoff
Experimentelle Elementarteilchenphysik
Fachbereich C - Physik
Bergische Universitaet Wuppertal
Gaussstr. 20
D-42097 Wuppertal, Germany
---------
sand...@physik.uni-wuppertal.de
Phone +49 202 439 3521
Fax +49 202 439 2811

Sven Breuner

unread,

Sep 19, 2012, 12:32:20 PM9/19/12

to fhgfs...@googlegroups.com, sand...@physik.uni-wuppertal.de

Hi Marisa,

just to have a reference, I tried a simple netbsd kernel build on the
login node (connected to fhgfs via 1GbE) of our production systems.

$ time ./build.sh -u -m amd64 kernel=MYKERNEL
real 22m5.183s
user 3m40.090s
sys 2m1.828s
(where MYKERNEL is a default/generic amd64 netbsd kernel config)

Of course, that system was busy with all kinds of workloads when I ran
the test, but at least this should confirm that 690 minutes is not the
normal build time on an fhgfs.

Since your local test results on mds/oss looked fine, I don't assume
there's a problem with a bad disk or something like that.

Next usual suspect would be a network problem then (e.g. some timeout
problem).
Did you check that fhgfs is using the interfaces / IP addresses you want
it to use? Here are some ways to do this:
* Use "fhgfs-net" (contained in the fhgfs-utils package) on a client to
show currently established connections.
* Use "fhgfs-ctl mode=listnodes print_nic_details nodetype=storage" (and
nodetype=meta) and confirm that the interfaces listed first are the ones
you want to use.
* Check /var/log/fhgfs-client.log and /var/log/fhgfs-meta.log for
connection errors.

Next thing I would suggest is some simple test cases to make sure they
run smooth, e.g.:
$ dd if=/dev/zero of=/mnt/fhgfs/testfile bs=1M count=1000
and
$ dd if=/mnt/fhgfs/testfile of=/dev/null bs=1M count=1000
...to make sure streaming throughput is ok.

If you don't have a metadata benchmark tool like bonnie at hand, you
could also try some simple metadata test cases like:
$ mkdir /mnt/fhgfs/mdtest
$ cd /mnt/fhgfs/mdtest
$ seq 1 1000 | xargs touch
(...to create some files)
$ time ls -l /mnt/fhgfs/mdtest
$ time rm -rf /mnt/fhgfs/mdtest

...just to check if any of these take unusually long (i.e. more than a
few seconds), which should get you hopefully somewhat closer to the
source of the problem.

Best regards,
Sven

Marisa Sandhoff wrote on 09/18/2012 02:50 PM:
> Our FHGFS setup (Version 2011.04-r21) is as follows:
> 2 servers (running Ubuntu server 12.04): One for meta data and one for
> data, both are DL380 G7
> X5650 @ 2.67GHz with 48GB RAM (data) or 24 GB RAM (meta data)
>
> The raids are P2000 G3 Fibre Channel with 24*2TB HDD SAS for data and
> 12*146GB HDD SAS for meta data.
>
> At the moment we are testing with one client and one thread only running
> CENTOS 5.8, with 1Gb Ethernet inter connection
>

> The FHGFS installation is "default" at the moment, hence, we did not
> change any config file, but we read and tried to follow the FHGFS Server
> Tuning guide:
>
> mds:
> /dev/mapper/3600c0ff0001418443aeb495001000000 on /mnt1 type ext4
> (rw,noatime,nodiratime,nobarrier)
> oss:
> /dev/mapper/3600c0ff00013b6b89c084a5001000000 on /mnt1 type xfs
> (rw,noatime,nodiratime,largeio,inode64,swalloc,allocsize=131072,nobarrier)
>
>

> one of our test suites is compliling a NetBSD kernel.
>
> results:

> 1 Thread FHGFS:
> real 690m15s
> user 11m56s
>

Marisa Sandhoff

unread,

Sep 20, 2012, 9:42:24 AM9/20/12

to fhgfs...@googlegroups.com

Hi Sven,

thank you very much for your reply!

> $ time ./build.sh -u -m amd64 kernel=MYKERNEL
> real 22m5.183s
> user 3m40.090s
> sys 2m1.828s
> (where MYKERNEL is a default/generic amd64 netbsd kernel config)

We also used the generic kernel. It is good to hear that our problem is
not a general one of FHGFS.

>
>
> Next usual suspect would be a network problem then (e.g. some timeout
> problem).

We also exported one file system after the other via NFS and compiled
the kernel, which took about 17m. Hence, the network should not be a
problem. It is only a big HP ProCurve 8212zl switch in between.

> Did you check that fhgfs is using the interfaces / IP addresses you want
> it to use? Here are some ways to do this:
> * Use "fhgfs-net" (contained in the fhgfs-utils package) on a client to
> show currently established connections.
> * Use "fhgfs-ctl mode=listnodes print_nic_details nodetype=storage" (and
> nodetype=meta) and confirm that the interfaces listed first are the ones
> you want to use.

We checked the output of these commands, everything looked ok:

[root@wn001 log]# fhgfs-net

mgmt_nodes
=============
fhgfs-mds
Connections: TCP: 1 (132.195.124.232:8008);

meta_nodes
=============
fhgfs-mds
Connections: TCP: 2 (132.195.124.232:8005);
fhgfs-oss
Connections: TCP: 1 (132.195.124.233:8005);

storage_nodes
=============
fhgfs-mds
Connections: TCP: 2 (132.195.124.232:8003);
fhgfs-oss
Connections: TCP: 3 (132.195.124.233:8003);

But after we started bonnie++ (bonnie did not finish after more than 1
hour), we get:

[root@wn001 ~]# fhgfs-net

mgmt_nodes
=============
fhgfs-mds
Connections: TCP: 1 (132.195.124.232:8008);

meta_nodes
=============
fhgfs-mds
Connections: <none>
fhgfs-oss
Connections: <none>

storage_nodes
=============
fhgfs-mds
Connections: TCP: 1 (132.195.124.232:8003);
fhgfs-oss
Connections: TCP: 3 (132.195.124.233:8003);

and see the following messages in /var/log/fhgfs-client.log

ed: 132.195.124.233:8003
(3) Sep20 15:17:25 *fhgfs_XNodeSync(18364) [NodeConn (invalidate
stream)] >> Dis
connected: 132.195.124.232:8005
(3) Sep20 15:17:25 *fhgfs_XNodeSync(18364) [NodeConn (invalidate
stream)] >> Dis
connected: 132.195.124.232:8005
(3) Sep20 15:17:25 *fhgfs_XNodeSync(18364) [NodeConn (invalidate
stream)] >> Dis
connected: 132.195.124.233:8005
(3) Sep20 15:17:25 *fhgfs_XNodeSync(18364) [NodeConn (invalidate
stream)] >> Dis
connected: 132.195.124.233:8005
(3) Sep20 15:17:25 *fhgfs_XNodeSync(18364) [NodeConn (invalidate
stream)] >> Dis
connected: 132.195.124.233:8005
(3) Sep20 15:17:25 *fhgfs_XNodeSync(18364) [NodeConn (invalidate
stream)] >> Dis
connected: 132.195.124.233:8005
(3) Sep20 15:17:25 *fhgfs_XNodeSync(18364) [NodeConn (invalidate
stream)] >> Dis
connected: 132.195.124.232:8003
(3) Sep20 15:17:25 *fhgfs_XNodeSync(18364) [Idle disconnect] >> Dropped
idle con
nections: 7

and after a while the fhgfs-net output looks ok again.

[root@wn001 log]# fhgfs-net

mgmt_nodes
=============
fhgfs-mds
Connections: TCP: 1 (132.195.124.232:8008);

meta_nodes
=============
fhgfs-mds
Connections: TCP: 2 (132.195.124.232:8005);
fhgfs-oss
Connections: TCP: 1 (132.195.124.233:8005);

storage_nodes
=============
fhgfs-mds
Connections: TCP: 2 (132.195.124.232:8003);
fhgfs-oss
Connections: TCP: 3 (132.195.124.233:8003);

However, dd commands with bs=1k and bs=1M just work fine with about
100MB/s.

Do you think of any reason what could happen? The node under test
behaves fine under lustre, panasas, NFS ... and was used in production
before. We are testing another node now, just to cross check.

Thanks!

Frank Kautz

unread,

Sep 21, 2012, 2:47:10 AM9/21/12

to fhgfs...@googlegroups.com

Hello Marisa,

if a connection between the client and a metadata server is idle the
connection will be disconnected automatically after a while. This
happens only for the connections to the metadata server. A disconnect
during a bonnie++ run on fhgfs is not normal.

Could you start this bonnie command: "bonnie++ -s0 -n 1:1:1:1 -r0 -d
/FHGFS_MOUNT -u root". This command will finish in a few seconds. If
this will not finish strace the execution of bonnie and send us the
output. Use the following command: "strace -Ttt bonnie++ -s0 -n 1:1:1:1
-r0 -d /FHGFS_MOUNT -u root"

In your first email you wrote that your fhgfs environment has 2 servers
one server for metadata and one server for storage. In the output of
fhgfs-net both servers are metadata server and storage server. Did you
reconfigure fhgfs or is there something wrong?

kind regards,
Frank

Marisa Sandhoff

unread,

Sep 21, 2012, 10:14:08 AM9/21/12

to fhgfs...@googlegroups.com

Hi Frank,

thank your very much for your reply!

In the meanwhile found some surprising values. It seems that one of our
worker nodes shows a very poor performance for fhgfs, while it behaves
normal with lustre, panasas and nfs. One other worker node shows also a
normal behaviour for fhgfs. See the reuslts at the end of this email.

>
> Could you start this bonnie command: "bonnie++ -s0 -n 1:1:1:1 -r0 -d
> /FHGFS_MOUNT -u root". This command will finish in a few seconds. If
> this will not finish strace the execution of bonnie and send us the
> output. Use the following command: "strace -Ttt bonnie++ -s0 -n 1:1:1:1
> -r0 -d /FHGFS_MOUNT -u root"
>

wn001 is our "buggy" worker node

[root@wn001 ~]# bonnie++ -s0 -n 1:1:1:1 -r0 -d /mnt/fhgfs/ -u root
Using uid:0, gid:0.
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.96 ------Sequential Create------ --------Random
Create--------
wn001 -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
/sec %CP
1:1:1 231 1 320 2 491 2 273 1 202 1
557 2
Latency 205ms 206ms 210ms 205ms 209ms
204ms
1.96,1.96,wn001,1,1348210770,,,,,,,,,,,,,,1,1,1,,,231,1,320,2,491,2,273,1,202,1,557,2,,,,,,,205ms,206ms,210ms,205ms,209ms,204ms
[root@wn001 ~]#

[root@wn002 ~]# bonnie++ -s0 -n 1:1:1:1 -r0 -d /mnt/fhgfs/ -u root
Using uid:0, gid:0.
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.96 ------Sequential Create------ --------Random
Create--------
wn002 -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
/sec %CP
1:1:1 446 2 536 4 741 2 496 4 535 4
734 2
Latency 2431us 3340us 28355us 4159us 36860us
1663us
1.96,1.96,wn002,1,1348214493,,,,,,,,,,,,,,1,1,1,,,446,2,536,4,741,2,496,4,535,4,734,2,,,,,,,2431us,3340us,28355us,4159us,36860us,1663us

The results of our kernel (only kernel, no tools) compilation are:

wn001 panasas:
real 7m14s
user 5m17s

wn001 fhgfs:
real 310m26s
user 5m50s

wn002 fhgfs: (we got the same values for a node with a 10Gb connection)
real 34m26s
user 5m45s

wn002 panasas:
real 8m1s
user 5m18s

> In your first email you wrote that your fhgfs environment has 2 servers
> one server for metadata and one server for storage. In the output of
> fhgfs-net both servers are metadata server and storage server. Did you
> reconfigure fhgfs or is there something wrong?
>

Yes, in the meanwhile I reconfigured our installation in order to find
our bug.

Christian Mohrbacher

unread,

Sep 21, 2012, 3:26:27 PM9/21/12

to fhgfs...@googlegroups.com

Hi Marisa,

>
>
> The results of our kernel (only kernel, no tools) compilation are:
>
> wn001 panasas:
> real 7m14s
> user 5m17s
>
> wn001 fhgfs:
> real 310m26s
> user 5m50s
>
> wn002 fhgfs: (we got the same values for a node with a 10Gb connection)
> real 34m26s
> user 5m45s
>
> wn002 panasas:
> real 8m1s
> user 5m18s
>

so values (at least for wn002) look a lot better now, although it's
still not really good. But with a bit of tuning it should be no problem
to get at least the same out of FhGFS as you get it with Panasas and
Lustre. You talk about a 10Gb client. Are your servers connected with 10GbE?

Regards,
Christian

--
=====================================================
| Christian Mohrbacher |
| Competence Center for High Performance Computing |
| Fraunhofer ITWM |
| Fraunhofer-Platz 1 |
| |
| D-67663 Kaiserslautern |
=====================================================
| Tel: (49) 631 31600 4425 |
| Fax: (49) 631 31600 1099 |
| |
| E-Mail: christian....@itwm.fraunhofer.de |
| Internet: http://www.itwm.fraunhofer.de |
=====================================================

Reply all

Reply to author

Forward