performance testing with fhgfs

dipe

unread,

Feb 4, 2012, 2:43:08 PM2/4/12

to fhgfs-user

we have been playing with a setup with fhgfs on 3 linux boxes with
cheap sata drives and a separate metadata server on a pretty new dell
desktop box that has 4GB RAM and a fusion io SSD card for the metadata
partition. (size about 1.5G)

we have a folder with 58000 files on one of our nfs servers and
executing one ls -l takes about 4 minutes. (the NFS server is quite
busy anyway)

with the fhgfs setup we are now down to 1 minutes which is nice, but
secretly I had hoped for an even better result (may be 10 seconds?) is
this unrealistic?

I let dstat run on the metadata server and then executes an ls -l.
Network utilization goes up a little and interrupts and context
switches also go up quite a bit.

How else can I find where there may be a bottleneck in my metadata
server ?

Thanks much
dipe

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---
system--
usr sys idl wai hiq siq| read writ| recv send| in out | int
csw
0 0 100 0 0 0| 0 0 |2294B 354B| 0 0 | 72
77
0 0 100 0 0 0| 0 0 |2362B 354B| 0 0 | 60
51
0 0 100 0 0 0| 0 0 |2354B 354B| 0 0 | 57
48
0 0 100 0 0 0| 0 0 |2106B 354B| 0 0 | 59
60
0 0 100 0 0 0| 0 40k|2106B 354B| 0 0 | 67
54
0 1 99 0 0 0| 0 0 | 230k 249k| 0 0 |3218
9439
1 1 98 0 0 0| 0 0 | 554k 588k| 0 0 |7604
23k
1 2 97 0 0 0| 0 0 | 580k 613k| 0 0 |7958
24k
1 2 98 0 0 0| 0 0 | 512k 547k| 0 0 |7071
21k
1 2 97 0 0 0| 0 0 | 577k 614k| 0 0 |7949
24k
1 2 98 0 0 0| 0 0 | 570k 606k| 0 0 |7863
23k
1 1 98 0 0 0| 0 12k| 337k 374k| 0 0 |4651
14k
1 2 97 0 0 0| 0 0 | 572k 607k| 0 0 |7902
23k
1 2 97 0 0 0| 0 0 | 570k 608k| 0 0 |7890
23k
1 1 97 0 0 0| 0 0 | 528k 565k| 0 0 |7298
22k
1 2 98 0 0 0| 0 0 | 550k 586k| 0 0 |7564
23k
1 2 98 0 0 0| 0 0 | 528k 541k| 0 0 |7230
22k
1 2 98 0 0 0| 0 0 | 346k 384k| 0 0 |7294
19k
1 1 97 0 0 0| 0 0 | 580k 615k| 0 0 |7982
24k
1 1 97 1 0 0| 0 16k| 529k 568k| 0 0 |7307
22k
1 2 97 0 0 0| 0 0 | 547k 582k| 0 0 |7522
22k
1 1 98 0 0 0| 0 0 | 466k 479k| 0 0 |6426
19k
1 2 97 0 0 0| 0 0 | 556k 590k| 0 0 |7634
23k
0 1 99 0 0 0| 0 0 | 208k 227k| 0 0 |2892
8532
1 2 98 0 0 0| 0 0 | 577k 613k| 0 0 |7970
24k
1 2 97 0 0 0| 0 0 | 579k 615k| 0 0 |7975
24k
1 1 98 0 0 0| 0 0 | 578k 615k| 0 0 |7993
24k
0 2 98 0 0 0| 0 0 | 520k 557k| 0 0 |7202
21k
1 2 97 0 0 0| 0 0 | 517k 555k| 0 0 |7134
21k
1 1 99 0 0 0| 0 0 | 390k 405k| 0 0 |5386
16k
1 2 98 0 0 0| 0 0 | 576k 613k| 0 0 |7936
24k
1 2 97 0 0 0| 0 0 | 578k 635k| 0 0 |7937
24k
1 2 98 0 0 0| 0 0 | 581k 617k| 0 0 |8007
24k
1 1 98 0 0 0| 0 0 | 584k 618k| 0 0 |8014
24k
1 2 98 0 0 0| 0 0 | 517k 554k| 0 0 |7119
21k
0 1 99 0 0 0| 0 0 | 289k 308k| 0 0 |3998
12k
1 2 97 0 0 0| 0 0 | 541k 576k| 0 0 |7439
22k
1 1 98 0 0 0| 0 0 | 580k 615k| 0 0 |7959
24k
1 2 98 0 0 0| 0 0 | 511k 527k| 0 0 |7044
21k
1 1 97 0 0 0| 0 0 | 581k 636k| 0 0 |7970
24k
1 2 98 0 0 0| 0 0 | 541k 555k| 0 0 |8700
25k
1 1 98 0 0 0| 0 0 | 370k 407k| 0 0 |5094
15k
1 2 97 0 0 0| 0 0 | 576k 611k| 0 0 |7931
24k
1 1 98 0 0 0| 0 0 | 537k 574k| 0 0 |7400
22k
1 1 97 0 0 0| 0 0 | 539k 576k| 0 0 |7463
22k
1 1 98 0 0 0| 0 8192B| 580k 616k| 0 0 |7978
24k
1 1 98 0 0 0| 0 0 | 483k 496k| 0 0 |6645
20k
1 1 98 0 0 0| 0 0 | 399k 438k| 0 0 |5507
16k
1 2 98 0 0 0| 0 0 | 583k 619k| 0 0 |8024
24k
1 1 97 1 0 0| 0 12k| 527k 563k| 0 0 |7280
22k
1 1 98 0 0 0| 0 0 | 503k 539k| 0 0 |6919
21k
1 1 97 0 0 0| 0 0 | 576k 611k| 0 0 |7948
24k
1 1 97 0 0 0| 0 0 | 494k 510k| 0 0 |6838
20k
1 2 97 0 0 0| 0 0 | 435k 472k| 0 0 |6021
18k
1 2 97 0 0 0| 0 0 | 573k 609k| 0 0 |7855
23k
1 2 97 0 0 0| 0 0 | 582k 618k| 0 0 |8013
24k
1 1 98 0 0 0| 0 0 | 576k 610k| 0 0 |7902
24k
1 2 98 0 0 0| 0 0 | 540k 579k| 0 0 |7467
22k
1 2 98 0 0 0| 0 0 | 506k 542k| 0 0 |6960
21k
0 1 98 0 0 0| 0 0 | 329k 347k| 0 0 |4566
14k
1 2 97 0 0 0| 0 0 | 580k 614k| 0 0 |7982
24k
1 2 97 0 0 0| 0 0 | 580k 614k| 0 0 |8003
24k
1 3 96 0 0 0| 0 0 | 578k 635k| 0 0 | 10k
29k
1 2 97 0 0 0| 0 0 | 581k 611k| 0 0 |7973
24k
0 0 99 0 0 0| 0 0 | 195k 192k| 0 0 |2718
8018
0 0 100 0 0 0| 0 0 |2346B 354B| 0 0 | 93
76
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---
system--
usr sys idl wai hiq siq| read writ| recv send| in out | int
csw
0 0 100 0 0 0| 0 0 |2660B 354B| 0 0 | 75
83
0 0 100 0 0 0| 0 0 |1926B 834B| 0 0 | 72
64
0 0 100 0 0 0| 0 0 |2294B 354B| 0 0 | 70
61

dipe

unread,

Feb 4, 2012, 3:21:05 PM2/4/12

to fhgfs-user

I should probably add that we are running 2.6.37.6 and ext4 on the
metadata server and have implemented the tuning suggestions from

http://www.fhgfs.com/wiki/wikka.php?wakka=ServerTuning#metatune

and have set

tuneNumWorkers

to 0, 16, 32 and 100 but neither made any difference.

Bernd Schubert

unread,

Feb 4, 2012, 5:07:29 PM2/4/12

to fhgfs...@googlegroups.com, dipe

Hello,

I guess your bottleneck is network latency. So 'ls -l' translates into

1) opendir()
2) readdir()
3) for each entry from readir: stat()
4) sort of all entries

So it has to call 58000 times stat and also will do it sequentially and
so everything depends on network latency.
On development side we could try to improve that with workarounds, but
in the end such workarounds tend to destabilize the code, as they are
only workarounds. Ways better would be, if posix would define a syscall
with a read/stat queue, so that multiple entries could be fetched at
once. Unfortunately that does not exist right now :(

But as a side note, does 'ls -l' on a directory with 58000 entries still
make sense? Are you really going to read the output? Or shouldn't you
better try to run it in parallel from directories with much less
entries, so more a typical workload as done by users?

Especially for network file systems admins usually should recommend to
their users to use 'ls' only, which will not include the additional and
rather slow stat call for all files.

Cheers,
Bernd

Sven Breuner

unread,

Feb 6, 2012, 7:38:30 AM2/6/12

to fhgfs...@googlegroups.com

Hi,

Bernd Schubert wrote on 02/04/2012 11:07 PM:
> I guess your bottleneck is network latency. So 'ls -l' translates into
>
> 1) opendir()
> 2) readdir()
> 3) for each entry from readir: stat()
> 4) sort of all entries
>
> So it has to call 58000 times stat and also will do it sequentially and
> so everything depends on network latency.

As Bernd said, network latency is the dominating time factor here.
However, besides the stat()-calls, there is more overhead being
introduced by "ls -l" - see below.

> On 02/04/2012 09:21 PM, dipe wrote:
>> I should probably add that we are running 2.6.37.6 and ext4 on the
>> metadata server and have implemented the tuning suggestions from
>>
>> http://www.fhgfs.com/wiki/wikka.php?wakka=ServerTuning#metatune
>>
>> and have set
>>
>> tuneNumWorkers
>>
>> to 0, 16, 32 and 100 but neither made any difference.

For workloads from a single client process, it's normal that increasing
the number of workers won't have any effect. This is only relevant if
you have multiple client processes (either on the same client or on
multiple clients) doing fs operations.

>> On Feb 4, 11:43 am, dipe<dip...@gmail.com> wrote:
>>> we have a folder with 58000 files on one of our nfs servers and
>>> executing one ls -l takes about 4 minutes. (the NFS server is quite
>>> busy anyway)
>>>
>>> with the fhgfs setup we are now down to 1 minutes which is nice, but
>>> secretly I had hoped for an even better result (may be 10 seconds?) is
>>> this unrealistic?

I assume you are using GigabitEthernet.

Did you also notice that there is a build option FHGFS_INTENT that can
be enabled to increase stat-performance? (This option will be enabled by
default with the next major release.)
Currently, it's only documented in the changelog (search for FHGFS_INTENT):
http://www.fhgfs.com/release/fhgfs_2011.04/Changelog.txt

In general, doing about 60,000 stat()-calls in about 10 seconds is not
unrealistic, but doing an "ls -l" in that time frame is difficult. I
just did some fhgfs bonnie benmarks on a GigE line with a single
directory with 60,000 files. Bonnie measured >5,000 stat operations per
second from the client. So if "ls -l" would only do the stats, then you
would be done in roughly 10 seconds.

Unfortunately, "ls -l" is not only doing stat-calls, but also some other
calls. Here is a "ls -l" strace excerpt from a single file:
$ strace ls -l
...
> lstat("test", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> lgetxattr("test", "security.selinux", 0x62d130, 255) = -1 ENODATA (No data available)
> getxattr("test", "system.posix_acl_access", 0x0, 0) = -1 EOPNOTSUPP (Operation not supported)
> lstat("test2", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> lgetxattr("test2", "security.selinux", 0x62b350, 255) = -1 ENODATA (No data available)
> getxattr("test2", "system.posix_acl_access", 0x0, 0) = -1 EOPNOTSUPP (Operation not supported)
> lstat("test3", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> lgetxattr("test3", "security.selinux", 0x62d150, 255) = -1 ENODATA (No data available)
> getxattr("test3", "system.posix_acl_access", 0x0, 0) = -1 EOPNOTSUPP (Operation not supported)
...

As you can see, "ls -l" is performing two other operations (for selinux
and ACLs) besides the actual lstat-call for each file in the directory.
In fact, both of those extra operations don't make any sense, because
selinux is not enabled on the system and ACLs are not supported. But
these operations make the Linux kernel call fhgfs's inode revalidation
method and thus significantly increase the "ls -l" time.

I don't know of any way to disable this ls overhead, so I will report
this to the GNU coreutils maitainers.
Maybe they can check whether selinux is enabled skip those calls if it
isn't enabled. And maybe they can stop doing getxattr calls if the first
call already reports "EOPNOTSUPP".

However, even if they implement the changes, it will of course take some
time until the modifications find their way into the Linux server
distributions.

Best regards,
Sven

dipe

unread,

Feb 7, 2012, 7:57:09 AM2/7/12

to Bernd Schubert, fhgfs...@googlegroups.com

Thanks Bernd,

I thought that network must be the main culprit because on a local
system I can do ls -l on a directory with a million files and it only
takes a few seconds.
You are right, 60000 files in one directory is rather rare but more
often we are dealing with 5-10000 files and it would be nice to be able
keep ls -l for this below 3-5 sec.

As interrupts and context switches seem to contribute a quite a bit to
the latency I am having some interest in technologies that claim to
address this:

http://www.openonload.org/

http://www.solarflare.com/Content/UserFiles/Documents/Solarflare_OpenOnload_IntroPaper.pdf
http://www.solarflare.com/Content/UserFiles/Documents/Solarflare_10GbE_HPC_Whitepaper.pdf

as far as I can tell this stuff is transparently stuffed into the
network stack and does not require any modifications on your end. Just
wanted to share it.

I have been testing on 1G without any kind of network tuning. It appears
that 10G reduces latency significantly compared to 1G and it seems that
this openonload stuff shaves of another 50% latency from 10G.

I wonder if this would compare to technologies like infiniband when it
comes to file system latency (using commands such a ls). I believe most
HPC people deploy infiniband because of throughput rather than latency.

Thanks
dipe

dipe

unread,

Feb 7, 2012, 8:16:20 AM2/7/12

to fhgfs...@googlegroups.com, Sven Breuner

Thanks Sven,

those are good suggestions:

First we patched ls with your suggestions below and tried it again but
it made no difference. Then we changed FHGFS_INTENT and
our ls -l went from 60 sec to 31 sec. Then we tested the patched ls with
FHGFS_INTENT enabled and ls -l went down to around 22 seconds. I don't
understand why the first test case does not make any difference but this
is really great stuff, thanks a lot.

after all it seems our metadata server is doing the trick.

dipe

Sven Breuner

unread,

Feb 8, 2012, 4:32:04 AM2/8/12

to dipe, fhgfs...@googlegroups.com

Hi,

dipe wrote on 02/07/2012 02:16 PM:
> First we patched ls with your suggestions below and tried it again but
> it made no difference. Then we changed FHGFS_INTENT and
> our ls -l went from 60 sec to 31 sec. Then we tested the patched ls with
> FHGFS_INTENT enabled and ls -l went down to around 22 seconds. I don't
> understand why the first test case does not make any difference but this
> is really great stuff, thanks a lot.

nice, thanks for the updated results.

Interesting that it made no difference for the first test. Maybe you
want to try something like "strace -c ls -l" with your patched ls to
make sure that the getxattr calls are completely gone.

Did you just remove the unnecessary calls from your special ls or did
you implement clean patches that also do some checks on whether the
calls are necessary (e.g. by checking if selinux is enabled)?

In the latter case, I would suggest you send your patches to the
coreutils discussion or bug mailing list (the list addresses can found
here: http://www.gnu.org/software/coreutils).

Best regards,
Sven

> On 2/6/2012 4:38 AM, Sven Breuner wrote:

>> [..]

>> I don't know of any way to disable this ls overhead, so I will report
>> this to the GNU coreutils maitainers.
>> Maybe they can check whether selinux is enabled skip those calls if it
>> isn't enabled. And maybe they can stop doing getxattr calls if the
>> first call already reports "EOPNOTSUPP".

>> [..]

Message has been deleted

dipe

unread,

Feb 13, 2012, 8:52:29 AM2/13/12

to fhgfs-user

On Feb 8, 1:32 am, Sven Breuner <sven.breu...@itwm.fraunhofer.de>
wrote:

> Hi,
>
> dipe wrote on 02/07/2012 02:16 PM:
>
> > First we patched ls with your suggestions below and tried it again but
> > it made no difference. Then we changed FHGFS_INTENT and
> > our ls -l went from 60 sec to 31 sec. Then we tested the patched ls with
> > FHGFS_INTENT enabled and ls -l went down to around 22 seconds. I don't
> > understand why the first test case does not make any difference but this
> > is really great stuff, thanks a lot.
>
> nice, thanks for the updated results.
>
> Interesting that it made no difference for the first test. Maybe you
> want to try something like "strace -c ls -l" with your patched ls to
> make sure that the getxattr calls are completely gone.
>
> Did you just remove the unnecessary calls from your special ls or did
> you implement clean patches that also do some checks on whether the
> calls are necessary (e.g. by checking if selinux is enabled)?

Actually I thought we patched it but I was told that we just installed
the latest version of coreutils where this was already fixed.

Sven Breuner

unread,

Feb 13, 2012, 9:18:05 AM2/13/12

to fhgfs...@googlegroups.com

Hi,

dipe wrote on 02/13/2012 02:52 PM:
>> Did you just remove the unnecessary calls from your special ls or did
>> you implement clean patches that also do some checks on whether the
>> calls are necessary (e.g. by checking if selinux is enabled)?
>
> Actually I thought we patched it but I was told that we just installed
> the latest version of coreutils where this was already fixed.

unfortunately, this is not yet fixed in more recent versions. If you
don't see these getxattr calls with your recent version and built your
version from source, then I assume the reason is that your version was
not built with acl and selinux support (which is of course also a way to
solve the problem ;-) ).

However, I already brought this up for discussion on the coreutils
mailing list and the coreutils maintainers agreed that this needs to be
improved, so they are working on it.
So far, they already provided a patch that avoids the
getxattr(selinux_stuff) calls.

Here's the link to the discussion, in case you're interested:
http://lists.gnu.org/archive/html/coreutils/2012-02/threads.html#00006

Best regards,
Sven

Reply all

Reply to author

Forward