I guess your bottleneck is network latency. So 'ls -l' translates into
1) opendir()
2) readdir()
3) for each entry from readir: stat()
4) sort of all entries
So it has to call 58000 times stat and also will do it sequentially and
so everything depends on network latency.
On development side we could try to improve that with workarounds, but
in the end such workarounds tend to destabilize the code, as they are
only workarounds. Ways better would be, if posix would define a syscall
with a read/stat queue, so that multiple entries could be fetched at
once. Unfortunately that does not exist right now :(
But as a side note, does 'ls -l' on a directory with 58000 entries still
make sense? Are you really going to read the output? Or shouldn't you
better try to run it in parallel from directories with much less
entries, so more a typical workload as done by users?
Especially for network file systems admins usually should recommend to
their users to use 'ls' only, which will not include the additional and
rather slow stat call for all files.
Cheers,
Bernd
Bernd Schubert wrote on 02/04/2012 11:07 PM:
> I guess your bottleneck is network latency. So 'ls -l' translates into
>
> 1) opendir()
> 2) readdir()
> 3) for each entry from readir: stat()
> 4) sort of all entries
>
> So it has to call 58000 times stat and also will do it sequentially and
> so everything depends on network latency.
As Bernd said, network latency is the dominating time factor here.
However, besides the stat()-calls, there is more overhead being
introduced by "ls -l" - see below.
> On 02/04/2012 09:21 PM, dipe wrote:
>> I should probably add that we are running 2.6.37.6 and ext4 on the
>> metadata server and have implemented the tuning suggestions from
>>
>> http://www.fhgfs.com/wiki/wikka.php?wakka=ServerTuning#metatune
>>
>> and have set
>>
>> tuneNumWorkers
>>
>> to 0, 16, 32 and 100 but neither made any difference.
For workloads from a single client process, it's normal that increasing
the number of workers won't have any effect. This is only relevant if
you have multiple client processes (either on the same client or on
multiple clients) doing fs operations.
>> On Feb 4, 11:43 am, dipe<dip...@gmail.com> wrote:
>>> we have a folder with 58000 files on one of our nfs servers and
>>> executing one ls -l takes about 4 minutes. (the NFS server is quite
>>> busy anyway)
>>>
>>> with the fhgfs setup we are now down to 1 minutes which is nice, but
>>> secretly I had hoped for an even better result (may be 10 seconds?) is
>>> this unrealistic?
I assume you are using GigabitEthernet.
Did you also notice that there is a build option FHGFS_INTENT that can
be enabled to increase stat-performance? (This option will be enabled by
default with the next major release.)
Currently, it's only documented in the changelog (search for FHGFS_INTENT):
http://www.fhgfs.com/release/fhgfs_2011.04/Changelog.txt
In general, doing about 60,000 stat()-calls in about 10 seconds is not
unrealistic, but doing an "ls -l" in that time frame is difficult. I
just did some fhgfs bonnie benmarks on a GigE line with a single
directory with 60,000 files. Bonnie measured >5,000 stat operations per
second from the client. So if "ls -l" would only do the stats, then you
would be done in roughly 10 seconds.
Unfortunately, "ls -l" is not only doing stat-calls, but also some other
calls. Here is a "ls -l" strace excerpt from a single file:
$ strace ls -l
...
> lstat("test", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> lgetxattr("test", "security.selinux", 0x62d130, 255) = -1 ENODATA (No data available)
> getxattr("test", "system.posix_acl_access", 0x0, 0) = -1 EOPNOTSUPP (Operation not supported)
> lstat("test2", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> lgetxattr("test2", "security.selinux", 0x62b350, 255) = -1 ENODATA (No data available)
> getxattr("test2", "system.posix_acl_access", 0x0, 0) = -1 EOPNOTSUPP (Operation not supported)
> lstat("test3", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> lgetxattr("test3", "security.selinux", 0x62d150, 255) = -1 ENODATA (No data available)
> getxattr("test3", "system.posix_acl_access", 0x0, 0) = -1 EOPNOTSUPP (Operation not supported)
...
As you can see, "ls -l" is performing two other operations (for selinux
and ACLs) besides the actual lstat-call for each file in the directory.
In fact, both of those extra operations don't make any sense, because
selinux is not enabled on the system and ACLs are not supported. But
these operations make the Linux kernel call fhgfs's inode revalidation
method and thus significantly increase the "ls -l" time.
I don't know of any way to disable this ls overhead, so I will report
this to the GNU coreutils maitainers.
Maybe they can check whether selinux is enabled skip those calls if it
isn't enabled. And maybe they can stop doing getxattr calls if the first
call already reports "EOPNOTSUPP".
However, even if they implement the changes, it will of course take some
time until the modifications find their way into the Linux server
distributions.
Best regards,
Sven
I thought that network must be the main culprit because on a local
system I can do ls -l on a directory with a million files and it only
takes a few seconds.
You are right, 60000 files in one directory is rather rare but more
often we are dealing with 5-10000 files and it would be nice to be able
keep ls -l for this below 3-5 sec.
As interrupts and context switches seem to contribute a quite a bit to
the latency I am having some interest in technologies that claim to
address this:
http://www.solarflare.com/Content/UserFiles/Documents/Solarflare_OpenOnload_IntroPaper.pdf
http://www.solarflare.com/Content/UserFiles/Documents/Solarflare_10GbE_HPC_Whitepaper.pdf
as far as I can tell this stuff is transparently stuffed into the
network stack and does not require any modifications on your end. Just
wanted to share it.
I have been testing on 1G without any kind of network tuning. It appears
that 10G reduces latency significantly compared to 1G and it seems that
this openonload stuff shaves of another 50% latency from 10G.
I wonder if this would compare to technologies like infiniband when it
comes to file system latency (using commands such a ls). I believe most
HPC people deploy infiniband because of throughput rather than latency.
Thanks
dipe
those are good suggestions:
First we patched ls with your suggestions below and tried it again but
it made no difference. Then we changed FHGFS_INTENT and
our ls -l went from 60 sec to 31 sec. Then we tested the patched ls with
FHGFS_INTENT enabled and ls -l went down to around 22 seconds. I don't
understand why the first test case does not make any difference but this
is really great stuff, thanks a lot.
after all it seems our metadata server is doing the trick.
dipe
dipe wrote on 02/07/2012 02:16 PM:
> First we patched ls with your suggestions below and tried it again but
> it made no difference. Then we changed FHGFS_INTENT and
> our ls -l went from 60 sec to 31 sec. Then we tested the patched ls with
> FHGFS_INTENT enabled and ls -l went down to around 22 seconds. I don't
> understand why the first test case does not make any difference but this
> is really great stuff, thanks a lot.
nice, thanks for the updated results.
Interesting that it made no difference for the first test. Maybe you
want to try something like "strace -c ls -l" with your patched ls to
make sure that the getxattr calls are completely gone.
Did you just remove the unnecessary calls from your special ls or did
you implement clean patches that also do some checks on whether the
calls are necessary (e.g. by checking if selinux is enabled)?
In the latter case, I would suggest you send your patches to the
coreutils discussion or bug mailing list (the list addresses can found
here: http://www.gnu.org/software/coreutils).
Best regards,
Sven
> On 2/6/2012 4:38 AM, Sven Breuner wrote:
>> [..]
>> I don't know of any way to disable this ls overhead, so I will report
>> this to the GNU coreutils maitainers.
>> Maybe they can check whether selinux is enabled skip those calls if it
>> isn't enabled. And maybe they can stop doing getxattr calls if the
>> first call already reports "EOPNOTSUPP".
>> [..]
dipe wrote on 02/13/2012 02:52 PM:
>> Did you just remove the unnecessary calls from your special ls or did
>> you implement clean patches that also do some checks on whether the
>> calls are necessary (e.g. by checking if selinux is enabled)?
>
> Actually I thought we patched it but I was told that we just installed
> the latest version of coreutils where this was already fixed.
unfortunately, this is not yet fixed in more recent versions. If you
don't see these getxattr calls with your recent version and built your
version from source, then I assume the reason is that your version was
not built with acl and selinux support (which is of course also a way to
solve the problem ;-) ).
However, I already brought this up for discussion on the coreutils
mailing list and the coreutils maintainers agreed that this needs to be
improved, so they are working on it.
So far, they already provided a patch that avoids the
getxattr(selinux_stuff) calls.
Here's the link to the discussion, in case you're interested:
http://lists.gnu.org/archive/html/coreutils/2012-02/threads.html#00006
Best regards,
Sven