keep_cache from FUSE

381 views
Skip to first unread message

Rudd-O

unread,
Jul 2, 2008, 10:19:17 PM7/2/08
to zfs-fuse
fi->keep_cache = 1;
at line 612 of zfs_operations.c in function zfsfuse_opencreate()

Does this mean that FUSE kernel cache is *enabled* in the ZFS code?
Is this just file data cache or also dentry cache? If no dentry cache
or negative dentry cache exists, how would I go about putting it in?

Rudd-O

unread,
Jul 3, 2008, 5:39:42 AM7/3/08
to zfs-fuse
According to Miklos:

---------------------

> - Does this endanger data safety?  I mean, if a file was recently
> created and the negative lookup caching is in effect, isn't there a
> window of time where lookups for that file from other apps or the
> creator app would fail?

No.

> - Does this require or assume something about the userspace app (zfs-fuse)?

For filesystems which are not distributed/networked there are no cache
coherence issues, so caching should work fine.

> - Does the standard kernel VFS cache negative lookups?

Yes.

The way to enable caching of negative dentries, is to return success
instead of ENOENT, but set up 'fuse_entry_param' like this:

    e.ino = 0;
    e.entry_timeout = BIG_VALUE;

The ".ino = 0" will tell the kernel that the file doesn't exist, but
the negative dentry can be cached for the given number of seconds.

Miklos

---------------

I assume that goes into zfsfuse_access or zfsfuse_access_helper

Why does /* access events always reply_err */ ?

How to do the entry_timeout and ino = 0 part in the context of those
functions?

Rudd-O

unread,
Jul 3, 2008, 5:42:11 AM7/3/08
to zfs-fuse
The reason I'm trying to enable negative dentry lookup is because the
BIG BULK of access()es in a Linux system (yes, over 95%) is
nonexistent files, so if we can avoid the transition from kernelspace
to userspace for those, that is the biggest gain and the lowest-
hanging fruit so to speak. The only question that I have is whose
responsibility is it to evict the kernel cache when a file is actually
created where no file was before.

Trust me, we enable this, and boom we get the biggest performance
increase ever.

Bryan Donlan

unread,
Jul 3, 2008, 1:18:42 PM7/3/08
to zfs-...@googlegroups.com

Since the file creation ops will go through the kernel as well, won't
that clear the negative hit as well?

Rudd-O

unread,
Jul 4, 2008, 5:47:12 AM7/4/08
to zfs-fuse
I don't think so. When starting kmail, the bulk of the file
operations is access() and easily tens of thousands of access()es are
issued for nonexistent files. If I could get zfs-fuse to cache those
negative lookups (AND of course clear the cache when a file is
created), trust me, I'd get the LARGEST performance kick. As of now,
it's not caching them because I can clearly see consecutive kmail
startups taking the same time (five to seven minutes).

On Jul 3, 12:18 pm, "Bryan Donlan" <bdon...@gmail.com> wrote:

Ricardo M. Correia

unread,
Jul 12, 2008, 12:59:00 PM7/12/08
to zfs-...@googlegroups.com
Hi Rudd-O,

Please see below.

On Qui, 2008-07-03 at 02:39 -0700, Rudd-O wrote:
> The way to enable caching of negative dentries, is to return success
> instead of ENOENT, but set up 'fuse_entry_param' like this:
>
> e.ino = 0;
> e.entry_timeout = BIG_VALUE;
>
> The ".ino = 0" will tell the kernel that the file doesn't exist, but
> the negative dentry can be cached for the given number of seconds.
>

(...)

> I assume that goes into zfsfuse_access or zfsfuse_access_helper
>
> Why does /* access events always reply_err */ ?

See /usr/include/fuse/fuse_lowlevel.h, search for "access".
You will see the list of valid replies only contains "fuse_reply_err".

> How to do the entry_timeout and ino = 0 part in the context of those
> functions?

To me, it looks like that reply must be sent in the lookup method, not
the access method.

See the patch below, it should do what Miklos said (warning: I've not
tested it at all).

BTW, sorry for being so absent lately, but it's been quite difficult for
me to find time and energy to work on zfs-fuse. I am planning to write a
post this weekend on the zfs-fuse blog about this.

Also, thanks for all the effort you have been spending with this
project, your comments and observations have been very helpful, even if
they haven't translated into concrete improvements yet.

Best regards,
Ricardo

negative-dentries.diff

Rudd-O

unread,
Jul 15, 2008, 5:11:13 AM7/15/08
to zfs-fuse
I am going to test this patch RIGHT NOW. Will come back with
performance measurements and the like. Expect me to be back here in
an hour or two.

On Jul 12, 11:59 am, "Ricardo M. Correia" <Ricardo.M.Corr...@Sun.COM>
wrote:
>  negative-dentries.diff
> 1KDownload

Rudd-O

unread,
Jul 15, 2008, 10:34:25 AM7/15/08
to zfs-fuse
Patch tested. No good. Here are my observations, in Python form:

-----------------------

#!/usr/bin/env python

import os, time, sys
files = sys.argv[1:]

# Background: KDE apps issue tens of thousands of stat/lstat/access
calls
# when starting up, just to find their icons. Over 90% of those calls
return
# ENOENT. Each one of those calls takes a while to finish, and they
add up
# fairly predictably. This test synthesises this pathology with the
purpose
# of identifying hotspots in the ZFS-FUSE callgraph.

# The following tests test dentry cache in ZFS-FUSE.
# I am testing two exactly congruent dirtrees:
# - /usr/share/icons backed by ZFS
# - /icons backed by ext3 (you should cp -R /usr/share/icons /icons
# to replicate this test accurately)

# The goal of the test is to find out what is the stat/lstat/access
# performance penalty using ZFS, because those calls comprise the
large
# majority of time spent when starting KDE applications. We don't
care about
# random read performance or streaming disk performance, because those
# performance points do not matter at all -- large files such as libs
and
# binaries are already cached in core by the time the app is started.

# For the purpose, we run two tests:
# - positive dentry cache: walk a dirtree and issue the former calls.
# - negative dentry cache: issue these calls on nonexistent files.
# We test two versions of ZFS-FUSE: the baseline and with a patch done
by
# Miklos to enable negative dentry cache.
# If negative dentry cache is enabled, there should be a dramatic
# performance improvement in the second test.

# Test invariants: zfsfuse tip, 64 MB ARC cache, debug=1, warm caches,
# 2.6.26-rc6, fuse keep_cache=1 in zfsfuse code, positive dentry cache
= 0
# in zfsfuse code, fuse/libfuse cvs, static libz, zfs compression
enabled,
# i386 arch (64-bit capable dual-core Xeon).
# Confirmed: disk lights barely blinked during the warm cache test.
# I also tested zfs-fuse built with scons debug=0. No runtime
difference.
# I also tested disabling compression in the zvol. No runtime
difference.

# Results and observations below.


# POSITIVE DENTRY CACHE TEST

# before patch:
# Positive dentry cache test - existent icon lookup
# File /icons:
# Stat time: 1.09539413452 -- avgstat: 6.09059846829e-05
# File /usr/share/icons:
# Stat time: 44.0628631115 -- avgstat: 0.00244997848827

# after patch:
# Positive dentry cache test - existent icon lookup
# File /icons:
# Stat time: 1.09060502052 -- avgstat: 6.06397008909e-05
# File /usr/share/icons:
# Stat time: 45.0394990444 -- avgstat: 0.00250428129243

# conclusion: no adverse effects of patch on positive cache
(expected).
# however, it is ABSOLUTELY CLEAR that VFS is contacting FUSE
# and FUSE is contacting ZFS for every positive cache hit
# which causes a 50fold performance decrease for existing files.
# This was expected from a code inspection (cache timeout = 0).

print "Positive dentry cache test - existent icon lookup"
for file in files:
count = 0
start = time.time()
for path,ds,fs in os.walk(file):
for d in ds+fs:
route = path+"/"+d
try: os.stat(route)
except Exception: pass
try: os.lstat(route)
except Exception: pass
try: os.access(route)
except Exception: pass
count = count + 1
if count % 1000 == 0: print count
end = time.time() - start
print "File %s:\nStat time: %s -- avgstat: %s"%(file,end,end/count)


# NEGATIVE DENTRY CACHE TEST
# Feel my pain: this is what KDE apps do on startup to locate their
icons.
# For EACH ICON. Kmail even looks up an icon or two when selecting an
email.
# Fifteen seconds of hung Kmail to read each email is unacceptable.

# before patch:
# Negative dentry cache test - missing icon lookup
# File /icons:
# Stat time: 0.541098117828 -- avgstat: 0.000207715208379
# File /usr/share/icons:
# Stat time: 14.7939498425 -- avgstat: 0.00567905944048

# after patch:
# Negative dentry cache test - missing icon lookup
# File /icons:
# Stat time: 0.533202171326 -- avgstat: 0.000204684134866
# File /usr/share/icons:
# Stat time: 14.3626229763 -- avgstat: 0.00551348290837

# conclusion:
# The patch is *ABSOLUTELY INEFFECTIVE* in preventing the extra
# transition from kernelspace into userspace. The kernel still
# consults ZFS-FUSE whether the file exists or not.

# The patch should have made the kernel VFS or the FUSE VFS skip the
# second transition from kernel space into user space and into libfuse/
ZFS.
# It is obvious this didn't work.

# It stands to reason that no matter what the entry struct says
# about caching, lookups are simply not being cached.
# Another evidence for this, the slabtop result for fuse_inode
# is half the size of my ext3 cache, which is utterly absurd
# considering that my ZFS FS is 380 GB, and my ext3 one is 1 GB.
# What am I doing wrong?

# My work points me to the fuse kernel module -- it must have a bug
# preventing the cache from working or caching anything inode-related.
# General usage of the system shows that keep_cache=1 is working.

# Another observation: during a make -j3 kernel compilation, trying
tab
# completion at the shell on a directory with 25 entries is noticeably
slow.
# Each row of entries is preceded by a one-to-two second delay. There
must
# be a lock contention issue somewhere in ZFS, because top says about
33%
# CPU idle during this compile, when it's clear I should be maxing out
my
# CPUs. 16% libc time in oprofile dumps of /sbin/zfs-fuse, mostly in
# pthread_* calls also point to this. This, coupled to atomic_add_64
being
# king of the hill in oprofile dumps, pointed me to the definition of
# atomic_add_64 in i386/atomic.S, compared to amd64/atomic.S. It's
one
# screenful of instructions vs. only one for the 64-bit arch, that is
# emulating the 64-bit arch instruction using locks. Not good at all.

print "Negative dentry cache test - missing icon lookup"
for file in files:
count = 0
start = time.time()
for path,ds,fs in os.walk(file):
for ext in "png,jpg,xpm,svg,svgz".split(","):
for d in ds:
route = path+"/"+"ktorrentnot."+ext
try: os.stat(route)
except Exception: pass
try: os.lstat(route)
except Exception: pass
try: os.access(route)
except Exception: pass
count = count + 1
if count % 1000 == 0: print count
end = time.time() - start
print "File %s:\nStat time: %s -- avgstat: %s"%(file,end,end/count)

Ricardo M. Correia

unread,
Jul 16, 2008, 9:57:45 AM7/16/08
to zfs-...@googlegroups.com
Hi Rudd-O,

I've tried your negative dentry cache test, but it looks like the vast
majority of lookups are for existing dentries, not negative ones.

You can see this for yourself by adding ",debug" to FUSE_OPTIONS in
zfs-fuse/util.c and starting zfs-fuse with the --no-daemon option. You
can also add the following line to zfs_operations.c right at the start
of function zfsfuse_lookup() in order to see which dentries are being
looked up:

fprintf(stderr, "Looking up %s\n", name);

Anyway, my previous patch was not correct, please revert it and try
applying the attached patch, it should cache negative dentries correctly
now.

Note that it probably won't have much impact in performance in your test
because most of the lookups seem to be for existing dentries. I suspect
it might be because os.walk() is stat()ing files, but I haven't verified
if that's the case.

HTH,
Ricardo

negative-dentries.diff

Rudd-O

unread,
Jul 16, 2008, 10:33:58 AM7/16/08
to zfs-fuse
My first test (first codeblock) is for existing dentries. My second
one (second codeblock) is for negative ones (lookups that return
ENOENT in their own right), sure it does some stats() of existing
dirs, but then proceeds to use that statted dir list to generate
thousands of lookups for ktorrentnot.png which doesn't exist anywhere
in a sane distro.

You can verify this behavior by running the python script under
strace.

I am going to try your patch right now again. Thank you VERY MUCH for
your efforts!

Rudd-O

unread,
Jul 16, 2008, 10:35:13 AM7/16/08
to zfs-fuse
How did I miss the missing error = 0 earlier? *slaps forehead*

Whoo-aaa! :-)

Ricardo M. Correia

unread,
Jul 16, 2008, 10:46:31 AM7/16/08
to zfs-...@googlegroups.com
On Qua, 2008-07-16 at 07:33 -0700, Rudd-O wrote:
> My first test (first codeblock) is for existing dentries. My second
> one (second codeblock) is for negative ones (lookups that return
> ENOENT in their own right), sure it does some stats() of existing
> dirs, but then proceeds to use that statted dir list to generate
> thousands of lookups for ktorrentnot.png which doesn't exist anywhere
> in a sane distro.

Right.

I was referring to the negative lookups (second code block).
Even if you run only that code block, you will see that zfs-fuse
receives lots of LOOKUP requests from FUSE which are answered with a
successful return code (because they are lookups for existing files).



> How did I miss the missing error = 0 earlier? *slaps forehead*

I'm not sure what you mean by this, note that the patches are different.
The error = 0 in this patch is necessary because we cannot return
ENOENT.

Cheers,
Ricardo


Rudd-O

unread,
Jul 16, 2008, 2:36:31 PM7/16/08
to zfs-fuse
> I was referring to the negative lookups (second code block).
> Even if you run only that code block, you will see that zfs-fuse
> receives lots of LOOKUP requests from FUSE which are answered with a
> successful return code (because they are lookups for existing files).

You were right. I am about to post a patch that enables both positive
and negative dentry cache. With it, the perftest with warm cache runs
almost just as fast as ext3, and decent performance is back in my
machine! Turns out positive caching is the key.

Rudd-O

unread,
Jul 16, 2008, 2:43:31 PM7/16/08
to zfs-fuse
Done. Just posted cache-dentries.diff to the group. With this,
perftest (and KDE applications in general, once their icons are warm
in cache) start almost just as fast in ext3 as in ZFS.

You simply can't believe the 2nd order of magnitude in performance
improvement until you try the patch. BTW it incorporates your patch
but enables positive cache.

No ill effects in userspace noticed yet.

Rudd-O

unread,
Jul 16, 2008, 7:23:16 PM7/16/08
to zfs-fuse
Sorry for the reply to self, but if you are in a memory-constrained
system and you want to take more advantage of this patch, I heartily
suggest you first reduce your ARC cache size to 64 or 32 MB (64 is my
lucky number of choice) and then:

echo 10 > /proc/sys/vm/vfs_cache_pressure # to tilt caching in favor
of dentries instead of disk blocks
echo 80 > /proc/sys/vm/swappiness # to release more RAM for dentry
caching, will help with large filesystems

Chris Samuel

unread,
Jul 17, 2008, 7:31:13 AM7/17/08
to zfs-...@googlegroups.com
On Thu, 17 Jul 2008, Rudd-O wrote:

> Done.  Just posted cache-dentries.diff to the group.  With this,
> perftest (and KDE applications in general, once their icons are warm
> in cache) start almost just as fast in ext3 as in ZFS.

I'd be more than happy to test this on my system here, every night (well,
assuming I remember) I run a script called zsnapshot which effectively
does an rsync to a ZFS filesystem and then snapshots it.

Just need the patch to appear here! :-)

cheers,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP

signature.asc

Rudd-O

unread,
Jul 17, 2008, 2:37:00 PM7/17/08
to zfs-fuse

Chris Samuel

unread,
Jul 17, 2008, 5:44:41 PM7/17/08
to zfs-...@googlegroups.com
On Fri, 18 Jul 2008, Rudd-O wrote:

> The patch is on the file list here:

Aha - I just use this as a mailing list, I'm too old fashioned to have
looked there, thanks for the pointer!

signature.asc

Ricardo M. Correia

unread,
Jul 17, 2008, 5:55:14 PM7/17/08
to zfs-...@googlegroups.com
On Sex, 2008-07-18 at 07:44 +1000, Chris Samuel wrote:
> On Fri, 18 Jul 2008, Rudd-O wrote:
>
> > The patch is on the file list here:
>
> Aha - I just use this as a mailing list, I'm too old fashioned to have
> looked there, thanks for the pointer!

Yes, I was a bit confused at first too :)

Regarding the patch, please note that (if I'm not mistaken) caching
positive dentries may lead to a security flaw if your files or
directories have ACLs (e.g., if you have set them on Solaris), because
zfs-fuse will no longer check for access permission on lookups.

Best regards,
Ricardo


Chris Samuel

unread,
Jul 17, 2008, 6:43:51 PM7/17/08
to zfs-...@googlegroups.com
On Fri, 18 Jul 2008, Rudd-O wrote:

> The patch is on the file list here:

Wow. Last night the rsync took 39 minutes for /home (no major changes I
think, just email going in and out), this morning it took 11 minutes
(again, just email changes).

Just one data point, but it'll be really interesting to see how that
carries on over the next few days with more major stuff going on.

I'll try a bonnie++ just in case it shows anything new.

cheers!

signature.asc

Chris Samuel

unread,
Jul 17, 2008, 6:47:00 PM7/17/08
to zfs-...@googlegroups.com
On Fri, 18 Jul 2008, Chris Samuel wrote:

> Wow.   Last night the rsync took 39 minutes for /home (no major changes
> I think, just email going in and out), this morning it took 11 minutes
> (again, just email changes).

A better data point would have been my local Subversion repo (372MB).

Yesterday (no changes):

0.01user 0.13system 0:02.48elapsed 5%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+925minor)pagefaults 0swaps

Today (no changes):

0.02user 0.03system 0:00.55elapsed 9%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+925minor)pagefaults 0swaps

signature.asc

Rudd-O

unread,
Jul 17, 2008, 9:36:00 PM7/17/08
to zfs-fuse
Interesting. How do we avoid this? Doesn't the setfacl go through
VFS, then straight into zfs_operations.c, and in the process of doing
so, invalidate the inode cache after setting the attributes? What can
we do in order to ensure this?

Today ZFS SIGABORTed on me. Unfortunately the core file was too big
for the puny root filesystem that I am using. Any ideas on how to get
that corefile next time it SIGABORTs? I would not mind running a
debugger permanently attached to zfs-fuse but I'm concerned that each
signal in zfs-fuse will interrupt operations until I continue
debugging. How can I avoid that while capturing SIGSEGV and SIGABORT?

Rudd-O

unread,
Jul 18, 2008, 6:58:05 PM7/18/08
to zfs-fuse
Okay. I've just completed my migration from Kubuntu 32-bit to Fedora
9 64-bit. Migration that, might I add, I did live (I did have to
reboot a few times instead of just one because I nuked several
important files in the process) instead of with an installer or
anything.

The performance difference between 64 and 32 bits is astounding. I'm
also exercising the caching patch I posted on this list, and so far,
so good. I'm not using big_writes because I'm using the vanilla fuse
libs that came with FUSE.

Something VERY important: sys/stropts.h is no longer in glibc for
fedora 9, so I replaced its occurrences with string.h and I could
compile ZFS again. Heads up because apparently this is the future,
not just a Fedora thing. I'd post a patch but heck, easier to grep
and sed.

Bad news: libaio is in /usr/lib64 so I had to compile it statically.
Good news too: libz is in /lib64, so I don't have to compile it
statically.

Rudd-O

unread,
Jul 21, 2008, 2:16:13 AM7/21/08
to zfs-fuse
Two steps forward, one step back. The most problematic userspace VFS
syscall is access() with KDE apps. And that is precisely the
operation that is not consistently sped up by the cache-dentries.diff
patch.

Three sequential runs:

[root@karen /]# /backups/perftest a /usr/share/icons
Positive dentry cache test - existent icon lookup
File /usr/share/icons: speedup is 1.84371723682
Negative dentry cache test - missing icon lookup
File /usr/share/icons: speedup is 1.88145126374
[root@karen /]# /backups/perftest a /usr/share/icons
Positive dentry cache test - existent icon lookup
File /usr/share/icons: speedup is 0.904757233292
Negative dentry cache test - missing icon lookup
File /usr/share/icons: speedup is 0.934242021794
[root@karen /]# /backups/perftest a /usr/share/icons
Positive dentry cache test - existent icon lookup
File /usr/share/icons: speedup is 1.18232543167
Negative dentry cache test - missing icon lookup
File /usr/share/icons: speedup is 0.967109159875

of the following code:

#!/usr/bin/env python

import os, time, sys
files = sys.argv[2:]
test = sys.argv[1]

# test: string containing a for access, l for lstat and s for stat
# cache does not speed up access() at all
# in fact, it makes access() slower across the board

print "Positive dentry cache test - existent icon lookup"
for run in ["cold cache","warm cache"]:

for f in files:

if run == "cold cache":
os.system("sync")
time.sleep(1)
try: file("/proc/sys/vm/drop_caches","w").write("3")
except IOError,e: print "Failed to drop caches"
time.sleep(1)

count = 0
start = time.time()
for path,ds,fs in os.walk(f):
for d in ds+fs:
route = path+"/"+d
if "s" in test:
try: os.stat(route)
except Exception: pass
if "l" in test:
try: os.lstat(route)
except Exception: pass
if "a" in test:
try: os.access(route)
except Exception: pass
count = count + 1
# if count % 1000 == 0: print count
end = time.time() - start
if run == "cold cache": elapsed = end
else: print "File %s: speedup is %s"%(f,elapsed/end)


print "Negative dentry cache test - missing icon lookup"
for run in ["cold cache","warm cache"]:

for f in files:

if run == "cold cache":
os.system("sync")
time.sleep(1)
try: file("/proc/sys/vm/drop_caches","w").write("3")
except IOError,e: print "Failed to drop caches"
time.sleep(1)

count = 0
start = time.time()
for path,ds,fs in os.walk(f):
for ext in "png,jpg,xpm,svg,svgz".split(","):
for d in ds:
route = path+"/"+"ktorrentnot."+ext
if "s" in test:
try: os.stat(route)
except Exception: pass
if "l" in test:
try: os.lstat(route)
except Exception: pass
if "a" in test:
try: os.access(route)
except Exception: pass
count = count + 1
# if count % 1000 == 0: print count
end = time.time() - start
if run == "cold cache": elapsed = end
else: print "File %s: speedup is %s"%(f,elapsed/end)


Hypotheses?

Szabolcs Szakacsits

unread,
Jul 21, 2008, 9:36:52 AM7/21/08
to zfs-fuse

On Sun, 20 Jul 2008, Rudd-O wrote:

> Two steps forward, one step back. The most problematic userspace VFS
> syscall is access() with KDE apps. And that is precisely the
> operation that is not consistently sped up by the cache-dentries.diff
> patch.
>
> Three sequential runs:
>
> [root@karen /]# /backups/perftest a /usr/share/icons
> Positive dentry cache test - existent icon lookup
> File /usr/share/icons: speedup is 1.84371723682
> Negative dentry cache test - missing icon lookup
> File /usr/share/icons: speedup is 1.88145126374
> [root@karen /]# /backups/perftest a /usr/share/icons
> Positive dentry cache test - existent icon lookup
> File /usr/share/icons: speedup is 0.904757233292
> Negative dentry cache test - missing icon lookup
> File /usr/share/icons: speedup is 0.934242021794
> [root@karen /]# /backups/perftest a /usr/share/icons
> Positive dentry cache test - existent icon lookup
> File /usr/share/icons: speedup is 1.18232543167
> Negative dentry cache test - missing icon lookup
> File /usr/share/icons: speedup is 0.967109159875

For comparison here are the ext3 and ntfs-3g averaged numbers
running this test:

ext3 ntfs-3g
Positive dentry cache speedup: 11 33
Negative dentry cache speedup: 12 28
Runtime in second (wall-time): 11 29

The ntfs-3g CPU usage was well under 10% and the reason for the worse
performance is the unoptimal placing/handling of the many files on the
disk, i.e. the most time is spent waiting to finish disk seeks (this issue
is optimized for ext3 but not yet for ntfs-3g). Testing with an SSD could
give a significantly different result in favour of the FUSE file system.

Szaka

--
NTFS-3G: http://ntfs-3g.org

Reply all
Reply to author
Forward
0 new messages