Improving ZFS performance for large directories

Kevin Day

unread,

Jan 29, 2013, 6:20:15 PM1/29/13

to

I'm trying to improve performance when using ZFS in large (>60000 files) directories. A common activity is to use "getdirentries" to enumerate all the files in the directory, then "lstat" on each one to get information about it. Doing an "ls -l" in a large directory like this can take 10-30 seconds to complete. Trying to figure out why, I did:

ktrace ls -l /path/to/large/directory
kdump -R |sort -rn |more

to see what sys calls were taking the most time, I ended up with:

69247 ls 0.190729 STRU struct stat {dev=846475008, ino=46220085, mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295, atime=1333196714, stime=1201004393, ctime=1333196714.547566024, birthtime=1333196714.547566024, size=30784, blksize=31232, blocks=62, flags=0x0 }
69247 ls 0.180121 STRU struct stat {dev=846475008, ino=46233417, mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295, atime=1333197088, stime=1209814737, ctime=1333197088.913571042, birthtime=1333197088.913571042, size=3162220, blksize=131072, blocks=6409, flags=0x0 }
69247 ls 0.152370 RET getdirentries 4088/0xff8
69247 ls 0.139939 CALL stat(0x800d8f598,0x7fffffffcca0)
69247 ls 0.130411 RET __acl_get_link 0
69247 ls 0.121602 RET __acl_get_link 0
69247 ls 0.105799 RET getdirentries 4064/0xfe0
69247 ls 0.105069 RET getdirentries 4068/0xfe4
69247 ls 0.096862 RET getdirentries 4028/0xfbc
69247 ls 0.085012 RET getdirentries 4088/0xff8
69247 ls 0.082722 STRU struct stat {dev=846475008, ino=72941319, mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295, atime=1348686155, stime=1348347621, ctime=1348686155.768875422, birthtime=1348686155.768875422, size=6686225, blksize=131072, blocks=13325, flags=0x0 }
69247 ls 0.070318 STRU struct stat {dev=846475008, ino=46211679, mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295, atime=1333196475, stime=1240230314, ctime=1333196475.038567672, birthtime=1333196475.038567672, size=829895, blksize=131072, blocks=1797, flags=0x0 }
69247 ls 0.068060 RET getdirentries 4048/0xfd0
69247 ls 0.065118 RET getdirentries 4088/0xff8
69247 ls 0.062536 RET getdirentries 4096/0x1000
69247 ls 0.061118 RET getdirentries 4020/0xfb4
69247 ls 0.055038 STRU struct stat {dev=846475008, ino=46220358, mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295, atime=1333196720, stime=1274282669, ctime=1333196720.972567345, birthtime=1333196720.972567345, size=382344, blksize=131072, blocks=773, flags=0x0 }
69247 ls 0.054948 STRU struct stat {dev=846475008, ino=75025952, mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295, atime=1351071350, stime=1349726805, ctime=1351071350.800873870, birthtime=1351071350.800873870, size=2575559, blksize=131072, blocks=5127, flags=0x0 }
69247 ls 0.054828 STRU struct stat {dev=846475008, ino=65021883, mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295, atime=1335730367, stime=1332843230, ctime=1335730367.541567371, birthtime=1335730367.541567371, size=226347, blksize=131072, blocks=517, flags=0x0 }
69247 ls 0.053743 STRU struct stat {dev=846475008, ino=46222016, mode=-rw-r--r-- , nlink=1, uid=0, gid=0, rdev=4294967295, atime=1333196765, stime=1257110706, ctime=1333196765.206574132, birthtime=1333196765.206574132, size=62112, blksize=62464, blocks=123, flags=0x0 }
69247 ls 0.052015 RET getdirentries 4060/0xfdc
69247 ls 0.051388 RET getdirentries 4068/0xfe4
69247 ls 0.049875 RET getdirentries 4088/0xff8
69247 ls 0.049156 RET getdirentries 4032/0xfc0
69247 ls 0.048609 RET getdirentries 4040/0xfc8
69247 ls 0.048279 RET getdirentries 4032/0xfc0
69247 ls 0.048062 RET getdirentries 4064/0xfe0
69247 ls 0.047577 RET getdirentries 4076/0xfec
(snip)

the STRU are returns from calling lstat().

It looks like both getdirentries and lstat are taking quite a while to return. The shortest return for any lstat() call is 0.000004 seconds, the maximum is 0.190729 and the average is around 0.0004. Just from lstat() alone, that makes "ls" take over 20 seconds.

I'm prepared to try an L2arc cache device (with secondarycache=metadata), but I'm having trouble determining how big of a device I'd need. We've got >30M inodes now on this filesystem, including some files with extremely long names. Is there some way to determine the amount of metadata on a ZFS filesystem?

_______________________________________________
freeb...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

Matthew Ahrens

unread,

Jan 29, 2013, 6:42:28 PM1/29/13

to

On Tue, Jan 29, 2013 at 3:20 PM, Kevin Day <toa...@dragondata.com> wrote:

> I'm prepared to try an L2arc cache device (with secondarycache=metadata),

You might first see how long it takes when everything is cached. E.g. by
doing this in the same directory several times. This will give you a lower
bound on the time it will take (or put another way, an upper bound on the
improvement available from a cache device).

> but I'm having trouble determining how big of a device I'd need. We've got
> >30M inodes now on this filesystem, including some files with extremely
> long names. Is there some way to determine the amount of metadata on a ZFS
> filesystem?

For a specific filesystem, nothing comes to mind, but I'm sure you could
cobble something together with zdb. There are several tools to determine
the amount of metadata in a ZFS storage pool:

- "zdb -bbb <pool>"
but this is unreliable on pools that are in use
- "zpool scrub <pool>; <wait for scrub to complete>; echo '::walk
spa|::zfs_blkstats' | mdb -k"
the scrub is slow, but this can be mitigated by setting the global
variable zfs_no_scrub_io to 1. If you don't have mdb or equivalent
debugging tools on freebsd, you can manually look at
<spa_t>->spa_dsl_pool->dp_blkstats.

In either case, the "LSIZE" is the size that's required for caching (in
memory or on a l2arc cache device). At a minimum you will need 512 bytes
for each file, to cache the dnode_phys_t.

--matt

Kevin Day

unread,

Jan 29, 2013, 7:06:01 PM1/29/13

to

On Jan 29, 2013, at 5:42 PM, Matthew Ahrens <mah...@delphix.com> wrote:

> On Tue, Jan 29, 2013 at 3:20 PM, Kevin Day <toa...@dragondata.com> wrote:
> I'm prepared to try an L2arc cache device (with secondarycache=metadata),
>
> You might first see how long it takes when everything is cached. E.g. by doing this in the same directory several times. This will give you a lower bound on the time it will take (or put another way, an upper bound on the improvement available from a cache device).
>

Doing it twice back-to-back makes a bit of difference but it's still slow either way.

After not touching this directory for about 30 minutes:

# time ls -l >/dev/null
0.773u 2.665s 0:18.21 18.8% 35+2749k 3012+0io 0pf+0w

Immediately again:

# time ls -l > /dev/null
0.665u 1.077s 0:08.60 20.1% 35+2719k 556+0io 0pf+0w

18.2 vs 8.6 seconds is an improvement, but even the 8.6 seconds is longer than what I was expecting.

>
> For a specific filesystem, nothing comes to mind, but I'm sure you could cobble something together with zdb. There are several tools to determine the amount of metadata in a ZFS storage pool:
>
> - "zdb -bbb <pool>"
> but this is unreliable on pools that are in use

I tried this and it consumed >16GB of memory after about 5 minutes so I had to kill it. I'll try it again during our next maintenance window where it can be the only thing running.

> - "zpool scrub <pool>; <wait for scrub to complete>; echo '::walk spa|::zfs_blkstats' | mdb -k"
> the scrub is slow, but this can be mitigated by setting the global variable zfs_no_scrub_io to 1. If you don't have mdb or equivalent debugging tools on freebsd, you can manually look at <spa_t>->spa_dsl_pool->dp_blkstats.
>
> In either case, the "LSIZE" is the size that's required for caching (in memory or on a l2arc cache device). At a minimum you will need 512 bytes for each file, to cache the dnode_phys_t.

Okay, thanks a bunch. I'll try this on the next chance I get too.

I think some of the issue is that nothing is being allowed to stay cached long. We have several parallel rsyncs running at once that are basically scanning every directory as fast as they can, combined with a bunch of rsync, http and ftp clients. I'm guessing with all that activity things are getting shoved out pretty quickly.

Steven Hartland

unread,

Jan 29, 2013, 8:29:35 PM1/29/13

to

----- Original Message -----
From: "Kevin Day" <toa...@dragondata.com>

> I think some of the issue is that nothing is being allowed to stay cached long. We have several parallel rsyncs running at once
> that are basically scanning every directory as fast as they can, combined with a bunch of rsync, http and ftp clients. I'm
> guessing with all that activity things are getting shoved out pretty quickly.

zfs send / recv a possible replacements for the rsyncs?

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postm...@multiplay.co.uk.

Kevin Day

unread,

Jan 29, 2013, 9:24:54 PM1/29/13

to

On Jan 29, 2013, at 7:29 PM, "Steven Hartland" <kil...@multiplay.co.uk> wrote:

>
> ----- Original Message ----- From: "Kevin Day" <toa...@dragondata.com>
>
>> I think some of the issue is that nothing is being allowed to stay cached long. We have several parallel rsyncs running at once that are basically scanning every directory as fast as they can, combined with a bunch of rsync, http and ftp clients. I'm guessing with all that activity things are getting shoved out pretty quickly.
>
> zfs send / recv a possible replacements for the rsyncs?

Unfortunately not. We're pulling these files from a host that we do not control, and isn't running ZFS. We're also serving these files up via a public rsync daemon, and the vast majority of the clients receiving files from it are not running ZFS either.

Total data size is about 125TB now, growing to ~300TB in the near future. It's just a ton of data that really isn't being stored in the best manner for this kind of system, but we don't control the layout.

-- Kevin

Ronald Klop

unread,

Jan 30, 2013, 5:20:02 AM1/30/13

to

On Wed, 30 Jan 2013 00:20:15 +0100, Kevin Day <toa...@dragondata.com>
wrote:

>

> I'm trying to improve performance when using ZFS in large (>60000 files)
> directories. A common activity is to use "getdirentries" to enumerate
> all the files in the directory, then "lstat" on each one to get
> information about it. Doing an "ls -l" in a large directory like this
> can take 10-30 seconds to complete. Trying to figure out why, I did:
>
> ktrace ls -l /path/to/large/directory
> kdump -R |sort -rn |more

Does ls -lf /pat/to/large/directory make a difference. It makes ls not to
sort the directory so it can use a more efficient way of traversing the
directory.

Ronald.

> I'm prepared to try an L2arc cache device (with

> secondarycache=metadata), but I'm having trouble determining how big of

> a device I'd need. We've got >30M inodes now on this filesystem,

> including some files with extremely long names. Is there some way to
> determine the amount of metadata on a ZFS filesystem?

Nikolay Denev

unread,

Jan 30, 2013, 5:36:35 AM1/30/13

to

On Jan 30, 2013, at 1:20 AM, Kevin Day <toa...@dragondata.com> wrote:

>
> I'm trying to improve performance when using ZFS in large (>60000 files) directories. A common activity is to use "getdirentries" to enumerate all the files in the directory, then "lstat" on each one to get information about it. Doing an "ls -l" in a large directory like this can take 10-30 seconds to complete. Trying to figure out why, I did:
>
> ktrace ls -l /path/to/large/directory
> kdump -R |sort -rn |more
>

What are your : vfs.zfs.arc_meta_limit and vfs.zfs.arc_meta_used sysctls?
Maybe increasing the limit can help?

Regards,

Kevin Day

unread,

Jan 30, 2013, 10:15:04 AM1/30/13

to

On Jan 30, 2013, at 4:20 AM, "Ronald Klop" <ronald-...@klop.yi.org> wrote:

> On Wed, 30 Jan 2013 00:20:15 +0100, Kevin Day <toa...@dragondata.com> wrote:
>
>>
>> I'm trying to improve performance when using ZFS in large (>60000 files) directories. A common activity is to use "getdirentries" to enumerate all the files in the directory, then "lstat" on each one to get information about it. Doing an "ls -l" in a large directory like this can take 10-30 seconds to complete. Trying to figure out why, I did:
>>
>> ktrace ls -l /path/to/large/directory
>> kdump -R |sort -rn |more
>

> Does ls -lf /pat/to/large/directory make a difference. It makes ls not to sort the directory so it can use a more efficient way of traversing the directory.
>
> Ronald.

Nope, the sort seems to add a trivial amount of extra time to the entire operation. Nearly all the time is spent in lstat() or getdirentries(). Good idea though!

-- Kevin

Kevin Day

unread,

Jan 30, 2013, 10:19:52 AM1/30/13

to

On Jan 30, 2013, at 4:36 AM, Nikolay Denev <nde...@gmail.com> wrote:
>
>
> What are your : vfs.zfs.arc_meta_limit and vfs.zfs.arc_meta_used sysctls?
> Maybe increasing the limit can help?

vfs.zfs.arc_meta_limit: 8199079936
vfs.zfs.arc_meta_used: 13965744408

Full output of zfs-stats:

------------------------------------------------------------------------
ZFS Subsystem Report Wed Jan 30 15:16:54 2013
------------------------------------------------------------------------

System Information:

Kernel Version: 901000 (osreldate)
Hardware Platform: amd64
Processor Architecture: amd64

ZFS Storage pool Version: 28
ZFS Filesystem Version: 5

FreeBSD 9.1-RC2 #1: Tue Oct 30 20:37:38 UTC 2012 root
3:16PM up 19 days, 19:44, 2 users, load averages: 0.91, 0.80, 0.68

------------------------------------------------------------------------

System Memory:

12.44% 7.72 GiB Active, 6.04% 3.75 GiB Inact
77.33% 48.01 GiB Wired, 2.25% 1.40 GiB Cache
1.94% 1.21 GiB Free, 0.00% 1.21 MiB Gap

Real Installed: 64.00 GiB
Real Available: 99.97% 63.98 GiB
Real Managed: 97.04% 62.08 GiB

Logical Total: 64.00 GiB
Logical Used: 90.07% 57.65 GiB
Logical Free: 9.93% 6.35 GiB

Kernel Memory: 22.62 GiB
Data: 99.91% 22.60 GiB
Text: 0.09% 21.27 MiB

Kernel Memory Map: 54.28 GiB
Size: 34.75% 18.86 GiB
Free: 65.25% 35.42 GiB

------------------------------------------------------------------------

ARC Summary: (HEALTHY)
Memory Throttle Count: 0

ARC Misc:
Deleted: 430.91m
Recycle Misses: 111.27m
Mutex Misses: 2.49m
Evict Skips: 647.25m

ARC Size: 87.63% 26.77 GiB
Target Size: (Adaptive) 87.64% 26.77 GiB
Min Size (Hard Limit): 12.50% 3.82 GiB
Max Size (High Water): 8:1 30.54 GiB

ARC Size Breakdown:
Recently Used Cache Size: 58.64% 15.70 GiB
Frequently Used Cache Size: 41.36% 11.07 GiB

ARC Hash Breakdown:
Elements Max: 2.19m
Elements Current: 86.15% 1.89m
Collisions: 344.47m
Chain Max: 17
Chains: 552.47k

------------------------------------------------------------------------

ARC Efficiency: 21.94b
Cache Hit Ratio: 97.00% 21.28b
Cache Miss Ratio: 3.00% 657.23m
Actual Hit Ratio: 73.15% 16.05b

Data Demand Efficiency: 98.94% 1.32b
Data Prefetch Efficiency: 14.83% 299.44m

CACHE HITS BY CACHE LIST:
Anonymously Used: 23.03% 4.90b
Most Recently Used: 6.12% 1.30b
Most Frequently Used: 69.29% 14.75b
Most Recently Used Ghost: 0.50% 105.94m
Most Frequently Used Ghost: 1.07% 226.92m

CACHE HITS BY DATA TYPE:
Demand Data: 6.11% 1.30b
Prefetch Data: 0.21% 44.42m
Demand Metadata: 69.29% 14.75b
Prefetch Metadata: 24.38% 5.19b

CACHE MISSES BY DATA TYPE:
Demand Data: 2.12% 13.90m
Prefetch Data: 38.80% 255.02m
Demand Metadata: 30.97% 203.56m
Prefetch Metadata: 28.11% 184.75m

------------------------------------------------------------------------

L2ARC is disabled

------------------------------------------------------------------------

File-Level Prefetch: (HEALTHY)

DMU Efficiency: 24.08b
Hit Ratio: 66.02% 15.90b
Miss Ratio: 33.98% 8.18b

Colinear: 8.18b
Hit Ratio: 0.01% 560.82k
Miss Ratio: 99.99% 8.18b

Stride: 15.23b
Hit Ratio: 99.98% 15.23b
Miss Ratio: 0.02% 2.62m

DMU Misc:
Reclaim: 8.18b
Successes: 0.08% 6.31m
Failures: 99.92% 8.17b

Streams: 663.44m
+Resets: 0.06% 397.18k
-Resets: 99.94% 663.04m
Bogus: 0

------------------------------------------------------------------------

VDEV cache is disabled

------------------------------------------------------------------------

ZFS Tunables (sysctl):
kern.maxusers 384
vm.kmem_size 66662760448
vm.kmem_size_scale 1
vm.kmem_size_min 0
vm.kmem_size_max 329853485875
vfs.zfs.l2c_only_size 0
vfs.zfs.mfu_ghost_data_lsize 2121007104
vfs.zfs.mfu_ghost_metadata_lsize 7876605440
vfs.zfs.mfu_ghost_size 9997612544
vfs.zfs.mfu_data_lsize 10160539648
vfs.zfs.mfu_metadata_lsize 17161216
vfs.zfs.mfu_size 11163991040
vfs.zfs.mru_ghost_data_lsize 7235079680
vfs.zfs.mru_ghost_metadata_lsize 11107812352
vfs.zfs.mru_ghost_size 18342892032
vfs.zfs.mru_data_lsize 4406255616
vfs.zfs.mru_metadata_lsize 3924364288
vfs.zfs.mru_size 8893582336
vfs.zfs.anon_data_lsize 0
vfs.zfs.anon_metadata_lsize 0
vfs.zfs.anon_size 999424
vfs.zfs.l2arc_norw 1
vfs.zfs.l2arc_feed_again 1
vfs.zfs.l2arc_noprefetch 1
vfs.zfs.l2arc_feed_min_ms 200
vfs.zfs.l2arc_feed_secs 1
vfs.zfs.l2arc_headroom 2
vfs.zfs.l2arc_write_boost 8388608
vfs.zfs.l2arc_write_max 8388608
vfs.zfs.arc_meta_limit 8199079936
vfs.zfs.arc_meta_used 14161977912
vfs.zfs.arc_min 4099539968
vfs.zfs.arc_max 32796319744
vfs.zfs.dedup.prefetch 1
vfs.zfs.mdcomp_disable 0
vfs.zfs.write_limit_override 0
vfs.zfs.write_limit_inflated 206088929280
vfs.zfs.write_limit_max 8587038720
vfs.zfs.write_limit_min 33554432
vfs.zfs.write_limit_shift 3
vfs.zfs.no_write_throttle 0
vfs.zfs.zfetch.array_rd_sz 1048576
vfs.zfs.zfetch.block_cap 256
vfs.zfs.zfetch.min_sec_reap 2
vfs.zfs.zfetch.max_streams 8
vfs.zfs.prefetch_disable 0
vfs.zfs.mg_alloc_failures 12
vfs.zfs.check_hostid 1
vfs.zfs.recover 0
vfs.zfs.txg.synctime_ms 1000
vfs.zfs.txg.timeout 5
vfs.zfs.vdev.cache.bshift 16
vfs.zfs.vdev.cache.size 0
vfs.zfs.vdev.cache.max 16384
vfs.zfs.vdev.write_gap_limit 4096
vfs.zfs.vdev.read_gap_limit 32768
vfs.zfs.vdev.aggregation_limit 131072
vfs.zfs.vdev.ramp_rate 2
vfs.zfs.vdev.time_shift 6
vfs.zfs.vdev.min_pending 4
vfs.zfs.vdev.max_pending 10
vfs.zfs.vdev.bio_flush_disable 0
vfs.zfs.cache_flush_disable 0
vfs.zfs.zil_replay_disable 0
vfs.zfs.zio.use_uma 0
vfs.zfs.snapshot_list_prefetch 0
vfs.zfs.version.zpl 5
vfs.zfs.version.spa 28
vfs.zfs.version.acl 1
vfs.zfs.debug 0
vfs.zfs.super_owner 0

------------------------------------------------------------------------

Nikolay Denev

unread,

Jan 30, 2013, 11:34:41 AM1/30/13

to

On Jan 30, 2013, at 5:19 PM, Kevin Day <toa...@dragondata.com> wrote:
>
> On Jan 30, 2013, at 4:36 AM, Nikolay Denev <nde...@gmail.com> wrote:
>>
>>
>> What are your : vfs.zfs.arc_meta_limit and vfs.zfs.arc_meta_used sysctls?
>> Maybe increasing the limit can help?
>
>
> vfs.zfs.arc_meta_limit: 8199079936
> vfs.zfs.arc_meta_used: 13965744408
>

> Full output of zfs-stats: […snipped…]

Looks like you can try to increase arc_meta_limit to be let's say : half of arc_max. (16398159872 in your case).

Kevin Day

unread,

Jan 30, 2013, 3:59:15 PM1/30/13

to

On Jan 30, 2013, at 10:34 AM, Nikolay Denev <nde...@gmail.com> wrote:

>>
>> vfs.zfs.arc_meta_limit: 8199079936
>> vfs.zfs.arc_meta_used: 13965744408
>>
>> Full output of zfs-stats: […snipped…]
>
> Looks like you can try to increase arc_meta_limit to be let's say : half of arc_max. (16398159872 in your case).
>
>

Okay, will give this a shot on the next reboot too.

Does anyone here understand the significance of "used" being higher than "limit"? Is the limit only a suggestion, or are there cases where there'a certain metadata that must be in arc, and it's particularly large here?

-- Kevin

Artem Belevich

unread,

Jan 30, 2013, 4:56:09 PM1/30/13

to

On Wed, Jan 30, 2013 at 12:59 PM, Kevin Day <toa...@dragondata.com> wrote:
>
> Does anyone here understand the significance of "used" being higher than "limit"? Is the limit only a suggestion, or are there cases where there'a certain metadata that must be in arc, and it's particularly large here?

arc_meta_limit is a soft limit which basically tells ARC to attempt
evicting metadata entries and reuse their buffers as opposed to
allocating new memory and growing ARC. According to the comment next
to arc_evict() function, it's a best-effort attempt and eviction is
not guaranteed. That could potentially allow meta_size to remain above
meta_limit.

--Artem

Peter Jeremy

unread,

Feb 1, 2013, 2:24:16 PM2/1/13

to

On 2013-Jan-29 18:06:01 -0600, Kevin Day <toa...@dragondata.com> wrote:
>On Jan 29, 2013, at 5:42 PM, Matthew Ahrens <mah...@delphix.com> wrote:

>> On Tue, Jan 29, 2013 at 3:20 PM, Kevin Day <toa...@dragondata.com> wrote:
>> I'm prepared to try an L2arc cache device (with secondarycache=metadata),
>>

>> You might first see how long it takes when everything is cached. E.g. by doing this in the same directory several times. This will give you a lower bound on the time it will take (or put another way, an upper bound on the improvement available from a cache device).
>>
>
>Doing it twice back-to-back makes a bit of difference but it's still slow either way.

ZFS can very conservative about caching data and twice might not be enough.
I suggest you try 8-10 times, or until the time stops reducing.

>I think some of the issue is that nothing is being allowed to stay cached long.

Well ZFS doesn't do any time-based eviction so if things aren't
staying in the cache, it's because they are being evicted by things
that ZFS considers more deserving.

Looking at the zfs-stats you posted, it looks like your workload has
very low locality of reference (the data hitrate is very) low. If
this is not what you expect then you need more RAM. OTOH, your
vfs.zfs.arc_meta_used being above vfs.zfs.arc_meta_limit suggests that
ZFS really wants to cache more metadata (by default ZFS has a 25%
metadata, 75% data split in ARC to prevent metadata caching starving
data caching). I would go even further than the 50:50 split suggested
later and try 75:25 (ie, triple the current vfs.zfs.arc_meta_limit).

Note that if there is basically no locality of reference in your
workload (as I suspect), you can even turn off data caching for
specific filesystems with zfs set primarycache=metadata tank/foo
(note that you still need to increase vfs.zfs.arc_meta_limit to
allow ZFS to use the the ARC to cache metadata).

--
Peter Jeremy