I've been having trouble with ZFS on my server. For the most part it works splendidly, but occasionally I'll experience permanent hangs.
For example, right now on one of my ZFS filesystems (the others are fine), I can read, write, and stat files, but if I run ls in any directory, ls and the terminal will hang. CTRL-C, and kill -9 can't kill it:
In top:
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
5868 nsivo 1 20 0 14456K 1016K zfs 0 0:00 0.00% ls
In ps:
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
nsivo 5868 0.0 0.0 14456 1016 2- D+ 2:35PM 0:00.00 ls
Eventually the entire system hangs, and can't be shutdown cleanly.
What are the next steps to debug this? I'm a software developer, but am not familiar with kernel debugging. Is there a way to discover in which syscall ls is stuck? Ideally without requiring a crash dump?
Thanks for reading,
Nick
-Nick
_______________________________________________
freebsd-...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questi...@freebsd.org"
The pool is less that 40% utilized, ARC isn't under memory pressure, and as far as I can tell everything should be fine.
I did find: https://wiki.freebsd.org/AvgZfsDeadlockDebug which confirms I want kernel stack traces. I was hoping to get some guidance on that, especially on a remote system with only SSH access. Not sure I can just enter DDB over SSH. Maybe tricks with dtrace and stack()?
[nsivo@hn3 sysutils]$ zpool get all ssd | grep -v default
NAME PROPERTY VALUE SOURCE
ssd size 182G -
ssd capacity 38% -
ssd health ONLINE -
ssd failmode panic local
ssd dedupratio 1.00x -
ssd free 111G -
ssd allocated 70.7G -
ssd readonly off -
ssd expandsize 0 -
ssd feature@async_destroy enabled local
ssd feature@empty_bpobj active local
ssd feature@lz4_compress enabled local
ssd unsup...@com.joyent:multi_vdev_crash_dump inactive local
[nsivo@hn3 sysutils]$ zfs get all | grep ssd | grep -v default | grep -v 'arc@2'
ssd type filesystem -
ssd creation Thu Aug 28 16:33 2014 -
ssd used 70.7G -
ssd available 108G -
ssd referenced 144K -
ssd compressratio 1.00x -
ssd mounted no -
ssd mountpoint none local
ssd checksum sha256 local
ssd atime off local
ssd canmount off local
ssd version 5 -
ssd utf8only on -
ssd normalization formKC -
ssd casesensitivity sensitive -
ssd usedbysnapshots 0 -
ssd usedbydataset 144K -
ssd usedbychildren 70.7G -
ssd usedbyrefreservation 0 -
ssd mlslabel -
ssd refcompressratio 1.00x -
ssd written 144K -
ssd logicalused 31.1G -
ssd logicalreferenced 43.5K -
ssd/arc type filesystem -
ssd/arc creation Wed Sep 17 17:07 2014 -
ssd/arc used 70.5G -
ssd/arc available 108G -
ssd/arc referenced 47.8G -
ssd/arc compressratio 1.00x -
ssd/arc mounted yes -
ssd/arc mountpoint /usr/arc received
ssd/arc checksum sha256 inherited from ssd
ssd/arc atime off inherited from ssd
ssd/arc setuid off received
ssd/arc snapdir visible received
ssd/arc xattr off temporary
ssd/arc version 5 -
ssd/arc utf8only on -
ssd/arc normalization formKC -
ssd/arc casesensitivity sensitive -
ssd/arc usedbysnapshots 22.7G -
ssd/arc usedbydataset 47.8G -
ssd/arc usedbychildren 0 -
ssd/arc usedbyrefreservation 0 -
ssd/arc mlslabel -
ssd/arc sync always local
ssd/arc refcompressratio 1.00x -
ssd/arc written 262M -
ssd/arc logicalused 31.0G -
ssd/arc logicalreferenced 15.3G -
[nsivo@hn3 ~]$ zfs-stats -a
------------------------------------------------------------------------
ZFS Subsystem Report Tue Oct 7 21:58:37 2014
------------------------------------------------------------------------
System Information:
Kernel Version: 902001 (osreldate)
Hardware Platform: amd64
Processor Architecture: amd64
ZFS Storage pool Version: 5000
ZFS Filesystem Version: 5
FreeBSD 9.2-RELEASE-p12 #0: Mon Sep 15 18:46:46 UTC 2014 root
9:58PM up 16 days, 3:39, 2 users, load averages: 0.24, 0.33, 0.35
------------------------------------------------------------------------
System Memory:
16.80% 10.42 GiB Active, 0.15% 94.19 MiB Inact
72.82% 45.15 GiB Wired, 0.12% 74.04 MiB Cache
10.11% 6.27 GiB Free, 0.00% 2.46 MiB Gap
Real Installed: 64.00 GiB
Real Available: 99.91% 63.94 GiB
Real Managed: 96.97% 62.00 GiB
Logical Total: 64.00 GiB
Logical Used: 89.95% 57.56 GiB
Logical Free: 10.05% 6.44 GiB
Kernel Memory: 15.70 GiB
Data: 99.83% 15.67 GiB
Text: 0.17% 27.37 MiB
Kernel Memory Map: 52.92 GiB
Size: 21.21% 11.22 GiB
Free: 78.79% 41.69 GiB
------------------------------------------------------------------------
ARC Summary: (HEALTHY)
Memory Throttle Count: 0
ARC Misc:
Deleted: 500.41m
Recycle Misses: 270.90m
Mutex Misses: 27.63m
Evict Skips: 5.12b
ARC Size: 25.95% 15.83 GiB
Target Size: (Adaptive) 45.64% 27.84 GiB
Min Size (Hard Limit): 12.50% 7.63 GiB
Max Size (High Water): 8:1 61.00 GiB
ARC Size Breakdown:
Recently Used Cache Size: 10.56% 2.94 GiB
Frequently Used Cache Size: 89.44% 24.91 GiB
ARC Hash Breakdown:
Elements Max: 16.33m
Elements Current: 20.84% 3.40m
Collisions: 571.54m
Chain Max: 41
Chains: 750.10k
------------------------------------------------------------------------
ARC Efficiency: 4.77b
Cache Hit Ratio: 81.44% 3.88b
Cache Miss Ratio: 18.56% 884.45m
Actual Hit Ratio: 80.36% 3.83b
Data Demand Efficiency: 29.50% 966.53m
Data Prefetch Efficiency: 29.08% 21.22m
CACHE HITS BY CACHE LIST:
Most Recently Used: 3.47% 134.64m
Most Frequently Used: 95.20% 3.69b
Most Recently Used Ghost: 5.18% 200.94m
Most Frequently Used Ghost: 5.58% 216.52m
CACHE HITS BY DATA TYPE:
Demand Data: 7.35% 285.16m
Prefetch Data: 0.16% 6.17m
Demand Metadata: 90.91% 3.53b
Prefetch Metadata: 1.58% 61.40m
CACHE MISSES BY DATA TYPE:
Demand Data: 77.04% 681.37m
Prefetch Data: 1.70% 15.05m
Demand Metadata: 15.86% 140.24m
Prefetch Metadata: 5.40% 47.79m
------------------------------------------------------------------------
L2ARC is disabled
------------------------------------------------------------------------
File-Level Prefetch: (HEALTHY)
DMU Efficiency: 11.05b
Hit Ratio: 59.82% 6.61b
Miss Ratio: 40.18% 4.44b
Colinear: 4.44b
Hit Ratio: 0.01% 317.87k
Miss Ratio: 99.99% 4.44b
Stride: 6.62b
Hit Ratio: 99.57% 6.59b
Miss Ratio: 0.43% 28.41m
DMU Misc:
Reclaim: 4.44b
Successes: 0.81% 35.91m
Failures: 99.19% 4.40b
Streams: 16.54m
+Resets: 0.25% 41.20k
-Resets: 99.75% 16.50m
Bogus: 0
------------------------------------------------------------------------
VDEV cache is disabled
------------------------------------------------------------------------
ZFS Tunables (sysctl):
kern.maxusers 384
vm.kmem_size 66575511552
vm.kmem_size_scale 1
vm.kmem_size_min 0
vm.kmem_size_max 329853485875
vfs.zfs.l2c_only_size 0
vfs.zfs.mfu_ghost_data_lsize 483016704
vfs.zfs.mfu_ghost_metadata_lsize 1192360960
vfs.zfs.mfu_ghost_size 1675377664
vfs.zfs.mfu_data_lsize 3821800448
vfs.zfs.mfu_metadata_lsize 1144714240
vfs.zfs.mfu_size 9304926208
vfs.zfs.mru_ghost_data_lsize 7731420672
vfs.zfs.mru_ghost_metadata_lsize 19021883392
vfs.zfs.mru_ghost_size 26753304064
vfs.zfs.mru_data_lsize 1530433536
vfs.zfs.mru_metadata_lsize 390488064
vfs.zfs.mru_size 3144110080
vfs.zfs.anon_data_lsize 0
vfs.zfs.anon_metadata_lsize 0
vfs.zfs.anon_size 12001792
vfs.zfs.l2arc_norw 1
vfs.zfs.l2arc_feed_again 1
vfs.zfs.l2arc_noprefetch 1
vfs.zfs.l2arc_feed_min_ms 200
vfs.zfs.l2arc_feed_secs 1
vfs.zfs.l2arc_headroom 2
vfs.zfs.l2arc_write_boost 8388608
vfs.zfs.l2arc_write_max 8388608
vfs.zfs.arc_meta_limit 16375442432
vfs.zfs.arc_meta_used 11641238168
vfs.zfs.arc_min 8187721216
vfs.zfs.arc_max 65501769728
vfs.zfs.dedup.prefetch 1
vfs.zfs.mdcomp_disable 0
vfs.zfs.nopwrite_enabled 1
vfs.zfs.write_limit_override 0
vfs.zfs.write_limit_inflated 205969035264
vfs.zfs.write_limit_max 8582043136
vfs.zfs.write_limit_min 33554432
vfs.zfs.write_limit_shift 3
vfs.zfs.no_write_throttle 0
vfs.zfs.zfetch.array_rd_sz 1048576
vfs.zfs.zfetch.block_cap 256
vfs.zfs.zfetch.min_sec_reap 2
vfs.zfs.zfetch.max_streams 8
vfs.zfs.prefetch_disable 0
vfs.zfs.no_scrub_prefetch 0
vfs.zfs.no_scrub_io 0
vfs.zfs.resilver_min_time_ms 3000
vfs.zfs.free_min_time_ms 1000
vfs.zfs.scan_min_time_ms 1000
vfs.zfs.scan_idle 50
vfs.zfs.scrub_delay 4
vfs.zfs.resilver_delay 2
vfs.zfs.top_maxinflight 32
vfs.zfs.write_to_degraded 0
vfs.zfs.mg_alloc_failures 8
vfs.zfs.check_hostid 1
vfs.zfs.deadman_enabled 1
vfs.zfs.deadman_synctime 1000
vfs.zfs.recover 0
vfs.zfs.txg.synctime_ms 1000
vfs.zfs.txg.timeout 5
vfs.zfs.vdev.cache.bshift 16
vfs.zfs.vdev.cache.size 0
vfs.zfs.vdev.cache.max 16384
vfs.zfs.vdev.trim_on_init 1
vfs.zfs.vdev.write_gap_limit 4096
vfs.zfs.vdev.read_gap_limit 32768
vfs.zfs.vdev.aggregation_limit 131072
vfs.zfs.vdev.ramp_rate 2
vfs.zfs.vdev.time_shift 29
vfs.zfs.vdev.min_pending 4
vfs.zfs.vdev.max_pending 10
vfs.zfs.vdev.bio_delete_disable 0
vfs.zfs.vdev.bio_flush_disable 0
vfs.zfs.vdev.trim_max_pending 64
vfs.zfs.vdev.trim_max_bytes 2147483648
vfs.zfs.cache_flush_disable 0
vfs.zfs.zil_replay_disable 0
vfs.zfs.sync_pass_rewrite 2
vfs.zfs.sync_pass_dont_compress 5
vfs.zfs.sync_pass_deferred_free 2
vfs.zfs.zio.use_uma 0
vfs.zfs.snapshot_list_prefetch 0
vfs.zfs.version.ioctl 3
vfs.zfs.version.zpl 5
vfs.zfs.version.spa 5000
vfs.zfs.version.acl 1
vfs.zfs.debug 0
vfs.zfs.super_owner 0
vfs.zfs.trim.enabled 1
vfs.zfs.trim.max_interval 1
vfs.zfs.trim.timeout 30
vfs.zfs.trim.txg_delay 32
------------------------------------------------------------------------
-Nick
On Tue, Oct 7, 2014 at 9:15 PM, Chad Leigh Shire.Net LLC <ch...@shire.net>
wrote:
> On Oct 7, 2014, at 7:48 PM, Nick Sivo <ni...@ycombinator.com> wrote:
>> Hello,
>>
>>
>> I've been having trouble with ZFS on my server. For the most part it works splendidly, but occasionally I'll experience permanent hangs.
>>
>>
>> For example, right now on one of my ZFS filesystems (the others are fine), I can read, write, and stat files, but if I run ls in any directory, ls and the terminal will hang. CTRL-C, and kill -9 can't kill it:
>>
> How much free space do you have? (percentage wise). I found out (and have had others confirm) that when you get below a certain amount of free space you can get these symptoms (which percentage may vary per system and how the zfs config is set up [kernel parameters]).
> Also, depending on what you are doing the various parameters may need to be tweaked. Look in the archives for similar posts (including from me).