isi auth mapping flush --all; isi auth users flush; isi auth groups flush; isi auth refresh
sysctl efs.bam.busy_vnodes | egrep -v "find any" | egrep -o "[0-9a-f]:[0-9a-f]{4}:[0-9a-f]{4}" | sort | uniq
isi_for_array -s 'ps auxwww | egrep -vi "user|idle|av-" | sort -k3 -n -r | head -n 5'
isi_for_array -s 'ps auxwww | egrep -vi "user|idle|av-" | sort -k5 -n -r | head -n 5'
isi_for_array -s 'sysctl net.ib.devices.0.ports.1.active_speed net.ib.devices.0.ports.2.active_speed | paste - -'
sysctl efs.bam.busy_vnodes | egrep -v "find any" | egrep -o "[0-9a-f]:[0-9a-f]{4}:[0-9a-f]{4}" | sort | uniq | while read lin; do isi get -L "${lin}"; doneisi statistics heat --nodes=all --orderby=ops --long | egrep "[a-f0-9]{1,2}:[a-f0-9]{4}:[a-f0-9]{4}.*?UNKNOWN"isi statistics heat --events=getattrInteresting...
On the performance side we do quite a bit with output from the various isi statistics calls and other things to gain further insight into what "evil" our workloads do to the underlying storage clusters. We then pipe that into Graphite (and other things that bolt on on top of it) I'd be curious to see what you are doing.
In regards to hangdumps, we see these semi-often as in our environment (HPC) we really push the distributed file system HARD. In my conversations with support and with engineering It was stated that without access to a lot of the "crown jewels" of OneFS then self analysis is a difficult proposition as you need these to debug proper. If you are also a customer I'd be curious to see what you are doing :)
-Jason
--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
In regards to hangdumps, we see these semi-often as in our environment (HPC) we really push the distributed file system HARD. In my conversations with support and with engineering It was stated that without access to a lot of the "crown jewels" of OneFS then self analysis is a difficult proposition as you need these to debug proper. If you are also a customer I'd be curious to see what you are doing :)
-Jason
--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
212 # The thread that dumps debug state on several conditions:
213 # 1) the ping thread stops responding indicating a vnode deadlock
214 # 2) a signal is received indicating the user wants the state dumped
215 # 3) a signal is received indicating that a dump thread on another node
216 # saw event 1 or 2 and sent a signal over isi_for_array.egrep -i "lock.*?timeout|triggering.*?hangdump" /var/log/messages | egrep -o "^[0-9]{4}-[0-9]{2}-[0-9]{2}" | sort | uniq -c
isi_for_array -s 'egrep -i "lock.*?timeout|triggering.*?hangdump" /var/log/messages | egrep -o "^[0-9]{4}-[0-9]{2}-[0-9]{2}"' | sort | uniq -c
If I'm correct, I want to say it's indicative of deadlocks?
At least that's my experience.
Our "unoptimized" HPC workloads tend to do terrible filesystem IOs and we've had to engage with support and our DSE to have these post identified to determine where the bad is occurring using hangdumps and occasionally coredumps.
"Terrible" - Being heavy write metadata with tens or hundreds of millions of files and directories created during the life of a large project. 90+% of files being under 8k in size, being hit by tens of thousands of CPU cores.
We throw a LOT of S200s and SSDs at the problem but it doesn't fix all :)