Hi everyone,
I'm using PAPI (the PAPI?) to pull various metrics about our filesystem. In particular, I'm looking for performance statistics that could indicate a particularly heavy workflow which would trigger some alert that we would get and then go throttle back some user workflows.
Anyway, I'm looking through the info provided through the API, and there's a description field provided for each key, and units of measure, and in most cases, these are pretty straight forward, but then there's some that make no sense to me. And maybe that's just because I'm not a storage admin.. But if anyone here could shed some light on a couple of these, it would be greatly appreciated.
All of these are under the cluster statististics branch of the API, and I'm always interested in the "current" values. For example, there is a key named node.disk.access.slow.avg. The description for this key is: "Average slow accesses per second for all disks" and the unit is "cents"
What does that mean? For starters, I mean, what is a "slow access"? is this something that could represent an under-performing node? and what does it mean to be measured in cents if the description says per second?
The next one that is confusing to me is node.disk.busy.avg, description: "Average disk busy in tenths of a percent", unit: "permil"
I think I understand the concept of a busy disk, I assume this is kind of like a disk wait state? but I'm not sure that I understand the units here. What is this a percent of?
Another one is node.disk.iosched.queue.avg, description "Avg iosched queue length for all disks" units: "cents"
I don't quite get this unit of measure.. Is this just a way to say "this is a count of something"?
I also have a broader question for all of the *.avg metrics. Assuming this is an average over time, what is that time frame?
I have many more questions like this, but I'm going to start off small and hope that someone can help get me past this first round.
btw, we're running v7.1.1.8 with a mix of x200s, x400s and n400s
Thanks in advance!
Patty