On 2014 Jul 3. md, at 02:31 st, Trey Dockendorf wrote:
>
>
> I'm hoping to find some strong indicators, either via fhgfs-ctl or
> other tools available to Linux, of when a storage server is performing
- probe some fs actions from clients every few minutes (df, mkdir, write/read some MBs)
- log network throughput per physical interface
- log CPU activity, both "load" and "busy" (usr/sys/iowait) values
- most important: log DISK LATENCY
- also: disk queue length, transfers/sec (IOPS), transfer sizes
(plus: get as much per client info from fhgfs itself to obtain
a picture of the workload.)
The disk %busy value is much less useful:
it's just the fraction of time when the disk is "not idle".
But it doesn't tell what is going on while the
disk is busy, the drive could do many fast non-overlapping
transfers or very few overlapping slow transfers,
or a moderate number of medium fast transfers.
It's actually the purpose of queuing transfers
at the disk to maximize throughput, at the cost
of some latency. Often the disks reach close to
100% busy way before the latency goes so high
that end user workflow is severely impacted.
So seeing a drive at ar close to 100%
can be just "right", or already overloaded
beyond what's useful.
This clearly is rather generic advise; paraphrased as:
watch the resources that have fixed capacities, namely
disk-IO, CPU, network.
As for "predicting": the exact outcome depends on the
workload, the organisation and interplay of the involved
file systems (here: cluster fs stacked on multiple local fs),
as well as the characteristics of the used disks/arrays.
Therefore testing and long-term observation are needed.
But where it really gets tricky is slowness
due to adding up latencies along the whole
path in the system -- because NONE of the mentioned
resources may become exhausted in those situations.
Such would require a deeper analysis of
workload and storage architecture...
The only good news is that such workloads
tend NOT to affect other users too badly,
they often are just slow by themselves.
From my experience with various storage systems,
what one usually hits first is:
1. disk latency (IO queing up on disks)
then
2. disk latency (IO queing up on disks)
then
3. (you name it)
then
4. network or CPU (kind of trivial)
then
5. system-induced latency chains for certain workflows
(like sequentially performed small transactions).
Cheers
Peter
> You received this message because you are subscribed to the Google Groups "fhgfs-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
fhgfs-user+...@googlegroups.com.