Poor performance due to small read operations

Trey Dockendorf

unread,

Jun 28, 2014, 11:48:33 PM6/28/14

to fhgfs...@googlegroups.com

We've recently experienced issues on our cluster where our FhGFS filesystem becomes almost unusable. "Unusable" is users reporting jobs taking 10 times longer as well as operations like "df" taking many seconds to print the FhGFS mount information.

What I've noticed when this happens is that some compute nodes are performing what appear to be large numbers of small reads. One sign was watching 'fhgfs-ctl --clientstats' showed some compute nodes doing 1-4MB/s reads while at the same time reporting many thousands of read operations. The users on these compute nodes are known to run applications of "poor" design that do small reads (this has been observed by doing strace on some of the running jobs). Other compute nodes during these periods of poor performance showed 50-100MB/s reads and only 20-50 read operations per second. These users typically run applications that are more "HPC friendly". The users who do small reads also run our on GigE nodes while the users doing large reads run on our IB nodes.

The storage servers are all Scientific Linux 6.4, with a 10GigE and IB connection. Each runs ZFS as the filesystem using 128K block size. The cluster is using 2012.10-r15. These systems have not yet been optimized aside from atime being turned off on the ZFS filesystem. They have a SSD based read cache (almost never used according to ZFS arcstat) and a SSD based intent log. I plan to optimize these systems after some testing on dev systems (based on tuning guide on wiki as well as ZFS specific tuning), but at this time no changes to the under lying OS have been made to improve throughput.

Our FhGFS instance is currently set to stripe data to all 6 storage servers with a 512K chunksize. Our initial idea is to disable striping across the entire filesystem and enable on a per-user basis (on their scratch space) once they request it. So far striping in cases when users perform small reads has been more problematic and most of our users do not seem to use applications that would benefit from striping. I'm curious if enabling per user message queues on the storage servers would help prevent a user's small read activity from affecting others, or at least make the effect less noticeable.

I'm hoping there is some additional debugging or possible solutions to this problem.

I'd also be interested to know how/if others have handled configuring FhGFS for small reads.

Thanks,
- Trey

Bernd Schubert

unread,

Jun 30, 2014, 8:27:38 AM6/30/14

to fhgfs...@googlegroups.com

Trey,

are the users doing the small reads also opening and closing the files
for each and every read?

Also, what is your current tuneNumWorkers setting for fhgfs-storage and
fhgfs-meta? Did you consider to increase it?

Best regards,
Bernd

Trey Dockendorf

unread,

Jul 1, 2014, 1:08:59 AM7/1/14

to fhgfs...@googlegroups.com

Bernd,

Thank you for the response.

On Mon, Jun 30, 2014 at 7:27 AM, Bernd Schubert
<bernd.s...@itwm.fraunhofer.de> wrote:
> Trey,
>
> are the users doing the small reads also opening and closing the files for
> each and every read?

In some cases, yes. Today I observed one user's jobs resulting in
less than 1MB/s read in clientstats and many hundreds of read
operations. In that case, it was a open ... read ~8KB ... close ...
getrlimit ... <repeat>.

Another case a user was doing open ... read ~512K ... read ~512K ...
<repeat> and that user's jobs showed a lower number of read operations
but still single digit MB/s read performance. In this case one of the
2 of the 6 storage servers had high IO wait (>20%) and the disks were
at a constant 100% utilization.

>
> Also, what is your current tuneNumWorkers setting for fhgfs-storage and
> fhgfs-meta? Did you consider to increase it?

Storage is 12, metadata is 128. Storage servers have 12-16 cores and
metadata has 16. I'm concerned about increasing the tuneNumWorkers on
storage because when we see performance issues across our FhGFS
instance the storage server's disks are at 100% utilization.

At this time I'm thinking (please correct me if I'm wrong) the best
course of action is to set the striping on FhGFS to numtargets=1. In
the end I hope to improve the overall performance of our storage
systems but what that exactly entails is still pending some testing.

I also hope to schedule maintenance and update everything to 2014.01
release. At the very least I would like to get clientstats from
fhgfs-admon instead of fhgfs-ctl (for monitoring purposes). Right now
I've noticed long delays in fhgfs-ctl --iostat and --clientstats when
the storage servers become heavily loaded (high IO wait). I have seen
mention that 2014.01 improves storage layout to benefit caching. Our
storage servers all have ~200GB SSD based read cache in ZFS, which at
this time has been very under utilized. The ARC cache (which does
read caching but spills into L2ARC if enough caching can take place
greater than ARC size) has had about a 50-75% hit rate on our storage
systems. ARC is RAM based and we limit it to ~30% RAM (20-40GB on
storage) due to memory fragmentation problems present in the current
ZFS on Linux implementation.

Thanks,
- Trey

Christian Mohrbacher

unread,

Jul 2, 2014, 7:53:45 AM7/2/14

to fhgfs...@googlegroups.com

Hi Trey,

> Storage is 12, metadata is 128. Storage servers have 12-16 cores and
> metadata has 16. I'm concerned about increasing the tuneNumWorkers on
> storage because when we see performance issues across our FhGFS
> instance the storage server's disks are at 100% utilization.

nevertheless we would recommend you to try increasing tuneNumWorkers on
the storage servers. Although the disks are already at 100% utilization,
it might help, because if we can put more requests into the disk's
queue, the scheduler might be able to sort them in a better way to
increase disk performance.

Regards,
Christian

--
=====================================================
| Christian Mohrbacher |
| Competence Center for High Performance Computing |
| Fraunhofer ITWM |
| Fraunhofer-Platz 1 |
| |
| D-67663 Kaiserslautern |
=====================================================
| Tel: (49) 631 31600 4425 |
| Fax: (49) 631 31600 1099 |
| |
| E-Mail: christian....@itwm.fraunhofer.de |
| Internet: http://www.itwm.fraunhofer.de |
=====================================================

Trey Dockendorf

unread,

Jul 2, 2014, 11:22:01 AM7/2/14

to fhgfs...@googlegroups.com

Christian,

Thanks, I'll give that a try. Is there a good way to determine
optimal number for tuneNumWorkers or should I just increase it a
little at a time?

Does the tuneNumWorkers directly relate to the
tuneUsePerUserMsgQueues? I've noticed on a test system that if
tuneUsePerUserMsgQueues=true then the qlen never goes above the
tuneNumWorkers value when I'm the only person stressing the server.

Thanks,
- Trey

> --
> You received this message because you are subscribed to a topic in the
> Google Groups "fhgfs-user" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/fhgfs-user/gLLGhtDZQIs/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> fhgfs-user+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Sven Breuner

unread,

Jul 2, 2014, 1:08:18 PM7/2/14

to fhgfs...@googlegroups.com

Hi Trey,

Trey Dockendorf wrote on 07/02/2014 05:22 PM:
> Thanks, I'll give that a try. Is there a good way to determine
> optimal number for tuneNumWorkers or should I just increase it a
> little at a time?

unfortunately, there's no easy way to determine an optimal number. But
the default value of 12 is a very conservative value, so I would suggest
increasing to something like 50.

> Does the tuneNumWorkers directly relate to the
> tuneUsePerUserMsgQueues? I've noticed on a test system that if
> tuneUsePerUserMsgQueues=true then the qlen never goes above the
> tuneNumWorkers value when I'm the only person stressing the server.

There shouldn't be any direct relation between qlen and
tuneUsePerUserMsgQueues=true.
qlen is the number of requests that are still in the queue (i.e. the
ones that are not currently being processed by a worker thread).
tuneUsePerUserMsgQueues only applies to the case where there are more
requests in the queue than there are free workers available to process
them - because only in that case we can do reordering.
If there are 50 client requests incoming and you have 50 free worker
threads, than each one of them will grab a request to process it
immediately.
So with n worker threads, tuneUsePerUserMsgQueues=true can only be
effective if there are more than n requests coming in at the same time.

Best regards,
Sven Breuner
Fraunhofer

Trey Dockendorf

unread,

Jul 2, 2014, 2:31:30 PM7/2/14

to fhgfs...@googlegroups.com

Sven,

Thanks for the explanation. I think I understand this now. In my
testing I have 4 clients, each with connMaxInternodeNum=6. The single
storage server has tuneNumWorkers=12. The reason I'm seeing qlen of
12 and bsy of 12 during high IO activity is a result of the total
client connections being 24 and the storage server only has 12 workers
able to handle those requests so the other 12 client connections get
queued.

It's very easy to see now how tuneNumWorkers=12 on 6 storage servers
with ~320 clients using connMaxInternodeNum=6 could result in very
high qlen during high IO on the cluster.

So if I increase my single dev server to tuneNumWorkers=50, and
generate IO from 4 clients using connMaxInternodeNum=6, then I should
see a bsy number around 24 and qlen of 0. Is the qlen an indicator of
poor server performance, for example if the storage server can't meet
the demand of all 24 requests, will the qlen go up? Or is the qlen on
the storage server just an indicator of there being more IO requests
than the tuneNumWorkers?

I'm hoping to find some strong indicators, either via fhgfs-ctl or
other tools available to Linux, of when a storage server is performing
poorly. I'm also hoping to find a way to take the data provided by
<insert some fhgfs-ctl command or some data from fhgfs-admon> to
determine when our FhGFS filesystem is too heavily loaded. Right now
my only indicator in my monitoring is seeing the IO wait on storage
servers go above ~20%. Other than that the actual performance issues
we've seen are usually reported by users, which is a failure of my
monitoring if a user has to report the problem before we detect it.

Thanks,
- Trey

Pete Sero

unread,

Jul 3, 2014, 6:53:25 AM7/3/14

to Trey Dockendorf

On 2014 Jul 3. md, at 02:31 st, Trey Dockendorf wrote:

>
>
> I'm hoping to find some strong indicators, either via fhgfs-ctl or
> other tools available to Linux, of when a storage server is performing

- probe some fs actions from clients every few minutes (df, mkdir, write/read some MBs)

- log network throughput per physical interface

- log CPU activity, both "load" and "busy" (usr/sys/iowait) values

- most important: log DISK LATENCY

- also: disk queue length, transfers/sec (IOPS), transfer sizes

(plus: get as much per client info from fhgfs itself to obtain
a picture of the workload.)

The disk %busy value is much less useful:
it's just the fraction of time when the disk is "not idle".

But it doesn't tell what is going on while the
disk is busy, the drive could do many fast non-overlapping
transfers or very few overlapping slow transfers,
or a moderate number of medium fast transfers.

It's actually the purpose of queuing transfers
at the disk to maximize throughput, at the cost
of some latency. Often the disks reach close to
100% busy way before the latency goes so high
that end user workflow is severely impacted.
So seeing a drive at ar close to 100%
can be just "right", or already overloaded
beyond what's useful.

This clearly is rather generic advise; paraphrased as:
watch the resources that have fixed capacities, namely
disk-IO, CPU, network.

As for "predicting": the exact outcome depends on the
workload, the organisation and interplay of the involved
file systems (here: cluster fs stacked on multiple local fs),
as well as the characteristics of the used disks/arrays.
Therefore testing and long-term observation are needed.

But where it really gets tricky is slowness
due to adding up latencies along the whole
path in the system -- because NONE of the mentioned
resources may become exhausted in those situations.
Such would require a deeper analysis of
workload and storage architecture...
The only good news is that such workloads
tend NOT to affect other users too badly,
they often are just slow by themselves.

From my experience with various storage systems,
what one usually hits first is:
1. disk latency (IO queing up on disks)
then
2. disk latency (IO queing up on disks)
then
3. (you name it)
then
4. network or CPU (kind of trivial)
then
5. system-induced latency chains for certain workflows
(like sequentially performed small transactions).

Cheers

Peter

> You received this message because you are subscribed to the Google Groups "fhgfs-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+...@googlegroups.com.

Sven Breuner

unread,

Jul 15, 2014, 5:52:58 AM7/15/14

to fhgfs...@googlegroups.com, Trey Dockendorf

Hi Trey,

Trey Dockendorf wrote on 07/02/2014 08:31 PM:
> Thanks for the explanation. I think I understand this now. In my
> testing I have 4 clients, each with connMaxInternodeNum=6. The single
> storage server has tuneNumWorkers=12. The reason I'm seeing qlen of
> 12 and bsy of 12 during high IO activity is a result of the total
> client connections being 24 and the storage server only has 12 workers
> able to handle those requests so the other 12 client connections get
> queued.

yes, that's right.

> So if I increase my single dev server to tuneNumWorkers=50, and
> generate IO from 4 clients using connMaxInternodeNum=6, then I should
> see a bsy number around 24 and qlen of 0.

Yes, right - unless you catch it in a nanosecond where there is still a
request in the queue and a free worker just hasn't pulled it out there
yet ;)

> Is the qlen an indicator of
> poor server performance, for example if the storage server can't meet
> the demand of all 24 requests, will the qlen go up?

No, if a request has been grabbed by a free worker, then that worker
will keep it and so it will no longer count for the qlen, independent of
how long it takes to process this request.

> Or is the qlen on
> the storage server just an indicator of there being more IO requests
> than the tuneNumWorkers?

Yes. Compare it to a printer where 10 people are submitting a print job
at the same time: 9 of them will initially end up in a queue,
independent of whether the printer is performing well or poor.

> I'm hoping to find some strong indicators, either via fhgfs-ctl or
> other tools available to Linux, of when a storage server is performing
> poorly. I'm also hoping to find a way to take the data provided by
> <insert some fhgfs-ctl command or some data from fhgfs-admon> to
> determine when our FhGFS filesystem is too heavily loaded. Right now
> my only indicator in my monitoring is seeing the IO wait on storage
> servers go above ~20%. Other than that the actual performance issues
> we've seen are usually reported by users, which is a failure of my
> monitoring if a user has to report the problem before we detect it.

I like the attitude implied by the last sentence :)
Pete Sero already made some comments on general storage monitoring and I
assume you were already aware that it's unfortunately not a trivial
goal, but I hope you'll share your results if you find ever find the one
or two magic key values or the magic forumla that helps you identify
such situations.

Related to these queue things, there is one undocumented fhgfs-ctl
option, which might or might not help you with this:
"fhgfs-ctl --listnodes --ping", e.g.:

$ fhgfs-ctl --listnodes --ping --nodetype=storage --pingretries=100

...will generate simple requests from the fhgfs-ctl tool and measure the
time it takes to get a reply from the servers (just as you would expect
from a normal ping-pong). So with this you can measure how long it takes
to process a full client request that does not involve disk IO.
(The first ping result per server includes the time it takes to
establish a connection, thus this value is not included in the final
average result.)

> The storage servers are all Scientific Linux 6.4, with a 10GigE and IB connection. Each runs ZFS as the filesystem using 128K block size. The cluster is using 2012.10-r15. These systems have not yet been optimized aside from atime being turned off on the ZFS filesystem. They have a SSD based read cache (almost never used according to ZFS arcstat) and a SSD based intent log. I plan to optimize these systems after some testing on dev systems (based on tuning guide on wiki as well as ZFS specific tuning), but at this time no changes to the under lying OS have been made to improve throughput.

It's interesting that you mention the almost never used SSD read cache
with ZFS. An SSD cache certainly seems like a nice and simple idea, but
I remember when we did our ZFS tests quite a while ago, we also added an
SSD for read caching and just couldn't get ZFS to actually make good use
of it.
Not sure though if that's a bug or a feature. Wished I had raised that
question when I recently talked to Bill Moore ;)

Best regards,
Sven

Reply all

Reply to author

Forward