Suggested read performance improvement: volume performance awareness

jfer...@quantcast.com

unread,

May 10, 2016, 5:59:33 PM5/10/16

to QFS Development

For heavily loaded systems, randomized distribution and randomized load will result in skew: some disks will become more loaded than others and performance will be bounded at less than the full capacity of the system as read performance is bounded by disks that are slow or representing more than average load. The practical observation of this has been seeing 10-40% of disks in a chassis being heavily loaded while the rest are lightly loaded, sometimes near idle, whether the cause be because of other disk activity, differences in disk performance due to varying specifications, or some sort of degradation of the volume.

In my current understanding of the implementation, QFS will try to read from the six regions of data that are striped across nodes, particularly on six disks. However, the data can be fully constructed from any of the (6 copies + 3 RS encodings). This would incur greater CPU work on the client, but that's often more flexible to assign than disk resources.

With that in mind, I propose a getPerfScore(chunk) interface where a client can ask all 9 servers for the performance score of a chunk with the score value based on the recent performance of the volume that the given chunk is located on, and possibly factoring in also pending volume reads. Choosing the six fastest chunk + encodings should yield much higher aggregate disk IO performance across the cluster and provide adaptive response to different quality hardware.

As an initial implementation I feel this would be highly useful on the read side being fairly simple to implement there and entirely optional to utilize. In longer term, I can envision the meta server also making use of this data to determine write-side placement in addition to capacity considerations.

-Jeff

Faraaz Sareshwala

unread,

May 10, 2016, 6:48:05 PM5/10/16

to qfs-...@googlegroups.com

The metaserver already knows the disk io rates, utilization, and space usage of each chunk server. That’s how it prioritizes where to place chunks. The assumption is that those disk metrics will remain constant over the lifetime of the chunk placement which may be a bad assumption.

Rather than a separate function or interface, why not just bake this into the metaserver and client protocol itself? The client already receives the full list of chunk servers where a given piece of data is (in order to do error correction if necessary). We can just also pass along disk metrics for those chunk servers as well and then have the client do what you suggested. Or is that strategy exactly what you are suggesting? Either way, I think it’s a good idea. :)

Faraaz

>--
>You received this message because you are subscribed to the Google Groups "QFS Development" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to qfs-devel+...@googlegroups.com.
>To post to this group, send email to qfs-...@googlegroups.com.
>To view this discussion on the web visit https://groups.google.com/d/msgid/qfs-devel/9114ba2b-9179-4d5d-bf29-9a347c9e2cc0%40googlegroups.com.
>For more options, visit https://groups.google.com/d/optout.

Ying Zheng

unread,

May 10, 2016, 7:53:01 PM5/10/16

to QFS Development

@Jeffrey, @Faraaz, this is an great idea. But it might be harder to implement than you think.

AFAIK, this has already been implemented for QFS writing. In general, QFS wants to 1) let each hard disk get about the same amount of data (in terms of percentage), 2) prefer to use faster hard disks to get better performance. 1 is the long term objective. 2 is the short term objective. A special very case is Quantsort, since we expect the intermediate data will be deleted soon, QFS is configured to only care about objective 2. But, for regular QFS systems where the data is going to last a long time, objective 1 is more important than 2, as 2 may increase the need for rebalancing data later.

If I recall correctly, this is down in two levels: metaserver pick the chunk server, and chunk server itself pick the hard disk (each chunk server has 10 to 12 hard disks in our typical configuration)

This has yet to be such optimization for read. Disk space usage is not a concern for reading. But there are other things we have to consider:
1) a hard disk that is overall fast can have slow / bad sectors;
2) the per disk performance / load numbers can hardly be 100% accurate and up to date;
3) if the choice is made at client side, thousands of clients may choose the same disk at the same time, and causing congestion;
4) if the choice is made at meta server side, how will multiple meta servers share the info and cooperate?

mcan...@quantcast.com

unread,

May 10, 2016, 11:30:56 PM5/10/16

to qfs-...@googlegroups.com

I remember reading somewhere in this group that one might try using shorter timeouts for reads. That way, slow chunk servers wouldn't be able to complete the request in given period and the client would automatically switch to a chunk server (hopefully a relatively faster one) with parity chunk automatically. Of course, this is not the same as what Jeff was suggesting but could be an alternative one can try without any new code.

As for writes, I'm sure that I/O load is considered by meta server for space re-balancing, but I'm not sure if that's the case for regular writes.

jfer...@quantcast.com

unread,

May 11, 2016, 2:33:17 PM5/11/16

to QFS Development

On Tuesday, May 10, 2016 at 4:53:01 PM UTC-7, Ying Zheng wrote:

1) That's relatively unavoidable. It's also already not accounted for and can be ignored in the sake of improving aggregate cluster performance.
2) True, but our current low water mark is measurably better performance than effectively random reads.
3) This will be something that gets tested adaptively as we go along. With large volume reads, the queue length may be so long that 1 second refreshes are perfectly adequate. Thundering herd effects can also be mitigated by perturbing the results.
4) Distributed load balancing doesn't need to be strongly consistent. I would naively imagine that each meta server would be fed that information directly from the chunk servers.

-Jeff

jfer...@quantcast.com

unread,

May 11, 2016, 2:35:31 PM5/11/16

to QFS Development

On Tuesday, May 10, 2016 at 3:48:05 PM UTC-7, Faraaz Sareshwala wrote:
> The metaserver already knows the disk io rates, utilization, and space usage of each chunk server. That’s how it prioritizes where to place chunks. The assumption is that those disk metrics will remain constant over the lifetime of the chunk placement which may be a bad assumption.
>
> Rather than a separate function or interface, why not just bake this into the metaserver and client protocol itself? The client already receives the full list of chunk servers where a given piece of data is (in order to do error correction if necessary). We can just also pass along disk metrics for those chunk servers as well and then have the client do what you suggested. Or is that strategy exactly what you are suggesting? Either way, I think it’s a good idea. :)

Really that comes down to development question. If the communication protocol can ignore extra fields then there's absolutely no problem with just adding it in, e.g., new optional fields in Thrift.

Reply all

Reply to author

Forward