In my current understanding of the implementation, QFS will try to read from the six regions of data that are striped across nodes, particularly on six disks. However, the data can be fully constructed from any of the (6 copies + 3 RS encodings). This would incur greater CPU work on the client, but that's often more flexible to assign than disk resources.
With that in mind, I propose a getPerfScore(chunk) interface where a client can ask all 9 servers for the performance score of a chunk with the score value based on the recent performance of the volume that the given chunk is located on, and possibly factoring in also pending volume reads. Choosing the six fastest chunk + encodings should yield much higher aggregate disk IO performance across the cluster and provide adaptive response to different quality hardware.
As an initial implementation I feel this would be highly useful on the read side being fairly simple to implement there and entirely optional to utilize. In longer term, I can envision the meta server also making use of this data to determine write-side placement in addition to capacity considerations.
-Jeff
AFAIK, this has already been implemented for QFS writing. In general, QFS wants to 1) let each hard disk get about the same amount of data (in terms of percentage), 2) prefer to use faster hard disks to get better performance. 1 is the long term objective. 2 is the short term objective. A special very case is Quantsort, since we expect the intermediate data will be deleted soon, QFS is configured to only care about objective 2. But, for regular QFS systems where the data is going to last a long time, objective 1 is more important than 2, as 2 may increase the need for rebalancing data later.
If I recall correctly, this is down in two levels: metaserver pick the chunk server, and chunk server itself pick the hard disk (each chunk server has 10 to 12 hard disks in our typical configuration)
This has yet to be such optimization for read. Disk space usage is not a concern for reading. But there are other things we have to consider:
1) a hard disk that is overall fast can have slow / bad sectors;
2) the per disk performance / load numbers can hardly be 100% accurate and up to date;
3) if the choice is made at client side, thousands of clients may choose the same disk at the same time, and causing congestion;
4) if the choice is made at meta server side, how will multiple meta servers share the info and cooperate?
1) That's relatively unavoidable. It's also already not accounted for and can be ignored in the sake of improving aggregate cluster performance.
2) True, but our current low water mark is measurably better performance than effectively random reads.
3) This will be something that gets tested adaptively as we go along. With large volume reads, the queue length may be so long that 1 second refreshes are perfectly adequate. Thundering herd effects can also be mitigated by perturbing the results.
4) Distributed load balancing doesn't need to be strongly consistent. I would naively imagine that each meta server would be fed that information directly from the chunk servers.
-Jeff