Hello,
My name is Mehmet and I work on the QFS team here at Quantcast. I’ll be helping
Faraaz with responding to the posts here in QFS-devel. First off, we want to apologize
for taking so long to get back to you on these answers. We really appreciate you reaching
out to us with questions. Please feel free to continue to do so!
From what we understand, QFS always wants its nodes/racks arranged in
groups of 9, which we aren't doing and haven't done since we had 9 nodes.
QFS doesn’t necessarily enforce groups of 9. In fact, for a special case, we have
a filesystem instance in which we store all 9 chunks (6 data chunks + 3 parity chunks)
of a file in only three racks, so three chunks per rack. This is assuming that the file size
is less than or equal to 64MB*6 of course (with a larger file size, the extra data will
overflow into another set of 9 chunks).
Enabling multi chunks per rack basically saves cross rack bandwidth compared
to a configuration where we have a single chunk per rack, but note that it might
also have serious reliability consequences, so its use is not encouraged.
With 27 nodes, what is the best rack layout? What are the pros/cons of the 3
racks of 9 nodes vs 9 racks of 3 nodes?
Assuming the chunks are distributed equally among racks, in general, a higher
number of racks leads to greater number of tolerated failures. This comes at the
expense of less or no data locality, but the fact that network is faster than disks
these days doesn’t make this an issue.
It is the opposite for a lower number of racks. It results in fewer tolerated failures,
but more locality. Another downside of a lower number of racks is the fact that
your storage capacity might diminish significantly if a rack failure is not transient.
In the 3x9 scenario you give below, you might lose 1/3 of your storage capacity
for extended failure periods since an entire rack is down.
I want to point out here that when talking about racks, I mean physically different
racks that each have their own failure modes. This usually means that they are
backed by their own rack specific switch, PDU, and backup PDU. It gives you no
extra benefit (qfs-wise) to have machines in the same physical rack but present
them to qfs as different racks.
How does QFS handle object placement when rack assignments exceed 9?
Less specifically, how does QFS handle object/chunk placement in general?
We are currently using RS 6+3 for all data on the cluster.
We also use RS 6+3 for almost all data on our cluster, if not all data. When placing
a chunk in a file, the metaserver will generally choose the chunkserver which has more
space available on its physical drives. This is the default configuration. Different chunks
of the same file are placed in a way that fault tolerant placement is taken into account
as well. That is; if there are more racks available, the metaserver will spread out the
different chunks of a file into different racks decreasing the chance of data loss under failures.
Note that metaserver also actively tries to re-balance available space by moving chunks around
from over-utilized chunkservers to underutilized ones. There are some threshold values
that you can play with to alter the behaviour of chunk placement and space re-balancer.
Some of these are metaServer.maxSpaceUtilizationThreshold, metaServer.rebalancingEnabled,
metaServer.maxRebalanceSpaceUtilThreshold and metaServer.minRebalanceSpaceUtilThreshold.
If you want to read more about this, you can checkout the following post for more details and a discussion.
The default chunk placement policy can be also altered by assigning static weights to racks.
Please follow this link to see how this setting can be defined in metaserver configuration file.
When static weights are assigned, the more weight and chunkservers that are in the rack,
the more likely the rack will be chosen for chunk placement. Another way to influence chunk
placement is the usage of tiers, which I explained at the end of this post.
Also, just to clear any possible confusion, I want to emphasize QFS is a distributed file
system, rather than an "object store". The main differences are (a) A QFS file can be appended / modified,
whereas object store doesn't allow those operations, (b) A QFS file can be renamed / moved with almost
no cost, while object stores only allow you copy and delete existing objects, and (c) QFS has strong
directory structures, directories can be renamed / moved, there is du info cached at directory level,
files can be listed by directories, and access permissions can be granted at directory level.
All of these are typically unavailable in object stores.
Are chunks for a given object (assume very large > 100GB objects) ever written
to more than 9 nodes or 9 racks? If so, what is the implication to the number of node/rack failures
a cluster can sustain before data loss? Is it simply N+3?
Yes, they are. For the next part of my answer, I’m assuming you are familiar with how QFS striping works.
Each chunk can hold data up to 64MB. If you write a large file, say 576MB,
the first 384MB (64MB*6) of original data and 192MB (64MB*3) for parity will be
stored in nine chunks, ideally in nine separate chunkservers. For the remaining data,
the metaserver will select a different set of nine chunkservers.
You can verify this behaviour by writing a large file and then invoking
“qfs” tool with “-dloc” option. For example,
qfs -fs kfs://localhost:20000 -dloc 0 603979776 foo.1
should show the chunkservers containing the data between [0-576]th MB of
a file called foo.1. If you run this command, you should see two lines. The first line
should show the chunkservers that contain the data between [0-384]th MB of the file,
whereas the second line should show the chunkservers that contain the remaining
data. Note that the chunkservers that store the parity data do not show up in the output,
so you should see twelve different chunkservers assuming you have enough number
of chunkservers in your filesystem.
How does all of this change with a 9-rack by 3-node layout or 3-rack by
9-node layout? Could the cluster then lose up to 3 racks worth of nodes
(9 nodes) in a 9x3 layout before losing data? Similarly, could the cluster
lose up to 3 nodes in each rack (9 nodes) for the 3x9 layout?
The answer depends on the configuration. In the 9x3 scenario with one chunk
per rack, we can lose up to three racks before losing data. That makes 9 nodes
as you said. However, this doesn’t mean that we can tolerate any arbitrary 9 node
failures, since the failed nodes can span more than three racks. With multi chunk
per rack configuration mentioned above (three chunks per rack), we can tolerate
the loss of up to a single rack, because two of the remaining racks will hold enough
data and parity chunks for a file to be recovered.
Again assuming chunks are spread evenly, in 3x9, we can lose up to an entire
single rack. Note that this is different from what you’re suggesting, since with the
loss of 3 nodes in each rack you can lose more than three chunks of a file. In all of
above configurations, we can only tolerate 3 arbitrary node failures.
To sum up,
Configuration | Number of Rack Failures Tolerated | Number of Arbitrary Node Failures Tolerated |
9x3 one chunk per rack | 3 racks (total of 9 nodes) | 3 nodes |
9x3 multi-chunk per rack | 1 rack (total of 3 nodes) | 3 nodes |
3x9 multi-chunk per rack | 1 rack (total of 9 nodes) | 3 nodes |
Considering these, it is probably better to have a smaller number of machines per rack and more
racks in total, if this is backed up by physical hardware failure groups. The idea is to increase
the number of failure groups as much as possible. That has a direct correlation with the
reliability of the cluster.
I also would like to add that you can use qfsfsck tool to evaluate the reliability of a filesystem
for at any given point in time. In particular, “Files reachable lost if server down” and
“Files reachable lost if rack down” fields provide valuable insights towards this goal.
What I'm getting at here is that a cluster with sequential node/rack assignments > 9
may be sacrificing a ton of redundancy (maybe only 3 simultaneous node failures)
as opposed to a more proper "by-9" layout (maybe up to 9 simultaneous node failures).
Is this correct?
The metaserver spreads file chunks across multiple chunkservers, not just the original 9.
For a really large file, it's possible that a majority of the chunkservers in a cluster participate
in hosting that file in some way. The metaserver sees racks as failure groups and uses that
information to spread chunks around in a fault tolerant manner. However, after that,
chunkserver disk space available and utilization is taken into account. A rack with nodes
not divisible by 9 shouldn't be any different than a rack with nodes divisible by 9 except in
terms of which nodes fail together.
What is the best way to accommodate differing disk counts and disk capacities
across nodes? In other words, what is the best way to handle non-uniform nodes?
The best way is to organize your failure groups as homogeneously as possible.
For instance, if you have storage units with varying sizes, it is best to split them
horizontally across racks (and across chunkservers in each rack) so that each rack
(and each chunkserver in a rack) has roughly the same amount of available space.
To give an example for what would happen in the contrary case, suppose that you
have 5 racks and rack 1 has significantly more space compared to other racks.
Then, a disproportionate amount of files would have rack 1 involved. This does not
only increase the network traffic on rack 1 and cause greater latencies for file operations,
but also a larger number of files will be impacted, should something happen to rack 1.
It's best to have your racks and nodes as similar as possible. If you add more powerful
nodes later on, it might be best to reshuffle some nodes from a less powerful rack such
that those new powerful nodes are shared across many racks instead of just going into
one together.
Assuming each rack itself is homogeneous, another alternative is to use rack weights to balance write
load between racks, though nowadays the rack weights are less important as filesystems attempt to maintain equal
write load between all disks/storage devices.
Does QFS support any kind of node "pooling" other than by rack ID? Grouping
sets of uniform nodes into their own racks seems like a logical solution but results
in unbalanced data distribution between nodes/racks and eventually in wasted space.
As discussed above, you can do this by assigning static weights to racks. Please follow
this link to see how this setting can be defined in metaserver configuration file. When static
weights are assigned, the more weight and chunkservers that are in the rack, the more likely
the rack will be chosen for chunk placement. You can also use the tiers described in the next
question.
Is this what "tiers" are for? What exactly are tiers in QFS? How can we use them?
Tiers are used to refer storage units with similar characteristics. For example, a filesystem
might be backed up by varying storage solutions such as HDDs, SSDs, and RAM disks
offering distinct advantages. In such a case, your HDDs, SSDs and RAM disks can be
grouped in different tiers. Then, when creating a file, you can explicitly tell the metaserver
to place a file in a specific tier or in a range of tiers. One example here is the storage of hot
files, which are files that are accessed very frequently. If we have a priori information on
access frequency of such a file, we can tell the metaserver to place this file in a RAM disk
by specifying the tier numbers during creation of the file.
Tier settings can be specified in metaserver configuration file by
“chunkServer.storageTierPrefixes” field. For example,
chunkServer.storageTierPrefixes = /mnt/ram 10 /mnt/flash 11
tells the metaserver that the devices that are mounted to /mnt/ram and
/mnt/flash belongs to tiers 10 and 11, respectively. So, a hot file that we
want to place in tier 10 will be stored under a path with prefix /mnt/ram in chunkservers.
Here, I want to emphasize again that rack/failure group assignment should reflect physical hardware configuration
properties. It seems to me that you’re rather trying to organize your nodes into logical racks. In this scenario, regardless
of the number of nodes in a logical rack, in reality each failure domain would consist of a single node. Organizing
independent nodes into logical racks will not bring any advantage in terms of fault-tolerance and it is okay to leave
the organization as is; 27 independent nodes.
The previous discussion on rack organization should be interpreted like the following. One gets the most fault-tolerant
configuration with RS 6+3 encoding, when the nodes are organized into 9 or more physical racks. Such configuration
can handle simultaneous outage of 3 and less racks, regardless of the number of nodes in each rack. That is; should
each rack contain only 1 node, 3 node failures (not arbitrary though) can be tolerated in total. Should each rack contain
9 nodes, 27 node failures can be tolerated in total. Although one can think the same logic applies to logical racks as well,
nodes in a logical rack wouldn’t form an actual failure group because node failures inside such rack are not correlated.
Hence, the organization wouldn’t matter.
Please let me know if I’m misinterpreting your case here.
What is the best practice for getting from a 1-27 rack ID non-optimal layout to something like 9 racks with 3
nodes each? Is the process as simple as stopping the metaserver and chunkservers, editing the chunkserver.prp
files with new rack IDs, and then restarting all services? Would doing this cause a massive IO storm, so to speak?
Or are we out of luck to the point that a reordering and rebalancing of this magnitude would require a data migration
and redeployment? The more detail the better!
There are two ways to change the rack assignment. One way is to define it in chunkserver configuration. The other is to
define it in metaserver configuration (see metaServer.rackPrefixes parameter) and to send a HUP signal to metaserver process.
If both ways are used, the latter takes precedence. Regardless which way one uses, rack assignment has effect only when chunk
servers reconnect to metaserver. You can force a chunk server to reconnect by issuing a HUP signal to the corresponding chunkserver process.
Once the rack assignment is in effect, rebalancer adjusts the chunk placement without overwhelming the filesystem. Rebalancing
is a background process and it has a lower priority than other meta server operations such as chunk recovery. It uses re-replication
to move the existing chunks around according to the new rack organization. Re-replication pace is generally controlled by the two
parameters metaServer.maxConcurrentReadReplicationsPerNode and metaServer.maxConcurrentWriteReplicationsPerNode.
Our experience tells us that the default values for these parameters are typically conservative enough. Moreover, rebalancer is limited
by the chunk/replicas locations, available space on nodes, etc… The less choices rebalancer has, the less parallel re-replications it can
schedule, hence the slower the rebalancing will be. For a rack organization where racks have roughly similar available space, rebalancing
shouldn’t be as aggressive as one worries.
So, the short answer is rebalancer shouldn’t cause a massive IO storm in your case.
Regarding rebalancing, we've found that rebalances after adding additional nodes don't go as quickly as we'd like. Is there
a way we can speed up rebalancing? We have increased the value of "metaServer.maxRebalanceScan" to 4096 (up from the
default of 1024) and it doesn't seem to have had an impact.
The metaServer.maxRebalanceScan parameter (as well as couple of others such as metaServer.rebalanceRunInterval) is for
metaserver CPU scheduling.Bumping up this value should cause rebalancer in metaserver to spend more time at each run.
However, when there is a large number of chunk replicas need to be moved around, it will not have the desired effect on
rebalancing speed. Instead, you can try to play with the metaServer.maxConcurrentReadReplicationsPerNode and
metaServer.maxConcurrentWriteReplicationsPerNode parameters that I mentioned previously, but I want to emphasize
that default values for these parameters usually strike a good balance between speed and overhead for rebalancing.