Re: [qfs-devel:582] Best Practices at Scale Questions

Faraaz Sareshwala

unread,

Feb 10, 2016, 9:17:57 PM2/10/16

to qfs-...@googlegroups.com

Thanks for writing in Vince! This is a big email so it's going to take me a
little bit to compile the answers and best practices to all your questions and
situations. I'll get back to you as soon as possible!

Faraaz

On Wed, Feb 10, 2016 at 02:07:33PM -0800, Vince wrote:
> Hello QFS Devs and Community!
>
> First of all, thanks for the awesome open source storage solution. We at
> Premiere Digital ([1]http://www.premieredigital.net/) might be the biggest user
> of QFS that you've never (or barely) heard of.
>
> Our cluster sits at over 10PB raw and we're adding more nodes all the time. We
> started using QFS about 3 years ago and have grown it from a relatively small
> cluster to what will soon to be a 27-node cluster.
>
>
> I'm posting with the hope of gaining some perspective into how QFS is
> distributing chunks with our current node layout and for advice on how best to
> proceed moving forward. We experienced explosive growth over the past few
> years and unfortunately grew and maintained our cluster a bit outside of best
> practice as a result of the craziness during that time.
>
> Without getting into too much detail to start, we have an odd number of nodes
> and rack assignments as it relates to QFS's ideal multiple of 9 when using 6+3
> RS encoding.
>
>
> Questions Regarding How Best to Move Forward
>
> From what we understand, QFS always wants its nodes/racks arranged in groups of
> 9, which we aren't doing and haven't done since we had 9 nodes.
>
> With 27 nodes, what is the best rack layout? What are the pros/cons of the 3
> racks of 9 nodes vs 9 racks of 3 nodes?
>
>
> Questions Regarding Object/Chunk Placement
>
> How does QFS handle object placement when rack assignments exceed 9? Less
> specifically, how does QFS handle object/chunk placement in general? We are
> currently using RS 6+3 for all data on the cluster.
>
> Are chunks for a given object (assume very large > 100GB objects) ever written
> to more than 9 nodes or 9 racks? If so, what is the implication to the number
> of node/rack failures a cluster can sustain before data loss? Is it simply
> N+3?
>
> How does all of this change with a 9-rack by 3-node layout or 3-rack by 9-node
> layout? Could the cluster then lose up to 3 racks worth of nodes (9 nodes) in
> a 9x3 layout before losing data? Similarly, could the cluster lose up to 3
> nodes in each rack (9 nodes) for the 3x9 layout?
>
> What I'm getting at here is that a cluster with sequential node/rack
> assignments > 9 may be sacrificing a ton of redundancy (maybe only 3
> simultaneous node failures) as opposed to a more proper "by-9" layout (maybe up
> to 9 simultaneous node failures). Is this correct?
>
>
> Other General Questions
>
> What is the best way to accommodate differing disk counts and disk capacities
> across nodes? In other words, what is the best way to handle non-uniform
> nodes?
>
> Does QFS support any kind of node "pooling" other than by rack ID? Grouping
> sets of uniform nodes into their own racks seems like a logical solution but
> results in unbalanced data distribution between nodes/racks and eventually in
> wasted space.
>
> Is this what "tiers" are for? What exactly are tiers in QFS? How can we use
> them?
>
>
> We'd love to get on a call or email and discuss specifics if possible.
>
> Thanks,
> Vince & Premiere Digital
>
> --
> You received this message because you are subscribed to the Google Groups "QFS
> Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to [2]qfs-devel+...@googlegroups.com.
> For more options, visit [3]https://groups.google.com/d/optout.
>
> References:
>
> [1] http://www.premieredigital.net/
> [2] mailto:qfs-devel+...@googlegroups.com
> [3] https://groups.google.com/d/optout

mcan...@quantcast.com

unread,

Feb 27, 2016, 12:41:55 PM2/27/16

to qfs-...@googlegroups.com

Hello,

My name is Mehmet and I work on the QFS team here at Quantcast. I’ll be helping

Faraaz with responding to the posts here in QFS-devel. First off, we want to apologize

for taking so long to get back to you on these answers. We really appreciate you reaching

out to us with questions. Please feel free to continue to do so!

From what we understand, QFS always wants its nodes/racks arranged in

groups of 9, which we aren't doing and haven't done since we had 9 nodes.

QFS doesn’t necessarily enforce groups of 9. In fact, for a special case, we have

a filesystem instance in which we store all 9 chunks (6 data chunks + 3 parity chunks)

of a file in only three racks, so three chunks per rack. This is assuming that the file size

is less than or equal to 64MB*6 of course (with a larger file size, the extra data will

overflow into another set of 9 chunks).

Enabling multi chunks per rack basically saves cross rack bandwidth compared

to a configuration where we have a single chunk per rack, but note that it might

also have serious reliability consequences, so its use is not encouraged.

With 27 nodes, what is the best rack layout? What are the pros/cons of the 3

racks of 9 nodes vs 9 racks of 3 nodes?

Assuming the chunks are distributed equally among racks, in general, a higher

number of racks leads to greater number of tolerated failures. This comes at the

expense of less or no data locality, but the fact that network is faster than disks

these days doesn’t make this an issue.

It is the opposite for a lower number of racks. It results in fewer tolerated failures,

but more locality. Another downside of a lower number of racks is the fact that

your storage capacity might diminish significantly if a rack failure is not transient.

In the 3x9 scenario you give below, you might lose 1/3 of your storage capacity

for extended failure periods since an entire rack is down.

I want to point out here that when talking about racks, I mean physically different

racks that each have their own failure modes. This usually means that they are

backed by their own rack specific switch, PDU, and backup PDU. It gives you no

extra benefit (qfs-wise) to have machines in the same physical rack but present

them to qfs as different racks.

How does QFS handle object placement when rack assignments exceed 9?

Less specifically, how does QFS handle object/chunk placement in general?

We are currently using RS 6+3 for all data on the cluster.

We also use RS 6+3 for almost all data on our cluster, if not all data. When placing

a chunk in a file, the metaserver will generally choose the chunkserver which has more

space available on its physical drives. This is the default configuration. Different chunks

of the same file are placed in a way that fault tolerant placement is taken into account

as well. That is; if there are more racks available, the metaserver will spread out the

different chunks of a file into different racks decreasing the chance of data loss under failures.

Note that metaserver also actively tries to re-balance available space by moving chunks around

from over-utilized chunkservers to underutilized ones. There are some threshold values

that you can play with to alter the behaviour of chunk placement and space re-balancer.

Some of these are metaServer.maxSpaceUtilizationThreshold, metaServer.rebalancingEnabled,

metaServer.maxRebalanceSpaceUtilThreshold and metaServer.minRebalanceSpaceUtilThreshold.

If you want to read more about this, you can checkout the following post for more details and a discussion.

The default chunk placement policy can be also altered by assigning static weights to racks.

Please follow this link to see how this setting can be defined in metaserver configuration file.

When static weights are assigned, the more weight and chunkservers that are in the rack,

the more likely the rack will be chosen for chunk placement. Another way to influence chunk

placement is the usage of tiers, which I explained at the end of this post.

Also, just to clear any possible confusion, I want to emphasize QFS is a distributed file

system, rather than an "object store". The main differences are (a) A QFS file can be appended / modified,

whereas object store doesn't allow those operations, (b) A QFS file can be renamed / moved with almost

no cost, while object stores only allow you copy and delete existing objects, and (c) QFS has strong

directory structures, directories can be renamed / moved, there is du info cached at directory level,

files can be listed by directories, and access permissions can be granted at directory level.

All of these are typically unavailable in object stores.

Are chunks for a given object (assume very large > 100GB objects) ever written

to more than 9 nodes or 9 racks? If so, what is the implication to the number of node/rack failures

a cluster can sustain before data loss? Is it simply N+3?

Yes, they are. For the next part of my answer, I’m assuming you are familiar with how QFS striping works.

Each chunk can hold data up to 64MB. If you write a large file, say 576MB,

the first 384MB (64MB*6) of original data and 192MB (64MB*3) for parity will be

stored in nine chunks, ideally in nine separate chunkservers. For the remaining data,

the metaserver will select a different set of nine chunkservers.

You can verify this behaviour by writing a large file and then invoking

“qfs” tool with “-dloc” option. For example,

qfs -fs kfs://localhost:20000 -dloc 0 603979776 foo.1

should show the chunkservers containing the data between [0-576]th MB of

a file called foo.1. If you run this command, you should see two lines. The first line

should show the chunkservers that contain the data between [0-384]th MB of the file,

whereas the second line should show the chunkservers that contain the remaining

data. Note that the chunkservers that store the parity data do not show up in the output,

so you should see twelve different chunkservers assuming you have enough number

of chunkservers in your filesystem.

How does all of this change with a 9-rack by 3-node layout or 3-rack by

9-node layout? Could the cluster then lose up to 3 racks worth of nodes

(9 nodes) in a 9x3 layout before losing data? Similarly, could the cluster

lose up to 3 nodes in each rack (9 nodes) for the 3x9 layout?

The answer depends on the configuration. In the 9x3 scenario with one chunk

per rack, we can lose up to three racks before losing data. That makes 9 nodes

as you said. However, this doesn’t mean that we can tolerate any arbitrary 9 node

failures, since the failed nodes can span more than three racks. With multi chunk

per rack configuration mentioned above (three chunks per rack), we can tolerate

the loss of up to a single rack, because two of the remaining racks will hold enough

data and parity chunks for a file to be recovered.

Again assuming chunks are spread evenly, in 3x9, we can lose up to an entire

single rack. Note that this is different from what you’re suggesting, since with the

loss of 3 nodes in each rack you can lose more than three chunks of a file. In all of

above configurations, we can only tolerate 3 arbitrary node failures.

To sum up,

Configuration	Number of Rack Failures Tolerated	Number of Arbitrary Node Failures Tolerated
9x3 one chunk per rack	3 racks (total of 9 nodes)	3 nodes
9x3 multi-chunk per rack	1 rack (total of 3 nodes)	3 nodes
3x9 multi-chunk per rack	1 rack (total of 9 nodes)	3 nodes

Considering these, it is probably better to have a smaller number of machines per rack and more

racks in total, if this is backed up by physical hardware failure groups. The idea is to increase

the number of failure groups as much as possible. That has a direct correlation with the

reliability of the cluster.

I also would like to add that you can use qfsfsck tool to evaluate the reliability of a filesystem

for at any given point in time. In particular, “Files reachable lost if server down” and

“Files reachable lost if rack down” fields provide valuable insights towards this goal.

What I'm getting at here is that a cluster with sequential node/rack assignments > 9

may be sacrificing a ton of redundancy (maybe only 3 simultaneous node failures)

as opposed to a more proper "by-9" layout (maybe up to 9 simultaneous node failures).

Is this correct?

The metaserver spreads file chunks across multiple chunkservers, not just the original 9.

For a really large file, it's possible that a majority of the chunkservers in a cluster participate

in hosting that file in some way. The metaserver sees racks as failure groups and uses that

information to spread chunks around in a fault tolerant manner. However, after that,

chunkserver disk space available and utilization is taken into account. A rack with nodes

not divisible by 9 shouldn't be any different than a rack with nodes divisible by 9 except in

terms of which nodes fail together.

What is the best way to accommodate differing disk counts and disk capacities

across nodes? In other words, what is the best way to handle non-uniform nodes?

The best way is to organize your failure groups as homogeneously as possible.

For instance, if you have storage units with varying sizes, it is best to split them

horizontally across racks (and across chunkservers in each rack) so that each rack

(and each chunkserver in a rack) has roughly the same amount of available space.

To give an example for what would happen in the contrary case, suppose that you

have 5 racks and rack 1 has significantly more space compared to other racks.

Then, a disproportionate amount of files would have rack 1 involved. This does not

only increase the network traffic on rack 1 and cause greater latencies for file operations,

but also a larger number of files will be impacted, should something happen to rack 1.

It's best to have your racks and nodes as similar as possible. If you add more powerful

nodes later on, it might be best to reshuffle some nodes from a less powerful rack such

that those new powerful nodes are shared across many racks instead of just going into

one together.

Assuming each rack itself is homogeneous, another alternative is to use rack weights to balance write

load between racks, though nowadays the rack weights are less important as filesystems attempt to maintain equal

write load between all disks/storage devices.

Does QFS support any kind of node "pooling" other than by rack ID? Grouping

sets of uniform nodes into their own racks seems like a logical solution but results

in unbalanced data distribution between nodes/racks and eventually in wasted space.

As discussed above, you can do this by assigning static weights to racks. Please follow

this link to see how this setting can be defined in metaserver configuration file. When static

weights are assigned, the more weight and chunkservers that are in the rack, the more likely

the rack will be chosen for chunk placement. You can also use the tiers described in the next

question.

Is this what "tiers" are for? What exactly are tiers in QFS? How can we use them?

Tiers are used to refer storage units with similar characteristics. For example, a filesystem

might be backed up by varying storage solutions such as HDDs, SSDs, and RAM disks

offering distinct advantages. In such a case, your HDDs, SSDs and RAM disks can be

grouped in different tiers. Then, when creating a file, you can explicitly tell the metaserver

to place a file in a specific tier or in a range of tiers. One example here is the storage of hot

files, which are files that are accessed very frequently. If we have a priori information on

access frequency of such a file, we can tell the metaserver to place this file in a RAM disk

by specifying the tier numbers during creation of the file.

Tier settings can be specified in metaserver configuration file by

“chunkServer.storageTierPrefixes” field. For example,

chunkServer.storageTierPrefixes = /mnt/ram 10 /mnt/flash 11

tells the metaserver that the devices that are mounted to /mnt/ram and

/mnt/flash belongs to tiers 10 and 11, respectively. So, a hot file that we

want to place in tier 10 will be stored under a path with prefix /mnt/ram in chunkservers.

Best,

Mehmet

Vince

unread,

Mar 8, 2016, 8:06:29 PM3/8/16

to QFS Development

Hi Mehmet,

Thank you very much for the detailed response! This is all very helpful.

Follow up question regarding rack assignments and reordering:

Let's assume we have a homogenous 27-node cluster with each node given a sequential rack ID 1-27. It's clear from your examples that an architecture like this is sacrificing some tolerance to physical device failures and that a more proper layout could tolerate as many as 9 node failures instead of only 3.

What is the best practice for getting from a 1-27 rack ID non-optimal layout to something like 9 racks with 3 nodes each? Is the process as simple as stopping the metaserver and chunkservers, editing the chunkserver.prp files with new rack IDs, and then restarting all services? Would doing this cause a massive IO storm, so to speak? Or are we out of luck to the point that a reordering and rebalancing of this magnitude would require a data migration and redeployment? The more detail the better!

Regarding rebalancing, we've found that rebalances after adding additional nodes don't go as quickly as we'd like.. or as quickly as we know the cluster is capable of based on how quickly it starts to rebuild when a node goes offline. Is there a way we can speed up rebalancing? We have increased the value of "metaServer.maxRebalanceScan" to 4096 (up from the default of 1024) and it doesn't seem to have had an impact.

Regarding tiers, it does sound like they could be useful for grouping homogenous nodes in an otherwise heterogenous cluster. One could do something like 9x 200TB homogenous nodes in one tier/rack and then another 9x 400TB homogenous nodes in another tier/rack and instruct the client to place data according to which tier has the most free space or in a round-robin fashion. Very cool.

I also want to thank you for pointing out the differences between QFS and object stores. We tend to speak about and examine object stores in relation to QFS on a regular basis and this makes it easy to lump QFS in with the various object stores and get them confused. While QFS does share many of the features that make object stores awesome like erasure coding and self-healing/rebalancing, the differences are definitely important and not to be confused. The fact that files on QFS can be appended/modified is particularly valuable for certain applications. It also means the sky is the limit in terms of the possibilities for QFS!

Thanks again.

-Vince

mcan...@quantcast.com

unread,

Mar 21, 2016, 9:38:11 PM3/21/16

to qfs-...@googlegroups.com

Hello again,

Sorry that I couldn't get to it sooner.

Let's assume we have a homogenous 27-node cluster with each node given a sequential rack ID 1-27.

It's clear from your examples that an architecture like this is sacrificing some tolerance to physical

device failures and that a more proper layout could tolerate as many as 9 node failures instead of only 3.

Here, I want to emphasize again that rack/failure group assignment should reflect physical hardware configuration

properties. It seems to me that you’re rather trying to organize your nodes into logical racks. In this scenario, regardless

of the number of nodes in a logical rack, in reality each failure domain would consist of a single node. Organizing

independent nodes into logical racks will not bring any advantage in terms of fault-tolerance and it is okay to leave

the organization as is; 27 independent nodes.

The previous discussion on rack organization should be interpreted like the following. One gets the most fault-tolerant

configuration with RS 6+3 encoding, when the nodes are organized into 9 or more physical racks. Such configuration

can handle simultaneous outage of 3 and less racks, regardless of the number of nodes in each rack. That is; should

each rack contain only 1 node, 3 node failures (not arbitrary though) can be tolerated in total. Should each rack contain

9 nodes, 27 node failures can be tolerated in total. Although one can think the same logic applies to logical racks as well,

nodes in a logical rack wouldn’t form an actual failure group because node failures inside such rack are not correlated.

Hence, the organization wouldn’t matter.

Please let me know if I’m misinterpreting your case here.

What is the best practice for getting from a 1-27 rack ID non-optimal layout to something like 9 racks with 3

nodes each? Is the process as simple as stopping the metaserver and chunkservers, editing the chunkserver.prp

files with new rack IDs, and then restarting all services? Would doing this cause a massive IO storm, so to speak?

Or are we out of luck to the point that a reordering and rebalancing of this magnitude would require a data migration

and redeployment? The more detail the better!

There are two ways to change the rack assignment. One way is to define it in chunkserver configuration. The other is to

define it in metaserver configuration (see metaServer.rackPrefixes parameter) and to send a HUP signal to metaserver process.

If both ways are used, the latter takes precedence. Regardless which way one uses, rack assignment has effect only when chunk

servers reconnect to metaserver. You can force a chunk server to reconnect by issuing a HUP signal to the corresponding chunkserver process.

Once the rack assignment is in effect, rebalancer adjusts the chunk placement without overwhelming the filesystem. Rebalancing

is a background process and it has a lower priority than other meta server operations such as chunk recovery. It uses re-replication

to move the existing chunks around according to the new rack organization. Re-replication pace is generally controlled by the two

parameters metaServer.maxConcurrentReadReplicationsPerNode and metaServer.maxConcurrentWriteReplicationsPerNode.

Our experience tells us that the default values for these parameters are typically conservative enough. Moreover, rebalancer is limited

by the chunk/replicas locations, available space on nodes, etc… The less choices rebalancer has, the less parallel re-replications it can

schedule, hence the slower the rebalancing will be. For a rack organization where racks have roughly similar available space, rebalancing

shouldn’t be as aggressive as one worries.

So, the short answer is rebalancer shouldn’t cause a massive IO storm in your case.

Regarding rebalancing, we've found that rebalances after adding additional nodes don't go as quickly as we'd like. Is there

a way we can speed up rebalancing? We have increased the value of "metaServer.maxRebalanceScan" to 4096 (up from the

default of 1024) and it doesn't seem to have had an impact.

The metaServer.maxRebalanceScan parameter (as well as couple of others such as metaServer.rebalanceRunInterval) is for

metaserver CPU scheduling.Bumping up this value should cause rebalancer in metaserver to spend more time at each run.

However, when there is a large number of chunk replicas need to be moved around, it will not have the desired effect on

rebalancing speed. Instead, you can try to play with the metaServer.maxConcurrentReadReplicationsPerNode and

metaServer.maxConcurrentWriteReplicationsPerNode parameters that I mentioned previously, but I want to emphasize

that default values for these parameters usually strike a good balance between speed and overhead for rebalancing.

Best,

a_gi...@premieredigital.net

unread,

Apr 29, 2016, 11:20:53 AM4/29/16

to QFS Development

Hello guys - thanks for the responses to Vince - they are much appreciated!

Just to add to the conversation: what are the best practices for version upgrades?

We are having checksum/fsid issues in our Dev environment and would like to hear your inputs.

Thanks for your time!

mcan...@quantcast.com

unread,

May 2, 2016, 5:38:14 PM5/2/16

to QFS Development

Hi Alex,

Let us get back to you soon about the best practices for version upgrades.

As for checksum/fsid issues, can you give more details?

Mehmet

Ben Cannon

unread,

May 2, 2016, 5:39:30 PM5/2/16

to qfs-...@googlegroups.com

Hey Alex, are you upgrading the MDCs first or the chunkservers?
-Ben.

> --
> You received this message because you are subscribed to the Google Groups "QFS Development" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to qfs-devel+...@googlegroups.com.
> To post to this group, send email to qfs-...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/qfs-devel/95b232f3-b22a-4ec9-a1e0-8d617e155cb2%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Alex G

unread,

May 3, 2016, 11:43:36 AM5/3/16

to QFS Development

Hey Mehmet,

I do not have the exact errors at the moment, but it was a simple test to see if backwards compatibility would work. Disclaimer as we are intentionally going for showstoppers/bugs. This has been recreated at least 5 times:

1) We wrote data with MD and CS at v1.0.2.

2) MD executable was upgraded to v1.1.4 while the CS's remained at v1.0.2 which appeared to function normally when the CS's connected.

3) CS's were upgraded to v1.1.4 and reconnected to the M...@v1.1.4 which caused an error regarding a file system id mismatch.

4) When these changes are reverted back to 1.0.2, the MD reports bad log files and exits. Then a file system init is used to bring it back to operational state (turn on flag in MD config, delete checkpoints and logs).

We would like to be able to retain existing data and upgrade to 1.1.4.

Your response with best practices is greatly appreciated.

Please let us know if there are any steps that can resolve this. Thank you!

Alex G

unread,

Jun 8, 2016, 2:58:29 PM6/8/16

to QFS Development

Bump

mcan...@quantcast.com

unread,

Jun 15, 2016, 11:45:46 AM6/15/16

to QFS Development

Hi Alex,

Sorry that I could respond to this now.

The steps you described seem ok.

As for "file system id mismatch" problem issue;

Filesystem id field is described in more detail here: https://github.com/quantcast/qfs/blob/master/conf/MetaServer.prp#L1042.

The default setting for that field shouldn't cause a problem during upgrade, but we're investigating the issue now.

In the meantime, can you make sure that there is no file with prefix 0-fsid-* in the chunk directories at the time you start 1.1.4 chunk servers?

Let me know if that works.

Reply all

Reply to author

Forward