Experiences Raid Chunk Size, Storage Server Striping and amount of Volumes

370 views
Skip to first unread message

Michael Ruepp

unread,
Jul 23, 2013, 9:33:07 AM7/23/13
to fhgfs...@googlegroups.com
Hi there,

I am interested in any experience regarding raid config, chunk/stripe size and correlation with amount of storage servers, amount of volumes and workers as well as amount of disks.


So basically, we have four Storage Servers, and 30 SAS disks a 4TB per server. I prefer raid 6. Every server is a dual socket, 2x8 cores e5-2650 2,0 with 256GB Ram. Interface is Infiniband FDR. (first if RDMA/SDP, second IPoIB, third IP-eth)

So i am thinking about optimal performance regarding iops and sequential: 
- Does it make sense to use more than one storage target volume vs. using large raid 6 array with small chunk size to enhance iops operations?
- How is the inner working of the fhgfs storage server, does it automatically load balance over the workers on single storage servers?
- So would it make more sense to extend the amount of fhgfs volumes per server (with fewer spindles per raidset) vs. extend the amount of disks per raidset and lesser volumes?

Our requirement would be 40+ clients, MPI jobs and often parallel access to the same files which are due to our statistics, between 2 and 130 GB per file. Its a quite mixed environment, and we are actually not really able to predict the exact use case in the future.
- So with 30 disks per server, what could be a reasonable config? Is there any implication (with fhgfs striping) on not using even numbers for the volumes but the servers (e.g. 4 Server and 3 volumes per server)?
- So one idea is to create 3 volumes a 10 raid6 disks (8 data chunks 2 parity chunks) at chunk size of 64/128Kb which would reflect a stripe size of 512/1024Kb. Does the size of the fhgfs stripe would have any impact? How would data be distributed?
- Is it able to make an online xfs extension when creating the volume without partitioning like in the performance tuning guide (e.g. mkfs.xfs /dev/sdX)
- Does it make sense performancewise to deploy two metadata server?

Thanks a lot,

Mike




Frank Kautz

unread,
Jul 24, 2013, 8:45:16 AM7/24/13
to fhgfs...@googlegroups.com
Hi Mike,

we can give you some recommendations which works for a lot of use-cases,
but I can not guarantee that this will work for you.

We recommend raid10(8+2) for the storage targets. You can add several
targets to one storage server. How many disk for the raid arrays and how
many targets per server is useful for your system is hard to say. This
depends on your hardware (RAID controller, network interfaces, ...).

For the metadata server we recommend separate targets with RAID10 or
RAID1 on SSDs. This is increases the IOPS.

> - So would it make more sense to extend the amount of fhgfs volumes per
> server (with fewer spindles per raidset) vs. extend the amount of disks
> per raidset and lesser volumes?
It depends on your hardware. If you add more disks in a raid array you
need a RAID-Controller with a higher throughput to have a performance
benefit from the additional disks. How many targets are useful to
increase the performance depends on the performance of your storage
targets, the maximal throughput of your raid controller/s and the
network interface/s of the server.

> Is there any implication (with fhgfs striping) on not using even numbers
> for the volumes but the servers (e.g. 4 Server and 3 volumes per server)?
There are no restriction about the number of servers.

> - So one idea is to create 3 volumes a 10 raid6 disks (8 data chunks 2
> parity chunks) at chunk size of 64/128Kb which would reflect a stripe
> size of 512/1024Kb. Does the size of the fhgfs stripe would have any
> impact?
A bigger chunk size (128kb, 256kb, ...) for the raid increases the
streaming performance of the storage server targets. A smaller chunk
size (64kb) of the raid increases the IOPS for the metadata targets. The
fhgfs chunk size is independent from the raid chunk size. Yes the fhgfs
stripe will have an impact. Normally the default values are working fine.

> How would data be distributed?
FhGFS distributes a file across some targets, by default 4 targets. The
fhgfs chunk size is by default 512kb. During the creation of a file the
storage targets are selected. The first chunk of a file will be stored
on the first target, the second chunk on the second target ... the fifth
chunk will be stored on the first target.

> - Does it make sense performancewise to deploy two metadata server?
It depends on the workload. You can start with one or two metadata
servers and if the performance is not good enough you can add new
metadata servers without a downtime.

kind regards,
Frank

Michael Ruepp

unread,
Jul 24, 2013, 9:17:46 AM7/24/13
to fhgfs...@googlegroups.com
Hi,

thank you for your answer. However, some questions remain open:
(Basically, I am speaking of a setup of 4 Storage Servers and 2 Metadataservers)

>
> We recommend raid10(8+2) for the storage targets. You can add several
> targets to one storage server. How many disk for the raid arrays and how
> many targets per server is useful for your system is hard to say. This
> depends on your hardware (RAID controller, network interfaces, …).
>
> For the metadata server we recommend separate targets with RAID10 or
> RAID1 on SSDs. This is increases the IOPS.
What does "separate targets" mean? Separated from the Storage targets or "the more Metadata Targets (ext4 Volumes) per Server - the better"?
>
>> - So would it make more sense to extend the amount of fhgfs volumes per
>> server (with fewer spindles per raidset) vs. extend the amount of disks
>> per raidset and lesser volumes?
> It depends on your hardware. If you add more disks in a raid array you
> need a RAID-Controller with a higher throughput to have a performance
> benefit from the additional disks. How many targets are useful to
> increase the performance depends on the performance of your storage
> targets, the maximal throughput of your raid controller/s and the
> network interface/s of the server.
Assumed, the Controller is fast enough to handle all disks in use to the max. If I use large Raidsets, I get mostly no write benefit when the filesize (Fhgfs Chunk per server) is not large enough for striping over all spindles). This is for sequential single usecase. But in sequential parallel usecase, it will get not enough IOPS and lot of latency by repositioning when lot of users read write parallel from the storage servers with one raidset per server.
So this is the question: If i create more raidsets and therefore more storage volumes on the storage servers, I could reduce latency because I guess that in a parallel usecase (e.g. lots of requests per second per server), it could balance itself out by using more than one raidset at a time by fhgfs.

Does this have any impact or is the queuing/caching complex of fhgfs-raidcontroller well enough equipped to get over this raidset repositioning stuff?

>
>> Is there any implication (with fhgfs striping) on not using even numbers
>> for the volumes but the servers (e.g. 4 Server and 3 volumes per server)?
> There are no restriction about the number of servers.
It is not about restriction, but about if there is a performance impact to not choose even/odd amount of servers. So if i choose a fhgfs chunksize of 512, how will fhgfs redistribute that chunk on for example 3 Storage Volumes (Targets) on the Storage Server? Does it write 2x256 and leaves one volume out or is the whole chunk written to one volume and leaves the two others out, or how is it done?
>
>> - So one idea is to create 3 volumes a 10 raid6 disks (8 data chunks 2
>> parity chunks) at chunk size of 64/128Kb which would reflect a stripe
>> size of 512/1024Kb. Does the size of the fhgfs stripe would have any
>> impact?
> A bigger chunk size (128kb, 256kb, ...) for the raid increases the
> streaming performance of the storage server targets. A smaller chunk
> size (64kb) of the raid increases the IOPS for the metadata targets. The
> fhgfs chunk size is independent from the raid chunk size. Yes the fhgfs
> stripe will have an impact. Normally the default values are working fine.
>
>> How would data be distributed?
> FhGFS distributes a file across some targets, by default 4 targets. The
> fhgfs chunk size is by default 512kb. During the creation of a file the
> storage targets are selected. The first chunk of a file will be stored
> on the first target, the second chunk on the second target ... the fifth
> chunk will be stored on the first target.
I guess we have a terminology misunderstanding here, I am into lustre terminology where a Storage Server and Storage Target is different. A XFS Volume to my understanding is a Storage Target.
So main question remains: How is the performance influenced by configure more than one XFS Volume (Target) per Storage Server and should the amount of Volumes be evenly as well as the amount of Storage Servers? If fhgfs restripes chunks which go into a storage server evenly over the amount of storage targets on that server, the storage targets should be even as well: 512kb/4 would be than 128kb per Storage Target in opposite to 512/3=170,666. But if it is not handled like this, it would make no difference. But I don´t know.
>
>> - Does it make sense performancewise to deploy two metadata server?
> It depends on the workload. You can start with one or two metadata
> servers and if the performance is not good enough you can add new
> metadata servers without a downtime.

Best regards and thanks for your help,

Mike
>
> kind regards,
> Frank
>
> --
> You received this message because you are subscribed to the Google Groups "fhgfs-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to fhgfs-user+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

myt...@gmail.com

unread,
Jul 24, 2013, 12:26:04 PM7/24/13
to fhgfs...@googlegroups.com, frank...@itwm.fraunhofer.de
Hi !

Just to know, you don't recommand RAID6 anymore ?

Bernd Schubert

unread,
Jul 25, 2013, 8:08:37 AM7/25/13
to fhgfs...@googlegroups.com, myt...@gmail.com
Depending on your usage pattern raid10 might have the better
performance. Especially if you have lots of small files or your are
doing small IO raid10 is better. raid6 is usually cheaper (you lose less
disks) and depending on your controller it might have better streaming
performance (but that really requires that your raid6 layer (hardware or
software) is not the bottleneck).

Cheers,
Bernd

Bernd Schubert

unread,
Jul 25, 2013, 8:28:26 AM7/25/13
to fhgfs...@googlegroups.com, Michael Ruepp
On 07/24/2013 03:17 PM, Michael Ruepp wrote:
> Hi,
>
> thank you for your answer. However, some questions remain open:
> (Basically, I am speaking of a setup of 4 Storage Servers and 2
> Metadataservers)
>
>>
>> We recommend raid10(8+2) for the storage targets. You can add
>> several targets to one storage server. How many disk for the raid
>> arrays and how many targets per server is useful for your system is
>> hard to say. This depends on your hardware (RAID controller,
>> network interfaces, …).
>>
>> For the metadata server we recommend separate targets with RAID10
>> or RAID1 on SSDs. This is increases the IOPS.
> What does "separate targets" mean? Separated from the Storage targets
> or "the more Metadata Targets (ext4 Volumes) per Server - the
> better"?

For FhGFS target means "storage directory". Although it usually does not
make sense, you can have several targets on one disk or one raid-set.
However, the usual configuration is:

mount /dev/my-raid-disk /data/fhgfs/storage-target1
mount /dev/my-raid-disk /data/fhgfs/storage-target2

Given by fhgfs-meta contraints one usually only has a single raid10
meta-target. fhgfs-storage supports 2^16 targets per daemon, fhgfs-meta
currently only one target per daemon (although might start several
daemon instances).

>>
>>> - So would it make more sense to extend the amount of fhgfs
>>> volumes per server (with fewer spindles per raidset) vs. extend
>>> the amount of disks per raidset and lesser volumes?
>> It depends on your hardware. If you add more disks in a raid array
>> you need a RAID-Controller with a higher throughput to have a
>> performance benefit from the additional disks. How many targets are
>> useful to increase the performance depends on the performance of
>> your storage targets, the maximal throughput of your raid
>> controller/s and the network interface/s of the server.
>
> Assumed, the Controller is fast enough to handle all disks in use to
> the max. If I use large Raidsets, I get mostly no write benefit when
> the filesize (Fhgfs Chunk per server) is not large enough for

Large raid sets increase the probability of failures. If 3 disks fail in
a raid6 you have problem... And with raid-level >= 3 read-modify-writes
become painful with large raid sets.

> striping over all spindles). This is for sequential single usecase.
> But in sequential parallel usecase, it will get not enough IOPS and
> lot of latency by repositioning when lot of users read write parallel
> from the storage servers with one raidset per server. So this is the
> question: If i create more raidsets and therefore more storage
> volumes on the storage servers, I could reduce latency because I
> guess that in a parallel usecase (e.g. lots of requests per second
> per server), it could balance itself out by using more than one
> raidset at a time by fhgfs.
>
> Does this have any impact or is the queuing/caching complex of
> fhgfs-raidcontroller well enough equipped to get over this raidset
> repositioning stuff?

FhGFS does not have a raid-controller... Default configuration is a
stripe over 4 targets and data are written to these targets in
round-robin way. If many users are doing lots of IO in parallel, overall
performance might be better if you reduce the default stripe-count to 1.

>
>>
>>> Is there any implication (with fhgfs striping) on not using even
>>> numbers for the volumes but the servers (e.g. 4 Server and 3
>>> volumes per server)?
>> There are no restriction about the number of servers.
>
> It is not about restriction, but about if there is a performance
> impact to not choose even/odd amount of servers. So if i choose a
> fhgfs chunksize of 512, how will fhgfs redistribute that chunk on for
> example 3 Storage Volumes (Targets) on the Storage Server? Does it
> write 2x256 and leaves one volume out or is the whole chunk written
> to one volume and leaves the two others out, or how is it done?

It writes 512kiB to one chunk on one, then 512kiB to the next chunk and
so on.

>>
>>> - So one idea is to create 3 volumes a 10 raid6 disks (8 data
>>> chunks 2 parity chunks) at chunk size of 64/128Kb which would
>>> reflect a stripe size of 512/1024Kb. Does the size of the fhgfs
>>> stripe would have any impact?
>> A bigger chunk size (128kb, 256kb, ...) for the raid increases the
>> streaming performance of the storage server targets. A smaller
>> chunk size (64kb) of the raid increases the IOPS for the metadata
>> targets. The fhgfs chunk size is independent from the raid chunk
>> size. Yes the fhgfs stripe will have an impact. Normally the
>> default values are working fine.
>>
>>> How would data be distributed?
>> FhGFS distributes a file across some targets, by default 4 targets.
>> The fhgfs chunk size is by default 512kb. During the creation of a
>> file the storage targets are selected. The first chunk of a file
>> will be stored on the first target, the second chunk on the second
>> target ... the fifth chunk will be stored on the first target.
> I guess we have a terminology misunderstanding here, I am into
> lustre terminology where a Storage Server and Storage Target is

There is no difference between FhGFS and Lustre in that sense, except
that other words are used.

FhGFS: storage-server Lustre: OSS
FhGFS: storage-target Lustre: OST


> different. A XFS Volume to my understanding is a Storage Target. So
> main question remains: How is the performance influenced by configure
> more than one XFS Volume (Target) per Storage Server and should the
> amount of Volumes be evenly as well as the amount of Storage Servers?

If you storage server can handle 1000 targets then you can use 1000
targets. If your storage server already maxes out with 1 target, adding
another 999 target only increases overall storage volume, but not
performance.

> If fhgfs restripes chunks which go into a storage server evenly over
> the amount of storage targets on that server, the storage targets
> should be even as well: 512kb/4 would be than 128kb per Storage
> Target in opposite to 512/3=170,666. But if it is not handled like
> this, it would make no difference. But I don´t know.

FhGFS does not restripe chunks, but as explained above, writes to chunks
in round-robin way with the configures chunk-size (which is btw.
absolutely identical to what Lustre does).

>>
>>> - Does it make sense performancewise to deploy two metadata
>>> server?
>> It depends on the workload. You can start with one or two metadata
>> servers and if the performance is not good enough you can add new
>> metadata servers without a downtime.

Just one note, you need sufficient memory and you also should put
storage and meta servers in different cgroups and assign a dedicated
amount of memory to these to avoid memory-cache 'stealing'.

Best regards,
Bernd

myt...@gmail.com

unread,
Jul 26, 2013, 3:47:10 AM7/26/13
to fhgfs...@googlegroups.com, Michael Ruepp, bernd.s...@itwm.fraunhofer.de
Very interresting post!

But I'm a bit in trouble btw now... Can you explain a bit more about this sentence ?


Le jeudi 25 juillet 2013 14:28:26 UTC+2, Bernd Schubert a écrit :
 Default configuration is a
stripe over 4 targets and data are written to these targets in
round-robin way. If many users are doing lots of IO in parallel, overall
performance might be better if you reduce the default stripe-count to 1.


I thought having more stripping will in fact increase performance (both IO and speed). I'm wrong ?

In my final setup I will have hundreds of TB with (dozens of) thousands connections.

Each storage server will have at least 12 HDD. 3 array of 4 HDD in RAID5 (4TB HDD).

80% read, 20% write

Thanks!

Bernd Schubert

unread,
Jul 26, 2013, 5:05:18 AM7/26/13
to myt...@gmail.com, fhgfs...@googlegroups.com, Michael Ruepp
On 07/26/2013 09:47 AM, myt...@gmail.com wrote:
> Very interresting post!
>
> But I'm a bit in trouble btw now... Can you explain a bit more about this
> sentence ?
>
> Le jeudi 25 juillet 2013 14:28:26 UTC+2, Bernd Schubert a écrit :
>>
>> Default configuration is a
>> stripe over 4 targets and data are written to these targets in
>> round-robin way. If many users are doing lots of IO in parallel, overall
>> performance might be better if you reduce the default stripe-count to 1.
>>
>>
> I thought having more stripping will in fact increase performance (both IO
> and speed). I'm wrong ?

It really depends. If you only have a moderate number of clients doing
IO in parallel (moderate in the relation to the number of
storage-targets and servers) striping is good for performance.
But now lets assume your storage targets are already saturated with IO
and you add another process striping over a few or even all targets.
When targets are saturated with IO they response latency to finish an IO
request increases.
As new stripes are only send when the current stripe is done, increased
latency also reduces performance for a single process then. Also,
usually not all targets have *exactly* the same performance, but even in
a raid-set with identical disks, each disk has their own latency. The
slowest disk then dominates the maximum performance for this raidset
(with is usually an FhGFS target). With several targets for a stripe the
slowest raidset (or disk) then dominates maximum performance for the
entire stripe.
It usually does not matter if linux and/or disk-io are not fully
saturated, but if IO queues are already full, fhgfs striping will not
increase performance (at best), but probably even slightly decrease it.

>
> In my final setup I will have hundreds of TB with (dozens of) thousands
> connections.
>
> Each storage server will have at least 12 HDD. 3 array of 4 HDD in RAID5
> (4TB HDD).
>
> 80% read, 20% write


You need to check if the queues are full, e.g. with 'iostat -x'. Or
install a monitoring tool such as collectl and check the graphs.


Cheers,
Bernd

Sven Breuner

unread,
Aug 12, 2013, 4:03:36 AM8/12/13
to fhgfs...@googlegroups.com, myt...@gmail.com, frank...@itwm.fraunhofer.de
hi mythzib,

myt...@gmail.com wrote on 07/24/2013 06:26 PM:
> Just to know, you don't recommand RAID6 anymore ?
>
> Le mercredi 24 juillet 2013 14:45:16 UTC+2, Frank Kautz a �crit :
> We recommend raid10(8+2) for the storage targets.

that was a typo. it should have been "we recommend raid6(8+2) for the
storage targets".

best regards,
sven

Geoffrey Hartz

unread,
Aug 12, 2013, 4:21:36 AM8/12/13
to fhgfs...@googlegroups.com, myt...@gmail.com, frank...@itwm.fraunhofer.de, sven.b...@itwm.fraunhofer.de
I'm unable to retrieve the source but I read big array (over 12TB, at least in raid5) will almost always failed when it's time to rebuild.

It's due to a error after 10-12TB of read.

In normal array, others disk fix the error but in a degraded mode this will result in a failed rebuild.

If you have time (or a link) can you explain a bit more what you recommand for big array.

Le lundi 12 août 2013 10:03:36 UTC+2, Sven Breuner a écrit :
hi mythzib,

myt...@gmail.com wrote on 07/24/2013 06:26 PM:
> Just to know, you don't recommand RAID6 anymore ?
>
> Le mercredi 24 juillet 2013 14:45:16 UTC+2, Frank Kautz a �crit :

Sven Breuner

unread,
Aug 14, 2013, 12:13:55 PM8/14/13
to fhgfs...@googlegroups.com, Geoffrey Hartz
hi goeffrey,

there are so many scary whitepapers regarding unrecoverable raid
failures around on the internet, but i didn't run into any of such
problems myself yet (maybe i'm just lucky ...or there's something wrong
with the theory behind it). anyways, i'll try to avoid going too much
into the details.

personally, i rather care about the number of disks in an array (instead
of the raw size of an array) for two simple reasons:

1) the more disks you have in a single raid5 or raid6 array, the harder
it gets to write a full stripe set (to avoid the inefficient
read-modify-writes).

2) the more disks you have in a single array, the more likely it is that
multiple disks fail at the same time. (e.g. if you have a single raid5
volume consisting of many, many disks, chances are good that several of
them might fail at the same time.)
[ for the theory behind it, one might just apply the birthday paradox to
disk failures and would then probably end up saying that having 23+
disks in a single array is no good ;-) ]

however, independent of the precise stochastic behind it, that's why we
usually don't recommend having more than 12 disks in a single array
[ of course, 12 is not 2^n, which usually aligns best with all kinds of
data structures, so that's why the general recommendation was 8+X in
frank's mail below ]
if you make that a raid5 (maybe with a hotspare) or raid6 is subject to
the usual trade-off of how important that data really is to you and how
much you care about small write performance
[ ...e.g. updating two raid6 parities is of course more expensive than
updating a single raid5 parity ...but then there's also the option to
use SSD caches or mirroring or other things depending on the available
budget ].

and the disk manufacturers also have the raid problem in mind and try to
do something about it, e.g. there is the "rebuild assist" feature in
SATA v3.2, which basically comes down to "don't fail a drive completely,
but instead make use of what's still accessible".

and after all, backups (maybe only backups of some special sub-folders
of your file system) are of course another option, if some particular
bits on your array are that important. (i bet you didn't see that one
coming, did you? ;-) )

best regards,
sven


Geoffrey Hartz wrote on 08/12/2013 10:21 AM:
> I'm unable to retrieve the source but I read big array (over 12TB, at
> least in raid5) will almost always failed when it's time to rebuild.
>
> It's due to a error after 10-12TB of read.
>
> In normal array, others disk fix the error but in a degraded mode this
> will result in a failed rebuild.
>
> If you have time (or a link) can you explain a bit more what you
> recommand for big array.
>
> Le lundi 12 août 2013 10:03:36 UTC+2, Sven Breuner a écrit :
>
> hi mythzib,
>
Reply all
Reply to author
Forward
0 new messages