On 07/24/2013 03:17 PM, Michael Ruepp wrote:
> Hi,
>
> thank you for your answer. However, some questions remain open:
> (Basically, I am speaking of a setup of 4 Storage Servers and 2
> Metadataservers)
>
>>
>> We recommend raid10(8+2) for the storage targets. You can add
>> several targets to one storage server. How many disk for the raid
>> arrays and how many targets per server is useful for your system is
>> hard to say. This depends on your hardware (RAID controller,
>> network interfaces, …).
>>
>> For the metadata server we recommend separate targets with RAID10
>> or RAID1 on SSDs. This is increases the IOPS.
> What does "separate targets" mean? Separated from the Storage targets
> or "the more Metadata Targets (ext4 Volumes) per Server - the
> better"?
For FhGFS target means "storage directory". Although it usually does not
make sense, you can have several targets on one disk or one raid-set.
However, the usual configuration is:
mount /dev/my-raid-disk /data/fhgfs/storage-target1
mount /dev/my-raid-disk /data/fhgfs/storage-target2
Given by fhgfs-meta contraints one usually only has a single raid10
meta-target. fhgfs-storage supports 2^16 targets per daemon, fhgfs-meta
currently only one target per daemon (although might start several
daemon instances).
>>
>>> - So would it make more sense to extend the amount of fhgfs
>>> volumes per server (with fewer spindles per raidset) vs. extend
>>> the amount of disks per raidset and lesser volumes?
>> It depends on your hardware. If you add more disks in a raid array
>> you need a RAID-Controller with a higher throughput to have a
>> performance benefit from the additional disks. How many targets are
>> useful to increase the performance depends on the performance of
>> your storage targets, the maximal throughput of your raid
>> controller/s and the network interface/s of the server.
>
> Assumed, the Controller is fast enough to handle all disks in use to
> the max. If I use large Raidsets, I get mostly no write benefit when
> the filesize (Fhgfs Chunk per server) is not large enough for
Large raid sets increase the probability of failures. If 3 disks fail in
a raid6 you have problem... And with raid-level >= 3 read-modify-writes
become painful with large raid sets.
> striping over all spindles). This is for sequential single usecase.
> But in sequential parallel usecase, it will get not enough IOPS and
> lot of latency by repositioning when lot of users read write parallel
> from the storage servers with one raidset per server. So this is the
> question: If i create more raidsets and therefore more storage
> volumes on the storage servers, I could reduce latency because I
> guess that in a parallel usecase (e.g. lots of requests per second
> per server), it could balance itself out by using more than one
> raidset at a time by fhgfs.
>
> Does this have any impact or is the queuing/caching complex of
> fhgfs-raidcontroller well enough equipped to get over this raidset
> repositioning stuff?
FhGFS does not have a raid-controller... Default configuration is a
stripe over 4 targets and data are written to these targets in
round-robin way. If many users are doing lots of IO in parallel, overall
performance might be better if you reduce the default stripe-count to 1.
>
>>
>>> Is there any implication (with fhgfs striping) on not using even
>>> numbers for the volumes but the servers (e.g. 4 Server and 3
>>> volumes per server)?
>> There are no restriction about the number of servers.
>
> It is not about restriction, but about if there is a performance
> impact to not choose even/odd amount of servers. So if i choose a
> fhgfs chunksize of 512, how will fhgfs redistribute that chunk on for
> example 3 Storage Volumes (Targets) on the Storage Server? Does it
> write 2x256 and leaves one volume out or is the whole chunk written
> to one volume and leaves the two others out, or how is it done?
It writes 512kiB to one chunk on one, then 512kiB to the next chunk and
so on.
>>
>>> - So one idea is to create 3 volumes a 10 raid6 disks (8 data
>>> chunks 2 parity chunks) at chunk size of 64/128Kb which would
>>> reflect a stripe size of 512/1024Kb. Does the size of the fhgfs
>>> stripe would have any impact?
>> A bigger chunk size (128kb, 256kb, ...) for the raid increases the
>> streaming performance of the storage server targets. A smaller
>> chunk size (64kb) of the raid increases the IOPS for the metadata
>> targets. The fhgfs chunk size is independent from the raid chunk
>> size. Yes the fhgfs stripe will have an impact. Normally the
>> default values are working fine.
>>
>>> How would data be distributed?
>> FhGFS distributes a file across some targets, by default 4 targets.
>> The fhgfs chunk size is by default 512kb. During the creation of a
>> file the storage targets are selected. The first chunk of a file
>> will be stored on the first target, the second chunk on the second
>> target ... the fifth chunk will be stored on the first target.
> I guess we have a terminology misunderstanding here, I am into
> lustre terminology where a Storage Server and Storage Target is
There is no difference between FhGFS and Lustre in that sense, except
that other words are used.
FhGFS: storage-server Lustre: OSS
FhGFS: storage-target Lustre: OST
> different. A XFS Volume to my understanding is a Storage Target. So
> main question remains: How is the performance influenced by configure
> more than one XFS Volume (Target) per Storage Server and should the
> amount of Volumes be evenly as well as the amount of Storage Servers?
If you storage server can handle 1000 targets then you can use 1000
targets. If your storage server already maxes out with 1 target, adding
another 999 target only increases overall storage volume, but not
performance.
> If fhgfs restripes chunks which go into a storage server evenly over
> the amount of storage targets on that server, the storage targets
> should be even as well: 512kb/4 would be than 128kb per Storage
> Target in opposite to 512/3=170,666. But if it is not handled like
> this, it would make no difference. But I don´t know.
FhGFS does not restripe chunks, but as explained above, writes to chunks
in round-robin way with the configures chunk-size (which is btw.
absolutely identical to what Lustre does).
>>
>>> - Does it make sense performancewise to deploy two metadata
>>> server?
>> It depends on the workload. You can start with one or two metadata
>> servers and if the performance is not good enough you can add new
>> metadata servers without a downtime.
Just one note, you need sufficient memory and you also should put
storage and meta servers in different cgroups and assign a dedicated
amount of memory to these to avoid memory-cache 'stealing'.
Best regards,
Bernd