FhGFS in a two-node cluster

278 views
Skip to first unread message

Ramon Diaz-Uriarte

unread,
May 19, 2012, 9:39:44 PM5/19/12
to fhgfs-user
Dear All,

I have some general questions about whether, in my setup, it makes
sense
to use FhGFS and, if it makes, the recommended usage patterns.


Context
=======

We have a two node cluster (a Dell PowerEdge C6145). Each node has
four
SAS HDs (600 GB each); each node has a single RAID card (LSI9260-8i)
for
the four disks. Each node also has four AMD Opteron 6276 sockets (16
cores
per socket) and 256 GB RAM. The two nodes are connected via Infiniband
(and also 1 GB ethernet)


We will be using the cluster for bioinformatics/statistics computing,
including programs that use MPI and OpenMP, as well as giving access
(via
web-based applications) to those same bioinfo/stats computing
programs.


The main reason for considering FhGFS is to provide a single shared-
disk
file system for users homes, application code, result storage, scratch
space, tmp files, etc.



Questions:
==========


1. Is it an overkill to use FhGFS in my scenario? I do not think so
(it
seems it is probably _the_ ideal solution for my scenario, but just
checking ;-).


2. Each node will need to be storage server, metadata server, and
client
at the same time. But I am not sure how best to use the disks.

From the documentation, it seems the best would be to use an xfs
partition for data storage. However, for metadata ...


2.1 Can I place the directory for the metadata in the same ext4
partition where the rest of the operating system will be installed?
(I
would format this ext4 partition with the "mkfs.ext4 -i 4096 -I 512
-J
size=400 -Odir_index,filetype" and use extended attributes, as
explained in the Server Tuning docs).


2.2 I can configure the four disks per node (and per RAID
controller)
as a single virtual disk (e.g., with RAID 0 or RAID 10), or I can
configure to have two virtual disks, each as RAID 0, the first with
1
physical HD (OS and metadata server), the second with the remaining
3
HDs (storage server). In either case, it is the same card which is
controlling the four disks for a node. What would be better? I
assume
that using a single virtual disk with RAID 0 is probably best (the
RAID
card will do its job better?).


3. If one of the nodes fails, and if I keep a backup of the data, can
I
recover by just copying the shared disk backup, and mounting as a
regular, local, file system? (To allow this, I think I need to
create
the backup from one of the clients).


4. Network configuration. Would it make sense to try to use at the
same
time the ethernet and infiniband connections, one for metadata and
the
other for storage transfers? I do not see how to do this.


5. What other options might I consider? I am also thinking about
GlusterFS
and PVFS2 (and asking similar questions on their list). (Lustre
definitely seems to discourage client and OSS in same node).



Any other comments or suggestions for this setup are welcome.


Best,

Ramon

Frank Kautz

unread,
May 21, 2012, 9:26:15 AM5/21/12
to fhgfs...@googlegroups.com
Hello Ramon,

I add the comments under the questions.
For the users homes a NFS will be better. For the other data FhGFS is a
good choice.

>
>
> 2. Each node will need to be storage server, metadata server, and
> client
> at the same time. But I am not sure how best to use the disks.
>
> From the documentation, it seems the best would be to use an xfs
> partition for data storage. However, for metadata ...
>
>
> 2.1 Can I place the directory for the metadata in the same ext4
> partition where the rest of the operating system will be installed?
> (I
> would format this ext4 partition with the "mkfs.ext4 -i 4096 -I 512
> -J
> size=400 -Odir_index,filetype" and use extended attributes, as
> explained in the Server Tuning docs).

Yes, it is possible. Please remember you need for the meta date 1% of
the hole disk space which is planned for the fhgfs.

If you didn't need a high metadata performance it is also possible to
store the metadata on the XFS. Do your software make a lot of file
stats, file creates, ...?

>
>
> 2.2 I can configure the four disks per node (and per RAID
> controller)
> as a single virtual disk (e.g., with RAID 0 or RAID 10), or I can
> configure to have two virtual disks, each as RAID 0, the first with
> 1
> physical HD (OS and metadata server), the second with the remaining
> 3
> HDs (storage server). In either case, it is the same card which is
> controlling the four disks for a node. What would be better? I
> assume
> that using a single virtual disk with RAID 0 is probably best (the
> RAID
> card will do its job better?).

This depends on the RAID-controller. Some Raid-controller will not
deliver the optimal performance if several virtual disks with different
RAID levels are configured.

RAID 0 with 4 disks is the best choice in your case. In a configuration
with two virtual disk you will waste disk space.

>
>
> 3. If one of the nodes fails, and if I keep a backup of the data, can
> I
> recover by just copying the shared disk backup, and mounting as a
> regular, local, file system? (To allow this, I think I need to
> create
> the backup from one of the clients).

Yes this will work. You need to start the fhgfs-client and make a backup
from the mounted FhGFS.

>
>
> 4. Network configuration. Would it make sense to try to use at the
> same
> time the ethernet and infiniband connections, one for metadata and
> the
> other for storage transfers? I do not see how to do this.

I recommend to use the infiniband connection for both. But you could
configure it in the configuration files. Use the configuration option
"connInterfacesFile". More details are available in the wiki.
http://www.fhgfs.com/wiki/wikka.php?wakka=FAQ#multiple_nics

kind regards,
Frank

Ramon Diaz-Uriarte

unread,
May 21, 2012, 10:30:52 AM5/21/12
to fhgfs...@googlegroups.com, Frank Kautz
I was planning on having everything under FhGFS: users sometimes launch
MPI and other parallel jobs that do I/O with files under their homes.




> >
> >
> > 2. Each node will need to be storage server, metadata server, and
> > client
> > at the same time. But I am not sure how best to use the disks.
> >
> > From the documentation, it seems the best would be to use an xfs
> > partition for data storage. However, for metadata ...
> >
> >
> > 2.1 Can I place the directory for the metadata in the same ext4
> > partition where the rest of the operating system will be installed?
> > (I
> > would format this ext4 partition with the "mkfs.ext4 -i 4096 -I 512
> > -J
> > size=400 -Odir_index,filetype" and use extended attributes, as
> > explained in the Server Tuning docs).

> Yes, it is possible. Please remember you need for the meta date 1% of
> the hole disk space which is planned for the fhgfs.

Yes, thanks.


> If you didn't need a high metadata performance it is also possible to
> store the metadata on the XFS. Do your software make a lot of file
> stats, file creates, ...?

Generally not. However, reserving about 2% for metadata means reserving
about 40 GB on the ext4 partition, which is not too much, so it seems I
might be better off placing the metadata in the ext4 / partition.



> >
> >
> > 2.2 I can configure the four disks per node (and per RAID
> > controller)
> > as a single virtual disk (e.g., with RAID 0 or RAID 10), or I can
> > configure to have two virtual disks, each as RAID 0, the first with
> > 1
> > physical HD (OS and metadata server), the second with the remaining
> > 3
> > HDs (storage server). In either case, it is the same card which is
> > controlling the four disks for a node. What would be better? I
> > assume
> > that using a single virtual disk with RAID 0 is probably best (the
> > RAID
> > card will do its job better?).

> This depends on the RAID-controller. Some Raid-controller will not
> deliver the optimal performance if several virtual disks with different
> RAID levels are configured.

> RAID 0 with 4 disks is the best choice in your case. In a configuration
> with two virtual disk you will waste disk space.


OK, excellent. That seems like the simpler and more flexible
configuration. I assume I could just as well use RAID 10 with the 4 disks
if space is not limited?




> >
> >
> > 3. If one of the nodes fails, and if I keep a backup of the data, can
> > I
> > recover by just copying the shared disk backup, and mounting as a
> > regular, local, file system? (To allow this, I think I need to
> > create
> > the backup from one of the clients).

> Yes this will work. You need to start the fhgfs-client and make a backup
> from the mounted FhGFS.


OK, understood.


> >
> >
> > 4. Network configuration. Would it make sense to try to use at the
> > same
> > time the ethernet and infiniband connections, one for metadata and
> > the
> > other for storage transfers? I do not see how to do this.

> I recommend to use the infiniband connection for both. But you could
> configure it in the configuration files. Use the configuration option
> "connInterfacesFile". More details are available in the wiki.
> http://www.fhgfs.com/wiki/wikka.php?wakka=FAQ#multiple_nics

OK, thanks. Infiniband for both seems the simplest thing to do.


Thanks for your detailed advice.

Best,

R.


> kind regards,
> Frank

> >
> >
> > 5. What other options might I consider? I am also thinking about
> > GlusterFS
> > and PVFS2 (and asking similar questions on their list). (Lustre
> > definitely seems to discourage client and OSS in same node).
> >
> >
> >
> > Any other comments or suggestions for this setup are welcome.
> >
> >
> > Best,
> >
> > Ramon

--
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Autónoma de Madrid
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: rdi...@gmail.com
ramon...@iib.uam.es

http://ligarto.org/rdiaz

Sven Breuner

unread,
May 21, 2012, 11:35:47 AM5/21/12
to fhgfs...@googlegroups.com, Ramon Diaz-Uriarte
Hi Ramon,

Ramon Diaz-Uriarte wrote on 05/21/2012 04:30 PM:
> On Mon, 21 May 2012 15:26:15 +0200,Frank Kautz<frank...@itwm.fraunhofer.de> wrote:
>> Am 05/20/2012 03:39 AM, schrieb Ramon Diaz-Uriarte:
>
>> RAID 0 with 4 disks is the best choice in your case. In a configuration
>> with two virtual disk you will waste disk space.
>
> OK, excellent. That seems like the simpler and more flexible
> configuration. I assume I could just as well use RAID 10 with the 4 disks
> if space is not limited?

yes, RAID-10 also wouldn't be a problem (actually, fhgfs won't care
about the RAID level).

Just as an additional note, if you're planning to have storage +
metadata on the same partition:

If you want to go for optimal streaming throughput, you should use XFS
here, but make sure to disable extended attributes in the metadata
server config, because XFS can be really slow with extended attributes
(http://oss.sgi.com/archives/xfs/2011-08/msg00233.html).

If you want the best tradeoff between streaming throughput and metadata
performance, you should go with ext4 with enabled extended attributes.


>>> (Lustre
>>> definitely seems to discourage client and OSS in same node).

In general, any network file system that uses the standard kernel page
cache on the client side (including e.g. NFS, just to give another
example) is not suitable for running client and server on the same
machine, because that would lead to memory allocation deadlocks under
high memory pressure - so you might want to watch out for that.
(fhgfs uses a different caching mechanism on the clients to allow
running it in such scenarios.)

Best regards,
Sven Breuner
Fraunhofer

Ramon Diaz-Uriarte

unread,
May 21, 2012, 1:22:24 PM5/21/12
to fhgfs...@googlegroups.com, Ramon Diaz-Uriarte, Sven Breuner



On Mon, 21 May 2012 17:35:47 +0200,Sven Breuner <sven.b...@itwm.fraunhofer.de> wrote:
> Hi Ramon,

> Ramon Diaz-Uriarte wrote on 05/21/2012 04:30 PM:
> > On Mon, 21 May 2012 15:26:15 +0200,Frank Kautz<frank...@itwm.fraunhofer.de> wrote:
> >> Am 05/20/2012 03:39 AM, schrieb Ramon Diaz-Uriarte:
> >
> >> RAID 0 with 4 disks is the best choice in your case. In a configuration
> >> with two virtual disk you will waste disk space.
> >
> > OK, excellent. That seems like the simpler and more flexible
> > configuration. I assume I could just as well use RAID 10 with the 4 disks
> > if space is not limited?

> yes, RAID-10 also wouldn't be a problem (actually, fhgfs won't care
> about the RAID level).

> Just as an additional note, if you're planning to have storage +
> metadata on the same partition:

> If you want to go for optimal streaming throughput, you should use XFS
> here, but make sure to disable extended attributes in the metadata
> server config, because XFS can be really slow with extended attributes
> (http://oss.sgi.com/archives/xfs/2011-08/msg00233.html).

> If you want the best tradeoff between streaming throughput and metadata
> performance, you should go with ext4 with enabled extended attributes.


OK, understood; thanks for the added detail.

However, I understand that much better performance would be achieved by
having the metadata in an ext4 partition (the same for the rest of the OS)
with extended attributes, and the storage in an XFS partition (without
extended attributes), even if both are in the same disk. Is this correct?




> >>> (Lustre
> >>> definitely seems to discourage client and OSS in same node).

> In general, any network file system that uses the standard kernel page
> cache on the client side (including e.g. NFS, just to give another
> example) is not suitable for running client and server on the same
> machine, because that would lead to memory allocation deadlocks under
> high memory pressure - so you might want to watch out for that.
> (fhgfs uses a different caching mechanism on the clients to allow
> running it in such scenarios.)

Aha! Thanks for the details.

Best,

R.


> Best regards,
> Sven Breuner
> Fraunhofer

Christian Mohrbacher

unread,
May 22, 2012, 7:29:13 AM5/22/12
to fhgfs...@googlegroups.com
Hi Ramon,

> OK, understood; thanks for the added detail. However, I understand
> that much better performance would be achieved by having the metadata
> in an ext4 partition (the same for the rest of the OS) with extended
> attributes, and the storage in an XFS partition (without extended
> attributes), even if both are in the same disk. Is this correct?

yes, that's correct.


Regards,
Christian Mohrbacher
Fraunhofer

--
=====================================================
| Christian Mohrbacher |
| Competence Center for High Performance Computing |
| Institut fuer Techno- und |
| Wirtschaftsmathematik (ITWM) |
| Fraunhofer-Platz 1 |
| |
| D-67663 Kaiserslautern |
=====================================================
| Tel: (49) 631 31600 4425 |
| Fax: (49) 631 31600 1099 |
| |
| E-Mail: christian....@itwm.fraunhofer.de |
| Internet: http://www.itwm.fraunhofer.de |
=====================================================

Ramon Diaz-Uriarte

unread,
May 22, 2012, 4:23:05 PM5/22/12
to fhgfs...@googlegroups.com, Christian Mohrbacher



On Tue, 22 May 2012 13:29:13 +0200,Christian Mohrbacher <christian....@itwm.fraunhofer.de> wrote:
> Hi Ramon,

> > OK, understood; thanks for the added detail. However, I understand
> > that much better performance would be achieved by having the metadata
> > in an ext4 partition (the same for the rest of the OS) with extended
> > attributes, and the storage in an XFS partition (without extended
> > attributes), even if both are in the same disk. Is this correct?

> yes, that's correct.


Thanks!

Best,

R.

> Regards,
> Christian Mohrbacher
> Fraunhofer

> --
> =====================================================
> | Christian Mohrbacher |
> | Competence Center for High Performance Computing |
> | Institut fuer Techno- und |
> | Wirtschaftsmathematik (ITWM) |
> | Fraunhofer-Platz 1 |
> | |
> | D-67663 Kaiserslautern |
> =====================================================
> | Tel: (49) 631 31600 4425 |
> | Fax: (49) 631 31600 1099 |
> | |
> | E-Mail: christian....@itwm.fraunhofer.de |
> | Internet: http://www.itwm.fraunhofer.de |
> =====================================================

Reply all
Reply to author
Forward
0 new messages