Distributed storage and computing

127 views

Skip to first unread message

Tim van Elteren

unread,

Oct 28, 2013, 9:26:11 AM10/28/13

to fhgfs...@googlegroups.com

Dear Fraunhofer GFS Community,

After researching a number of distributed file systems for deployment in a production environment with the main purpose of performing both batch and real-time distributed computing I've identified FhGFS as a potential solution.

The key properties that our system should exhibit:

- an open source, liberally licensed, yet production ready, e.g. a mature, reliable, community and commercially supported solution;

- ability to run on commodity hardware, preferably be designed for it;

- provide high availability of the data with the most focus on reads;

- high scalability, so operation over multiple data centres, possibly global;

- removal of single points of failure with the use of replication and distribution of (meta-)data.

The sensitivity points that were identified, and resulted in the following questions, are:

1) transparency to the processing layer / application with respect to data locality, e.g. know where data is physically located on a server level, mainly for resource allocation and fast processing, high performance, how can this be accomplished using FhGFS?

2) posix compliance, or conformance: hadoop for example isn't posix compliant by design, what are the pro's and con's? What is FhGFSs approach with respect to support for posix operations?

3) mainly with respect to evaluating the production readiness of FhGFS, where is it currently used in production environments and for what specific usecases it seems most suitable? Are there any known issues / common pitfalls and workarounds available?

4) finally what would be the most compelling reason to go for FhGFS and not for the alternatives?

I'm looking forward to your replies. Thanks in advance! :)

With kind regards,

Tim van Elteren

harry mangalam

unread,

Oct 28, 2013, 8:25:25 PM10/28/13

to Tim van Elteren, fhgfs...@googlegroups.com

Hi Tim,

Answers inline. This is of course my understanding of how the sytem works and how it has worked in our implementation, especially wrt Gluster, which we run at the same time:

<http://moo.nac.uci.edu/~hjm/fhgfs_vs_gluster.html>

I'm about to start a test of GPFS, but have no data yet.

The Fhgfs devs can step in and correct my mistakes.

> The key properties that our system should exhibit:

> - an open source, liberally licensed, yet production ready, e.g. a mature, reliable, community and commercially supported solution;

Mostly - The client is OSS, but the server isn't... yet, but seems to maybe moving in that direction. The licensing is extremely liberal - it's free. The paid support is reasonable (I'm part of U. of Cal and we get a strangely discounted licensing of IBM's GPFS (the hook), but if we didn't get this discounting, Fhgfs would be MUCH cheaper than GPFS for example, altho most other Distributed FSs (DFSs) are also OSS (Lustre, PVFS, Ceph, Gluster - tho it's commercial spawn RHS is not).

The other proprietary DFSs that might be of interest will tend to be quite expensive as well (EMC's Isilon & MPFS, HP's Ibrix, NetApp, etc)

> - ability to run on commodity hardware, preferably be designed for it;

Fhgfs scores very high - no requirements for any particular hardware at all -- we ran it on a combination of Hardware RAID, software RAID, new hardware, old hardware - all of it ran very well, tho somewhat constrained by obvious bottlenecks. And it's very distro-neutral. RHEL/CentOS, Debbian/Ubuntu, we've run it on all these distro's (mixed) and it's the most forgiving DFS that we've tried. I've posted previously that it even runs when it seems like there's no way that it could (despite bad IP configs, runs over ethernet despite never being configured for it, etc).

> - provide high availability of the data with the most focus on reads;

It's robust (see above), but it is not yet 'High Availability' - that has not been designed into it AFAIK, tho I guess you could design your own HA system around it. If HA is a critical part of your requirements, it might be better to use another system, tho the Fhgfs dev's can comment.

> - high scalability, so operation over multiple data centres, possibly global;

Fhgfs has very high scalability locally, but does not support multiple data centers with mulitple filesystems or geo-replication like Gluster. You could fake it with rsync, but it's not a built-in feature.

> - removal of single points of failure with the use of replication and distribution of (meta-)data.

See above. There /are/ single points of failure with Fhgfs. You can have multiple metadata servers, but they will handle different parts of the filesystem, not act as HA backups of each other. You can mirror directories so that you get effectively pseudo-replication, and you could even run it across different data centers using Long Haul Infiniband using the Mell. MetroX switches, but it's not meant as an HA system.

> The sensitivity points that were identified, and resulted in the following questions, are:

> 1) transparency to the processing layer / application with respect to data locality, e.g. know where data is physically located on a server level, mainly for resource allocation and fast processing, high performance, how can this be accomplished using FhGFS?

Unlike Gluster, where files are stored intact (unless enormous or explicitly striped), an entire file lives on a brick which can be identified (usually in the case of a server load going very high due to IO hotspots with that file developing.) On Fhgfs, almost all files (except very small ones) are striped across all servers, so we have never observed such hotspots with Fhgfs (one reason we're transitioning our Gluster FS to Fhgfs).

Fhgfs does have a neat GUI to observer IO and load on individual metadata and storage servers so you can tell where the load is (and it's almost always balanced in Fhgfs).

> 2) posix compliance, or conformance: hadoop for example isn't posix compliant by design, what are the pro's and con's? What is FhGFSs approach with respect to support for posix operations?

Fhgfs is POSIX-compliant. We recently had a problem with distributed locks, but that was resolved with an option and seems to be working. Otherwise it works like a local filesystem. It is also quite good (better than Gluster anyway) at communicating among servers to create dirs much faster and especially handle recursive ops MUCH better than Gluster.

> 3) mainly with respect to evaluating the production readiness of FhGFS, where is it currently used in production environments and for what specific usecases it seems most suitable? Are there any known issues / common pitfalls and workarounds available?

The main place that it seems to be used is in installations like ours - part of HPC clusters where performance requirements are quite high and HA is not an issue. We treat our FS as a large scratch space and remind our users of it frequently. We are not required to maintain coherence thru famine, floods, war, earthquakes, etc, tho we've done so thus far. If inherent robustness is what you're after, maybe Fhgfs is not what you want.

It sounds a

> 4) finally what would be the most compelling reason to go for FhGFS and not for the alternatives?

Cheap, very fast, very scalable locally, very easy to install and check up on (as long as you don't rely on the admon GUI for re-installation or changing the config - the CLI is your friend). It is very resilient to problems, it runs without problem over RDMA. Has very good recursive and small file performance.

It does not qualify on the HA front tho. It sounds like you might be looking for something more like GPFS which by docs and presentation can provide all of what you're looking for, but at an enormously higher price. I don't think any other DFS (OSS or not) can provide what you're looking.

It's like the saying: Speed, reliability, price. Choose 2.

hjm

---

Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine

[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487

415 South Circle View Dr, Irvine, CA, 92697 [shipping]

MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)

---

Reply all

Reply to author

Forward

0 new messages