I've read through some of the archives but there is less data there
than I had hoped for.
What are the major issues anyone has seen during a large deployment?
It looks like node failure is handled by replication at the file (and
object) level. It looks like Metadata DB failure is handled by the
election mechanism.
On disk (or node) failure, it appears that there is no natural
recovery to policy level unless you attempt to read the file (perhaps
stat) in question. Is that correct?
I'm looking for information.
Thanks,
Justin
On Fri, Dec 18, 2009 at 11:05 PM, Justin Stottlemyer
<justin.h.s...@gmail.com> wrote:
> I'm considering XTREEMFS for a enterprise style deployment. Who
> currently has the largest deployment?
>
> I've read through some of the archives but there is less data there
> than I had hoped for.
>
> What are the major issues anyone has seen during a large deployment?
I don't know any XtreemFS deployment of this size, but here are a few
guidelines. XtreemFS shouldn't have problems handling the data volume
or throughput in a deployment of this size, but depending of the
nature of the data, I can imagine that you might run into two other
issues:
* If you have many files that need to be in one volume, you might
hit limits in the metadata server (MRC), as it does no sharding for a
single namespace. However, you can always put your data in multiple
volumes, which can be handled by separate metadata servers.
* If you have many storage servers (OSDs), because you need high
throughput, you might find managing that many OSDs tedious, because
XtreemFS has no system-level monitoring yet nor does it integrate with
other monitoring systems at this point.
So generally I'd say, if you can spend some developer time and need
some customization anyway (replication policies, etc.), XtreemFS is a
good choice as its servers (and one of the client libraries) are
written in Java and we have tried to keep things light-weight (avoided
integrating large third-party stacks). You might need to integrate
some monitoring or integrate system management tools.
You should consider yourself an early adaptor (in Moore's terms), but
I'm sure the XtreemFS developers will actively support you.
> It looks like node failure is handled by replication at the file (and
> object) level. It looks like Metadata DB failure is handled by the
> election mechanism.
AFAIK MRC replication is not enabled in the current release, but
should be out soon. Björn?
> On disk (or node) failure, it appears that there is no natural
> recovery to policy level unless you attempt to read the file (perhaps
> stat) in question. Is that correct?
Conceptually, yes. However, there is (or used to be?) a parallel
"scrubber" that systematically triggered multiple OSDs in parallel to
read objects and verify their checksums at scale. "Used to be" because
it might be that the implementation fell behind, and it might not be
available in version 1.2. Björn should be able to give more details?
... Felix
http://code.google.com/p/xtreemfs/issues/detail?id=75
On Dec 19, 6:21 pm, Felix Hupfeld <fhupf...@googlemail.com> wrote:
> Hi Justin,
>
> On Fri, Dec 18, 2009 at 11:05 PM, Justin Stottlemyer
>
> should be out soon. Bj�rn?
We'll start with the DIR replication in the next release to make sure it
is well tested before we use (the same code) for the MRC replication. As
the implementation is already in the trunk, it should be no problem to
experiment with the replication before we put it in a release.
>
>> On disk (or node) failure, it appears that there is no natural
>> recovery to policy level unless you attempt to read the file (perhaps
>> stat) in question. Is that correct?
>
> Conceptually, yes. However, there is (or used to be?) a parallel
> "scrubber" that systematically triggered multiple OSDs in parallel to
> read objects and verify their checksums at scale. "Used to be" because
> it might be that the implementation fell behind, and it might not be
> available in version 1.2. Bj�rn should be able to give more details?
The scrubber is still working and included in the releases but it does
not repair replicas yet. We'll include that asap. We also plan to
implement a monitoring infrastructure and add services that
automatically replace replicas when OSDs fail.
Bj�rn
That's quite cool