[Lustre-discuss] Question on lustre redundancy/failure features

Emmanuel Noobadmin

unread,

Jun 26, 2010, 5:13:23 PM6/26/10

to lustre-...@lists.lustre.org

I'm looking at using Lustre to implement a centralized storage for
several virtualized machines. The key consideration being reliability
and ease of increasing/replacing capacity.

However, I'm still quite confused and haven't read the manual fully
because I'm tripping on this: what exactly happens if a piece of
hardware fails?
Perhaps it's because I haven't yet tried to setup Lustre so the terms
used don't quite translate for me yet. So I'll appreciate some newbie
hand holding here :)

For example, if I have a simple 5 machine cluster, one MDS/MDTand one
failover MDS/MDT. Three OSS/OST machines with 4 drives each, for 2
sets of MD Raid 1 block devices and so total of 6 OST if I didn't
understand the term wrongly.

What happens if one of the OSS/OST dies, say motherboard failure?
Because the manual mentions data striping across multiple OST, it
sounds like either networked RAID 0 or RAID 5.

In the case of network RAID 0, a single machine failure means the
whole cluster is dead. It doesn't seem to make sense for Lustre to
fail in this manner. Where as if Lustre implements network RAID 5, the
cluster would continue to serve all data despite the dead machine.

Yet the manual warns that Lustre does not have redundancy and relies
entirely on some kind of hardware RAID being used. So it seems to
imply that the network RAID 0 is what's implemented.

This appears to be the case given the example in the manual of a
simple combined MGS/MDT with two OSS/OST which uses the same fsname
"temp" for the OSTs, which then combines the two 16MB OST into a
single 30MB block device mounted as /lustre on the client.

Does this then mean that if I want redundancy on the storage, I would
basically need to have a failover machine for every OSS/OST?

I'm also confused because the manual says an OST is a block device
such as /dev/sda1 but OSS can be configured to provide failover
services. But if the OSS machine which houses the OST dies, how would
another OSS take over anyway since it would not be able to access the
other set of data?

Or does that mean this functionality is only available if the OST in
the cluster are standalone SAN devices?
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

William Olson

unread,

Jun 28, 2010, 9:44:15 AM6/28/10

to lustre-...@lists.lustre.org

Hello, being a newbie myself I've just recently worked through all these
questions myself, here's what I've learned..

On 6/26/2010 2:13 PM, Emmanuel Noobadmin wrote:
> I'm looking at using Lustre to implement a centralized storage for
> several virtualized machines. The key consideration being reliability
> and ease of increasing/replacing capacity.

Increasing capacity is easy, replacing it will take some practice and
careful reading of the manual and the mailing list archives.

> However, I'm still quite confused and haven't read the manual fully
> because I'm tripping on this: what exactly happens if a piece of
> hardware fails?
> Perhaps it's because I haven't yet tried to setup Lustre so the terms
> used don't quite translate for me yet. So I'll appreciate some newbie
> hand holding here :)
>
> For example, if I have a simple 5 machine cluster, one MDS/MDTand one
> failover MDS/MDT. Three OSS/OST machines with 4 drives each, for 2
> sets of MD Raid 1 block devices and so total of 6 OST if I didn't
> understand the term wrongly.

Think you understood it correctly there.

> What happens if one of the OSS/OST dies, say motherboard failure?
> Because the manual mentions data striping across multiple OST, it
> sounds like either networked RAID 0 or RAID 5.

networked RAID 0 is the closest analogy.

> In the case of network RAID 0, a single machine failure means the
> whole cluster is dead. It doesn't seem to make sense for Lustre to
> fail in this manner. Where as if Lustre implements network RAID 5, the
> cluster would continue to serve all data despite the dead machine.

This is why the manual points out that it's important to have reliable
hardware on the back-end. I would strongly suggest a SAN/NAS solution
or at least a well-tested and executed backup strategy.

> Yet the manual warns that Lustre does not have redundancy and relies
> entirely on some kind of hardware RAID being used. So it seems to
> imply that the network RAID 0 is what's implemented.
>

Yup.

> This appears to be the case given the example in the manual of a
> simple combined MGS/MDT with two OSS/OST which uses the same fsname
> "temp" for the OSTs, which then combines the two 16MB OST into a
> single 30MB block device mounted as /lustre on the client.
>
> Does this then mean that if I want redundancy on the storage, I would
> basically need to have a failover machine for every OSS/OST?
>

Correct, however if you are using a 5 node cluster 2 mgs/mds and 3 oss
then the 3 oss servers could be configured to back each other up in the
event of a failure, assuming you were using a SAN/NAS solution for the
storage. If not then I would recommend extra drives in each machine
that a backup of the failed OST could be restored to.

> I'm also confused because the manual says an OST is a block device
> such as /dev/sda1 but OSS can be configured to provide failover
> services. But if the OSS machine which houses the OST dies, how would
> another OSS take over anyway since it would not be able to access the
> other set of data?
>
> Or does that mean this functionality is only available if the OST in
> the cluster are standalone SAN devices?

This would be the most advisable hardware configuration from my
experience. If on the other hand, you have spare hardware for the
production servers(such as replacement mobo, drives, etc..) then you can
be fairly safe as long as you ensure that you have a proper RAID
configuration on your lustre partitions. You will experience downtime
while you replace failed core components(mobo, proc, ram, etc), but if
it's just a RAID member HD, then lustre can keep on truckin. Downtime
should only be as long as it takes to replace the part. We make it a
point to always have a hot-spare of any core production machine that we
have in the rack. So if you only have 5 machines to work with(and no
NAS/SAN), I would suggest moving to a 4 node Lustre environment and
keeping the 5th server as a hot-spare.

Good Luck!
-Billy Olson

Brian J. Murrell

unread,

Jun 28, 2010, 11:10:36 AM6/28/10

to lustre-...@lists.lustre.org

On Sun, 2010-06-27 at 05:13 +0800, Emmanuel Noobadmin wrote:
>
> However, I'm still quite confused and haven't read the manual fully
> because I'm tripping on this: what exactly happens if a piece of
> hardware fails?

What happens depends on which piece of hardware fails. If it's an OSS
configured for failover, the backup OSS takes over serving the OSTs.

Ditto for an MDS.

If it's a disk in a RAID LUN, well, you replace the disk and let RAID
rebuild the LUN.

> For example, if I have a simple 5 machine cluster, one MDS/MDTand one
> failover MDS/MDT.

We should get you started out correctly with nomenclature and concepts.

For any give filesystem there can be only 1 MDT. The MDT is the actual
device/disk and associated processes that stores and serves the
metadata. You can have 1 or more MDSes configured to provide service
for it. Of course, if you have more than one, then somehow, usually
through shared storage, all of those machines must be able to see the
MDT (the disk).

An MDS is a physical machine that hosts (can provide) MDT services. You
can only have one active MDS at a time -- that is, only one MDS can have
the MDT mounted at a time. This is paramount. No more than one machine
can mount the MDT at a time.

> Three OSS/OST machines

They are usually just called OSSes.

> with 4 drives each, for 2
> sets of MD Raid 1 block devices and so total of 6 OST if I didn't
> understand the term wrongly.
>
> What happens if one of the OSS/OST dies, say motherboard failure?

In order to survive such a failure, the OST must be visible by another
OSS which can then mount it and provide service for it.

> Because the manual mentions data striping across multiple OST, it
> sounds like either networked RAID 0 or RAID 5.

Lustre does not provide any form of data redundancy and expects the
storage below it to provide that, so yes, if you value your data, you
put your OSTs on RAID disk.

> In the case of network RAID 0, a single machine failure means the
> whole cluster is dead.

No. Even if you didn't configure failover (so that another machine can
provide service for the OST(s)), the filesystem is still available for
access to any data that is not on the OSTs of a failed, non-failover
configured OSS. Any access to data from the failed OSSes OSTs will
either just block (i.e. hang) the client's request until the OSS is
brought back into service, or I/O to failed OSTs can return an EIO to
client. That is configurable by the administrator.

> It doesn't seem to make sense for Lustre to
> fail in this manner. Where as if Lustre implements network RAID 5, the
> cluster would continue to serve all data despite the dead machine.

I think you are missing the point of failover (with shared disk). A
failure of an OSS is survivable in that case.

> Yet the manual warns that Lustre does not have redundancy and relies
> entirely on some kind of hardware RAID being used. So it seems to
> imply that the network RAID 0 is what's implemented.

No. Lustre provides no RAID at all.

> Does this then mean that if I want redundancy on the storage, I would
> basically need to have a failover machine for every OSS/OST?

Yes, Typically people configure active/active failover for OSTs. That
is, if they have enough disk for 12 OSTs, they configure two OSSes and
put 6 OSTs on each with each OST also being configured to provide
service for the other OSSes 6. So normally, each OSS actively provides
service for 6 OSTs but if one of the OSSes fails, the survivor takes
over service for and provides for all 12 OSTs.

> I'm also confused because the manual says an OST is a block device
> such as /dev/sda1 but OSS can be configured to provide failover
> services. But if the OSS machine which houses the OST dies, how would
> another OSS take over anyway since it would not be able to access the
> other set of data?

You need to be using some sort of shared storage where two computers can
both see the same disk. This is typically achieved with FC SCSI type
configurations, however it can be done at the lower end with Firewire
(which supports shared access, to the extent of various hardware and
software implementations). Others here are also using DRBD, but we
(Oracle) don't really have any experience with the robustness of such a
solution, so you will need to test it for yourself to your level of
satisfaction.

> Or does that mean this functionality is only available if the OST in
> the cluster are standalone SAN devices?

Well, not so much actual SAN devices -- which, IIUC usually implies a
filesystem service not a block device, but yes, you are typically
referring to disks that are physically outside of the OSSes and
connected via some sharable medium such as FC SCSI or infiniband, etc.

b.

signature.asc

Emmanuel Noobadmin

unread,

Jun 28, 2010, 1:10:13 PM6/28/10

to lustre-...@lists.lustre.org

> We should get you started out correctly with nomenclature and concepts.

Thanks for the clarification, it really helped my understanding :)

Unfortunately, after all the much appreciated responses, it seems that
Lustre is not the solution I'm looking for. I was hoping that to use
it as an easily expandable storage cluster with the equivalent of
network RAID 5 across 3 machines with RAID 1 physical disks.

This storage cluster/SAN would then hold VM images for several VM
servers. This way, I thought it would make recovery of any machine
easy, I just have to mount the network storage on a
working/replacement server and boot up the VMs originally hosted on a
failed server.

Somebody else pointed out that I might be looking for OpenFiler instead.

Peter Grandi

unread,

Jun 28, 2010, 4:28:34 AM6/28/10

to Lustre discussion

> I'm looking at using Lustre to implement a centralized storage
> for several virtualized machines.

That's such a cliche, and Lustre is very suitable for it if you
don't mind network latency :-), or if you use a very low latency
fabric.

In general I am surprised (or perhaps not :->) by how many
"clever" people choose to provide resource virtualization and
parallelization at the lower levels of abstaction (e.g. block
device) and not at the higher ones (service protocol), thus
enjoying all the "benefits" of centralization. But then probably
they don't care about availability and in particular abut latency
(and sometimes not even about throughput).

> The key consideration being reliability

Data "reliability" is not a Lustre concern as such. Eventually
Lustre on ZFS will gain what ZFS offersq about it. Also Lustre 2.x
will have object level redundancy (sort of like RAID1), somewhat
compromising the purity of its design.

For overall service availability choose carefully Lustre version
and patches, that is do extensive integration testing before
production use.

Lots of sites have reported spending a few months figuring out the
combination of firmware, OS, and Lustre versions that actually
works well together. Lustre setups tend to be demanding and to
exercise corner cases that less ambitious systems don't reach.

> and ease of increasing/replacing capacity.

Add more OSSes with more OSTs. Some sites have hundreds or
thousands. While avoiding having too few MDSes. Which means more
than one Lustre instance (which can be done in some cool ways, as
nothing prevents a node from being a frontend for more than one
instance).

Note that like in many other cases in your specific application
there is no significant benefit from having a single storage pool
(Lustre instance).

> However, I'm still quite confused and haven't read the manual
> fully because I'm tripping on this: what exactly happens if a
> piece of hardware fails?

The manual, the Wiki, a number of papers and presentations and
this mailing list have extensive discussions of various schemes.

Keep inm kind that Lustre is fundamentally aimed at being an HPCn
filesystem, not a HA one. That is the primary use of multiple hw
resources is parallelism not redundancy.

> For example, if I have a simple 5 machine cluster, one
> MDS/MDTand one failover MDS/MDT. Three OSS/OST machines with 4

> drives each, for 2 sets of MD Raid 1 block devices [ ... ]

That's somewhat unusual, as this leaves parallelization entirely
up to the Lustre layer striping, which perhaps is not wise. It is
surely wiser than using parity RAID for OSTs (which is what the
Lustre docs suggest for data, while RAID10 is recommended for
metadata).

> What happens if one of the OSS/OST dies, say motherboard
> failure? Because the manual mentions data striping across
> multiple OST, it sounds like either networked RAID 0 or RAID 5.

Sort of like RAID0 but at the object (file or file section) level
instead of block level.

> In the case of network RAID 0, a single machine failure means the
> whole cluster is dead. It doesn't seem to make sense for Lustre to

> fail in this manner. [ ... ]

Perhaps it does make sense ton others. :-)

> Yet the manual warns that Lustre does not have redundancy and
> relies entirely on some kind of hardware RAID being used. So it
> seems to imply that the network RAID 0 is what's implemented.

The manual is pretty clear on that.

> Does this then mean that if I want redundancy on the storage,
> I would basically need to have a failover machine for every
> OSS/OST?

Depending on how much redundancy you want to achieve, you may
need both failover machine and failover drives.

> I'm also confused because the manual says an OST is a block
> device such as /dev/sda1 but OSS can be configured to provide

> failover services. [ ... ] Or does that mean this

> functionality is only available if the OST in the cluster are
> standalone SAN devices?

Whichever storage device can be shared across multiple servers
in a hot/warm setup.

There are detailed discussions of frontend server failover (
various HA schemes) and storage backend replication (DRBD for
example) setups in the Lustre Wiki and several papers.

Emmanuel Noobadmin

unread,

Jun 28, 2010, 2:03:46 PM6/28/10

to Lustre discussion

On 6/28/10, Peter Grandi <pg_...@lus.for.sabi.co.uk> wrote:
>> I'm looking at using Lustre to implement a centralized storage
>> for several virtualized machines.
>
> That's such a cliche, and Lustre is very suitable for it if you
> don't mind network latency :-), or if you use a very low latency
> fabric.
>
> In general I am surprised (or perhaps not :->) by how many
> "clever" people choose to provide resource virtualization and
> parallelization at the lower levels of abstaction (e.g. block
> device) and not at the higher ones (service protocol), thus
> enjoying all the "benefits" of centralization. But then probably
> they don't care about availability and in particular abut latency
> (and sometimes not even about throughput).

Am I correct to understand that you mean the approach I am considering
is stupid then? Which wouldn't be too surprising since I'm a newbie at
this so I'll appreciate any pointers in the right direction :)

What do you mean by higher levels of abstraction and benefits of
centralization? Would it be correct to understand that to mean instead
of trying to provide redundant storage, I should be looking at
providing several servers that would simply fail over to each other?

e.g.
S1 (VM1, VM2, VM3) failover to S2
S2 (VM4, VM5, VM6) failover to S3
S3 (VM7, VM8, VM9) failover to S1

Peter Grandi

unread,

Jun 29, 2010, 3:40:18 AM6/29/10

to Lustre discussion

> Lustre is not the solution I'm looking for. I was hoping that to use
> it as an easily expandable storage cluster

But it is that; it is just a particular type of storage cluster
system with a specific performance profile.

As to "expandable", consider again whether your requirements
involve a single storage pool or you can do multiple instances.

> with the equivalent of network RAID 5 across 3 machines with
> RAID 1 physical disks.

That seems to me quite a peculiar setup with some strong
performance anisotropy and it is difficult for me to imagine the
requirements driving that.

> This storage cluster/SAN would then hold VM images for several
> VM servers.

The images can be relatively small things. What about the
storage for those VMs? Virtual disks (more images) or do you
mount the filesystems from a NAS server (e.g. Lustre) while the
VM is booting?

> This way, I thought it would make recovery of any machine
> easy, I just have to mount the network storage on a
> working/replacement server and boot up the VMs originally
> hosted on a failed server.

Ah that's an interesting point, as you have implicitly stated
some avalaibility requirements and expected failure modes.
Apparently you don't need continuous VM availability and
recovery can be manual and take some time. Also you think that
loss of a compute server is more likely or easier to recover
from than loss of a storage server or a storage device (even if
you want to provide two levels of redundancy). You also seem to
imply that network latency and bandwidth is not a big issue as
to VM performance.

> Somebody else pointed out that I might be looking for
> OpenFiler instead.

Or perhaps GlusterFS. Or perhaps check again your requirements
and simplify a bit your design.

The ideal application for Lustre is massively parallel
(many-to-many) IO of large sequentially accessed datasets, and
down from there. Scalability is bought at the price of network
latency and traffic (in this it is a smaller scale version of
the GoogleFS, where the tradeoff is even more extreme), and
careful design of the underlying storage layer (in this the
GoogleFS is the opposite).

It can also do decently the same workloads that

Peter Grandi

unread,

Jun 29, 2010, 4:25:38 AM6/29/10

to Lustre discussion

>>> I'm looking at using Lustre to implement a centralized
>>> storage for several virtualized machines.

>> That's such a cliche, and Lustre is very suitable for it if
>> you don't mind network latency :-), or if you use a very low
>> latency fabric.

>> [ ... ] choose to provide resource virtualization and

>> parallelization at the lower levels of abstaction (e.g. block
>> device) and not at the higher ones (service protocol), thus
>> enjoying all the "benefits" of centralization. But then
>> probably they don't care about availability and in particular
>> abut latency (and sometimes not even about throughput).

> Am I correct to understand that you mean the approach I am
> considering is stupid then?

Not quite, there are some legitimate applications in which
availability, latency or throughput don't matter much or less
than other goals, and then low level virtualization and
parallelization are acceptable design choices.

But it is difficult for me to imagine the requirements that
justify a choice of network RAID5 on RAID1 arrays.

> [ ... ] pointers in the right direction :)

It depends on requirements, and what is the priority, and the
budget for the hardware layer.

> What do you mean by higher levels of abstraction and benefits
> of centralization?

Well, consider the case of something like a data repository,
e.g. RDBMS tablespaces or local mail store.

The alternative could be between virtualizing and sharing the
disks using a low level block oriented protocol (e.g. GFS/GFS2)
or having two redundant RDBMS or mail storage systems each with
its own local storage and applications specific sync, that is
whether the virtualize the storage used by the service, or the
service. I think the latteris preferable in most cases. Another
popular choice to have a central SAN server, a central NAS (NFS,
Lustre, ...) server using it, and a central compute or
timesharing server mounting the latter, instead of three
computers each with local storage and filesystem and each
serving a third of the load.

Network latency and througput limitations usually matter more
than realtime continuous sharing and availability and unless one
wants to invest in HPC style fabrics, network latency and
throughput issues are best avoided and local access at low
levels of abstractions/virtualization is vastly preferable.

Note: there are some people who do need massive shared systems
with very high continuous realtime sharing and availability
requirements, and there are very expensive and difficult ways
to address those requirements properly.

> Would it be correct to understand that to mean instead of
> trying to provide redundant storage, I should be looking at
> providing several servers that would simply fail over to each
> other? e.g.
> S1 (VM1, VM2, VM3) failover to S2
> S2 (VM4, VM5, VM6) failover to S3
> S3 (VM7, VM8, VM9) failover to S1

I presume that this means that S1 is running VM1, VM2, VM3 from
local disks.

This might be a good alternative, and you could be using DRBD to
mirror the images in realtime across machines. The advantage
would be a lot less network latency (with the "main" image being
on local storage and only writes, and queued, to the network)
and network traffic (all reads being local).

Another issue is whether you have different requirements for the
VM images (e.g. the '/' filesystem) and/or the filesystems they
access (e.g. '/home' or '/var/www'), and whether the latter
should be shared across two or more VMs. In which case a network
filesystem could be handy, and Lustre is a good choice even if
one does not need its massively parallel (many-to-many) streaming
performance.

Note: for VMs regrettably block level virtualization over the
network might be better than mounting filesystems over the
network, because in the former case the network traffic is
done by the real system, in the latter by the virtual system,
and many VM implementations don't do network traffic that well.

Reply all

Reply to author

Forward