[Lustre-discuss] SAN, shared storage, iscsi using lustre?

18 views

Skip to first unread message

Alex

unread,

Aug 12, 2008, 12:20:29 PM8/12/08

to lustre-...@lists.lustre.org

Hello experts,

I read that raid software on linux is not cluster aware, so i'm tried to find
a solution to join together more computers to form a shared file system and
build a SAN (correctly if i am wrong), avoiding usage of raid software... but,
suddenly i discovered lustre which seems to be exaclty what i need.

It is well supported on linux (centos5/rhel5) and has support for
raid/lvm/iscsi (as i read in FAQ), is scaling well and is easy to extend.

My problem comes below:

Let say that I have:
- N computers (N>8) sharing their volumes (volX, where X=N). Each volX is
arround 120GB.
- M servers (M>3) - which are accessing a clustered shared storage volume
(read/write)
- Other regular computers which are available if required.

Now, I want:
- to build somehow a cluster file system on top of vol1, vol2, ... volN
volumes with high data availability and without a single point of failure.
- resulted logical volume to be used on SERVER1, SERVER2 and SERVER3
(read/write access at the same time)

Questions:

- Using lustre, can i join all volX (exported via iscsi) toghether in one
bigger volume (using raid/lvm) and have a fault-tolerance SHARED STORAGE
(failure of a single drive (volX) or server (computerX) should not bring down
the storage usage)?

- I have one doubt regarding lustre: i saw that is using EXT3 on top, which is
a LOCAL FILE SYSTEM not suitable for SHARED STORAGE (different
computers accesing the same volume and write at the same time on it).

- So, using lustre's patched kernels and tools, ext3 become suitable for
SHARED STORAGE?

Regards,
Alx
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Brian J. Murrell

unread,

Aug 12, 2008, 12:45:44 PM8/12/08

to lustre-...@lists.lustre.org

On Tue, 2008-08-12 at 19:20 +0300, Alex wrote:
>
> My problem comes below:
>
> Let say that I have:
> - N computers (N>8) sharing their volumes (volX, where X=N). Each volX is
> arround 120GB.

What exactly do you mean "sharing their volumes"?

> - M servers (M>3) - which are accessing a clustered shared storage volume
> (read/write)

Where/what is this clustered share storage volume these servers are
accessing?

> Now, I want:
> - to build somehow a cluster file system on top of vol1, vol2, ... volN
> volumes

Do you mean "disk" or "partition" when you say "volumes" here and are
these disks/partitions in the "N computers" you refer to above?

> - resulted logical volume to be used on SERVER1, SERVER2 and SERVER3
> (read/write access at the same time)

Hrm. This is all quite confusing, probably because you are not yet
understanding the Lustre architecture. To try to map what you are
describing to Lustre, I'd say your "N computers" are an MDS and OSSes
and their 120GB "volumes" are an MDT and OSTs (respectively) and your "M
servers" are Lustre clients.

> - Using lustre, can i join all volX (exported via iscsi) toghether in one
> bigger volume (using raid/lvm) and have a fault-tolerance SHARED STORAGE
> (failure of a single drive (volX) or server (computerX) should not bring down
> the storage usage)?

I don't think this computes within the Lustre architecture. You
probably need to review what Lustre does and how it works again.

> - I have one doubt regarding lustre: i saw that is using EXT3 on top, which is
> a LOCAL FILE SYSTEM not suitable for SHARED STORAGE (different
> computers accesing the same volume and write at the same time on it).

This is moot. Lustre manages the ext3 filesystem as it's backing store
and provides shared access.

> - So, using lustre's patched kernels and tools, ext3 become suitable for
> SHARED STORAGE?

You probably just need to ignore that Lustre uses ext3 "under the hood"
and trust that Lustre deals with it appropriately.

signature.asc

Alex

unread,

Aug 12, 2008, 4:00:49 PM8/12/08

to lustre-...@lists.lustre.org

On Tuesday 12 August 2008 19:45, Brian J. Murrell wrote:
> On Tue, 2008-08-12 at 19:20 +0300, Alex wrote:
> > My problem comes below:
> >
> > Let say that I have:
> > - N computers (N>8) sharing their volumes (volX, where X=N). Each volX is
> > arround 120GB.
>
> What exactly do you mean "sharing their volumes"?

I mean that i'm exporting via iscsi a block device (could be an entire hard
disk or just a slice). In this case is and entire hard disk, 120GB large,
named for simplicity volX (vol1, vol2... vol8) because i have 8 computers
doing this.

>
> > - M servers (M>3) - which are accessing a clustered shared storage volume
> > (read/write)
>
> Where/what is this clustered share storage volume these servers are
> accessing?

For example could be a GFS shared volume over volX. Here on lustre i don't
know ... You tell me ...

>
> > Now, I want:
> > - to build somehow a cluster file system on top of vol1, vol2, ... volN
> > volumes
>
> Do you mean "disk" or "partition" when you say "volumes" here and are
> these disks/partitions in the "N computers" you refer to above?

Yes. There are 8 disks exported via iscsi by each computer in our testlab. Say
it informally SAN. Does not matter if they are disk or partitions. They are
block devices which can be accessed by each of our SERVERS (SERVER1, SERVER2
and SERVER3) using iscsi, and mounted locally as /dev/sda,
dev/sdb, ... /dev/sdh! I tried before to post here GFS, but because redhat
does not support raid in GFS cluster, i cold not setup failover. For example
i cant goup /dev/sda and /dev/sdb in /dev/md0, and so on up to /dev/md3 and
after that to use lvm to unify md0+md1+md2+md3 in one logical volume (mylv)
which will run on TOP a clustered file system(GFS)!

I can't mkfs.gfs ... /dev/myvg/mylv and mount mount -t
gfs /dev/myvg/mylv /var/shared_data on ALL our SERVERS!

> > - resulted logical volume to be used on SERVER1, SERVER2 and SERVER3
> > (read/write access at the same time)
>
> Hrm. This is all quite confusing, probably because you are not yet
> understanding the Lustre architecture. To try to map what you are
> describing to Lustre, I'd say your "N computers" are an MDS and OSSes
> and their 120GB "volumes" are an MDT and OSTs (respectively) and your "M
> servers" are Lustre clients.

I don't know lustre. I asked about. I just want to know if is possibile ... If
the answer is yes, my question is: who will be MDS and WHO will be OSSes. How
MANY MDS and HOW MANY OSSes I NEED in order to obtain what i want!

>
> > - Using lustre, can i join all volX (exported via iscsi) toghether in one
> > bigger volume (using raid/lvm) and have a fault-tolerance SHARED STORAGE
> > (failure of a single drive (volX) or server (computerX) should not bring
> > down the storage usage)?
>
> I don't think this computes within the Lustre architecture. You
> probably need to review what Lustre does and how it works again.

No. My question are referring to situation described above and also to lustre
FAQ.

[snip]
Can you run Lustre on LVM volumes, software RAID, etc?

Yes. You can use any Linux block device as storage for a backend Lustre server
file system, including LVM or software RAID devices.
[end snip]

And more from FAQ....

[snip]
Which storage interconnects are supported?

Just to be clear: Lustre does not require a SAN, nor does it require a fabric
like iSCSI. It will work just fine over simple IDE block devices. But because
many people already have SANs, or want some amount of shared storage for
failover, this is a common question.

For storage behind server nodes, FibreChannel, InfiniBand, iSCSI, or any other
block storage protocol can be used. Failover functionality requires shared
storage (each partition used active/passive) between a pair of nodes on a
fabric like SCSI, FC or SATA.
[end snip]

So, for me, reading this, is very clear without being an expert that lustre
support BLOCK DEVICES in any RAID/LVM configuration... Also lustre, can work
with my iscsi block devices /dev/sd[a-h] ...

I asked here hoping that someone is using RAID/LVM over BLOCK DEVICES IMPORTED
via iscsi on production and can confirm that what i want is not an utopia!

> > - I have one doubt regarding lustre: i saw that is using EXT3 on top,
> > which is a LOCAL FILE SYSTEM not suitable for SHARED STORAGE (different
> > computers accesing the same volume and write at the same time on it).
>
> This is moot. Lustre manages the ext3 filesystem as it's backing store
> and provides shared access.

This is not clear at all... Generally speaking ext3 is a local file system
(used on one computer). Reading FAQ, didn't find an answer, so i asked
here...

> > - So, using lustre's patched kernels and tools, ext3 become suitable for
> > SHARED STORAGE?
>
> You probably just need to ignore that Lustre uses ext3 "under the hood"
> and trust that Lustre deals with it appropriately.

No i can't ignore ... I want to be sure that ext3 used by lustre is a
clustered file system. Redhat NEVER indicated their ext3 as file system for
clusters. They are using GFS for that. I saw a lot of other howtos on the net
which are ignoring parallel write problem on cluster configuration and teach
peoples how to use for example xfs to mount the same partition on more
servers and write on it at the same time... So, if lustre's ext3 file system
is clustered, why nobody add a note to the FAQ about that: "we are using a
patched ext3 version, which differ by redhat ext3 because it support cluster
file systems like GFS"...

Kit Westneat

unread,

Aug 12, 2008, 4:17:40 PM8/12/08

to Alex, lustre-...@lists.lustre.org

Lustre is different from something like StorNext or other clustered fses
in that the clients never actually touch the storage, but instead
communicate with the servers who then communicate with the storage.
That's why it really doesn't matter what Lustre runs as its backing
filesystem, as the filesystem will only be mounted by the storage server.

This white paper is a good intro to the architecture of Lustre:
www.sun.com/software/products/lustre/docs/lustrefilesystem_wp.pdf

- Kit

--
---
Kit Westneat
kwes...@datadirectnet.com
812-484-8485

Brian J. Murrell

unread,

Aug 12, 2008, 4:49:24 PM8/12/08

to lustre-...@lists.lustre.org

On Tue, 2008-08-12 at 23:00 +0300, Alex wrote:
>
> I mean that i'm exporting via iscsi a block device

Ahhh. Now that's nomenclature I'm grasping. :-)

> In this case is and entire hard disk, 120GB large,
> named for simplicity volX (vol1, vol2... vol8) because i have 8 computers
> doing this.

Right. So you have 8 iscsi disks.

> For example could be a GFS shared volume over volX. Here on lustre i don't
> know ... You tell me ...

These 3 servers are still unclear to me. What do you see their function
as being? Would they be the Lustre filesystem servers to which Lustre
clients go to get access to the shared filesystem composed of the 8
iscsi disks?

If so, then it sounds like your 3 servers are looking to create a Lustre
filesystem with the 8 iscsi disks. This is doable but not a typical
scenario. You would probably dedicate one of the 8 targets to the MDT
and the other 7 as OSTs.

> I don't know lustre. I asked about. I just want to know if is possibile ... If
> the answer is yes, my question is: who will be MDS and WHO will be OSSes. How
> MANY MDS and HOW MANY OSSes I NEED in order to obtain what i want!

Well, from your explanation above I would imagine you would use your
three servers to create 1 active MDS and 2 active OSSes with each OSS
hosting 4 and 3 OSTs respectively. You could pair the machines so that
one would pick up the slack of a failed machine in a failover event.
Something like:

Server 1 Server 2 Server 3
Primary MDS Primary OSS1 Primary OSS2
Backup OSS1 Backup OSS2 Backup MDS

> Can you run Lustre on LVM volumes, software RAID, etc?
>
> Yes. You can use any Linux block device as storage for a backend Lustre server
> file system, including LVM or software RAID devices.
> [end snip]

Oh, hrm. You want no SPOF. Given that you have 8 targets all in 8
different machines, I'm not quite sure how you are going to achieve
that. I suppose you could mirror two of the 8 iscsi devices for the MDT
and RAID5/6 the remaining 6 iscsi devices into a single volume. You
could then have 1 of your 3 machines be Primary MDS, the other Primary
OSS and the third backup for both of the first. It does seem a bit
wasteful to have a single machine doing nothing but waiting for failure
but this is frequently the case when you are working on the low end with
so little.

> So, for me, reading this, is very clear without being an expert that lustre
> support BLOCK DEVICES in any RAID/LVM configuration...

Indeed.

> Also lustre, can work
> with my iscsi block devices /dev/sd[a-h] ...

Sure. But you will have to build your redundancy on those devices
before giving them to Lustre. Lustre provides no redundancy itself and
relies on the underlying block device to be redundant.

> This is not clear at all... Generally speaking ext3 is a local file system
> (used on one computer). Reading FAQ, didn't find an answer, so i asked
> here...

Right. It is in fact too much information for the Lustre beginner. You
should just be told that Lustre operates on and manages the block
device. That it does so through ext3 only serves to confuse the Lustre
beginner. Later when you have a better grasp on the architecture it
might be worthwhile understanding that each Lustre server does it's
management of the block device via ext3. So please, don't worry about
the traditional uses of ext3 and confuse it's limitations with Lustre.
Lustre simply didn't want to invent a new on-disk management library and
used ext3 for it.

> No i can't ignore ... I want to be sure that ext3 used by lustre is a
> clustered file system.

Well, you are just going to have to take my word for it or start digging
deeper into the architecture of Lustre to be sure of that. Beyond what
I've already explained I'm not going to get any deeper into the
internals of Lustre.

> Redhat NEVER indicated their ext3 as file system for
> clusters.

First of all, ext3 is not RedHat's. Second, ext3, in and of itself is
not for clusters, this is true.

> So, if lustre's ext3 file system
> is clustered, why nobody add a note to the FAQ about that: "we are using a
> patched ext3 version, which differ by redhat ext3 because it support cluster
> file systems like GFS"...

Because it's not like that. As I have said, we simply use ext3 as a
storage mechanism.

signature.asc

Alex

unread,

Aug 13, 2008, 4:55:46 AM8/13/08

to lustre-...@lists.lustre.org

Hello Brian,

Thanks for your prompt reply... See my comments inline..

> Right. So you have 8 iscsi disks.

Yes, let simplify our test environment.
- i have 2 lvs routers (one active router and one backup router used for
failover) to balance connections through our servers located behind them.
- behind lvs, i have a cluster with 3 servers (let say they are web servers
for simplicity). All web servers are serving the same content from a shared
storage volume mounted as document root on all.

Till here design is unchangeable.

As I said in one of my past email, i used GFS on top of a shared storage
logical volume. I give up because a can't use raid to group our iscsi disks
-> i have a single point of failure design (if one or more iscsi
disks/computers are down, shared volume become unusable).

Goal: replace GFS and create a non SPOF shared storage using other cluster
file system -> let say lustre

What we have in adition to above:
- other N=8 computers (ore more). N will be what it need to be and can be
increased as needed. Nothing imposed. In my example, i said that all N
computers are exporting via iscsi their block devices (one block/computer) so
on ALL our websevers we have visible and available all 8 iscsi disks to build
a shared storage volume (like a SAN). Doesn't mean that is a must that all of
them to export disks. Part of them can achive other functions like MDS, or
perform other function if needed. Also, we can add more computers in above
schema, as you will suggest.

Using GFS, i can't use raid over block devices (my iscsi disks) forming some
md devices (md0, md1, etc), unify them using lvm and run GFS on top of
resulted logical volume. That's the problem with GFS.

> These 3 servers are still unclear to me. What do you see their function
> as being? Would they be the Lustre filesystem servers to which Lustre
> clients go to get access to the shared filesystem composed of the 8
> iscsi disks?

These 3 servers should be definitely our www servers. I don't know if can be
considered part of lustre... They should be able to access simultaneous, a
shared storage BUILD BY LUSTRE using our iscsi disks. Reading lustre faq, is
still unclear for me who are lustre clients. My feeling is telling me, that
our 3 webservers will considered clients by lustre. Correct? Maybe now,
because you have more information, you can tell me the way to go:
- how may more machines i need
- how to group and their roles (which will be MDS, which will be OSSes, which
will be clients, etc)
- what i have to do to unify all iscsi disks in order to have non SPOF
- which machine(s) will be resposible to agregate our iscsi disks?
- will be ok to group our 8 iscsi disk in two 4 paired software raid (raid6)
arrays (md0, md1), form on top another raid1 (let say md2), and on top of md2
to use lvm? How is better to group/agregate our iscsi disks (which raid
level)?
- how can be accessed resulted logical volume by our webservers?

> > This is not clear at all... Generally speaking ext3 is a local file
> > system (used on one computer). Reading FAQ, didn't find an answer, so i
> > asked here...
>
> Right. It is in fact too much information for the Lustre beginner. You
> should just be told that Lustre operates on and manages the block
> device. That it does so through ext3 only serves to confuse the Lustre
> beginner. Later when you have a better grasp on the architecture it
> might be worthwhile understanding that each Lustre server does it's
> management of the block device via ext3. So please, don't worry about
> the traditional uses of ext3 and confuse it's limitations with Lustre.
> Lustre simply didn't want to invent a new on-disk management library and
> used ext3 for it.

Ok, sounds good. I believe you.

Brian J. Murrell

unread,

Aug 13, 2008, 8:13:04 AM8/13/08

to lustre-...@lists.lustre.org

On Wed, 2008-08-13 at 11:55 +0300, Alex wrote:
> Hello Brian,

Hi.

> Thanks for your prompt reply... See my comments inline..

NP.

> i have a cluster with 3 servers (let say they are web servers
> for simplicity). All web servers are serving the same content from a shared
> storage volume mounted as document root on all.

Ahhh. Your 3 servers would in fact then be Lustre clients. Given that
you have identified 3 Lustre clients and 8 "disks" you now need some
servers to be your Lustre servers.

> What we have in adition to above:
> - other N=8 computers (ore more). N will be what it need to be and can be
> increased as needed.

Well, given that those are simply disks, you can/need to increase that
count only in so much as your bandwidth and capacity needs demand.

As an aside, it seems rather wasteful to dedicate a whole computer to
being nothing more than an iscsi disk exporter, so it's entirely
possible that I'm misunderstanding this aspect of it. In any case, if
you do indeed have 1 disk in each of these N=8 computers exporting a
disk with iscsi, then so be it and each machine represents a "disk".

> Nothing imposed. In my example, i said that all N
> computers are exporting via iscsi their block devices (one block/computer) so
> on ALL our websevers we have visible and available all 8 iscsi disks to build
> a shared storage volume (like a SAN).

Right. You need to unravel this. If you want to use Lustre you need to
make those disks/that SAN available to Lustre servers, not your web
servers (which will be Lustre clients).

> Doesn't mean that is a must that all of
> them to export disks. Part of them can achive other functions like MDS, or
> perform other function if needed.

Not if you want to have redundancy. If you want to use RAID to get
redundancy out of those iscsi disks then the machines exporting those
disks need to be dedicated to simply exporting the disks and you need to
introduce additional machines to take those exported block devices, make
RAID volumes out of them and then incorporate those RAID volumes into a
Lustre filesystem. You can see why I think it seems wasteful to be
exporting these disks, 1 per machine as iscsi targets.

> Also, we can add more computers in above
> schema, as you will suggest.

Well, you will need to add an MDS or two and 2 or more OSSes to achieve
redundancy.

> These 3 servers should be definitely our www servers. I don't know if can be
> considered part of lustre...

Only Lustre clients then.

> Reading lustre faq, is
> still unclear for me who are lustre clients.

Your 3 web servers would be the Lustre clients.

> - how may more machines i need

Well, I would say 3 minimum as per my previous plan.

> - how to group and their roles (which will be MDS, which will be OSSes, which
> will be clients, etc)

Again, see my previous plan. You could simplify a bit and use 4
machines, two acting as active/passive MDSes and two as active/active
OSSes.

> - what i have to do to unify all iscsi disks in order to have non SPOF

RAID them on the MDSes and OSSes.

> - will be ok to group our 8 iscsi disk in two 4 paired software raid (raid6)
> arrays (md0, md1),

No. Please see my previous e-mail about what you could do with 8 disks.

> form on top another raid1 (let say md2), and on top of md2
> to use lvm?

You certainly could layer LVM between the RAID devices and Lustre, but
it doesn't seem necessary.