Sci Cortex rebuild Lustre File System

rsvancara

unread,

Dec 14, 2011, 7:48:24 PM12/14/11

to SiCortex Users

I have inherited a Sci Cortex based system that has a Promise Vtrac
disk controller. A disk failed and now the Lustre file system is
corrupt. Unfortunately I do not know much about these systems
(although I used to be an avid gentoo use and I manage a much larger
HPC environment), but what I would like to do is blow away the file
system and rebuild it. Is there some quick and easy way to do this.
I have experience with GPFS, GFS, PVFS2 and NFS, but i have never used
LusterFS. Hints, documentation, or any other advice would be greatly
appreciated.

Thanks,
Randall.

Lawrence Stewart

unread,

Dec 14, 2011, 9:56:58 PM12/14/11

to sicorte...@googlegroups.com, Lawrence Stewart

The usual way it is set up is something like this:

* Use the GUI or command line tools that come with the disk array to
configure the array into logical units (LUNS)
* Export appropriate LUNs on the various fiberchannel ports
* Reboot the SC system, to see if the disk controllers and LUNs
are recognized by the Linux boot sequence on the nodes with the fiberchannel
controllers (just look at the console logs or ssh to those nodes)
* Follow the directions in the Lustre Guide (version 1.6) and the
SiCortex system admin guide to create filesystem volumes
on the various LUNs (essentially mkfs -t lustre ...)
* Study the lustre startup script in ssp:/opt/sicortex/local.d
to make sure it makes sense: that it is loading the lustre modules
on the nodes with the fiberchannel ports and mounting the right
filesystems.

The SiCortex way of managing Lustre setups was to use stylized
comments in the lustre filesystem setup (mkfs...) to
control which LUN was interpreted to be which Lustre volume.

Typically there would be one MDT (metadata target) volume and
several OST (object server target) volumes.

or...

Unless you have some secret wish to become a lustre wizard, and really
need a <parallel> filesystem so you can get improved file
read-write-bandwidth, then why not just run NFS?

In other words, those Promise arrays are not that fast - you are likely to
get maybe 300 MB/sec out of them in the best case, so Lustre isn't
doing much for you.

You can just configure the LUNs, and then mkfs them as ext2 or ext3
filesystems, mount them on the fiberchannel nodes as direct attached
storage, and then export them as NFS fileystems to all the other nodes.

Works fine, is more reliable, just not as fast as the Lustre would be.

If NFS performance is good enough, it sure is a lot simpler.

-Larry

> --
> You received this message because you are subscribed to the Google Groups "SiCortex Users" group.
> To post to this group, send email to sicorte...@googlegroups.com.
> To unsubscribe from this group, send email to sicortex-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sicortex-users?hl=en.
>

rsvancara

unread,

Dec 15, 2011, 1:02:47 PM12/15/11

to SiCortex Users

NFS sounds promising given that is easier to manage. How difficult is
it to convert to NFS. Would I have to change around many
configuration scripts? Coming from an HPC world comprised of
commodity servers, I understand how everything works because you can
physically observe what each node does just by looking at the rack.
It is very intuitive. With Sci-cortex, I understand you can
designate "nodes" as storage nodes and the rest of compute nodes, but
I am unsure how it determines this information, perhaps by the
hostname (hostid) of the system? In the Sci-Cortex configuration I am
working with, there are four FC cables that plug into one "blade". So
how do you designate a node as a storage node? I am trying to
understand more about the magic that goes on inside each one of the
sci-cortex blades. My other question is about interconnects. Are
all the nodes connected via ethernet or does the Sci-Cortex
architecture use something completely different. Again in the HPC
environment I work in, we use both infiniband and ethernet. It is
pretty straight forward. I am reading through 500 page manual and
some of my questions are partially answered but I feel like I do not
know enough to sufficiently "wrangle" the Sci-Cortex cluster into a
functional state.

My worse case scenario is breaking this system more than it is broken
already. Best case scenario, everything works.

Lawrence Stewart

unread,

Dec 15, 2011, 2:59:30 PM12/15/11

to sicorte...@googlegroups.com, Lawrence Stewart

Pretty straightforward, see below:

On 2011, Dec 15, at 1:02 PM, rsvancara wrote:

> NFS sounds promising given that is easier to manage. How difficult is
> it to convert to NFS. Would I have to change around many
> configuration scripts?

The only things you should have to change are in ssp:/opt/sicortex/config/local.d

These are scripts that run on each node when the machine is booted.

There is probably a script there now that sets up lustre. Move it out of
the ssp:/opt/sicortex/config/local.d directory.

Next, create a new script to set up NFS. There should be an example
for just this case in ssp:/opt/sicortex/script_examples/nfs_export.sh

The example sets up a tmpfs on one node and exports it to all the others.
In your case, you will want to mount direct attached filesystems from one
or more of your fiberchannel nodes and export them to all the others.

You will have to modify the script appropriately and then copy it into
ssp:/opt/sicortex/config/local.d, then restart.

The script has a case statement for what node it runs on (it runs on every node)
When run on the specific nodes that export NFS, the script should mount the
filesystem and export it.

When run on other nodes, the script waits for the export, then mounts the nfs filesystem.

There is some magic for the coordination. The scripts use something called "ev1d" which
is an event sharing server that runs on the SSP. The NFS export nodes create
an event there when they are finished, and the nodes which import NFS wait for that
event before they try to mount the filesystem.

Of course before you can export NFS filesystems, you have to format the storage array
and mkfs the filesystems, see previous msg.

> Coming from an HPC world comprised of
> commodity servers, I understand how everything works because you can
> physically observe what each node does just by looking at the rack.
> It is very intuitive. With Sci-cortex, I understand you can
> designate "nodes" as storage nodes and the rest of compute nodes, but
> I am unsure how it determines this information, perhaps by the
> hostname (hostid) of the system?

Each "blade" is really 27 individual 6-way SMP Linux machines, each one independent.
The four blades, from left to right, are modules 0, 1, 2, and 3. Each one has nodes 0-26.
The hostnames of the nodes, then, are sc1-m0n0 through sc1-m0n26 for the nodes (linux systems)
on module 0 through sc1-m3n0 - sc1-m3n26 for the nodes on module 3.

The 27 nodes on the fabric commuicate with each other through the "fabric" interconnect, which is
very fast (each node has 3 links at 2 GB/sec).

To get outside the chassis, there are 4 "special" nodes on each module: sc1-m?n6 has a dual-gigabit
ethernet. Some of these nodes are generally used for serving the root filesystem and for external
networking. On an SC648, probably only one is used for a minimal setup, and will be cabled to the
SSP directly. You can set up a different ethernet-configured node as a gateway to the local LAN.

Three other nodes on each module (sc1-m?n0, sc1-m?n1, and sc1-m?n3) have PCI ExpressModule slots
and it sounds like you have fiberhannel cards in two of those slots. I don't remember which one is which,
but you could just grep through ssp:/var/log/sc1/sc1-m?n[0,1,3].console to locate which ones probe the fiberchannel cards.. These files are the console logs from each of the 108 nodes.

> In the Sci-Cortex configuration I am
> working with, there are four FC cables that plug into one "blade". So
> how do you designate a node as a storage node? I am trying to
> understand more about the magic that goes on inside each one of the
> sci-cortex blades.

Which node should be a "storage server" is controlled by where the fiberchannel cards are plugged in
as above.

> My other question is about interconnects. Are
> all the nodes connected via ethernet or does the Sci-Cortex
> architecture use something completely different.

Completely different, but it has an "ethernet emulation" so you can use TCP and UDP freely.
If you use MPI or SHMEM or UPC, those things run "native" on the fabric with latencies
like 1-2 microseconds and 1.5 - 2 GB/s bandwidth.

> Again in the HPC
> environment I work in, we use both infiniband and ethernet. It is
> pretty straight forward. I am reading through 500 page manual and
> some of my questions are partially answered but I feel like I do not
> know enough to sufficiently "wrangle" the Sci-Cortex cluster into a
> functional state.

It is pretty much like a cluster of independent linux boxes, except
* they have no local disks (except for the fiberchannel nodes)
* they have a "shared" root filesystem
* the logging is arranged to route to the SSP's filesystem

>
> My worse case scenario is breaking this system more than it is broken
> already. Best case scenario, everything works.
>

If it worked once, it can work again :-)

You can get a feel for whether it is working even without the NFS filesystem.
Just remove the lustre script from ssp:/opt/sicortex/config/local.d and
type "scboot -p sc1" on the ssp and see if it comes up.

If so, try

"srun -p sc1 -N 108 hostname"

to run the hostname command on all nodes.

It is possible there are broken parts to configure around, but that is usually pretty easy.

Randall Svancara

unread,

Jan 2, 2012, 8:39:01 PM1/2/12

to sicorte...@googlegroups.com, Lawrence Stewart

I went the NFS route. It was fairly straight forward and I was able to revive the disk volume in under four hours. Thanks for the great explanation.

--
Randall Svancara
Know Your Linux?

Reply all

Reply to author

Forward