GPFS detection of Disk Array Loss

openstream rob

unread,

Nov 27, 2007, 6:24:52 PM11/27/07

to

I have a GPFS cluster comprising 2 nodes and tiebreaker disk. There is
only one filesystem, made from 3 NSDs, each in their own failgroup.
One NSD is desconly and is for disk quorum. The other two are
replicated with one NSD at each site. (I think this is a fairly
typical setup.)

I have a couple of problems though. Firstly I need to say that I only
physically have TWO arrays, so my "site C" is infact on "site B".
Therefore my tie breaker disk is mounted in the same array as one of
my data NSDs. Similarly, the quorum disk is also on the same array.

I am hoping that the nuts and bolts of maintaining service will not be
affected by the co-location of the tiebreaker disks and the data disks
on one site. Although, obviously, should this site/array fail, then
the surviving site will be down also. However, the specific site
holding the tiebreaker disks will survive the loss of the non-tb site.
Am I correct with this assumption?

As a test I am powering off the Array at the site without the
tiebreaker disks. I envisaged that all nodes would survive as there is
a node and tiebreaker still awake, together with a disk quorum in the
filesystem.
What I am seeing is all nodes freezing on access to the filesystem,
all FS commands hang/freeze. When I run mmlsdisk sfs, it returns that
both data and the tiebreake NSDs are available and "up". I know this
is not the case, so what should happen? How does GPFS detect the array
loss? and maintain service?

When I power up the Array again, after 10 to 20 seconds after it's
sorted itself out, full service is restored and the Filesystem is
available again, with no intervention.

As a side point, the Advanced GPFS admin guide mentions that internal,
non shared disks can be used at tiebreaker sites. is this true? as
references on the IBM site contradict this?

Any pointers, greatfully received!

Rob

Hajo Ehlers

unread,

Nov 27, 2007, 7:17:57 PM11/27/07

to

On 28 Nov., 00:24, openstream rob <r...@openstream.co.uk> wrote:
...

> As a test I am powering off the Array at the site without the
> tiebreaker disks. I envisaged that all nodes would survive as there is
> a node and tiebreaker still awake, together with a disk quorum in the
> filesystem.
> What I am seeing is all nodes freezing on access to the filesystem,
> all FS commands hang/freeze. When I run mmlsdisk sfs, it returns that
> both data and the tiebreake NSDs are available and "up". I know this
> is not the case, so what should happen? How does GPFS detect the array
> loss? and maintain service?

Which gpfs version ?
Do you use NSD Servers ?

One cause might be:
It depends how long the fc-fscsi subsystem tries to access a lost
path, then failover to another path which will be dead also . Until
then all io is blocked thus gpfs. As soon as the fcs/fscsi subsystem
send an error that all pathes are dead gpfs will shutdown or switch
over to use a NSD server. So check your error log about errors during
that time

See http://www-1.ibm.com/support/docview.wss?uid=isg1520readmefb4520desr_lpp_bos
for details on fast failover a.s.o

...

> As a side point, the Advanced GPFS admin guide mentions that internal,
> non shared disks can be used at tiebreaker sites. is this true?

Its true, you might get an error message from the other cluster nodes
that they can not access the disk what of course is true. But the
function is given.

hth
Hajo

openstream rob

unread,

Nov 28, 2007, 5:29:55 AM11/28/07

to

Thanks Hajo

The version of GPFS is 3.1.0.7, AIX is 5.3 ML04

I'm looking at your attachment now,

Thanks

Rob

Hajo Ehlers

unread,

Nov 28, 2007, 6:32:20 AM11/28/07

to

On Nov 28, 11:29 am, openstream rob <r...@openstream.co.uk> wrote:
> Thanks Hajo
>
> The version of GPFS is 3.1.0.7, AIX is 5.3 ML04

GPFS PTF 16 is already out and needed for integration with GPFS 3.2
and it fixes problems with
- ML06 and higher
- Disks in size in between 1TB and 2 TB not correctly seen

and many other

hth
Hajo

openstream rob

unread,

Nov 28, 2007, 6:38:06 AM11/28/07

to

Hi Hajo

I've just run some "extended" tests, and left the Array off for 10
minutes. Eventually, the failover occurred and everything is looking
good. I'm going to implement the fast_fail and see if the timing
improves. ( i'm not exactly sure how long the failover took, probably
not the full 10 mins)

Rob

openstream rob

unread,

Nov 28, 2007, 8:20:20 AM11/28/07

to

Thanks again!

I'm downloading the ptf now.

BTW: In the "fast_fail" mode, what system state do I need to be in to
allow the chdev to work, and not complain of child devices being
active? Do I need to unplug the hardware? Or can I detach a child
device. (how do I identify the dependant devices?)

I currently have
2 fcsnet
2 fscsi
2 dar
8 dac
2 fcs

Rob

Hajo Ehlers

unread,

Nov 28, 2007, 10:17:27 AM11/28/07

to

You can:
use the -P option with the chdev and reboot
or and if supported by all devices
$ lsdev -S available | egrep "fcs0|fscsi0|dar|dac" # Check state of
devices
$ rmdev -p fcs0 # put all children into the defined state
$ chdev -l fscsi0 ... # reconfigure your fscsi
$ chdev -l fscsi0 ...
$ cfgmgr -l fcs0 # Make the adapter and all its children active
again
$ lsdev -S available | egrep "fcs0|fscsi0|dar|dac" # Recheck state of
devices

repeat it for all required fc adapters. No downtime and no loss of
disk access - if course only if you have more then one fc adapter in
use.

hth
Hajo

openstream rob

unread,

Nov 29, 2007, 11:26:10 AM11/29/07

to

Hi Hajo

Thanks for the hints. I ran the commands for fcs0 and it worked
perfectly. for fcs2 ( connected to a second switch and on to a second
disk controller) the command failed when attempting to rmdev the dac4
device. It complained that the device was busy.

Is this showing that all traffic is passing only through the second
fabric? Hopefully until there is a failure, then it will be passed to
the first?

How do I swap it over to adjust the fscsi2 device?

Rob

Hajo Ehlers

unread,

Nov 29, 2007, 12:39:41 PM11/29/07

to

On Nov 29, 5:26 pm, openstream rob <r...@openstream.co.uk> wrote:
> Hi Hajo
>
> Thanks for the hints. I ran the commands for fcs0 and it worked
> perfectly. for fcs2 ( connected to a second switch and on to a second
> disk controller) the command failed when attempting to rmdev the dac4
> device. It complained that the device was busy.

I am not familiar with IBM SAN Storage. Please check your
documentation.
At least if everything is working you can unplug the cable for dac4
which should lead to a failover to the other one. Then you should be
able to change the adapter and plug the cable back in.
Or use the -p option as already mentioned and reboot.

>
> Is this showing that all traffic is passing only through the second
> fabric? Hopefully until there is a failure, then it will be passed to
> the first?
>
> How do I swap it over to adjust the fscsi2 device?

A simple way to see which fcX/fscsiX device is used you can use
$ nmon
$ iostat -a 1 | egrep "fc|fscsi|dac"

hth
Hajo

openstream rob

unread,

Nov 30, 2007, 5:14:49 AM11/30/07

to

Hajo

I used the -P option and it was successful. All adapters are now
running in "fast_fail".

This has improved the failover times to 8 minutes, from 14 minutes.
overall.

It's curiuos as the server consoles report the loss via errpt a lot
quicker ( 1 minute max), GPFS doesn't react for quite a while.
After 5 minutes GPFS log gives:
"Local access to sfs2 failed with EIO, will attempt to access the disk
remotely."

mmlsdisk sfs, doesn't report sfs as "down" for a further 3 minutes.

Any ideas?

Rob

Hajo Ehlers

unread,

Nov 30, 2007, 9:54:26 AM11/30/07

to

On Nov 30, 11:14 am, openstream rob <r...@openstream.co.uk> wrote:
> Hajo
>
> I used the -P option and it was successful. All adapters are now
> running in "fast_fail".

And you rebooted the node where you made this change ? Otherwise the
ODM has been changed but not the adapter itself.
BTW: You also enabled dynamic tracking ?
$ chdev -l fscsiX -a dyntrk=yes

>
> This has improved the failover times to 8 minutes, from 14 minutes.
> overall.

What do you mean with failover ? Depending on your GPFS version and
configuration the gpfs process will stay alive as long as it can
communicate with its other cluster members. Only in case communication
and quorum is lost the gpfs process will shutdown.

In case local disk can not be accessed any more nothing really should
happen except that the filesystems on these disks will be unmounted.
In case you have NSD servers the local gpfs process will stop disk
access and use the remote nsd server for accessing the data.
As soon as the local disk are available again the node switches back
to local data access.
The failover from local disk to network should be within seconds.

>
> It's curiuos as the server consoles report the loss via errpt a lot
> quicker ( 1 minute max), GPFS doesn't react for quite a while.
> After 5 minutes GPFS log gives:
> "Local access to sfs2 failed with EIO, will attempt to access the disk
> remotely."

Maybe nothing was using the gpfs fs thus no i/o thus no error to gpfs.
You should do some testing meaning to put some load on the system.

>
> mmlsdisk sfs, doesn't report sfs as "down" for a further 3 minutes.
>

Like i said already. You might have to reboot your gpfs nodes to be
sure that all settings are applied.

hth
Hajo

openstream rob

unread,

Nov 30, 2007, 10:51:01 AM11/30/07

to

Hi again

Okay, I forgot to do the dynamic tracking change. But I did reboot
after the -P change to the fscsi devices.

When I say failover I'm not being precise. The scenario is below:

Array1:
nsd sfs1:data+meta: failure grp 1
nsd fs1:descOnly: failuregrp 1

Arrary2:
nsd sfs2:data+meta: failure grp 2

Array1 and 2 have twin controllers, connected to 2 switches each to
provide resilience to 6 servers, all with 2 fcs adapters

there is only one filesystem sfs, comprised of nsds: sfs1 sfs2 and fs1

My test is to power off Array2, loosing sfs2, and I was hoping that
all the servers would maintain filesystem access via the quorum disk
(fs1) and sfs1 being still available on Array 1.

At this point, I have to wait 8minutes before any/all servers are
given access to the filesystem.

Can you spot anything else to look at? Will the fibre network topology
be relevant. Basically, servers 1 to 3 are each linked to 2 switches(a
nd b) ( 1 per fcs) and servers 4 to 6 are similarly linked to 2
switches (c+d)
switches a and c are joined and b and d are also joined. Each of these
extended fabrics is connected to each Disk array, (that have twin
controllers) .

It's not too different to the setup shown in the GPFS planning manual
(page 19 figure 11), except each switch in their diagram represents a
joind pair in our scenario.

Do you think the above test should "failover" instantly? This is what
I had anticipated.

Rob

Hajo Ehlers

unread,

Nov 30, 2007, 11:17:39 AM11/30/07

to

On Nov 30, 4:51 pm, openstream rob <r...@openstream.co.uk> wrote:
> Hi again
>
> Okay, I forgot to do the dynamic tracking change. But I did reboot
> after the -P change to the fscsi devices.
>
> When I say failover I'm not being precise. The scenario is below:
>
> Array1:
> nsd sfs1:data+meta: failure grp 1
> nsd fs1:descOnly: failuregrp 1
>
> Arrary2:
> nsd sfs2:data+meta: failure grp 2
>
> Array1 and 2 have twin controllers, connected to 2 switches each to
> provide resilience to 6 servers, all with 2 fcs adapters
>
> there is only one filesystem sfs, comprised of nsds: sfs1 sfs2 and fs1
>
> My test is to power off Array2, loosing sfs2, and I was hoping that
> all the servers would maintain filesystem access via the quorum disk
> (fs1) and sfs1 being still available on Array 1.
>
> At this point, I have to wait 8minutes before any/all servers are
> given access to the filesystem.

So you are saying that for 8 minutes no filesystem access is possible
at all but afterwards it is working ?

Please provide the following output

Since you use 2 failure groups - have you setup replication to 2 for
data and metadata ?
$ mmlsfs sfs -r -R -m -M

What the value for unmountOnDiskFail
$ mmlsconfig | grep -i unmount

Do you have nsd server
$ mmlsnsd -m

openstream rob

unread,

Nov 30, 2007, 11:58:53 AM11/30/07

to

On 30 Nov, 16:17, Hajo Ehlers <serv...@metamodul.com> wrote:
> On Nov 30, 4:51 pm, openstream rob <r...@openstream.co.uk> wrote:
>
>
>
>
>
> > Hi again
>
> > Okay, I forgot to do the dynamic tracking change. But I did reboot
> > after the -P change to the fscsi devices.
>
> > When I say failover I'm not being precise. The scenario is below:
>
> > Array1:
> > nsd sfs1:data+meta: failure grp 1
> > nsd fs1:descOnly: failuregrp 1
>
> > Arrary2:
> > nsd sfs2:data+meta: failure grp 2
>
> > Array1 and 2 have twin controllers, connected to 2 switches each to
> > provide resilience to 6 servers, all with 2 fcs adapters
>
> > there is only one filesystem sfs, comprised of nsds: sfs1 sfs2 and fs1
>
> > My test is to power off Array2, loosing sfs2, and I was hoping that
> > all the servers would maintain filesystem access via the quorum disk
> > (fs1) and sfs1 being still available on Array 1.
>
> > At this point, I have to wait 8minutes before any/all servers are
> > given access to the filesystem.
>
> So you are saying that for 8 minutes no filesystem access is possible
> at all but afterwards it is working ?

Exactly. Pull the plug. Wait 8mins and then all io returns.

>
> Please provide the following output
>
> Since you use 2 failure groups - have you setup replication to 2 for
> data and metadata ?
> $ mmlsfs sfs -r -R -m -M

Yes this is done.

>
> What the value for unmountOnDiskFail
> $ mmlsconfig | grep -i unmount

Oh. This isn't set.

> Do you have nsd server

fs1 db1 direct
fs1 db2 primary
fs2 db1 direct
fs2 db6 primary
sfs1 db1 direct
sfs1 db2 primary
sfs1 db3 backup
sfs2 db1 direct
sfs2 db5 primary
sfs2 db6 backup
tb1 db1 direct
tb2 db1 direct

db1,2,3 are in one rack
db 4,5,6 are in another

tb1 and tb2 are tiebreaker disks. Only one is used. The other is used
for DR recovery.

Thanks Hajo. I'm looking at the unmountondiskfail conifgurable.

Rob

> - Show quoted text -

openstream rob

unread,

Nov 30, 2007, 12:23:50 PM11/30/07

to

Hi Hajo

I thought I'd mention again that although I'm implementing a
"standard" 3 site resilient cluster. My 3rd site is actually located
on the secondary rack. I recognise that this rack is now a point of
failure, but I'm allowing for manual intervention to reconfigure the
tiebreaker and fs quorum node to the other stack. (Hence the tb2 and
fs2 nsds in my last post)
All my testing referenced in these posts is only against failure of
disks in the rack WITHOUT the tiebreaker and fs quorum.
HOWEVER, is this setup potentially causing the timing problem?

I looked at unmountondiskfailure and I'm not sure it should be
enabled. I would say not. Particularly as the descOnly fs quorum nsd
is SAN based in my case.

Really appreciate the help, what do you think?

Hajo Ehlers

unread,

Nov 30, 2007, 12:47:48 PM11/30/07

to

On Nov 30, 4:51 pm, openstream rob <r...@openstream.co.uk> wrote:
> Hi again
...

>
> Array1:
> nsd sfs1:data+meta: failure grp 1
> nsd fs1:descOnly: failuregrp 1
>
> Arrary2:
> nsd sfs2:data+meta: failure grp 2
>
> Array1 and 2 have twin controllers, connected to 2 switches each to
> provide resilience to 6 servers, all with 2 fcs adapters
>
> there is only one filesystem sfs, comprised of nsds: sfs1 sfs2 and fs1
>
> My test is to power off Array2, loosing sfs2, and I was hoping that
> all the servers would maintain filesystem access via the quorum disk
> (fs1) and sfs1 being still available on Array 1.
>
> At this point, I have to wait 8minutes before any/all servers are
> given access to the filesystem.
>
> Can you spot anything else to look at? Will the fibre network topology
> be relevant. Basically, servers 1 to 3 are each linked to 2 switches(a
> nd b) ( 1 per fcs) and servers 4 to 6 are similarly linked to 2
> switches (c+d)
> switches a and c are joined and b and d are also joined. Each of these
> extended fabrics is connected to each Disk array, (that have twin
> controllers) .

In case i understand your SAN configuration correctly your
server[1,2,3] can SEE directly the disk on arrary2 as well as its
own on array1
where
server[4,5,6] can SEE directly the disk on array1 as well as its
own on array2

In case the above is true an failure on array2 will not lead to a use
of a NSD server - meaning you should not see high network traffic
between the nodes. Of cource only if you have put load on one of the
server[4,5,6].

So for trouble shooting:
Shutdown array2 , put some load on one of the servers[4,5,6] and see
if you have any high network traffice on the test node. If not it just
simply means that the node can access to data via the SAN on array1.
If this is correct i beieve that your SAN fabric just needs too long
to stabelize meaning for some reason its takes pretty long to find a
active path to the remaining disks.

BTW:
I would really simplify your current configuartion.
1)In case a server can see all disks on array1 and array2
- Only one GPFS Server in Rack1 - with tiebraker disk
- Only one GPFS Server in Rack2
- no disk is configured for nsd usage - all direct attached, no
primary or backup server
- Keep your current SAN configuration
I assume that
- server1 sees array1 & array2
- server4 sees array2 & array1

1.1 ) Now do your tests to see what happens if array2 will not be
available.
If you have still these 8 minute delay i would rethink your Fabric
configuration.
In case not the problem might triggered on your typ of NSD server
onfiguration.

BTW:
If you are going to have nsd servers then ONLY one server on each rack
should be a nsd server. Thus server1 in rack1 is primary server and
server4 in rack2 is backup server.
This applies also in case a server can only see one array.

Thus a simple configuration would be:
2)
- Only one GPFS Server in Rack1 - with tiebraker disk, primary nsd
server
- Only one GPFS Server in Rack2 - backup nsd server
- all disk are configured for nsd usage
- Keep your current SAN configuration
I assume that
- server1 sees only array1
- server4 sees only array2

2.1)
Now do your tests to see what happens if array2 will not be available.

You can extend your test by adding a third GPFS server from Rack2
which can only see array2.
Redo the test - turn off array2 - and the new GPFS server should
switch to NSD mode meaing it is accessing the GPFS fs through the
primary nsd server.

hth
Hajo

Hajo Ehlers

unread,

Nov 30, 2007, 12:58:57 PM11/30/07

to

On Nov 30, 6:23 pm, openstream rob <r...@openstream.co.uk> wrote:
> Hi Hajo
>
> I thought I'd mention again that although I'm implementing a
> "standard" 3 site resilient cluster. My 3rd site is actually located
> on the secondary rack. I recognise that this rack is now a point of
> failure, but I'm allowing for manual intervention to reconfigure the
> tiebreaker and fs quorum node to the other stack. (Hence the tb2 and
> fs2 nsds in my last post)
> All my testing referenced in these posts is only against failure of
> disks in the rack WITHOUT the tiebreaker and fs quorum.

That is clear to me but see my previous post regarding your
configuration

> HOWEVER, is this setup potentially causing the timing problem?

Depends which part of the setup you mean ;-)
- gpfs setup: nsd server y/n , network
- SAN setup: How many disk, how many pathes
- SAN Storage: What type of storage, fail over mode

Like i said:
simplify your current configuration and then start to extend it.

hth
Hajo

openstream rob

unread,

Nov 30, 2007, 1:12:07 PM11/30/07

to

> Hajo- Hide quoted text -

>
> - Show quoted text -

For a quick bit of even more information:

yes, all servers 1 to 6 can see both arrays directly.

The NSD use is because,(sorry I didn't mention), there is another tier
of 8 servers fully connected to the 6 already mentioned. These are
only connected on Gb LAN, but use the NSD servers for access to sfs
also.

I'll drop the server count as you suggest (2) and repeat the test.

When the array is dropped. There is NO io to remaining array. on any
server.(for 8 mins) . e.g a "touch filename" will hang. After 8 mins
it suceeds.

Thanks Hajo. I'll need to do tests on Monday as the building is
closing.

Have a good weekend

Rob

openstream rob

unread,

Nov 30, 2007, 1:31:01 PM11/30/07

to

Not so quick!!

I just did one more test run, as I was rebooting the cluster after
setting the dyntrk=yes, while we were chatting.

I pulled the wire on the Array and all servers carried on immediately.
mmlsdisk sfs show an instant state of down on sfs2 and IO is normal to
the remaining array.

Problem solved.

I will do a review of the NSD servers anyway, and I still need to
check the manual failover mechanism if the array1/rack1 trips over.

Thanks for your time Hajo. That's about the 10th problem you've sorted
of mine.

Another case to add to your solved file!

Cheers
Rob

Hajo Ehlers

unread,

Nov 30, 2007, 1:49:08 PM11/30/07

to

On Nov 30, 11:14 am, openstream rob <r...@openstream.co.uk> wrote:

...

> quicker ( 1 minute max), GPFS doesn't react for quite a while.
> After 5 minutes GPFS log gives:
> "Local access to sfs2 failed with EIO, will attempt to access the disk
> remotely."

...
Sometimes it is really good to reread a thread:
The above message stated in my opion that your server in rack2 could
not use the disk in array1 for some reason.

From
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfs31/bl1pdg1176.html
...
If a disk is defined to have a local connection, as well as being
connected to primary and secondary NSD servers, and the local
connection fails, GPFS bypasses the broken local connection and uses
the NSD servers to maintain disk access. The following error message
appears in the MMFS log:

6027-361
Local access to disk failed with EIO, switching to access the disk
remotely.

This is the default behavior, and can be changed with the useNSDserver
file system mount option. See General Parallel File System: Concepts,
Planning, and Installation Guide and search for NSD server
considerations.

For a file system using the default mount option
useNSDserver=asneeded, disk access fails over from local access to
remote NSD access. Once local access is restored, GPFS detects this
fact and switches back to local access. The detection and switch over
are not instantaneous, but occur at approximately five minute
intervals.
...

So here you have your 5 minute intervall. But this should happen only
for a node which has local access and lost it complettly. Since in
your setup the servers sees all disk this should not happen.

Have a nice weekend
Hajo

openstream rob

unread,

Dec 14, 2007, 5:30:38 PM12/14/07

to

Hi Hajo

My optimism was a bit premature!

In fact, the instant failover has not repeated since. I've cut down
the environment and I belive the problem must lie in the fibre network
area. If I yank a disk out of the array, the detection is instant by
GPFS. The controller must be signalling back to aix, the disk is
removed. However, when the controller is takendown, no "disk loss"
signal is given. I repeated the test by unplugging the fibre switches.
GPFS again took 15 minutes to detect the drop before failing over.
I'm not sure where to look. Something is telling aix that the router
has gone, as I can see the error log entries immediately. What is the
link to GPFS to tell it to abandon access to the unreachable disk?

any clues? I know this post has gone cold, but I need any help.

Thanks

Rob

Hajo Ehlers

unread,

Dec 17, 2007, 4:37:17 AM12/17/07

to

> In fact, the instant failover has not repeated since. I've cut down
> the environment and I belive the problem must lie in the fibre network
> area. If I yank a disk out of the array, the detection is instant by
> GPFS.

I assume that you are using multipath access to your disk. Thus i my
understanding you disabled a LUN since a simple disk failure should
not lead to a GPFS failure.

> The controller must be signalling back to aix, the disk is
> removed. However, when the controller is takendown, no "disk loss"
> signal is given.

What do mean with controller down ? On our CX600 we have 2
controllers. Even in case one controller fails the other one will take
over.

I would recreate the GPFS environment by implementing a single node
GPFS cluster. Thus using only one node. No NSD server or what so ever.
Also set unmountOnDiskFail to yes.

Then for testing you should simply remove one fc connection after
another on that test node and check how the node behaves. As soon as
the last path to your SAN disk is gone the GPFS fs should be umounted.

After that do the same procedure for your DISK storage. Meaning to
remove one fc connections at a time from the storage and see now how
the node behaves.

Verify your SAN setup. As an example:
It is allowed that on a two controller/port SAN array a given node can
connect from any hba to any port on the storage or not ?

hth
Hajo

openstream rob

unread,

Dec 17, 2007, 6:03:54 AM12/17/07

to

I'm setting it up for single node. I'll let you know later.

Yes I have multipath. Two hba's per server. One goes to each switch.
DS4300 does have two controllers.
To simulate disk failure I pulled out the spindle with the GPFS FS on.
(the other half of the fail group is on the other Array on site B).
GPFS on all nodes on both sites caried on fine. GPFS took down the
removed nsd. Two controllers doesn;t help in this case, as there's no
data disk.

All I'm trying to do is simulate total site failure. So powering down
the rack or the array takes both controllers in the downed array out.
In this case, GPFS on the surviving site doesn't failover.

SAN setup is:

Server has 2 hbas. Each connected to a separate switch. Each switch
connects to a single controller on each of the two Arrays.
As this is spread over two sites, each "single switch" is actually two
switches, one in each site, joined in the same zone, so to act as one
switch.

I'll simplify and feedback

Cheers

Rob

openstream rob

unread,

Dec 17, 2007, 11:23:34 AM12/17/07

to

Hi

I've cleared down the GPFS to 1 node. This still has a filesystem,sfs,
split into 2 failgroups, with half on each disk array. sfs1 and sfs2.
(+fs quorum)
The NSDs still have a primary server specified, but it's the main, and
only, node.

It mounts up fine. Access is normal.

I can unplug the first hba from the node and all is fine. Both sfs1
and sfs2 are up, ready. the filesystem access is normal.
I unplug the second and, access to filesystem is blocked and it's
difficult to see the output from mmlsdisk sfs, but I suspect it still
"thinks" the NSDs are up and ready.

After putting the wires back in and resetting the env, I repeated from
the other end.

I have 4 fibres in each array.
I can remove 3 with no effect on GPFS. The fabric resilience gives
access to both arrays still, so no problem.
When I remove the final fibre, Access to the SFS is blocked and all IO
is left hanging. (After some intial cache access on read only basis).
mmlsdisk sfs, reports that sfs1 and sfs2 are up ready!! After 2
minutes, sfs2 is now reported as "unavailable, down" and IO is allowed
access via sfs1 to SFS filesystem.

When I plug the fibres back in, it takes 15 seconds before "mmchdisk
sfs start -d sfs2" allows sfs2 to be brought "up" again. prior to that
IO error is returned. Aftert a quick restripe, everything works fine
again.

FYI I did get the times down to 2 minutes approximately, by setting
the dar0 and dar1 attributes "switch_retries" to 0

What is aen_frequency and hlthchk_freq ? They are both set at 600
seconds by default.

Many thanks for helping

Rob

Hajo Ehlers

unread,

Dec 19, 2007, 9:25:29 AM12/19/07

to

...
In case you have Gigabit PCI-E Single or Dual Port Fibre Channel
adapters (feature codes 5773 or 5774) please read and apply latest
fix.

http://www14.software.ibm.com/webapp/set2/subscriptions/pqvcmjd?mode=18&ID=4012#IY99608

hth
Hajo

openstream rob

unread,

Dec 21, 2007, 8:38:34 AM12/21/07

to

Hi Hajo

I've checked the devices, we have 5773/4's.

I've raised a PMR with IBM and they've come back to tell me that "AIX
and the DS4000 are working as designed."

This is a bit dissapointing as the documentation says it provides
transparent failover on site failure. Although failover does occur,
eventually, I would say that this is not entirely transparent. Their
case is that GPFS is relying on the underlying SAN and OS to respond
to the site loss. They say that LVM suffers the same issues, it's not
related to GPFS.

I have now got our team reworking our software to workaround GPFS,
hopefully this will be successful, although there will be extended
impacts to analyse.

Thanks for your help, again and again.
Have a good holiday time

Rob

Hajo Ehlers

unread,

Dec 21, 2007, 9:15:35 AM12/21/07

to

Why not using GPFS replication of data and metadata ? In this case a
site failure does no harm at all since it is using the mirrored path ?

Hajo

openstream rob

unread,

Dec 31, 2007, 5:59:40 AM12/31/07

to

Hi Hajo

This scenario is using replication. There is a good copy on each site.
GPFS freezes while the replication fails, as the site has dropped.
(mmlsdisk sfs, still shows both mirrors as ready,up) When the disk is
recognised as not available any more(1-2 mins) then GPFS is happy to
continue on the surviving copy.

I had feedback from IBM who are happy this is working within design.
I'm not convinced this is "transparent failover" myself, with such a
delay. The view from Blue is that the OS(aix) and the SAN are
responsible for the delay. This seems odd as GPFS is going to be
mostly working on top off a SAN.

We've re-integrated that application over the holiday,and now don't
rely so heavily on GPFS. Our main problem is now worked around,
although we will have to live with the delay is certain cases.

Thanks for your help

Rob