Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Big problem with lsm and advfs

8 views
Skip to first unread message

Didier Godefroy

unread,
Dec 19, 2010, 2:13:59 PM12/19/10
to Tru64 Unix managers
Hello all,

I posted a message lately relating to a problem with a lost lsm
configuration following some problems after swapping a dead drive.

The drive that went bad was the boot drive, which contains the
root/swap/usr/var filesystems and the whole thing mirrored by lsm.

I had the bad drive swapped and tried recreating the mirroring.
It went fine for root, swap and then gave me some issues when trying to do
usr and var, which gave me some errors and it wouldn't work at first.
I was then able to get usr back properly and so I had root/swap/usr back to
normal mirroring but I was unable to get var back and it caused some errors
because of volume sizes. This was odd because no size mismatch should've
happen as nothing had changed in size, the drive was identical to the
removed one and everything was as before.

After failing to re-mirror var there was a problem with the fpa logging plex
on root and the log plex on usr which were both on a subdisks located on the
same partition where var is, and there was also a log plex on var from that
same partition. So the lsm disk that is for var has the 3 small subdisks for
those 3 log plexes along with the big subdisk for the var plex.

What I don't understand is why those weren't mirrored on the other var plex.
So I was left with "removed" subdisks that were supposed to have the log
plexes on and no logging could happen.
I'm not sure if that's a likely caused, but at that point there was
corruption in the lsm configuration. Although there were 4 active copies of
the configuration spread out over multiple disks, it all went bad and I lost
the ability to use lsm configuration commands.
The system kept going but I wasn't able to do any correction and get a
configuration back to work.

I was hoping by posting on this list to find some hints on how to correct
such lost configuration with corruption before something bad happened and
forced the system to reboot, which I was sure would cause it not to come
back online.
Unfortunately I couldn't figure out a way to fix any lsm configuration and
what I feared happened and the reboot was triggered.
This of course caused it not to come back and I was stuck with a down
system.
The problem is that I'm very remote at the moment and can't be physically
there at the server room and I have no knowledgeable people there to fix it.

The amount of data and complexity of the configuration makes it impossible
right now to have a tape backup, as the size of the backup is much too high
compared to tape drive availability, so I don't have such backup and noone
there to restore one if I had one.

I do have an old lsm volsave backup, which is almost identical to the lost
configuration, with differences in mostly naming of plexes, subdisks, etc...
And one small size difference in the var volume, but minor. I can't recall
exactly what was done in details to cause that small difference but I think
it had to do with the logging plexes being added, but that was years ago and
since that hardly anything else had changed.
I was hoping to get some level of fixing done on the configuration and
possibly be able to use that volsave backup to help doing this, but didn't
get to do anything and the machine rebooted.

With lsm having lost its config, I was thinking if at least I could get a
unix prompt, I could revert back to a non-lsm system and then build that
back up later, but it won't even boot in single user mode.

The boot doesn't go far at all, because there is some corruption that
slipped in on the advfs root domain, preventing the bootstrap from finding
the osf_boot file at root level.

So it doesn't even get to try starting lsm, it goes into a panic and reboots
in a loop.

I had a technician remove the mirror set, in case I could get the first set
to boot and get a unix prompt so I could remove lsm. Without the mirror, it
doesn't change anything.

I had the boot drive from the removed mirror set moved over to an other
machine of similar type so I could try to mount advfs filesystems and see if
I could do some corrections. I can mount the usr domain and it looks fine,
but the root domain would not mount. Before attempting any tinkering on that
root partition, I figured I'd work only on a duplicate, which I made with
dd.

So I'm working on a duplicate of the boot drive from the half mirror of the
downed system.

That downed system is an aplha 1200 with tru64 5.1b. All 7 bays are full,
with a split scsi bus with the top 4 drives on a controller with the other 3
on an other.
The top 3 drives are one half lsm mirror, with a hot spare on the 4th drive,
then the bottom 3 drives are the other half of the mirror.
The first drive in each set is that root/swap/usr/var boot drive (18gigs).

Here is a volprint illustrating this configuration from right before the
corruption caused the lsm configuration to be lost:

volprint
Disk group: rootdg

TY NAME ASSOC KSTATE LENGTH PLOFFS STATE TUTIL0
PUTIL0
dg rootdg rootdg - - - - - -

dm conf02 dsk1e - 0 - - - -
dm extra01 dsk2 - 8381009 - - - -
dm extra02 dsk5 - 8381009 - - - -
dm root01 dsk0a - 307200 - - - -
dm root02 dsk3a - 307200 - - - -
dm spare dsk6 - 71128848 - SPARE - -
dm spool01 dsk1a - 8191984 - - - -
dm spool02 dsk4a - 8191984 - - - -
dm srv01 dsk1d - 59864864 - - - -
dm srv02 dsk4d - 59864863 - - - -
dm sswap01 dsk1b - 3072000 - - - -
dm sswap02 dsk4b - 3072000 - - - -
dm swap01 dsk0b - 3072000 - - - -
dm swap02 dsk3b - 3072000 - - - -
dm usr01 dsk0d - 20480000 - - - -
dm usr02 dsk3d - 20480000 - - - -
dm var01 dsk0f - 11701784 - - - -
dm var02 - - - - REMOVED - -

v eswapvol gen ENABLED 3072000 - ACTIVE - -
pl eswap-pl-02 eswapvol ENABLED 3072000 - ACTIVE - -
sd eswap-sd-02 eswap-pl-02 ENABLED 3072000 0 - - -
pl eswap-pl-01 eswapvol ENABLED 3072000 - ACTIVE - -
sd eswap-sd-01 eswap-pl-01 ENABLED 3072000 0 - - -

v ftpvol fsgen ENABLED 2047984 - ACTIVE - -
pl ftp-pl-02 ftpvol ENABLED 2048000 - ACTIVE - -
sd ftp-sd-02 ftp-pl-02 ENABLED 2048000 0 - - -
pl ftp-pl-01 ftpvol ENABLED 2048000 - ACTIVE - -
sd ftp-sd-01 ftp-pl-01 ENABLED 2048000 0 - - -

v rootvol root ENABLED 307200 - ACTIVE - -
pl root-pl-01 rootvol ENABLED 307200 - ACTIVE - -
sd root-sd-01p root-pl-01 ENABLED 16 0 - - -
sd root-sd-01 root-pl-01 ENABLED 307184 16 - - -
pl rootvol-01 rootvol DISABLED FPAONLY - RECOVER - -
sd var02-01 rootvol-01 DISABLED 65 FPA REMOVED - -
pl root-pl-02 rootvol ENABLED 307200 - ACTIVE - -
sd root-sd-02p root-pl-02 ENABLED 16 0 - - -
sd root-sd-02 root-pl-02 ENABLED 307184 16 - - -

v spoolvol fsgen ENABLED 8191984 - ACTIVE - -
pl spool-pl-01 spoolvol ENABLED 8191984 - ACTIVE - -
sd spool-sd-01 spool-pl-01 ENABLED 8191984 0 - - -
pl spool-pl-02 spoolvol ENABLED 8191984 - ACTIVE - -
sd spool-sd-02 spool-pl-02 ENABLED 8191984 0 - - -

v srvvol fsgen ENABLED 59864863 - ACTIVE - -
pl srv-pl-01 srvvol ENABLED 59864864 - ACTIVE - -
sd srv-sd-01 srv-pl-01 ENABLED 59864864 0 - - -
pl srv-pl-02 srvvol ENABLED 59864863 - ACTIVE - -
sd srv-sd-02 srv-pl-02 ENABLED 59864863 0 - - -

v sswapvol gen ENABLED 3072000 - ACTIVE - -
pl sswap-pl-02 sswapvol ENABLED 3072000 - ACTIVE - -
sd sswap-sd-02 sswap-pl-02 ENABLED 3072000 0 - - -
pl sswap-pl-01 sswapvol ENABLED 3072000 - ACTIVE - -
sd sswap-sd-01 sswap-pl-01 ENABLED 3072000 0 - - -

v swapvol swap ENABLED 3072000 - ACTIVE - -
pl swap-pl-02 swapvol ENABLED 3072000 - ACTIVE - -
sd swap-sd-02 swap-pl-02 ENABLED 3072000 0 - - -
pl swap-pl-01 swapvol ENABLED 3072000 - ACTIVE - -
sd swap-sd-01 swap-pl-01 ENABLED 3072000 0 - - -

v tmpvol fsgen ENABLED 3261009 - ACTIVE - -
pl tmp-pl-01 tmpvol ENABLED 3261009 - ACTIVE - -
sd tmp-sd-01 tmp-pl-01 ENABLED 3261009 0 - - -
pl tmp-pl-02 tmpvol ENABLED 3261009 - ACTIVE - -
sd tmp-sd-02 tmp-pl-02 ENABLED 3261009 0 - - -

v usrvol fsgen ENABLED 20480000 - ACTIVE - -
pl usr-pl-02 usrvol ENABLED 20480000 - ACTIVE - -
sd usr-sd-02 usr-pl-02 ENABLED 20480000 0 - - -
pl usr-pl-01 usrvol ENABLED 20480000 - ACTIVE - -
sd usr-sd-01 usr-pl-01 ENABLED 20480000 0 - - -
pl usrvol-01 usrvol DETACHED LOGONLY - STALE - -
sd var02-02 usrvol-01 DISABLED 325 LOG REMOVED - -

v varvol fsgen ENABLED 11701784 - ACTIVE - -
pl var-pl-02 varvol DISABLED 11701784 - REMOVED - -
sd var-sd-02 var-pl-02 DISABLED 11701784 0 REMOVED - -
pl var-pl-01 varvol ENABLED 11701784 - ACTIVE - -
sd var-sd-01 var-pl-01 ENABLED 11701784 0 - - -
pl varvol-01 varvol DISABLED LOGONLY - RECOVER - -
sd var02-03 varvol-01 DISABLED 195 LOG REMOVED - -

The disklabels on the dsk0 and dsk3 drives have those partitions:

# size offset fstype fsize bsize cpg # ~Cyl values
a: 307200 0 unused 0 0 # 0 - 60*
b: 3072000 307200 unused 0 0 # 60*- 665*
c: 35565080 0 unused 0 0 # 0 - 7000
d: 20480000 3379200 unused 0 0 # 665*- 4696*
e: 4096 23859200 unused 0 0 # 4696*- 4697*
f: 11701784 23863296 unused 0 0 # 4697*- 7000
g: 17585932 393216 unused 0 0 # 77*- 3539*
h: 17585932 17979148 unused 0 0 # 3539*- 7000

The e partition was to contain one more configuration copy but it was never
used.
There are (were) 4 active configuration copies.

With the drive from the half mirror moved to that other machine (was dsk3),
I can see there is some corruption in the advfs volume, but although it's
enough to cause an advfs panix and prevent a boot, it's not enough to
prevent recovering all the files from it.
After duplicating that drive to work on a copy only, for some reason the
root domain is now mountable, despite the corruption. I don't understand why
it won't mount on the original drive but its copy does mount...

Anyway, I ran a salvage command and got all the contents intact from that
root partition.


Now how can I patch up that advfs corruption to get this advfs domain not to
cause a panic and allow a boot?????


I can use that other machine to work on that copy and attempt fixing it,
then hopefully getting it back into the original machine and boot to the
unix prompt.

I don't have a tru64 unix install cd handy and I'm not certain where it is,
I thought I had brought it with me but can't locate it.

I do have srm console acces via a serial port from remote, just in case, but
that won't let me work on this, so using the other machine is the only way
right now.

I have the books "tru64 unix system administrator's guide" and "tru64 unix
filesystem administration handbook" but I found nothing helpful in this case
in there.

What I was thinking about doing, is use the existing working root partition
on this running machine to re-create a working root partition, and then
somehow use the backup of the contents from the downed machine's root to get
it to a similar enough configuration, perhaps patching up the config not to
attempt starting lsm and only boot into advfs. I could doctor up the links
in the /etc/fdmns to point back to the advfs partitions instead of the lsm
volumes.
The problem is there are many symlinks that I may not be able to make
properly and there are all those clustering context sensitive links that I
don't know how to handle.

I'm sure someone out there did something somewhat similar in the past.

Can someone give me some pointers on patching up this root domain??

Knowing that I can't use a backup, I'm not at the machine, I don't have a
unix cd and there is too much data and complexity in the configuration to
start it all over from a fresh install. Plus I don't have a knowledgeable
person on this platform on location. I can have someone issue commands there
and work with the hardware, nothing more.


Anyone? Please? I'm stuck with a downed machine for almost a whole day
now...


Thanks all,


ps: since my main email is on that machine, please use the below email
address to reply, or I won't get it until that machine is back online.


--
Didier Godefroy
mailto:d...@orbhost.net


0 new messages