My continued investigations and attempts at reconnecting the lost lsm
configuration are pointing to a quite possible corruption of the
configuration data.
I forgot to mention that I'm on tru64 5.1b and I only have remote access to
my machine with only the command line to work with to fix this, with no
knowledgeable people on this platform on site to help me fix this.
The vold daemon can start but finds no valid configuration copies:
vold -k -x log
lsm:vold: ERROR: enable failed: Error in disk group configuration copies
Disk group has no valid configuration copies; transactions are disabled.
I tried many things with vold and voldctl, even volprivutil to pull up one
of the configuration copies that I know is there.
The volprivutil dumpconfig that I did on the private region of the hot spare
drive, which I know does contain a configuration copy, does list the
configuration but also gives an error:
lsm:volconfigdump: ERROR: Error (File block 8): Format error in
configuration copy
Which I understand as being a corrupt configuration on disk.
I also assume all the configuration copies would contain the same thing, so
I guess all the copies would pull up the same.
When I try to run commands, the vold daemon doesn't have a valid
configuration loaded, for example:
volprint
lsm:volprint: ERROR: No disk groups loaded
If I started vold in disabled mode:
voldctl mode
mode: disabled
And I try to enable it, I get more hints of corruption in the configuration:
voldctl enable
lsm:vold: ERROR: enable failed: Error in disk group configuration copies
Disk group has no valid configuration copies; transactions are disabled.
lsm:voldctl: ERROR: enable failed: Error in disk group configuration copies
I do have an old volsave configuration from 2006, which should be fairly
close to the present one because hardly anything has changed in all that
time except for drive swaps and perhaps slight naming changes on plexes,
subdisks small details like this. I also had added log plexes, which aren't
enabled now because of this issue.
The only volume I could not re-enable and re-mirror was the var volume,
which also has the logging plexes and subdisks with it.
With vold unable to load any valid configuration, I can't use any
configuration commands and get information. Using volprivutil and editing
text files manually might be my only options to troubleshoot and fix this.
Having just about all the information on the configuration, I should be able
to patch up whatever is broken in the configuration and attempt to get it
back to a usable state, if I only knew what to patch up exactly and how.
Once vold can load a configuration, I should be able to fix up whatever
small detail isn't quite right in the configuration (maybe).
I won't take the risk of a reboot right now, I think it's highly likely that
something would go badly wrong on restart and the system would probably not
get back to a stable state. Having no knowledgeable people on that platform
on site would make it too difficult to fix this from remote (the machine is
on the east us coast and I'm in europe at present). The only available
access I have to fix this is the command line. The x11 and ssh are both
disabled on that machine right now, so that doesn't leave much else.
I am digging into the "tru64 filesystem administration handbook" for info on
the structure and hints on how I could patch this up manually, but I haven't
figured this out yet.
Can anyone get me on the right track?
Thanks all,
on 12/13/10 11:35 PM, Didier Godefroy at l...@ulysium.net uttered the
following:
> Hello all who's left on this list,
>
> I haven't had any troubles with my systems for years now so I haven't been
> active on this list, but now I have a problem that's becoming urgent.
>
> What happened is that I had a failed disk that I replaced and I was only
> able to reestablish part of the mirroring that was there before.
>
> The machine is an alpha 1200 and I have drives in all 7 bays, with the scsi
> bus split and two controllers, so I have 4 drives on one controller and 3 on
> the other.
>
> I have my disks named as dsk0..dsk6 with dsk0, dsk1, dsk2 and dsk6 on the
> first scsi controller on the 4 top bays, and then dsk3, dsk4 and dsk5 on the
> other controller.
>
> The dsk0 is the boot disk, with root, swap, usr and var in dsk0a, dsk0b,
> dsk0d and dsk0f respectively. That disk was encapsulated and it is mirrored
> on dsk3.
>
> I was thinking I had several configuration copies, and I pretty sure I did,
> but now after replacing the dsk3 that had failed, after trying to
> reestablish all the mirroring, something must've happened and there are no
> longer any configuration copies found.
>
> To replace the failed disk, I swapped it physically and used hwmgr to scan
> the new disk and then I swapped the name dsk3 with the newly created dsk10
> with dsfmgr so I would have that new drive in the same location and with the
> same name as the previous one.
>
> So far, this worked fine. Then I reestablished the mirroring on the root
> volume using voldiskadm with option 4 first to remove the failed disk and
> then re-add it with the option 5 which worked fine.
> I did the same thing with the swap volume and that also went fine.
> However when I tried doing this with the usr and var volumes, I was getting
> some errors such as those:
>
> lsm:volplex: ERROR: Volume usrvol, plex usr-pl-02, block 0: Plex write:
> Error: Write failure
> lsm:volplex: ERROR: sd usr-sd-02 in plex usr-pl-02 failed during attach
> lsm:volplex: ERROR: I/O error on plex usr-pl-02, not attached to volume
> usrvol
> lsm:volplex: ERROR: DRL write to volume usrvol failed:
> No such device or address
> lsm:volplex: ERROR: Write of active log to volume usrvol failed:
> No such device or address
> lsm:volplex: ERROR: I/O error on plex usrvol-01, not attached to volume
> usrvol
> lsm:volplex: ERROR: Volume varvol, plex var-pl-02, block 0: Plex write:
> Error: Write failure
> lsm:volplex: ERROR: sd var-sd-02 in plex var-pl-02 failed during attach
> lsm:volplex: ERROR: I/O error on plex var-pl-02, not attached to volume
> varvol
> lsm:volplex: ERROR: DRL write to volume varvol failed:
> No such device or address
> lsm:volplex: ERROR: Write of active log to volume varvol failed:
> No such device or address
> lsm:volplex: ERROR: I/O error on plex varvol-01, not attached to volume
> varvol
>
> I think it's mostly because of the configuration copies and the fpa logging.
> I had to reverse to the previous state on both and then finally I was able
> to get the usr volume back to normal as well, but was left with that var
> volume that still wouldn't want to come back to mirroring.
>
> At some point I was no longer able to get any disk or other info from the
> commands like voldisk... Volprint... etc... And I thought the vold daemon
> had hung or something like that, so I tried issuing a "vold -k", which
> didn't change anything. I tried stopping the vold using voldctl to be able
> to restart it manually but it starts and has no configuration copies to
> find.
>
> Trying again to use voldisk, I get this:
>
> voldisk list
> lsm:voldisk: ERROR: Cannot get records from vold: Record not in disk group
>
> So vold is running, responding, but has no configuration copies to find.
>
> I only have one disk group, rootdg and besides the mirror on dsk0/dsk3 for
> the root/swap/usr/var boot disk, I have dsk1/dsk4 and dsk2/dsk5 as mirrors
> for other data volumes and some space reserved on each pairs for an extra
> swap volume (3 swap volumes total).
>
> For now everything still works on the machine but with an lsm configuration
> that's broken I could get in serious trouble if something else happened now.
>
> I was thinking I should try to add more configuration copies, but since
> nothing works right now I think it wouldn't work. And I'm not certain of the
> exact commands to use to manually add configuration copies and specify where
> to put them.
>
> Are there lsm gurus left on this list who could give me a hint?
>
> How can I get out of this bad situation?
>
> Thanks all,
--
Didier Godefroy
mailto:d...@ulysium.net