philo wrote:
> Doug Freyburger wrote:
>
>> One of my systems has a crappy HBA. Every time it reboots I have to
>> "vgchange -a n vgEXT", "vxexport vgEXT", "vgimport vgEXT",
>> "vgchange -a y vgEXT" then run fsck against its logical volumes before
>> mounting. Swapping the HBA cards did not help.
>
>> Problem is the mount point has several million files of metadata
>> associated with a large Oracle database. Backups are so slow they
>> haven't completed in month.
>
> Even a *large* data base could not possibly have taken over a month to
> back up.
Unless the client elects to purchase a backup package that I'd never
heard of not one of the ones I recommended.
> You should have fixed the problem when it first became evident.
Unless the client chose to have their backups handled by their Windows
admins who never did so. Unless the client chose to not spend the
billable hours to have us fix the problem.
> Since you have been through this before and apparently had success,
> hopefully you will be able to pull it off again.
An algorithm below might work. Else there are a couple of other
possibilities that are being discussed.
> When you are all done it will be time for you to fix the problem that
> caused it
This client has a long history of making expensive decisions. Sometimes
they are penny wise, pound foolish. Sometimes they sound buzz-word-y
and cool sounding but are really no more than more expensive and less
reliable. Once again I will suggest moving to a backup product that
actually works. Once again I'll send in the bill for a lot of hours,
one way or the other.
One lesson learned - I will start to produce a weekly inode to name
listing of every file on every server I support. And bill for writing
the script. Very profitable given the number of clients this event at
one of them triggered.
On the HBA problems -
Part of me observes this and continues to recommend professional
systems for my clients. Sun, HP and IBM commercial servers don't do
this. Commodity hardware does do this and many other sorts of failures
that don't happen on Solaris, HP-UX and AIX. Heck, today I'm doing a
replacement of a mirroed boot drive on an AIX box. Damn thing has
continued to keep the mirror intact for a month using up more and more
reassignment blocks and just plain kept running even on a failed drive.
Ah the standard AIX experience. AIX may be unpleasant to work on but
you can't kill it with a plasma torch.
Part of me observes my bill for support hours and figures it's
profitable or me, for my company, and much of the debugging is fun
because it's challenging.
On the other options -
There are 4 restore options possible. Another consultant on my team
has been working all day on calculating recovery using data available
in combined lost+found and Avamar log files. A contractor from another
firm has been working all day on getting a LUN that contains the
original data in HP-UX format mounted to a legacy HP-UX host to use
rsync. A local Windows admin at the client has been working all day on
assembling partial Avamar backups into a full Avamar restore. The
fourth possibility is to reinstall Legato Networker on an HP-UX host
and import all of the expired tapes back in.
The Legato approach would definitely work but it might take 3 weeks so
it has not yet been addressed. To start we’d have to install HP-UX
11.23 on another legacy HP-UX for it to be able to support the tape
silo, then install Networker and so on. The Legato approach would also
be extremely profitable for me because all of the hours would be
separately billed not a part of the standard fixed price contract.
All in all if that LUN to HP-UX works out we'll go with it and the
rsync. Run mkfs on the logical volume and start copying. Easy peasy.
1) If the LUN is available then the rsync will take about 2-3 days. Big if.
2) If the Avamar partial backups can be assembled into a full restore
that will take 2-3 days. No guarantee there are enough partial backups
to make a full.
3) Below is an algorithm that I have designed to try to restore directly.
4) If we end up needing to go the Legato Networker route that will take
on the order of 3 weeks. Unfortunately this is the only option that's
certain to work. Because of the length of time and cost in task pack
hours involved all three previous options will be exhausted before
falling back.
Note that there are 535 directories in lost+found. And I counted the
digits wrong for regular files. There are about 650,000 out of 2+
million total files on the mount point. All regular files on the mount
point are *.tif images.
This client needs to purchase Documentum!
The algorithm that is being attempted -
One of our DBAs is now developing the Oracle query needed to calculate
the full path to any one *.tif file. The query is needed for the first
loop in the algorithm:
For each directory under /mnt/lost+found do
Find the name of one *.tif file under it and write down its exact path under lost+found.
Find that *.tif file in Oracle and calculate what it’s complete directory path should be.
Build a table of inode numbered directories to calculated full directory paths.
Done
Topological/alphabetical sort the table of directory mapping so they are
processed shallow first.
For each directory in the table
See if the parent directory needs to be created. Note which ones don’t exist because those are suspects for the regular files under /170img/lost+found
Rename the directory from lost+found to its correct place in the tree.
Done
For each directory in the table do
Compare the list of regular files in it to the listing from the backups.
If there are any missing directories note them down as incomplete ones
Restore the specific files from backup, counting as we go. Should be 535 small restore jobs.
Done
If the restore count equals the number of regular files in /mnt/lost+found we are done.
Else look at restoring from the list marked as incomplete in the
previous loop.