Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

The Unofficial Unix Administration Horror Stories Summary (part 3)

127 views

Skip to first unread message

A.X. Ivasyuk

unread,

Dec 3, 1992, 11:26:31 AM12/3/92

Part 3 of the Unix Administration Horror Stories Summary

-----------------------------------------------------------------------------

From: jo...@cortex.physiol.su.oz.au (John Dodson)
Organization: Department of Physiology, University of Sydney, NSW, Australia

Some years ago when we went from Version 7 Unix on a PDP11 to a flavour of BSD
on a Vax, I was working on the Vax in my home directory & came across a file
that I had no permission on (I'd created it as root) so the following ensued...

$ /bin/su -
Password:
# chown -R me *

mmmmm this seems to be taking a long time !
kill.
# ls -l

the result was that I was in / after the su !
(good old V7 su used to leave you in the current directory ;-)

It took me quite a while to restore all the right ownerships to /bin /etc & /dev
(especially the suid/sgid files)
I'd managed to kill it before it got off the root filesystem.

not quite rm -fr / but...

-----------------------------------------------------------------------------

From: j...@coombs.anu.edu.au (J. McPherson)
Organization: Australian National University

A few months ago in comp.sys.hp, someone posted about their repairs to an
HP 7x0, after a new sysadmin had just started work. They {the new
person} had been looking throught the file system to try to make some
space, saw /dev and the mainly 0 length files therein. Next command was "rm
-f /dev/*" and they wondered why they couldn't login ;)

I think the result was that the new person was sent on a sysamin's
course a.s.a.p.

;)

-----------------------------------------------------------------------------

From: pin...@IRO.UMontreal.CA (Francois Pinard)
Organization: Universite' de Montre'al

Many things happened in those many years I've been with computers.
The most horrorful story I've seen is not UNIX related, but it is
certainly worth a tale. Here it goes.

This big (:-) CDC 6600 system was bootable from tape drive 0, using
these 12 inches wheels containing 1/2" tape. The *whole* system was
reloaded anew from the tape each time we restarted the machine,
because there was no permanent file system yet, the disks were not
meant to retain files through computer restarts (unbelievable today, I
know :-). The deadstart tapes (as they were called) were quite
valuable, and we were keeping at least a dozen backups of those, going
back maybe one or two years in development.

The problem was that the two vacuum capstans which were driving the
tape 0, near the magnetic heads, were not perfectly synchronized, due
to an hardware misadjustment. So they were stretching the tape while
they were reading it, wearing it in a way invisible to the eye, but
nevertheless making the tape irrecoverable. Besides that, everything
was looking normal in the tape physical and electrical operations. Of
course, nobody knew about this problem when it suddenly appeared.

All this happened while all the system administration team went into
vacation at the same time. Not being a traveler, I just stayed
available `on call'. The knowledgeable operators were able solve many
situations, and being kind guys for me (I was for them :-), they would
not disturb me just for a non-working deadstart tape. Further, they
had a full list of all deadstart backup tapes. So, they first tried
(and destroyed) half a dozen backups before turning the machine to the
hardware guys, whom destroyed themselves a few more.

The technicians had their own systems for diagnostics, all bootable
from tape drive 0, of course. They had far less backups to we did.
They destroyed almost them all before calling me in. Once told what
happened, my only suggestion was to alter the deadstart sequence so to
become able to boot from another tape drive. Strangely enough, nobody
thought about it yet. In these old times, software guys were always
suspecting hardware, and vice versa :-).

Happily enough, the few tapes left started, both for production and
for the technicians. Tape drive 0 being quite suspectable, the
technicians finally discovered the problem and repaired it. My only
job left was to upgrade the system from almost one year back, before
turning it to operations. This was at the time, now seemingly lost,
when system teams were heavily modifying their operating system
sources. This was also the time when everything not on big tapes was
all on punched Hollerith cards, the only interactive device being the
system console. It took me many days, alone, having the machine in
standalone mode. The crowd of users stopped regularily in the windows
of the computer room, taking bets, as they were used to do, on how
fast I will get the machine back up (I got some of my supporters
loosing their money, this time :-).

This was quite hard work for me, done under high pressure. When the
remainder of the staff returned from trip, and when I told them the
whole tale, we decided to never synchronize our holidays again.

-----------------------------------------------------------------------------

From: gr...@unisys.co.nz (Grant McLean)
Organization: Unisys New Zealand

One of my customers (who shall remain nameless) was having a problem with
insufficient swap space. I recommended that he back up the system, boot
off the OS tape, repartition the disk, remake the filesystems and restore
the data (any idiot could do this, right? :-) ). I also suggested that if
he wasn't confident of achieving all this, we could provide a skilled
person for a modest fee. Of course he was fully confident so I left him
to it.

Next day I get a call from the guy to say he'd been there all night and
he'd had all sorts of funny messages when restoring from tape.

Eventually we tracked his problem down to the backup script he'd been
using. It was a simple one liner:

find / -print ³ cpio -oc ³ dd -obs=100k of=/dev/rmt0 2>/dev/null

This was a problem because:

1) His system had two 300MB drives
2) He only had a 150MB tape drive
3) The same script was being run every night by a cron job
4) All his backups were created by this script

(In case you haven't worked it out, the dd is to speed up writes to tape
but it has the unfortunate side effect that CPIO never finds out about
the end of tape. Because the errors were going to the bit bucket, they
never knew their backups were incomplete until they came to restore from
them).

I would have loved to be a fly on the wall when he explained to his boss
that the data was gone and there was no way of getting it back.

I haven't heard from the guy since then. Hmmm ...

Grant

-----------------------------------------------------------------------------

From: a...@geac.com (Anthony DeBoer)
Organization: Geac Computer Corporation

In article <1992Oct10....@waggen.twuug.com> brob...@waggen.twuug.com (Bill Roberts) writes:
>My most interesting in the reguard was when I deleted "/dev/null". Of
>course it was soon recreated as a "regular file", then permission problems
>started to show up.

I was once called in to save a system where most things worked, but the
main application package being used on it hung the moment you entered it
(leaving the system more than a little useless for getting things done).
I poked around for awhile, verified that the application's files were all
present, undamaged, and had the right permissions. The folks who
normally used the machine had also discovered that all was well if root
tried to run it. But nothing was visibly wrong anywhere. So, being a
bit hungry by then, I took a break for supper, and about halfway through,
the little voice at the back of my head that sometimes helps me said,
"/dev/tty". Sure enough, somebody had chmod'ded it to 0644, and the
application directed (or tried to direct, in this case) all its I/O
through it rather than just using stdin/stdout like a sane normal process.

-----------------------------------------------------------------------------

From: nag...@menudo.uh.edu (Chaitanya Nagappa)
Organization: University of Houston

The following article is posted by a friend from my account:
Chai Nagappa
===================================================================
Hi,
This is Ravi. Needed to add just a couple of stories from all the wierd stuff
that have happenned. So, are these tales for around a campfire on Halloween?

At one time, there were three of us working on a unique SVR3.2 motorola based
machine, on a R&D project. I took care of all the SysAdmin tasks, I had a
back up administrator, and the third person had been stuck into my group
(company politics). The group project files were in /user and the individial
ones in /user2. We had managed to get backup from the operations department
for /user only (not even /; security paranoia?). Anyway, I had another scsi
hard disk that I used for making a disk copy of the primary scsi hard disk
every Friday. This disk was connected, but not mounted, so that I could
do the disk backup from my desk when I wanted to.
This machine used to sometimes get a scsi error such that you could not log
in, but the processes already running on the machine were not affected. If
were logged in the console, you just powered off the machine for a few minutes
and rebooted it. Around holidays time the other Admin was off in a long
vacation. I had taken Monday off, and headed off for a four day weekend.
The machine does the same blurp. The third person decides the power off the
machine & turn it back on immediately. It does not come up properly. She
decides to reinstall the machine using the installation tape that I had
unfortunately left in the open. Reformats the hard disk, installs the base
system, and is stuck at that point when I come back in on Tuesday. I almost
blow a blood vessel but try to keep calm 'cause I had made a disk copy about
10 days before (too anxious to get on my holiday the previous week). Try to
mount the disk... hit vaccuum. Try using dd to look at the disk... Seemed
to be a large /dev/null :-? When the lady decided to reinstall the system,
it asked her what scsi disks she wanted to reformat, and she said "y" for
both 0 & 1!! All my sample/trial&error work for a year had bitten the dust.
My only (small) consolation was that I was not the only one affected.

Story 2. Live 24 hour online system. Does backup over the ethernet to a
SCSI tape. Unfortunately, no SCSI on this system to recover if root/ethernet
dies. This was a Compaq Systempro running SCO Unix. Slated a downtime of
4-6am. I thought that it will take me only 30 minutes, as I had installed
a similar (Adaptec) SCSI board on a similiar hardware on SCO. Only difference
was that this machine was running MPX (multiprocess extension) and you had
to deinstall it, install the SCSI, and then reinstall MPX (proper procedure).
I had made all my slot/IRQ charts the previous day, and so got busy removing
MPX. Then said "mkdev tape", go through the IDs, and am almost at home
base. Then... "link kit not installed, use floppy X1" when I tried to remake
the kernel. For some reason, when I removed the multiprocessor extension,
the single processor files were not moved to their right location. And if
I reinstalled the single, all my changes would be lost. Finally, restored the
OS (from backup) on the remote machine, and then rcp-ed them over to bring back
the MPX version. Unfortunately, rcp does not maintain the date/ permissions,
etc. Got a limpimg version of the machine back on-line about 45 minutes
after its slated time, and spent the rest of the day fixing vagrant files.
The next week, I moved the online programs to another machine (a headache),
and reinstalled this machine from scratch.

Ok, that should be enough horror. Please send any replies to "ra...@usv.com"
instead of this account.

Thanks,
--Ravi Ramachandran

-----------------------------------------------------------------------------

From: gr...@lemis.uucp (Greg Lehey)
Organization: LEMIS, W-6324 Feldatal, Germany

In article <16...@umd5.umd.edu> matt...@oberon.umd.edu (Mike Matthews) writes:
>The moral? *NEVER* move something important. Copy, VERIFY, and THEN delete.

Something like this bit me just yesterday. I'm currently trying to
work out how ISC Unix/386 handles COFF files, and discovered the
/shlib directory, which I suspected wasn't really used (*wrong*). So,
to try it out, I did:

+ root adagio:/ 819 -> mv shlib slob
+ root adagio:/ 820 -> xterm
+ /usr/bin/X11/xterm: Can not access a needed shared library

So far, so good. So, put it back:

+ root adagio:/ 821 -> mv slob shlib
+ /bin/mv: Can not access a needed shared library

Oops! So, tried it from a different system, but didn't have
permission, so:

+ root adagio:/ 822 -> chmod 777 slob
+ /bin/chmod: Can not access a needed shared library

OK, so let's just cp them across.

+ root adagio:/ 823 -> cd slob
+ root adagio:/slob 824 -> mkdir /shlib
+ /bin/mkdir: Can not access a needed shared library
+ root adagio:/slob 825 ->

Then I wrote a program which just did a link(2) of the directories.
Yes, gcc and ld didn't have any problems, but even after the link was
in place, it still didn't work. I had to reboot (but nothing else),
after which it did work. No idea why that made any difference.

-----------------------------------------------------------------------------

From: a...@geac.com (Anthony DeBoer)
Organization: Geac Computer Corporation

In article <Bw40G...@cen.ex.ac.uk> JR...@cen.ex.ac.uk (J.Rowe) writes:
>One thing I would like all vendors to do (I know one or two do) is
>to give root the option of logging in using another shell. Am I the
>only one to have mangled a root shell?

This actually leads me back to a Unix admin horror story. At a former
employer, I once watched our sysadmin reboot from the distribution tape
after making a typing error editing the root line in /etc/passwd. After
munging the colon count in this line, nobody could login or su, and he
hadn't left himself in root in another session while testing his changes
(a rule I've adopted for myself).

My "big break", the moment I became sysadmin, was partly by virtue of
being the only one to ask him for the root password the day he went out
the door for the last time.

What I've found preferable, when wanting to set up an alternative shell
for root (bash, in my case), is to add a second line in /etc/passwd with
a slightly different login name, same password, UID 0, and the other
shell. That way, if /usr/local/bin/bash or /usr/local/bin or the /usr
partition itself ever goes west, I still have a login with good ol'
/bin/sh handy. (I know, installing it as /bin/bash might bypass some
potential problems, but not all of them.)

This might, of course, be harder to do on a security fascist system like
AIX. Simply trying to create a "backup" login with UID 0 there once so
that the operator didn't get a prompt and have to remember what to type
next was a nightmare. (I wound up giving "backup" a normal UID, put it
in a group by itself, and gave it setuid-root copies of find and cpio,
with owner root, group backup, and permissions 4550). BTW, this was to
make things easier for the backup operator, not to make it secure from
that person.

-----------------------------------------------------------------------------

From: will...@nssdcs.gsfc.nasa.gov (Jim Williams)
Organization: NASA Goddard Space Flight Center, Greenbelt, Maryland

Well, I guess I'll throw in a couple of stories too. The first isn't
really a horror story, more of an unexpected failure mode.

Story One is about The Sun 3/260 That Froze Solid. One day a user
reported that the Sun 3/260 he was using was "dead". On inspection, I
found the Sun at the console prompt and the keyboard totally
unresponsive. The L1-A sequence did nothing. So I power cycled it.
Nothing. A blank screen, no activity. I was ready to call service,
then decided to try rebooting with the normal/diag switch set to diag.
On looking at the back of the pedestal, I saw that the ethernet cable
had been pressed up against the reset switch! ARGGGHHHH! The user
had pushed the machine back just enough to press the switch and keep
it pressed. (I don't recall if there was a "watchdog reset" message
on the console when I found it, but I was new enough to Suns that that
would not have been a dead givaway.)

Story Two involved connecting an HP laserjet to a Sun 3/280. This
sucker just would NOT do flow control correctly. I put a dumb
terminal in place of the HP and manually typed ªS/ªQ sequences to
prove that the serial port really was honoring X-ON/X-OFF. But for
some reason the ªSs from the HP didn't "taste right" to the Sun, which
ignored them. Switching the HP serial port between RS422/RS232 had no
effect. It evenually turned out to be some sort of flakeyness with
the Sun ALM-II board. Everything worked fine after I moved the
printer to one of the built-in Zilog ports. Death to flakey hardware...

Cheers!
Jim

-----------------------------------------------------------------------------

From: ri...@sadtler.com (Rick Morris)
Organization: Sadtler Research Laboratories

Slightly off the subject, but not too far off, is the phenomenon of "Sysadmin
Wannabees." I've been Sys Admin of UNIX at 3 sites now. The phenomenon has
occured at all three.

You are talking to a fellow programmer, or a programmer is within ear shot.
A new user (or even an old user) comes up to you and asks something like:
"How would I list only directory files within a directory?"

Now it has been my experience that the question is not complete. Is this a
recursive list? Is this a "one-time" thing, or are you going to do it many
times? Is it part of a program? (Sometimes questions like this end up as
an answer to a C question executed as a system(3) call rather than a preferred
library call.) Anyway, as you ponder the question, the many alternatives (in
unix there's always another way), the questioner's experience, whether or not
they want a techie answer or a DOSie answer, the programmer within ear shot
pipes in with an answer of how *THEY* do or would do it.

It is invariable. It happens every time. I don't think I take all that
long to answer. But the Wannabee answer is rapid. Like the kid in class
who raises his hand going "oo" "oo" "oo".

I have seen my predicessors get all bent out of shape when the Sysadmin
Wannabees jump on their toes. I usually let the answer proceed, indeed,
often these Wannabees give a complete answer, even doing it for the
questioner. After a bit I return to the questioner and ask if the question
was properly answered, if they understand the answer, or if they want any
more information. It also shows me how deeply the Wannabee understands
just what is going on inside that pizza box.

Have any other of you sys admins seen this phenomenon, or is it my slow
pondering of potential answers that drives the Wannabee to jump in?

-Rick.

-----------------------------------------------------------------------------

From: rsl...@cue.bc.ca (Rob Slade)
Organization: Computer Using Educators of B.C., Canada

Hope this fits.

I had a job one time teaching Pascal at a "visa school". The machine was a
multi-user micro that ran UNIX. I have enough stories from that one course
to keep a group of computer educators in stitches for at least half an hour.

The finale of the course was on the last day of classes. When I showed up
and powered up the system, it refused to boot. Since all the students' term
projects and papers were in the computer, it was fairly important. After
a few hours of work, and consultation with the other teacher, who did the
sysadmin and maintenance, we were finally informed that the new admin
assistant around the place had decided that the layout of the computer lab
was unsuitable. (I had noticed that all the desk were repositioned: I thought
the other teacher had done it, he thought I had.) The AA had, the night
before, moved all the furniture, including the terminals and the micro. She
did not know anything about parking hard disks.

We knew now, that we were in trouble, but we didn't realize how much until
we started reading up on emergency procedures. For some unknown reason,
booting the micro from the original system disks would automatically reformat
the hard disk.

(The visa school refunded the tuition for all the students in that course.)

-----------------------------------------------------------------------------

From: ke...@ksmith.uucp (Keith Smith)
Organization: Keith's Computer, Hope Mills, NC

My dumbest move ever. Client in Charlotte, NC (3 hours + away) has
Xenix box with like 15 users running single app. They have a tape
backup of course. Anyway they ran slam out of space on the 70MB disk
drive so I upgraded them from an MFM to a SCSI 150MB disk. Restored
their app & data files, and they were off and running. Anyway they did
an application directories backup (tar) on a daily basis and backed the
rest of the system up with tar on Monday morning.

Being a nice guy I built a menu system and installed the backups on the
menu so they could do it with a push of the button. Swell, It's Monday
Call if anything else comes up. 1 week later I get a call. Console is
scrolling messages, App seems to be missing yesterday's orders, etc.
Call in, and cannot log in. 'w' doesn't work. Crazy stuff. Really
strange.

Grab old drive/controller, fly to Charlotte replace drive, install
app backup tape. They re-key missing stuff, etc. Bring new disk back.
Won't boot, won't do anything. Boot emergency floppy set. Looking
around. Can't figure but have backup tape from that morning that
"completed successfully". tar tvf /dev/rct0. Hmm, why all these
files look very OLD. Uh, Where, Uh. Look at menu command for the
"backup" is 'tar xvf /dev/rct0 /'

Anyway, I owned up to the mistake, re-loaded the SCSI drivers and
changed the command to 'tar cvf ..'

Hehehe, Now I DOUBLE check what I put on a menu, and try not to be in a
*HURRY* when I do this stuff.

-----------------------------------------------------------------------------

From: k...@sugra.uucp (Kenneth Ng)
Organization: Private Computer, Totowa, NJ

In article <1992Oct16.1...@nsisrv.gsfc.nasa.gov: will...@nssdcs.gsfc.nasa.gov (Jim Williams) writes:
:Story Two involved connecting an HP laserjet to a Sun 3/280. This
:sucker just would NOT do flow control correctly. I put a dumb
:terminal in place of the HP and manually typed ªS/ªQ sequences to
:prove that the serial port really was honoring X-ON/X-OFF. But for
:some reason the ªSs from the HP didn't "taste right" to the Sun, which
:ignored them. Switching the HP serial port between RS422/RS232 had no
:effect. It evenually turned out to be some sort of flakeyness with
:the Sun ALM-II board. Everything worked fine after I moved the
:printer to one of the built-in Zilog ports. Death to flakey hardware...

ARRRGGGHHH!!!! DEATH TO ALM-II BOARDS! Funny though, I do have an HPLJ-2
hooked up to a SUN 690MP through the ALM-2 boards without problems. However
I also had Sun going up the wall with myself with an Okidata 320 printer
that would hang the port until we reboot the machine (not a nice thing to
do with a dozen stock brokers). Funny thing is, we had ANOTHER Okidata 320
printer attached to the same Sun on another ALM-2 port, no problem with that
one. Hm, switch the printers, no change. Switch the cables, no change.
Switch the ports, no change. Wierd. Finally discovered it was the DATA that
was being sent. The printer with problems was a label printer, which was
sending a control-s every 10-20 characters or so to pause the Sun. Apparently
the Sun ALM-2 drivers can not handle control-s'es too frequently. No problem,
Sun said, just switch to hardware flow control. Puzzled me, because my docs
said the ALM boards had no hardware flow control. But his docs said they
were there. Took the printer off line, started the lpd, data scope showed the
data going out. Talked to Sun again, tried RTS-CTS, DTR, 'crtscts' in printcap,
'-crtscts' in printcap. Trying all kinds combinations. Finally he asked me
which ALM-2 port I was using, 13 I responded. Oh, ALM-2 ports only have the
hardware flow control in the first four ports. Whoops :-). Both docs were,
true, my docs said there was no hardware flow control, which was right, on
the last 12 ports. His docs said that there was hw flow control, but he
missed the 'on the first four ports' part. Now it works, and I hope Sun
now has this better documented.

-----------------------------------------------------------------------------

From: cor...@ensta.ensta.fr (Gilles Gravier)
Organization: ENSTA, Paris, France

Well, talk about horror stories... We have a DataGeneral Aviion machine
where I work at. I was doing regular admin tasks on it and decided, logged
in as root, to clean /tmp... (I can already see you laughing there!). So,
as usual, I typed "cd / tmp" then "rm *" as I was placed in / when the
dreaded rm was entered... My root directory was erased...

I realized my error fast enough... So, since I had deleted the kernel, and
the administration kernels (that both reside in /), I had to recreate a
new kernel. Luckily for me, DG/UX allows to recreate one "on the fly", using
parameters of the running kernel (in memory!)... So I did, and then rebooted.

Things started getting bad when I still couldn't work on my machine, logins
didn't work (No Shell messages...)... Until I could access the /etc/passwd
file using a trojan shell through an NFS mounted directory, and great a root
account whose shell was not /sbin/sh...

On a DG, /sbin and /bin are both links to /usr/sbin... The links were killed
when I did my "rm"...

Well, now I do backups!

Gilles.

-----------------------------------------------------------------------------

From: cor...@ensta.ensta.fr (Gilles Gravier)
Organization: ENSTA, Paris, France

I am sysadmin at my office... I won't name it, because that's not
the subject... Of course, UNIX is my cup of tea... But, at home, I have an
MS DOS machine... As old habits die hard, I have set up MKS toolkit on my home
PC... And, as I have a C:\TMP directory where Windows and other applications
put stuff, that remains, as I sometimes have to reboot fast... (ah, the fun
of developping at home!)... So, in my AUTOEXEC.BAT file, I have the following:
rm -rf /tmp
mkdir c:\tmp
the recursive rm comming from MKS, and mkdir from horrible MSDOS.

At the time, I didn't have a tape streamer on my pc... I was working,
and the mains waint down... so did the PC. Windows was running, \TMP full
of stuff... So, when powers comes back on, rm -rf /tmp has things to do...
While it's doing those things, power goes down again (there was a storm).
Power comes back up, and this time, it seems that the autoexec takes really
too much time... So, I control C it... And, to my horror, realize that I don't
have anymore C:\DOS C:\BIN C:\USR and that my C:\WINDOWS was quite depleted...

After some investigation, unsuccesfull, I did the following: cd \tmp
and then DIR... And there, in C:\TMP, I find my C:\ files! The first power
down had resulted in the cluster number of C:\ being copied to that of C:\TMP,
actually resulting in a LINK! (Now, this isn't suppose to happen under MSDOS!)
I had to patch in the DIRECTORY cluster to change TMP's name replacing the
first T by the letter Sigma, so that DOS tought that TMP wasn't there anymore,
then do an chkdsk /F, and then undelete the files that I could... And rebuild
the rest...

Took me some time!

Gilles.

-----------------------------------------------------------------------------

From: er...@src4src.linet.org (Erik VanRiper)
Organization: The Source for Source

Here's one for ya...

I run on a 386/25. Small system, 4 inbound lines, etc. I was installing a
new SCSI drive to complement my 2 MFM's. Took me forever to get everything
just right. Things finally worked, so I figured I would shutdown and play
with the jumper settings to see what this thing could do. What did I do?
Well, I just turned off the power, that's all.

erk. Just rebuilt the kernal, did not do a haltsys, or a shutdown, or anything.
Just shut the power off. ARGH! Took me 3 weeks to clean up the mess.

You tend to get in this cycle of "try" "haltsys" "power off" "change jumpers"
"power on" "try". Well, once everything worked, I guess I was a wee bit
excited and forgot a step. :-)

Granted, not a very good story, but I will tell you about my "cardboard
teepee" of a computer case sometime. :-)

-----------------------------------------------------------------------------

From: mi...@pacsoft.com (Mike Stefanik)
Organization: Pacific Software Group, Riverside, CA

One of the more interesting problems that I ran into was a customer that
was having problems with their SCSI tape drive on a XENIX box. Around midnight,
every night, the system would automatically backup and verify their data. One
day, the customer needed to restore some data files from the last night's
backup. She called because, although the restore worked just fine, she didn't
see the busy light on the drive come on, and it didn't sound like the tape was
moving. I dialed up the system, had her put a tape in and did a retension --
the drive started winding the tape back and forth, and we both concluded that
she was mistaken. After all, the tape was retensioning, and she wasn't getting
any backup or verify errors at all. I just chalked this one up to user
confusion.

A few days later, she called back saying that there really is something wrong
with the tape. She needed to restore some data from a few days ago, and like
before, the busy light on the drive didn't come on, but files did restore.
However when she started the application program, the data hadn't changed. I
dialed up the system again, and just on a fluke, issued a "df" -- it showed
their rather large root filesystem to be nearly full. Confused, I did a "find",
searching for files over 1MB. Of course, what I found was this huge file named
/dev/rct0. As I later discovered, their system had crashed a few weeks ago,
and she had simply answered "yes" to a bunch of questions that it asked when
she brought it back up. The /dev/rct0 device was removed (but /dev/xct0 was
still there, which allowed me to retension the tape) and the backup script
never checked to make sure that it was actually writing to a character device.

Needless to say, I modified the backup program to make sure that it was really
writing to a device, and I made her promise to call me whenever the system
crashed or asked "funny questions" when it was booting.

-----------------------------------------------------------------------------

From: ge...@greenie.gold.sub.org (Gert Doering)

russ...@ccu1.aukuni.ac.nz (Russell Street) writes:

>So when I came in this morning a user's session had crashed while
>he was replying to mail and emacs had spent the night quietly
>filling up the root partion (where /tmp) was.

Well... sounds familiar...

I was on a 5 days vacation, the first day my machine crashed...

How? Well...

cron started a shell-skript to extract some files from a ".lzh"-Archive.
LHarc found that the target file already existed, asked

"file <foo> exists, overwrite (y/n)?"

... since it was started from cron, it just read "EOF". Tried again. Read
"EOF". And so on.

All output went to /tmp... what was full after the file reached 90 MB!
What happened next? I'm using a SCO machine, /tmp is in my root filesystem
and when trying to login, the machine said something about being not able
to write loggin informations - and threw me out again.

Switched machine off.

Power on, go to single user mode. Tried to login - immediately thrown out
again.

I finally managed to repair the mess by booting from Floppy disk, mounting
(and fsck-ing) the root filesystem and cleaning /tmp/*

gert

-----------------------------------------------------------------------------

From: ge...@greenie.gold.sub.org (Gert Doering)
Organization: GreeniE

n...@dale.cts.com (Nancy Milligan) writes:

>About three days later almost every file on this machine had been deleted or
>compressed. Apparently I got distracted by something while I was writing
>the config file, and the entry that was supposed to be for /tmp said /.
>Boy, did I feel like an ijjit.

Ever did

# find / -atime +14 -exec rm -f {} \;

instead of

# find /tmp -atime +14 -exec rm -f {} \;

[corrected from a later post - ed.]

and then wondered why it took so long to clean up 20 files under /tmp?

gert

-----------------------------------------------------------------------------

From: dbr...@zia.aoc.nrao.edu (Daniel Briggs)
Organization: National Radio Astronomy Observatory, Socorro NM

Did anyone by chance archive the post of a year or so ago where someone
described the recovery of a Unix box from a partial "rm -r *" (where root
forgot that he was in /) ? They had lost everything up to (and including?)
/etc before the command was stopped. I seem to recall that they would lose
everything on the disk if they reinstalled the system, so there were very
good reasons to try and restore the barely running system. Of course
almost all of the utilities that they needed to do it had lived in /bin.
There were a few goodies in /usr/5bin that helped them out. The fix
eventually involved writing a bootstrap network utility on another machine,
and assembling it there, typing in the binary in an emacs process that was
still running, and overwriting some other system utility that had the
correct execute permissions, (since they couldn't chmod anything!). It was
a wonderful example of recovery from a near fatal error. If it floats my
way again, I'd love to get a copy of that post.

-----------------------------------------------------------------------------

From: ni...@acm.rpi.edu (Trip Martin)

Yup, I saved a copy because it was such a classic story. It's apparently
been re-posted every so often for a number of years, and it's worth
posting again. So here it is...

-----

From alt.folklore.computers Fri Nov 9 11:16:43 1990
Path: rpi!zaphod.mps.ohio-state.edu!usc!cs.utexas.edu!utgpu!utzoo!sq!msb
From: m...@sq.sq.com (Mark Brader)
Newsgroups: alt.folklore.computers
Subject: rm -rf / (was Hex vs. Octal)
Summary: repost
Message-ID: <1990Nov8.0...@sq.sq.com>
Date: 8 Nov 90 08:25:50 GMT
References: <1990Nov5.1...@hq.demos.su>
Organization: SoftQuad Inc., Toronto, Canada
Lines: 184
Status: OR

> ... if you're trying rm -rf / you'll NEVER get a clear disk - at least
> /bin/rm (and if it reached /bin/rmdir before scanning some directories
> then add a lot of empty directories). I've seen it once...

Then it must be version-dependent. On this Sun, "cp /bin/rm foo"
followed by "./foo foo" does not leave a foo behind, and strings
shows that rm appears not to call rmdir (which makes sense, as it
can just use unlink()).

In any case, I'm reminded of the following article. This is a classic
which, like the story of Mel, has been on the net several times;
it was in this newsgroup in January. It was first posted in 1986.

-----

Have you ever left your terminal logged in, only to find when you came
back to it that a (supposed) friend had typed "rm -rf ~/*" and was
hovering over the keyboard with threats along the lines of "lend me a
fiver 'til Thursday, or I hit return"? Undoubtedly the person in
question would not have had the nerve to inflict such a trauma upon
you, and was doing it in jest. So you've probably never experienced the
worst of such disasters....

It was a quiet Wednesday afternoon. Wednesday, 1st October, 15:15
BST, to be precise, when Peter, an office-mate of mine, leaned away
from his terminal and said to me, "Mario, I'm having a little trouble
sending mail." Knowing that msg was capable of confusing even the
most capable of people, I sauntered over to his terminal to see what
was wrong. A strange error message of the form (I forget the exact
details) "cannot access /foo/bar for userid 147" had been issued by
msg. My first thought was "Who's userid 147?; the sender of the
message, the destination, or what?" So I leant over to another
terminal, already logged in, and typed
grep 147 /etc/passwd
only to receive the response
/etc/passwd: No such file or directory.

Instantly, I guessed that something was amiss. This was confirmed
when in response to
ls /etc
I got
ls: not found.

I suggested to Peter that it would be a good idea not to try anything
for a while, and went off to find our system manager.

When I arrived at his office, his door was ajar, and within ten
seconds I realised what the problem was. James, our manager, was
sat down, head in hands, hands between knees, as one whose world has
just come to an end. Our newly-appointed system programmer, Neil, was
beside him, gazing listlessly at the screen of his terminal. And at
the top of the screen I spied the following lines:
# cd
# rm -rf *

Oh, shit, I thought. That would just about explain it.

I can't remember what happened in the succeeding minutes; my memory is
just a blur. I do remember trying ls (again), ps, who and maybe a few
other commands beside, all to no avail. The next thing I remember was
being at my terminal again (a multi-window graphics terminal), and
typing
cd /
echo *
I owe a debt of thanks to David Korn for making echo a built-in of his
shell; needless to say, /bin, together with /bin/echo, had been
deleted. What transpired in the next few minutes was that /dev, /etc
and /lib had also gone in their entirety; fortunately Neil had
interrupted rm while it was somewhere down below /news, and /tmp, /usr
and /users were all untouched.

Meanwhile James had made for our tape cupboard and had retrieved what
claimed to be a dump tape of the root filesystem, taken four weeks
earlier. The pressing question was, "How do we recover the contents
of the tape?". Not only had we lost /etc/restore, but all of the
device entries for the tape deck had vanished. And where does mknod
live? You guessed it, /etc. How about recovery across Ethernet of
any of this from another VAX? Well, /bin/tar had gone, and
thoughtfully the Berkeley people had put rcp in /bin in the 4.3
distribution. What's more, none of the Ether stuff wanted to know
without /etc/hosts at least. We found a version of cpio in
/usr/local, but that was unlikely to do us any good without a tape
deck.

Alternatively, we could get the boot tape out and rebuild the root
filesystem, but neither James nor Neil had done that before, and we
weren't sure that the first thing to happen would be that the whole
disk would be re-formatted, losing all our user files. (We take dumps
of the user files every Thursday; by Murphy's Law this had to happen
on a Wednesday). Another solution might be to borrow a disk from
another VAX, boot off that, and tidy up later, but that would have
entailed calling the DEC engineer out, at the very least. We had a
number of users in the final throes of writing up PhD theses and the
loss of a maybe a weeks' work (not to mention the machine down time)
was unthinkable.

So, what to do? The next idea was to write a program to make a device
descriptor for the tape deck, but we all know where cc, as and ld
live. Or maybe make skeletal entries for /etc/passwd, /etc/hosts and
so on, so that /usr/bin/ftp would work. By sheer luck, I had a
gnuemacs still running in one of my windows, which we could use to
create passwd, etc., but the first step was to create a directory to
put them in. Of course /bin/mkdir had gone, and so had /bin/mv, so we
couldn't rename /tmp to /etc. However, this looked like a reasonable
line of attack.

By now we had been joined by Alasdair, our resident UNIX guru, and as
luck would have it, someone who knows VAX assembler. So our plan
became this: write a program in assembler which would either rename
/tmp to /etc, or make /etc, assemble it on another VAX, uuencode it,
type in the uuencoded file using my gnu, uudecode it (some bright
spark had thought to put uudecode in /usr/bin), run it, and hey
presto, it would all be plain sailing from there. By yet another
miracle of good fortune, the terminal from which the damage had been
done was still su'd to root (su is in /bin, remember?), so at least we
stood a chance of all this working.

Off we set on our merry way, and within only an hour we had managed to
concoct the dozen or so lines of assembler to create /etc. The
stripped binary was only 76 bytes long, so we converted it to hex
(slightly more readable than the output of uuencode), and typed it in
using my editor. If any of you ever have the same problem, here's the
hex for future reference:
070100002c000000000000000000000000000000000000000000000000000000
0000dd8fff010000dd8f27000000fb02ef07000000fb01ef070000000000bc8f
8800040000bc012f65746300

I had a handy program around (doesn't everybody?) for converting ASCII
hex to binary, and the output of /usr/bin/sum tallied with our
original binary. But hang on---how do you set execute permission
without /bin/chmod? A few seconds thought (which as usual, lasted a
couple of minutes) suggested that we write the binary on top of an
already existing binary, owned by me...problem solved.

So along we trotted to the terminal with the root login, carefully
remembered to set the umask to 0 (so that I could create files in it
using my gnu), and ran the binary. So now we had a /etc, writable by
all. From there it was but a few easy steps to creating passwd,
hosts, services, protocols, (etc), and then ftp was willing to play
ball. Then we recovered the contents of /bin across the ether (it's
amazing how much you come to miss ls after just a few, short hours),
and selected files from /etc. The key file was /etc/rrestore, with
which we recovered /dev from the dump tape, and the rest is history.

Now, you're asking yourself (as I am), what's the moral of this story?
Well, for one thing, you must always remember the immortal words,
DON'T PANIC. Our initial reaction was to reboot the machine and try
everything as single user, but it's unlikely it would have come up
without /etc/init and /bin/sh. Rational thought saved us from this
one.

The next thing to remember is that UNIX tools really can be put to
unusual purposes. Even without my gnuemacs, we could have survived by
using, say, /usr/bin/grep as a substitute for /bin/cat.

And the final thing is, it's amazing how much of the system you can
delete without it falling apart completely. Apart from the fact that
nobody could login (/bin/login?), and most of the useful commands
had gone, everything else seemed normal. Of course, some things can't
stand life without say /etc/termcap, or /dev/kmem, or /etc/utmp, but
by and large it all hangs together.

I shall leave you with this question: if you were placed in the same
situation, and had the presence of mind that always comes with
hindsight, could you have got out of it in a simpler or easier way?
Answers on a postage stamp to:

Mario Wolczko

-----
Trip Martin

-----------------------------------------------------------------------------

From: exu...@exu.ericsson.se (Dave Williams)
Organization: Ericsson Network Systems

A sysadmin was told to change the root passwd on a dozen or so Sun servers
serving 400 diskless sun clients. He changed the passwd string to the wrong
encrypted string (with a sed-like string editor) and locked root out from
everywhere. Took hours to untangle.

You only learn when you make mistakes...

[stuff about dead presidents deleted]

-----------------------------------------------------------------------------

From: almq...@chopin.udel.edu (Squish)
Organization: Human Interface Technology Lab (on vacation)

Two miserable flubs:

1)
/etc/rc cleans tmp but it wasn't cleaning up directories so I changed the line:

echo clearing /tmp
(cd /tmp; rm -f - *)

echo clearing /tmp
(cd /tmp; rm -f -r - *; rm -f -r - .*)

About 15 minutes later I had wiped out the hard drive.

2)
One of the user discs got filled so I needed to move everyone over to the new
disc partition. So, I used the tar to tar command and flubbed:

cd /user1; tar cf - . ³ (cd /user1; tar xfBp - )

Next thing I know /user1 is coming up with lots of weird consistency errors and
other such nonsense. I meant to type /user2 not /user1. OOOPS!

My moral of the story is when you are doing some BIG type the command and
reread what you've typed about 100 times to make sure its sunk in (:

-----------------------------------------------------------------------------

From: an...@maxwell.concordia.ca (Anne Bennett)
Organization: Concordia University, Montreal, Canada

After about four months as a Unix sysadm, and still feeling rather like a
novice, I was asked to "upgrade" a Sun lab (3/280 server and ten 3/50
diskless clients) from SunOS 4.0.3 to 4.1 -- of course, this "upgrade" was
actually a complete re-install.

Well, the server had no tape drive, not even any SCSI controller. There
were no other machines on its subnet other than the clients, so I had no
boothost (at that time, I did not know that the routers could be
reconfigured to pass the appropriate rarp packets, nor do I think our
network people would have taken kindly to such a hack!). The clients did
have SCSI controllers, but I had no portable tape drive. Luckily, I had
a portable disk.

So, with great trepidation (remember, I was still a novice), I set up
one of the clients, with the spare disk, to be a boothost. I booted
the server off the client and read the miniroot from a tape on a remote
machine, and copied it to the server's swap partition. Then I manually
booted the miniroot on the server by booting off the temporary boothost
with the appropriate options, and specified the server's swap partition
as containing the kernel to be loaded. Once in the miniroot, I started
up routed to permit me to reach the tapehost, and finally invoked
suninstall. From then on, it worked like a charm.

Needless to say, I was extremely pleased with myself for figuring all of
this out. I then settled down to do the "easy stuff", and got around to
configuring NIS (Yellow Pages). I decided to get rid of everything I
didn't need, under the assumption that a smaller system is easier to
understand and keep track of. The Sun System and Network Administration
Manual, which is in many ways an admirable tome, had on page 476 a
section on "Preparing Files on NIS Clients", which said:

"Note that the files networks, protocols, ethers, and services need
not be present on any NIS clients. However, if a client will on
occasion not run NIS, make sure that the above mentioned files do
have valid data in them."

So I removed them. Several hours later, when I had finished configuring
the server to my satisfaction, reloading the user files, etc., I finally
got around to booting up the clients. Well, I *tried* to boot up the
clients, but got the strangest errors: the clients loaded their
kernels and mounted /, but failed trying to mount /usr with the message
"server not responding. RPC: Unknown protocol". I was mystified. I tried
putting back the generic kernels on server and clients, several different
ifconfig values for the ethernet interfaces, enabling mountd and rexd on
server's inetd.conf, removing the clients' /etc/hostname.le0 (which I had
added)... all to no avail. 'Twas the last work day before the Christmas
break, and I was flummoxed.

Of course, I finally connected the error message "unknown protocol"
with the removed /etc/protocols (and other) files, restored these
files, after which everything was fine again. I was pretty mad, since
I had wasted a whole day on this problem, but *technically*, the Sun
manual above is correct.

It just neglected to mention that of course, *no* machine is running
NIS at boot time, therefore *every* machine needs valid data in the
networks, services, protocols, and ethers files *at boot time*. Grrr!

----------------------------------------------------------------------------

From: ri...@sadtler.com (Rick Morris)
Organization: Sadtler Research Laboratories

Okay, I'll bite. We had Zenith Data System's Z-286's, boosted to 386's
via an excellerator (imagine a large boot stomping lots of data through
a small 16 bit funnel...). We were running SCO's Xenix. The user filesystem
crashed in such a way that it couldn't be repaired via fsck. fsck would
try to repair a specific file and then just stop, leaving the filesystem
dirty. The "dirty bit" in the superblock said that it couldn't be mounted
because it was dirty. But it couldn't be cleaned. But there was lots of
data on it and I hadn't been doing backups because the only I/O device to
do backups was the floppy drive and I wasn't about to sit there every night
or even once a week and slam 30 odd floppies into the drive while the backups
ran, even worse try to restore a file from a backup of 30 floppies....

Anyway, to recover the data I used fsdb to edit the superblock and change
the dirty bit to clean, mounted the disk, got off all the good data,
and remade the filesystem. Thanks, Xenix. fsck couldn't clean it,
but you did supply fsdb! *whew*

-Rick.
From: ya...@anteros.enst.fr (Nadim Yared)
Organization: Telecom Paris, France

Well,
My story happened on a Sun Sparcstation 2

I once wanted to update the libc.so.1.7 to libc.so.1.8 by myself, so
I got root, and then ftp the /lib/libc.so.1.8 to my /lib. Unfortunately
there was not enough room on this partition. So all i got was a file
with zero length.
The problem is that I ran /usr/etc/ldconfig in the directory /lib,
and that was all. Every command could not be executed, cause ld.so
checked for /libc.so.1.8, being the newest one. All i needed was a
statically linked mv, but SUN does not provide usually the source.
Even going single user didn't do anything. So i had to install a
miniroot on the swap partition, and cp /bin/mv from the CD-ROM,
and execute-it.

It sounds like an american film : a happy ending saved my life.

Nadim YARED.
Ecole Nationale Superieure des Telecommunications de PARIS.

----------------------------------------------------------------------------

From: col...@gid.co.uk (Colston Sanger)
Summary: Ah, the scratch monkey story....
Organization: GID Ltd, Upper Basildon, Reading, UK

In article <17...@frackit.UUCP>, da...@frackit.UUCP (Dave Ratcliffe) writes:
> In article <1992Oct14....@sci34hub.sci.com>, ga...@sci34hub.sci.com (Gary Heston) writes:
> > In article <1992Oct7.1...@multix.no>, ar...@multix.no (Arne Asplem) writes:
> > With all these stories, I'm suprised nobody has posted the "scratch monkey"
> > story. Has that admin gone onto bigger and better things?
>
> ... If anyone
> has access to the file in question I think now is an excellent time to
> drag it out and regale us with it.

Here it is:

From er...@snark.thyrsus.com Sat Mar 30 23:19:09 1991
Subject: Apologies to all fans of Mabel!
Followup-To: alt.folklore.computers

In responding to several posters' pleas for the reinstatement of Mabel,
I clean forgot that the condensed `Story of Mabel' wasn't added to the
`scratch monkey' entry till 2.8.2, which most of you don't have.

Here is the relevant bit from 2.8.5:

@h{scratch monkey} n. As in, ``Before testing or reconfiguring, always
mount a scratch monkey.'', a proverb used to advise caution when
dealing with irreplaceable data or devices. Used to refer to any
scratch volume hooked to a computer during any risky operation as a
replacement for some precious resource or data that might get
trashed.

This term preserves the memory of Mabel, the Swimming Wonder
Monkey, star of a biological research program at a great American
university. Mabel was not (so the legend goes) your ordinary
monkey; the university had spent years teaching her how to swim,
breathing through a regulator, in order to study the effects of
different gas mixtures on her physiology. Mabel suffered an
untimely demise one day when a computer vendor @e{PM}ed the machine
controlling her regulator (see also @e{provocative maintainance}).
It is recorded that, after calming down an understandably irate
customer sufficiently to ascertain the facts of the matter, the
vendor's troubleshooter called up the @e{field circus} manager
responsible and asked him sweetly ``Can you swim?''. The moral is
clear: when in doubt, always mount a scratch monkey. See
@e{scratch}. @refill

I hope this satisfies Mabel's fans. The volume of the outcry for her
resurrection has been remarkable (which is actually pleasant, because
it vindicates my original idea that the story was worth including).

Art Evans (the gentleman who posted the story to comp.risks) is doubtless an
estimable person with whom I'd enjoy becoming acquainted, but a writer he
is not. In particular, it always bothered me how he muffed the punch line...
oh, heck, I guess I'll include the posting so you can see for yourself.

------------------------------------------------------------------------

The following, modulo a couple of inserted commas and
capitalization changes for readability, is the exact text of a famous
USENET message. The reader may wish to review the definitions of
@e{PM} in the main text before continuing.

Date: Wed 3 Sep 86 16:46:31-EDT
From: "Art Evans" <Evans@@TL-20B.ARPA>
Subject: Always Mount a Scratch Monkey
To: Risks@@CSL.SRI.COM

My friend Bud used to be the intercept man at a computer vendor for
calls when an irate customer called. Seems one day Bud was sitting at
his desk when the phone rang.

Bud: Hello. Voice: YOU KILLED MABEL!!
B: Excuse me? V: YOU KILLED MABEL!!

This went on for a couple of minutes and Bud was getting nowhere, so he
decided to alter his approach to the customer.

B: HOW DID I KILL MABEL? V: YOU PM'ED MY MACHINE!!

Well, to avoid making a long story even longer, I will abbreviate what had
happened. The customer was a Biologist at the University of Blah-de-blah,
and he had one of our computers that controlled gas mixtures that Mabel (the
monkey) breathed. Now, Mabel was not your ordinary monkey. The University
had spent years teaching Mabel to swim, and they were studying the effects
that different gas mixtures had on her physiology. It turns out that the
repair folks had just gotten a new Calibrated Power Supply (used to
calibrate analog equipment), and at their first opportunity decided to
calibrate the D/A converters in that computer. This changed some of the gas
mixtures and poor Mabel was asphyxiated. Well, Bud then called the branch
manager for the repair folks:

Manager: Hello
B: This is Bud, I heard you did a PM at the University of
Blah-de-blah.
M: Yes, we really performed a complete PM. What can I do
for you?
B: Can you swim?

The moral is, of course, that you should always mount a scratch monkey.

~~~~~~~~~~~~~~~~~~~~~~~~~~~

There are several morals here related to risks in use of computers.
Examples include, ``If it ain't broken, don't fix it.'' However, the
cautious philosophical approach implied by ``always mount a scratch
monkey'' says a lot that we should keep in mind.

Art Evans
Tartan Labs

------------------------------------------------------------------------

Let's face it, people, that ending just does not work as well as it ought.
The moral isn't ``always mount a scratch monkey''; sometimes you gotta use
real monkeys, or you don't get any work done. The moral is properly
``*when in doubt* (that is, when you're going to do something that might
crash the system)'' always mount a scratch monkey.

I'm sure this is what Art meant, but it's not what he said. This and other
infelicities in the writing (rambling prose, shaky punctuation, awkward
anti-climactic appendix after the tildes etc.) made the scratch monkey
appendix target #1 when it came to trim time.

As much as possible, I tried to capture the flavor of the anecdote in my
condensation without reproducing the bugs. Is that satisfactory?
--
Eric S. Raymond = er...@snark.thyrsus.com (mad mastermind of TMN-Netnews)

----------------------------------------------------------------------------

From: val...@vttcf.cc.vt.edu (Valdis Kletnieks)
Organization: Virginia Tech, Blacksburg, VA

Well, here's a few contributions of mine, over 10 years of hacking
Unixoid systems:

1) yesterday's panic: Applying a patch tape to an AIX 3.2 system
to bring it to 3.2.3. Having had reasonable sucess at this before,
I used an xterm window from my workstation. Well, at some point,
a shared library got updated.. I'd seen this before on other machines -
what happens is that 'more', 'su', and a few other things start failing
mysteriously. Unfortunately, I then managed to nuke ANOTHER window
on my workstation - and the SIGHUP semantics took out all windows I
spawned from the command line of that window.

So - we got a system that I can login to, but can't 'su' to root.
And since I'm not root, I can't continue the update install, or clean
things up. I was in no mood to pull the plug on the machine when
I didn't know what state it was in - was kind of in no mood to reboot
and find out it wasn't rebootable.

I finally ended up using FTP to coerce all the files in /etc/security
so that I could login as root and finish cleaning up....

Ended up having to reboot *anyhow* - just too much confusion with the
updated shared library...

2) Another time, our AIX/370 cluster managed to trash the /etc/passwd
file. All 4 machines in the cluster lost their copies within
milliseconds. In the next few minutes, I discovered that (a) the
nightly script that stashed an archive copy hadn't run the night before
and (b) that our backups were pure zorkumblattum as well. (The joys
of running very beta-test software).

I finally got saved when I realized the cluster had *5* machines in it -
a lone PS/2 had crashed the night before, and failed to reboot. So
it had a propogated copy of /etc/passwd as of the previous night.

Go to that PS/2, unplug it's Ethernet.. reboot it. Copy /etc/passwd
to floppy, carry to a working (?) PS/2 in the cluster, tar it off,
let it propogate to other cluster sites. Go back, hook up the
crashed PS/2s ethernet.. All done.

Only time in my career that having beta-test software crash a machine
saved me from bugs in beta-test software. ;)

3) Once I was in the position of upgrading a Gould PN/9080. I was
a good sysadmin, took a backup before I started, since the README said
that they had changed the I-node format slightly. I do the upgrade,
and it goes with unprecidented (for Gould) smoothness. mkfs all
the user partitions, start restoring files. Blam.

I/O error on the tape. All 12 tapes. Both Sets of backups.

However, 'dd' could read the tape just fine.

36 straight hours later, I finally track it down to a bad chip on the
tape controller board - the chip was involved in the buffer/convert
from a 32-bit backplane to a 8-bit I/O cable. Every 4 bytes, the
5th bit would reverse sense. 20 mins later, I had a program
written, and 'dd ³ my_twiddle ³ restore -f -' running.

Moral: Always *verify* the backups - the tape drive didn't report a
write error, because what it *received* and what went on the tape
were the same....

I'm sure I have other sagas, but those are some of the more memorable
ones I've had...

Valdis Kletnieks
Computer Systems Engineer
Virginia Tech

---------------------------------------------------------------------------
--
Anatoly Ivasyuk @ Computer Science House @ Rochester Institute of Technology
ana...@nick.csh.rit.edu || axi...@ultb.rit.edu || axi...@ritvax.rit.edu

You say you haven't heard of CSH? You will...

0 new messages