Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

The Unofficial Unix Administration Horror Stories Summary (part 1)

7 views

Skip to first unread message

A.X. Ivasyuk

unread,

Dec 3, 1992, 11:24:44 AM12/3/92

Unix Admin. Horror Story Summary, version 1.0
-----------------------------------------
compiled by: Anatoly Ivasyuk (ana...@nick.csh.rit.edu)

This is version 1.0 of the unofficial "Unix Administration Horror Story
Summary". This is a summary of the "Unix Administration Horror Stories"
thread which was seen in comp.unix.admin in October '92. I put this
together for two reasons:
1) Some of these stories are damn amusing.
2) Many people can learn many things about what *not* to do when
they're in charge of a system.

This summary contains quite a few different types of stories. There are
success stories, and... well... other stories. But the most important thing
that can be learned from this is not that you have to make backups (we all
know that, right? ;-) ). More important than making backups is to make sure
your backups are complete and verified. For more on this, see the story
about trying to backup 300MB drives onto 150MB tapes.

If there are additional stories that anyone wants to submit, I'll be glad
to add them to this FAQ. Send them to me at: ana...@nick.csh.rit.edu.
Please send any general comments my way, also.

Please consider this a "beta test" release. I have not had the time to
go over this as many times as I wanted to, so there may be mistakes in
my editing. I have not edited the content of the stories except where
noted, and may have excluded stories or bits where I felt it was appropriate.

-Anatoly

-----------------------------------------------------------------------------

The posting that started it all:
--------------------------------

On 7 Oct 92 12:02:46 GMT, ar...@multix.no (Arne Asplem) said:

> I'm the program chair for a one day conference on Unix system
> administration in Oslo in 3 weeks, including topics like network
> management, system admininistration tools, integration, print/file-servers,
> securitym, etc.

> I'm looking for actual horror stories of what have gone wrong because
> of bad system administration, as an early morning wakeup.

> I'll summarise to the net if there is any interest.

> -- Arne

-----------------------------------------------------------------------------

From: jd...@maggie.mit.edu (John Ellithorpe)
Organization: Massachusetts Institute of Technology

Here's a pretty bad story. I wanted to have root use tcsh instead of the
Bourne shell. So I decided to copy tcsh to /usr/local/bin. I created the
file, /etc/shells, and put in /usr/local/bin/tcsh, along with /bin/sh and
/bin/csh.

All seems fine, so I used the chsh command and changed root's shell to
/usr/local/bin/tcsh. So I logged out and tried to log back in. Only to find
out that I couldn't get back in. Every time I tried to log in, I only got
the statement: /usr/local/bin/tcsh: permission denied!

I instantly realized what I had done. I forgot to check that tcsh has
execute privileges and I couldn't get in as root!

After about 30 minutes of getting mad at myself, I finally figured out to just
bring the system down to single-user mode, which ONLY uses the /bin/sh,
thankfully, and edited the password file back to /bin/sh.

I'll never do that again. This wasn't that much of a horror story, but good
enough if you aren't that familiar with the system.

John

-----------------------------------------------------------------------------

From: dbri...@dave.mis.semi.harris.com (Dave Brillhart)
Organization: Harris Semiconductor

We can laugh (almost) about it now, but...

Our operations group, a VMS group but trying to learn UNIX, was assigned
account administration. They were cleaning up a few non-used accounts
like they do on VMS - backup and purge. When they came across the
account "sccs", which had never been accessed, away it went. The
"deleteuser" utility fom DEC asks if you would like to delete all
the files in the account. Seems reasonable, huh?

Well, the home directory for "sccs" is "/". Enough said :-(

-----------------------------------------------------------------------------

From: t...@stein.u.washington.edu (Tim Smith)
Organization: University of Washington, Seattle

I was working on a line printer spooler, which lived in /etc. I wanted
to remove it, and so issued the command "rm /etc/lpspl." There was only
one problem. Out of habit, I typed "passwd" after "/etc/" and removed
the password file. Oops.

I called up the person who handled backups, and he restored the password
file.

A couple of days later, I did it again! This time, after he restored it,
he made a link, /etc/safe_from_tim.

About a week later, I overwrote /etc/passwd, rather than removing it.

After he restored it again, he installed a daemon that kept a copy of
/etc/passwd, on another file system, and automatically restored it if
it appeared to have been damaged.

Fortunately, I finished my work on /etc/lpspl around this time, so we
didn't have to see if I could find a way to wipe out a couple of
filesystems...

--Tim Smith

-----------------------------------------------------------------------------

From: ni...@BNR.CA ("Nick Pitfield", N.T.)

Greetings,

The following horror story occured only last week....

One of my colleagues had been itching to get into sys admin for some time,
so last week he was finally sent on a 5-day sys admin course run by HP in
Bracknell..

On the following Sunday, he decided to try out his new found knowledge by
trying to connect and configure a DAT drive on one of our critical test
systems. He connected the cables up okay, and then created the device file
using 'mknod'.

Unfortunately, he gave the device file the same minor & major device numbers
as the root disk; so as soon as he tried to write to this newly installed
'DAT drive', the machine wents tits up with a corrupt root disk....ho hum.

Regards.

Nick Pitfield.

-----------------------------------------------------------------------------

From: phi...@haas.berkeley.edu (Philip Enteles)
Organization: Haas School of Business, Berkeley

As a new system administrator of a Unix machine with limited space I
thought I was doing myself a favor by keeping things neat and clean. One
day as I was 'cleaning up' I removed a file called 'bzero'. Strange
things started to happen like vi didn't work then the compliants started
coming in. Mail didn't work. The compilers didn't work. About this time
the REAL system administrator poked his head in and asked what I had
done. Further examination showed that bzero is the zeroed memory without
which the OS had no operating space so anything using temporary memory
was non-functional. The repair? Well things are tough to do when most of
the utilities don't work. Eventually the REAL system administrator took
the system to single user and rebuilt the system including full
restores from a tape system. The Moral is don't be to anal about things
you don't understand. Take the time learn what those strange files are before
removeing them and screwing yourself.

Philip Enteles

-----------------------------------------------------------------------------

From: brob...@waggen.twuug.com (Bill Roberts)
Organization: Brite Systems

My most interesting in the reguard was when I deleted "/dev/null". Of
course it was soon recreated as a "regular file", then permission problems
started to show up.

I was new at the game at the time and couldn't figure out what happened!
It look good to me. I didn't know about "special files" and "mknod" and
major and minor device codes. A friend finally helped out and started
laughing and put me on the right track. That one episode taught me a
lot about my system.

-----------------------------------------------------------------------------

From: Frank T Lofaro <fl...@andrew.cmu.edu>
Organization: Sophomore, Math/Computer Science, Carnegie Mellon, Pittsburgh, PA

Well one time I was installing a minimal base system of Linux on a
friends PC, so that we would have all the necessary utlitities to bring
over the rest of the stuff. His 3 1/2 inch disk was dead, so when had to
get the 5 1/4 inch version of the boot/root disk. Too bad that version,
having to fit in 1.2M instead of 1.44, didn't have tar. We could get a
version of tar, but it was in a tar file (nice chicken and egg
scenario). I said, okay, since we don't have tar, we can't use that to
copy the files from floppy to the hard disk, I'll use cp instead (bad
move). It actually seemed to work for a while, then the machine
rebooted! I did it again, the same thing happened. Then I realize cp
wouldn't work on device files! (this is what happens when you try to
install un*x at 3 AM). It just read the contents of the device and made
a file containing such, which is undesireable in any event. (when it
read /dev/port, the device file that references I/O ports, it must've
did something to reboot the machine, that was the file that was causing
the reboots).

I finally got it working by having him get the tar archive of the
linux binaries (including the tar we needed), and untarring it on one of
the public decstations here, so we could ftp tar to his PC using his dos
tcp/ip stuff. A funny aside was that it untarred into ~/bin, and
superseded all his normal commands. We were wondering why everything
wouldn't run. Luckily it wasn't too hard to fix after we realized what
happened.

-----------------------------------------------------------------------------

From: mfra...@grebyn.com (Marc Fraioli)
Organization: Grebyn Timesharing

Well, here's a good one for you:

I was happily churning along developing something on a Sun workstation,
and was getting a number of annoying permission denieds from trying to
write into a directory heirarchy that I didn't own. Getting tired of
that, I decided to set the permissions on that subtree to 777 while I
was working, so I wouldn't have to worry about it. Someone had recently
told me that rather than using plain "su", it was good to use "su -",
but the implications had not yet sunk in. (You can probably see where
this is going already, but I'll go to the bitter end.) Anyway, I cd'd
to where I wanted to be, the top of my subtree, and did su -. Then I
did chmod -R 777. I then started to wonder why it was taking so damn
long when there were only about 45 files in 20 directories under where I
(thought) I was. Well, needless to say, su - simulates a real login,
and had put me into root's home directory, /, so I was proceeding to set
file permissions for the whole system to wide open. I aborted it before
it finished, realizing that something was wrong, but this took quite a
while to straighten out.

Marc Fraioli

-----------------------------------------------------------------------------
From: rhe...@renext.open.ch (Richard H. E. Eiger)
Organization: Olivetti (Schweiz) AG, Branch Office Berne

In article <1992Oct9.1...@u.washington.edu> t...@stein.u.washington.edu
(Tim Smith) writes:
> I was working on a line printer spooler, which lived in /etc. I wanted
> to remove it, and so issued the command "rm /etc/lpspl." There was only
> one problem. Out of habit, I typed "passwd" after "/etc/" and removed
> the password file. Oops.
>
[deleted to save space[
>
> --Tim Smith

Here's another story. Just imagine having the sendmail.cf file in /etc. Now, I
was working on the sendmail stuff and had come up with lots of sendmail.cf.xxx
which I wanted to get rid of so I typed "rm -f sendmail.cf. *". At first I was
surprised about how much time it took to remove some 10 files or so. Hitting
the interrupt key, when I finally saw what had happened was way to late,
though.

Fortune has it that I'm a very lazy person. That's why I never bothered to just
back up directories with data that changes often. Therefore I managed to
restore /etc successfully before rebooting... :-) Happy end, after all. Of
course I had lost the only well working version of my sendmail.cf...

Richard

-----------------------------------------------------------------------------

From: mi...@cirrus.com (Mitch Wright)
Organization: Cirrus Logic Inc.

I guess I should add a story (or maybe not). Anyway, a fellow sysadmin
was looking to free up some much needed disk space. Since it was purely
a production machine I suggested that he go through and "strip" his binaries.
Unfortunately I made the assumption that he knew what strip does and would
use it wisely -- flashes of the Bad News Bears come to mind now.
To make it short, he stripped /vmunix which didn't destroy the system, but
certainly caused some interesting problems.

~mitch

-----------------------------------------------------------------------------

From: hi...@cc.swarthmore.edu (Eiji Hirai)
Organization: Information Services, Swarthmore College, Swarthmore, PA, USA

Some of these stories of pure stupidity rather than of interesting horror
but they did happen.

[ BTW, these happened at a different place at a different time than where I
am now. Don't bother my current employer about it. ]

(1) A consultant we had hired (and not a very good one) was installing Unix
on one our workstations. He was mucking with creating and deleting
/dev/tty* files and made /dev/tty a regular file. Weird things started to
happen. Commands would only print their output if you pressed return twice,
etc. Fortunately, we solved the problem by re-mknod-ing /dev/tty. However,
it took a while to realize what was causing this problem.

(2) I wanted to create a second swap partition on another disk and made the
partition start at sector 0 of the disk! (which sounded ok at the time since
all other regular 'a' partitions started on sector 0) Every time I rebooted,
fsck would complain about missing partition tables - I initially suspected
that the disk was bad but I later realized that swapping was overwriting the
partition table. I had lost an unknown percentage of the financial data for
the institution that I was working for at the time, right when they were
being audited! Yikes! Anyway, we were able to recover the data and life
returned to normal but I did wonder at the time whether I could still keep
my job there.

(3) At the same institution, we were running a system software that had a
serious bug where if anyone had logged out ungracefully, the system wouldn't
let any more users onto the system and users who were logged on couldn't
execute any new commands. (The newest release of the software later on did
fix this bug.) I had to reboot the machine to restore the system to a sane
state. I did a wall <<EOF We need to shutdown blah blah... EOF and then
shutdown. Well, I should've waited since at the precise moment, one of our
users was doing a once-a-year massive conversion of our financial data (talk
about bad luck). I had shutdown in the middle of a very long disk write and
thus, data was lost. We did recover that data and life went on. Moral:
make damn sure that *no one* is doing anything on your system before you
reboot, even if other users are vociferously clamoring for you to reboot.

(4) I heard this from a fellow sysadmin friend. My friend was forced to
work with some sysadmins who didn't have their act together. One day, one
of them was "cleaning" the filesytem and saw a file called "vmunix" in /.
"Hmm, this is taking up a lot of space - let's delete it". "rm /vmunix".

My friend had to reinstall the entire OS on that machine after his coworker
did this "cleanup". Ahh, the hazards of working with sysadmins who really
shouldn't be sysadmins in the first place.

Moral of all these stories: if I had to hire a Unix sysadmin, the first
thing I'd look for is experience. NOTHING can substitute for down-to-earth,
real-life grungy experience in this field.

-----------------------------------------------------------------------------

From: je...@incc.com (Jerry Rocteur)
Organization: InCC.com Perwez Belgium

Horror story,

I sent one of my support guys to do an Oracle update in Madrid.

As instructed he created a new user called esf and changed the files
in /u/appl to owner esf, however in doing so he *must* have cocked up
his find command, the command was:

find /u/appl -user appl -exec chown esf {} \;

He rang me up to tell me there was a problem, I logged in via x25 and
about 75% of files on system belonged to owner esf.

VERY little worked on system.

What a mess, it took me a while and I came up with a brain wave to
fix it but it really screwed up the system.

Moral: be *very* careful of find execs, get the syntax right!!!!

-----------------------------------------------------------------------------

From: we...@bach.udel.edu (Ken Weaverling)
Organization: University of Delaware

A friend of mine called me up saying he no longer could log into his
system. I asked him what he had done recently, and found out that he
thought that all executable programs in /bin /usr/bin /etc and so on
should be owned by bin, since they were all binaries! So he had
chown'ed them all.

-----------------------------------------------------------------------------

From: rsj@wa4mei (Randy Jarrett)
Organization: Amateur Radio Gateway WA4MEI, Chamblee, GA

Here's one that will show that you shouldn't work on a system
that you don't thourghly understand.

At my "previous" employer I was instructed to install a new
(larger) disk drive in a RS/6000 system. Since a full backup
of the system was done the previous day I just looked at the file
systems vi a df to see which were on the drive that I was replacing.
After this I did a tape backup of these filesystems, ran smit and
did a remove of these filesystems. I then installed the new disk
and brought the system back up. When I ran smit and when I was able
to do the installation of the new drive and setup the file systems
I was figuring that this was going to be an easy one. WRONG!! I was
aware that you could expand filesystems under AIX but was not aware
that it would expand them 'across physical drives'!!! I first
realized that I was in trouble when I went to read in the backup tape
and cpio was not found. I did an ls of the /usr/bin directory and it
said that the file was there but when I tried to run it it was not
found. and of course when I went looking for the original install tape
it was not to be found....

Randy

-----------------------------------------------------------------------------

From: gr...@Speech.SRI.COM (Steven Tepper)
Organization: SRI International

This may not exactly fit the "administration horror story" category, but...

At one place where I worked, someone had set up cron to delete any
file named "core" more than a few days old, since disk space was
always tight and most users wouldn't know what core files were or care
about them. Unfortunately not everyone knew about this and one user
lost a plain text file (a project proposal) he'd spent a one lot of
time working on because he called it "core". This was around 1976,
when Unix was still considered exotic and before bookstores carried
entire sections of Unix books.

-greep

-----------------------------------------------------------------------------

From: d...@jet.uk (David J Stevenson)
Organization: Joint European Torus

In <W1NR...@cc.swarthmore.edu> hi...@cc.swarthmore.edu (Eiji Hirai) writes:
>...[some deleted]
>(4) I heard this from a fellow sysadmin friend. My friend was forced to
>work with some sysadmins who didn't have their act together. One day, one
>of them was "cleaning" the filesytem and saw a file called "vmunix" in /.
>"Hmm, this is taking up a lot of space - let's delete it". "rm /vmunix".

>My friend had to reinstall the entire OS on that machine after his coworker
>did this "cleanup". Ahh, the hazards of working with sysadmins who really
>shouldn't be sysadmins in the first place.
When this happened to a colleague (when I worked somewhere else) he restored
vmunix by copying from another machine. Unfortunately, a 68000 kernel does
not run very well on a Sparc...

-----------------------------------------------------------------------------

From: smck...@sunicnc.France.Sun.COM (Steve McKinty - Sun ICNC)
Organization: SunConnect

In article <W1NR...@cc.swarthmore.edu>, hi...@cc.swarthmore.edu (Eiji Hirai) writes:

> (4) I heard this from a fellow sysadmin friend. My friend was forced to
> work with some sysadmins who didn't have their act together. One day, one
> of them was "cleaning" the filesytem and saw a file called "vmunix" in /.
> "Hmm, this is taking up a lot of space - let's delete it". "rm /vmunix".
>
> My friend had to reinstall the entire OS on that machine after his coworker
> did this "cleanup". Ahh, the hazards of working with sysadmins who really
> shouldn't be sysadmins in the first place.

Hmm. A colleague of mine did much the same by accident on one of
our test machines. After discovering it, fortunately while the machine
was still up & running, he FTPed a copy of /vmunix from the other lab
system (both running exactly the same kernel).

After rebooting his machine everything (to his relief) worked fine.

-----------------------------------------------------------------------------

From: lin...@math.uni-frankfurt.de (Anselm Lingnau)
Organization: University of Frankfurt/Main, Dept. of Mathematics

In article <1992Oct10....@waggen.twuug.com>, brob...@waggen.twuug.com
(Bill Roberts) writes:

> My most interesting in the reguard was when I deleted "/dev/null". Of
> course it was soon recreated as a "regular file", then permission problems
> started to show up.

Years ago when I was working in the Graphics Workshop at Edinburgh University,
we used to have a small UNIX machine for testing. The machine wasn't used too
much, so nobody bothered to set up user accounts, and so everybody was running
as root all the time. Now one of the chaps who used to come in was fond of
reading fortunes (/usr/games/fortune having been removed from the University's
real machines along with all the other games). Guess what happened when the
machine said

# fortune
fortune: write error on /dev/null --- please empty the bit bucket

Quite a lot of stuff wouldn't work after the chap was done with the machine
for the day. You bet we put up proper accounts after that!

Anselm

-----------------------------------------------------------------------------

From: pe...@NeoSoft.com (Peter da Silva)
Organization: NeoSoft Communications Services -- (713) 684-5900

Well, we had one system on which you couldn't log in on the console for a
while after rebooting, but it'd start working sometimes. What was happening
was that the manufacturer had, for some idiot reason, hardcoded the names
of the terminals they wanted to support into getty (this manufacturers own
terminals, that I can understand, but also a handful of common types like
adm3a) so getty could clear the screen properly (I guess hacking that into
gettydefs was too obvious or something). If getty couldn't recognise the
terminal type on the command line, it'd display a message on the console
reading "Unknown terminal type pc100". We ignored this flamage, which was
a pity. Cos that was the problem.

It did this *before* opening the terminal, so if it happened to run between
the time rc completed and the getty on the console started the console got
attached to some random terminal somewhere, so when login attempted to open
/dev/tty to prompt for a password it failed.

Moral: always deal with error messages even when you *know* they're bogus.
Moral: never cry wolf.

-----------------------------------------------------------------------------

From: ri...@pmafire.inel.gov (Rick Furniss)
Organization: WINCO

Horror stories:
Did this myself many years ago, and have come close to it since.

Murphy's law #?? , preventive maintenence doesnt.

try this one: /etc/dump /dev/rmt/0m /dev/dsk/0s1
Or: tar cvf /dev/root /dev/rmt0

Backups on unix can be one of the most dangerous commands used,
and they are used to prevent rather than cause a problem. If any Unix
utility were a candidate for a warning message, or error checking, this
would be it.

Just in case you didnt catch the HORROR above, the parameters are backworks
causing a TOTAL wipe out of the root file systems.

More systems have been wiped out by admins, than any hacker could do in
a life time.

-----------------------------------------------------------------------------

From: gfo...@javelin.sim.es.com (Gary Fowler)
Organization: Evans & Sutherland Computer Corporation

Once I was going to make a new file system using mkfs. The device I wanted to
make it on was /dev/c0d1s8. The device name that I used, however, was
/dev/c0d0s8 which held a very important application. I had always been a little
annoyed by the 10 second wait that mkfs has before it actually makes the file
system. I'm sure glad it waited that time though. I probably waited 9.9
seconds before I realized my mistake and hit that DEL key just in time. That
was a near disaster avoided.

Another time I wasn't so lucky. I was a very new SA, and I was trying to clean
some junk out of a system. I was in /usr/bin when I noticed a sub directory
that didn't belong there. A former SA had put it there. I did an ls on it and
determined that it could be zapped. Forgetting that I was still in /usr/bin, I
did an rm *. No 10 second idiot proofing with rm. Now if some one would only
create an OS with a "Do what I mean, not what I say" feature.

Gary "Experience is what allows you to recognize a mistake the second time you
make it." Fowler

-----------------------------------------------------------------------------

From: broa...@neurocog.lrdc.pitt.edu (Bill Broadley)
Organization: University of Pittsburgh

On a old decstation 3100 I was deleting last semesters users to try to
dig up some disk space, I also deleted some test users at the same time.

One user took longer then usual, so I hit control-c and tried ls.
"ls: command not found"

Turns out that the test user had / as the home directory and the remove
user script in ultrix just happily blew away the whole disk.

ftp, telnet, rcp, rsh, etc were all gone. Had to go to tapes, and had
one LONG rebuild of X11R5.

Fortunately it wasn't our primary system, and I'm only a student....

-----------------------------------------------------------------------------

From: hi...@cc.swarthmore.edu (Eiji Hirai)
Message-ID: <DMRR...@cc.swarthmore.edu>
Sender: ne...@cc.swarthmore.edu (USENET News System)
Nntp-Posting-Host: gingko
Organization: Information Services, Swarthmore College, Swarthmore, PA, USA
References: <djd.718900643@reading> <28...@bsu-cs.bsu.edu> <rik.71...@nella15.cc.monash.edu.au>
Date: Tue, 13 Oct 1992 16:00:28 GMT

rik.h...@fcit.monash.edu.au writes:
> I'll mount it in /tmp

Though this may strike most sane sysadmins as bad practice, SunOS (3.4 or so
- my memory is vague) shipped a command called "on". If you were logged on
machine A and wanted to execute a command on machine B, you said "on B
command", sort of like rsh.

However, A would mount B's disks under some invokations of "on" and it would
mount it in /tmp! Of course, lots of folks got bitten by this stupid
command and it was taken out after a long delay by Sun.

Anyone remember the details? I've blocked out my memory of pre-4.0 SunOS.
Am I just hallucinating?

-----------------------------------------------------------------------------

From: matt...@oberon.umd.edu (Mike Matthews)
Organization: /etc/organization

In article <Bw1G0...@gumby.ocs.com> o...@gumby.ocs.com writes:
>Now when I partition a disk I sit there with a calculator and make sure
>all the numbers add up correctly (offsets, number of cylinders, number of
>blocks, and so on).

Heh heh, now that you mention that...

We had just gotten a 1.2G disk drive for our Sun (which direly needed it) so
we felt we'd repartition everything.

All went well, except... on reboot, one of the partitions that was newly
restored from backup got a fsck error. Fixed it, it rebooted, then another
one got an error. fscked that one, rebooted it, and doggone it, the first
error was back!

We had a one cylinder overlap. Sheesh.

At least Ultrix WARNS you of that.

Mike Matthews, matt...@oberon.umd.edu (NeXTmail accepted)

-----------------------------------------------------------------------------

From: mt...@eurotherm.co.uk (Martin Tomes)
Organization: Eurotherm Limited

We had something really wierd happen one day. I copied a file to
/usr/local on someone elses machine and all seemed to be OK. A bit
later the user of the machine noticed that the files and directories they
were using on another disk partition were corrupted. There were 2
gigbyte files on a 650Mb disk - and lots of them with wierd names and
permissions. At first I did not connect the two events. This disk
had given trouble when the power failed a week before, so I fsck'ed
it. Now I have run fsck more times than I can begin to imagine and
seen plenty of errors, some needing 'manual intervention' but I had
never seen anything like this before! It was spectacular. And what
was more, when I ran it a second time things got worse. Then I tried
to backup the /usr/local partition before restoring this corrupt data
and lo, that was corrupt too. It turned out that our sysadmin had
created the /usr/local disk partition in the wrong place on the disk
and put it over the top of the alternate sectors partition. By
writing to the /usr/local disk I had written all over the alts which
were mapped into the users partition. Oh dear, what a mess.

Solution, rebuild all the partitions so they don't overlap and
restore, also buy the sysadmin a calculator.

Moral, always do your sums on the /etc/partitions file very carefully
before using mkpart.

----- UNIX-ADM USENET appended at 20:22:10 on 92/10/13 GMT (by USENET at ALMADEN)

From: c...@Unify.Com (Chris A. Anderson)
Organization: Unify Corporation, Sacramento, California

Ok, here's one...

At a company that I used to work for, the CEO's brother was the
"system operator". It was his job to do backups, maintentance,
etc. Problem was, he didn't have a clue about Unix. We were re-
quired to go through him to do anything, though.

Well, I was setting up a Plexus P-95 to be a
news/mail/communications machine and needed to wipe the disks and
install a new OS. El CEO requested that his brother do the in-
stallation and disk partitioning. He had done this before, so I
gave him the partition maps and let him at it. When he was done,
everything seemed to be ok. Great, on with the install and set-
up.

Things went fine until I started compiling the news and mail
software. All of a sudden, the machine paniced. I brought it
back up and the root file system was amazingly corrupt. After
rebuilding things, it all seemed to be fine -- diagnostics all
ran fine, etc. So I started again -- this time keeping an eye on
things. Sure enough, the root file system became corrupted again
when the system started to load.

This time I brought it down and checked everything. The problem?
Swap space started at block zero and so did the root file system.
ARRRGGGHHHHH!!

Oh yes, the brother still works there.

Chris

-----------------------------------------------------------------------------

From: mi...@Chaos.mcs.kent.edu (Roger Miles)
Organization: Kent State University

A year ago we moved to a brand spanking new building. All the equipment
was moved by professional movers. The last piece of equipment I wanted
moved was the computer (a Zilog s8000, 6ft. tall, with 3 disk drives,
cartridge drive and reel tape drive all mounted in one cabinet. It must have
weighed 250 to 300 lbs) because I wanted to keep an eye on the movers.
Actually, I was hoping they'd drop it so I could get a new computer. Anyway,
much to my surprize the movers said they would not move the computer because
of the liability. One of my co-workers owned a Ford pickup so we hoisted it
up and drove off with me riding in the back hanging on to the Zilog. It
was the longest 15 minute drive I was ever on in my life.

Roger Miles
KSU

-----------------------------------------------------------------------------

From: t...@hrt213.brooks.af.mil (Tim Miller)
Organization: AL/HRTI, Brooks AFB

This one qulaified for Stupid Act of the Month:

All this happened on my sparcII...

I was making room on / because I needed to to test run something
(which was using a tmp file in, of all places, /var/tmp. I could have
recompiled the application to use more memory and/or /tmp, but I'm too
lazy for that), so I figure "I'll just compress this, and this, and
this..." One of those "this'" was vmunix.

Well, of course the application crashes the machine, and stupid
me had forgotten that I'd compressed vmunix, so the damn thing won't
boot. checksum: Bad value or some such error. Took me most of the day
to figure out just what I'd done to the dang thing. 8)

Moral(s):

1) Never, ever, EVER play with vmunix.
2) Always keep a log of what you do to the root file system.

-----------------------------------------------------------------------------

From: jar...@dvorak.amd.com (John Jarocki)
Organization: Advanced Micro Devices, Inc.; Austin, Texas

In article <ericw.718908214@hobbes> er...@hobbes.amd.com (Eric Wedaa) writes:
>
>The moral(s) of the story here:
[Eric's "Guidebook to Being a Good Paranoid UNIX Sysadmin" Deleted]
>
>>>>Ericw
>(Paranoia is a "Good Thing" when you can really muck things up!)
>--
>Eric Wedaa - eric....@amd.com ł Two more kinds of lies...
>{ames apple uunet}!amd!ericw ł Release Dates, and Benchmarks
>Advanced Micro Devices, M/S 167 PO Box 3453 Sunnyvale, CA 94088-3453
>=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Eric,

You left out an important one:
- Never hand out directions on "how to" do some sysadmin task
until the directions have been tested thoroughly.
- Corollary: Just because it works one one flavor
on *nix says nothing about the others. '-}
- Corollary: This goes for changes to rc.local (and
other such "vital" scripties.

-----------------------------------------------------------------------------

From: bi...@chaos.cs.umn.edu ( Hari Seldon ... psychohistorian )
Organization: University of Minnesota

In <1992Oct13.0...@ccu1.aukuni.ac.nz> russ...@ccu1.aukuni.ac.nz (Russell Street) writes:

>r...@Ingres.COM (Bob Arnold) writes:
>> 9) It's a lot less painful to learn from someone else's experience
>> than your own (that's what this thread is about, I guess :-) )

>With out trying to wander off the thread tooooo much ... In my
>experience the best experiences to learn off are your own :)
>I wonder how many stories we have got so far about "I will never
>type rm -r /" as root. (And no I have not done that _yet_, but
>the day will come :()

after a real bad crash (tm) and having been an admin (on an rs/6000)
for less than a month (honest it wasn't my fault, yea right stupid)
we got to test our backup by doing:
# cd /
# rm -rf *
ohhhhhhhh sh*t i hope those tapes are good

ya know it's kinda funny (in a perverse way) to watch the system just
slowly go away.

bill pociengel

-----------------------------------------------------------------------------

From: bar...@calvin.demon.co.uk (Barrie Spence)
Organization: DataCAD Ltd, Hamilton, Scotland

In article <1992Oct13.0...@ccu1.aukuni.ac.nz> russ...@ccu1.aukuni.ac.nz (Russell Street) writes:
>r...@Ingres.COM (Bob Arnold) writes:
>> 9) It's a lot less painful to learn from someone else's experience
>> than your own (that's what this thread is about, I guess :-) )
>
>With out trying to wander off the thread tooooo much ... In my
>experience the best experiences to learn off are your own :)
>I wonder how many stories we have got so far about "I will never
>type rm -r /" as root. (And no I have not done that _yet_, but
>the day will come :()
>

My mistake on SunOS (with OpenWindows) was to try and clean up all the
'.*' directories in /tmp. Obviously "rm -rf /tmp/*" missed these, so I
was very careful and made sure I was in /tmp and then executed
"rm -rf ./.*".

I will never do this again. If I am in any doubt as to how a wildcard
will expand I will echo it first.

Barrie

-----------------------------------------------------------------------------

From: ro...@trebor.uucp (Bob Stockler)
Organization: Bob Stockler

r...@Ingres.COM (Bob Arnold) writes:

>Morals:
> 2) Don't do backups to floppies.

Once, Tandy Xenix had the largest installed base of *NIX systems extant.

My friend, mentor and guru Bob Snapp and I undertook to write a systematic
backup set of shell scripts do what the *NIX programs then available would
not do: make a reliable compressed Master Backup, and reliable compressed
incremental backups (so 'cron' could do it) to available 8" floppy drives.

We've never found that our programs failed. Now, on SCO *NIX systems we
prefer CTAR. We've never found it to fail either.

-----------------------------------------------------------------------------
From: JR...@cen.ex.ac.uk (J.Rowe)
Organization: Computer Unit. - University of Exeter. UK

In article <rik.71...@nella15.cc.monash.edu.au> r...@nella15.cc.monash.edu.au (Rik Harris) writes:

> I said to myself (being a Friday afternoon...see previous
> post) "it's only temporary.../mnt is already being used...I'll mount
> it in /tmp". So, I mounted on /tmp/a (or something). This was fine
> for a few hours, but then the auto-cleanup script kicked in, and blew
> away half of my source (the stuff over 2 weeks old). I didn't notice
> this for a few days, though. After I figured out what had happened,
> and restored the files (we _do_ have a good backup strategy),
> everything was OK.

If you're doing this using find always put -xdev in:

find /tmp/ -xdev -fstype 4.2 -type f -atime +5 -exec rm {} \;

This stops find from working its way down filesystems mounted under
/tmp/. If you're using, say, perl you have to stat . and .. and see if
they are mounted on the same device. The fstype 4.2 is pure paranoia.

Needless to say, I once forgot to do this. All was well for some weeks
until Convex's version of NQS decided to temporarily mount /mnt under
/tmp... Interestingly, only two people noticed. Yes, the chief op.
keeps good backups!

Other triumphs: I created a list of a user's files that hadn't been
accessed for three months and a perl script for him to delete them.
Of course, it had to be tested, I mislaid a quote from a print
statement... This did turn into a triumph, he only wanted a small
fraction of them back so we saved 20 MB.

I once deleted the only line from within an if.. then statement in
rc.local, the sun refused to come up, and it was surprisingly
difficult to come up single user with a writeable file system.

AIX is a whole system of nightmares strung together. If you stray
outside of the sort of setup IBM implicitly assume you have (all IBM
kit, no non IBM hosts on the network, etc.) you're liable to end up in
deep doodoo.

One thing I would like all vendors to do (I know one or two do) is
to give root the option of logging in using another shell. Am I the
only one to have mangled a root shell?

John Rowe

-----------------------------------------------------------------------------

From: koc...@sei.cmu.edu (John Kochmar)
Organization: The Software Engineering Institute

A long time ago, back when the Apollo 460 was around and I had just
graduated from college, I had the good fortune of being one of two
adminstrators in charge of making a cluster of 460's a part of our
environment. One of the things I was tasked with was geting them onto
our network.

Well, I was young, I had the manuals, and a guy from Apollo tech
support was there to help. How hard could it be, right?

Well, we got out the manuals, configured the system (relying heavily on
the defaults), and within 2 hours, we had that puppy on the network.
Life was good.

About 3 hours later, I get a phone call from a systems programmer /
developer from CMU campus (the SEI is a part of CMU, and we are on their
network.) He told me that if I didn't take the &%@*ing Apollo off the
network, he was going to do hurtful things to me physically.
Life was not so good.

As it turned out, in default mode, the Apollo answered every address
request it saw, even if it is not the machine the request was for.
Kind of a "hey, I'm not who you are looking for, but I'm out here in
case you decide you'd rather talk to me." Apollo considered this a
feature, and they took advantage of it in their OS environment.

However, one of the earlier versions of a heavily network dependant OS
developed at CMU considered this a bug. The OS would issue a request,
and expect only the machine it was looking for to answer it. Of
course, it would assume that if it got an answer to its request, it
must be the machine it expected to talk to. It didn't look at the
address of the answer it got, so if it wasn't the correct machine, most
of the time the OS would hang or panic.

The outcome? Over about 3 hours time, more and more of campus was
talking to our little 460, which had just enough muscle to keep up with
the requests. By the time campus figured out what was going on, we had
an Apollo merrily answering the network requests for hundreds of
machines (the ones that were still up, that is.) This caused the part
of campus who used the new OS going to hell in a bucket, one very busy
Apollo 460, and one very warm ethernet.

Well, we turned off the Apollo, configured it not to chat to all of
campus before putting it back on the ethernet (this time, we did it
while talking with campus, making sure we didn't cause the same
problems we did the last time -- we didn't have a packet monitor at the
time), and campus changed their OS to look at the request response
before assuming it was the correct one. I also learned to think very
carefully about default values before using them.

John

-----------------------------------------------------------------------------

From: d...@csg.cs.reading.ac.uk (David J Dawkins)
Organization: University of Reading

we...@bach.udel.edu (Ken Weaverling) writes:

>A friend of mine called me up saying he no longer could log into his
>system. I asked him what he had done recently, and found out that he
>thought that all executable programs in /bin /usr/bin /etc and so on
>should be owned by bin, since they were all binaries! So he had
>chown'ed them all.

Oh you bastards. I was hoping that a thread like this would never
appear, because if it did, I knew I would have to confess. Oh well...

About a year back, I was looking through /etc and found that a few
system files had world write permission. Gasping with horror, I went
to put it right with something like

dipshit# chmod -r 664 /etc/*

(I know, I know, goddamnit!.. now)

Everything was OK for about two to three weeks, then the machine went
down for some reason (other than the obvious). Well, I expect that you
can imagine the result. The booting procedure was unable to run fsck,
so barfed and mounted the file systems read-only, and bunged me into
single-user mode. Dumb expression..gradual realisation..cold sweat. Of
course, now I can't do a frigging chmod +x on anything because it's all
read-only. In fact I can't run anything that isn't part of sh.
Wedgerama. Hysteria time. Consider reformatting disks. All sorts of
crap ideas. Headless chicken scene. Confession.

"You did WHAT??!!"

Much forehead slapping, solemn oaths and floor pacing.

Luckily, we have a local MegaUnixGenius who, having sat puzzled for an hour
or more, decided to boot from a cdrom and take things from there. He fixed
it.

My boss, totally amazed at the fix I'd got the system into, luckily
saw the funny side of it. I didn't. Even though at that stage, I didn't
know much about unix/suns/booting/admin, I did actually know enough to NOT
use a command like the one above. Don't ask. Must be the drugs.

BTW, if my future employer _is_ reading this (like they say he/she might),
then I have certainly learned tonnes of stuff in the last year, especially
having had to set up a complete Sun system, fix local problems, etc :-)

Anyone else got a tale of SGS (Spontaneous Gross Stupidity) ?

-dave "I'm much better now, honest.. no, really.. hey what's this button
doooooooooOOOOOO..."
--
Anatoly Ivasyuk @ Computer Science House @ Rochester Institute of Technology
ana...@nick.csh.rit.edu || axi...@ultb.rit.edu || axi...@ritvax.rit.edu

You say you haven't heard of CSH? You will...

0 new messages