Some people hate me because I'm so perfect. I know it all. My code works the first time. I do everything right. I don't screw up. And I usually keep my head on my shoulders when those around me are losing theirs.
Well, I finally did it. I screwed up. (Which just goes to show that even EPS isn't perfect.) And I screwed up bigtime. In twelve years of driving UNIX boxes, I've never done anything like this, and suddenly I'm staring at an oil slick the size of Alaska, and an unknown number of dead fish--I don't know how many, but I'm sure it's a lot. Last night I really wasn't in any condition to do what I was doing, but I wasn't in any condition to appreciate that I wasn't in any condition to do what I was doing.
"A Tale of Two Computers"
In the beginning, there was sutro. Many of you may have heard of sutro.sfsu.edu, since it's our official anonymous FTP site, one of the secondary domain name servers for SFSU.EDU and FOGNET, and generally a rather high-profile machine. It's also the center of the very first bunch of NeXTs at SFSU, and has a certain sentimental value to me, even though I now spend most of my time on the other side of the campus.
Sutro was a NeXT cube with a 660MB internal drive and a 660MB external drive, both Maxtor 8760Ses, both purchased from NeXT for some outrageous amount of money. Of course, we believed that if anything were to go wrong, NeXT would help us out. (Wrong!)
Anyway, one day in mid-May, something does go wrong. The external hard drive--the one with all the users' files--is flaking out. The problem seems to be with the Top Level Assembly attached to that drive; the disk proper seems to be more or less intact. But the kernel is complaining about SCSI errors, and after too many retries, decides that it wants nothing more to do with it. Calls to NeXT aren't terribly productive--I was hoping we'd be able to borrow another one, do a block-to-block copy, and then back up the data from there. We're basically told that we're on our own, but if we can find another Maxtor 330 or 660, the TLAs are supposedly interchangeable. We had no luck there. The machine was still running, providing name service and routing mail, but no one could log in. The end of the spring semester well within sight, sympathetic instructors canceled final [online] assignments.
The system manager eventually found someone locally who could lend us a disk temporarily so we could attempt some sort of recovery. Unfortunately, he forgot to reset the SCSI unit number on the new drive--which was zero--as was the internal drive--and the machine still booted. Then it was totally random which device would actually service I/O requests. Realizing his mistake, he tried to shut the system down, and did a halt WITH sync. Oops. A whole bunch of inodes on the root partition got smashed, and the system would no longer boot.
I brought over a 2.1 optical disk. sutro hadn't had the best luck with optical drives; its had been replaced *twice*. But it was working flawlessly now. Good thing, too.
While the root partition looked pretty much OK when mounted read- only, running fsck proved to be its final undoing. fsck blew away a bunch of things, like the network-wide NetInfo database-- which hadn't been backed up. [Ironically enough, there was a passwd backup: two days before, I had asked one of the lab rats to run sutro's passwd though a password cracker to see how many stupidities he could find since we knew we'd be "hit" by marauders over Memorial Day Weekend. (They did turn up, as expected, but we got off easy compared to the same time last year.) Some 55 users had easily-guessible passwords, including a lot of people who "should have known better."]
So we went from having a "minor" problem with one disk to having nothing left but a /clients partition. Since the e-mail spool was on /clients, it was still intact. However, othing short of a builddisk was going to get the system back up, and we still needed to salvage what we could, since there were other things that had not been backed up (sigh) that had been placed on the root because there just happened to be space there.
After a few days, it was painfully clear that the failing Maxtor was not going to be repaired anytime soon, and that the most expedient way to get the machine back was to order a brand-new external hard drive. Miraculously enough, that happened. Our fiscal year runs from July 1-June 30, and all purchase requisitions for the current FY needed to be submitted by May 29. And then there was the problem of finding the funds to pay for it. The California State University has a horrific budget crisis thanks to the mushrooming state deficit; people are losing their jobs left and right, and money that might normally be used for equipment purchase wasn't expected to be there at all.
On Wednesday, I'm told that a new Fujitsu 2266S has arrived, and would I please help set it up. No problem, but I had a dinner engagement that evening, and promised to return the next day to resurrect sutro. At this point, the machine had been down for something like three weeks, and none of us wanted to know how low our popularity ratings had fallen. Mind you, I'm no longer officially responsible for sutro, but everyone knows I'm the one to call when you're in deep doo-doo, so that was little consolation.
Thursday evening, we go to work. One extra-large sausage-and- black olive pizza and a half-dozen Cokes later, things are mostly back to normal. /clients is now on the Fujitsu, the internal drive has been low-level formatted and built as a single partition. The latest NeXTanswers and Support Bulletin are online. Users now have about 50% more space than they did before. Only three users seem to have lost files. One mail file vanished. It's starting to get light outside.
I'm too tired to find my way off-campus, and too wired on caffeine to get any sleep, so I head for the Advanced Computing Laboratory to catch up on comp.sys.next.*. By the time I get home and lose consciousness, it's 9 a.m.
"There's such a thing as being TOO nice"
We have something of a convention of naming computers with the same first letter as our last names, and picking appropriate themes for related things. Mt. Sutro is [literally] one of the high points in San Francisco, and at the time we got its namesake, it was one of the high points as far as campus computing was concerned. Its clients, printers, disks, etc. are named after San Francisco landmarks. That gives it a certain familiarity and--dare I say--comfort.
Last year, I got to name some more babies. Computer Science had an Operating Systems Laboratory with 68010-based UNIX workstations that were no longer supported by their vendor, and the time had come to replace them. Their replacements would be 68040-based NeXTs, and the new/relocated facility named the Advanced Computing Laboratory (ACL). There wouldn't be as many NeXTs, but they'd be far more capable machines. The old machines had been named after cartoon characters--snoopy, minnie, batman. I decided that the new machines would honor their predecessors, but updated for the 90s. So I named most of them after the Simpsons. Their server: springfield. springfield is a hot machine. It _started_ with 32MB RAM. It took _years_ for sutro to get there.
There are other machines in the ACL besides NeXTs, but the NeXTs are by far the most popular. So popular that some people never seem to leave. In fact, 24-hour activity is not uncommon during the week. (And given that we're largely a commuter campus, the lab's two sofas, refrigerator, and microwave oven get good workouts too.) About the only time machines can be shut down is Friday evening. There's something about Friday evening. Drawn by the scent of the weekend's freedom, off they go. Perhaps to catch a first-run movie. Perhaps not. Whatever they do, come Saturday, the guilt sets in, and they all come back to atone for screwing off the night before. Of course, the dial-in crowd gets a head start--they start burning up the wires between 1 and 2 in the morning.
One of the "rewards" of the Advanced Computing Laboratory is Computing Without Limits. Especially when it rebels against the remnants of a bombastic computing center, whose multiprocessor VAX serving most of the lower division students enforces strict disk quotas and other passe' obstacles to productivity. This policy inevitably means that disk storage becomes something of a nonrenewable resource.
springfield is a slab with a 424MB internal drive, and a Fujitsu 2266S named DeepDeepTrouble. A small part of the latter is used for a /clients partition and the remainder for users' files. (Just as sutro is now.) We purchased a second Fujitsu with the intent of turning it all over to users, and redesignating the original partition for projects requiring "large" amounts of storage--some individuals have justifiably needed 200MB, and we're one of the few sites on campus that doesn't consider that unreasonable for a student. Last week, the new disk was initialized as DeeperTrouble, and the two drives housed in a dual-bay enclosure. The plan was for this Friday's scheduled system work to move all the users to the new disk. Those disk names would turn out to be prophetic...
I requested downtime earlier than usual, since I knew it would take a while. Normally I'm quite awake and alert at 6 p.m. on Fridays, but this Friday was different. I'd returned to work around 3 p.m., so I'd had far less sleep than I should. And far less than I'd realized. It was only Friday the 12th, but it might as well have been Friday the 13th.
I got a late start--other matters had kept me busy until nearly 7. Shutting down springfield's clients was easy. Three other machines had springfield's users' partition automounted. Normally, that wouldn't have been cause for concern, but I figured that NeXT's software probably wasn't going to cope too gracefully
>I screwed up. >I wanted to type > dump 0f - /dev/rsd1b >but my weary fingers hammered out something like > dump 0f /dev/rsd1b
Dump isn't the only "bad" program -- The NeXT version of 'newfs' works fine on mounted file systems (I suppose because of 'BuildDisk' or the auto-formatting of new SCSI devices).
See, I wanted to quickly erase an OD one day, and since I am such a UNIX god, I decided just to plop that OD into our NeXT server and type newfs /dev/rsd0a (It should have been /dev/rod0a)...
Ooops...So a few weeks later, I'm at a friend's house who just borrowed a Sony MO (external) and I said "Step aside, I'll format this baby...By the way, I just ruined a disk at work the other day doing this, but don't worry, I'll be careful" newfs /dev/rsd0a (Funny, the OD isn't making any noise.. Yes, I made the same mistake again).
No one likes to hear my explanation -- "Mumble...'newfs' on a Sun won't work on a mounted file system..." Actually, I submitted a bug to NeXT about this, but, I surmise this defect is there on purpose, I don't think it will be fixed.
There's nothing like trying to rebuild a file system out of 'lost+found':-)
> >I screwed up. > >I wanted to type > > dump 0f - /dev/rsd1b > >but my weary fingers hammered out something like > > dump 0f /dev/rsd1b
> See, I wanted to quickly erase an OD one day, and since I am such a > UNIX god, I decided just to plop that OD into our NeXT server and > type: > newfs /dev/rsd0a (It should have been /dev/rod0a)...
Wheeee! The joys of Unix are many. My first real learning experience was while attaching an external hard drive to one of our cubes. I typed "disk -i /dev/rsd0a" and it was a good 30 seconds before I realized that the wrong drive was making the chugga-chugga noise. After I screamed and pulled the plug, I called our chief guru, who told me where the backups lived. "No problem," I thought. Little did I know...
Because the main drive was, of course, unhealthy, I booted up from optical and grabbed the backups, which were multivolume dumps. Dumped on ODs, that is. After just a few hours of cursing and swearing, I accepted the fact that the NeXT thought that restoring from a drive it wanted to use was majorly uncool. "That's OK," said I. "I have a small external drive and a network with a few other NeXTs. Surely I can arrange something."
Twenty-some hours later, after trying to NetBoot, build a too-small disk, restore over the network, and sacrificing a chicken on top of that (appropriately) black cube, something worked. I don't know what and I couldn't reproduce it now, but I didn't care; I had won! I had beaten the damned thing!
(In the end, though, Unix had its revenge; because of my supposed knowledge, somebody offered me a job I couldn't afford to turn down. Budding sysadmins should heed our warnings and find a nice, pleasant job like digging ditches or fighting oil well fires or something.)
-- William Pietri | Email: William.Pie...@umich.edu Stat. Dept. Sysadmin and | or: will...@stat.lsa.umich.edu ITD/CSS Consultant | or: USERW...@UMICHUM.bitnet University of Michigan | Phone: (313) 764-9983
: I'll add my relatively tame story. (I am enjoying this thread!)
"Enjoying" this thread is a manner of speaking, I hope. :-)
: [...] I did this: : : cd ~user : chown -R user.group .* : : After noticing this was taking a long time, I hit ctl-c. Well, I had : managed to set the permissions to user.group on all kinds of files in : all kinds of places.
Oh, that sounds familiar. But wait! Try doing THIS, to remove a "make install" that had gone sour in the (VERY!) wrong directory, thinking it was somewhere else: ----- atlantis:/etc/foo [1:03pm] root> rm *
atlantis:/etc/foo [1:04pm] root> rm -fr .*
--interrupted atlantis:/etc/foo [1:07pm] root> cd /etc pwd: getwd: can't stat .
(screams of "ARRRRRRGH!" [and other fine obscenities]) Kinda makes you wonder just how far things have to go before total system meltdown, eh? -- Dan Berry == President, NeXT Edmonton Owners Network / UofA Computers Tech. bugs%atlan...@cs.ualberta.ca (b...@atlantis.uucp) *** NO NeXTmail! *** "On a first date, watch how your date treats the waiter, the bartender, and so on. That's how she'll be treating YOU after three months."
I this happened today, because the topic had come up, and I just had to come up with a good one...
This afternoon, I niloaded about 150 aliases. As it turned out, I really didn't wont them. Since NetInfoMangler doesn't allow me to select more than one thing at once, I was doing commmand-r <pause> 'return'. I had to hit return after every command-r to confirm that I meant delete. So, I'm zipping along, just flying, doing command-r/return as fast as I can. I start to get near the end, and stop. Now the keyboard needs to catch up. So I sit there and watch it finish up on the aliases directory, then it removes aliases, then group, then fax_modems, then printers, then machines... you get the picture. I realized what was going on as soon as I saw aliases go, but by the time I killed NetInfoMangler, it was too late. There were other conditions that made a clean restore from my clone network.nidb, but I must say I was impressed at how well the clone took over. My users didn't even know I had screwed up big time. Of course, I got home five hours late, but I know I won't make that mistake again.
> I this happened today, because the topic had come up, and I just had > to come up with a good one...
> This afternoon, I niloaded about 150 aliases. As it turned out, I > really didn't wont them. Since NetInfoMangler doesn't allow me to > select more than one thing at once, I was doing commmand-r <pause> > 'return'. I had to hit return after every command-r to confirm that I > meant delete. So, I'm zipping along, just flying, doing > command-r/return as fast as I can. I start to get near the end, and
[sad story deleted] Now you know why some people stay away from NetInfoMangler... That's why they created nidump and niload -d...for deleting 150 aliases at once.... -- David Lemson (312) 732-4741 FNBC Sys Admin (Summer) UIUC NeXT Campus Consultant(rest of the time) E-mail to: lem...@fnbc.com NeXTMail accepted
In article <1992Jun19.142233.13...@fnbc.com> lem...@fnbc.com (David Lemson) writes:
> > This afternoon, I niloaded about 150 aliases. As it turned out, I > > really didn't wont them. Since NetInfoMangler doesn't allow me to > > select more than one thing at once, I was doing commmand-r <pause> > > 'return'. I had to hit return after every command-r to confirm that I > > meant delete. So, I'm zipping along, just flying, doing > > command-r/return as fast as I can. I start to get near the end, and > [sad story deleted] > Now you know why some people stay away from NetInfoMangler... > That's why they created nidump and niload -d...for deleting 150 aliases at > once....
Yes, but don't forget, sometimes nidump doesn't dump *all* the data fields for certain files. There is a Support Bulletin which addresses this cute little surprise.
-- Nathan Janette "I'm a NeXTstep man, Dept MB&B, Yale Univ I'm a NeXTcube guy" New Haven, CT nat...@laplace.biology.yale.edu (NeXT)
This is an attempt to solicit help with a uucp problem that I've (blush) been working on for nearly a year. After changing from a Neuron FAX96+ to a Telebit WorldBlazer, there is still some kind of a problem occuring well after the link has been made.
I included the entire chat script, as is, but I x'd out the psswd's
My UNIX consultants, both famous and not in the NeXT world, have been very generous and occasionally it seems that a dedicated line straight into the Internet node (newssun in this script) by my system (gorilla is my UUCPNAME) will be the only way of avoiding alleged x-on-x-off difficulties with the server in the first step of my two-step logon procedure.
Any opinions, directly emailed to me, would be gratefully accepted.
localhost# call newssun root newssun (6/22-19:26-301) DEBUG (Local Enabled) root newssun (6/22-19:26-301) NO CALL (MAX RECALLS) MAX RECALL COUNT 95 root newssun (6/22-19:26-301) continuing anyway (debugging) finds (newssun) called ifadate returns 177 getto: call no. cufb for sys newssun Using DIR to call Opening /dev/cufb login called wanted """" got: that send "P_ZERO" wanted """" got: that send "AT&F4" wanted "OK~5" AT&F4 OKgot: that send "ATDT326-0313" wanted "CONNECT~30"
ATDT326-0313 CONNECTgot: that send "\r\d\r" RETURN DELAY RETURN wanted "ID:~30"
User ID:got: that send "mgilula" wanted "ssword:~10" mgilula
Password:got: that send "xxxxxxx" wanted "login.~10" .......
Successful login.got: that send "\r\d\r" RETURN DELAY RETURN wanted "local>~20"
Please type HELP if you need assistance
local>got: that send "c\snewssun" BLANK wanted "ogin:~20"
local> c newssun Xyplex -010- \007Session 1 to NEWSSUN established
SunOS UNIX (newssun.med.miami.edu)
login:got: that send "ngorilla" wanted "ssword:~10" ngorilla Password:got: that send "xxxxxx" root newssun (6/22-19:27-301) SUCCEEDED (call to newssun ) imsg looking for SYNC< \20> imsg input<Shere=newssun\0 Using \0 as End of message char
>got 13 characters
omsg <Sgorilla -Q0 -x99> imsg looking for SYNC<\20> imsg input<ROK\0>got 3 characters msg-ROK Rmtname newssun, Role MASTER, Ifn - 5, Loginuser - root rmesg - 'P' imsg looking for SYNC<\20> imsg input<Pgetxf\0>got 6 characters got Pgetxf wmesg 'U' g omsg <Ug> send 073 rec h->cntl 077 send 061 state - 01 rec h->cntl 061 send 053 state - 03 rec h->cntl 057 state - 010 Proto started g protocol g root newssun (6/22-19:27-301) OK (startup cufb 19200 baud) *** TOP *** - role=MASTER gnamef returns . bldflst rejects . gnamef returns .. bldflst rejects .. gnamef returns C.newssunA0024 gnamef returns C.newssunA0034 gnamef returns C.newssunA0044 gnamef returns C.newssunA0054 bldflst returns 1 agent newssun (6/22-19:27-301) REQUEST (S D.gorillaB0022 D.gorillaS0022 agent) expfile type - 0, wrktype - S wmesg 'S' D.gorillaB0022 D.gorillaS0022 agent - D.gorillaB0022 0666 send 0210 rmesg - 'S' rec h->cntl 041 state - 010 rec h->cntl 0211 Set pk_rpr from 1 to 1 in pkgetpack send 041 got SY PROCESS: msg - SY SNDFILE: send 0221 send 0231 send 0241 send 0251 send 0261 send 0271 send 0301 rec h->cntl 042 state - 010 send 0311 sent data 425 bytes 0.45 secs rmesg - 'C' rec h->cntl 043 state - 010 rec h->cntl 044 state - 010 pkcget: alarm 4001 send 0251 pkcget: alarm 7002 send 0251 rec h->cntl 044 Reack count is 1 send 0261 state - 010 pkcget: alarm 10003 send 0251 rec h->cntl 044 Reack count is 2 send 0261 state - 010 rec h->cntl 044 Reack count is 3 send 0271 state - 010 rec h->cntl 044 Reack count is 4 Reack overflow on 4 send 0251 state - 010 rec h->cntl 044 Reack count is 1 send 0261 state - 010 rec h->cntl 044 Reack count is 2 send 0271 state - 010 rec h->cntl 044 Reack count is 3 send 0301 state - 010 rec h->cntl 044 Reack count is 4 Reack overflow on 4 send 0251 state - 010 rec h->cntl 044 Reack count is 1 send 0261 state - 010 rec h->cntl 044 Reack count is 2 send 0271 state - 010 rec h->cntl 044 Reack count is 3 send 0301 state - 010 rec h->cntl 044 Reack count is 4 Reack overflow on 4 send 0251 state - 010 rec h->cntl 044 Reack count is 1 send 0261 state - 010 rec h->cntl 010 state - 06000 got FAIL agent newssun (6/22-19:30-301) BAD READ (expected 'C' got FAIL (2)) cntrl - -1 agent newssun (6/22-19:30-301) FAILED (conversation complete) send OO -1,omsg <OOOOOO> imsg looking for SYNC<\20> imsg input< "*\10 \20> imsg input<OOOOOO\0>got 6 characters localhost# -- Marshall F. Gilula, M.D "El que busca mucho nada encuentra, pero mgil...@newssun.med.miami.edu el que busca nada mucho encuentra" NeRD#1054 Co-Founder, MiamiNUG NeXTMail Welcome Virtual Virtual Realities, German Shepherds, and Steinbergers.
: -- : Marshall F. Gilula, M.D "El que busca mucho nada encuentra, pero : mgil...@newssun.med.miami.edu el que busca nada mucho encuentra" : NeRD#1054 Co-Founder, MiamiNUG NeXTMail Welcome : Virtual Virtual Realities, German Shepherds, and Steinbergers.
Rather than ".*", I tend to use ".??*". This skips . and .., but picks up anything 3 characters or longer. (Yes, it will also miss wierd files like ".r" or ".[", but that's not usually a problem for me.)
Just a little trick I picked up when I was a Unix beginner... -- Geoff Kuenning ge...@ITcorp.com uunet!desint!geoff