Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

/usr/ucb/quota hang on login (4.1.3_U1 w/ C2). Stumper

0 views
Skip to first unread message

Gregory B. Newby

unread,
Nov 10, 1994, 1:03:31 AM11/10/94
to
I've been working with Sun service engineers on this one for almost
2 weeks, and they're stumped, at least temporarily.

Periodically (e.g., 1 or 2 times per DAY) I need to restart my
SS10 model 41 because login processes get 'hung' while running
/usr/ucb/quota

This started fairly recently, after I installed C2 shadowed
passwords. I've installed a truckload of patches (below), none of which
have helped.

These /usr/ucb/quota processes are not killable with kill -9. Nor
are the login processes that spawn (or fork, or whatever) them.
People cannot login at all.... and even when I'm logged in as
root, shutdown/reboot won't work since they can't kill all the processes
before restarting the system.

Here's a note I prepared for somebody else with painful detail.
Any ideas are welcome!

Thanks..... Greg Newby

--------
Thanks for your offer of help! Here's all the details I can think
of. I have some various log files in /home/staff/gbnewby/hung, if you
want to check them out. I'm happy to experiment, at this point,
and see what might help.

I'm pretty sure that the problem lies somewhere in part of NIS, the
RPC portmapper program, part of the quota system, or even in the
login program. I've installed about 10 Sun patches over the weekend,
and tried oodles of things.

The symptomology varies somewhat, but the main way the problem manifests
itself is that suddenly login processes get hung on /usr/ucb/quota (part
of the login process, as described in 'man mount' of all places).
The /usr/ucb/quota process stays in IW state, and never returns. It
is not kill-able with kill -9. No logins happen, and eventually
all 256 PTYs fill up with running login and quota processes that never
die.

Several other things will not work as well (I can login from the console,
or through an open window, or do an 'rsh prairienet.org su', even tho
regular telnet or rlogins do not go through). Generally, the system
is still usable by those who are logged in, although you'll get
your shell effectively hung if you run anything that involves a
quota check, and possibly other cases as well that I'm not able to
replicate.

In a previous version of this problem, we could "sh /etc/rc.local"
and everything would get running again. Now, though, that usually
doesn't work -- rc.local hangs right after the 'domainname' call.
I don't think the rc.local trick has worked since I installed the
various patches over the weekend, but I'm not positive if we've
ran into some of our older problems (below) where it did work.

We also can't do a remote shutdown or reboot (or even 'halt') because
all try to kill all active processes before continuing, and the quota
processes don't die. This necessitates rebooting from the console via
the PROM monitor every time...and a few times, the machine has totally
hung, and I've needed to hit the power switch. This has happened
when I did a 'reboot' from a root shell.....the reboot never happens, and
when I hit the console BREAK key to wake up the PROM monitor, it never
answers.

Any quota process won't work in this state......"quotaoff -a" "repquota"
etc. You can't run "portmap" to restart (or start another) portmapper,
and (I think this is it) if you kill the running portmapper, /etc/rc.local
still won't work to restart the various processes.

There are a few things that I suspected, but have not proved helpful:
- UID/GID numbers > 32767 used to exist, but I've moved them all.
- We now run with NO disk quotas (no /quotas files in any filesystem,
tho quotacheck and quotaon -a are still run from /etc/rc)
- I have not been able to pin this on a particular user, say,
somebody using a bad username/password, but have had
trouble in most cases determining just who the last
successful users were to login.
- We have one NFS mounted filesystem (R/O, noquotas), which I've
stopped mounting. The problem persists.
- All filesystems are now mounted noquota

'umount' will also not work, for any filesystem, in the semi-hung state.


Finally: These current problems, which have resulted in the computer
getting hung up on quotas processes every 12 hours or so (yes, it's
fairly regular.... seldom < 6 hours, and never more than about 15),
started when we finally installed C2 security shadow passwords last Thursday
or Wednesday. (C2 would not previously install because we were running
ypbind instead of ypbind -s from /etc/rc.local).

Previously, our problem was that every day or two (up to 4)
the Prairienet computer would suddently drop off the Internet. It
wouldn't be able to get to hosts outside of 192.17.3, and would not
respond to pings or other inbound traffic.

This tended to happen during prime heavy usage. Now, though, we
have even heavier usage with no such problems. The /usr/ucb/quota
problem happens during any time, including late-night when the load
is very low.

That problem started in (I think) early October when I updated our
NIS hosts database. Previously, we had a huge database which was not
kept current (we're not a nameserver, remember). So, I instead built
a small database for just local information, and we now use the DNS for
other hosts lookups. I did install a (supposedly) patched version of
libresolv.a and /usr/lib/sendmail as recommended by the campus Sun
software support person at CCSO. Later, I discovered that these
would not work with the current (4.1.3_U1) version of libc, which is
1.1.9 (it's made for 1.1.8).

The sendmail/libresolv was what caused our several days of sendmail woes,
where sendmail processes were hanging (very similar to the hanging quota
processes, except that these could be killed and did not prevent logins,
except where they drove the system load so high as to render it unusable).
I did some further investigation in mid-October, and again last week,
and installed new then newer patchs to libresolv and sendmail (later,
libc itself) from Sun which work very well.

Patches installed are numerous. When the machine was first brought
online, somebody else installed all the 4.1.3_U1 security patches.
Over the last week, I've installed further patches. All these are
in /fn/src/4.1.3_U1_Patches

Here's the list:
Installed November 6-8 1994:

101436-08
101435-01
101461-04
101665-03
101954-04
101759-02
102060-01
101445-01
101784-02
101718-01

Previously installed (mid-August 1994):

100103-12/ 100804-03/ 101440-01/ 101579-01/ 101625-02/
100383-06/ 101434-03/ 101441-01/ 101587-01/ 101679-01/
100448-02/ 101437-01/ 101508-06/ 101621-02/


Firefly, as you recall, was previously running the slightly older
4.1.3_B (or C, I forget which) and ran like a charm for about a year.
After we needed to rebuild our disk, we installed the U1 version of
SunOS. C2 never installed correctly (I think the hanging quota processes
were the problem then, but don't recall for certain), even tho we were
running it under 4.1.3_B. Everything else seemed to be working
correctly, until I installed the ill-fated (but necessary) updated
small hosts file.

Any ideas are welcome, forward this to anybody you'd like. I've
been on the telephone with 2 Sun engineers for about 10 hours over the
last 10 days, and they have had lots of ideas but none have helped. They've
promised to bump the priority on this report tomorrow (Thursday) to
get more people involved, since they're so far collectively
stumped.

I noticed that there are 4.1.1 patches for portmap, and similar
patches for 5.3 and 5.4. I wonder if maybe there is something related
in 4.1.3_U1 that we're discovering, although it beats me how we
could be finding a bug in a system version that's been around for so
long.

We have customized the kernel, but only minimally:
- removed NIT support, per a CERT advisory
- bumped 'maxusers' to 256 (to accomodate more PTYs)

Plus some kernel source or object patches in the groups I've
mentioned above. I did make sure to rebuild the kernel AFTER the new
libc, libresolv, etc. were installed.

/* Gregory B. Newby, PhD. Ass't Professor in the Grad. School of Library */
/* and Information Science, University of Illinois at Urbana-Champaign */
/* 501 E. Daniel Street, Champaign IL, 61820-6212. Voice: 217-244-7365 */
/* Fax: 217-244-3302. Email: gbn...@uiuc.edu, gne...@ncsa.uiuc.edu */
==== Try Prairienet. Telnet to firefly.prairienet.org, login 'visitor' ====

Gregory B. Newby

unread,
Nov 13, 1994, 11:33:31 PM11/13/94
to
It wasn't surprising that this was my fault, but I seem to have reached
a fairly arcane level of knowledge...I spoke about auditing with the
Sunservice guys, and they didn't pursue anything further with it.

Posted so that others might learn.

Executive summary: C2 passwords require that 'auditd' be running.

Newsgroups: pnet.announce.important,pnet.sys.bugs,pnet.sys.announce
Followup-to: pnet.sys.bugs
Subject: Problems of 11/2 - 11/8 resolved

We have finally resolved the frequent downtime problems of 11/2 through
11/8, when the system would become unavailable up to 2 or 3 times
per day until it was re-started by hand.

The problem turned out to be a bit embarasing, but had everybody
stumped including a team of Sun engineers working with us on the problem.

If you're interested in some details:

On 11/2, we (finally) put a security system in place called "shadow
passwords," using Sun's product called "C2." We had C2 installed on
firefly previously, but after the system disk was re-installed in early
August we never got C2 to work again.

The problem with C2 not working in August was due to a bug in the Sun
operating system. We got good advice from the Sun folk on 11/1 and
got C2 working correctly that day.

The recent problem, as described in a previous post, was that all of a sudden
firefly would not seem to permit new logins, and would seem to "hang."
The cause of this was the 'quota' program, which is run automatically
by the Sun system when people login. This started right after we
got C2 working properly.

Here's what happened:

In order for C2 to work, a program called "auditd" needs to be working
as well. The auditd program is in charge of logging various system
activities, including things like password breakin attempts and people
running over their assigned disk quota.

We didn't know that this program was started as part of the C2
installation. Previously, we were purposively not running auditd,
because it slightly decreases the system's efficiency (there are other
utilities which we use to monitor security issues).

Here's the embarasing part: We have a program that runs twice per
day to 'clean up' user programs that sometimes stay around AFTER
a user has logged off. This especially happens when, say, your modem
disconnects or you unexpectedly lose your telephone connection (e.g.,
with call waiting). And a bunch of other reasons as well!

So, the 'auditd' program was NOT run by user "root" as are most system
programs. Instead, it was run by a special username, "audit"
Since "audit" was not in the list of users who ARE allowed to run
programs when not logged on, our automatic program was removing the
auditd program twice per day (sometime more frequently, depending on the
time when the system was restarted).

Dumb? Yes. However, nobody knew that auditd was required to run
for C2, and absolutely nobody knew that the 'quota' program requires
the 'auditd' program in order to run ONLY when C2 is installed. Under
the circumstances, the 'quota' program was not very friendly -- most
programs, when they can't contact another program, have a built-in
"time-out" component which eventually prints a warning message and
either ends the program, or logs out the user. No such luck with
'quota' when C2 is installed, though.

The problem was further compounded because it was very hard to
know just when the problem was occurring. It turns out that it was
'caused' by the twice-daily system program to clean up unknown
processes, as described above. But since the program can take anywhere
up to 1 hour to run (during weekday afternoons), and after it
removed auditd it would take awhile before anybody noticed what
the problem was with logins not working, the twice-daily pattern
was not clear.

Right now, things seem quite stable. We'll be putting back some
of the things we removed, moved, or disabled while trying to
pin-point the difficulty over the next few days, and will need
to quickly restart the Prairienet computer a few times. But
we don't anticipate further difficulties.


An additional problem described in a previous message was that
the system was periodically dropping off the Internet. We don't
know for certain that this will not re-occur, but hope that one
of the many various system patches and fixes we installed during
the past 10 days will solve that problem as well. We'll see
whether this happens again...


That's the story! It's bad news that this problem happened, but at
least it appears to be behind us now. The good news is that the
C2 system makes us more secure against the (incresing number of)
hacker attacks which occur across the Internet. Or, at least against
the large number of hackers who use methods which others have
discovered and publicized....we've made our system secure against
all of the well-known problems, but there are always more problems
that the hackers discover before the systems administrators.


Thanks for your patience during this period, and apologies to our
members who are late-night owls or early-morning folk - the system
outages were most keenly felt by you.

-- Greg Newby, for the Prairienet systems staff


--------
Prairienet is a public-access community computer system located in Urbana-
Champaign, Illinois. Prairienet is a Free-Net affiliated with the National
Public Telecomputing Network. For more information about Prairienet, please
use the system. Dial 255-9000 (press <return> a few times, N-8-1), or telnet
to prairienet.org. Login as "visitor". Email inquiries may be sent to
"in...@prairienet.org"

Pete Hartman

unread,
Nov 14, 1994, 4:57:24 PM11/14/94
to
gbn...@prairienet.org (Gregory B. Newby) writes:
>It wasn't surprising that this was my fault, but I seem to have reached
>a fairly arcane level of knowledge...I spoke about auditing with the
>Sunservice guys, and they didn't pursue anything further with it.
>Posted so that others might learn.
>Executive summary: C2 passwords require that 'auditd' be running.

Really?

I run several systems with just rpc.passwdauthd and shadow passwords,
and I haven't had any quota problems or other problems that seemed
to indicate that auditd was required.

>The problem with C2 not working in August was due to a bug in the Sun
>operating system. We got good advice from the Sun folk on 11/1 and
>got C2 working correctly that day.
>The recent problem, as described in a previous post, was that all of a sudden
>firefly would not seem to permit new logins, and would seem to "hang."
>The cause of this was the 'quota' program, which is run automatically
>by the Sun system when people login. This started right after we
>got C2 working properly.

Ok, this makes more sense. If you used the C2conv script that Sun
supplies, then yes, you do have to have a running auditd.

However, it remains true that you can, by not using the C2conv, but
instead using another script floating around (email me if you're
interested), convert your system to use only the shadow password
facility without requiring the auditd facility (which takes up
bazongas of disk space).
--
Pete Hartman Bradley University p...@bradley.bradley.edu
It may lead to dancing.

0 new messages