Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Hangs doing ls -l but not ls

623 views
Skip to first unread message

Kevin Coulter

unread,
Dec 1, 2004, 5:01:35 PM12/1/04
to
I've got a customer with 5.0.6, RS506A, OSS646B installed.

Systems "seems" fine, except cd drive does not work, and sometimes
tape drive does not work. We can boot off of CD without trouble.

Went poking around: If I am in root, do a "ls" I see the listing. If I
do "ls -l" it hangs. Same thing in /dev. Tried redirecting output to a
file, no help. Still hangs. When I hang, I have to kill the session,
ps still shows it active, kill -9 does not clean it up.

Something in /dev is still there, as I can sometimes hit the tape
device, /dev/rStp0, and our software uses the serial port, /dev/tty1a,
and that is working fine. If I do "file /dev/rStp0" or "file
/dev/tty1a" it works fine, if I do "file /dev/cd0" it hangs.

Ideas? I'm looking for a way to "fix" this without reload if
possible.... also need to try to confirm it is a UNIX issue, not a
hardware issue.

System does seem to reboot ok, but I'm trying to avoid doing reboots
without a good reason....

Thanks,
Kevin

NSM

unread,
Dec 1, 2004, 5:12:33 PM12/1/04
to

"Kevin Coulter" <ke...@qantel.com> wrote in message
news:82d1ec2f.0412...@posting.google.com...

| I've got a customer with 5.0.6, RS506A, OSS646B installed.
|
| Systems "seems" fine, except cd drive does not work, and sometimes
| tape drive does not work. We can boot off of CD without trouble.
|
| Went poking around: If I am in root, do a "ls" I see the listing. If I
| do "ls -l" it hangs. ...

What if you are logged in as a user? What if you do "ls -xyz"?

N

Bill Campbell

unread,
Dec 1, 2004, 5:26:23 PM12/1/04
to
On Wed, Dec 01, 2004, Kevin Coulter wrote:
>I've got a customer with 5.0.6, RS506A, OSS646B installed.
>
>Systems "seems" fine, except cd drive does not work, and sometimes
>tape drive does not work. We can boot off of CD without trouble.
>
>Went poking around: If I am in root, do a "ls" I see the listing. If I
>do "ls -l" it hangs. Same thing in /dev. Tried redirecting output to a
>file, no help. Still hangs. When I hang, I have to kill the session,
>ps still shows it active, kill -9 does not clean it up.

If I had to make a SWAG, I think something's hosed in the password database
since the ``ls -l'' command needs to resolve user and group information.
I've seen things get flakey on systems that are doing LDAP authentication
via pam_ldap and nss_ldap on Linux systems that were having trouble talking
to the LDAP server.

On Linux, the ``strace'' program can be very useful for debugging things
like this as one can attach to the process that's hanging to see what
system calls are being made. For example, I was having problems with long
delays when building software, and strace showed that the build was
attempting to access automounted directories where the server wasn't
available because of an outdated AMD automounter configuration file. I
would never have figured that out without looking at the system calls being
executed.

Bill
--
INTERNET: bi...@Celestial.COM Bill Campbell; Celestial Software LLC
UUCP: camco!bill PO Box 820; 6641 E. Mercer Way
FAX: (206) 232-9186 Mercer Island, WA 98040-0820; (206) 236-1676
URL: http://www.celestial.com/

Make no laws whatever concerning speech and, speech will be free; so soon as
you make a declaration on paper that speech shall be free, you will have a
hundred lawyers proving that ``freedom does not mean abuse, nor liberty
license;'' and they will define and define freedom out of existence.
- Voltarine de Cleyre (1866-1912)

Jean-Pierre Radley

unread,
Dec 1, 2004, 5:30:40 PM12/1/04
to
Kevin Coulter typed (on Wed, Dec 01, 2004 at 02:01:35PM -0800):

| I've got a customer with 5.0.6, RS506A, OSS646B installed.
|
| Systems "seems" fine, except cd drive does not work, and sometimes
| tape drive does not work. We can boot off of CD without trouble.
|
| Went poking around: If I am in root, do a "ls" I see the listing. If I
| do "ls -l" it hangs. Same thing in /dev. Tried redirecting output to a
| file, no help. Still hangs. When I hang, I have to kill the session,
| ps still shows it active, kill -9 does not clean it up.

What does just 'l' do? And 'ls -bl'?

--
JP

ke...@qantel.com

unread,
Dec 2, 2004, 11:50:02 AM12/2/04
to

Thanks for the help. /etc/passwd and /etc/shadow look ok, running the
tests as a user instead of root does not seem to matter. lc works, ls
-x works, "l" hangs (I let it sit for 5 minutes).

What is really bothering me is that after a hang, I disconnect and
reconnect. My process is still out there (with a parent of 1, the shell
does die). I cannot kill it.

Backups are also getting tape write errors most but not all days. Not
sure if it is related or not. When it does get the error, it occurs
during, not the beginning of the backup. The backup is only backing up
/usr/xxx where xxx has the application and application data in many
subdirectories underneath that tree. I have no problems doing ls -l,
etc on that directory. The problem only seems to be with / and /dev.

What started this was we were trying to mount a cdrom, and it hung.
Booted the machine to a bootable (non UNIX) CD, and it worked fine. So
I went to look at /dev/cd0, and hung doing an ls -l on that....

What is baffling me is why I can do a "file" on some things in /dev,
but not others.

So I'm at a point: hardware or software? If software, what can I do
remotely to fix it - the earliest I can get there to attempt a full
reload is 2 weeks away. Am I sure that a full reload will fix the
issues? And finally, what caused this so we don't end up there
again...

Thanks for the advice so far, and thanks in advance for any help you
can provide in the future.

Kevin

Boyd Lynn Gerber

unread,
Dec 2, 2004, 12:42:43 PM12/2/04
to ke...@qantel.com
On Thu, 2 Dec 2004 ke...@qantel.com wrote:
> Jean-Pierre Radley wrote:
> > Kevin Coulter typed (on Wed, Dec 01, 2004 at 02:01:35PM -0800):
> > | I've got a customer with 5.0.6, RS506A, OSS646B installed.
> > | Went poking around: If I am in root, do a "ls" I see the listing.
> If I
> > | do "ls -l" it hangs. Same thing in /dev. Tried redirecting output
> to a
> > | file, no help. Still hangs. When I hang, I have to kill the
> session,
> > | ps still shows it active, kill -9 does not clean it up.
> > What does just 'l' do? And 'ls -bl'?
> Thanks for the help. /etc/passwd and /etc/shadow look ok, running the
> tests as a user instead of root does not seem to matter. lc works, ls
> -x works, "l" hangs (I let it sit for 5 minutes).

I have seen this problem if you have a link to a mounted filesystem and
the mounted filesystem is not accessible. For example I had a hard drive
going bad on a system that was mounted. The SCSI controler generated and
error and the HD was unaccessible. A df would also hang. I had a link
with

ln -s /filesystem_mounted/directory/file file

ls would show everything but would not hang. ls -la or l would hang. I
would check to make sure you do not have any links. A reboot of the
system fixed the problem because the filesystem could not be mounted and
the link was then shown as an error.

Good Luck,



> What is really bothering me is that after a hang, I disconnect and
> reconnect. My process is still out there (with a parent of 1, the shell
> does die). I cannot kill it.

Yes.



> Backups are also getting tape write errors most but not all days. Not
> sure if it is related or not. When it does get the error, it occurs
> during, not the beginning of the backup. The backup is only backing up
> /usr/xxx where xxx has the application and application data in many
> subdirectories underneath that tree. I have no problems doing ls -l,
> etc on that directory. The problem only seems to be with / and /dev.
>
> What started this was we were trying to mount a cdrom, and it hung.
> Booted the machine to a bootable (non UNIX) CD, and it worked fine. So
> I went to look at /dev/cd0, and hung doing an ls -l on that....
>
> What is baffling me is why I can do a "file" on some things in /dev,
> but not others.
>
> So I'm at a point: hardware or software? If software, what can I do
> remotely to fix it - the earliest I can get there to attempt a full
> reload is 2 weeks away. Am I sure that a full reload will fix the
> issues? And finally, what caused this so we don't end up there
> again...
>
> Thanks for the advice so far, and thanks in advance for any help you
> can provide in the future.

--
Boyd Gerber <ger...@zenez.com>
ZENEZ 1042 East Fort Union #135, Midvale Utah 84047

Jean-Pierre Radley

unread,
Dec 2, 2004, 12:19:59 PM12/2/04
to
ke...@qantel.com typed (on Thu, Dec 02, 2004 at 08:52:04AM -0800):
| Thanks for the help. /etc/passwd and /etc/shadow look ok, running the
| tests as a user instead of root does not seem to matter. lc works, ls
| -x works, "l" hangs (I let it sit for 5 minutes).
|

I asked you to try the -b option. So?


--
JP

ke...@qantel.com

unread,
Dec 2, 2004, 11:52:04 AM12/2/04
to
Thanks for the help. /etc/passwd and /etc/shadow look ok, running the
tests as a user instead of root does not seem to matter. lc works, ls
-x works, "l" hangs (I let it sit for 5 minutes).

What is really bothering me is that after a hang, I disconnect and


reconnect. My process is still out there (with a parent of 1, the shell
does die). I cannot kill it.

Backups are also getting tape write errors most but not all days. Not


sure if it is related or not. When it does get the error, it occurs
during, not the beginning of the backup. The backup is only backing up
/usr/xxx where xxx has the application and application data in many
subdirectories underneath that tree. I have no problems doing ls -l,
etc on that directory. The problem only seems to be with / and /dev.

What started this was we were trying to mount a cdrom, and it hung.
Booted the machine to a bootable (non UNIX) CD, and it worked fine. So
I went to look at /dev/cd0, and hung doing an ls -l on that....

What is baffling me is why I can do a "file" on some things in /dev,
but not others.

So I'm at a point: hardware or software? If software, what can I do
remotely to fix it - the earliest I can get there to attempt a full
reload is 2 weeks away. Am I sure that a full reload will fix the
issues? And finally, what caused this so we don't end up there
again...

Thanks for the advice so far, and thanks in advance for any help you
can provide in the future.

Kevin

Bill Vermillion

unread,
Dec 2, 2004, 3:45:01 PM12/2/04
to
In article <1102006202.8...@f14g2000cwb.googlegroups.com>,

<ke...@qantel.com> wrote:
>
>Jean-Pierre Radley wrote:
>> Kevin Coulter typed (on Wed, Dec 01, 2004 at 02:01:35PM -0800):
>> | I've got a customer with 5.0.6, RS506A, OSS646B installed.
>> |
>> | Systems "seems" fine, except cd drive does not work, and sometimes
>> | tape drive does not work. We can boot off of CD without trouble.
>> |
>> | Went poking around: If I am in root, do a "ls" I see the listing.
>If I
>> | do "ls -l" it hangs. Same thing in /dev. Tried redirecting output
>to a
>> | file, no help. Still hangs. When I hang, I have to kill the
>session,
>> | ps still shows it active, kill -9 does not clean it up.
>>
>> What does just 'l' do? And 'ls -bl'?
>>
>> --
>> JP

>Thanks for the help. /etc/passwd and /etc/shadow look ok, running the
>tests as a user instead of root does not seem to matter. lc works, ls
>-x works, "l" hangs (I let it sit for 5 minutes).

"looking ok" and "being ok" are sometimes two different things.

Since ls works and lc works - both of which do NOT query the
password or group file, try ls -n. That will list the
UID and GID of the files.

If that works then it still can be a problem with either password
or group.

Have either the /etc/passwd or the /etc/group file ever been
edited by hand? Are the permissions and ownership correct.
Owned by bin and group owner is auth and only bin and auth
can have write privledges but the final octet must be read only.

If either of those have been edited by hand double check to make
sure there are no blanks lines, or anything extraneous.

Try view /etc/passwd [view is vi in read only mode and you don't
really want to accidentally fire up vi on either of those unless
you know exactly what you are doing]. You can 'list' with
1,$l while in 'view' to see if there are any spurious characters
in the file. Pass it through 'less' if you have a large number
of users.

>What is really bothering me is that after a hang, I disconnect and
>reconnect. My process is still out there (with a parent of 1, the shell
>does die). I cannot kill it.

That is the expected procedure when you disconnect and a program
is hung waiting for something. Since you - the parent - has gone
away - the program becomes and orphan and is adopted by init - aka 1.


>Backups are also getting tape write errors most but not all days. Not
>sure if it is related or not. When it does get the error, it occurs
>during, not the beginning of the backup. The backup is only backing up
>/usr/xxx where xxx has the application and application data in many
>subdirectories underneath that tree. I have no problems doing ls -l,
>etc on that directory. The problem only seems to be with / and /dev.

You might post a few lines from the error log. That might give
a hint as to just what 'xxx' is.

>What started this was we were trying to mount a cdrom, and it hung.
>Booted the machine to a bootable (non UNIX) CD, and it worked fine. So
>I went to look at /dev/cd0, and hung doing an ls -l on that....

What OS was this bootable non Unix CD? I've seen people totally
destroy things by using some MS tools [I use the word "tools"
advisedly] when trying to fix a Unix system. I've found you can
use Unix tools on non-Unix systems [Knoppix is a wonderful recovery
tool even for MS-XP] but the reverse can often do you in.

>What is baffling me is why I can do a "file" on some things in /dev,
>but not others.

And 'some' is not helping a lot either. What files don't responsd
to 'file' and when you do a 'file' on dev is almost everything
a block special or character special?

I don't want to sound picky but details are important when you
aren't sitting at the keyboard.

>So I'm at a point: hardware or software? If software, what can I do
>remotely to fix it - the earliest I can get there to attempt a full
>reload is 2 weeks away. Am I sure that a full reload will fix the
>issues? And finally, what caused this so we don't end up there
>again...

When you say a 'full reload' do you mean a reinstall from the
distribution media, or a from a backup program such as
BackupEdge or LoneTar.

The latter make it easy, and if you have no idea of what the
problem may be - using either of their recovery tools is the best
way and starting with remaking the filesystems completely. That
means if something was torched in some way the new fs will take
care of that. But just a plain reload from tape may leave some
corruption.

>Thanks for the advice so far, and thanks in advance for any help you
>can provide in the future.

The more details the better.

Bill

--
Bill Vermillion - bv @ wjv . com

Kevin Coulter

unread,
Dec 3, 2004, 2:39:29 PM12/3/04
to
Jean-Pierre Radley <j...@jpr.com> wrote in message news:<2004120217...@jpradley.jpr.com>...

ls -b works fine, ls -bl hangs. I went to the opt/K... where
/etc/passwd and such live, here are ls -l's from there:
-rw-rw-r-- 1 bin auth 1186 Aug 10 2001 passwd
-rw-rw-r-- 1 bin auth 471 Aug 10 2001 group
-rw-rw---- 1 root auth 423 Aug 10 2001 shadow

As these dates precede my time around, I cannot say for sure if these
files were edited by hand or not.

I wrote a script and did a "file" on all character devices, they all
come back as character devices just fine. Similar script on all block
devices, the only one that hangs is /dev/cd0. Redid the script again
in /, and the only thing it hangs on is /mnt.

Ran scoadmin software, verify system. Came up with a bunch of errors,
including checksum and file size, primarily in skunkware directories.
there were a few in /opt/K/tcp... I tried selecting save the list of
errors to file, but that would hang as well. My terminal emulator is
not cooperating to let me easily review the entire file list of
problem children, and we do not have a system printer set up. (we
print through our software).

ls -l / hangs
ls -l /dev hangs
If I go into /dev/rmt, I can do a ls -l in there. If I do a ls -l ..
from /dev/rmt, it will hang.
ls -l /dev/rStp1 (for example, as long as its not /dev/cd0) works.

Entire system is two file systems, / and /stand. :-) Wondering if we
should try a fsck of root?

Thanks again for your help.

Kevin

Jean-Pierre Radley

unread,
Dec 3, 2004, 2:57:38 PM12/3/04
to
Kevin Coulter typed (on Fri, Dec 03, 2004 at 11:39:29AM -0800):

In /dev, run the find command to see if there are any odbballs:

find /dev ! -type c ! -type b

should list some directories and a few FIFOS.

Then drop the directories:

find /dev ! -type c ! -type b ! -type d

On my system, this shows a couple of FIFOS and a lock file:

/dev/syslog
/dev/logfifo
/dev/edge_listen
/dev/marry/.marrylock

If you see anything else, tell us.


--
JP

Kevin

unread,
Dec 6, 2004, 4:49:05 PM12/6/04
to
I did as suggested:

# find /dev ! -type c ! -type b
/dev
/dev/byte
/dev/byte/octal
/dev/byte/hex
/dev/byte/dec
/dev/fd
/dev/dsk
/dev/rdsk
/dev/logfifo
/dev/string
/dev/table

At this point, it hung. After 2 minutes, I bailed out (had to terminate
the session). Logged back in and did:
# find /dev ! -type c ! -type b ! -type d
/dev/logfifo

and again I hung, and bailed after 2 minutes.

/dev/edge_listen and /dev/marry do not exist. (I did a ls on them...)

Since I discovered this issue, here is what is in syslog:
Dec 1 01:15:00 libman %Stp-1 - - - Vendor=COMPAQ
Product=S
DT-10000
Dec 1 10:06:18 libman WARNING: table_grow - exec data table page limit
of 25 pa
ges (MAXEXECARGS) exceeded by 1 pages
Dec 1 10:09:52 libman WARNING: table_grow - exec data table page limit
of 25 pa
ges (MAXEXECARGS) exceeded by 1 pages
Dec 2 01:26:49 libman NOTICE: Stp: Error on SCSI tape 1 (ha=1 bus=0
id=6 lun=0)
Dec 2 01:26:49 libman Hardware error: Unexpected internal
error
Dec 2 10:08:39 libman WARNING: table_grow - exec data table page limit
of 25 pa
ges (MAXEXECARGS) exceeded by 1 pages

ps -fe now hangs (when I started this, it had worked), but I am
thinking that the process table is so hosed from my
connections/hangs/disconnections..... I know I need to boot it to
clean it up, but I also need confidence that it WILL boot (which I'm
not completely confident of).

Thanks,
Kevin

Jean-Pierre Radley

unread,
Dec 6, 2004, 5:21:39 PM12/6/04
to
Kevin typed (on Mon, Dec 06, 2004 at 01:49:05PM -0800):

| I did as suggested:
|
| # find /dev ! -type c ! -type b
| /dev
| /dev/byte
| /dev/byte/octal
| /dev/byte/hex
| /dev/byte/dec
| /dev/fd
| /dev/dsk
| /dev/rdsk
| /dev/logfifo
| /dev/string
| /dev/table
|
| At this point, it hung. After 2 minutes, I bailed out (had to terminate
| the session). Logged back in and did:
| # find /dev ! -type c ! -type b ! -type d
| /dev/logfifo
|
| and again I hung, and bailed after 2 minutes.

Now try:

find /dev -type f

--
JP

Bill Vermillion

unread,
Dec 6, 2004, 8:45:01 PM12/6/04
to
In article <1102369745.6...@f14g2000cwb.googlegroups.com>,

Kevin <ke...@qantel.com> wrote:
>I did as suggested:

># find /dev ! -type c ! -type b
>/dev
>/dev/byte
>/dev/byte/octal
>/dev/byte/hex
>/dev/byte/dec
>/dev/fd
>/dev/dsk
>/dev/rdsk
>/dev/logfifo
>/dev/string
>/dev/table

>At this point, it hung. After 2 minutes, I bailed out (had to terminate
>the session). Logged back in and did:
># find /dev ! -type c ! -type b ! -type d
>/dev/logfifo

And did you try the command ls -lan as I suggested. If that
makes it through without hanging then I'd really suspect
the password or group file with some corruption. The 'n' will
list the UID and GID numerically and performs no lookups
to passwd or group.

Kevin

unread,
Dec 8, 2004, 10:45:42 AM12/8/04
to
Hi there.

find /dev -type f
ls -n
ls -lan

all hang without returning any output.
Still no additional messages in /var/adm/messages or /var/adm/syslog,
still can't do a ps -fe. Wondering if I need to "risk" doing a reboot
right now just to clean up the process table - is it invalidating some
of my tests?

Kevin

Bill Vermillion

unread,
Dec 8, 2004, 11:35:08 AM12/8/04
to
In article <1102520742.1...@c13g2000cwb.googlegroups.com>,

Kevin <ke...@qantel.com> wrote:
>Hi there.

>find /dev -type f
>ls -n
>ls -lan

>all hang without returning any output.

Well that shoots down that theory.

The only other thing I can suggest is to note the last thing you
see when you perform an ls -f [that returns the files unsorted,
in the order in which they appear in the directory]

Then I'd do an hd on that directory. The newer OSR5's don't permit
that so you'd have to snag a copy of one of the older versions.

See what comes after the last name you see there.

But if this is happening in all directories, do the same thing,
an ls -f and see if they all return the same number of
file names before hanging. That could indicated some resource
limit.

Have you checked to see that the binaries for ls are identical
in the machine which works with the one where it fails. Maybe
ls is bad.

Does echo * return all the names?

Kevin

unread,
Dec 8, 2004, 5:38:04 PM12/8/04
to
Did sum's on ls on a known good machine and this machine, and they
match. Permissions/ownership also match.

ls -f works without hanging, appears to list all files, and so did echo
*.

The hang only seems to occur w/ ls -l or ls -n.

Bill Vermillion

unread,
Dec 8, 2004, 7:35:05 PM12/8/04
to
In article <1102545484.0...@c13g2000cwb.googlegroups.com>,

So then it is hanging anytime it is trying to display UID or GID
- even without lookups, as ls -l maps the UID/GID to names
in /etc/group while -n just prints the number.

If this hangs on all directories that is really strange. If it
only hangs on some, perhaps a directory has a UID or GID that is
outside the normal range.

At this point I'm confused and I suspect that only a hex dump
of the directory with a problem.

The -f option - which displays then in the actual order they are
in the directory - might be your only clue. Find the last file
you can see, and then with a hexdump examine the next entry to
see what may be causing the problem.

But that can be a real pain. Back when directory entries were
limited to 14 character names and two bytes for the inum they were
easy to read.

You do have a starnge problem. I'm out of guesses at the moment.

Kevin

unread,
Dec 9, 2004, 4:50:53 PM12/9/04
to
Bill,

The hang is only happen on / and /dev. /dev/rmt is ok, as are all
other directories I've tried.....

Going back to the passwd/group files, and their links and integrity,
here is what I have:

$ pwd
/etc
$ ls -l passwd
lrwxrwxrwx 1 root root 38 May 18 2001 passwd ->
/var/opt/K/SCO/Unix/5.0.6Ga/etc/passwd
$ ls -l shadow
lrwxrwxrwx 1 root root 38 May 18 2001 shadow ->
/var/opt/K/SCO/Unix/5.0.6Ga/etc/shadow
$ ls -l group
lrwxrwxrwx 1 root root 37 May 18 2001 group ->
/var/opt/K/SCO/Unix/5.0.6Ga/etc/group

$ cd /var/opt/K/SCO/Unix/5.0.6Ga/etc
$ ls -l passwd


-rw-rw-r-- 1 bin auth 1186 Aug 10 2001 passwd

$ ls -l shadow


-rw-rw---- 1 root auth 423 Aug 10 2001 shadow

$ ls -l group


-rw-rw-r-- 1 bin auth 471 Aug 10 2001 group

So its looks like to me that the links are ok. Weird, eh?

Kevin

Jean-Pierre Radley

unread,
Dec 9, 2004, 4:54:38 PM12/9/04
to
Kevin typed (on Thu, Dec 09, 2004 at 01:50:53PM -0800):

The links of those files are not at issue. Their content may be.
Does authck -av disclose anything?

--
JP

Kevin

unread,
Dec 10, 2004, 12:36:12 PM12/10/04
to
Here is the authck:

# authck -av
Checking defaults
Finding all entries in the Protected Password database hierarchy
Checking Protected Password hierarchy
Checking all entries in the Subsystem database
Checking all entries in the Terminal Control database
#

This completely rules out problems with the passwd/group files, right?
Thanks,
Kevin

Jean-Pierre Radley

unread,
Dec 10, 2004, 12:39:23 PM12/10/04
to
Kevin typed (on Fri, Dec 10, 2004 at 09:36:12AM -0800):

Not necessarily. You should examine those files with 'cat -v' or with
'less' to make sure they do not contain nonsense.


--
JP

Bill Vermillion

unread,
Dec 10, 2004, 4:25:01 PM12/10/04
to
In article <1102700172....@f14g2000cwb.googlegroups.com>,

Since the ls -n also had the problem and the -n option does not
use password or group I don't think you have problems there.

Since it only happens in /dev and / the next thing I'd suspect is
a bogus name, but since a plain 'ls' works that seems to be ruled
out also.

Mike Brown

unread,
Dec 15, 2004, 11:46:47 PM12/15/04
to

what does:

find /dev -exec ls -l {} \; | more

show? Compare the output to

find /dev -exec ls {} \; | more

to see where it hangs. It may be that you have
a problem with the /dev directory table that shows
up only with a long listing from / or /dev.

Mike

--
Michael Brown

Kevin

unread,
Dec 16, 2004, 12:03:47 PM12/16/04
to
Mike,

The find ... ls -l {} ... hangs immediately. The find without the -l
does not hang.

The first thing that shows when I do the find without the -l is the
directory name X.

I'm heading there tonight to do a reload from scratch, that is I think
the only safe way to fix it at this point.

When I'm there, I am going to try to boot off the boot floppy and mount
the filesystem to see if I can see anything different that way...
Thanks,
Kevin

Mike Brown

unread,
Dec 16, 2004, 1:29:17 PM12/16/04
to

You could play with the find command a bit to narrow down what it is
looking for, and see if anything from the /dev directory will list.
As an example
find /dev/rdsk -exec ls -l {} \;

Mike

--
Michael Brown

Bela Lubkin

unread,
Dec 18, 2004, 7:55:19 AM12/18/04
to
Kevin Coulter wrote:

Reinstalling the system is an overreaction to this!

For most of the discussion, it sounded like you were not rebooting the
system. All the `find`, `ls` etc. commands that were run, were run
during a single system uptime.

You did say in the first message that the machine seemed to reboot OK.
But that doesn't seem to agree with the problem.

The problem seems to be this: your system has /dev/cd0 mounted on /mnt,
but there is something wrong with the in-core inode structure for
/dev/cd0. The "something wrong" is most likely that the structure's
lock flag is set. As a result, any attempt to stat() /dev/cd0 hangs,
waiting for the lock to be cleared. Any attempt to stat() the mount
point directory, /mnt, also hangs, because the kernel has to follow from
the mount point to the real device, /dev/cd0, whose structure is locked.

Unless you can clear that lock flag, you'll have to reboot.

I've used kernel debuggers to forcibly clear such a flag. What usually
happens is that the process that was holding the lock immediately cycles
back and reacquires it, and nothing improves. This is because the
process is running in the hardware device driver (in this case, "Srom"
for the CD-ROM drive or "wd" or some SCSI device driver for the
hardware). When you clear the lock, it wakes up, sees that whatever
operation it was waiting on didn't complete, and re-issues that
operation.

The operation is probably hung because the drive hung.

So the chain of causality is:

- CD-ROM drive or controller hardware hung
- device driver tried to access it, never got the expected response
- the device driver (or its caller) had locked /dev/cd0's in-core inode
- accesses of /dev/cd0 in-core inode hang waiting for unlock
- accesses of /mnt in-core inode hang because they refer to /dev/cd0
- `ls -l /` hangs because it tries to stat("/mnt")
- `ls -l /dev` hangs because it tries to stat("/dev/cd0")
- `find` eventually hangs trying to stat() one of those
- and so on

Rebooting ought to clear this up completely, unless the system is
configured to automatically mount the CD at boot time, and the CD-ROM
drive or controller is so hung it needs to be power-cycled, or is
actually broken.

If the machine hasn't yet been rebooted, you can try a couple of things
to demonstrate the truth of what I say. Start with:

# cd /dev
# for name in *; do
case "$name" in cd0|rcd0) echo "=== SKIPPED $name ==="; continue;; esac
printf %-20s "$name"
ls -gold "$name"
done

This will produce a lot of output; and if I'm right, it won't hang. If
it does hang, the problem file's name should be on the last line. Add
it to the "SKIPPED" case statement, e.g.:

case $name in cd0|rcd0|tty666) echo ...

and do it again.

You can do likewise for /mnt:

# cd /
# for name in *; do
case "$name" in mnt) echo "=== SKIPPED $name ==="; continue;; esac
printf %-20s "$name"
ls -gold "$name"
done

Don't add "-L" to the `ls` flags; you aren't trying to test symlink
integrity, only the accessibility of in-core inodes of files in those
directories.

>Bela<

Kevin

unread,
Dec 21, 2004, 5:13:05 PM12/21/04
to
Problem is resolved without reloading. Bela is right, I did not boot
the machine until I was on site. It rebooted ok until I tried to mount
the cdrom. When I mounted the CD for the first time, I received on the
console: "Warning:wd(0) now using polled interface". My customer never
saw this as they were telneted in, I never saw it as I was telneted in,
and the log files had stopped getting written to when I was working on
this remotely.

After lots of trials and tribulations, determined that
/var/adm/hwconfig was set to primary/master for the CD drive. It needed
to be primary/slave. Used mkdev to remove old config and install new
one. Only explanation I can come up with is something in RS506a or
OSS646B is not forgiving to hwconfig not being quite right, but prior
to those updates, the system didn't care.

The part that I still don't understand: When I did a mount /dev/cd0
/mnt, it would just hang the session, but I could go to other sessions
and do the ls -l's on / and /dev. However, after I did a mount -f
HS,lower -ro /dev/cd0 /mnt, then the ls -l hangs would occur.
Thanks to everyone for your help!

Kevin

Tony Lawrence

unread,
Dec 23, 2004, 6:38:04 AM12/23/04
to
Kevin wrote:
>
> The part that I still don't understand: When I did a mount /dev/cd0
> /mnt, it would just hang the session, but I could go to other
> sessions and do the ls -l's on / and /dev. However, after I did a
> mount -f HS,lower -ro /dev/cd0 /mnt, then the ls -l hangs would
> occur. Thanks to everyone for your help!

Well, it WAS a hardware problem.

It's not particularly surprising that, given a hardware mis-
configuration, different mount options have different deviant
behaviour. Obviously there is code in certain sections that is reached
with some options and not reached with others.

A couple of small points just to avoid future confusion by someone
else reading this thread: hwconfig is a report - it doesn't set
anything.
Also, "WARNING: wd(0) now using polled interface" just means that the
device
you accessed can't use DMA transfers so the driver will use (slower)
PIO.
It doesn't always necessarily indicate a problem in and of itself,
though
it is true that you'll see that kind of thing when master/slave is
confused.


--
Tony Lawrence
http://aplawrence.com

0 new messages