rqs gone bad.

6 views
Skip to first unread message

Hans Peter Verne

unread,
Feb 7, 2004, 10:45:58 AM2/7/04
to

I just installed the new patch (5473) on a bunch of machines. All
went fine, except on one of them. Here, the rqsall seems to hang
forever, ie. inst says:

(... the usuall stuff, nothing wrong ...)
Installations and removals were successful.
Requickstarting ELF files (see rqsall(1))

And there it hangs. In another shell, I can see

lavoisier:/# date
Sat Feb 7 16:21:41 MET 2004
lavoisier:/# ps -ef | head -1 ; ps -ef | grep '[r]qs'
UID PID PPID C STIME TTY TIME CMD
root 616672 619262 0 11:15:36 ? 294:31 /usr/etc/rqs -force_requickstart -load_address 0x7c70000 -timestamp 0x4024bac8
root 619262 619653 0 11:12:21 ? 1:16 /usr/etc/rqsall -force -no_echo -inst #3 -o /var/inst/.rqsfiles -rescan /var/in


So, rqs has been running for 5 hours. With top, I can see it gets
~95% CPU. There is nothing interesting in /var/adm/SYSLOG .

Can I somehow find out what's holding up rqs? I checked the file
/var/inst/.rqsfiles, it is dated 11:12, and ends with

lb:libc.so.1
v:sgi1.0
t:0x3e948995
c:0x513644fd
el:
eo:

which tells me nothing, apart from suggesting the problem is
related to libc.... Or perhaps not

I have never seen this before, and I have no idea what do do.
Simply kill rqs and hope for the best? Reboot? xfs_check?

This is an Power Indigo2 XZ, Extreme, IP26, running 6.5.18m.

All hints appreciated,
sincerely,
--
Hans Peter Verne ( hpv at kjemi dot uio dot no )

`It would seem that you have no useful skill or talent whatsoever,
have you thought of going into teaching?' -- Terry Pratchett, "Mort"

David Anderson

unread,
Feb 8, 2004, 4:45:55 PM2/8/04
to
In article <c0317m$jk9$1...@readme.uio.no>,

Hans Peter Verne <dev...@kjemi.uio.invalid> wrote:
>
>I just installed the new patch (5473) on a bunch of machines. All
>went fine, except on one of them. Here, the rqsall seems to hang
>forever, ie. inst says:
>
> (... the usuall stuff, nothing wrong ...)
> Installations and removals were successful.
> Requickstarting ELF files (see rqsall(1))
>
>And there it hangs. In another shell, I can see
>
> lavoisier:/# date
> Sat Feb 7 16:21:41 MET 2004
> lavoisier:/# ps -ef | head -1 ; ps -ef | grep '[r]qs'
> UID PID PPID C STIME TTY TIME CMD
> root 616672 619262 0 11:15:36 ? 294:31 /usr/etc/rqs -force_requickstart -load_address 0x7c70000 -timestamp 0x4024bac8
> root 619262 619653 0 11:12:21 ? 1:16 /usr/etc/rqsall -force -no_echo -inst #3 -o /var/inst/.rqsfiles -rescan /var/in
>
>
>So, rqs has been running for 5 hours. With top, I can see it gets
>~95% CPU. There is nothing interesting in /var/adm/SYSLOG .

This is a big surprise. No 'hanging rqs' issue is known
or has been known for years (in fact I don't recall such at all ever,
though a couple of botches have lead to rqsall never finishing
due to circular-links in /var/inst/.rqsfiles).

Try interceping with par(1) to get a report on what it is doing.

It is written to be safe at all times.
A new DSO is written out and the removal and replacement
of the old is designed to be 'atomic'.

So I should be the case that killing 616672 should cause
no problem.


>Can I somehow find out what's holding up rqs? I checked the file

Well, look at par(1). That can help.

>/var/inst/.rqsfiles, it is dated 11:12, and ends with
>
>lb:libc.so.1
>v:sgi1.0
>t:0x3e948995
>c:0x513644fd
>el:
>eo:
>
>which tells me nothing, apart from suggesting the problem is
>related to libc.... Or perhaps not

No, not a correct conclusion.

>I have never seen this before, and I have no idea what do do.
>Simply kill rqs and hope for the best? Reboot? xfs_check?

a) try par on rqs to find out what it is doing, report that
here.

b) kill rqs.

c) xfs_check is a good idea: a file system problem is
the only cause I can think of offhand.

c2) If /usr/lib*/so_locations is trashed this can
give rqs problems, though not leading to an infinite loop
that I know of.


>This is an Power Indigo2 XZ, Extreme, IP26, running 6.5.18m.

d) even if rqs and rqsall fail your machine will still be ok.
It just might not start up some system utilities quite
as fast as it could when they are run.

e) Neither has any known bugs (I do have things I'd like to
do to improve reporting of its actions -- right now the
available options for reporting are not very useful)

f) I don't understand what is causing the problem.

g) You can always rerun rqsall (don't run 2 at once!) later if
you wish to. Not a bad idea given this odd situation.
See the rqsall man page.

Sign me very surprised by your diffficulty.
David B. Anderson davea at sgi dot com http://reality.sgiweb.org/davea
[rqs, rqsall maintainer.]

Timo Kanera

unread,
Feb 8, 2004, 5:25:19 PM2/8/04
to
In article <c0317m$jk9$1...@readme.uio.no>,

Hans Peter Verne <dev...@kjemi.uio.invalid> writes:
>
> I have never seen this before, and I have no idea what do do.
> Simply kill rqs and hope for the best? Reboot? xfs_check?
>
> This is an Power Indigo2 XZ, Extreme, IP26, running 6.5.18m.

Interesting. I actually regularly saw the same rqs behavior on an IP26 (Challenge M)
too lately (on .21 and .22). I didn't look deeper into this yet, so I can't
really help you - but the fact that you're only experiencing it on an r8000
made me raise an eye brow.. probably no coincidence

so long,
Timo

--
Timo Kanera <ti...@kanera.de> . GPG Key-ID: 1024D/30CDB412


Hans Peter Verne

unread,
Feb 18, 2004, 9:59:34 AM2/18/04
to

In article <c06amj$u0o3j$1...@fido.engr.sgi.com> da...@quasar.engr.sgi.com (David Anderson) writes:

> >So, rqs has been running for 5 hours. With top, I can see it gets
> >~95% CPU. There is nothing interesting in /var/adm/SYSLOG .
>
> This is a big surprise.

Thanks for your interest. I had to put this away for some time,
but I tried out some more today.

> No 'hanging rqs' issue is known
> or has been known for years (in fact I don't recall such at all ever,
> though a couple of botches have lead to rqsall never finishing
> due to circular-links in /var/inst/.rqsfiles).

Can .rqsfiles be regenerated somehow?

> Try interceping with par(1) to get a report on what it is doing.

OK. I don't have all that much experience with par, but I didn't
find much. See below.

> So I should be the case that killing 616672 should cause
> no problem.

Yup, so I did.

> c) xfs_check is a good idea: a file system problem is
> the only cause I can think of offhand.

The system is offsite, so I can't easily boot from CD, but
xfs_check -s said nothing.

> c2) If /usr/lib*/so_locations is trashed this can
> give rqs problems, though not leading to an infinite loop
> that I know of.

I don't know what to look for here, though:

# ls -l /usr/lib*/so_locations
-r--r--r-- 1 root root 54463 Feb 8 12:11 /usr/lib/so_locations
-r--r--r-- 1 root root 69160 Feb 8 12:11 /usr/lib32/so_locations
-r--r--r-- 1 root root 10078 Feb 8 12:11 /usr/lib64/so_locations

> a) try par on rqs to find out what it is doing, report that
> here.

Now, this is what I did:

tail /var/inst/INSTLOG showed how rqsall was called, so I tried:

/usr/etc/rqsall -v -count -log /tmp/rqs.log -force \
-o /var/inst/.rqsfiles-done2 -rescan /var/inst/.rqsfiles

This outputs a lot, then hangs, last message is

removing starting address 0xd3f7000 of length 0x7000 from memory list vl_vec[23] mlist 0x101a08c0 for /usr/lib32/libawareaudio.so.1 -- SUCCEED

/tmp/rqs.log remains empty, and /var/inst/.rqsfiles-done2 is not even
created. So (in another window) I run "ps -ef | grep rqs" and find it
hanging here:

root 762052 762698 0 14:30:13 pts/4 36:52 /usr/etc/rqs -force_requickstart -load_address 0x4ca0000 -timestamp 0x403368e5

ps won't tell me the full arg list, but pid 762052 has been running for
quite some time. So I try

# par -SS -Q -A -i -s -r -k -i -p 762052

After a couple of minutes, it has printed this list:

0mS inetd(762132): I/O queued; flags 0x14019 dev 0,48 count 16384 blkno 45600
0mS inetd(762132): I/O started; flags 0x14019 dev 0,48 count 16384 blkno 45600
830mS (762132): was sent signal SIGCLD
869mS sh(762363): I/O queued; flags 0x9 dev 0,44 count 8192 blkno 845952
870mS sh(762363): I/O started; flags 0x9 dev 0,44 count 8192 blkno 845952
888mS (762132): was sent signal SIGCLD
1040mS (762132): was sent signal SIGCLD
1220mS (762132): was sent signal SIGCLD
1396mS (762132): was sent signal SIGCLD
71354mS (750293): was sent signal SIGCLD


I'm still stumped!

David Anderson

unread,
Feb 18, 2004, 6:42:38 PM2/18/04
to
In article <c0vukm$6pc$1...@readme.uio.no>,

Hans Peter Verne <dev...@kjemi.uio.invalid> wrote:
>
>In article <c06amj$u0o3j$1...@fido.engr.sgi.com> da...@quasar.engr.sgi.com (David Anderson) writes:
>
>> >So, rqs has been running for 5 hours. With top, I can see it gets
>> >~95% CPU. There is nothing interesting in /var/adm/SYSLOG .
...

>> No 'hanging rqs' issue is known
>> or has been known for years (in fact I don't recall such at all ever,
>> though a couple of botches have lead to rqsall never finishing
>> due to circular-links in /var/inst/.rqsfiles).
>
>Can .rqsfiles be regenerated somehow?

Regrettably no. But that does not seem to be your problem,
and anyway modern rqsall notices the circular list and
avoids looping.


>> Try interceping with par(1) to get a report on what it is doing.
>
>OK. I don't have all that much experience with par, but I didn't
>find much. See below.

Yes, Not much of any use there.


>> a) try par on rqs to find out what it is doing, report that
>> here.
>
>Now, this is what I did:
>
>tail /var/inst/INSTLOG showed how rqsall was called, so I tried:
>
> /usr/etc/rqsall -v -count -log /tmp/rqs.log -force \
> -o /var/inst/.rqsfiles-done2 -rescan /var/inst/.rqsfiles
>
>This outputs a lot, then hangs, last message is

I'd like to see a lot more than 'the last message' here.
What I'm looking for is the name of the file rqsall is
going to run rqs on, and the rqs command being run.
It should be in this output somewhere, not far from the end.

> removing starting address 0xd3f7000 of length 0x7000 from memory list vl_vec[23] mlist 0x101a08c0 for /usr/lib32/libawareaudio.so.1 -- SUCCEED
>
>/tmp/rqs.log remains empty, and /var/inst/.rqsfiles-done2 is not even
>created. So (in another window) I run "ps -ef | grep rqs" and find it
>hanging here:
>
> root 762052 762698 0 14:30:13 pts/4 36:52 /usr/etc/rqs -force_requickstart -load_address 0x4ca0000 -timestamp 0x403368e5


Regrettably the kernel string array limit make
the file name being rqs'd invisible to ps.

That's the rqs command, but missing the file name.

The fact that par tells us nothing about this process is curious.
Makes me think (since no IO going on) that the object
is damaged somehow. Or could the object be nfs-mounted
and the mount point be unusable hanging up rqs?
Look at the path names in
/var/inst/.rqsfiles for anything that looks like
a full path to an nfs file (for this system).

Again, keep a backup of /var/inst/.rqsfiles . Don't let it get lost.

>ps won't tell me the full arg list, but pid 762052 has been running for
>quite some time. So I try
>
> # par -SS -Q -A -i -s -r -k -i -p 762052
>
>After a couple of minutes, it has printed this list:

The so_locations files you ls -l'd were of sensible length
(/usr/lib64/so_locations was a tiny bit smaller than I would
have expected, but within bounds of reasonableness.

Could you email as much of the tail of an rqsall run stdout/stderr
as you can? (add
-debug reason
to the options you give rqsall.)

When rqs hangs, kill it, letting rqsall run to completion.
Somehow I have to find out *:which* file is hanging rqs.
The file will be a full path in /var/inst/.rqsfiles

We should take this to support or to email and report findings
here when we have something.


Thanks for your patience.
David Anderson davea at sgi dot com

Reply all
Reply to author
Forward
0 new messages