We got some odd messages in the EMS log (linkmon unable to establish
session, no server available). I check the pathway and I have 20
servers (out of 100) in pending state. I stop the instance by name
from outside pathway and all is good.
A week later, same exact thing. Same pathway, same number pending.
Is this code not exiting properly? Is the pathmon choking?
The server object is running under OSS, could the exit/abend logic be
bad?
Tks...
---------------------------------
Rob Lesan, DBA/Systems programmer
rob_...@hotmail.com
---------------------------------
This normally indicates that the Pathmon has tried to stop an instance
of a server (due to operator command or deletedelay), and had all TCPs
or Linkmon processes 'close' the server. It is waiting for the
processes to terminate.
A properly-coded Pathway server must keep track of its openers (in
many cases, simply counting opens and closes will work), and stop (by
exiting the MAIN routine or calling PROCESS_STOP_) when the last
opener closes it. Note: to count openers, the FILE_OPEN_ call of
$RECEIVE must indicate which OPEN or CLOSE message (old or new format)
to receive. (If a static server is started by Pathmon but no links
are granted, PATHMON will call PROCESS_STOP_() instead of having
linkers close the server).
Also note: If an opener is running in a CPU that fails or halts, no
CLOSE message will be sent to the server. Also, if an 'expand'
connection between the opener and the server fails, no CLOSE message
will be sent. To properly handle these situations, you must keep
track each opener in a table. See the OPENER_LOST() Guardian routine
for help in doing so.
Finally, about 20 years ago there were some logic flaws in Pathmon
that caused the symptom you describe. I suppose its possible that
there could be a bug here if your code seems to comply with all of the
above.
Good luck,
Bill Honaker
XID Software, Inc.
On Wed, 28 Aug 2002 11:03:12 -0400, Rob Lesan <rob_...@hotmail.com>
wrote:
>The server object is running under OSS, could the exit/abend logic be
>bad?
>
Are there any processes (other servers, or non-Pathway processes) which open
these servers by process name? That can cause the symptoms you describe. If
this is the case, those processes should be communicating with the servers
using the PathSend interface instead, which will avoid the problem.
--
Real email address is alphageek /at/ milmac /dot/ com
.. Ted Kennedy's car has killed more people than my gun.
<speculation>
The 20 pending processes have encountered the same problem in their
processing (possibly locks, possibly a loop of some sort, possibly
inspect) and are waiting.
Each of these processes has then incurred a timeout on the linkmon open.
After dissolving the link, the pathmon sends the stop message to each of
the offending server processes. Until these processes read that message
and die, they remain PENDING for deletion.
Either they have read the message and failed to stop (logic problem), or
they remain in a waiting state, in which case the processes own timeout
logic controls the life of the process.
Until these processes disappear, pathmon won't perform an autorestart
(if specified). If the serverclass is already riding on the MAXSERVERS
limit, the odd messages in the EMS log (linkmon unable to establish
session, no server available) will appear, as no more processes are
permitted to start.
</speculation>
<tip>
If you are going to stop these pending processes from tacl, use #ABEND
<processname> , instead of STOP <processname>. This triggers the pathmon
to invoke any autorestart options that are specified, otherwise a START
SERVER command (or the CREATEDELAY) determines when the process
restarts.
</tip>
When the pending problem next reoccurs, do a short-term PROCESS and FILE
measure on the processes to see what open they are waiting for. If the
processes are waiting on one of their opens, the next step will be
obvious.
Otherwise, if there is no message queue to the process, the process has
failed to acknowledge the close message from the pathmon and you have a
logic problem, or as Doug mentioned, another opener of the same process
can be preventing the close.
Ian Hadley
OneStop Tandem Performance
http://www.austral.se/onestop
[snip]
>When the pending problem next reoccurs, do a short-term PROCESS and FILE
>measure on the processes to see what open they are waiting for. If the
>processes are waiting on one of their opens, the next step will be
>obvious.
Or obtain the (unsupported) PSTATE utility from the GCSC, and run it against
each of the pending processes. This will show what files the process has open,
and what I/O operations, if any, are outstanding on each file.
>
>Otherwise, if there is no message queue to the process, the process has
>failed to acknowledge the close message from the pathmon and you have a
>logic problem, or as Doug mentioned, another opener of the same process
>can be preventing the close.
>
Another unsupported tool, WHOHAS, can be used to track this down. If you
suspect that a process $A123 is opened by another process, WHOHAS $A123 will
show the name of the opener process.
<snip>
>Each of these processes has then incurred a timeout on the linkmon open.
>After dissolving the link, the pathmon sends the stop message to each of
>the offending server processes. Until these processes read that message
>and die, they remain PENDING for deletion.
>
>Either they have read the message and failed to stop (logic problem), or
>they remain in a waiting state, in which case the processes own timeout
>logic controls the life of the process.
>
Are you quite sure that this scenario can happen?
To my knowledge, the only time Pathmon ever issues a STOP
against a server process is if it never granted a link in the
first place.
Just, asking,
Oz
> Are you quite sure that this scenario can happen?
In the case we are discussing the scenario is most likely.
> To my knowledge, the only time Pathmon ever issues a STOP
> against a server process is if it never granted a link in the
> first place.
The stop you refer to is accomplished via a procedure call and not via a
message to the process. Unfortunately it does not pertain to the case of
the pending server processes.
Using a freeze & stop command sequence against a serverclass causes the
Pathmon to issue a stop. I know this one is painfully obvious but I just
thought it needed to be said.
Neither of these 2 stop techniques are what I was referring to earlier.
Another manner in which Pathmon will stop a static server process is
when the last link to the process has timed out. In this case, the link
is dissolved, the linkmon/TCP regains control of the original request on
behalf of it's requester, and the Pathmon sends a terminate message to
the process. Until that message is read and correctly acted upon, the
process remains in pending state. (PENDING = pending deletion)
Because serverclass processes with MAXLINKS 1 are not all that common
most timeouts go unnoticed. One way to determine if there have been
timeouts is when the links/weight information from the "status
serverclass>,detail" command doesn't make sense (assuming it made sense
in the first place :).
When a timeout occurs in a multilink situation, the link is dissolved,
the link counter reduces by one, but the weight stays the same. Since
the weight is used to determine the distribution of new links within a
serverclass, this partially inhibits new links going to a process which
has been having difficulties (timeouts).
Returning to the context of our thread.
I would say that the serverclass has MAXSERVERS 100, NUMSTATIC 100 or
almost 100, MAXLINKS 1 and a TIMEOUT of some duration specified. It is
the MAXLINKS/MAXSERVERS combination which caused the EMS message.
When control is returned to the linkmon after the timeout, the linkmon's
requester reattempts the original request. Until at least one pending
process stops, there will be no available link to be granted, hence the
message in the EMS log.
With a little wilder speculation, one might even be able to say that all
of the pending process have been caused be the same original request.
Prior to the first timeout, 80 links may have been active. The requester
receiving the timeouts was able to march through all of the remaining
links/serverprocesses (at the interval of the timeout) until there were
no more links left. To see if this might be true, a "status
<serverclass>,processes" will show where the links to each process come
from. This piece of speculation is supportable if all the pending
processes have the same linkmon as the opener.
The solution to this problem doesn't lie in the pathmon or it's
configuration (unless you want to relax the timeout duration or accept
multiple links).
<Assuming LINKMON is the only opener and the process is busy processing
a request>
Since the original requester (and only opener) has lost interest in this
particular request it is perfectly acceptable to drop one of the pending
processes into debug (assuming it wasn't already in inspect) and just
wait until it comes out of where it was.
This is also a good time to use PSTATE and a series of FUP LISTLOCKS
that Doug referred to, to see what that turns up in the case of locking
etc. If and when the process reactivates (and if inspect enabled), a
brief inspection of the $receive (and other parts of the process) will
tell what the request was all about.
I hope Rob doesn't post now saying I'm completely on the wrong track :)
If anything is not clear Oz it is quite OK to repost.
Ian Hadley
OneStop Tandem Performance
http://www.austral.se/onestop
>
> Just, asking,
> Oz
>Hi Oz,
>
>> Are you quite sure that this scenario can happen?
>
>In the case we are discussing the scenario is most likely.
>
>> To my knowledge, the only time Pathmon ever issues a STOP
>> against a server process is if it never granted a link in the
>> first place.
>
>The stop you refer to is accomplished via a procedure call and not via a
>message to the process. Unfortunately it does not pertain to the case of
>the pending server processes.
>
>Using a freeze & stop command sequence against a serverclass causes the
>Pathmon to issue a stop. I know this one is painfully obvious but I just
>thought it needed to be said.
>
>Neither of these 2 stop techniques are what I was referring to earlier.
>
>Another manner in which Pathmon will stop a static server process is
>when the last link to the process has timed out. In this case, the link
>is dissolved, the linkmon/TCP regains control of the original request on
>behalf of it's requester, and the Pathmon sends a terminate message to
>the process. Until that message is read and correctly acted upon, the
>process remains in pending state. (PENDING = pending deletion)
<snip>
I was with you up until that....what is the exact nature of this
"terminate" message that is sent from Pathmon to the serverprocess?
Oz
The "terminate" message is read by the server process as an EOF (file
status 10) on the $receive or otherwise referred to as the "at-end"
condition.
When a pathway server process receives this message it has been
commanded to stop by the pathmon and is obliged to stop itself
gracefully (coded stop). Processes that fail to stop, as well as those
that have yet to read the message (for whatever reason), are regarded as
pending.
It may be that this message travels via the TCP or LINKMON that
dissolved the last link, but it is the pathmon that orchestrates the
proceedings and is responsible for the action.
This stop mechanism is not to be confused with the technique you are
most familiar with, where the pathmon uses the close procedure call to
stop a never opened server process. In that case no direct communication
takes place between the pathmon and the server process. Even an unopened
server process suspended from tacl will stop after a Pathcom freeze &
stop command sequence.
If you try that same trick on an opened server process, the links will
dissolve and the server process will go into pending state until the
process is activated from tacl and is able to read it's stop command
(which curiously enough would emulate Rob's original problem as well).
I'm not sure how the cobol85 manual looks these days (no TIM), but there
has been, and ought to still be, a section concerning the at-end
condition for $receive. Other details concerning link management can be
found in various pathway manuals.
But I was just wondering, has any of this 'discussion' enlightened Rob's
original "pending" server problem..
Ian Hadley
OneStop Tandem Performance
http://www.austral.se/onestop
>Oz,
>
>The "terminate" message is read by the server process as an EOF (file
>status 10) on the $receive or otherwise referred to as the "at-end"
>condition.
>
>When a pathway server process receives this message it has been
>commanded to stop by the pathmon and is obliged to stop itself
>gracefully (coded stop). Processes that fail to stop, as well as those
>that have yet to read the message (for whatever reason), are regarded as
>pending.
>
Ah, now I am with you...in fact there is no "terminate" message
sent at all...it is a fiction dummied up by the Cobol runtime
library in response to a CLOSE system message, and it wouldn't
come from Pathmon anyhow (unless you are saying Pathmon leaves
the server open after Pathmon has created it, opened and sent it
the startup message(s)).
Am I missing something here?
You will never see an "EOF" on $RECEIVE when doing a READ/
READUPDATE (and their X counterparts) using the raw proc calls,
say, in a TAL server.
Oz
If we don't wander from the thread too much, Rob Lesan's got a bunch of
serverclass processes (probably not cobol) that are in pending state and
it's disrupting the behaviour of the serverclass.
I'm still sticking to my speculation (using generic terminology) that
the processes have timed out, the pathmon wants them stopped (and
restarted) before it can grant any new links to the server class. Unless
there are other openers, the processes have yet to read the last close
(stop :) message, which means they are waiting on something.
How about we leave it at that and try to get the problem
explained/solved a little better, which I'm always prepared to do.
Ian Hadley
OneStop Tandem Performance
http://www.austral.se/onestop
<snip>
>
>If we don't wander from the thread too much, Rob Lesan's got a bunch of
>serverclass processes (probably not cobol) that are in pending state and
>it's disrupting the behaviour of the serverclass.
>
>I'm still sticking to my speculation (using generic terminology) that
>the processes have timed out, the pathmon wants them stopped (and
>restarted) before it can grant any new links to the server class. Unless
>there are other openers, the processes have yet to read the last close
>(stop :) message, which means they are waiting on something.
>
>How about we leave it at that and try to get the problem
>explained/solved a little better, which I'm always prepared to do.
Ian,
Sure thing, and I agree with you. When you originally posted
about a "terminate" message I thought that perhaps it was
potentially some new gizmo introduced into Pathway of which I
was unaware. That's why I was curious about the nature of these
things. Thanks for clearing that up.
And if the serverclass processes are written in Cobol then they will
receive a pseudo-EOF on $RECEIVE from the Cobol RTL after all openers
have closed them.
Consequently, as you say, if the processes are failing to stop
after all the links have been dissolved then they missed the
CLOSE message (somehow), the CLOSE's were never sent (cpu
failures are notorious for that), or they are just waiting on
something else, like a -waited- i/o completion to another
process in a chain.
As you said, we need more information. The original poster
said 20 out of 80 server processes failed to stop, as opposed to
all of them, so for all we know NUMSTATIC was set to 20 and
the other 80 were never started in the first place :O).
All the best.
Regards, Oz
<snip>
>
>If we don't wander from the thread too much, Rob Lesan's got a bunch of
>serverclass processes (probably not cobol) that are in pending state and
>it's disrupting the behaviour of the serverclass.
>
>I'm still sticking to my speculation (using generic terminology) that
>the processes have timed out, the pathmon wants them stopped (and
>restarted) before it can grant any new links to the server class. Unless
>there are other openers, the processes have yet to read the last close
>(stop :) message, which means they are waiting on something.
>
>How about we leave it at that and try to get the problem
>explained/solved a little better, which I'm always prepared to do.
Ian,
<snip>
>
>If we don't wander from the thread too much, Rob Lesan's got a bunch of
>serverclass processes (probably not cobol) that are in pending state and
>it's disrupting the behaviour of the serverclass.
>
>I'm still sticking to my speculation (using generic terminology) that
>the processes have timed out, the pathmon wants them stopped (and
>restarted) before it can grant any new links to the server class. Unless
>there are other openers, the processes have yet to read the last close
>(stop :) message, which means they are waiting on something.
>
>How about we leave it at that and try to get the problem
>explained/solved a little better, which I'm always prepared to do.
Ian,
<snip>
>
>If we don't wander from the thread too much, Rob Lesan's got a bunch of
>serverclass processes (probably not cobol) that are in pending state and
>it's disrupting the behaviour of the serverclass.
>
>I'm still sticking to my speculation (using generic terminology) that
>the processes have timed out, the pathmon wants them stopped (and
>restarted) before it can grant any new links to the server class. Unless
>there are other openers, the processes have yet to read the last close
>(stop :) message, which means they are waiting on something.
>
>How about we leave it at that and try to get the problem
>explained/solved a little better, which I'm always prepared to do.
Ian,
It appears that faulty logic may be our problem. After some further
investigaion, it appears that the programmer(s) did not code for
proper exit functionalty.
I checked the state of all of the "pending" server processes, they
were all waiting on $RECEIVE. A little further investigation shows
that the program is an OSS process. They are also all servers started
dynamically and allowed to die after deletedelay.
The programmer(s) apparently did NOT code to listen for STOP messages
from the system. I am willing to bet they are ignoring system
messages altogether.
Thanks for everyones input on this...
<snip>
>
>If we don't wander from the thread too much, Rob Lesan's got a bunch of
>serverclass processes (probably not cobol) that are in pending state and
>it's disrupting the behaviour of the serverclass.
>
>I'm still sticking to my speculation (using generic terminology) that
>the processes have timed out, the pathmon wants them stopped (and
>restarted) before it can grant any new links to the server class. Unless
>there are other openers, the processes have yet to read the last close
>(stop :) message, which means they are waiting on something.
>
>How about we leave it at that and try to get the problem
>explained/solved a little better, which I'm always prepared to do.
Ian,
<snip>
Sorry about the duplicate posts...my news server has gone
postal.
Oz