Seeking for help finding cause of S23E/S53E with preceeding S202, and PGM 011/004

Peter Hunkeler

unread,

Mar 27, 2017, 10:34:17 AM3/27/17

to

How grown application written in ASM and C. Works fine since years, except for some "out of socket descriptor" problems every now and them. I was asked to help finding the cause of S23E or S53E (both DETCH) abends since the developer tries to fix the above "ourt of socket descriptor" problem.

I have set SLIP traps to catch the S23E or S53E. I do have some svcdumps. I have found some hints in the dump. I suspect some storage overlay but am stuck at the moment. Need some fresh ideas.

From the system trace I see that the S23E or S53E is preceeded by an S202 and sometime a true PGM 011.

Some questions:

a) I see a couple of FREEMAIN (SSRV 78) trace entries pointing to a TCB in read/write nucleus (TCB address is 00FDD4F8). Do these hold some useful information for me?

b) I know "SVC D" is also entered for normal task termination. In an oooold MVS debugging manual I found that the first byte of R1 is x'08' this indicates RTM2 is called for task termination cleanup. The x'08 does no longer seem to hold true. How can I identify such an non-error an SVC D entry?

c) In some dumps I see "SVC 3" (exit) trace entries, sometimes I can see the "SVC 3E" (DETACH), sometime it is not in the trace.

Thoughts?

--
Peter Hunkeler

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to list...@listserv.ua.edu with the message: INFO IBM-MAIN

John McKown

unread,

Mar 27, 2017, 10:46:52 AM3/27/17

to

On Mon, Mar 27, 2017 at 9:33 AM, Peter Hunkeler <ph...@gmx.ch> wrote:

> How grown application written in ASM and C. Works fine since years, except
> for some "out of socket descriptor" problems every now and them. I was
> asked to help finding the cause of S23E or S53E (both DETCH) abends since
> the developer tries to fix the above "ourt of socket descriptor" problem.
>

What is the code in R15? Just as a guess, for the S23E, I'd bet it's x'08'
which says "tasking being DETACH'd is not a subtask of the task doing the
DETACH". For the S53E, I don't have a good guess. If this is getting the
"true PGM 011" (whatever that means), perhaps the ECB to be posted is in an
area which as been FREEMAIN'd or the EXTR routine has been DELETEd, perhaps
by some other task, which did the LOAD, ending previously.

>
>
> I have set SLIP traps to catch the S23E or S53E. I do have some svcdumps.
> I have found some hints in the dump. I suspect some storage overlay but am
> stuck at the moment. Need some fresh ideas.
>
>
> From the system trace I see that the S23E or S53E is preceeded by an S202
> and sometime a true PGM 011.
>

Again, what is in R15. If it is x'0C', then the storage containing the ECB
has likely been FREEMAIN'd. Or, as you have indicated, that the area which
should have the address of the ECB was contaminated and the area containing
the pointer to the ECB was overlain.

>
>
> Some questions:
>
>
> a) I see a couple of FREEMAIN (SSRV 78) trace entries pointing to a TCB in
> read/write nucleus (TCB address is 00FDD4F8). Do these hold some useful
> information for me?
>
>
> b) I know "SVC D" is also entered for normal task termination. In an
> oooold MVS debugging manual I found that the first byte of R1 is x'08' this
> indicates RTM2 is called for task termination cleanup. The x'08 does no
> longer seem to hold true. How can I identify such an non-error an SVC D
> entry?
>
>
> c) In some dumps I see "SVC 3" (exit) trace entries, sometimes I can see
> the "SVC 3E" (DETACH), sometime it is not in the trace.
>
>
> Thoughts?
>
>
>
>
> --
> Peter Hunkeler
>
> ----------------------------------------------------------------------
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to list...@listserv.ua.edu with the message: INFO IBM-MAIN
>

--
"Irrigation of the land with seawater desalinated by fusion power is
ancient. It's called 'rain'." -- Michael McClary, in alt.fusion

Maranatha! <><
John McKown

Peter Hunkeler

unread,

Mar 27, 2017, 1:06:17 PM3/27/17

to

>> How grown application written in ASM and C. Works fine since years, except
>> for some "out of socket descriptor" problems every now and them. I was
>> asked to help finding the cause of S23E or S53E (both DETCH) abends since
>> the developer tries to fix the above "ourt of socket descriptor" problem.
>>
>
>What is the code in R15? Just as a guess, for the S23E, I'd bet it's x'08'
>which says "tasking being DETACH'd is not a subtask of the task doing the
>DETACH".

Sorry for having forgotten to write this.

The S23E had reason 00, meaning "The protection key of the address does not match the key of the issuer of the DETACH." Interesting, isn't it?

The S53E doesn't have different reasons. One of the possible causes listed in the manual is: "The protection key of the address does not match the key of the issuer of the DETACH." Seems likely since I see an S202 in the system trace before. I'm not in the office now, and I can't remember the S202's reason code for sure. I think it was 00, meaning "The system found an incorrect address for a request block (RB) in the 3 low-order bytes of the ECB specified by the problem program."

I know the program does subtasking via ATTACH and DETACH. Unfortunately, I only have an vague idea what the program does but don't have a clue (yet) how it exactly works. In fact there seems to be no-one left in the company who does.

I would open a PMR with IBM asking for help, but thanks to great management, we are not allowed to send dumps. Therefore I have to get at least a basic understanding to be able to describe the symptoms good enough before I can even think of opening a PMR.

John McKown

unread,

Mar 27, 2017, 1:28:10 PM3/27/17

to

On Mon, Mar 27, 2017 at 12:06 PM, Peter Hunkeler <ph...@gmx.ch> wrote:

>
>
> >> How grown application written in ASM and C. Works fine since years,
> except
> >> for some "out of socket descriptor" problems every now and them. I was
> >> asked to help finding the cause of S23E or S53E (both DETCH) abends
> since
> >> the developer tries to fix the above "ourt of socket descriptor"
> problem.
> >>
> >
> >What is the code in R15? Just as a guess, for the S23E, I'd bet it's
> x'08'
> >which says "tasking being DETACH'd is not a subtask of the task doing the
> >DETACH".
>
>
> Sorry for having forgotten to write this.
>
>
> The S23E had reason 00, meaning "The protection key of the address does
> not match the key of the issuer of the DETACH." Interesting, isn't it?
>
>
> The S53E doesn't have different reasons. One of the possible causes listed
> in the manual is: "The protection key of the address does not match the key
> of the issuer of the DETACH." Seems likely since I see an S202 in the
> system trace before. I'm not in the office now, and I can't remember the
> S202's reason code for sure. I think it was 00, meaning "The system found
> an incorrect address for a request block (RB) in the 3 low-order bytes of
> the ECB specified by the problem program."
>

Hum, could be the ECB has been overlain. Or that it is not currently being
WAIT'd upon. What are the entire 4 bytes of the ECB value (in hex please
[grin]).
values are described here:
https://www.ibm.com/support/knowledgecenter/SSLTBW_2.2.0/com.ibm.zos.v2r2.iead100/iead100658.htm

>
>
> I know the program does subtasking via ATTACH and DETACH. Unfortunately, I
> only have an vague idea what the program does but don't have a clue (yet)
> how it exactly works. In fact there seems to be no-one left in the company
> who does.
>
>
> I would open a PMR with IBM asking for help, but thanks to great
> management, we are not allowed to send dumps. Therefore I have to get at
> least a basic understanding to be able to describe the symptoms good enough
> before I can even think of opening a PMR.
>
>
> --
> Peter Hunkeler
>

--
"Irrigation of the land with seawater desalinated by fusion power is
ancient. It's called 'rain'." -- Michael McClary, in alt.fusion

Maranatha! <><
John McKown

Barbara Nitz

unread,

Mar 28, 2017, 1:57:30 AM3/28/17

to

>a) I see a couple of FREEMAIN (SSRV 78) trace entries pointing to a TCB in read/write nucleus (TCB address is 00FDD4F8). Do these hold some useful information for me?

The only tcb I know of that actually, really is located in the R/W nucleus is the first tcb in the *master* address space. Every address space started after that starts with a tcb in LQSA. So unless the SSRV entries were for asid(1), I would find a tcb address in R/W nucleus highly suspicious. Have you checked the storage FDD4F8? Is it really a tcb? (Easiest way to check is a cbf x'fdd4f8' str(tcb). If the eyecatcher is not there, the formatter will tell you).

b) I know "SVC D" is also entered for normal task termination. In an oooold MVS debugging manual I found that the first byte of R1 is x'08' this indicates RTM2 is called for task termination cleanup. The x'08 does no longer seem to hold true. How can I identify such an non-error an SVC D entry?

Error entries usually have an asterisk is front of the word SVC. I learned early on to do a "f '*'" in the trace table to find the entry for the abend. If you don't see the *rcvy entries following the svc d, chances are that you're looking at normal termination. Another indication is that IIRC normal termination doesn't have an abend code.

c) In some dumps I see "SVC 3" (exit) trace entries, sometimes I can see the "SVC 3E" (DETACH), sometime it is not in the trace.

What's the question on this? Did you format the trace table using 'jobname(your jobname)'? Given the number of dumps you had and the abend codes, there should be some form of common denominator. 23E and 53E are both detach abends, so I would expect to see an SVC 3E entry. Detach does some validity checking and then issues these abends. So at the time of detach you already have an overlay.
Always look at the earlieast error indication. Are there logrec entries for it? What is that error? If it is a pic11, check the earlier trace table in that address space for a freemain - sometimes it is a larger range that got freed.
Does a summary format on the problem address space work without errors? Is there more than one tcb with a completion code?

Barbara

Peter Relson

unread,

Mar 28, 2017, 7:42:11 AM3/28/17

to

I'm not really sure why IBM service would help someone debug an
application error, aside from on a for-fee basis, unless there is reason
to believe there could be a system problem.

23E reason 0: you have an error in the input to detach. The system
program-checked referencing the data that you provided. Specifically, it
appears, loading the word pointed to by register 1 (where that word being
loaded is expected to contain the address of the TCB to detach).

It is certainly true that if you issue DETACH after the freemain of the
TCB things will go poorly, but probably not 23E-00.

If your ATTACH was not done with ECB or ETXR, it is wrong to issue DETACH
after the termination of the daughter task.

Peter Relson
z/OS Core Technology Design

Peter Hunkeler

unread,

Mar 28, 2017, 8:07:37 AM3/28/17

to

>I'm not really sure why IBM service would help someone debug an
application error, aside from on a for-fee basis, unless there is reason
to believe there could be a system problem.

I agree. But when all else fails, we might even try going down that route.

>23E reason 0: you have an error in the input to detach. The system
program-checked referencing the data that you provided. Specifically, it
appears, loading the word pointed to by register 1 (where that word being
loaded is expected to contain the address of the TCB to detach).

>It is certainly true that if you issue DETACH after the freemain of the
TCB things will go poorly, but probably not 23E-00.

>If your ATTACH was not done with ECB or ETXR, it is wrong to issue DETACH
after the termination of the daughter task.

Yep, I understand we've got an error in the DETACH. The problem is we have no clue why. I suspect a storage overlay somewhere.

This is complex, multitasking code that has run for years without problem. The developer tried to change a tiny bit of code, and the errors startet do appear. The problem is that the code change has been backed out, but the problem still occurs in development. The only difference there should be between development and prod is that the development code has been recompiled/reassembled (it is maintained under ChangeMan). So recompilation/reassembly seems to lead to the error (I suspect it may be some AMODE24/31 issue). If the old code is activated (baseline), instead of what we think is the identical source, but recompiled, then the error does not appear.

I'm not in development, and I do not know the code. I cannot play with it on my own, I have to "trust" the developer (which I do).

Main reason I posted is I was hoping to get comments that would give me new ideas what to look for. And to understand some of the system trace entries I currently don't understand. Such as the TCB in Read/Write nucleus.

--
Peter Hunkeler

Peter Hunkeler

unread,

Mar 28, 2017, 8:24:42 AM3/28/17

to

>b) I know "SVC D" is also entered for normal task termination. In an oooold MVS debugging manual I found that the first byte of R1 is x'08' this indicates RTM2 is called for task termination cleanup. The x'08 does no longer seem to hold true. How can I identify such an non-error an SVC D entry?
>>Error entries usually have an asterisk is front of the word SVC. I learned early on to do a "f '*'" in the trace table to find the entry for the abend. If you don't see the *rcvy entries following the svc d, chances are that you're looking at normal termination. Another indication is that IIRC normal termination doesn't have an abend code.

Yes, the first thing I always do when looking at a system trace is "DOWN MAX", "REPORT VIEW", "F '*' 20 PREV", and then search upwards.

>>c) In some dumps I see "SVC 3" (exit) trace entries, sometimes I can see the "SVC 3E" (DETACH), sometime it is not in the trace.
>What's the question on this?

Excellent question ;-) It was more of a bit of information than a question.

>Always look at the earlieast error indication.

Normally, yes. In this case the first entry marked is for the S13E which we see often, followed by a normal task termination SVC D. Both are marked with an asterisk, but this is not an error that leads to job termination.

Also, sometime I/O entries are also marked in error, but the I/O does not necessarily belong to the address space in question, so these can also be ignored.

>Are there logrec entries for it? What is that error?

Only for the S13E (two of them).

>If it is a pic11, check the earlier trace table in that address space for a freemain - sometimes it is a larger range that got freed.

I'll try to find that.

>Does a summary format on the problem address space work without errors? Is there more than one tcb with a completion code?

Yes. No.

--
Peter Hunkeler

John McKown

unread,

Mar 28, 2017, 9:25:15 AM3/28/17

to

On Tue, Mar 28, 2017 at 7:06 AM, Peter Hunkeler <ph...@gmx.ch> wrote:
<snip>

>
>
> This is complex, multitasking code that has run for years without problem.
> The developer tried to change a tiny bit of code, and the errors startet do
> appear. The problem is that the code change has been backed out, but the
> problem still occurs in development. The only difference there should be
> between development and prod is that the development code has been
> recompiled/reassembled (it is maintained under ChangeMan). So
> recompilation/reassembly seems to lead to the error (I suspect it may be
> some AMODE24/31 issue). If the old code is activated (baseline), instead of
> what we think is the identical source, but recompiled, then the error does
> not appear.
>

You might want to run an AMBLIST against the newly compiled program and
the old, functional, program. You can compare the listings to see any
changes to CSECT size, AMODE, or RMODE. That could at least point you to
where the newly compiled program differs from the functional program.

>
> I'm not in development, and I do not know the code. I cannot play with it
> on my own, I have to "trust" the developer (which I do).
>

There is an old Russian proverb: Trust, but verify. It was used by
President Reagan about nuclear disarmament with the U.S.S.R. I generally
trust the programmers here. But they often see what _should_ be there and
not what _is_ there. My stance has _always_ been: If you want my help, I
want a program compile listing and a dump. Both, if you can't supply both,
then "Son, you're on your own!"

>
>
> Main reason I posted is I was hoping to get comments that would give me
> new ideas what to look for. And to understand some of the system trace
> entries I currently don't understand. Such as the TCB in Read/Write nucleus.
>
>
>
>
> --
> Peter Hunkeler
>
>
>
> ----------------------------------------------------------------------
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to list...@listserv.ua.edu with the message: INFO IBM-MAIN
>

--
"Irrigation of the land with seawater desalinated by fusion power is
ancient. It's called 'rain'." -- Michael McClary, in alt.fusion

Maranatha! <><
John McKown

Jim Mulder

unread,

Mar 29, 2017, 1:06:14 AM3/29/17

to

> >a) I see a couple of FREEMAIN (SSRV 78) trace entries pointing to a
> TCB in read/write nucleus (TCB address is 00FDD4F8). Do these hold
> some useful information for me?
> The only tcb I know of that actually, really is located in the R/W
> nucleus is the first tcb in the *master* address space. Every
> address space started after that starts with a tcb in LQSA. So
> unless the SSRV entries were for asid(1), I would find a tcb address
> in R/W nucleus highly suspicious. Have you checked the storage
> FDD4F8? Is it really a tcb? (Easiest way to check is a cbf x'fdd4f8'
> str(tcb). If the eyecatcher is not there, the formatter will tell you).

At the end of task termination, we free the storage containing
the TCB/STCB under which we are running, and set PSATOLD to the
address of the WAIT TCB (which is in the nucleus). That gets
traced in some SSRV 78 entries for some more FREEMAINs, and then
we enter the dispatcher.

Jim Mulder z/OS Diagnosis, Design, Development, Test IBM Corp.
Poughkeepsie NY

Peter Hunkeler

unread,

Mar 30, 2017, 8:24:26 AM3/30/17

to

>At the end of task termination, we free the storage containing
the TCB/STCB under which we are running, and set PSATOLD to the
address of the WAIT TCB (which is in the nucleus). That gets
traced in some SSRV 78 entries for some more FREEMAINs, and then
we enter the dispatcher.

Hi Jim,
Many thanks for the above details. Much appreciated! It helped. I think I understand what the root cause of the different ABENDs is: The ECB from an ATTACH has been overlaid. I just have to find the place in the code, now....

It took me a while to understand what the system trace was telling me; never before have I had the need to dig that deep into system trace, and to analyze a trace from a multitasking application.

When I started to write this post -- actually the text below -- I still was under the impression that there will be questions I'd have to ask to be able to understand. Now, after finishing it, it seems all is clear.
Instead of deleting the summary, I'll leave in the post. Maybe it is of interest to someone. I do have the relevant trace entries with some comments of mine in a PDF. If anyone is interested, just drop me a note.

--
Peter Hunkeler

This is on z/OS V2.1. Here is what I understand from the system trace :

o TCB A (job step task) DETACHes subtask TCB B, which has not yet completed.

o DETACH SVC code (under TCB A) initiates abnormal termination of subtask (S13E), then puts TCB A into a WAIT until subtask termination is complete.

o TCB B gets dispatched and termination is starting via SVC D.

o Many, many trace entries later, TCB B, branches to POST (from within DETACH SVC code) to post the ECB TCB A is waiting for.

o The TCB/STCB are freed next, as explained by Jim. The TCB address passed to DETACH must have been correct, and the TCB and other structures not overlaid, otherwise we would not have come that far, right?

o Next trace entries show TCB A being dispatched, still in DETACH SVC code. Just next entry, DETACH is branching to POST (SSRV 129) from IGC062+02D8. Is this to POST the ECB from the ATTACH of the task just terminated?

o Next trace entry is the PGM 011 which IEAVEPST is suffering from offset 99E. The storage the TEA address points to is not allocated. This must be either the address from TCBECB or the address of the RB from the ECB.

o Next entries, all for TCB A, are
-- RCVY PROG for the 0C4-11
-- RCVY FRR when POST's FRR is entered. The 0C4-11 is changed to an S202-00, then percolation occurs
-- RCVY FRR when DETACH's FRR is entered. The S202-00 is changed to S23E-00, then percolation occurs again
-- RCVY PERC, when the S23E-0 is percolated to RTM2.
-- SVC D, when RTM2 initiates termination of TCB A (which is the JSTCB).

--
Peter Hunkeler