I'd like to enter this thread once again.
90 percent of the dumps - if not more - that we see at our site are
dumps from
normal application code, that is 0Cx etc. exceptions or other "easy"
resolvable
reasons from application code. Or same reason, but the error is detected
somewhere below, for example in a LE routine which is called after a call
to a PL/1 or C runtime routine. Then the caller of this runtime routine
normally
has to be blamed for it.
We automized dump reading as much as possible - making it almost
unnecessary
most of the time - by providing an LE exit which runs in all our
environment and which
in case of an error catches this error and provides enough information
from the
save area back trace, that normally the application developer only has
to look at
those informations and simply doesn't need to refer to the following
SYSUDUMP.
For example, we print every DSA for every procedure call, together with the
name of the function, the parameter address lists of every call, the
complete
call hierarchy etc., the registers at every call level, the offset of
the call etc.
If the error is indeed in a LE routine below the application code, we
recognize
this and go up to the application code and identify the error position
in the application
code - same goes for DB2 errors, that is, when the error position is in
the routine
that is handling the DB2 "SQLCODE not handled" condition. And: if we found
the name of the module which is the cause of the error, we send an alarm
mail
to the department which is reponsible for the module - we get this
information
from a repository.
The information provided this way is much easier to read for our people
than SYSUDUMP and even easier than CEEDUMP (it has more information,
has a somehow better structure in our opinion, and - important for some
of our co-workers - it's in German language).
Furthermore, we teach the developers how to cope with this.
This was necessary (we did it in 2005), because we realized some problems:
- the dumps looked different in the different environments (batch, test,
DB dialog aka IMS),
but we wanted the same look and feel in every environment
- dump reading skills degraded
- we didn't want to buy an expensive tool and do the customizing in the
different environments;
instead we wanted one of our own, where we could add additional function
(see above
in an easy way)
From today's viewpoint, it looks like a success story.
Even in cases when the save area is destroyed (overwritten), the LE exit
does a very good
job by providing at least the rests of the save area trace. It tries to
find the save areas first
from the bottom (register 13), then from above (TCBFSA), and in the
normal case, the
two chains fit together. If not, there is a gap, and this gap is
documented.
The save area trace and the back chain is very imporant for us, because
at our site we
typically have many small modules calling each other and it is not
uncommon to see
some 50 levels of calling hierarchy.
BTW: the method works regardless of the programming language; we have C,
PL/1 and
ASSEMBLER (and, at a neighbor site, the exit also works with C++
functions - in fact
the method to get the function name from the entry point is the same for
all LE languages,
so I believe it will work for COBOL, too, although there is no COBOL
around).
Kind regards
Bernd