Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Dump file analysis.

787 views
Skip to first unread message

Shiva

unread,
Jan 5, 2015, 1:13:58 PM1/5/15
to
Hi everyone,

There's this dump file which I'm trying to figure out.

Disclaimer: I've never done these analysis before and won't bet on me knowing anything about these. The code is in pTAL and I don't know much about pTAL, though it looks very readable and understandable, and I don't know much about inspect, thought inspect says it can't work with a e system snapshot or something like that. So I understood that this had to go through einspect, though I was unable to 'add program' this dump file through that. Might have to go through the einspect manual, but for the time being - this is where I stand.

My error is a TFDS incident. I could see it in the TFDSCOM and that's how I knew about the error. At present not aware of the situation or the kind of error faced by the user, but the call flow was like this.

Error: Unable to write to stack.
Call flow: add^zeros subproc which is called from USER^NUMERIC^INPUT^CONVERSION.

The problem here is that it is trying to write zero to the whole stack and when it reaches a point where it could not write to an address holded by some other process or anything else, this incident occurs. (this was though stated by a trustable resource as he analyzed the corresponding object PATHTCP2 through ENOFT utility, I don't even knew that such an utility existed let alone what it does!)

It is stated by the same trustable resource that the error occurs in the following - in the code. Line number 77 he said though I could not know how possibly he could know that without looking at the code.

USER^NUMERIC^INPUT^CONVERSION ( USERCODE, ERROR,INPUT, INPUT^LEN, INTERNAL, INTERNAL^SCALE );

*cut*

STRING .reply^string[0:9];

*cut*

?PAGE "add^zeros subprocedure"

*cut*

IF add^post^0 THEN
BEGIN
FOR i := 0 TO (add^post^0 - 1) DO
BEGIN
reply^string[reply^len] := "0"; ! This is that line 77.
reply^len := reply^len + 1;
END;
END;

*cut*

I've *cut* various lines of my code to display only the important parts. I understand that here an infinite loop should have been happening which is causing it to write 0 to all addresses, and when the 77th line tried to write again when it was an invalid address or an address used by some other process, it fails. I've a few questions here.

1) How would I find the six inputs passed to the function

USER^NUMERIC^INPUT^CONVERSION ( USERCODE, ERROR,INPUT, INPUT^LEN, INTERNAL, INTERNAL^SCALE );

2) ENOFT, how cool is that and what does it do? :D

3) TFDSCOM can tell me more about the dump file rather than just the info that it has on DETAIL <incident number>? I also saw an option that we could take screenshots when the incident occurs? Would I be able to set that and would it be useful? Can I extract any more information from TFDSCOM?

4) INSPECT / EINSPECT is the only difference between these is that EINSPECT is for 800 files? And inspect is for 100? EINSPECT is a bit hard. Was not even able to load the dump file onto it to look at the code and inspect it.

wbreidbach

unread,
Jan 5, 2015, 5:03:21 PM1/5/15
to
If you really have a dump file produced by TFDS, you need the EGARTH tool to analyze the dump.
INSPECT/EINSPECT are used to analyze ZZSA-files (filecode 130), EGARTH is used to analyze CPU dumps. i am not aware of any official documentation for EGARTH. In addition you might need super.super to analyze a CPU-dump.

wbreidbach

unread,
Jan 5, 2015, 5:05:50 PM1/5/15
to
Am Montag, 5. Januar 2015 19:13:58 UTC+1 schrieb Shiva:
It is very likely that the index is running out of bounds (reply_len > 9 in line 77).

Tone

unread,
Jan 5, 2015, 6:17:39 PM1/5/15
to
To answer question 4 first, EINSPECT is very different to INSPECT. It
has a different syntax and is documented in the Native Inspect Manual.

You use the snapshot command to open a dump file, the bt command to show
the stack trace, the frame command to select a particular frame (like
the scope command in INSPECT) and the p command to print the contents of
variables.

So, for Q1 you could frame to the USER^NUMERIC^INPUT^CONVERSION
procedure and p USERCODE etc.

Q2. Yes, ENOFT is quite useful for looking at Native objects. Try :

ENOFT
file $system.system.pathtcp2
help (to see all the commands)
la (to list the attributes)
ls * (to list the sources)
lp * (to list the procs)
xrefproc USER^NUMERIC^INPUT^CONVERSION both (to see who calls the proc)

Q3. I don't think you can get more info on the dump file. Not sure what
you mean by screenshot. Perhaps you saw the references to SNAPSHOT and
assumed this referred to SCREENshots rather than the fact the saveabend
files from code 800 objects are often referred to as snapshot files.



Tone

unread,
Jan 5, 2015, 6:18:01 PM1/5/15
to
Whilst TFDS can take CPU dumps (it calls RCVDUMP to accomplish this),
application programs can also be instrumented to use TFDS (see the TFDS
Programmer's Guide). PATHTCP2 has been instrumented in this way.

Keith Dick

unread,
Jan 6, 2015, 12:19:06 AM1/6/15
to
Tone has given good answers to most of your questions, but I will see whether I can add anything helpful.

eInspect is essentially gdb, the debugger from the Linux/Open Source world, in case you have had any contact with that debugger.

The error happened inside a user conversion procedure that the Pathway TCP uses when a field's attributes call for a user-written conversion procedure to be applied. You can read about Pathway user conversion procedures in the manual Pathway/iTS TCP and Terminal Programming Guide, chapter 4.

The output from USER^NUMERIC^INPUT^CONVERSION is a 64-bit integer (type FIXED), scaled according to the scale factor passed as the last argument, not a character string, so my guess is that the error happened when the code was trying to put the input string into a canonical form which it would then convert to a 64-bit integer, perhaps using $ASCIITOFIXED. The code apparently was fed some edge case that it does not handle correctly, leading it to compute an unreasonably large value for add^post^0. (I'm assuming this particular conversion is typically used a lot, and works properly for most inputs. Keep in mind that USER^NUMERIC^INPUT^CONVERSION potentially performs many different conversions, governed by the value of the first argument, USERCODE.)

When you load the zzsannnn file into eInspect and use the bt command, you probably will see a lot of lines that say "unknown function". Since the address error apparently happened in USER^NUMERIC^INPUT^CONVERSION, don't let all those unknown function lines concern you. Just find the entry in the stack trace for USER^NUMERIC^INPUT^CONVERSION and use the frame command with the number on the line for USER^NUMERIC^INPUT^CONVERSION to select that stack frame, as Tone said.

This assumes that the debugging information for the user conversion code was generated when the user conversion procedures were compiled and were not discarded along the way to putting them into the Pathway TCP. If the stack trace from the bt command does not show any line for USER^NUMERIC^INPUT^CONVERSION, you've got a harder problem. There *are* ways to examine a dump from an executable that does not contain the debugging information, or when the executable is not accessible from the userid or system where you are trying to examine the dump. If you are in that situation, say so and we can tell you your options.

Your trustable resource probably looked at the zzsannnn file using eInspect. That's probably how he could tell you the line on which the error happened. When you look at the zzsannnn file, if you cannot see USER^NUMERIC^INPUT^CONVERSION in the stack trace or cannot get eInspect to show you the source lines or eInspect does not know the names of the program variables, that trusted resource probably can quickly tell you how to get all the files that eInspect will need in order to properly examine the dump. eInspect finds the name of the executable file from the zzsannnn file and finds the name of the source files from debugging information in the executable file (assuming debugging information was included in the object file). If you move the zzsannnn file to a different NonStop system, the name of the executable file that eInspect finds in the dump file might be the name of a file that exists on the system on which you are examining the dump, but is not the same executabl
e file that the zzsannnn file was produced by. If that is the case, you may get error messages or very wrong or confusing results. Be alert to that possibility if you are not examining the zzsannnn file on the system on which it was created.

I have only used eInspect with C and C++ code. It works pretty well with programs in those languages. Since eInspect came from the Linux/Open Source world, the support for pTAL would have been grafted on by HP programmers, and I have no idea how good of a job they did with that. I see that the Native Inspect manual says that you must use C syntax when writing references to pTAL variables. Look in the Native Inspect manual, "set language" command for some discussion of this (and follow the reference to page 65).

If there is a way to find out which screen field the user conversion procedure was working on when it got the error, or even the screen name, I don't remember what it is. I have a feeling that there is no way to tell.

That's all I can think of to add at the moment.

Shiva

unread,
Jan 6, 2015, 3:18:15 PM1/6/15
to
@Wolfgang: EGARTH, I'll check if that tool is available in my system - but if it is not, don't think there's much that I can do about it. So ZZSA and CPU dumps are different, that's news to me. So I can't analyse this using EINSPECT as I originally thought that I could.

And yes, I was almost sure that the index was running out of bounds.

@Tone: Precise and to the point, thank you! Of course I was able to see that EINSPECT had a different syntax compared to INSPECT, but I thought functionality wise generally EINSPECT would be used to work on save abend files from 800 type object codes, which is now clear that it is.

So in EINSPECT - I'll do the snapshot command to open the dump file and frame command to select the particular frame, in this case USER^NUMERIC^INPUT^CONVERSION procedure and then I'll look up the values for user code and length - which I were concerned about. That should tell me how my code would work for those inputs.

That should really narrow down my root cause analysis. But also I'd like to know whether I would be able to go down to that particular iteration of the code and see if I could check the values of those local variables using the p command? Don't think that's possible?

But really thanks for breaking it down for me. You've done it beautifully well. Thanks again.

And I hope to look at ENOFT after I'm done with all the other manuals that I've lined up. Thanks a lot for that gist though!

And yes, I misunderstood snapshot to the screenshot. Too bad. Thanks for pointing out that snapshots are indeed saveabend dump files of 800 object codes. That makes a lot of sense. Thanks again.

So when wolfgang mentioned that TFDS produced a CPU dump he thought that it was related to CPU something, but I think it falls down to this category that Tone has mentioned:

Application programs can also be instrumented to use TFDS (see the TFDS Programmer's Guide). PATHTCP2 has been instrumented in this way.

Again makes a lot of sense. And I'll look at that guide too, thanks. PATHTCP2, ha. That's where it all started, yes. And talking about 'instrumented to use TFDS', is there a way to detach PATHTCP2 from writing to/using TFDS?

And Keith, yet again you amaze me with your analysis. You're 100% right. Pathway/iTS TCP and Terminal Programming Guide, chapter 4 - I've read through that one already - long back, but I'll have another look. This time I might just understand more!

(I'm assuming this particular conversion is typically used a lot, and works properly for most inputs. )

Again, right. That's what we assumed as well. And hence the analysis to find the USERCODE input through the dump file.

use the frame command with the number on the line for USER^NUMERIC^INPUT^CONVERSION to select that stack frame, as Tone said.

So I need the BT command to get me the line number of USER^NUMERIC^INPUT^CONVERSION and then I'd use that number with frame to get me to that position and then p command the variables. Got it.

This assumes that the debugging information for the user conversion code was generated when the user conversion procedures were compiled and were not discarded along the way to putting them into the Pathway TCP.

I'd not know what you mean by that. Do you mean compiling with symbols? If so, yes. And let me check through BT and tell you whether I can see that frame.

eInspect finds the name of the executable file from the zzsannnn file and finds the name of the source files from debugging information in the executable file (assuming debugging information was included in the object file). If you move the zzsannnn file to a different NonStop system, the name of the executable file that eInspect finds in the dump file might be the name of a file that exists on the system on which you are examining the dump, but is not the same executabl
e file that the zzsannnn file was produced by. If that is the case, you may get error messages or very wrong or confusing results. Be alert to that possibility if you are not examining the zzsannnn file on the system on which it was created.

By executable file, you mean the object file which created the dump, right? And say if the object was compiled in a different system and brought here to just run along the pathway - the object file would have the name of the source file as it was in the system where it was compiled, I'd hope so. I understand the possibility that you're trying to say. But if I move the ZZSAnnnn file to a different NonStop system, the name of the executable file (again thinking of it as the object file which created the dump in the first place) would be having the system name in it as well? which would be different from the system that I'm inspecting it on - and in that case it would just give me a security violation if access to that server is prohibited using the user ID that I'm working on, in this other server. Or did I get it wrong?

I'll definitely do a read of the Native Inspect Manual and onto the reference to page 65, thanks.

And about, - If there is a way to find out which screen field the user conversion procedure was working on when it got the error, or even the screen name, I don't remember what it is. I have a feeling that there is no way to tell.

I just fup copied the dump file onto the screen and it gave me a source code. A requester, to be exact. Now I'm just going to wild guess that it is where the abend/failure happened. But who's to blame? ;)

Thanks for your valuable inputs again, Keith. :)

Tone

unread,
Jan 6, 2015, 6:18:43 PM1/6/15
to
There will be a copy of EGARTH on your system as TFDS uses it after it
has used RCVDUMP to dump a halted CPU as it needs to do CPU dump
analysis to determine if this is a new or recurring halt.

Within EINSPECT you can look at locals within a frame by the "info
locals" command to see all of them or p <localvar> to see them
individually. Try "help info" for other options.

Note that if the code is compiled with OPTIMIZE > 0 then you may see
that some of the variables are optimized out by the compiler. This means
it decided to could do what was necessary with the variable in a
register rather than main memory as register usage is faster than
accessing memory. Normally you would compile your programs with OPTIMIZE
0 whilst creating and testing and debugging.

One ENOFT command I neglected to mention is the dp <procname> command
which dumps out the native code for the procedure/function. It is not
usually necessary to go to this low a level but can be interesting to
look at. Manuals that help in understanding this are available from
Intel (Itanium Architecture Software Developer's Manuals).

PATHTCP2 is coded to use TFDS and there is no way to stop it from doing so.

And I agree with you that Keith's explanations are amazing both in both
their quality and depth.

wbreidbach

unread,
Jan 7, 2015, 3:24:19 AM1/7/15
to
The most important thing is looking at the filecode, 130 can be analyzed by (E)INSPECT, for other filecodes have a look at the FUP manual. I suspect it might be a PATHTCP2 dump, should be filecode 307. But I am not sure how to analyze such a file, I never had the need for that.
As stated in a previous comment I am pretty sure that the index is running out of bounds.

Keith Dick

unread,
Jan 7, 2015, 1:53:24 PM1/7/15
to
Shiva wrote:
> @Wolfgang: EGARTH, I'll check if that tool is available in my system - but if it is not, don't think there's much that I can do about it. So ZZSA and CPU dumps are different, that's news to me. So I can't analyse this using EINSPECT as I originally thought that I could.
>
> And yes, I was almost sure that the index was running out of bounds.
>
> @Tone: Precise and to the point, thank you! Of course I was able to see that EINSPECT had a different syntax compared to INSPECT, but I thought functionality wise generally EINSPECT would be used to work on save abend files from 800 type object codes, which is now clear that it is.
>
> So in EINSPECT - I'll do the snapshot command to open the dump file and frame command to select the particular frame, in this case USER^NUMERIC^INPUT^CONVERSION procedure and then I'll look up the values for user code and length - which I were concerned about. That should tell me how my code would work for those inputs.
>
> That should really narrow down my root cause analysis. But also I'd like to know whether I would be able to go down to that particular iteration of the code and see if I could check the values of those local variables using the p command? Don't think that's possible?
>
> But really thanks for breaking it down for me. You've done it beautifully well. Thanks again.
>
> And I hope to look at ENOFT after I'm done with all the other manuals that I've lined up. Thanks a lot for that gist though!
>
> And yes, I misunderstood snapshot to the screenshot. Too bad. Thanks for pointing out that snapshots are indeed saveabend dump files of 800 object codes. That makes a lot of sense. Thanks again.
>
> So when wolfgang mentioned that TFDS produced a CPU dump he thought that it was related to CPU something, but I think it falls down to this category that Tone has mentioned:
>
> Application programs can also be instrumented to use TFDS (see the TFDS Programmer's Guide). PATHTCP2 has been instrumented in this way.
>
> Again makes a lot of sense. And I'll look at that guide too, thanks. PATHTCP2, ha. That's where it all started, yes. And talking about 'instrumented to use TFDS', is there a way to detach PATHTCP2 from writing to/using TFDS?
>
> And Keith, yet again you amaze me with your analysis. You're 100% right. Pathway/iTS TCP and Terminal Programming Guide, chapter 4 - I've read through that one already - long back, but I'll have another look. This time I might just understand more!
>
> (I'm assuming this particular conversion is typically used a lot, and works properly for most inputs. )
>
> Again, right. That's what we assumed as well. And hence the analysis to find the USERCODE input through the dump file.
>
> use the frame command with the number on the line for USER^NUMERIC^INPUT^CONVERSION to select that stack frame, as Tone said.
>
> So I need the BT command to get me the line number of USER^NUMERIC^INPUT^CONVERSION and then I'd use that number with frame to get me to that position and then p command the variables. Got it.

That is correct.

>
> This assumes that the debugging information for the user conversion code was generated when the user conversion procedures were compiled and were not discarded along the way to putting them into the Pathway TCP.
>
> I'd not know what you mean by that. Do you mean compiling with symbols? If so, yes. And let me check through BT and tell you whether I can see that frame.

Yes, compiling with symbols creates the debugging information I mentioned. However, there is an option in eld to make it strip away the debugging information, and some people might use that option when building the version of the executable file for production. If you don't do that, you'll have the debugging information.

>
> eInspect finds the name of the executable file from the zzsannnn file and finds the name of the source files from debugging information in the executable file (assuming debugging information was included in the object file). If you move the zzsannnn file to a different NonStop system, the name of the executable file that eInspect finds in the dump file might be the name of a file that exists on the system on which you are examining the dump, but is not the same executabl
> e file that the zzsannnn file was produced by. If that is the case, you may get error messages or very wrong or confusing results. Be alert to that possibility if you are not examining the zzsannnn file on the system on which it was created.
>
> By executable file, you mean the object file which created the dump, right? And say if the object was compiled in a different system and brought here to just run along the pathway - the object file would have the name of the source file as it was in the system where it was compiled, I'd hope so. I understand the possibility that you're trying to say. But if I move the ZZSAnnnn file to a different NonStop system, the name of the executable file (again thinking of it as the object file which created the dump in the first place) would be having the system name in it as well? which would be different from the system that I'm inspecting it on - and in that case it would just give me a security violation if access to that server is prohibited using the user ID that I'm working on, in this other server. Or did I get it wrong?

Yes, executable file is the object file that was running, got the error, and whose data areas were dumped into the ZZSAnnnn file. I do not know whether the ZZSAnnnn file includes the system name in the name of the executable file. Since an executable file can be run only on the same system as it is stored in, the designers might have thought that they only needed to record the local form of the executable file name. I simply don't know which choice they made. The names of the source files that are recorded in the debugging informatino in the executable file definitely includes the system name. You are correct that if eInspect has a network-form file and tries to read it, if you don't have authorization to access the system it is on or authorization to read the file, you will get an error message that tells you there was a security violation on the access.

If you get into a situation in which eInspect cannot find the debugging information or source files itself, or can find it but gets a security violation, the symbol-file command can be used to tell eInspect to get debugging information from a copy of the executable file that actually contains the debugging information. Similarly, the map-source-name command can tell eInspect a substitute name for a source file. If you use those commands, it is up to you to be sure the files you are telling eInspecct to use truly correspond to the program represented in the dump file.
>
> I'll definitely do a read of the Native Inspect Manual and onto the reference to page 65, thanks.
>
> And about, - If there is a way to find out which screen field the user conversion procedure was working on when it got the error, or even the screen name, I don't remember what it is. I have a feeling that there is no way to tell.
>
> I just fup copied the dump file onto the screen and it gave me a source code. A requester, to be exact. Now I'm just going to wild guess that it is where the abend/failure happened. But who's to blame? ;)

I am a little surprised that the Screen COBOL source statements are in the data space of the PATHTCP program, since the Screen COBOL compiler compiles the Screen COBOL source into a byte code that PATHTCP interprets. Did you see procedure division statements? Maybe it keeps the screen descriptions more-or-less in source form and that is what you saw. I don't know how reliable that technique is for telling you which screen was being interpreted. It could have several screens in its memory from different points in executing the Screen COBOL program that got the error, plus screens for other Screen COBOL programs running other Pathway TERMs, so don't put too much weight on what you saw being directly connected with the error that happened. The information about exactly what line of what Screen COBOL program was being executed when the error happened certainly is in the dump somewhere, but you would have to have debugging information about the Pathway parts of PATHTCP and k
nowledge of what variables to look at to dig it out of the dump.

Shiva

unread,
Jan 7, 2015, 4:45:57 PM1/7/15
to
@Tone: Thanks for that. I'll look it up.

@Wolfgang: True. But this file had a filecode of 130.

@Keith: Find my responses below:

> Yes, compiling with symbols creates the debugging information I mentioned. However, there is an option in eld to make it strip away the debugging information, and some people might use that option when building the version of the executable file for production. If you don't do that, you'll have the debugging information.

I know for a fact that it was compiled with symbols. But the USER^NUMERIC^INPUT^CONVERSION was not to be found in the BT command on EINSPECT. May be like you said - the option to trip away debugging info was done on eld? But why would anyone do that. I'd have to check what that option is and look at the command that was used (I can check what command was used for eld - internal process monitoring)

And okie - about what I did with that snapshot file, I'll tell you. I took the snapshot file to a different system and tried EINSPECT there, using a different user obviously - where I got security violation stating that I was trying to read that corresponding object file that created the snapshot file.

So I went back to the system where the snapshot file was created and I ran EINSPECT there and I was able to do the BT trace now, and it had the unknown function which you said it would (thrice) and then there were some other (PROCS, I believe) names that I could not recognize or even find in the source code. These were some three to four more lines and that's it. Within some 10-15 lines all of BT output was over. I could not find the USER^NUMERIC^INPUT^CONVERSION in that.

So that means that debugging information is not available in the dump file and of course with the object too. So from the options you've given me, I should:

Try to compile the object with symbols and make sure that eld does not strip off the data this time, and then symbol-file command in EINSPECT to point to this object file. And then BT and then frame and then the p command.

Or if the first option doesn't work - I should just give map-source-name for this snapshot file to the corresponding source file? Highly unlikely that this is what you suggested - because more than one source files are epTAL compiled then ELD'ed to create the object. We know that, of course. But if in case this was it, then BT and then frame and then the p command.

I'll proceed with the above and get back with the results. Thanks for your help, guys. :)
Always the best.

> I just fup copied the dump file onto the screen and it gave me a source code. A requester, to be exact.

How wrong I was to state "source code". It gave me a name. The name of a requester source file. That's it. That was actually stored inside the requester itself, as a variable I believe. And that's how it was present in the dump file. Now would that be possible now - that the abend occurred here in this requester? There's no other file name in that dump file. We can only guess. We may never know for sure.

Keith Dick

unread,
Jan 7, 2015, 8:37:07 PM1/7/15
to
Shiva wrote:
> @Tone: Thanks for that. I'll look it up.
>
> @Wolfgang: True. But this file had a filecode of 130.
>
> @Keith: Find my responses below:
>
>
>>Yes, compiling with symbols creates the debugging information I mentioned. However, there is an option in eld to make it strip away the debugging information, and some people might use that option when building the version of the executable file for production. If you don't do that, you'll have the debugging information.
>
>
> I know for a fact that it was compiled with symbols. But the USER^NUMERIC^INPUT^CONVERSION was not to be found in the BT command on EINSPECT. May be like you said - the option to trip away debugging info was done on eld? But why would anyone do that. I'd have to check what that option is and look at the command that was used (I can check what command was used for eld - internal process monitoring)

Sometimes, people strip the debugging information to conserve disk space, because the debugging information can be somewhat large. Sometimes they strip the debugging information to make it hard for someone to reverse-engineer their precious code. Sometimes they strip the debugging information because they don't know they are doing it, but are just using the eld commands that Joe down the hall told them they had to use, and those commands have worked so far.

>
> And okie - about what I did with that snapshot file, I'll tell you. I took the snapshot file to a different system and tried EINSPECT there, using a different user obviously - where I got security violation stating that I was trying to read that corresponding object file that created the snapshot file.
>
> So I went back to the system where the snapshot file was created and I ran EINSPECT there and I was able to do the BT trace now, and it had the unknown function which you said it would (thrice) and then there were some other (PROCS, I believe) names that I could not recognize or even find in the source code. These were some three to four more lines and that's it. Within some 10-15 lines all of BT output was over. I could not find the USER^NUMERIC^INPUT^CONVERSION in that.

Oh, oh. This sounds bad. If you know you compiled with symbols, and the bt command showed some procedure names, but not USER^NUMERIC^INPUT^CONVERSION, then I have a bad feeling that PATHTCP caught the exception, did some cleanup, and then called ABEND. That catch-the-exception-and-clean-up approach destroys the information you need to examine the state of the program at the time of the error. I can't be certain that is what happened, but I fear it has.

>
> So that means that debugging information is not available in the dump file and of course with the object too. So from the options you've given me, I should:

Debugging information, such as the procedure names, variable locations, etc., is never in the dump file. It is in the object file.

>
> Try to compile the object with symbols and make sure that eld does not strip off the data this time, and then symbol-file command in EINSPECT to point to this object file. And then BT and then frame and then the p command.

It can't hurt to try that, but I think the chance is small that USER^NUMERIC^INPUT^CONVERSION is in the call stack saved in that dump file.

>
> Or if the first option doesn't work - I should just give map-source-name for this snapshot file to the corresponding source file? Highly unlikely that this is what you suggested - because more than one source files are epTAL compiled then ELD'ed to create the object. We know that, of course. But if in case this was it, then BT and then frame and then the p command.

The procedure names and other information do not come from the source files. The source files are only used when you ask eInspect to list the source lines of the program. The debugging information is in the object file. So I think what you suggest here is pointless.

>
> I'll proceed with the above and get back with the results. Thanks for your help, guys. :)
> Always the best.

If my fears about the situation prove true, you aren't likely to get any debugging help from the dump.

One thing you could do is read the USER^NUMERIC^INPUT^CONVERSION code very carefully, keeping in mind that it seems that add^post^0 gets computed incorrectly in some situation.

Or, if you can make the problem happen in a test environment, you could put PATHTCP into eInspect, put a breakpoint on the entry to USER^NUMERIC^INPUT^CONVERSION, and run the application, counting the number of times you hit the breakpoint before PATHTCP dies -- just type the c command each time it stops at the breakpoint. Then do it again except don't type c the last time it stops at the breakpoint, but step through execution of the call of USER^NUMERIC^INPUT^CONVERSION that gets the error, checking what it does. Not easy, I know, but if you cannot figure out the problem from reading the code, that might be your only choice.

If you are not sure whether the code does the right thing in a certain case, you could create a small test program that calls USER^NUMERIC^INPUT^CONVERSION with arguments that test the case you are unsure about, compile and link with your USER^NUMERIC^INPUT^CONVERSION, and run that little test program in eInspect to see whether that case is handled correctly.

There is a possible cause that would be very hard to find. It is possible that the actual error is in another user conversion procedure, the error in that other procedure does something that alters some data that later causes the bad input to USER^NUMERIC^INPUT^CONVERSION. I certainly hope this scenario is not what is happening in your situation. It is very hard to find and fix an error resulting from a delayed effect like that.

If I were you, I would check the EMS log and whatever information TFDS provides about errors like this one. There must have been some information in TFDS that your trustable resource used to tell you exactly where in the code the error was, because he certainly did not get it from the dump file. Maybe there was a little more information that he did not pass along to you, and that additional information might provide enough clues to lead you to the error. I don't know what additional information there might be.

Check to see whether the TCP is configured with DUMP ON, and if it is, look at the file(s) it has created and see whether any of them are from the time the TFDS event occurred. I don't know the format of the DUMP files, but if you find one (or two -- it dumps its backup process, too, if it is running as a process-pair -- make a FUP DUP of them immediately so they won't get overwritten, then we can inquire about what is in those files and see whether they can help track down the problem.
>
>
>>I just fup copied the dump file onto the screen and it gave me a source code. A requester, to be exact.
>
>
> How wrong I was to state "source code". It gave me a name. The name of a requester source file. That's it. That was actually stored inside the requester itself, as a variable I believe. And that's how it was present in the dump file. Now would that be possible now - that the abend occurred here in this requester? There's no other file name in that dump file. We can only guess. We may never know for sure.

That is much more plausible than what you said earlier. The data space of the TCP usually will contain the data for many Screen COBOL files. Unless you have variables that hold the name of the Screen COBOL source file in all or most of your Screen COBOL programs, seeing that one file's name in the dump does not tell us much. It probably does mean that that Screen COBOL program was executing sometime not long before the error happened, but it might have been executing for a different PATHWAY TERM than the one that got the error, so it might be giving very little useful information.

Tone

unread,
Jan 7, 2015, 10:17:30 PM1/7/15
to
Re your attempt to look at the snapshot file on another system.

If you can, move the snapshot and PATHTCP2 and PATHTCPL object
files to the other system, then

einspect
snapshot zzsafile
symbol pathtcp2 pathtcpl (tells einspect to use these local objects)

You may also need other objects to get a complete stack trace.

info dll will show what other objects may be needed.

Shiva

unread,
Jan 8, 2015, 1:13:24 PM1/8/15
to
> Sometimes, people strip the debugging information to conserve disk space, because the debugging information can be somewhat large. Sometimes they strip the debugging information to make it hard for someone to reverse-engineer their precious code. Sometimes they strip the debugging information because they don't know they are doing it, but are just using the eld commands that Joe down the hall told them they had to use, and those commands have worked so far.

Okie, about that - the following is the ELD compile command that was used while producing the PATHTCPL file in my system.

epTAL/in SLIB/ SLIBOBJ; symbols
epTAL /in <user_conversion_pTAL_code>/ <pTAL_obj>; symbols

Note that TLIB is somehow converted to TCPLIB (I don't know how), and then SLIB linked to the <user_conversion_pTAL_code> and ILIB (think that's how it should be? - I'm not 100% sure about this either).

And after that the following ELD command.

ELD / out $S.#abc/ TCPLIB <user_conversion_pTAL_code> SLIBOBJ -o PATHTCPL -ul &
-set highpin on -set highrequester on -set runnamed off -allow duplicate_process -set inspect on

I don't think any part of the above command strips off source code from the PATHTCPL object. That I say with no knowledge about what the above syntax for the command should mean. I just think that -o should mean object and -ul should mean user library and others just work the way it's written on the tin. At least that's what my understanding is.

> Oh, oh. This sounds bad. If you know you compiled with symbols, and the bt command showed some procedure names, but not USER^NUMERIC^INPUT^CONVERSION, then I have a bad feeling that PATHTCP caught the exception, did some cleanup, and then called ABEND. That catch-the-exception-and-clean-up approach destroys the information you need to examine the state of the program at the time of the error. I can't be certain that is what happened, but I fear it has.
> Debugging information, such as the procedure names, variable locations, etc., is never in the dump file. It is in the object file.

And hey, I did finally get some hope. Look at what I did:

After lots of stupid things I tried to get the dump to tell me what to do, I tried the following. I opened einspect and then the snapshot file and then I gave symbol-file PATHTCP2 (the object which created the dump of-course. But still the BT command gave me nothing but unknown function. Like below:

#0 ...... TFDS_Capture_End
#1 ...... Create^dump
#2 ...... Dump^TCP
#3 ...... Z^Gasp
#4 ...... y^Process^Signal
#5 0xfffffffff02ecb80 Unknown function name.
#6 0x78002980 Unknown function name.
#7 0x78006140 Unknown function name.

Then I gave symbol-file PATHTCPL (the library, actually) and the BT command now gave me the call flow. (this I did before seeing Tone's reply - ah it would've been a whole lot easier). And I've given only the PATHTCPL, may be I should've given PATHTCP2 also. Clearly stating the following:

#0 ...... TFDS_Capture_End
#1 ...... Create^dump
#2 ...... Dump^TCP
#3 ...... Z^Gasp
#4 ...... y^Process^Signal
#5 0xfffffffff02ecb80 <SOMETHING COMES HERE>.
#6 0x78002980 USER^NUMERIC^INPUT^CONVERSION.ADD^ZEROS() AT
<SOURCE_FILE_NAME> : 77.
reply^string[reply^len] := "0"; ! This is that line 77.
#7 0x78006140 USER^CONVERSION^2.
#8 0x780022e0:0 USER^NUMERIC^INPUT^CONVERSION (USERCODE = 2, ERROR = 0X30303030, INPUT = 0X30303030 <address 0x30303030 out of bounds>, INPUT^LEN = 12336, INTERNAL = 0X30303030, INTERNAL^SCALE = 12336)

And that's how the trustable resource had told me where exactly the fault was. He also mentioned that 0xfffffffff02ecb80 in frame #5 comes from MCPDLL (don't know what that is!) which is usually a jacket procedure to handle exception caused by code in frame #6.

That made a lot of sense. But also raised more questions. Because I gave the following in EINSPECT after that.

einspect: p add^post^0
14336

einspect: info locals (this one I took from Tone ;) )
I = 688

einspect: p INPUT
$2 = (STRING *) 0x30303030 <address 0x30303030 out of bounds>

The whole point of my dump analysis was to find the 'input' value to the user conversion code and I see only an address that tells me nothing but the fact that it is out of bounds. I could not find what was the input. The code does not have any other variable that take the input (or even part of it) the only variable the input value was stored on, was in this. And it's not of any use. And one thing I know for sure.

I should never be 688, not input length be 12336, nor add^post^0 be 14336. Something somewhere has gone terribly wrong.

I'll tell you what this user conversion 2 is meant to do. It is meant to convert any kind of input into the following

2 to 2.00
198 to 198.00
2.1 to 2.10
9999 to 9999.00
.2 to .20

For any input such as the following it gives errors.
. - wrong value
99999 - limit exceeded
00000 - out of range

And various other exceptions.

Now you see why the input length should never have been 12336 (there's no way the input can be so large! You can't even input such a big value on the screen, you can't even input more than 9(4)V99. The COBOL people would understand that. (six digits, two after decimal point )

May be there was some case that was not handled properly. Without looking at the input value we can't guess very easily.

May be because I've given only the PATHTCPL, may be I should've given PATHTCP2 also - which is why the input value is out of bounds? But that's just for symbols. Do you guys have any idea for me? I need to know the input value. I think that would solve the mystery then we can amend the code as required.

Tone

unread,
Jan 8, 2015, 3:44:13 PM1/8/15
to
ERROR actually contains ASCII "0000", as does INPUT. INPUT^LEN = 12336
which is 0x3030 (ASCII "00"). So a lot of the variables on the stack
have been corrupted by ASCII zeros.

Sometimes it is necessary to symbol in the following 2 objects :

$SYSTEM.SYSnn.MCPDLL (contains millicode procedures)
$SYSTEM.SYSnn.INITDLL (contains system procedures)


Keith Dick

unread,
Jan 8, 2015, 4:15:11 PM1/8/15
to
I should have thought about PATHTCPL -- my mistake. But I'm glad you found it or noticed Tone's instructions about it.

Notice that the value shown for ERROR (0x30303030), INPUT (0x30303030), INPUT^LEN (12336 = 0x3030), INTERNAL (0x30303030), INTERNAL^SCALE (12336 = 0x3030)?

0x30 is the Ascii code for "0". It appears that the runaway loop has obliterated the evidence of what the arguments to the procedure originally were. I don't know anything about how the call stack for the Itanium native programs is formatted, so I don't know whether it is likely that you would be able to dig out some useful information from areas of memory that have not been destroyed by that loop.

Did the stack trace from bt end with frame 8, or did you just stop copying it there? Did you copy that stack trace by hand? I think I remember that you were not able to take any data in or out of your office. That was tedious if you did it by hand.

The value shown for I, if it can be believed, indicates that the loop has only obliterated about 688 bytes. That's surprisingly low, unless the code picked up a new value for the base address of reply^string when the loop overstored it (assuming the loop overstored it).

One thing you could do is look at memory using eInspect's x command. Work backwards through memory from some point that has been overstored until you find the beginning of the string of "0" characters. The data immediately before that should be part of the original input string that INPUT pointed to. The trouble with that approach is that I don't immediately know how to find the address of a part of the area that has already been overstored to work backwards from.

I'm also a little puzzled that USERCODE seems not to have been overstored even though all of the other arguments have been overstored.

The addresses before the procedure names in the stack trace look like code addresses, not data addresses, so using one of them to start probably is wrong.

Maybe if you asked eInspect to display the address of one of the value arguments. Maybe

print &INPUT^LEN
or
print /x &INPUT^LEN

If the second form works, you won't have to convert a decimal value to hex. You can use decimal addresses with the x command, but it's weird. I'm not sure &INPUT^LEN will get the address of the INPUT^LEN argument, but I think it will.

Then try backing off 20 or 30 from the address you get and try the commadn

x /20xw 0x80xxxxxx

where 0x80xxxxxx is the address you get by taking the value from the print command and subtracting 20 or 30. If this displays a bunch of words with value 0x30303030, we might be on to something. If so, keep decreasing the address until you get to the beginning of the string of Ascii zeros. That might be the beginning of the string the procedure was asked to convert. Or it might not.

If the stack trace went back beyond frame 9, look at the next couple of lines in the stack and see whether it gives you any argument information. Those will be procedures in Pathway, and the debugging information for the Pathway procedures might not be there (precious code and all that), but if it does show parameters, see whether any of the names look like it might be the pointer to the screen field value. If so, you might be able to see the original value of the input field. I think it is unlikely you'll be able to find that, but look and see. Maybe you'll be lucky.

I'm out of ideas right now. If something else occurs to me, I'll post again. You have a tough problem to unwind there. If you were a Pathway developer, it probably would be pretty easy to find the screen name, requester name, original value of the screen field, unless the runaway loop has overwritten it. But I imagine you won't be able to find anything except maybe some of the original input field just before the beginning of the string of Ascii zeros in memory. Maybe that will be enough of a clue to solve the puzzle.

Shiva

unread,
Jan 9, 2015, 4:24:29 PM1/9/15
to
Tone, INPUT and ERROR had out of bounds values that I was able to see. But even INPUT^LEN = 12336 is 0x3030 (ASCII "00")? That makes things worse. I thought at least that would give me some clue.

And I also added the MCPDLL and INITDLL to symbol-file but to no effect. At least what I thought I should check, were not made any more useful by the introduction of those two files. May be I was checking in the wrong places. I don't really know. Sorry. But thanks for your tip.

Keith, yes. Like you said it does appear that the loop has cleared out any evidence of what the arguments to the procedure originally were.

Did the stack trace from bt end with frame 8, or did you just stop copying it there?
The latter. I thought that was enough. Below are the remaining frames 9, 10, 11, 12, 13.

#9 ...... CONVERT^AND^CHECK (FLDLEN = 5)
#10 ...... Y^DO^ACCEPT ()
#11 ...... Z^INTERP ()
#12 ...... DISPATCHER ()
#13 ...... TCP ()

I went to all these frames and gave info locals which did not seem to provide any useful information. Maybe the #9 FLDLEN might have some significance. But I don't know about that, so I can't really comment on that.

Did you copy that stack trace by hand?
Of course. What other option do I have.

I think I remember that you were not able to take any data in or out of your office. That was tedious if you did it by hand.

Oh you had no idea, mate. But if it even counts for even the tiniest bit of learning, I'm be greatly satisfied - and so far I've learnt far more than I thought I would. Thanks much to you and Tone. :)

The value shown for I, if it can be believed, indicates that the loop has only obliterated about 688 bytes. That's surprisingly low, unless the code picked up a new value for the base address of reply^string when the loop overstored it (assuming the loop overstored it).

Yes, 688 is the count for the loop. Why is it surprisingly low? Remember the condition for the loop is "FOR i := 0 TO (add^post^0 - 1) DO ". And my previous post states that the dump file has the value for add^post^0 is 14336. Though this much resembles the 12336 for INPUT^LEN I don't know what ASCII value that is. I hope I'm right in saying that 14336 and 12336 are decimals/integers? And for 12336 it is 00. But I could not even google at this hour what 14336 could mean. Sorry (2am and a pounding headache).

Back to your comment - I value of 688 being surprisingly low? I'll tell you why. And this is going to a bigger explanation.

(einspect 1,94): frame 6
#6 0x78002980: 0 in USER^NUMERIC^INPUT^CONVERSION.ADD6ZEROS() AT
<SOURCE_FILE_NAME> : 77

That's what frame 6 had for me. So I go into enoft.

enoft> file pathtcpl
enoft> da 0x78002980 (this dumpall idea I got from my trustable resource)
[2:001: 78. :1 USER^NUMERIC^INPUT^CONVERSION.ADD^ZEROS]
0X78002980: {0:08e01426000 M st1[r20] = r19
1:080c2300440 M1d8 r17 = [r35]
2.000080000000 l nop.i 0x0;;
template: 0x09}.

Here the r20 register was assigned the value from r19. So I went back to einspect.

(einspect 1,94): info register r19
r19: 0x30
(einspect 1,94): info register r20
r20: 0x70000000

Now my trustable resource who explained this, tells me that 0x70000000 is a code address and should not be written to, which is why it has failed. Now if you see, it was trying to write 0x30 in all other registers and it had been allowed and when it started trying to write to a code stack, it had failed because it is illegal to write to code address. I still don't know what's the difference and how they differentiate which is code address and which is data stack.

So as I is 688, you see that nearly 700 bytes of data had been overwritten to 0x30, and after that it had failed. Hence 688 is justifiable.

Here's a sample:

(einspect 1,94): an 0x6ffffd08 700
0x6ffffd08: ...... .0.. . ...... .0.. .
0x6ffffd18: ...... .0.. . ...... .8....
0x6ffffd28: ...... ...... ...... ......
0x6ffffd38: ...... ...... .2273. .0. 0.
0x6ffffd48: ./,.C. .o..E. .0000. .0000.
0x6ffffd58: .0000. .0000. .0000. .0000.
0x6ffffd68: .0000. .0000. .0000. .0000.
0x6ffffd78: .0000. .0000. .0000. .0000.
0x6ffffd88: .0000. .0000. .0000. .0000.
-----------x clipped a lot of zeroes x-----------
0x6ffffff8: .0000. .0000. Warning: cannot access memory at address 0x70000000.

That explains it all. But still doesn't give us enough information to understand why such a situation occurred in the first place.

Even that an command in einspect was suggested to me by that trustable resource. He definitely knew what he was doing. That command looks more like the x command that you've given Keith - but this works a bit differently. Trying to understand the difference but I think only einspect manual will teach me that.

I'm also a little puzzled that USERCODE seems not to have been overstored even though all of the other arguments have been overstored.
May be USERCODE was not a variable in this level or frame, may be it came from some other procedure or frame. May be that's why it is stored elsewhere as I could not see it above. If I had given a p &USERCODE I'd have seen the address where it was stored, no? But if I would rather see the program itself I'd know - I think.

The addresses before the procedure names in the stack trace look like code addresses, not data addresses, so using one of them to start probably is wrong.
Again, you're right. You're a genius, Keith.

I'm not sure &INPUT^LEN will get the address of the INPUT^LEN argument, but I think it will.
It did, I think. That's what gave me the idea to suggest that p &USERCODE would say where USERCODE is stored.

So back to the einspect frame 8. I did a info locals to that and I got the following information.

decimal^pt^pos = 3
string^length = 5
my^string = "22730. 0/,"

The code has the following important lines which will help us understand the importance of my^string and the value above.

STRING .my^string[0:9];
.
.
.
my^string ':=' input [0] FOR input^len BYTES;

So the value "22730. 0/," is not indeed the input alone but the whole 0-9 places(bits?) covered because input^len in the above command was possibly larger than what my^string expects.

But string^length and decimal^pt^pos indicate that the input might have been 22730 and input length just 5, but that can't make decimal point position as 3. I understand that all these are related to code, and I don't expect you all to make any sense of it. I didn't read the whole code myself. I'm just making comments with the variable names that we see here, logically. About the code, I don't think that I can post the whole code here. For one, it's too large. For two, it's too secure to post and get away easy. I hope you understand.

That might be the beginning of the string the procedure was asked to convert.
Well, looks like it. But not too sure still on why the problem occurred on the first place.

If you were a Pathway developer, it probably would be pretty easy to find the screen name, requester name, original value of the screen field, unless the runaway loop has overwritten it.

What do you mean by a Pathway developer? I take care of the pathway, and work with it - if that's what you mean. To 'take care of pathway' is loosly put, but was intended that way. There's no job role of a pathway developer here in my workplace.

Also I've a few questions.
Why does the einspect have prompt like this - especially, what does the values in it mean?
(einspect 1,84):

And I told you I've two systems. For einspect - In one, the help and the "an" command etc works, in other very limited. For the "an" command and the help command I get a reply prompt saying no such command or something like that. Why? Both are same einspect versions. TNS/E einspect gdb debugger (T1237 - 18 Jull 2012 13:46) They both should've worked the same way. Unless, they disabled some commands in one system? Hmm?

There were many dumps in my system of which I've taken five (one above already, repeated below for comparison). And the only three data I've been able to take out, I've listed. Let me know if it helps.

decimal^pt^pos = 4, string^length = 6, my^string = "680640 0/,"
decimal^pt^pos = 3, string^length = 5, my^string = "37840. 0/,"
decimal^pt^pos = 3, string^length = 5, my^string = "23870. 0/,"
decimal^pt^pos = 4, string^length = 6, my^string = "684240 0/,"
decimal^pt^pos = 3, string^length = 5, my^string = "22730. 0/,"

All the my^string have the last four characters as the same. A space followed by a zero then a forward slash and a comma. So that may mean that they are not really inputs. There's no coincidence in life.

If the code is required, I can post parts of it as per the required variable that's needed to be analyzed. But anyway, I'm going to take a hard look at the code tomorrow. With these values as inputs, I think we can still crack a solution out of this.

Thanks all for your valuable inputs.





Keith Dick

unread,
Jan 10, 2015, 7:49:53 AM1/10/15
to
Shiva wrote:
> Tone, INPUT and ERROR had out of bounds values that I was able to see. But even INPUT^LEN = 12336 is 0x3030 (ASCII "00")? That makes things worse. I thought at least that would give me some clue.
>
> And I also added the MCPDLL and INITDLL to symbol-file but to no effect. At least what I thought I should check, were not made any more useful by the introduction of those two files. May be I was checking in the wrong places. I don't really know. Sorry. But thanks for your tip.

MCPDLL and INITDLL contain numerous system library procedures, and when they are loaded, it will make at least some of the stack frames that say the procedure name is unknown will be able to list the proper procedure name. This sometimes can be helpful to get an idea of what system procedures are involved when you have a stack trace with a lot of unknown procedure names. I wouldn't have expected it to help with your current problem.
>
> Keith, yes. Like you said it does appear that the loop has cleared out any evidence of what the arguments to the procedure originally were.
>
> Did the stack trace from bt end with frame 8, or did you just stop copying it there?
> The latter. I thought that was enough. Below are the remaining frames 9, 10, 11, 12, 13.
>
> #9 ...... CONVERT^AND^CHECK (FLDLEN = 5)
> #10 ...... Y^DO^ACCEPT ()
> #11 ...... Z^INTERP ()
> #12 ...... DISPATCHER ()
> #13 ...... TCP ()
>
> I went to all these frames and gave info locals which did not seem to provide any useful information. Maybe the #9 FLDLEN might have some significance. But I don't know about that, so I can't really comment on that.

That FLDLEN value of 5 probably is the length of the field that is to be converted. It also matches string^length, which makes it seem a little more likely that it is the field length. When you look at the code, you'll probably see it copies INPUT^LEN to string^length.

>
> Did you copy that stack trace by hand?
> Of course. What other option do I have.
>
> I think I remember that you were not able to take any data in or out of your office. That was tedious if you did it by hand.
>
> Oh you had no idea, mate. But if it even counts for even the tiniest bit of learning, I'm be greatly satisfied - and so far I've learnt far more than I thought I would. Thanks much to you and Tone. :)

I applaud your dedication!

>
> The value shown for I, if it can be believed, indicates that the loop has only obliterated about 688 bytes. That's surprisingly low, unless the code picked up a new value for the base address of reply^string when the loop overstored it (assuming the loop overstored it).
>
> Yes, 688 is the count for the loop. Why is it surprisingly low? Remember the condition for the loop is "FOR i := 0 TO (add^post^0 - 1) DO ". And my previous post states that the dump file has the value for add^post^0 is 14336. Though this much resembles the 12336 for INPUT^LEN I don't know what ASCII value that is. I hope I'm right in saying that 14336 and 12336 are decimals/integers? And for 12336 it is 00. But I could not even google at this hour what 14336 could mean. Sorry (2am and a pounding headache).

I said it was surprisingly low because many time a runaway loop starts storing at some data address in the program and runs for many thousands of bytes before it reaches the end of the memory area assigned to the process. Not just a few hundred bytes. Of course, if the variable the loop started storing into happened to be near the end of a memory area, it could run off the end of the assigned memory just a few hundred bytes beyond the variable it started storing into.

14336 decimal (yes, those values are displayed in decimal) is 0x3800, which is the Ascii character "8" followed by the null byte. I don't know whether it was set to that by overstoring with some ASCII data or is a coincidence that it is an Ascii digit and a null. On Windows, there is a calculator application that can switch between decimal and hex display, which is very handy for converting values between the two (when you switch the number base, it does not erase the value but just displays it in the new base), and for doing address arithmetic. Assuming you use a Windows computer at your office, you should have it. It defaults to an ordinary calculator interface, but in the View menu, you can switch to Programmer (that's for the Windows 7 version of calculator; it is Scientific for the Windows XP version).

>
> Back to your comment - I value of 688 being surprisingly low? I'll tell you why. And this is going to a bigger explanation.
>
> (einspect 1,94): frame 6
> #6 0x78002980: 0 in USER^NUMERIC^INPUT^CONVERSION.ADD6ZEROS() AT
> <SOURCE_FILE_NAME> : 77
>
> That's what frame 6 had for me. So I go into enoft.
>
> enoft> file pathtcpl
> enoft> da 0x78002980 (this dumpall idea I got from my trustable resource)
> [2:001: 78. :1 USER^NUMERIC^INPUT^CONVERSION.ADD^ZEROS]
> 0X78002980: {0:08e01426000 M st1[r20] = r19
> 1:080c2300440 M1d8 r17 = [r35]
> 2.000080000000 l nop.i 0x0;;
> template: 0x09}.
>
> Here the r20 register was assigned the value from r19. So I went back to einspect.
>
> (einspect 1,94): info register r19
> r19: 0x30
> (einspect 1,94): info register r20
> r20: 0x70000000
>
> Now my trustable resource who explained this, tells me that 0x70000000 is a code address and should not be written to, which is why it has failed. Now if you see, it was trying to write 0x30 in all other registers and it had been allowed and when it started trying to write to a code stack, it had failed because it is illegal to write to code address. I still don't know what's the difference and how they differentiate which is code address and which is data stack.

The areas of memory accessible to a program can be marked in tables that the OS maintains and the hardware consults to be either read/write, or read-only (or maybe it is called execute-only). The boundaries of the various segments almost always are round numbers in hex, so a transition from read/write to read-only at 0x70000000 makes sense. I don't know enough about how memory is laid out in a TNS/E native process to know what the area just before the code segment is, but the evidence here is that it holds the local stack frames of the call stack. I do remember reading that something grows from high addresses towards the low addresses, and that might have been the local call stack . I think the heap grows from the low address in that same segment toward high addresses. Or maybe the heap is at the high end and the call stack at the low end.
>
> So as I is 688, you see that nearly 700 bytes of data had been overwritten to 0x30, and after that it had failed. Hence 688 is justifiable.
>
> Here's a sample:
>
> (einspect 1,94): an 0x6ffffd08 700
> 0x6ffffd08: ...... .0.. . ...... .0.. .
> 0x6ffffd18: ...... .0.. . ...... .8....
> 0x6ffffd28: ...... ...... ...... ......
> 0x6ffffd38: ...... ...... .2273. .0. 0.
> 0x6ffffd48: ./,.C. .o..E. .0000. .0000.
> 0x6ffffd58: .0000. .0000. .0000. .0000.
> 0x6ffffd68: .0000. .0000. .0000. .0000.
> 0x6ffffd78: .0000. .0000. .0000. .0000.
> 0x6ffffd88: .0000. .0000. .0000. .0000.
> -----------x clipped a lot of zeroes x-----------
> 0x6ffffff8: .0000. .0000. Warning: cannot access memory at address 0x70000000.
>
> That explains it all. But still doesn't give us enough information to understand why such a situation occurred in the first place.

Well, it doesn't explain it all, but it does show some things. You did find the long string of zeros that the runaway loop stored. From the beginning of the string of zeros (at 0x6ffffd50 to 0x70000000 is exactly 688 bytes. I see your 22730 value 16 bytes before the beginning of that string of zeros, not immediately before it, as I expected it would be. So maybe the runaway loop started at (or even before) the beginning of reply^string, and the 22730 we see is INPUT. However, the five bypes following 22730 seem to match what your other display shows for my^string, so I think it is likely that my^string is at 0x6ffffd40, and reply^string is at 0x6ffffd50. I think you could confirm that by using

print /x &my^string
and
print /x &reply^string

Since ^ is a C operator, not allowed in names, you might have to use the eInspect command set language ptal to get it to accept those variable names, unless it recognizes you are in an epTAL program and set it automatically.

At this point, I think we can be pretty sure we know three of the four inputs to USER^INPUT^NUMERIC^CONVERSION. We know USERCODE, INPUT, and INPUT^LEN. The only one we don't know is INTERNAL^SCALE.

>
> Even that an command in einspect was suggested to me by that trustable resource. He definitely knew what he was doing. That command looks more like the x command that you've given Keith - but this works a bit differently. Trying to understand the difference but I think only einspect manual will teach me that.

The an command only displays in Ascii. The x command has other ways to display as well as Ascii. I'm not sure any of the ways the x command displays Ascii matches exactly how an does the display, so it is good to know both.

>
> I'm also a little puzzled that USERCODE seems not to have been overstored even though all of the other arguments have been overstored.
> May be USERCODE was not a variable in this level or frame, may be it came from some other procedure or frame. May be that's why it is stored elsewhere as I could not see it above. If I had given a p &USERCODE I'd have seen the address where it was stored, no? But if I would rather see the program itself I'd know - I think.

USERCODE, INPUT^LEN, and INTERNAL^SCALE all are INT value arguments, so I would expect them to be handled basically the same way, and all three would be vulnerable to being overwritten the same way. It seems that is not true, which means there is something about the way value arguments are handled that I don't know.
>
> The addresses before the procedure names in the stack trace look like code addresses, not data addresses, so using one of them to start probably is wrong.
> Again, you're right. You're a genius, Keith.
>
> I'm not sure &INPUT^LEN will get the address of the INPUT^LEN argument, but I think it will.
> It did, I think. That's what gave me the idea to suggest that p &USERCODE would say where USERCODE is stored.

What value did print /x ^INPUT^LEN give you? If you included that in your post, I am not seeing it. Is it a number between 0x6fffd50 and 0x70000000? What does print /x &USERCODE show?
>
> So back to the einspect frame 8. I did a info locals to that and I got the following information.
>
> decimal^pt^pos = 3
> string^length = 5
> my^string = "22730. 0/,"
>
> The code has the following important lines which will help us understand the importance of my^string and the value above.
>
> STRING .my^string[0:9];
> .
> .
> .
> my^string ':=' input [0] FOR input^len BYTES;
>
> So the value "22730. 0/," is not indeed the input alone but the whole 0-9 places(bits?) covered because input^len in the above command was possibly larger than what my^string expects.

The space used by local variables is not cleared to zeros when entering the procedure, so their initial contents will be whatever the program previously had stored in that area from using it for the local variables of other procedures that had been called and already returned before calling the procedures currently in the call stack. So it is not at all surprising to see the bytes of my^string that had not been written to by the ':=' statement to have some "random" value in them. It often won't actually be random, but highly repeatable. It depends on what that memory had been used for previously in the history of the program's execution. Sometimes, as it seems to be from the displays from the other dump files, below, it is highly repeatable, though not easily predictable.
>
> But string^length and decimal^pt^pos indicate that the input might have been 22730 and input length just 5, but that can't make decimal point position as 3. I understand that all these are related to code, and I don't expect you all to make any sense of it. I didn't read the whole code myself. I'm just making comments with the variable names that we see here, logically. About the code, I don't think that I can post the whole code here. For one, it's too large. For two, it's too secure to post and get away easy. I hope you understand.

I understand that you might not be allowed to post the actual code. Many organizations are ridiculously protective of their precious source code, even though there usually is nothing very original about it, but you have to follow your organization's rules. But you should be able to tell us in general terms what it does. For instance, why is decimal^pt^pos of 3 not reasonable? Is it unreasonable that the desired output for this input might be 22730.000? Or maybe decimal^pt^pos is not supposed to be the number of decimal places in the output value, but something else -- maybe the number of decimal places in the input string? Is USERCODE of 2 always supposed to produce a value with 2 decimal positions? Maybe when it gets an input that has more than two decimal positions, that is what makes it go wild? If that were the case, I'd think it would have happened and have been corrected long before now, unless the error was introduced recently. Was any of the user conversion
code changed recently? If so, checking those changes carefully would be a good thing to do.

>
> That might be the beginning of the string the procedure was asked to convert.
> Well, looks like it. But not too sure still on why the problem occurred on the first place.

Actually, no, that isn't what I was suggesting. I thought the runaway loop might have been used to append a calculated number of zeros to the end of the input value, but since the apparent input value lies 16 bytes before the beginning of the string of zeros the runaway loop stored, it doesn't seem to be doing what I though it might be doing. Maybe it was intended to store the zeroes into the reply string first, then overlay the first zeroes with digits from the input string. Or something still different from that. I don't know. Once you look at the code, what it was supposed to do might become clear. The name of the variable giving the upper limit for the loop, add^post^0, sort of implies that it is doing something about adding zeros after (post) something, but maybe the variable name is misleading.

By the way, the ultimate output from USER^INPUT^NUMERIC^CONVERSION is a 64-bit binary integer which is 10 raised to the INTERNAL^SCALE times the intended value. That is, it has an implied decimal point INTERNAL^SCALE positions from the right end of the value. 9(n)V9(m) in COBOL terms, where m is INTERNAL^SCALE. So reply^string must be only an internal result, since it is still holding ASCII digits. The point of the ADD^ZEROS subprocedure might be to add the appropriate number of zeros to get the input value to have the appropriately scaled value in ASCII prior to converting the ASCII digits to a binary integer.

Have you looked into stack frame 7? Some of the input values from USER^INPUT^NUMERIC^CONVERSION might have been passed along to USER^CONVERSION^2, though the stack trace your showed does not indicate that it has any arguments. I'm actually a little puzzled about what USER^CONVERSION^2 is. ADD^ZEROS is a subproc of USER^INPUT^NUMERIC^CONVERSION, and so is only visible inside USER^INPUT^NUMERIC^CONVERSION. I think that means it can only be referenced from inside USER^INPUT^NUMERIC^CONVERSION or inside a subproc of USER^INPUT^NUMERIC^CONVERSION. The stack entry for USER^CONVERSION^2 makes it look like another procedure. Also its code address looks like it would be beyond the point in USER^INPUT^NUMERIC^CONVERSION where it was called from, and that would have to make it be a separate procedure, since subprocs have to be at the beginning of their enclosing procedure. Maybe ADD^ZEROS was passed as an argument that is a procedure to USER^CONVERSION^2. I don't remember wheth
er subprocs are allowed to be passed as arguments that are procedures. I'd think not, since, at least in the TNS architecture, they would lose access to their enclosing procedure's local variables if called from another procedure's stack frame, but maybe pTAL did that differently. Anyway, look in your user conversion procedure source to see what USER^CONVERSION^2 is and whether its arguments or local variables hold any clues about what the arguments to USER^INPUT^NUMERIC^CONVERSION were, especially INTERNAL^SCALE. If you can learn what value INTERNAL^SCALE had, you could bench check the code for the exact argument values that caused the failure, or even create a small test program that calls USER^INPUT^NUMERIC^CONVERSION with those arguments and step through it in the debugger to see exactly what it does and where it goes wrong.

I'm now assuming that the local variables of procedures above USER^INPUT^NUMERIC^CONVERSION in the call stack, and maybe any local variables above reply^string in USER^INPUT^NUMERIC^CONVERSION have not been overwritten by the runaway loop, so look at them in the dump file and see what additional information they give you about the case caused the error.

>
> If you were a Pathway developer, it probably would be pretty easy to find the screen name, requester name, original value of the screen field, unless the runaway loop has overwritten it.
>
> What do you mean by a Pathway developer? I take care of the pathway, and work with it - if that's what you mean. To 'take care of pathway' is loosly put, but was intended that way. There's no job role of a pathway developer here in my workplace.

I meant a developer who works for HP doing Pathway product development. I thought that would be clear from context, but I now see that it was not. Sorry for n ot being more clear. An HP Pathway product developer would ahve access to a Pathway object file with all the debugging information and the source code so he or she could look into the areas of the dump file that you cannot view, and perhaps find more information about exactly what screen, screen field, and working-storage variable the conversion was working with. However, if the local variables of the stack below the user conversion procedure have been overwritten by the runaway loop, maybe an HP Pathway developer couldn't do much more than you can.

>
> Also I've a few questions.
> Why does the einspect have prompt like this - especially, what does the values in it mean?
> (einspect 1,84):

The 1,84 gives the CPU number and process number within that CPU of the process eInspect is curently looking at. eInspect can have control of several processes at the same time, and you can switch its attention among them as you like. So this is the way eInspect help you keep track of which process you are looking at at the current moment. When looking at a dump file, it tells you the CPU and process number of the process whose state was recorded in the dump file. Only one dump file can be examined at a time, so this isn't quite as important when looking at a dump file, but it still might be helpful.

>
> And I told you I've two systems. For einspect - In one, the help and the "an" command etc works, in other very limited. For the "an" command and the help command I get a reply prompt saying no such command or something like that. Why? Both are same einspect versions. TNS/E einspect gdb debugger (T1237 - 18 Jull 2012 13:46) They both should've worked the same way. Unless, they disabled some commands in one system? Hmm?

I don't have a good answer. Maybe the eInspect on one of the systems has been limited in some way. Or maybe there was some error during its installation. Or maybe your userid on one system has more priviliges than your userid on the other system does, and normal Guardian file security is interfering with the execution of some commands.

>
> There were many dumps in my system of which I've taken five (one above already, repeated below for comparison). And the only three data I've been able to take out, I've listed. Let me know if it helps.
>
> decimal^pt^pos = 4, string^length = 6, my^string = "680640 0/,"
> decimal^pt^pos = 3, string^length = 5, my^string = "37840. 0/,"
> decimal^pt^pos = 3, string^length = 5, my^string = "23870. 0/,"
> decimal^pt^pos = 4, string^length = 6, my^string = "684240 0/,"
> decimal^pt^pos = 3, string^length = 5, my^string = "22730. 0/,"

Useful information! Were the dates on those dumps all recent? If you have been getting complaints from end users about "the system always goes down when I try to do this", maybe you can learn from them which screen always causes the problem, and maybe figure out from that which field it is. That might give some more clues about what the code should have done in these cases when it clearly is not doing what was intended.

>
> All the my^string have the last four characters as the same. A space followed by a zero then a forward slash and a comma. So that may mean that they are not really inputs. There's no coincidence in life.

Yes, you are right. Those last few characters in my^string are not inputs to USER^INPUT^NUMERIC^CONVERSION. They are what is left over in those memory locations from previous use of those memory locations by procedures that had been called earlier and already returned.

>
> If the code is required, I can post parts of it as per the required variable that's needed to be analyzed. But anyway, I'm going to take a hard look at the code tomorrow. With these values as inputs, I think we can still crack a solution out of this.

We probably won't need to see much, if any, of the code. If there are particular statements that you are not sure what they do, you might have to post those statements, but for anything else, your description of what the code is doing, or intended to do, probably will be enough.

Shiva

unread,
Jan 13, 2015, 6:47:47 AM1/13/15
to
> MCPDLL and INITDLL contain numerous system library procedures, and when they are loaded, it will make at least some of the stack frames that say the procedure name is unknown will be able to list the proper procedure name. This sometimes can be helpful to get an idea of what system procedures are involved when you have a stack trace with a lot of unknown procedure names. I wouldn't have expected it to help with your current problem.

Hmm, I assumed so. Thanks for that clarification.

>
> That FLDLEN value of 5 probably is the length of the field that is to be converted. It also matches string^length, which makes it seem a little more likely that it is the field length. When you look at the code, you'll probably see it copies INPUT^LEN to string^length.

Yes, it does.

>
> I applaud your dedication!
>

Ah, thank you. But that's just a selfish act where I desire to learn. But you replying with pages of answer to trying make me understand the concepts which will take years of learning or proper training - even lots of pages of reading, all in a single post (a very large one, to justify) and the time you and people like you - take everyday to reply and help people like me with expecting nothing in return, now that's a selfless act. And that's something that needs higher recognition. :)

I don't ever know how I could thank you for all the help you've been providing.

> I said it was surprisingly low because many time a runaway loop starts storing at some data address in the program and runs for many thousands of bytes before it reaches the end of the memory area assigned to the process. Not just a few hundred bytes. Of course, if the variable the loop started storing into happened to be near the end of a memory area, it could run off the end of the assigned memory just a few hundred bytes beyond the variable it started storing into.
>
> 14336 decimal (yes, those values are displayed in decimal) is 0x3800, which is the Ascii character "8" followed by the null byte. I don't know whether it was set to that by overstoring with some ASCII data or is a coincidence that it is an Ascii digit and a null. On Windows, there is a calculator application that can switch between decimal and hex display, which is very handy for converting values between the two (when you switch the number base, it does not erase the value but just displays it in the new base), and for doing address arithmetic. Assuming you use a Windows computer at your office, you should have it. It defaults to an ordinary calculator interface, but in the View menu, you can switch to Programmer (that's for the Windows 7 version of calculator; it is Scientific for the Windows XP version).
>

So 14336 is just "80" - now that's confusing. Because the values for add^post^0 that's assigned within user^conversion^2 subproc are 0, 1, and 2. No other values are assigned to it.

And yes, I know that. I mentioned that I was unable to make up my mind to even try to convert it because I was in a very sorry state when I wrote that post. Won't happen again, cool :)

>
> The areas of memory accessible to a program can be marked in tables that the OS maintains and the hardware consults to be either read/write, or read-only (or maybe it is called execute-only). The boundaries of the various segments almost always are round numbers in hex, so a transition from read/write to read-only at 0x70000000 makes sense. I don't know enough about how memory is laid out in a TNS/E native process to know what the area just before the code segment is, but the evidence here is that it holds the local stack frames of the call stack. I do remember reading that something grows from high addresses towards the low addresses, and that might have been the local call stack . I think the heap grows from the low address in that same segment toward high addresses. Or maybe the heap is at the high end and the call stack at the low end.
> >

Hmm, that makes me think that my previous notion of "some addressess are predefined to have code and others variable information" is wrong. May be it is decided at run time - it doesn't really matter at this point. May be my trustable resource suggested that because he found that the value 0x70000000 in the bt stack trace for frame 8. Looks more likely and if so, he has a sharp eye!
At times experience gives you things that hardwork and dedication can't even think about! Ha, and deserved too.

> > So as I is 688, you see that nearly 700 bytes of data had been overwritten to 0x30, and after that it had failed. Hence 688 is justifiable.

Okie, I don't get from the below that it is 688 bytes that have been over written. How do you convert this address details with values in them, to bytes. I tried my math but it's not working.

> >
> > Here's a sample:
> >
> > (einspect 1,94): an 0x6ffffd08 700
> > 0x6ffffd08: ...... .0.. . ...... .0.. .
> > 0x6ffffd18: ...... .0.. . ...... .8....
> > 0x6ffffd28: ...... ...... ...... ......
> > 0x6ffffd38: ...... ...... .2273. .0. 0.
> > 0x6ffffd48: ./,.C. .o..E. .0000. .0000.
> > 0x6ffffd58: .0000. .0000. .0000. .0000.
> > 0x6ffffd68: .0000. .0000. .0000. .0000.
> > 0x6ffffd78: .0000. .0000. .0000. .0000.
> > 0x6ffffd88: .0000. .0000. .0000. .0000.
> > -----------x clipped a lot of zeroes x-----------
> > 0x6ffffff8: .0000. .0000. Warning: cannot access memory at address 0x70000000.

If this last line says that of the four separations here first one is 0x6ffffff8 and the next one is 0x6ffffff9 and the next one where the error occurs is 0x70000000. And it matches too. But if you continue on, the fourth seperation would have been 0x70000008 from the left corner series of address naming convention - but not so true with the addition of address places that we do for the values stored in the right side columns. I know very well that I'm not using the right words or even explaining my understanding correctly, but I don't think I can try any better. That's as far as I understand.

> >
> > That explains it all. But still doesn't give us enough information to understand why such a situation occurred in the first place.
>
> Well, it doesn't explain it all, but it does show some things. You did find the long string of zeros that the runaway loop stored. From the beginning of the string of zeros (at 0x6ffffd50 to 0x70000000 is exactly 688 bytes. I see your 22730 value 16 bytes before the beginning of that string of zeros, not immediately before it, as I expected it would be. So maybe the runaway loop started at (or even before) the beginning of reply^string, and the 22730 we see is INPUT. However, the five bypes following 22730 seem to match what your other display shows for my^string, so I think it is likely that my^string is at 0x6ffffd40, and reply^string is at 0x6ffffd50. I think you could confirm that by using
>
> print /x &my^string
> and
> print /x &reply^string
>

Again 'exactly 688 bytes', I don't know how. And I'll try the above when I get back to office next monday. On a little break from work. Hence the delays in my replies :) But I think your deduction is correct.
Find the following which are the 'last' declaration statements from the code which makes me think that - in the above stack it is in the order as below.

STRING .my^string[0:9];
.my^ptr := @mystring,
.reply^string[0:9];

There's my^string, then for the pointer there's some junk which means the declaration above (just guessing), then starts writing junk into the reply^string.

> Since ^ is a C operator, not allowed in names, you might have to use the eInspect command set language ptal to get it to accept those variable names, unless it recognizes you are in an epTAL program and set it automatically.

epTAL is recognized automatically. Beauty! :D
>
> At this point, I think we can be pretty sure we know three of the four inputs to USER^INPUT^NUMERIC^CONVERSION. We know USERCODE, INPUT, and INPUT^LEN. The only one we don't know is INTERNAL^SCALE.
>

True. I read those chapters that you asked me to read, and they weren't very explanative of INTERNAL^SCALE. But your explanation below make it clear.

> >
> > Even that an command in einspect was suggested to me by that trustable resource. He definitely knew what he was doing. That command looks more like the x command that you've given Keith - but this works a bit differently. Trying to understand the difference but I think only einspect manual will teach me that.
>
> The an command only displays in Ascii. The x command has other ways to display as well as Ascii. I'm not sure any of the ways the x command displays Ascii matches exactly how an does the display, so it is good to know both.
>
> >
> > I'm also a little puzzled that USERCODE seems not to have been overstored even though all of the other arguments have been overstored.
> > May be USERCODE was not a variable in this level or frame, may be it came from some other procedure or frame. May be that's why it is stored elsewhere as I could not see it above. If I had given a p &USERCODE I'd have seen the address where it was stored, no? But if I would rather see the program itself I'd know - I think.
>
> USERCODE, INPUT^LEN, and INTERNAL^SCALE all are INT value arguments, so I would expect them to be handled basically the same way, and all three would be vulnerable to being overwritten the same way. It seems that is not true, which means there is something about the way value arguments are handled that I don't know.

I'll may be try get the p &USERCODE etc to see if they give us some more detail.
> >
> > The addresses before the procedure names in the stack trace look like code addresses, not data addresses, so using one of them to start probably is wrong.
> > Again, you're right. You're a genius, Keith.
> >
> > I'm not sure &INPUT^LEN will get the address of the INPUT^LEN argument, but I think it will.
> > It did, I think. That's what gave me the idea to suggest that p &USERCODE would say where USERCODE is stored.
>
> What value did print /x ^INPUT^LEN give you? If you included that in your post, I am not seeing it. Is it a number between 0x6fffd50 and 0x70000000? What does print /x &USERCODE show?

Not at office, Keith. Sorry - I'll try that first thing next monday. And what does /x mean in the print statement? Translated from ASCII to hex or some other conversion like that?
> >
> > So back to the einspect frame 8. I did a info locals to that and I got the following information.
> >
> > decimal^pt^pos = 3
> > string^length = 5
> > my^string = "22730. 0/,"
> >
> > The code has the following important lines which will help us understand the importance of my^string and the value above.
> >
> > STRING .my^string[0:9];
> > .
> > .
> > .
> > my^string ':=' input [0] FOR input^len BYTES;
> >
> > So the value "22730. 0/," is not indeed the input alone but the whole 0-9 places(bits?) covered because input^len in the above command was possibly larger than what my^string expects.
>
> The space used by local variables is not cleared to zeros when entering the procedure, so their initial contents will be whatever the program previously had stored in that area from using it for the local variables of other procedures that had been called and already returned before calling the procedures currently in the call stack. So it is not at all surprising to see the bytes of my^string that had not been written to by the ':=' statement to have some "random" value in them. It often won't actually be random, but highly repeatable. It depends on what that memory had been used for previously in the history of the program's execution. Sometimes, as it seems to be from the displays from the other dump files, below, it is highly repeatable, though not easily predictable.

Got you. Thanks again!

> >
> > But string^length and decimal^pt^pos indicate that the input might have been 22730 and input length just 5, but that can't make decimal point position as 3. I understand that all these are related to code, and I don't expect you all to make any sense of it. I didn't read the whole code myself. I'm just making comments with the variable names that we see here, logically. About the code, I don't think that I can post the whole code here. For one, it's too large. For two, it's too secure to post and get away easy. I hope you understand.
>
> I understand that you might not be allowed to post the actual code. Many organizations are ridiculously protective of their precious source code, even though there usually is nothing very original about it, but you have to follow your organization's rules. But you should be able to tell us in general terms what it does. For instance, why is decimal^pt^pos of 3 not reasonable? Is it unreasonable that the desired output for this input might be 22730.000? Or maybe decimal^pt^pos is not supposed to be the number of decimal places in the output value, but something else -- maybe the number of decimal places in the input string? Is USERCODE of 2 always supposed to produce a value with 2 decimal positions? Maybe when it gets an input that has more than two decimal positions, that is what makes it go wild? If that were the case, I'd think it would have happened and have been corrected long before now, unless the error was introduced recently. Was any of the user conversion
> code changed recently? If so, checking those changes carefully would be a good thing to do.
>

Ha, let me tell you a secret. The whole code was rewritten. Because of the following reasons.

Old 700 code. Had to be recompiled in new environment as 800 code. No source code (oh yes!) lost in time. So had to start from the scratch - with vague requirements! But it has worked so well all along. For most inputs, at least.

Is USERCODE of 2 always supposed to produce a value with 2 decimal positions?

Z(5)9.99 that's the input that it can handle. For others, as far as I see - the error catch is in place.

I'm guessing may be requester is calling the wrong user conversion procedure for that particular code. Of course, nobody changed any code in requester, then may be there's just one exception that the code is not handling. I'll post the logic of the user^conversion^2 below.


> >
> > That might be the beginning of the string the procedure was asked to convert.
> > Well, looks like it. But not too sure still on why the problem occurred on the first place.
>
> Actually, no, that isn't what I was suggesting. I thought the runaway loop might have been used to append a calculated number of zeros to the end of the input value, but since the apparent input value lies 16 bytes before the beginning of the string of zeros the runaway loop stored, it doesn't seem to be doing what I though it might be doing. Maybe it was intended to store the zeroes into the reply string first, then overlay the first zeroes with digits from the input string. Or something still different from that. I don't know. Once you look at the code, what it was supposed to do might become clear. The name of the variable giving the upper limit for the loop, add^post^0, sort of implies that it is doing something about adding zeros after (post) something, but maybe the variable name is misleading.

No, it does exactly that.
>
> By the way, the ultimate output from USER^INPUT^NUMERIC^CONVERSION is a 64-bit binary integer which is 10 raised to the INTERNAL^SCALE times the intended value. That is, it has an implied decimal point INTERNAL^SCALE positions from the right end of the value. 9(n)V9(m) in COBOL terms, where m is INTERNAL^SCALE. So reply^string must be only an internal result, since it is still holding ASCII digits. The point of the ADD^ZEROS subprocedure might be to add the appropriate number of zeros to get the input value to have the appropriately scaled value in ASCII prior to converting the ASCII digits to a binary integer.

True. Below is the flow of user^conv^2.
If the input has no decimal point and the input length is not > 6 then my^string[0] ':=' input[0] FOR input^len BYTES.

There's another scale^string sub procedure that uses 'INTERNAL^SCALE' as the scale factor to add the "INTERNAL^SCALE" number of zeroes to the my^string variable, at the end. And then the string^len is increased to accommodate the internal^scale length as well now.

After that $ASCIITOFIXED is done. And then add^zeros subproc is called!

This is where I'm confused. In add^zeros subproc, reply^length is initialized first! To zero. Then add^pre^0 is checked to add pre zeroes which is not the case mostly - not really sure why it is even there in the first place, and then the flag add^dec^pt is checked which is set for our input and so a dec pt is added at reply^string[reply^len] := "."; and the reply^len is incremented by one.

And as the add^post^0 is set to 2 in our case of input 22730, we go into the next condition where there is a FOR i :=0 TO (add^post^0 - 1) to reply^string[reply^len]:= "0" and again reply length is incremented by one.

Finally input[0] is set to reply^string[0] FOR reply^len BYTES and input^len is set to reply^len. And then return;

Which means that the reply^len which is initialized to 0 at the start would screw up the math for the conditions that come below. Won't it? But I could not prove that this would eventually write zeroes to all the addresses into an infinite loop. Because if this was the issue, the add^zeros sub proc should have caused errors for other user conversion procedures too. Just a bit confused with the logic of initializing the reply^len with zero.


>
> Have you looked into stack frame 7? Some of the input values from USER^INPUT^NUMERIC^CONVERSION might have been passed along to USER^CONVERSION^2, though the stack trace your showed does not indicate that it has any arguments. I'm actually a little puzzled about what USER^CONVERSION^2 is. ADD^ZEROS is a subproc of USER^INPUT^NUMERIC^CONVERSION, and so is only visible inside USER^INPUT^NUMERIC^CONVERSION. I think that means it can only be referenced from inside USER^INPUT^NUMERIC^CONVERSION or inside a subproc of USER^INPUT^NUMERIC^CONVERSION. The stack entry for USER^CONVERSION^2 makes it look like another procedure. Also its code address looks like it would be beyond the point in USER^INPUT^NUMERIC^CONVERSION where it was called from, and that would have to make it be a separate procedure, since subprocs have to be at the beginning of their enclosing procedure. Maybe ADD^ZEROS was passed as an argument that is a procedure to USER^CONVERSION^2. I don't remember wheth
> er subprocs are allowed to be passed as arguments that are procedures. I'd think not, since, at least in the TNS architecture, they would lose access to their enclosing procedure's local variables if called from another procedure's stack frame, but maybe pTAL did that differently. Anyway, look in your user conversion procedure source to see what USER^CONVERSION^2 is and whether its arguments or local variables hold any clues about what the arguments to USER^INPUT^NUMERIC^CONVERSION were, especially INTERNAL^SCALE. If you can learn what value INTERNAL^SCALE had, you could bench check the code for the exact argument values that caused the failure, or even create a small test program that calls USER^INPUT^NUMERIC^CONVERSION with those arguments and step through it in the debugger to see exactly what it does and where it goes wrong.
>
> I'm now assuming that the local variables of procedures above USER^INPUT^NUMERIC^CONVERSION in the call stack, and maybe any local variables above reply^string in USER^INPUT^NUMERIC^CONVERSION have not been overwritten by the runaway loop, so look at them in the dump file and see what additional information they give you about the case caused the error.
>
I think I did and most of them had zeros. Only those which didn't - I posted here. I did this exercise as part of the BT trace that worked for me for the first time. I've posted the values which were not zero already. As I didn't mentioned under which frame I found them, I thought it was understandable - but not really so. A bad call from me. Sorry! :D

> >
> > If you were a Pathway developer, it probably would be pretty easy to find the screen name, requester name, original value of the screen field, unless the runaway loop has overwritten it.
> >
> > What do you mean by a Pathway developer? I take care of the pathway, and work with it - if that's what you mean. To 'take care of pathway' is loosly put, but was intended that way. There's no job role of a pathway developer here in my workplace.
>
> I meant a developer who works for HP doing Pathway product development. I thought that would be clear from context, but I now see that it was not. Sorry for n ot being more clear. An HP Pathway product developer would ahve access to a Pathway object file with all the debugging information and the source code so he or she could look into the areas of the dump file that you cannot view, and perhaps find more information about exactly what screen, screen field, and working-storage variable the conversion was working with. However, if the local variables of the stack below the user conversion procedure have been overwritten by the runaway loop, maybe an HP Pathway developer couldn't do much more than you can.

Oh I get it now. :)

>
> >
> > Also I've a few questions.
> > Why does the einspect have prompt like this - especially, what does the values in it mean?
> > (einspect 1,84):
>
> The 1,84 gives the CPU number and process number within that CPU of the process eInspect is curently looking at. eInspect can have control of several processes at the same time, and you can switch its attention among them as you like. So this is the way eInspect help you keep track of which process you are looking at at the current moment. When looking at a dump file, it tells you the CPU and process number of the process whose state was recorded in the dump file. Only one dump file can be examined at a time, so this isn't quite as important when looking at a dump file, but it still might be helpful.

Hmm, understandable. I'll try the status <process-num> just in case. Though I'm sure they're not going to help me here :D
>
> >
> > And I told you I've two systems. For einspect - In one, the help and the "an" command etc works, in other very limited. For the "an" command and the help command I get a reply prompt saying no such command or something like that. Why? Both are same einspect versions. TNS/E einspect gdb debugger (T1237 - 18 Jull 2012 13:46) They both should've worked the same way. Unless, they disabled some commands in one system? Hmm?
>
> I don't have a good answer. Maybe the eInspect on one of the systems has been limited in some way. Or maybe there was some error during its installation. Or maybe your userid on one system has more priviliges than your userid on the other system does, and normal Guardian file security is interfering with the execution of some commands.

Hmm I'll check with my admin team. May be they got an answer to this. Just wanted to check if you can limit a few commands from a utility in one particular system alone. Thought that was possible for any utility that HP provided though I don't know how.
>
> >
> > There were many dumps in my system of which I've taken five (one above already, repeated below for comparison). And the only three data I've been able to take out, I've listed. Let me know if it helps.
> >
> > decimal^pt^pos = 4, string^length = 6, my^string = "680640 0/,"
> > decimal^pt^pos = 3, string^length = 5, my^string = "37840. 0/,"
> > decimal^pt^pos = 3, string^length = 5, my^string = "23870. 0/,"
> > decimal^pt^pos = 4, string^length = 6, my^string = "684240 0/,"
> > decimal^pt^pos = 3, string^length = 5, my^string = "22730. 0/,"
>
> Useful information! Were the dates on those dumps all recent? If you have been getting complaints from end users about "the system always goes down when I try to do this", maybe you can learn from them which screen always causes the problem, and maybe figure out from that which field it is. That might give some more clues about what the code should have done in these cases when it clearly is not doing what was intended.
>
> >
> > All the my^string have the last four characters as the same. A space followed by a zero then a forward slash and a comma. So that may mean that they are not really inputs. There's no coincidence in life.
>
> Yes, you are right. Those last few characters in my^string are not inputs to USER^INPUT^NUMERIC^CONVERSION. They are what is left over in those memory locations from previous use of those memory locations by procedures that had been called earlier and already returned.
>
> >
> > If the code is required, I can post parts of it as per the required variable that's needed to be analyzed. But anyway, I'm going to take a hard look at the code tomorrow. With these values as inputs, I think we can still crack a solution out of this.
>
> We probably won't need to see much, if any, of the code. If there are particular statements that you are not sure what they do, you might have to post those statements, but for anything else, your description of what the code is doing, or intended to do, probably will be enough.
> >
> > Thanks all for your valuable inputs.
> >
> >
> >
> >
> >

The above user^conv^2 is not all of what I posted - but it is my deduction of how the flow should have gone and part of that trace. Because 1) the input was 22730 and 2) the line 77 mentioned in BT comes to here in this subproc. But the add^zeros is also called for two other conditions inside user^conv^2 such as the following:

If decimal point found check position of decimal pt.
dec pt pos 0 and anywhere else. All these cases are handled but complex. Everything tells me that add^zeros subproc is wrong because all the changes are done to my^string and reply^string is used in add^zeros subproc - I understand that you might not get any of these, because you don't have a way to look at the code. Any suggestion from this would be very helpful. I'll post the C problem tonight, Keith. Just didn't get enough time. :)
Message has been deleted

Shiva

unread,
Jan 13, 2015, 6:52:26 AM1/13/15
to
> Useful information! Were the dates on those dumps all recent? If you have been getting complaints from end users about "the system always goes down when I try to do this", maybe you can learn from them which screen always causes the problem, and maybe figure out from that which field it is. That might give some more clues about what the code should have done in these cases when it clearly is not doing what was intended.

The user didn't report any errors. Possibly because it is backfiring only for some wrong inputs that the users are keying in and as they see the system is behaving differently they'd have rectified the errors with the input and the system would have worked normally. Still don't know the amount/kind of impact the users would be experiencing.

David Thompson

unread,
Jan 19, 2015, 1:51:44 AM1/19/15
to
On Tue, 13 Jan 2015 03:49:59 -0800 (PST), Shiva
<subrama...@gmail.com> wrote:

Aside: overlong lines (semi)fixed. I understand googlegroups doesn't
help, but if you can try to break your text into lines of up to about
70 characters that would be helpful to many non-google readers.

> (Keith Dick:)
> > 14336 decimal (yes, those values are displayed in decimal) is
0x3800, which is the Ascii character "8" followed by the null byte. I
don't know whether it was set to that by overstoring with some ASCII
data or is a coincidence that it is an Ascii digit and a null. On
Windows, there is a calculator application that can switch between
decimal and hex display, which is very handy for converting values
between the two (when you switch the number base, it does not erase
the value but just displays it in the new base), and for doing address
arithmetic. Assuming you use a Windows computer at your office, you
should have it. It defaults to an ordinary calculator interface, but
in the View menu, you can switch to Programmer (that's for the Windows
7 version of calculator; it is Scientific for the Windows XP version).
> >
>
> So 14336 is just "80" - now that's confusing. Because the values
for add^post^0 that's assigned within user^conversion^2 subproc are 0,
1, and 2. No other values are assigned to it.
>
There are two kinds of zero. The *character* "0", i.e. the character
generated by the key near the right end of the top row of a usual
keyboard and that displays on a screen or printout as a slightly
squished round shape (oval), like all other displayable characters,
has a nonzero code. In ASCII (and ASCII supersets like 8859 and
Unicode) that code is hex 30 or decimal 48. (Many programmers have
burned into their brains x - 48 for the numeric value of a decimal
digit, or 48 + x as the digit for a number 0..9.)

The character with *code* zero, in other words a byte with all bits
off, is not displayable. In ASCII (and most if not all other codes) it
is a control character that does nothing, called "null", or often
"nul" or "NUL" because it was traditional to use no more than 3
characters for the name of a control character. (Compare to EOT for
End Of Transmission, CR for Carriage Return, SI for Shift In, etc.)
In some cases NUL or zero-byte can't even be in a string, most notably
in C (and things based on C) which uses NUL to terminate a string.
NonStop, and thus original TAL and thus to some extent the derived
TALs, uses a mixture: some operations like block copy use a pointer
and explicit length, and some operations like SCAN stop at a NUL.
This means putting a NUL into character data may work sometimes,
but for safety and reliability you should not do so.

Thus a word of dec 14336 hex 3800 has a high byte containing hex 38
dec 56 which is the character "8" and a low byte of 00 which as a char
is NUL. I believe Keith's point was, and mine certainly is, that NUL
is not normally used as a character, so this suggests as a possibility
that add^post^0 contained a (2-byte binary integer) 0, and the high
byte (only) was stored with the ASCII character "8". This could occur
either by running off the end of whatever occurs before it in memory,
or by storing through a pointer that is set to point to the wrong
place. Because numerous other variables in your program show evidence
of being overrun by a series of ASCII digit zeros, 0x30303030 for a
4-byte pointer or 12336 = 0x3030 for a 2-byte integer, a stray "8" in
the high byte of add^post^0 seems a particularly likely suspect.

Keith Dick

unread,
Jan 19, 2015, 7:44:24 AM1/19/15
to
Shiva wrote:
>>MCPDLL and INITDLL contain numerous system library procedures, and when they are loaded, it will make at least some of the stack frames that say the procedure name is unknown will be able to list the proper procedure name. This sometimes can be helpful to get an idea of what system procedures are involved when you have a stack trace with a lot of unknown procedure names. I wouldn't have expected it to help with your current problem.
>
>
> Hmm, I assumed so. Thanks for that clarification.
>
>
>>That FLDLEN value of 5 probably is the length of the field that is to be converted. It also matches string^length, which makes it seem a little more likely that it is the field length. When you look at the code, you'll probably see it copies INPUT^LEN to string^length.
>
>
> Yes, it does.
>
>
>>I applaud your dedication!
>>
>
>
> Ah, thank you. But that's just a selfish act where I desire to learn. But you replying with pages of answer to trying make me understand the concepts which will take years of learning or proper training - even lots of pages of reading, all in a single post (a very large one, to justify) and the time you and people like you - take everyday to reply and help people like me with expecting nothing in return, now that's a selfless act. And that's something that needs higher recognition. :)
>
> I don't ever know how I could thank you for all the help you've been providing.
>
>
>>I said it was surprisingly low because many time a runaway loop starts storing at some data address in the program and runs for many thousands of bytes before it reaches the end of the memory area assigned to the process. Not just a few hundred bytes. Of course, if the variable the loop started storing into happened to be near the end of a memory area, it could run off the end of the assigned memory just a few hundred bytes beyond the variable it started storing into.
>>
>>14336 decimal (yes, those values are displayed in decimal) is 0x3800, which is the Ascii character "8" followed by the null byte. I don't know whether it was set to that by overstoring with some ASCII data or is a coincidence that it is an Ascii digit and a null. On Windows, there is a calculator application that can switch between decimal and hex display, which is very handy for converting values between the two (when you switch the number base, it does not erase the value but just displays it in the new base), and for doing address arithmetic. Assuming you use a Windows computer at your office, you should have it. It defaults to an ordinary calculator interface, but in the View menu, you can switch to Programmer (that's for the Windows 7 version of calculator; it is Scientific for the Windows XP version).
>>
>
>
> So 14336 is just "80" - now that's confusing. Because the values for add^post^0 that's assigned within user^conversion^2 subproc are 0, 1, and 2. No other values are assigned to it.
>
> And yes, I know that. I mentioned that I was unable to make up my mind to even try to convert it because I was in a very sorry state when I wrote that post. Won't happen again, cool :)

David explained the difference between "0" and the null byte. If the code stores only 0, 1, or 2 into add^post^0, then either there is a path through the code that does not store anything into add^post^0 and the loop upper bound is using whatever junk value was leftover in memory when the subproc was called, or something overstored the value in add^post^0. Since the runaway loop is storing "0", and neither byte of add^post^0 is "0", the runaway loop did not overstore it. Unless there is some other bad code in the subproc that could have overstored add^post^0, it seems likely that there is some path through the code that does not store a value into add^post^0 before it is used to control the loop.
>
>
>>The areas of memory accessible to a program can be marked in tables that the OS maintains and the hardware consults to be either read/write, or read-only (or maybe it is called execute-only). The boundaries of the various segments almost always are round numbers in hex, so a transition from read/write to read-only at 0x70000000 makes sense. I don't know enough about how memory is laid out in a TNS/E native process to know what the area just before the code segment is, but the evidence here is that it holds the local stack frames of the call stack. I do remember reading that something grows from high addresses towards the low addresses, and that might have been the local call stack . I think the heap grows from the low address in that same segment toward high addresses. Or maybe the heap is at the high end and the call stack at the low end.
>>
>
> Hmm, that makes me think that my previous notion of "some addressess are predefined to have code and others variable information" is wrong. May be it is decided at run time - it doesn't really matter at this point. May be my trustable resource suggested that because he found that the value 0x70000000 in the bt stack trace for frame 8. Looks more likely and if so, he has a sharp eye!
> At times experience gives you things that hardwork and dedication can't even think about! Ha, and deserved too.

Your notion is sort of accurate. It depends on what scale you take for "predefined". It is technically possible for the OS to change the flags that control a memory area while the program is running (and could provide a system call to allow a program to request such changes), but most OSes, including NSK, have conventions that they follow about the purposes assigned to various ranges of memory addresses, and those purposes don't get changed around dynamically. So I believe addresses in a NonStop program that are of the form 0x7------- are code and not writable. I don't know the details of what addresses are used for the code of DLLs. I do know that DLLs do not have to be put at the same address in every program that uses them. It might be that certain address ranges are used for data in some processes and are used for DLL code in other processes.
>
>
>>>So as I is 688, you see that nearly 700 bytes of data had been overwritten to 0x30, and after that it had failed. Hence 688 is justifiable.
>
>
> Okie, I don't get from the below that it is 688 bytes that have been over written. How do you convert this address details with values in them, to bytes. I tried my math but it's not working.
>
>
>>>Here's a sample:
>>>
>>>(einspect 1,94): an 0x6ffffd08 700
>>>0x6ffffd08: ...... .0.. . ...... .0.. .
>>>0x6ffffd18: ...... .0.. . ...... .8....
>>>0x6ffffd28: ...... ...... ...... ......
>>>0x6ffffd38: ...... ...... .2273. .0. 0.
>>>0x6ffffd48: ./,.C. .o..E. .0000. .0000.
>>>0x6ffffd58: .0000. .0000. .0000. .0000.
>>>0x6ffffd68: .0000. .0000. .0000. .0000.
>>>0x6ffffd78: .0000. .0000. .0000. .0000.
>>>0x6ffffd88: .0000. .0000. .0000. .0000.
>>>-----------x clipped a lot of zeroes x-----------
>>>0x6ffffff8: .0000. .0000. Warning: cannot access memory at address 0x70000000.
>
>
> If this last line says that of the four separations here first one is 0x6ffffff8 and the next one is 0x6ffffff9 and the next one where the error occurs is 0x70000000. And it matches too. But if you continue on, the fourth seperation would have been 0x70000008 from the left corner series of address naming convention - but not so true with the addition of address places that we do for the values stored in the right side columns. I know very well that I'm not using the right words or even explaining my understanding correctly, but I don't think I can try any better. That's as far as I understand.

The numbers down the left are byte addresses in hex. Each group of four characters increases the address by 4. The first group of 0000 starts at 8 bytes past 0x6ffffd48, which is 0x6ffffd50 (do the addition in hex). The last line shows two groups of 0000, so the next address beyond is 8 bytes past 0x6ffffff8, which is 0x70000000. Subtracting 0x6ffffd50 from 0x70000000 gives 0x2b0, which is 688.
The /x in the print command means to format the value of the variable in hex. The syntax of the eInspect commands is explained in the Native Inspect manual, but you have to jump around a bit to find the descriptions of various parts of a command, because the writer didn't want to repeat the explanation of things that were used in many commands, so he put the description of the common things before the description of the first command.
>
>>>So back to the einspect frame 8. I did a info locals to that and I got the following information.
>>>
>>>decimal^pt^pos = 3
>>>string^length = 5
>>>my^string = "22730. 0/,"
>>>
>>>The code has the following important lines which will help us understand the importance of my^string and the value above.
>>>
>>>STRING .my^string[0:9];
>>>.
>>>.
>>>.
>>>my^string ':=' input [0] FOR input^len BYTES;
>>>
>>>So the value "22730. 0/," is not indeed the input alone but the whole 0-9 places(bits?) covered because input^len in the above command was possibly larger than what my^string expects.
>>
>>The space used by local variables is not cleared to zeros when entering the procedure, so their initial contents will be whatever the program previously had stored in that area from using it for the local variables of other procedures that had been called and already returned before calling the procedures currently in the call stack. So it is not at all surprising to see the bytes of my^string that had not been written to by the ':=' statement to have some "random" value in them. It often won't actually be random, but highly repeatable. It depends on what that memory had been used for previously in the history of the program's execution. Sometimes, as it seems to be from the displays from the other dump files, below, it is highly repeatable, though not easily predictable.
>
>
> Got you. Thanks again!
>
>
>>>But string^length and decimal^pt^pos indicate that the input might have been 22730 and input length just 5, but that can't make decimal point position as 3. I understand that all these are related to code, and I don't expect you all to make any sense of it. I didn't read the whole code myself. I'm just making comments with the variable names that we see here, logically. About the code, I don't think that I can post the whole code here. For one, it's too large. For two, it's too secure to post and get away easy. I hope you understand.
>>
>>I understand that you might not be allowed to post the actual code. Many organizations are ridiculously protective of their precious source code, even though there usually is nothing very original about it, but you have to follow your organization's rules. But you should be able to tell us in general terms what it does. For instance, why is decimal^pt^pos of 3 not reasonable? Is it unreasonable that the desired output for this input might be 22730.000? Or maybe decimal^pt^pos is not supposed to be the number of decimal places in the output value, but something else -- maybe the number of decimal places in the input string? Is USERCODE of 2 always supposed to produce a value with 2 decimal positions? Maybe when it gets an input that has more than two decimal positions, that is what makes it go wild? If that were the case, I'd think it would have happened and have been corrected long before now, unless the error was introduced recently. Was any of the user conversio
n
>>code changed recently? If so, checking those changes carefully would be a good thing to do.
>>
>
>
> Ha, let me tell you a secret. The whole code was rewritten. Because of the following reasons.
>
> Old 700 code. Had to be recompiled in new environment as 800 code. No source code (oh yes!) lost in time. So had to start from the scratch - with vague requirements! But it has worked so well all along. For most inputs, at least.

Ugh. No wonder you are having problems with it. Nobody could even find a listing of the source code? I have very little sympathy or respect for an organization that can lose any bit of the source code of their production code. I'd find another place to work as soon as possible.

I wonder how the rewritten code was tested. What I would do is write a test program that calls the user conversion procedures with a long series of test values and check that each call gives the correct output. Or what I guess the correct output to be, since you say that the requirements weren't even preserved. It is much easier to test and debug the user conversion procedures in such a test program. I hope whoever did the rewrite tested that way. If not, maybe that should be done now.

>
> Is USERCODE of 2 always supposed to produce a value with 2 decimal positions?
>
> Z(5)9.99 that's the input that it can handle. For others, as far as I see - the error catch is in place.
>
> I'm guessing may be requester is calling the wrong user conversion procedure for that particular code. Of course, nobody changed any code in requester, then may be there's just one exception that the code is not handling. I'll post the logic of the user^conversion^2 below.
>
>
>
>>>That might be the beginning of the string the procedure was asked to convert.
>>>Well, looks like it. But not too sure still on why the problem occurred on the first place.
>>
>>Actually, no, that isn't what I was suggesting. I thought the runaway loop might have been used to append a calculated number of zeros to the end of the input value, but since the apparent input value lies 16 bytes before the beginning of the string of zeros the runaway loop stored, it doesn't seem to be doing what I though it might be doing. Maybe it was intended to store the zeroes into the reply string first, then overlay the first zeroes with digits from the input string. Or something still different from that. I don't know. Once you look at the code, what it was supposed to do might become clear. The name of the variable giving the upper limit for the loop, add^post^0, sort of implies that it is doing something about adding zeros after (post) something, but maybe the variable name is misleading.
>
>
> No, it does exactly that.

It can't be doing exactly that (adding zeros after something), since in this case, it appears to have stored zeros starting at the beginning of reply^srring

>
>>By the way, the ultimate output from USER^INPUT^NUMERIC^CONVERSION is a 64-bit binary integer which is 10 raised to the INTERNAL^SCALE times the intended value. That is, it has an implied decimal point INTERNAL^SCALE positions from the right end of the value. 9(n)V9(m) in COBOL terms, where m is INTERNAL^SCALE. So reply^string must be only an internal result, since it is still holding ASCII digits. The point of the ADD^ZEROS subprocedure might be to add the appropriate number of zeros to get the input value to have the appropriately scaled value in ASCII prior to converting the ASCII digits to a binary integer.
>
>
> True. Below is the flow of user^conv^2.
> If the input has no decimal point and the input length is not > 6 then my^string[0] ':=' input[0] FOR input^len BYTES.
>
> There's another scale^string sub procedure that uses 'INTERNAL^SCALE' as the scale factor to add the "INTERNAL^SCALE" number of zeroes to the my^string variable, at the end. And then the string^len is increased to accommodate the internal^scale length as well now.
>
> After that $ASCIITOFIXED is done. And then add^zeros subproc is called!
>
> This is where I'm confused. In add^zeros subproc, reply^length is initialized first! To zero. Then add^pre^0 is checked to add pre zeroes which is not the case mostly - not really sure why it is even there in the first place, and then the flag add^dec^pt is checked which is set for our input and so a dec pt is added at reply^string[reply^len] := "."; and the reply^len is incremented by one.
>
> And as the add^post^0 is set to 2 in our case of input 22730, we go into the next condition where there is a FOR i :=0 TO (add^post^0 - 1) to reply^string[reply^len]:= "0" and again reply length is incremented by one.
>
> Finally input[0] is set to reply^string[0] FOR reply^len BYTES and input^len is set to reply^len. And then return;
>
> Which means that the reply^len which is initialized to 0 at the start would screw up the math for the conditions that come below. Won't it? But I could not prove that this would eventually write zeroes to all the addresses into an infinite loop. Because if this was the issue, the add^zeros sub proc should have caused errors for other user conversion procedures too. Just a bit confused with the logic of initializing the reply^len with zero.

You are right to be confused. What you described sounds correct for input that does not contain a decimal point, up to the point of calling $ASCIITOFIXED. Once the output of $ASCIITOFIXED is stored into INTERNAL, the input conversion is done, unless there was an error detected by $ASCIITOFIXED, and the conversion routine should return at that point.

The code you describe following the $ASCIITOFIXED sounds like code that should be in USER^NUMERIC^OUTPUT^CONVERSION, of something it calls. But output conversion should not use $ASCIITOFIXED, so it sure seems like whoever coded that was very mixed up. One minor point shows how mixed up: INPUT^LEN is a value argment, so storing into it does nothing as far as returning a result to the caller. Sometimes code will store into a value argument if the value has to be adjusted in some circumstances before it is used, but there is no point to storing into a value argument just before returning.

INPUT, however, is a reference argument, so storing into it will affect memory in the caller, and the input conversion procedure should NOT modify its input string. There is a small chance that modifying INPUT could screw up Pathway, especially if the input conversion procedure stores more characters into INPUT than were passed in. How likely that is depends on internal details of the TCP that I don't know.
>
>
>
>>Have you looked into stack frame 7? Some of the input values from USER^INPUT^NUMERIC^CONVERSION might have been passed along to USER^CONVERSION^2, though the stack trace your showed does not indicate that it has any arguments. I'm actually a little puzzled about what USER^CONVERSION^2 is. ADD^ZEROS is a subproc of USER^INPUT^NUMERIC^CONVERSION, and so is only visible inside USER^INPUT^NUMERIC^CONVERSION. I think that means it can only be referenced from inside USER^INPUT^NUMERIC^CONVERSION or inside a subproc of USER^INPUT^NUMERIC^CONVERSION. The stack entry for USER^CONVERSION^2 makes it look like another procedure. Also its code address looks like it would be beyond the point in USER^INPUT^NUMERIC^CONVERSION where it was called from, and that would have to make it be a separate procedure, since subprocs have to be at the beginning of their enclosing procedure. Maybe ADD^ZEROS was passed as an argument that is a procedure to USER^CONVERSION^2. I don't remember whe
th
>>er subprocs are allowed to be passed as arguments that are procedures. I'd think not, since, at least in the TNS architecture, they would lose access to their enclosing procedure's local variables if called from another procedure's stack frame, but maybe pTAL did that differently. Anyway, look in your user conversion procedure source to see what USER^CONVERSION^2 is and whether its arguments or local variables hold any clues about what the arguments to USER^INPUT^NUMERIC^CONVERSION were, especially INTERNAL^SCALE. If you can learn what value INTERNAL^SCALE had, you could bench check the code for the exact argument values that caused the failure, or even create a small test program that calls USER^INPUT^NUMERIC^CONVERSION with those arguments and step through it in the debugger to see exactly what it does and where it goes wrong.
>>
>>I'm now assuming that the local variables of procedures above USER^INPUT^NUMERIC^CONVERSION in the call stack, and maybe any local variables above reply^string in USER^INPUT^NUMERIC^CONVERSION have not been overwritten by the runaway loop, so look at them in the dump file and see what additional information they give you about the case caused the error.
>>
>
> I think I did and most of them had zeros. Only those which didn't - I posted here. I did this exercise as part of the BT trace that worked for me for the first time. I've posted the values which were not zero already. As I didn't mentioned under which frame I found them, I thought it was understandable - but not really so. A bad call from me. Sorry! :D
>
>
>>>If you were a Pathway developer, it probably would be pretty easy to find the screen name, requester name, original value of the screen field, unless the runaway loop has overwritten it.
>>>
>>>What do you mean by a Pathway developer? I take care of the pathway, and work with it - if that's what you mean. To 'take care of pathway' is loosly put, but was intended that way. There's no job role of a pathway developer here in my workplace.
>>
>>I meant a developer who works for HP doing Pathway product development. I thought that would be clear from context, but I now see that it was not. Sorry for n ot being more clear. An HP Pathway product developer would ahve access to a Pathway object file with all the debugging information and the source code so he or she could look into the areas of the dump file that you cannot view, and perhaps find more information about exactly what screen, screen field, and working-storage variable the conversion was working with. However, if the local variables of the stack below the user conversion procedure have been overwritten by the runaway loop, maybe an HP Pathway developer couldn't do much more than you can.
>
>
> Oh I get it now. :)
>
>
>>>Also I've a few questions.
>>>Why does the einspect have prompt like this - especially, what does the values in it mean?
>>>(einspect 1,84):
>>
>>The 1,84 gives the CPU number and process number within that CPU of the process eInspect is curently looking at. eInspect can have control of several processes at the same time, and you can switch its attention among them as you like. So this is the way eInspect help you keep track of which process you are looking at at the current moment. When looking at a dump file, it tells you the CPU and process number of the process whose state was recorded in the dump file. Only one dump file can be examined at a time, so this isn't quite as important when looking at a dump file, but it still might be helpful.
>
>
> Hmm, understandable. I'll try the status <process-num> just in case. Though I'm sure they're not going to help me here :D

Since you are examining a snapshot file, process 1,84 at the time of the dump is long gone, so status 1,84 is going to tell you about whatever process is currently number 84 in cpu 1. In other words, nothing useful at all. If you were debugging a running process, using a status command from a TACL prompt in another window might sometimes be useful.

>
>>>And I told you I've two systems. For einspect - In one, the help and the "an" command etc works, in other very limited. For the "an" command and the help command I get a reply prompt saying no such command or something like that. Why? Both are same einspect versions. TNS/E einspect gdb debugger (T1237 - 18 Jull 2012 13:46) They both should've worked the same way. Unless, they disabled some commands in one system? Hmm?
>>
>>I don't have a good answer. Maybe the eInspect on one of the systems has been limited in some way. Or maybe there was some error during its installation. Or maybe your userid on one system has more priviliges than your userid on the other system does, and normal Guardian file security is interfering with the execution of some commands.
>
>
> Hmm I'll check with my admin team. May be they got an answer to this. Just wanted to check if you can limit a few commands from a utility in one particular system alone. Thought that was possible for any utility that HP provided though I don't know how.

I believe that some third-party security packages provide some fine control over who can run individual commands inside varous programs, but I am not at all familiar with those products. I don't know why anyone would use such controls to limit help and an commands of eInspect.

>
>>>There were many dumps in my system of which I've taken five (one above already, repeated below for comparison). And the only three data I've been able to take out, I've listed. Let me know if it helps.
>>>
>>>decimal^pt^pos = 4, string^length = 6, my^string = "680640 0/,"
>>>decimal^pt^pos = 3, string^length = 5, my^string = "37840. 0/,"
>>>decimal^pt^pos = 3, string^length = 5, my^string = "23870. 0/,"
>>>decimal^pt^pos = 4, string^length = 6, my^string = "684240 0/,"
>>>decimal^pt^pos = 3, string^length = 5, my^string = "22730. 0/,"
>>
>>Useful information! Were the dates on those dumps all recent? If you have been getting complaints from end users about "the system always goes down when I try to do this", maybe you can learn from them which screen always causes the problem, and maybe figure out from that which field it is. That might give some more clues about what the code should have done in these cases when it clearly is not doing what was intended.
>>
>>
>>>All the my^string have the last four characters as the same. A space followed by a zero then a forward slash and a comma. So that may mean that they are not really inputs. There's no coincidence in life.
>>
>>Yes, you are right. Those last few characters in my^string are not inputs to USER^INPUT^NUMERIC^CONVERSION. They are what is left over in those memory locations from previous use of those memory locations by procedures that had been called earlier and already returned.
>>
>>
>>>If the code is required, I can post parts of it as per the required variable that's needed to be analyzed. But anyway, I'm going to take a hard look at the code tomorrow. With these values as inputs, I think we can still crack a solution out of this.
>>
>>We probably won't need to see much, if any, of the code. If there are particular statements that you are not sure what they do, you might have to post those statements, but for anything else, your description of what the code is doing, or intended to do, probably will be enough.
>>
>>>Thanks all for your valuable inputs.
>>>
>>>
>>>
>>>
>>>
>
>
> The above user^conv^2 is not all of what I posted - but it is my deduction of how the flow should have gone and part of that trace. Because 1) the input was 22730 and 2) the line 77 mentioned in BT comes to here in this subproc. But the add^zeros is also called for two other conditions inside user^conv^2 such as the following:
>
> If decimal point found check position of decimal pt.
> dec pt pos 0 and anywhere else. All these cases are handled but complex. Everything tells me that add^zeros subproc is wrong because all the changes are done to my^string and reply^string is used in add^zeros subproc - I understand that you might not get any of these, because you don't have a way to look at the code. Any suggestion from this would be very helpful. I'll post the C problem tonight, Keith. Just didn't get enough time. :)

I did not follow the above very well. The cases involved in input numeric conversion could be a bit complex, though a good programmer should, I think, be able to organize the code and comment it in a way to make it pretty easy to follow. As I mentioned above, the one part you described sure seems to me to indicate that the programmer did not understand what was to be done in that he or she seems to do output conversion things after finishing the input work.

Also, check on my suspicion that there is some path through the code that reaches the runaway loop without storing a value into add^post^0.

Shiva

unread,
Jan 26, 2015, 12:23:54 PM1/26/15
to
Hi Keith,

I've been away, sorry for the delay in reply. Been ill, been busy, et al. Find my answers in line, as usual.

> Unless there is some other bad code in the subproc that could have overstored add^post^0, it seems likely that there is some path through the code that does not store a value into add^post^0 before it is used to control the loop.

You're a genius, Keith. There was. Actually I found that the other day after posting the reply above. And later when I thought to come back and post again about that - you had found that, even without looking at the code. You should be Sherlock Holmes Jr. USER^CONVERSION^2 has many paths of which one does not set any value to add^post^0 before calling that subproc. But why should that cause an infinite loop, I couldn't figure out.

How so ever the loop should not have gone on and on, no? It should have stopped at some point. Even if a garbage value is set to add^post^0.

> Nobody could even find a listing of the source code?

I've got a doubt. Now that you ask me, I went back and checked the 700 code in NOFT and it showed me all the proc names, would I be able to find the whole code from that? I'm sure that the 700 code was compiled with symbols.

> I hope whoever did the rewrite tested that way. If not, maybe that should be done now.

We did test it. We regression tested it with the old code and compared our test results as well. Seems like the test cases were not good enough.

> It can't be doing exactly that (adding zeros after something), since in this case, it appears to have stored zeros starting at the beginning of reply^srring

Well, it should be. But remember the below that I posted?

> True. Below is the flow of user^conv^2.
> If the input has no decimal point and the input length is not > 6 then my^string[0] ':=' input[0] FOR input^len BYTES.
>
> There's another scale^string sub procedure that uses 'INTERNAL^SCALE' as the scale factor to add the "INTERNAL^SCALE" number of zeroes to the my^string variable, at the end. And then the string^len is increased to accommodate the internal^scale length as well now.
>
> After that $ASCIITOFIXED is done. And then add^zeros subproc is called!
>
> This is where I'm confused. In add^zeros subproc, reply^length is initialized first! To zero. Then add^pre^0 is checked to add pre zeroes which is not the case mostly - not really sure why it is even there in the first place, and then the flag add^dec^pt is checked which is set for our input and so a dec pt is added at reply^string[reply^len] := "."; and the reply^len is incremented by one.
>
> And as the add^post^0 is set to 2 in our case of input 22730, we go into the next condition where there is a FOR i :=0 TO (add^post^0 - 1) to reply^string[reply^len]:= "0" and again reply length is incremented by one.
>
> Finally input[0] is set to reply^string[0] FOR reply^len BYTES and input^len is set to reply^len. And then return;
>
> Which means that the reply^len which is initialized to 0 at the start would screw up the math for the conditions that come below. Won't it? But I could not prove that this would eventually write zeroes to all the addresses into an infinite loop. Because if this was the issue, the add^zeros sub proc should have caused errors for other user conversion procedures too. Just a bit confused with the logic of initializing the reply^len with zero.

When they reinitialized reply^len to 0 they made a mistake because that would mean that the code would write zeros from the start of the variable.

So actually the problem within the code not only lies in "not initializing add^post^0 in one of the paths to the subproc" but also in initializing reply^len to 0 inside the subproc.

And the code does not write the output to USER^NUMERIC^OUTPUT^CONVERSION but instead writes to INPUT for input^len bytes itself within USER^NUMERIC^INPUT^CONVERSION in the end.

Weird, doesn't look like the standard process described in the pTAL guide.

> I believe that some third-party security packages provide some fine control over who can run individual commands inside varous programs, but I am not at all familiar with those products. I don't know why anyone would use such controls to limit help and an commands of eInspect.

einspect: Could not setup tcl environment:, line = 0.

This is what I get when I start einspect. May be that's got something to do with the limitations.

And please find the values below.

p /x &reply^string
0x6ffffd50

p /x &INPUT^LEN
0x6ffffd64

p /x &USERCODE
Address requested for identifier "USERCODE" which is in register $r32
info register r32
r32: 0x2

p /x &INPUT
0x6ffffd60

All your deductions seem pretty accurate.

Thanks again for all your explanations. Very detailed, as always.

About the C problem, I'll post it next week. A hectic week this one, for sure! Hope you are doing good.

David,

> Aside: overlong lines (semi)fixed. I understand googlegroups doesn't
help, but if you can try to break your text into lines of up to about
70 characters that would be helpful to many non-google readers.

I understand. But breaking lines make for bad visibility. Is google groups
website access restricted? But even in mobiles they come real nice, when
you use google groups. It's really nice, you should try it. :)

And thanks for your detailed explanations on the zeros, really knowledgable.
I'll keep them in my notes. Thanks again. :)

Keith Dick

unread,
Jan 26, 2015, 4:24:09 PM1/26/15
to
Shiva wrote:
> Hi Keith,
>
> I've been away, sorry for the delay in reply. Been ill, been busy, et al. Find my answers in line, as usual.
>
>
>>Unless there is some other bad code in the subproc that could have overstored add^post^0, it seems likely that there is some path through the code that does not store a value into add^post^0 before it is used to control the loop.
>
>
> You're a genius, Keith. There was. Actually I found that the other day after posting the reply above. And later when I thought to come back and post again about that - you had found that, even without looking at the code. You should be Sherlock Holmes Jr. USER^CONVERSION^2 has many paths of which one does not set any value to add^post^0 before calling that subproc. But why should that cause an infinite loop, I couldn't figure out.
>
> How so ever the loop should not have gone on and on, no? It should have stopped at some point. Even if a garbage value is set to add^post^0.

The garbage value for add^post^0 in the case for which you got the dump file was 14336, so the loop would have repeated until I reached 14336, not actually forever. It stopped after only 688 times because when I reached 688, it tried to store into an address that was not writable.

>
>
>>Nobody could even find a listing of the source code?
>
>
> I've got a doubt. Now that you ask me, I went back and checked the 700 code in NOFT and it showed me all the proc names, would I be able to find the whole code from that? I'm sure that the 700 code was compiled with symbols.

Compiling with SYMBOLS puts only debugging information into the object file -- the names of variables and their addresses. Compiling with or without SYMBOLS includes the procedure names in the object file so that linking can be done. The source file names also are included in the object file, I think all of the time, but maybe only when SYMBOLS is specified. In no case is the contents of the source files included in the object file. If you use the option to strip extraneous information from the object file, all of that information will be removed.

If you forget where a source file was when you compiled a particular object file, you can list the source file names from the object file and that will tell you where to look for the source files, but if the files are no longer there, that probably will not help you find the source unless you were not sure what the file names were.
>
>
>> I hope whoever did the rewrite tested that way. If not, maybe that should be done now.
>
>
> We did test it. We regression tested it with the old code and compared our test results as well. Seems like the test cases were not good enough.

You did test with a simple program calling the procedures? If so, I'm happy to hear that. You might have been unlucky enough that the garbage value that add^post^0 got when executing your test program was always some small value that did not lead to a problem, or the location of reply^string happened to be one where significant amounts of storing past its end did not cause bad effects. Maybe if you had tried more test cases, you would have spotted the problem, but maybe not. I hope you included bad input values as well as valid input values in your testing, to be sure the new code caught and reported errors for at least some bad input.

When possible in testing a relatively small procedure in that way, I like to generate a large amount of random inputs and compare the results from both procedures (using a program to do the comparisons) to be sure both are doing the same thing in all cases. You might get a lot of differences for invalid inputs if one procedure responds to the invalid input differently than the other does, but if that happens, and either result is acceptable, you can allow those differences to stand. The test program should also check to be sure that no call returns a different value in an input-only arguemnt than was passed in.
>
>
>>It can't be doing exactly that (adding zeros after something), since in this case, it appears to have stored zeros starting at the beginning of reply^srring
>
>
> Well, it should be. But remember the below that I posted?

Right. What I meant was that, in this failure case, something was making it not add to the end of something, but start at the beginning. I could have expressed that better.

>
>
>>True. Below is the flow of user^conv^2.
>>If the input has no decimal point and the input length is not > 6 then my^string[0] ':=' input[0] FOR input^len BYTES.
>>
>>There's another scale^string sub procedure that uses 'INTERNAL^SCALE' as the scale factor to add the "INTERNAL^SCALE" number of zeroes to the my^string variable, at the end. And then the string^len is increased to accommodate the internal^scale length as well now.
>>
>>After that $ASCIITOFIXED is done. And then add^zeros subproc is called!
>>
>>This is where I'm confused. In add^zeros subproc, reply^length is initialized first! To zero. Then add^pre^0 is checked to add pre zeroes which is not the case mostly - not really sure why it is even there in the first place, and then the flag add^dec^pt is checked which is set for our input and so a dec pt is added at reply^string[reply^len] := "."; and the reply^len is incremented by one.
>>
>>And as the add^post^0 is set to 2 in our case of input 22730, we go into the next condition where there is a FOR i :=0 TO (add^post^0 - 1) to reply^string[reply^len]:= "0" and again reply length is incremented by one.
>>
>>Finally input[0] is set to reply^string[0] FOR reply^len BYTES and input^len is set to reply^len. And then return;
>>
>>Which means that the reply^len which is initialized to 0 at the start would screw up the math for the conditions that come below. Won't it? But I could not prove that this would eventually write zeroes to all the addresses into an infinite loop. Because if this was the issue, the add^zeros sub proc should have caused errors for other user conversion procedures too. Just a bit confused with the logic of initializing the reply^len with zero.
>
>
> When they reinitialized reply^len to 0 they made a mistake because that would mean that the code would write zeros from the start of the variable.
>
> So actually the problem within the code not only lies in "not initializing add^post^0 in one of the paths to the subproc" but also in initializing reply^len to 0 inside the subproc.
>
> And the code does not write the output to USER^NUMERIC^OUTPUT^CONVERSION but instead writes to INPUT for input^len bytes itself within USER^NUMERIC^INPUT^CONVERSION in the end.
>
> Weird, doesn't look like the standard process described in the pTAL guide.

I'm not sure what you are referring to in the pTAL guide, but I probably don't need to know that. The part of the code that is storing into INPUT is definitely wrong and should be removed. Probably the part that is building reply^string should be removed, too, since it seems all of that is done after the result from $ASCIITOFIXED is stored into INTERNAL, which, along with ERROR, should be the only output from the procedure. Anything in USER^NUMERIC^INPUT^CONVERSION that is not contributing to calculating and storing values into INTERNAL and ERROR should not be there.
>
>
>>I believe that some third-party security packages provide some fine control over who can run individual commands inside varous programs, but I am not at all familiar with those products. I don't know why anyone would use such controls to limit help and an commands of eInspect.
>
>
> einspect: Could not setup tcl environment:, line = 0.
>
> This is what I get when I start einspect. May be that's got something to do with the limitations.

Yes, that could be the problem. TCL (Tool Command Language) is an open source programming language specifically intended for extending the command set of other programs. Some of the einspect commands described in the Native Inspect Manual actually are implemented in TCL. For example, the description of the "a" and "an" commands say they are a "Tcl command" right at the start of their description in the manual. So if TCL isn't working, I'm not surprised that you get an error that says it doesn't recognize the "an" command. The manual does not say whether the "help" command is a TCL command or not, though it is not listed in chapter 5 where it says it lists the commands that are implemented in TCL. Maybe it is implemented in TCL and the manual just does not mention that. Or maybe the help command is failing not because it depends on TCL but because of some other reason.

I don't know why you would be getting the error that says einspect could not set up the TCL environment. Maybe a file involved in getting TCL to work is secured in a way that prevents your userid from accessing it, or maybe a file that defines the TCL-implemented eInspect commands is secured in a way that prevents you from accessing it. If so, that probably is an accident, but I don't see any installation instructions in the Native Inspect Manual about what files are involved. The user manuals usually don't include installation instructions. If installation instructions are needed, they usually go into the Softdoc for the product, and HP restricts access to Softdocs, so I can't check to see whether the Softdoc for Native Inspect includes any instructions that might shed light on this problem.

I did a little investigation on a NonStop system I have access to for my current contract, and I found that quite a few files in the $system.zeinspct subvol are opened when eInspect and a help command are entered. eInspect does not keep them open, but appears to open them only when it needs them and closes them as soon as possible afterwards. The files whose open timestamp changes are at least: AUTOLOAD, EIHELP, GLOBALS, INITENV, INITTCL, TCLINDEX, and UTILS. I have a feeling that some of the other files in that subvolume would be used by eInspect under some conditions, though I can't be sure. Maybe some of them are only used for CPU dump analysis. It probably would not hurt anything to allow any user to be able to read any of the files in that subvolume. Perhaps you could compare the file protection on the files in that subvolume in the two systems you mentioned, and if you find that the protection is more limited on the system where eInspect doesn't work completely c
orrectly, ask the system manager whether he or she forgot to adjust the protection of the files in that subvolume.

The old installation programs used to preserve the file security settings when installing a new version of a product replaced existing files, but defaulted to a very restrictive file security when installing a file that was not present in the system before. I imagine the current installation programs follow a similar policy. It is the responsibilty of the system manager to decide how to set the file security when accepting a new system or installing new products on an existing system. So it easily could be that the system manager of the system on which einspect is getting those errors either forgot to change the file security for some files related to eInspect, or for some reason felt that he did not want to relax the security for those files.


>
> And please find the values below.
>
> p /x &reply^string
> 0x6ffffd50

That confirms my guess about the location of reply^string from the display you provided in an earlier post.

>
> p /x &INPUT^LEN
> 0x6ffffd64

That shows that this argument is in the area above reply^string, as expected, since it got overwritten.

>
> p /x &USERCODE
> Address requested for identifier "USERCODE" which is in register $r32
> info register r32
> r32: 0x2

Aha! The compiler decided to put that argument in a register, which is why it did not get overwritten like the others did. I don't know anything about how the compiler decides whether to put an argument in a register rather than on the stack.
>
> p /x &INPUT
> 0x6ffffd60

And that argument is in the area about reply^string, which is expected since it was overwritten.

Shiva

unread,
Jan 28, 2015, 1:13:17 PM1/28/15
to
> The garbage value for add^post^0 in the case for which you got the dump file was 14336, so the loop would have repeated until I reached 14336, not actually forever. It stopped after only 688 times because when I reached 688, it tried to store into an address that was not writable.

Hmm, that's interesting. But also confusing. Why would 14336 be always there for all the errors? May be there's something else to it that I'm not aware of. But I get your point. That is the case here, how so ever coincidental it may be.


> >>Nobody could even find a listing of the source code?
> >
> >
> > I've got a doubt. Now that you ask me, I went back and checked the 700 code in NOFT and it showed me all the proc names, would I be able to find the whole code from that? I'm sure that the 700 code was compiled with symbols.
>
> Compiling with SYMBOLS puts only debugging information into the object file -- the names of variables and their addresses. Compiling with or without SYMBOLS includes the procedure names in the object file so that linking can be done. The source file names also are included in the object file, I think all of the time, but maybe only when SYMBOLS is specified. In no case is the contents of the source files included in the object file. If you use the option to strip extraneous information from the object file, all of that information will be removed.
>
> If you forget where a source file was when you compiled a particular object file, you can list the source file names from the object file and that will tell you where to look for the source files, but if the files are no longer there, that probably will not help you find the source unless you were not sure what the file names were.

Ah, then what did you mean by a 'listing' - the details that you mentioned above which tells me where the source is? Long past that. The source is not there. We know exactly the name of the source. It's not there anywhere in the system.


> >
> > Weird, doesn't look like the standard process described in the pTAL guide.
>
> I'm not sure what you are referring to in the pTAL guide, but I probably don't need to know that. The part of the code that is storing into INPUT is definitely wrong and should be removed. Probably the part that is building reply^string should be removed, too, since it seems all of that is done after the result from $ASCIITOFIXED is stored into INTERNAL, which, along with ERROR, should be the only output from the procedure. Anything in USER^NUMERIC^INPUT^CONVERSION that is not contributing to calculating and storing values into INTERNAL and ERROR should not be there.

I mentioned to the part where it was noted that "USER^NUMERIC^OUTPUT^CONVERSION" is where you write your output to. Because in this code the output is written to the "INPUT" variable in USER^NUMERIC^INPUT^CONVERSION itself. And USER^NUMERIC^OUTPUT^CONVERSION is not used at all.

> >
> >
> >>I believe that some third-party security packages provide some fine control over who can run individual commands inside varous programs, but I am not at all familiar with those products. I don't know why anyone would use such controls to limit help and an commands of eInspect.
> >
> >
> > einspect: Could not setup tcl environment:, line = 0.
> >
> > This is what I get when I start einspect. May be that's got something to do with the limitations.
>
> Yes, that could be the problem. TCL (Tool Command Language) is an open source programming language specifically intended for extending the command set of other programs. Some of the einspect commands described in the Native Inspect Manual actually are implemented in TCL. For example, the description of the "a" and "an" commands say they are a "Tcl command" right at the start of their description in the manual. So if TCL isn't working, I'm not surprised that you get an error that says it doesn't recognize the "an" command. The manual does not say whether the "help" command is a TCL command or not, though it is not listed in chapter 5 where it says it lists the commands that are implemented in TCL. Maybe it is implemented in TCL and the manual just does not mention that. Or maybe the help command is failing not because it depends on TCL but because of some other reason.
>
> I don't know why you would be getting the error that says einspect could not set up the TCL environment. Maybe a file involved in getting TCL to work is secured in a way that prevents your userid from accessing it, or maybe a file that defines the TCL-implemented eInspect commands is secured in a way that prevents you from accessing it. If so, that probably is an accident, but I don't see any installation instructions in the Native Inspect Manual about what files are involved. The user manuals usually don't include installation instructions. If installation instructions are needed, they usually go into the Softdoc for the product, and HP restricts access to Softdocs, so I can't check to see whether the Softdoc for Native Inspect includes any instructions that might shed light on this problem.
>
> I did a little investigation on a NonStop system I have access to for my current contract, and I found that quite a few files in the $system.zeinspct subvol are opened when eInspect and a help command are entered. eInspect does not keep them open, but appears to open them only when it needs them and closes them as soon as possible afterwards. The files whose open timestamp changes are at least: AUTOLOAD, EIHELP, GLOBALS, INITENV, INITTCL, TCLINDEX, and UTILS. I have a feeling that some of the other files in that subvolume would be used by eInspect under some conditions, though I can't be sure. Maybe some of them are only used for CPU dump analysis. It probably would not hurt anything to allow any user to be able to read any of the files in that subvolume. Perhaps you could compare the file protection on the files in that subvolume in the two systems you mentioned, and if you find that the protection is more limited on the system where eInspect doesn't work completely c
> orrectly, ask the system manager whether he or she forgot to adjust the protection of the files in that subvolume.
>
> The old installation programs used to preserve the file security settings when installing a new version of a product replaced existing files, but defaulted to a very restrictive file security when installing a file that was not present in the system before. I imagine the current installation programs follow a similar policy. It is the responsibilty of the system manager to decide how to set the file security when accepting a new system or installing new products on an existing system. So it easily could be that the system manager of the system on which einspect is getting those errors either forgot to change the file security for some files related to eInspect, or for some reason felt that he did not want to relax the security for those files.
>
>

Ah, how so much you know about Tandem. But I thought various versions of TAL (TAL, pTAL, epTAL) and TACL where the only true 'native' languages of HP NS systems. Now there's TCL - I suppose there's more of it!? Unlike COBOL, C, Java which are 'adopted' into NS systems.

That was a very valuable lesson. Going to search whether there's a TCL manual available in the nonstop-docs. :) Thanks much again, for your patient explanation. I'll check my system about that sub volume as well.

Tone

unread,
Jan 28, 2015, 8:20:58 PM1/28/15
to
TCL is not a 'native' NonStop language. It actually stands for Tool
Command Language. Chapter 5 on Native Inspect manual has details on
TCL scripting.

Keith Dick

unread,
Jan 29, 2015, 6:17:43 AM1/29/15
to
Shiva wrote:
>>The garbage value for add^post^0 in the case for which you got the dump file was 14336, so the loop would have repeated until I reached 14336, not actually forever. It stopped after only 688 times because when I reached 688, it tried to store into an address that was not writable.
>
>
> Hmm, that's interesting. But also confusing. Why would 14336 be always there for all the errors? May be there's something else to it that I'm not aware of. But I get your point. That is the case here, how so ever coincidental it may be.

I didn't know it always had the same garbage value. Maybe it only has the same garbage value when the error happens, and has other, smaller, garbage values in cases that don't fail because the garbage value is small enough that the loop doesn't run away very much and so does very little damage.

But suppose that it is the same value every time.

Keep in mind that "gargage value" does not mean "random value". When the program uses the value of add^post^0 without first storing something into it, the value is whatever most recently was put into that area of memory. That memory probably was used by some procedure that was called and had retured before ADD^ZEROS was called. It might have been a proc or subproc called from USER^NUMERIC^INPUT^CONVERSION or from USER^CONVERSION^2, or even from some procedure called before USER^NUMERIC^INPUT^CONVERSION was called. The fact that the value comes from some previous part of this program means it would not be so surprising if it were always the same, since the execution history just prior to the error is likely to be very similar, since the program probably was processing the fields from an input screen, and it will do the same steps each time that screen is processed.
>
>
>
>>>>Nobody could even find a listing of the source code?
>>>
>>>
>>>I've got a doubt. Now that you ask me, I went back and checked the 700 code in NOFT and it showed me all the proc names, would I be able to find the whole code from that? I'm sure that the 700 code was compiled with symbols.
>>
>>Compiling with SYMBOLS puts only debugging information into the object file -- the names of variables and their addresses. Compiling with or without SYMBOLS includes the procedure names in the object file so that linking can be done. The source file names also are included in the object file, I think all of the time, but maybe only when SYMBOLS is specified. In no case is the contents of the source files included in the object file. If you use the option to strip extraneous information from the object file, all of that information will be removed.
>>
>>If you forget where a source file was when you compiled a particular object file, you can list the source file names from the object file and that will tell you where to look for the source files, but if the files are no longer there, that probably will not help you find the source unless you were not sure what the file names were.
>
>
> Ah, then what did you mean by a 'listing' - the details that you mentioned above which tells me where the source is? Long past that. The source is not there. We know exactly the name of the source. It's not there anywhere in the system.

Remember, I'm an old dinosaur. A 'listing' is what you get when you send the output of the compiler to the printer (and the compiler has the decency to provide an option that makes it list the source lines it is compilinging, and hopefully include some additional information mixed in with the source lines, such as code offset of the first instruction for each line of code, etc.). You then take that nice stack of paper and file it away so that when the source file gets lost, you can type the program into the computer again by reading from the listing. Some organizations seem to know how to keep track of important papers better than they know how to keep track of important source files.
>
>
>
>>>Weird, doesn't look like the standard process described in the pTAL guide.
>>
>>I'm not sure what you are referring to in the pTAL guide, but I probably don't need to know that. The part of the code that is storing into INPUT is definitely wrong and should be removed. Probably the part that is building reply^string should be removed, too, since it seems all of that is done after the result from $ASCIITOFIXED is stored into INTERNAL, which, along with ERROR, should be the only output from the procedure. Anything in USER^NUMERIC^INPUT^CONVERSION that is not contributing to calculating and storing values into INTERNAL and ERROR should not be there.
>
>
> I mentioned to the part where it was noted that "USER^NUMERIC^OUTPUT^CONVERSION" is where you write your output to. Because in this code the output is written to the "INPUT" variable in USER^NUMERIC^INPUT^CONVERSION itself. And USER^NUMERIC^OUTPUT^CONVERSION is not used at all.

I don't follow everything you said here. I wonder whether you have some typos or missed some words. You are right that USER^INPUT^NUMERIC^CONVERSION should never store into its INPUT argument. USER^NUMERIC^OUTPUT^CONVERSION is for the opposite situation -- when you are sending out data to the screen and want to format it in some way that the normal formatting controls cannot do. The two procedures are completely independent from each other.
>
>
>>>
>>>>I believe that some third-party security packages provide some fine control over who can run individual commands inside varous programs, but I am not at all familiar with those products. I don't know why anyone would use such controls to limit help and an commands of eInspect.
>>>
>>>
>>>einspect: Could not setup tcl environment:, line = 0.
>>>
>>>This is what I get when I start einspect. May be that's got something to do with the limitations.
>>
>>Yes, that could be the problem. TCL (Tool Command Language) is an open source programming language specifically intended for extending the command set of other programs. Some of the einspect commands described in the Native Inspect Manual actually are implemented in TCL. For example, the description of the "a" and "an" commands say they are a "Tcl command" right at the start of their description in the manual. So if TCL isn't working, I'm not surprised that you get an error that says it doesn't recognize the "an" command. The manual does not say whether the "help" command is a TCL command or not, though it is not listed in chapter 5 where it says it lists the commands that are implemented in TCL. Maybe it is implemented in TCL and the manual just does not mention that. Or maybe the help command is failing not because it depends on TCL but because of some other reason.
>>
>>I don't know why you would be getting the error that says einspect could not set up the TCL environment. Maybe a file involved in getting TCL to work is secured in a way that prevents your userid from accessing it, or maybe a file that defines the TCL-implemented eInspect commands is secured in a way that prevents you from accessing it. If so, that probably is an accident, but I don't see any installation instructions in the Native Inspect Manual about what files are involved. The user manuals usually don't include installation instructions. If installation instructions are needed, they usually go into the Softdoc for the product, and HP restricts access to Softdocs, so I can't check to see whether the Softdoc for Native Inspect includes any instructions that might shed light on this problem.
>>
>>I did a little investigation on a NonStop system I have access to for my current contract, and I found that quite a few files in the $system.zeinspct subvol are opened when eInspect and a help command are entered. eInspect does not keep them open, but appears to open them only when it needs them and closes them as soon as possible afterwards. The files whose open timestamp changes are at least: AUTOLOAD, EIHELP, GLOBALS, INITENV, INITTCL, TCLINDEX, and UTILS. I have a feeling that some of the other files in that subvolume would be used by eInspect under some conditions, though I can't be sure. Maybe some of them are only used for CPU dump analysis. It probably would not hurt anything to allow any user to be able to read any of the files in that subvolume. Perhaps you could compare the file protection on the files in that subvolume in the two systems you mentioned, and if you find that the protection is more limited on the system where eInspect doesn't work completely
c
>>orrectly, ask the system manager whether he or she forgot to adjust the protection of the files in that subvolume.
>>
>>The old installation programs used to preserve the file security settings when installing a new version of a product replaced existing files, but defaulted to a very restrictive file security when installing a file that was not present in the system before. I imagine the current installation programs follow a similar policy. It is the responsibilty of the system manager to decide how to set the file security when accepting a new system or installing new products on an existing system. So it easily could be that the system manager of the system on which einspect is getting those errors either forgot to change the file security for some files related to eInspect, or for some reason felt that he did not want to relax the security for those files.
>>
>>
>
>
> Ah, how so much you know about Tandem. But I thought various versions of TAL (TAL, pTAL, epTAL) and TACL where the only true 'native' languages of HP NS systems. Now there's TCL - I suppose there's more of it!? Unlike COBOL, C, Java which are 'adopted' into NS systems.
>
> That was a very valuable lesson. Going to search whether there's a TCL manual available in the nonstop-docs. :) Thanks much again, for your patient explanation. I'll check my system about that sub volume as well.

Tone already answered this, but let me add a little. When I say something is open source, that usually means it is a program that is widely available freely, like Linux, Apache, etc. That is what TCL is. It was written by someone not at Tandem. Its purpose was for extending the commands a program has. It can be used entirely on its own to develop programs, too. As far as I knoe, TCL is only used in eInspect and EGARTH, but it might have found its way into some other NonStop products, too. I don't know whether you can use TCL on the NonStop system to write a standalone TCL program. The developers who used it in eInspect and EGARTH might not have distributed the parts needed to use TCL outside of in eInspect and EGARTH.

Bill Honaker

unread,
Jan 29, 2015, 4:23:57 PM1/29/15
to
A little of my memory tells me that the iTP 'httpd.config' file and
things it sources in are derived from TCL? And that is the same
format used by Apache Web Server. TCL is a quarter century old this
month!

learn more here: http://en.wikipedia.org/wiki/Tcl
Open Source site: http://www.tcl.tk/

Shiva

unread,
Jan 30, 2015, 1:00:20 PM1/30/15
to
> >>
> >>
> >>
> >>>>Weird, doesn't look like the standard process described in the pTAL guide.
> >>>
> >>>I'm not sure what you are referring to in the pTAL guide, but I probably don't need to know that. The part of the code that is storing into INPUT is definitely wrong and should be removed. Probably the part that is building reply^string should be removed, too, since it seems all of that is done after the result from $ASCIITOFIXED is stored into INTERNAL, which, along with ERROR, should be the only output from the procedure. Anything in USER^NUMERIC^INPUT^CONVERSION that is not contributing to calculating and storing values into INTERNAL and ERROR should not be there.
> >>
> >>
> >> I mentioned to the part where it was noted that "USER^NUMERIC^OUTPUT^CONVERSION" is where you write your output to. Because in this code the output is written to the "INPUT" variable in USER^NUMERIC^INPUT^CONVERSION itself. And USER^NUMERIC^OUTPUT^CONVERSION is not used at all.
> >
> >I don't follow everything you said here. I wonder whether you have some typos or missed some words. You are right that USER^INPUT^NUMERIC^CONVERSION should never store into its INPUT argument. USER^NUMERIC^OUTPUT^CONVERSION is for the opposite situation -- when you are sending out data to the screen and want to format it in some way that the normal formatting controls cannot do. The two procedures are completely independent from each other.

I actually thought that to get input from screen you use USER^NUMERIC^INPUT^CONVERSION and do the conversion and give the OUTPUT to USER^NUMERIC^OUTPUT^CONVERSION because both names looked similar. But now I see that it is USER^INPUT^NUMERIC^CONVERSION. Hmm, so both are totally different. Then how does the input that's got from the input conversion procedure written back to the screen? I'll have to read better. You've explained so much. I'll have a read and get back. :)

And TCL, interesting! That looks a lot like it came from the OSS space into Tandem. The other day I was reading the HP NS for dummies from Comforte and I remember the portrayal they gave for the HP Kernel. It was like Guardian inside, OSS after and Nonstop OS the outer layer. Makes sense now.

Shiva

unread,
Jan 30, 2015, 1:04:51 PM1/30/15
to
@Bill & @Tone: Wow. There's so much in the world out there! Thanks for that. I'll look them up. I did wiki TCL though. :D

Keith Dick

unread,
Jan 30, 2015, 8:28:53 PM1/30/15
to
During a SCOBOL ACCEPT, the PATHWAY TCP takes the characters from the screen, does its attempt to convert the field, then calls the user conversion input procedure if the screen description says to use user conversion. The user conversion procedure gives its result back to the Pathway TCP in the INTERNAL parameter.

An output user conversion is not used in a SCOBOL ACCEPT at all. It is used in a SCOBOL DISPLAY if a field says to use user conversion.

This is explained in the manual that describes the user conversion procedures. Not as well as it probably ought to be, but I think there is enough there for most situations.

Shiva

unread,
Jan 31, 2015, 12:29:57 AM1/31/15
to
Yep, I read that last night after I understood that my notion was wrong all along! I misread it the first time. Difference between skimming through and reading through. Should be more careful. Thanks! :)
0 new messages