> MCPDLL and INITDLL contain numerous system library procedures, and when they are loaded, it will make at least some of the stack frames that say the procedure name is unknown will be able to list the proper procedure name. This sometimes can be helpful to get an idea of what system procedures are involved when you have a stack trace with a lot of unknown procedure names. I wouldn't have expected it to help with your current problem.
Hmm, I assumed so. Thanks for that clarification.
>
> That FLDLEN value of 5 probably is the length of the field that is to be converted. It also matches string^length, which makes it seem a little more likely that it is the field length. When you look at the code, you'll probably see it copies INPUT^LEN to string^length.
Yes, it does.
>
> I applaud your dedication!
>
Ah, thank you. But that's just a selfish act where I desire to learn. But you replying with pages of answer to trying make me understand the concepts which will take years of learning or proper training - even lots of pages of reading, all in a single post (a very large one, to justify) and the time you and people like you - take everyday to reply and help people like me with expecting nothing in return, now that's a selfless act. And that's something that needs higher recognition. :)
I don't ever know how I could thank you for all the help you've been providing.
> I said it was surprisingly low because many time a runaway loop starts storing at some data address in the program and runs for many thousands of bytes before it reaches the end of the memory area assigned to the process. Not just a few hundred bytes. Of course, if the variable the loop started storing into happened to be near the end of a memory area, it could run off the end of the assigned memory just a few hundred bytes beyond the variable it started storing into.
>
> 14336 decimal (yes, those values are displayed in decimal) is 0x3800, which is the Ascii character "8" followed by the null byte. I don't know whether it was set to that by overstoring with some ASCII data or is a coincidence that it is an Ascii digit and a null. On Windows, there is a calculator application that can switch between decimal and hex display, which is very handy for converting values between the two (when you switch the number base, it does not erase the value but just displays it in the new base), and for doing address arithmetic. Assuming you use a Windows computer at your office, you should have it. It defaults to an ordinary calculator interface, but in the View menu, you can switch to Programmer (that's for the Windows 7 version of calculator; it is Scientific for the Windows XP version).
>
So 14336 is just "80" - now that's confusing. Because the values for add^post^0 that's assigned within user^conversion^2 subproc are 0, 1, and 2. No other values are assigned to it.
And yes, I know that. I mentioned that I was unable to make up my mind to even try to convert it because I was in a very sorry state when I wrote that post. Won't happen again, cool :)
>
> The areas of memory accessible to a program can be marked in tables that the OS maintains and the hardware consults to be either read/write, or read-only (or maybe it is called execute-only). The boundaries of the various segments almost always are round numbers in hex, so a transition from read/write to read-only at 0x70000000 makes sense. I don't know enough about how memory is laid out in a TNS/E native process to know what the area just before the code segment is, but the evidence here is that it holds the local stack frames of the call stack. I do remember reading that something grows from high addresses towards the low addresses, and that might have been the local call stack . I think the heap grows from the low address in that same segment toward high addresses. Or maybe the heap is at the high end and the call stack at the low end.
> >
Hmm, that makes me think that my previous notion of "some addressess are predefined to have code and others variable information" is wrong. May be it is decided at run time - it doesn't really matter at this point. May be my trustable resource suggested that because he found that the value 0x70000000 in the bt stack trace for frame 8. Looks more likely and if so, he has a sharp eye!
At times experience gives you things that hardwork and dedication can't even think about! Ha, and deserved too.
> > So as I is 688, you see that nearly 700 bytes of data had been overwritten to 0x30, and after that it had failed. Hence 688 is justifiable.
Okie, I don't get from the below that it is 688 bytes that have been over written. How do you convert this address details with values in them, to bytes. I tried my math but it's not working.
> >
> > Here's a sample:
> >
> > (einspect 1,94): an 0x6ffffd08 700
> > 0x6ffffd08: ...... .0.. . ...... .0.. .
> > 0x6ffffd18: ...... .0.. . ...... .8....
> > 0x6ffffd28: ...... ...... ...... ......
> > 0x6ffffd38: ...... ...... .2273. .0. 0.
> > 0x6ffffd48: ./,.C. .o..E. .0000. .0000.
> > 0x6ffffd58: .0000. .0000. .0000. .0000.
> > 0x6ffffd68: .0000. .0000. .0000. .0000.
> > 0x6ffffd78: .0000. .0000. .0000. .0000.
> > 0x6ffffd88: .0000. .0000. .0000. .0000.
> > -----------x clipped a lot of zeroes x-----------
> > 0x6ffffff8: .0000. .0000. Warning: cannot access memory at address 0x70000000.
If this last line says that of the four separations here first one is 0x6ffffff8 and the next one is 0x6ffffff9 and the next one where the error occurs is 0x70000000. And it matches too. But if you continue on, the fourth seperation would have been 0x70000008 from the left corner series of address naming convention - but not so true with the addition of address places that we do for the values stored in the right side columns. I know very well that I'm not using the right words or even explaining my understanding correctly, but I don't think I can try any better. That's as far as I understand.
> >
> > That explains it all. But still doesn't give us enough information to understand why such a situation occurred in the first place.
>
> Well, it doesn't explain it all, but it does show some things. You did find the long string of zeros that the runaway loop stored. From the beginning of the string of zeros (at 0x6ffffd50 to 0x70000000 is exactly 688 bytes. I see your 22730 value 16 bytes before the beginning of that string of zeros, not immediately before it, as I expected it would be. So maybe the runaway loop started at (or even before) the beginning of reply^string, and the 22730 we see is INPUT. However, the five bypes following 22730 seem to match what your other display shows for my^string, so I think it is likely that my^string is at 0x6ffffd40, and reply^string is at 0x6ffffd50. I think you could confirm that by using
>
> print /x &my^string
> and
> print /x &reply^string
>
Again 'exactly 688 bytes', I don't know how. And I'll try the above when I get back to office next monday. On a little break from work. Hence the delays in my replies :) But I think your deduction is correct.
Find the following which are the 'last' declaration statements from the code which makes me think that - in the above stack it is in the order as below.
STRING .my^string[0:9];
.my^ptr := @mystring,
.reply^string[0:9];
There's my^string, then for the pointer there's some junk which means the declaration above (just guessing), then starts writing junk into the reply^string.
> Since ^ is a C operator, not allowed in names, you might have to use the eInspect command set language ptal to get it to accept those variable names, unless it recognizes you are in an epTAL program and set it automatically.
epTAL is recognized automatically. Beauty! :D
>
> At this point, I think we can be pretty sure we know three of the four inputs to USER^INPUT^NUMERIC^CONVERSION. We know USERCODE, INPUT, and INPUT^LEN. The only one we don't know is INTERNAL^SCALE.
>
True. I read those chapters that you asked me to read, and they weren't very explanative of INTERNAL^SCALE. But your explanation below make it clear.
> >
> > Even that an command in einspect was suggested to me by that trustable resource. He definitely knew what he was doing. That command looks more like the x command that you've given Keith - but this works a bit differently. Trying to understand the difference but I think only einspect manual will teach me that.
>
> The an command only displays in Ascii. The x command has other ways to display as well as Ascii. I'm not sure any of the ways the x command displays Ascii matches exactly how an does the display, so it is good to know both.
>
> >
> > I'm also a little puzzled that USERCODE seems not to have been overstored even though all of the other arguments have been overstored.
> > May be USERCODE was not a variable in this level or frame, may be it came from some other procedure or frame. May be that's why it is stored elsewhere as I could not see it above. If I had given a p &USERCODE I'd have seen the address where it was stored, no? But if I would rather see the program itself I'd know - I think.
>
> USERCODE, INPUT^LEN, and INTERNAL^SCALE all are INT value arguments, so I would expect them to be handled basically the same way, and all three would be vulnerable to being overwritten the same way. It seems that is not true, which means there is something about the way value arguments are handled that I don't know.
I'll may be try get the p &USERCODE etc to see if they give us some more detail.
> >
> > The addresses before the procedure names in the stack trace look like code addresses, not data addresses, so using one of them to start probably is wrong.
> > Again, you're right. You're a genius, Keith.
> >
> > I'm not sure &INPUT^LEN will get the address of the INPUT^LEN argument, but I think it will.
> > It did, I think. That's what gave me the idea to suggest that p &USERCODE would say where USERCODE is stored.
>
> What value did print /x ^INPUT^LEN give you? If you included that in your post, I am not seeing it. Is it a number between 0x6fffd50 and 0x70000000? What does print /x &USERCODE show?
Not at office, Keith. Sorry - I'll try that first thing next monday. And what does /x mean in the print statement? Translated from ASCII to hex or some other conversion like that?
> >
> > So back to the einspect frame 8. I did a info locals to that and I got the following information.
> >
> > decimal^pt^pos = 3
> > string^length = 5
> > my^string = "22730. 0/,"
> >
> > The code has the following important lines which will help us understand the importance of my^string and the value above.
> >
> > STRING .my^string[0:9];
> > .
> > .
> > .
> > my^string ':=' input [0] FOR input^len BYTES;
> >
> > So the value "22730. 0/," is not indeed the input alone but the whole 0-9 places(bits?) covered because input^len in the above command was possibly larger than what my^string expects.
>
> The space used by local variables is not cleared to zeros when entering the procedure, so their initial contents will be whatever the program previously had stored in that area from using it for the local variables of other procedures that had been called and already returned before calling the procedures currently in the call stack. So it is not at all surprising to see the bytes of my^string that had not been written to by the ':=' statement to have some "random" value in them. It often won't actually be random, but highly repeatable. It depends on what that memory had been used for previously in the history of the program's execution. Sometimes, as it seems to be from the displays from the other dump files, below, it is highly repeatable, though not easily predictable.
Got you. Thanks again!
> >
> > But string^length and decimal^pt^pos indicate that the input might have been 22730 and input length just 5, but that can't make decimal point position as 3. I understand that all these are related to code, and I don't expect you all to make any sense of it. I didn't read the whole code myself. I'm just making comments with the variable names that we see here, logically. About the code, I don't think that I can post the whole code here. For one, it's too large. For two, it's too secure to post and get away easy. I hope you understand.
>
> I understand that you might not be allowed to post the actual code. Many organizations are ridiculously protective of their precious source code, even though there usually is nothing very original about it, but you have to follow your organization's rules. But you should be able to tell us in general terms what it does. For instance, why is decimal^pt^pos of 3 not reasonable? Is it unreasonable that the desired output for this input might be 22730.000? Or maybe decimal^pt^pos is not supposed to be the number of decimal places in the output value, but something else -- maybe the number of decimal places in the input string? Is USERCODE of 2 always supposed to produce a value with 2 decimal positions? Maybe when it gets an input that has more than two decimal positions, that is what makes it go wild? If that were the case, I'd think it would have happened and have been corrected long before now, unless the error was introduced recently. Was any of the user conversion
> code changed recently? If so, checking those changes carefully would be a good thing to do.
>
Ha, let me tell you a secret. The whole code was rewritten. Because of the following reasons.
Old 700 code. Had to be recompiled in new environment as 800 code. No source code (oh yes!) lost in time. So had to start from the scratch - with vague requirements! But it has worked so well all along. For most inputs, at least.
Is USERCODE of 2 always supposed to produce a value with 2 decimal positions?
Z(5)9.99 that's the input that it can handle. For others, as far as I see - the error catch is in place.
I'm guessing may be requester is calling the wrong user conversion procedure for that particular code. Of course, nobody changed any code in requester, then may be there's just one exception that the code is not handling. I'll post the logic of the user^conversion^2 below.
> >
> > That might be the beginning of the string the procedure was asked to convert.
> > Well, looks like it. But not too sure still on why the problem occurred on the first place.
>
> Actually, no, that isn't what I was suggesting. I thought the runaway loop might have been used to append a calculated number of zeros to the end of the input value, but since the apparent input value lies 16 bytes before the beginning of the string of zeros the runaway loop stored, it doesn't seem to be doing what I though it might be doing. Maybe it was intended to store the zeroes into the reply string first, then overlay the first zeroes with digits from the input string. Or something still different from that. I don't know. Once you look at the code, what it was supposed to do might become clear. The name of the variable giving the upper limit for the loop, add^post^0, sort of implies that it is doing something about adding zeros after (post) something, but maybe the variable name is misleading.
No, it does exactly that.
>
> By the way, the ultimate output from USER^INPUT^NUMERIC^CONVERSION is a 64-bit binary integer which is 10 raised to the INTERNAL^SCALE times the intended value. That is, it has an implied decimal point INTERNAL^SCALE positions from the right end of the value. 9(n)V9(m) in COBOL terms, where m is INTERNAL^SCALE. So reply^string must be only an internal result, since it is still holding ASCII digits. The point of the ADD^ZEROS subprocedure might be to add the appropriate number of zeros to get the input value to have the appropriately scaled value in ASCII prior to converting the ASCII digits to a binary integer.
True. Below is the flow of user^conv^2.
If the input has no decimal point and the input length is not > 6 then my^string[0] ':=' input[0] FOR input^len BYTES.
There's another scale^string sub procedure that uses 'INTERNAL^SCALE' as the scale factor to add the "INTERNAL^SCALE" number of zeroes to the my^string variable, at the end. And then the string^len is increased to accommodate the internal^scale length as well now.
After that $ASCIITOFIXED is done. And then add^zeros subproc is called!
This is where I'm confused. In add^zeros subproc, reply^length is initialized first! To zero. Then add^pre^0 is checked to add pre zeroes which is not the case mostly - not really sure why it is even there in the first place, and then the flag add^dec^pt is checked which is set for our input and so a dec pt is added at reply^string[reply^len] := "."; and the reply^len is incremented by one.
And as the add^post^0 is set to 2 in our case of input 22730, we go into the next condition where there is a FOR i :=0 TO (add^post^0 - 1) to reply^string[reply^len]:= "0" and again reply length is incremented by one.
Finally input[0] is set to reply^string[0] FOR reply^len BYTES and input^len is set to reply^len. And then return;
Which means that the reply^len which is initialized to 0 at the start would screw up the math for the conditions that come below. Won't it? But I could not prove that this would eventually write zeroes to all the addresses into an infinite loop. Because if this was the issue, the add^zeros sub proc should have caused errors for other user conversion procedures too. Just a bit confused with the logic of initializing the reply^len with zero.
>
> Have you looked into stack frame 7? Some of the input values from USER^INPUT^NUMERIC^CONVERSION might have been passed along to USER^CONVERSION^2, though the stack trace your showed does not indicate that it has any arguments. I'm actually a little puzzled about what USER^CONVERSION^2 is. ADD^ZEROS is a subproc of USER^INPUT^NUMERIC^CONVERSION, and so is only visible inside USER^INPUT^NUMERIC^CONVERSION. I think that means it can only be referenced from inside USER^INPUT^NUMERIC^CONVERSION or inside a subproc of USER^INPUT^NUMERIC^CONVERSION. The stack entry for USER^CONVERSION^2 makes it look like another procedure. Also its code address looks like it would be beyond the point in USER^INPUT^NUMERIC^CONVERSION where it was called from, and that would have to make it be a separate procedure, since subprocs have to be at the beginning of their enclosing procedure. Maybe ADD^ZEROS was passed as an argument that is a procedure to USER^CONVERSION^2. I don't remember wheth
> er subprocs are allowed to be passed as arguments that are procedures. I'd think not, since, at least in the TNS architecture, they would lose access to their enclosing procedure's local variables if called from another procedure's stack frame, but maybe pTAL did that differently. Anyway, look in your user conversion procedure source to see what USER^CONVERSION^2 is and whether its arguments or local variables hold any clues about what the arguments to USER^INPUT^NUMERIC^CONVERSION were, especially INTERNAL^SCALE. If you can learn what value INTERNAL^SCALE had, you could bench check the code for the exact argument values that caused the failure, or even create a small test program that calls USER^INPUT^NUMERIC^CONVERSION with those arguments and step through it in the debugger to see exactly what it does and where it goes wrong.
>
> I'm now assuming that the local variables of procedures above USER^INPUT^NUMERIC^CONVERSION in the call stack, and maybe any local variables above reply^string in USER^INPUT^NUMERIC^CONVERSION have not been overwritten by the runaway loop, so look at them in the dump file and see what additional information they give you about the case caused the error.
>
I think I did and most of them had zeros. Only those which didn't - I posted here. I did this exercise as part of the BT trace that worked for me for the first time. I've posted the values which were not zero already. As I didn't mentioned under which frame I found them, I thought it was understandable - but not really so. A bad call from me. Sorry! :D
> >
> > If you were a Pathway developer, it probably would be pretty easy to find the screen name, requester name, original value of the screen field, unless the runaway loop has overwritten it.
> >
> > What do you mean by a Pathway developer? I take care of the pathway, and work with it - if that's what you mean. To 'take care of pathway' is loosly put, but was intended that way. There's no job role of a pathway developer here in my workplace.
>
> I meant a developer who works for HP doing Pathway product development. I thought that would be clear from context, but I now see that it was not. Sorry for n ot being more clear. An HP Pathway product developer would ahve access to a Pathway object file with all the debugging information and the source code so he or she could look into the areas of the dump file that you cannot view, and perhaps find more information about exactly what screen, screen field, and working-storage variable the conversion was working with. However, if the local variables of the stack below the user conversion procedure have been overwritten by the runaway loop, maybe an HP Pathway developer couldn't do much more than you can.
Oh I get it now. :)
>
> >
> > Also I've a few questions.
> > Why does the einspect have prompt like this - especially, what does the values in it mean?
> > (einspect 1,84):
>
> The 1,84 gives the CPU number and process number within that CPU of the process eInspect is curently looking at. eInspect can have control of several processes at the same time, and you can switch its attention among them as you like. So this is the way eInspect help you keep track of which process you are looking at at the current moment. When looking at a dump file, it tells you the CPU and process number of the process whose state was recorded in the dump file. Only one dump file can be examined at a time, so this isn't quite as important when looking at a dump file, but it still might be helpful.
Hmm, understandable. I'll try the status <process-num> just in case. Though I'm sure they're not going to help me here :D
>
> >
> > And I told you I've two systems. For einspect - In one, the help and the "an" command etc works, in other very limited. For the "an" command and the help command I get a reply prompt saying no such command or something like that. Why? Both are same einspect versions. TNS/E einspect gdb debugger (T1237 - 18 Jull 2012 13:46) They both should've worked the same way. Unless, they disabled some commands in one system? Hmm?
>
> I don't have a good answer. Maybe the eInspect on one of the systems has been limited in some way. Or maybe there was some error during its installation. Or maybe your userid on one system has more priviliges than your userid on the other system does, and normal Guardian file security is interfering with the execution of some commands.
Hmm I'll check with my admin team. May be they got an answer to this. Just wanted to check if you can limit a few commands from a utility in one particular system alone. Thought that was possible for any utility that HP provided though I don't know how.
>
> >
> > There were many dumps in my system of which I've taken five (one above already, repeated below for comparison). And the only three data I've been able to take out, I've listed. Let me know if it helps.
> >
> > decimal^pt^pos = 4, string^length = 6, my^string = "680640 0/,"
> > decimal^pt^pos = 3, string^length = 5, my^string = "37840. 0/,"
> > decimal^pt^pos = 3, string^length = 5, my^string = "23870. 0/,"
> > decimal^pt^pos = 4, string^length = 6, my^string = "684240 0/,"
> > decimal^pt^pos = 3, string^length = 5, my^string = "22730. 0/,"
>
> Useful information! Were the dates on those dumps all recent? If you have been getting complaints from end users about "the system always goes down when I try to do this", maybe you can learn from them which screen always causes the problem, and maybe figure out from that which field it is. That might give some more clues about what the code should have done in these cases when it clearly is not doing what was intended.
>
> >
> > All the my^string have the last four characters as the same. A space followed by a zero then a forward slash and a comma. So that may mean that they are not really inputs. There's no coincidence in life.
>
> Yes, you are right. Those last few characters in my^string are not inputs to USER^INPUT^NUMERIC^CONVERSION. They are what is left over in those memory locations from previous use of those memory locations by procedures that had been called earlier and already returned.
>
> >
> > If the code is required, I can post parts of it as per the required variable that's needed to be analyzed. But anyway, I'm going to take a hard look at the code tomorrow. With these values as inputs, I think we can still crack a solution out of this.
>
> We probably won't need to see much, if any, of the code. If there are particular statements that you are not sure what they do, you might have to post those statements, but for anything else, your description of what the code is doing, or intended to do, probably will be enough.
> >
> > Thanks all for your valuable inputs.
> >
> >
> >
> >
> >
The above user^conv^2 is not all of what I posted - but it is my deduction of how the flow should have gone and part of that trace. Because 1) the input was 22730 and 2) the line 77 mentioned in BT comes to here in this subproc. But the add^zeros is also called for two other conditions inside user^conv^2 such as the following:
If decimal point found check position of decimal pt.
dec pt pos 0 and anywhere else. All these cases are handled but complex. Everything tells me that add^zeros subproc is wrong because all the changes are done to my^string and reply^string is used in add^zeros subproc - I understand that you might not get any of these, because you don't have a way to look at the code. Any suggestion from this would be very helpful. I'll post the C problem tonight, Keith. Just didn't get enough time. :)