Tips for debugging heap corruption on IA64 C program?

John E. Malmberg

unread,

Jul 17, 2016, 8:01:05 PM7/17/16

to

I am trying to chase down an issue where the chdir() function in the VMS
CRTL is access violating in LIB$VM_FREE().

I am only seeing it with the coreutils utilities that use a common
routine that simulates the fchdir() utility.

This file in question is a symbolic link that points to a non-existant file.

If I use a simple program to access just the same file with chdir() and
lstat() it works just fine, so it is not a path issue.

But when run through the coreutils 8.25 chmod() utility, that specific
path fails with the above access violation.

I have added fprintf() to report the malloc() equivalent and free()
calls in the fchdir() wrapper, but so far have not found anything that
indicates where the heap corruption is.

This debugging is being done on VMS/Itanium 8.4.

Regards,
-John
wb8...@qsl.net_work

John Reagan

unread,

Jul 17, 2016, 9:14:53 PM7/17/16

to

You've tried the heap analyzer, right? See the debugger manual.

http://h71000.www7.hp.com/doc/84final/4538/4538pro_023.html#dbg_analyzer_ch

John E. Malmberg

unread,

Jul 17, 2016, 9:53:21 PM7/17/16

to

On 7/17/2016 8:14 PM, John Reagan wrote:
> On Sunday, July 17, 2016 at 8:01:05 PM UTC-4, John E. Malmberg wrote:
>> I am trying to chase down an issue where the chdir() function in the VMS
>> CRTL is access violating in LIB$VM_FREE().
>

> You've tried the heap analyzer, right? See the debugger manual.
>
> http://h71000.www7.hp.com/doc/84final/4538/4538pro_023.html#dbg_analyzer_ch

I was totally ignorant of the heap analyzer.

Running the program now under it. It seems to have greatly slowed it down.

I use the "start heap_analyzer" in the debugger window and then clicked
start on the heap_analyzer and go in the debugger window.

Not seeing any activity under MON/SYSTEM or output from the program. It
looks like it is hung.

Regards,
-John

John E. Malmberg

unread,

Jul 18, 2016, 12:55:28 AM7/18/16

to

On 7/17/2016 8:53 PM, John E. Malmberg wrote:
> On 7/17/2016 8:14 PM, John Reagan wrote:
>> On Sunday, July 17, 2016 at 8:01:05 PM UTC-4, John E. Malmberg wrote:
>>> I am trying to chase down an issue where the chdir() function in the VMS
>>> CRTL is access violating in LIB$VM_FREE().
>>
>> You've tried the heap analyzer, right? See the debugger manual.
>>
>> http://h71000.www7.hp.com/doc/84final/4538/4538pro_023.html#dbg_analyzer_ch
>>
>
> I was totally ignorant of the heap analyzer.
>
> Running the program now under it. It seems to have greatly slowed it down.

Running with the heap_analyzer with the program launched from Bash.

Running with the heap_analyzer with the program launched from DCL works.
I am now looking at it at the point of the crash.

So now, How do I use it to find the presumed heap corruption?

Nothing seems to be jumping out at me.

%DEBUG-W-UNAOPNSRC, unable to open source file
$1$DGA106:[LIBRTL_2.SRC]LIBMALLOC.C;1
-RMS-F-DEV, error in device name or inappropriate device type for operation
%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual
address=FFFFFFFFFFFFFFF5, PC=FFFFFFFF84073DB0, PS=0000001B

DBG> show calls
module name routine name line rel PC abs PC
*LIB$MALLOC LIB$VM_FREE 26330 0000000000006440
FFFFFFFF84073DB0
*C$MALLOC decc$free 30927 0000000000000030
FFFFFFFF84600490
*C$FINDFILE decc$$dealloc_fcb
68503 00000000000025B0
FFFFFFFF84579A70
*C$DEFAULT_DIR chdir 46437 0000000000003D60
FFFFFFFF846096F0
*vms_terminal_io
vms_chdir 9141 0000000000000A50
0000000000050670
*vms_terminal_io
vmsmode_chdir
25262 0000000000001F50
0000000000051B70
*vms_terminal_io
vms_fchdir 25498 00000000000031A0
0000000000052DC0
*fstatat fstatat 28351 0000000000000CF0
0000000000035830

-John
wb8tyw@qsl_network

Louis Krupp

unread,

Jul 18, 2016, 4:24:14 AM7/18/16

to

On Sun, 17 Jul 2016 23:55:31 -0500, "John E. Malmberg"
<wb8...@qsl.net_work> wrote:

>On 7/17/2016 8:53 PM, John E. Malmberg wrote:
>> On 7/17/2016 8:14 PM, John Reagan wrote:
>>> On Sunday, July 17, 2016 at 8:01:05 PM UTC-4, John E. Malmberg wrote:
>>>> I am trying to chase down an issue where the chdir() function in the VMS
>>>> CRTL is access violating in LIB$VM_FREE().
>>>
>>> You've tried the heap analyzer, right? See the debugger manual.
>>>
>>> http://h71000.www7.hp.com/doc/84final/4538/4538pro_023.html#dbg_analyzer_ch
>>>
>>
>> I was totally ignorant of the heap analyzer.
>>
>> Running the program now under it. It seems to have greatly slowed it down.
>
>Running with the heap_analyzer with the program launched from Bash.
>
>Running with the heap_analyzer with the program launched from DCL works.
> I am now looking at it at the point of the crash.
>
>So now, How do I use it to find the presumed heap corruption?
>
>Nothing seems to be jumping out at me.
>
>%DEBUG-W-UNAOPNSRC, unable to open source file
>$1$DGA106:[LIBRTL_2.SRC]LIBMALLOC.C;1
>-RMS-F-DEV, error in device name or inappropriate device type for operation
>%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual
>address=FFFFFFFFFFFFFFF5, PC=FFFFFFFF84073DB0, PS=0000001B

It's been a while since I did anything with VMS, but my guess is that
LIB$VM_FREE is trying to read memory at address 0xFFFFFFFFFFFFFFF5
(decimal -11, an invalid address).

If you're comfortable running programs under the debugger, you could
do a variety of things, probably taking advantage of the fact that the
offending instruction is reported to be at address 0x0000000000006440
relative to module LIB$MALLOC.

If you don't want to dive into that, you could consider that in any
heap implementation, the free() routine is likely to fail for one of
three reasons:

1. You're trying to free memory using a pointer you've already passed
to free().

2. You're trying to free memory using a pointer that wasn't returned
from malloc().

3. When you malloc an area of memory, the heap allocates a few words
before and after your area which are used to keep track of the memory
preceding and following your area. If you pass a legitimate pointer
to free(), but memory before or after the area you're trying to free
has been overwritten by another error in your code, bad things can
happen.

You might want to deskcheck your code once more to make sure you're
not doing #1. The others can be a little (or a lot) harder to track
down.

Good luck.

Louis

John Reagan

unread,

Jul 18, 2016, 8:08:44 AM7/18/16

to

On Monday, July 18, 2016 at 12:55:28 AM UTC-4, John E. Malmberg wrote:
> On 7/17/2016 8:53 PM, John E. Malmberg wrote:
> > On 7/17/2016 8:14 PM, John Reagan wrote:
> >> On Sunday, July 17, 2016 at 8:01:05 PM UTC-4, John E. Malmberg wrote:
> >>> I am trying to chase down an issue where the chdir() function in the VMS
> >>> CRTL is access violating in LIB$VM_FREE().
> >>
> >> You've tried the heap analyzer, right? See the debugger manual.
> >>
> >> http://h71000.www7.hp.com/doc/84final/4538/4538pro_023.html#dbg_analyzer_ch
> >>
> >
> > I was totally ignorant of the heap analyzer.
> >
> > Running the program now under it. It seems to have greatly slowed it down.
>
> Running with the heap_analyzer with the program launched from Bash.
>
> Running with the heap_analyzer with the program launched from DCL works.
> I am now looking at it at the point of the crash.
>
> So now, How do I use it to find the presumed heap corruption?
>
> Nothing seems to be jumping out at me.
>

The visual part of the heap analyzer (when run under the DECwindows debugger) can often help you spot things like double-deallocation and the like. It won't help with stale pointers or even uninitialized pointers.

What does your call to chdir() look like? How many arguments? And their values?

John Reagan

unread,

Jul 18, 2016, 8:14:31 AM7/18/16

to

chdir() will call dealloc_fcb if something went wrong and it is trying to return an error. The fcb contains a dozen or so pointers so dealloc_fcb will attempt to call free() on each of them.

John E. Malmberg

unread,

Jul 18, 2016, 9:34:37 AM7/18/16

to

On 7/18/2016 7:14 AM, John Reagan wrote:
> On Monday, July 18, 2016 at 8:08:44 AM UTC-4, John Reagan wrote:
>> On Monday, July 18, 2016 at 12:55:28 AM UTC-4, John E. Malmberg wrote:
>>> On 7/17/2016 8:53 PM, John E. Malmberg wrote:
>>>> On 7/17/2016 8:14 PM, John Reagan wrote:
>>>>> On Sunday, July 17, 2016 at 8:01:05 PM UTC-4, John E. Malmberg wrote:
>>>>>> I am trying to chase down an issue where the chdir() function in the VMS
>>>>>> CRTL is access violating in LIB$VM_FREE().
>>>>>
>>

>> The visual part of the heap analyzer (when run under the
>> DECwindows debugger) can often help you spot things like
>> double-deallocation and the like. It won't help with stale
>> pointers or even uninitialized pointers.

I think I am looking for a buffer overrun or underrun.
None are showing up in my running dump of memory allocations() and
free() in the suspect module.

>> What does your call to chdir() look like? How many arguments? And their values?

decc$feature_set("DECC$FILENAME_UNIX_ONLY", __FEATURE_MODE_CURVAL, 0);

file = "lcl_root:[diffutils.tests.gt-no-dereference^.umdC]"

Memory address for file: 3478480 - 50 bytes allocated by strdup()

result = decc$chdir(file);

> chdir() will call dealloc_fcb if something went wrong and it is
> trying to return an error. The fcb contains a dozen or so pointers
> so dealloc_fcb will attempt to call free() on each of them.

The directory exists. A standalone C program can all chdir()
successfully on it, so it is not an issue with that specific path.

So I would not have expected a chdir() call to fail.

If I am reading the calling sequence right, the lib$vm_free is called
with 4434592.

That address is not showing up on the my instrumentation of the suspect
module.

Close addresses are:
decc$getcwd 4429984 string size 50 freed
decc$getcwd 4429984 string size 26 freed
decc$getcwd 4429984 string size 26 freed
decc$getcwd 4429984 string size 26 freed
decc$getcwd 4429984 string size 50
access violation in lib$vm_free(4434592)

4608 bytes between the decc$getcwd allocated memory and the memory
failed to be freed.

Regards,
-John

Stephen Hoffman

unread,

Jul 18, 2016, 10:09:59 AM7/18/16

to

On 2016-07-18 00:01:09 +0000, John E. Malmberg said:

> I am trying to chase down an issue where the chdir() function in the
> VMS CRTL is access violating in LIB$VM_FREE().
>
> I am only seeing it with the coreutils utilities that use a common
> routine that simulates the fchdir() utility.
>
> This file in question is a symbolic link that points to a non-existant file.
>
> If I use a simple program to access just the same file with chdir() and
> lstat() it works just fine, so it is not a path issue.
>
> But when run through the coreutils 8.25 chmod() utility, that specific
> path fails with the above access violation.

This is usually a buffer overrun in a previous allocation; an
allocation that's in the heap at a virtual address earlier in the heap
pointer chains. I prefer to centralize the allocation and
deallocation calls, and to implement and use fenceposts to try to
detect these corruptions (slightly) ahead of the crash. Once the
corruption is detected, rummaging the previous allocation in the heap
memory can sometimes tell you what corrupted the heap data structures.
It's also possible to have rogue heap corruptions — a stale or
volatile pointer to an IOSB is a popular mechanism for that — but it's
usually code that copied too much data into a buffer on the heap, or
code that copied data into previously-deallocated heap storage (and
overwriting the buffer at deallocation with sentinel values can help
with those).

OpenVMS lacks valgrind or such tools — part of those "issues with the
development tools" topic that I've commented on before — and the heap
analyzer isn't always as useful as I'd want it to be.

Related:
http://labs.hoffmanlabs.com/node/401
http://labs.hoffmanlabs.com/node/260

--
Pure Personal Opinion | HoffmanLabs LLC

Louis Krupp

unread,

Jul 18, 2016, 6:22:20 PM7/18/16

to

Can you put all the code into one source file and trim it down to a
relatively small program that reproduces the problem?

Louis

John E. Malmberg

unread,

Jul 18, 2016, 7:38:06 PM7/18/16

to

Unfortunately a small program does not reproduce the issue.

Regards,
-John

Louis Krupp

unread,

Jul 18, 2016, 10:51:27 PM7/18/16

to

On Mon, 18 Jul 2016 18:38:11 -0500, "John E. Malmberg"

<wb8...@qsl.net_work> wrote:

>On 7/18/2016 5:22 PM, Louis Krupp wrote:
>> On Mon, 18 Jul 2016 08:34:41 -0500, "John E. Malmberg"
>> <wb8...@qsl.net_work> wrote:
>>
>> Can you put all the code into one source file and trim it down to a
>> relatively small program that reproduces the problem?
>
>Unfortunately a small program does not reproduce the issue.
>
>Regards,
>-John
>

It's starting to sound more and more like a wild pointer used to dump
-11 on top of what should have been another pointer.

Does -11 sound like a value (perhaps an error code) that your program
might be storing somewhere?

How do you feel about stepping through Itanium machine code one
instruction at a time? You don't have to understand the instruction
set completely; you just have to have an idea of what's going on.

If you can figure out where (in memory) the -11 is coming from, you
might be able to set a watchpoint on that address, and you might be
able to see where it's being trashed.

Or, if you want to do some more deskchecking, look for global pointers
that hang around after their target has been deallocated or has gone
out of scope.

Something like this, for example:

char *p;

char *f(void)
{
char buf[80];
return buf;
}

p = f();
*p = 'x';

But you probably knew that.

If all else fails, what sort of project is it? Could you allow
someone to ssh (or whatever the VMS equivalent is) into your system
and try a debug run? I'm sure there are people here who've stepped
through Itanium code.

Louis

John E. Malmberg

unread,

Jul 18, 2016, 11:10:54 PM7/18/16

to

The project is coreutils which is downloadable from the vms-ports
sourceforge site.

The problem showed up when attempting to run the no-dereference test for
diffutils in the test cleanup code where it cleans up a dangling
symbolic link.

The diffutils code with the no-dereference test disabled is also at the
vms-ports sourceforge site.

I can not allow direct login's into my VMS system. My ISP will not
allow it.

If I knew how to look up the size information for a given pointer to
heap storage, I can add that to the instrumentation in my local copy to
see if there is a buffer overrun.

Regards,
-John

Craig A. Berry

unread,

Jul 18, 2016, 11:30:01 PM7/18/16

to

On 7/18/16 9:51 PM, Louis Krupp wrote:

> Does -11 sound like a value (perhaps an error code) that your program
> might be storing somewhere?

Probably unrelated but ya never know. Are threads enabled, or is there
any DECthreads debugging going on the the CRTL chdir implementation?

******************************
SYS$COMMON:[DECC$LIB.REFERENCE.SYS$STARLET_C]PTHREAD.H;1

PTHREAD_DBG_CONFLICT = -11,

It doesn't really look like a random value off the stack:

$ x = -11
$ sh sym x
X = -11 Hex = FFFFFFF5 Octal = 37777777765

John E. Malmberg

unread,

Jul 18, 2016, 11:36:46 PM7/18/16

to

No threads in my code. I do not know about the CRTL.

Regards,
-John

Louis Krupp

unread,

Jul 19, 2016, 2:44:01 AM7/19/16

to

On Mon, 18 Jul 2016 22:10:59 -0500, "John E. Malmberg"

<wb8...@qsl.net_work> wrote:

>On 7/18/2016 9:51 PM, Louis Krupp wrote:
>> On Mon, 18 Jul 2016 18:38:11 -0500, "John E. Malmberg"
>> <wb8...@qsl.net_work> wrote:
>>
>> If all else fails, what sort of project is it? Could you allow
>> someone to ssh (or whatever the VMS equivalent is) into your system
>> and try a debug run? I'm sure there are people here who've stepped
>> through Itanium code.
>
>The project is coreutils which is downloadable from the vms-ports
>sourceforge site.

This looks like it could be the directory, but it's just a guess:

https://sourceforge.net/p/gnv/coreutils/ci/v8.25_reference/tree/vms_source/coreutils/

Where is the source for fstastat and for vms_terminal_io?

One thing you might want to do is in fchdir: print or log the file
descriptor argument. Unless I'm looking at an old version,
vms_lookup_fd indexes a global array by its argument fd without
checking to see if it's in bounds. I would also print (or log)
info->vmscwd just to see if it makes sense.

Assuming all that is OK, which it probably is, it's bizarre. The CRTL
allocates an FCB (file control block structure, I believe) for the
directory and then frees it only to find that a structure member that
should be a pointer is garbage instead.

On mature reflection, that garbage probably isn't -11. It's probably
-3 or 5 or something. There would be no reason for free() to
dereference the pointer it's been passed, since the nothing in the
allocated area is of interest. free() *would* want to look at the
link words immediately before the allocated area; to find the
preceding 64 bits (presumably an address), it would subtract 8 from
its pointer argument. To get the 64 bits before those, it would
subtract 16.

If you set a breakpoint at decc$$dealloc_fcb and step through the code
one instruction at a time, you'll probably hit relative PC 0x30 pretty
quickly. That (or the preceding instruction) is probably where it's
deallocating the FCB component by calling free() on a pointer which
has gone rogue.

Louis

Scott Dorsey

unread,

Jul 19, 2016, 8:51:32 AM7/19/16

to

<wb8...@qsl.net_work> wrote:
>On 7/18/2016 5:22 PM, Louis Krupp wrote:
>> On Mon, 18 Jul 2016 08:34:41 -0500, "John E. Malmberg"
>> <wb8...@qsl.net_work> wrote:
>>
>> Can you put all the code into one source file and trim it down to a
>> relatively small program that reproduces the problem?
>
>Unfortunately a small program does not reproduce the issue.

Whenever making a change in one part of the program alters the behavior of
a bug in an unrelated part of the program, you have an issue with memory
being overwritten. So if taking every other part of the program away and
leaving a stub means the bug disappears from that stub, you have a pointer
that isn't pointing where it should be.

Try compiling with array bounds checking on, and make sure all your parameters
match up perfectly.... But worse comes to worst you can use the debugger to
set a breakpoint to stop whenever that address is touched.
--scott

--
"C'est un Nagra. C'est suisse, et tres, tres precis."

John E. Malmberg

unread,

Jul 19, 2016, 9:39:49 AM7/19/16

to

On 7/19/2016 1:44 AM, Louis Krupp wrote:
> On Mon, 18 Jul 2016 22:10:59 -0500, "John E. Malmberg"
> <wb8...@qsl.net_work> wrote:
>
>> On 7/18/2016 9:51 PM, Louis Krupp wrote:
>>> On Mon, 18 Jul 2016 18:38:11 -0500, "John E. Malmberg"
>>> <wb8...@qsl.net_work> wrote:
>>>
>>> If all else fails, what sort of project is it? Could you allow
>>> someone to ssh (or whatever the VMS equivalent is) into your system
>>> and try a debug run? I'm sure there are people here who've stepped
>>> through Itanium code.
>>
>> The project is coreutils which is downloadable from the vms-ports
>> sourceforge site.
>
> This looks like it could be the directory, but it's just a guess:
>
> https://sourceforge.net/p/gnv/coreutils/ci/v8.25_reference/tree/vms_source/coreutils/
>
> Where is the source for fstastat

../reference/coreutils/... Contains the unmodified GNU source for v8.25
Coreutils.

> and for vms_terminal_io?

vms/ contains all the VMS specific files, and the instructions for building.

I use a logical name search list as described in the build instructions.

lcl_root: is where all build products are created.

vms_root: points to the VMS specific code and is read-only to build system.

src_root: points to the un-modified upstream code and can also be
pointed at a checkout of any of the branches, to allow building against
master, pre-betas, alphas, etc.

> One thing you might want to do is in fchdir: print or log the file
> descriptor argument.

I am doing that on my local copy. No buffer overruns are showing up on
any storage that is allocated by malloc. I do not know how much storage
is actually allocated by strdup() or getcwd().

I did find two cases in vms_terminal_io where a potential buffer overrun
could exist in my local copy and put in checks for it. In the data used
in the test run, non-of the strings ever reached the length that could
cause the buffer overrun. None of the new checks got tripped.

> Unless I'm looking at an old version,
> vms_lookup_fd indexes a global array by its argument fd without
> checking to see if it's in bounds.

The global array is created based on maximum number of file descriptors
that the CRTL is allowed to create. Unless there is a serious bug in
the CRTL, a buffer overrun there can not happen.

However I am adding a bounds check to cause it to return a NULL, errno
EIO if out of bounds, and the caller will return a failure status in
that case.

> I would also print (or log)
> info->vmscwd just to see if it makes sense.

I did that, and nothing surprising showed up.

> Assuming all that is OK, which it probably is, it's bizarre. The CRTL
> allocates an FCB (file control block structure, I believe) for the
> directory and then frees it only to find that a structure member that
> should be a pointer is garbage instead.

The CRTL is apparently sharing the same VM ZONE for its internal heap
use as with the user with no guard zone in between. While this can make
the code a bit easier to write, it also makes a heap corruption by user
code easily able to damage the CRTL internal structures.

> On mature reflection, that garbage probably isn't -11. It's probably
> -3 or 5 or something. There would be no reason for free() to
> dereference the pointer it's been passed, since the nothing in the
> allocated area is of interest. free() *would* want to look at the
> link words immediately before the allocated area; to find the
> preceding 64 bits (presumably an address), it would subtract 8 from
> its pointer argument. To get the 64 bits before those, it would
> subtract 16.

Note the memory causing the accvio should not have ever been touched by
the user program while the CRTL was using it.

> If you set a breakpoint at decc$$dealloc_fcb and step through the code
> one instruction at a time, you'll probably hit relative PC 0x30 pretty
> quickly. That (or the preceding instruction) is probably where it's
> deallocating the FCB component by calling free() on a pointer which
> has gone rogue.

At that point the heap is already corrupted, I would need to know how to
get the the information about a given pointer returned by malloc() in
order to make sense of it. And if I had that information, I could add
it to my instrumentation and sanity check it.

My instrumentation shows that I am currently typically allocating and
freeing the same amount of memory that is tied to the structure returned
by info. I am changing the code to only allocating the memory for path
storage once and keeping it tied to the structure for the length of the
program and adding a flag to track how much of the allocated space has
valid data.

If this does not clear up the problem, then I am looking in the wrong
module.

Regards,
-John

Louis Krupp

unread,

Jul 19, 2016, 12:38:40 PM7/19/16

to

On Tue, 19 Jul 2016 08:39:55 -0500, "John E. Malmberg"

As Scott said, your problem could be *anywhere*, and I'd agree that
you're going to have to get down in the mud with the debugger and
figure out the address of the bad field in the FCB and set a
watchpoint to find out when and how it's being corrupted. As I
recall, the VMS debugger was a pleasure to use, relatively speaking.

If you change enough other things, you might succeed in moving things
around so that the program runs to completion, but you might also be
masking a bug that you really want to fix.

Louis

John E. Malmberg

unread,

Jul 20, 2016, 9:35:42 AM7/20/16

to

If I could get the heap tracking information for a pointer returned from
malloc() I could probably find where the issue is.

> If you change enough other things, you might succeed in moving things
> around so that the program runs to completion, but you might also be
> masking a bug that you really want to fix.

Yes, that is a risk.

However, this code is exercised a lot when running configure scripts and
also in other scripts similar to the one that it fails in.

The main thing that is different in the failure case is that one of the
subdirectories contains a dangling symbolic link, and that is normally
not the case in the others.

In other runs, and apparently covered up by the text written by my
dumping what pointers where malloced, the program was reporting that
lstat() failed on that dangling symbolic link. lstat() should never
fail at that point, because lstat() is working on the object containing
the link and not concerned with the link.

That could either be a bug in lstat(), or it could be an indication that
the heap is already corrupt. This happens about 2 loops before the
access violation if I am remembering it right.

Running on Alpha produces a different failure pattern. On IA64, the
chmod() fails. On Alpha, the rm fails, and I am seeing stack unwinds.

%SYSTEM-W-UNWIND, unwind currently in progress
%TRACE-W-TRACEBACK, symbolic stack dump follows
%TRACE-I-END, end of TRACE stack dump
%SYSTEM-W-UNWIND, unwind currently in progress
%SYSTEM-W-UNWIND, unwind currently in progress
%TRACE-W-TRACEBACK, symbolic stack dump follows
%TRACE-I-END, end of TRACE stack dump
%SYSTEM-W-UNWIND, unwind currently in progress
%SYSTEM-W-UNWIND, unwind currently in progress
%TRACE-W-TRACEBACK, symbolic stack dump follows
%TRACE-I-END, end of TRACE stack dump
%SYSTEM-W-UNWIND, unwind currently in progress
%SYSTEM-W-UNWIND, unwind currently in progress
%TRACE-W-TRACEBACK, symbolic stack dump follows
%TRACE-I-END, end of TRACE stack dump
%SYSTEM-W-UNWIND, unwind currently in progress
%SYSTEM-W-UNWIND, unwind currently in progress
%TRACE-W-TRACEBACK, symbolic stack dump follows
%TRACE-I-END, end of TRACE stack dump
%SYSTEM-W-UNWIND, unwind currently in progress
%SYSTEM-W-UNWIND, unwind currently in progress
%TRACE-W-TRACEBACK, symbolic stack dump follows
%TRACE-I-END, end of TRACE stack dump
%SYSTEM-W-UNWIND, unwind currently in progress

%SYSTEM-F-ACCVIO, access violation, reason mask=00, virtual

address=00000000010E
0005, PC=FFFFFFFF80F64354, PS=0000001B
%TRACE-F-TRACEBACK, symbolic stack dump follows
image module routine line rel PC abs PC
DECC$SHR_EV56 0 00000000000CC354
FFFFFFFF80F64354
rm rm __main 33994 0000000000000424
0000000000030424
rm 0 000000000004885C
000000000004885C
0 FFFFFFFF8037FC44
FFFFFFFF8037FC44
%TRACE-I-END, end of TRACE stack dump

Regards,
-John
wb8...@qsl.net_work

Louis Krupp

unread,

Jul 20, 2016, 2:23:26 PM7/20/16

to

On Wed, 20 Jul 2016 08:35:49 -0500, "John E. Malmberg"

<wb8...@qsl.net_work> wrote:

>On 7/19/2016 11:38 AM, Louis Krupp wrote:
>> On Tue, 19 Jul 2016 08:39:55 -0500, "John E. Malmberg"
>> <wb8...@qsl.net_work> wrote:
>>
>> As Scott said, your problem could be *anywhere*, and I'd agree that
>> you're going to have to get down in the mud with the debugger and
>> figure out the address of the bad field in the FCB and set a
>> watchpoint to find out when and how it's being corrupted. As I
>> recall, the VMS debugger was a pleasure to use, relatively speaking.
>
>If I could get the heap tracking information for a pointer returned from
>malloc() I could probably find where the issue is.
>
>> If you change enough other things, you might succeed in moving things
>> around so that the program runs to completion, but you might also be
>> masking a bug that you really want to fix.
>
>Yes, that is a risk.
>

You have a reproducible bug. Do you know how lucky you are? There
are programmers in ... I don't know, Denver ... who are chasing
phantom bugs and who would do *anything* for a failure that can be
reproduced on demand in a reasonable length of time.

There are times when the only way to figure out what's going on is to
let the debugger disassemble the instructions leading up to the
failure and figure out what address is being poisoned and what's in it
and then set a watchpoint and run the program again and watch the bad
value (I'm still betting it's a -3) dumped on top of a pointer and
then go back up the stack and see who did what.

This might be one of those times. You might have to learn a bit about
the Itanium instruction set and it might take a while and it might be
tedious, but in the end it might be the quicker than speculating on
what might have gone wrong.

I've usually done this sort of thing at night. I would get some tea
and maybe some chocolate and crank up my boombox and start with
Beethoven's 7th or maybe Bruckner's 7th depending on the difficulty of
the bug and I would transition to Runrig or something as the evening
wore on and if all went well I'd have the problem fixed or at least
identified by the time my colleagues showed up for work in the
morning. Then I'd go home and get some sleep and I wouldn't have to
sit around all day listening to when is the bug going to be fixed and
is it really a bug and why are we doing it this way and maybe you
should get someone else to help you while you go work on something
else and so on.

Have fun. Get dark chocolate. Too much sugar will ruin your
concentration.

Louis

hb

unread,

Jul 20, 2016, 3:46:25 PM7/20/16

to

On 07/20/2016 03:35 PM, John E. Malmberg wrote:
> If I could get the heap tracking information for a pointer returned from
> malloc() I could probably find where the issue is.

The usual one in the CRTL or the one with the heap analyzer?

Dunno whether this helps or you already know this. It looks like there
are two quadwords just below the malloced memory (the address returned),
which seem to keep track of the allocated memory. The info which I can
identify is the size of the malloced memory - including the two
quadwords - and a flag 0xF00D marking it used.

(Then there may be pointer to the start of the unallocated memory, but
other than that and its size I don't think that there is tracking
information.)

If the memory is freed the flag is changed to 0x7777. And it seems the
free memory is put at the beginning of a "free list". The next pointer
in this list is at the first longword of the previously allocated
memory. It is a zero terminated list.

This is what I see with a simple example in the debugger -
interpretation and/or mis-interpretation is owned by me:-):

DBG> Step
stepped to F\f\%LINE 2
2: int i = 32;
DBG> Step
stepped to F\f\%LINE 4
4: p1=malloc(i);
DBG> Step
stepped to F\f\%LINE 5
5: p1[0]=0x01;
DBG> Step
stepped to F\f\%LINE 6
6: p1[i-1]=0x11;
DBG> Step
stepped to F\f\%LINE 8
8: p2=malloc(i);
DBG> Step
stepped to F\f\%LINE 9
9: p2[0]=0x02;
DBG> Step
stepped to F\f\%LINE 10
10: p2[i-1]=0x22;
DBG> Step
stepped to F\f\%LINE 12
12: p3=malloc(i);
DBG> Step
stepped to F\f\%LINE 13
13: p3[0]=0x03;
DBG> Step
stepped to F\f\%LINE 14
14: p3[i-1]=0x33;
DBG> Step
stepped to F\f\%LINE 16
16: free(p2);
DBG> ex p1, p2, p3
F\f\p1: 002BA010
F\f\p2: 002BA040
F\f\p3: 002BA070
DBG> ex 002BA010-20:002BA070+20+20
00000000002B9FF0: 0000000000000000
00000000002B9FF8: 0000000000000000
00000000002BA000: 0F00D003300000030
00000000002BA008: 0000000000000000
00000000002BA010: 0000000000000001
00000000002BA018: 0000000000000000
00000000002BA020: 0000000000000000
00000000002BA028: 1100000000000000
00000000002BA030: 0F00D003300000030
00000000002BA038: 0000000000000000
00000000002BA040: 0000000000000002
00000000002BA048: 0000000000000000
00000000002BA050: 0000000000000000
00000000002BA058: 2200000000000000
00000000002BA060: 0F00D003300000030
00000000002BA068: 0000000000000000
00000000002BA070: 0000000000000003
00000000002BA078: 0000000000000000
00000000002BA080: 0000000000000000
00000000002BA088: 3300000000000000
00000000002BA090: 0000000000000000
00000000002BA098: 0000000000000000
00000000002BA0A0: 0000000000000000
00000000002BA0A8: 0000000000000000
00000000002BA0B0: 0000000000000000
DBG> Step
stepped to F\f\%LINE 17
17: free(p3);
DBG> ex 002BA010-20:002BA070+20+20
00000000002B9FF0: 0000000000000000
00000000002B9FF8: 0000000000000000
00000000002BA000: 0F00D003300000030
00000000002BA008: 0000000000000000
00000000002BA010: 0000000000000001
00000000002BA018: 0000000000000000
00000000002BA020: 0000000000000000
00000000002BA028: 1100000000000000
00000000002BA030: 7777003300000030
00000000002BA038: 0000000000000000
00000000002BA040: 0000000000000000
00000000002BA048: 0000000000000000
00000000002BA050: 0000000000000000
00000000002BA058: 2200000000000000
00000000002BA060: 0F00D003300000030
00000000002BA068: 0000000000000000
00000000002BA070: 0000000000000003
00000000002BA078: 0000000000000000
00000000002BA080: 0000000000000000
00000000002BA088: 3300000000000000
00000000002BA090: 0000000000000000
00000000002BA098: 0000000000000000
00000000002BA0A0: 0000000000000000
00000000002BA0A8: 0000000000000000
00000000002BA0B0: 0000000000000000
DBG> Step
stepped to F\f\%LINE 18
18: free(p1);
DBG> ex 002BA010-20:002BA070+20+20
00000000002B9FF0: 0000000000000000
00000000002B9FF8: 0000000000000000
00000000002BA000: 0F00D003300000030
00000000002BA008: 0000000000000000
00000000002BA010: 0000000000000001
00000000002BA018: 0000000000000000
00000000002BA020: 0000000000000000
00000000002BA028: 1100000000000000
00000000002BA030: 7777003300000030
00000000002BA038: 0000000000000000
00000000002BA040: 0000000000000000
00000000002BA048: 0000000000000000
00000000002BA050: 0000000000000000
00000000002BA058: 2200000000000000
00000000002BA060: 7777003300000030
00000000002BA068: 0000000000000000
00000000002BA070: 00000000002BA040
00000000002BA078: 0000000000000000
00000000002BA080: 0000000000000000
00000000002BA088: 3300000000000000
00000000002BA090: 0000000000000000
00000000002BA098: 0000000000000000
00000000002BA0A0: 0000000000000000
00000000002BA0A8: 0000000000000000
00000000002BA0B0: 0000000000000000
DBG> Step
stepped to F\f\%LINE 19
19: }
DBG> ex 002BA010-20:002BA070+20+20
00000000002B9FF0: 0000000000000000
00000000002B9FF8: 0000000000000000
00000000002BA000: 7777003300000030
00000000002BA008: 0000000000000000
00000000002BA010: 00000000002BA070
00000000002BA018: 0000000000000000
00000000002BA020: 0000000000000000
00000000002BA028: 1100000000000000
00000000002BA030: 7777003300000030
00000000002BA038: 0000000000000000
00000000002BA040: 0000000000000000
00000000002BA048: 0000000000000000
00000000002BA050: 0000000000000000
00000000002BA058: 2200000000000000
00000000002BA060: 7777003300000030
00000000002BA068: 0000000000000000
00000000002BA070: 00000000002BA040
00000000002BA078: 0000000000000000
00000000002BA080: 0000000000000000
00000000002BA088: 3300000000000000
00000000002BA090: 0000000000000000
00000000002BA098: 0000000000000000
00000000002BA0A0: 0000000000000000
00000000002BA0A8: 0000000000000000
00000000002BA0B0: 0000000000000000
DBG>

VAXman-

unread,

Jul 20, 2016, 4:36:14 PM7/20/16

to

In article <6levobt6u4a5rlohh...@4ax.com>, Louis Krupp <lkr...@nospam.pssw.com.invalid> writes:
>On Wed, 20 Jul 2016 08:35:49 -0500, "John E. Malmberg"

><wb8...@qsl.net_work> wrote:
>
>>On 7/19/2016 11:38 AM, Louis Krupp wrote:
>>> On Tue, 19 Jul 2016 08:39:55 -0500, "John E. Malmberg"
>>> <wb8...@qsl.net_work> wrote:
>>>
>>> As Scott said, your problem could be *anywhere*, and I'd agree that
>>> you're going to have to get down in the mud with the debugger and
>>> figure out the address of the bad field in the FCB and set a
>>> watchpoint to find out when and how it's being corrupted. As I
>>> recall, the VMS debugger was a pleasure to use, relatively speaking.
>>

>>If I could get the heap tracking information for a pointer returned from=
>=20

>>malloc() I could probably find where the issue is.
>>
>>> If you change enough other things, you might succeed in moving things
>>> around so that the program runs to completion, but you might also be
>>> masking a bug that you really want to fix.
>>
>>Yes, that is a risk.
>>
>

>You have a reproducible bug. Do you know how lucky you are? There

Yup. Usually, if I can see it in action, I can find its cause.

--
VAXman- A Bored Certified VMS Kernel Mode Hacker VAXman(at)TMESIS(dot)ORG

I speak to machines with the voice of humanity.

John E. Malmberg

unread,

Jul 21, 2016, 2:04:39 AM7/21/16

to

On 7/20/2016 2:45 PM, hb wrote:
> On 07/20/2016 03:35 PM, John E. Malmberg wrote:
>> If I could get the heap tracking information for a pointer returned from
>> malloc() I could probably find where the issue is.
>
> The usual one in the CRTL or the one with the heap analyzer?
>
> Dunno whether this helps or you already know this. It looks like there
> are two quadwords just below the malloced memory (the address returned),
> which seem to keep track of the allocated memory. The info which I can
> identify is the size of the malloced memory - including the two
> quadwords - and a flag 0xF00D marking it used.
>
> (Then there may be pointer to the start of the unallocated memory, but
> other than that and its size I don't think that there is tracking
> information.)
>
> If the memory is freed the flag is changed to 0x7777. And it seems the
> free memory is put at the beginning of a "free list". The next pointer
> in this list is at the first longword of the previously allocated
> memory. It is a zero terminated list.

Added two routines to the program which can also be called from the
debugger to make things easier.

The instrumentation is still showing no buffer overruns.

void dump_pointer_m(const void * ptr, const char * str, int psize) {
long * ptr1;
long msize;
long flags;
ptr1 = (long *)ptr;
msize = ptr1[-4];
flags = ptr1[-3];
fprintf(stderr,
"malloc %d, msize %d flags %xd size %d %s\n",
ptr, msize, flags, psize, str);
}

void dump_pointer_f(const void * ptr, const char * str, int psize) {
long * ptr1;
long msize;
long flags;
ptr1 = (long *)ptr;
msize = ptr1[-4];
flags = ptr1[-3];
fprintf(stderr,
"free %d, msize %d flags %xd size %d %s\n",
ptr, msize, flags, psize, str);
}

Since the access violation always occurred on a call to
decc$free(3850064), I set a watch point there.

Previous free() just before the watch point was tripped:
free 3517632, msize 288 flags f00d0048d size 256 vms_close

This is where 3850064 memory that when decc$free() accvios is first
accessed:

free 3850064, msize 96 flags f00d003bd size 0 is_open

C$FINDFILE/decc$$alloc_fcb
C$AUX_OPEN/decc$$aux_open
C$STAT/xstat
C$STAT/stat
C$AUX_OPEN/decc$$aux_open
C$AUX_OPEN/decc$$init_stdout
C$OFF64_FCNTL/decc$$init_stdout
C$OFF64_FCNTL/c$$map_fd_to_fcb
C$OFF64_FCNTL/fcntl

old 0
new 1
new 17409
new 6597069784065
new 6600291075073
new 3849104
CHMOD: cannot access
'/prj_root/diffutils/tests/gt-no-dereference.umdC/subdir2b/
foo': no such file or directory

This is the first indication of a problem. That file is a dangling
symlink and it exists. It is a lstat() call in the CRTL that is
failing, and lstat() should not be failing.

The GNU code would do a
chdir('/prj_root/diffutils/tests/gt-no-dereference.umdC/subdir2b') and
then an lstat('foo').

In this case /prj_root is a search list of lcl_root:,vms_root:,src_root:.

The file and directory only exist in the lcl_root: path, and the current
version of the code detects this and should be changing the directory to
lcl_root:[diffutils.tests.gt-no-dereference^.umdC.subdir2b]

So I need to look more in this area, and validate that the current
directory was actually set correctly.

Then the code runs for a while while it should be doing a chmod() on the
other files in "subdir2b" and then enters decc$chdir() when it runs out
of files to switch to another directory.

The decc$chdir() calls calloc() which gets 3850064 and clears it.

calloc 3850064, msize 96 flags f00d003bd size 0 decc$chdir
value = 0

It then calls decc$$alloc_fcb() which puts 20483 at the start of that
memory location.

Continuing on from there crashes with the access violation.

So this memory cell / block was only ever written to by the CRTL
according to the debugger.

Regards,
-John
wb8...@qsl.net_work

John E. Malmberg

unread,

Jul 21, 2016, 9:10:06 AM7/21/16

to

Confirmed that this sequence is being done, and there is at least one
bug in the CRTL:

Start conditions:
* Default directory is prj_root:[diffutils.tests] set by
decc$chdir() function, return success.
* decc$feature_set("DECC$FILENAME_UNIX_ONLY", current_val, 1)
* decc$feature_set("DECC$POSIX_COMPLIANT_FILENAMES", current_val, 0)

Then:

* decc$feature_set("DECC$FILENAME_UNIX_ONLY", current_val, 0)
* decc$chdir (
"local_root:[diffutils.tests.gt-no-dereference^.umdC.subdir2b]")
* decc$feature_set("DECC$FILENAME_UNIX_ONLY", current_val, 1)

* debugger confirms c$std_stat/lstat("foo") is being called.

* lstat() returns -1 on a dangling symbolic link.

**** That is a CRTL bug, as that is the expected behavior of stat().
Since the symbolic link exists, lstat() should return success.

* In debugger, calling readlink() returns 11 bytes and the string
"nonexistent", which is as expected, which confirms the bug in lstat().

Which means that while I can detect the lstat() failure, I currently do
not have a work around to why it failed.

> Then the code runs for a while while it should be doing a chmod() on the
> other files in "subdir2b" and then enters decc$chdir() when it runs out
> of files to switch to another directory.
>
> The decc$chdir() calls calloc() which gets 3850064 and clears it.
>
> calloc 3850064, msize 96 flags f00d003bd size 0 decc$chdir
> value = 0
>
> It then calls decc$$alloc_fcb() which puts 20483 at the start of that
> memory location.
>
> Continuing on from there crashes with the access violation.
>
> So this memory cell / block was only ever written to by the CRTL
> according to the debugger.

Using the debugger to skip the lstat() call that fails, does not prevent
the access violation in the later chdir().

Regards,
-John
wb8...@qsl.net_work

John Reagan

unread,

Jul 21, 2016, 12:25:38 PM7/21/16

to

What ECO kits are you running? I seem to remember that U900 had several CRTL fixes including something to do with lstat(). I'll check when I get a chance.

John E. Malmberg

unread,

Jul 21, 2016, 7:01:47 PM7/21/16

to

On 7/21/2016 11:25 AM, John Reagan wrote:

> What ECO kits are you running? I seem to remember that U900 had
> several CRTL fixes including something to do with lstat(). I'll check
> when I get a chance.

VMSIA4_UPDATE V10.0, V7.0, V6.O

VMS84I_RMS V3.0, V2.0

On the alpha:

VMS84A_UPDATE V8.0, V5.0
VMS84A_RMS V2.0

Hobbyists now have limited access to ECO kits. Occasionally roll up
kits show up.

The simple test program that just does a chdir() to the same directory
and then calls lstat() on the same file works correctly.

I think for my next test, I will add a function to do an lstat() of that
file and periodically poll to see if the access still works.

Regards,
-John

John E. Malmberg

unread,

Jul 22, 2016, 9:38:16 AM7/22/16

to

locally coretuils program is now running correctly on IA64.

The bug appears to be in lstat(), the _USE_STD_STAT variant.

decc$efs_case_preserve enabled
decc$efs_charset enabled
decc$filename_unix_noversion enabled
decc$readdir_dropdotnotype enabled
decc$filename_unix_only set to 1

decc$posix_compliant_pathanmes set to 0 or 2.

(Have not tried other combinations)

If you give lstat() a pathname argument without a directory component,
such as "foo", lstat() will sometimes fail.

Previously ".dir" or "x.dir" with out a directory component was also
known to fail.

In this case if "foo" was pointing to a dangling symbolic link which was
not null-terminated.

Per x/open standard, symbolic links do not need null termination.

Calling lstat("foo", &buf) in this case apparently is causing some
internal memory corruption in the CRTL which showed up a bit later in
the chdir() command.

The work around is before calling lstat() if there is not a directory
component on the name, to add one. Calling lstat("./foo", &buf) works.

Interestingly, it appears that the CRTL may be caching the results of
some lstat() calls. If you call lstat("./foo", &buf) before calling
lstat("foo", &buf), then the issue does not appear.

Previously it was also discovered that if the path argument to lstat()
is a symlink to a directory, and decc$posix_complient_pathnames is 0,
the default, that lstat() would sometimes behave as a stat() call.

So if lstat() reports success and a directory, and
decc$posix_compliant_pathnames is 0, the work around is to change
decc$posix_compliant_pathnames to 2, and retry the operation in a
termporary stat buffer.

Then set decc$posix_compliant_pathnames back to 0.

If this retry succeeds, then copy the temp stat buffer into the caller's
stat buffer and use that as the return status.

Otherwise return the original stat buffer.

With this change, the coretuils chmod() program operates correctly and
does not crash.

The affected source code is common to a number of the repackaged GNV
components, and as I have time, I will be creating new kits.
I still need to see if this fixes the issues seen running the same bash
script on VMS/AXP 8.4 were rm.exe instead of chmod.exe access violates.

Thanks all for their suggestions.

Regards,
-John
wb8...@qsl.net_work

John Reagan

unread,

Jul 22, 2016, 10:14:27 AM7/22/16

to

On Friday, July 22, 2016 at 9:38:16 AM UTC-4, John E. Malmberg wrote:
> locally coretuils program is now running correctly on IA64.
>
> The bug appears to be in lstat(), the _USE_STD_STAT variant.
>

I've added your analysis to the bug list. We'll look at it.

John E. Malmberg

unread,

Jul 22, 2016, 7:19:00 PM7/22/16

to

Slight correction. Posting too early in the morning after the debug
session last night, so was reading the fixed code wrong.

The work around is to append a "/." to the path for a dangling symbolic
link, or a pathname that ends in ".dir" that would be like a [.x^.dir]
in VMS format.

This apparently tricks the CRTL into trying the name as a directory
before trying the name any other way.

The stat() (_USE_STD_STAT) variant also does not understand that the
unix directory name "foo.dir" is foo^.dir with EFS and UNIX only enabled.

So for wrapper_lstat(name) and wrapper_stat(name)
(FILENAME_UNIX_ONLY) MODE.

If name does not have a slash, or has a .dir in it:
append "/." to make it a Unix directory lookup.
call CRTL lstat(newpath) or stat(newpath)

if status is ok, return we are done.

If above status is not ok, then call CRTL stat or lstat
with the original name.

For lstat() only:

If the above lstat() returns a directory mode bit
instead of a symolic link mode bit, it sometimes
gets fooled if with relative links to it self or other
directories.

if those conditions are not true, then
return the status and buffer.

If the posix_compliant_pathnames (PCP) mode is 0:
set pcp mode to 2.
retry lstat() to new buffer.
if lstat() works:
copy new buffer in to one supplied by call.
return success.
else:
return original status and buffer.

Regards,
-John