I re-added the GC_DEBUG define today, and weeded out a bunch of issues. For those who don't remember, GC_DEBUG (currently in parrot.h) causes various limits and settings and logic to be setup such that GC bugs occur relatively soon after the offending code. It allocates one header at a time, and performs DOD and collection runs extremely frequently (effectively, anywhere they could possibly occur if GC_DEBUG weren't defined.) It's goal is to make GC bugs which appear only in complex programs...appear in simpler ones as well.
Check the cvs-commit traffic if you're interested in what issues I've fixed already. From what I can tell, two things remain: - regexes (these are known to be broken. angel's latest patch should fix these in theory. Probably not worth spending time on fixing these.) - hashes (these were recently rewritten to use indices, a step forward, but they aren't 100% clean yet) - lexicals (there's one remaining issue on the last test I didn't look into) - subs (likely includes all variety of them. Basically, I got the wrong result on one test, instead of GPF failures like I received on the above bugs.) - possibly other that got lost in the noise of the above issues
Anyone more well-versed in these departments than I care to take a look at the potential problems? Just change GC_DEBUG in parrot.h, and you can be on your way. :)
I tried to have a look at the lexicals and subs issues but I am finding that with the GC_DEBUG flag set to 1, I *always* get a segmentation fault, even for simple pasm files like :
set I0, 12 end
or just:
end
Is there something I am doing wrong? I did a fresh checkout about an hour ago.
On Mon, 2002-08-12 at 01:56, Mike Lambert wrote: > Hey,
> I re-added the GC_DEBUG define today, and weeded out a bunch of issues. > For those who don't remember, GC_DEBUG (currently in parrot.h) causes > various limits and settings and logic to be setup such that GC bugs occur > relatively soon after the offending code. It allocates one header at > a time, and performs DOD and collection runs extremely frequently > (effectively, anywhere they could possibly occur if GC_DEBUG weren't > defined.) It's goal is to make GC bugs which appear only in complex > programs...appear in simpler ones as well.
> Check the cvs-commit traffic if you're interested in what issues I've > fixed already. From what I can tell, two things remain: > - regexes (these are known to be broken. angel's latest patch should fix > these in theory. Probably not worth spending time on fixing these.) > - hashes (these were recently rewritten to use indices, a step forward, > but they aren't 100% clean yet) > - lexicals (there's one remaining issue on the last test I didn't look > into) > - subs (likely includes all variety of them. Basically, I got the wrong > result on one test, instead of GPF failures like I received on the above > bugs.) > - possibly other that got lost in the noise of the above issues
> Anyone more well-versed in these departments than I care to take a look at > the potential problems? Just change GC_DEBUG in parrot.h, and you can be > on your way. :)
On Mon, Aug 12, 2002 at 03:56:27AM -0400, Mike Lambert wrote: > - hashes (these were recently rewritten to use indices, a step forward, > but they aren't 100% clean yet)
I know of at least one remaining problem with these, but...
> Anyone more well-versed in these departments than I care to take a look at > the potential problems? Just change GC_DEBUG in parrot.h, and you can be > on your way. :)
I can't get to it because parrot doesn't survive past initialization for me. When it creates the Array PMC for userargv, it allocates the PMC first and then the buffer for the array's data. During the buffer's creation, it does a collection that wipes out the PMC. My lo_var_ptr and hi_var_ptr are set to reasonable-sounding values at the top of trace_system_stack(), but I haven't been able to track it farther yet. Oh, and I do have your recent patch to set interpreter->lo_var_ptr early.
The userargv PMC is not anchored other than in the C stack, because it dies in the pmc_new() creation process before the assignment to P0 can run.
On Mon, 12 Aug 2002, Steve Fink wrote: > farther yet. Oh, and I do have your recent patch to set > interpreter->lo_var_ptr early.
How early is "early"? It looks like setting lo_var_ptr in Parrot_runcode instead of runops should be necessary/sufficient. If not, does initializing it to the address of a local in main make this go away?
On Mon, Aug 12, 2002 at 12:34:28PM -0700, Sean O'Rourke wrote: > On Mon, 12 Aug 2002, Steve Fink wrote: > > farther yet. Oh, and I do have your recent patch to set > > interpreter->lo_var_ptr early.
> How early is "early"? It looks like setting lo_var_ptr in Parrot_runcode > instead of runops should be necessary/sufficient. If not, does > initializing it to the address of a local in main make this go away?
Nope. lo_var_ptr was getting set plenty early. This was the problem, which I'll apply once I make a few more problems go away:
> > Anyone more well-versed in these departments than I care to take a look at > > the potential problems? Just change GC_DEBUG in parrot.h, and you can be > > on your way. :)
> I can't get to it because parrot doesn't survive past initialization > for me. When it creates the Array PMC for userargv, it allocates the > PMC first and then the buffer for the array's data. During the > buffer's creation, it does a collection that wipes out the PMC. My > lo_var_ptr and hi_var_ptr are set to reasonable-sounding values at the > top of trace_system_stack(), but I haven't been able to track it > farther yet. Oh, and I do have your recent patch to set > interpreter->lo_var_ptr early.
> The userargv PMC is not anchored other than in the C stack, because it > dies in the pmc_new() creation process before the assignment to P0 can > run.
Weird. I had to move the lo_var_ptr initialization code to runcode instead of runops, in order to avoid collecting the ARGV pmc. The new code looks like:
void *dummy_ptr; PMC *userargv;
Is it possible that some systems might put dummy_ptr higher in memory than userargv, thus causing userargv to become prematurely collected? If so, there are three options: - make two dummy ptrs, and choose the lesser of the two. - set the dummy ptr to userargv, and hope we don't add two header variables. ;) - force the setting of lo_var_ptr upon the 'main' code in test_main.c, above all possible functions.
I think 1 is easiest, but 3 does have the advantage of allowing the user to do GC stuff outside of the parrot execution loop, like allocating global variables (like argv, but app-specific), etc. Of course, it also imposees additional coding overhead on the embedding programmer.
On Mon, Aug 12, 2002 at 04:23:14PM -0400, Mike Lambert wrote: > Weird. I had to move the lo_var_ptr initialization code to runcode instead > of runops, in order to avoid collecting the ARGV pmc. The new code looks > like:
> void *dummy_ptr; > PMC *userargv;
You sure? That's what I thought at first too, but then I realized that (to be userargv)->vtable wasn't surviving the return from pmc_new(). So when it crashes, userargv is filled in, but at the point where GC scrozzles the PMC, userargv hasn't been assigned yet so it's address doesn't need to be within the stack range. It's actually the variable 'pmc' inside of pmc_new() that holds onto the newly-created PMC.
> Is it possible that some systems might put dummy_ptr higher in memory than > userargv, thus causing userargv to become prematurely collected? If > so, there are three options: > - make two dummy ptrs, and choose the lesser of the two.
I think anything like this will just get defeated by an optimizing compiler on some platform.
> - set the dummy ptr to userargv, and hope we don't add two > header variables. ;)
Ick. But since this *does* keep the problem isolated to Parrot code, it might be the way to go. If we want another variable, we can call a set_stack_top() function for each one, and it'll keep the minimum.
> - force the setting of lo_var_ptr upon the 'main' code in test_main.c, > above all possible functions.
I have a patch that adds an argument to Parrot_init for this purpose.
> I think 1 is easiest, but 3 does have the advantage of allowing the user > to do GC stuff outside of the parrot execution loop, like allocating > global variables (like argv, but app-specific), etc. Of course, it also > imposees additional coding overhead on the embedding programmer.
Is main() even guaranteed to be early enough (assuming the order of local variable declaration doesn't determine the order of stack usage)? Actually, it's not really main(), it's whatever function creates an interpreter and calls Parrot_init().
I can easily imagine a function that creates an interpreter, creates a PMC (say a hashtable describing the application or something), does something with the PMC using the interpreter, then shuts down the interpreter. All in the same function. Describing what's wrong with this and how to avoid it is kind of complicated. (But perhaps unavoidable.)
> /* We rely on the fact that expand_hash() will be called before > @@ -295,7 +296,12 @@ > hash->max_chain = (HashIndex) -1;
> hash->entries = 0; > - hash->bucket_pool = new_buffer_header(interpreter); > + > + /* In order to keep the bucket_pool header from being > + * collected from expand_hash(), a pointer to it is put on > + * the system stack where the stack walking code can find it */ > + for_gc = hash->bucket_pool = new_buffer_header(interpreter); > +
This is a case of a temporary allocated value being only in a CPU register? If so, I believe that this is no solution of the cause, merely working round a symptom. The stack walking code needs custom assembler to find all values in the CPU registers that may not be on the stack.
We cannot control correct third party code that is linked against parrot. So we cannot expect the solution to GC crashes to be "oh, you need to get the source to libfoo and declare a variable volatile so that the stack walking code can see it"
If I've misunderstood the cause of the above problem, oops, sorry. But I believe that we still need to check for pointers in the CPU registers, because on a register rich CPU is is not impossible that something has the only pointer to an in use value still only in a CPU register.
On Mon, Aug 12, 2002 at 09:48:56PM +0100, Nicholas Clark wrote: > This is a case of a temporary allocated value being only in a CPU register?
Not in this case. The link to the hash structure from the PMC is not established, which prevents the custom mark routine in the PMC from identifying the buffer in question. Another way of fixing this is to change new_hash to accept a pointer to the PMC's ->data pointer so it can initialize this before doing any further allocation.
> But I believe that we still need to check for pointers in the CPU registers, > because on a register rich CPU is is not impossible that something has the > only pointer to an in use value still only in a CPU register.
Yes. This is definitely necessary. We also need to flush register windows on platforms like SPARC, IA-64.
Here is another way of resolving this, as I mentioned in the last message. Having new_hash() initialize the PMC's data pointer so that the hash is immediately visible to the collector.
The last patch had an issue I didn't see on the test I was working against. mark_hash() assumes the buffer_pool is always initialized, but with the patch this is no longer the case.
On Mon, 2002-08-12 at 19:23, Jason Gloudon wrote: > The last patch had an issue I didn't see on the test I was working against. > mark_hash() assumes the buffer_pool is always initialized, but with the patch > this is no longer the case.
> /* The end of used parameter is passed into the mark_used function of > Index: parrot/include/parrot/hash.h > =================================================================== > RCS file: /cvs/public/parrot/include/parrot/hash.h,v > retrieving revision 1.6 > diff -u -r1.6 hash.h > --- parrot/include/parrot/hash.h 7 Aug 2002 19:02:06 -0000 1.6 > +++ parrot/include/parrot/hash.h 13 Aug 2002 00:20:40 -0000 > @@ -20,7 +20,7 @@ > /* HASH is really a hashtable, but 'hash' is standard perl nomenclature. */ > typedef struct _hash HASH;
On Mon, 2002-08-12 at 21:02, Jason Greene wrote: > Here is one additional check...
> -Jason (The other one)
> On Mon, 2002-08-12 at 19:23, Jason Gloudon wrote: > > The last patch had an issue I didn't see on the test I was working against. > > mark_hash() assumes the buffer_pool is always initialized, but with the patch > > this is no longer the case.
On Mon, 2002-08-12 at 22:38, Jason Greene wrote: > Sorry, this version now includes the missing modifications to hash.h.
> -Jason
> On Mon, 2002-08-12 at 21:02, Jason Greene wrote: > > Here is one additional check...
> > -Jason (The other one)
> > On Mon, 2002-08-12 at 19:23, Jason Gloudon wrote: > > > The last patch had an issue I didn't see on the test I was working against. > > > mark_hash() assumes the buffer_pool is always initialized, but with the patch > > > this is no longer the case.
> /* The end of used parameter is passed into the mark_used function of > Index: include/parrot/hash.h > =================================================================== > RCS file: /cvs/public/parrot/include/parrot/hash.h,v > retrieving revision 1.6 > diff -u -r1.6 hash.h > --- include/parrot/hash.h 7 Aug 2002 19:02:06 -0000 1.6 > +++ include/parrot/hash.h 13 Aug 2002 03:30:57 -0000 > @@ -20,7 +20,7 @@ > /* HASH is really a hashtable, but 'hash' is standard perl nomenclature. */ > typedef struct _hash HASH;
>> /* We rely on the fact that expand_hash() will be called before >> @@ -295,7 +296,12 @@ >> hash->max_chain = (HashIndex) -1;
>> hash->entries = 0; >> - hash->bucket_pool = new_buffer_header(interpreter); >> + >> + /* In order to keep the bucket_pool header from being >> + * collected from expand_hash(), a pointer to it is put on >> + * the system stack where the stack walking code can find it */ >> + for_gc = hash->bucket_pool = new_buffer_header(interpreter); >> +
>This is a case of a temporary allocated value being only in a CPU register? >If so, I believe that this is no solution of the cause, merely working round >a symptom. The stack walking code needs custom assembler to find all values in >the CPU registers that may not be on the stack.
It may also be the case where the values are on the stack, but at a spot where the GC won't look, as they're allocated in the code that's calling into the interpreter and not rooted. Which means we *must* root the argv PMCs quickly.
>If I've misunderstood the cause of the above problem, oops, sorry. >But I believe that we still need to check for pointers in the CPU registers, >because on a register rich CPU is is not impossible that something has the >only pointer to an in use value still only in a CPU register.
Yup. We have to do this, so we might as well go for it now. -- Dan
--------------------------------------"it's like this"------------------- Dan Sugalski even samurai d...@sidhe.org have teddy bears and even teddy bears get drunk
A GC can be triggered by Parrot_allocate that frees the newly created buffer. This shows up when creating interpreter->ctx.pad_stack.
This is really the same problem as the hash->buffer bug. The system stack walker will find the stack variable, but won't know how to look inside it to find the buffer.
I can think of the following ways of fixing this (these are all rehashes of things we've used or discussed in the past):
1. Also putting stack->buffer into a local variable and praying that it doesn't get optimized away 2. Disabling GC during the Parrot_allocate() call 3. Passing in an extra OUT parameter to new_stack so we can anchor the stack before the Parrot_allocate() call. 4. Tagging the newly created number with a generation counter so it won't get freed until after the next opcode is run (in this case, that'll be the first opcode). (This requires implementing Peter Gibbs' (?) idea of not freeing things created in the current generation.) 5. Stuffing the new buffer into a temporary holding area that is added to the root set. 6. Setting a neonate flag on the buffer that'll get automatically get cleared at an appropriate time. 7. Setting an immune flag on the buffer and manually clearing it after anchoring the stack to the root set. 8. Wrapping the stack in a PMC (so the stack struct would go into a ->data buffer and would be traversed with a custom mark routine) and being very, very careful about the order of initialization.
Any votes?
I have also come to the conclusion that tracking down these memory bugs is way too difficult right now. I tracked the above problem back from a seg fault, which was resulting from a garbage value in the byte code stream, which was triggered by adding a PMC to the free list until I added a bunch of debugging code. With the debugging code, I could see that the dead userargv PMC (that wasn't really dead because it was on the system stack even though it would never get used again) pointed to a PMC with a bogus type, but also the memory corruption was now triggered by a stack push where the stack and some PMC incorrectly shared a buffer.
Or, in summary: these kinds of problems result in memory corruption that is unlikely to be immediately fatal.
I wonder if we could add something to GC_DEBUG that would mark all dead Buffers and PMCs in an easily recognizable way, and then check all live PMCs and Buffers to be sure that nothing they point to is marked as being dead. (Right now, they all just unconditionally mark things as "live for this DOD pass".)
This is a patch that clears up all of the GC_DEBUG-revealed bugs, at least on my machine. As I enumerated in my previous email, there are a lot of different ways to fix these sorts of problems, and this just kinda picks one for each problem encountered. This patch includes the stuff done by Jason {Gloudon,Greene}.
In summary, this patch
- Adds a parameter to Parrot_init() to allow setting the stack top. - Fixes a bug in dod.c where the first PMC might be skipped (changed < to <=) - Adds some NULL checks in mark_hash() - Adds an OUT parameter to new_hash() so the hash is anchored to the root set while it is being constructed. - Adds an OUT parameter to hash_clone() for early anchoring. - Repeatedly calls lookupBucket() while cloning a hash in case things moved - In interpreter.c, asserts that a few of the early buffer creations do not return the same buffer (provides early warning of GC mischief) - Adds an OUT parameter to rx_allocate_info() for early anchoring. - Makes a major change to the Pointer PMC: the previously unused ->cache area is now used to hold a pointer to a custom mark routine that will get fired during PMC traversal. Previously, Pointers had the PMC_private_GC_FLAG set, but nothing ever looked at it. With this change, Pointers behave as they always did unless something externally sets the ->cache.struct_val field (in other words, there is no vtable entry for setting the mark routine, and the PMC's custom mark routine does nothing if that field is NULL.) - Reorders the rx_allocinfo opcode to assign things in the correct order and fill in the ->cache.struct_val field of the Pointer PMC it creates. - Briefly disables DOD while a stack is being created so allocating the contents of the stack buffer doesn't destroy the unanchored buffer header. - Sets the PMC_custom_mark_FLAG on a cloned PerlHash PMC. - Fixes a comment in the cloning test case in t/pmc/perlhash.t
Somebody gimme a cookie.
If the rx info object is going away, then obviously those parts of the patch need not be applied. But in the meantime, it's nice to have a Parrot that doesn't crash.
I'm not going to apply this patch yet because I'm sure someone will disagree with how it fixes some or all of these bugs. So would that someone please speak up? Thanks.
> I have also come to the conclusion that tracking down these memory > bugs is way too difficult right now. I tracked the above problem back > from a seg fault, which was resulting from a garbage value in the byte > code stream, which was triggered by adding a PMC to the free list > until I added a bunch of debugging code. With the debugging code, I > could see that the dead userargv PMC (that wasn't really dead because > it was on the system stack even though it would never get used again) > pointed to a PMC with a bogus type, but also the memory corruption was > now triggered by a stack push where the stack and some PMC incorrectly > shared a buffer.
> Or, in summary: these kinds of problems result in memory corruption > that is unlikely to be immediately fatal.
> I wonder if we could add something to GC_DEBUG that would mark all > dead Buffers and PMCs in an easily recognizable way, and then check > all live PMCs and Buffers to be sure that nothing they point to is > marked as being dead. (Right now, they all just unconditionally mark > things as "live for this DOD pass".)
+1 to something like this, I have been spent too many hours at the debugger trying to figure out some of these GC problems (one of them involving large buffers I still haven't solved)
> If the rx info object is going away, then obviously those parts of the > patch need not be applied. But in the meantime, it's nice to have a > Parrot that doesn't crash.
I agree. My disclaimer about the regex code in my original email was to suggest that we didn't need to focus on the rx issues, but if you've already done it... :)
> I'm not going to apply this patch yet because I'm sure someone will > disagree with how it fixes some or all of these bugs. So would that > someone please speak up? Thanks.
I suppose that someone is me, although there might be other someones.
> In summary, this patch
> - Adds an OUT parameter to new_hash() so the hash is anchored to the root set > while it is being constructed. > - Adds an OUT parameter to hash_clone() for early anchoring. > - Adds an OUT parameter to rx_allocate_info() for early anchoring. > - Briefly disables DOD while a stack is being created so allocating the contents > of the stack buffer doesn't destroy the unanchored buffer header.
These are needed for now. However, when we get that buffer/pmc unification, we should be able to make mark() methods in the header pools. Then, with support for non-pmc-wrapped buffers, we can find references to them on the system stack, and call their mark() method directly, avoiding the above hoops. At least, that's my hope. Is it possible to mark the above code with some XXX tag so that we can re-address it when we get the unification in place?
> - Makes a major change to the Pointer PMC: the previously unused ->cache area > is now used to hold a pointer to a custom mark routine that will get fired > during PMC traversal. Previously, Pointers had the PMC_private_GC_FLAG set, > but nothing ever looked at it. With this change, Pointers behave as they > always did unless something externally sets the ->cache.struct_val field > (in other words, there is no vtable entry for setting the mark routine, > and the PMC's custom mark routine does nothing if that field is NULL.)
> - Reorders the rx_allocinfo opcode to assign things in the correct order and > fill in the ->cache.struct_val field of the Pointer PMC it creates.
These are a bit hackish, but I agree they are needed to solve our GC_DEBUG problems (and by extension, "real-world Parrot programs" ;). Both of these should also be able to "go away" with the unification, so see previous paragraph. :)
I think I'm going to make GC_DEBUG a parameter of the interpreter, and allow it to be turned on/off via opcodes. Then we could force our test suite to use GC_DEBUG to root out GC problems a lot sooner than they otherwise would. Fixing all GC_DEBUG problems would help allow this kind of testing to be part of the standard test suite.
> - In interpreter.c, asserts that a few of the early buffer creations do not > return the same buffer (provides early warning of GC mischief)
Oooh, nice! :)
The rest of the things you listed, which I didn't comment on are, imo, perfectly fine.
In conclusion, I don't have any objections to this patch, although it would be nice if "XXX Unification" markers were included in places that needed to be addressed later.
On Wed, Aug 14, 2002 at 03:16:20AM -0400, Mike Lambert wrote: > > If the rx info object is going away, then obviously those parts of the > > patch need not be applied. But in the meantime, it's nice to have a > > Parrot that doesn't crash.
> I agree. My disclaimer about the regex code in my original email was to > suggest that we didn't need to focus on the rx issues, but if you've > already done it... :)
Yes, I knew that when I tackled it, but after climbing the painful learning curve to fix all the other problems, I figured I might as well nail this one down too while I was in there and in the right mindset (or maybe it was just that I didn't want to delete my zillions of debugging printouts quite yet...)
> > - Adds an OUT parameter to new_hash() so the hash is anchored to the root set > > while it is being constructed. > > - Adds an OUT parameter to hash_clone() for early anchoring. > > - Adds an OUT parameter to rx_allocate_info() for early anchoring. > > - Briefly disables DOD while a stack is being created so allocating the contents > > of the stack buffer doesn't destroy the unanchored buffer header.
> These are needed for now. However, when we get that buffer/pmc unification, > we should be able to make mark() methods in the header pools. Then, with > support for non-pmc-wrapped buffers, we can find references to them > on the system stack, and call their mark() method directly, avoiding > the above hoops. At least, that's my hope. Is it possible to mark the > above code with some XXX tag so that we can re-address it when we get the > unification in place?
That will fix some of these problems, but not all. For one, the Pointer PMC needs a custom mark routine in order to work properly *after* the initial creation, because whatever the Pointer PMC points to may contain PMCs and Buffers that probably won't be on the system stack after creation. So some way to define a custom mark routine will still be needed (or we'll need to restrict Pointer PMCs to point to objects that are not subject to garbage collection, either because they are allocated externally to Parrot, or they are anchored in some other way.)
Also, the rxinfo struct and the Stack_Chunk_t struct are both allocated with mem_sys_allocate, so they won't be helped by making Buffers smarter. The problem is that the system stack walk only sees the address of the struct, and doesn't know how to delve inside it. I don't know if there's any good reason why they can't be allocated with new_bufferlike_header (that's how the hash struct works), in which case the smarter Buffers would fix the problem. Or was this what you were intending? If so, then I'll plaster each of those with a FIXME comment. Otherwise, I'll leave them as-is.