Better tools for catching missing GC roots?

Tim Holy

unread,

Jan 27, 2015, 8:41:57 AM1/27/15

to juli...@googlegroups.com

Historically, some of the hardest bugs to squash in julia have come from
memory errors on the C side. As I'm currently in the middle of trying to track
another likely case down, it occurs to me to ask: is it worth investing some
effort in developing better tools for finding such problems? (Or, am I not aware
of some existing tools?)

One approach that, to me, seems most likely to succeed would be a
"LintJuliaC.jl" package. (This would presumably _heavily_ leverage Clang.jl.)
Given that GC runs only when memory gets allocated, the strategy might be:
- Starting from the lowest level gc functions, build up a Set of those C
functions that might trigger GC. One would basically have to follow the call
chain down and see if you end up in a gc routine.
- Once assembled, do a second pass through src/, and look for created
variables that don't get a JL_GC_PUSH before the next function that might
trigger GC gets called.

I can imagine there might be other strategies, too, such as a special run mode
of julia that forces gc() after practically every statement. (I worry that
would be very slow.) Or inserting some kind of sanity check into JL_DATA_TYPE
and make sure it gets checked on every access. To me these seem less
practical, but they give some indication of what else I've considered.

Thoughts? I'm aware this might not be easy, but given that some GC bugs
delayed 0.3 considerably, and I'm currently struggling to identify the source
of an object whose type is being printed as <?::<?::<?::<?::<?::<?::>>>>>>, it
seems worth discussing.

Best,
--Tim

Keno Fischer

unread,

Jan 27, 2015, 11:29:38 AM1/27/15

to juli...@googlegroups.com

It probably would be doable to write a plugin for Clang's static analyzer that checks for these kinds of issues.

Erik Schnetter

unread,

Jan 27, 2015, 11:56:46 AM1/27/15

to juli...@googlegroups.com

Run-time checking could be implemented purely on the Julia side:

- GC (probably triggered often, but not after every statement) would not actually deallocate objects. Instead, they are only marked as "unreachable".
- Every access to an object checks whether it is unreachable; if so, create a backtrace.

Since GC doesn't free anything, the wrongly-unreachable objects are still there and are intact, and you can follow either them or the stack backtrace to see how they were reached, and which root was missing.

Checking all access to objects may be difficult. In addition to marking them with a bit, we could poison them (e.g. every pointer in the object has bit 62 set so that accessing it gives a segfault). Alternatively -- and people would be willing to do this if there's no other way -- C code could be sprinkled with "CHECK_OBJECT_VALID(x)" macro calls before objects are accessed. Although my hope is that inserting such a macro into a few strategic places should suffice, e.g. when an object's type is checked, or in generated LLVM bitcode.

-erik

--
Erik Schnetter <schn...@gmail.com>
http://www.perimeterinstitute.ca/personal/eschnetter/

My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from https://sks-keyservers.net.

signature.asc

Joshua Ballanco

unread,

Jan 27, 2015, 2:23:20 PM1/27/15

to juli...@googlegroups.com

Having tracked down GC-related memory corruption issues in the past, I’ve found a couple of tools end up being especially useful:

* MallocStackLogging: this is a feature of malloc on OS X that records a stack trace for each allocation and free. Once you have the location of the memory that’s been corrupted, it’s relatively straight-forward to go back through the history for that region and see what happened.

* NSZombies: this is a tool in Objective-C/Cocoa that (very similar to what Erik suggested), instead of free-ing memory after a `dealloc` or GC, replaces the class of the object with the “NSZombie” class. This is a simple proxy that logs any calls to the object as warnings, then passes along the method call to the should-be-dead object.

* GuardMalloc (gmalloc): this is the “big gun” of them all…guard malloc aligns every allocation to a page boundary, then it also allocates the following page and marks it protected. That way, any erroneous write outside the expected bounds of an allocation causes an immediate segfault.

I may be a bit odd, but I actually rather enjoy tracking down memory corruption issues. Is there a tag in the GitHub issues that I could follow, and possibly lend a hand in the future?

Cheers,

Josh

Tim Holy

unread,

Jan 27, 2015, 2:29:10 PM1/27/15

to juli...@googlegroups.com

An example of the type of problem I see being difficult to solve with runtime
checking is this one:

a = allocate_big_object(); // triggers GC 99.9% of the time
free(a);
b = allocate_tiny_object(); // basically never triggers GC
c = allocate_another_object(); // triggers GC if `a` didn't
JL_GC_PUSH2(&b, &c);

b basically never triggers GC because a usually forces a GC, and b is not big
enough to trigger a GC. Even when a doesn't trigger a GC, it's tiny, so
doesn't usually cross threshold. But c is big enough to reliably cross
threshold. In that 0.1% of the time, allocation of c will force a GC before
you've protected b with your GC_PUSH2. But since it happens very rarely, it's
hard to catch at runtime.

--Tim

Tim Holy

unread,

Jan 27, 2015, 2:33:52 PM1/27/15

to juli...@googlegroups.com

Might it be easier to do the analysis in Julia, though? (Showing my biases
here...)

But thanks for the pointer, I didn't know it existed. I'll read up about it:
http://clang-analyzer.llvm.org/checker_dev_manual.html

--Tim

Tim Holy

unread,

Jan 28, 2015, 10:14:40 AM1/28/15

to juli...@googlegroups.com

On Tuesday, January 27, 2015 09:23:14 PM Joshua Ballanco wrote:
> I may be a bit odd, but I actually rather enjoy tracking down memory
> corruption issues. Is there a tag in the GitHub issues that I could follow,
> and possibly lend a hand in the future?

The particular one I was just working on seems to have been fixed in the last
couple of days, but its phenotype was similar to
https://github.com/JuliaLang/julia/issues/8651. That's no guarantee it's a GC
error, but you could start by checking to see if you can replicate it on your
machine using a current build.

--Tim

Tim Holy

unread,

Feb 19, 2015, 11:05:57 PM2/19/15

to juli...@googlegroups.com

A very simple tool (a syntax highlighter) is now available at
https://github.com/timholy/CallGraphs.jl
There's lots of room for much more sophisticated tools, but this is at least a
start.

--Tim

On Tuesday, January 27, 2015 11:23:21 AM Keno Fischer wrote:

Reply all

Reply to author

Forward