[Sbcl-devel] debugging memory leaks / heap exhaustion analysis

133 views

Skip to first unread message

Marco Baringer

unread,

Apr 18, 2013, 8:25:35 AM4/18/13

to sbcl-...@lists.sourceforge.net

hi,

in the course of debugging recurring heap exhaustions i'm trying to
figure out, from whatever i can log and what little info i can get
from an ldb buffer, if a) my app has a leak, b) my app just needs more
memory than i'm giving it c) i'm expecting the gc to work differently
than it acutally does.

Are these assumptions correct:

1) If i have a 1GB heap (--dynamic-space-size 1024) and
sb-kernel:dynamic-usage reports 400MB used, i can always allocate
another 600MB of stuff (in my case these would be lots of
relatively small objects, but it could also be a few larger (10KB)
strings and hash tables).

I ask because after reading this:
http://marc.info/?l=sbcl-devel&m=129767397012797&w=2 I'm worried
that I may be hitting the "generational trap", since i have a
similar allocation pattern (in part at least) to whan nikodemus
mentioned.

1b) The long lived object caches (where i'm guessing i have a problem)
are very large trees of objects (instances whose slot value are
other instances with forward and back pointers all around).

If i want to see if i'm confusing the conservative collector,
would this work: go slowly through my caches (which i can do) and
break the forward/backward links with (setf (the-link) nil) and
then call (gc :full t). monitor the change in dynamic-usage and
see if it's small (more or less equivalent to the single object
which is no longer reachable) or large (in which case there was a
pointer in that object which was garbarge but the conservative
collector was confused).

2) Running (sb-ext:gc :full t) at regular intervals has no draw backs
and after that whatever i see in the gc logfile or on the heap is
real live data.

In dedugging this I'm trying to interpret the values in the gc
logfile, and i'm not quite sure i understand the meaning of the
various numbers. Given this table (which is what i got at the ldb
prompt and simply reformated so the columns line up):

------------------------------------------------------------------------
Heap exhausted during garbage collection: 40 bytes available, 104 requested.
+-----+-----+-----+-----+-----+-----+-------+-----+-----+-----+---------+-------+---------+-----+-----+-------+
| Gen|StaPg|UbSta|LaSta|LUbSt|Boxed|Unboxed| LB| LUB|!move|Alloc |Waste | Trig | WP| GCs|Mem-age|
+-----+-----+-----+-----+-----+-----+-------+-----+-----+-----+---------+-------+---------+-----+-----+-------+
| 0:| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0 | 0 |10737418 | 0| 0| 0.0000|
+-----+-----+-----+-----+-----+-----+-------+-----+-----+-----+---------+-------+---------+-----+-----+-------+
| 1:| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0 | 0 |10737418 | 0| 0| 0.0000|
+-----+-----+-----+-----+-----+-----+-------+-----+-----+-----+---------+-------+---------+-----+-----+-------+
| 2:| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0 | 0 |10737418 | 0| 0| 0.0000|
+-----+-----+-----+-----+-----+-----+-------+-----+-----+-----+---------+-------+---------+-----+-----+-------+
| 3:| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0 | 0 |10737418 | 0| 0| 0.0000|
+-----+-----+-----+-----+-----+-----+-------+-----+-----+-----+---------+-------+---------+-----+-----+-------+
| 4:| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0 | 0 |10737418 | 0| 0| 0.0000|
+-----+-----+-----+-----+-----+-----+-------+-----+-----+-----+---------+-------+---------+-----+-----+-------+
| 5:|38735|38733| 0| 0|85613| 31161| 0| 0| 0|473357552|4948752|508117930| 0| 20| 1.0508|
+-----+-----+-----+-----+-----+-----+-------+-----+-----+-----+---------+-------+---------+-----+-----+-------+
| 6:| 0| 0| 0| 0|18059| 4500| 0| 0| 0| 92401664| 0| 2000000|15463| 0| 0.0000|
+-----+-----+-----+-----+-----+-----+-------+-----+-----+-----+---------+-------+---------+-----+-----+-------+
Total bytes allocated = 1063752464
Dynamic-space-size bytes = 1073741824
GC control variables:
*GC-INHIBIT* = true
*GC-PENDING* = in progress
*STOP-FOR-GC-PENDING* = false
fatal error encountered in SBCL pid 12379(tid 4124879728):
Heap exhausted, game over.
------------------------------------------------------------------------

- generation 5 is the oldest generation right? (generation 6 being the
"scratch" generation the source code refers to and something i can
basically ignore)

- Does the large number of GCs in generation 5 imply that sbcl has tried
(20 times) to remove stuff from that generation and hasn't been able
to? (this would strongly suggest either a leak in my code or that my
app simply needs more memory, however it would go against my
understanding of my app's memory usage).

- What does the "Mem-age" column tell me?

- In this particular case the number of pinned objects (!move) is always
0, but in normal runs (before heap exhaustion) i'll regularly see
values in the column between 0 and 200. Do these have anything to do
with ffi objects? (if so i may have a problem in how i'm using the c
library to talk to the db)

- Would setting bytes-consed-between-gcs and
generation-bytes-consed-between-gcs to obscenely low numbers (i was
thinking of setting these to 3) cause heap exhaustions to be happen
earlier? (this would help me to, after the fact, correlate heap
exhaustion to what the log files say was going an at that time)

- the gc_trigger value of generation 5 is always much much higher than
the gc_trigger value of the other generations (in this case it's x50,
but i've seen x100 or x200 at other times). is that normal? (i assume
that generations are the same size, so that sounds weird to me (but i
don't really understand the gc either...))

As a debugging aid: is there a way to as the gc to call some hook when
an allocation would cause it to have less than X amount of free bytes?
What i would like is that if (make-array 1-bazillion-bytes) would
succeed but would only leave me with a few bytes free (not enough for me
to run any forensics) i'd like the allocation to be "roll-backed" and a
kernel:heap-exhausted erorr signaled (i'm more than willing to give up a
large amount of memory to get this feature). I can of course run code in
*after-gc*, but that is often too late. This probably isn't possible,
but if it is it would be of great help in debugging infinite allocation
loops. (actually, maybe a *before-gc* hook would be enough, with it
being my (the user) responsibility to write defensive enough code that
i don't exhaust the heap while trying to collect the heap)

ps - if it helps: my app, a web app (of course...) allocates three
different 'kinds' of objects: many (tens to hundreds of thousands) of
small and short lived (a few hundred milliseconds) strings and clos
instances, a decent amount (thousands to tens of thousands) of session
life (between 6 and 30 minutes) objects (hash tables with mapping
fixnums to clos objects), and a large number (millions) of long lived
clos instances (global caches).

pps - would there any objections to a patch which made
write_generation_stats output csv formatted data? (which would make it
easier to send this data to gplot or R, but harder for a human to read
it as it scrolls by)

sorry this is so long,
thanks,
--
-marco

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Sbcl-devel mailing list
Sbcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sbcl-devel

Paul Khuong

unread,

Apr 19, 2013, 5:11:19 PM4/19/13

to Marco Baringer, sbcl-...@lists.sourceforge.net

I'll address the points I can.

Marco Baringer wrote:
> Are these assumptions correct:
>
> 1) If i have a 1GB heap (--dynamic-space-size 1024) and
> sb-kernel:dynamic-usage reports 400MB used, i can always allocate
> another 600MB of stuff (in my case these would be lots of
> relatively small objects, but it could also be a few larger (10KB)
> strings and hash tables).

Nope. The GC is copying, and may temporarily use as much as twice the
size of live data.

> 1b) The long lived object caches (where i'm guessing i have a problem)
> are very large trees of objects (instances whose slot value are
> other instances with forward and back pointers all around).
>
> If i want to see if i'm confusing the conservative collector,
> would this work: go slowly through my caches (which i can do) and
> break the forward/backward links with (setf (the-link) nil) and
> then call (gc :full t). monitor the change in dynamic-usage and
> see if it's small (more or less equivalent to the single object
> which is no longer reachable) or large (in which case there was a
> pointer in that object which was garbarge but the conservative
> collector was confused).

Yes, but you might create more conservative references in that process.

> 2) Running (sb-ext:gc :full t) at regular intervals has no draw backs
> and after that whatever i see in the gc logfile or on the heap is
> real live data.

Doing this may prematurely age data, which would let them live a lot
longer than if a minor GC had been triggered naturally.

> - In this particular case the number of pinned objects (!move) is always
> 0, but in normal runs (before heap exhaustion) i'll regularly see
> values in the column between 0 and 200. Do these have anything to do
> with ffi objects? (if so i may have a problem in how i'm using the c
> library to talk to the db)

It's much more likely to bec aused by stray words that look like
references on the stack than by foreign code directly.

> pps - would there any objections to a patch which made
> write_generation_stats output csv formatted data? (which would make it
> easier to send this data to gplot or R, but harder for a human to read
> it as it scrolls by)