[Caml-list] Strategies for finding memory leaks

Hans Ole Rafaelsen

unread,

Mar 21, 2012, 5:49:40 AM3/21/12

to caml...@inria.fr

Hello,

Is there some tools / tricks that can be used to help find memory leaks?

I have trouble with an application, that when running for a long time,
starts to use a lot of CPU and consume more memory. It starts out by using
about 20% CPU (reported by top) and after 24 hours it has increased to 80%
usage. Also physical (RES) memory usage goes from 80M to 160M. The workload
for the application is the same the whole time.

Using OProfile (http://oprofile.sourceforge.net/news/) shows that that most
of the time is being spent doing memory management.

At startup:
CPU: Core 2, speed 2667 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples % image name symbol name
52764 22.3913 vc_client.native mark_slice
33580 14.2502 vc_client.native caml_page_table_lookup
25415 10.7853 vc_client.native sweep_slice
10119 4.2942 vc_client.native caml_fl_allocate
6423 2.7257 [vdso] (tgid:9015 range:0x7fff4256c000-0x7fff4256d000)
[vdso] (tgid:9015 range:0x7fff4256c000-0x7fff4256d000)
5233 2.2207 vc_client.native
camlLividisvc__Nalbuf_tools__replace_pattern_1033
2759 1.1708 vc_client.native caml_iterate_global_roots
2728 1.1577 vc_client.native caml_modify
2473 1.0495 vc_client.native caml_oldify_one
2204 0.9353 vc_client.native
camlLividisvc__Nalbuf_bytestream__search_1047
2183 0.9264 vc_client.native caml_darken
1935 0.8212 vc_client.native caml_stash_backtrace
1843 0.7821 vc_client.native caml_do_roots
1838 0.7800 vc_client.native caml_delete_global_root

After ca. 24 hours run:
CPU: Core 2, speed 2667 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask
of 0x00 (Unhalted core cycles) count 100000
samples % image name symbol name
1137401 56.2697 vc_client.native mark_slice
405598 20.0658 vc_client.native sweep_slice
399832 19.7806 vc_client.native caml_page_table_lookup
10106 0.5000 vc_client.native caml_fl_allocate
3548 0.1755 vc_client.native caml_iterate_global_roots
3397 0.1681 [vdso] (tgid:26129 range:0x7fff747ff000-0x7fff74800000)
[vdso]
(tgid:26129 range:0x7fff747ff000-0x7fff74800000)
2797 0.1384 vc_client.native
camlLividisvc__Nalbuf_tools__replace_
pattern_1033
2307 0.1141 vc_client.native
camlLividisvc__Nalbuf_bytestream__sea
rch_1047
2005 0.0992 vc_client.native caml_oldify_local_roots
1786 0.0884 vc_client.native caml_gc_stat
1441 0.0713 vc_client.native caml_oldify_one
1163 0.0575 vc_client.native caml_darken
1163 0.0575 vc_client.native caml_fl_merge_block
1032 0.0511 vc_client.native camlHashtbl__find_1093

The application uses several 3rd party libraries, including: LablGTK2,
OCamlNet, LWT and others.

Is there some clever trick that can by used to track down or get a hint of
what is causing this?

Thanks,

Hans Ole Rafaelsen

--
Caml-list mailing list. Subscription management and archives:
https://sympa-roc.inria.fr/wws/info/caml-list
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Goswin von Brederlow

unread,

Mar 22, 2012, 11:03:34 AM3/22/12

to Hans Ole Rafaelsen, caml...@inria.fr

Are you calling the GC manually somewhere in the code or in one of the
libs?

MfG
Goswin

Hans Ole Rafaelsen

unread,

Mar 23, 2012, 5:33:47 AM3/23/12

to Goswin von Brederlow, caml...@inria.fr

Hi,

I have tried to trigger Gc.compact but that only have limited effect for a
short time.

Hans Ole

Goswin von Brederlow

unread,

Mar 24, 2012, 10:00:46 AM3/24/12

to caml...@inria.fr

Hans Ole Rafaelsen <hrafa...@gmail.com> writes:

> Hi,
>
> I have tried to trigger Gc.compact but that only have limited effect for a
> short time.
>
> Hans Ole

My thinking was rather that something triggers the GC manually and over
time starts to do that far too often so more and more time is spend in
the GC.

If it isn't triggered manually then custom types can also give the GC a
wrong idea about how much memory is used/wasted. But that is also just
an idea of where to look.

Richard W.M. Jones

unread,

Apr 1, 2012, 3:58:27 PM4/1/12

to Hans Ole Rafaelsen, caml...@inria.fr

On Wed, Mar 21, 2012 at 10:49:22AM +0100, Hans Ole Rafaelsen wrote:
> Hello,
>
> Is there some tools / tricks that can be used to help find memory leaks?

[...]

> The application uses several 3rd party libraries, including: LablGTK2,
> OCamlNet, LWT and others.

If you suspect a leak in the *C* part of your program or these
libraries, then I am very successfully using valgrind on my OCaml
(native) programs to track such leaks.

https://github.com/libguestfs/libguestfs/blob/95123a9144edc1ddb3ad67d75276ca3af70eddb8/tests/extra/Makefile.am

Note that several of the programs tested are written in OCaml, and see
also the 'suppressions' file in the same directory.

However this will not find leaks in OCaml code. Although I was
pleasantly surprised the other day when valgrind pointed correctly to
a use of an uninitialized string in some OCaml code.

Rich.

--
Richard Jones
Red Hat

Hans Ole Rafaelsen

unread,

Apr 2, 2012, 4:15:54 AM4/2/12

to Richard W.M. Jones, caml...@inria.fr

Hi,

We have done some further investigation, and it does not seem to be memory
leakage. The application is a video streaming application, where we use
external processes to encode and decode the video. By configuring it in
different ways, we have found the problem to be in the code that is reading
the decoded video from an external process and displaying it using
LablGTK2. The previous thought memory leakage seems to be memory used for
buffers when the system is under high load, and this is not reclaimed by
the OS. When running on low frame rate the application does not use more
memory than at startup.

However, the application still consumes more and more CPU time. And it
seems to happen in the GC. Apart from that, the application seems to be
just fine. So it seems to be something in our code (or in LablGTK) that is
making the GC spend more and more time. Anyone experienced this kind of
problem?

Best,

Hans Ole Rafaelsen

Richard W.M. Jones

unread,

Apr 2, 2012, 6:14:15 AM4/2/12

to Hans Ole Rafaelsen, caml...@inria.fr

On Mon, Apr 02, 2012 at 10:15:02AM +0200, Hans Ole Rafaelsen wrote:
> However, the application still consumes more and more CPU time. And it
> seems to happen in the GC. Apart from that, the application seems to be
> just fine. So it seems to be something in our code (or in LablGTK) that is
> making the GC spend more and more time. Anyone experienced this kind of
> problem?

You're not swapping are you? (Run 'vmstat 1' & look at the si/so columns)

John Carr

unread,

Apr 2, 2012, 7:26:27 AM4/2/12

to Hans Ole Rafaelsen, caml...@inria.fr

Cache misses count as CPU time. I've seen people measure
parallel speedup by looking at CPU usage. They get linear
increase in CPU use to eight threads even though the task
saturated memory bandwidth after the second thread.

If you have a way to measure instruction count you can compare
increase in instruction count in GC to increase in time in GC.
If instructions are constant while time increases you probably
have a memory access problem. If instructions increase the
structure of references within your data may be forcing the
collector to do more work.

> However, the application still consumes more and more CPU time. And it
> seems to happen in the GC. Apart from that, the application seems to
> be just fine. So it seems to be something in our code (or in LablGTK)
> that is making the GC spend more and more time. Anyone experienced
> this kind of problem?

Hans Ole Rafaelsen

unread,

Apr 2, 2012, 9:41:06 AM4/2/12

to Richard W.M. Jones, caml...@inria.fr

Just did another run to be sure. It does not do any swapping.

--
Hans Ole

Jerome Vouillon

unread,

Apr 3, 2012, 8:14:24 AM4/3/12

to Gerd Stolpmann, Hans Ole Rafaelsen, Richard W.M. Jones, caml...@inria.fr

On Tue, Apr 03, 2012 at 12:42:08PM +0200, Gerd Stolpmann wrote:
> This reminds me of a problem I had with a specific C binding (for mysql),
> years ago. That binding allocated custom blocks with badly chosen
> parameters used/max (see the docs for caml_alloc_custom in
> http://caml.inria.fr/pub/docs/manual-ocaml/manual032.html#toc144). If the
> ratio used/max is > 0, these parameters accelerate the GC. If the custom
> blocks are frequently allocated, this can have a dramatic effect, even for
> quite small used/max ratios. The solution was to change the code, and to
> set used=0 and max=1.
>
> This type of problem would match your observation that the GC works more
> and more the longer the program runs, i.e. the more custom blocks have
> been allocated.
>
> The problem basically also exists with bigarrays - with
> used=<size_of_bigarary> and max=256M (hardcoded).

I have also observed this with input-output channels (in_channel and
out_channel), where used = 1 and max = 1000. A full major GC is
performed every time a thousand files are opened, which can result on
a significant overhead when you open lot of files and the heap is
large.

-- Jerome

Hans Ole Rafaelsen

unread,

Apr 4, 2012, 6:53:41 AM4/4/12

to Jerome Vouillon, Gerd Stolpmann, Richard W.M. Jones, caml...@inria.fr

Hi,

Thanks for your suggestions. I tried to patch lablgtk2 with:

--- src/ml_gdkpixbuf.c.orig 2012-04-03 13:56:29.618264702 +0200
+++ src/ml_gdkpixbuf.c 2012-04-03 13:56:58.106263510 +0200
@@ -119,7 +119,7 @@
value ret;
if (pb == NULL) ml_raise_null_pointer();
ret = alloc_custom (&ml_custom_GdkPixbuf, sizeof pb,
- 100, 1000);
+ 0, 1);
p = Data_custom_val (ret);
*p = ref ? g_object_ref (pb) : pb;
return ret;

--- src/ml_gobject.c.orig 2012-04-03 15:40:11.002004506 +0200
+++ src/ml_gobject.c 2012-04-03 15:41:04.938002250 +0200
@@ -219,7 +219,7 @@
CAMLprim value ml_g_value_new(void)
{
value ret = alloc_custom(&ml_custom_GValue,
sizeof(value)+sizeof(GValue),
- 20, 1000);
+ 0, 1);
/* create an MLPointer */
Field(ret,1) = (value)2;
((GValue*)&Field(ret,2))->g_type = 0;
@@ -272,14 +272,14 @@
custom_serialize_default, custom_deserialize_default };
CAMLprim value Val_gboxed(GType t, gpointer p)
{
- value ret = alloc_custom(&ml_custom_gboxed, 2*sizeof(value), 10, 1000);
+ value ret = alloc_custom(&ml_custom_gboxed, 2*sizeof(value), 0, 1);
Store_pointer(ret, g_boxed_copy (t,p));
Field(ret,2) = (value)t;
return ret;
}
CAMLprim value Val_gboxed_new(GType t, gpointer p)
{
- value ret = alloc_custom(&ml_custom_gboxed, 2*sizeof(value), 10, 1000);
+ value ret = alloc_custom(&ml_custom_gboxed, 2*sizeof(value), 0, 1);
Store_pointer(ret, p);
Field(ret,2) = (value)t;
return ret;

At startup is uses
top - 16:40:27 up 1 day, 7:01, 28 users, load average: 0.47, 0.50, 0.35
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
Cpu(s): 4.8%us, 1.3%sy, 0.0%ni, 93.6%id, 0.2%wa, 0.0%hi, 0.1%si,
0.0%st
Mem: 4004736k total, 3617960k used, 386776k free, 130704k buffers
Swap: 4070396k total, 9244k used, 4061152k free, 1730344k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
10275 hans 20 0 529m 77m 13m S 14 2.0 0:01.66
vc_client.nativ

and 12 hours later
top - 04:40:07 up 1 day, 19:01, 35 users, load average: 0.00, 0.01, 0.05
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
Cpu(s): 20.2%us, 3.4%sy, 0.0%ni, 76.1%id, 0.1%wa, 0.0%hi, 0.2%si,
0.0%st
Mem: 4004736k total, 3828308k used, 176428k free, 143928k buffers
Swap: 4070396k total, 10708k used, 4059688k free, 1756524k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
10275 hans 20 0 534m 82m 13m S 17 2.1 110:11.19
vc_client.nativ

Without the patch
top - 22:05:38 up 1 day, 12:26, 34 users, load average: 0.35, 0.16, 0.13
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
Cpu(s): 5.6%us, 1.5%sy, 0.0%ni, 92.6%id, 0.2%wa, 0.0%hi, 0.1%si,
0.0%st
Mem: 4004736k total, 3868136k used, 136600k free, 140900k buffers
Swap: 4070396k total, 9680k used, 4060716k free, 1837500k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
25111 hans 20 0 453m 76m 13m S 14 2.0 0:13.68
vc_client_old.n

top - 10:05:19 up 2 days, 26 min, 35 users, load average: 0.01, 0.04, 0.05
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
Cpu(s): 20.4%us, 3.2%sy, 0.0%ni, 75.8%id, 0.4%wa, 0.0%hi, 0.2%si,
0.0%st
Mem: 4004736k total, 3830596k used, 174140k free, 261692k buffers
Swap: 4070396k total, 13640k used, 4056756k free, 1640452k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
25111 hans 20 0 453m 76m 13m S 49 2.0 263:05.34
vc_client_old.n

So from this it seems that with the patch it still uses more and more CPU,
but at a much lower rate. However, it seems to increase memory usage with
the patch. I also tried to patch the wrappers.h file, but the memory
consumption just exploded.

So it is working better, but still not good enough. Is there some way to
prevent this kind of behavior? That is, no extra memory usage and no extra
CPU usage.

I have attached some additional profiling if that would be of any interest.
In short it seems to be that it is the GC that is consuming the CPU.

Best,

Hans Ole

MemoryProfiling.txt

Gabriel Scherer

unread,

Apr 4, 2012, 7:31:36 AM4/4/12

to Hans Ole Rafaelsen, caml...@inria.fr

May your program leak one of those GTK resources?

The effectiveness of your patch seems to indicate that you have a lot
of one of these values allocated (and that they were requesting the GC
much too frequently). The patch solves the CPU usage induced by
additional GC, but does not change the reason why those GC were
launched: apparently your code allocates a lot of those resources. If
there indeed is a leak in your program, it will use more and more
memory even if you fix the CPU-usage effect.

An interesting side-effect of your patch is that you could, by
selectively disabling some of the change you made (eg. by changing
Val_g_boxed but not Val_g_boxed_new), isolate which of those resources
were provoking the increased CPU usage, because it was allocated in
high number.

(Usual candidates that provoke leak are global data structures that
store references to your data. A closure will also reference the data
corresponding to the variables it captures, so storing closures in
such tables can be an indirect cause for "leaks". Do you have global
tables of callbacks or values for GTK-land?)

Gerd Stolpmann

unread,

Apr 4, 2012, 8:55:48 AM4/4/12

to Gabriel Scherer, Hans Ole Rafaelsen, caml...@inria.fr

Am Mittwoch, den 04.04.2012, 13:30 +0200 schrieb Gabriel Scherer:
> May your program leak one of those GTK resources?
>
> The effectiveness of your patch seems to indicate that you have a lot
> of one of these values allocated (and that they were requesting the GC
> much too frequently). The patch solves the CPU usage induced by
> additional GC, but does not change the reason why those GC were
> launched: apparently your code allocates a lot of those resources. If
> there indeed is a leak in your program, it will use more and more
> memory even if you fix the CPU-usage effect.
>
> An interesting side-effect of your patch is that you could, by
> selectively disabling some of the change you made (eg. by changing
> Val_g_boxed but not Val_g_boxed_new), isolate which of those resources
> were provoking the increased CPU usage, because it was allocated in
> high number.

Or just increment a counter for each type.

Gerd

------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany ge...@gerd-stolpmann.de
Creator of GODI and camlcity.org.
Contact details: http://www.camlcity.org/contact.html
Company homepage: http://www.gerd-stolpmann.de
*** Searching for new projects! Need consulting for system
*** programming in Ocaml? Gerd Stolpmann can help you.
------------------------------------------------------------

Gerd Stolpmann

unread,

Apr 4, 2012, 9:55:39 AM4/4/12

to Gabriel Scherer, Hans Ole Rafaelsen, caml...@inria.fr

I'm trying to think beyond this particular problem, and it seems to me
that the current "GC speed control" for custom blocks is too simple.
Right now, we can practically only base the speed on either

- the size of a custom object relative to the expected maximum
size of all custom objects of the type, or
- the number of a custom objects relative to the maximum number
of objects, or
- we don't control the speed, and hope that the normal GC speed
is sufficient to collect the custom blocks fast enough.

The problem of the first two strategies is that we normally cannot
predict how many objects we will create in a program, or how much memory
we will spend, and that most C bindings just choose constants for
used/max. For instance, the Lablgtk bindings assume that there are
normally no more than 10 gdk pixbufs (max/used = 1000/100 = 10), i.e.
after every 10 pixbuf allocations one GC cycle is performed.

What about a different scheme: For each custom block we record the
number of bytes allocated outside the heap (which is often known or can
be well guessed), and when calculating the GC speed, the size of this
external memory is just added to the numbers for the heap memory (i.e.
the size of allocations in the last GC slice to caml_allocated_words and
the total to caml_stat_heap_size), so we "virtually" consider the
external memory as an extension of the heap. The effect is that the same
speed control is used as for the heap (i.e. overhead-based control).

An implementation could store the number of external bytes in the custom
block (one more word is allocated than requested) - you need it for
updating the total.

Let's look a bit closer at the effect:

- If the size of the external memory is moderate, the GC speed is
linearly increased the more external memory is allocated. The
good thing is that the speed increase is now relative to the
size of the heap. If the heap is large anyway, the external memory
almost does not count, and can never dominate the speed controls
(which is a main problem with the current scheme). If the heap
is small, the speed-up can be substantial, but this is not as
problematic, because a small heap is quickly GC'ed.
- If the size of the external memory is large relative to the heap,
it will still dominate the speed control. The user could, however,
still influence the speed by setting the overhead parameter. So,
still better than the current scheme.

Well, it's just an idea. Any comments?

Gerd

Am Mittwoch, den 04.04.2012, 13:30 +0200 schrieb Gabriel Scherer:

------------------------------------------------------------
Gerd Stolpmann, Darmstadt, Germany ge...@gerd-stolpmann.de
Creator of GODI and camlcity.org.
Contact details: http://www.camlcity.org/contact.html
Company homepage: http://www.gerd-stolpmann.de
*** Searching for new projects! Need consulting for system
*** programming in Ocaml? Gerd Stolpmann can help you.
------------------------------------------------------------

Hans Ole Rafaelsen

unread,

Apr 7, 2012, 9:28:10 AM4/7/12

to Gerd Stolpmann, Gabriel Scherer, caml...@inria.fr

So just to be clear, it seems like I'm allocating lots of objects of a kind
that I don't free. I have been trying to tracking down this in my ML part
that use the library. You suggest that trying to have a counter in the C
binding part of the library and count each time a object is created (and
maybe each time it is destroyed) might be a better option. I have not
worked much with ML<->C binding code, just want to be sure that this might
be a proper way to do it before I start.

--
Hans Ole