[Caml-list] Questions on replacing finalizers and memory footprints

Thomas Fischbacher

unread,

Dec 6, 2007, 6:05:05 AM12/6/07

to caml...@inria.fr

Hello everybody,

is there a simple and straightforward way to replace a finalizer
function attached to some entity by a different one, or do I have
to work around this problem by modifying my finalizers appropriately?

Also, is there a simple way to implement a function (perhaps using
Obj.magic) which will walk a (possibly circular) network of tuples,
arrays, variadic entities and lists, and return the total number of
bytes used up by that structure? I see that this should be possible
in principle with the present implementation of the runtime if one
could get some basic information about the internal type of an array.

Actually, the Marshal module will have to do something very similar
(but not exactly the same thing). Is there a way to get hold of
the underlying functions?

--
best regards,
Thomas Fischbacher
t...@functionality.de

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

dmitry grebeniuk

unread,

Dec 6, 2007, 7:53:06 AM12/6/07

to Thomas Fischbacher, caml...@inria.fr

Hello, Thomas.

TF> Also, is there a simple way to implement a function
TF> (perhaps using Obj.magic) which will walk a (possibly
TF> circular) network of tuples, arrays, variadic entities
TF> and lists, and return the total number of bytes used up
TF> by that structure? I see that this should be possible in
TF> principle with the present implementation of the runtime
TF> if one could get some basic information about the
TF> internal type of an array.

When I faced this problem, I wrote a function to
get the size of value. This is a C-world function
which walks though values and marks visited
values by changing (and then restoring) their "color".
Colors are stored using RLE-like compression to
decrease memory usage.

But it's possible to calculate the size of the value
using pure ocaml code, using Obj to examine values and
Hashtbl to store visited values. It will work well
with small or medium-sized values.

--
WBR,
dmitry mailto:gds-...@moldavcable.com

Richard Jones

unread,

Dec 6, 2007, 9:27:03 AM12/6/07

to Thomas Fischbacher, caml...@inria.fr

On Thu, Dec 06, 2007 at 11:12:04AM +0000, Thomas Fischbacher wrote:
> Also, is there a simple way to implement a function (perhaps using
> Obj.magic) which will walk a (possibly circular) network of tuples,
> arrays, variadic entities and lists, and return the total number of
> bytes used up by that structure? I see that this should be possible
> in principle with the present implementation of the runtime if one
> could get some basic information about the internal type of an array.

You might get some ideas by having a look at the ancient module
(http://merjis.com/developers/ancient), specifically at how the C
function _mark in ancient_c.c is implemented. Also have a look at the
implementation of the Marshal module in the OCaml sources which takes
a slightly different approach.

If you want to do this in pure OCaml, probably your best bet would be
to just Marshal the structure and count how big it is. It'll be slow
of course.

Rich.

--
Richard Jones
Red Hat

Thomas Fischbacher

unread,

Dec 6, 2007, 9:50:43 AM12/6/07

to Richard Jones, caml...@inria.fr

Richard Jones wrote:

> If you want to do this in pure OCaml, probably your best bet would be
> to just Marshal the structure and count how big it is. It'll be slow
> of course.

Actually, the situation that brought up this question is that I have a
complicated internal data structure which will free 300 MB of RAM if I
delete it, while serializing it produces a file of 94 MB only...
So, I would like to have more clarity what is going on here, and which
part of this data structure eats how much space.

--
Best regards,
Thomas Fischbacher
t...@functionality.de

Jon Harrop

unread,

Dec 6, 2007, 12:00:24 PM12/6/07

to caml...@yquem.inria.fr

On Thursday 06 December 2007 14:57, Thomas Fischbacher wrote:
> Richard Jones wrote:
> > If you want to do this in pure OCaml, probably your best bet would be
> > to just Marshal the structure and count how big it is. It'll be slow
> > of course.
>
> Actually, the situation that brought up this question is that I have a
> complicated internal data structure which will free 300 MB of RAM if I
> delete it, while serializing it produces a file of 94 MB only...
> So, I would like to have more clarity what is going on here, and which
> part of this data structure eats how much space.

I had never though of measuring the size of a marshalled data structure. Turns
out its representation of ints can be more concise than the code
representation though:

# String.length (Marshal.to_string (Array.make 1000000 0) []);;
- : int = 1000025
# String.length (Marshal.to_string (Array.make 1000000 123456789) []);;
- : int = 5000025
# String.length (Marshal.to_string (Array.make 1000000 max_int) []);;
- : int = 9000025

which is probably what you're observing.

Marshalling also handles sharing but that seems to refer to DAGs in memory
rather than hash consing:

# String.length (Marshal.to_string (Array.make 1000000 0., Array.make 1000000
0.) []);;
- : int = 16000031

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?e

fo...@x9c.fr

unread,

Dec 6, 2007, 4:31:37 PM12/6/07

to caml-list

Le 6 déc. 07 à 17:50, Jon Harrop a écrit :

> On Thursday 06 December 2007 14:57, Thomas Fischbacher wrote:
>> Richard Jones wrote:
>>> If you want to do this in pure OCaml, probably your best bet
>>> would be
>>> to just Marshal the structure and count how big it is. It'll be
>>> slow
>>> of course.
>>
>> Actually, the situation that brought up this question is that I
>> have a
>> complicated internal data structure which will free 300 MB of RAM
>> if I
>> delete it, while serializing it produces a file of 94 MB only...
>> So, I would like to have more clarity what is going on here, and
>> which
>> part of this data structure eats how much space.
>
> I had never though of measuring the size of a marshalled data
> structure. Turns
> out its representation of ints can be more concise than the code
> representation though:

(...)

Yes, the marshalled format although not compressed is quite optimized.
For example, if I am not wrong, an "int" value in 0..63 will be coded
on a
single byte. Such a value may take up to 8 times more space in memory
if you are using a 64-bit architecture. The same is of course also
true for
all data types that are represented at runtime by values in 0..63
(e.g. variant with no embedded data).
The same kind of encoding is used for "small" blocks, whose headers are
stored more compactly in a marshalled stream than in memory.

As a consequence, it is not very surprising that you observe a 3:1 ratio
between in-memory and marshalled representations.

Xavier Clerc

Xavier Leroy

unread,

Dec 7, 2007, 3:53:31 AM12/7/07

to Thomas Fischbacher, caml...@inria.fr

> Also, is there a simple way to implement a function (perhaps using
> Obj.magic) which will walk a (possibly circular) network of tuples,
> arrays, variadic entities and lists, and return the total number of
> bytes used up by that structure? I see that this should be possible
> in principle with the present implementation of the runtime if one
> could get some basic information about the internal type of an
> array.

Jean-Christophe Filliātre's "size" library does exactly this:

http://www.lri.fr/~filliatr/software.en.html

- Xavier Leroy

Jean-Christophe Filliâtre

unread,

Dec 7, 2007, 5:34:33 AM12/7/07

to Xavier Leroy, Thomas Fischbacher, caml...@inria.fr

Xavier Leroy wrote:
>> Also, is there a simple way to implement a function (perhaps using
>> Obj.magic) which will walk a (possibly circular) network of tuples,
>> arrays, variadic entities and lists, and return the total number of
>> bytes used up by that structure? I see that this should be possible
>> in principle with the present implementation of the runtime if one
>> could get some basic information about the internal type of an
>> array.
>
> Jean-Christophe Filliātre's "size" library does exactly this:
>
> http://www.lri.fr/~filliatr/software.en.html

Indeed. However, note that it uses internally a hash table to store
blocks already considered (in order to correctly account for sharing),
and thus it is potentially incorrect if the GC moves some blocks during
the count, for instance during a resizing of the hash table (which
triggers the GC). I don't know how to avoid this issue; any help is welcome.

--
Jean-Christophe Filliātre
http://www.lri.fr/~filliatr/

Jon Harrop

unread,

Dec 7, 2007, 5:45:42 AM12/7/07

to caml...@yquem.inria.fr

On Friday 07 December 2007 10:44, Jean-Christophe Filliātre wrote:
> Xavier Leroy wrote:
> >> Also, is there a simple way to implement a function (perhaps using
> >> Obj.magic) which will walk a (possibly circular) network of tuples,
> >> arrays, variadic entities and lists, and return the total number of
> >> bytes used up by that structure? I see that this should be possible
> >> in principle with the present implementation of the runtime if one
> >> could get some basic information about the internal type of an
> >> array.
> >
> > Jean-Christophe Filliātre's "size" library does exactly this:
> >
> > http://www.lri.fr/~filliatr/software.en.html
>
> Indeed. However, note that it uses internally a hash table to store
> blocks already considered (in order to correctly account for sharing),
> and thus it is potentially incorrect if the GC moves some blocks during
> the count, for instance during a resizing of the hash table (which
> triggers the GC). I don't know how to avoid this issue; any help is
> welcome.

Yes. I wanted this functionality as well and tried your "size" library but it
only segfaults on my machine. Presumably this must be written in C to avoid
using the GC?

Perhaps this would be a good product... :-)

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?e

_______________________________________________

fo...@x9c.fr

unread,

Dec 7, 2007, 6:19:29 AM12/7/07

to caml...@inria.fr

Selon Jean-Christophe Filliātre <Jean-Christo...@lri.fr>:

> Xavier Leroy wrote:
> >> Also, is there a simple way to implement a function (perhaps using
> >> Obj.magic) which will walk a (possibly circular) network of tuples,
> >> arrays, variadic entities and lists, and return the total number of
> >> bytes used up by that structure? I see that this should be possible
> >> in principle with the present implementation of the runtime if one
> >> could get some basic information about the internal type of an
> >> array.
> >
> > Jean-Christophe Filliātre's "size" library does exactly this:
> >
> > http://www.lri.fr/~filliatr/software.en.html
>
> Indeed. However, note that it uses internally a hash table to store
> blocks already considered (in order to correctly account for sharing),
> and thus it is potentially incorrect if the GC moves some blocks during
> the count, for instance during a resizing of the hash table (which
> triggers the GC). I don't know how to avoid this issue; any help is welcome.

Sorry for the noise but doesn't this mean that the "size" function may not
terminate ?

Berke Durak

unread,

Dec 7, 2007, 6:32:35 AM12/7/07

to caml...@inria.fr

Xavier Leroy a écrit :

Hello,

I once had more or less the same problem in the EDOS project with huge
data structures.

Filliātre's size library was also growing too large a hashtable,
so I decided to use something else.

My solution was to marshal the data structure and then analyze the
marshalled data by recomputing its type.

You first write a TYPER module that represents the types in your
marshalled data structure, then use the Analyzer module on your
output.

I was more interested in the size strings were taking in memory,
but it should be easy to modify. All the ugly parts (parsing the
marshal format, and the combinators for describing the types)
have been defined.

https://protactinium.pps.jussieu.fr:12345/svn/edos/users/berke/dvhfz/analyze.ml
--
Berke DURAK

Jean-Christophe Filliâtre

unread,

Dec 7, 2007, 2:55:13 PM12/7/07

to fo...@x9c.fr, caml...@inria.fr

fo...@x9c.fr a écrit :

>> Indeed. However, note that it uses internally a hash table to store
>> blocks already considered (in order to correctly account for sharing),
>> and thus it is potentially incorrect if the GC moves some blocks during
>> the count, for instance during a resizing of the hash table (which
>> triggers the GC). I don't know how to avoid this issue; any help is welcome.
>
> Sorry for the noise but doesn't this mean that the "size" function may not
> terminate ?

No, simply that it may count a same block several times, because it was
moved by the GC during a resizing of the hash table, between the first
time it was seen and the next time it is reached from another block.

So you may only overestimate the size of the data. It should not fail,
and should terminate, even on cyclic values.

Jon is reporting segfaults, though, so there must be a bug somewhere in
my code (which uses Obj for obvious reasons). I'll have a look at it if
Jon is able to send some faulty application.

--
Jean-Christophe Filliātre
http://www.lri.fr/~filliatr/

_______________________________________________

Christophe Raffalli

unread,

Dec 7, 2007, 3:27:24 PM12/7/07

to Jean-Christophe Filliâtre, Thomas Fischbacher, caml...@inria.fr, Xavier Leroy

Jean-Christophe Filliātre a écrit :

> Xavier Leroy wrote:
>
>>> Also, is there a simple way to implement a function (perhaps using
>>> Obj.magic) which will walk a (possibly circular) network of tuples,
>>> arrays, variadic entities and lists, and return the total number of
>>> bytes used up by that structure? I see that this should be possible
>>> in principle with the present implementation of the runtime if one
>>> could get some basic information about the internal type of an
>>> array.
>>>
>> Jean-Christophe Filliātre's "size" library does exactly this:
>>
>> http://www.lri.fr/~filliatr/software.en.html
>>
>
> Indeed. However, note that it uses internally a hash table to store
> blocks already considered (in order to correctly account for sharing),
> and thus it is potentially incorrect if the GC moves some blocks during
> the count, for instance during a resizing of the hash table (which
> triggers the GC). I don't know how to avoid this issue; any help is welcome.
>
>

Write size in C, marshaling is written with a hashtbl in C probably for
that same reason ...

Christophe

fo...@x9c.fr

unread,

Dec 7, 2007, 3:59:16 PM12/7/07

to caml...@inria.fr

Le 7 déc. 07 ą 20:54, Jean-Christophe Filliātre a écrit :

> fo...@x9c.fr a écrit :
>>> Indeed. However, note that it uses internally a hash table to store
>>> blocks already considered (in order to correctly account for
>>> sharing),
>>> and thus it is potentially incorrect if the GC moves some blocks
>>> during
>>> the count, for instance during a resizing of the hash table (which
>>> triggers the GC). I don't know how to avoid this issue; any help
>>> is welcome.
>>
>> Sorry for the noise but doesn't this mean that the "size" function
>> may not
>> terminate ?
>
> No, simply that it may count a same block several times, because it
> was
> moved by the GC during a resizing of the hash table, between the first
> time it was seen and the next time it is reached from another block.
>
> So you may only overestimate the size of the data. It should not fail,
> and should terminate, even on cyclic values.

My mistake, I did not take into account the fact that if a block is
moved by the
garbage collector, its reference is updated in the hashtable *too*.
Hence the termination guarantee.

Alexandre Pilkiewicz

unread,

Dec 8, 2007, 5:07:11 AM12/8/07

to caml...@yquem.inria.fr

Le Friday 07 December 2007 22:01:23 fo...@x9c.fr, vous avez écrit :
> Le 7 déc. 07 ą 20:54, Jean-Christophe Filliātre a écrit :

> My mistake, I did not take into account the fact that if a block is
> moved by the
> garbage collector, its reference is updated in the hashtable *too*.
> Hence the termination guarantee.

I think the problem is with the hash function :

let hash o = Hashtbl.hash (magic o : int)

If you put an object in the hash table, it is stored under a key that depends
on it's address a1. Once it's moved by the GC to the address a2, its
reference is changed to a2, but not its key which is still the hash of a1. So
when you check after that if your object is allready in the hash table, you
look under the key hash(a2) if it's allready there, but it's not ! And if you
are very unlucky (and have very few memory), it might append several time.

One solution could be to store the objects in a normal list and to look at the
entire list every time. It would be *much* slower on huge structures, but
probably more "correct" (so if used only for debug purpose, why not..)

let node_list = ref []

let in_list o = List.memq o !node_list

let add_in_list o = node_list := o::!node_list

let reset_list () = node_list := []

--
Alexandre Pilkiewicz

Benjamin Canou

unread,

Dec 8, 2007, 9:21:23 AM12/8/07

to caml-list

Hi,

Another solution (not safe if the value is used at the same time in
another thread or signal handler however) is to modify the value while
exploring it.

Here is the principle :
When you explore a block node b, you replace its first field with a
special block saying "I've not restored this node, the original first
field is v" (a tuple (false, v) with a special tag) and you call the
recursive size function on v and remaining fields of b; when you explore
this block again, you don't call the recursive function since the first
field is a special block.
When the calculation is done, you explore the value again, and when you
find a block b with a special block (false, v) as first field, you set
its status to "I'm restoring this block", you call the recursive
reconstruction on v and replace the first field of b with v in the end;
otherwise, you do nothing.

Here is the code, but it does only handle simple blocks (not strings,
etc.) Also since one cannot mark blocks easily, I use a high block tag
which should not occur in most programs, but the code in totally unsafe
and can destroy the value if it is the case. A safer way could be to use
ad hoc custom blocks. As said in caps lock at the beginning of many
source files : "use at your own risk, no warranty"...

open Obj

(* the special tag is 240 , not safe if used in v *)
let calc_size v =
let rec calc_size v =
if is_int v then 1 else (
if size v >= 1 then (
let v0 = field v 0 in
if is_int v0 || tag v0 <> 240 then (
let s = ref (size v + 1 (* header *)) in
let d = new_block 240 2 in
set_field v 0 d ;
set_field d 0 (repr false) ;
set_field d 1 v0 ;
if is_block v0 then
s := !s + calc_size v0;
for i = 1 to size v - 1 do
if is_block (field v i) then
s := !s + calc_size (field v i)
done ;
!s
) else 0
) else 0 (* atoms are preallocated *)
) in
let rec restore v =
if is_block v then (
if size v >= 1 then (
let v0 = field v 0 in
if is_block v0 && tag v0 == 240 then (
if field v0 0 = repr false then (
set_field v0 0 (repr true) ;
restore (field v0 1) ;
for i = 1 to size v - 1 do
restore (field v i) ;
done ;
set_field v 0 (field v0 1)
)
)
)
) in
let size = calc_size v in restore v ; size

Benjamin.

PS: My code is not much tested, if you find bugs and/or enhance it with
strings, doubles, etc., I'm interested.

Le samedi 08 décembre 2007 à 10:57 +0100, Alexandre Pilkiewicz a écrit :
> Le Friday 07 December 2007 22:01:23 fo...@x9c.fr, vous avez écrit :

> > Le 7 déc. 07 à 20:54, Jean-Christophe Filliâtre a écrit :

> > My mistake, I did not take into account the fact that if a block is
> > moved by the
> > garbage collector, its reference is updated in the hashtable *too*.
> > Hence the termination guarantee.
>
> I think the problem is with the hash function :
>
> let hash o = Hashtbl.hash (magic o : int)
>
> If you put an object in the hash table, it is stored under a key that depends
> on it's address a1. Once it's moved by the GC to the address a2, its
> reference is changed to a2, but not its key which is still the hash of a1. So
> when you check after that if your object is allready in the hash table, you
> look under the key hash(a2) if it's allready there, but it's not ! And if you
> are very unlucky (and have very few memory), it might append several time.
>
> One solution could be to store the objects in a normal list and to look at the
> entire list every time. It would be *much* slower on huge structures, but
> probably more "correct" (so if used only for debug purpose, why not..)
>
> let node_list = ref []
>
> let in_list o = List.memq o !node_list
>
> let add_in_list o = node_list := o::!node_list
>
> let reset_list () = node_list := []
>

_______________________________________________

Hendrik Tews

unread,

Jan 23, 2008, 7:09:06 AM1/23/08

to caml...@yquem.inria.fr

Jean-Christophe Filliātre <Jean-Christo...@lri.fr>
writes:

> Jean-Christophe Filliātre's "size" library does exactly this:
>
> http://www.lri.fr/~filliatr/software.en.html

Indeed. However, note that it uses internally a hash table to store
blocks already considered (in order to correctly account for sharing),
and thus it is potentially incorrect if the GC moves some blocks during
the count, for instance during a resizing of the hash table (which
triggers the GC). I don't know how to avoid this issue; any help is welcome.

Use the normal hash function (ie. the hash depends on the block
and not on its address) and use the blocks themselves as keys!
Then the GC will not change the hash and will update the keys in
the hash table. Of course all in a hash table with physical
equality. I use this solution in memcheck
(http://www.cs.ru.nl/~tews/memcheck/).

The problem is that your runtime will be quadratic in the number
of blocks that are equal (=) but different (not ==). For my
applications of memcheck this is problematic, because the data
contains lots of (ref None) blocks, which cannot be identified
for obvious reasons.

The (apparently abondoned) ocaml-ty branch
(http://www.pps.jussieu.fr/~henry/marshal/) uses a list instead
of a hash table and is therefore quadratic in _all_ blocks. The
performence is only sufficient for toy applications (4 hours for
checking 4MB).

For applications like size or memcheck it would be nice if the GC
could notify the program when it moves blocks.

Bye,

Hendrik

(Better answering late than never...)