Datatype optimization?

Jeremy Fincher

unread,

Nov 13, 2002, 2:13:08 PM11/13/02

to

It seems that many datatypes in ML (and especially many of the most
used ones) have two constructors, one which actually "does something"
and the other which is the "base case". For example:

datatype 'a option = SOME of 'a | NONE
datatype 'a list = CONS of 'a * 'a list | NIL
datatype 'a tree = NODE of 'a tree * 'a * 'a tree | LEAF

If I understand correctly, sum types in ML are represented as "tagged
unions" -- each constructor name is mapped to an integer (SOME might
be 1, NONE might be 2; CONS and NIL and NODE and LEAF likewise). So
SOME 'a would require two words of storage: one for the tag, and one
for the actual data. A CONS would also require two words; one for the
tag, and one for the tuple (which itself would require 2 or 3 words,
as I understand it). That's a bit of extra overhead compared to, say,
C, in which a singly linked list would be represented by a struct with
.data and .next fields (and NIL would be represented by a NULL
pointer), or an option would be implemented as either a pointer to
something or NULL.

What I'm curious about is if this situation could be made a special
case in ML. If there are only two constructors, and only one of the
two actually contains data (as is the case in each of the three
examples above) would it be possible to represent SOME 'a as a pointer
to something, and NONE as simply 0 (a NULL pointer); or to represent
CONS as a pointer to a 2-tuple, and NIL as simply a NULL pointer?
That would reduce the memory required by each CONS or SOME by a word,
which, given their widespread usage, might be a significant gain.
This would allow a programmer to initialize an option array to NONE
without that nagging "I'm wasting a word of memory for each element in
the array just to be typesafe" feeling (which is what originally
prompted me to ask this question :)).

I'm just curious if any compilers do this optimization, and if they
don't, why not?

Jeremy

Xavier Leroy

unread,

Nov 13, 2002, 3:33:57 PM11/13/02

to

tweed...@hotmail.com (Jeremy Fincher) writes:

> What I'm curious about is if this situation could be made a special
> case in ML. If there are only two constructors, and only one of the
> two actually contains data (as is the case in each of the three
> examples above) would it be possible to represent SOME 'a as a pointer
> to something, and NONE as simply 0 (a NULL pointer); or to represent
> CONS as a pointer to a 2-tuple, and NIL as simply a NULL pointer?

I believe many (most?) ML compilers represent constant constructors by
small integers, and non-constant constructors by pointers. This is
the case for OCaml, for instance.

Be careful about "SOME 'a represented as a pointer to something": if
you mean that SOME(x) should be represented like x, it becomes
impossible to distinguish NONE from SOME(NONE), both being legal
values of type 'a option option. So, you still need the extra
indirection cell for SOME.

As for omitting the tag if there is only one non-constant constructor,
it can certainly be done, but depending on the memory format of heap
blocks, it may or may not lead to a space saving. Continuing the
OCaml example, every heap block has a one-word header giving its size
and GC information. OCaml steals a few bits from this header to store
the tag representing the constructor, thus storing the tag at no extra
cost. Omitting the tag isn't going to save any space...

- Xavier Leroy
--
Valid e-mail address (without the underscores): Xavier.Leroy@i_n_r_i_a.f_r
This is a protection against junk mail. Apologies for the inconvenience.
Home page: http://pauillac.inria.fr/~xleroy/

Stephen J. Bevan

unread,

Nov 13, 2002, 5:03:44 PM11/13/02

to

tweed...@hotmail.com (Jeremy Fincher) writes:
> ... or to represent

> CONS as a pointer to a 2-tuple, and NIL as simply a NULL pointer?
> That would reduce the memory required by each CONS or SOME by a word,
> which, given their widespread usage, might be a significant gain.

[snip]

> I'm just curious if any compilers do this optimization, and if they
> don't, why not?

IIRC a list node (aka CONS) in MLWorks consumed exactly two words and
so had the significant gain you are looking for.

Stephen Weeks

unread,

Nov 15, 2002, 6:41:35 PM11/15/02

to

Xavier Leroy wrote:

> Be careful about "SOME 'a represented as a pointer to something": if
> you mean that SOME(x) should be represented like x, it becomes
> impossible to distinguish NONE from SOME(NONE), both being legal
> values of type 'a option option. So, you still need the extra
> indirection cell for SOME.

The compiler does not always have to use an extra indirection in the
case of SOME. MLton, for example, makes the decision on a
case-by-case basis.

For example, in the type "(int * int) option", MLton will represent
NONE with the integer 1 and SOME (x, y) as an untagged pointer to the
tuple, without an extra indirection.

On the other hand, in the type "int option", MLton will represent NONE
with the integer 1 and SOME x as an untagged pointer to a 2 word
heap-allocated object, with one word for the header (that all heap
objects have) and one word for the integer.

On the other hand, unfortunately, for the type "bool option", MLton
will use exactly the same representation as for "int option", even
though it would be nice if it were smart enough to translate "bool
option" into
datatype boolopt = None | SomeFalse | SomeTrue
and represent "bool option" with the integers {0, 1, 2}.

Matthew Arcus

unread,

Nov 19, 2002, 1:18:39 PM11/19/02

to

swe...@sweeks.com (Stephen Weeks) writes:

> Xavier Leroy wrote:
>
> > Be careful about "SOME 'a represented as a pointer to something": if
> > you mean that SOME(x) should be represented like x, it becomes
> > impossible to distinguish NONE from SOME(NONE), both being legal
> > values of type 'a option option. So, you still need the extra
> > indirection cell for SOME.
>
> The compiler does not always have to use an extra indirection in the
> case of SOME. MLton, for example, makes the decision on a
> case-by-case basis.
>
> For example, in the type "(int * int) option", MLton will represent
> NONE with the integer 1 and SOME (x, y) as an untagged pointer to the
> tuple, without an extra indirection.
>
> On the other hand, in the type "int option", MLton will represent NONE
> with the integer 1 and SOME x as an untagged pointer to a 2 word
> heap-allocated object, with one word for the header (that all heap
> objects have) and one word for the integer.

IIRC, MLWorks would always produce a tagged object for SOME x,
ie. a tuple of size 2, but size 2 tuples were represented on the heap
without headers and referred to by a pointer tagged differently to a
pointer to a normal heap object with a header, so both SOME 1 and SOME
(0,0) would be represented by a 2 word heap object.

Matthew.