Fat references

Jon Harrop

unread,

Dec 29, 2009, 12:55:59 PM12/29/09

to

I've been working on a project called HLVM in my spare time:

http://forge.ocamlcore.org/projects/hlvm

One goal was to have fast interop with C, so I didn't want to copy the
traditional style of placing a header with GC metadata before every value
in the heap because that would require C arrays to be copied just to add
this header. I couldn't be bothered to allocate a separate header so,
instead, I pulled the GC metadata into the reference. So my references are
now "fat": a quadword of pointer to run-time type, array length or union
type tag, pointer to mark state and pointer to the actual data itself.

This actually works rather well except I sacrificed atomic read/write of
references. Has it been done before?

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?u

Paul Biggar

unread,

Dec 30, 2009, 8:32:22 AM12/30/09

to

Hi John,

On Tue, Dec 29, 2009 at 5:55 PM, Jon Harrop <j...@ffconsultancy.com> wrote:
> One goal was to have fast interop with C, so I didn't want to copy the
> traditional style of placing a header with GC metadata before every value
> in the heap because that would require C arrays to be copied just to add
> this header. I couldn't be bothered to allocate a separate header so,
> instead, I pulled the GC metadata into the reference. So my references are
> now "fat": a quadword of pointer to run-time type, array length or union
> type tag, pointer to mark state and pointer to the actual data itself.

This strongly resembles how values are represented in modern scripting
language implementations, including the canonical implementations of
Lua, PHP, Perl, Python and Ruby. In PHP, each value is a struct
containing:

- a reference count
- a union of primitive values/object pointer/array pointer/resource
- a type byte (indicates object/array/string/int/float/bool/null/resource)
- a flag indicating whether the value is a reference (we can ignore
this, I just added it for completeness)

The "resource" mentioned above is a C pointer (void*) used as part of
a foreign function interface. C data is typically referred to by the
pointer, rather than copied into some PHP representation.

> This actually works rather well except I sacrificed atomic read/write of
> references. Has it been done before?

All in all, the PHP approach seems to be very similar to what you're
doing. All the other scripting languages have a variation on this
theme.

One difference might be that in the PHP implementation, a pointer to
the struct is passed around. I get the impression your "fat
references" are passed by copy instead, but I'm not sure? If you did,
and found that was faster due to less indirection, that would be
interesting.

Paul

--
Paul Biggar
paul....@gmail.com

Robert A Duff

unread,

Dec 30, 2009, 12:24:53 PM12/30/09

to

Jon Harrop <j...@ffconsultancy.com> writes:

GNAT (the gcc Ada compiler) uses fat pointers by default to represent
pointers to unconstrained arrays. (Unconstrained means different
objects of a given array type can have different lengths.) The fat
pointer consists of the address of the bounds and the address of the
data.

One advantage is that it can use zero-origin indexing. That is, if
the bounds of the array are 10..20, the data address points to the
0'th array component, even though that component doesn't exist. So
indexing doesn't need to subtract off the lower bound.

The purpose of these fat pointers has nothing to do with GC, though.
GNAT does not currently support GC, except for the versions targeting
.NET and the JVM.

The user can also request thin pointers, which are the address of the
data, with the bounds stored just before the data. The zero-origin
trick won't work in this case.

I don't fully understand the problem you're trying to solve, nor how
fat pointers are the solution. Are you saying all pointers are fat (4
words per pointer!?)? Or just pointers to arrays? Under what
circumstances are you worried about an extra copy? If the array is
allocated on the C side, you have to concoct the meta data somehow...

Why not just say that arrays allocated on the C side are not
subject to GC?

By the way, if you're interested in GC, you should join the
GC list, which our esteemed moderator set up some years ago.

- Bob

BGB / cr88192

unread,

Dec 30, 2009, 1:10:16 PM12/30/09

to

"Jon Harrop" <j...@ffconsultancy.com> wrote in message

is it possible to do it this way:
GC objects are managed on a separate GC heap, where GC objects themselves
contain this header, but the header is otherwise invisible to C land (the
pointer they see is after this GC-internal header).

this is more how my GC works, and I use it mostly with C.
the GC itself also keeps track of object size, type, ref count, ...
actually, this whole header is packed into bits and fits into 64-bits. a
small hash value is also kept in the header so that the GC can detect if it
has been overwritten (heuristic measures can also be used to aid in locating
the origin of the offending object, and help track down the offending code).

the type stored in this header is actually a hash key into the hash of known
object types (sadly, dynamic, so this doesn't allow switch-based type
dispatch, but granted does allow for the "table of function pointers"
strategy).

under this strategy, raw C data (raw pointers, malloc'ed data, ...) is,
simply, not GC'ed.

note that ref-counting is supported, but not usually enabled (this is
per-object), as most of my C code is not exactly ref-count safe...

or, if continuing to use fat pointers:
have you considered the possiblilty of using SIMD operations for this (AKA:
SSE/SSE2).
I had used this both for 128 bit integers, and for an analogous "wide
pointers" system (they were 128 bits, but most of this went into a massively
expanded address space, rather than anything related to object management).
however, analogous should work.

SSE also allows getting/setting the high qword apart from the low qword,
which could allow, for example, using the low qword for the pointer (for
both 32 and 64 bit systems, movd and movq allow moving between these
references and GPR's), and the high qword for type and GC info (although, I
do have doubts of storing this info in the reference, as this seems like it
would make GC a horrible mess, for example, trying to keep everything
between all the references in sync).

for example, if the pointers were to contain a ref count, how would this be
kept in sync from one place to another?
...

sadly, I don't know how much control the LLVM core (presumably in use here)
gives over these things.

also possible:
doing like PHP, and storing a pointer to a header, which in turn points to
the object?...
granted, this would add a possibly notable per-object overhead, but would
likely be at least easier to keep synchronized, and could allow the actual
backing memory to be managed externally.

glen herrmannsfeldt

unread,

Dec 30, 2009, 2:13:31 PM12/30/09

to

Robert A Duff <bob...@shell01.theworld.com> wrote:
(snip)

> One advantage is that it can use zero-origin indexing. That is, if
> the bounds of the array are 10..20, the data address points to the
> 0'th array component, even though that component doesn't exist. So
> indexing doesn't need to subtract off the lower bound.

The array descriptors used by PL/I (or at least PL/I (F)) pass the
virtual origin (address of element with all subscripts zero), along
with the upper and lower bounds and the appropriate multiplier to
address array elements.

It seems that they way Fortran (assumed shape) arrays work, the called
routine doesn't use the lower bound, but normally uses a lower bound
of 1 for each subscript. (The called routine can change the lower
bound that it uses.)

The VAX/VMS array descriptor includes both the virtual and actual
origin, allowing for either PL/I style or Fortran style array
addressing.

There was a discussion on comp.lang.fortran some time ago (maybe
last year) on the advantages of each. To me, it seems more consistent
to pass and use the lower bound in the called routine, but it does
complicate array manipulation subroutines.

This also applies to array expressions. For PL/I array expressions,
the upper and lower bound must agree, where in Fortran only the
extent (upper-lower+1) need agree.

Something for designers of new languages to consider.

-- glen

Jon Harrop

unread,

Dec 30, 2009, 4:05:37 PM12/30/09

to

Paul Biggar wrote:
> One difference might be that in the PHP implementation, a pointer to
> the struct is passed around. I get the impression your "fat
> references" are passed by copy instead, but I'm not sure?

That is correct, yes.

> If you did, and found that was faster due to less indirection, that would
> be interesting.

From the benchmarks I have done, it seems to have had little adverse effect
on performance. The main effect has been to slow down stack operations but
that is largely because I'm using a naive shadow stack.

Memory consumption must be higher because pointers are all 4x larger. Some
operations should be faster, like computing the sum of the lengths of the
subarrays in an array of arrays because the locality of those lengths is
better.

I also have an unusual GC design that works well in concert with these fat
pointers.

Kaz Kylheku

unread,

Dec 30, 2009, 2:58:42 PM12/30/09

to

On 2009-12-29, Jon Harrop <j...@ffconsultancy.com> wrote:
> I've been working on a project called HLVM in my spare time:
>
> http://forge.ocamlcore.org/projects/hlvm
>
> One goal was to have fast interop with C, so I didn't want to copy the
> traditional style of placing a header with GC metadata before every value
> in the heap because that would require C arrays to be copied just to add
> this header. I couldn't be bothered to allocate a separate header so,
> instead, I pulled the GC metadata into the reference. So my references are
> now "fat": a quadword of pointer to run-time type, array length or union

So now you have references that can't fit into a machine register.

> type tag, pointer to mark state and pointer to the actual data itself.

How do you fit three pointers and a tag into a ``quad word''?

You must be using ``quad word'' differently from the rest of the world,
where it is understood to be 64 bits, according to this convention:

16 bits: word
32 bits: double word
64 bits: quad word

In a 64 bit address space, just one single pointer is already a
quad-word.

> This actually works rather well except I sacrificed atomic read/write of
> references. Has it been done before?

Fat pointers have been implemented in some C compilers, to provide
run-time bounds checking on arrays and other objects.

From your description, it seems that what you describe resembles a
handle-based approach whereby some objects are split into two parts: a
fixed size record, which contains a pointer to the remaining storage, in
addition to other fields, like a type tag and whatnot. The fixed size
records are allocated in an array fashion from heaps, where garbage
collection takes place. However, the records are not regarded as
references, and are never passed by value. The reference values that
are manipulated by the program are actually pointers to these records,
not the records themselves. So in other words, you keep your fundamental
value/reference type of a size that fits into a register. If you need
atomic operations, you can do them nicely on that type. Variables,
including function arguments, and the slots of structure and vector
objects, are all just one pointer wide, so you can do a compare-swap (or
whatever) on any memory location that holds a value, whether it's a
local variable, array element, etc.

Jon Harrop

unread,

Dec 30, 2009, 4:57:57 PM12/30/09

to

Robert A Duff wrote:
> GNAT (the gcc Ada compiler) uses fat pointers by default to represent
> pointers to unconstrained arrays. (Unconstrained means different
> objects of a given array type can have different lengths.) The fat
> pointer consists of the address of the bounds and the address of the
> data.

I see.

> The purpose of these fat pointers has nothing to do with GC, though.
> GNAT does not currently support GC, except for the versions targeting
> .NET and the JVM.

Right. With my approach, the pointer to the run-time type can be used to
fetch a GC-specific function that traverses the references in a given type,
or a user-function to pretty print a value of that type.

> I don't fully understand the problem you're trying to solve, nor how
> fat pointers are the solution.

My original motivation was to evade typeless null pointers that afflict many
VMs including the JVM and CLR. Various operations hit a dead wall with
these nulls because they don't provide any type information. For example,
pretty printing null is a serious problem in F# because lists, sets, option
types and other types all want to use null as an efficient representation
of "empty" but the run-time system just sees a typeless null and can only
print "null". So F# wraps the empty list in a heap allocated object,
massively degrading the performance of most uses of lists.

However, with the benefit of hindsight, there is a simpler but still
efficient solution to this problem on the CLR: wrap the null in a value
type instead of an object. That way there is no heap allocation and
(assuming my performance results from HLVM carry) you don't see any
slowdown at all. Unfortunately, this would work perfectly in theory but, in
practice, the CLR has problems with structs inhibiting TCO. HLVM doesn't
have that problem though (thanks to LLVM!).

> Are you saying all pointers are fat (4
> words per pointer!?)? Or just pointers to arrays?

Yes, all pointers. Every single reference in the heap now occupies a
quadword. I even blindly load and store entire quadwords when reading and
writing references.

> Under what circumstances are you worried about an extra copy?

Extra copy of what?

> If the array is allocated on the C side, you have to concoct the meta data
> somehow...

That's no problem. If the array came from the C side then I must JIT compile
a stub function that accepts HLVM's fast calling convention and invokes the
external function using C's calling convention anyway, so I can put the
extra code to convert the C array into a fat reference in that stub,
injecting type information into the typed fat reference derived from the
FFI definition of the external function.

> Why not just say that arrays allocated on the C side are not
> subject to GC?

Simplicity is *very* important for me. I want to create a useful language
implementation with the minimum of effort (where useful means
high-performance parallel numerics to me) by drawing upon research without
doing anything new myself. My entire VM is still well under 2kLOC of OCaml
code. If I started adding special cases like different kinds of arrays then
it would complicate everything from the type system to the GC and shadow
stack.

> By the way, if you're interested in GC, you should join the
> GC list, which our esteemed moderator set up some years ago.

I'd love to. Where is it?

HLVM has another unusual characteristic in that it defines a high-level
language (e.g. with tuples and tail call elimination) that is not only the
target language but also the language the GC is written in. This has the
advantage that I can easily JIT compile type-specialized GC code to improve
performance. The approach seems to work very well because it makes
metaprogramming trivial (e.g. instantiating generic definitions per type is
trivial and very efficient) and was easy to implement.

glen herrmannsfeldt

unread,

Dec 31, 2009, 12:43:18 AM12/31/09

to

Kaz Kylheku <kkyl...@gmail.com> wrote:
> On 2009-12-29, Jon Harrop <j...@ffconsultancy.com> wrote:

(snip)

> You must be using ``quad word'' differently from the rest of the world,
> where it is understood to be 64 bits, according to this convention:

> 16 bits: word
> 32 bits: double word
> 64 bits: quad word

That is VAX. Everyone else uses 32 bits for a word.

-- glen

Jon Harrop

unread,

Jan 1, 2010, 1:10:02 AM1/1/10

to

Kaz Kylheku wrote:
> On 2009-12-29, Jon Harrop <j...@ffconsultancy.com> wrote:
>> I've been working on a project called HLVM in my spare time:
>>
>> http://forge.ocamlcore.org/projects/hlvm
>>
>> One goal was to have fast interop with C, so I didn't want to copy the
>> traditional style of placing a header with GC metadata before every value
>> in the heap because that would require C arrays to be copied just to add
>> this header. I couldn't be bothered to allocate a separate header so,
>> instead, I pulled the GC metadata into the reference. So my references
>> are now "fat": a quadword of pointer to run-time type, array length or
>> union
>
> So now you have references that can't fit into a machine register.

If you don't have 128- or 256-bit registers on your 32- or 64-bit machine,
respectively.

>> type tag, pointer to mark state and pointer to the actual data itself.
>
> How do you fit three pointers and a tag into a ``quad word''?

I grew up with 32-bit words (ARM) and meant 4*sizeof(void *).

> You must be using ``quad word'' differently from the rest of the world,

> where it is understood to be 64 bits...

No, there is no consistent meaning.

>> This actually works rather well except I sacrificed atomic read/write of
>> references. Has it been done before?
>
> Fat pointers have been implemented in some C compilers, to provide
> run-time bounds checking on arrays and other objects.

Thanks.

> From your description, it seems that what you describe resembles a
> handle-based approach whereby some objects are split into two parts: a
> fixed size record, which contains a pointer to the remaining storage, in
> addition to other fields, like a type tag and whatnot. The fixed size
> records are allocated in an array fashion from heaps, where garbage
> collection takes place. However, the records are not regarded as
> references, and are never passed by value. The reference values that
> are manipulated by the program are actually pointers to these records,
> not the records themselves. So in other words, you keep your fundamental
> value/reference type of a size that fits into a register. If you need
> atomic operations, you can do them nicely on that type. Variables,
> including function arguments, and the slots of structure and vector
> objects, are all just one pointer wide, so you can do a compare-swap (or
> whatever) on any memory location that holds a value, whether it's a
> local variable, array element, etc.

I see. That is not what I'm doing but, besides atomicity, what is the
motivation for trying to squeeze a reference into a pointer?

BGB / cr88192

unread,

Jan 1, 2010, 4:11:04 AM1/1/10

to

"Jon Harrop" <j...@ffconsultancy.com> wrote in message

> Robert A Duff wrote:

<snip>

>
> Right. With my approach, the pointer to the run-time type can be used to
> fetch a GC-specific function that traverses the references in a given
> type,
> or a user-function to pretty print a value of that type.

ok, so type-with-pointer rather than type-with-object or
type-from-compiler?...

>> I don't fully understand the problem you're trying to solve, nor how
>> fat pointers are the solution.
>
> My original motivation was to evade typeless null pointers that afflict
> many
> VMs including the JVM and CLR. Various operations hit a dead wall with
> these nulls because they don't provide any type information. For example,
> pretty printing null is a serious problem in F# because lists, sets,
> option
> types and other types all want to use null as an efficient representation
> of "empty" but the run-time system just sees a typeless null and can only
> print "null". So F# wraps the empty list in a heap allocated object,
> massively degrading the performance of most uses of lists.
>

well, two solutions come to mind here:
NULL is context dependent, so how it is printed depends on where it was seen
from (AKA: the logic for printing a list would see the NULL differently than
would a generic print function, ...).

extending the "basic value set" to include values beyond NULL.
for example, I often use a value "UNDEFINED" in addition to NULL, as well as
a few other lesser-used possibilities.

granted, these are still "generic" untyped values.

EOL (End Of List), ... could easily enough exist, but in my (current) common
uses I don't use them.

> However, with the benefit of hindsight, there is a simpler but still
> efficient solution to this problem on the CLR: wrap the null in a
> value type instead of an object. That way there is no heap
> allocation and (assuming my performance results from HLVM carry) you
> don't see any slowdown at all. Unfortunately, this would work
> perfectly in theory but, in practice, the CLR has problems with
> structs inhibiting TCO. HLVM doesn't have that problem though
> (thanks to LLVM!).

personally, for my uses, I use a "trick" for implementing things like
"fixnum" and "flonum" with an otherwise "C-like" GC setup.

I dedicate that regions of the address space that are not actually
used for valid pointers to hold dynamically-typed regions. (normally,
dynamic types would be per-object).

so, a big chunk of address space goes into fixnum, and another big chunk
into flonum, as well as other types.

on Win32, for example, there is the whole region from
0xC0000000-0xFFFFFFFF which does not hold validly addressable memory,
and so I can "safely" allocate space from this region without risking
clashes.

on Win64, I have a much bigger space in a different location
(71000000:00000000 - 71FFFFFF:FFFFFFFF), which I can similarly allocate out
as needed.

now, there is a little bit of an edge case:
right near the end of this region in 32-bit mode, I have usually placed
certain magic constants, for example, UNDEFINED==((void *)(-1)), ...

it also mixes much better with C than with tagged references, since for the
most part the C code need not actually know or care that there is no
"object" behind the pointer (IOW: no real need for special handling, for the
most part).

granted, with fat pointers, there is no real need for this trick either,
since I guess one can shove whole floats or doubles into the pointers if
they want (rather than devising 28 and 48 bit float formats to shove into
their pointers...).

actually, in recent times, the use of pure dynamic types has been existing
in a state of contention with "signature typing", where dynamic typing works
via identifying pointers: "oh yeah, that is a cons", "that one there is a
fixnum", ... and signature typing via a sideband "magic string" that ends up
getting moved around all over the place (the string itself representing the
type as a much more elaborate "mini-language", vs the simplistic "type name"
system used by the dynamic types), and where each has a different set of
merits.

on one hand, one system uses simple pointers which are self-identifying;
on the other, the other can specify nearly arbitrary data structures, and
can deal with relatively "dense" memory packing, raw C data, ..., as well as
often being able to produce efficient machine code, ... but, at the cost of
generally being more awkward and somewhat more complex.

>> Are you saying all pointers are fat (4
>> words per pointer!?)? Or just pointers to arrays?
>
> Yes, all pointers. Every single reference in the heap now occupies a
> quadword. I even blindly load and store entire quadwords when reading and
> writing references.
>

you say quadword but it sounds like octword.

quadword = 64-bits.
octword = 128-bits.

unless of course you are using that bizarre ELF-ish language where
word=32-bits, dword=64-bits, and qword=128-bits...

grr...

but, anyways, I don't personally like this strategy as it would somewhat
inflate memory use, and ultimately performance will pay...

it may not be drastic or immediate, but with the working set not fitting
nicely in cache, this much will be felt...

it is bad enough with 64-bit code, as one can notice how, even though it
"should" be faster in a theoretical sense, 64-bit code generally drags
behind 32-bit code in terms of raw performance...

so, I advise at least "some" caution, as it may well make sense to be able
to "optimize" things down to narrow pointers in many cases.

>> Under what circumstances are you worried about an extra copy?
>
> Extra copy of what?
>
>> If the array is allocated on the C side, you have to concoct the meta
>> data
>> somehow...
>
> That's no problem. If the array came from the C side then I must JIT
> compile
> a stub function that accepts HLVM's fast calling convention and invokes
> the
> external function using C's calling convention anyway, so I can put the
> extra code to convert the C array into a fat reference in that stub,
> injecting type information into the typed fat reference derived from the
> FFI definition of the external function.
>

calling convention transitions are another thing I don't personally
like too much. IME, this point has often become a bit of a mess.

then again, "dynamic" may eventually force one to break away, even if for
such trivial reasons that the C calling convention does not exactly mix well
with continuations, makes things like closures a bit of a mess (of the sort
involving generating a new stub for every closed-over function), ...

and, yes, like seemingly everything else, a lot of this stuff gets a slight
bit nastier when working around the SysV calling convention.

I once considered (for some uses) going over to a calling convention based
instead around the use of "frames", where frames were essentially allocated
from a special purpose heap.

but I had not done much with this idea.

the partial downside was considering that any calls through a normal C
function would essentially break the ability to implement continuations,
limiting much of the appeal of this particular idea.

>> Why not just say that arrays allocated on the C side are not
>> subject to GC?
>
> Simplicity is *very* important for me. I want to create a useful language
> implementation with the minimum of effort (where useful means
> high-performance parallel numerics to me) by drawing upon research without
> doing anything new myself. My entire VM is still well under 2kLOC of OCaml
> code. If I started adding special cases like different kinds of arrays
> then
> it would complicate everything from the type system to the GC and shadow
> stack.
>

oddly, I think my strategies are fairly conservative as well...

granted, my project is C, and is a bit bigger than 2kloc...

but, I don't think this issue is 2 complicated in my case:
GC'ed data is allocated via the GC, so it is the GC which decides, not
anything built on top of it.

granted, my GC works primarily with C.

I guess, in a way, it is sort of like Boehm, but I guess with a few
differences (errm, that the newly written GC spec ended up going over the
encoding rules for flonums, briefly describes how cons cells work, the GC
allocating many objects via the use of bitmaps, ...). I guess it really is a
different piece of technology...

some things I take for granted, but was then reminded of:
the C I use is not exactly the same as the C many others use, as there are
many influences from many languages, and in this case, I am left realizing
that the Scheme influences have far from completely died off...

after all, had Scheme influences completely died, I probably would not be
around documenting cons cells, fixnums, and flonums?...

hell, I almost may as well just add a scheme interpreter in the mix
somewhere (vs, say, using C in many cases with a bunch of what are,
essentially, Scheme-derived data types and features).

well, unless I guess most people think it is common "C" practice, say, to
append a few lists and use a "generic apply" API call with this list and a
function pointer... (and, hell, my "architecture" really is a chimerra...).

but, it is sort of an odd thought, thinking of how much one can get burried
in all of this, and then go and remember how a lot of this got started. what
all does this imply exactly? I am not sure...

>> By the way, if you're interested in GC, you should join the
>> GC list, which our esteemed moderator set up some years ago.
>
> I'd love to. Where is it?
>
> HLVM has another unusual characteristic in that it defines a high-level
> language (e.g. with tuples and tail call elimination) that is not only the
> target language but also the language the GC is written in. This has the
> advantage that I can easily JIT compile type-specialized GC code to
> improve
> performance. The approach seems to work very well because it makes
> metaprogramming trivial (e.g. instantiating generic definitions per type
> is
> trivial and very efficient) and was easy to implement.
>

well, I guess I will take your word for it...
but, admittedly, this particular approach doesn't reallt sound like
something I would likely do.

granted, in my case, my GC is a more-or-less disjoint component written in
C...

I guess I still have my reservations as to how all this will turn out, but
best of luck I guess...

BGB / cr88192

unread,

Jan 1, 2010, 4:35:49 PM1/1/10

to

"Jon Harrop" <j...@ffconsultancy.com> wrote in message

> Kaz Kylheku wrote:

<snip>

>>
>> So now you have references that can't fit into a machine register.
>
> If you don't have 128- or 256-bit registers on your 32- or 64-bit machine,
> respectively.
>

one needs AVX to have 256 bit registers at present (on x86 / x86-64 and
friends).

given AVX is not (yet) available in general "run of the mill" CPUs,
this seems to be a bit agressive. similarly, 256 bit pointers are
hardly cheap...

>>
>> How do you fit three pointers and a tag into a ``quad word''?
>
> I grew up with 32-bit words (ARM) and meant 4*sizeof(void *).

<merge>

> No, there is no consistent meaning.
>

granted, many also live with x86, where "quad word" has historically
been 64 bits. this may be because in x86 land, it is still fairly
common to use 16-bit words for many operations.

>> From your description, it seems that what you describe resembles a
>> handle-based approach whereby some objects are split into two parts: a
>> fixed size record, which contains a pointer to the remaining storage, in
>> addition to other fields, like a type tag and whatnot. The fixed size
>> records are allocated in an array fashion from heaps, where garbage
>> collection takes place. However, the records are not regarded as
>> references, and are never passed by value. The reference values that
>> are manipulated by the program are actually pointers to these records,
>> not the records themselves. So in other words, you keep your fundamental
>> value/reference type of a size that fits into a register. If you need
>> atomic operations, you can do them nicely on that type. Variables,
>> including function arguments, and the slots of structure and vector
>> objects, are all just one pointer wide, so you can do a compare-swap (or
>> whatever) on any memory location that holds a value, whether it's a
>> local variable, array element, etc.
>
> I see. That is not what I'm doing but, besides atomicity, what is the
> motivation for trying to squeeze a reference into a pointer?

the main thing I am thinking here is:
1, saving memory:
given that, typically, pointers are one of the most common value types
(especially in dynamic languages), large pointers translates to HUGE
increases in memory use.

2. performance:
increasing memory use greatly reduces cache efficiency, and as a result,
code will run much slower on average.

as I sometimes say, having lots of a resource is not a good reason to
waste it. just because a modern computer has > 1GB of RAM (my desktop
has 6GB and my laptop 4GB), does not mean it is a good idea to
increase memory use by a huge factor without good reason.

it is typically better to use more of a resource to have more stuff, than to
squander it and have the same, or less, overall.

much like how now, a person can easily have multiple TB of hard-drive space,
yet data compression has by no means become meaningless.

to make little savings at the bottom can often translate to big savings
overall...

hence my idea is maybe to consider a scheme where, most of the time,
pointers remain at their natural size, and can occasionally route through
another structure for added functionality (this structure need not be
necessarily heap-based). alternatively, most info can be placed in the
object (as is often done).

in the structure based strategy, also possible could be to have an "object
indirection table", which would basically be used for any item addressed via
a fat pointer. in this case, the narrow pointer encodes an index or pointer
into this table, and the table in turn refers to the object (with any added
info).

this would allow greatly reducing overall costs to multiple references to
the same object.
more extreme would be to make nearly all references index via this table
(possibly even allowing 32-bit "pointers" to be used on a 64 bit system, if
desired).

I actually before did something roughly similar before when using a
tagged-reference based memory manager, mostly noting that this indirection
actually made the whole thing, on average, faster than encoding a pointer
directly into the tagged reference.

the main weak side of tagged references is that they are a hassle to work
with from C, but at the same time, the same could be said of fat pointers
(without explicit compiler support). actually, it may well be worse, because
at least a tagged reference will usually fit nicely into an unsigned integer
or similar...

actually, I don't think it is a coincidence that there are many high level
VM's which use tagged references, but relatively few which use fat pointers.

a simple example of a tagged reference scheme:
low 3 bits:
0/4: fixnum (2-bit 0 tag)
1/5: flonum (2 bit 1 tag)
2: object reference (28 bit index into object heap/table)
3: cons reference (28 bit index into cons-cell heap)
6: large object reference (28 bit reference into large objects table)
7: expanded reference, 5 additional tag bits
0: reserved
1: special constants (NULL, UNDEF, EOL, TRUE, FALSE, ...)
2: character
3: remote object handle
...

even on a 64-bit system, a person may still use 32-bit tagged values,
meaning LESS space use than if they used pointers (although, there is a
finite limit on practical heap size, for example, 4GB worth of cons cells,
...).

the reason for giving so much space to fixnums and flonums is that these are
usually very common types, and it is desirable to minimize the need for
heap-allocated integers or floats.

similarly, 30 bits is a bit more accurate than 28 bits (what I currently use
on 32-bit x86 for flonums in raw pointers...), and so is more comprable to a
single-float (at 28 bits, the approximation as not as good, and at 24 bits,
the accuracy is just poor...).

sadly, it would require 64-bit tagged references to have double-approximate
tagged references.

either way is still much less than with fat pointers...

note that, in my compiler lower end, I have actually ended up packing a good
portion of the C typesystem (representation and semantics) into a 32-bit
integer (via "bit twiddling ninjutsu"), although some more complex types
(notably function-pointer types and multi-dimensional arrays) need to
"overflow" into a structure-based representation (certain bit patterns
indicate this case, so the type is still passed around as a 32-bit integer).

in most other places though, I use a string based represantation (the so
called "signature-based" type system), since this is much more generic (and
does not have to deal with terrible-complicated bit-twiddling logic...).

however, in the lower-end, the bit-packed representation does have the
advantage that it allows more easily doing predicate operations (basically,
returning true/false based on whether a type exhibits a given property,
which is the main way used for handling types, as oppesed to the more common
set-based view of types).

Anton Ertl

unread,

Jan 2, 2010, 9:00:29 AM1/2/10

to

No, that's the 8008 and its successors. For the PDP-11 and its
successors (including the VAX) the 32-bit units were called longwords.

"Word" for 32 bits is common for architectures that originated from
32-bit architectures that were not sold as extensions of 16-bit or
8-bit architectures (e.g., IBM 360, MIPS, SPARC, ARM).

How do original 64-bit architectures like the Cray-1 call 16-bit and
32-bit units? Do they have any support for that at all?

- anton
--
M. Anton Ertl
an...@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/

glen herrmannsfeldt

unread,

Jan 2, 2010, 3:20:45 PM1/2/10

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
(snip, someone wrote)

>>> 16 bits: word
>>> 32 bits: double word
>>> 64 bits: quad word

(then I wrote)

>>That is VAX. Everyone else uses 32 bits for a word.

> No, that's the 8008 and its successors. For the PDP-11 and its
> successors (including the VAX) the 32-bit units were called longwords.

Certainly the PDP-11 was described as having a 16 bit word, but I
don't remember double word, or especially quad word being used in
context with the PDP-11. There were no memory operations that would
do that.

The 8008, 8080, 6800, 6502, and others at that time were considered
'eight bit processors' even though the 8080 could do some 16 bit
operations. One might have called 16 bits a word, but it wasn't
commonly needed. (At least much less than for the PDP-11.)

Already being used to a 32 bit word for S/360, I remember being
disappointed at the time that VAX called 16 bits a word. Especially
if one believed that VAX was supposed to be in the 32 bit processor
marketplace.

> "Word" for 32 bits is common for architectures that originated from
> 32-bit architectures that were not sold as extensions of 16-bit or
> 8-bit architectures (e.g., IBM 360, MIPS, SPARC, ARM).

It always made some sense to me. As I remember, in typing a word is
considered to be five characters. The average english word seems
likely to be in the four or five character range, not two.

> How do original 64-bit architectures like the Cray-1 call 16-bit and
> 32-bit units? Do they have any support for that at all?

I believe the Cray-1 does 64 bit memory operations, and that most (if
not all) C compilers for the Cray-1 used a CHAR_BIT of 64. If C char,
short, int, and long are all 64 bits, why have a name other than word
for them?

-- glen

Robert A Duff

unread,

Jan 2, 2010, 3:39:48 PM1/2/10

to

Jon Harrop <j...@ffconsultancy.com> writes:

> Robert A Duff wrote:

>> Are you saying all pointers are fat (4
>> words per pointer!?)? Or just pointers to arrays?
>
> Yes, all pointers. Every single reference in the heap now occupies a
> quadword. I even blindly load and store entire quadwords when reading and
> writing references.

For pointer-heavy data structures, I'd be concerned about using
fat pointers all over -- damages cache behavior by making
everything bigger.

>> Under what circumstances are you worried about an extra copy?
>
> Extra copy of what?

In your original message, you said, "...because that would require C
arrays to be copied just to add this header." I was asking what
you mean. I guess you mean C code passes you an address,
and an array length, or something like that, and you want to avoid
putting your meta-data before the array data, because then you'd
have to copy that data.

>> By the way, if you're interested in GC, you should join the
>> GC list, which our esteemed moderator set up some years ago.
>
> I'd love to. Where is it?

I was hoping our esteemed moderator would point to it.
I don't have it handy, hmm.... let's see, google points
me here:

http://www.iecc.com/gclist/

It's pretty low traffic these days.

> HLVM has another unusual characteristic in that it defines a high-level
> language (e.g. with tuples and tail call elimination) that is not only the
> target language but also the language the GC is written in.

Interesting.

- Bob

Jon Harrop

unread,

Jan 2, 2010, 7:42:26 PM1/2/10

to

BGB / cr88192 wrote:
> "Jon Harrop" <j...@ffconsultancy.com> wrote in message
>> Kaz Kylheku wrote:
>>> So now you have references that can't fit into a machine register.
>>
>> If you don't have 128- or 256-bit registers on your 32- or 64-bit
>> machine, respectively.
>
> one needs AVX to have 256 bit registers at present (on x86 / x86-64 and
> friends).
>
> given AVX is not (yet) available in general "run of the mill" CPUs,
> this seems to be a bit agressive. similarly, 256 bit pointers are
> hardly cheap...

On the contrary, HLVM's benchmark results have proven that fat references
are cheap.

>> I see. That is not what I'm doing but, besides atomicity, what is the
>> motivation for trying to squeeze a reference into a pointer?
>
> the main thing I am thinking here is:
> 1, saving memory:
> given that, typically, pointers are one of the most common value types
> (especially in dynamic languages), large pointers translates to HUGE
> increases in memory use.

No. Moving the header data from the heap-allocated target to the local
source of the reference only increases memory consumption by 96 bits
for every *duplicate* reference stored in the heap, i.e. when two heap
objects A and B both reference C then the metadata is duplicated in A
and B. I don't believe duplicate references in the heap are common at
all.

Moreover, HLVM is specifically designed to reduce the number of
references in the heap. Most notably by unboxing multiple values
(tuples) and all primitive types. In contrast, languages like OCaml
box multiple values (e.g. pairs) and box many basic types such as
floats.

> 2. performance:
> increasing memory use greatly reduces cache efficiency,

Yes.

> and as a result, code will run much slower on average.

No. HLVM has certainly proven that wrong. Indeed, disproving such
accepted wisdom is an important part of why I think these results are
of wider interest.

For example, I just benchmarked a simple hash table with arrays for
buckets. Note that the spine contains the *only* reference in the
heap to each bucket: there are no duplicate references so memory
consumption is no worse than with 1-word references. OCaml is probably
the world's fastest functional language implementation for this kind
of task but HLVM is 2.2x faster than OCaml on this benchmark (!).

> as I sometimes say, having lots of a resource is not a good reason
> to waste it. just because a modern computer has > 1GB of RAM (my
> desktop has 6GB and my laptop 4GB), does not mean it is a good idea
> to increase memory use by a huge factor without good reason.

I don't believe HLVM's design will bloat the heap by a huge
factor. Indeed, can you think of a practical solution to a problem
that entails large numbers of duplicate references stored in the heap?

> it is typically better to use more of a resource to have more stuff,
> than to squander it and have the same, or less, overall.

I don't believe so.

> much like how now, a person can easily have multiple TB of hard-drive
> space, yet data compression has by no means become meaningless.

Sure but there are clear trade-offs and compressing data
representations in RAM to fit in a smaller fraction of a cache line is
clearly diminishing returns.

> to make little savings at the bottom can often translate to big savings
> overall...

That is certainly wrong. Language implementations like OCaml and
Haskell go to ridiculous lengths to shrink data representations
(e.g. placing a tag bit in each int) but HLVM and F# wipe the floor
with them in benchmarks.

> hence my idea is maybe to consider a scheme where, most of the time,
> pointers remain at their natural size, and can occasionally route through
> another structure for added functionality (this structure need not be
> necessarily heap-based). alternatively, most info can be placed in the
> object (as is often done).
>
> in the structure based strategy, also possible could be to have an "object
> indirection table", which would basically be used for any item addressed
> via a fat pointer. in this case, the narrow pointer encodes an index or
> pointer into this table, and the table in turn refers to the object (with
> any added info).
>
> this would allow greatly reducing overall costs to multiple references to
> the same object.

But does that improve memory consumption and performance for the uncommon
case of duplicate references at the cost of performance for all other
references?

> the main weak side of tagged references is that they are a hassle to work
> with from C, but at the same time, the same could be said of fat pointers
> (without explicit compiler support). actually, it may well be worse,
> because at least a tagged reference will usually fit nicely into an
> unsigned integer or similar...

You just need primitive operations to extract the underlying pointer
from a fat reference (the AddressOf operator in HLVM) and to construct
a fat reference from a pointer and a type (not yet implemented in
HLVM).

> actually, I don't think it is a coincidence that there are many high level
> VM's which use tagged references, but relatively few which use fat
> pointers.
>
> a simple example of a tagged reference scheme:
> low 3 bits:
> 0/4: fixnum (2-bit 0 tag)
> 1/5: flonum (2 bit 1 tag)
> 2: object reference (28 bit index into object heap/table)
> 3: cons reference (28 bit index into cons-cell heap)
> 6: large object reference (28 bit reference into large objects table)
> 7: expanded reference, 5 additional tag bits
> 0: reserved
> 1: special constants (NULL, UNDEF, EOL, TRUE, FALSE, ...)
> 2: character
> 3: remote object handle
> ...

That is exactly the kind of low-level bit twiddling that old school
language implementations like OCaml, Haskell and even the JVM do. I
don't think it is such a great idea. For example, it makes OCaml's int
arithmetic 3x slower than C and boxing makes Java's hash tables 32x
slower than .NET's.

> even on a 64-bit system, a person may still use 32-bit tagged values,
> meaning LESS space use than if they used pointers (although, there is a
> finite limit on practical heap size, for example, 4GB worth of cons cells,
> ...).

You mean 2^32 cons cells. That will become a serious limitation quite
quickly. OCaml suffers from a similar legacy problem due to similar bit
twiddling: strings and arrays are limited to only 16Mb.

> the reason for giving so much space to fixnums and flonums is that these
> are usually very common types, and it is desirable to minimize the need
> for heap-allocated integers or floats.

If you're interested in performance and memory consumption then it is
absurd to ever heap allocate ints and floats.

Hans-Peter Diettrich

unread,

Jan 2, 2010, 10:30:47 PM1/2/10

to

Anton Ertl schrieb:

>>> You must be using ``quad word'' differently from the rest of the world,
>>> where it is understood to be 64 bits, according to this convention:
>>> 16 bits: word
>>> 32 bits: double word
>>> 64 bits: quad word
>> That is VAX. Everyone else uses 32 bits for a word.
>
> No, that's the 8008 and its successors. For the PDP-11 and its
> successors (including the VAX) the 32-bit units were called longwords.

Just the DEC machines had very different byte and word sizes, as had
many other machines predating the era of 8 bit bytes. IMO no special
size should be associated with a "word" in general, it's a machine
specific definition. AFAIR our TR-440 had a memory of 128K words of 52 bits.

DoDi
[Early DEC machines had 18, 12, and 36 bit words, but once the PDP-11
became a success in the early 1970s, byte sizes other than 8 and word
sizes other than 16 and 32 soon withered away. -John]

Ray

unread,

Jan 3, 2010, 12:36:40 PM1/3/10

to

Jon Harrop wrote:

A lot of modern VMs use fat references. I know that the JVM uses at
least 2x(sizeof( void* )) to represent a pointer. And yes, it's to
support the GC.

When I implemented a lisp runtime, I went through two iterations; in
the first, pointers were the size of pointers on the native
architecture (in this case 64 bits) and there was a 5-word (ie, the
size of 5 pointers) overhead for each heap-allocated object. But the
code was complex and, as it turned out, slow.

In the second iteration, I used fat pointers; Each pointer was the
size of four native-architecture pointers where one was the pointer to
the object on the heap, two were "forward" and "back" pointers that
joined all pointers into a single doubly linked list, and one was a
set of immediate binary flags (typetag, "soft" pointer, last pointer
in this object, etc). Different regions of the list corresponded to
different populations w/r/t Garbage Collection (traversed root,
untraversed root, reached and traversed live data, reached but not
traversed live data, not yet reached possibly-live data, and known
garbage not yet finalized/deallocated). The garbage collector knew
where the dividing points between these regions were, and would
process pointers found at those points in the list, often moving them
to one of the other boundary points. At "generation boundaries" when
the "untraversed" section was empty, it would just shift the boundary
points and start over. The GC code was very simple (<200 lines of C)
and needed to know nothing about the internal structure of the
user-defined types beyond a few bits set in the object allocators.

I expected performance with fat pointers to be horrible, but it
actually wasn't. It turned out to be faster than the first version,
(on an AMD64 dual-core processor with 128K/2M cache), probably due to
more inlined values (I was using a lot of doubles) and better locality
of reference.

Bear

Anton Ertl

unread,

Jan 3, 2010, 3:09:09 PM1/3/10

to

glen herrmannsfeldt <g...@ugcs.caltech.edu> writes:
>Certainly the PDP-11 was described as having a 16 bit word, but I
>don't remember double word, or especially quad word being used in
>context with the PDP-11.

Not "double word", but "longword". And if that was not used for the
original PDP-11, it was used for its successors, the Virtual Address
eXtension (VAX) of the PDP-11 and the extended VAX (EV, aka Alpha).
[The PDP-11's optional floating point units did address doublewords,
which were different from VAX longwords because they goofed and stored
the high 16 bits at the lower addresses. -John]

The PDP-11 was not a compatible successor of another DEC machine, so
the PDP-11's natural word size (16-bits) was used for the "word"-named
unit.

>The 8008, 8080, 6800, 6502, and others at that time were considered
>'eight bit processors' even though the 8080 could do some 16 bit
>operations. One might have called 16 bits a word, but it wasn't
>commonly needed.

Yes, 16 bits were called a word on such 8-bit processors (there was
another name for 8-bit units: byte), and it was commonly needed,
because these machines used 16 bits for addressing their 64KB address
space.

>Already being used to a 32 bit word for S/360, I remember being
>disappointed at the time that VAX called 16 bits a word. Especially
>if one believed that VAX was supposed to be in the 32 bit processor
>marketplace.

But it also was marketed as a successor to the PDP-11, with various
compatibility features. And apparently it's important to keep "word"
the same size when doing such successions, and so 16 bits are still
called a word even in the Alpha architecture.

And you have the same story for the 6800 and the 68000.

And the same story with 8008, 8086 (ok, there it's natural), 80386
(aka IA-32) and AMD64 architectures (and also IA-64), except that here
the 32-bit units are called "double words".

And the analogous story with the originally 32-bit IBM/360, PowerPC,
MIPS, and SPARC architectures and their current 64-bit successors.
They all kept what they called a word the same even though the natural
word size grew.

Crossposted to comp.compilers,comp.arch, followups set to comp.arch.

glen herrmannsfeldt

unread,

Jan 3, 2010, 4:10:29 PM1/3/10

to

Hans-Peter Diettrich <DrDiet...@aol.com> wrote:
(John wrote)

> [Early DEC machines had 18, 12, and 36 bit words, but once the PDP-11
> became a success in the early 1970s, byte sizes other than 8 and word
> sizes other than 16 and 32 soon withered away. -John]

There is a discussion on comp.sys.pdp10 on whether the PDP-10
architecture could have survived into the 1980's and 1990's if
DEC had tried to keep it alive.

Many of the IBM scientific machines before S/360, including
the 704 that started Fortran, were 36 bit machines.
There are many stories about the significance of the
precision loss going from the 36 bit 7090 to 32 bit S/360.
IBM's assumption was that people would go to 64 bit double
precision for most computation, which is probably about right.

I do believe that the continued use of the PDP-10 hardware would
have needed a file system based on at least 8 bit characters.
(TOPS-10 stores five 7 bit ASCII characters in a 36 bit word.)

I have heard that C compilers for the PDP-10 use a CHAR_BIT of 9,
though the architecture makes addressing 18 bit halfwords a
little easier than 9 bit bytes.

-- glen
[The 360 had well documented design errors in its floating point which
made the useful precision three bits less than it should have been, so
it wasn't really 32 vs. 36 bits. But the real fatal flaw on the PDP-10
was the 18 bit addressing, and the indirect segmented indirect addressing
kludge they added to later DEC-20s just wasn't good enough. -John]

glen herrmannsfeldt

unread,

Jan 3, 2010, 4:33:39 PM1/3/10

to

In comp.compilers Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
(snip discussion of word, doubleword, and quadword)

> Not "double word", but "longword". And if that was not used for the
> original PDP-11, it was used for its successors, the Virtual Address
> eXtension (VAX) of the PDP-11 and the extended VAX (EV, aka Alpha).

Yes. I believe that quadword and octoword for 64 and 128 bits came
from the time of VAX.

> [The PDP-11's optional floating point units did address doublewords,
> which were different from VAX longwords because they goofed and stored
> the high 16 bits at the lower addresses. -John]

and continued that goof into the VAX and Alpha.

> The PDP-11 was not a compatible successor of another DEC machine, so
> the PDP-11's natural word size (16-bits) was used for the "word"-named
> unit.

(snip)

> Yes, 16 bits were called a word on such 8-bit processors (there was
> another name for 8-bit units: byte), and it was commonly needed,
> because these machines used 16 bits for addressing their 64KB address
> space.

But not quadword and octoword.

(snip regarding VAX and 16 bit words

> But it also was marketed as a successor to the PDP-11, with various
> compatibility features. And apparently it's important to keep "word"
> the same size when doing such successions, and so 16 bits are still
> called a word even in the Alpha architecture.

It is also important to show the technology advances. VAX was
supposed to be DEC's entry to the 32 bit world. Keeping the word size
at 16 bits dilutes the effect.

I do remember that in early Alpha C compilers a C (long) was 64 bits,
but later changed to 32 bits with (long long) as the 64 bit type.

-- glen

Jon Harrop

unread,

Jan 3, 2010, 10:12:15 PM1/3/10

to

Robert A Duff wrote:
> Jon Harrop <j...@ffconsultancy.com> writes:
>> Robert A Duff wrote:
>>> Are you saying all pointers are fat (4
>>> words per pointer!?)? Or just pointers to arrays?
>>
>> Yes, all pointers. Every single reference in the heap now occupies a
>> quadword. I even blindly load and store entire quadwords when reading and
>> writing references.
>
> For pointer-heavy data structures, I'd be concerned about using
> fat pointers all over -- damages cache behavior by making
> everything bigger.

Locality is different but not necessarily worse. For example,
computing the number of elements in a hash table represented as a
spine array of bucket arrays requires you to sum the lengths of the
bucket arrays. In HLVM's representation, those are stored in the spine
so you'll get 4 lengths per 64-byte cache line rather than only 1 =>
locality can be better with HLVM.

Memory consumption is only increased for duplicated references
(i.e. when two or more references in the heap point to the same value)
but I believe that is comparatively rare. Indeed, a lot of literature
on reference counting says that this is so and, hence, 1-bit reference
counts that saturate can still be useful.

>>> Under what circumstances are you worried about an extra copy?
>>
>> Extra copy of what?
>
> In your original message, you said, "...because that would require C
> arrays to be copied just to add this header." I was asking what
> you mean. I guess you mean C code passes you an address,
> and an array length, or something like that, and you want to avoid
> putting your meta-data before the array data, because then you'd
> have to copy that data.

I see. Yes, that is exactly what I meant.

>>> By the way, if you're interested in GC, you should join the
>>> GC list, which our esteemed moderator set up some years ago.
>>
>> I'd love to. Where is it?
>
> I was hoping our esteemed moderator would point to it.
> I don't have it handy, hmm.... let's see, google points
> me here:
>
> http://www.iecc.com/gclist/
>
> It's pretty low traffic these days.

I've subscribed.

>> HLVM has another unusual characteristic in that it defines a high-level
>> language (e.g. with tuples and tail call elimination) that is not only
>> the target language but also the language the GC is written in.
>
> Interesting.

I just made HLVM multicore friendly (the GC allows threads to run in
parallel) and I'm thinking that it has huge potential...

Kaz Kylheku

unread,

Jan 3, 2010, 11:02:57 PM1/3/10

to

The definition of word, indeed, varies; as in ``what is the word size of
this machine''.

But of course, I was referring specifically to quad word, not to word.

But ``quad word'' is entrenched in the x86 culture and others.

In their Windows development kit, Microsoft define these types: WORD,
DWORD and QWORD. These are 16, 32 and 64 respectively. Programmers from
a Windows background have these size names drilled into heir heads.

Outside of Microsoft: in 4.4 BSD Unix, the conversion specifier for a 64
bit integer is called %q, which stands for quad.

According to a search on http://codesearch.google.com for the terms
qword and quadword. The most popular size for type names derived
from these spellings is 64 bits.

So, proof by popularity and Microsoft: QUAD erad demonstrandum. :)

BGB / cr88192

unread,

Jan 3, 2010, 11:09:34 PM1/3/10

to

"Ray" <be...@sonic.net> wrote in message news:10-0...@comp.compilers...
> Jon Harrop wrote:

<snip>

> A lot of modern VMs use fat references. I know that the JVM uses at
> least 2x(sizeof( void* )) to represent a pointer. And yes, it's to
> support the GC.

yep.

I have used oversize pointers in a few places, but I guess the point of note
is that very few of them are this way.

> When I implemented a lisp runtime, I went through two iterations; in
> the first, pointers were the size of pointers on the native
> architecture (in this case 64 bits) and there was a 5-word (ie, the
> size of 5 pointers) overhead for each heap-allocated object. But the
> code was complex and, as it turned out, slow.

a 5 pointers per object overhead is itself nasty...

actually, in my case I have 0 pointers for smaller objects.

I can actually do this because my GC design was originally influenced by
those used in Guile and PLT Scheme, which had used bitmap-based memory
allocation for small objects.

my current GC mutated away from its original form in many respects, so I am
not sure exactly how much is common now (many years have passed and I don't
remember their code).

from what I remember, both used tagged references, but my modern GC uses raw
pointers, ...

in any case, storage for smaller objects is generally managed by the use of
bitmaps (actually, they are 8-bit bytes per cell in my current GC, but I
don't have a better term for this, and each byte is essentially treated as a
collection of bits).

this leads to a constant-factor storage overhead for small objects (around
6.25%, since cells are 16 bytes).
there is then also an 8-byte per-object header (identifies object type and
some other info).

so, the small-object heap consists of "chunks", typically 1MB, which are
filled with cells (960KiB data, 64KiB bitmap, so about 61440 usable data
cells).

an object takes at least 1 cell (for <=8 bytes payload), or 2 cells (for <=
24 bytes).

cons cells have their own heap chunks, where they have a similar 6.25%
bitmap overhead, but don't have headers.

large objects use a more traditional allocator (at present, I use malloc for
objects >= 6KiB, since malloc's tend to be fairly efficient except for huge
numbers of very small objects, where malloc may turn evil).

resolving a pointer to an object usually involves a quick check to determine
where in the heap it points (into the cons heap, small object heap, or to a
large object), and after this point, it is mostly a matter of arithmetic to
get the object base (this much is mostly an artifact of conservative GC, a
precise GC would not need to "resolve" objects in this way since it would
know which references were valid in advance).

(my GC can optionally do faster resolution for some operations, but at the
cost of a loss of safety checking and having to supply the right type of
object...).

<snip> description of a (generational?) GC using fat pointers and linked
lists. </snip>

ran line counter, my GC is 7630 loc, so it is a bit larger.
granted, it is also a concurrent conservative mark/sweep + ref-counting GC.

1500 loc could be trimmed (there is a deflate compressor there which doesn't
need to be there...).
so, around 6130 loc or so.

actually, personally I have rarely been able to get good performance out of
linked-list GC's. I had tried in the past, but they were slow (taking easily
seconds to perform a GC pass on a 50-60MB heap...).

this is why I like cells, in addition to the relatively high memory density
for small objects...

actually, it is also why I like using arrays of pointers, and quicksort (to
get these arrays sorted before GC to allow faster lookup), so that it
becomes an O(log2 n) operation to find the heap chunk (or a large object)
for a given pointer (followed by a small amount of arithmetic for small
objects or conses).

>
> I expected performance with fat pointers to be horrible, but it
> actually wasn't. It turned out to be faster than the first version,
> (on an AMD64 dual-core processor with 128K/2M cache), probably due to
> more inlined values (I was using a lot of doubles) and better locality
> of reference.
>

granted, the critical issue WRT cache is actually the working set.

if the working set itself mostly fits in cache, then performance will not
change much (until one exceeds this).
OTOH, if the working set is huge, then performance will also not change
much, since it didn't fit in the cache before either (although, there could
be a linear slowdown due to the amount of memory access, which is less
likely to matter if it does fit).

then again, I guess the big downside would be that, if using the GC from C,
explicit fat pointers (AKA: not managed by the compiler) would be awkward
and likely prevent creation of an opaque API.

after all, if the client code knows anything, it is that the GC uses fat
pointers (and, very likely, a good deal more). this is unlike the case of
the client knowing little more of the GC's internals than that it uses raw
pointers (and some of the API calls, ...).

so, it "could" be a concern as well as to how the API looks, like, if it is
a clean and opaque API, or if the client code can see into the internals.

granted, there are many other cases where opaqueness is not as much of a
goal...

I guess though, all is a bit moot though absent objective comparrisons...

then again I am aware of some current internal flaws in my GC's
implementation which impede its reliability, and so trying to do a
stress-test on the thing is likely to just cause crashes...

before the thing was multi-threaded though (which is where problems were
introduced), I stress-tested the thing some, and optimized and fixed
whatever was misbehaving...

but, now I would need to fix it up some...

in general, it works well enough I guess though...

Robert A Duff

unread,

Jan 4, 2010, 2:13:29 PM1/4/10

to

Jon Harrop <j...@ffconsultancy.com> writes:

> Locality is different but not necessarily worse. For example,
> computing the number of elements in a hash table represented as a
> spine array of bucket arrays requires you to sum the lengths of the
> bucket arrays. In HLVM's representation, those are stored in the spine
> so you'll get 4 lengths per 64-byte cache line rather than only 1 =>
> locality can be better with HLVM.

OK.

> Memory consumption is only increased for duplicated references
> (i.e. when two or more references in the heap point to the same value)
> but I believe that is comparatively rare.

That's a good point.

But don't forget to count 'null' pointers. For example, consider
a balanced binary tree of depth 3. You've got 7 nodes and
7 non-null pointers to them (one from outside the tree).
But you've also got 8 null pointers stored in the leaf
nodes. As I understand it, those 8 null pointers are
going to be 4 words each, in HLVM.

You can think of 'null' as being a pointer to a single
dummy object -- and there are typically lots of duplicates!

If you're talking about something like OCaml, you don't
have null (which is a Good Thing!) but you have the
equivalent issue.

Also, why you do say "in the heap" above? I'd say "Memory consumption

is only increased for duplicated references (i.e. when two or more

references point to the same value)." Doesn't matter whether those
pointers are in the heap, or on the stack (or in a register -- four
registers! -- for that matter).

Anyway, if you've got good measurements, they trump my
speculation.

>...Indeed, a lot of literature on reference counting says that this

> is so and, hence, 1-bit reference counts that saturate can still be
> useful.

Yes, I've read that. There are lots of GC schemes that take
advantage of that.

- Bob

Anton Ertl

unread,

Jan 4, 2010, 2:57:25 PM1/4/10

to

glen herrmannsfeldt <g...@ugcs.caltech.edu> writes:
>In comp.compilers Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Yes, 16 bits were called a word on such 8-bit processors (there was
>> another name for 8-bit units: byte), and it was commonly needed,
>> because these machines used 16 bits for addressing their 64KB address
>> space.
>
>But not quadword and octoword.

Sure, but once they had used "word" for 16 bits, they apparently had
no fantasy left for names of larger units (hmm: phrase, sentence,
paragraph; at least "line" and "page" are used:-).

>> But it also was marketed as a successor to the PDP-11, with various
>> compatibility features. And apparently it's important to keep "word"
>> the same size when doing such successions, and so 16 bits are still
>> called a word even in the Alpha architecture.
>
>It is also important to show the technology advances. VAX was
>supposed to be DEC's entry to the 32 bit world. Keeping the word size
>at 16 bits dilutes the effect.

It's just a name. I would prefer to use the natural word size for
"word" (i.e., 64 bits on Alpha and AMD64). But I guess there are
groups of people who reuse stuff from the smaller machines on the
bigger ones, and if the width of a "word" changed between the
architectures, interpreting documentation for such things would become
confusing ("Does this documentation mean a PDP-11 word, a VAX word, or
an Alpha word in this place?"), so every manufacturer decided to avoid
changing the number of bits in a "word" even if that meant that a
"word" was smaller than a natural machine word.

>I do remember that in early Alpha C compilers a C (long) was 64 bits,
>but later changed to 32 bits with (long long) as the 64 bit type.

C's long int type is 64-bits on Alpha (and other 64-bit architectures)
on Unix, and AFAIK 32-bits on Windows (even 64-bit Windows), with
64-bit long long int on both. I don't think you get Windows for Alpha
anymore.

Kaz Kylheku

unread,

Jan 4, 2010, 5:22:00 PM1/4/10

to

On 2010-01-01, Jon Harrop <j...@ffconsultancy.com> wrote:
> I see. That is not what I'm doing but, besides atomicity, what is the
> motivation for trying to squeeze a reference into a pointer?

The real motivation is to squeeze it into a machine register. A single
memory->reg load instruction fetches the entire reference.

Moreover, calling a four-argument external function means loading only
four registers.

If your ABI provides, say, 8 registers for parameter passing
(exemplified by for instance the MIPS 64 bit and n32 ABI's) you can
have functions of eight parameters that are register absed, rather
just two, when one parameter requires four registers.

Are any of your benchmarks structured as a realistic program with
function calls going around among separately compiled external module
(and possibly separately loaded at run-time?)

How much leverage are you getting in the benchmarks from just
accessing those parts of the reference that are relevant to the
computation (such as the principal pointer to the underlying data),
avoiding situations where you have to load the whole thing?

In real-world programs, there do end up copies of references to heap
data. Even if the heap itself doesn't have multiple references to an
object (i.e. most objects in the heap have a unique parent object in
the heap), the program can still be juggling lots of copies in the
registers, and throughout the call stack, whose frames may be spread
among external functions.

Some copies may not even be semantic references. If a function needs
to save some callee-saved registers to backing memory and re-load
them, a copy is made twice, even though the backing memory has no
semantic role other than holding those registers. Fat references will
create more register pressure of this type.

Say you need a whole bunch of registers to load some fat references
from memory. Oops, you've run out of the callee-clobbered registers,
and need to use callee-saved ones. So now your function has to save
those to memory and then restore them before returning. You've just
copied the parent frame's data twice, even though you aren't working
with that data.

Jon Harrop

unread,

Jan 4, 2010, 7:02:31 PM1/4/10

to

Kaz Kylheku wrote:
> So, proof by popularity and Microsoft: QUAD erad demonstrandum. :)

If you're interested in popularity, compare ARM's sales to Intel's. ;-)

Dennis Ritchie

unread,

Jan 5, 2010, 12:37:42 AM1/5/10

to

"glen herrmannsfeldt" <g...@ugcs.caltech.edu> wrote in message

> I believe the Cray-1 does 64 bit memory operations, and that most (if
> not all) C compilers for the Cray-1 used a CHAR_BIT of 64. If C char,
> short, int, and long are all 64 bits, why have a name other than word
> for them?

I'm not positive about what we did for the Cray 1, but XMP and
YMP used CHAR_BIT 8. Probably the -1 did as well.

Dennis

Kaz Kylheku

unread,

Jan 5, 2010, 3:43:08 PM1/5/10

to

On 2010-01-04, Robert A Duff <bob...@shell01.TheWorld.com> wrote:
> Jon Harrop <j...@ffconsultancy.com> writes:
>
>> Locality is different but not necessarily worse. For example,
>> computing the number of elements in a hash table represented as a
>> spine array of bucket arrays requires you to sum the lengths of the
>> bucket arrays. In HLVM's representation, those are stored in the spine
>> so you'll get 4 lengths per 64-byte cache line rather than only 1 =>
>> locality can be better with HLVM.
>
> OK.
>
>> Memory consumption is only increased for duplicated references
>> (i.e. when two or more references in the heap point to the same value)
>> but I believe that is comparatively rare.
>
> That's a good point.
>
> But don't forget to count 'null' pointers. For example, consider
> a balanced binary tree of depth 3. You've got 7 nodes and
> 7 non-null pointers to them (one from outside the tree).
> But you've also got 8 null pointers stored in the leaf
> nodes. As I understand it, those 8 null pointers are
> going to be 4 words each, in HLVM.

It seems these null pointers could be elided by using a different tree
node variant for leaf nodes, which simply has no pointers.

I.e. a tree is defined recursively as:

- an empty tree, denoted by a null value
- a leaf node record containing only a key and value
- an inner node containing a key, value and two children that are trees.

A ``half inner'' node (a node with one null child pointer) still has
two pointers, but these could get a dedicated representation also, for
both the left and right leaning variants. Though it deserves to be
noted that in a maximally balanced tree of any size, there is at most
one of these nodes, whereas more than half of the nodes are leaves.

If you have fat pointers you already have a type field in the pointer,
which indicates the node variant, so that the node need not have a
type field for this. It almost seems silly not to take advantage in
this way, given that these fat pointers are carrying a word-wide type
field.

Without fat pointers, this same compression can still be done with spare bits
in a regular pointer.

BGB / cr88192

unread,

Jan 5, 2010, 6:05:34 PM1/5/10

to

"Robert A Duff" <bob...@shell01.TheWorld.com> wrote in message

> Jon Harrop <j...@ffconsultancy.com> writes:
>> Memory consumption is only increased for duplicated references
>> (i.e. when two or more references in the heap point to the same value)
>> but I believe that is comparatively rare.
>
> That's a good point.
>
> But don't forget to count 'null' pointers. For example, consider
> a balanced binary tree of depth 3. You've got 7 nodes and
> 7 non-null pointers to them (one from outside the tree).
> But you've also got 8 null pointers stored in the leaf
> nodes. As I understand it, those 8 null pointers are
> going to be 4 words each, in HLVM.
>
> You can think of 'null' as being a pointer to a single
> dummy object -- and there are typically lots of duplicates!
>

<snip>

>
> Anyway, if you've got good measurements, they trump my
> speculation.

yep.

I at first had more than a few doubts, but had later discovered that
the VM was actually a very different technology than I had originally
thought, namely: it is statically typed, ...

so, yeah, 128 or 256 bit references performing well with static typing
is much easier to understand...

the main difference is in how things work:
in static typing, most things are not references;
in dynamic typing, almost everything is references.

I hadn't originally considered this, as I tend to use a large amount of
"mixed typing" instead.

>>...Indeed, a lot of literature on reference counting says that this
>> is so and, hence, 1-bit reference counts that saturate can still be
>> useful.
>
> Yes, I've read that. There are lots of GC schemes that take
> advantage of that.
>

yep.

my guess:
0 not/encodable
1 0
2+ 1

I have typically used more bits 2-4, and allow 0 to be encodable (misc
reasons, mostly so that an operation can do a "safe decrement" and
re-increment without freeing the object, because it makes more sense
to me to allocate an object with count=0 so that putting it into a
variable will make the count=1, ...).

> - Bob

Jon Harrop

unread,

Jan 5, 2010, 9:01:45 PM1/5/10

to

Robert A Duff wrote:
> Jon Harrop <j...@ffconsultancy.com> writes:

>> Memory consumption is only increased for duplicated references
>> (i.e. when two or more references in the heap point to the same value)
>> but I believe that is comparatively rare.
>
> That's a good point.
>
> But don't forget to count 'null' pointers. For example, consider
> a balanced binary tree of depth 3. You've got 7 nodes and
> 7 non-null pointers to them (one from outside the tree).
> But you've also got 8 null pointers stored in the leaf
> nodes. As I understand it, those 8 null pointers are
> going to be 4 words each, in HLVM.

If you put null references in the data structure, yes.

> You can think of 'null' as being a pointer to a single
> dummy object -- and there are typically lots of duplicates!

Potentially, yes. In fact, that is exactly how F# represents many null
values: the empty list [] is represented as a pointer to a global
empty list of that type. That representation was chosen over a simple
NULL because it conveys run-time type information and, consequently,
makes it possible to pretty print the empty list.

HLVM certainly would lose out on that sharing if you used that
representation. However, if you use a better representation then you
get better performance and memory consumption and fat references do
not appear to be a hindrance. Specifically, you take any variant type
such as the type of a list:

type 'a list = Nil | Cons of 'a * 'a list

and you special case type constructors with Nils in them so Cons(x, Nil)
becomes One x:

type 'a list = Nil | One of 'a | Cons of 'a * 'a list

HLVM gives you at least 2^32 type constructors so adding an extra one
costs nothing but now the only lists that use Nil are empty lists, so
the number of null references has been dramatically reduced but you
still have all of the advantages (e.g. typed representation that can
be pretty printed).

This is so successful that I use an equivalent trick to special case
branches of tree with empty children in OCaml's Set data structure and
it improves performance 30% (and memory consumption). In fact, that
removes all of the null references from your example tree.

> Also, why you do say "in the heap" above? I'd say "Memory consumption
> is only increased for duplicated references (i.e. when two or more
> references point to the same value)." Doesn't matter whether those
> pointers are in the heap, or on the stack (or in a register -- four
> registers! -- for that matter).

No, because the optimizer eliminates redundant loads and
stores. Although HLVM naively generates LLVM IR asking it to load and
store entire quadwords, LLVM's optimization passes detect when the IR
requires less (e.g. is just reading the length of an array) and loads
only a single word.

The main exceptions are pushing and popping quadwords from the shadow
stack and passing fat references in function arguments (where they
consume 4 registers instead of 1).

> Anyway, if you've got good measurements, they trump my speculation.

Yes. I think these results are really surprising. Everyone I spoke to
when I was designing HLVM said that it was known to be a bad idea but
it obviously isn't.

Jon Harrop

unread,

Jan 5, 2010, 9:08:14 PM1/5/10

to

Kaz Kylheku wrote:
> Are any of your benchmarks structured as a realistic program with
> function calls going around among separately compiled external module
> (and possibly separately loaded at run-time?)

There is no notion of external module. HLVM uses JIT compilation to
perform whole-program optimization on-the-fly, like the CLR.

> How much leverage are you getting in the benchmarks from just
> accessing those parts of the reference that are relevant to the
> computation (such as the principal pointer to the underlying data),
> avoiding situations where you have to load the whole thing?

Difficult to say. LLVM's optimization passes optimize such things but
generally do not improve performance.

> In real-world programs, there do end up copies of references to heap
> data. Even if the heap itself doesn't have multiple references to an
> object (i.e. most objects in the heap have a unique parent object in
> the heap), the program can still be juggling lots of copies in the
> registers, and throughout the call stack, whose frames may be spread
> among external functions.

Yes.

> Some copies may not even be semantic references. If a function needs
> to save some callee-saved registers to backing memory and re-load
> them, a copy is made twice, even though the backing memory has no
> semantic role other than holding those registers. Fat references will
> create more register pressure of this type.

In theory, yes.

> Say you need a whole bunch of registers to load some fat references
> from memory. Oops, you've run out of the callee-clobbered registers,
> and need to use callee-saved ones. So now your function has to save
> those to memory and then restore them before returning. You've just
> copied the parent frame's data twice, even though you aren't working
> with that data.

That's the concern but it seems to be unfounded in practice.

George Neuner

unread,

Jan 7, 2010, 1:23:56 AM1/7/10

to

On Mon, 04 Jan 2010 19:57:25 GMT, an...@a4.complang.tuwien.ac.at
(Anton Ertl) wrote:

>glen herrmannsfeldt <g...@ugcs.caltech.edu> writes:
>>In comp.compilers Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>>> Yes, 16 bits were called a word on such 8-bit processors (there was
>>> another name for 8-bit units: byte), and it was commonly needed,
>>> because these machines used 16 bits for addressing their 64KB address
>>> space.
>>
>>But not quadword and octoword.
>
>Sure, but once they had used "word" for 16 bits, they apparently had
>no fantasy left for names of larger units (hmm: phrase, sentence,
>paragraph; at least "line" and "page" are used:-).

Intel used the term "paragraph" to refer to the 16-byte blocks
addressed by 8086 segment registers. The so-called "huge" address
mode used a standard format:(segment * 16) + (offset % 16), in which
the segment selected the paragraph and the offset the byte within. It
enabled pointers to be value compared easily and eliminated the issue
of pointers having different segment and offset values referring to
the same memory location.

George

George Neuner

unread,

Jan 7, 2010, 2:10:51 AM1/7/10

to

On Sun, 03 Jan 2010 09:36:40 -0800, Ray <be...@sonic.net> wrote:

>In the second iteration, I used fat pointers; Each pointer was the
>size of four native-architecture pointers where one was the pointer to
>the object on the heap, two were "forward" and "back" pointers that
>joined all pointers into a single doubly linked list, and one was a
>set of immediate binary flags (typetag, "soft" pointer, last pointer
>in this object, etc). Different regions of the list corresponded to
>different populations w/r/t Garbage Collection (traversed root,
>untraversed root, reached and traversed live data, reached but not
>traversed live data, not yet reached possibly-live data, and known
>garbage not yet finalized/deallocated). The garbage collector knew
>where the dividing points between these regions were, and would
>process pointers found at those points in the list, often moving them
>to one of the other boundary points. At "generation boundaries" when
>the "untraversed" section was empty, it would just shift the boundary
>points and start over. The GC code was very simple (<200 lines of C)
>and needed to know nothing about the internal structure of the
>user-defined types beyond a few bits set in the object allocators.
>
>I expected performance with fat pointers to be horrible, but it
>actually wasn't. It turned out to be faster than the first version,
>(on an AMD64 dual-core processor with 128K/2M cache), probably due to
>more inlined values (I was using a lot of doubles) and better locality
>of reference.

I suspect a good part of the overall improvement came from using the
treadmill GC. Treadmills are very fast and, so far, treadmill is the
only general GC model having performance predictable enough to enable
its use in some high speed HRT processing.

George

George Neuner

unread,

Jan 7, 2010, 1:57:45 AM1/7/10

to

On Fri, 01 Jan 2010 06:10:02 +0000, Jon Harrop <j...@ffconsultancy.com>
wrote:

>Kaz Kylheku wrote:

>> On 2009-12-29, Jon Harrop <j...@ffconsultancy.com> wrote:
>>> I couldn't be bothered to allocate a separate header so,
>>> instead, I pulled the GC metadata into the reference. So my references
>>> are now "fat": a quadword of pointer to run-time type, array length or
>>> union type tag, pointer to mark state and pointer to the actual data
>>> itself.
>>

>> So now you have references that can't fit into a machine register.
>
>If you don't have 128- or 256-bit registers on your 32- or 64-bit machine,
>respectively.

Comparing your fat references using large SIMD registers is relatively
easy (but still 2 instructions on most architectures - not as simple
as comparing a single value). But the SIMD registers on most (all?)
CPUs can't dereference pointers ... which means you still have to get
the pointer bits into a general register to use it. Any data on how
much of a hit you take by reloading or extracting pointers for use
after the tag check?

Or are you just pointing out that some CPUs have large(r) registers?

Two issues you might consider are: 1) few current machines have memory
buses larger than 64-bits even if they have larger registers; and 2)
for multiple cores or processors you need to consider the bus
arbitration scheme ... if the bus is arbitrated cycle-by-cycle there
is no guarantee that a large SIMD register load/store will be atomic.

George