Tagged memory system for garbage collector

robf...@gmail.com

unread,

May 11, 2021, 6:40:11 PM5/11/21

to

Just thinking about using a tagged memory system to support garbage collection
which needs to distinguish pointers from other data.
I do not like the bit stealing idea as it creates oddball sized floating point.
It also does nothing for the case of unaligned data.
What I am proposing is:
Each byte of memory is tagged with an additional byte. The additional byte
contains a data type and a byte index into the type.
This doubles memory requirements. When a load operation occurs the memory tag
is loaded into a tag for the register if it is byte 0.

Data Type Byte Number
000 pointer 0 to F
001 integer
010 float
011 decfloat
100 posit
101 reserved
110 instruction
111 unknown

The register file has a data type tag associated with each register.
There are some simple rules for how the tag value is maintained.
The tag is set when the value of the target register is set, depending on the
operation.
The LEA instruction sets the target data type to 'pointer'. Subsequent
additions or subtractions of integers keep the data type at 'pointer'.
Subtracting two pointers turns it into an integer.

There would be a branch instruction allowing a branch depending on the data
type.
±

Stefan Monnier

unread,

May 11, 2021, 6:49:40 PM5/11/21

to

robf...@gmail.com [2021-05-11 15:40:10] wrote:
> Just thinking about using a tagged memory system to support garbage collection
> which needs to distinguish pointers from other data.

Your proposal gives me the impression that you haven't actually tried to
write the corresponding garbage collector.

> This doubles memory requirements.

Seems like a rather steep cost.

Stefan

MitchAlsup

unread,

May 11, 2021, 8:53:38 PM5/11/21

to

How do you mask off the lower 23 bits of a floating point fraction ?

Strong typing is good for writing most codes, but when writing the floating point
support library, one needs to deal with the 3 fields in a FP number as if they were
bit strings.

How do you deal with unions ?

How do you pack a 48-bit pointer and 16-bits of "other data" into a 64-bit word ?

What happens if you Load an instruction ? Store an instruction ?
What happens when you fetch an Integer, a Floating point, an unknown ?

Why not go all the way and have both signed and unsigned integers ?

Who do you think will cater to this set of ideas ?

Will this make designing hardware simpler ?

robf...@gmail.com

unread,

May 11, 2021, 10:58:37 PM5/11/21

to

>Your proposal gives me the impression that you haven't actually tried to
>write the corresponding garbage collector.

I have keyed in a simple garbage collector translated to CC64 / assembler from text-book pseudo-code
but not yet debugged it. Some of it hinges on what support the ISA has for GC.

> Strong typing is good for writing most codes, but when writing the floating point
support library, one needs to deal with the 3 fields in a FP number as if they were
bit strings.

The goal was not for strong typing but just enough distinguishment to get garbage
collection working. I guess this really needs only two bits according to the text on
garbage collection.

> How do you pack a 48-bit pointer and 16-bits of "other data" into a 64-bit word ?

The idea behind having the byte index 0-15 was to distinguish this type of thing.
48-bit pointer would have six “pointer” type tags with byte indexes 0 to 5. and two
other tag values (integer, bytes 0 to 1).

>What happens if you Load an instruction ? Store an instruction ?
>What happens when you fetch an Integer, a Floating point, an unknown ?

The three bit data type field would be directly loaded into a data type field associated
with a register.

>Why not go all the way and have both signed and unsigned integers ?

It is probably good enough to use a two bit field specifying, pointer, object header, or
integer (other value). There is no real need to distinguish between signed and unsigned,
but since there are extra bits available when a byte is used both could be specified. I
thought it may be handy for exceptions for instance trying to add a float value in a
register to an integer value.

>Who do you think will cater to this set of ideas ? Probably not many people. It is more
expensive hardware. Part of the goal is better performance. It can take 20-30 instructions
to detect a pointer on a stock machine. The goal of tagging memory (and registers) is to
make it zero instructions.

>Will this make designing hardware simpler ? No. However the hardware complexity is not
high. The trade-off is simpler software for hopefully better performance.

*****

One step in pointer detection is to see if the potential pointer could possibly be to a heap location.
This amounts to checking the value against the bounds of the heap. So, I was wondering if there
was any hardware to do bounds checking? Rather than using a series of test and set then branch
instructions. For instance, the heap bounds could be in CSR’s and a single parallel instruction used
to test against all heap bounds at the same time.

Checking if a pointer is on the heap is not as simple as it sounds. There may be multiple heaps in
the system (one or more for each app) to check. The heaps may be in their own address spaces in
virtual memory. If we would like to use a globally operating garbage collector it gets complex. And
complex means slow.

robf...@gmail.com

unread,

May 12, 2021, 8:06:10 AM5/12/21

to

The tagged memory system may work well with error correcting memory. ECC memory needs five extra bits for each byte. Adding in a three-bit tag rounds the memory requirements to 16 bits. Five ECC bits is enough to cover 11 data bits.

Stefan Monnier

unread,

May 12, 2021, 9:57:23 AM5/12/21

to

>> Your proposal gives me the impression that you haven't actually tried to
>> write the corresponding garbage collector.
> I have keyed in a simple garbage collector translated to CC64 /
> assembler from text-book pseudo-code but not yet debugged it.
> Some of it hinges on what support the ISA has for GC.

But was that code notably helped by your extra metadata?
How does it interact with standard GC optimizations?

>> Strong typing is good for writing most codes, but when writing the floating point
>> support library, one needs to deal with the 3 fields in a FP number as if they were
>> bit strings.
> The goal was not for strong typing but just enough distinguishment to get garbage
> collection working.

GC is already working quite well on off-the-shelf ISAs, so there must be
some other goal, right? AFAIK the main goals I'd expect to see in this area would be:
- Improve security via hardware-enforced string typing.
- Improve speed (of the overall program by reducing time spent in the
GC or in the mutator).
- Reduce the cost of concurrent or incremental GC.
- Reduce memory overhead (tho this is rarely considered a problem with
current GCs, AFAIK).

Clearly your scheme doesn't aim to help for the last point and you just stated
that it doesn't aim to help for the first point either.

> One step in pointer detection is to see if the potential pointer could
> possibly be to a heap location.

That's only used in the case of conservative tracing, which you'll
generally want to avoid if you want a high-performance GC (because of
the overhead of checking if a value points to a valid heap object and
the overhead of maintaining the datastructures needed for this check to
be fast).

Stefan

BGB

unread,

May 12, 2021, 4:35:59 PM5/12/21

to

Yep.

A precise GC will generally know the type and layout of every object in
the GC managed heap, and also a known set of roots to start tracing from.

This means the sort of "is this location possibly a pointer into the
heap?" sorts of checks are completely unnecessary apart from a basic
tagged value check or similar.

The main drawback:
C does not traditionally interact well with precise GC (it is a major
pain). Thus, conservative GC remains popular for "not a horrible pain to
deal with from C" reasons.

Ideally, one would want language extensions, eg:
SomeObj ^obj;
obj = new^ SomeObj(); //*1,

*1: casting a pointer returned by "malloc()" or similar may no longer be
valid. Also, traditional "struct pointers" are no longer sufficient.

Which informs the compiler that this is an object reference into the GC
managed heap, and that it should deal with all of the sorts of hassles
this entails:
Providing any relevant metadata for the GC;
Generating any relevant boilerplate to make things work as expected;
...

So, for its part, the compiler:
Generates a list of global locations which may hold stack locations;
Generates function-unwind metadata which also flags GC variables;
Generates object-layout metadata for any GC managed objects;
Likely emits calls to GC handler functions rather than directly handling
things like variable-assignment itself (needed if a Ref-Counter or
similar is used);
...

We can assume that an object reference is one of several basic types:
Object (GC managed object of some sort);
Array (GC managed array of some sort);
Fixnum (Integer value, ignored by GC);
Flonum (Floating-point value, ignored by GC);
...

As noted, traditional way to handle this is to "steal" a few bits from
the value:
00: ObjRef or Similar (48 bit addr, 8 .. 12 bit TypeID)
01: Fixnum (62 bits)
10: Flonum (62 bits)
11: Other

Where storing a type-ID in the ObjRef avoids needing to retrieve it from
an object header or similar (though, typically the GC and object-system
will have their own relevant structures), eg:
64 or 128 bits: MM/GC related header (Object Type/Size/RefCount)
May be stored in a bit-packed form.
64 bits: VTable Pointer (Object)
Object payload data

If one has an Class+Interface model, the Interfaces are typically glued
to the end of the object (after all the payload data), with an interface
reference pointing to this location, rather than the start of the
object. In this case, the GC will need to be aware of interface types
and correct for this situation.

Typically, the object will point to a VTable, where the first few fields
are special. Say, index 0 may hold a pointer to an ClassDef structure or
similar, which may define information about the size and layout of an
object (and potentially where it may find GC refs). This might also hold
things like a QName or similar.

Potentially, the compiler could generate an internal "_GCMark" method or
similar, whose main purpose is to mark any references during GC.

So, GC marking a reference:
Is it fixnum or flonum?
Ignore this reference.
Invoke associated TypeID handler.

TypeID Handlers:
Is it a ClassObj?
Invoke its "_GCMark()" method.
Is it an Interface?
Get base object and call its "_GCMark()";
Is it an array of references?
Mark array elements.
Is it a DynObj?
Mark list of object fields.
...

Depending on the language, there might also be "dynamic objects"
(DynObj) with contents that are prone to vary at runtime. These could be
handled differently, typically either as sorted associative arrays, or
as a tree-like structure (such as a B-Tree).

One might also have "Dynamic Classes", which could exist as a ClassObj
containing a reference to a DynObj representing any non-fixed fields.

...

robf...@gmail.com

unread,

May 12, 2021, 8:15:02 PM5/12/21

to

>As noted, traditional way to handle this is to "steal" a few bits from
>the value:
>00: ObjRef or Similar (48 bit addr, 8 .. 12 bit TypeID)
>01: Fixnum (62 bits)
>10: Flonum (62 bits)
>11: Other

This implies that there is special process to follow to perform arithmetic on a stock machine. In a GC managed system would it be good to have 62-bit floating point hardware? Or is the process to manipulate fp at 64-bits then round to the third LSB? It seems to me it would be necessary to maintain the stolen bit values in registers throughout arithmetic processing.

When performing arithmetic on a stock machine, the ADD must be followed by code to set the two LSB’s to 01. Admittedly this is not a lot of code to perform. Just a field deposit instruction following an arithmetic operation. And if the compiler handles it we do not have to worry about it. But if the arithmetic instructions themselves set the two LSB’s appropriately code would be shorter and faster.

The above makes me think that it might be helpful to have load/store instructions that set the stolen bits appropriately. As in load integer, load float, load pointer.
How are float quads handled? 124 bits with two bits in the middle 10? The above assumes 64-bit values.

A precise GC can begin with known roots and compiler generated info which aids GC, what about if imprecise GC is needed?
What about an orphaned object garbage collector? Suppose an APP crashes, how would memory be recovered by the OS?

MitchAlsup

unread,

May 12, 2021, 8:40:40 PM5/12/21

to

This is starting to smell a bit like what you want is a descriptor machine
B6600 or ICL 2900

Stefan Monnier

unread,

May 12, 2021, 11:18:07 PM5/12/21

to

> This implies that there is special process to follow to perform arithmetic
> on a stock machine.

Not necessarily. For JVMs for example, there's no need for that.
Instead, the compiler needs to make sure the mutator provides the GC
with a precise "map" of which local variables (i.e. registers and stack
slots) contain pointers and which contain other data.

> In a GC managed system would it be good to have 62-bit
> floating point hardware?

Non-IEEE floating point is not going to fly.

> When performing arithmetic on a stock machine, the ADD must be followed by
> code to set the two LSB’s to 01.

That's often used in dynamically typed languages, yes.
Less so in statically typed languages.

Stefan

robf...@gmail.com

unread,

May 13, 2021, 1:34:00 AM5/13/21

to

So, the VM is expected to provide data type maps for the components of an object.
This would be more memory efficient than tagging every memory word, assuming
there is a single map that applies for many instances of objects. The map could be
a special instance of the object that contains the data type rather than the data
itself at each member location. Assuming it is desirable to use the same offsets into
the object to determine the data type as would be used to determine the data value.
But then how would the data type of bit fields be specified?

Pointers to objects on the stack are tracked by tracking which stack variables have
been set to pointer values. ->?Use a shadow stack that contains only pointers + data type id?

Since everything is tracked by the VM, there is a root down connection to all potential
pointer variables. The system “knows” the location of all the pointer variables so there
is no need for tagged data then.

It seems the only place it may be good to have tags for data is in the register file.

->Assuming the GC operation is driven by the OS and a task switch occurs the register file
itself should not need to be scanned for pointers. The register values will have been stored
in the task control block. Pointers can be found from that image.

The compiler database of types is dumped somewhere as part of the code.

The CC64 compiler uses a 15-bit data type id which is used to compute method signatures.
Built-in data types are present using id’s < 64. Single byte data types can have the data type
stored in a single byte. If the id is > 128 then a two byte (fifteen bit) data type id can be assumed.
->The data type id will always fit within the data value. Pointer maps for structured data types
can be made by storing the offset of the next pointer member in the current pointer member.
The offset to the first pointer member can be stored in the header. Thus the GC code can trace
the location of pointers given the type id.

BGB

unread,

May 13, 2021, 3:29:08 AM5/13/21

to

On 5/12/2021 10:18 PM, Stefan Monnier wrote:
>> This implies that there is special process to follow to perform arithmetic
>> on a stock machine.
>
> Not necessarily. For JVMs for example, there's no need for that.
> Instead, the compiler needs to make sure the mutator provides the GC
> with a precise "map" of which local variables (i.e. registers and stack
> slots) contain pointers and which contain other data.
>

Pretty much.

Metadata may be used, but keeping fairly conventional in-memory formats
is preferable from a performance perspective.

If the programmer types "int i;" why not just take the hint and use a
bare integer?...

In a statically typed language, tagged types would be a special case,
rather than the rule. They are useful though if there is a need to be
able to store a value in a reference. Some languages (such as Java)
using "boxing" via a class instance.

Eg:
Object boxInt = (Object)(new Integer(3));

But this is fairly inefficient if compared with a fixnum or flonum.
More so if a language doesn't buy into the "everything is an instance of
a class" cool-aide.

>> In a GC managed system would it be good to have 62-bit
>> floating point hardware?
>
> Non-IEEE floating point is not going to fly.
>

Agreed. If anything, there might be some tag Check/Set/Convert helper
ops, but otherwise the FPU will still be working with Double.

>> When performing arithmetic on a stock machine, the ADD must be followed by
>> code to set the two LSB’s to 01.
>
> That's often used in dynamically typed languages, yes.
> Less so in statically typed languages.
>

Pretty much.

It also makes sense in hybrid languages (ActionScript3 having been an
example), where one has both static and dynamically typed features in
the same language. My BGBScript languages generally fell into this
category as well.

BGBScript (original):
Started out as a JavaScript knock off;
Mutated in a direction similar to that of ActionScript3;
Eventually became hideously over-complicated.
Mostly due to a complicated scope-model and typesystem.
Static + Dynamic + Type inference.

This language could be used like JS, with optional type annotations. It
would also try to use type-inference. Some aspects of its design were
also influenced by Self and Smalltalk, including its scope model.

So, resolving a type or declaration was essentially a graph walk, but
with the possibility that parts of the graph could be mutated at runtime
with little distinction between compile-time and runtime scope semantics.

BGBScript 2:
Retained a similar syntax to BGBScript;
Semantically had more in common with Java and C#;
Switched over to Package/Namespace/Class/Function scoping;
Used a primarily static (manifest) typesystem.
Dynamic types were treated as a subset of the static typesystem.

This language was simpler in that it was possible to write a compiler
for it.

However, being a statically compiled language limited it in some ways,
and if it is put in competition with C, then C holds more advantage.

I had recently considered a language which I was calling BS1L (BGBScript
1 Lite), which would aim to be a simplified intermediate:
Similar syntax to BGBScript 1 and 2;
Primarily dynamically typed;
Would use lexical scoping as the primary scope model;
Would likely use an intermediate IR format for interpretation.

This language would be intended mostly for lightweight scripting and
being parsed from text files, rather then compiled in advance.

Its typesystem would be a subset of the one used in BGBScript2.

Code in BS1L would look something like:
var a0, a1;
a0 = [1, 2, 3, 4]; //array
a1 = {x: 3, y: 4}; //object
...
Probably with either 'var' or 'function' being allowed for declarations:
function foo(x, y) { return(x+y); }
var bar(x, y) x+y; //why not?

Like in the original BGBScript, naming an object member like "_foo" will
cause it to behave like a delegate:
var obj1 = {z: 3};
var obj2 = {x: 1, y: 2, _proto: obj1};
obj2.z => 3

This is similar to the "prototype" member in JS, just allowing any
number of delegation paths. However, this time it would be confined to
object members.

Another plan is that BS1L would primarily use reference-counting, as
opposed to a mark/sweep garbage collector or similar (the original BS
had used a mark/sweep collector), or manual memory management (as in
BGBScript2).

...

Bernd Linsel

unread,

May 13, 2021, 5:17:21 AM5/13/21

to

It is possible to store fp numbers in standard format and hide all other
data formats (fixnum, pointers) including some type bits in NaNs.

Terje Mathisen

unread,

May 13, 2021, 8:04:39 AM5/13/21

to

Yeah, that is the classic NaN-boxing trick, I have used it myself in a
situation where is halved the memory use of a large (memory-limited)
application.

The main considerations are usually thing like

- How many bits do we need for tags vs data vs actual NaN values?

- Can we zero/sign-extend all pointers from a ~50-bit boxed payload
value, or do we need to use a custom allocator where the 50-bit pointer
values are really indices into a single big heap array?

When you really need it, and can live with the limitations, then it can
be huge win.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

BGB

unread,

May 13, 2021, 12:49:33 PM5/13/21

to

On 5/13/2021 7:04 AM, Terje Mathisen wrote:
> Bernd Linsel wrote:
>> On 13.05.2021 02:15, robf...@gmail.com wrote:
>>>> As noted, traditional way to handle this is to "steal" a few bits from
>>>> the value:
>>>> 00: ObjRef or Similar (48 bit addr, 8 .. 12 bit TypeID)
>>>> 01: Fixnum (62 bits)
>>>> 10: Flonum (62 bits)
>>>> 11: Other
>>>
>>> This implies that there is special process to follow to perform
>>> arithmetic on a stock machine. In a GC managed system would it be
>>> good to have 62-bit floating point hardware? Or is the process to
>>> manipulate fp at 64-bits then round to the third LSB? It seems to me
>>> it would be necessary to maintain the stolen bit values in registers
>>> throughout arithmetic processing.
>>>
>>> When performing arithmetic on a stock machine, the ADD must be

>>> followed by code to set the two LSBâ€™s to 01. Admittedly this is not

>>> a lot of code to perform. Just a field deposit instruction following
>>> an arithmetic operation. And if the compiler handles it we do not
>>> have to worry about it. But if the arithmetic instructions themselves

>>> set the two LSBâ€™s appropriately code would be shorter and faster.

>>>
>>> The above makes me think that it might be helpful to have load/store
>>> instructions that set the stolen bits appropriately. As in load
>>> integer, load float, load pointer.
>>> How are float quads handled? 124 bits with two bits in the middle 10?
>>> The above assumes 64-bit values.
>>>
>>> A precise GC can begin with known roots and compiler generated info
>>> which aids GC, what about if imprecise GC is needed?
>>> What about an orphaned object garbage collector? Suppose an APP
>>> crashes, how would memory be recovered by the OS?
>>>
>>
>> It is possible to store fp numbers in standard format and hide all
>> other data formats (fixnum, pointers) including some type bits in NaNs.
>
> Yeah, that is the classic NaN-boxing trick, I have used it myself in a
> situation where is halved the memory use of a large (memory-limited)
> application.
>
> The main considerations are usually thing like
>
> - How many bits do we need for tags vs data vs actual NaN values?
>
> - Can we zero/sign-extend all pointers from a ~50-bit boxed payload
> value, or do we need to use a custom allocator where the 50-bit pointer
> values are really indices into a single big heap array?
>
> When you really need it, and can live with the limitations, then it can
> be huge win.
>

NaN boxing can be a win if:
Double / Flonum is the dominant (or only) number type;
Bit-twiddling Double values is expensive;
...

If the code is primarily dominated by fixnums and obj-refs, and there is
no significant penalty to bit-twiddling the FP values, ..., then bit
tagging wins.

Similarly, the loss of a few bits off the low end of a double tends to
be pretty much invisible in most cases, ...

Stefan Monnier

unread,

May 13, 2021, 1:13:11 PM5/13/21

to

>> That's often used in dynamically typed languages, yes.
>> Less so in statically typed languages.
> Pretty much.
> It also makes sense in hybrid languages (ActionScript3 having been an
> example), where one has both static and dynamically typed features in the
> same language. My BGBScript languages generally fell into this category
> as well.

I only meant to describe a correlation, not a hard rule, indeed.
It's not uncommon for 100% statically typed language implementations to
use "tagbits". It's actually "the rule" rather than the exception in
the world of Haskell/ML, AFAIK.

This is due to a reason similar to what you called "everything is an
instance of a class cool-aid": in a polymorphic function, a variable of
type `α` can (at run-time) sometimes contain a pointer and sometimes
a plain immediate value like an integer or a boolean.

The system can be arranged to keep track of "which is which" and provide
that info to the GC, but it's easier to use a single tagbit, and the
impact on performance is usually considered minor. The more significant
impact are:
- plain integers end up being only 31bit or 63bit, thus making it more
difficult to interact with libraries written in other languages where
integers have 32bit or 64bit).
- floats need to be boxed, which can have a very significant impact on
performance. This is usually solved by providing ad-hoc types for
"vectors of floats" and by having the compiler do extra work to try
and avoid boxing floats when it's not indispensable (when (potential)
polymorphism doesn't get in the way), which is usually done by local
unboxing optimizations whose effectiveness depends heavily on inlining.

Stefan

Stefan Monnier

unread,

May 13, 2021, 1:37:28 PM5/13/21

to

> Similarly, the loss of a few bits off the low end of a double tends to be
> pretty much invisible in most cases, ...

FWIW, I don't know of any language implementation which uses the
low-bits of floats for tags (and hence gives up on IEEE-floats
semantics).

It's definitely not an insane idea in theory, but "non-IEEE floats" are
really not popular, it seems (and the fact that truncating the last few
bits means basically "double rounding" doesn't help, of course).

Stefan

BGB

unread,

May 13, 2021, 2:34:06 PM5/13/21

to

On 5/13/2021 12:37 PM, Stefan Monnier wrote:
>> Similarly, the loss of a few bits off the low end of a double tends to be
>> pretty much invisible in most cases, ...
>
> FWIW, I don't know of any language implementation which uses the
> low-bits of floats for tags (and hence gives up on IEEE-floats
> semantics).
>

Specifics of IEEE semantics matters as far as it tends to have a visible
effect on the results. If the results of the deviation are mostly
invisible, it is "generally acceptable".

More so, if one is off casting a 'double' to 'object' or 'variant', then
a small loss of precision (vs leaving it in 'double' variables) may be
considered as an acceptable tradeoff.

For a scripting language, it probably falls pretty solidly into "doesn't
matter" territory.

> It's definitely not an insane idea in theory, but "non-IEEE floats" are
> really not popular, it seems (and the fact that truncating the last few
> bits means basically "double rounding" doesn't help, of course).
>

Tagbits for flonums was semi-common AFAIK (at least more so than putting
everything else in a NaN).

Though, IIRC, VMs like SpiderMonkey and similar had used NaN boxing.

But, yeah, my own languages had typically cut the low-order bits off the
bottom (sometimes with rounding applied). As can be noted, cutting 2 or
4 bits off is mostly unnoticeable in calculations. Cutting 8 or 16 bits
off starts to get a bit more noticeable though...

Meanwhile, allocating a heap object to hold a boxed double is
unreasonably expensive. A few of my early VMs did this sort of thing,
but it "isn't really worth it".

Though, this assumes an ISA without a huge penalty to bit-twiddling
Double, the usage of 64 bit pointers, ...

Boxed Doubles make a lot more sense on 32-bit x86 or similar (though,
generally if using a specialized slab allocator or similar, rather than
the main heap).

The tradeoffs between MSB tagging and LSB tagging are also likely to
depend somewhat on the ISA:
x86 tends to favor LSB tagging;
BJX2 favors MSB tagging;
...

The MSB tagged flonum is essentially stored in a form where it is
right-shifted by 2 bits. Converting to double involves a 2-bit left
shift, and converting flonum to double consists of a 2-bit right shift
and OR'ing the tag into the high order bits.

Terje Mathisen

unread,

May 13, 2021, 2:57:47 PM5/13/21

to

I strongly agree: "Just don't do it!"

NaN-boxing is fine, breaking all FP math much less so.

Terje Mathisen

unread,

May 13, 2021, 3:00:42 PM5/13/21

to

I've never done that!

NaN-boxing is for when I have a huge array of (mostly) double variables,
but some of them really need to be strings or integers or generic
objects, all of which works perfectly well with storing those other
types as NaN payloads.

Stefan Monnier

unread,

May 13, 2021, 4:01:25 PM5/13/21

to

>>> Similarly, the loss of a few bits off the low end of a double tends to be
>>> pretty much invisible in most cases, ...
>> FWIW, I don't know of any language implementation which uses the
>> low-bits of floats for tags (and hence gives up on IEEE-floats
>> semantics).
> Specifics of IEEE semantics matters as far as it tends to have a visible
> effect on the results. If the results of the deviation are mostly invisible,
> it is "generally acceptable".

I think it's only "generically acceptable" in the sense that most casual
users won't notice, but most real/serious users of floating point will
sooner or later bump into and get *really* frustrated. And once this
becomes known, relevant users will simply turn to other
languages/implementations.

IOW it works as long as your language implementation is toy/experimental.
E.g. It would have worked for the original implementation of Javascript, but
in today's uses of Javascript it would be a non-starter.

It might work if you provide both IEEE float and "cheap/fast floats"
under separate types, but sooner or later you'll end up discovering that
there's enough pressure to make IEEE floats efficient that the
"cheap/fast floats" aren't worth the trouble any more.

>> It's definitely not an insane idea in theory, but "non-IEEE floats" are
>> really not popular, it seems (and the fact that truncating the last few
>> bits means basically "double rounding" doesn't help, of course).
> Tagbits for flonums was semi-common AFAIK (at least more so than putting
> everything else in a NaN).

As I said, I'm not familiar with any implementation doing that (of
course, I'm only familiar with a small part of the proglang world), so
I'd be curious to see some references.

E.g. as far as know all the Common-Lisp implementation use "flonums"
represented as boxed floats (e.g. the `+` operation allocates a heap
object in which to place the result when it's a float and return
a (tagged) pointer to it).

Stefan

BGB

unread,

May 13, 2021, 4:50:31 PM5/13/21

to

NaN boxing has other drawbacks:
The "tag" is larger and more expensive to load/compare/... (*1);
Significantly limits the range of fixnum values (eg: 48 bits rather than
62 bits);
Doesn't leave enough bits to allow type-tagging pointers with a 48-bit
address space (so it is necessary to look at the object's header to
type-tag it, *2);
...

So, loosing the low-order 2 bits off a double seems a lot less
destructive impact on "everything else" (when compared to "maybe if you
do a long enough string of floating-point calculations, the low-order
digits might start to diverge").

Apart from the crazy people using Python, most people presumably aren't
using dynamically typed languages for serious scientific calculations or
similar.

*1: The selection of ops which can handle Imm16 is more limited. One may
need to use more manual shifting and masking if they go this route.

*2: Eg, you have to look at the object header to determine if it is a
ClassObj vs an Interface, ... Rather than being able to do type-based
dispatch using some of the high-order tag-bits as an index into a
dispatch-table or similar.

By extension, one may also have to handle more of these cases with
decision trees rather than dispatch-tables.

...

Say, for example if one does binary type-dispatch sort of like:
Take the two operand references;
Do a Morton shuffle of the high-order bits
(say, one has a handy Morton-shuffle op);
Do a computed branch into dispatch-table using high-order bits;
Do operation for types present and be done with it.

In BJX2 ASM, this might look something like:
MOVHHD R4, R5, R6
PMORTQ R6, R6
SHLDQ R6, -60, R7 //Get interleaved tag bits
BRA2B R7 //Table Dispatch
BRA .RefRef //0, Both Ref
BRA .RefInt //1, Int, Ref
BRA .IntRef //2, Ref, Int
BRA .IntInt //3, Both Int
BRA .RefFlo //4
BRA .RefMsc //5
BRA .IntFlo //6
BRA .IntMsc //7
BRA .FloRef //8
BRA .FloInt //9
BRA .MscRef //A
BRA .MscInt //B
BRA .FloFlo //C, Both Flonum
BRA .FloMsc //D
BRA .MscFlo //E
BRA .MscMsc //F, Both Misc
...
.IntInt:
LDQHI 0x100, R6
NOT R6, R7 | ADD R4, R5, R2
OR R6, R2
AND R7, R2
RTS
...
.FloFlo:
SHLDQ R4, 2, R16 | SHLDQ R5, 2, R17
LDQHI 0x200, R6 | FADD R16, R17, R3
SHLDQ R3, -2, R2
OR R6, R2
RTS
...

Or, if an op were added to set the 2 high bits directly:
.IntInt:
ADD R4, R5, R2
LDQHI2 1, R2
RTS

Many other operations are subject to similar, and similar may also be
applied to reference-based types (albeit not usually for binary dispatch
in this case).

BGB

unread,

May 13, 2021, 7:34:57 PM5/13/21

to

On 5/13/2021 3:01 PM, Stefan Monnier wrote:
>>>> Similarly, the loss of a few bits off the low end of a double tends to be
>>>> pretty much invisible in most cases, ...
>>> FWIW, I don't know of any language implementation which uses the
>>> low-bits of floats for tags (and hence gives up on IEEE-floats
>>> semantics).
>> Specifics of IEEE semantics matters as far as it tends to have a visible
>> effect on the results. If the results of the deviation are mostly invisible,
>> it is "generally acceptable".
>
> I think it's only "generically acceptable" in the sense that most casual
> users won't notice, but most real/serious users of floating point will
> sooner or later bump into and get *really* frustrated. And once this
> becomes known, relevant users will simply turn to other
> languages/implementations.
>
> IOW it works as long as your language implementation is toy/experimental.
> E.g. It would have worked for the original implementation of Javascript, but
> in today's uses of Javascript it would be a non-starter.
>
> It might work if you provide both IEEE float and "cheap/fast floats"
> under separate types, but sooner or later you'll end up discovering that
> there's enough pressure to make IEEE floats efficient that the
> "cheap/fast floats" aren't worth the trouble any more.
>

For most of the stuff I am doing, it seems to work without issue.

>>> It's definitely not an insane idea in theory, but "non-IEEE floats" are
>>> really not popular, it seems (and the fact that truncating the last few
>>> bits means basically "double rounding" doesn't help, of course).
>> Tagbits for flonums was semi-common AFAIK (at least more so than putting
>> everything else in a NaN).
>
> As I said, I'm not familiar with any implementation doing that (of
> course, I'm only familiar with a small part of the proglang world), so
> I'd be curious to see some references.
>
> E.g. as far as know all the Common-Lisp implementation use "flonums"
> represented as boxed floats (e.g. the `+` operation allocates a heap
> object in which to place the result when it's a float and return
> a (tagged) pointer to it).
>

Hard to skim and find that much of an overarching view on these matters.
People don't seem that inclined to list and describe this stuff.

From what I can gather:
SpiderMonkey: Uses NaN tagging
Chrome V8: Uses heap boxing
CPython: Pointer to heap object (no tagrefs)
Erlang: Apparently also uses heap boxing
...

There was a Scheme VM that used single-precision floats inside a 64-bit
tagged value.

Also saw a scheme VM which used a variant of NaN tagging where they
XORed the value with the NaN bit-pattern such that the pointer case
looked like a bare pointer.

...

This leaves BGBCC (and the BJX2 port of my BS2 language) as possibly an
outlier it seems by using a right-shifted value which discards the
low-order 2 bits.

Note that the x86-64 BGBScript2 VM had used a different scheme:
Middle-range Double values were represented at full precision;
Values with large/small exponents were shifted right 4 bits.

For the BJX2 port, a 2-bit right shift was simpler and faster.

Also, the BJX2 core is pretty slow (50MHz), so don't necessarily want to
do type-tagging on it in a way that would be so slow as to make
dynamically typed operations basically unusable.

As noted, older VMs of mine (including early forms of the original
BGBScript VM) had generally used heap boxing. However, I had moved away
from it for performance reasons.

Some versions had also used a "float24" format which consisted of a
single-precision float with the low 8 bits cut off and shoved into a
pointer. The *suck* with this one was pretty obvious though, so it was
fairly short-lived...

As noted, during the later evolution of this VM, it had moved over to a
primarily statically-typed / type-inferred core (loosely comparable to
the Java VM). The original BGBScript2 VM was also in a similar category
(using an explicitly-typed stack-oriented bytecode).

Some amount of the VM design in both VMs was influenced by the JVM, .NET
CLR, and also the IA-64 C++ ABI spec.

In the BS VMs, pointers remained tagged by default, whereas most static
(non-variant) values were stored in an non-tagged form. Generally, they
were built on a JVM-like type model (ILFDAX/ILDAX,
Int/Long/(Float)/Double/Address/X128).

Most other types were treated as subtypes:
Int: char, short, int, ...
Long: long, long long, ...
Float: Float, Half
Double: Double
Address: pointers, object, variant, ...
X128: Int128, Float128, Vec4F, ...

Variant types remained as a special case, either as an explicit request
(BS2 and C + extensions), or if type-inference failed and if no explicit
type was given.

As can be noted, my BJX2 emulator is also roughly similar technology to
what was used in the later BGBScript VMs.

And also, BGBCC (my C compiler), was originally itself a fork off of one
of the earlier BGBScript VMs. And, sort of developed in its own way
along a similar path.

As noted, it can also function as a BGBScript2 compiler (both C and BS2
share the same backend codegen and similar on this target).

If I did implement BS1L though, it would partly be a regression in some
ways to a much earlier form of the language. Though, as noted, it still
needs to be fast and lightweight enough to hopefully be usable on
something running at 50MHz with a relatively modest amount of RAM.

It is assumed partly that BS2 could partly serve as a bridge between C
and BS1L code (since BS1L would be built on a subset of BS2's typesystem).

Also, because some cases are still coming up where some form of
scripting language could be useful (which don't fit in ideally with
static compilation). But, also can't really afford a big/complicated VM
on this target.

...