Bronze Age Lisp design part 1 - value representation and marks for cycle detection

115 views

Skip to first unread message

Oto Havle

unread,

Aug 18, 2013, 12:55:10 PM8/18/13

to kl...@googlegroups.com

Hi all,

as announced, I'll start a series of posts about design choices
in Bronze Age Lisp.

Kernel Language (like Scheme and many other dynamic languages) has
a notion of a basic piece of data called "value". For
a language runtime, memory representation of each value
must be defined. Because of dynamic typing, the representation
also contains type information. For all intereseted, I recommend this paper

David Gudeman: Representing Type Information in Dynamically Typed Languages

as a detailed reference on the topic.

In Klisp:

a value is represented by a 64-bit word split in a 32-bit tag and
32-bit payload. The tag defines the type (klisp distinguishes about 40
types on this level). For small types, like fixints (small integers),
characters and booleans, the value is stored directly in the payload
component. For large types (like strings or closures), the payload
contains a pointer to a heap-allocated block, where the actual value is
stored.

There is only one heap. Heap-allocated blocks have 16 byte header. The
header also contains a "source code info", which refers to the file the
object was (read) or (load ...)'ed from. Most objects also contain a
mark field which help the interpreter traversing cyclic structures.

In Bronze Age Lisp:

the representation is more complicated. Moreover, there are two heaps.
One for structured values like cons cells and closures (the "lisp heap")
and one for strings, keywords, symbols and bytevectors (the "blob
heap").

A value is represented by 32-bit word. The two least significant
bits determine the encoding of the remaining 30-bits

00) If the two LS bits are zero, the value is a pointer
to a block allocated in the lisp heap. The first 32-bit
word of the block is a header word. The header word
stores type tag and length of the block.

01) If the two LS bits are 01, the value is a fixint.
The remaining 30 bits contain two's complement
representation of the number.

Representation of fixints is nothing special,
similar pattern can be found in Chicken Scheme or even OCaml.

10) If the two LS bits are 10, the word is further split
in 6 bit tag and 24-bit payload. The 6-bit tag
determines the type, like boolean, char, #inert,
but also string or symbol. The payload contains the
actual value (e.g. unicode code point), or index in
the blob heap.

11) If the two LS are 11, the value is a pair. The rest
of the word contains 29 bits of a pointer to
the lisp heap, and the most significant bit flags
immutability. Pairs do not have header words.

Encoding of pairs is a bit contrived, but it is possible
to discard the immutability flag and extract car or cdr
of a pair with single MOV instruction.

In contrast to klisp, Bronze Age Lisp does NOT have source
information or mark fields.

Source code information is not maintained. For error reporting, the
interpreter simply searches the ground environment to find name of an
object. If the object is not there, bad luck :(

The mark fields are not stored permanently in the objects, but they can
be temporarily allocated in the unused halfspace of the copying garbage
collector. A mapping is defined between object address in the "live"
hafspace and address of a mark word in the unused halfspace.

This scheme allows to mark objects stored in read-only memory. The
built-in combiners and interpreted code is stored in binary form in a
read-only segment in the interpreter executable. At startup, the
read-only data is loaded by the operating system witout explicit
initialization. Moreover, even if the builtins take lot of memory, they
do not put pressure on the garbage collector.

This marking scheme has a drawback. The halfspace is full of garbage,
and it must be zeroed out before the temporary mark fields can be used.
This slows downs not only structural equality predicate. It also slows
down the mandatory checking of formal parameter trees (this is very
common, as it comes with every ($let ...) form). Therefore, each of the
algorithms comes in two variants. The general, but slow one uses the
marks, and serves as a fallback to a quick, but unreliable, algorithm.

For equality, the quick algorithm follows pointers naively until
equality is confirmed or a threshold is reached.

For ptree checking, the quick algorithm uses very small hash
table (two 32-bit registers, actually). Hash collision
cannot be distinguished from sharing (which is not an error),
occurence of a cycle, or repeated symbol in the parameter tree.
If a collision occurs, the general algorithm is invoked.

Marks are not needed for (get-list-metrics ..). I use
the Brent algorithm instead.

http://en.wikipedia.org/wiki/Cycle_detection#Brent.27s_algorithm

To be continued...

Oto Havle.

Reply all

Reply to author

Forward

0 new messages