Design for a persistent memory image for a simple language
This document will attempt to describe the data representation for the
types in a minimal Lisp-like language, where all the program data --
indeed the entire state of the interpreter -- can be saved to disk and
reloaded. The central idea is to define the in-memory representations
to incorporate the serialization/deserialization automatically --
baked in, so to speak.
The types are fixnum, symbol, list, string, subr, subr2, fsubr, fsubr2.
Where a symbol is a short handle that associates to an interned string,
subr and subr2 are C functions that accept one or two arguments,
and fsubr and fsubr2 are C functions that accept one or two unevaluated
lists as arguments.
Using the terminology from David Gudeman,
``Representing Type Information in Dynamically Typed Languages'',
the scheme chosen is a /staged representation/ with the first stage
tagging the low 2 bits of a 32 bit cell giving a spread of 4 types:
CONS address
symbol index
OBJ address
NUM number
where CONS cells contain an address in the remaining bits which
address is the cell number of the pair of car and cdr parts stored
sequentially, NUM contains a 30 bit signed value, symbol contains an
index into the symbol list, and OBJ is the escape valve to the second
stage. An OBJ cell contains an address locating the remaining parts
of its representation.
At the second stage, an OBJ has a full cell for an extended tag
and another cell as its payload, this covers the remaining types:
STRING address
SUBR index
SUBR2 index
FSUBR index
FSUBR2 index
where a STRING contains an address of zero terminated string of bytes
(padded to cell alignment), and where all the function types hold an
index into the C function table.
These entities exist in a memory space of 32bit int-sized cells with a
simple linear allocator (new_address = global.next++) where all the
addresses described above are the cell number. So accessing a real
object in memory from this space requires a formula like:
global.mem[ address ]
or the similar formula to yield a C pointer to a cell:
( global.mem + address )
Saving/Loading the memory image.
Since all objects exist in a convenient binary space, the whole
state of the interpeter can be saved and re-loaded by storing
the contents of the memory as bytes, along with a few sizes and
important addresses.
saved memory descriptor
cells used <=> global.next
cells malloced <=> global.mem
address of env list <=> global.env
address of symbol list <=> global.syms
One remaining issue (which I had hoped to solve and then brag about,
delaying further work on this writeup) is coordinating or synchronizing
the saved memory image with the version of the C code which will
work with it.
I think I need some kind of CRC or hash upon the suite of function
names that are loaded into the environment. So that the entries in the
C function table will match the intended meanings of the antecedent
indices that exist in the memory space.
Or there may be some (overly) clever way of using git's own hash code
that identifies the current commit of the codebase. But that may be
too restrictive.
The above has been implemented in my tiny Lisp interpreter in golfed C
as of the current commit.
https://github.com/luser-dr00g/sexp.c/tree/597d2ca7ad4f66e3f1f836b0325c21d2f3611f55
https://github.com/luser-dr00g/sexp.c/archive/597d2ca7ad4f66e3f1f836b0325c21d2f3611f55.zip
I've also written a separate program to analyze the dump file.
It uses /bin/od to interpret the bytes as little-endian words
and feeds that into an awk program which builds up an array of
locations and values which can be traversed to show the layout
of the environment data in memory. This effort feels like it
demonstrates some degree of portability or internal consistency,
as well as offering a separate debugging tool for use with the
interpreter.
[I tried posting through gnus but got a 441 error I need to investigate.]