----
SXIL will be a S-Expression based IL.
The IL will contain a functioning interpreter, as well as being able to
represent code to be compiled. It may also provide macro facilities for
generalized code rewriting.
Likely, there will be no clear distinction between compile-time and runtime
within the SXIL interpreter, meaning that a macro may be essentially similar
to a function only that it rewrites code rather than executes logic.
Both macros and functions may be used within code to be compiled, where
macros will be expected to transform code fragments, and functions will be
expected to provide compile-time logic (such as implementing compiler
constructs).
As such, there will be no real distinction between macro-expansion time and
script runtime, and bindings available to ordinary code will also be
available within macros.
Note that SXIL will represent the scripts running in the compiler, but not
necessarily the semantics of the code to be compiled. The purpose of SXIL
will then be to produce output which can be compiled, and which may be
specialized to a specific target arch (although the SXIL itself should
remain target neutral if possible).
Scope:
There will be 2 types of scope: lexical and dynamic.
Lexical scope is the default scoping model, and will be captured by
closures. As such, whenever a closure is applied, it will have the same
bindings available as when it was created.
Dynamic scope will be based on control flow, and so the most recent binding
(in terms of control flow) will be that which is used.
Lexical bindings will take precedence over dynamic bindings (for now, a
lexical binding will take precedence over a dynamic binding regardless of
which was defined more recently, although ideally taking declaration recency
into account would be better).
A single namespace will be used for both variables and functions.
Modules:
A module system will be present allowing one to import bindings from other
SXIL modules. All functions, variables, and macros will be imported as such,
but this is not recursive (if B imports A, and C imports B, C may see the
bindings from B but not A).
As another restriction, it is only possible to import modules which have
already been created.
Module names will be given as strings.
(module <modulename>)
Declare the current module name.
(import <modulename>)
Import a given module, making its bindings visible locally.
Special Forms:
(if <test> <then> [<else>])
Evaluate test, and if true then evaluate then, otherwise (if present)
evaluate else. The return value is that of the evaluated expression.
Here, false is defined as one of the following: (), 0, or #f. Any other
values are true.
(begin <expr*>)
Evaluate a group of expressions, returning the value of the final
expression.
(and <expr*>)
Evaluate a group of expressions, terminating if an expression evaluates to
false returning this value, or returning the value of the final expression.
(or <expr*>)
Evaluate a group of expressions, terminating if an expression evaluates to
true returning this value, or returning the value of the final expression.
(cond <cond-clause*> [<else-clause>])
cond-clause: (<test> <expr*>)
else-clause: (else <expr*>)
Will evaluate each clause, where for each clause it will evaluate test, and
if true, evaluate the expressions and return the result of the last
expression (as the result of the cond expression). If none of the tests are
true, then the expressions associated with the else clause are evaluated and
the final result returned.
(case <key> <case-clause*> [<else-clause>])
case-clause: ((<datum*>) <expr*>)
Evaluate key, the value of which is then matched against each datum in each
clause. If there is a match, then the expressions are evaluated and the
final result returned. If no matches, then the else clause is evaluated.
(set! <var> <val>)
Assign a variable with a value.
(define <var> <val>)
Define a variable in the lexical scope.
(define <name-args> <body*>)
Define a function in the lexical scope.
name-args: (<func> <arg*> [ . <rest>])
(defvar <var> <val>)
(defvar <name-args> <body*>)
Define a variable or function in the dynamic scope (note that in the case of
a function, bindings will still be captured from the lexical scope).
(lambda <args> <body*>)
(nlambda <args> <body*>)
Create a closure
args: <name-name> | (<arg*>) | (<arg*> . <rest-name>)
arg: <name> | (<arg*>) | (<arg*> . <rest-name>)
Note that since each arg may be similarly constructed (allowing nested
destructuring of list arguments).
The difference between lambda and nlambda will be in terms of argument
evaluation, where the arguments to lambda will be evaluated prior to the
call, and so each argument is the result of evaluating the expression. In
the case of nlambda, the input expression will itself be passed as the
argument.
(defmacro <name> <args> <body*>)
Define a macro.
A macro is used sort of like a function, but is called with the arguments
unevaluated, and the return value is assumed to be a form which is then
evaluated.
(let <bindings*> <body*>)
(let* <bindings*> <body*>)
(letrec <bindings*> <body*>)
Where bindings is:
((<var> <expr>)*)
Evaluate each expression, binding it to var, and when done, evaluate body.
Between them, let will evaluate bindings in the parent scope and will
evaluate body with the bindings in place, let* will evaluate and bind in
left-to-right order, and letrec will evaluate with the bindings in-place
already.
(nlet <name> <bindings*> <body*>)
Like let, but will bind name within body to a lambda representing the
bindings and the body.
(letvar <bindings*> <body*>)
Like let, but binds in the dynamic scope rather than the lexical scope.
(quote <expr>)
Returns the expression as is (does not evaluate).
This may be represented specially in the textual form as '<expr>
(quasiquote <expr>)
Like quote but with an important difference, namely that within the body of
quasiquote several forms are special:
(unquote <expr>)
(unquote-splicing <expr>)
Where the former will evaluate expr in-place, and substitute the value in
the result, and the latter will evaluate expr and inline the value in the
result.
(unquote-magic <name>)
Will allow a shorthand for gensyms, where any such magic name is replaced by
a gensym which is used throughout the quasiquote body.
`<expr> => (quasiquote <expr>)
,<expr> => (unquote <expr>)
,@<expr> => (unquote-splicing <expr>)
,~<name> => (unquote-magic <name>)
Functions:
(eq? <obja> <objb>) | (=== <obja> <objb>)
Returns #t if obja and objb are exactly equal (have the same identity), #f
otherwise.
(eqv? <obja> <objb>) | (== <obja> <objb>)
Returns #t if obja and objb are equal (have the same value, but not
necessarily identity).
(equal? <obja> <objb>) | (= <obja> <objb>)
SXIL will regard eqv and equal to be synonyms.
(not-eq? <obja> <objb>) | (!== <obja> <objb>)
(not-eqv? <obja> <objb>) | (!= <obja> <objb>)
(not-equal? <obja> <objb>)
Will be as above, but the value is inverted.
(less? <obja> <objb>) | (< <obja> <objb>)
(greater? <obja> <objb>) | (> <obja> <objb>)
(less-equal? <obja> <objb>) | (<= <obja> <objb>)
(greater-equal? <obja> <objb>) | (>= <obja> <objb>)
Perform relative comparrisons between objects.
The objects need to be comparable, such as being numbers or strings.
(match? <pattern> <value>)
Returns #t if pattern and value match (any lists have a compatible shape,
any literals match, ...).
A name will match with anything, and a keyword will match with a name with
the same name or with the same keyword.
(bool? <value>)
Returns #t if the value is #t or #f, #f otherwise.
(cons? <value>) | (pair? <value>)
Returns #t if value is a cons cell, #f otherwise.
(null? <value>)
Returns #t if value is (), #f otherwise.
(list? <value>)
Returns #t if value is a proper list (a chain of 0 or more cons cells
terminated by ()), of #f otherwise.
(symbol? <value>)
Returns #t if value is a symbol, #f otherwise.
(keyword? <value>)
Returns #t if value is a keyword, #f otherwise.
(string? <value>)
Returns #t if value is a string, #f otherwise.
(not <value>)
Will perform the boolean inverse of the value.
(cons <vala> <valb>)
Creates a new cons cell of vala and valb.
(car <cons>)
Returns the first item in a cons cell.
(cdr <cons>)
Returns the second item of the cons cell.
(caar <cons>) | (cadr <cons>) | (cdar <cons>) | (cddr <cons>) | ...
Represent combinations of car and cdr, where:
(cadr a) => (car (cdr a)); (cdaddr a) => (cdr (car (cdr (cdr a)))); ...
These go to 4 levels.
(list <value*>)
Creates and returns a list containing the values.
(length <list>)
Returns the number of items in a list.
(append <list*>)
Append lists together and return the result.
(reverse <list>)
Return a list with the items in reverse order.
(append! <list*>)
Destructively append lists together and return the result.
(reverse! <list>)
Destructively reverse the list and return the new list.
(symbol->keyword <symbol>)
(symbol->string <symbol>)
(keyword->symbol <keyword>)
(keyword->string <keyword>)
(string->symbol <string>)
(string->keyword <string>)
Convert between symbols, keywords, and strings.
(eval <expr> [<env>])
Evaluate an expression and return the result.
(+ <value> <value*>)
(- <value>) | (- <value> <value*>)
(* <value> <value*>)
(/ <value> <value*>)
Perform arithmetic operations on arguments.
(gensym)
Generate a unique name.
Compiler Features
Type Notation
Type signatures will be given as string literals.
Each type is intended to be parsable fairly easily, and where multiple types
may be present in the same string (them being placed end to end). However,
only the types in-sequence are specified, and this does not necessarily
specify how these will be placed in memory.
Qualifiers:
P*, pointer to type
R*, reference to type (invisible pointer)
C*, complex type (f,d,e,g,k)
G*, imaginary type (f,d,e,g,k)
U*, extended type qualifier ('<name>;')
X*, compound type (struct or union), followed by '<name>;'.
T*, will specify a tagged type (deprecated).
W*, Wide pointer type
L*, reference to object type ('<classname>;')
Q*, dynamic array of type
A*, B*, used for basic extension types
Basic Types:
a, signed char (8 bit)
b, bool (8 bit)
c, char (8 bit)
d, double (64 bit)
e, long double (80 bit data, 128 bit storage)
f, float (32 bit)
g, 128-bit float (128 bit)
h, unsigned char/byte (8 bit)
i, signed int (32 bit)
j, unsigned int (32 bit)
k, hfloat (16-bit float) (16 bit)
l, signed long (32|64 bit, 64 bit external)
m, unsigned long (32|64 bit, 64 bit external)
n, signed 128-bit int (128 bit)
o, unsigned 128-bit int (128 bit)
p, (reserved)
q, (reserved)
r, variant (reference)
s, signed short (16 bit)
t, unsigned short (16 bit)
u, custom type (?)
v, void (N/A)
w, wchar/short (16 bit)
x, long long (64 bit)
y, unsigned long long (64 bit)
z, ... (varargs, ...) (N/A)
Extended Types:
m64, 64-bit raw SSE vector
m128, 128-bit raw SSE vector
quat, quaternion
hquat, hyperbolic quaternion
vec2, 2-elem geometric vector
vec3, 3-elem geometric vector
vec4, 4-elem geometric vector
mat2, 2x2 matrix
mat3, 3x3 matrix
mat4, 4x4 matrix
v2f, 2-elem raw vector
v3f, 3-elem raw vector
v4f, 4-elem raw vector
v2d, 2-elem raw vector (double)
If the type is followed by a number, it indicates that this is an array,
with a comma allowing multidimensional arrays.
Example:
bar:PXfoo;4,4
struct foo *bar[4][4];
Example:
Uvec4;16
Array of 16 4-vectors.
These arrays will specify memory with the specified physical layout.
Object References
'L<classname>;'
These will refer to a specific object type, where classname will be the
qualified class-name for the object in question.
This will differ from 'PX<name>;' in that 'X<name>;' will refer to a
specific and known structural type (the physical layout is thus known).
Likewise, I says nothing about the pointer itself, or the nature of the
pointed-to memory.
On the other hand, 'r' specifies that a dynamically-typed reference is
given, and thus the nature of the pointer and the kind of memory it can
point to, however, 'r' does not specify anything about 'what' is referenced.
'L<classname>;' will then specify that this is a dynamically typed reference
(similar to 'r'), but will also specify that it is a reference to a specific
abstract type (for example, the class with the given classname, or a class
derived from this class), but will not specify the physical layout or
concrete type of the referenced memory (unlike 'X' or 'PX'). In many cases,
'L' may be treated as simply a special case of 'r' (differing primarily in
terms of assignment, where the implementation is strongly encouraged to
ensure that the object being referenced is of the appropriate class, as
otherwise things may be allowed to break).
The exact structure and meaning of 'classname' will be internal to the
object system, but the current idea is that it will represent a heirarchy of
the form '[<name>/]*<class>'. For example, "myApp/custom/Foo".
Dynamic Arrays
'Q' will be similar to 'P' in spirit, but differ in practice similarly to
how 'L' differs from 'PX'. 'Q' will specify that a reference is used to a
dynamically-managed array holding members with the given type, but will
specify neither the physical layout of the array nor of the referenced
values.
Presently this will be limited to reference based types, such as 'Q', 'L',
and 'r'.
For example: 'QQr' will be an array of arrays of references, and
"QLmyApp/custom/Foo;" will be an array of objects.
Functions and Methods
Signature strings as applied to functions and methods will have a slightly
modified notation, namely in that a specific designation of args and return
type may be given.
The basic layout will be:
'(<args>)<ret>'.
Examples:
"int foo();" gives "()i";
"double bar(int x, int y);" gives "(ii)d".
This whole unit will be treated as a single type-unit, so, for example, a
function-pointer could be specified like this:
"P(d)i" for the type "int (*)(double)".
The exact meaning or interpretation of this type will depend on the context
of its usage.
General Declarations
(class <modifiers> <classname> <super> <class-body*>)
super: (<superclass> <interface*>)
class-body:
(field <name> <type> [<modifiers>])
(method (<name> <type> [<modifiers>]) <cfargs> [<vars> (<body*>)])
modifiers: (<modifier*>)
Where each modifier is a name, such as static, virtual, ...
cfargs: (<cfarg*> [...])
cfarg: (<name> <type> [<modifiers>])
(interface <modifiers> <if-name> <if-super> <class-body*>)
if-super: (<interface*>)
(namespace <name> <namespace-body*>)
namespace-body: class | interface
(prototype (<name> <type> [<modifiers>]) <cfargs>)
(function (<name> <type> [<modifiers>]) <cfargs> <vars> (<body*>))
Example:
(function (main "i") ((argc "i") (argv "PPc")) ()
( (funcall printf "test...\n")
(return 0)
))
IL Ops:
These send intermediate opcodes to the code generator. These are not really
intended to be directly used by frontend scripts, but may be a large portion
of the output from translating built-in forms (such as funcall, ...).
The internal IL will make use of an abstract stack, and many operations will
serve to work in terms of this stack.
(il-op <ilopname>)
Send a single opcode (generic).
(il-value <value>)
Sends a literal value (pushed onto IL stack).
(il-label <name>)
(il-load <name>)
(il-store <name>)
(il-goto <name>)
(il-goto_true <name>)
(il-goto_false <name>)
...
Example:
...
(il-label GaB59hA)
(il-load i)
(il-goto_false sQ72Fy7)
...
(il-goto GaB59hA)
(il-label sQ72Fy7)
...
I haven't tried to understand all of it (a lot of this stuff is over my
head), but:
* It's quite high level for an intermediate language. (My intermediate
languages tend to be at the same sort of level as bytecode, and are
invariably 'flat', with regard to control structure, and scoping). Although
the set of il-ops near the end seem more typical of what I expect an IL to
do.
* It looks like Lisp! (Although I've no idea how close it actually is). Why
not just use Lisp?
* It looks like it could be used as a language in it's own right; is this
the case?
--
Bartc
yeah, this is actually part of the design.
I have raised the level of abstraction fairly high as to give the compiler
machinery a good amount of room to move-about and do its business (as such,
its input will be at a similar level of abstraction to the input language,
although the IL will hopefully be relatively language-neutral).
it is actually a good deal higher level than my prior IL (which had a level
of abstraction similar to that of a textual form of .NET bytecodes). as
such, this IL will absorb a lot of the functionality usually done in the
upper compiler as well.
the il-ops were more a reference to my prior IL, where this IL will
essentially wrap the machinery from my prior IL, but may also begin to
absorb many of the higher-level constructs from this, hopefully so that its
level of abstraction can be lowered again.
> * It looks like Lisp! (Although I've no idea how close it actually is).
> Why not just use Lisp?
>
Scheme and Lisp were notable design influences.
however, this applies more to the in-IL language, rather than the semantics
of the code to be compiled (which will follow a similar structure, but will
represent different constructs and use static typing rather than dynamic
typing).
note that, when I was younger, one of my often-used programming languages
was Scheme, and as such I had my own implementation (as such, I am fammiliar
with many of the strengths and weaknesses of this style of language, and
although IMO not well suited to general-purpose programming, they are likely
to be fairly good for implementing and working on compiler internals, or at
least better than C for this task...).
the reason for not using Lisp as such, is that I don't control the
implementation of most Lisp's, and also that most serious Lisp
implementations aren't exactly really designed to compile somewhat non-Lisp
languages (such as C, C++, C#, Java, ...).
would need a Lisp with: full JIT, strong FFI support, usable as a library,
... and even though C# and Java could probably work effectively on a
Lisp-style core, C and C++ would probably not (given primarily that they
like using pointers for everything, and depend somewhat on how the
underlying memory is organized, both of which would be problematic...).
whereas, even for its apparently lispiness, the main point of this language
will actually be mostly to handle C# and Java...
there is my prior IL, which I call RPNIL (and which was more influenced by
PostScript), which I am thinking I can free from this task and for now leave
it in charge of C, and maybe later C++, since these languages operate
differently (and this is more what RPNIL was originally designed for, it
being a good deal of awkwardness trying to extend it to handle C# and
friends...).
of course, sadly I did fork the compiler machinery between RPNIL and SXIL.
> * It looks like it could be used as a language in it's own right; is this
> the case?
>
yes, although it is being fairly specialized, and the interpreter is not
designed for high-performance operation.
it is based in large part on top of my existing compiler machinery, although
the interpreter portion is new.
the reason for this is that it will allow a good amount of fairly high-level
transformations to be embedded in the compilation process and also be
scriptable (presumably largely at the discretion of the upper compiler).
which may help reduce the amount of ugly pattern-recognition and
transformation code being done in the C portions of the compiler, and also a
lot of the high-level abstractions I had been adding to the middle-compiler
(in particular: OO related stuff, exception handling, ...).
for example, as is, things like static conditional checking and loop
unrolling are done in the upper compiler, but here this responsibility could
be moved to the middle compiler. likewise, it could allow other things
currently problematic in my compiler...
this is even if the overall design is... unusual...
yeah, even I am having some doubts about it...
> --
> Bartc
>
>
Well, for Java I could see a pointerless approach work, but for C#? It's
possible* (see below) to write "nearly C++" in C#, that is, use pointers for
everything and depend on the layout of structs.
Only you'd have GC, interfaces instead of virtual inheritance, and some
other things.
I know you didn't actually say pointless, but you raised concern about the
pointeres in C and C++, so I'm raising it for C#.
* not many C# programmers throw pointers around like rice on a wedding, but
it's possible to do so. Especially interop code, GDI code and weird hacks
depend on it. Even BitConverter uses pointers to type-pun a double to a
long.
for a proper Lisp, it is an issue...
however, my IL has pointers, and in fact the most novel thing of the IL is
the interpreter, but the code being compiled is far more mundane...
> * not many C# programmers throw pointers around like rice on a wedding,
> but it's possible to do so. Especially interop code, GDI code and weird
> hacks depend on it. Even BitConverter uses pointers to type-pun a double
> to a long.
yes, well, I wasn't really saying that one would be making use of 'unsafe'
on a Lisp style core (very possibly, one could get by making the compiler
complain and refuse to work, while still being standards conformant, since
an implementation is not "required" to accept unsafe code as such...).
but, in my case, there is no problem, since I am not using a Lisp-style
core, rather a C-style core with a bunch of extra features tacked on (such
as GC, exceptions, ...).
the big hairy issue though is just how to handle exception handling...
I started implementing an idea, but it has an silly issue:
the mechanism is structured in such a way as that the try block has to
finish successfully at least once before the handler can be properly
registered.
this means that if the first time a piece of code is run, it throws an
exception, then the handler will fail to catch it...
this issue is stupid (and would likely break many simple introductions to
exception-handling), but shouldn't really be too much of a problem in
practice...
a compiler could technically implement a hack to fix this, namely that it
could "prime" the exception handler by first jumping over the try block
(thus reaching the end of the try block), and then jumping back to before
the try block, and not jumping over the try block or jumping back this time.
in C:
static int _primed=0;
void *p;
...
L0:
p=dyllBeginTry();
if(!_primed)goto L1;
try...
L1:
p=dyllEndTry(p);
if(!_primed) { _primed=1; goto L0; }
if(p)
{
catch magic...
}
...
however, due to the limited nature of preprocessor macros, this issue
couldn't be glossed over in C... (or at least not restricting it to there
being only a single exception handler per function).
(actually, a compiler need not have this issue to begin with, as it can know
up-front where the beginning and end of the try block is, ...).
an alternative considered option is to always jump back to the start, but
this is less desirable.
actually, if the C-level interface is restricted to 1 exception handler
per-function, one can actually jump to the start and have it come out the
end. this would involve creative use of macros... (but this also messes
stuff up...).
groan... this kind of thing is why I don't yet have working exception
handling...
> the big hairy issue though is just how to handle exception handling...
>
> I started implementing an idea, but it has an silly issue:
> the mechanism is structured in such a way as that the try block has to
> finish successfully at least once before the handler can be properly
> registered.
>
> [...]
>
> in C:
>
> static int _primed=0;
> void *p;
>
> ...
> L0:
> p=dyllBeginTry();
> if(!_primed)goto L1;
> try...
> L1:
> p=dyllEndTry(p);
> if(!_primed) { _primed=1; goto L0; }
> if(p)
> {
> catch magic...
> }
> ...
Two library calls in a try/catch loop that does not even throw is
expensive. For C/C++-like languages, I think many users would expect
so-called "Zero Overhead Exception Handling", where a path that does not
throw does not slow down from the addition of a try/catch block.
A (the?) way to implement that is to have the code that searches for a
exception handler match the address where the exception is thrown to the
executable, and from there to a table in the executable that maps code
ranges to catch handlers. If that handler does not handle the exception,
unwind the stack (for C++, calling destructors if necessary), find the
next return adress, look up its table of catch handlers, etc.
The cost of this is that it makes exception handling expensive. If
exceptions are the exception (pun intended), the net effect is a speed
up.
Reinder
From that, it seems priming is completely independent of whatever "try..."
does, since you skip over "try...". So, perhaps this would work:
if(!_primed)
{
p=dyllBeginTry();
p=dyllEndTry(p);
_primed=1;
}
p=dyllBeginTry();
try...
p=dyllEndTry(p);
if(p)
{
catch magic...
}
Yes? No?...
I agree with RV on the zero impact philosophy. Depending on the language
design or what users happen to code, there could be numerous uncessary
exception attempts.
I know from your posts you've both looked at many and implemented a few
IL's. Obviously, they aren't sufficient for your needs. So, what's wrong
with some of the existing IL's or IR's? Is the problem simplicity,
complexity, flexibility, inflexibility, memory usage, too low level, too
high level, supported interfaces, language concepts? What? Or, what is
going to be better in SXIL than all the other IL's if SXIL was available to
you right now?
These are just a few of numerous IL/IR choices that I pulled from Wikipedia:
TAC, SSA, C, C--, Seed7, Gimple, RTL, MS' CIL, Tendra ANDF, Slang (smalltalk
subset), S-lang (C-subset interpreter), etc.
http://wiki.squeak.org/squeak/2267
http://www.s-lang.org/
And, I know there are probably another five or so C interpreters too, as
well as choices like Lisp and FORTH. Isn't there one already in existence
that is sufficient?
Rod Pemberton
the difficulty here is how to implement this effectively as a C API...
first problem:
AFAIK, we can't have pointers to labels (not verified, but I think this is
the case).
later problems:
I have found that, although one can return to the correct position in the
code easily enough (or restore the stack to its correct position, ...),
apart from capture it doesn't seem possible to restore the caller-save
registers (esi, edi, ebx, or others on x86-64...).
originally, I had made an exception-handling scheme based on callbacks,
however the great problem is that, typically, targetting this would beyond
the capabilities of my IL's (such the try, catch, and finally blocks would
need to be located in different functions...).
likewise, it would also be a problem to have several, generally unrelated,
exception managing schemes.
actually, I am almost thinking I may have to split up this problem:
first problem, create an interface for a "nested unwinder" (basically, a
nestable version of longjmp);
second problem, figure how to best use the above for exception dispatching;
...
now, since an unwinder, by implication, requires capture and API calls, it
is not exactly super cheap, but it can be implemented via a pure C API...
I am just thinking, at least if each call can be kept under a few-hundred
clocks, might have to be good enough...
so, it is all looking ugly at this point...
> Reinder
wont work, because BeginTry and EndTry were actually used to capture the IP
ranges and possible return state. the first time they are run, they set up
the state, but later than this could become theoretically no-op...
the discovered issue though is that BeginTry would still need to lookup the
context, making it still expensive, as well as a later realized issue that I
could not restore the callee-save registers this way...
> I agree with RV on the zero impact philosophy. Depending on the language
> design or what users happen to code, there could be numerous uncessary
> exception attempts.
>
the issue though, is that one needs metadata, and the metadata needed can't
be effectively represented in plain C, which would limit its use to my
compiler...
I also want to be able to use whatever exception handling mechanism with
whatever code I compile with gcc as well...
(sadly, there seems to be no common ABI for exception handling on x86, and
infact numerous conflicting strategies in use...).
on Windows, the native provided method is SEH, but SEH is generally accessed
via compiler extensions (for example, MSVC has these). FWIW, SEH could be
wrapped in an API with mechanisms similar to those I am considering (only
that the handler would always be generic, and would then jump back into the
code using info stored in the context).
from what I can tell, gcc on Linux generally does exception handling by
using hidden API calls anyways (I have not looked into this in detail, but
information points this way). likewise, LLVM uses this same API.
I don't know what gcc does on Windows, I have not looked into this.
> I know from your posts you've both looked at many and implemented a few
> IL's. Obviously, they aren't sufficient for your needs. So, what's wrong
> with some of the existing IL's or IR's? Is the problem simplicity,
> complexity, flexibility, inflexibility, memory usage, too low level, too
> high level, supported interfaces, language concepts? What? Or, what is
> going to be better in SXIL than all the other IL's if SXIL was available
> to
> you right now?
>
> These are just a few of numerous IL/IR choices that I pulled from
> Wikipedia:
> TAC, SSA, C, C--, Seed7, Gimple, RTL, MS' CIL, Tendra ANDF, Slang
> (smalltalk
> subset), S-lang (C-subset interpreter), etc.
>
> http://wiki.squeak.org/squeak/2267
> http://www.s-lang.org/
>
> And, I know there are probably another five or so C interpreters too, as
> well as choices like Lisp and FORTH. Isn't there one already in existence
> that is sufficient?
>
mostly, I am very particular about implementation details, and using most
other ILs implies using existing implementations, and the problem then is
that most existing code is not compatible with my practices...
likewise, due to past experience, I have generally learned to be very
careful when incorporating other's code, as most other people like producing
big tangled messes of code that tend to fall apart if not handled with
extreme care (has happened a few times IME...), where I like code I can
fairly easily rip apart and reorganize...
whereas, in my case, I place a much stronger emphasis on structure and
specific API's, where the actual code is usually a secondary concern (and
often written clean, or periodically re-written, as needed...).
so, it does not matter if I gain code, if there is not a clean API for doing
so.
now, as for the IRs listed, I have reasons for not using them (although, CIL
is intended as an input format, as is JBC, although the effort needed to
support both has currently lessened their priority, and SXIL is also partly
designed to address the issues of getting JBC and CIL into a form I can
compile...).
as for SXIL, the primary reasons are this:
I will have control over the implementation;
the design is relatively little work to implement with my existing codebase
(getting much of the code beaten together was a matter of days, and most of
the rest should go fairly easily);
I personally have experience with, and some nostalgia for, Scheme;
...
I don't expect anyone else to use it as such, but hell, it is an IL, and
most ILs tend to be project-specific anyways...
now, the great problem when it comes to exceptions, is I can't think of a
good way to structure an API for this within the confines of C...
so, I am starting to think it may need to be 2-stage:
first, will be an API for nested unwinding;
second, will be an API for passing exceptions.
p=dyllBeginUnwind();
stuff...
dyllEndUnwind(p);
where p represents an unwind state.
in this approach, control flow always comes up following the BeginUnwind
call, and the returned pointer would need to be examined to to determine the
state.
p=dyllBeginUnwind();
stuff...
dyllEndUnwind(p);
and, possible handler approach:
p=dyllBeginUnwind();
if(dyllStateNormalP(p))
{
dyllBeginTry(p);
try...
dyllEndTry(p);
}else if(dyllStateCatchP(p))
{
dyllBeginCatch(p);
exception handling...
dyllEndCatch(p);
}else
{
finalizer...
dyllUnwind(p);
}
dyllEndUnwind(p);
granted, this is not a very lightweight approach, but partially separating
exceptions and unwinding could allow more flexibility (namely, different
possible exception handling mechanisms...).
>
> Rod Pemberton
>
>
Shouldn't some of that functionality be moved into the exception handler...
?
And, can the initialization be done on application startup... ?
> I also want to be able to use whatever exception handling mechanism with
> whatever code I compile with gcc as well...
>
> (sadly, there seems to be no common ABI for exception handling on x86, and
> infact numerous conflicting strategies in use...).
What model comes closest to your needs?
> likewise, due to past experience, I have generally learned to be very
> careful when incorporating other's code, as most other people like
producing
> big tangled messes of code that tend to fall apart if not handled with
> extreme care (has happened a few times IME...), where I like code I can
> fairly easily rip apart and reorganize...
I agree. But, we are limited by time...
> as for SXIL, the primary reasons are this:
> I will have control over the implementation;
That's always good.
> the design is relatively little work to implement with my existing
codebase
Very good reason, IMO.
> I personally have experience with, and some nostalgia for, Scheme;
From Wikipedia - since I've no experience with Scheme, it seems Scheme (or
Lisp) has solved the operator precedence problem. For many years, C
programmers have away from operator precedence and have moved towards using
parenthesis to override or force operation order. A combination of a few
ideas, like transforming an operator precedence language like C to
parenthesized notation like Scheme, infix notation to RPN like FORTH,
multi-operand parameters to 0-operand like FORTH, and context-free-grammar
to NPDA (an FSM), might turn out well. It seems that they are all
implementable using stacks, and some other minor logic... I also ran across
something useful on Wikipedia for string parsing: Cocke-Younger-Kasami
algorithm.
> I personally have experience with, and some nostalgia for, Scheme;
Ah, your first love always dies hard...
It took many years, and a language review, before I realized that my fond
memories about the strength of BASIC were misplaced. It was more a
representation of my strength or skill than the language's. Although, BASIC
does have a few useful language constructs for string processing which are
still useful. In fact, I implemented a left$, right$, mid$ for another
language that was lacking in string processing functionality.
Unfortunately, native language syntax for concatentation wasn't available.
For C, I usually just put up with the slightly extra work of using C's
string functions since they are more compatible with each other than with an
implementation of left$, right$, mid$.
> I don't expect anyone else to use it as such, but hell, it is an IL, and
> most ILs tend to be project-specific anyways...
What about LLVM? (Which I forgot to mention...) Isn't it supposed to be
both language and architecture neutral?
> now, the great problem when it comes to exceptions, is I can't think of a
> good way to structure an API for this within the confines of C...
I don't have a solution for you. C is just weak here. The majority of C is
implemented as single threaded code and typically uses a stack to handle
parameters and control flow information (a "call stack" as Wikipedia calls
it...). Anything that bypasses this method of implementation is going to
cause some issues. setjmp(), longjmp(), raise(), signal(), exec(),
system(), and exit() are the functions that come to mind that bypass this
normal control flow within C. FWIW, two Wikipedia entries "Exception
handling syntax" and "Comparison of programming languages (basic
instructions)" pages shows how try-throw-catch could be implemented in C
using setjmp and longjmp (and for other languages too). For C, use of
setjmp and longjmp is less than optimal since they have many side effects
like 1) requiring volatile on automatic variables in the local procedure 2)
may cause loss of modified values in local procedure upon use 3)
non-portable, but more portable than signal and raise 4) tends to be
implemented on x86 by saving all registers to the jump buffer which is slow
5) may not work in interrupt or exception handlers for pre-ANSI C.
Rod Pemberton
it could, with compiler support...
if one is using good ol' gcc as-is, they are fairly limited as to what can
be done at startup, and more often have to result to runtime initialization
of features...
if one could safely have pointers to labels, especially between functions,
then this case would be less of a problem, but I don't know of any compiler
which does this (nor of a way in which a compiler could make this safely
usable).
so, the general issue is:
for efficient exception handling, support needs to be built into the
compiler in question (or, at least, one needs some fairly powerful
control-flow constructs and/or macro facilities, which C doesn't have
either...).
>> I also want to be able to use whatever exception handling mechanism with
>> whatever code I compile with gcc as well...
>>
>> (sadly, there seems to be no common ABI for exception handling on x86,
>> and
>> infact numerous conflicting strategies in use...).
>
> What model comes closest to your needs?
>
SEH...
but, of course, natively using SEH requires compiler support.
I have since implemented a basic unwinding mechanism, for which the
internals are based on SEH (primarily written in assembler). it basically
throws a fixed exception code in my case, and generic handlers are used.
for the Linux case, it is modified some (TLS is used instead, and there is
no generic handler).
x86-64 is not handled at present...
of course, as-is, it is still based on begin/end pairs...
>> likewise, due to past experience, I have generally learned to be very
>> careful when incorporating other's code, as most other people like
> producing
>> big tangled messes of code that tend to fall apart if not handled with
>> extreme care (has happened a few times IME...), where I like code I can
>> fairly easily rip apart and reorganize...
>
> I agree. But, we are limited by time...
>
time is not as much an issue in my case, if anything, because there is not
much point in all my efforts...
>> as for SXIL, the primary reasons are this:
>> I will have control over the implementation;
>
> That's always good.
>
>> the design is relatively little work to implement with my existing
> codebase
>
> Very good reason, IMO.
>
>> I personally have experience with, and some nostalgia for, Scheme;
>
> From Wikipedia - since I've no experience with Scheme, it seems Scheme (or
> Lisp) has solved the operator precedence problem. For many years, C
> programmers have away from operator precedence and have moved towards
> using
> parenthesis to override or force operation order. A combination of a few
> ideas, like transforming an operator precedence language like C to
> parenthesized notation like Scheme, infix notation to RPN like FORTH,
> multi-operand parameters to 0-operand like FORTH, and context-free-grammar
> to NPDA (an FSM), might turn out well. It seems that they are all
> implementable using stacks, and some other minor logic... I also ran
> across
> something useful on Wikipedia for string parsing: Cocke-Younger-Kasami
> algorithm.
>
I have a full-fledged parser in my case...
but, the reason this stage is needed, is that I have found that directly
targeting RPN (and having this take a fairly direct path to the generated
machine code) is just too inflexible in the case of internal architectural
issues (it works fine for x86, but doesn't hold up as well in the x86-64
case).
so, going to S-Exps allows keeping most of the codegen machinery intact, but
leaving more freedom to reorganize things in target-specific ways, and
likewise deal with some of the more "high-level" issues.
in this way, the RPN stage can have its level of abstraction lowered,
becomming more of a straight codegen (likewise, I will no longer be required
to more or less force everything into an RPN-based model, but may be freer
to adopt a looser, more ASM-like approach).
the alternative would be making the language-specific frontends output a
target-specialized IL, which is not desirable (even with RPNIL, this was an
issue, as RPNIL has more than a few "x86-isms", such as requiring args to be
given in right-to-left order, ...).
of course, CIL and JBC will remain as RPN, but the idea in this case is that
they would be "unwound" into S-Exps, and then passed to SXIL, and
re-flattened as RPN (in the JBC-case, this will require regenerating a lot
of info which is not present in the class files, such as the types of local
variables, ...).
the reason for this is, partly, that JBC and CIL don't fit onto the model
used by the low-level codegen (differing in terms of evaluation order,
argument order, ...), and it didn't look like there was a good way to
resolve this apart from creating a new set of calling conventions, ...
>> I personally have experience with, and some nostalgia for, Scheme;
>
> Ah, your first love always dies hard...
>
> It took many years, and a language review, before I realized that my fond
> memories about the strength of BASIC were misplaced. It was more a
> representation of my strength or skill than the language's. Although,
> BASIC
> does have a few useful language constructs for string processing which are
> still useful. In fact, I implemented a left$, right$, mid$ for another
> language that was lacking in string processing functionality.
> Unfortunately, native language syntax for concatentation wasn't available.
> For C, I usually just put up with the slightly extra work of using C's
> string functions since they are more compatible with each other than with
> an
> implementation of left$, right$, mid$.
>
yeah.
I started with BASIC as well, but stopped using it long ago...
but, Scheme (and Lisp in general) have some fairly useful features.
now, as is, I have already implemented much of Scheme in C, apart from those
constructions which just don't map so well...
this is partly what makes SXIL easy to implement, since most of the basic
machinery was already there...
(the parser, typesystem, ... were already at hand).
the main thing though was that the core interpreter was written clean, as it
is a style I have not used in a long time (due to its low performance and
high garbage production). note that I could have used a bytecoded
interpreter easily enough, but this would have been at the cost of the
expressiveness of the macro system.
>> I don't expect anyone else to use it as such, but hell, it is an IL, and
>> most ILs tend to be project-specific anyways...
>
> What about LLVM? (Which I forgot to mention...) Isn't it supposed to be
> both language and architecture neutral?
>
my issues with LLVM:
it is a good deal lower-level and I would have to produce SSA to make it
work;
the way in which the "bottom-levels" of LLVM work are not really compatible
with my project (at the level of the assembler and linker), and shows a
general disregard for things I felt were fairly important (being able to
dynamically relink running code, ...), and is slow to address these
problems;
it is "teh large" and written in C++ (C would be vastly preferable);
they take the "one true supercomponent does everything" stance/mindset,
which I dislike (my approach is far more modular and decentralized);
...
now, in terms of LLVM's IL, that is decent enough, just I don't like their
implementation...
(actually, I can say the same of Mono, but my complaints about Mono's
internals are a good deal different, well, that and my inability to get it
built on Windows thus far...).
>> now, the great problem when it comes to exceptions, is I can't think of a
>> good way to structure an API for this within the confines of C...
>
> I don't have a solution for you. C is just weak here. The majority of C
> is
> implemented as single threaded code and typically uses a stack to handle
> parameters and control flow information (a "call stack" as Wikipedia calls
> it...). Anything that bypasses this method of implementation is going to
> cause some issues. setjmp(), longjmp(), raise(), signal(), exec(),
> system(), and exit() are the functions that come to mind that bypass this
> normal control flow within C. FWIW, two Wikipedia entries "Exception
> handling syntax" and "Comparison of programming languages (basic
> instructions)" pages shows how try-throw-catch could be implemented in C
> using setjmp and longjmp (and for other languages too). For C, use of
> setjmp and longjmp is less than optimal since they have many side effects
> like 1) requiring volatile on automatic variables in the local procedure
> 2)
> may cause loss of modified values in local procedure upon use 3)
> non-portable, but more portable than signal and raise 4) tends to be
> implemented on x86 by saving all registers to the jump buffer which is
> slow
> 5) may not work in interrupt or exception handlers for pre-ANSI C.
>
yep...
my hack for implementing exceptions thus far has been to build an unwind
mechanism which internally works similar to setjmp/longjmp...
namely, the current form (actually, the main machinery is located in my
linker machinery, as are a few other related parts) has a few calls:
void *BASM_StartUnwind(BASM_JmpBuf *ctx);
does something similar to a setjmp, and links the handler into SEH.
actually, more correctly, it sets it up with SEH, and tail-calls into a
custom setjmp...
when it returns NULL, we know it is a normal call (not a return).
void BASM_EndUnwind(BASM_JmpBuf *ctx);
unregisters handler...
void BASM_DoUnwind(void *val);
technically, throws an exception via SEH (currently, via the Win32 API
related calls).
presumably, control will then come back out of the last StartUnwind,
however, it will return a pointer to the associated BASM_JmpBuf rather than
the value (however, info from SEH and similar is stuffed into the context,
which can be retrieved).
a slight optimization would be to bypass the Win32 API calls when it can be
determined that there are no "proper" SEH handlers waiting (since the SEH
crap is hairy...). in this case, it would be similar to initiating a
longjmp...
the idea then is that the frontend excpetion API (likely located in DYLL,
not BASM), will wrap these calls and provide more generic exception-handling
stuff.
dynamically compiled code could either use this API, or I could implement
IP-range based handlers instead...
performance should be acceptable so long as no one decides to use exceptions
as an alternative to loops or similar (registering and unregistering should
be cheap enough, only that throwing is not so likely to be cheap...).
>
> Rod Pemberton
>
>