fklisp: how does it work?

159 views
Skip to first unread message

Andres Navarro

unread,
Feb 11, 2013, 6:37:31 PM2/11/13
to kl...@googlegroups.com
Now that the historical perspective has been explained I'll try to
explain what fklisp is and how it works.  This is going to be a pretty
long mail!

So, fklisp is both a very simple functional dialect of Kernel and an
abstract interpretation engine for that dialect.  There is no mutation
or first-class continuations in the dialect.  The idea is to have the
simplest practical language.  The final goal of this experiment is
trying to write the abstract interpretation engine in the fklisp
dialect and have it analyze itself. This will serve as a pretty good test,
the final code size of the abstract interpreter should be at about 3000
lines of fklisp code if all goes as planned and uses a variety of common
constructs and idioms, besides being a pretty complicated program.

Abstract interpretation is used to perform static analysis on a fklisp
source expression.  The idea is to explore where the different parts
of the source expressions are evaluated (in which environments) to see
if we can replace some parts of that source expressions for more
"optimized" versions.  Right now fklisp only concerns itself with
trying to replace symbols from the original expression.  It will try
to replace any symbol it can by either a primitive combiner (i.e. a
combiner in the standard fklisp environment) or by a direct access
instruction (i.e. an instruction to access the current environment
directly at certain positions, instead of looking for the binding in
the environment and its parents at runtime).

Right now the result is a textual representation of the resulting
expression with the symbols replaced, pretty printed for reading
convinience.  There is still no mechanism to run the optimized version
(mainly because there's no way in klisp to access the current
environment directly).  However it should be simple enough to start
with the klisp interpreter and simplify it to run only the fklisp
dialect and have an array representation of environments that allow
direct access.  This should make it faster in the process because it
doesn't need to consider cyclic expressions or first class
continuations (that means it can use a simple stack, for example).

The abstract interpretation engine is already good enough to analyze a
good portion of itself (I started rewriting fklisp in the fklisp
dialect to allow this, I am still not done) and I'm confident that I
will be able to have it work on the complete thing sometime in the
near future. 


Before telling you how it works please consider that all of this is
rooted in something called abstract interpretation that has a strong
theoretical base, theoretical results, and a vast ammount of
applications.  I won't get into the definition or the details of how
abstract interpretation works and why, but I'll try to give the
intuitions and details of fklisp take on abstract interpretation so
that you can understand how it works. 

fklisp has representations for all kinds of objects in the fklisp
dialect (integers, symbols, environments, applicatives, operatives,
nulls, ignores, pairs & booleans).  In addition it has a top & bottom
type (representing all objects and no object respectively).  It also
has representations for some types of partial information (like an
unknown integer, boolean, or symbol, a pair with unknown car & cdr, or
an environment with unknown bindings).  So these abstract objects
actually represent sets of concrete objects.  There are special cases
for the set of all objects (top), the empty set or no object (bottom),
the set of all integers, the set of all pairs, but also a specific
pair, or a specific boolean, etc.  All these objects can be arranged
in a finite height lattice by the inclusion relationship, with top and
bottom occuping the obvious positions.

This abstract types also track which symbols come from the original
expression and in the case of unknown values (top), it tracks which
symbols from the original expressions it has access to.  So for
example two instances of the symbol '$vau' in the original expression
are represented by different objects in the abstract interpreter
because it needs to track where each one may be evaluated.  If we were
to do more complicated analysis (for example inlining of applicatives,
or some kind of "macro expansion" of operatives), then the pairs from
the original expression would also need to be tracked.  This is not
currently done in fklisp because I'm now only concerned with symbol
replacement.

So the fklisp takes an environment and an expressions and tries to
analyze where the different subexpressions (especially symbols) of the
expression are evaluated.  To do this it first converts the original
expressions & environment by transforming all objects to one of the
types mentioned in the paragraph above.  All objects created in this
step are sets of one element exactly of course, because the source
expression and original environment is known.  Every occurence of a
symbol in the original expression is given a fresh number to identify
it.  Currently this is done with simple tagged lists, so for example
the empty list will be converted to "(#:null)" and the symbol $vau may
be converted to "(#:symbol $vau 23)" where the number is different
each time a symbol appears in the expression, iregardless of the
symbol print name.  This is important.

After converting the expression to these abstract types, abstract
interpretation begins.  It is basically an interpreter, augmented to
work with the abstract types.  Remember that abstract types are
actually sets of concrete types.  Primitive combiners are trivially
extended to handle this.  For example boolean? will return true,
(#:boolean #t), for '(#:boolean #t)' and '(#:boolean #f)' but also for
'(#:boolean (#:top ...))' (an unknown boolean).  It would return
false, '(#:boolean #f)', for '(#:null)', '(#:ignore)', and '(#:integer
(#:top))' (an unknown integer) but would return an unknown boolean:
'(#:boolean (#:top ...)' for an unknown value like '(#:top ...)'.  So
there is a primitive handler for each primitive combiner.  It takes an
abstract list of arguments and returns an abstract object.

RE: the ellipsis after '#:top', this is to remind you that unknown
values (objects of abstract type top) track additional info that is
not shown here, namely the symbols from the source expression that may
be accessible from it.

Special care is needed to track the whereabouts of symbols from the
source expression.  So for example if 'car' receives an unknown value
it naturally should return an unknown value, but that unknown value
must have info about all symbols that may be accessible from it and
because we don't know where each symbol in the original unknown value
is in the list structure we should conclude that any one of them could
be in the car and so the result of (car (#:top (sym1 ... symn)))
shoulde be also (#:top (sym1 ... symn)).  Similarly when consing two
unknown values the symbols from both values should be combined and put
into the unknown result.  This careful tracking of symbols from the
source expression in the unknown values is key to guaranteeing that
all symbol replacements suggested by klisp are valid.  If at any time,
is possible for an unknown value to be evaluated or escape the
original expression (for example if it's part of the result of the
abstract interpretation), all symbols recorded in that value are
marked as constants and so won't be eligible for replacement, even if
those symbols are observed to evaluate to a known primitive combiner
at some other time.  The reason for this is that it's possible that
the symbol is used literally (as by $quote) and so we can't safely
replace it.  The same happens when a symbol from the source expression
is used as a constant (like in a call to $binds? or a symbol used in
an argument list).

Keen observers may notice that the handler for the eval applicative
will take an abstract expression and an abstract environment and
return an abstract object representing the possible results of that
evaluation.  This is exacly the same as the abstract interpretation
engine!


Because of the nature of abstract interpretation, sometimes it's not
possible to decide which branch of an $if to take.  To solve this, the
abstract interpreter will actually take both in that case and combine
the results (find a superset of both results).  Of course this will
cause infinite looping for many recursive combiners, so a number of
mechanism must be used to avoid this happening.

To think about this, you can see each evaluation of an expression in
an environment as a node in a directed graph where there's an arc from
one node to the other if the evaluation of the first expression caused
the evaluation of the second one directly.  For example in the
expression '($if <test> <then> <else>)', the node for '($if <test>
<then> <else>)' is connected to the node for '<test>' and depending on
what info the abstract interpreter can gather from that, there may be
an additional arc from the '$if' node to either '<then>', '<else>' or
both.  It's important to note that the node contains both the
expression and the environment in which it's evaluated and the
abstract evaluation is done, also the result of said evaluation.

So for any expression and environment, the abstract interpreter will
construct this evaluation graph and store the results of each
evaluation in the same node.  When it finishes, the result in the
original node is an abtract value representing all possible objects
resulting from the expression & additional info about the program can
be extracted from the graph (e.g. one can see where each symbol in the
original expression is evaluated and what's the result of that
evaluation).

In order for the abstract interpreter to stop eventually, this graph
should be finite and so fklisp has a number of strategies for limiting
the creation of nodes.  For example, before adding a new node for the
graph (and starting abstract interpretation of the corresponding
expression), the abstract interpreter checks a cache of already seen
expressions in the same environment and if it finds one it doesn't add
a new node, but actually just reads the result from the old node.
This is what happens for example when there is recursion and unknown
values. 

Let's say you want to abstract interpret the definition of '(list? x)'
in an environment with the regular definition of list? and '(#:top
...') for 'x'.  Evaluation of this will result in the evaluation of an
$if to check whether 'x' is a pair, obviously because 'x' is unknown
both branches have to be evaluated and one of the branches will do a
'list?'  of the 'cdr' of 'x' which is also unknown.  The abstract
interpreter will recognize the fact that it already has a node for
this and instead of creating a new one it will just link (create an
edge) to the existing one.  Then it can use the result of that node.
But what's the result, considering that we aren't done evaluating the
original expression?  The answer to this is that we don't know, but we
can approximate it with fixed points, which I will discuss shortly.

When there is a loop in the graph, as we seen in the previous example,
we need a result before we have it.  To approximate this we do the
following: when a node is created we register (#:bottom) meaning no
value as a tentative result and procceed with the evaluation of the
expression, if at any time during the evaluation the result is needed,
the saved value, (#:bottom) is used and a mark is left in the node to
indicate that the result of it was used.  When the original evaluation
terminates, it can check to see if its node was marked, if it wasn't
it means that no subexpression needed the result of this node and so
all is good, we just store the result obtained and continue, if it was
marked it means that some subexpression tried to use the result, we
solve this by merging the result with the previous result (the same
way $if merges both branches when it can't determine the result of the
test), clear the mark and restart the evaluation of this node with the
new result.  This will proceed the needed number of times, until the
result doesn't change (a fixpoint is reached).  This result will then
be a correct approximation of the result (meaning that the actual
result is in the set denoted by the approximation found).  This always
reaches a fixpoint in fklisp because the lattice has a finite height
and the merge is a superset of both values.  Some abstract interpreter
have infinite length lattices but use other mechanisms to ensure
reaching a fixpoint.

There are of course other reasons why the graph may become infinite
(besides repeating evaluations).  For example, a badly written
combiner may loop forever counting to infinity (e.g. '($labels
((forever (x) (forever (add1 x)))) (forever 0))').  Currently fklisp
doesn't have a mechanism to avoid this and consequently the abstract
interpreter will loop forever, just as the actual program would.  It
is certainly possible to find ways to avoid this, but I haven't tried
this yet.

One of the difficulties of working with lisp (compared with a language
that doesn't conflate data and code) is that sometimes it is difficult
to tell when two expression evaluations correspond with the same
application of a combiner.  For example there's this problem: let's
say we have the usual tail-recursive definition of length (without
type checking):

($labels ((length-h (ls acc)
            ($if (null? ls)
                 acc
                 (length-h (cdr ls) (add1 acc))))
          (length (ls) (length-h ls 0)))
    ...)

Now let's suppose that we wish to see what happens when length is
called using abstract interpretation with only the check of repeated
expressions as defined above.  Say we do abstract interpretation of
'(length x)' where x is an unknown object '(#:top ...)'.  The abstract
interpreter puts a new node in the graph for this evaluation, sets the
result to no value '(#:bottom)' and sees that length is an applicative
as defined above and so it creates an appropriate environment (binds
ls to '(#:top ...)' in a child of the $labels environment) and
proceeds with the abstract evaluation of the body in that environment.
Eventually the abstract interpreter evaluates the body of length-h in
an environment binding 'ls' to '(#:top ...) and 'acc' to '0'.  So far,
so good.  The problem is that when the recursive call is reached, it
corresponds to evaluating the body of length-h but in an environment
binding 'ls' still to '(#:top ...)', because the cdr of an unknown
value is the same unknown value, but 'acc' is now bound to '1'.  In
this case this new evaluation is different from the one before and so
the abstract interpreter generates a new node!  This of course keeps
on forever as it counts to infinity even though the applicative is
well defined.  The problem is that the abstract interpreter didn't
recognize the second call as related to the first one.  In this case
it's easy to see that the abstract interpreter ought to recognize this
fact and try to merge the arguments to both calls, resulting in an
undertermined integer (because merging '(#:integer 0)' and '(#:integer
1)' in the current lattice causes '(#:integer (#:top)'. And so the
next call will coincide in the bindings ('ls' to (#:top ...) and 'acc'
to '(#:integer (#:top))', the fixpoint will be reached and the result
will correctly be approximated as '(#:integer (#:top))', because add1
of an unknown integer is an unknown integer.

So the question is how to solve this.  There are a number of ways that
I'm considering (I haven't implemented one yet, and so the above
definition of length would make the current fklisp to loop forever).
The simpler of course would be to always merge the arguments of all
combiners over the course of abstract interpretation, so that only one
node is there for the body of any operative.  This will succeed in
stopping infinite recursion (provided the program didn't create an
infinite number of combiners!) of the abstract interpreter but would
be overly restricting, because, for example would merge make otherwise
known results, if say for example we used length on two different
known lists.  One solution to this would be to only merge arguments in
case of recursive calls: we could keep a stack of called functions and
if a user defined combiner is called when there is a call in the stack
below, we then merge the arguments of both calls.  This will elegantly
solve this issue of length because when the second evaluation with
'acc' equal to '1' is seen, there's the other one with 'acc' equal to
'0' below it and so both numbers are merged and the correct
approximation is reached.  In the case of two unrelated calls to
length, they won't be merged together because neither is below the
stack of the other.  So far so good.

The problem is that sometimes, there is a combiner that naturally
appears in the stack a number of times with no direct relation between
each other.  Let consider the case of the $cond operative, which in
fklisp is a defined operative, not a primitive.  It is possible
(actually quite common) that there is a $cond inside a $cond (notice
that this could simply be a function that uses $cond called inside a
cond clause).  For example:

($labels ((sign (x)
            ($cond ((positive? x) 1)
                   ((negative? x) -1)
                   (#t 0))))
  ($cond ((integer? obj) (sign obj))
         ...))

So the when evaluating the $cond, in the first clause we see a call to
sign which results in a call to $cond again, with totally different
clauses.  Now, this two uses should be considered unrelated and the
clauses shouldn't be merged, the mechanism outlined above however would
merge both of these and so wouldn't be able to correctly analyze this
otherwise simple looking case.  The solution I am currently
considering would take into consideration the presence of unknown
arguments for doing the merge.  In the case of length above, there is
an unknown object (the list) and so they should be merged.  In the
case of $cond here, all arguments to both instances of $cond are known
and so no merging should be done, even though there is a previous call
to $cond in the stack.  I am fairly confident that this will be
sufficient for most fklisp probrams, including the abstract
interpreter itself, when rewritten in fklisp. 

In any case there a number of possible alternatives.  One I should
mention is recognizing macro-like operatives like $cond and doing
macro-expansion.  One way to do this is, after analysing the body of
the operative, check to see whether it looks like a macro.  Here is
the result of applying the abstract interpreter to the definition of
$lambda & $cond, from derived.fk in the fklisp abstract interpreter
(slightly reformatted):

($labels (...
  ($lambda (formals body) denv
   (<wrap> (<eval> (<list><1.0> <$vau> <formals><0.1> #ignore <body><0.2>)
                   <denv><0.0>)))
  ...
  ($cond clauses denv
         (<$if> (<null?> <clauses><0.1>)
                (<$error> (no clause was true))
                (<$if> (<eval> (<caar><1.2> <clauses><0.1>) <denv><0.0>)
                       (<eval> (<cadar><1.8> <clauses><0.1>) <denv><0.0>)
                       (<eval> (<cons> <$cond><1.33>
                                       (<cdr> <clauses><0.1>))
                               <denv><0.0>))))
   ...)
  ...)

It is important to notice that all symbols in the definition were
correctly replaced, so <$if> is actually the primitive operative $if
(not the symbol) and <denv><0.0> is the actual dynamic environment of
the call to $cond, etc.  So the idea is to try to transform the body
so that it is of the form '(<eval> some-expr <denv><0.0>)'.  Where
some-expr is an object formed by simple manipulation of the operands
to the operative. Or some kind of checks ($if, etc) using only the
structure of the operands to the operative, and resulting in
something of the form '(<eval> some-expr <denv><0.0>)'.

In the case of $lambda, we can do put the wrap inside the eval easily
because we know it ignores the dynamic environment.  Everything is
inside the eval now and it is just simple manipulations of the
arguments to $lambda (creating list of them and some constants).
In the case of $cond, it can be done too, because the first $if can be
left as is (just analyzes the structure of the operands) and the inner
$if & 3 evals can be combined into a single <eval> with the if inside
because (<$if> (<eval> test <denv>)
               (<eval> then <denv>)
               (<eval> else <denv>)
is always equivalent to (<eval> (<list> <$if> test then else) <denv>).

So this is certainly a way worth exploring, but of course depends on
having first done static analysis done on the body of the operative to
transform symbols to primtive combiners, other macros and direct
access to the environment.  Other interesting thing that can be done
with the result of the abstract interpreter as currently defined or
with minor modifications are inlining, constant propagation, warnings
about undefined references, unreachable code, warning about types &
number of arguments, etc. 

Well that was long, but I hope this will help people understanding
fklisp and how it works.  Please write with any questions, ideas, etc,
I'll be happy to answer.

Regards,
Andres Navarro

Andres Navarro

unread,
Feb 14, 2013, 6:00:11 PM2/14/13
to kl...@googlegroups.com
Sorry a little correction about fklisp lattice:

On Mon, Feb 11, 2013 at 8:37 PM, Andres Navarro <canav...@gmail.com> wrote:
(...).  All these objects can be arranged

in a finite height lattice by the inclusion relationship, with top and
bottom occuping the obvious positions.


This lattice actually has infinite height because of pairs and environments
with partial info (e.g. unknown car or cdr, or parent environment).
 
This always
reaches a fixpoint in fklisp because the lattice has a finite height
and the merge is a superset of both values.  Some abstract interpreter
have infinite length lattices but use other mechanisms to ensure
reaching a fixpoint.


This is true of most datatypes.  However it's not true for pairs for example.
What happens with pairs is that this merging generates a sequence that
obeys the ascending chain condition, meaning that even though the domain
is infinite, the sequence of values generated by the above operation eventually
terminates (i.e. it doesn't change anymore).


Please consider that all of this is from the top of my head and even may change at any
moment while I continue working with fklisp & its termination conditions.  I wrote about
it just to give an idea of how it works/tries to work.  Take it all with a grain of salt.

I expect that once I have the abstract interpreter running on itself, I'll be able to give
these matters a lot more thought and analysis so that I can give a more formal account
of it all.

Regards,
Andres Navarro

Brandon Bloom

unread,
Sep 3, 2013, 4:18:59 PM9/3/13
to kl...@googlegroups.com
Exciting! Great work!

I'm curious if you have any preliminary results to report. How much faster are things? What portion of symbols are you able to resolve statically on code you've tested?

Andres Navarro

unread,
Sep 3, 2013, 7:15:24 PM9/3/13
to kl...@googlegroups.com

The project is currently on hold but I'm planning on getting back to it soon. The branch where I was working is "self-application", so that would have the latest results.

Regarding your questions: I did most of the tests on code for library combiners ($let, map, apply, $cond, etc) and the abstract interpreter itself, all of which I rewrote in fklisp explicitly for this purpose.  I was able to process all except the last file of the interpreter (fkeval.fk, see below), several hundreds of lines in all. All replaceable symbols were identified correctly as far as I could see, either as primitive combiners or as specific positions in the environment structure.

Sadly I was stuck trying to get it to analyze the last file: fkeval.fk before getting busy with the start of the school year. (either it loops forever or it takes a REALLY LONG time, still not sure...).  There are a number of possibilities here, all of them fixable in principle: one is that there's some remaining bug that throws the abstract interpreter on a wrong (and possibly infinite) path, other one is that the naive method I chose doesn't scale (time complexity) for the whole program,  yet another one is that the complexity of the remaining code hides something that I didn't consider and causes the engine to enter an infinite loop.

I remain confident that I'll be able to get that last file analyzed eventually which would end this experiment/proof of concept as a total success. On the other hand it already proved (in my mind at least) that the issue of Kernel compilation ísn't the impossible thing we were led to believe by the fexpr detractors :). I'll probably write an article about it in any case, over the (southern hemisphere) summer break.

As for the speed gained from the process: I didn't tackle this issue yet.  The main reason is that for the very important part of doing direct access to the environment (with an index) I need support from the interpreter.  Now, klisp uses a representation of environments (lists or hashtables depending on the env size) that is not really a good fit for this kind of optimization. Moreover, it has a number of features (and some naive implementation of them at that) that aren't needed in fklisp: first class continuations, mutation & cyclic list handling that add considerably to the runtime overhead. So the benefit of replacing the symbols in klisp, while substantial, wouldn't be as spectacular as in a proper interpreter for fklisp.

Irregardless, the main benefit of replacing the symbols is not the speed gains in interpreting the code with the symbols optimized away, but the fact that it opens the possibility of compiling the code (because it fixes the semantics of the symbols). In that sense, this project is but the first step of a long journey ahead of us.

As an example of the output produced, you can look at the file derived.fkc (in the mentioned self-application branch of the fklisp repo on bitbucket) which is the result of applying the abstract interpreter to the code for the standard derived combiners (source code in derived.fk).  Here we can see how all possible symbol optimizations have been correctly identified.  A symbol between angle brackets (such as "<$if>") represents a symbol that always refers to a primitive combiner, and so it can be replaced by it.  A symbol between angle brackets followed by two numbers (such as "<caar><1.2>") represents a symbols that always leads to a fixed position in the environment structure (relative level & offset) and so, it can be replaced by a direct access instruction to the environment.

Feel free to ask any further questions on the matter.

Regards,
Andres Navarro

Reply all
Reply to author
Forward
0 new messages