Defunctionalization, prim__intToStr, and type safety

69 views
Skip to first unread message

Ralith

unread,
Oct 2, 2012, 8:58:44 PM10/2/12
to idris...@googlegroups.com
The LLVM code generator I've been working on (visible at https://github.com/bitlisp/Idris-dev ) is nearing completion, but has been producing segfaults when fed test programs. I've isolated (one of?) these to calls to the prim__intToStr function constructed by IRTS.Compiler.build, into which the defunctionalization pass inserts a call to {EVAL0}, feeding it its argument and passing its return value on to the actual primitive operator. prim__intToStr's argument is an Int, which in the current RTS are passed unboxed by casting into pointers. However, {EVAL0} does pattern matching (a constructor case) on its argument, so is clearly expecting arguments of some type which has conventional data constructors. In the C backend, constructor tag extraction is special-cased for integers to return -1, which in {EVAL0} causes the case to fall through to its default clause, which returns the function's argument. However, in my LLVM backend I have made the assumption that an attempt will never be made to destructure an integer, as that sort of thing shouldn't normally get through high-level typechecking, so when a destructuring case is executed I go ahead and dereference the tag pointer, which causes a segfault in the theoretically-impossible instance of an unboxed value ending up there.

I can easily duplicate the C backend's special case behavior here, but is this really the correct approach? I may be new to defunctionalization, but I can't begin to imagine why it should make any changes to prim__intToStr at all, let alone one that appears to violate type safety.

For reference, here's the code generators' output for prim__intToStr:

C:
void _idris_prim_95__95_intToStr(VM* vm, VAL* oldbase) {
    INITFRAME;
    RESERVE(1);
    ADDTOP(1);
    RESERVE(1);
    TOP(0) = LOC(0);
    STOREOLD;
    BASETOP(0);
    ADDTOP(1);
    CALL(_idris__123_EVAL0_125_);
    LOC(1) = RVAL;
    RVAL = idris_castIntStr(vm, LOC(1));
    TOPBASE(0);
    REBASE;
}

LLVM:
define internal fastcc %value* @_idris_prim__intToStr(i8* %VM, %value* %"{op0}") {
entry:
  %0 = call fastcc %value* @"_idris_{EVAL0}"(i8* %VM, %value* %"{op0}")
  %1 = call %value* @idris_castIntStr(i8* %VM, %value* %0)
  ret %value* %1
}

Edwin Brady

unread,
Oct 5, 2012, 8:11:43 AM10/5/12
to idris...@googlegroups.com
Hi,
If I've understood the problem correctly… the thing about EVAL is that it has to be able to handle any input that is thrown at it, which could be a primitive or a constructor, and that constructor could represent a function. So the C backend checks which it is first. It would be better, eventually, I think, to have some compile time analysis to remove any unnecessary EVALs, and indeed to unbox primitives properly.

One place where you might need to destructure something which turns out to be an integer is where you have built a thunk to be evaluated lazily, which will eventually turn into an integer. So it is unfortunately not a safe assumption that integers will never be inspected by a case - primitives can't assume that their inputs are fully evaluated, at least not without further compile time analysis.

When I get around to it (no promises though :)) I intend to do a bit of work on the intermediate representation - specifically, a partial evaluator for the defunctionalised language. Some easy optimisations would be to inline primitives and do EVALs at compile time wherever possible.

Edwin.

Ralith

unread,
Oct 6, 2012, 6:58:31 PM10/6/12
to idris...@googlegroups.com
Is it generally true that IR case statements must handle values of
every type (by jumping to the default alternative when a type other
than that which they were written for is supplied?), or is it just
EVAL that behaves specially in that regard? The former possibility
seems counterintuitive, given how strongly typed the source language
is. Would the IR optimization passes you describe make the
aforementioned assumption completely safe again?

This brings to mind another tangental question: What motivated the
decision to use defunctionalization instead of (what I understand to
be) the more common practice of passing function pointer+environment
structs around? If it was merely the observation that not all targets
support that sort of thing, would it be reasonable for codegen for a
particular target (e.g. LLVM) to take this approach?

Edwin Brady

unread,
Oct 7, 2012, 7:55:43 AM10/7/12
to idris...@googlegroups.com
On 6 Oct 2012, at 23:58, Ralith <ral...@gmail.com> wrote:

> Is it generally true that IR case statements must handle values of
> every type (by jumping to the default alternative when a type other
> than that which they were written for is supplied?), or is it just
> EVAL that behaves specially in that regard? The former possibility
> seems counterintuitive, given how strongly typed the source language
> is. Would the IR optimization passes you describe make the
> aforementioned assumption completely safe again?

Here's the problem - the following things all have type Nat:

O
S O
plus (S O) (S O)
head []

The following things all have type Int:

6
42
factorial 8
head []

(Of course, if you require everything to be total, the last one goes away in each case)
Yes, you know what type they are, but you don't know in general at run-time whether they're in normal form. This is an important part of the operational semantics that the type system doesn't tell you. The idea with EVAL is to make sure that any function which needs to inspect a value knows that that value is in head normal form, so you can safely inspect a constructor or integer and know the range of values it can be.

The next trick is to eliminate the calls to EVAL where it is known at compile time that a value must be in head normal form.

> This brings to mind another tangental question: What motivated the
> decision to use defunctionalization instead of (what I understand to
> be) the more common practice of passing function pointer+environment
> structs around? If it was merely the observation that not all targets
> support that sort of thing, would it be reasonable for codegen for a
> particular target (e.g. LLVM) to take this approach?

I wanted to try defunctionalisation because I wanted to see what would be the benefits and drawbacks of moving EVAL and APPLY out of the runtime system, and making them representable in the IR. It seems to be a good idea, in a language that is supposed to help you prove things, to have a very simple run time system and easily explainable (and formalisable) transformations from one step to the next.

There's nothing to stop you using the higher level representation, and it may make sense for some backends, but if so you'll have to implement EVAL and APPLY yourself, and you may lose any optimisations that take advantage of them being IR code. There's some other things I want to try too, e.g. serialising closures for web/cloud programming.

Edwin.
Reply all
Reply to author
Forward
0 new messages