Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

misc: new interpreter ("Fast RIR")

89 views

Skip to first unread message

BGB

unread,

Oct 16, 2012, 4:47:33 PM10/16/12

basically, this is just a few globs from some recent emails, mentioning
something I am currently looking into.

the idea would be to develop a new faster threaded-code backend for the
BGBScript VM, which would basically be used "behind the scenes" while
still keeping the existing high-level bytecode.

as-is, the interpreter already translates the bytecode into a
stack-based threaded-code format (which is faster than running it
directly via a "switch()", but still could be somewhat faster).

so, the general idea would be to "unwind" the stack-based bytecode
(vaguely like JVM or AVM2 bytecode) into using a register-machine model
(more like Dalvik or Parrot), and essentially use a register-machine to
execute the code.

a lesser goal is that it could also potentially lessen the work of
writing a JIT backend (to allow for a smaller and simpler JIT, which
should ideally be easier to implement, debug, and maintain).

First Message:
<===
but, I just went and wrote up a basic "idea spec" for the new
interpreter idea, which I decided to call FastRIR.

it is yet to be proven though whether it would be much faster than the
current interpreter (the number of inner-loop pointer-operations is not
likely to be much different than my current interpreter).

note that the basic premise of operation for this sort of interpreter is
that the function-pointers inside the structures are generally what are
used to direct control flow (so, the logic that sets up the execution
path will generally set up the structures and fill in function pointers
as-appropriate).

this is in contrast to a more traditional "while loop and switch"
interpreter, but this style of interpreters tend to be a little faster
IME (without being drastically more complicated).

this is far from comprehensive, so a little imagination would be needed
to "fill in the gaps".

(the main focus here is on the part of the interpreter which tends to
consume the bulk of the execution time, which often tends to be the
thing spinning in a loop and executing operations).

any ideas for how to make such a design idea faster are welcome (apart
maybe from putting everything in globals, which yes, can make things
faster, but kills thread-safety, and also doesn't help much on ELF targets).

but, anyways, here is the current idea spec:
-----

Fast Register IR:
Low-level statically-untyped (weakly-typed) register-machine model.

Operations are statically typed and operate on registers.
Most operations will follow a 3 register form.

add_i dest, srca, srcb

so:
add_i, add_l, add_f, add_d
sub_i, sub_l, sub_f, sub_d
...

This will be used for a threaded-code interpreter.
FRIR code will be organized into traces.

FRIR will have up to 256 registers of each type.

union FRIR_Value_u
{
s32 i;
s64 l;
float f;
double d;
byte *pb;
dyt r;
};

struct FRIR_Frame_s {
FRIR_Trace *trace; //return trace
int *ri;
s64 *rl;
float *rf;
double *rd;
byte **ra;
FRIR_Value *vspos; //var stack pos
FRIR_Value *aspos; //arg stack pos
};

struct FRIR_Context_s {
FRIR_Trace *trace; //current trace
FRIR_Frame **framestack; //frame stack base
FRIR_Value *varstack; //var stack base
FRIR_Value *argstack; //arg stack base

FRIR_Frame *frame; //active frame
FRIR_Frame **fspos; //frame stack position
FRIR_Value *vspos; //var stack pos
FRIR_Value *aspos; //arg stack pos

int ret; //ctx return state (0=normal execution)

//register state (shared with frame)
int *ri;
s64 *rl;
float *rf;
double *rd;
byte **ra;
};

The return state will be a magic number indicating the current state of
the interpreter. This value will be kept to 0 during normal execution,
but will be set for events such as an exception being thrown or a
function call/return event, which will be handled outside the normal
execution loop.

A method will hold an array of opcodes and traces.
The opcode array will hold all opcodes in the method.
Traces will represent subsets of this opcode array.
A trace may not contain a jump or an operation which may throw an
exception, but may terminate with such an instruction.

struct FRIR_Method_s {
FRIR_Opcode *ops; //first opcode in method
FRIR_Trace *trace; //first trace in method
int n_ops; //number of opcodes
int n_trace; //number of traces
};

struct FRIR_Trace_s {
FRIR_Trace *next; //next trace (linear, per-method)
FRIR_Trace *jmpnext; //jump-next trace (direct jump)
FRIR_Opcode *ops; //opcodes in trace
// FRIR_Opcode *opse; //end of opcodes
int n_ops; //number of opcodes
FRIR_Trace *(*run)(FRIR_Context *ctx, FRIR_Trace *tr);
FRIR_Trace *(*end)(FRIR_Context *ctx,
FRIR_Trace *tr, FRIR_Opcode *op);
};

run will be the function for handling executing a trace.
end will be called at the end of a trace, and will give the next trace
to execute.

struct FRIR_Opcode_s {
void (*fcn)(FRIR_Context *ctx, FRIR_Opcode *op);
int op;
int a, b, c;
FRIR_Value i;
};

Registers:
RI#, Register Int
RL#, Register Long
RF#, Register Float
RD#, Register Double
RA#, Register Address (Raw Pointer)

lea_b dest(RA), base(RA), index(RI)
lea_w dest(RA), base(RA), index(RI)
lea_i dest(RA), base(RA), index(RI)

lea2_b dest(RA), base(RA), index(RI), offset(II)
lea2_w dest(RA), base(RA), index(RI), offset(II)
lea2_i dest(RA), base(RA), index(RI), offset(II)

void ThOp_Add_I(FRIR_Context *ctx, FRIR_Opcode *op)
{
ctx->ri[op->c]=ctx->ri[op->a]+ctx->ri[op->b];
}

void ThOp_Lea_I(FRIR_Context *ctx, FRIR_Opcode *op)
{
ctx->ra[op->c]=ctx->ra[op->a]+
ctx->ri[op->b]*sizeof(int);
}

void ThOp_Lea2_I(FRIR_Context *ctx, FRIR_Opcode *op)
{
ctx->ra[op->c]=ctx->ra[op->a]+
(ctx->ri[op->b]+op->i.i)*sizeof(int);
}

//default end function for trace
void FRIR_Trace_EndDefault(FRIR_Context *ctx,
FRIR_Trace *tr, FRIR_Opcode *op)
{
return(tr->next);
}

//default end function for trace (unconditional jump)
void FRIR_Trace_EndJumpDefault(FRIR_Context *ctx,
FRIR_Trace *tr, FRIR_Opcode *op)
{
return(tr->jmpnext);
}

//default run function for trace
void FRIR_Trace_RunDefault(FRIR_Context *ctx, FRIR_Trace *tr)
{
FRIR_Opcode *op, *ope;

op=tr->ops;
ope=tr->ops+tr->n_ops-1;
while(op<ope)
{ op->fcn(ctx, op); op++; }
return(tr->end(ctx, tr, op));
}

The above may be unrolled, as in:
void FRIR_Trace_RunDefault4(FRIR_Context *ctx, FRIR_Trace *tr)
{
FRIR_Opcode *op;

op=tr->ops;
op->fcn(ctx, op); op++;
op->fcn(ctx, op); op++;
op->fcn(ctx, op); op++;
return(tr->end(ctx, tr, op));
}
or:
void FRIR_Trace_RunDefault8(FRIR_Context *ctx, FRIR_Trace *tr)
{
FRIR_Opcode *op;

op=tr->ops;
op->fcn(ctx, op); op++;
op->fcn(ctx, op); op++;
op->fcn(ctx, op); op++;
op->fcn(ctx, op); op++;
op->fcn(ctx, op); op++;
op->fcn(ctx, op); op++;
op->fcn(ctx, op); op++;
return(tr->end(ctx, tr, op));
}

The combination of unrolling and possibly adding redundancy should allow
some exploitation of the branch-predictor.

void FRIR_Thread_RunStep(FRIR_Context *ctx)
{
FRIR_Trace *tr;

tr=ctx->trace;
while(!ctx->ret)
{ tr=trace->run(ctx, tr); }
ctx->trace=tr;
}
===>

Second Message:
<===
investigating more (and [after] looking some at Dalvik, ...) it seems
there is a bit more room for possible optimizations at the interpreter
level for a register-IR.

a register IR also allows for more easily eliminating common
sub-expressions (could likely be "recognized" from the stack-based input
bytecode).

(I also looked at Parrot and LLVM IR, but these are less closely related
to my current VM).

as-is, things are still in a "general design and basic experiments" stage...

apparently, people working in similar situations in the past, were able
to get a roughly 25%-30% speedup by translating to a register IR (for an
interpreter). this was for Java ByteCode.

some further speedup is possible as it is considered that the conversion
step would also better nail down things like variable scope and object
field access, which are currently handled dynamically (FRIR will be much
lower-level than the BSVM bytecode).

for example, FRIR will distinguish between local variables, function
arguments, and captured lexical variables (in the BSVM bytecode, these
are all treated as a single unified "stack", but which may not be
physically continuous in memory, essentially requiring a linked-list walk).

basically, for example, it could possibly optimize accessing an object
slot from something like (present):
check object type;
fetch slot by name (hash-key lookup);
dispatch by field type and fetch value.

to something more like:
*(int *)(((byte *)obj->data)+op->i.i);
which would likely be significantly faster than the current logic.

(the above case could be optimized in the current interpreter, but I
hadn't done so previously because it would be a hassle, requiring more
complex logic for emitting the threaded code...).

a similar speedup may be possible for array handling (it will no longer
need to type-check the array, like to determine that it is an array).

these will likely be specific to typed variables, so:
var obj:MyClass = new MyClass();
obj.x=3;
or:
var arr:int[] = new int[256];
arr[31]=4;

but may not (necessarily) work for untyped variables:
"var obj;" or "var arr;".
which will far more likely end up using dynamic type-checks if the VM
can't type-infer them (less likely at present, as the existing
type-inference is more concerned with primitive types, rather than with
objects or arrays).

I also realized some while looking at retrofitting a few ideas onto the
existing interpreter:
in my stack-machine interpreter, a very large number of instructions may
throw exceptions (hindering the use of trace optimizations). (for
example, it would need to be proven that the stack can neither underflow
nor overflow during an operation, ...).

in the current design for a register IR, the places where exceptions may
appear are much more constrained. the possible locations for exceptions
are important to trace-building, as a trace can't go through either a
conditional branch or an operation which may throw, otherwise the
presence of an exception state has to be done following each instruction
(typically this is what is done in my prior interpreters).

uncertain:
traces are best for when there are longer sequences of instructions in
the trace, but cases of only 1 or 2 instructions may execute slower than
a single instruction.

design changes from previously:
'A' type was changed from being a raw pointer to a pointer-sized
integer, mostly as it became apparent that the interpreter logic would
require an excessive amount of casting in C, so it would be easier just
to treat it as an integer type and cast back to a pointer as-needed.

though, this was prior to deciding remove a number of "likely
irrelevant" operations, basically operations which would be unlikely on
pointers.

I am considering likely switching from a fixed-256 registers to a "fixed
unbounded" register set, basically meaning that methods will encode how
many registers they need of a given type (this will be used when
creating the execution frames), but may not be limited to a certain
fixed number.

later on, assuming all of the above goes somewhere:

I am also thinking that IR code will likely be emitted/built mostly via
an API (so, probably no bytecode or ASM-like syntax here).

probably, this would mean a special backend for the BGBScript VM mostly
serving to unwind the stack operations into a register mapping and
convert the operations into FRIR API calls, rebuilding the code as a
lower-level IR. I could try switching over to the new backend "once
everything exists".

(this would sit in-place of the current backend which unpacks the
bytecode and builds the existing threaded-code structures).

probably the FRIR ISA will be regarded as volatile, and will exist as a
component largely independent of the BS-VM proper.

or such...
===>

Rod Pemberton

unread,

Oct 17, 2012, 6:50:11 AM10/17/12

"BGB" <cr8...@hotmail.com> wrote in message
news:k5khbh$f71$1...@news.albasani.net...
...

> the idea would be to develop a new faster threaded-code backend for the
> BGBScript VM, which would basically be used "behind the scenes" while
> still keeping the existing high-level bytecode.

As you may know, I've been developing an ITC (threaded code) Forth
interpreter in C for a few years now. One issue is with branching. I don't
believe the issue is Forth specific. I think applies to all threaded code.
Threaded code is not well suited to branching or most control flow, but
branching can be made to work. Forth has no formal definition for it's
branching "words" (routines). Historically, Forth calls them: BRANCH and
either 0BRANCH or ?BRANCH. The first is an absolute branch, and the
second with two common names is a conditional branch.

> as-is, the interpreter already translates the bytecode into a
> stack-based threaded-code format (which is faster than running it
> directly via a "switch()", but still could be somewhat faster).

Well, that's good to know!

Most Forth's coded in C use a switch() since that's the C way of doing
things. I didn't use a C switch() statement for my interpreter. I
interpret the same way as early ITC Forth's in assembly, but using C
instead of assembly.

> so, the general idea would be to "unwind" the stack-based bytecode
> (vaguely like JVM or AVM2 bytecode) into using a register-machine model
> (more like Dalvik or Parrot), and essentially use a register-machine to
> execute the code.

Have you tried any of this already? Have you analyzed how many stacked
items are typically used at a time? I.e., does your bytecode only use two
or three items at once?

From my own experiences, I can "see" issues with needing a register
allocator, keeping track of which data is in which registers, spilling of
registers when you run out of available registers, etc.

> any ideas for how to make such a design idea faster are welcome (apart
> maybe from putting everything in globals, which yes, can make things
> faster, but kills thread-safety, and also doesn't help much on ELF
> targets).

I ran into a few issues with making my ITC interpreter faster in C.

The first issue is that the "register" keyword only works on "auto" or
procedure local variables for GCC. GCC deprecated the use of the "register"
keyword for file scope or global variables.

The second issue is:

"If I can't use the 'register' keyword on globals, and I've got four
procedures operating on my globals, how do I get all four procedures merged
into a _single_ procedure so that my globals become 'auto' types and can be
optimized?"

The first answer to that is something C doesn't have: nested procedures.
The second answer to that -possibly- is something C doesn't have either:
inheritance. The third answer to that is: "I'm still working on that one!"
At this point, I'm considering *anything*, like even a small switch() or
Duff's device, in order to get 'auto' types. I just haven't attempted to
get something working, yet.

> apparently, people working in similar situations in the past, were able
> to get a roughly 25%-30% speedup by translating to a register IR (for an
> interpreter). this was for Java ByteCode.

...

> I also realized some while looking at retrofitting a few ideas onto the
> existing interpreter:
> in my stack-machine interpreter, a very large number of instructions may
> throw exceptions (hindering the use of trace optimizations). (for
> example, it would need to be proven that the stack can neither underflow
> nor overflow during an operation, ...).

Or, stack check on push/pop...

Or, just abort...

> design changes from previously:
> 'A' type was changed from being a raw pointer to a pointer-sized
> integer, mostly as it became apparent that the interpreter logic would
> require an excessive amount of casting in C, so it would be easier just
> to treat it as an integer type and cast back to a pointer as-needed.

Exactly. You can't dereference a pointer type in C and end up with the
exact same type. Well, except maybe a void*, but then you can't access data
at the pointer because it's void... So, casting in C is required somewhere
threaded code.

Rod Pemberton

rug...@gmail.com

unread,

Oct 17, 2012, 5:21:56 PM10/17/12

Hi,

N.B. My experience is very low, but I'll try to add a few tidbits.

On Wednesday, October 17, 2012 5:46:27 AM UTC-5, Rod Pemberton wrote:
> "BGB" <cr8...@nospam.com> wrote in message
>
> news:k5khbh$f71$1...@news.nospam.net...

>
> > the idea would be to develop a new faster threaded-code backend for the
> > BGBScript VM, which would basically be used "behind the scenes" while
> > still keeping the existing high-level bytecode.
>

> I ran into a few issues with making my ITC interpreter faster in C.
>
> The first issue is that the "register" keyword only works on "auto" or
> procedure local variables for GCC. GCC deprecated the use of the "register"
> keyword for file scope or global variables.

This is no surprise, but indeed "register" doesn't always help very much. It's not even required to do anything, IIRC. And I don't know of anybody getting much benefit from using it with GCC. At least my own wimpy uses didn't show any noticeable benefit. But I didn't look too hard to find a good example (perhaps GForth??). Maybe x86-64 helps more, but I mostly doubt it. In other words, I wouldn't pin too many hopes on this.

> The second issue is:
> "If I can't use the 'register' keyword on globals, and I've got four
> procedures operating on my globals, how do I get all four procedures merged
> into a _single_ procedure so that my globals become 'auto' types and can be
> optimized?"

Whole-program optimization or maybe LTO? Honestly, no idea.

> The first answer to that is something C doesn't have: nested procedures.

GCC supports nested procedures as an extension. And most Algol-inspired languages do also (Pascal, Modula-2, Modula-3, Oberon). However, I have no idea if even those implementations using GCC as backend would optimize for register use in that way. I guess in theory you could use GNAT / Ada for whatever and link that to your binary, but I doubt that's a reasonable approach, if it helped at all.

> At this point, I'm considering *anything*, like even a small switch() or
> Duff's device, in order to get 'auto' types. I just haven't attempted to
> get something working, yet.

Duff's device sounds interesting but is so weird that I can't exactly understand it (nor tried). Normally speed isn't such a dire issue. I mean, it's nice, but sometimes it's just easier to wait! Or write in assembly. Or upgrade your compiler. ;-)

Just see what GForth does, they seem to know what they're doing (GCC's computed goto?).

> > apparently, people working in similar situations in the past, were able
> > to get a roughly 25%-30% speedup by translating to a register IR (for an
> > interpreter). this was for Java ByteCode.

You didn't mention Lua here, but since 5.x they now use a register-based VM (as opposed to previous stack-based), which they claim is noticeably faster. You may want to read their paper _The implementation of Lua 5.0_ for hints. See http://www.lua.org/doc/jucs05.pdf .

> > I also realized some while looking at retrofitting a few ideas onto the
> > existing interpreter:
> > in my stack-machine interpreter, a very large number of instructions may
> > throw exceptions (hindering the use of trace optimizations). (for
> > example, it would need to be proven that the stack can neither underflow
> > nor overflow during an operation, ...).
>
> Or, stack check on push/pop...
> Or, just abort...

What else can you do? It's either malloc more, halt, or crash. I can't think of any other options. Perhaps realloc or free some unnecessary stuff or compress your data structures (at runtime??).

BGB

unread,

Oct 17, 2012, 6:12:12 PM10/17/12

On 10/17/2012 5:50 AM, Rod Pemberton wrote:
> "BGB" <cr8...@hotmail.com> wrote in message
> news:k5khbh$f71$1...@news.albasani.net...
> ...
>
>> the idea would be to develop a new faster threaded-code backend for the
>> BGBScript VM, which would basically be used "behind the scenes" while
>> still keeping the existing high-level bytecode.
>
> As you may know, I've been developing an ITC (threaded code) Forth
> interpreter in C for a few years now. One issue is with branching. I don't
> believe the issue is Forth specific. I think applies to all threaded code.
> Threaded code is not well suited to branching or most control flow, but
> branching can be made to work. Forth has no formal definition for it's
> branching "words" (routines). Historically, Forth calls them: BRANCH and
> either 0BRANCH or ?BRANCH. The first is an absolute branch, and the
> second with two common names is a conditional branch.
>

BGBScript has been in-use and incrementally developed since about 2004.
I have been using threaded-code since around November 2011.

so, the point now is mostly looking into taking a different and more
efficient route towards the interpreter backend.

note that, unlike Forth or similar, the BGBScript bytecode follows
similar stack-use restrictions to JVM bytecode, basically that on a
given jump/label, the stack layout is required to match.

the prior interpreter basically operated as a sort of
continuation-passing-style, where every opcode would return the next
opcode to execute.

this time I am factoring this out and instead using "traces".
the hope is that by better nailing down the opcode-level control flow, a
more efficient means of executing the instruction traces may be used.

>> as-is, the interpreter already translates the bytecode into a
>> stack-based threaded-code format (which is faster than running it
>> directly via a "switch()", but still could be somewhat faster).
>
> Well, that's good to know!
>
> Most Forth's coded in C use a switch() since that's the C way of doing
> things. I didn't use a C switch() statement for my interpreter. I
> interpret the same way as early ITC Forth's in assembly, but using C
> instead of assembly.
>

I went away from using switch a while ago, mostly as it takes a fairly
severe performance hit for every loop iteration.

threaded code is a little faster, but all of those function
calls/returns and indirect function calls are also expensive (note:
can't use computed goto as my interpreter is not GCC specific).

sadly, there is no really "good" option short of writing the whole
interpreter core in ASM, and by this point I may as well just write a JIT.

>> so, the general idea would be to "unwind" the stack-based bytecode
>> (vaguely like JVM or AVM2 bytecode) into using a register-machine model
>> (more like Dalvik or Parrot), and essentially use a register-machine to
>> execute the code.
>
> Have you tried any of this already? Have you analyzed how many stacked
> items are typically used at a time? I.e., does your bytecode only use two
> or three items at once?
>

typically, all of the items in a typical expression will end up on the
stack.

this will likely exceed the number of x86 registers, but the threaded
code will use a "potentially unbounded" number of virtual registers.

if a JIT were used, likely it would use a mapping heuristic to figure
out which VM registers to map to hardware registers (this would more
likely apply to x86-64, I suspect that 32-bit x86 is sufficiently
register-starved to where any sort of aggressive mapping would likely do
more harm than good, and ESI/EDI would likely be needed mostly for
holding the current VM state).

> From my own experiences, I can "see" issues with needing a register
> allocator, keeping track of which data is in which registers, spilling of
> registers when you run out of available registers, etc.
>

this would be more for a native code generator, not so much for a
register-IR, where the registers aren't so much "registers" in the CPU
sense, but are more closely related to "temporary variables".

so, you can avoid the need to "spill" by simply having 256 (or more) as
the max number of VM registers, and by making the VM preserve register
state across calls.

the need for a full register allocator can also be avoided at this stage
as well (they can be allocated either sequentially or in a stack-like
manner.

>> any ideas for how to make such a design idea faster are welcome (apart
>> maybe from putting everything in globals, which yes, can make things
>> faster, but kills thread-safety, and also doesn't help much on ELF
>> targets).
>
> I ran into a few issues with making my ITC interpreter faster in C.
>
> The first issue is that the "register" keyword only works on "auto" or
> procedure local variables for GCC. GCC deprecated the use of the "register"
> keyword for file scope or global variables.
>
> The second issue is:
>
> "If I can't use the 'register' keyword on globals, and I've got four
> procedures operating on my globals, how do I get all four procedures merged
> into a _single_ procedure so that my globals become 'auto' types and can be
> optimized?"
>
> The first answer to that is something C doesn't have: nested procedures.
> The second answer to that -possibly- is something C doesn't have either:
> inheritance. The third answer to that is: "I'm still working on that one!"
> At this point, I'm considering *anything*, like even a small switch() or
> Duff's device, in order to get 'auto' types. I just haven't attempted to
> get something working, yet.
>

the "register" keyword doesn't really usually make much of a difference,
so I don't really use it.

>> apparently, people working in similar situations in the past, were able
>> to get a roughly 25%-30% speedup by translating to a register IR (for an
>> interpreter). this was for Java ByteCode.
> ...
>
>> I also realized some while looking at retrofitting a few ideas onto the
>> existing interpreter:
>> in my stack-machine interpreter, a very large number of instructions may
>> throw exceptions (hindering the use of trace optimizations). (for
>> example, it would need to be proven that the stack can neither underflow
>> nor overflow during an operation, ...).
>
> Or, stack check on push/pop...
>
> Or, just abort...
>

checking for stack overflow is just the sort of cost I am trying to
avoid here.

it is the same sort of issue as checking if an exception-state variable
has been set. it seems simple enough, but it can eat lots of clock
cycles in a tight loop.

typically, a stack might be accessed like:
ctx->stackpos--;
*(int *)ctx->stackpos=((int *)ctx->stackpos)[0]+
((int *)ctx->stackpos)[1];

but, if we add something like:
if(ctx->stackpos<0)
{ ... }

then suddenly lots of clock cycles are being wasted just performing this
silly little check.

and, if our dispatch loop is something like:
while(op && !ctx->ret)
{ op=op->fcn(ctx, op); }

that "&& !ctx->ret" also adds a bit of cost.

here is the inner-loop for the in-development interpreter:
BGBFRIR_API FRIR_Trace *FRIR_Trace_RunDefault(

FRIR_Context *ctx, FRIR_Trace *tr)
{
FRIR_Opcode *op, *ope;

op=tr->ops;
ope=tr->ops+tr->n_ops-1;
while(op<ope)
{ op->fcn(ctx, op); op++; }
return(tr->end(ctx, tr, op));
}

and the add-integer handler:
BGBFRIR_API void FRIR_ThOp_ADD_I(FRIR_Context *ctx, FRIR_Opcode *op)

{ ctx->ri[op->c]=ctx->ri[op->a]+ctx->ri[op->b]; }

>> design changes from previously:
>> 'A' type was changed from being a raw pointer to a pointer-sized
>> integer, mostly as it became apparent that the interpreter logic would
>> require an excessive amount of casting in C, so it would be easier just
>> to treat it as an integer type and cast back to a pointer as-needed.
>
> Exactly. You can't dereference a pointer type in C and end up with the
> exact same type. Well, except maybe a void*, but then you can't access data
> at the pointer because it's void... So, casting in C is required somewhere
> threaded code.
>

this more has to do with the new interpreter having a lot of operations
for pointer arithmetic, like "lea" (for calculating addressed) and
"xloada" (for loading the address of a field into a register) and similar.

an integer type makes more sense, but leaves the minor annoyance of sign
and address-size (x86 addresses are unsigned and 32-bit, x86-64 is
signed and 64-bit), requiring a little bit of special handling in
certain cases.

at the IR level, the 'A' (address type) is partly disjoint from the
integer types, so at least the issue shouldn't be apparent.

there are operations like: "conv_a2l" and "conv_l2a" to convert between
pointers and long (signed 64-bit), but doing this is not as efficient on
32-bit x86 (shouldn't matter too much, as this isn't likely to be all
that common).

but, as-is, IR would look something like:
llload_i 2, 7 //load local var 7 into r2
llload_i 3, 8 //load local var 8 into r3
add_i 2, 2, 3 //add r2=r2+r3
llstore_i 2, 9 //store r2 into local var 9

there may likely end up being a few more special cases, like:
lladd_ic 11, 2 //add 2 to local var 11
which would in C look something like:
BGBFRIR_API void FRIR_ThOp_LLADD_IC(FRIR_Context *ctx, FRIR_Opcode *op)
{ ctx->llvars[op->c].i+=op->i.i; }

or such...

BGB

unread,

Oct 17, 2012, 7:05:53 PM10/17/12

On 10/17/2012 4:21 PM, rug...@gmail.com wrote:
> Hi,
>
> N.B. My experience is very low, but I'll try to add a few tidbits.
>
> On Wednesday, October 17, 2012 5:46:27 AM UTC-5, Rod Pemberton wrote:
>> "BGB" <cr8...@nospam.com> wrote in message
>>
>> news:k5khbh$f71$1...@news.nospam.net...
>>
>>> the idea would be to develop a new faster threaded-code backend for the
>>> BGBScript VM, which would basically be used "behind the scenes" while
>>> still keeping the existing high-level bytecode.
>>
>>> so, the general idea would be to "unwind" the stack-based bytecode
>>> (vaguely like JVM or AVM2 bytecode) into using a register-machine model
>>> (more like Dalvik or Parrot), and essentially use a register-machine to
>>> execute the code.
>>
>> I ran into a few issues with making my ITC interpreter faster in C.
>>
>> The first issue is that the "register" keyword only works on "auto" or
>> procedure local variables for GCC. GCC deprecated the use of the "register"
>> keyword for file scope or global variables.
>
> This is no surprise, but indeed "register" doesn't always help very much. It's not even required to do anything, IIRC. And I don't know of anybody getting much benefit from using it with GCC. At least my own wimpy uses didn't show any noticeable benefit. But I didn't look too hard to find a good example (perhaps GForth??). Maybe x86-64 helps more, but I mostly doubt it. In other words, I wouldn't pin too many hopes on this.
>

yeah, hence why I don't use it personally...

>> The second issue is:
>> "If I can't use the 'register' keyword on globals, and I've got four
>> procedures operating on my globals, how do I get all four procedures merged
>> into a _single_ procedure so that my globals become 'auto' types and can be
>> optimized?"
>
> Whole-program optimization or maybe LTO? Honestly, no idea.
>
>> The first answer to that is something C doesn't have: nested procedures.
>
> GCC supports nested procedures as an extension. And most Algol-inspired languages do also (Pascal, Modula-2, Modula-3, Oberon). However, I have no idea if even those implementations using GCC as backend would optimize for register use in that way. I guess in theory you could use GNAT / Ada for whatever and link that to your binary, but I doubt that's a reasonable approach, if it helped at all.
>

personally, I am writing code that may be compiled with either GCC or
with MSVC / Visual Studio.

this mostly precludes the use of compiler-specific features.

>> At this point, I'm considering *anything*, like even a small switch() or
>> Duff's device, in order to get 'auto' types. I just haven't attempted to
>> get something working, yet.
>
> Duff's device sounds interesting but is so weird that I can't exactly understand it (nor tried). Normally speed isn't such a dire issue. I mean, it's nice, but sometimes it's just easier to wait! Or write in assembly. Or upgrade your compiler. ;-)
>

another drawback of Duffs device is that it still doesn't really escape
the relatively high constant-time cost of a switch.

granted, yes, conditionals are also expensive, so the number of
conditional expressions and unpredictable branches needs to hopefully be
minimized.

> Just see what GForth does, they seem to know what they're doing (GCC's computed goto?).
>

computed goto is fast, but yeah, kind of a problem that MSVC can't do it.

ironically, BGBScript has computed goto (but this is of little help in
implementing an interpreter for it).

>>> apparently, people working in similar situations in the past, were able
>>> to get a roughly 25%-30% speedup by translating to a register IR (for an
>>> interpreter). this was for Java ByteCode.
>
> You didn't mention Lua here, but since 5.x they now use a register-based VM (as opposed to previous stack-based), which they claim is noticeably faster. You may want to read their paper _The implementation of Lua 5.0_ for hints. See http://www.lua.org/doc/jucs05.pdf .
>

I wasn't aware of Lua using a register VM.

I was aware of Dalvik and Parrot though, so these are mostly what I was
looking into.

note that the use of a register interpreter backend will not effect the
use of a high-level stack-based bytecode.

>>> I also realized some while looking at retrofitting a few ideas onto the
>>> existing interpreter:
>>> in my stack-machine interpreter, a very large number of instructions may
>>> throw exceptions (hindering the use of trace optimizations). (for
>>> example, it would need to be proven that the stack can neither underflow
>>> nor overflow during an operation, ...).
>>
>> Or, stack check on push/pop...
>> Or, just abort...
>
> What else can you do? It's either malloc more, halt, or crash. I can't think of any other options. Perhaps realloc or free some unnecessary stuff or compress your data structures (at runtime??).
>

throw a "StackOverflowException"?...

like, in script code:
try {
...
}catch(ex:StackOverflowException)
{
printf("Stack blew up!\n");
}

but, anyways, factoring out the stack can improve performance because we
will know (at method call time) whether or not the stack has overflowed,
and so can check and throw as part of calling the method, rather than
having to check every time a value is pushed onto the stack.

"abort()" is no-go, given the type of app I am developing...

usually, if the exception can't be handled in script code, then the VM
will set up for an "uncaught exception" and return back into C land with
an error code (this was chosen as preferable to trying to throw an
exception in C land, which is kind of a mess).

theoretically, it means that code calling into the VM "should" check for
an exception whenever an error code is returned, or the return-value is
"undefined", but sadly, a lot of code doesn't bother.

mostly this doesn't matter as much, as script-code is generally
considered as less important than the underlying C code for the program.

note that script code may also throw things like:
"NullPointerException", "UndefinedMethodException", ...

or such...

Andy Walker

unread,

Oct 17, 2012, 9:01:39 PM10/17/12

On 18/10/12 00:05, BGB wrote:
[...]

> throw a "StackOverflowException"?...
>
> like, in script code:
> try {
> ...
> }catch(ex:StackOverflowException)
> {
> printf("Stack blew up!\n");
> }

If you've just run out of storage, then calling a
procedure-with-parameter, which may itself require large
amounts of storage, is probably not a good idea. OK, you
could reserve a small amount of storage specifically for
"catching", but then the "catch" isn't normal user code,
which has its own problems. You would get away with it
[probably!] if the "try" uses large amounts of stack which
is recovered by the "throw", but that's not the general
case.

--
Andy Walker,
Nottingham.

Rod Pemberton

unread,

Oct 17, 2012, 9:08:36 PM10/17/12

"BGB" <cr8...@hotmail.com> wrote in message

news:k5nam3$kp5$1...@news.albasani.net...

> On 10/17/2012 5:50 AM, Rod Pemberton wrote:
> > "BGB" <cr8...@hotmail.com> wrote in message
> > news:k5khbh$f71$1...@news.albasani.net...
...

> >> I also realized some while looking at retrofitting a few ideas onto the
> >> existing interpreter:
> >> in my stack-machine interpreter, a very large number of instructions
> >> may throw exceptions (hindering the use of trace optimizations). (for
> >> example, it would need to be proven that the stack can neither
> >> underflow nor overflow during an operation, ...).
> >
> > Or, stack check on push/pop...
> >
> > Or, just abort...
> >
>
> checking for stack overflow is just the sort of cost I am trying to
> avoid here.
>

I know. I didn't want such code in my push/pop routines for my Forth
interpreter either. One of my pop routines actually has *two* types of
underflow checks ... It's horrid! I don't want them there. It slows
things down.

However, without underflow/overflow checks how do you confirm underflow or
overflow for push/pop? And, where would you? Do you just allow push/pop to
crash/halt the application or overwrite non-stack memory?

I've no elegant solution to this other than to insert the checks. I was
hoping to maybe analyze the program in more depth in the future to determine
other code locations where the checks could be placed. I.e., somewhere
where they weren't constantly being checked, but checked frequently enough
that they would prevent or detect most overflows/underflows prior to them
occurring.

> it is the same sort of issue as checking if an exception-state variable
> has been set. it seems simple enough, but it can eat lots of clock
> cycles in a tight loop.

Any additional operations in a tight, continuous loop can accumulate to
consume much time.

> typically, a stack might be accessed like:
> ctx->stackpos--;
> *(int *)ctx->stackpos=((int *)ctx->stackpos)[0]+
> ((int *)ctx->stackpos)[1];
>
> but, if we add something like:
> if(ctx->stackpos<0)
> { ... }
>
> then suddenly lots of clock cycles are being wasted just performing this
> silly little check.
>

True. Unfortunately for me, stack checks are needed, but very unwanted ...

> here is the inner-loop for the in-development interpreter:
> BGBFRIR_API FRIR_Trace *FRIR_Trace_RunDefault(
> FRIR_Context *ctx, FRIR_Trace *tr)
> {
> FRIR_Opcode *op, *ope;
>
> op=tr->ops;
> ope=tr->ops+tr->n_ops-1;
> while(op<ope)
> { op->fcn(ctx, op); op++; }
> return(tr->end(ctx, tr, op));
> }
>
> and the add-integer handler:
> BGBFRIR_API void FRIR_ThOp_ADD_I(FRIR_Context *ctx, FRIR_Opcode *op)
> { ctx->ri[op->c]=ctx->ri[op->a]+ctx->ri[op->b]; }
>

Okay, I usually don't comment on your code because 1) you post alot of it
and 2) it's generally abstract enough that I don't spend much time
attempting to understand it... This is somewhat abstract too. :-)

Clearly, you can rework the code to fix trivial stuff like eliminating the
need for "-1" for ope either by adjusting tr->n_ops or using op<=ope.

Anyway, I'm not entirely sure what you're doing with all the ctx and tr->ops
and op->fnc(ctx,op) stuff. The best I can tell, this is a loop through a
list of functions to be called. The functions get called. They also get
passed 'ctx' - which I'm not sure what is. I would guess that fcn and op
should be all that's needed ... However, I've seen other code use 'ctx' and
it's usually something like a register struct or a VM save state. If so, it
seems like it's doing some additonal work since ctx pointer keeps being
passed, i.e., allocated and deallocated on the stack for each function call.
Won't declaring the fcn function as "void func(void)" reduce some overhead?
It should eliminate the prolog(ue) or epilog(ue) code for the procedure.
But, that requires op and ctx to be globals instead of locals. For GCC,
you'll still end up with some stack adjustments upon procedure entry or
exit. For other C compilers, you should be able to make the function
"naked" by using "void func(void)". Or, you can use special compiler
attributes to do so.

Rod Pemberton

BGB

unread,

Oct 17, 2012, 10:25:28 PM10/17/12

presumably, this try/catch block would occur outside of the code which
used up all the stack-space and triggered the exception, and also the
exception handling does not itself use up stack space (throwing
basically causes all stack-frames to be unwound until it gets back to
the exception handler, and the exception-object is heap-allocated...).

granted, yes, another potential cause of error is an
"OutOfMemoryException", but this is potentially a bit more severe
(basically, it means that the heap is all used up and that the garbage
collector can't free up any more, so a memory-allocation request has
failed).

stack-overflows are potentially a little more common with my VM as it
typically uses smaller stacks than are generally used by native threads,
but this is counterbalanced by my VM not generally allocating storage
for objects or arrays on the stack.

also, the "printf" in BGBScript is implemented off in C land, so this
uses the native stack (rather than the VM stack).

BGB

unread,

Oct 17, 2012, 10:58:01 PM10/17/12

On 10/17/2012 8:08 PM, Rod Pemberton wrote:
> "BGB" <cr8...@hotmail.com> wrote in message
> news:k5nam3$kp5$1...@news.albasani.net...
>> On 10/17/2012 5:50 AM, Rod Pemberton wrote:
>>> "BGB" <cr8...@hotmail.com> wrote in message
>>> news:k5khbh$f71$1...@news.albasani.net...
> ...
>
>>>> I also realized some while looking at retrofitting a few ideas onto the
>>>> existing interpreter:
>>>> in my stack-machine interpreter, a very large number of instructions
>>>> may throw exceptions (hindering the use of trace optimizations). (for
>>>> example, it would need to be proven that the stack can neither
>>>> underflow nor overflow during an operation, ...).
>>>
>>> Or, stack check on push/pop...
>>>
>>> Or, just abort...
>>>
>>
>> checking for stack overflow is just the sort of cost I am trying to
>> avoid here.
>>
>
> I know. I didn't want such code in my push/pop routines for my Forth
> interpreter either. One of my pop routines actually has *two* types of
> underflow checks ... It's horrid! I don't want them there. It slows
> things down.
>
> However, without underflow/overflow checks how do you confirm underflow or
> overflow for push/pop? And, where would you? Do you just allow push/pop to
> crash/halt the application or overwrite non-stack memory?
>

static analysis:
if you know the max number of stack items ever used, it is possible to
allocate stack space for just that many items (and thus determine at
method-call time whether or not an overflow is possible).

this means that, granted, an overflow exception may be generated even if
the execution path which would generate the overflow is never called.

but, say, if we know that the method has, say, a maximum stack-depth of
11, then we only need to reserve space for 11 stack items.

similar goes for reserving space for things like lexical environment
frames, ...

I don't know whether this would be applicable or not to something like
Forth, but given the VM has a similar stack-usage restriction to the JVM
(namely that stack layout and element types be statically determinable),
this makes things a little easier.

> I've no elegant solution to this other than to insert the checks. I was
> hoping to maybe analyze the program in more depth in the future to determine
> other code locations where the checks could be placed. I.e., somewhere
> where they weren't constantly being checked, but checked frequently enough
> that they would prevent or detect most overflows/underflows prior to them
> occurring.
>

in my case, generally checks like this are done when trying to call into
a method (this is also when the "bytecode -> threaded-code" translation
takes place).

>> it is the same sort of issue as checking if an exception-state variable
>> has been set. it seems simple enough, but it can eat lots of clock
>> cycles in a tight loop.
>
> Any additional operations in a tight, continuous loop can accumulate to
> consume much time.
>

yes, this is why I want to factor this out, as at present, the
generation of exception events is a lot harder to factor out, so better
would be to execute code in a form where static elimination of most
possible exception cases is possible.

>> typically, a stack might be accessed like:
>> ctx->stackpos--;
>> *(int *)ctx->stackpos=((int *)ctx->stackpos)[0]+
>> ((int *)ctx->stackpos)[1];
>>
>> but, if we add something like:
>> if(ctx->stackpos<0)
>> { ... }
>>
>> then suddenly lots of clock cycles are being wasted just performing this
>> silly little check.
>>
>
> True. Unfortunately for me, stack checks are needed, but very unwanted ...
>

fair enough...

>> here is the inner-loop for the in-development interpreter:
>> BGBFRIR_API FRIR_Trace *FRIR_Trace_RunDefault(
>> FRIR_Context *ctx, FRIR_Trace *tr)
>> {
>> FRIR_Opcode *op, *ope;
>>
>> op=tr->ops;
>> ope=tr->ops+tr->n_ops-1;
>> while(op<ope)
>> { op->fcn(ctx, op); op++; }
>> return(tr->end(ctx, tr, op));
>> }
>>
>> and the add-integer handler:
>> BGBFRIR_API void FRIR_ThOp_ADD_I(FRIR_Context *ctx, FRIR_Opcode *op)
>> { ctx->ri[op->c]=ctx->ri[op->a]+ctx->ri[op->b]; }
>>
>
> Okay, I usually don't comment on your code because 1) you post alot of it
> and 2) it's generally abstract enough that I don't spend much time
> attempting to understand it... This is somewhat abstract too. :-)
>
> Clearly, you can rework the code to fix trivial stuff like eliminating the
> need for "-1" for ope either by adjusting tr->n_ops or using op<=ope.
>

I considered the possibility of having it be precomputed, but was left
to debate whether precomputing it would add more cost than calculating
it on the fly.

> Anyway, I'm not entirely sure what you're doing with all the ctx and tr->ops
> and op->fnc(ctx,op) stuff. The best I can tell, this is a loop through a
> list of functions to be called. The functions get called. They also get
> passed 'ctx' - which I'm not sure what is. I would guess that fcn and op
> should be all that's needed ... However, I've seen other code use 'ctx' and
> it's usually something like a register struct or a VM save state. If so, it
> seems like it's doing some additonal work since ctx pointer keeps being
> passed, i.e., allocated and deallocated on the stack for each function call.
> Won't declaring the fcn function as "void func(void)" reduce some overhead?
> It should eliminate the prolog(ue) or epilog(ue) code for the procedure.
> But, that requires op and ctx to be globals instead of locals. For GCC,
> you'll still end up with some stack adjustments upon procedure entry or
> exit. For other C compilers, you should be able to make the function
> "naked" by using "void func(void)". Or, you can use special compiler
> attributes to do so.
>

globals are no-go, mostly as they are not thread-safe (if used for
volatile state), and often the BGBScript VM may be used from multiple
threads at the same time (and each thread needs its own VM state).

the VM is mostly used by the "BGBTech Game Engine":
http://www.youtube.com/user/BGBTech
ex:
http://www.youtube.com/watch?v=E1kN_bM6KBs&feature=plcp

note that this 3D engine is itself internally multi-threaded in many areas.

or, in other words, using global variables would risk making stuff blow
up pretty damn hard... (and TLS costs more than passing state around in
arguments).

the ctx and op are basically the VM-state-context and opcode-state.

the reason for all the function-calls through function-pointers stuff is
that, in my tests, this has generally be one of the fastest ways I know
of to do this.

>
> Rod Pemberton
>
>

Dmitry A. Kazakov

unread,

Oct 18, 2012, 3:22:36 AM10/18/12

On Wed, 17 Oct 2012 21:25:28 -0500, BGB wrote:

> On 10/17/2012 8:01 PM, Andy Walker wrote:
>> On 18/10/12 00:05, BGB wrote:
>> [...]
>>> throw a "StackOverflowException"?...
>>>
>>> like, in script code:
>>> try {
>>> ...
>>> }catch(ex:StackOverflowException)
>>> {
>>> printf("Stack blew up!\n");
>>> }
>>
>> If you've just run out of storage, then calling a
>> procedure-with-parameter, which may itself require large
>> amounts of storage, is probably not a good idea. OK, you
>> could reserve a small amount of storage specifically for
>> "catching", but then the "catch" isn't normal user code,
>> which has its own problems. You would get away with it
>> [probably!] if the "try" uses large amounts of stack which
>> is recovered by the "throw", but that's not the general
>> case.
>
> presumably, this try/catch block would occur outside of the code which
> used up all the stack-space and triggered the exception, and also the
> exception handling does not itself use up stack space (throwing
> basically causes all stack-frames to be unwound until it gets back to
> the exception handler, and the exception-object is heap-allocated...).

A counterexample:

{
int Boo; // Allocated on the stack

try
{ ...
}
catch (ex:StackOverflowException)
{
printf("Stack blew up when Boo=%d!\n", Boo);
} }

If you switch stacks in exception handlers, you would have a problem with
accessing the stuff allocated on the stack by the enclosing context.

Anyway. Handling exceptions in this way is much like co-routine calls. A
co-routine may have stack of its own. I wonder what could become of that.
(Not returning to on-clauses of early PL/1 of course.)

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

BartC

unread,

Oct 18, 2012, 11:56:20 AM10/18/12

"BGB" <cr8...@hotmail.com> wrote in message

news:k5nam3$kp5$1...@news.albasani.net...

>>> as-is, the interpreter already translates the bytecode into a
>>> stack-based threaded-code format (which is faster than running it
>>> directly via a "switch()", but still could be somewhat faster).

> I went away from using switch a while ago, mostly as it takes a fairly
> severe performance hit for every loop iteration.

And I'm going away from fast threaded code, and towards using a switch!
That's because I might need to write in C in order to port to a different
processor, because getting back all the speed I've lost.

Using a very simple mock-up of the interpreter (just repeatedly executing a
single bytecode, that used for an unindexed loop, but it will test the
dispatching), I got these results (to execute this bytecode a billion
times):

4.5 seconds (C using a do-while loop, and calling handlers via function
pointers)
3.4 seconds (C using a big switch statement)
3.3 seconds (C using gotos to gcc's indirect labels; a kind of threaded
code)
2.1 seconds (x86 ASM also using direct threading)
1.9 seconds (The current interpreter also using direct threading)

So if I'm going to use C at all, then it will be probably be a switch.

>> Or, stack check on push/pop...
>>
>> Or, just abort...
>>
>
> checking for stack overflow is just the sort of cost I am trying to avoid
> here.

Why do you have to check for stack overflow, when a language such as C
doesn't bother?

Either make it optional (for debugging for example), or check it
infrequently, perhaps per function-call (since stack overflow is most likely
to be caused by out-of-control recursion) rather than per expression.
(Assuming it is too hard to detect an actual stack fault.)

> and, if our dispatch loop is something like:
> while(op && !ctx->ret)
> { op=op->fcn(ctx, op); }
>
> that "&& !ctx->ret" also adds a bit of cost.

You'd take that check out of the loop. I assume that checks whether the next
opcode is some sort of stop code. Since it's only ever going to be true once
in any program run (perhaps after a trillion iterations) it is rather
inefficient!

--
Bartc

BGB

unread,

Oct 18, 2012, 12:06:22 PM10/18/12

the variable declaration syntax would be more like:
var boo:int;

but, the stack issue is not actually an issue.

this is partly because local variables are not "only" stored on the
stack, but any captured bindings (due to a closure or otherwise) will
typically end up in heap-allocated environment frames.

actually, even then, within the current interpreter, variables have
their own stack, and there are parallel stacks for various tasks:
value-stack (holds intermediate values);
mark-stack (holds "marks" for the value-stack);
flow-control stack (holds call frames);
local variable stack (holds non-captured local variables and similar);
dynamic variable stack (holds any dynamically-scoped variables);
...

actually, the bytecode makes the lexical environment itself look like a
single big stack, except that this stack is not physically contiguous in
memory whenever closures are made (the lexical environment is more of a
linked-list structure).

currently, different threaded code is generated based on whether or not
the current function has captured bindings or is itself a closure
(captures bindings from a parent function).

in the case of FRIR, it would actually keep track of the difference
between lexical and local variables, and identifies lexical variables
via a pair of numbers (indicating how many levels to walk outwards, and
the binding within that level).

ex: "3, 8" means "up 3 levels, entry 8".

currently a stack-overflow exception doesn't actually indicate which
stack overflowed.

> Anyway. Handling exceptions in this way is much like co-routine calls. A
> co-routine may have stack of its own. I wonder what could become of that.
> (Not returning to on-clauses of early PL/1 of course.)
>

actually, it is not that the exception handling has its own stack, but
rather that "printf" has its own stack.

this is mostly because the current interpreter is internally implemented
using CPS (continuation-passing-style), and uses its own internal
stacks. similarly, calls within the VM don't actually use any more space
on the CPU stack, mostly because in C land, it is all just a single
trampoline loop which spins in place (so, the interpreter is not
recursive in this case).

when a call is made into C land (such as for "printf()"), it will use
the native CPU stack as the stack, rather than the stack internal to the
interpreter.

BGB

unread,

Oct 18, 2012, 12:55:51 PM10/18/12

On 10/18/2012 10:56 AM, BartC wrote:
> "BGB" <cr8...@hotmail.com> wrote in message
> news:k5nam3$kp5$1...@news.albasani.net...
>
>>>> as-is, the interpreter already translates the bytecode into a
>>>> stack-based threaded-code format (which is faster than running it
>>>> directly via a "switch()", but still could be somewhat faster).
>
>> I went away from using switch a while ago, mostly as it takes a fairly
>> severe performance hit for every loop iteration.
>
> And I'm going away from fast threaded code, and towards using a switch!
> That's because I might need to write in C in order to port to a
> different processor, because getting back all the speed I've lost.
>
> Using a very simple mock-up of the interpreter (just repeatedly
> executing a single bytecode, that used for an unindexed loop, but it
> will test the dispatching), I got these results (to execute this
> bytecode a billion times):
>
> 4.5 seconds (C using a do-while loop, and calling handlers via
> function pointers)
> 3.4 seconds (C using a big switch statement)
> 3.3 seconds (C using gotos to gcc's indirect labels; a kind of
> threaded code)
> 2.1 seconds (x86 ASM also using direct threading)
> 1.9 seconds (The current interpreter also using direct threading)
>
> So if I'm going to use C at all, then it will be probably be a switch.
>

I am doing threaded-code in C (via function pointers).
generally it was going faster than using switch statements.

granted, a possible issue is that MSVC generates some pretty bad code
for switch statements.

so, an indirect call via a function pointer will typically outperform
the "switch()" IME.

>>> Or, stack check on push/pop...
>>>
>>> Or, just abort...
>>>
>>
>> checking for stack overflow is just the sort of cost I am trying to
>> avoid here.
>
> Why do you have to check for stack overflow, when a language such as C
> doesn't bother?
>

mostly to lessen the chance of accidentally crashing the host
application. the idea is that even with bad code, the host app should be
able to handle it gracefully.

> Either make it optional (for debugging for example), or check it
> infrequently, perhaps per function-call (since stack overflow is most
> likely to be caused by out-of-control recursion) rather than per
> expression. (Assuming it is too hard to detect an actual stack fault.)
>

the current plan is to do it per-call.

>> and, if our dispatch loop is something like:
>> while(op && !ctx->ret)
>> { op=op->fcn(ctx, op); }
>>
>> that "&& !ctx->ret" also adds a bit of cost.
>
> You'd take that check out of the loop. I assume that checks whether the
> next opcode is some sort of stop code. Since it's only ever going to be
> true once in any program run (perhaps after a trillion iterations) it is
> rather inefficient!
>

basically it indicates some need to break out of the main instruction
dispatch loop.

usually, this would be for a few scenarios:
the interpreted code has returned from the outermost level;
the code has blocked or executed a sleep call;
an exception has been thrown;
a call is being made back into C land;
...

the goal would be to organize instructions into "traces", where this
doesn't have to be checked for nearly as often.

Kaz Kylheku

unread,

Oct 18, 2012, 2:19:56 PM10/18/12

Well, no, because, obviously, the unwinding would restore the stack to the
correct frame. At the point where the exception is caught, unwinding has
already taken place. Thus the condition of a stack overflow no longer exists.
That happened in the try block which tried to allocate more stack space, but
all of that is undone at this point.

The switching to a some emergency stack is only for the purpose of the
exception handling mechanism which may need its own stack frames.

Anyway, in protected operating systems, interrupts and exceptions (the
real, architectural ones) are handled on a different stack inside the
kernel. You cannot rely on unprivileged mode to have a stack for catching
interrupts and exceptions. Suppose user space has pushed its stack pointer all
the way to the edge of available space so that the next address is an unmapped
page. In that situation, interrupts like the clock tick or network receive
still have to work! They do because the exception handling in the architecture
switches to a kernel stack before saving the machine state.

A stack overflow situation might be intercepted at the CPU level, and
turned into an exception that traps to the kernel, on the kernels' stack.
That kernel code could pull some strings to arrange for the user code to have a
different stack for running the user handler for that condition.

Unix has a such a thing: sigaltstack.

http://pubs.opengroup.org/onlinepubs/007904875/functions/sigaltstack.html

"The sigaltstack() function allows a process to define and examine the state of
an alternate stack for signal handlers for the current thread. Signals that
have been explicitly declared to execute on the alternate stack shall be
delivered on the alternate stack."

This doesn't mean that there is no linkage from the alternate stack to the
parent stack. It may still be possible to traverse the stack from the alternate
one to the original one to perform an exception search and do unwinding.

BGB

unread,

Nov 20, 2012, 2:57:02 PM11/20/12

On 10/16/2012 3:47 PM, BGB wrote:
> basically, this is just a few globs from some recent emails, mentioning
> something I am currently looking into.
>
> the idea would be to develop a new faster threaded-code backend for the
> BGBScript VM, which would basically be used "behind the scenes" while
> still keeping the existing high-level bytecode.
>
> as-is, the interpreter already translates the bytecode into a
> stack-based threaded-code format (which is faster than running it
> directly via a "switch()", but still could be somewhat faster).
>

after being on hold for a while (I was busy working more on other
stuff), I finished implementing it at least enough to test the basic design.

I was working some on adding basic mechanisms, like function calls, but
am left thinking more about "efficient" ways to manage scoping, and
hopefully not be left with ugly problem areas or poor performance (the
whole effort would be pointless if most of the speedup is squandered
away again with something like slow method or variable lookups).

a partial tradeoff is between "directness" and "mutability", for
example, consider I am calling a "static method":
I can cache the static method with the instruction, so that the next
time around the static method is invoked directly, and does not incur a
performance hit from having to fetch it. but what then if the class
containing the method is redefined and the method is replaced?...

this implies the added cost of fetching the method again each time from
the class's vtable (~ 45ns).

the only other obvious solution is for the interpreter to basically
retain pointers to all static method calls, which can then be flushed
as-needed (setting the method-pointer back to NULL, and setting the
handler from the direct-call case to the "lookup and then attempt to
call" case).

another alternative here being of course a kind of "static method call
table", where basically all of the active static methods are laid out in
a big global table which calls are directed through. the static-calls
would then hold pointers to the table entries. (note that entries would
likely be keyed as "class-qname"+"methodname"+"signature").

as of yet, I have doubts about strategies that would nail down the class
or object types (as-is, there are currently 2 major types of object, one
used for class/instance and dynamic classes, and another used for
prototype-OO via ex-nihilo objects, which are currently also used to
implement packages).

as-is, the currently existing interpreter will just fetch the method
from the class/object each time (and then use type-checks to see how to
apply it). this works but isn't particularly efficient.

otherwise, here is something from an email:

<---
after fiddling around with image compression stuff more, and a brief run
of trying to optimize my 3D engine somewhat, I went and finally got
around to finishing up the new "FRIR" (the "register IR") interpreter
enough to test it (at least in a very minimalistic form).

so, for the FRIR interpreter test, the equivalent of a "for()" loop
spins 45.2M cycles per second (23.23s to spin 1,000,000,000 times).

native "for()" loop, 385M cycles per second (2.6s).
IOW: "for(i=0; i<1000000000; i++);"

difference: 8.9x of native.

this seems promising, as it is nearly 7x as fast as my current interpreter.

this is not to say though that it wouldn't slow down once weighted down
by all the other crap and used for "actual" script-code.

though, it would probably at least be somewhat faster, as the existing
script interpreter is still a bit of a cobbled-together mess...

not like my other "real" benchmarks were that realistic, as most were
doing things like the recursive fibonacci function and implementing
bubble-sort and similar, which aren't likely to be particularly
representative of the normal workloads... (which has thus far been
mostly for running part of the AIs for several of the monster types,
among other mostly misc things...).

will see though if I actually do much with it.
it would still take a bit of work to really make the new backend "ready
for use".
--->

I went and calculated, the above works out to roughly 22ns per loop
iteration (in the interpreter), vs around 2.6ns per-iteration in native.

0 new messages