Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
misc: new interpreter ("Fast RIR")
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  15 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
BGB  
View profile  
 More options Oct 16 2012, 4:50 pm
Newsgroups: comp.lang.misc
From: BGB <cr88...@hotmail.com>
Date: Tue, 16 Oct 2012 15:47:33 -0500
Local: Tues, Oct 16 2012 4:47 pm
Subject: misc: new interpreter ("Fast RIR")
basically, this is just a few globs from some recent emails, mentioning
something I am currently looking into.

the idea would be to develop a new faster threaded-code backend for the
BGBScript VM, which would basically be used "behind the scenes" while
still keeping the existing high-level bytecode.

as-is, the interpreter already translates the bytecode into a
stack-based threaded-code format (which is faster than running it
directly via a "switch()", but still could be somewhat faster).

so, the general idea would be to "unwind" the stack-based bytecode
(vaguely like JVM or AVM2 bytecode) into using a register-machine model
(more like Dalvik or Parrot), and essentially use a register-machine to
execute the code.

a lesser goal is that it could also potentially lessen the work of
writing a JIT backend (to allow for a smaller and simpler JIT, which
should ideally be easier to implement, debug, and maintain).

First Message:
<===
but, I just went and wrote up a basic "idea spec" for the new
interpreter idea, which I decided to call FastRIR.

it is yet to be proven though whether it would be much faster than the
current interpreter (the number of inner-loop pointer-operations is not
likely to be much different than my current interpreter).

note that the basic premise of operation for this sort of interpreter is
that the function-pointers inside the structures are generally what are
used to direct control flow (so, the logic that sets up the execution
path will generally set up the structures and fill in function pointers
as-appropriate).

this is in contrast to a more traditional "while loop and switch"
interpreter, but this style of interpreters tend to be a little faster
IME (without being drastically more complicated).

this is far from comprehensive, so a little imagination would be needed
to "fill in the gaps".

(the main focus here is on the part of the interpreter which tends to
consume the bulk of the execution time, which often tends to be the
thing spinning in a loop and executing operations).

any ideas for how to make such a design idea faster are welcome (apart
maybe from putting everything in globals, which yes, can make things
faster, but kills thread-safety, and also doesn't help much on ELF targets).

but, anyways, here is the current idea spec:
-----

Fast Register IR:
Low-level statically-untyped (weakly-typed) register-machine model.

Operations are statically typed and operate on registers.
Most operations will follow a 3 register form.

add_i dest, srca, srcb

so:
add_i, add_l, add_f, add_d
sub_i, sub_l, sub_f, sub_d
...

This will be used for a threaded-code interpreter.
FRIR code will be organized into traces.

FRIR will have up to 256 registers of each type.

union FRIR_Value_u
{
     s32 i;
     s64 l;
     float f;
     double d;
     byte *pb;
     dyt r;

};

struct FRIR_Frame_s {
FRIR_Trace *trace;        //return trace
int *ri;
s64 *rl;
float *rf;
double *rd;
byte **ra;
FRIR_Value *vspos;    //var stack pos
FRIR_Value *aspos;    //arg stack pos

};

struct FRIR_Context_s {
FRIR_Trace *trace;                //current trace
FRIR_Frame **framestack;        //frame stack base
FRIR_Value *varstack;            //var stack base
FRIR_Value *argstack;            //arg stack base

FRIR_Frame *frame;                //active frame
FRIR_Frame **fspos;                //frame stack position
FRIR_Value *vspos;                //var stack pos
FRIR_Value *aspos;                //arg stack pos

int ret;    //ctx return state (0=normal execution)

//register state (shared with frame)
int *ri;
s64 *rl;
float *rf;
double *rd;
byte **ra;

};

The return state will be a magic number indicating the current state of
the interpreter. This value will be kept to 0 during normal execution,
but will be set for events such as an exception being thrown or a
function call/return event, which will be handled outside the normal
execution loop.

A method will hold an array of opcodes and traces.
The opcode array will hold all opcodes in the method.
Traces will represent subsets of this opcode array.
A trace may not contain a jump or an operation which may throw an
exception, but may terminate with such an instruction.

struct FRIR_Method_s {
FRIR_Opcode *ops;        //first opcode in method
FRIR_Trace *trace;        //first trace in method
int n_ops;                //number of opcodes
int n_trace;            //number of traces

};

struct FRIR_Trace_s {
FRIR_Trace *next;        //next trace (linear, per-method)
FRIR_Trace *jmpnext;    //jump-next trace (direct jump)
FRIR_Opcode *ops;        //opcodes in trace
// FRIR_Opcode *opse;    //end of opcodes
int n_ops;                //number of opcodes
FRIR_Trace *(*run)(FRIR_Context *ctx, FRIR_Trace *tr);
FRIR_Trace *(*end)(FRIR_Context *ctx,
     FRIR_Trace *tr, FRIR_Opcode *op);

};

run will be the function for handling executing a trace.
end will be called at the end of a trace, and will give the next trace
to execute.

struct FRIR_Opcode_s {
void (*fcn)(FRIR_Context *ctx, FRIR_Opcode *op);
int op;
int a, b, c;
FRIR_Value i;

};

Registers:
     RI#, Register Int
     RL#, Register Long
     RF#, Register Float
     RD#, Register Double
     RA#, Register Address (Raw Pointer)

lea_b dest(RA), base(RA), index(RI)
lea_w dest(RA), base(RA), index(RI)
lea_i dest(RA), base(RA), index(RI)

lea2_b dest(RA), base(RA), index(RI), offset(II)
lea2_w dest(RA), base(RA), index(RI), offset(II)
lea2_i dest(RA), base(RA), index(RI), offset(II)

void ThOp_Add_I(FRIR_Context *ctx, FRIR_Opcode *op)
{
     ctx->ri[op->c]=ctx->ri[op->a]+ctx->ri[op->b];

}

void ThOp_Lea_I(FRIR_Context *ctx, FRIR_Opcode *op)
{
     ctx->ra[op->c]=ctx->ra[op->a]+
         ctx->ri[op->b]*sizeof(int);

}

void ThOp_Lea2_I(FRIR_Context *ctx, FRIR_Opcode *op)
{
     ctx->ra[op->c]=ctx->ra[op->a]+
         (ctx->ri[op->b]+op->i.i)*sizeof(int);

}

//default end function for trace
void FRIR_Trace_EndDefault(FRIR_Context *ctx,
     FRIR_Trace *tr, FRIR_Opcode *op)
{
     return(tr->next);

}

//default end function for trace (unconditional jump)
void FRIR_Trace_EndJumpDefault(FRIR_Context *ctx,
     FRIR_Trace *tr, FRIR_Opcode *op)
{
     return(tr->jmpnext);

}

//default run function for trace
void FRIR_Trace_RunDefault(FRIR_Context *ctx, FRIR_Trace *tr)
{
     FRIR_Opcode *op, *ope;

     op=tr->ops;
     ope=tr->ops+tr->n_ops-1;
     while(op<ope)
         { op->fcn(ctx, op); op++; }
     return(tr->end(ctx, tr, op));

}

The above may be unrolled, as in:
void FRIR_Trace_RunDefault4(FRIR_Context *ctx, FRIR_Trace *tr)
{
     FRIR_Opcode *op;

     op=tr->ops;
     op->fcn(ctx, op); op++;
     op->fcn(ctx, op); op++;
     op->fcn(ctx, op); op++;
     return(tr->end(ctx, tr, op));

}

or:
void FRIR_Trace_RunDefault8(FRIR_Context *ctx, FRIR_Trace *tr)
{
     FRIR_Opcode *op;

     op=tr->ops;
     op->fcn(ctx, op); op++;
     op->fcn(ctx, op); op++;
     op->fcn(ctx, op); op++;
     op->fcn(ctx, op); op++;
     op->fcn(ctx, op); op++;
     op->fcn(ctx, op); op++;
     op->fcn(ctx, op); op++;
     return(tr->end(ctx, tr, op));

}

The combination of unrolling and possibly adding redundancy should allow
some exploitation of the branch-predictor.

void FRIR_Thread_RunStep(FRIR_Context *ctx)
{
     FRIR_Trace *tr;

     tr=ctx->trace;
     while(!ctx->ret)
         { tr=trace->run(ctx, tr); }
     ctx->trace=tr;

}

===>

Second Message:
<===
investigating more (and [after] looking some at Dalvik, ...) it seems
there is a bit more room for possible optimizations at the interpreter
level for a register-IR.

a register IR also allows for more easily eliminating common
sub-expressions (could likely be "recognized" from the stack-based input
bytecode).

(I also looked at Parrot and LLVM IR, but these are less closely related
to my current VM).

as-is, things are still in a "general design and basic experiments" stage...

apparently, people working in similar situations in the past, were able
to get a roughly 25%-30% speedup by translating to a register IR (for an
interpreter). this was for Java ByteCode.

some further speedup is possible as it is considered that the conversion
step would also better nail down things like variable scope and object
field access, which are currently handled dynamically (FRIR will be much
lower-level than the BSVM bytecode).

for example, FRIR will distinguish between local variables, function
arguments, and captured lexical variables (in the BSVM bytecode, these
are all treated as a single unified "stack", but which may not be
physically continuous in memory, essentially requiring a linked-list walk).

basically, for example, it could possibly optimize accessing an object
slot from something like (present):
check object type;
fetch slot by name (hash-key lookup);
dispatch by field type and fetch value.

to something more like:
*(int *)(((byte *)obj->data)+op->i.i);
which would likely be significantly faster than the current logic.

(the above case could be optimized in the current interpreter, but I
hadn't done so previously because it would be a hassle, requiring more
complex logic for emitting the threaded code...).

a similar speedup may be possible for array handling (it will no longer
need to type-check the array, like to determine that it is an array).

these will likely be specific to typed variables, so:
var obj:MyClass = new MyClass();
obj.x=3;
or:
var arr:int[] = new int[256];
arr[31]=4;

but may not (necessarily) work for untyped variables:
"var obj;" or "var arr;".
which will far more likely end up using dynamic type-checks if the VM
can't type-infer them (less likely at present, as the existing
type-inference is
...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "new interpreter ("Fast RIR")" by Rod Pemberton
Rod Pemberton  
View profile  
 More options Oct 17 2012, 6:46 am
Newsgroups: comp.lang.misc
From: "Rod Pemberton" <do_not_h...@notemailnotz.cnm>
Date: Wed, 17 Oct 2012 06:50:11 -0400
Local: Wed, Oct 17 2012 6:50 am
Subject: Re: new interpreter ("Fast RIR")
"BGB" <cr88...@hotmail.com> wrote in message

news:k5khbh$f71$1@news.albasani.net...
...

> the idea would be to develop a new faster threaded-code backend for the
> BGBScript VM, which would basically be used "behind the scenes" while
> still keeping the existing high-level bytecode.

As you may know, I've been developing an ITC (threaded code) Forth
interpreter in C for a few years now.  One issue is with branching.  I don't
believe the issue is Forth specific.  I think applies to all threaded code.
Threaded code is not well suited to branching or most control flow, but
branching can be made to work.  Forth has no formal definition for it's
branching "words" (routines).  Historically, Forth calls them: BRANCH and
either 0BRANCH or ?BRANCH.  The first is an absolute branch, and the
second with two common names is a conditional branch.

> as-is, the interpreter already translates the bytecode into a
> stack-based threaded-code format (which is faster than running it
> directly via a "switch()", but still could be somewhat faster).

Well, that's good to know!

Most Forth's coded in C use a switch() since that's the C way of doing
things.  I didn't use a C switch() statement for my interpreter.  I
interpret the same way as early ITC Forth's in assembly, but using C
instead of assembly.

> so, the general idea would be to "unwind" the stack-based bytecode
> (vaguely like JVM or AVM2 bytecode) into using a register-machine model
> (more like Dalvik or Parrot), and essentially use a register-machine to
> execute the code.

Have you tried any of this already?  Have you analyzed how many stacked
items are typically used at a time?  I.e., does your bytecode only use two
or three items at once?

From my own experiences, I can "see" issues with needing a register
allocator, keeping track of which data is in which registers, spilling of
registers when you run out of available registers, etc.

> any ideas for how to make such a design idea faster are welcome (apart
> maybe from putting everything in globals, which yes, can make things
> faster, but kills thread-safety, and also doesn't help much on ELF
> targets).

I ran into a few issues with making my ITC interpreter faster in C.

The first issue is that the "register" keyword only works on "auto" or
procedure local variables for GCC.  GCC deprecated the use of the "register"
keyword for file scope or global variables.

The second issue is:

"If I can't use the 'register' keyword on globals, and I've got four
procedures operating on my globals, how do I get all four procedures merged
into a _single_ procedure so that my globals become 'auto' types and can be
optimized?"

The first answer to that is something C doesn't have: nested procedures.
The second answer to that -possibly- is something C doesn't have either:
inheritance.  The third answer to that is: "I'm still working on that one!"
At this point, I'm considering *anything*, like even a small switch() or
Duff's device, in order to get 'auto' types.  I just haven't attempted to
get something working, yet.

> apparently, people working in similar situations in the past, were able
> to get a roughly 25%-30% speedup by translating to a register IR (for an
> interpreter). this was for Java ByteCode.

...

> I also realized some while looking at retrofitting a few ideas onto the
> existing interpreter:
> in my stack-machine interpreter, a very large number of instructions may
> throw exceptions (hindering the use of trace optimizations). (for
> example, it would need to be proven that the stack can neither underflow
> nor overflow during an operation, ...).

Or, stack check on push/pop...

Or, just abort...

> design changes from previously:
> 'A' type was changed from being a raw pointer to a pointer-sized
> integer, mostly as it became apparent that the interpreter logic would
> require an excessive amount of casting in C, so it would be easier just
> to treat it as an integer type and cast back to a pointer as-needed.

Exactly.  You can't dereference a pointer type in C and end up with the
exact same type.  Well, except maybe a void*, but then you can't access data
at the pointer because it's void...  So, casting in C is required somewhere
threaded code.

Rod Pemberton


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
rugx...@gmail.com  
View profile  
 More options Oct 17 2012, 5:21 pm
Newsgroups: comp.lang.misc
From: rugx...@gmail.com
Date: Wed, 17 Oct 2012 14:21:56 -0700 (PDT)
Local: Wed, Oct 17 2012 5:21 pm
Subject: Re: new interpreter ("Fast RIR")
Hi,

N.B. My experience is very low, but I'll try to add a few tidbits.

This is no surprise, but indeed "register" doesn't always help very much. It's not even required to do anything, IIRC. And I don't know of anybody getting much benefit from using it with GCC. At least my own wimpy uses didn't show any noticeable benefit. But I didn't look too hard to find a good example (perhaps GForth??). Maybe x86-64 helps more, but I mostly doubt it. In other words, I wouldn't pin too many hopes on this.

> The second issue is:
> "If I can't use the 'register' keyword on globals, and I've got four
> procedures operating on my globals, how do I get all four procedures merged
> into a _single_ procedure so that my globals become 'auto' types and can be
> optimized?"

Whole-program optimization or maybe LTO? Honestly, no idea.

> The first answer to that is something C doesn't have: nested procedures.

GCC supports nested procedures as an extension. And most Algol-inspired languages do also (Pascal, Modula-2, Modula-3, Oberon). However, I have no idea if even those implementations using GCC as backend would optimize for register use in that way. I guess in theory you could use GNAT / Ada for whatever and link that to your binary, but I doubt that's a reasonable approach, if it helped at all.

> At this point, I'm considering *anything*, like even a small switch() or
> Duff's device, in order to get 'auto' types.  I just haven't attempted to
> get something working, yet.

Duff's device sounds interesting but is so weird that I can't exactly understand it (nor tried). Normally speed isn't such a dire issue. I mean, it's nice, but sometimes it's just easier to wait! Or write in assembly. Or upgrade your compiler.  ;-)

Just see what GForth does, they seem to know what they're doing (GCC's computed goto?).

> > apparently, people working in similar situations in the past, were able
> > to get a roughly 25%-30% speedup by translating to a register IR (for an
> > interpreter). this was for Java ByteCode.

You didn't mention Lua here, but since 5.x they now use a register-based VM (as opposed to previous stack-based), which they claim is noticeably faster. You may want to read their paper _The implementation of Lua 5.0_ for hints. See http://www.lua.org/doc/jucs05.pdf .

> > I also realized some while looking at retrofitting a few ideas onto the
> > existing interpreter:
> > in my stack-machine interpreter, a very large number of instructions may
> > throw exceptions (hindering the use of trace optimizations). (for
> > example, it would need to be proven that the stack can neither underflow
> > nor overflow during an operation, ...).

> Or, stack check on push/pop...
> Or, just abort...

What else can you do? It's either malloc more, halt, or crash. I can't think of any other options. Perhaps realloc or free some unnecessary stuff or compress your data structures (at runtime??).

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
BGB  
View profile  
 More options Oct 17 2012, 6:15 pm
Newsgroups: comp.lang.misc
From: BGB <cr88...@hotmail.com>
Date: Wed, 17 Oct 2012 17:12:12 -0500
Local: Wed, Oct 17 2012 6:12 pm
Subject: Re: new interpreter ("Fast RIR")
On 10/17/2012 5:50 AM, Rod Pemberton wrote:

BGBScript has been in-use and incrementally developed since about 2004.
I have been using threaded-code since around November 2011.

so, the point now is mostly looking into taking a different and more
efficient route towards the interpreter backend.

note that, unlike Forth or similar, the BGBScript bytecode follows
similar stack-use restrictions to JVM bytecode, basically that on a
given jump/label, the stack layout is required to match.

the prior interpreter basically operated as a sort of
continuation-passing-style, where every opcode would return the next
opcode to execute.

this time I am factoring this out and instead using "traces".
the hope is that by better nailing down the opcode-level control flow, a
more efficient means of executing the instruction traces may be used.

>> as-is, the interpreter already translates the bytecode into a
>> stack-based threaded-code format (which is faster than running it
>> directly via a "switch()", but still could be somewhat faster).

> Well, that's good to know!

> Most Forth's coded in C use a switch() since that's the C way of doing
> things.  I didn't use a C switch() statement for my interpreter.  I
> interpret the same way as early ITC Forth's in assembly, but using C
> instead of assembly.

I went away from using switch a while ago, mostly as it takes a fairly
severe performance hit for every loop iteration.

threaded code is a little faster, but all of those function
calls/returns and indirect function calls are also expensive (note:
can't use computed goto as my interpreter is not GCC specific).

sadly, there is no really "good" option short of writing the whole
interpreter core in ASM, and by this point I may as well just write a JIT.

>> so, the general idea would be to "unwind" the stack-based bytecode
>> (vaguely like JVM or AVM2 bytecode) into using a register-machine model
>> (more like Dalvik or Parrot), and essentially use a register-machine to
>> execute the code.

> Have you tried any of this already?  Have you analyzed how many stacked
> items are typically used at a time?  I.e., does your bytecode only use two
> or three items at once?

typically, all of the items in a typical expression will end up on the
stack.

this will likely exceed the number of x86 registers, but the threaded
code will use a "potentially unbounded" number of virtual registers.

if a JIT were used, likely it would use a mapping heuristic to figure
out which VM registers to map to hardware registers (this would more
likely apply to x86-64, I suspect that 32-bit x86 is sufficiently
register-starved to where any sort of aggressive mapping would likely do
more harm than good, and ESI/EDI would likely be needed mostly for
holding the current VM state).

>  From my own experiences, I can "see" issues with needing a register
> allocator, keeping track of which data is in which registers, spilling of
> registers when you run out of available registers, etc.

this would be more for a native code generator, not so much for a
register-IR, where the registers aren't so much "registers" in the CPU
sense, but are more closely related to "temporary variables".

so, you can avoid the need to "spill" by simply having 256 (or more) as
the max number of VM registers, and by making the VM preserve register
state across calls.

the need for a full register allocator can also be avoided at this stage
as well (they can be allocated either sequentially or in a stack-like
manner.

the "register" keyword doesn't really usually make much of a difference,
so I don't really use it.

checking for stack overflow is just the sort of cost I am trying to
avoid here.

it is the same sort of issue as checking if an exception-state variable
has been set. it seems simple enough, but it can eat lots of clock
cycles in a tight loop.

typically, a stack might be accessed like:
ctx->stackpos--;
*(int *)ctx->stackpos=((int *)ctx->stackpos)[0]+
        ((int *)ctx->stackpos)[1];

but, if we add something like:
if(ctx->stackpos<0)
        { ... }

then suddenly lots of clock cycles are being wasted just performing this
silly little check.

and, if our dispatch loop is something like:
while(op && !ctx->ret)
        { op=op->fcn(ctx, op); }

that "&& !ctx->ret" also adds a bit of cost.

here is the inner-loop for the in-development interpreter:
BGBFRIR_API FRIR_Trace *FRIR_Trace_RunDefault(
        FRIR_Context *ctx, FRIR_Trace *tr)
{
        FRIR_Opcode *op, *ope;

        op=tr->ops;
        ope=tr->ops+tr->n_ops-1;
        while(op<ope)
                { op->fcn(ctx, op); op++; }
        return(tr->end(ctx, tr, op));

}

and the add-integer handler:
BGBFRIR_API void FRIR_ThOp_ADD_I(FRIR_Context *ctx, FRIR_Opcode *op)
        { ctx->ri[op->c]=ctx->ri[op->a]+ctx->ri[op->b]; }

>> design changes from previously:
>> 'A' type was changed from being a raw pointer to a pointer-sized
>> integer, mostly as it became apparent that the interpreter logic would
>> require an excessive amount of casting in C, so it would be easier just
>> to treat it as an integer type and cast back to a pointer as-needed.

> Exactly.  You can't dereference a pointer type in C and end up with the
> exact same type.  Well, except maybe a void*, but then you can't access data
> at the pointer because it's void...  So, casting in C is required somewhere
> threaded code.

this more has to do with the new interpreter having a lot of operations
for pointer arithmetic, like "lea" (for calculating addressed) and
"xloada" (for loading the address of a field into a register) and similar.

an integer type makes more sense, but leaves the minor annoyance of sign
and address-size (x86 addresses are unsigned and 32-bit, x86-64 is
signed and 64-bit), requiring a little bit of special handling in
certain cases.

at the IR level, the 'A' (address type) is partly disjoint from the
integer types, so at least the issue shouldn't be apparent.

there are operations like: "conv_a2l" and "conv_l2a" to convert between
pointers and long (signed 64-bit), but doing this is not as efficient on
32-bit x86 (shouldn't matter too much, as this isn't likely to be all
that common).

but, as-is, IR would look something like:
llload_i 2, 7   //load local var 7 into r2
llload_i 3, 8   //load local var 8 into r3
add_i 2, 2, 3   //add r2=r2+r3
llstore_i 2, 9  //store r2 into local var 9

there may likely end up being a few more special cases, like:
lladd_ic 11, 2  //add 2 to local var 11
which would in C look something like:
BGBFRIR_API void FRIR_ThOp_LLADD_IC(FRIR_Context *ctx, FRIR_Opcode *op)
        { ctx->llvars[op->c].i+=op->i.i; }

or such...


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
BGB  
View profile  
 More options Oct 17 2012, 7:09 pm
Newsgroups: comp.lang.misc
From: BGB <cr88...@hotmail.com>
Date: Wed, 17 Oct 2012 18:05:53 -0500
Local: Wed, Oct 17 2012 7:05 pm
Subject: Re: new interpreter ("Fast RIR")
On 10/17/2012 4:21 PM, rugx...@gmail.com wrote:

yeah, hence why I don't use it personally...

>> The second issue is:
>> "If I can't use the 'register' keyword on globals, and I've got four
>> procedures operating on my globals, how do I get all four procedures merged
>> into a _single_ procedure so that my globals become 'auto' types and can be
>> optimized?"

> Whole-program optimization or maybe LTO? Honestly, no idea.

>> The first answer to that is something C doesn't have: nested procedures.

> GCC supports nested procedures as an extension. And most Algol-inspired languages do also (Pascal, Modula-2, Modula-3, Oberon). However, I have no idea if even those implementations using GCC as backend would optimize for register use in that way. I guess in theory you could use GNAT / Ada for whatever and link that to your binary, but I doubt that's a reasonable approach, if it helped at all.

personally, I am writing code that may be compiled with either GCC or
with MSVC / Visual Studio.

this mostly precludes the use of compiler-specific features.

>> At this point, I'm considering *anything*, like even a small switch() or
>> Duff's device, in order to get 'auto' types.  I just haven't attempted to
>> get something working, yet.

> Duff's device sounds interesting but is so weird that I can't exactly understand it (nor tried). Normally speed isn't such a dire issue. I mean, it's nice, but sometimes it's just easier to wait! Or write in assembly. Or upgrade your compiler.  ;-)

another drawback of Duffs device is that it still doesn't really escape
the relatively high constant-time cost of a switch.

granted, yes, conditionals are also expensive, so the number of
conditional expressions and unpredictable branches needs to hopefully be
minimized.

> Just see what GForth does, they seem to know what they're doing (GCC's computed goto?).

computed goto is fast, but yeah, kind of a problem that MSVC can't do it.

ironically, BGBScript has computed goto (but this is of little help in
implementing an interpreter for it).

>>> apparently, people working in similar situations in the past, were able
>>> to get a roughly 25%-30% speedup by translating to a register IR (for an
>>> interpreter). this was for Java ByteCode.

> You didn't mention Lua here, but since 5.x they now use a register-based VM (as opposed to previous stack-based), which they claim is noticeably faster. You may want to read their paper _The implementation of Lua 5.0_ for hints. See http://www.lua.org/doc/jucs05.pdf .

I wasn't aware of Lua using a register VM.

I was aware of Dalvik and Parrot though, so these are mostly what I was
looking into.

note that the use of a register interpreter backend will not effect the
use of a high-level stack-based bytecode.

>>> I also realized some while looking at retrofitting a few ideas onto the
>>> existing interpreter:
>>> in my stack-machine interpreter, a very large number of instructions may
>>> throw exceptions (hindering the use of trace optimizations). (for
>>> example, it would need to be proven that the stack can neither underflow
>>> nor overflow during an operation, ...).

>> Or, stack check on push/pop...
>> Or, just abort...

> What else can you do? It's either malloc more, halt, or crash. I can't think of any other options. Perhaps realloc or free some unnecessary stuff or compress your data structures (at runtime??).

throw a "StackOverflowException"?...

like, in script code:
try {
        ...

}catch(ex:StackOverflowException)

{
        printf("Stack blew up!\n");

}

but, anyways, factoring out the stack can improve performance because we
will know (at method call time) whether or not the stack has overflowed,
and so can check and throw as part of calling the method, rather than
having to check every time a value is pushed onto the stack.

"abort()" is no-go, given the type of app I am developing...

usually, if the exception can't be handled in script code, then the VM
will set up for an "uncaught exception" and return back into C land with
an error code (this was chosen as preferable to trying to throw an
exception in C land, which is kind of a mess).

theoretically, it means that code calling into the VM "should" check for
an exception whenever an error code is returned, or the return-value is
"undefined", but sadly, a lot of code doesn't bother.

mostly this doesn't matter as much, as script-code is generally
considered as less important than the underlying C code for the program.

note that script code may also throw things like:
"NullPointerException", "UndefinedMethodException", ...

or such...


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andy Walker  
View profile  
 More options Oct 17 2012, 9:01 pm
Newsgroups: comp.lang.misc
From: Andy Walker <n...@cuboid.co.uk>
Date: Thu, 18 Oct 2012 02:01:39 +0100
Local: Wed, Oct 17 2012 9:01 pm
Subject: Re: new interpreter ("Fast RIR")
On 18/10/12 00:05, BGB wrote:
[...]

> throw a "StackOverflowException"?...

> like, in script code:
> try {
>      ...
> }catch(ex:StackOverflowException)
> {
>      printf("Stack blew up!\n");
> }

        If you've just run out of storage, then calling a
procedure-with-parameter, which may itself require large
amounts of storage, is probably not a good idea.  OK, you
could reserve a small amount of storage specifically for
"catching", but then the "catch" isn't normal user code,
which has its own problems.  You would get away with it
[probably!] if the "try" uses large amounts of stack which
is recovered by the "throw", but that's not the general
case.

--
Andy Walker,
Nottingham.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Rod Pemberton  
View profile  
 More options Oct 17 2012, 9:04 pm
Newsgroups: comp.lang.misc
From: "Rod Pemberton" <do_not_h...@notemailnotz.cnm>
Date: Wed, 17 Oct 2012 21:08:36 -0400
Local: Wed, Oct 17 2012 9:08 pm
Subject: Re: new interpreter ("Fast RIR")
"BGB" <cr88...@hotmail.com> wrote in message

news:k5nam3$kp5$1@news.albasani.net...
> On 10/17/2012 5:50 AM, Rod Pemberton wrote:
> > "BGB" <cr88...@hotmail.com> wrote in message
> > news:k5khbh$f71$1@news.albasani.net...

...

> >> I also realized some while looking at retrofitting a few ideas onto the
> >> existing interpreter:
> >> in my stack-machine interpreter, a very large number of instructions
> >> may throw exceptions (hindering the use of trace optimizations). (for
> >> example, it would need to be proven that the stack can neither
> >> underflow nor overflow during an operation, ...).

> > Or, stack check on push/pop...

> > Or, just abort...

> checking for stack overflow is just the sort of cost I am trying to
> avoid here.

I know.  I didn't want such code in my push/pop routines for my Forth
interpreter either.  One of my pop routines actually has *two* types of
underflow checks ...  It's horrid!  I don't want them there.  It slows
things down.

However, without underflow/overflow checks how do you confirm underflow or
overflow for push/pop?  And, where would you?  Do you just allow push/pop to
crash/halt the application or overwrite non-stack memory?

I've no elegant solution to this other than to insert the checks.  I was
hoping to maybe analyze the program in more depth in the future to determine
other code locations where the checks could be placed.  I.e., somewhere
where they weren't constantly being checked, but checked frequently enough
that they would prevent or detect most overflows/underflows prior to them
occurring.

> it is the same sort of issue as checking if an exception-state variable
> has been set. it seems simple enough, but it can eat lots of clock
> cycles in a tight loop.

Any additional operations in a tight, continuous loop can accumulate to
consume much time.

> typically, a stack might be accessed like:
> ctx->stackpos--;
> *(int *)ctx->stackpos=((int *)ctx->stackpos)[0]+
> ((int *)ctx->stackpos)[1];

> but, if we add something like:
> if(ctx->stackpos<0)
> { ... }

> then suddenly lots of clock cycles are being wasted just performing this
> silly little check.

True.  Unfortunately for me, stack checks are needed, but very unwanted ...

Okay, I usually don't comment on your code because 1) you post alot of it
and 2) it's generally abstract enough that I don't spend much time
attempting to understand it...  This is somewhat abstract too.  :-)

Clearly, you can rework the code to fix trivial stuff like eliminating the
need for "-1" for ope either by adjusting tr->n_ops or using op<=ope.

Anyway, I'm not entirely sure what you're doing with all the ctx and tr->ops
and op->fnc(ctx,op) stuff.  The best I can tell, this is a loop through a
list of functions to be called.  The functions get called.  They also get
passed 'ctx' - which I'm not sure what is.  I would guess that fcn and op
should be all that's needed ...  However, I've seen other code use 'ctx' and
it's usually something like a register struct or a VM save state.  If so, it
seems like it's doing some additonal work since ctx pointer keeps being
passed, i.e., allocated and deallocated on the stack for each function call.
Won't declaring the fcn function as "void func(void)" reduce some overhead?
It should eliminate the prolog(ue) or epilog(ue) code for the procedure.
But, that requires op and ctx to be globals instead of locals.  For GCC,
you'll still end up with some stack adjustments upon procedure entry or
exit.  For other C compilers, you should be able to make the function
"naked" by using "void func(void)".  Or, you can use special compiler
attributes to do so.

Rod Pemberton


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
BGB  
View profile  
 More options Oct 17 2012, 10:28 pm
Newsgroups: comp.lang.misc
From: BGB <cr88...@hotmail.com>
Date: Wed, 17 Oct 2012 21:25:28 -0500
Local: Wed, Oct 17 2012 10:25 pm
Subject: Re: new interpreter ("Fast RIR")
On 10/17/2012 8:01 PM, Andy Walker wrote:

presumably, this try/catch block would occur outside of the code which
used up all the stack-space and triggered the exception, and also the
exception handling does not itself use up stack space (throwing
basically causes all stack-frames to be unwound until it gets back to
the exception handler, and the exception-object is heap-allocated...).

granted, yes, another potential cause of error is an
"OutOfMemoryException", but this is potentially a bit more severe
(basically, it means that the heap is all used up and that the garbage
collector can't free up any more, so a memory-allocation request has
failed).

stack-overflows are potentially a little more common with my VM as it
typically uses smaller stacks than are generally used by native threads,
but this is counterbalanced by my VM not generally allocating storage
for objects or arrays on the stack.

also, the "printf" in BGBScript is implemented off in C land, so this
uses the native stack (rather than the VM stack).


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
BGB  
View profile  
 More options Oct 17 2012, 11:01 pm
Newsgroups: comp.lang.misc
From: BGB <cr88...@hotmail.com>
Date: Wed, 17 Oct 2012 21:58:01 -0500
Local: Wed, Oct 17 2012 10:58 pm
Subject: Re: new interpreter ("Fast RIR")
On 10/17/2012 8:08 PM, Rod Pemberton wrote:

static analysis:
if you know the max number of stack items ever used, it is possible to
allocate stack space for just that many items (and thus determine at
method-call time whether or not an overflow is possible).

this means that, granted, an overflow exception may be generated even if
the execution path which would generate the overflow is never called.

but, say, if we know that the method has, say, a maximum stack-depth of
11, then we only need to reserve space for 11 stack items.

similar goes for reserving space for things like lexical environment
frames, ...

I don't know whether this would be applicable or not to something like
Forth, but given the VM has a similar stack-usage restriction to the JVM
(namely that stack layout and element types be statically determinable),
this makes things a little easier.

> I've no elegant solution to this other than to insert the checks.  I was
> hoping to maybe analyze the program in more depth in the future to determine
> other code locations where the checks could be placed.  I.e., somewhere
> where they weren't constantly being checked, but checked frequently enough
> that they would prevent or detect most overflows/underflows prior to them
> occurring.

in my case, generally checks like this are done when trying to call into
a method (this is also when the "bytecode -> threaded-code" translation
takes place).

>> it is the same sort of issue as checking if an exception-state variable
>> has been set. it seems simple enough, but it can eat lots of clock
>> cycles in a tight loop.

> Any additional operations in a tight, continuous loop can accumulate to
> consume much time.

yes, this is why I want to factor this out, as at present, the
generation of exception events is a lot harder to factor out, so better
would be to execute code in a form where static elimination of most
possible exception cases is possible.

>> typically, a stack might be accessed like:
>> ctx->stackpos--;
>> *(int *)ctx->stackpos=((int *)ctx->stackpos)[0]+
>> ((int *)ctx->stackpos)[1];

>> but, if we add something like:
>> if(ctx->stackpos<0)
>> { ... }

>> then suddenly lots of clock cycles are being wasted just performing this
>> silly little check.

> True.  Unfortunately for me, stack checks are needed, but very unwanted ...

fair enough...

I considered the possibility of having it be precomputed, but was left
to debate whether precomputing it would add more cost than calculating
it on the fly.

globals are no-go, mostly as they are not thread-safe (if used for
volatile state), and often the BGBScript VM may be used from multiple
threads at the same time (and each thread needs its own VM state).

the VM is mostly used by the "BGBTech Game Engine":
http://www.youtube.com/user/BGBTech
ex:
http://www.youtube.com/watch?v=E1kN_bM6KBs&feature=plcp

note that this 3D engine is itself internally multi-threaded in many areas.

or, in other words, using global variables would risk making stuff blow
up pretty damn hard... (and TLS costs more than passing state around in
arguments).

the ctx and op are basically the VM-state-context and opcode-state.

the reason for all the function-calls through function-pointers stuff is
that, in my tests, this has generally be one of the fastest ways I know
of to do this.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dmitry A. Kazakov  
View profile  
 More options Oct 18 2012, 3:21 am
Newsgroups: comp.lang.misc
From: "Dmitry A. Kazakov" <mail...@dmitry-kazakov.de>
Date: Thu, 18 Oct 2012 09:22:36 +0200
Local: Thurs, Oct 18 2012 3:22 am
Subject: Re: new interpreter ("Fast RIR")

A counterexample:

{
   int Boo; // Allocated on the stack

   try
   {  ...
   }
   catch (ex:StackOverflowException)
   {
      printf("Stack blew up when Boo=%d!\n", Boo);

}  }

If you switch stacks in exception handlers, you would have a problem with
accessing the stuff allocated on the stack by the enclosing context.

Anyway. Handling exceptions in this way is much like co-routine calls. A
co-routine may have stack of its own. I wonder what could become of that.
(Not returning to on-clauses of early PL/1 of course.)

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
BartC  
View profile  
 More options Oct 18 2012, 11:58 am
Newsgroups: comp.lang.misc
From: "BartC" <b...@freeuk.com>
Date: Thu, 18 Oct 2012 16:56:20 +0100
Local: Thurs, Oct 18 2012 11:56 am
Subject: Re: new interpreter ("Fast RIR")
"BGB" <cr88...@hotmail.com> wrote in message

news:k5nam3$kp5$1@news.albasani.net...

>>> as-is, the interpreter already translates the bytecode into a
>>> stack-based threaded-code format (which is faster than running it
>>> directly via a "switch()", but still could be somewhat faster).
> I went away from using switch a while ago, mostly as it takes a fairly
> severe performance hit for every loop iteration.

And I'm going away from fast threaded code, and towards using a switch!
That's because I might need to write in C in order to port to a different
processor, because getting back all the speed I've lost.

Using a very simple mock-up of the interpreter (just repeatedly executing a
single bytecode, that used for an unindexed loop, but it will test the
dispatching), I got these results (to execute this bytecode a billion
times):

4.5 seconds   (C using a do-while loop, and calling handlers via function
pointers)
3.4 seconds   (C using a big switch statement)
3.3 seconds   (C using gotos to gcc's indirect labels; a kind of threaded
code)
2.1 seconds   (x86 ASM also using direct threading)
1.9 seconds   (The current interpreter also using direct threading)

So if I'm going to use C at all, then it will be probably be a switch.

>> Or, stack check on push/pop...

>> Or, just abort...

> checking for stack overflow is just the sort of cost I am trying to avoid
> here.

Why do you have to check for stack overflow, when a language such as C
doesn't bother?

Either make it optional (for debugging for example), or check it
infrequently, perhaps per function-call (since stack overflow is most likely
to be caused by out-of-control recursion) rather than per expression.
(Assuming it is too hard to detect an actual stack fault.)

> and, if our dispatch loop is something like:
> while(op && !ctx->ret)
> { op=op->fcn(ctx, op); }

> that "&& !ctx->ret" also adds a bit of cost.

You'd take that check out of the loop. I assume that checks whether the next
opcode is some sort of stop code. Since it's only ever going to be true once
in any program run (perhaps after a trillion iterations) it is rather
inefficient!

--
Bartc


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
BGB  
View profile  
 More options Oct 18 2012, 12:09 pm
Newsgroups: comp.lang.misc
From: BGB <cr88...@hotmail.com>
Date: Thu, 18 Oct 2012 11:06:22 -0500
Local: Thurs, Oct 18 2012 12:06 pm
Subject: Re: new interpreter ("Fast RIR")
On 10/18/2012 2:22 AM, Dmitry A. Kazakov wrote:

the variable declaration syntax would be more like:
var boo:int;

but, the stack issue is not actually an issue.

this is partly because local variables are not "only" stored on the
stack, but any captured bindings (due to a closure or otherwise) will
typically end up in heap-allocated environment frames.

actually, even then, within the current interpreter, variables have
their own stack, and there are parallel stacks for various tasks:
value-stack (holds intermediate values);
mark-stack (holds "marks" for the value-stack);
flow-control stack (holds call frames);
local variable stack (holds non-captured local variables and similar);
dynamic variable stack (holds any dynamically-scoped variables);
...

actually, the bytecode makes the lexical environment itself look like a
single big stack, except that this stack is not physically contiguous in
memory whenever closures are made (the lexical environment is more of a
linked-list structure).

currently, different threaded code is generated based on whether or not
the current function has captured bindings or is itself a closure
(captures bindings from a parent function).

in the case of FRIR, it would actually keep track of the difference
between lexical and local variables, and identifies lexical variables
via a pair of numbers (indicating how many levels to walk outwards, and
the binding within that level).

ex: "3, 8" means "up 3 levels, entry 8".

currently a stack-overflow exception doesn't actually indicate which
stack overflowed.

> Anyway. Handling exceptions in this way is much like co-routine calls. A
> co-routine may have stack of its own. I wonder what could become of that.
> (Not returning to on-clauses of early PL/1 of course.)

actually, it is not that the exception handling has its own stack, but
rather that "printf" has its own stack.

this is mostly because the current interpreter is internally implemented
using CPS (continuation-passing-style), and uses its own internal
stacks. similarly, calls within the VM don't actually use any more space
on the CPU stack, mostly because in C land, it is all just a single
trampoline loop which spins in place (so, the interpreter is not
recursive in this case).

when a call is made into C land (such as for "printf()"), it will use
the native CPU stack as the stack, rather than the stack internal to the
interpreter.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
BGB  
View profile  
 More options Oct 18 2012, 12:59 pm
Newsgroups: comp.lang.misc
From: BGB <cr88...@hotmail.com>
Date: Thu, 18 Oct 2012 11:55:51 -0500
Local: Thurs, Oct 18 2012 12:55 pm
Subject: Re: new interpreter ("Fast RIR")
On 10/18/2012 10:56 AM, BartC wrote:

I am doing threaded-code in C (via function pointers).
generally it was going faster than using switch statements.

granted, a possible issue is that MSVC generates some pretty bad code
for switch statements.

so, an indirect call via a function pointer will typically outperform
the "switch()" IME.

>>> Or, stack check on push/pop...

>>> Or, just abort...

>> checking for stack overflow is just the sort of cost I am trying to
>> avoid here.

> Why do you have to check for stack overflow, when a language such as C
> doesn't bother?

mostly to lessen the chance of accidentally crashing the host
application. the idea is that even with bad code, the host app should be
able to handle it gracefully.

> Either make it optional (for debugging for example), or check it
> infrequently, perhaps per function-call (since stack overflow is most
> likely to be caused by out-of-control recursion) rather than per
> expression. (Assuming it is too hard to detect an actual stack fault.)

the current plan is to do it per-call.

>> and, if our dispatch loop is something like:
>> while(op && !ctx->ret)
>> { op=op->fcn(ctx, op); }

>> that "&& !ctx->ret" also adds a bit of cost.

> You'd take that check out of the loop. I assume that checks whether the
> next opcode is some sort of stop code. Since it's only ever going to be
> true once in any program run (perhaps after a trillion iterations) it is
> rather inefficient!

basically it indicates some need to break out of the main instruction
dispatch loop.

usually, this would be for a few scenarios:
the interpreted code has returned from the outermost level;
the code has blocked or executed a sleep call;
an exception has been thrown;
a call is being made back into C land;
...

the goal would be to organize instructions into "traces", where this
doesn't have to be checked for nearly as often.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Kaz Kylheku  
View profile  
 More options Oct 18 2012, 2:20 pm
Newsgroups: comp.lang.misc
From: Kaz Kylheku <k...@kylheku.com>
Date: Thu, 18 Oct 2012 18:19:56 +0000 (UTC)
Local: Thurs, Oct 18 2012 2:19 pm
Subject: Re: new interpreter ("Fast RIR")
On 2012-10-18, Dmitry A. Kazakov <mail...@dmitry-kazakov.de> wrote:

Well, no, because, obviously, the unwinding would restore the stack to the
correct frame.  At the point where the exception is caught, unwinding has
already taken place. Thus the condition of a stack overflow no longer exists.
That happened in the try block which tried to allocate more stack space, but
all of that is undone at this point.

The switching to a some emergency stack is only for the purpose of the
exception handling mechanism which may need its own stack frames.

Anyway, in protected operating systems, interrupts and exceptions (the
real, architectural ones) are handled on a different stack inside the
kernel.  You cannot rely on unprivileged mode to have a stack for catching
interrupts and exceptions. Suppose user space has pushed its stack pointer all
the way to the edge of available space so that the next address is an unmapped
page. In that situation, interrupts like the clock tick or network receive
still have to work! They do because the exception handling in the architecture
switches to a kernel stack before saving the machine state.

A stack overflow situation might be intercepted at the CPU level, and
turned into an exception that traps to the kernel, on the kernels' stack.
That kernel code could pull some strings to arrange for the user code to have a
different stack for running the user handler for that condition.

Unix has a such a thing: sigaltstack.

http://pubs.opengroup.org/onlinepubs/007904875/functions/sigaltstack....

"The sigaltstack() function allows a process to define and examine the state of
an alternate stack for signal handlers for the current thread. Signals that
have been explicitly declared to execute on the alternate stack shall be
delivered on the alternate stack."

This doesn't mean that there is no linkage from the alternate stack to the
parent stack. It may still be possible to traverse the stack from the alternate
one to the original one to perform an exception search and do unwinding.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "misc: new interpreter ("Fast RIR")" by BGB
BGB  
View profile  
 More options Nov 20 2012, 2:59 pm
Newsgroups: comp.lang.misc
From: BGB <cr88...@hotmail.com>
Date: Tue, 20 Nov 2012 13:57:02 -0600
Local: Tues, Nov 20 2012 2:57 pm
Subject: Re: misc: new interpreter ("Fast RIR")
On 10/16/2012 3:47 PM, BGB wrote:

> basically, this is just a few globs from some recent emails, mentioning
> something I am currently looking into.

> the idea would be to develop a new faster threaded-code backend for the
> BGBScript VM, which would basically be used "behind the scenes" while
> still keeping the existing high-level bytecode.

> as-is, the interpreter already translates the bytecode into a
> stack-based threaded-code format (which is faster than running it
> directly via a "switch()", but still could be somewhat faster).

after being on hold for a while (I was busy working more on other
stuff), I finished implementing it at least enough to test the basic design.

I was working some on adding basic mechanisms, like function calls, but
am left thinking more about "efficient" ways to manage scoping, and
hopefully not be left with ugly problem areas or poor performance (the
whole effort would be pointless if most of the speedup is squandered
away again with something like slow method or variable lookups).

a partial tradeoff is between "directness" and "mutability", for
example, consider I am calling a "static method":
I can cache the static method with the instruction, so that the next
time around the static method is invoked directly, and does not incur a
performance hit from having to fetch it. but what then if the class
containing the method is redefined and the method is replaced?...

this implies the added cost of fetching the method again each time from
the class's vtable (~ 45ns).

the only other obvious solution is for the interpreter to basically
retain pointers to all static method calls, which can then be flushed
as-needed (setting the method-pointer back to NULL, and setting the
handler from the direct-call case to the "lookup and then attempt to
call" case).

another alternative here being of course a kind of "static method call
table", where basically all of the active static methods are laid out in
a big global table which calls are directed through. the static-calls
would then hold pointers to the table entries. (note that entries would
likely be keyed as "class-qname"+"methodname"+"signature").

as of yet, I have doubts about strategies that would nail down the class
or object types (as-is, there are currently 2 major types of object, one
used for class/instance and dynamic classes, and another used for
prototype-OO via ex-nihilo objects, which are currently also used to
implement packages).

as-is, the currently existing interpreter will just fetch the method
from the class/object each time (and then use type-checks to see how to
apply it). this works but isn't particularly efficient.

otherwise, here is something from an email:

<---
after fiddling around with image compression stuff more, and a brief run
of trying to optimize my 3D engine somewhat, I went and finally got
around to finishing up the new "FRIR" (the "register IR") interpreter
enough to test it (at least in a very minimalistic form).

so, for the FRIR interpreter test, the equivalent of a "for()" loop
spins 45.2M cycles per second (23.23s to spin 1,000,000,000 times).

native "for()" loop, 385M cycles per second (2.6s).
IOW: "for(i=0; i<1000000000; i++);"

difference: 8.9x of native.

this seems promising, as it is nearly 7x as fast as my current interpreter.

this is not to say though that it wouldn't slow down once weighted down
by all the other crap and used for "actual" script-code.

though, it would probably at least be somewhat faster, as the existing
script interpreter is still a bit of a cobbled-together mess...

not like my other "real" benchmarks were that realistic, as most were
doing things like the recursive fibonacci function and implementing
bubble-sort and similar, which aren't likely to be particularly
representative of the normal workloads... (which has thus far been
mostly for running part of the AIs for several of the monster types,
among other mostly misc things...).

will see though if I actually do much with it.
it would still take a bit of work to really make the new backend "ready
for use".
--->

I went and calculated, the above works out to roughly 22ns per loop
iteration (in the interpreter), vs around 2.6ns per-iteration in native.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »