On 10/17/2012 5:50 AM, Rod Pemberton wrote:
> "BGB" <
cr8...@hotmail.com> wrote in message
> news:k5khbh$f71$1...@news.albasani.net...
> ...
>
>> the idea would be to develop a new faster threaded-code backend for the
>> BGBScript VM, which would basically be used "behind the scenes" while
>> still keeping the existing high-level bytecode.
>
> As you may know, I've been developing an ITC (threaded code) Forth
> interpreter in C for a few years now. One issue is with branching. I don't
> believe the issue is Forth specific. I think applies to all threaded code.
> Threaded code is not well suited to branching or most control flow, but
> branching can be made to work. Forth has no formal definition for it's
> branching "words" (routines). Historically, Forth calls them: BRANCH and
> either 0BRANCH or ?BRANCH. The first is an absolute branch, and the
> second with two common names is a conditional branch.
>
BGBScript has been in-use and incrementally developed since about 2004.
I have been using threaded-code since around November 2011.
so, the point now is mostly looking into taking a different and more
efficient route towards the interpreter backend.
note that, unlike Forth or similar, the BGBScript bytecode follows
similar stack-use restrictions to JVM bytecode, basically that on a
given jump/label, the stack layout is required to match.
the prior interpreter basically operated as a sort of
continuation-passing-style, where every opcode would return the next
opcode to execute.
this time I am factoring this out and instead using "traces".
the hope is that by better nailing down the opcode-level control flow, a
more efficient means of executing the instruction traces may be used.
>> as-is, the interpreter already translates the bytecode into a
>> stack-based threaded-code format (which is faster than running it
>> directly via a "switch()", but still could be somewhat faster).
>
> Well, that's good to know!
>
> Most Forth's coded in C use a switch() since that's the C way of doing
> things. I didn't use a C switch() statement for my interpreter. I
> interpret the same way as early ITC Forth's in assembly, but using C
> instead of assembly.
>
I went away from using switch a while ago, mostly as it takes a fairly
severe performance hit for every loop iteration.
threaded code is a little faster, but all of those function
calls/returns and indirect function calls are also expensive (note:
can't use computed goto as my interpreter is not GCC specific).
sadly, there is no really "good" option short of writing the whole
interpreter core in ASM, and by this point I may as well just write a JIT.
>> so, the general idea would be to "unwind" the stack-based bytecode
>> (vaguely like JVM or AVM2 bytecode) into using a register-machine model
>> (more like Dalvik or Parrot), and essentially use a register-machine to
>> execute the code.
>
> Have you tried any of this already? Have you analyzed how many stacked
> items are typically used at a time? I.e., does your bytecode only use two
> or three items at once?
>
typically, all of the items in a typical expression will end up on the
stack.
this will likely exceed the number of x86 registers, but the threaded
code will use a "potentially unbounded" number of virtual registers.
if a JIT were used, likely it would use a mapping heuristic to figure
out which VM registers to map to hardware registers (this would more
likely apply to x86-64, I suspect that 32-bit x86 is sufficiently
register-starved to where any sort of aggressive mapping would likely do
more harm than good, and ESI/EDI would likely be needed mostly for
holding the current VM state).
> From my own experiences, I can "see" issues with needing a register
> allocator, keeping track of which data is in which registers, spilling of
> registers when you run out of available registers, etc.
>
this would be more for a native code generator, not so much for a
register-IR, where the registers aren't so much "registers" in the CPU
sense, but are more closely related to "temporary variables".
so, you can avoid the need to "spill" by simply having 256 (or more) as
the max number of VM registers, and by making the VM preserve register
state across calls.
the need for a full register allocator can also be avoided at this stage
as well (they can be allocated either sequentially or in a stack-like
manner.
>> any ideas for how to make such a design idea faster are welcome (apart
>> maybe from putting everything in globals, which yes, can make things
>> faster, but kills thread-safety, and also doesn't help much on ELF
>> targets).
>
> I ran into a few issues with making my ITC interpreter faster in C.
>
> The first issue is that the "register" keyword only works on "auto" or
> procedure local variables for GCC. GCC deprecated the use of the "register"
> keyword for file scope or global variables.
>
> The second issue is:
>
> "If I can't use the 'register' keyword on globals, and I've got four
> procedures operating on my globals, how do I get all four procedures merged
> into a _single_ procedure so that my globals become 'auto' types and can be
> optimized?"
>
> The first answer to that is something C doesn't have: nested procedures.
> The second answer to that -possibly- is something C doesn't have either:
> inheritance. The third answer to that is: "I'm still working on that one!"
> At this point, I'm considering *anything*, like even a small switch() or
> Duff's device, in order to get 'auto' types. I just haven't attempted to
> get something working, yet.
>
the "register" keyword doesn't really usually make much of a difference,
so I don't really use it.
>> apparently, people working in similar situations in the past, were able
>> to get a roughly 25%-30% speedup by translating to a register IR (for an
>> interpreter). this was for Java ByteCode.
> ...
>
>> I also realized some while looking at retrofitting a few ideas onto the
>> existing interpreter:
>> in my stack-machine interpreter, a very large number of instructions may
>> throw exceptions (hindering the use of trace optimizations). (for
>> example, it would need to be proven that the stack can neither underflow
>> nor overflow during an operation, ...).
>
> Or, stack check on push/pop...
>
> Or, just abort...
>
checking for stack overflow is just the sort of cost I am trying to
avoid here.
it is the same sort of issue as checking if an exception-state variable
has been set. it seems simple enough, but it can eat lots of clock
cycles in a tight loop.
typically, a stack might be accessed like:
ctx->stackpos--;
*(int *)ctx->stackpos=((int *)ctx->stackpos)[0]+
((int *)ctx->stackpos)[1];
but, if we add something like:
if(ctx->stackpos<0)
{ ... }
then suddenly lots of clock cycles are being wasted just performing this
silly little check.
and, if our dispatch loop is something like:
while(op && !ctx->ret)
{ op=op->fcn(ctx, op); }
that "&& !ctx->ret" also adds a bit of cost.
here is the inner-loop for the in-development interpreter:
BGBFRIR_API FRIR_Trace *FRIR_Trace_RunDefault(
FRIR_Context *ctx, FRIR_Trace *tr)
{
FRIR_Opcode *op, *ope;
op=tr->ops;
ope=tr->ops+tr->n_ops-1;
while(op<ope)
{ op->fcn(ctx, op); op++; }
return(tr->end(ctx, tr, op));
}
and the add-integer handler:
BGBFRIR_API void FRIR_ThOp_ADD_I(FRIR_Context *ctx, FRIR_Opcode *op)
{ ctx->ri[op->c]=ctx->ri[op->a]+ctx->ri[op->b]; }
>> design changes from previously:
>> 'A' type was changed from being a raw pointer to a pointer-sized
>> integer, mostly as it became apparent that the interpreter logic would
>> require an excessive amount of casting in C, so it would be easier just
>> to treat it as an integer type and cast back to a pointer as-needed.
>
> Exactly. You can't dereference a pointer type in C and end up with the
> exact same type. Well, except maybe a void*, but then you can't access data
> at the pointer because it's void... So, casting in C is required somewhere
> threaded code.
>
this more has to do with the new interpreter having a lot of operations
for pointer arithmetic, like "lea" (for calculating addressed) and
"xloada" (for loading the address of a field into a register) and similar.
an integer type makes more sense, but leaves the minor annoyance of sign
and address-size (x86 addresses are unsigned and 32-bit, x86-64 is
signed and 64-bit), requiring a little bit of special handling in
certain cases.
at the IR level, the 'A' (address type) is partly disjoint from the
integer types, so at least the issue shouldn't be apparent.
there are operations like: "conv_a2l" and "conv_l2a" to convert between
pointers and long (signed 64-bit), but doing this is not as efficient on
32-bit x86 (shouldn't matter too much, as this isn't likely to be all
that common).
but, as-is, IR would look something like:
llload_i 2, 7 //load local var 7 into r2
llload_i 3, 8 //load local var 8 into r3
add_i 2, 2, 3 //add r2=r2+r3
llstore_i 2, 9 //store r2 into local var 9
there may likely end up being a few more special cases, like:
lladd_ic 11, 2 //add 2 to local var 11
which would in C look something like:
BGBFRIR_API void FRIR_ThOp_LLADD_IC(FRIR_Context *ctx, FRIR_Opcode *op)
{ ctx->llvars[op->c].i+=op->i.i; }
or such...