Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

beyond 'switch limit', IDL's...

1 view
Skip to first unread message

BGB / cr88192

unread,
Nov 5, 2009, 2:04:07 AM11/5/09
to
well, I will start out by acknowledging that the prior thread was not all
useless, as I did get a rather useful idea from the thread.

in particular, the idea had been suggested of using function pointers in
place of a switch-based dispatch. I was able to use this strategy along with
my prior strategy along with my prior strategy to essentially eliminate
several of the major switches, and in a few cases, to use function pointers
directly to micro-optimized handlers.

essentially, this was done by putting a function pointer directly into the
opcode structure, and calling this function pointer, rather than a generic
dispatch function.

this did not require a general reorganization, since I basically ended up
adding another function which does a mini-version of the dispatch, and
essentially looks for the "best" handler function to handle a given opcode,
which in many cases just ends up returning the generic handlers (although,
the initial first-level switch has been almost completely eliminated).

I also reduced the hash to 4k entries, and fiddled with its logic a little,
...

the result is, now the interpreter is pulling off around 12.4 MIPS, which is
measuring at around 76.6x slower than native...

this trick also "opens the door" to further micro-optimizing (since, thus
far, for most of the ISA the prior "generic" logic is still used).

I am not certain of a present load-distribution ranking, since I think the
running time has gotten short enough that the profiler is not returning
particularly sensible results...


taking this idea further, it came up as a possiblity that "an interpreter"
could essentially structure its main interpretation loop sort of like this:

cur=first;
while(cur)cur=cur->exec(ctx, cur);

which although not a very generic design, could be very fast (although, I am
not certain how much so, when compared against native). this approach could
also be fairly easily ammendable to JIT.

as-is, this particular approach is not well suited to my interpreter, but I
may keep it in mind for later (since it IS a lot more generic than JIT, and
for certain VM designs could make interpreter overhead largely negligible).

next issue:
I am recently considering ideas for some sort of IDL-like technology.

possible options include:
an MIDL-based IDL format (complicated, as MIDL seems to use a C-style
parser);
a custom format (likely line-oriented and based on "signature strings"),
which would be much simpler, but would be an unusual and generally ugly
format;
annotated C headers, which could allow using the same file as a header and
as an IDL, but (like MIDL) would be a hassle to implement (although, I could
use my C compiler frontend as a base easily enough, for example, because it
parses to fairly generic XML-based ASTs, ...).

current leaning is more towards the annotated C headers...

this would be mostly for easing gluing native API's to my interpreter, since
as-is, many of my core API's are simply too bulky to wrap (I have tried, but
with a single "interface" having easily around 1000 functions, it becomes
apparent that this is not entirely reasonable).

so, my current thinking is that I will embed IDL commands into headers as
special macros which, in plain C, will simply remove themselves. this could
be pulled off by having a "mandatory" header, which would mostly be filled
with stuff like:
#ifndef bidl_begin_interface
#define bidl_begin_interface
#endif

#ifndef bidl_guid
#define bidl_guid(x)
#endif

...


the IDL tool would then essentially define these macros internally (prior to
preprocessing/parsing the header), so that they don't get replaced by dummy
versions (instead, they could expand to IDL tool internal syntax).

all this also allows other possibilities as well, such as embedding info
related to my object system (class and interface definitions, ...) directly
into C headers without so much risk that a stock C compiler will be confused
by them (and is still better than using either manual API calls, or dummy
Java files, as is my current approach...).

a possible use of IDL's here would be to facilitate auto-wrapping C API's as
class/instance objects, and making C/I objects visible as C APIs (in a
similar manner to CLOS). this could also be of some use to C/Java
integration as well (currently a largely unaddressed issue in my
framework...).

beyond interpreter gluing and OO facilities, another possible use is of
"components" (in the COM sense).

granted, this technology would not necessarily be compatible with COM or
CORBA, but I have my reasons not to use them...


I am also idly thinking about integrating basic "command shell" style
functionality into the interpreter, but have not gotten around to this as of
this as of yet. likely this would be modeled after a "simple posix-style
shell". shell-scripts would likely be somewhat limited (lacking any of the
sorts of conditional or procedural facilities of many bash scripts, ...).

...


or such...


Rod Pemberton

unread,
Nov 5, 2009, 2:44:42 AM11/5/09
to
"BGB / cr88192" <cr8...@hotmail.com> wrote in message
news:hcttd7$r6p$1...@news.albasani.net...

>
> the result is, now the interpreter is pulling off around 12.4 MIPS, which
is
> measuring at around 76.6x slower than native...
>
> [...]

>
> which although not a very generic design, could be very fast (although, I
am
> not certain how much so, when compared against native).
>

Currently, you're measuring the insructions/time as a determination of how
fast the interpreter is. I might be confused, but I gather that you're
comparing apples and oranges, i.e., assembly for a non-interpreter (native
code) to assembly for an interpreter from a compiler. At some point, you
need to compare apples to apples, i.e., assembly for an interpreter by hand
to assembly for an interpreter from a compiler. But, you don't want to use
time to compare them. You want to visually compare the emitted assembly.
With much work, they can become very close through trial and error,
rewriting, different techniques, etc.


Rod Pemberton


BGB / cr88192

unread,
Nov 5, 2009, 9:51:33 AM11/5/09
to

"Rod Pemberton" <do_no...@nohavenot.cmm> wrote in message
news:hctvsv$nhj$1...@aioe.org...

actually, it is not that drastic.
I am comparing x86-64 code (from MSVC), to x86 code (from GCC), where the
native code in this case is x86-64, and the interpreted code is x86.

I am working on the vague assumption that both compilers are producing
"similar" code, allowing a reasonable comparrison (granted, GCC and MSVC
"could" produce drastically different code, but I am assuming against this
here).

the measure on the native CPU is purely based on time (and some internal
knowledge of the loop, ...), where the theoretical MIPS rate is inferred
from the interpreter's experience.

the interpreter is using a "time stamp counter", which is basically a 64 bit
integer which is incremented for every instruction.

MIPS=TSC/(Seconds*1000000)

the comparrison with native is based on a loop (the same loop used in the
interpreter), which is basically (InterpSecs/RealSecs)*Scale, where
Scale=100, and corresponds to the difference in loop lengths between the
real and interpreted loops.

(I don't have an instruction-level TSC on native, as the native RDTSC
actually measures clock cycles).


granted, to be really accurate, one would need to run essentially the same
instruction sequence on both processors (with only trivial modifications to
compensate for x86 vs x86-64), but I am not that fussy at present, and am
relying on C code doing the same task (essentially, a bit-twiddly loop on an
array of unsigned chars...).

further accuracy would demand non-trivial tasks which use a more varried set
of the ISA.


FWIW, I am actually more concerned with the "generic" C-level performance,
than with the precise instruction-per-instruction performance differences...

I would much rather know "about how much slower is this", than have some
terribly complicated benchmark.


currently, the interpreter is using pure C-based interpretation, but I am
using function pointers to get around the use of big switch statements in
many cases, as apparently MSVC does not do them terribly well...

a lot of gain has been pulled off mostly by bypassing said switches, via
function pointers which are looked up and added to the opcode structure
around the same time it is decoded (although, the code for doing so is still
within the interpreter portion, so modularity is not broken).


in general, 'reg, reg' opcodes have been mostly broken off to individual
functions (and mov/and/or/xor have been specialized in the 32-bit case).

from the profiler, it may also make sense to optimize some of the 'reg' and
'reg, imm' cases.
judging from the profiler, 'reg, mem' operations (and things relating to the
Virtual Address space) are actually using more time (2nd place now was a
function related to the VA handling), but I can't do as much with all this
(although using a binary search for VA resolution, vs a linear search, could
help some...).


>
> Rod Pemberton
>
>


0 new messages