the 'switch' limit...

BGB / cr88192

unread,

Nov 1, 2009, 1:46:13 AM11/1/09

to

this is an observation I have made in the past, and I think I will state
here as if it were some kind of rule:
when a profiler says that the largest amount of time used in an interpreter
is the main opcode dispatch switch (rather than some function called by it,
or some logic contained within said switch, ... especially if this switch
does nothing more than call functions...), then they are rapidly approaching
the performance limits they can hope to gain from said interpreter.

which is kind of lame in this case, as this interpreter is, technically, not
all that fast...

but, then again, there may still be a little tweaking left, as said switch
has not gained 1st place (it is 2nd), and holds approx 12% of the total
running time.

1st place, at 15%, is the hash-table for fetching instructions (and, sadly,
is not really optimizable unless I discover some way to 'optimize' basic
arithmetic expressions, which FWIW is about as productive as trying to
optimize a switch...).

this means: 27% of time:
hash EIP (currently: "((EIP*65521)>>16)&65535");
fetch opcode-struct from hash-table (the hash-fail rate is apparently very
low);
switch on opcode info (branches to functions containing further dispatch and
logic code).

(well, all this, as well as a few arithmetic ops, such as adding the opcode
size to EIP, ...).

previously I had an opcode decoder which was not super fast (approx 60% of
runtime if sequential decoding is used), but the hash seems to have largely
eliminated it (since most of the time, pre-decoded opcodes are used from the
hash).

so, alas, it is an x86 interpreter performing at approx 386 (or maybe
486)-like speeds on my computer (~6 MIPS...).
could be faster, except that MSVC on x64 has no real idea what it is doing
WRT optimizations.

then again, it is around 12x faster than when I started trying to optimize
it (~0.5 MIPS...).

note: I have plenty of past experience with both ASM and JIT, but at the
moment am leaning against this, and in favor of using pure C-based
interpretation for now.

dunno if anyone knows some major way around this.

IOW: if there is anything that can be done when the main switch creeps into
a high position, and code starts becomming very fussy about the fine details
WRT performance issues...

past experience is generally not... (that this is a general limit to C-based
interpreters...).

any comments?...

Pascal J. Bourguignon

unread,

Nov 1, 2009, 7:00:51 AM11/1/09

to

"BGB / cr88192" <cr8...@hotmail.com> writes:
> note: I have plenty of past experience with both ASM and JIT, but at the
> moment am leaning against this, and in favor of using pure C-based
> interpretation for now.
>
> dunno if anyone knows some major way around this.

Modify the C compiler so that it generates optimized code for your
specific case (ie. hide your specific assembly in the C compiler).

--
__Pascal Bourguignon__

BGB / cr88192

unread,

Nov 1, 2009, 10:19:39 AM11/1/09

to

"Pascal J. Bourguignon" <p...@informatimago.com> wrote in message
news:877huai...@galatea.local...

I am using MSVC on x64 for this...
I can't even do stupid inline assembly, as MS removed this feature for
x64...

(either way, for now I have tried to make this interpreter "generic", rather
than trying to rely on details of the specific host arch...).

my main complaint though is mostly that the compiler is stupid and does not
know about even trivial optimizations... (its "optimization" is "all for
show", it apparently just turns off a few more absurd features, but
apparently still does not go and actually "optimize" code, and more so,
trying to use optimization and debug options at the same time will generally
break the compiler, forcing one to choose one or another...).

...
mov rcx, [rsp+280h]
mor rax, [rcx]
mov rcx, [rsp+280h]
mor rdx, [rcx]
add rax, rdx
...

and, worse, like when it starts using memory and spewing out long chains of
loading and storing things to memory.

mov [rsp+28h], rax
mov [rsp+20h], rcx
mov rcx, [rsp+20h]
mov rax, [rsp+28h]
...

hence, there is around a > 2x-3x speed difference between 32-bit GCC and
64-bit MSVC, with GCC winning...

it is also worth noting that, from what I have seen from the profiler (which
tends to auto-disassemble things), this sort of stuff apparently plagues the
OS as well...

say, what about the case where one is like:
and eax, ecx
jz foo

MSVC?... nope, does not know of this either...

and eax, ecx
cmp eax, 0
je foo

oh yes, and the initial setting up for a switch is a long and ugly chunk of
code (around 10-12 instructions), except in the cases where it decides to
compile the switch as the equivalent of a long if/else chain (not even
binary sorting?... apparently it is linear dispatch).

yep...

now, I could use my own compiler, but alas then I would have to debug its
output...

or even GCC's Win64 support, but last time I tried to use it, it produced
code so bad (buggy and ill formed) as to scare me away. granted, it has been
a fairly long time now (~ 1 year already?...), so maybe they have improved
it some, I don't know...

so, alas, maybe it is "good enough" that I am getting 386-like speeds, as
this could easily mean P1-like speeds if building with a less terrible
compiler...

but, from a theoretical perspective, the main dispatch switch is the
effective limit for optimizing an interpreter?... (at the plain C level).

(note, given MSVC is in use in this case, certain GCC'isms do not apply...).

>
> --
> __Pascal Bourguignon__

Robert Redelmeier

unread,

Nov 1, 2009, 10:26:16 AM11/1/09

to

In alt.lang.asm BGB / cr88192 <cr8...@hotmail.com> wrote in part:

> fetch opcode-struct from hash-table
> (the hash-fail rate is apparently very low);

There's your improvement opportunity -- substitute a simpler
hash with larger failure rate and/or larger jump table.

-- Robert

BGB / cr88192

unread,

Nov 1, 2009, 11:39:57 AM11/1/09

to

"Robert Redelmeier" <red...@ev1.net.invalid> wrote in message
news:hck9ao$j6$1...@aioe.org...

only real way to do this I think is to use a mersenne prime and eliminate
the shift, but the issue then is that only likely the low 16 bits or so of
EIP would be considered in the hash.

IME, (((value*PRIME)>>SHIFT)&MASK) is generally both fast and effective,
though slightly slower than ((value*MERSENNE)&MASK).

according to the profiler, another major source of time use is:
"rip=ctx->sreg_base[0]+ctx->eip;"

but, alas, this much is needed to allow segmentation to actually be usable
(even if much of the time CS.base==0 anyways...).

but, alas, it seems there is an approx 170x speed difference between the
interpreter and native.
apparently there are too many complicated factors involved, and I am not
sure how to effectively convert this to 386-relative MHz...

direct relative clock-rate scaling relative to current processor would place
it at around 20 MHz.

calculating based on available "marketing" info (relative to my current
processor) it would be ~25 MHz.

going by some other info (instructions-per-clock on a 386, ...), it would be
closer to around 33 MHz (although I don't know "which" instructions they
were measuring...).
...

sadly, I don't have a 386 or 486 around to compare against...

>
> -- Robert

Pascal J. Bourguignon

unread,

Nov 1, 2009, 12:03:36 PM11/1/09

to

"BGB / cr88192" <cr8...@hotmail.com> writes:

> "Pascal J. Bourguignon" <p...@informatimago.com> wrote in message
> news:877huai...@galatea.local...
>> "BGB / cr88192" <cr8...@hotmail.com> writes:
>>> note: I have plenty of past experience with both ASM and JIT, but at the
>>> moment am leaning against this, and in favor of using pure C-based
>>> interpretation for now.
>>>
>>> dunno if anyone knows some major way around this.
>>
>> Modify the C compiler so that it generates optimized code for your
>> specific case (ie. hide your specific assembly in the C compiler).
>>
>
> I am using MSVC on x64 for this...

So unless you're rich enough to buy the sources of MSVC, forget it.

> I can't even do stupid inline assembly, as MS removed this feature for
> x64...

May I suggest you to switch to an open source compiler (such as gcc)?
Or just write your own compiler?

> now, I could use my own compiler, but alas then I would have to debug its
> output...

Hence the suggestion for a debugged compiler such as gcc.

> or even GCC's Win64 support, but last time I tried to use it, it produced
> code so bad (buggy and ill formed) as to scare me away. granted, it has been
> a fairly long time now (~ 1 year already?...), so maybe they have improved
> it some, I don't know...

This doesn't matter. What matters is that you have the sources, so
you can improve its optimizer.

(But also, read the interview of Fran Allen in "Coders at Work" (Peter
Seibel, Apress), you might prefer another language altogether to be
able to really optimize its compiler).

--
__Pascal Bourguignon__

Richard Harter

unread,

Nov 1, 2009, 1:26:01 PM11/1/09

to

On Sat, 31 Oct 2009 23:46:13 -0700, "BGB / cr88192"
<cr8...@hotmail.com> wrote:

>this is an observation I have made in the past, and I think I will state
>here as if it were some kind of rule:
>when a profiler says that the largest amount of time used in an interpreter
>is the main opcode dispatch switch (rather than some function called by it,
>or some logic contained within said switch, ... especially if this switch
>does nothing more than call functions...), then they are rapidly approaching
>the performance limits they can hope to gain from said interpreter.
>
>which is kind of lame in this case, as this interpreter is, technically, not
>all that fast...
>
>but, then again, there may still be a little tweaking left, as said switch
>has not gained 1st place (it is 2nd), and holds approx 12% of the total
>running time.
>
>1st place, at 15%, is the hash-table for fetching instructions (and, sadly,
>is not really optimizable unless I discover some way to 'optimize' basic
>arithmetic expressions, which FWIW is about as productive as trying to
>optimize a switch...).
>
>this means: 27% of time:
>hash EIP (currently: "((EIP*65521)>>16)&65535");
>fetch opcode-struct from hash-table (the hash-fail rate is apparently very
>low);
>switch on opcode info (branches to functions containing further dispatch and
>logic code).

Offhand the above sounds ugly. I don't know many op codes you
are using but I expect that it most it is a few hundred. The
hash function you use is relatively expensive. Using a hash
table at all is relatively expensive. A trie might be cheaper or
possibly a perfect hash. Then there is the question of why you
are using a switch at all. If the op-code lookup yields an
index, why aren't you using it as an index into a table that
contains the pointer to the action function and the op-code
struct?

Mind you, there might be very good reasons for doing what you are
doing. I don't have access to your code. Still, it sounds as
though your code is cumbersome.

>(well, all this, as well as a few arithmetic ops, such as adding the opcode
>size to EIP, ...).
>
>previously I had an opcode decoder which was not super fast (approx 60% of
>runtime if sequential decoding is used), but the hash seems to have largely
>eliminated it (since most of the time, pre-decoded opcodes are used from the
>hash).
>
>
>so, alas, it is an x86 interpreter performing at approx 386 (or maybe
>486)-like speeds on my computer (~6 MIPS...).
>could be faster, except that MSVC on x64 has no real idea what it is doing
>WRT optimizations.
>
>then again, it is around 12x faster than when I started trying to optimize
>it (~0.5 MIPS...).
>
>
>note: I have plenty of past experience with both ASM and JIT, but at the
>moment am leaning against this, and in favor of using pure C-based
>interpretation for now.
>
>
>dunno if anyone knows some major way around this.
>
>IOW: if there is anything that can be done when the main switch creeps into
>a high position, and code starts becomming very fussy about the fine details
>WRT performance issues...
>
>past experience is generally not... (that this is a general limit to C-based
>interpreters...).
>
>
>any comments?...
>
>

Richard Harter, c...@tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Kafka wasn't an author;
Kafka was a prophet!

BGB / cr88192

unread,

Nov 1, 2009, 2:16:36 PM11/1/09

to

"Pascal J. Bourguignon" <p...@informatimago.com> wrote in message

news:87y6mqg...@galatea.local...

> "BGB / cr88192" <cr8...@hotmail.com> writes:
>
>> "Pascal J. Bourguignon" <p...@informatimago.com> wrote in message
>> news:877huai...@galatea.local...
>>> "BGB / cr88192" <cr8...@hotmail.com> writes:
>>>> note: I have plenty of past experience with both ASM and JIT, but at
>>>> the
>>>> moment am leaning against this, and in favor of using pure C-based
>>>> interpretation for now.
>>>>
>>>> dunno if anyone knows some major way around this.
>>>
>>> Modify the C compiler so that it generates optimized code for your
>>> specific case (ie. hide your specific assembly in the C compiler).
>>>
>>
>> I am using MSVC on x64 for this...
>
> So unless you're rich enough to buy the sources of MSVC, forget it.
>

yep.

>> I can't even do stupid inline assembly, as MS removed this feature for
>> x64...
>
> May I suggest you to switch to an open source compiler (such as gcc)?
> Or just write your own compiler?
>

GCC on Win64 was very unreliable last I checked (occasionally generating bad
code, ...), but it may have improved since then.

my compiler works, just I similarly don't trust it that well to produce good
code, and using it for compiling much of anything "non-trivial" is likely to
leave bugs all over the place which would be rather difficult to track down
(I leave it mostly for compiling smaller code fragments at runtime, since a
failure here is at least usually obvious and nearly not so difficult to
track down and fix...).

granted, running head-first into the land of bug fixing would be a good way
to improve its reliability, but I guess I am not yet willing to go down this
path (AKA: having lots of code built with lots of little hidden bugs...).

>> now, I could use my own compiler, but alas then I would have to debug its
>> output...
>
> Hence the suggestion for a debugged compiler such as gcc.
>

it may have improved since last I looked, but at the time GCC's Win64 output
was buggy.

I had compiled a big mass of code, and it was crashing. I eventually tracked
it down to GCC having been messing up with handling the Win64 calling
convention, and so I ended up using MSVC instead, figuring at least its
Win64 support demonstratably works reliably...

>
>> or even GCC's Win64 support, but last time I tried to use it, it produced
>> code so bad (buggy and ill formed) as to scare me away. granted, it has
>> been
>> a fairly long time now (~ 1 year already?...), so maybe they have
>> improved
>> it some, I don't know...
>
> This doesn't matter. What matters is that you have the sources, so
> you can improve its optimizer.
>

with GCC, one probably will not have to, as presumably at least GCC has an
idea how to produce efficient code (it seems to do so well enough on Linux).

so, the biggie issue is mostly a matter of bugs, and I am not really willing
to go digging through GCC's (IMO, teh huge) codebase on a bug hunt...

granted, they have hopefully improved the bugginess at least, which is the
main issue.

> (But also, read the interview of Fran Allen in "Coders at Work" (Peter
> Seibel, Apress), you might prefer another language altogether to be
> able to really optimize its compiler).
>

I have a fairly large existing C codebase, and I will probably not switch to
another language noting that most others have poor C integration.

it is, likewise:
it is not that there was no reason that I chose x86 as the main ISA for this
interpreter...

I chose x86 for similar reasons to why I stick with C...

even if, granted, x86 is not the most ideal ISA for an interpreter, but
hell, at least it allows me to use GCC for building a lot of code that runs
within my VM, which is itself enough of a reason IMO...

>
> --
> __Pascal Bourguignon__

BGB / cr88192

unread,

Nov 1, 2009, 2:59:06 PM11/1/09

to

"Richard Harter" <c...@tiac.net> wrote in message
news:4aedcfb0....@text.giganews.com...

errm...

the x86 ISA has a bit more than this, closer to around 1,800 (core integer
ops, x87, SSE/SSE2/..., and partial AVX). full AVX and SSE5 might push it a
bit over 2000.

note that this is actual opcodes, not nmonics, where x86 likes using a
number of different opcodes with the same nmonic, although in my list there
are still around 800 nmonics...

as for there being only a few hundred opcodes, in nearly any other bytecode
this would probably be the case, but is not so much the case with x86...

the interpreter does not at present handle the full set though...

but, the hash is not used for opcode lookup/decoding, rather it is used for
grabbing already decoded instructions from a cache (which is based on memory
address).

(this is because instruction decoding is fairly expensive, and so is not
done inline, and when instructions are actually decoded, they are converted
from an in-memory opcode, to a struct-based representation, which is what is
fetched here by the hash lookup).

so, it serves a role similar to that of the L1 cache.

> The
> hash function you use is relatively expensive. Using a hash
> table at all is relatively expensive. A trie might be cheaper or
> possibly a perfect hash. Then there is the question of why you
> are using a switch at all. If the op-code lookup yields an
> index, why aren't you using it as an index into a table that
> contains the pointer to the action function and the op-code
> struct?
>

the hash table fetches the (already decoded) opcode struct, and the switch
is used to have the interpreter actually so something.

Step:
hash EIP address (actually: EIP+CS.Base);
try fetch decoded opcode:
ExecOpcode.
fail:
set up hash entry;
decode opcode at EIP into hash;
ExecOpcode.

ExecOpcode:
switch(op->type) //not exact, but close enough
case Basic: ExecOpcodeBasic(); break;
case Reg: ExecOpcodeReg(); break;
case RM: ExecOpcodeRM(); break;
case RegRM: ExecOpcodeRegRM(); break;
case RMReg: ExecOpcodeRMReg(); break;
case Imm: ExecOpcodeImm(); break;
case RegImm: ExecOpcodeRegImm(); break;
case RMImm: ExecOpcodeRMImm(); break;
...

...
ExecOpcodeRegRM:
resolve RM, ...
switch(op->opnum)
case ADC: ...
case ADD: ...
case AND: ...
case BSF: ...
case BSR: ...
...
...

in all, it is a fairly "heavyweight" interpreter, but alas, x86 is also a
fairly heavy ISA...

> Mind you, there might be very good reasons for doing what you are
> doing. I don't have access to your code. Still, it sounds as
> though your code is cumbersome.
>

kind of, but it could be a whole lot worse...

how would you approach something like x86?...

hopefully not by suggesting a huge switch to dispatch for every possible
opcode byte/prefix/... and then re-enter for following bytes. many opcodes
are long (2/3 bytes for just the 'opcode' part, or 4/5 for many SSE
opcodes), and then there is the Mod/RM field, which itself is an issue to
decode.

(if you don't believe me, go and take a good long hard look at the Intel
docs, and imagine how you would go about writing an interpreter for all this
stuff...).

I started initially trying this, but soon found it unworkable.

I then switched to a more "generic" strategy (essentially grabbing my
pre-existing disassembler and using this as a template for an opcode
decoder, which itself was mostly based on a pattern-matching strategy).

it took a bit of tweaking to get the speed up (following that of reworking
it into a piece of usable interpreter machinery), but it was still fairly
slow vs the interpreter, hence I added the hash (actually, originally I
considered a more involved strategy, but this was much simpler for now, and
can still pull off more-or-less the same effect...).

Richard Harter

unread,

Nov 1, 2009, 4:50:28 PM11/1/09

to

On Sun, 1 Nov 2009 12:59:06 -0700, "BGB / cr88192"
<cr8...@hotmail.com> wrote:

>
>"Richard Harter" <c...@tiac.net> wrote in message
>news:4aedcfb0....@text.giganews.com...
>> On Sat, 31 Oct 2009 23:46:13 -0700, "BGB / cr88192"
>> <cr8...@hotmail.com> wrote:

[snip]

>>
>> Offhand the above sounds ugly. I don't know many op codes you
>> are using but I expect that it most it is a few hundred.
>
>errm...
>
>the x86 ISA has a bit more than this, closer to around 1,800 (core integer
>ops, x87, SSE/SSE2/..., and partial AVX). full AVX and SSE5 might push it a
>bit over 2000.
>
>note that this is actual opcodes, not nmonics, where x86 likes using a
>number of different opcodes with the same nmonic, although in my list there
>are still around 800 nmonics...
>
>as for there being only a few hundred opcodes, in nearly any other bytecode
>this would probably be the case, but is not so much the case with x86...

That's still not all that bad, but it is less than pleasant.

>
>
>the interpreter does not at present handle the full set though...
>
>
>but, the hash is not used for opcode lookup/decoding, rather it is used for
>grabbing already decoded instructions from a cache (which is based on memory
>address).
>
>(this is because instruction decoding is fairly expensive, and so is not
>done inline, and when instructions are actually decoded, they are converted
>from an in-memory opcode, to a struct-based representation, which is what is
>fetched here by the hash lookup).
>
>so, it serves a role similar to that of the L1 cache.

I'm a little confused here. Are we just talking about opcodes
(decoded or not) or are we talking about full instructions. If
we are only talking about opcodes then everything can be
precomputed except an opcode->index mapping.

Something isn't quite right here. In the switch ExecOpcodeRegRM
is a function; in the code immediately after it is a label.

One problem with your switches is that they are big. Now a good
compiler can and should optimize a compact set of cases into a
jump table. However all bets are off if there are holes or if
the cases are sequential. If doesn't produce a jump table it
probably defaults to sequential if/then/else chains which are
deadly for large switches. Thus in your ExecOpCode switch you
really want a table of function pointers indexed by op->type so
that the switch becomes something like

optypes[op->type]();

The op types you use for the table have to be approximately
sequential integers. If Intel doesn't give you the right kind of
numbers to use as an index, create an index field in your struct.
You can even have holes in the indexing - just fill the holes
with pointers to an error function.

All of this is fairly standard. There probably is some good
reason why you aren't doing this, but it isn't obvious.

>
>in all, it is a fairly "heavyweight" interpreter, but alas, x86 is also a
>fairly heavy ISA...
>
>
>> Mind you, there might be very good reasons for doing what you are
>> doing. I don't have access to your code. Still, it sounds as
>> though your code is cumbersome.
>>
>
>kind of, but it could be a whole lot worse...
>
>
>how would you approach something like x86?...
>
>hopefully not by suggesting a huge switch to dispatch for every possible
>opcode byte/prefix/... and then re-enter for following bytes. many opcodes
>are long (2/3 bytes for just the 'opcode' part, or 4/5 for many SSE
>opcodes), and then there is the Mod/RM field, which itself is an issue to
>decode.

Shudder. If in fact there are only a couple of thousand op codes
then 2/3 is plausible for organizational reasons. However 4/5
sounds a little fishy. Are there really only a couple of
thousand when you take everything into account.

But no, if there only a few thousand opcodes in 2/3 bytes a
specialized trie should do quite nicely. I like to do
precomputed magic tables myself, but it may turn out that a hash
function will do fine. You aren't by any chance using a prime
number size table are you? The modulo operation could be the
bulk of your time in the hash table.

>
>(if you don't believe me, go and take a good long hard look at the Intel
>docs, and imagine how you would go about writing an interpreter for all this
>stuff...).

Thanks but no thanks. Looking at the Intel docs is very low on
my list of priorities.

bartc

unread,

Nov 1, 2009, 4:59:57 PM11/1/09

to

"BGB / cr88192" <cr8...@hotmail.com> wrote in message
news:hcjarj$vro$1...@news.albasani.net...

> but, then again, there may still be a little tweaking left, as said switch
> has not gained 1st place (it is 2nd), and holds approx 12% of the total
> running time.
>
> 1st place, at 15%, is the hash-table for fetching instructions (and,
> sadly, is not really optimizable unless I discover some way to 'optimize'
> basic arithmetic expressions, which FWIW is about as productive as trying
> to optimize a switch...).
>
> this means: 27% of time:
> hash EIP (currently: "((EIP*65521)>>16)&65535");
> fetch opcode-struct from hash-table (the hash-fail rate is apparently very
> low);
> switch on opcode info (branches to functions containing further dispatch
> and logic code).
>
> (well, all this, as well as a few arithmetic ops, such as adding the
> opcode size to EIP, ...).
>
> previously I had an opcode decoder which was not super fast (approx 60% of
> runtime if sequential decoding is used), but the hash seems to have
> largely eliminated it (since most of the time, pre-decoded opcodes are
> used from the hash).

Forget C for the crucial dispatch code. If you don't have inline asm, use a
separate linked function.

I don't understand your hash technique. I would just have decoded an opcode
at a time (ie. 8-bits at a time), although I've long forgotten the
intricacies of intel opcode decodes. If ebx points to the next instruction
of 32-bit code:

movzx eax,byte [ebx]
inc ebx
jmp [eax*4 + mainopcode_jumptable]

When further decodes are required, a similar 3 lines but with a different
table.

And after successfully dealing with an opcode, repeat those initial three
lines (no need to return to a common loop point).

Some experience of creating a disassembler (which is not speed critical)
would be needed to help plan the best approach, but I would still go for asm
code like this rather than C (especially if having trouble getting
reasonable code out of it).

(I had my own hll which also had trouble generating decent code in cases
like this, insisting on range checking that I didn't need:

do
switch x
when a then...
....
end switch
od

I made a couple of language/compiler mods which allowed me to write:

doswitch
asm .....
when a then
...
end switch

Ie. I could write my own switch jump code in asm! And automatically looped
back to the start, although quite often I just repeated the switch code at
the end of some handlers)

> note: I have plenty of past experience with both ASM and JIT, but at the
> moment am leaning against this, and in favor of using pure C-based
> interpretation for now.

What slowdown are you getting compared with a real machine?

--
Bartc

Pascal J. Bourguignon

unread,

Nov 1, 2009, 5:08:33 PM11/1/09

to

"BGB / cr88192" <cr8...@hotmail.com> writes:

> dunno if anyone knows some major way around this.

Another way to do it would be to generate portably assembly with LLVM.

--
__Pascal Bourguignon__

bartc

unread,

Nov 1, 2009, 5:25:52 PM11/1/09

to

"BGB / cr88192" <cr8...@hotmail.com> wrote in message
news:hckpa8$aj2$1...@news.albasani.net...

> (if you don't believe me, go and take a good long hard look at the Intel
> docs, and imagine how you would go about writing an interpreter for all
> this stuff...).

I think emulator would be a better name. Interpreter has other connotations.

> I started initially trying this, but soon found it unworkable.

I have a feeling that that would still be the best approach, as I guess
that's how actual processors have to decode this stuff.

But I haven't tried it myself and it's too big a task to have a quick go (I
haven't even been able to find any opcode maps anywhere).

--
Bartc

BGB / cr88192

unread,

Nov 1, 2009, 6:14:40 PM11/1/09

to

"Richard Harter" <c...@tiac.net> wrote in message

news:4aedf3bd....@text.giganews.com...

> On Sun, 1 Nov 2009 12:59:06 -0700, "BGB / cr88192"
> <cr8...@hotmail.com> wrote:
>
>>
>>"Richard Harter" <c...@tiac.net> wrote in message
>>news:4aedcfb0....@text.giganews.com...
>>> On Sat, 31 Oct 2009 23:46:13 -0700, "BGB / cr88192"
>>> <cr8...@hotmail.com> wrote:
> [snip]
>
>>>
>>> Offhand the above sounds ugly. I don't know many op codes you
>>> are using but I expect that it most it is a few hundred.
>>
>>errm...
>>
>>the x86 ISA has a bit more than this, closer to around 1,800 (core integer
>>ops, x87, SSE/SSE2/..., and partial AVX). full AVX and SSE5 might push it
>>a
>>bit over 2000.
>>
>>note that this is actual opcodes, not nmonics, where x86 likes using a
>>number of different opcodes with the same nmonic, although in my list
>>there
>>are still around 800 nmonics...
>>
>>as for there being only a few hundred opcodes, in nearly any other
>>bytecode
>>this would probably be the case, but is not so much the case with x86...
>
> That's still not all that bad, but it is less than pleasant.

yes, ok.

some of my past bytecodes ended up with around 500 opcodes.

I think JBC has somewhere around 170-200 (I forget the exact number),
however, this is too few to do "that" much more with it.

some newer JVM versions added a few opcodes, with a good deal more
functionality being essentially multiplexed in via existing mechanisms (such
as method calls, ...). (actually, it could be loosely compared to building a
network filesystem interface on top of HTTP via WebDAV, where pre-existing
mechanisms are "extended" to handle new operations, ...).

the alternative strategy is, of course, adding piles of new opcodes...

>>
>>
>>the interpreter does not at present handle the full set though...
>>
>>
>>but, the hash is not used for opcode lookup/decoding, rather it is used
>>for
>>grabbing already decoded instructions from a cache (which is based on
>>memory
>>address).
>>
>>(this is because instruction decoding is fairly expensive, and so is not
>>done inline, and when instructions are actually decoded, they are
>>converted
>>from an in-memory opcode, to a struct-based representation, which is what
>>is
>>fetched here by the hash lookup).
>>
>>so, it serves a role similar to that of the L1 cache.
>
> I'm a little confused here. Are we just talking about opcodes
> (decoded or not) or are we talking about full instructions. If
> we are only talking about opcodes then everything can be
> precomputed except an opcode->index mapping.
>

ok, I was using the term to 'opcode' to refer to the entire instruction
(including arguments, ...).

so, in this case the 'opcode' structure contains:
opnum, which is an internal opcode number, which idendifies the nmonic;
opidx, which is an internal opcode index, which identifies the binary
representation of said imstruction;
width, which identifies the bit-width of integer ops;
reg, a register;
op_sz, cached in-memory size of opcode/instruction (used mostly to step
to the next EIP);
rm_base, memory base register, or another data register;
rm_idx, index register;
rm_sc, register scale;
rm_disp, an offset added to memory ops;
rm_fl, which holds flags;
imm, for holding an immediate;
reg2/reg3, for AVX (because we really need 4-register opcodes...);
...

the opcode decoder recognizes the pattern, and fills in all the fields in
the struct as appropriate.
in this way, the main interpreter just grabs the fields from the structure
it needs, without having to fiddle with its serialized form.

these structures are keyed by memory address, which is where the hash table
comes in.

ok, my notation isn't very good.
it is a function, but the label-like notation shows what it contains.

ok, imagine it were something like:

ExecOpcodeRegRM():
...

> One problem with your switches is that they are big. Now a good
> compiler can and should optimize a compact set of cases into a
> jump table. However all bets are off if there are holes or if
> the cases are sequential. If doesn't produce a jump table it
> probably defaults to sequential if/then/else chains which are
> deadly for large switches. Thus in your ExecOpCode switch you
> really want a table of function pointers indexed by op->type so
> that the switch becomes something like
>
> optypes[op->type]();
>
> The op types you use for the table have to be approximately
> sequential integers. If Intel doesn't give you the right kind of
> numbers to use as an index, create an index field in your struct.
> You can even have holes in the indexing - just fill the holes
> with pointers to an error function.
>
> All of this is fairly standard. There probably is some good
> reason why you aren't doing this, but it isn't obvious.
>

actually, I have noted already that the compiler produces jump tables
internally for all these, so this much is good (I know, I have seen the
produced ASM for these cases).

the main cost is that it ends up taking around 10-12 instructions to handle
the jump-table logic, which is mostly the fault of MSVC lameness...

it is apparently mostly just producing if/else style sequences for very
small switches, such as apparently the switch for segment-overrides. I could
use a table for this, but in this case it is not in a place which causes
notable impact.

>>
>>in all, it is a fairly "heavyweight" interpreter, but alas, x86 is also a
>>fairly heavy ISA...
>>
>>
>>> Mind you, there might be very good reasons for doing what you are
>>> doing. I don't have access to your code. Still, it sounds as
>>> though your code is cumbersome.
>>>
>>
>>kind of, but it could be a whole lot worse...
>>
>>
>>how would you approach something like x86?...
>>
>>hopefully not by suggesting a huge switch to dispatch for every possible
>>opcode byte/prefix/... and then re-enter for following bytes. many opcodes
>>are long (2/3 bytes for just the 'opcode' part, or 4/5 for many SSE
>>opcodes), and then there is the Mod/RM field, which itself is an issue to
>>decode.
>
> Shudder. If in fact there are only a couple of thousand op codes
> then 2/3 is plausible for organizational reasons. However 4/5
> sounds a little fishy. Are there really only a couple of
> thousand when you take everything into account.
>

this is mostly because of how Intel organizes their opcodes.

a lot of the SSE opcodes use 0x66, 0xF3, 0xF2, ... as part of the opcode.

here are a few, which may encode as 3 or 4 bytes for the opcode (X is the
optional REX prefix).
addpd 66X0F58/r xr,xrm
addps X0F58/r xr,xrm
addsd F2X0F58/r xr,xrm
addss F3X0F58/r xr,xrm
addsubpd 66X0FD0/r xr,xrm
addsubps F2X0FD0/r xr,xrm

here are a few with 4 or 5 byte opcodes:
dppd 66X0F3A41/r,ib xr,xrm,i8
dpps 66X0F3A40/r,ib xr,xrm,i8
extractps 66X0F3A17/r,ib xr,xrm,i8

insertps 66X0F3A21/r,ib xr,xrm,i8

mpsadbw 66X0F3A42/r,ib xr,xrm,i8

...

aesenc 66X0F38DC/r xr,xrm
aesenclast 66X0F38DD/r xr,xrm
aesdec 66X0F38DE/r xr,xrm
aesdeclast 66X0F38DF/r xr,xrm
aesimc 66X0F38DB/r xr,xrm
aeskeygenassist 66X0F3ADF/r,ib xr,xrm,i8

...

for AVX Intel switched to an essentially partly new opcode encoding scheme.

the AVX opcodes in the listing thus far seem to be using a mostly fixed
4-byte opcodes.

vblendpd Ijr0D/r,ib xr,xrv,xrm,i8
Ijv0D/r,ib yr,yrv,yrm,i8
vblendps Ijr0C/r,ib xr,xrv,xrm,i8
Ijv0C/r,ib yr,yrv,yrm,i8
vblendvpd Ijr4B/r,is xr,xrv,xrm,xri
Ijv4B/r,is yr,yrv,yrm,yri
vblendvps Ijr4A/r,is xr,xrv,xrm,xri
Ijv4A/r,is yr,yrv,yrm,yri

where Ixx basically encodes a 3-byte VEX prefix, so essentially a 4-byte
opcode.

luckily at least, most of the basic opcodes use only 1 or 2 bytes for the
opcode...

this is followed by however many more are needed to encode Mod/RM or
immediates...

so, for example:
mov [eax+4*ecx+0xF00BA5], 0xBAF00

can take 10 bytes (in addition to the opcode, this is the "/r" in the
listings above).
with a 5-byte opcode, this could take 15 bytes for an instruction.

luckily, this should not happen in practice (usually it takes special
construction to create an opcode exceeding Intel's 16-byte limit).

> But no, if there only a few thousand opcodes in 2/3 bytes a
> specialized trie should do quite nicely. I like to do
> precomputed magic tables myself, but it may turn out that a hash
> function will do fine. You aren't by any chance using a prime
> number size table are you? The modulo operation could be the
> bulk of your time in the hash table.
>
>

luckily the main interpreter does not need to see the horrors of the
opcode/instruction encoding scheme, as it is all much more "sterile" than
this (all of this is left to the "opcode decoder", which is, luckily, not
currently a significant part of the running time).

as for the hash table, it is a power-of-2 size, and hence I use a mask...
actually, I since replaced the mask with a cast to 'u16', since this is
slightly faster in this case.

I don't use modulo in hash calculations as this tends to kill performance.
I usually use a multiply, a shift, an a mask.

multiply+mask also works, but it requires a mersenne prime to work well, and
has its own set of "interesting" properties, which as I see it did not match
the current use (I chose instead to multiply by a non-mersenne prime and use
a shift).

>>
>>(if you don't believe me, go and take a good long hard look at the Intel
>>docs, and imagine how you would go about writing an interpreter for all
>>this
>>stuff...).
>
> Thanks but no thanks. Looking at the Intel docs is very low on
> my list of priorities.

I deal with them a lot...

much of what I have done in the past few years has involved these docs...

bartc

unread,

Nov 1, 2009, 6:54:36 PM11/1/09

to

"BGB / cr88192" <cr8...@hotmail.com> wrote in message
news:hcl4ot$sia$1...@news.albasani.net...

> ok, I was using the term to 'opcode' to refer to the entire instruction
> (including arguments, ...).
>
> so, in this case the 'opcode' structure contains:
> opnum, which is an internal opcode number, which idendifies the nmonic;
> opidx, which is an internal opcode index, which identifies the binary
> representation of said imstruction;
> width, which identifies the bit-width of integer ops;
> reg, a register;
> op_sz, cached in-memory size of opcode/instruction (used mostly to step
> to the next EIP);
> rm_base, memory base register, or another data register;
> rm_idx, index register;
> rm_sc, register scale;
> rm_disp, an offset added to memory ops;
> rm_fl, which holds flags;
> imm, for holding an immediate;
> reg2/reg3, for AVX (because we really need 4-register opcodes...);
> ...
>
> the opcode decoder recognizes the pattern, and fills in all the fields in
> the struct as appropriate.
> in this way, the main interpreter just grabs the fields from the structure
> it needs, without having to fiddle with its serialized form.
>
>
> these structures are keyed by memory address, which is where the hash
> table comes in.

It looks like you're translating from the conventional packed form of the
instruction, to an unpacked form that your code finds it easier to deal
with.

In that case why not complete the process and just run this code directly
and forget the original? You will need to change all references to code
(jumps and calls) to addresses within this new block of code.

Then the hashing disappears and you can use whatever switch index you like.
(Although you are now no longer emulating x86 but interpreting something
different.)

(I understood anyway that your 'x86' code was not exactly x86, but only
similar. This means you might be able to directly generate the simpler,
unpacked form, and not bother with any decoding at all.)

--
Bartc

BGB / cr88192

unread,

Nov 1, 2009, 7:46:30 PM11/1/09

to

"bartc" <ba...@freeuk.com> wrote in message
news:0dpHm.792$Ym4...@text.news.virginmedia.com...

yes.

> In that case why not complete the process and just run this code directly
> and forget the original? You will need to change all references to code
> (jumps and calls) to addresses within this new block of code.
>

this case is still tied to the original image, and needs to be able to deal
with cases like self-modifying code...

granted, if code were assumed to be static and read/only, there is little to
prevent discarding the original code...

> Then the hashing disappears and you can use whatever switch index you
> like. (Although you are now no longer emulating x86 but interpreting
> something different.)
>

the 'switch' is actually using indices which are generated automatically by
tools which write parts of the interpreter (itself borrowed from my
assembler/disassembler).

the tool essentially processes high-level listings and writes a lot of
code/tables/... allowing a lot of the decode process to be largely automatic
(I ended up mostly adding a lot of additional specialized logic mostly to
make this part of the process go faster...).

the hash itself it used for fetching the opcode structure, not for figuring
out an index to use for the switch...

> (I understood anyway that your 'x86' code was not exactly x86, but only
> similar. This means you might be able to directly generate the simpler,
> unpacked form, and not bother with any decoding at all.)
>

this was a different idea...

I had considered an x86 subset mostly to allow the possibility of high-level
analysis, verification, and translation...

if you remember, I had used the aleph/bet/gimel/dalet naming scheme for that
idea...

all this is, in effect, plain x86, not my modified subset...
(although, it is restricted to userspace code, and some of the internal
details of the interpreter's internals are not strictly orthodox, but this
is largely invisible from the POV of the code being run...).

they are, however, intended for different purposes, although given both
technologies would be based on the binary representation of x86, they would
be, essentially, compatible...

BGB / cr88192

unread,

Nov 1, 2009, 7:57:11 PM11/1/09

to

"bartc" <ba...@freeuk.com> wrote in message

news:QVnHm.760$Ym4...@text.news.virginmedia.com...

>
> "BGB / cr88192" <cr8...@hotmail.com> wrote in message
> news:hckpa8$aj2$1...@news.albasani.net...
>
>
>> (if you don't believe me, go and take a good long hard look at the Intel
>> docs, and imagine how you would go about writing an interpreter for all
>> this stuff...).
>
> I think emulator would be a better name. Interpreter has other
> connotations.
>

so does emulator...

emulator implies emulating the complete arch (hardware, BIOS, ...), rather
than simply the userspace portion of the CPU / ISA...

this is thus much more "lightweight" than a full emulator.

>> I started initially trying this, but soon found it unworkable.
>
> I have a feeling that that would still be the best approach, as I guess
> that's how actual processors have to decode this stuff.
>

but, it is not strictly necessary...

> But I haven't tried it myself and it's too big a task to have a quick go
> (I
> haven't even been able to find any opcode maps anywhere).
>

check the Intel docs...

> --
> Bartc
>
>

BGB / cr88192

unread,

Nov 2, 2009, 1:24:29 AM11/2/09

to

"Pascal J. Bourguignon" <p...@informatimago.com> wrote in message

news:87my35h...@galatea.local...

> "BGB / cr88192" <cr8...@hotmail.com> writes:
>> dunno if anyone knows some major way around this.
>
> Another way to do it would be to generate portably assembly with LLVM.
>

or, if I were going to go this route, I could use my pre-existing codegen
backend as well...

(basically, this would be because the interpreter was written to be used
along with my pre-existing compiler/VM framework, and quite possibly
accepting code generated by my custom compiler, as well as possibly libs
compiled with GCC, or some other compiler for 32-bit x86...).

either way, it would make the general "cost" much higher than an
interpreter, although it would allow for much better performance.

elsewhere in the thread I think I stated that I was sticking with pure C for
now.
there are a few reasons for this...

either way, I am not certain if a general-purpose codegen/backend of either
style is ideal for JITing x86 machine code, vs say a JIT based more around
the "model" already in-use within x86.

granted, a generic JIT could be better for anything<->anything translation
though, vs x86 to some-other-machine-code (probably x86 or x86-64).

but, for now, the interpreter may have to do as-is...
I can maybe micro-optimize it a little further, but I guess at this point I
can no longer expect drastic speed-ups (the only other major way to speed it
up being by having it do less, as in, the possiblity of using system calls
for many operations).

another possibility is the use of "magic" sequences (can be compared with
the prologues and epilogues in Win64), which would consist of special
patterns which the interpreter would recognize and handle specially.

I have already done this with a few sequences (revolving around 'rep movsb'
and friends), although it seems that the current C library I am using is
using generic loop code instead, which of course would not stuble on these
special optimizations (not that it would be difficult to change if needed).

another place is with more complex operations, which is where system calls
come into play.

specialized pattern recognition and optimization is also a possibility
(similar to a compiler/optimizer), but would likely be too expensive for
real-time use, and is not ideally matched for the design I ended up using
(where opcodes are executed individually rather than organized into
'traces').

so, for now, it will have to work...

I guess a 170x slowdown is not THAT bad, FWIW...

> --
> __Pascal Bourguignon__

BGB / cr88192

unread,

Nov 2, 2009, 1:50:41 AM11/2/09

to

"bartc" <ba...@freeuk.com> wrote in message

news:xxnHm.748$Ym4...@text.news.virginmedia.com...

what is to say that the main interpreter switch is dealing with matters of
decoding?...
what is to say that this would actually be any faster than having it NOT
deal with matters of decoding?...
...

apart from needing it to calculate jump targets, the main interpreter almost
does not need to even keep track of EIP...

> Some experience of creating a disassembler (which is not speed critical)
> would be needed to help plan the best approach, but I would still go for
> asm code like this rather than C (especially if having trouble getting
> reasonable code out of it).
>
> (I had my own hll which also had trouble generating decent code in cases
> like this, insisting on range checking that I didn't need:
>
> do
> switch x
> when a then...
> ....
> end switch
> od
>
> I made a couple of language/compiler mods which allowed me to write:
>
> doswitch
> asm .....
> when a then
> ...
> end switch
>
> Ie. I could write my own switch jump code in asm! And automatically looped
> back to the start, although quite often I just repeated the switch code at
> the end of some handlers)
>

the decoder is fairly directly based on a disassembler, and yeah, switches
on some of the opcode bytes are used to assist in speeding up the decode
logic.

however, I was able to mostly eliminate instruction decoding from the
runtime overhead via the use of the hash.

also, there are reasons I am using C here...
what is to say an interpreter for x86 will necessarily run ON x86?...

as for doing interpreter logic partly in ASM, I had partly considered this
as well, but for now am sticking with C for other reasons (mostly that, at
the moment, ASM would add more complexities than are worthwhile...).

>> note: I have plenty of past experience with both ASM and JIT, but at the
>> moment am leaning against this, and in favor of using pure C-based
>> interpretation for now.
>
> What slowdown are you getting compared with a real machine?
>

currently, by simplistic tests, around 170x at present...

so, it is not great, but it could be worse...

> --
> Bartc

tm

unread,

Nov 2, 2009, 5:26:46 AM11/2/09

to

On 1 Nov., 20:16, "BGB / cr88192" <cr88...@hotmail.com> wrote:
> "Pascal J. Bourguignon" <p...@informatimago.com> wrote in messagenews:87y6mqg...@galatea.local...
>

> > "BGB / cr88192" <cr88...@hotmail.com> writes:
>
> >> "Pascal J. Bourguignon" <p...@informatimago.com> wrote in message
> >>news:877huai...@galatea.local...

> >>> "BGB / cr88192" <cr88...@hotmail.com> writes:
> >>>> note: I have plenty of past experience with both ASM and JIT, but at
> >>>> the
> >>>> moment am leaning against this, and in favor of using pure C-based
> >>>> interpretation for now.
>
> >>>> dunno if anyone knows some major way around this.
>
> >>> Modify the C compiler so that it generates optimized code for your
> >>> specific case (ie. hide your specific assembly in the C compiler).
>
> >> I am using MSVC on x64 for this...
>
> > So unless you're rich enough to buy the sources of MSVC, forget it.
>
> yep.
>
> >> I can't even do stupid inline assembly, as MS removed this feature for
> >> x64...
>
> > May I suggest you to switch to an open source compiler (such as gcc)?
> > Or just write your own compiler?
>
> GCC on Win64 was very unreliable last I checked (occasionally generating bad
> code, ...), but it may have improved since then.

Did you tell them about the errors you found?

> my compiler works, just I similarly don't trust it that well to produce good
> code, and using it for compiling much of anything "non-trivial" is likely to
> leave bugs all over the place which would be rather difficult to track down
> (I leave it mostly for compiling smaller code fragments at runtime, since a
> failure here is at least usually obvious and nearly not so difficult to
> track down and fix...).

When you think that GCC and MSVC are not okay you should
definitely use your own C compiler. When there are bugs in it
create a test suite of C programs. That way you can reduce the
number of errors.

You seem to write programs in many areas. But to get this
programs finished with good quality I can only suggest:
Eat your own dogfood.

Greetings Thomas Mertes

Seed7 Homepage: http://seed7.sourceforge.net
Seed7 - The extensible programming language: User defined statements
and operators, abstract data types, templates without special
syntax, OO with interfaces and multiple dispatch, statically typed,
interpreted or compiled, portable, runs under linux/unix/windows.

bartc

unread,

Nov 2, 2009, 6:45:11 AM11/2/09

to

"BGB / cr88192" <cr8...@hotmail.com> wrote in message

news:hclvg3$ubg$1...@news.albasani.net...

>
> "bartc" <ba...@freeuk.com> wrote in message
> news:xxnHm.748$Ym4...@text.news.virginmedia.com...

>

>> What slowdown are you getting compared with a real machine?
>>
>
> currently, by simplistic tests, around 170x at present...
>
> so, it is not great, but it could be worse...

I put together a very quick test for the following code:

mov ebx,100000000
l1:
sub ebx,1
jnz l1
hlt

Emulating only the last 3 instructions, the emulation was 30x slower than
the real code (the first 3 instructions), using a mix of hll and assembler.
Could be a bit tighter with dedicated registers, but I'm a bit short of
time.

The sub instructions used a 2-level jump table (on 1st 8 bits of first byte,
and top 5 bits of modrm byte).

--
bartc

James Harris

unread,

Nov 2, 2009, 8:26:26 AM11/2/09

to

On 2 Nov, 06:50, "BGB / cr88192" <cr88...@hotmail.com> wrote:
> "bartc" <ba...@freeuk.com> wrote in message

...

I'll add my comments here as they are similar to bartc's.

> > I don't understand your hash technique. I would just have decoded an
> > opcode at a time (ie. 8-bits at a time), although I've long forgotten the
> > intricacies of intel opcode decodes. If ebx points to the next instruction
> > of 32-bit code:
>
> > movzx eax,byte [ebx]
> > inc ebx
> > jmp [eax*4 + mainopcode_jumptable]
>
> > When further decodes are required, a similar 3 lines but with a different
> > table.
>
> > And after successfully dealing with an opcode, repeat those initial three
> > lines (no need to return to a common loop point).
>
> what is to say that the main interpreter switch is dealing with matters of
> decoding?...
> what is to say that this would actually be any faster than having it NOT
> deal with matters of decoding?...
> ...
>
> apart from needing it to calculate jump targets, the main interpreter almost
> does not need to even keep track of EIP...

As I understand it your hash function looks up instructions that you
have already decoded. That's quite sophisticated and I infer from that
that your decoding is expensive. Whether that's the case or not I
would think that it's likely to be fastest to decode by simple lookup
tables.

...

> the decoder is fairly directly based on a disassembler, and yeah, switches
> on some of the opcode bytes are used to assist in speeding up the decode
> logic.
>
> however, I was able to mostly eliminate instruction decoding from the
> runtime overhead via the use of the hash.

OK but the hash (together with interpreting what the hash returns and
dealing with any duplicate hashes) is still costly and ISTM that a
straight table-based decode may be faster. Why not try the following.

>
> also, there are reasons I am using C here...
> what is to say an interpreter for x86 will necessarily run ON x86?...

Good point. I wrote my suggestion in assembler before I'd seen your
comment. Trying to write in assembler can be a good idea anyway to
confirm that code *should* compile to be fast. It should be easy
enough to cast to C.

The main code is just the following. It is untested. Apart from the
comments it is only five instructions - all of which are pairable.
Before starting the code you would need to load esi with the first
instruction address (to set esi as the initial simulated eip) and
point edi at the initial opcode table.

;----- Code start -----

;Registers used
; eax to edx - workspace; on dispatch al holds the fetched opcode
byte
; ebp - prefix flags to indicate rep, lock, segment and size
overrides
; esi - copy of the simulated instruction pointer
; edi - points to the relevant 256-entry opcode table

next_instruction:

;Clear all prefix instruction indications
xor ebp, ebp

next_instruction_byte:

;Fetch the next byte of the instruction and increment instruction
pointer
movzx al, [esi]
inc esi

;Execute the corresponding routine
jmp [edi + 4 * eax]

;----- Code end -----

That's it. Once that's in place the service routines would vary
according to what was needed. So I'm not leaving the hard part undone
ere are some samples. First, dealing with a prefix byte, all that's
needed is to set the relevant bit in the prefix-flags register.

;Code for 0x66 - operand size prefix
or ebp, PREFIX_FLAGS_OPSIZ
jmp next_instruction_byte

Second, code for nop. This jumps back to a different point so that the
prefix flags are cleared. Easy to do in assembler, of course, but
maybe some nested loops would just about be able to deal with this in
C without using goto.

;Code for 0x90 - nop
jmp next_instruction

Third, if the first byte is an "escape" code such as 0x0f the
instruction table needs to be changed. Subsequent code executed would
be pointed at by the new table.

;Code for 0x0f - instruction extension
lea edi, [opcode_following_0f_table]
jmp next_instruction_byte

Finally, here's some code to show that entry points can be combined
together to save code space. The intention is to push whatever
register is addressed by three bits (in this case the last three) of
an instruction byte. A small subtlety here is that the 80386 and later
push the value of esp as it was before the push. This routine does
that too.

;Code for initial opcodes 0x50 through 0x57 - push reg
;Identify and fetch the simulated register
and eax, 7
mov eax, [simulated_general_regs + 4 * eax]

;Push on to the simulated stack
;(only 32-bit shown; 16-bit needs to be added)
mov ebx, [simulated_esp]
sub ebx, 4
mov [simulated_esp], ebx
mov [ebx], eax

;And we are done
jmp next_instruction

Naturally, any instructions after an escape would reset to using the
initial opcode table before starting the next instruction with such as

lea edi, [initial_opcode_table]

> as for doing interpreter logic partly in ASM, I had partly considered this
> as well, but for now am sticking with C for other reasons (mostly that, at
> the moment, ASM would add more complexities than are worthwhile...).

Maybe you were trying to do complex stuff in ASM. The above assembler
is blindingly simple. All it is doing is looking up what to do next
from a table with 256 entries.

Come back ASM, all is forgiven. :-)

...

> > What slowdown are you getting compared with a real machine?
>
> currently, by simplistic tests, around 170x at present...
>
> so, it is not great, but it could be worse...

You've maybe already explained why but if you want to emulate the
80386 why not use an existing emulator? Is it something to do with
wanting to restrict to a subset of its instructions?

James

BGB / cr88192

unread,

Nov 2, 2009, 10:53:20 AM11/2/09

to

"tm" <thomas...@gmx.at> wrote in message
news:6bc33382-90b5-49aa...@a21g2000yqc.googlegroups.com...

I was not sure who to tell about what, nor was I not sure of the nature or
the cause of the bug (not really knowing GCC's internals...). only, that it
was a bug with passing arguments (args got messed up in transit, from what I
remember due to incorrect register handling).

>> my compiler works, just I similarly don't trust it that well to produce
>> good
>> code, and using it for compiling much of anything "non-trivial" is likely
>> to
>> leave bugs all over the place which would be rather difficult to track
>> down
>> (I leave it mostly for compiling smaller code fragments at runtime, since
>> a
>> failure here is at least usually obvious and nearly not so difficult to
>> track down and fix...).
>
> When you think that GCC and MSVC are not okay you should
> definitely use your own C compiler. When there are bugs in it
> create a test suite of C programs. That way you can reduce the
> number of errors.
>

yeah, proper testing should be good...
I guess I have not done a whole lot of this.

as for MSVC, usually it is good enough (after all, it is good enough for
MS).
it is just not as ideal for areas of code which notably effect performance.

my compiler can produce often more efficient code (except in the case of
switch, errm... because I never got around to properly implementing
jump-tables, and so there is an "O(log2 n)" switch overhead...).

so, nevermind switch, I do address many optimizations which MSVC apparently
does not, which at the time I had found unimpressive (given MS has lots of
resources, ...).

> You seem to write programs in many areas. But to get this
> programs finished with good quality I can only suggest:
> Eat your own dogfood.
>

I originally wrote it mostly for JIT / at-runtime compilation, and this is
mostly what I use it for (generating scripts at runtime, and compiling and
running these).

it was only recently that I had (ever) used it for a statically compiled
piece of code, but I forget for what reason I made it usable as a static
compiler.

actually, now I remember, it was the idea of using the Win64 calling
convention on Linux, just before I ended up figuring that I could probably
get more for less effort by instead using virtualized x86 as a
cross-platform IL (in between with a failed attempt at designing a bytecode,
which ended up as a hybrid of x86 and EFI bytecode...).

I guess if I were to use it as a static compiler, for building project
sources, I would have to fix up a few of its known deficiencies:
its lack of a properly working 'switch';
its lack of statically-initialized structs;
...

and, I guess, maybe writing proper test-cases. I have a few small pieces of
code which I had been using for tests, but they are far from
comprehensive...

tm

unread,

Nov 2, 2009, 11:00:08 AM11/2/09

to

Do I understand this correctly?
The instructions are already decoded and when they
are executed a second decode (switch) is necessary.

This sounds very strange for me, but maybe I just
misunderstand the problem.

I have a question:
Why don't you decode to function pointers instead
of integers (which need to be decoded again)?
Function pointers would even be faster than a
lookup table.

The Seed7 interpreter (hi) works with function
pointers. That way no switch is necessary when
a program is executed.

James Harris

unread,

Nov 2, 2009, 12:00:24 PM11/2/09

to

Although you are replying to my post I presume your questions are for
BGB. I may have misunderstood him.

>
> This sounds very strange for me, but maybe I just
> misunderstand the problem.
>
> I have a question:
> Why don't you decode to function pointers instead
> of integers (which need to be decoded again)?

> Function pointers would even be faster than a
> lookup table.

Would they? I'm suggesting arrays (of function pointers). It's just
one pairable instruction to index into such arrays. You can't get
faster.

>
> The Seed7 interpreter (hi) works with function
> pointers. That way no switch is necessary when
> a program is executed.

Yes, a big switch will be slow except in the special case that the
data values are adjacent and the compiler converts it to a simple
indexing operation. Rather than hope the compiler does what one wants
it's better to express the function pointers in an array directly,
IMHO.

James

BGB / cr88192

unread,

Nov 2, 2009, 12:03:27 PM11/2/09

to

"James Harris" <james.h...@googlemail.com> wrote in message
news:475f3699-141a-46ea...@d5g2000yqm.googlegroups.com...

...

<--

As I understand it your hash function looks up instructions that you
have already decoded. That's quite sophisticated and I infer from that
that your decoding is expensive. Whether that's the case or not I
would think that it's likely to be fastest to decode by simple lookup
tables.

...
-->

yeah, the decoder was fairly expensive, and when it was used for every
instruction, it was around 60% of the total running time.

it uses LUT's for optimizing the opcode lookup, but the "final steps" are
done by generic logic code, which makes use of strings-driven-logic to match
the exact pattern and process the bytes.

for many things, string-driven-logic is fast enough, but in some cases, it
is not the fastest possible route...

by loose analogy, consider if the machine code were being matched by
regex'es, ...

hence, I used the hash to escape having to do it for every op, since it was,
granted, sort of expensive.

> the decoder is fairly directly based on a disassembler, and yeah, switches
> on some of the opcode bytes are used to assist in speeding up the decode
> logic.
>
> however, I was able to mostly eliminate instruction decoding from the
> runtime overhead via the use of the hash.

<--

OK but the hash (together with interpreting what the hash returns and
dealing with any duplicate hashes) is still costly and ISTM that a
straight table-based decode may be faster. Why not try the following.

-->

possibly, it could be faster, except that there are many "hidden" costs with
x86...
even if one could use a direct-lookup to match the opcode, they would still
need to decode the ModRM/SIB/disp mess, which is itself not likely to be
fast (in my case, in the decoder, it is handled via lots of bit-masking and
around 4-levels of switches, and was still a not-small portion of the total
running time of the decoder).

I am not sure if ModRM could "reasonably" be handled via pure LUT's (or, at
least, absent mechanically-generated code).

>
> also, there are reasons I am using C here...
> what is to say an interpreter for x86 will necessarily run ON x86?...

<--

Good point. I wrote my suggestion in assembler before I'd seen your
comment. Trying to write in assembler can be a good idea anyway to
confirm that code *should* compile to be fast. It should be easy
enough to cast to C.

-->

C is also good for prototyping and keeping the design more flexible...

even though it is an interpreter, and subject somewhat to performance
concerns, I still generally try to retain modularity and internal integrity.

it is not perfect, for example, my memory manager is tangled with the
internals of the address-translation code (rather than, say, a separate
component making use of address-space alloc/free functions and implementing
the MM similar to how it would do so with a conventional OS), ...

the DLL loader, however, does use the external interface, which was modeled
mostly after the Win32 API's VirtualAlloc/VirtualFree interface.
the DLL loader then exports something which resembles the LoadLibrary /
GetProcAddress / ... API.

elsewhere, a function is created which resembles 'CreateProcess', only that
I call it 'SpawnProcess' and it takes different args.

...

but, it could be worse.

<--

;----- Code start -----

next_instruction:

next_instruction_byte:

;----- Code end -----

lea edi, [initial_opcode_table]

-->

naturally enough, this could be done, but would essentially be a very
different beast than my current interpreter...

technically, the current interpreter is MUCH higher level in terms of its
internals, thinking mostly in terms of abstract register handles which are
get/set via special functions.

example:
rm=BGBV86_ResolveRMAddr(ctx, op);

switch(op->opnum)
{
...
case BGBV86_OP_ADD:
aq=BGBV86_GetRegGeneric(ctx, op->reg);
bq=BGBV86_ImageReadUGeneric(ctx, rm, op->width);
aq+=bq;
BGBV86_SetRegGeneric(ctx, op->reg, aq);
BGBV86_AdjustArithFlagsUGeneric(ctx, aq, op->width);
break;
case BGBV86_OP_AND:
aq=BGBV86_GetRegGeneric(ctx, op->reg);
bq=BGBV86_ImageReadUGeneric(ctx, rm, op->width);
aq&=bq;
BGBV86_SetRegGeneric(ctx, op->reg, aq);
BGBV86_AdjustArithFlagsUGeneric(ctx, aq, op->width);
break;
...
case BGBV86_OP_LEA:
BGBV86_SetRegGeneric(ctx, op->reg, rm);
break;
case BGBV86_OP_MOV:
aq=BGBV86_ImageReadUGeneric(ctx, rm, op->width);
BGBV86_SetRegGeneric(ctx, op->reg, aq);
break;
...

(oddly, LEA could maybe be optimized, as the profiler seems to say that this
is apparently one of the most heavily used instructions in the tests I was
running...).

then again, GetReg* and SetReg* are filled with an evil: large switches...

or, from a different function (RMReg):

case BGBV86_OP_SHLD:
aq=BGBV86_ImageReadUGeneric(ctx, rm, op->width);
bq=BGBV86_GetRegGeneric(ctx, op->reg);
i=BGBV86_GetRegGeneric(ctx, BGBV86_REG_CL);
aq=(aq<<i)|(bq>>((op->width)-i));
BGBV86_ImageWriteUGeneric(ctx, rm, aq, op->width);
BGBV86_AdjustArithFlagsUGeneric(ctx, aq, op->width);
break;
case BGBV86_OP_SHRD:
aq=BGBV86_ImageReadUGeneric(ctx, rm, op->width);
bq=BGBV86_GetRegGeneric(ctx, op->reg);
i=BGBV86_GetRegGeneric(ctx, BGBV86_REG_CL);
aq=(aq>>i)|(bq<<((op->width)-i));
BGBV86_ImageWriteUGeneric(ctx, rm, aq, op->width);
BGBV86_AdjustArithFlagsUGeneric(ctx, aq, op->width);
break;

...

from Imm:
case BGBV86_OP_INT:
BGBV86_CheckRaiseInt(ctx, op->imm);
break;
case BGBV86_OP_INTXT:
BGBV86_CheckRaiseInt(ctx, op->imm);
break;
case BGBV86_OP_JA:
BGBV86_JumpAddrCC(ctx, ctx->eip + op->imm, BGBV86_COND_A);
break;
case BGBV86_OP_JAE:
BGBV86_JumpAddrCC(ctx, ctx->eip + op->imm, BGBV86_COND_AE);
break;
case BGBV86_OP_JB:
BGBV86_JumpAddrCC(ctx, ctx->eip + op->imm, BGBV86_COND_B);
break;
...
case BGBV86_OP_LOOP:
aq=BGBV86_GetRegGeneric(ctx, BGBV86_REG_ECX);
aq--;
BGBV86_SetRegGeneric(ctx, BGBV86_REG_ECX, aq);
if(aq)BGBV86_JumpAddr(ctx, ctx->eip + op->imm);
break;
...

> as for doing interpreter logic partly in ASM, I had partly considered this
> as well, but for now am sticking with C for other reasons (mostly that, at
> the moment, ASM would add more complexities than are worthwhile...).

<--

Maybe you were trying to do complex stuff in ASM. The above assembler
is blindingly simple. All it is doing is looking up what to do next
from a table with 256 entries.

Come back ASM, all is forgiven. :-)

...
-->

granted, however, by "complexities", I mean, overall architectural ones...

similarly, it does not much help that, at the moment, I have several
different arch's I am building for, and so would have to write ASM for all
of my targets...

x86 ASM, and x64 ASM for both Windows and Linux, which use different calling
conventions.

as well as possibly tangled level-of-abstraction issues, ...

in these cases, for ASM-based micro-optimization, I usually "confine" it,
and then dynamically produce the ASM via procedural generation.

however, this raises other architectural issues (currently, my interpreter
is not in the direct usability-scope of my assembler, which will probably
mean the "API sharing via structs of function pointers" trick...).

actually, in one place I am already needing this, but this would wrap a
region of code already exporting an interface to the assembler, so I could
include this in said struct as well.

> > What slowdown are you getting compared with a real machine?
>
> currently, by simplistic tests, around 170x at present...
>
> so, it is not great, but it could be worse...

<--

You've maybe already explained why but if you want to emulate the
80386 why not use an existing emulator? Is it something to do with
wanting to restrict to a subset of its instructions?

-->

well, I had looked at QEMU some, but opted against it mostly for
code-related reasons (it did not appear to be all that modular, and I
couldn't really get it to build on Windows...).

more so, apparently officially QEMU currently does not work directly on
Win64 (if built for 64-bits), ...

another issue is that it is only doing partial emulation, rather than full
emulation (it is, technically, much closer to an interpreter than an
emulator in terms of purpose and general operation).

but, alas, it is in a some vague area between the 2 technologies...

so, alas, I just beat together my own...

BGB / cr88192

unread,

Nov 2, 2009, 12:20:34 PM11/2/09

to

"tm" <thomas...@gmx.at> wrote in message

news:11b12e14-d4ce-4f2d...@l2g2000yqd.googlegroups.com...

<--

Do I understand this correctly?
The instructions are already decoded and when they
are executed a second decode (switch) is necessary.

This sounds very strange for me, but maybe I just
misunderstand the problem.

-->

the instructions are decoded into an unpacked form (structs), which
basically decode the serialized form of the instruction.

the switch is used mostly to go from this opcode structure into the actual
logic for "doing" whatever the instruction indicates.

so, the switch is used for instruction dispatch, not for decoding.

<--

I have a question:
Why don't you decode to function pointers instead
of integers (which need to be decoded again)?
Function pointers would even be faster than a
lookup table.

-->

this would require redesigning the interpreter.
it would also be very limiting and almost completely destroy modularity
between the decoder and instruction dispatcher (since each would need to be
concerned with what goes on in the other), and would interfere with
structural re-use (for example, using the same decoder, but using it for
different tasks).

integers allow keeping a higher level of abstraction...

for example, I could copy my current opcode decoder back into my assembler
library to provide an "improved" disassembler, or maybe reuse it for other
contexts where I might want to decode x86, ...

it is much like how I grabbed a lot of components from my assembler to build
an interpreter, only that I had to tweak some things and bolt on the big
mass of logic for actually "doing" stuff, ...

<--

The Seed7 interpreter (hi) works with function
pointers. That way no switch is necessary when
a program is executed.

-->

it is an interesting idea, but I don't yet know of an ideal way to use this
idea...

it could be possible, but the logic would likely be located in the
interpreter, rather than the decoder, and would likely be done as a sort of
cache step.

however, it could easily expand to a huge level of complexity for the
individual leaf operations, since each would need to be its own
self-contained function and likely fully specialized (rather than the
current strategy which involves a sort of layered specialization approach,
where different things are calculated along the various steps in the
process...).

...

BGB / cr88192

unread,

Nov 2, 2009, 12:54:42 PM11/2/09

to

"James Harris" <james.h...@googlemail.com> wrote in message

news:76f128e3-5d9e-4905...@n35g2000yqm.googlegroups.com...

On 2 Nov, 16:00, tm <thomas.mer...@gmx.at> wrote:
> On 2 Nov., 14:26, James Harris <james.harri...@googlemail.com> wrote:
>
>

<snip>

>
> The Seed7 interpreter (hi) works with function
> pointers. That way no switch is necessary when
> a program is executed.

<--

Yes, a big switch will be slow except in the special case that the
data values are adjacent and the compiler converts it to a simple
indexing operation. Rather than hope the compiler does what one wants
it's better to express the function pointers in an array directly,
IMHO.

-->

a big downside with an array though is that an array is very sensitive to
positioning and ordering.
this would be difficult here since much of my handling is done with
tool-assigned symbolic constants...

granted, my tool does generally assign values sequentially, but there is
little to say that these values are not subject to change.

now, the downside of MSVC and switches is that MSVC apparently does not do
them well...

apparently, MSVC does it sort of like this:

calc reg=index //depends on input expr
mov [esp+X], reg
mov reg, [esp+X]
sub reg, base
mov [esp+X], reg
cmp [esp+X], (limit-base)
jnbe default //yet to investigate how this one works
mov reg, [esp+X]
lea reg3, [Y]
movzx reg, [reg3+reg+Z]
mov reg, [reg3+reg*SZPTR+W]
add reg, reg3
jmp reg

or, essentially, it works, but I can imagine less terrible ways to do jump
tables...
X/Y/Z/W are apparently switch-specific magic constants...

and, for unrelated reasons, I am thinking I may need to write some sort of
IDL-based tool:
IDL -> export interface; IDL -> header; IDL -> import interface.

this is mostly because non-local code gluing is getting problematic, and
more automated means may be preferable to manually writing lots of headers.

granted, via a naive approach, both the importer and exporter IDL's would
need to be exactly the same for the thing to work.

I may need to investigate...

bartc

unread,

Nov 2, 2009, 1:08:58 PM11/2/09

to

"bartc" <ba...@freeuk.com> wrote in message

news:bDzHm.910$Ym4...@text.news.virginmedia.com...

>
> "BGB / cr88192" <cr8...@hotmail.com> wrote in message
> news:hclvg3$ubg$1...@news.albasani.net...
>>
>> "bartc" <ba...@freeuk.com> wrote in message
>> news:xxnHm.748$Ym4...@text.news.virginmedia.com...
>>> What slowdown are you getting compared with a real machine?

>> currently, by simplistic tests, around 170x at present...

> I put together a very quick test for the following code:

>
> mov ebx,100000000
> l1:
> sub ebx,1
> jnz l1
> hlt
>
> Emulating only the last 3 instructions, the emulation was 30x slower than
> the real code (the first 3 instructions), using a mix of hll and
> assembler.
> Could be a bit tighter with dedicated registers, but I'm a bit short of
> time.

The 'tighter' asm code yielded about a 15x slowdown compared with executing
on the actual processor.

I also tried pure C code and was surprised it managed to achieve only a 30x
slowdown (this test code is below).

Now, this instruction (sub reg,imm) isn't the most complex (only one operand
and no complex addressing), only the zflag is set, and it ignores prefixes
(ie. it assumes a 32-bit op). Even so, there is plenty of margin here for
adding more stuff, although it's not clear how well this approach will
scale.

You say you're getting 170x and that's with already partly decoded
instructions... So I think this approach could be worth looking at again.

#include <stdio.h>
#include <stdlib.h>

typedef void (*fnptr)(void);
typedef unsigned char byte;

int registers[8];
byte *pcptr;
byte zflag;
byte stopped;

fnptr opcodetable[256];
fnptr arithtable[256];

void arith81(void) {
arithtable[(*(pcptr+1))>>3]();
}

void jumpnz75(void) {
if (!zflag)
pcptr += *((signed char*)(pcptr+1))+2;
else
pcptr += 2;
}

void haltf4(void) {
stopped=1;
}

void subregimm(void) {
int reg = *(pcptr+1) & 7;
registers[reg] -= *( (int*)(pcptr+2));
zflag = registers[reg]==0;
pcptr += 6;
}

int main(void)
{
/* l1: sub ebx,1; jnz l1; hlt */
byte testcode[] = {0x81, 0xeb, 0x01, 0x00,0x00,0x00, 0x75, 0xf8, 0xf4};

opcodetable[0x81]=&arith81;
opcodetable[0x75]=&jumpnz75;
opcodetable[0xf4]=&haltf4;

arithtable[0x1d]=&subregimm;

pcptr = testcode;
registers[3] = 100000000;

printf("EBX register at start: %d\n",registers[3]);

stopped=0;

do {
opcodetable[*pcptr]();
} while (!stopped);

printf("EBX register at end: %d\n",registers[3]);

}

--
Bartc

bartc

unread,

Nov 2, 2009, 1:49:00 PM11/2/09

to

"BGB / cr88192" <cr8...@hotmail.com> wrote in message

news:hcn3d1$mpo$1...@news.albasani.net...

>
> "James Harris" <james.h...@googlemail.com> wrote in message
> news:475f3699-141a-46ea...@d5g2000yqm.googlegroups.com...
> On 2 Nov, 06:50, "BGB / cr88192" <cr88...@hotmail.com> wrote:
>> "bartc" <ba...@freeuk.com> wrote in message

> possibly, it could be faster, except that there are many "hidden" costs

> with x86...
> even if one could use a direct-lookup to match the opcode, they would
> still need to decode the ModRM/SIB/disp mess, which is itself not likely
> to be

I seem to remember that some modrm value indicates that SIB follows? SIB
then just needs a function that uses logic obtain the effective address, or
perhaps uses a 256-way function table.

Example: add [mem],reg

This can be quickly decoded down to the level where it knows it's adding a
register to memory. It even knows if it's 8 bits, or 16/32 bits (this latter
needs an extra check). Since there's a memory address involved, a function
call will sort out the modrm/sib/disp stuff (and maybe even step the program
counter).

Now you have the address A, and a register code R: *A += registers[R], and
you're done.

OK, it's not quite that simple, but I'm sure it's possible to do all this
with simpler code than you've shown below (I don't know if the following
code is prehash or posthash).

> rm=BGBV86_ResolveRMAddr(ctx, op);
>
> switch(op->opnum)
> {
> ...
> case BGBV86_OP_ADD:
> aq=BGBV86_GetRegGeneric(ctx, op->reg);
> bq=BGBV86_ImageReadUGeneric(ctx, rm, op->width);
> aq+=bq;
> BGBV86_SetRegGeneric(ctx, op->reg, aq);
> BGBV86_AdjustArithFlagsUGeneric(ctx, aq, op->width);
> break;

> case BGBV86_OP_JB:

> BGBV86_JumpAddrCC(ctx, ctx->eip + op->imm, BGBV86_COND_B);
> break;
> ...
> case BGBV86_OP_LOOP:
> aq=BGBV86_GetRegGeneric(ctx, BGBV86_REG_ECX);
> aq--;
> BGBV86_SetRegGeneric(ctx, BGBV86_REG_ECX, aq);
> if(aq)BGBV86_JumpAddr(ctx, ctx->eip + op->imm);
> break;

--
Bartc

Rod Pemberton

unread,

Nov 2, 2009, 4:37:08 PM11/2/09

to

"BGB / cr88192" <cr8...@hotmail.com> wrote in message
news:hclak6$4ej$1...@news.albasani.net...
>
> at which point the main switch became the bottleneck...

How so? Too many case values? Too widely dispersed case values which
prevents optimization? What?

Rod Pemberton

unread,

Nov 2, 2009, 4:37:49 PM11/2/09

to

"BGB / cr88192" <cr8...@hotmail.com> wrote in message
news:hckpa8$aj2$1...@news.albasani.net...

>
> but, the hash is not used for opcode lookup/decoding, rather it is used
> for grabbing already decoded instructions from a cache (which is
> based on memory address).
>

Your hash generates, what, 64k of possible hash values or memory locations?
What if you reduce the size to 4k? 4k/sizeof(void *)? Will this allow to
compiler to simplify the generated assembly?

The randomness doesn't have to come from multiplication. It can come from
other sources such as a lookup array of randomized data.

Rod Pemberton

unread,

Nov 2, 2009, 4:38:12 PM11/2/09

to

"BGB / cr88192" <cr8...@hotmail.com> wrote in message

news:hcjarj$vro$1...@news.albasani.net...

>
> hash EIP (currently: "((EIP*65521)>>16)&65535");
>

The shift truncates the value to the mask size. I.e., &65535 is not needed.
Yes?

Rod Pemberton

unread,

Nov 2, 2009, 4:38:40 PM11/2/09

to

"BGB / cr88192" <cr8...@hotmail.com> wrote in message

news:hcke2u$n5l$1...@news.albasani.net...
>
> according to the profiler, another major source of time use is:
> "rip=ctx->sreg_base[0]+ctx->eip;"
>

What happens if you eliminate struct ctx? I.e., make both sreg_base and eip
separate variables. What happens with file scope? ... with local scope?

Rod Pemberton

BGB / cr88192

unread,

Nov 2, 2009, 6:34:41 PM11/2/09

to

"Rod Pemberton" <do_no...@nohavenot.cmm> wrote in message
news:hcnjkn$7bt$1...@aioe.org...

what happens?...

well, then, I could no longer do multi-threading...

ctx essentially represents the current simulated thread context, and there
may be 1 or more (OS-level) worker threads essentially serving as virtual
processors.

I am not willing to make a design change which would essentially prohibit
multi-threaded operation...

>
> Rod Pemberton
>
>
>
>

BGB / cr88192

unread,

Nov 2, 2009, 6:54:32 PM11/2/09

to

"Rod Pemberton" <do_no...@nohavenot.cmm> wrote in message

news:hcnjhr$72k$1...@aioe.org...

it jumped near the top of the list in the profiler...

basically, this is because this is a more or less central location through
nearly all control flows (before branching off into all of the deeper
internals of the interpreter).

this is because, one may eliminate slowdowns in one place, and the app gets
overall faster, but in terms of the profiler, the load has shifted somewhere
else, and one could then normally optimize this location for yet further
gains.

sometimes though, it will shift to code which can't be optimized, and in an
interpreter, when the top load shifts to the main switch statement (AKA: the
central part driving operation of the interpreter), it is my observation
that often one is essentially rapidly approaching the optimizability of an
interpreter.

from the POV of further optimization, it is usually better if the running
time is mostly in leaf functions, since this case is usually much easier to
optimize (for example, by optimizing the caller such that they are called
less often).

from the POV of the main loop, there is only a single complexity: O(n).

>
> Rod Pemberton
>
>
>
>
>

BGB / cr88192

unread,

Nov 2, 2009, 7:03:16 PM11/2/09

to

"Rod Pemberton" <do_no...@nohavenot.cmm> wrote in message

news:hcnjj4$755$1...@aioe.org...

I tried both ways, but it does not seem to make much difference between 4k
and 64k for the hash.
I used 64k figureing it would scale better, but then thinking of it, a full
hash might end up using an unreasonably large amount of memory (4MB? 16MB?
more?...), whereas a 4k hash is self-limiting I guess (maybe ~1MB, assuming
around 256 bytes per decode-op...).

actually, I checked, with the current size of the DecodeOp structure, the
current memory limit is around 2MB with a 64k hash, and would be around
128kB with a 4k hash...

note that, generally, a multiplication is cheaper than an array lookup,
however an array lookup is generally cheaper than a division or modulo...

>
> Rod Pemberton
>
>
>
>

BGB / cr88192

unread,

Nov 2, 2009, 7:15:15 PM11/2/09

to

"bartc" <ba...@freeuk.com> wrote in message

news:_eFHm.1021$Ym4...@text.news.virginmedia.com...

>
> "bartc" <ba...@freeuk.com> wrote in message
> news:bDzHm.910$Ym4...@text.news.virginmedia.com...
>>
>> "BGB / cr88192" <cr8...@hotmail.com> wrote in message
>> news:hclvg3$ubg$1...@news.albasani.net...
>>>
>>> "bartc" <ba...@freeuk.com> wrote in message
>>> news:xxnHm.748$Ym4...@text.news.virginmedia.com...
>>>> What slowdown are you getting compared with a real machine?
>
>>> currently, by simplistic tests, around 170x at present...
>
>> I put together a very quick test for the following code:
>>
>> mov ebx,100000000
>> l1:
>> sub ebx,1
>> jnz l1
>> hlt
>>
>> Emulating only the last 3 instructions, the emulation was 30x slower than
>> the real code (the first 3 instructions), using a mix of hll and
>> assembler.
>> Could be a bit tighter with dedicated registers, but I'm a bit short of
>> time.
>
> The 'tighter' asm code yielded about a 15x slowdown compared with
> executing on the actual processor.
>
> I also tried pure C code and was surprised it managed to achieve only a
> 30x slowdown (this test code is below).
>

in a few rare cases, I have managed to get 15x with an interpreter.
it depends, however, highly on the particular type of code being run.

> Now, this instruction (sub reg,imm) isn't the most complex (only one
> operand and no complex addressing), only the zflag is set, and it ignores
> prefixes (ie. it assumes a 32-bit op). Even so, there is plenty of margin
> here for adding more stuff, although it's not clear how well this approach
> will scale.
>
> You say you're getting 170x and that's with already partly decoded
> instructions... So I think this approach could be worth looking at again.
>

there are many reasons for such a level of slowdown...

one of the major ones may well be the level of abstraction I am using for a
lot of the code.
in general, it is hardly a design aimed at "max speed", since in general I
put "making it work" and "having clean code" above "having the maximum raw
speed"...

at which cost?...

as can be seen, this does not even address a tiny fraction of the issues
which exist in the x86 ISA (for example, how do you intend to handle issues
like "Virtual Addressing" and segmentation, maybe page-table translation,
... ?).

what even about eflags?...

maybe keep these sorts of issues in mind.

as well, consider that the code above looks, well, terrible...

> --
> Bartc

BGB / cr88192

unread,

Nov 2, 2009, 7:23:29 PM11/2/09

to

"bartc" <ba...@freeuk.com> wrote in message

news:wQFHm.1039$Ym4...@text.news.virginmedia.com...

the decode does decode...

the main problem with trying to do ModRM and SIB decoding via tables would
be that one would have to provide an entry for every spot in the table.

the use of generic logic code may be an acceptable tradeoff, FWIW...

>> rm=BGBV86_ResolveRMAddr(ctx, op);
>>
>> switch(op->opnum)
>> {
>> ...
>> case BGBV86_OP_ADD:
>> aq=BGBV86_GetRegGeneric(ctx, op->reg);
>> bq=BGBV86_ImageReadUGeneric(ctx, rm, op->width);
>> aq+=bq;
>> BGBV86_SetRegGeneric(ctx, op->reg, aq);
>> BGBV86_AdjustArithFlagsUGeneric(ctx, aq, op->width);
>> break;
>
>> case BGBV86_OP_JB:
>> BGBV86_JumpAddrCC(ctx, ctx->eip + op->imm, BGBV86_COND_B);
>> break;
>> ...
>> case BGBV86_OP_LOOP:
>> aq=BGBV86_GetRegGeneric(ctx, BGBV86_REG_ECX);
>> aq--;
>> BGBV86_SetRegGeneric(ctx, BGBV86_REG_ECX, aq);
>> if(aq)BGBV86_JumpAddr(ctx, ctx->eip + op->imm);
>> break;
>

the code below is well after the hash step...

this is some of the code from the main interpreter logic.
its point is mostly to point out that both opcodes and registers are
generally handled symbolically, and at a fairly high level of abstraction vs
the raw machine code.

but, admittedly, it is far from being free performance-wise...

personally, I pursue performance, but not at the cost of clean coding
practices...

> --
> Bartc

BGB / cr88192

unread,

Nov 2, 2009, 7:29:22 PM11/2/09

to

"Rod Pemberton" <do_no...@nohavenot.cmm> wrote in message

news:hcnjjs$7af$1...@aioe.org...

errm, this would be if I were doing these calculations with 32-bit unsigned
arithmetic...

actually, most of my addressing calculations are being done with 64-bit
signed arithmetic mostly since the interpreter may also handle simulated
long-mode, and because 64-bit arithmetic is far less prone to issues related
to overflow behavior...

(ctx->eip is usually 32-bit EIP, but may also be a 64-bit RIP, FWIW...).

granted, if I were to compile my code on a 32-bit system, this would likely
hurt performance fairly severe...

bartc

unread,

Nov 2, 2009, 8:00:01 PM11/2/09

to

"BGB / cr88192" <cr8...@hotmail.com> wrote in message

news:hcnsmm$vu8$1...@news.albasani.net...

>
> "bartc" <ba...@freeuk.com> wrote in message
> news:_eFHm.1021$Ym4...@text.news.virginmedia.com...
>>
>> "bartc" <ba...@freeuk.com> wrote in message
>> news:bDzHm.910$Ym4...@text.news.virginmedia.com...
>>>
>>> "BGB / cr88192" <cr8...@hotmail.com> wrote in message
>>> news:hclvg3$ubg$1...@news.albasani.net...
>>>>
>>>> "bartc" <ba...@freeuk.com> wrote in message
>>>> news:xxnHm.748$Ym4...@text.news.virginmedia.com...
>>>>> What slowdown are you getting compared with a real machine?
>>
>>>> currently, by simplistic tests, around 170x at present...

>> The 'tighter' asm code yielded about a 15x slowdown compared with

>> executing on the actual processor.
>>
>> I also tried pure C code and was surprised it managed to achieve only a
>> 30x slowdown (this test code is below).
>
> in a few rare cases, I have managed to get 15x with an interpreter.
> it depends, however, highly on the particular type of code being run.

>> You say you're getting 170x and that's with already partly decoded

>> instructions... So I think this approach could be worth looking at again.
>>
>
> there are many reasons for such a level of slowdown...
>
> one of the major ones may well be the level of abstraction I am using for
> a lot of the code.
> in general, it is hardly a design aimed at "max speed", since in general I
> put "making it work" and "having clean code" above "having the maximum raw
> speed"...

With this sort of code the performance level will be inherent; messing about
with switches and possibly changing to function pointers is going to make
only a small difference.

>> int registers[8];
...

> at which cost?...
>
> as can be seen, this does not even address a tiny fraction of the issues
> which exist in the x86 ISA (for example, how do you intend to handle
> issues like "Virtual Addressing" and segmentation, maybe page-table
> translation, ... ?).

This is starting to depend on how serious an emulation you want of the x86
processor. I would have been happy emulating a single virtual task, and not
intrude into OS and device driver territory, or even into memory caches and
instruction pipelines.

You might have point about virtual memory mapping, but then it's also
possible to make use of such facilities on the host processor (the one
running the emulator); I would expect the memory of the emulated task to
exist inside the memory space of the emulator.

Elsewhere you mentioned threading, but was this threading of your
'interpreter', or were you also trying to deal with threading in the cpu
(multiple cores and whatever)?

> what even about eflags?...

You mean stuff like the Virtual Interrupt Pending flag? If I needed to model
a complete processor with 100% accuracy, then I might be bothered with all
that. If I just wanted to run an exe application, this is probably not
necessary.

>
> maybe keep these sorts of issues in mind.
>
> as well, consider that the code above looks, well, terrible...

Thanks...

--
bartc

BGB / cr88192

unread,

Nov 2, 2009, 11:36:50 PM11/2/09

to

"bartc" <ba...@freeuk.com> wrote in message

news:lgLHm.1171$Ym4...@text.news.virginmedia.com...

yep...

a little more fiddling, a little improvement, can't expect that much more at
this point...

>>> int registers[8];
> ...
>
>> at which cost?...
>>
>> as can be seen, this does not even address a tiny fraction of the issues
>> which exist in the x86 ISA (for example, how do you intend to handle
>> issues like "Virtual Addressing" and segmentation, maybe page-table
>> translation, ... ?).
>
> This is starting to depend on how serious an emulation you want of the x86
> processor. I would have been happy emulating a single virtual task, and
> not
> intrude into OS and device driver territory, or even into memory caches
> and
> instruction pipelines.
>
> You might have point about virtual memory mapping, but then it's also
> possible to make use of such facilities on the host processor (the one
> running the emulator); I would expect the memory of the emulated task to
> exist inside the memory space of the emulator.
>

I do partial address-translation, but, yes, given this is not strictly an
emulator, the address translation mechanisms differ some from that of the
real CPU (I am using spans which pretend to be collections of pages...).

this is mostly because spans was a little faster and had a few other uses,
and because 'userspace' code is not likely to notice the difference. I may
add full paging support eventually though.

> Elsewhere you mentioned threading, but was this threading of your
> 'interpreter', or were you also trying to deal with threading in the cpu
> (multiple cores and whatever)?
>

both...

the interpreter will simulate multiple threads, but the interpreter will
also run in a multi-threaded environment, and may be itself
multi-threaded...

I don't want to commit to any overly restrictive design decisions, such as
ones which would break in the face of multi-threading.

I have had this sort of bad experience before...

>> what even about eflags?...
>
> You mean stuff like the Virtual Interrupt Pending flag? If I needed to
> model
> a complete processor with 100% accuracy, then I might be bothered with all
> that. If I just wanted to run an exe application, this is probably not
> necessary.
>

basic eflags behavior is needed even for code to run, so alas, nearly every
arithmetic op needs to set eflags as appropriate, otherwise they risk that
code will fail to work.

hence, I simulate most of the eflags behavior (except maybe PF and a few
others, the PF case being mostly because this one would be difficult to
check and also because almost no modern code is likely to depend on it).

>>
>> maybe keep these sorts of issues in mind.
>>
>> as well, consider that the code above looks, well, terrible...
>
> Thanks...
>

hmm...

I guess I am mostly just really fussy about coding practices...

I have had some bad experiences in the past though.

cleanly designed code is, generally, far more resistant to gradual bit-rot,
which starts to matter a whole lot as one keeps working on a codebase over
some amount of time (years or more...).

bit-rot is an almost inescapable issue, and one is better served in writing
code which can resist this condition to some extent...

so, I am fussy I guess...

oh well, here is the great evil switch which is all off in top position in
the profiler...
note: this mask is because the 'configuration' field exists in the same
place as the 'flags', and is actually a proper subset of the flags, only
that I have ended up special-casing many combinations of flags to
'synthesize' a field I could use for a switch.

yes, the compiler does compile this switch to a jump table...
(and, yes, you see correctly if the existence of 5 argument opcodes is
noticed. we can thank AVX for this one...).

int BGBV86_ExecOpcode(BGBV86_Context *ctx, BGBV86_DecodeOp *op)
{
switch(op->rm_fl&BGBV86_RMFL_CFGMASK)
{
case BGBV86_RMCFG_BASIC:
BGBV86_ExecOpcode_Basic(ctx, op); break;
case BGBV86_RMCFG_REG:
BGBV86_ExecOpcode_Reg(ctx, op); break;
case BGBV86_RMCFG_RM:
BGBV86_ExecOpcode_RM(ctx, op); break;
case BGBV86_RMCFG_REGRM:
BGBV86_ExecOpcode_RegRM(ctx, op); break;
case BGBV86_RMCFG_RMREG:
BGBV86_ExecOpcode_RMReg(ctx, op); break;
case BGBV86_RMCFG_REGRMREG2:
BGBV86_ExecOpcode_RegRMReg2(ctx, op); break;
case BGBV86_RMCFG_RMREGREG2:
BGBV86_ExecOpcode_RMRegReg2(ctx, op); break;
case BGBV86_RMCFG_REGREG2RM:
BGBV86_ExecOpcode_RegReg2RM(ctx, op); break;
case BGBV86_RMCFG_RMREG2REG:
BGBV86_ExecOpcode_RMReg2Reg(ctx, op); break;
case BGBV86_RMCFG_REGRMREG2REG3:
BGBV86_ExecOpcode_RegRMReg2Reg3(ctx, op); break;
case BGBV86_RMCFG_RMREGREG2REG3:
BGBV86_ExecOpcode_RMRegReg2Reg3(ctx, op); break;
case BGBV86_RMCFG_REGREG2RMREG3:
BGBV86_ExecOpcode_RegReg2RMReg3(ctx, op); break;
case BGBV86_RMCFG_RMREG2REGREG3:
BGBV86_ExecOpcode_RMReg2RegReg3(ctx, op); break;

case BGBV86_RMCFG_IMM:
BGBV86_ExecOpcode_Imm(ctx, op); break;
case BGBV86_RMCFG_REGIMM:
BGBV86_ExecOpcode_RegImm(ctx, op); break;
case BGBV86_RMCFG_RMIMM:
BGBV86_ExecOpcode_RMImm(ctx, op); break;
case BGBV86_RMCFG_REGRMIMM:
BGBV86_ExecOpcode_RegRMImm(ctx, op); break;
case BGBV86_RMCFG_RMREGIMM:
BGBV86_ExecOpcode_RMRegImm(ctx, op); break;
case BGBV86_RMCFG_REGRMREG2IMM:
BGBV86_ExecOpcode_RegRMReg2Imm(ctx, op); break;
case BGBV86_RMCFG_RMREGREG2IMM:
BGBV86_ExecOpcode_RMRegReg2Imm(ctx, op); break;
case BGBV86_RMCFG_REGREG2RMIMM:
BGBV86_ExecOpcode_RegReg2RMImm(ctx, op); break;
case BGBV86_RMCFG_RMREG2REGIMM:
BGBV86_ExecOpcode_RMReg2RegImm(ctx, op); break;
case BGBV86_RMCFG_REGRMREG2REG3IMM:
BGBV86_ExecOpcode_RegRMReg2Reg3Imm(ctx, op); break;
case BGBV86_RMCFG_RMREGREG2REG3IMM:
BGBV86_ExecOpcode_RMRegReg2Reg3Imm(ctx, op); break;
case BGBV86_RMCFG_REGREG2RMREG3IMM:
BGBV86_ExecOpcode_RegReg2RMReg3Imm(ctx, op); break;
case BGBV86_RMCFG_RMREG2REGREG3IMM:
BGBV86_ExecOpcode_RMReg2RegReg3Imm(ctx, op); break;

case BGBV86_RMCFG_REGREG1:
BGBV86_ExecOpcode_RegReg1(ctx, op); break;
case BGBV86_RMCFG_REG1REG:
BGBV86_ExecOpcode_Reg1Reg(ctx, op); break;

default: BGBV86_CheckRaiseUD(ctx); break;
}

return(0);
}

Mike Austin

unread,

Nov 5, 2009, 2:11:24 AM11/5/09

to

BGB / cr88192 wrote:
> this is an observation I have made in the past, and I think I will state
> here as if it were some kind of rule:
> when a profiler says that the largest amount of time used in an interpreter
> is the main opcode dispatch switch (rather than some function called by it,
> or some logic contained within said switch, ... especially if this switch
> does nothing more than call functions...), then they are rapidly approaching
> the performance limits they can hope to gain from said interpreter.

A little late to the party, but...

I say let an interpreter be an interpreter, and provide access to fast
libraries and frameworks. These days, you can write a 3D game in a scripting
language using math and physics libraries and opengl vertex objects. Throw in
a simple native scene graph, and your bottleneck is no longer the runtime. I
admire projects like LuaJit, but if I need absolute speed, it seems easier to
just use C or asm.

Mike

> which is kind of lame in this case, as this interpreter is, technically, not
> all that fast...

>
> but, then again, there may still be a little tweaking left, as said switch

> has not gained 1st place (it is 2nd), and holds approx 12% of the total
> running time.
>

> 1st place, at 15%, is the hash-table for fetching instructions (and, sadly,
> is not really optimizable unless I discover some way to 'optimize' basic
> arithmetic expressions, which FWIW is about as productive as trying to
> optimize a switch...).
>
> this means: 27% of time:

> hash EIP (currently: "((EIP*65521)>>16)&65535");

> fetch opcode-struct from hash-table (the hash-fail rate is apparently very
> low);
> switch on opcode info (branches to functions containing further dispatch and
> logic code).
>
> (well, all this, as well as a few arithmetic ops, such as adding the opcode
> size to EIP, ...).
>
> previously I had an opcode decoder which was not super fast (approx 60% of
> runtime if sequential decoding is used), but the hash seems to have largely
> eliminated it (since most of the time, pre-decoded opcodes are used from the
> hash).
>
>

> so, alas, it is an x86 interpreter performing at approx 386 (or maybe
> 486)-like speeds on my computer (~6 MIPS...).
> could be faster, except that MSVC on x64 has no real idea what it is doing
> WRT optimizations.
>
> then again, it is around 12x faster than when I started trying to optimize
> it (~0.5 MIPS...).

>
>
> note: I have plenty of past experience with both ASM and JIT, but at the
> moment am leaning against this, and in favor of using pure C-based
> interpretation for now.
>
>
> dunno if anyone knows some major way around this.
>

> IOW: if there is anything that can be done when the main switch creeps into
> a high position, and code starts becomming very fussy about the fine details
> WRT performance issues...
>
> past experience is generally not... (that this is a general limit to C-based
> interpreters...).
>
>
> any comments?...
>
>

BGB / cr88192

unread,

Nov 5, 2009, 10:40:18 AM11/5/09

to

"Mike Austin" <mi...@mike-nospam-austin.com> wrote in message
news:DYSdnSRFsvEA52_X...@giganews.com...

> BGB / cr88192 wrote:
>> this is an observation I have made in the past, and I think I will state
>> here as if it were some kind of rule:
>> when a profiler says that the largest amount of time used in an
>> interpreter is the main opcode dispatch switch (rather than some function
>> called by it, or some logic contained within said switch, ... especially
>> if this switch does nothing more than call functions...), then they are
>> rapidly approaching the performance limits they can hope to gain from
>> said interpreter.
>
> A little late to the party, but...
>
> I say let an interpreter be an interpreter, and provide access to fast
> libraries and frameworks. These days, you can write a 3D game in a
> scripting
> language using math and physics libraries and opengl vertex objects.
> Throw in
> a simple native scene graph, and your bottleneck is no longer the runtime.
> I
> admire projects like LuaJit, but if I need absolute speed, it seems easier
> to
> just use C or asm.
>

FWIW, the interpreter is currently running C, and also ASM...

granted, I have yet to resolve the issue as to how to generally/easily
marshall APIs to the interpreter (with the differences in address space and
word size being just a few of the issues).

using a modified C frontend as the basis of an IDL tool is a possibility,
where I am currently thinking of C headers with embedded IDL commands.

a prior idea had involved the use of special preprocessor magic, but another
possibility is to embed all of the IDL markup in comments, which would be
treated specially by the IDL tool's preprocessor (unwrapped and placed
directly into the syntax stream...).

/*IDL|...*/ --> ...
/*IDL[...]*/ --> [...]

/*IDL[guid(b0f0-...)]*/
/*IDL| interface { */
...

the tool would then write new code and headers to attempt to marshall API's
across different situations (current thinking is via 'native call', via the
object system, ...).

I would probably use a vaguel "CLOS-like" style for presenting OO style
interfaces in C, so for example, C-code is written in this style, and with
some IDL magic in the headers, and the tool figures out how to write code to
import/export said interfaces to/from the object system.

...

/*IDL|
namespace MyApp {
class Foo {
Foo();
public int doSomething(int x);
...
}
}
*/

and, in C:
MyApp_Foo *foo;
foo=new_MyApp_Foo();
foo->doSomething(foo, 100);
or, maybe:
MyApp_Foo_doSomething(foo, 100);

granted, this is not the most elegant possibility, but it could be usable at
least...