On 11/28/2012 3:15 AM, Rod Pemberton wrote:
> "Hugh Aguilar" <
hughag...@nospicedham.yahoo.com> wrote in
> message
> news:0aac8c84-77aa-4e8f...@kt16g2000pbb.googlegroups.com...
> ...
>
>> I would like to provide capability of relocatable binary
>> overlays. I don't think this is possible with stack threading,
>> because all of the threaded code is composed of absolute
>> addresses.
>
> Use offsets ...
>
> I use a variant of a Forth interpreter for another project
> (sigh...) that uses offsets instead of absolute addresses. But,
> it's in C, not assembly.
>
>> In order to allow for relocatability, I would need to dedicate a
>> registers (such as rbp) as a base pointer for the entire
>> program.
>
> Well, something like that ... indirect addressing of some form or
> other.
>
>> All pointers to data or to code, would actually be offsets
>> from this base register.
>
> For 32-bit mode, you can set the base address of the selector.
> Was that functionality removed for 64-bit mode?
>
yes.
in 64-bit mode, segment base addresses no longer work.
IIRC I had heard rumors before though that some later chip may re-add them.
then again, I also heard rumors recently of Intel wanting to move
entirely to BGA packaging (with the CPUs comming pre-soldered to the
MOBO), which other people commented was stupid and unlikely (since Intel
would create a lot of backlash and lose market share by going this route).
then again, this brings up a long-ago memory that I saw 386SX MOBOs back
in the 90s which did this (just with QFP). apparently there were QFP 486
chips as well...
>> The threading scheme would have to use the base register.
>
> See the LEA instruction.
>
>> Also, all Forth words that access memory (@ and ! and so
>> forth) would have to use the base register.
>
> Why?
>
> You can adjust the offsets to absolute addresses simply by adding
> the base address. If all memory reads (fetches in Forth) and
> writes (stores in Forth) reduce to Forth's memory operations: @
> (fetch, i.e., read) ! (store, i.e., write) C@ (char fetch) C!
> (char store), then you can add the base directly in those
> low-level Forth words or "primitives". It's only if you have some
> words or "primitives" that bypass Forth's memory operators that
> you'd have to rewrite more words.
>
yeah.
base-registers can be useful.
I had considered similar before partly as an option to allow a code
generator to be clever and mostly use 32-bit addressing internally
(except when dealing with explicit pointers or similar).
never did much with the idea though. mostly as the higher complexity and
maintenance costs of native code generators has largely caused me to use
them sparingly (and most of this is as ugly shims to glue crap together).
note that as-is, most of the "executable heap" is clustered into a 2GB
region in 64-bit targets (though actually 4GB is reserved, with a 2GB
usable area in the middle).
dynamically generated code and data / bss areas may be put there, such
that everything is within easy reach. currently, the whole thing is RWX
though, and may be require changing eventually.
there was some uncertainty as apparently SELinux doesn't like RWX
memory, but as-is AFAICT the "no RWX memory" restriction is only
enforced by default for daemons.
>> If subroutine-threading were used, then the CALL instruction
>> would have to use the base register.
>
> Perhaps, use LEA with CALL.
>
why?...
CALL is normally relative anyways, so you would still only need an
indirect call if going to another memory-region (or outside the +-2GB
window).
>> Back in the old days, UR/Forth for MS-DOS had relocatable binary
>> overlays. It was essentially using CS and DS as its base
>> registers, similar to what I've described above. Now everything
>> is "flat" --- so to get relocatable binary overlays it is
>> necessary to dedicate one of the general-purpose registers as a
>> base register. This isn't too difficult on the 64-bit x86 that
>> has a lot of registers, but it was a problem on the 32-bit x86
>> that is already register starved (SwiftForth used EDI as I
>> recall).
>
> I'm not entirely upto date on 64-bit x86, but aren't all
> instructions relative? If so, I'd think there would be a CALL and
> JMP to a relative address too ... Yes? No?
>
CALL and JMP are relative even on 32-bit x86.
the only significant limitation on 64-bits is that they have a +-2GB
window, as the relative address is still 32-bits.
>> [String instructions] are definitely convenient, so I still use
>> them. I had thought that there was a speed penalty for using
>> these old CISC instructions on the modern processors, but you
>> say they are fast --- that is great --- convenience and speed.
>
> There is a large penalty for decoding them since they truly are
> CISC instructions. There is also a penalty for loading the
> registers they need. But, the instruction themselves, when used
> with REP/NE/NZ prefixes, execute faster than other instructions
> over large blocks.
>
yeah. a little possible trick here is basically that they can be treated
more like direct memory-block copies by the processor, rather than
actually executed as-expected per-se.
so, costs aren't really all that bad.
for fixed-size block-moves though, certain SSE operations (MOVDQA and
MOVDQU) can be a little faster though.
so, logic can be, in a code-generator (for a "memcpy" or similar intrinsic):
multiple-of-16 (constant size, and under a certain size limit)?
both-ends aligned?
use a MOVDQA chain.
else:
use a MOVDQU chain.
else:
use "REP MOVSB" or "REP MOVSD" (multiple of 4).
ironically, "REP MOVSB" can sometimes beat out more "clever" ways of
doing memory copies, like copying everything into GPRs and writing them
to the destination.
this is ultimately a moot tradeoff for larger memory copies though, as
once a person goes outside of what easily fits in cache, then the speed
at which memory can be read/written becomes the dominant factor.
my past x86 interpreter exploited this, and ironically "REP MOVSB" and
friends were some of the fastest operations in the interpreter.
>> BTW: UR/Forth used direct threading. It used the SI register as
>> the Forth IP. The NEXT consisted of a LODS to get the code
>> address into AX, and then a JMP indirect through AX --- this was
>> pretty fast on the 16-bit x86 --- my experience was that
>> UR/Forth ran benchmark programs at the same speed as Turbo C.
>> I might switch over to doing this so I can have binary overlays.
>> The JMP will have to use the base register, as the pointer in
>> RAX is actually an offset from the base rather than an absolute
>> address.
>
> I'm not sure that LODS would be the fastest solution anymore.
>
yeah...
much faster was of threaded code IME are basically just to have calls
in-sequence.
...
call A
...
call B
...
call C
...
ret
since, even if the calls are indirect, the branch-predictor can do its
thing.
an experimental threaded code interpreter of mine (written in C) does
attempt to exploit this (by organizing the threaded code into "traces"
which basically consist of sequential unwound calls, terminated by any
operation which may change the execution path or raise an exception,
such as a jump or call, ...).
though, as-is, it runs things like simple loops and similar still about
8x slower than native (MSVC compiled) code.
while considerably faster than my existing threaded-code interpreter
(the core of the BGBScript VM), which is closer to around 70x slower
than native, it is still much slower than native code (sadly), and also
far from usably complete (and attention is much more directed to more
immediately relevant parts of the project).
I am not sure exactly, but I suspect unrolling the operation-handler
dispatch loops is a big part of the speed gain here (at least for
shorter traces, currently traces with up-to about 8 operations are
handled this way).
not sure if there is any obviously faster strategy for a plain-C
interpreter.
I suspect actually that, as time goes on, it is getting considerably
harder to approach native speeds with an interpreter (designs which got
within 15x of native 8 years ago, now run closer to around 100x slower
than native). (and "switch()" has gotten considerably more expensive as
well, in addition to things like "if()", so code is basically faster if
it is "straight-shot" with very few conditionals).
I suspect it is mostly that this has to deal with the types of
optimizations being used in modern processors (things like deep
pipelines and similar).
so, the CPU generally gets faster, but interpreter performance is left
behind.
>
> Rod Pemberton
>
>
>