64-bit noobie questions

Hugh Aguilar

unread,

Nov 26, 2012, 6:30:52 PM11/26/12

to

1.) I have heard that only RAX can be used with mov to get a 64-bit
immediate value, but the Intel manual seems to indicate that any
register can be used. The limitation of 32-bit only occurs when the
destination is memory. Is it true that, for any instruction, when a 32-
bit immediate goes to a 64-bit destination that it will be sign-
extended and not zero-extended?

2.) Is it true that all memory access is relative to the current
position (there is a rip register used internally that contains the
address of the instruction being executed), and there is no absolute
memory addressing now? If I use indirect addressing though, then I
have to have the absolute address in the register. If I use the lea
instruction with a label, I will get the absolute address of that
label in my register, although the operand of the instruction will
actually be an offset from the instruction.

3.) Is there any speed difference between using the traditional
registers and using the new registers (r8 through r15)? I wonder if
the opcodes have a prefix or whatever that makes them bigger and
slower.

4.) For the most part, the registers are all the same and we no longer
have specific registers for specific uses. There are some exceptions:
Multiplication now uses any registers, but it still needs RDX:RAX if
you want mixed-precision. Division always uses RDX:RAX and always does
mixed-precision. Also, there is the question of whether RAX is
necessary for 64-bit immediate values as mentioned above. EBP is
necessary for ENTER and LEAVE. RSI, RDI and RCX are necessary for
string instructions. Are there any other exceptions that I should know
about?

wolfgang kern

unread,

Nov 27, 2012, 3:38:43 PM11/27/12

to

Hugh Aguilar asked:

> 1.) I have heard that only RAX can be used with mov to get a 64-bit
> immediate value, but the Intel manual seems to indicate that any
> register can be used. The limitation of 32-bit only occurs when the
> destination is memory. Is it true that, for any instruction, when a 32-
> bit immediate goes to a 64-bit destination that it will be sign-
> extended and not zero-extended?

All 16 available regs can be loaded with a 64-bit immediadate value via
MOV reg,imm64. (perhaps also apart from just 32-bit sign-extended loads).

The special advantage for eax is only found in the 64-bit addressing
capabilitiy [code A0/A1/A2/A3] ie:(MOV rax,imm64[mem64]).

> 2.) Is it true that all memory access is relative to the current
> position (there is a rip register used internally that contains the
> address of the instruction being executed), and there is no absolute
> memory addressing now? If I use indirect addressing though, then I
> have to have the absolute address in the register. If I use the lea
> instruction with a label, I will get the absolute address of that
> label in my register, although the operand of the instruction will
> actually be an offset from the instruction.

64-bit (aka long mode) is (exept GS) always FLAT, and based to Zero.
RIP-addressing is available within 64-bit mode only, but there is
no difference to the previous opportunities in the 32 bit world,
except the values found within instructions refer to code pages.
'Normal' code/data-relative references work as before in 32 bits
but extended (even truncated on over-roll) to 64-bits.

> 3.) Is there any speed difference between using the traditional
> registers and using the new registers (r8 through r15)? I wonder if
> the opcodes have a prefix or whatever that makes them bigger and
> slower.

This one byte OPerand-size prefix got a timing of Zero! (prefetched).
New (with 64-bit codeing) is just that we got more opportunities...
reg to reg is fastest, while reg <-> mem depend on bus-speed as before.

> 4.) For the most part, the registers are all the same and we no longer
> have specific registers for specific uses. There are some exceptions:
> Multiplication now uses any registers, but it still needs RDX:RAX if
> you want mixed-precision. Division always uses RDX:RAX and always does
> mixed-precision. Also, there is the question of whether RAX is
> necessary for 64-bit immediate values as mentioned above.

Mul/Div are specials anyway. But all regs are accepted for addr-calc...

> EBP is necessary for ENTER and LEAVE. RSI, RDI and RCX are necessary for
> string instructions. Are there any other exceptions that I should know
> about?

the usage of EBP depend on your compiler...
and as I use just my brain instead of a stupid tool, EBP is free to be used
for a pointer to stack-bottom (my global variables are there).

Right, ECX, EDI and ESI were designed and handy for Block-instructions,
and against all odd comments from whoever, I figured that this olde REP:
MOVs/b/w/d/q and friends are still unbeatable also on modern machines
in terms of size 'and' speed-performance.

__
wolfgang

Hugh Aguilar

unread,

Nov 27, 2012, 9:04:16 PM11/27/12

to

On Nov 27, 1:38 pm, "wolfgang kern" <nowh...@never.at> wrote:
> Hugh Aguilar asked:

> > 2.) Is it true that all memory access is relative to the current
> > position (there is a rip register used internally that contains the
> > address of the instruction being executed), and there is no absolute
> > memory addressing now? If I use indirect addressing though, then I
> > have to have the absolute address in the register. If I use the lea
> > instruction with a label, I will get the absolute address of that
> > label in my register, although the operand of the instruction will
> > actually be an offset from the instruction.
>
> 64-bit (aka long mode) is (exept GS) always FLAT, and based to Zero.
> RIP-addressing is available within 64-bit mode only, but there is
> no difference to the previous opportunities in the 32 bit world,
> except the values found within instructions refer to code pages.
> 'Normal' code/data-relative references work as before in 32 bits
> but extended (even truncated on over-roll) to 64-bits.

The reason why I'm asking, is because I'm writing a Forth interpreter.
I'm using stack-threading. The rsp register is the Forth IP. The ret
instruction is the Forth NEXT code.

I would like to provide capability of relocatable binary overlays. I
don't think this is possible with stack threading, because all of the
threaded code is composed of absolute addresses. In order to allow for
relocatability, I would need to dedicate a registers (such as rbp) as
a base pointer for the entire program. All pointers to data or to
code, would actually be offsets from this base register. The threading
scheme would have to use the base register. Also, all Forth words that
access memory (@ and ! and so forth) would have to use the base
register. If subroutine-threading were used, then the CALL instruction
would have to use the base register.

Back in the old days, UR/Forth for MS-DOS had relocatable binary
overlays. It was essentially using CS and DS as its base registers,
similar to what I've described above. Now everything is "flat" --- so
to get relocatable binary overlays it is necessary to dedicate one of
the general-purpose registers as a base register. This isn't too
difficult on the 64-bit x86 that has a lot of registers, but it was a
problem on the 32-bit x86 that is already register starved (SwiftForth
used EDI as I recall).

> > EBP is necessary for ENTER and LEAVE. RSI, RDI and RCX are necessary for
> > string instructions. Are there any other exceptions that I should know
> > about?
>
> the usage of EBP depend on your compiler...
> and as I use just my brain instead of a stupid tool, EBP is free to be used
> for a pointer to stack-bottom (my global variables are there).

Well, I'm using my brain to build a stupid tool. :-)

> Right, ECX, EDI and ESI were designed and handy for Block-instructions,
> and against all odd comments from whoever, I figured that this olde REP:
> MOVs/b/w/d/q and friends are still unbeatable also on modern machines
> in terms of size 'and' speed-performance.

They are definitely convenient, so I still use them. I had thought
that there was a speed penalty for using these old CISC instructions
on the modern processors, but you say they are fast --- that is great
--- convenience and speed.

BTW: UR/Forth used direct threading. It used the SI register as the
Forth IP. The NEXT consisted of a LODS to get the code address into
AX, and then a JMP indirect through AX --- this was pretty fast on the
16-bit x86 --- my experience was that UR/Forth ran benchmark programs
at the same speed as Turbo C. I might switch over to doing this so I
can have binary overlays. The JMP will have to use the base register,
as the pointer in RAX is actually an offset from the base rather than
an absolute address.

Hugh Aguilar

unread,

Nov 27, 2012, 11:51:52 PM11/27/12

to

On Nov 27, 7:04 pm, Hugh Aguilar <hughaguila...@nospicedham.yahoo.com>
wrote:

> BTW: UR/Forth used direct threading. It used the SI register as the
> Forth IP. The NEXT consisted of a LODS to get the code address into
> AX, and then a JMP indirect through AX --- this was pretty fast on the
> 16-bit x86 --- my experience was that UR/Forth ran benchmark programs
> at the same speed as Turbo C. I might switch over to doing this so I
> can have binary overlays. The JMP will have to use the base register,
> as the pointer in RAX is actually an offset from the base rather than
> an absolute address.

Here is another noobie question:

If I do a LODSW in 64-bit mode, then I'm loading a 32-bit number into
EAX indirectly through RSI (which gets incremented by 4 assuming that
the DF flag is zero). My question is, does this number get sign-
extended or zero-extended into RAX, or is the upper half of RAX left
as it was?

This is similar to my previous question about MOV of an immediate.
That was a dumb question because I could have just read the Intel
manual and found out the answer (it is sign-extended). This time
however, I did read the Intel manual regarding LODS and I couldn't
find an answer.

Frank Kotler

unread,

Nov 28, 2012, 12:46:46 AM11/28/12

to

Hugh Aguilar wrote:

...

> If I do a LODSW

I think you mean LODSD... (I don't know the answer to the rest of your
question)

Best,
Frank

Hugh Aguilar

unread,

Nov 28, 2012, 1:39:50 AM11/28/12

to

On Nov 27, 10:46 pm, Frank Kotler

Yes, LODSD is what I meant --- 32-bit --- that was typo, or a thinko,
or something like that.

One of the problems with the stack-threading discussed earlier is that
it requires the pointers to the cfas (the threaded code) to be 64-bit
(because only 64-bit addresses can be on the rsp stack in 64-bit
mode). This is an overkill for me though, as the Forth program is
certainly going to be less than 4GB in size. It wastes memory, which
in turn slows down the code, because the threaded code may not fit in
the data cache. By comparison, with LODSD as described above, I can
kill two birds with one stone:
1.) Save memory by using 32-bit pointers in the threads.
2.) Allow for binary overlays by doing a JMP [rbp+rax] to get to the
cfa. All memory access, to both code and data, will be relative to RBP
which is my base pointer.

Here are some more questions:

1.) LODSD is smaller than doing it manually, which is good because it
makes the code smaller, but is it faster?

2.) I could do it manually like this: MOV EAX, [RSI] ADD RSI, 4. In
this case, does the 32-bit value get sign-extended into RAX by the MOV
instruction? I don't want this because my program will go haywire if
it is above the 2GB boundary.

3.) Is it faster if I do the LODSD early in the function and hold the
value in EAX throughout the function (assuming that I don't need RAX
for anything else), rather than do the LODSD immediately prior to the
JMP that uses RAX? This was true on the 80486, but it may not be true
nowadays. Similarly, if I do it manually, I could spread out the MOV,
the ADD and the JMP so they aren't right next to each other.

The last time that I had a job writing x86 code, was over 10 years
ago, and that was for the 80486. I've read Abrash's second book which
discusses the 80486 and Pentium, but I'm not familiar with these new
x86 processors at all. That is why I'm asking all of these noobie
questions. :-)

Dick Wesseling

unread,

Nov 28, 2012, 1:56:26 AM11/28/12

to

In article <0aac8c84-77aa-4e8f...@kt16g2000pbb.googlegroups.com>,

Hugh Aguilar <hughag...@nospicedham.yahoo.com> writes:

> The reason why I'm asking, is because I'm writing a Forth interpreter.
> I'm using stack-threading. The rsp register is the Forth IP. The ret
> instruction is the Forth NEXT code.

Kind of scary. What happens when the OS delivers a signal? do you have
an alternate signal stack?

> I would like to provide capability of relocatable binary overlays. I
> don't think this is possible with stack threading, because all of the
> threaded code is composed of absolute addresses.

Then relocate your binaries when loading. If your Forth compiler cannot
generate relocation info, you can always use a multipass algorithm:

- compile at offset A save binary
- compile at offset B save binary
- compare both binaries. Absolute data will be identical in A and B,
relocatable data will differ.
- emit code + relocation info

Dick Wesseling

unread,

Nov 28, 2012, 1:44:09 AM11/28/12

to

In article <3793b860-5661-4a7f...@6g2000pbh.googlegroups.com>,

Hugh Aguilar <hughag...@nospicedham.yahoo.com> writes:
> Here is another noobie question:
> If I do a LODSW in 64-bit mode, then I'm loading a 32-bit number into
> EAX indirectly through RSI (which gets incremented by 4 assuming that
> the DF flag is zero). My question is, does this number get sign-
> extended or zero-extended into RAX, or is the upper half of RAX left
> as it was?

The upper 48 bits of RAX are left as they were.

Hugh Aguilar

unread,

Nov 28, 2012, 2:38:06 AM11/28/12

to

On Nov 27, 11:56 pm, f...@nospicedham.securityaudit.val.newsbank.net
(Dick Wesseling) wrote:
> In article <0aac8c84-77aa-4e8f-8058-f4ecd8f72...@kt16g2000pbb.googlegroups.com>,

> Hugh Aguilar <hughaguila...@nospicedham.yahoo.com> writes:
>
> > The reason why I'm asking, is because I'm writing a Forth interpreter.
> > I'm using stack-threading. The rsp register is the Forth IP. The ret
> > instruction is the Forth NEXT code.
>
> Kind of scary. What happens when the OS delivers a signal? do you have
> an alternate signal stack?

When I first heard of stack-threading (here on clax, where I get most
of my wild ideas), my first thought was that it wouldn't work because
the OS would overwrite the rsp stack. I was told that it does work, by
somebody who has supposedly done it successfully. That was in this
thread:
https://groups.google.com/group/comp.lang.asm.x86/browse_thread/thread/f4718838eeb3c9a8
George Neuner said this:
On Aug 24, 4:09 pm, George Neuner <gneun...@nospicedham.comcast.net>
wrote:
> Well ... most ISAs include a simple subroutine call/return mechanism
> which involves only an address pushed onto or pulled from a stack.
> [I say "most" because there are chips where the ISA call/return
> mechanism automatically saves/restores certain CPU state and thus puts
> more than just an address onto the stack.]
>
> The return instruction pulls an address from the stack, updates the
> stack pointer, and transfers control to the address. In all ways but
> one - the address source being a stack vs a list - its behavior
> parallels that of the subroutine exit code in a conventional DTC
> implementation.
>
> So, it should be obvious that a DTC thread list can be pushed (in
> reverse order) onto a stack and the ISA return instruction can be used
> to jump from one thread subroutine to the next. On many chips this
> will execute faster than the 3 instruction jump, and additionally it
> frees up a register that is no longer needed for the thread list
> pointer.

He even called it "obvious"(!). That was the first I had ever heard of
the idea, but when he assured me that interrupts won't overwrite my
code, I decided to go for it. Now I'm having my doubts though, and I'm
thinking that I should go with RSI as the IP.

BTW: He was mistaken about it freeing up a register --- the same
number of registers are needed, just different ones.

> > I would like to provide capability of relocatable binary overlays. I
> > don't think this is possible with stack threading, because all of the
> > threaded code is composed of absolute addresses.
>
> Then relocate your binaries when loading. If your Forth compiler cannot
> generate relocation info, you can always use a multipass algorithm:
>
> - compile at offset A save binary
> - compile at offset B save binary
> - compare both binaries. Absolute data will be identical in A and B,
> relocatable data will differ.
> - emit code + relocation info

I am aware of that method that you are describing, and that is what I
intend to do. I still need to use pointers that are relative to my
base register however. I can't use absolute addresses, because the
program will likely get loaded into a different place in memory every
time that it is run. All of the references in the overlay to code or
data in the main program would be wrong. I have to use the base
register so that all of those references will be correct when the
program is run later and the base register is set up to point to the
base of the main program. All of the references in the overlay to code
and data in the overlay will get adjusted using the above method. Each
overlay will have two files --- one file is the overlay itself, and
the other file is all of the adjustments that have to be made in the
first file --- when an overlay gets loaded, the program will traverse
the entire file making the adjustments according to the corresponding
file.

A lot of MS-DOS program used overlays --- this was a common way to
work around the memory limitations of the 8088. This worked because,
although the program could get loaded into a different place every
time that it was run, we had the CS and DS registers pointing to the
base of the program --- they were effectively the base register that
I'm using RBP for now. The reason why I want overlays now although
memory usage is no longer an issue, is so my users can distribute
libraries as binary overlays, and they don't have to provide the
source-code and have their users compile it. It is kind of a crude
linker --- Forth is interactive, which is why I don't want to generate
an object file and have a standard linker construct an executable.
Users should able to load and discard overlays as they are using the
system, without rebuilding the whole system.

Using absolute addresses would only work if the program is loaded into
memory in the same place every time that it is run. This was true in
regard to .com files in MS-DOS days (they were at 0x100). Do we still
have .com files nowadays? Are they still limited to 64K like in MS-
DOS? Those were just a throwback to CP/M to allow people to port old
8080 program over to the 8088 easily --- they were somewhat handy for
small utility programs though --- but I need more than 64K.

Hugh Aguilar

unread,

Nov 28, 2012, 2:39:37 AM11/28/12

to

On Nov 27, 11:44 pm, f...@nospicedham.securityaudit.val.newsbank.net
(Dick Wesseling) wrote:
> In article <3793b860-5661-4a7f-8039-5fad03697...@6g2000pbh.googlegroups.com>,

> Hugh Aguilar <hughaguila...@nospicedham.yahoo.com> writes:
>
> > Here is another noobie question:
> > If I do a LODSW in 64-bit mode, then I'm loading a 32-bit number into
> > EAX indirectly through RSI (which gets incremented by 4 assuming that
> > the DF flag is zero). My question is, does this number get sign-
> > extended or zero-extended into RAX, or is the upper half of RAX left
> > as it was?
>
> The upper 48 bits of RAX are left as they were.

Well, I can deal with that --- I'll just do an XOR RAX, RAX prior to
doing the LODSD.

japheth

unread,

Nov 28, 2012, 3:21:05 AM11/28/12

to

> > The upper 48 bits of RAX are left as they were.
>
> Well, I can deal with that --- I'll just do an XOR RAX, RAX prior to
> doing the LODSD.

LODSW will not change the upper 48 bits, but LODSD will clear the
upper 32-bits.

IIRC whenever the whole 32-bit register is affected, then the upper 32-
bits are cleared.

Rod Pemberton

unread,

Nov 28, 2012, 4:11:05 AM11/28/12

to

"Hugh Aguilar" <hughag...@nospicedham.yahoo.com> wrote in
message
news:5351bc3f-524b-43b5...@qi8g2000pbb.googlegroups.com...
...

> One of the problems with the stack-threading discussed earlier
> is that it requires the pointers to the cfas (the threaded code)
> to be 64-bit (because only 64-bit addresses can be on the rsp
> stack in 64-bit mode).

For 16-bit or 32-bit mode, you can load registers with 8-bit
immediates. But, all pushes of registers, without overrides, push
the stack size, i.e., 16-bits or 32-bits. I'm not familiar with
64-bit, but I'd suspect that if you load a 32-bit value into a
register and push the 64-bit register, you'd push a 64-bit value.
Whether the upper bits are set or clear dependends on the
instructions used or perhaps whether the REX prefix is used (?)

> This is an overkill for me though, as the Forth program is
> certainly going to be less than 4GB in size.

Is that a hint to stick to 32-bit mode ... ?

> 1.) LODSD is smaller than doing it manually, which is good
> because it makes the code smaller, but is it faster?

Most likely, not faster. It is an actual CISC instruction. If
you locate the pipelining information for instructions in the AMD
and Intel manuals, those pages should, if LODS is the same as for
16-bit or 32-bit modes, indicate that LODS goes through the
slowest, most complicated, instruction decoding pipeline.

> 2.) I could do it manually like this: MOV EAX, [RSI] ADD RSI,
> 4. In this case, does the 32-bit value get sign-extended into
> RAX by the MOV instruction? I don't want this because my
> program will go haywire if it is above the 2GB boundary.

You might look at instructions like LEA, MOVSX/ZX, or CBW, CWDE,
CDQ, CWD, etc. Those last Cxxx instructions probably have 64-bit
forms or new names for 64-bits.

Rod Pemberton

Rod Pemberton

unread,

Nov 28, 2012, 4:15:24 AM11/28/12

to

"Hugh Aguilar" <hughag...@nospicedham.yahoo.com> wrote in
message

news:0aac8c84-77aa-4e8f...@kt16g2000pbb.googlegroups.com...
...

> I would like to provide capability of relocatable binary
> overlays. I don't think this is possible with stack threading,
> because all of the threaded code is composed of absolute
> addresses.

Use offsets ...

I use a variant of a Forth interpreter for another project
(sigh...) that uses offsets instead of absolute addresses. But,
it's in C, not assembly.

> In order to allow for relocatability, I would need to dedicate a
> registers (such as rbp) as a base pointer for the entire
> program.

Well, something like that ... indirect addressing of some form or
other.

> All pointers to data or to code, would actually be offsets
> from this base register.

For 32-bit mode, you can set the base address of the selector.
Was that functionality removed for 64-bit mode?

> The threading scheme would have to use the base register.

See the LEA instruction.

> Also, all Forth words that access memory (@ and ! and so
> forth) would have to use the base register.

Why?

You can adjust the offsets to absolute addresses simply by adding
the base address. If all memory reads (fetches in Forth) and
writes (stores in Forth) reduce to Forth's memory operations: @
(fetch, i.e., read) ! (store, i.e., write) C@ (char fetch) C!
(char store), then you can add the base directly in those
low-level Forth words or "primitives". It's only if you have some
words or "primitives" that bypass Forth's memory operators that
you'd have to rewrite more words.

> If subroutine-threading were used, then the CALL instruction
> would have to use the base register.

Perhaps, use LEA with CALL.

> Back in the old days, UR/Forth for MS-DOS had relocatable binary
> overlays. It was essentially using CS and DS as its base
> registers, similar to what I've described above. Now everything
> is "flat" --- so to get relocatable binary overlays it is
> necessary to dedicate one of the general-purpose registers as a
> base register. This isn't too difficult on the 64-bit x86 that
> has a lot of registers, but it was a problem on the 32-bit x86
> that is already register starved (SwiftForth used EDI as I
> recall).

I'm not entirely upto date on 64-bit x86, but aren't all
instructions relative? If so, I'd think there would be a CALL and
JMP to a relative address too ... Yes? No?

> [String instructions] are definitely convenient, so I still use

> them. I had thought that there was a speed penalty for using
> these old CISC instructions on the modern processors, but you
> say they are fast --- that is great --- convenience and speed.

There is a large penalty for decoding them since they truly are
CISC instructions. There is also a penalty for loading the
registers they need. But, the instruction themselves, when used
with REP/NE/NZ prefixes, execute faster than other instructions
over large blocks.

> BTW: UR/Forth used direct threading. It used the SI register as
> the Forth IP. The NEXT consisted of a LODS to get the code
> address into AX, and then a JMP indirect through AX --- this was
> pretty fast on the 16-bit x86 --- my experience was that
> UR/Forth ran benchmark programs at the same speed as Turbo C.
> I might switch over to doing this so I can have binary overlays.
> The JMP will have to use the base register, as the pointer in
> RAX is actually an offset from the base rather than an absolute
> address.

I'm not sure that LODS would be the fastest solution anymore.

Rod Pemberton

Philip Lantz

unread,

Nov 28, 2012, 6:04:44 AM11/28/12

to

Hugh Aguilar wrote:
>
> 1.) I have heard that only RAX can be used with mov to get a 64-bit
> immediate value, but the Intel manual seems to indicate that any
> register can be used. The limitation of 32-bit only occurs when the
> destination is memory. Is it true that, for any instruction, when a 32-
> bit immediate goes to a 64-bit destination that it will be sign-
> extended and not zero-extended?

No, any 64-bit register can be loaded with a 64-bit immediate. Probably
what you have heard is that only AL, AX, EAX, and RAX can be loaded from
memory using a 64-bit absolute address encoded in the instruction.

Yes, when a 32-bit immediate is loaded to a 64-bit destination, it is
sign extended.

> 2.) Is it true that all memory access is relative to the current
> position (there is a rip register used internally that contains the
> address of the instruction being executed), and there is no absolute
> memory addressing now? If I use indirect addressing though, then I
> have to have the absolute address in the register. If I use the lea
> instruction with a label, I will get the absolute address of that
> label in my register, although the operand of the instruction will
> actually be an offset from the instruction.

No, not all memory access is rip relative; only mod=00 and r/m=101
specifies rip-relative addressing. The other addressing modes are the
same as in 32-bit mode (except they use 64-bit base and index registers,
of course). Absolute addressing with a 32-bit address is still available
using a SIB encoding of 0x25.

> 3.) Is there any speed difference between using the traditional
> registers and using the new registers (r8 through r15)? I wonder if
> the opcodes have a prefix or whatever that makes them bigger and
> slower.

A REX prefix is required to use r8 through r15, so instructions using
these registers may be one byte longer. However, the REX prefix is also
used to specify a 64-bit operation, so there is no additional prefix
needed to specify r8 - r15 for a 64-bit operation.

BGB

unread,

Nov 28, 2012, 3:28:14 PM11/28/12

to

On 11/28/2012 3:15 AM, Rod Pemberton wrote:
> "Hugh Aguilar" <hughag...@nospicedham.yahoo.com> wrote in
> message
> news:0aac8c84-77aa-4e8f...@kt16g2000pbb.googlegroups.com...
> ...
>
>> I would like to provide capability of relocatable binary
>> overlays. I don't think this is possible with stack threading,
>> because all of the threaded code is composed of absolute
>> addresses.
>
> Use offsets ...
>
> I use a variant of a Forth interpreter for another project
> (sigh...) that uses offsets instead of absolute addresses. But,
> it's in C, not assembly.
>
>> In order to allow for relocatability, I would need to dedicate a
>> registers (such as rbp) as a base pointer for the entire
>> program.
>
> Well, something like that ... indirect addressing of some form or
> other.
>
>> All pointers to data or to code, would actually be offsets
>> from this base register.
>
> For 32-bit mode, you can set the base address of the selector.
> Was that functionality removed for 64-bit mode?
>

yes.

in 64-bit mode, segment base addresses no longer work.
IIRC I had heard rumors before though that some later chip may re-add them.

then again, I also heard rumors recently of Intel wanting to move
entirely to BGA packaging (with the CPUs comming pre-soldered to the
MOBO), which other people commented was stupid and unlikely (since Intel
would create a lot of backlash and lose market share by going this route).

then again, this brings up a long-ago memory that I saw 386SX MOBOs back
in the 90s which did this (just with QFP). apparently there were QFP 486
chips as well...

>> The threading scheme would have to use the base register.
>
> See the LEA instruction.
>
>> Also, all Forth words that access memory (@ and ! and so
>> forth) would have to use the base register.
>
> Why?
>
> You can adjust the offsets to absolute addresses simply by adding
> the base address. If all memory reads (fetches in Forth) and
> writes (stores in Forth) reduce to Forth's memory operations: @
> (fetch, i.e., read) ! (store, i.e., write) C@ (char fetch) C!
> (char store), then you can add the base directly in those
> low-level Forth words or "primitives". It's only if you have some
> words or "primitives" that bypass Forth's memory operators that
> you'd have to rewrite more words.
>

yeah.
base-registers can be useful.

I had considered similar before partly as an option to allow a code
generator to be clever and mostly use 32-bit addressing internally
(except when dealing with explicit pointers or similar).

never did much with the idea though. mostly as the higher complexity and
maintenance costs of native code generators has largely caused me to use
them sparingly (and most of this is as ugly shims to glue crap together).

note that as-is, most of the "executable heap" is clustered into a 2GB
region in 64-bit targets (though actually 4GB is reserved, with a 2GB
usable area in the middle).

dynamically generated code and data / bss areas may be put there, such
that everything is within easy reach. currently, the whole thing is RWX
though, and may be require changing eventually.

there was some uncertainty as apparently SELinux doesn't like RWX
memory, but as-is AFAICT the "no RWX memory" restriction is only
enforced by default for daemons.

>> If subroutine-threading were used, then the CALL instruction
>> would have to use the base register.
>
> Perhaps, use LEA with CALL.
>

why?...

CALL is normally relative anyways, so you would still only need an
indirect call if going to another memory-region (or outside the +-2GB
window).

>> Back in the old days, UR/Forth for MS-DOS had relocatable binary
>> overlays. It was essentially using CS and DS as its base
>> registers, similar to what I've described above. Now everything
>> is "flat" --- so to get relocatable binary overlays it is
>> necessary to dedicate one of the general-purpose registers as a
>> base register. This isn't too difficult on the 64-bit x86 that
>> has a lot of registers, but it was a problem on the 32-bit x86
>> that is already register starved (SwiftForth used EDI as I
>> recall).
>
> I'm not entirely upto date on 64-bit x86, but aren't all
> instructions relative? If so, I'd think there would be a CALL and
> JMP to a relative address too ... Yes? No?
>

CALL and JMP are relative even on 32-bit x86.

the only significant limitation on 64-bits is that they have a +-2GB
window, as the relative address is still 32-bits.

>> [String instructions] are definitely convenient, so I still use
>> them. I had thought that there was a speed penalty for using
>> these old CISC instructions on the modern processors, but you
>> say they are fast --- that is great --- convenience and speed.
>
> There is a large penalty for decoding them since they truly are
> CISC instructions. There is also a penalty for loading the
> registers they need. But, the instruction themselves, when used
> with REP/NE/NZ prefixes, execute faster than other instructions
> over large blocks.
>

yeah. a little possible trick here is basically that they can be treated
more like direct memory-block copies by the processor, rather than
actually executed as-expected per-se.

so, costs aren't really all that bad.

for fixed-size block-moves though, certain SSE operations (MOVDQA and
MOVDQU) can be a little faster though.

so, logic can be, in a code-generator (for a "memcpy" or similar intrinsic):
multiple-of-16 (constant size, and under a certain size limit)?
both-ends aligned?
use a MOVDQA chain.
else:
use a MOVDQU chain.
else:
use "REP MOVSB" or "REP MOVSD" (multiple of 4).

ironically, "REP MOVSB" can sometimes beat out more "clever" ways of
doing memory copies, like copying everything into GPRs and writing them
to the destination.

this is ultimately a moot tradeoff for larger memory copies though, as
once a person goes outside of what easily fits in cache, then the speed
at which memory can be read/written becomes the dominant factor.

my past x86 interpreter exploited this, and ironically "REP MOVSB" and
friends were some of the fastest operations in the interpreter.

>> BTW: UR/Forth used direct threading. It used the SI register as
>> the Forth IP. The NEXT consisted of a LODS to get the code
>> address into AX, and then a JMP indirect through AX --- this was
>> pretty fast on the 16-bit x86 --- my experience was that
>> UR/Forth ran benchmark programs at the same speed as Turbo C.
>> I might switch over to doing this so I can have binary overlays.
>> The JMP will have to use the base register, as the pointer in
>> RAX is actually an offset from the base rather than an absolute
>> address.
>
> I'm not sure that LODS would be the fastest solution anymore.
>

yeah...

much faster was of threaded code IME are basically just to have calls
in-sequence.

...
call A
...
call B
...
call C
...
ret

since, even if the calls are indirect, the branch-predictor can do its
thing.

an experimental threaded code interpreter of mine (written in C) does
attempt to exploit this (by organizing the threaded code into "traces"
which basically consist of sequential unwound calls, terminated by any
operation which may change the execution path or raise an exception,
such as a jump or call, ...).

though, as-is, it runs things like simple loops and similar still about
8x slower than native (MSVC compiled) code.

while considerably faster than my existing threaded-code interpreter
(the core of the BGBScript VM), which is closer to around 70x slower
than native, it is still much slower than native code (sadly), and also
far from usably complete (and attention is much more directed to more
immediately relevant parts of the project).

I am not sure exactly, but I suspect unrolling the operation-handler
dispatch loops is a big part of the speed gain here (at least for
shorter traces, currently traces with up-to about 8 operations are
handled this way).

not sure if there is any obviously faster strategy for a plain-C
interpreter.

I suspect actually that, as time goes on, it is getting considerably
harder to approach native speeds with an interpreter (designs which got
within 15x of native 8 years ago, now run closer to around 100x slower
than native). (and "switch()" has gotten considerably more expensive as
well, in addition to things like "if()", so code is basically faster if
it is "straight-shot" with very few conditionals).

I suspect it is mostly that this has to deal with the types of
optimizations being used in modern processors (things like deep
pipelines and similar).

so, the CPU generally gets faster, but interpreter performance is left
behind.

>
> Rod Pemberton
>
>
>

Robert Wessel

unread,

Nov 28, 2012, 4:11:13 PM11/28/12

to

FS and GS have always worked as base registers in 64 bit mode. I
think it's unlikely that that will be expanded in the future.

>then again, I also heard rumors recently of Intel wanting to move
>entirely to BGA packaging (with the CPUs comming pre-soldered to the
>MOBO), which other people commented was stupid and unlikely (since Intel
>would create a lot of backlash and lose market share by going this route).
>
>then again, this brings up a long-ago memory that I saw 386SX MOBOs back
>in the 90s which did this (just with QFP). apparently there were QFP 486
>chips as well...

Solderable packages (BGAs, QFPs, etc.) make sense for low cost parts.
So most (all?) Intel Atom's are available in such. For the high end
parts in Intel's product line, it's hard to see how that would be a
good idea.

Andrew Cooper

unread,

Nov 28, 2012, 7:25:48 PM11/28/12

to

This would probably be because of "Fast String Operations" (Intel SDM
Volume 1, 7.3.9.3) supported on newer processors.

Experimentally, on our new Ivy Bridge processors at work, it appears
that the fastest way to zero a page is

REP STOSQ (With rax as 0 and rcx as 4K)

Which certainly appears to outperform aligned SSE writes.

~Andrew

Robert Redelmeier

unread,

Nov 28, 2012, 10:12:43 PM11/28/12

to

Andrew Cooper <and...@nospicedham.nospam.example.com> wrote in part:

> This would probably be because of "Fast String Operations"
> (Intel SDM Volume 1, 7.3.9.3) supported on newer processors.
>
> Experimentally, on our new Ivy Bridge processors at work,
> it appears that the fastest way to zero a page is
>
> REP STOSQ (With rax as 0 and rcx as 4K)
>
> Which certainly appears to outperform aligned SSE writes.

Outperforms MOVNTQ ? This is news.

-- Robert

Hugh Aguilar

unread,

Nov 29, 2012, 12:05:33 AM11/29/12

to

On Nov 28, 1:21 am, japheth <japhe...@nospicedham.googlemail.com>
wrote:

Where did you read that?

I couldn't find anything in the Intel manual that said one way or the
other --- I just looked up LODS in volume 2.

Hugh Aguilar

unread,

Nov 29, 2012, 12:17:18 AM11/29/12

to

On Nov 28, 2:15 am, "Rod Pemberton"
<do_not_h...@nospicedham.notemailnotz.cnm> wrote:
> "Hugh Aguilar" <hughaguila...@nospicedham.yahoo.com> wrote in
> messagenews:0aac8c84-77aa-4e8f...@kt16g2000pbb.googlegroups.com...

> > Also, all Forth words that access memory (@ and ! and so
> > forth) would have to use the base register.
>
> Why?
>
> You can adjust the offsets to absolute addresses simply by adding
> the base address. If all memory reads (fetches in Forth) and
> writes (stores in Forth) reduce to Forth's memory operations: @
> (fetch, i.e., read) ! (store, i.e., write) C@ (char fetch) C!
> (char store), then you can add the base directly in those
> low-level Forth words or "primitives". It's only if you have some
> words or "primitives" that bypass Forth's memory operators that
> you'd have to rewrite more words.

Because Forth code runs at both compile-time and run-time.

When the user writes an overlay, he may execute Forth code at compile-
time that works with pointers, and stores those pointers in data
structures that will be accessed at run-time for that overlay. If
those pointers are absolute addresses then they are going to have to
be relocated when the overlay is loaded later on. This is okay for
pointers in the overlay that point to code or data in the overlay
itself. There will also be pointers stored in the overlay that point
to code or data in the main program. The big problem here, is that the
main program is not loaded into the same place in memory every time
that it is run --- so we can't have absolute addresses in the main
program saved inside of the overlay --- because the overlay isn't
getting relocated as part of the relocation of the main program that
the OS does when loading a program into memory.

I've give some thought to this subject over the last few days, and I'm
pretty sure that I don't want to store any absolute addresses at all
inside of the overlay --- because storing absolute addresses that
point to the main program is a problem. I think that I have to use
only offsets from a base register, and have @ and ! etc. convert these
into absolute addresses at run-time just before they are used.

Philip Lantz

unread,

Nov 29, 2012, 2:50:13 AM11/29/12

to

Hugh Aguilar wrote:

> japheth wrote:
> > > > The upper 48 bits of RAX are left as they were.
> >
> > > Well, I can deal with that --- I'll just do an XOR RAX, RAX prior to
> > > doing the LODSD.
> >
> > LODSW will not change the upper 48 bits, but LODSD will clear the
> > upper 32-bits.
> >
> > IIRC whenever the whole 32-bit register is affected, then the upper 32-
> > bits are cleared.
>
> Where did you read that?
>
> I couldn't find anything in the Intel manual that said one way or the
> other --- I just looked up LODS in volume 2.

This behavior is common to all instructions in 64-bit mode where the
destination is a 32-bit register, so it isn't mentioned for each
instruction in volume 2.

Instead, it is described in Volume 1, Section 3.4.1.1, "General-Purpose
Registers in 64-Bit Mode". (I'm looking at edition 43 (May 2012);
section numbers may vary in other editions.)

"When in 64-bit mode, operand size determines the number of valid bits
in the destination general-purpose register:
? 64-bit operands generate a 64-bit result in the destination general-
purpose register.
? 32-bit operands generate a 32-bit result, zero-extended to a 64-bit
result in the destination general-purpose register.
? 8-bit and 16-bit operands generate an 8-bit or 16-bit result. The
upper 56 bits or 48 bits (respectively) of the destination general-
purpose register are not modified by the operation. If the result of an
8-bit or 16-bit operation is intended for 64-bit address calculation,
explicitly sign-extend the register to the full 64-bits."

BGB

unread,

Nov 29, 2012, 2:50:45 AM11/29/12

to

I meant, besides FS and GS...

there was a rumor around somewhere that "some future version" might
allow other segment registers to work as they did before.

not really like it would probably make a whole lot of difference though:
even if it worked again, OS's probably wouldn't bother to make much use
of it.

I could actually just as easily see a CPU going and stripping out
segmentation nearly altogether, and moving what is left of it into MSRs
or similar (probably a chip that boots directly into long-mode and omits
support for non-flat memory models). not that it would buy all that much
at the moment though (and it would probably require changes to an OS,
and probably mandate using UEFI or similar for booting up, ...).

>
>> then again, I also heard rumors recently of Intel wanting to move
>> entirely to BGA packaging (with the CPUs comming pre-soldered to the
>> MOBO), which other people commented was stupid and unlikely (since Intel
>> would create a lot of backlash and lose market share by going this route).
>>
>> then again, this brings up a long-ago memory that I saw 386SX MOBOs back
>> in the 90s which did this (just with QFP). apparently there were QFP 486
>> chips as well...
>
>
> Solderable packages (BGAs, QFPs, etc.) make sense for low cost parts.
> So most (all?) Intel Atom's are available in such. For the high end
> parts in Intel's product line, it's hard to see how that would be a
> good idea.
>

pretty much, this was the issue.

a person was claiming that all their chips would become BGA (including
high-end ones), but other people were like "no that is just stupid".

Terje Mathisen

unread,

Nov 29, 2012, 2:52:20 AM11/29/12

to

Andrew Cooper wrote:
>> use "REP MOVSB" or "REP MOVSD" (multiple of 4).
>>
>> ironically, "REP MOVSB" can sometimes beat out more "clever" ways of
>> doing memory copies, like copying everything into GPRs and writing them
>> to the destination.
>
> This would probably be because of "Fast String Operations" (Intel SDM
> Volume 1, 7.3.9.3) supported on newer processors.
>
> Experimentally, on our new Ivy Bridge processors at work, it appears
> that the fastest way to zero a page is
>
> REP STOSQ (With rax as 0 and rcx as 4K)
>
> Which certainly appears to outperform aligned SSE writes.

This is finally the way it should have been a _long_ time ago!

My friend Andy Glew (one of the PentiumPro architects) have told me that
he wanted to add fast strings to that cpu but didn't have the time &
room to do it properly:

A complete fast strings implementation needs to handle (nearly?) all the
possible cases efficiently, i.e.

1) Target and/or source misaligned

2) Target and/or source already in cache level N (N=1,2,3 etc)

3) Target and/or source in a special memory range (Write Combining,
Write Back, Uncached...)

and all the possible permutations of these alternatives.

I.e. this turns out to be a very wide switch statement, something which
is slow in sw or even in microcode but can be very fast in dedicated hw.

Of course, by the time you do have such an engine, you can simply use
the byte based operations for everything, as long as the block to be
moved is less than 4 GB in 32-bit mode! The HW will turn this into an
optimal set of cache-line sized block transfers, probably using the
SSE/AVX shifter to align the destination if target/source is relatively
misaligned.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

BGB

unread,

Nov 29, 2012, 3:54:32 AM11/29/12

to

I haven't done any extensive tests or recently (and my CPU is several
years old now, ~ 2010 era).

but, from past tests, the cost of "REP MOVS*" and friends is that it
effectively requires evacuating and then assigning several registers,
which isn't free.

so, in my tests, for small fixed-size objects, SSE operations were
faster, but didn't scale as well, or deal effectively with
variable-sized copies.

this would be like copying a 16-byte aligned struct with:
movdqa xmm0, [...]
movdqa [...], xmm0

the performance killer mostly has to do with looping, where it becomes
impractical to copy a large structure simply by using very-long chains
of "movdqa" operations or similar (and the speed advantage falls off
quickly), and using an explicit loop puts some major hurt on the thing.

"REP *" was faster though for larger and variable-sized structures.
this is because, apart from getting things loaded into the correct
registers, the operation itself goes very fast.

(whereas "REP MOVS*" does not seem to suffer from the same performance
problems as loops, and so seems able to run much faster than an explicit
loop).

past a certain size limit, and the difference drops off suddenly:
like, for copying a several-MB array, all reasonably-efficient
operations drop down to the same speed (say, 6.4GB/s or so), I suspect
mostly because the bus starts choking things.

most options (apart from maybe copying bytes via a "for()" loop) will be
limited by this factor (in a recent test, a loop like "for(i=0; i<sz;
i++)dst[i]=src[i];" with bytes, will only pull off about 300MB/s or
so...). in my recent test, a plain "for()" loop by itself can only spin
about 400M iterations/second (IOW: the loop wont be able to spin fast
enough for the bus speed to matter).

I suspect this is a big impact on interpreters as well (where every
conditional and similar saps performance). this ends up being part of
why the structure of many things starts being built from "plumbing-pipe
logic" with function pointers. because, yes, it is actually getting
faster to write high-performance code which mostly works by calling
through function pointers (and trying to take as many conditionals as
possible out of the direct execution path).

like, for some unclear reason:
if(whatever)
{ do something... }

is often slower than:
state->handler(state);

with support machinery like:
void Foo_StateCheckReset(FooState *state)
{
...
if(whatever)
state->handler=Foo_StateDoSomething;
else
state->handler=Foo_StateDoNothing;
}
...

with the idea that the state's handler will be called far more often
than the state will be changed.

mostly as indirect calls like this seem to be often surprisingly fast on
recent hardware.

though, the main limiting factor regarding wide adoption of this sort of
logic structure is just how convoluted/awkward it can get, and most code
is not *that* speed critical.

or such...

Rod Pemberton

unread,

Nov 29, 2012, 4:15:37 AM11/29/12

to

"Hugh Aguilar" <hughag...@nospicedham.yahoo.com> wrote in
message

news:36cec20f-ce9a-4316...@b4g2000pby.googlegroups.com...

> On Nov 28, 2:15 am, "Rod Pemberton"
> <do_not_h...@nospicedham.notemailnotz.cnm> wrote:
> > "Hugh Aguilar" <hughaguila...@nospicedham.yahoo.com> wrote in

...

> > > Also, all Forth words that access memory (@ and ! and so
> > > forth) would have to use the base register.
>
> > Why?
>
> > You can adjust the offsets to absolute addresses simply by
> > adding the base address. If all memory reads (fetches in
> > Forth) and writes (stores in Forth) reduce to Forth's memory
> > operations: @ (fetch, i.e., read) ! (store, i.e., write) C@
> > (char fetch) C! (char store), then you can add the base
> > directly in those low-level Forth words or "primitives". It's
> > only if you have some words or "primitives" that bypass
> > Forth's memory operators that you'd have to rewrite more
> > words.
>
> Because Forth code runs at both compile-time and run-time.
>
> When the user writes an overlay, he may execute Forth code at

> compile-time that works with pointers, and stores those pointers

> in data structures that will be accessed at run-time for that
> overlay. If those pointers are absolute addresses then they are
> going to have to be relocated when the overlay is loaded later
> on. This is okay for pointers in the overlay that point to code
> or data in the overlay itself. There will also be pointers
> stored in the overlay that point to code or data in the main
> program. The big problem here, is that the main program is not
> loaded into the same place in memory every time that it is
> run --- so we can't have absolute addresses in the main program
> saved inside of the overlay --- because the overlay isn't
> getting relocated as part of the relocation of the main program
> that the OS does when loading a program into memory.

Possible solution #1:
Use a single method for both compile-time and run-time. If you
want overlays, then make everything use offsets.

> I've give some thought to this subject over the last few days,
> and I'm pretty sure that I don't want to store any absolute
> addresses at all inside of the overlay --- because storing
> absolute addresses that point to the main program is a problem.
> I think that I have to use only offsets from a base register,
> and have @ and ! etc. convert these into absolute addresses at
> run-time just before they are used.

Possible solution #2:
Use absolute addresses. When saving an overlay, subtract the base
address from all stored addresses. Hopefully, colon-definitions
(i.e., address lists to called) are are primary place using
absolute addresses in your Forth. The CFA fields will need
adjusting too. Of course, you probably want an ITC Forth for
this. Most Forth words seem to be offset based already. If other
low-level Forth 'words' or "primitives" (i.e., functions or
procedures), perhaps string related 'words', use absolute
addresses, then you'll have to fix them up also. E.g., as I think
you're aware, for ITC, they can be made to be relative to the
stacked instruction pointer on the return stack using R> at the
start of the definition.

Rod Pemberton

Rod Pemberton

unread,

Nov 29, 2012, 4:31:07 AM11/29/12

to

"BGB" <cr8...@nospicedham.hotmail.com> wrote in message
news:k974a4$23e$1...@news.albasani.net...

> On 11/28/2012 3:11 PM, Robert Wessel wrote:
> > On Wed, 28 Nov 2012 14:28:14 -0600, BGB
> > <cr8...@nospicedham.hotmail.com> wrote:
> >> On 11/28/2012 3:15 AM, Rod Pemberton wrote:
> >>> "Hugh Aguilar" <hughag...@nospicedham.yahoo.com> wrote
> >>>in message
> >>>
news:0aac8c84-77aa-4e8f...@kt16g2000pbb.googlegroups.com...
...

> >> then again, I also heard rumors recently of Intel wanting to
> >> move entirely to BGA packaging (with the CPUs comming
> >> pre-soldered to the MOBO), which other people commented was
> >> stupid and unlikely (since Intel would create a lot of
> >> backlash and lose market share by going this route).
> >> then again, this brings up a long-ago memory that I saw 386SX
> >> MOBOs back in the 90s which did this (just with QFP).
> >> apparently there were QFP 486 chips as well...
> >
> >
> > Solderable packages (BGAs, QFPs, etc.) make sense for low cost
> > parts. So most (all?) Intel Atom's are available in such.
> > For the high end parts in Intel's product line, it's hard to
> > see how that would be a good idea.
> >
>
> pretty much, this was the issue.
>
> a person was claiming that all their chips would become BGA
> (including high-end ones), but other people were like
> "no that is just stupid".
>

Eliminating the socket reduces production cost of the motherboard,
perhaps $15 to $35 USD. However, the microprocessor is a far more
expensive component. So, making motherboards with a processor
already soldered means the board manufacturer needs more cash to
produce that product, than they would for motherboard with just a
socket and no microprocessor. But, the board manufacturer also
has to take the market demands into account. So, I'd think the
choice of a socket or a processor, would be determined by whether
the motherboard buyer wants a processor included or not. If the
customer is an individual and wants to install their own
processor, they don't want a soldered-in processor. If the
customer is a large company assembling PCs, they don't want to
have to pay for labor to insert many microprocessors into sockets.
They'd rather have soldered processors.

Rod Pemberton

Robert Redelmeier

unread,

Nov 29, 2012, 9:49:43 AM11/29/12

to

BGB <cr8...@nospicedham.hotmail.com> wrote in part:

> On 11/28/2012 9:12 PM, Robert Redelmeier wrote:
>> Andrew Cooper <and...@nospicedham.nospam.example.com> wrote in part:
>>> This would probably be because of "Fast String Operations"
>>> (Intel SDM Volume 1, 7.3.9.3) supported on newer processors.
>>>
>>> Experimentally, on our new Ivy Bridge processors at work,
>>> it appears that the fastest way to zero a page is
>>>
>>> REP STOSQ (With rax as 0 and rcx as 4K)
>>>
>>> Which certainly appears to outperform aligned SSE writes.
>>
>> Outperforms MOVNTQ ? This is news.
>>
>
> I haven't done any extensive tests or recently (and my CPU
> is several years old now, ~ 2010 era).
>
> but, from past tests, the cost of "REP MOVS*" and friends is
> that it effectively requires evacuating and then assigning
> several registers, which isn't free.
>
> so, in my tests, for small fixed-size objects, SSE operations
> were faster, but didn't scale as well, or deal effectively
> with variable-sized copies.
>
> this would be like copying a 16-byte aligned struct with:
> movdqa xmm0, [...] movdqa [...], xmm0
>
> the performance killer mostly has to do with looping, where
> it becomes impractical to copy a large structure simply by
> using very-long chains of "movdqa" operations or similar
> (and the speed advantage falls off quickly), and using an

> explicit loop puts some major hurt on the thing. [snip]

No, that's not the problem I was referring to. bzero()
[and even bcopy()] can be slow because most instructions
mov/movdqa/... require _reading_ the cacheline into L1 before
writing any bytes. Even when the whole cacheline is going to
be written (MMU cannot know). movntq avoids [delays?] this
expensive read-before-write.

The other possible optimization for bcopy() is bus-bursting, doing
long reads into cache, then long writes to save mem bus turnaround.

-- Robert

yesma

unread,

Nov 29, 2012, 4:35:25 PM11/29/12

to

On 29.11.2012 08:50, Philip Lantz wrote:
[...]

> "When in 64-bit mode, operand size determines the number of valid bits
> in the destination general-purpose register:

only mov instruction is 64bit and do not extending 64bit registers (uper
32bit part) to zero and of course push instruction :)

Hugh Aguilar

unread,

Nov 30, 2012, 1:46:15 AM11/30/12

to

On Nov 29, 12:50 am, Philip Lantz <p...@nospicedham.canterey.us>
wrote:

Thanks for the info!

I'm gradually learning this stuff. There is a lot of new material
added since the time of the 80486, but I'm picking it up little by
little. :-)

Hugh Aguilar

unread,

Nov 30, 2012, 5:12:54 PM11/30/12

to

On Nov 29, 11:46 pm, Hugh Aguilar

<hughaguila...@nospicedham.yahoo.com> wrote:
> I'm gradually learning this stuff. There is a lot of new material
> added since the time of the 80486, but I'm picking it up little by
> little. :-)

Here are a couple of more noobie questions:

1.) I have an instruction: JMP [rbp+rbx] This failed under FASM
because there was no size for the datum. So I changed it to this: JMP
qword [rbp+rbx] That worked. This makes no sense though! Why should
the datum size be qword or dword or any other kind of word? There is
no datum! This is code not data.

2.) Is there a small code cache of about 32 bytes? I should align the
start of every function on this. Is this true? If so, is the size 32
bytes or 16 or what?

3.) Is there a large code cache of about 64K bytes? I should strive to
fit my entire VM inside of this. Is this true? If so, is the size 64K
or 32K or what?

Thanks for your help. :-)

Tim Roberts

unread,

Nov 30, 2012, 11:29:21 PM11/30/12

to

Hugh Aguilar <hughag...@nospicedham.yahoo.com> wrote:
>
>1.) I have an instruction: JMP [rbp+rbx] This failed under FASM
>because there was no size for the datum. So I changed it to this: JMP
>qword [rbp+rbx] That worked. This makes no sense though! Why should
>the datum size be qword or dword or any other kind of word? There is
>no datum! This is code not data.

It is both. That's an indirect jump. rbp+rbx points at the address to be
loaded, and the jump can be near or far. It has to know the size of the
address to determine that.
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.

Philip Lantz

unread,

Dec 1, 2012, 3:12:51 AM12/1/12

to

Hugh Aguilar wrote:
> Here are a couple of more noobie questions:
>
> 1.) I have an instruction: JMP [rbp+rbx] This failed under FASM
> because there was no size for the datum. So I changed it to this: JMP
> qword [rbp+rbx] That worked. This makes no sense though! Why should
> the datum size be qword or dword or any other kind of word? There is
> no datum! This is code not data.

rbp+rbx is not the new value of the RIP; rather it is the address of a
memory location that contains the new value of the RIP. And of course
that memory location has a size. There is no instruction that allows you
to jump to rbp+rbx.

Hugh Aguilar

unread,

Dec 2, 2012, 7:13:25 PM12/2/12

to

Well, I read Intel's optimization manual, and I answered my own
question (maybe):

On Nov 30, 3:12 pm, Hugh Aguilar <hughaguila...@nospicedham.yahoo.com>
wrote:

> 2.) Is there a small code cache of about 32 bytes? I should align the
> start of every function on this. Is this true? If so, is the size 32
> bytes or 16 or what?

All labels that can be branched to should be aligned on 16-byte
boundaries.

> 3.) Is there a large code cache of about 64K bytes? I should strive to
> fit my entire VM inside of this. Is this true? If so, is the size 64K
> or 32K or what?

The instruction-cache is 32KB. Fitting the entire VM into 32K is the
goal. If I can't do this, I should still put the most common words in
a single 32KB block, and the less common ones in another (maybe all of
the floating-point words, as HostForth is primarily going to be used
for cross-compilation, which doesn't do much floating-point
arithmetic).

There are 8 4KB data-caches, which can be used simultaneously. It is a
good idea, then, to put data together to minimize the number of 4KB
blocks that are in use at any time. In my heap, I will sort the free
blocks by their address (in a tree). When I need memory, I will find
the one with the lowest address that is big enough. This will tend to
make all of the used blocks pretty much adjacent in low memory. Also,
when a block is freed, I will split it into multiple free blocks on
4KB boundaries so that no block later given out will cross a 4KB
boundary (except in the rare case that a block is requested larger
than 4KB).

The colon words are actually data not code, so they should also be
grouped into 4KB blocks of related words --- hopefully when one word
calls another word, they will both be in the same 4KB block so
everything will already be in cache.

I'm referring to a 4KB block as a "book" --- I just made that term up
--- is there some other term that is already in use? 16-byte blocks
have been referred to as "paragraphs" for as long as I can remember
(on 16-bit x6, the ES would point to paragraphs, so the nodes in a
tree or list had to be paragraph aligned assuming that ES was being
used as the node-pointer).

Hugh Aguilar

unread,

Dec 2, 2012, 7:28:20 PM12/2/12

to

Okay, now I'm confused!

I expected there to be code at that address --- because I assembled
some code there!

Consider this:

lea rbx, [destination]
jmp [rbx]

destination:
mov rax, 1
...

This will presumably execute the code at destination, starting with
the mov instruction. What exactly is in rbx, if not an absolute
address?

This is related to what I asked earlier about overlays. I'm assuming
that I need to dedicate a register (ebp) to hold the address of the
base of the program, and all of "pointers" in the program will
actually be offsets from base of the program. It is possible that this
is already being done --- all of the pointers in the program are
actually offsets from the base of the program, and there is an
internal register that I don't know about that holds the absolute
address of the base of the program (similar to how cs and ds were
pointers to the base of the program's code and data segments on the
old 16-bit x86, and all near pointers were actually offsets from these
bases, rather than absolute addresses).

Is there a document that describes rip-relative addressing? I've seen
it mentioned in the Intel manuals, but I've never seen it defined
anywhere.

Thanks --- Hugh

Philip Lantz

unread,

Dec 3, 2012, 1:45:43 AM12/3/12

to

Hugh Aguilar wrote:

Philip Lantz wrote:
> > Hugh Aguilar wrote:
> > > Here are a couple of more noobie questions:
> >
> > > 1.) I have an instruction: JMP [rbp+rbx] This failed under FASM
> > > because there was no size for the datum. So I changed it to this: JMP
> > > qword [rbp+rbx] That worked. This makes no sense though! Why should
> > > the datum size be qword or dword or any other kind of word? There is
> > > no datum! This is code not data.
> >
> > rbp+rbx is not the new value of the RIP; rather it is the address of a
> > memory location that contains the new value of the RIP. And of course
> > that memory location has a size. There is no instruction that allows you
> > to jump to rbp+rbx.
>
> Okay, now I'm confused!

Sorry! I considered adding more information, but I didn't want to make
it confusing!

> I expected there to be code at that address --- because I assembled
> some code there!
>
> Consider this:
>
> lea rbx, [destination]
> jmp [rbx]
>
> destination:
> mov rax, 1
> ...
>
> This will presumably execute the code at destination, starting with
> the mov instruction. What exactly is in rbx, if not an absolute
> address?

Depending on your assembler syntax, it could be exactly what you expect.

While there is no instruction that jumps to, say, rbp+rbx, you can jump
to the address in rbx (or any other general-purpose register).

The instruction is jmp r/m64 (opcode FF /4). If the addressing mode is
register (mod = 11), the destination comes from a register. If the
addressing mode is any of the myriad memory addressing modes, the
destination comes from the addressed memory location.

The assembler might use the syntax
jmp rbx
for the register-direct addressing mode, or it might use
jmp [rbx]

I'm pretty sure I've seen both, but I don't remember which assemblers
did which.

If your assembler uses
jmp rbx
for the register-direct addressing mode (FF E3), then
jmp [rbx]
should be FF 23, which means jump to the address that is in the memory
location addressed by rbx.

In any case, if you write
jmp [rbx+rbp]
and your assembler accepts it, it has to be indirect through memory
(FF 24 2B), because there is no "register-direct" addressing mode that
uses the sum of two registers.

I hope that's clear, now, but if it isn't, let me know, and I'll try
again. :-) If you include a fragment of an assembler listing with the
actual bytes of the instructions you have questions about, I can give
more precise answers.

> Is there a document that describes rip-relative addressing? I've seen
> it mentioned in the Intel manuals, but I've never seen it defined
> anywhere.

Volume 1, section 3.7.5.1, and volume 2, section 2.2.1.6.

Philip Lantz

unread,

Dec 3, 2012, 2:18:06 AM12/3/12

to

Hugh Aguilar wrote:
> There are 8 4KB data-caches, which can be used simultaneously.

I think you may have misunderstood what you read, or at least your
explanation doesn't seem to me to match the way the caches work. The
data caches store cache lines of 64 bytes, not full pages. The caches
are set associative, which you should read about if you care to
understand how the caches work in detail. (I couldn't begin to do the
topic justice here.)

Philip Lantz

unread,

Dec 3, 2012, 2:17:56 AM12/3/12

to

Hugh Aguilar wrote:

> I'm referring to a 4KB block as a "book" --- I just made that term up
> --- is there some other term that is already in use?

"Page", because it is the size of a virtual memory page. (However, I
wouldn't call a 4kb block a page unless it is aligned on a 4k boundary.)

Hugh Aguilar

unread,

Dec 3, 2012, 6:39:09 PM12/3/12

to

I'm using FASM. Section 2.1.6 (page 28) seems to be the applicable
section. It lists both JMP AX and JMP PWORD [EBX] as valid though. I
think I understand what you are saying though, that JMP PWORD [EBX]
jumps to an address in a 32-bit variable pointed to by EBX (most
likely a jump table). By comparison, JMP AX jumps directly to the
address pointed to by AX.

Tim Roberts

unread,

Dec 3, 2012, 11:44:38 PM12/3/12

to

Hugh Aguilar <hughag...@nospicedham.yahoo.com> wrote:
>
>I expected there to be code at that address --- because I assembled
>some code there!
>
>Consider this:
>
> lea rbx, [destination]
> jmp [rbx]

These two instructions are different:
jmp rbx
jmp [rbx]

The first jumps to the address in rbx. That's what you expected. The
second jumps to the address contained in memory at rbx. That's what you
wrote.

So, these two sequences do the same thing:

lea rbx, [destination]
jmp rbx

; ...

..data
gohere qword destination
..code
lea rbx, [gohere]
jmp [rbx]

>This will presumably execute the code at destination, starting with
>the mov instruction. What exactly is in rbx, if not an absolute
>address?

Sure, it's an address, but the question is whether it the address of the
next instruction, or the address of a memory variable that CONTAINS the
address of the next instruction.

Hugh Aguilar

unread,

Dec 5, 2012, 9:26:11 PM12/5/12

to

I'll call it a "page" --- a 4KB block aligned on a 4KB boundary ---
I'm not referring to virtual memory though.

I got the idea of 4KB page from the Intel Optimization manual. I don't
recall now where in there I read that though. My understanding is that
there are 8 4KB caches active simultaneously. If the program accesses
data somewhere outside of these 8 pages, one of these caches will have
to be reloaded with the 4KB page that this datum is in. For this
reason, it is best to keep related data together together in a 4KB
page, so that any memory access will hopefully be in a 4KB page that
is already active.

Also, however, there is the problem of "aliasing." This seems to
happen when memory accesses are exactly 4KB apart from each other. It
can also happen when data is exactly 64KB apart from each other. This
whole aliasing problem is very confusing to me.

Also, the code cache is 32KB in size. It is one big cache, not 8 4KB
caches. So I want to put my entire VM inside of a single 32KB aligned
memory-block.

This is all very vague in my mind --- those Intel manuals are
difficult to understand!

Dick Wesseling

unread,

Dec 6, 2012, 1:33:25 AM12/6/12

to

In article <899752b9-531d-4582...@qi8g2000pbb.googlegroups.com>,

Hugh Aguilar <hughag...@nospicedham.yahoo.com> writes:
> On Dec 3, 12:17 am, Philip Lantz <p...@nospicedham.canterey.us> wrote:
>> Hugh Aguilar wrote:
>
> I got the idea of 4KB page from the Intel Optimization manual. I don't
> recall now where in there I read that though. My understanding is that
> there are 8 4KB caches active simultaneously. If the program accesses
> data somewhere outside of these 8 pages, one of these caches will have
> to be reloaded with the 4KB page that this datum is in.

There are several kinds of caches:

1) data cache
2) instruction cache
3) translation lookaside buffer (TLB)
4) et cetera

1) and 2) cache data in chunks called "cache lines". The size of
a cache line varies, 32 for older CPUs, 64 is more common nowadays.

3) is related to paging. However, it does not cache data, but
rather the physical _addresses_ of logical pages in your program. The
size of a physical address is typically 4 or 6 (rounded up to 8) bytes.

A TLB miss is expensive because the CPU must access several levels
of page tables when computing a physical address. However, that
does not mean that it has to load 4K from memory. For the 2 level
page tables used in x86 it has to load 2 cache lines at most. x86-64
may load 4 cache lines. (Of course the page tables entries themselves
may be cached somewhere).

> For this
> reason, it is best to keep related data together together in a 4KB
> page, so that any memory access will hopefully be in a 4KB page that
> is already active.

That is a good idea, but the penalty of not doing so it not as bad
as you seem to think.

> Also, however, there is the problem of "aliasing." This seems to
> happen when memory accesses are exactly 4KB apart from each other. It
> can also happen when data is exactly 64KB apart from each other. This
> whole aliasing problem is very confusing to me.

A typical data/instruction cache is n-way associative where "n" depends
on the CPU model. A cache splits an address in three fields:

+---------------+---------------+---------+
| tag | cacheslot | ignore |
+---------------+---------------+---------+
3 2 1

1) Discard the lowest bits, depending on the cache line size, e.g.
if cache lines are 32 bytes the lowest 5 bits are ignored.

2) The next x bits (x depends on the number of slots in the cache)
address a location in the cache.

3) The remaining bits are stored along with the cache line. If these
don't match then the cache slot contains something else.

This describes a 1-way cache. For higher values of n there is more
than 1 slot for each value of field 2. The CPU searches all n slots
for a matching tag before replacing a cache line.

I vaguely recall that older Intel CPUs used a suboptimal algorithm
for cache replacement, leading to the problem with addresses exactly
64K apart than you mention, but to my knowledge that is no longer
a problem.

> Also, the code cache is 32KB in size. It is one big cache,

No, it is either 1024 cachelines of 32 bytes each or 512 cachelines
of 64 bytes each.

> This is all very vague in my mind --- those Intel manuals are
> difficult to understand!

Difficult to understand? Perhaps. Fun to read? Definitely!

Philip Lantz

unread,

Dec 6, 2012, 4:33:23 AM12/6/12

to

Hugh Aguilar wrote:
> My understanding is that
> there are 8 4KB caches active simultaneously. If the program accesses
> data somewhere outside of these 8 pages, one of these caches will have
> to be reloaded with the 4KB page that this datum is in.

No, this is definitely not the way the cache works. It stores individual
lines of 64 bytes each, which can come from anywhere in memory. (The
cache-line size depends on the processor; current processors are all 64
bytes.) I can only guess that your misunderstanding came from a poor
description of an 8-way set associative cache.