R15d Register

0 views
Skip to first unread message

Inacayal Tanoesoedibjo

unread,
Aug 5, 2024, 3:28:08 AM8/5/24
to rinmoriwind
x64extends x64's 8 general-purpose registers to be 64-bit, and adds 8 new 64-bit registers. The 64-bit registers have names beginning with "r", so for example the 64-bit extension of eax is called rax. The new registers are named r8 through r15.

The lower 32 bits, 16 bits, and 8 bits of each register are directly addressable in operands. This includes registers, like esi, whose lower 8 bits were not previously addressable. The following table specifies the assembly-language names for the lower portions of 64-bit registers.


In addition, there's some extra general purpose registers r8 through r15 which can also be accessed as (for example) r8d, r8w and r8b (the lower 32-bit double-word, 16-bit word and 8-bit byte respectively). The b suffix is the original AMD nomenclature but you'll sometimes see it written as l (lower case L) for "low byte".


I tend to prefer the b suffix myself (even though the current low-byte registers are al, bl, and so on) since it matches the d/w = double/word names and l could potentially be mistaken for long. Or, worse, the digit 1, leading you to question what the heck register number 81 is :-)


The high bytes of the old 16-bit registers are still accessible, under many circumstances, as ah, bh, and so on (though this appears to not be the case for the new r8 through r15 registers). There are some new instruction encodings, specifically those using the REX prefix, that can not access those original high bytes, but others are still free to use them.


There are many more control registers which have various side effects and can generally not be written to unless you want those effects (and often require ring 0). These are summarized in "Volume 3 System Programming Guide- 2.1.6 System Registers", which is more for OS developers.


The lower 32 bits, 16 bits, and 8 bits of each register are directly addressable in operands. This includes registers, like esi, whose lower 8 bits weren't previously addressable. The following table specifies the assembly-language names for the lower portions of 64-bit registers.


Operations that output to a 32-bit subregister are automatically zero-extended to the entire 64-bit register. Operations that output to 8-bit or 16-bit subregisters aren't zero-extended (this is compatible x86 behavior).


The calling convention for C++ is similar. The this pointer is passed as an implicit first parameter. The next three parameters are passed in remaining registers, while the rest are passed on the stack.


Instructions that refer to 64-bit registers are automatically performed with 64-bit precision. For example, mov rax, [rbx] moves 8 bytes beginning at rbx into rax.


A special form of the mov instruction has been added for 64-bit immediate constants or constant addresses. For all other instructions, immediate constants or constant addresses are still 32 bits.


The 64-bit x86 register set consists of 16 general purpose registers, only 8 of whichare available in 16-bit and 32-bit mode. The core eight 16-bit registers are AX, BX, CX,DX, SI, DI, BP, and SP. The least significant 8 bits of the first four of these registersare accessible via the AL, BL,CL, and DL in all executionmodes. In 64-bit mode, the least significant 8 bits of the other four of these registersare also accessible; these are named SIL, DIL, SPL, and BPL. The most significant 8 bits of the first four 16-bit registers arealso available, although there are some restrictions on when they can be used in 64-bitmode; these are named AH, BH,CH, and DH.


The 80386 extended these registers to 32 bits while retaining all of the 16-bit and8-bit names that were available in 16-bit mode. The new extended registers are denoted byadding a E prefix; thus the core eight 32-bitregisters are named EAX, EBX,ECX, EDX, ESI, EDI, EBP,and ESP. The original 8-bit and 16-bit register names mapinto the least significant portion of the 32-bit registers.


64-bit long mode further extended these registers to 64 bits in size by adding aR prefix to the 16-bit name; thus the base eight64-bit registers are named RAX, RBX, etc. Long mode also added eight extra registers named numericallyr8 through r15. The leastsignificant 32 bits of these registers are available via a d suffix (r8d throughr15d), the least significant 16 bits via a w suffix (r8w throughr15w), and the least significant 8 bits via a b suffix (r8b throughr15b).


The AMD64 architecture allows software to define up to 15 external interrupt-priority classes. Priority classes are numbered from 1 to 15, with priority-class 1 being the lowest and priority-class 15 the highest. CR8 uses the four low-order bits for specifying a task priority and the remaining 60 bits are reserved and must be written with zeros.


System software can use the TPR register to temporarily block low-priority interrupts from interrupting a high-priority task. This is accomplished by loading TPR with a value corresponding to the highest-priority interrupt that is to be blocked. For example, loading TPR with a value of 9 (1001b) blocks all interrupts with a priority class of 9 or less, while allowing all interrupts with a priority class of 10 or more to be recognized. Loading TPR with 0 enables all external interrupts. Loading TPR with 15 (1111b) disables all external interrupts.


Extended Feature Enable Register (EFER) is a model-specific register added in the AMD K6 processor, to allow enabling the SYSCALL/SYSRET instruction, and later for entering and exiting long mode. This register becomes architectural in AMD64 and has been adopted by Intel. Its MSR number is 0xC0000080.


MSRs with the addresses 0xC0000100 (for FS) and 0xC0000101 (for GS) contain the base addresses of the FS and GS segment registers. These are commonly used for thread-pointers in user code and CPU-local pointers in kernel code. Safe to contain anything, since use of a segment does not confer additional privileges to user code.


A local breakpoint bit deactivates on hardware task switches, while a global does not.

00b condition means execution break, 01b means a write watchpoint, and 11b means an R/W watchpoint. 10b is reserved for I/O R/W (unsupported).


The problem is that between the store and the load the value hasn't

been retired / placed in the cache. One would expect store-to-load

forwarding to kick in, but on x86 that doesn't happen because x86

requires the store to be of equal or greater size than the load. So

instead the load takes the slow path, causing unacceptable slowdowns.


GCC gets around this by using the smallest load for a bitfield. It

seems to use a byte for everything, at least in our examples. From the

comments, this is intentional, because according to the comments

(which are never wrong) C++0x doesn't allow one to touch bits outside

of the bitfield. (I'm not a language lawyer, but take this to mean

that gcc is trying to minimize which bits it's accessing by using byte

stores and loads whenever possible.)


By default, clang emits all bitfield load/store operations using the width of the entire sequence of bitfield members. If you look at the LLVM IR for your testcase, all the bitfield operations are i16. (For thread safety, the C/C++ standards treat a sequence of bitfield members as a single "field".)


If you look at the assembly, though, an "andb $-2, (%rdi)" slips in. This is specific to the x86 backend: it's narrowing the store to save a couple bytes in the encoding, and a potential decoding stall due to a 2-byte immediate. Maybe we shouldn't do that, or we should guard it with a better heuristic.


When writing my initial email, I forgot another option which Eli

pointed out: don't shrink the store's size. That would be acceptable

for our purposes. If it's something that needs further consideration,

perhaps we could disable it via a flag (not an official "-m..." flag,

but "-mllvm -disable-store-shortening" or whatever)?


When I spent some time looking at this back in March when Bill mentioned it on IRC. I think I saw a write to one bit in one of the 8-bit pieces and then a read of that bit and a bit from the adjacent byte. So we used a narrow store and then a wider load due to the 2 bits.


I think you're referring to same_flow and free in the structure below.

Those both have stores, as does most of the rest of the bitfield (it's

an initialization, which seems like could be done with a few bitwise

operations on the whole bitfield, but I digress). But yeah, in the

case that we have consecutive accesses of bitfields in adjacent bytes,

then a bigger read & store are better.


At least in this test-case, the "bitfield" part of this seems to be a distraction. As Eli notes, Clang has lowered the function to LLVM IR containing consistent i16 operations. Despite that being a different choice from GCC, it should still be correct and consistent.


I suspect that this is more prevalent with bitfields as they're more

likely to have the load / bitwise op / store operations done on them,

resulting in an access type that can be shortened. But yes, it's not

specific to just bitfields.


I'm more interested in consistency, to be honest. If the loads and

stores for the bitfields (or other such shorten-able objects) were the

same, then we wouldn't run into the store-to-load forwarding issue on

x86 (I don't know about other platforms, but suspect that consistency

wouldn't hurt). I liked Arthur's idea of accessing the object using

the type size the bitfield was defined with (i8, i16, i256). It would

help with improving the heuristic. The downside is that it could lead

to un-optimal code, but that's the situation we have now, so...

3a8082e126
Reply all
Reply to author
Forward
0 new messages