X64 General Purpose Registers

0 views

Skip to first unread message

Beverly Zielonko

unread,

Aug 5, 2024, 8:15:18 AM8/5/24

to conlylittmen

Idon't think there's any instruction can do this directly (or the other direction), since it would mean an instruction with five operands. However, the code above seems silly to me. Is there a better way to do it? I can only think of one alternative, use the pinsrd and related instructions. But it does not seem any better.

The motivation is that, sometime it is faster to do some things in AVX2 while others with general purpose register. For example, say within a small piece of code, there are four 64-bit unsigned integers, I will need four xor, two mulx from BMI2. It will be faster to do the xor with vpxor, however, mulx does not have an AVX2 equivalent. Any performance of gain of vpxor vs 4 xor is lost due to the process of packing and unpacking.

Spending a lot of instructions/uops on moving data between integer and vector regs is usually a bad idea. PMULUDQ does give you the equivalent of a 32-bit mulx, but you're right that 64-bit multiplies aren't available directly in AVX2. (AVX512 has them).

You can do a 64-bit vector multiply using the usual extended-precision techniques with PMULUDQ. My answer on Fastest way to multiply an array of int64_t? found that vectorizing 64 x 64 => 64b multiplies was worth it with AVX2 256b vectors, but not with 128b vectors. But that was with data in memory, not with data starting and ending in vector regs.

Integer XOR is extremely cheap, with excellent ILP (latency=1, throughput = 4 per clock). It's definitely not worth moving your data into vector regs just to XOR it, if you don't have anything else vector-friendly to do there. See the x86 tag wiki for performance links.

Also worth considering: whatever you were doing to produce r8..r11, do it with vector-integer instructions so your data is already in XMM regs. Then you still need to shuffle them together, though, with 2x PUNPCKLQDQ and VINSERTI128.

My previous project was built on STM32F051K4U7 microcontroller using the legacy Arm compiler (v5.06).

For fast data operations I used general purpose registers r5 all the way up to r11 (7 in total). The project would build just fine (potentially at a cost of higher code size, but I am OK with that).

However for a new project I decided to move to STM32G0 series microcontroller and use Arm Compiler v6.20.1. The compiler appears to be unhappy if I utilize more than 4 general purpose registers (r8,r9,r10,r11).

It throws a warning if I try to use r7 as follows:

I understand that the frame pointer is stored in register R11 for A32 code and register R7 for T32 code, but can/should I use it or is there a major complication if I do so?

In addition, the default setting for Arm Compiler v6.2x is -fomit-frame-pointer as specified in Arm Compiler for Embedded. I enforced this through compiler command window, but no success - I get the same error as the above.

This may be a stupid question but... The arduino page states "1K of RAM", while the datasheet states 1KB of ram with "32 general purpose working registers". Does this mean that the rest of the 1K is reserved, and I actually have only 32 bytes to work with?

The 32 general purpose registers are used by the compiler, typically to manipulate values in computations. You get to use however much of the 1K ram not used by the runtime code to store your data. The actual amount free for your sketch varies depending on which libraries you use but it's somewhat less than 900 bytes. Still, its surprising how much can be done with that little memory.

Most of the instructions operating on the register file have direct access to all registers and most of them are single-cycle instructions. Each register is also assigned a data memory address, mapping them directly into the first 32 locations of the user data space. Although not being physically implemented as SRAM locations, this memory organization provides great flexibility in access to the registers, as the X-, Y-, and Z-pointer registers can be set to index any register in the file.

Registers R26 through R31 have some added functions to their general-purpose usage. These registers are 16-bit address pointers for indirect addressing of the data space. The three indirect address registers (X, Y, and Z) are defined as described in the accompanying figure.

If you need to work with Microchip Support staff directly, you can submit a technical support case. Keep in mind that many questions can be answered through our self-help resources, so this may not be your speediest option.

I believe that (b) requires the use of OS mode as well, because the processor may have more than one process executing at the same time. As both of them have access to the general purpose register, it is quite possible that one process may overwrite the contents of another process if switching takes place.

Input/output requires privileges, to access a peripheral or to communicate with the part of the system that manages files. So (c) does generally require going to kernel mode. It's possible for some printf calls to remain entirely within the calling process, for example if output is buffered and the output of this call is going entirely inside the buffer. But in general a printf call does need to do actual I/O and thus does need a transition to kernel mode.

Processes execute code as if the other processes didn't exist. When the kernel decides to switch to another process, it suspends the running process and unsuspends the process that it wants to run next. Part of this suspension mechanism is to save the values of the general-purpose registers into a dedicated memory area that belongs to the process that is being suspended. There is one such register store for each process. Part of the unsuspension mechanism is to restore the registers of the process that is being unsuspended. When a process is unsuspended, it keeps running where it left off, with the same values in registers as when it was suspended. This suspension/unsuspension mechanism is called a context switch.

A process never overwrites the registers used by another process because only the currently running process's register values are in the processor registers. The other processes' register values are in their register store. There has to be a context switch to change which process's register values are in the processor registers.

x86-64 (also known as just x64 and/or AMD64) is the 64-bit version of the x86/IA32 instruction set. Below is our overview of its features that are relevant to CS107. There is more extensive coverage on these topics in Chapter 3 of the B&O textbook. See also our x86-64 sheet for a compact reference.

The table below lists the commonly used registers (sixteen general-purpose plus two special). Each register is 64 bits wide; the lower 32-, 16- and 8-bit portions are selectable by a pseudo-register name. Some registers are designated for a certain purpose, such as %rsp being used as the stack pointer or %rax for the return value from a function. Other registers are all-purpose, but have a conventional use depending on whether caller-owned or callee-owned. If the function binky calls winky, we refer to binky as the caller and winky as the callee. For example, the registers used for the first 6 arguments and return value are all callee-owned. The callee can freely use those registers, overwriting existing values without taking any precautions. If %rax holds a value the caller wants to retain, the caller must copy the value to a "safe" location before making a call. The callee-owned registers are ideal for scratch/temporary use by the callee. In contrast, if the callee intends to use a caller-owned register, it must first preserve its value and restore it before exiting the call. The caller-owned registers are used for local state of the caller that needs to preserved across further function calls.

True to its CISC nature, x86-64 supports a variety of addressing modes. An addressing mode is an expression that calculates an address in memory to be read/written to. These expressions are used as the source or destination for a mov instruction and other instructions that access memory. The code below demonstrates how to write the immediate value 1 to various memory locations in an example of each of the available addressing modes:

A note about instruction suffixes: many instructions have a suffix (b, w, l, or q) which indicates the bitwidth of the operation (1, 2, 4, or 8 bytes, respectively). The suffix is often elided when the bitwidth can be determined from the operands. For example, if the destination register is %eax, it must be 4 bytes, if %ax it must be 2 bytes, and %al would be 1 byte. A few instructions such as movs and movz have two suffixes: the first is for the source operand, the second for the destination. For example, movzbl moves a 1-byte source value to a 4-byte destination.

When the destination is a sub-register, only those specific bytes in the sub-register are written with one broad exception: a 32-bit instruction zeroes the high order 32 bits of the destination register.

By far most frequent instruction you'll encounter is mov in one of its its multi-faceted variants. Mov copies a value from source to destination. The source can be an immediate value, a register, or a memory location (expressed using one of the addressing mode expressions from above). The destination is either a register or a memory location. At most one of source or destination can be memory. The mov suffix (b, w, l, or q) indicates how many bytes are being copied (1, 2, 4, or 8 respectively). For the lea (load effective address) instruction, the source operand is a memory location (using an addressing mode from above) and it copies the calculated source address to destination. Note that lea does not dereference the source address, it simply calculates its location. This means lea is nothing more than an arithmetic operation and commonly used to calculate the value of simple linear combinations that have nothing to do with memory locations!