Pdp-40 Assembly

0 views
Skip to first unread message

Gibert Chisholm

unread,
Aug 5, 2024, 12:02:23 AM8/5/24
to thaytecompclos
Inrecent years, it has been widely accepted by the applied cryptographycommunity. It has been standardized and incorporated in many softwareprojects and Internet protocols, including the Linux kernel, OpenBSD,OpenSSH, TLSv1.3, and Tor.

In ChaCha20, the basic unit of operation is the Chacha20 quarterround function. It takes 4 unsigned 32-bit integer a, b, c, das inputs and applies a series of transformations in 4 steps. Ineach step, 3-different 32-bit unsigned integers are added and XORedwith one another, then the result is rotated (circular shifted) to theleft.


Six 16-bit general purpose registers are available, allowingus to hold three 32-bit integers in registers. Eachstep in the ChaCha20 quarter round happens to operate on three32-bit integers, we found the number of registers is justenough - but we do have to load and store one register per step.ChaCha20 is really designed for machines with four 32-bit registersor more, not three.


No. There are many legends about the infamous PDP-endian in the computerfolklore. In PDP-endian, a 32-bit integer is stored as two 16-bit integersin big-endian, but the 32-bit intergers themselves are stored in little-endianin memory.


Because we have 6 registers to hold 3 integers, but ChaCha20 operates on 4.It forces us to store one integer back to RAM in each step within thequarter round. Storing it prior to the XOR is the best time - it frees uptwo spare registers, r0 and r1, as our scratchpad registers.


ChaCha20 requires a 16-bit, 12-bit, 8-bit and 7-bit rotation in eachquarter round. This is the most painful operation on the PDP-11. For abasic PDP-11/40 machine, only 1 bit can be shifted at a time. Writing thispart takes a lot of head scratching. I found some encouragement in theJargon File.


To rotate an integer, We have to do a left shift on the lower16-bit word, a rotation (shift with carry) at the higher 16-bit,add the overflowed carry bit back to the lower 16-bit. And we needto do it for 16 times!


It takes 2.75 s (+ 0.3 s) to adjust the MSB per shift. If no adjustment isneeded, it takes only 1.40 s (+ 0.15 s). But it makes the execution of thecipher depend on the secret internal state, which is derived from our secretkey - it opens up an entire category of side-channel attacks, a big no incryptography.


To do this, first, we copy the lower word to a scratchpad register (to avoiddestroying the original word) and shift it by 1 bit. After this shift, the carryflag is set according to the rightmost bit. Later, we rotate the high word -now rightmost bit is transferred to the leftmost bit. Finally, rotate the lowword.


Curiously, shifting to the right has an additional 0.25 s performance penalty,my uneducated guess is that the hardware prefers leftshifting, when shiftingto the right, the microcode has to execute one more instruction or so(yes, almost all PDP-11 models are microcoded).


On PDP-11, almost all 16-bit instructions have a 8-bit version, suffixed withb (e.g. MOV and MOVB). This allows ASCII strings and byte arrays (orshall I say octet arrays?) to be handled directly and is regarded as a majorstrength of the platform, unlike 18-bit or 36-bit computers of the same erawhich were best suited for 6-bit characters.


One solution is reloading the same integer from RAM, but accessing memory is veryexpensive and must be avoided when possible. Thus, the only remaining option issaving it to a scratchpad register. At this point, we are running out of regularregisters, forcing us to seize the stack pointer, r6.


When we use a timesharing operating system (like UNIX v6) on a PDP-11/40,the optional Memory Management Unit hardware would be purchased andinstalled on the PDP-11. The MMU provides two independent stack pointersfor the kernel and userspace and switches them automatically. No need toworry, just be sure to save and restore r6 before and after the entireencryption routine is finished.


In the first attempt, reloading the data from RAM was ruled out due toperformance concerns. But now we have realized, memory accesses per se isnot the problem - saving and restoring the stack pointer and interruptsitself requires 4 memory accesses. The question is whether we can keep thenumber of memory accesses no more than 4, without relying on the stack pointer.


Again, the PDP-11, as a quintessential and highly orthogonal CISC architecture, is notlimited to load-and-store. Register-memory, and even memory-memory operations aresupported by almost all instructions on all registers. It allows programmers to writereally dense and tight code in assembly - which made perfect sense before highlypipelined processors.


For convenience, the initial backup is shown and calculated as part of the mixingprocess. After the mixing step, the output of ChaCha20 becomes highly nonlinear andirreversible, secure for cryptographic purposes.


Until now, the priority was to optimize for speed at all cost. But in order tomake our encryption routine be practical, program size must be considered.Our quarter round implementation takes 198 bytes. In full ChaCha20, thereare 80 quarter rounds in total, which eats 15.47 KiB of RAM, this is clearlyunacceptable.


The author advocated an objective and fact-based analysis,and claimed that based on the current cryptanalysis, ChaCha8 is secure, andanything higher is only for show without practical significance (i.e.security theater).


ChaCha20 is known for its inherent resistence to side-channel attacks due tothe careful avoidance of secret-dependent branching or memory accesses -it naturally runs in constant time. Still, constant time does not implyconstant electromagnetic radiation!


Turbocharged with Schottky TTL circuits and semiconductor memory (unlike thePDP-11/40 that only uses core memory), it allows a 3x speedup. In fact, a PDP-11/45class machine became the standard for UNIX v7 and later, the underpowered PDP-11/40was struggling to run them.


On PDP-11/40, the execution time of an instruction is calculated byadding a source time, a destination time, and an execution and fetchtime together. They are given in microseconds. All the threetimes can be different depending on the addressing mode (see Appendix3) and the installed memory option of your paticular machine.


The MOV instruction is special - only t_src and t_ef are needed, sowe assume t_dest is always 0. After looking at the manual, I found thata MOV instruction from memory to register takes 1.74 s of source timeand 1.46 s of EF time.


EIS instructions are also special, the timing is calculated just likeMOV, only t_src and t_ef are used, but they are given in a differenttable. Also, their execution time depends on the bitshift distance, shiftingmore bits causes a longer EF time.


They look like two very dissimilar operations: One dereferences a hardcodedconstant pointer, another dereferences a pointer to a list of pointers, thenauto-increments the register. Why on Earth are they the same operation?


When an instruction that dereferences a constant pointer is executed, theprogram pointer is pointed to the beginning of the instruction itself, atwhere our hardcoded address is located! If we dereferencing PC twice (firstto get the hardcoded memory address itself, then to dereference that address),we can get the content at memory address XXXX and load to the register,finally, we increments PC, so we start running the next instruction.


Similarly, both PC-relative addressing mode (for Position-Independent Code)and index mode (base + offset, for obtaining elements from an array) are thesame mode: mode 6. PC-relative mode is simply the index mode using PC as thebase address.


I took a short-term assignment to design a typesetting system for a friend ofmine who had thought that typesetting was the new future. And so we put atypesetting system on a PDP-11 and started doing the local town Cave Creeknewspapers and everything. So the PDP-11 was an extremely good eye-opener for me.Because it showed you that memory registers and memory addressing was important.


Although the architectures of 8-bit processors are quite different from the PDP-11(the PDP-11 has a highly orthogonal ISA with many general-purpose registers, the8-bit microprocessors mainly use accumulator-based designs with a lot more limitations,VLSI was still in its infancy), but at least we can see an indirect influence.


The middle endian format of the PDP11 comes out of the necessity to emulate 32 bit operations on a 16 bit machine. This is done by first storing the high word, then the low word in memory, even though the PDP11 uses little endian for its data. This causes the weird endianess. Though, in practice there isn't really a performance difference. Left shifting an integer stored in memory by one place is still done in three instructions:


Data in registers on any kind of machine always behaves as binary with the MSB on the left (like you'd write it in a place-value representation like 0b10110101). A left shift always multiplies by a power of two, and a right shift is always divides by a power of two, regardless of the machine being big, little, or pdp-endian.


Think of endianness as being applied in the load-store unit of the CPU when executing a load or store wider than one byte. The registers and execution units part of the core just sees binary integers of register width.


It's possible to write "endian agnostic" code to serialize integers to/from a stream of bytes without any dependency on the machine's native endianness. e.g. to extract 4 bytes into an int, in this case using little-endian order for the byte stream:


Middle-Endian or PDP-Endian systems save the most significant word first, with each word having the least significant byte first. For developers of new software, it is not only perfectly reasonable, but strongly recommended, to ignore this possiblity. I don't think there ever was a processor that stored 32-bit integer values to memory in a middle-endian format, though middle-endianness has occasionally appeared in things like packed-decimal formats, floating point formats, and obscure communications protocols (it's used for the length of TCP/IP packets in Visa's "Visa Base I" protocol).

3a8082e126
Reply all
Reply to author
Forward
0 new messages