x87 is not used by GCC on x86_64. GCC uses SSE4.2 scalar SP (32-bit) and DP (64-bit) floating point and SSE floating point lacks transcendentals.
The x87 transcendentals date back to the original design of the 8087 coprocessor from 1980 and is too inaccurate for scientific applications and too slow for graphics applications
I was thinking about GPUs when I mentioned F_TD and D_TD in a prior email indicating possible ISA extensions.
Modern GPUs have RISC-like ISAs (however with Nvidia GPUs they are not exposed directly to the compiler due to PTX IR. The ISA can be changed on each new chip family). These ISAs have transcendental instructions, here referred to as an SFU (Special Function Unit). I don’t think this makes the GPU ISAs CISC:
In graphics applications sin and cos approximations can be “good enough” for use in pixel shaders i.e. the accuracy results in reduced visual quality. These functions wouldn’t be used for scientific computing.
Transcendentals don’t necessarily need to be micro-coded as they could be implemented as multi-stage units like other FPU operations. I’d say that these instructions fall into the category of domain specific instructions. GPUs do all sorts of things in their ISAs while avoiding micro-code e.g. special load store instructions for local and shared memory vs what we would consider one set of normal load store instructions that access global memory. That doesn’t make them CISC. Neither does having transecendentals.
Also, FPUs are not micro-coded, but they are multi-stage pipelined units. It’s possible to have multi-stage pipelined functional units that are not micro-coded. i.e. they can indeed be hard-coded for a special function.
ROMs are 1T vs SRAMs which are 6T or 8T. Depending on their geometry, ROMs can be extremely fast to access. If you have ROM, in L1 proximity to the special functional units it would be possible to access the polynomial approximation vectors without going through the memory bus and disturbing L1/L2 then you could optimise transcendental functions without resorting to micro-code. Essentially you need to broadcast x*x to a vector and multiply that against the polynomial constants from ROM. There are many ways to do this, but here is one that is quite accurate:
Loading the constants from RAM wouldn’t be appropriate, and having registers reserved for the constants would also not be appropriate, unless a special shader kernel compiler can reserve registers for the polynomial approximation vector. The only way I can think of doing these intrinsics efficiently and at speed without special function units is to have a way to load vector constants from a constant memory address space.
GPUs have constant memory with load instructions that don’t disturb the local/shared/global memory data paths i.e. so that when you execute sin, cos, log2 or exp2 (in a pixel shader being called for 8M pixels at 60Hz), you don’t constantly bump the data you are working on in and out of cache by loading constants via a single data path to memory, which essentially mucks with the memory systems ability to keep data loads constrained to the data in the array you are working, so you can focus on things like coalescing loads from multiple threads. A graphics optimised domain specific processor would likely just include fcos.[sd]/fsin.[sd]. An AI optimised processor may exclude them. A set of optional instructions that can be added or removed for custom processors customised for application specific domains seems like the perfect example of where you might have a profile that includes transcendentals. Interesting to note that tan or arctan are not included in the graphic ISAs. They probably don’t occur frequently enough in shader kernels.
Now we have domain specific processors, the lines are being blurred between NPU/CPU/GPU/TPU/XPU. Pixel shaders call transcendental functions frequently enough that they merit their own instructions in domain specific processors.
I think CISC micro-code is orthogonal to domain specific instruction extensions with multi-stage functional units such as FPUs or even Crypto. A Divider requires a multi-cycle state machine. Divide wasn’t included on early ARM and Alpha. Does divide make a processor CISC? (given the number of algorithms that exist to implement high radix multiply and divide).
I think the principle is that one can implement simple unary and binary instructions that make up the verbs for ones particular application domain. I think it would be possible to hard-code a sincos functional unit.