Standardize the posit environment

140 views
Skip to first unread message

Phil M

unread,
Jan 14, 2021, 2:40:13 AM1/14/21
to Unum Computing
There are two ways to address memory: one register like in CPU and many registers like in GPU. The first way covers sequential algorithms, the second way covers parallel algorithms.

The second way allows to reduce the size of registers in all parallel cores, of which there are a lot. The size of the posit will be limited by the size of these small registers. There is no problem here.

But what about the first way? There are several cores, each with large registers. Software and operating system distributions have buried 32-bits, so large registers have 64-bit minimum. Should 64-bit processors support posit32 and other trifles, or should they limit themselves to posit64 only?

Posits have a wonderful property: converting a smaller posit to a larger posit is free (like uint32 to uint64, but posits are always in the upper part of the register). This allows you to store in memory, for example, the range [1; 4) of 12-bit posit inside an 8-bit word. You only need two operations (integer addition and left shift) to convert this range to a register-sized posit.

I suggest making it part of the posit standard by making the posit size equal to the register size. As a result, the 64-bit processor will not be overloaded with small posits and their quire. The processor can use the freed space more wisely. Programmers, on the other hand, need to understand the posit environment in which they work. And it just so happens that consumer devices and servers are only 64-bit. And these 64-bits will be with us for a long time.

Shredding posits is convenient. A register-sized posit is the basis of this.

P.S. To convert a truncated posit to a posit of register size, load the word in the lower part of the register as an unsigned integer. Then, add the unsigned integer number of the beginning of the range of the untruncated posit (first column in the Posit Lookup Table) to it. Then shift to the left by a constant equal to the difference between the register size and the size of the untruncated posit.

Phil

theo

unread,
Jan 14, 2021, 5:50:36 AM1/14/21
to Unum Computing
Hi Phil:

We came to a different conclusion with respect to the posit register file. 
1- the encode/decode of a posit is a very expensive operation, something you DON"T want to do as part of the arithmetic path. 
2- the actual arithmetic of computing on posits follows the standard floating point data path on a triple (sign, scale, significant)
3- posits without quires do not fundamentally improve on the catastrophic cancellation and the loss of associativity and distributivity, so the micro-architecture of a posit data path needs to leverage a quire asset to be a posit processing device

We can deal with the first problem by making the decode/encode part of the load/store instruction, so that we remove a big chunk of time out of the ALU critical path, and then we can use a unified floating point representation to support both types, creating an opportunity to reuse the same ALU for IEEE and posit operations. This implies that the register file does not actually store posits in such a design. Secondly, it is essential that the output of the ALU has a custom path to a quire asset, which is, and consumes, fixed-point formats. This yields a micro-architecture where there are no posit encoded bits moving through the pipeline at all, the bits that are stored and moving are triples or fixed-points. It is only when you are interacting with memory that you reintroduce the posit format.

In summary, posits without quire integration, do not yield significant benefit over linear floats. This quire integration dominates the micro-architecture structure and resources and thus is the core area for optimization.

Phil

unread,
Jan 17, 2021, 2:06:00 AM1/17/21
to Unum Computing
Theo,

Isolating posits from integers makes sense, and it devalues the idea of equalizing their sizes to fill the combined register completely.

Programming languages can provide a convenient way to shred posits into ranges, then it can become part of the programmer's weekdays. What do you think about this? Will the hardware implementation of this be simple enough?

Phil


четверг, 14 января 2021 г. в 15:50:36 UTC+5, theo:

theo

unread,
Jan 17, 2021, 8:53:54 AM1/17/21
to Unum Computing
Phil:

Very good point: if your code needs a lot of comparisons having the posit encoding available would make that operation more energy efficient. 

In the paper we published at SC'17, we make the argument that the quire (Kulisch super-accumulator) is the essence of restoring associative and distributive properties for real number arithmetic that would benefit numerical reproducibility in concurrent environments. We highlight the observation that all high-performance codes use system level schedules to soften the memory wall, which manifest themselves as divide and conquer schemes that compute sub-problems. That leads to the need for a quire register file in a stored program machine as the system level schedules generate partial quire results that will need to be managed. And the number of partials combined with the size of the quire state makes an associative and distributive real number stored program machine expensive. But that is not due to the arithmetic, it is caused by the resource constraint management of a stored program machine. The system level schedules for example in a data flow machine for that same mathematical operation will have different constraints that fit associative and distributive real number arithmetic much better. 

That was a long way to explain that the complexity of posit compute engines can be made more than competitive compared to IEEE floating point by looking at the system level data flows and figuring out how to integrate the quire in the data flow. This is very much the same realization that the DL and in-memory compute folks are rediscovering, but that have been standard knowledge in the digital signal processing for half a century. The GPU designs also make heavy use of custom managed data paths, hence their popularity in high-performance computing. And the Graphcore IPU is one of the best examples of this trend: a custom system architecture for DL algorithms. A rough calculation would indicate that the Graphcore IPU would double its performance if it was using posits.

There is significant upside in the signal and sensor processing world to ditch IEEE floating-point in favor of robust and at the same time higher performance posit arithmetic, and the secret is to dissect the algorithm in terms of its dynamic range and precision needs and than judiciously apply the user-defined rounding offered by the quire. In all cases that we have applied posits, they beat IEEE handily, in performance, performance per Watt, numerical accuracy and reproducibility.

And to answer your last question directly: yes, the hardware implementation is simple enough.

Reply all
Reply to author
Forward
0 new messages