Synthesis with floating point arithmetic fails to find blackbox

68 views
Skip to first unread message

peter.t...@gmail.com

unread,
Feb 17, 2021, 7:15:14 PM2/17/21
to Clash - Hardware Description Language
Clash 1.2.4. The only report I can see in the group of problems with float is from clash 0.6.x, for the vhdl backend. I am synthesing for system verilog. I suppose the cure is to get the right black box primitives from somewhere and put them somewhere ..

KPU/Float.hs:15:1: error:
   
    Clash.Netlist.BlackBox(284): No blackbox found for: GHC.Prim.plusFloat#. Did you forget to include directories containing primitives? You can use '-i/my/prim/dir' to achieve this.


I believe I encountered this kind of probem once before, having built clash 0.99, or 1.0 or thereabouts back then, and it turned out that my compilation didn't install the (was it json?) files containing the primitives. Building 1.2.4 I presume I have somehow managed the same kind of omission. But it's so peculiar that only the floats are missing - everything else synthesises! - that I'm posting here first just in case it's a problem you recognize that should (also) be addressed at your end.

Please do let me know what to get from where to put to where. I do have the source  for 1.2.4. I'll only guess if not told, and that may make more mess!

Regards

Peter

peter.t...@gmail.com

unread,
Feb 18, 2021, 12:09:50 AM2/18/21
to Clash - Hardware Description Language

This in the tutorial explains it (the prelude manual doesn't seem to mark any of the float stuff as unsynthesizable, though there are things it does - this is a different kind of unsynthesizable, closer to "we don't have a good option", I think).




Floating point types
  • There is no support for the Float and Double types, if you need numbers with a fractional part you can use the Fixed point type.

    As to why there is no support for these floating point types:

    1. In order to achieve reasonable operating frequencies, arithmetic circuits for floating point data types must be pipelined.
    2. Haskell's primitive arithmetic operators on floating point data types, such as plusFloat#

      plusFloat# :: Float# -> Float# -> Float#

      which underlie Float's Num instance, must be implemented as purely combinational circuits according to their type. Remember, sequential circuits operate on values of type "Signal a".

    Although it is possible to implement purely combinational (not pipelined) arithmetic circuits for floating point data types, the circuit would be unreasonable slow. And so, without synthesis possibilities for the basic arithmetic operations, there is no point in supporting the floating point data types.


Yadda. Actually all that is false, more or less, isn't it? While SOME floating point ops usually take several cycles, add and subtract take about 1 cycle nowadays even for clocks at a couple of GHz or more. Why would it not? All one has to do is shift left or right on the smaller order operand and then do an integer add or subtract. Compare exponents, shift, add, find first 1, reshift. Not much at all.

More or less the same for multiplication, except that you don't even have to shift.

The complications are to do with the NaN and infinities and a couple of zeros, and the special forms which don't fit the standard format ... all of which one can ignore entirely. Ditto all the silly overflow and error conditions. The first thing a compiler or operating system lead-in does is issue an instruction to turn all that off. At any rate the extra complexity can be rather trivial if one so chooses.

Bottom line: if you are willing to provide integer 64-bit division as a logic circuit, 32-bit float ops certainly should be OK. They don't even have 32 bits of precision, once one knocks off something for the exponent bits. About 22 bit, or thereabouts, no?

And never mind all that, surely the xilinx and intel fpga systems you are aiming at provide fpus, so you can map to them? Ditto for integer ops and alus.

(how may I trigger synthesis for xilinx, for example?)

Christiaan Baaij

unread,
Feb 18, 2021, 4:42:57 AM2/18/21
to clash-l...@googlegroups.com
You're correct in that it's somewhat of a cop-out.
However, while it's true that you can set Xilinx Floating Point coreGen to have zero latency on the operators: https://www.xilinx.com/support/documentation/ip_documentation/floating_point/v7_1/pg060-floating-point.pdf#page=9
You definitely will not get high operating frequencies for zero latency operators, as you can see https://www.xilinx.com/support/documentation/ip_documentation/ru/floating-point.html the high operating frequencies can only be achieved with pipelining (all those designs use FFs).
To be fair though, I cannot quantify how much slower a zero latency operator is going to be; so perhaps qualifying it as "unreasonably" slow is somewhat hyperbole.

On the other hand, Clash can/must currently only generate vendor-independent code, meaning we have to manually write the operators in VHDL and Verilog.
This requires engineering effort for which until now there hasn't been much demand.
Generating vendor/part-targetting code, thus using vendor primitives, is going to require even more engineering effort.

If you could investigate and deliver vendor-independent zero-latency Floating point operators in either VHDL or Verilog, we're willing to transliterate that to the other HDL and include that in the Clash compiler and maintain it from then on.

Regards,
Christiaan

--
You received this message because you are subscribed to the Google Groups "Clash - Hardware Description Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clash-languag...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/clash-language/1f950eed-6167-4f0d-9868-5a15710974f1n%40googlegroups.com.

peter.t...@gmail.com

unread,
Feb 18, 2021, 5:03:20 PM2/18/21
to Clash - Hardware Description Language
OK, well, I'll just do something quickly for multiplication in 32-bit floats, as that's simplest. The arithmetic representation for IEEE is (Bool,Int8,Word23), where the bool is the top/sign bit of the 32 bits, the Int8 is the next 8 bits,  being the (signed) "exponent" (aka "scale factor" to some, meaning how many binary 0s to add on in the end), and the last 23 bits is the mantissa (the real info). What it means is sign * 1.mantissa * 2^exponent .

A more formal way of putting that is  sign * [(2^23 + mantissa) / 2^23] * 2^exponent.

For the purpose of being able to make examples and test them, here's how to break out sign, exponent and mantissa from a Haskell/Clash Float:

mydecodeFloat :: Float -> (Bool,Int8,Word23)
mydecodeFloat x = let x' = cast x :: UInt32
                  in ( x < 0 :: Bool , cast( truncateB (x' .>>. 23) ) :: Int8 , truncateB x' :: Word23 )


where for convenience I am using "cast" to map bit for bit from one type to another

cast = unpack . pack

and I hope the type aliases are obvious .. Word23 = Unsigned 23, Int8 = Signed 8, UInt32 = Unsigned 32, etc (in case I do that again - apologies in advance). I am writing ".>>" for shiftR. Please believe that I really am breaking out 1+8+23 bits, and if I'm not ... my mistake and apologies.

Again for purposes of actually being able to test and check anything, here's how to put those fields together to make a Haskell Float. The really, really annoying thing about IEEE is that a literal exponent of 0x7f (i.e. 127, aka maxPos in Int8) actually _means_ 0. The IEEE set the zero half way through the type, in some sort of misbegotten thought process that has no explanation. For that reason, I will always have to write exponent - 127 when I actually mean the arithmetic value, not the literal value.

There is also a really annoying thing about Haskell floats, as far as I can tell, in that their exponent numbers are 23 off from the "right" IEEE numbers, maybe because internally they represent numbers in opposite endianness, but I really do not care what or why either. Treat this as a magic formula:

myencodeFloat :: (Bool,Int8,Word23) -> Float
myencodeFloat (True,zExp,zFrac)  = - myencodeFloatPos(zExp,zFrac)
myencodeFloat (False,zExp,zFrac) =   myencodeFloatPos(zExp,zFrac)

myencodeFloatPos :: (Int8,Word23) -> Float
myencodeFloatPos (zExp,zFrac) = result
             where
             exponent :: Int
             exponent = cast ( resize  (zExp - 127 - 23))   --
literal 0..127 really means -127 .. 0 !
             mantissa':: Word32
             mantissa'= if exponent > -127  then setBit (zeroExtend zFrac) 23 else zeroExtend zFrac
             result   :: Float
             result   = let e = max exponent (-126)
                        in scaleFloat e ((fromInteger . toInteger) mantissa')

I've used the Haskell scaleFloat to add the right number of zeros onto the basic 1.mantissa float that I wrote as mantissa'. That's what I mean about Haskell and 23. That "e" is 23 less than I naively thought it would be. I have no idea what goes on that that happens.

You/I can now make and break floats into/out of the arithmetic IEEE representation:

> mydecodeFloat 1.23
(False,127,1929380)

> myencodeFloat (False,127,1929380)
1.23

Please test some more. I believe I have done right by "0" (which ought to end up as zeroBits in IEEE - a special treat in their design). I have no idea what I did with/about NaN and infinity and really, really do not care. I suspect I got it mostly right but have absolutely not tested. Feel free to use more/different magic.

Now that we can make and read the arithmetic IEEE representation, I'll do multiplication. That should be totally easy, since all I have to do is build 1.mantissaA and 1.mantissaB, multiply them as perfectly ordinary binary numbers, and get 1.mantissaC. Yes, I will need double the number of bits for C as for A and B, and I will throw away the bottom half to keep the same (lousy) precision. Being cleverer would be not calculating the lower bits at all, if it could be avoided, but can it? Probably not. I don't think anybody at all would care since we are only talking about 23 bits here. It just cannot be hard.

Oh yes. We also have to fix the sign and push on the right number of zeros in the end ...

--  1..23b.. *  1..23b.. = 1..46b..
--                            ^
take top 23b
--
127 as literal exponent really means "0", and is the literal exponent expected from 1.23
mymulFloat :: (Bool,Int8,Word23) -> (Bool,Int8,Word23) -> (Bool,Int8,Word23)
mymulFloat (s1,e1,m1) (s2,e2,m2) = (s,e,m)   
    where s = s1 /= s2                        :: Bool
          e = (e1 - 127) + (e2 - 127) + 127   :: Int8
          m1' = setBit (zeroExtend m1) 23     :: Word48 -- caculate in 48b
          m2' = setBit (zeroExtend m2) 23     :: Word48 -- and drop last 23b
          m = truncateB ((m1' * m2') .>>. 23) :: Word23 -- taking subtop 23b


I did the calculation in 48 bits. Surely 46 would be enough? I only need 23 bits in the mantissa in the end. I have cunningly arranged for the leading 1 in 1.mantissaC to be lost via the final truncate to 23 bits ... I hope. So now the test:

> myencodeFloat(mymulFloat (mydecodeFloat 1.23) (mydecodeFloat 4.56))
5.6088
> 1.23 * 4.56
5.6088

Good enough for me!  Please try more (and fix as reqd!)

I didn't round the result, though, I truncated. Probably want to put in a "+ bit 22" to that final binary multiplication in 48 bits. That should be below the level of explicit observability when one drops the final 23 bits (of which bit 22 is the top one), but any carry from it will stick, which is just what we want in rounding.

I hope this illustrates the principle. I am not going to even try writing that in system verilog, as it would be far less clear (than it even is) and I hope the person who usually does the templates can take it and run. Please ask away for things that are unclear.

To do addition, one wants to work in one of 23 or 24 bits, according to whether a carry/overflow is expected or not. "Expected" would be when both addends are the same size, with the same exponents. Then we expect 1.mantissaA+1.mantissB = 1m.antissaC and the extra bit of data has been needed for the m.antissaC. But if the addends are mismatched sizes, then that overflow would not usually be expected. One can't tell for sure so one will have to test and shift 0 or 1. .... but I'm not going to do it now. Multiplication was easier! No if/then/else required!

One could get clever and actually do addition via multiplication, using a few more bits. I mean that 1.0000aaaa * 1.0000bbbb = 1.0000(aaaa+bbbb)(aaaa*bbbb)   and the two calculations will miss each other in the result bits, and one can drag out the addition. But maybe that's too crazy for engineers ...

HTH

Regards

Peter

peter.t...@gmail.com

unread,
Feb 18, 2021, 11:36:32 PM2/18/21
to Clash - Hardware Description Language
Let me streamline the "test" encoding of an IEEE triple as a haskell float. I'll also do division and one or two more. Assuming haskell uses the standard ieee binary layout internally (apparently yes, since this works), just map bit to bit from the triple bits to the haskell float bits:

myencodeFloatPos :: (Int8,Word23) -> Float
myencodeFloatPos (zExp,zFrac) = result
     where result = cast((zeroExtend(cast zExp) .<<. 23) .|. zeroExtend zFrac :: Word32)

By ".<<." I mean shiftL.

That will synthesize. Haskell float to/from whatever transforms probably won't, since they will have been written as recursive functions.

Sorry about the unnecessary complication originally in the encode function.

Further, one can avoid the  haskell negation of floats that I used in the top level encoding function by setting the sign bit in the underlying IEEE layout:

myencodeFloat :: (Bool,Int8,Word23) -> Float
myencodeFloat (True,zExp,zFrac)  =
               unpack(setBit (pack(myencodeFloatPos(zExp,zFrac))) 31)        -- was "- stuff"
myencodeFloat (False,zExp,zFrac) = myencodeFloatPos(zExp,zFrac)


or words to that effect.

I should emphasize that in principle the internal layout in terms of bits of an IEEE float is not really specified. It is just that _for interchange_ the formal is specified. That's called the interchange or binary format. One can't rely on it being adhered to inside one single platform/program. But, yes, in practice, to avoid surprises, the interchange format is the one generally used internally. Maybe one can come up wih examples in which it isn't. It looks like haskell uses it internally, at least in broad outline. That would be expected if they used libraries, or if they just read the docs and did what they say. I didn't bother investigating ... I don't really care outside of for testing.  There may still be differences ... I would be suspicious about the nonstandard form ("specialized"? Something like that as a name) that IEEE permits in order to get better precision for things out of range of the usual exponent limits. Probably it's not catered for. One doesn't have to care. And there are lots of kinds of NaN ... and please don't mention endianness. I seem to recall the interchange format says bigendian, but don't quote me. Maybe it doesn't.

Here is division, modeled on the multiplication I did already in the post above:

myDivFloat :: (Bool,Int8,Word23) -> (Bool,Int8,Word23) -> (Bool,Int8,Word23)
myDivFloat (s1,e1,m1) (s2,e2,m2) = (s,e,m)

    where s = s1 /= s2                             :: Bool
          e = (e1 - 127) - (e2 - 127) + 127        :: Int8

          m1' = setBit (zeroExtend m1) 23          :: Word48 -- caculate in 48b
          m2' = setBit (zeroExtend m2) 23          :: Word48
          m = truncateB ((m1' .<<. 23) `div` m2')  :: Word23
Test:
> myencodeFloat(myDivFloat (mydecodeFloat 1.3) (mydecodeFloat 4.5))
0.28888887
> 1.3 / 4.5 :: Float
0.28888887

I'm truncating, not rounding. Boo hoo. IEEE probably allows either. Somebody else can read the standard and tell me. I'll rely on supposing the standard accommodates all existing implementations at the time it was drawn up, and at that time some vendors did it the least complex way and saved on adding a half-something somewhere in there. So sue me. I'll fix if reqd.

For help in converting to/fro, here's how to turn an Int into an IEEE float. Recall that cast = unpack . pack maps equal bitsize things onto each other, such as Signed onto Unsigned (Int onto Word) and back again.

-- works on up to 128 bit ints (could use Integer instead of Int)
myIntToFloat :: Int -> (Bool,Int8,Word23)
myIntToFloat x = (x < 0,e,m)
             where (e,m) = myIntToFloatPos (abs x)

--
only on positive ints, up to 127 bit posints at least
myIntToFloatPos :: Int -> (Int8,Word23)
myIntToFloatPos 0 = (0,0)
myIntToFloatPos x = if n < 23
                    then (127+toEnum n,truncateB(cast x .<<. (23-n)))
                    else (127+toEnum n,truncateB(cast x .>>. (n-23)))
                      where n = ff1 x --
index of highest set bit


I used this helper function to find the position of the most significant bit in an Int, or whatever:

-- highest set bit number
ff1 x =  case findIndex (/=0) (bv2v (pack x)) of
           Just i  -> finiteBitSize x - 1 - fromEnum i
           Nothing -> -1 -- for x == 0 !!

Test:

> myencodeFloat(myIntToFloat (-8000001))
-8000001.0


That looks OK.

I'll need to compare floats. Comparing with zero would do if I had done addition and subtraction already, but I haven't, so I'll do the float against float compare.  Only one comparison is needed, as one can swap sides and use not and && to get the others. This is basically alphabetic ordering on the IEEE triples, with sign thrown in. I'll delegate the comparison to positive against positive, after dealing with the signs:

myLtFloat (s1,e1,m1) (s2,e2,m2) =
    case undefined of
      _ | s1 && not s2     -> True                          --
neg < pos
      _ | not s1 && s2     -> False                         --
pos < neg
      _ | not s1 && not s2 -> myLtFloatPos (e1,m1) (e2,m2)  --
pos < pos
      _ | s1 && s2         -> myLtFloatPos (e2,m2) (e1,m1)  --
neg < neg

myLtFloatPos (e1,m1) (e2,m2) =
    case undefined of
      _ | e2 - e1 > 0      -> True
      _ | e2 - e1 < 0      -> False
      _ | m2 > m1          -> True
      _ | m2 <= m1         -> False

Test:

> myLtFloat (mydecodeFloat 1.3) (mydecodeFloat 4.5)
True


And a few more like that which I won't show.

I do insist that the (single precision) descriptions above do not invoke complex logic, and I can see no reason why they should not be done in one cycle.  The multiplication and division were 48 bit. You will say that double precision floats (64 bit) are more complex, and yes, I guess they would require about 90 bit logic (I haven't looked it up - maybe it's 80, maybe it's 100) and maybe at present need two or more cycles for multiplication and division, but that does not derive from some frightening (hypothetical) fact of them being fundamentally different and mysterious by virtue of their floating point-ness. The operations and logic are just the same as integer, in all truth, plus a trivial unpacking and packing into/from integers fore and aft, and have that complexity and no more.

One can do better than the way I have set them out above, but I didn't wish to invoke any exotic techniques.

Regards

Peter



peter.t...@gmail.com

unread,
Feb 19, 2021, 12:02:20 PM2/19/21
to Clash - Hardware Description Language
I'll carry on to do (single precision) floating point addition and subtraction while I still have strength ... I've been leaving them because I know there'll be lots of cases and that's a headache and I'm sure to make a mistake ...

First let me correct my implementation of division. I can't count. I meant to shift the dividend left by 24 bits, not 23 bits. That way it clears the divisor, itself extended to 24 bits from the 23 bits in the IEEE representation by putting back the missing leading 1. That's how one gets maximal possible accuracy in a/b via integer division. Write it as a0000000/b and scale the extra 0s off again afterwards. I'll also put in an extra trap for 0 as dividend, else one gets a VSN ("very small number" - trademark) as result instead of 0. Shrug.

Recall that the IEEE arithmetic representation is a triple (Bool,Int8,Word23) representing sign, exponent, mantissa respectively, and the value it represents is (informally) sign * 1.mantissa * 2^exponent . The binary "interchange" format for that is a straight map bitwise of that triple, elements taken in order left to right, numbers bigendian (I think! I haven't done anything that would check the internal arrangment) and 2s complement. A nuance is that 127 (maxPos in Int8) as literal exponent really means 0, so I will write exponent-127 whenever I want the value. I'm not confused.

myDivFloat :: (Bool,Int8,Word23) -> (Bool,Int8,Word23) -> (Bool,Int8,Word23)
myDivFloat (s1,0,0) (s2,e2,m2) = (s1/=s2,0,0)              -- (new trap for 0/..)

myDivFloat (s1,e1,m1) (s2,e2,m2) = (s,e,m)
    where s = s1 /= s2                           :: Bool
          e = (e1 - 127) - (e2 - 127) + 128      :: Int8   -- (was +127)

          m1' = setBit (zeroExtend m1) 23        :: Word48 --
caculate in 48b

          m2' = setBit (zeroExtend m2) 23        :: Word48
          m = truncateB ((m1' .<<. 24) `div` m2'):: Word23 -- (was .<<. 23)

I should trap for 0 as divisor too, but I don't recall the NaN format (one of them!) so I won't. Please somebody look it up and trap for ../0.

Hmm. I also forgot the trap for 0 as multiplicand in the multiplication operator. Please rescue as follows:

myMulFloat :: (Bool,Int8,Word23) -> (Bool,Int8,Word23) -> (Bool,Int8,Word23)
myMulFloat (s1,0,0) (s2,e2,m2) = (s1/=s2,0,0)
myMulFloat (s1,e1,m1) (s2,0,0) = (s1/=s2,0,0)
myMulFloat (s1,e1,m1) (s2,e2,m2) = ...


The s1/=s2 business is so one can have the thrill of seeing an occasional "-0.0" pop out as result. Who am I to say no.

So, apologies, mea culpa, and errati apart, take a deep breath and on to addition/subtraction. I'll get rid of having to consider the sign bit by reducing everthing to sum of two positive numbers, or difference of two positive numbers, before even thinking:

myAddFloat :: (Bool,Int8,Word23) -> (Bool,Int8,Word23) -> (Bool,Int8,Word23)
myAddFloat (s1,e1,m1) (s2,e2,m2) =
   case (s1,s2) of
     (False,False) -> myAddFloatPos (e1,m1) (e2,m2)               --
pos + pos
     (False,True)  -> mySubFloatPos (e1,m1) (e2,m2)               --
pos + neg
     (True,False)  -> let (s,e,m) = mySubFloatPos (e1,m1) (e2,m2) --
neg + pos
                      in  (not s,e,m)
     (True,True)   -> let (s,e,m) = myAddFloatPos (e1,m1) (e2,m2) --
neg + neg
                      in  (not s,e,m)

mySubFloat :: (Bool,Int8,Word23) -> (Bool,Int8,Word23) -> (Bool,Int8,Word23)
mySubFloat (s1,e1,m1) (s2,e2,m2) = myAddFloat (s1,e1,m1) (not s2,e2,m2)

I set subtraction to be just addition of a negative, since all I have to do to negate a number is change its sign bit in the IEEE triple ... and hope. (You can just see that I ought to trap for zero here too, but let's go on without ...).

All I have to do to add up two positive numbers is put back the leading 1 on the mantissas and add as integers, then take the leading 1 back off to get the mantissa of the result.

I'll have to watch for the leading 1 moving left one place (when one does large + large), in order to hit it when aiming to knock it off, which is the "testBit m 24" below. The addends have leading 1 in position 23, so adding the leading 1 either stays in position 23 or moves one left to position 24. In the latter case I have to drop the 0th bit of the result, losing 1 bit of precision,
but keeping to the allotted 23 bits for the mantissa. That means a shift right by 1 (" .>>. 1"). The leading 1 gets dropped in any case by the obligatory truncation to 23 bits for the mantissa.

myAddFloatPos :: (Int8,Word23) -> (Int8,Word23) -> (Bool,Int8,Word23)
myAddFloatPos (e1,m1) (e2,m2) =
                    if testBit m 24
                    then (False,e+1,truncateB (m .>>. 1) :: Word23) --
grew!
                    else (False,e,  truncateB m          :: Word23)
  where
    m1' = setBit (zeroExtend m1) 23     :: Word25 --
caculate in 25b
    m2' = setBit (zeroExtend m2) 23     :: Word25
    (e,m) = case undefined of
               _ | e1-e2 > 0 -> (e1, m1' + (m2' .>>. fromEnum(e1-e2)))
               _ | e1-e2 < 0 -> (e2, (m1' .>>. fromEnum(e2-e1)) + m2')
               _ | e1 == e2  -> (e1, m1' + m2')


I had to shift the smaller addend to the right to match up its digits in the correct places with the digits of the larger addend. That loses some precision. The more caring should round rather than truncate when doing the shift right.

I yet have to handle subtraction of two positive numbers (below). This splits into several cases according to whether to expect a positive (nonegative) or negative result. Those cases are large - small and small - large, plus a couple in which it is harder to tell.  One can primarily tell by which has the larger exponent. If the exponents are equal, one can tell by which has the larger mantissa. If the mantissas are equal too, then the answer for the subtraction is 0. I will use "ff1" to return the position of a leading 1:

mySubFloatPos :: (Int8,Word23) -> (Int8,Word23) -> (Bool,Int8,Word23)
mySubFloatPos (e1,m1) (e2,m2) = (s,e',m')
      where
          m1' = setBit (zeroExtend m1) 23     :: Word25                          --
calculate in 25b
          m2' = setBit (zeroExtend m2) 23     :: Word25
          --
shift left or right if leading 1 not in expected posn 23
          (e',m') = let n = if e1==e2 && m1==m2 then 23 else ff1 m
                    in
                    case n of
                      23 -> (e,             truncateB m               :: Word23)
                      _ | n > 23                                                 --
impossible
                         -> (e-toEnum(23-n),truncateB (m .>>. (n-23)) :: Word23)
                      _ | n < 23                                                 --
implausible
                         -> (e-toEnum(23-n),truncateB (m .<<. (23-n)) :: Word23)
          (s,e,m) = case undefined of
                      _ | e1-e2 > 0 -> (False,e1,m1'-(m2' .>>. fromEnum(e1-e2))) --
big minus small
                      _ | e1-e2 < 0 -> (True,e2, m2'-(m1' .>>. fromEnum(e2-e1))) --
small minus big
                      _ | m2 < m1   -> (False,e1,m1'-m2')                        --
big minus small
                      _ | m1 < m2   -> (True,e2, m2'-m1')                        --
small minus big
                      _ -> (False,0,0)                                           -- x - x = 0

In the x-x case, I have sneaked a (_,0,0) result by setting an imaginary "leading 1 position" on the intermediate result at 23, the expected default position, which will stop the code attempting to further mutate it. (That I have to explain it means it is wrong code, in software engineering terms -- just trap for a zero result at top level and skip the subterfuge).

All that needs much more testing than I have given it. The subtraction is the more delicate code. It invokes the addition. So here is a test:

> myencodeFloat(mySubFloat (mydecodeFloat 17.3) (mydecodeFloat 9.3))
8.0
> myencodeFloat(mySubFloat (mydecodeFloat 17.3) (mydecodeFloat 19.3))
-2.0


Yes, I have tried subtracting and adding negative numbers. Please try more and let me know what I got wrong.

I forgot to give a function that maps from float to integer. This one truncates towards zero (which I hate ... but may even be right in some universe, maybe the IEEE one too!). Here it is. Recall that cast = unpack . pack maps equal-sized objects onto each other bitwise. Again, I'll remove the sign bit first before even thinking about it, in order to simplify this problem. That already guarantees symmetrical behavior around zero, for better or worse:

myFloatToInt :: (Bool,Int8,Word23) -> Int
myFloatToInt (_,0,0)     = 0
myFloatToInt (True,e,m)  = - myFloatToIntPos (e,m)
myFloatToInt (False,e,m) =   myFloatToIntPos (e,m)

--
needs Word/Int to have more than 24 bits
myFloatToIntPos :: (Int8,Word23) -> Int
myFloatToIntPos (e,m) = if e-127 > 23
                        then m' .<<. fromEnum ((e-127)-23)
                        else m' .>>. fromEnum (23-(e-127))
     where m' = cast(setBit (zeroExtend m) 23) :: Int


(A late reminder that .<<. is shiftL and .>>. is shiftR). Test:

> myFloatToInt (mydecodeFloat 98.7)
98
> myFloatToInt (mydecodeFloat (-101.1))
-101


That's what I mean by symmetrical  truncation towards zero. One might have expected -101.1 to truncate to -102 as an integer, instead. I certainly do. It's definitely the case that rounding is preferable to truncation here, in terms of creating least surprise. I'll do that if requested.

I can't think of anything much else one might need as floating point ops. Modulus and remainder, perhaps. Please do translate to primitive operation "templates" in the backend languages ... and add traps for zero etc to taste. It's not going to be terribly wrong whatever you do, because IEEE as a standard was originally intended to be inclusive. At worst some people might tut-tut until a correction.

That I've done only single precision floating point operations is not significant, other than that I intended to leave no space for the argument that  floating point logic cannot fit in one cycle.

For double precision, change some of the bitsize numbers.

I don't think we're far off being able to handle even 64-bit floating point logic in one cycle. Addition and subtraction should fit.

Regards

Peter

peter.t...@gmail.com

unread,
Feb 21, 2021, 9:58:47 AM2/21/21
to Clash - Hardware Description Language
I checked the rounding mode (truncate towards zero) I used with the IEEE standard, and it's fine by them (I hate it). As one might guess, the standard was set up to be inclusive and it okays what any vendor did at the time, in order to offend nobody. The object is formalization, not policing. Since truncate is simplest, it was done, and is allowed. The standard speaks of "rounding modes", so I presume the big fancy FPUs can be switched from one rounding mode to another.  In practice I suppose they stay in what I call rounding (the standard calls it rounding to nearest expressible value, with tie-breaking (upward? Away from zero?)). If you want to implement rounding like that, you have to add a half of whatever may be lost before you throw away those bits of precision. So if one throws away 9 bits of precision during the calculation, by shifting 9 right, then one must first add a "bit 8". A "bit 8" on its own would be lost by a shift right 9, but there may already be a bit 8 set in what it is added to, which would make a bit 9, and be preserved by the shift. So if you want to change the multiplication I gave, for example, from truncation to rounding, you would change the calculation that drops 23 lower bits of a 48 bit intermediate result (and also drops 2 excess highest bits):

m = truncateB ((m1' * m2') .>>. 23) :: Word23

to

m = truncateB ((m1' * m2' + bit 22) .>>. 23) :: Word23

and that would make me, if nobody else, happier.

With respect to errors, the general principle is to not error. Instead one should produce a "NaN", which is (_,-1,*) as an IEEE triple (Bool,Int8,Word23) (i.e., sign,exponent,mantissa). The -1 is the exponent part. The * for the matissa part should be non-zero. That's because (_,-1,0) is infinity. NaNs should ideally be preserved through calculations, so most of the calculations I gave should have an extra first line or two to pick out the situation when one of the arguments to the floating point operation is a NaN, and make the result a NaN. I would suggest for multiplication, for example, a guard line like

myMulFloat (s1,-1,m1) (s2,e2,m2) = (s1/=s2,-1,m1)            -- NaN * _ = NaN

Sticklers may wish to pick out the infinity case just before that, so infinity times zero can be detected, and NaN produced. Otherwise one will get some ordinary value, and I have no clue what it will be. Maybe zero. Surely zero. I've kept the sign calculation as-is above, because it's info.

To build a complete floating point operation  one does (for example)

mul_f :: Float -> Float -> Float
mul_f x y = myencodeFloat(muMulFloat (mydecodeFloat x) (mydecodeFloat y))

The simplest encoding/decoding function expressions are probably:

myencodeFloat :: (Bool,Int8,Word23) -> Float
myencodeFloat (zSign,zExp,zFrac) = unpack (pack zSign ++# pack zExp ++# pack zFrac)

                   
mydecodeFloat :: Float -> (Bool,Int8,Word23) -- sign, exponent, mantissa (significand without leading bit)
mydecodeFloat x = (zSign,zExp,zFrac)
                where x' = pack x :: BitVector 32
                      zSign = testBit x' 31                  :: Bool
                      zExp  = unpack( truncateB(x' .>>. 23)) :: Int8
                      zFrac = unpack( truncateB x')          :: Word23

         
You should be able to do much better with more access to and knowledge of Clash internals. This assumes the haskell internal representation is the standard IEEE binary interchange format, and I believe it is. But on an other-endian machine that story needs checking.

I would really appreciate it if somebody would produce a new backend primitive operation from this. I have not succeeded in making head nor tail of the tutorial exposition, I am afraid - it fails to tell me what you want,  so I don't know what to supply, let alone how. I can dump a system verilog "module" for the mul_f above, for example. Here it is! Please modify as required, and maybe I can spot what is needed from that.

/* AUTOMATICALLY GENERATED SYSTEMVERILOG-2005 SOURCE CODE.
** GENERATED BY CLASH 1.2.4. DO NOT MODIFY.
*/
`timescale 100fs/100fs
module FPUa_mul_f
    ( // Inputs
      input logic [31:0] x
    , input logic [31:0] y
      // Outputs
    , output logic [31:0] result
    );
  // KPU/Float.hs:120:1-13
  logic zSign;
  // KPU/Float.hs:120:1-13
  logic signed [7:0] zExp;
  // KPU/Float.hs:120:1-13
  logic [22:0] zFrac;
  logic [0:0] c$app_arg;
  FPUa_types::Tup3 c$case_alt_0;
  FPUa_types::Tup3 c$case_alt_1;
  FPUa_types::Tup3 result_0;
  FPUa_types::Tup3 c$case_alt_2;
  logic c$app_arg_0;
  logic c$case_alt_3;
  logic [47:0] c$app_arg_1;
  logic [47:0] c$app_arg_2;
  logic c$app_arg_3;
  logic c$case_alt_4;
  FPUa_types::Tup3 result_1;
  FPUa_types::Tup3 result_2;
  FPUa_types::Tup3 c$case_alt_5;
  logic c$app_arg_4;
  logic c$case_alt_6;
  // KPU/Float.hs:136:1-10
  logic [22:0] m2;
  // KPU/Float.hs:136:1-10
  logic s2;
  // KPU/Float.hs:136:1-10
  logic [22:0] m1;
  // KPU/Float.hs:136:1-10
  logic s1;
  // KPU/Float.hs:126:1-13
  logic [31:0] \x' ;
  FPUa_types::Tup3 result_3;
  // KPU/Float.hs:126:1-13
  logic [31:0] \c$x'_0 ;
  FPUa_types::Tup3 result_4;
  logic signed [7:0] result_0_selection;
  logic [47:0] c$bv;
  logic signed [7:0] result_2_selection;
  logic [31:0] c$bv_0;
  logic [31:0] c$bv_1;
  assign zSign = c$case_alt_0.Tup3_sel0;
  assign zExp = $signed(c$case_alt_0.Tup3_sel1);
  assign zFrac = c$case_alt_0.Tup3_sel2;
  assign c$app_arg = zSign ? 1'b1 : 1'b0;
  assign c$case_alt_0 = ($signed(result_4.Tup3_sel1) == -8'sd1) ? {c$app_arg_0
                                                                  ,-8'sd1
                                                                  ,m1} : c$case_alt_1;
  assign c$case_alt_1 = ($signed(result_3.Tup3_sel1) == -8'sd1) ? {c$app_arg_0
                                                                  ,-8'sd1
                                                                  ,m2} : result_0;
  assign result_0_selection = $signed(result_4.Tup3_sel1);
  always_comb begin
    case(result_0_selection)
      8'sd0 : result_0 = c$case_alt_2;
      default : result_0 = result_2;
    endcase
  end
  always_comb begin
    case(m1)
      23'd0 : c$case_alt_2 = {c$app_arg_0
                             ,8'sd0
                             ,23'd0};
      default : c$case_alt_2 = result_2;
    endcase
  end
  assign c$app_arg_0 = s1 ? c$case_alt_3 : s2;
  assign c$case_alt_3 = s2 ? 1'b0 : 1'b1;
  // replaceBit start
  always_comb begin
    c$app_arg_1 = (({{(48-23) {1'b0}},m2}));
    c$app_arg_1[(64'sd23)] = (1'b1);
  end
  // replaceBit end
  // replaceBit start
  always_comb begin
    c$app_arg_2 = (({{(48-23) {1'b0}},m1}));
    c$app_arg_2[(64'sd23)] = (1'b1);
  end
  // replaceBit end
  assign c$app_arg_3 = s1 ? c$case_alt_4 : s2;
  assign c$case_alt_4 = s2 ? 1'b0 : 1'b1;
  assign c$bv = ((((c$app_arg_2) * (c$app_arg_1)) + 48'd4194304) >> (64'sd23));
  assign result_1 = {c$app_arg_3
                    ,(($signed(result_4.Tup3_sel1) - 8'sd127) + ($signed(result_3.Tup3_sel1) - 8'sd127)) + 8'sd127
                    ,c$bv[0+:23]};
  assign result_2_selection = $signed(result_3.Tup3_sel1);
  always_comb begin
    case(result_2_selection)
      8'sd0 : result_2 = c$case_alt_5;
      default : result_2 = result_1;
    endcase
  end
  always_comb begin
    case(m2)
      23'd0 : c$case_alt_5 = {c$app_arg_4
                             ,8'sd0
                             ,23'd0};
      default : c$case_alt_5 = result_1;
    endcase
  end
  assign c$app_arg_4 = s1 ? c$case_alt_6 : s2;
  assign c$case_alt_6 = s2 ? 1'b0 : 1'b1;
  assign m2 = result_3.Tup3_sel2;
  assign s2 = result_3.Tup3_sel0;
  assign m1 = result_4.Tup3_sel2;
  assign s1 = result_4.Tup3_sel0;
  assign \x'  = __VOID__;
  assign c$bv_0 = (\x'  >> (64'sd23));
  assign result_3 = {(\x' [(64'sd31)]) == (1'b1)
                    ,$signed((c$bv_0[0+:8]))
                    ,(\x' [0+:23])};
  assign \c$x'_0  = __VOID__;
  assign c$bv_1 = (\c$x'_0  >> (64'sd23));
  assign result_4 = {(\c$x'_0 [(64'sd31)]) == (1'b1)
                    ,$signed((c$bv_1[0+:8]))
                    ,(\c$x'_0 [0+:23])};
  assign result = (c$case_alt);
endmodule


Ben Blaxill

unread,
Feb 21, 2021, 2:21:02 PM2/21/21
to clash-l...@googlegroups.com
While SOME floating point ops usually take several cycles, add and subtract take about 1 cycle nowadays even for clocks at a couple of GHz or more. 

Intel processors have a 3 cycle latency for FADD/FSUB as far as I can tell - source https://www.agner.org/optimize/instruction_tables.pdf

Thanks,
Ben

On 21 Feb 2021, at 14:58, peter.t...@gmail.com <peter.t...@gmail.com> wrote:


--
You received this message because you are subscribed to the Google Groups "Clash - Hardware Description Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clash-languag...@googlegroups.com.

peter.t...@gmail.com

unread,
Feb 21, 2021, 5:14:13 PM2/21/21
to Clash - Hardware Description Language
Hi - thanks for the response! I thought the world had died!

Intel processors have a 3 cycle latency for FADD/FSUB as far as I can tell - source https://www.agner.org/optimize/instruction_tables.pdf

The first table (p8) is for AMD K7 (", where FADD is listed on P13 as 1 cycle thruput, 4 cycle latency, but the note says it's working on memory OR registers ("r/m"), so memory latency is thrown into the mix there.  The timing is the same as memory only, and the same as for multiplication!  So it looks like that one unit works in multiples of 4 cycles. Maybe it's 64 bit. Maybe 80. I don't know.

About p150 of the document, the timings for the Intel Pentium M are listed (I am typing on one ...) and FADD is  1 cycle latency (p160) there. Multiplication is 2 cycle latency. That  makes sense since this is a 32-bit machine!  .. it is 1.4GHz, vintage about 2006, with 90nm tech.  If they could do it then, they can do it now.

The code I wrote does addition for a 32-bit single precision float using only  24-bit integer addition, applied once. If that cannot be done in one cycle, something is wrong.

But it doesn't require experimental confirmation because the floating point single precision addition really is 24-bit integer addition, done once. It's just math. So if some "they" can't do it in one cycle, _they_ are doing something wrong.

Double precision floating point addition would be 54 bit integer addition, done once.

Floating point units, however, I believe should standardly come with 40 bit or 80 bit internal accumulators, and are correspondingly really 40 bit or 80 bit internally.  Probably all 80 bit nowadays! Or more?

Is that consistent with what you see? There are 400+ pages to that document. If you point me to the table you are looking at, I can try for an interpretation of it.

Best regards

Peter








Ben Blaxill

unread,
Feb 21, 2021, 6:49:47 PM2/21/21
to clash-l...@googlegroups.com
> About p150 of the document, the timings for the Intel Pentium M are listed (I am typing on one ...) and FADD is  1 cycle latency (p160) there. Multiplication is 2 cycle latency. That  makes sense since this is a 32-bit machine!  .. it is 1.4GHz, vintage about 2006, with 90nm tech.  If they could do it then, they can do it now.

Latencies on that page for FADD/FMUL are 3/5 respectively, it is reciprocal throughput that is 1/2. Additionally there are other operations such as FCOMP on r/m that do reach 1 cycle latency rather than 3/5.

> The code I wrote does addition for a 32-bit single precision float using only  24-bit integer addition, applied once. If that cannot be done in one cycle, something is wrong. 

Latency and max frequency are design specific- if performance for your design meets your needs then great. I don’t have insight into Intel FADD, but the cost of a barrel shift should not be underestimated.

Thanks,
Ben

On 21 Feb 2021, at 22:14, peter.t...@gmail.com <peter.t...@gmail.com> wrote:



peter.t...@gmail.com

unread,
Feb 21, 2021, 6:53:12 PM2/21/21
to Clash - Hardware Description Language
Further clarification of what the x87 unit is (the fpu of x86 series processors) is given in wikipedia, and explains what is being measured in the intel tables for FADD of https://www.agner.org/optimize/instruction_tables.pdf . It's 80 bit, in principle, always, internally, even for a 32-bit machine (though I have a bit of a doubt about my Pentium M,which feels more like 40 bit ...), even if the arguments and result are 32 bit (i.e., single precision IEEE floating point). That's 3 to 4 times as long as required for IEEE single precision for 32-bit architectures ... i.e. Haskell "Float".

The x87 provides single-precision, double-precision and 80-bit double-extended precision binary floating-point arithmetic as per the IEEE 754-1985 standard. By default, the x87 processors all use 80-bit double-extended precision internally (to allow sustained precision over many calculations, see IEEE 754 design rationale). A given sequence of arithmetic operations may thus behave slightly differently compared to a strict single-precision or double-precision IEEE 754 FPU.[4] As this may sometimes be problematic for some semi-numerical calculations written to assume double precision for correct operation, to avoid such problems, the x87 can be configured using a special configuration/status register to automatically round to single or double precision after each operation. Since the introduction of SSE2, the x87 instructions are not as essential as they once were, but remain important as a high-precision scalar unit for numerical calculations sensitive to round-off error and requiring the 64-bit mantissa precision and extended range available in the 80-bit format.

And no, I've read it wrong,  the tables in https://www.agner.org/optimize/instruction_tables.pdf show my 1.4GHz Pentium M doing (presumably 80 bit)  FADD with THREE cycle latency (p160). That's the SECOND TO LAST column. Multiplication takes FIVE cycles. Grrr. I looked at the LAST column.  The column headings are about 10 pages before that. Apologies ...

The tables for Intel Coffee Lake (p297) show FADD (p303) at THREE cycles latency too. Does anyone know if by then they have doubled FPU length again? If not, it's still 80 bit. Multiplication is 5 cycles again.

In any case, the SSE2 instructions (available already on Pentium M) do 128-bit floating point operations, so the FPU, and there may be more than one unit that could be called that, maybe with different precisions, must be at least 113 bit (assuming the same proportion of bits are devoted to the exponent as in the IEEE standard). Maybe it's 160 bit!

The bottom line is that processor FPUs are way longer precision than what is required for IEEE single precision floating point operations, even much longer than what is required for IEEE double precision, so one cannot get info on how long the IEEE single precision (and double precsion) operations should take from them. Single precision IEEE floating point addition is 24 bit integer addition and should take as long as that takes. SIngle precision IEEE floating point multiplication is 24 bit integer multiplication for a 48 bit result, of which the bottom half is either dropped (truncation) or incremented (rounding) for a possible carry into the top half and then dropped. The exponents get 8-bit added in parallel.

Ben Blaxill

unread,
Feb 21, 2021, 7:20:42 PM2/21/21
to clash-l...@googlegroups.com
It's 80 bit, 

Yeah my mistake, although ADDPS (32bit float vectorized add) also seems to be 3or4 cycle latency which I would take as a proxy. 

Thanks,
Ben

On 21 Feb 2021, at 23:53, peter.t...@gmail.com <peter.t...@gmail.com> wrote:

It's 80 bit,

peter.t...@gmail.com

unread,
Feb 21, 2021, 7:51:46 PM2/21/21
to Clash - Hardware Description Language
Out of interest, 80 bit FPU actually means 64 bit integer addition and multiplication, and also 64 bit precision. There are  16 bits of exponent and sign in the 80 bit format. Perhaps one can compare the timings with the 64-bit integer ops and draw some conclusions there.

That design is - I read - intended to give 7 or more (10!) extra bits of precision over what is strictly needed for IEEE double precision floating point ops, which have nominally 54 bits of precision (plus 10 bits of exponent/sign). So lots of additions can be done in sequence at high accuracy.

I wonder if the 128-ness of SSE2 instructions refers to the significand part only or the significand plus exponent and sign.

I think 64/80/129 bit IEEE formats don't drop the leading 1 in the binary representation of the significand part, so one doesn't waste time and logic sometimes "renormalizing" after an op. That means one can't just take the single-precision ops I coded and change the numbers (23, 48, etc) to get double or extended double precision ops. Well, one could, but they would (a) be longer and more complex than they need to be, which is OK (b) wrong format, which is not OK.

Still, the code I provided is  good for single precsion IEEE (barring the occasional mistake which I will be glad to rectify on demand ..), and I would be very pleased to see somebody  turn them into Clash backend primitives! The logic should flatten well, and it can't have to cascade more than 24/48 bits right to left. It doesn't have to be perfect. It can always be improved.

Peter

peter.t...@gmail.com

unread,
Feb 21, 2021, 11:35:08 PM2/21/21
to Clash - Hardware Description Language
You also got me curious as to the vector instruction timings. The first thing I would infer from "ADDPS (32bit float vectorized add) also seems to be 3or4 cycle latency which I would take as a proxy." is that the reasoning there is _therefore_ mistaken. You know that 32-bit float add must be faster than 128-bit float add (see p226 where the timing for Ivy Bridge ADDPS is the same as ADDPD and ADDSS and ADDSD - three cycles for all) if the vector calculation is truly done as 32-bit calculations in parallel, so it isn't. So one shouldn't take it as a proxy.

Another puzzle is that the operands are given as "x/m128" for the same measurement, which means either register or memory, both at 128 bit. This statement in the Intro says they code the memory access instruction(s) to/from the same address, so the pipeline spots that and does not do any memory access at all: "It is not possible to measure the latency of a memory read or write instruction with software methods. ... in most cases the microprocessor is smart enough to make a "store forwarding" directly from the write unit to the read unit rather than waiting for the data to go to the cache and back again."

That's right. So that explains how come that (the effective address calculation apparently only takes an insignificant fraction of a cycle).

The ADD{P/S}{S/D} I presume is 128 bits divvied up as either 4x32 or 2x64 or ... ? Possibly there are four 80-bit FPUs and they simply run all four in parallel at 80 bits and round to 32 bits after each op, as the wikipedia explains should be a FPU configuration option, or they run two in parallel at 80 bits and round to 64 after each op. Either way it's the 80-bit as the critical time.

An alternative is that they split one 80-bit FPU and run it twice as fast as two 40-bit FPUs, and let one half do first one 32-bit floating point add in half the time, then another in half the time, while its other half does the other two in parallel. Nett result is the same.

Or something like that ... or very different. Anyway, we know good proxy reasoning cannot hold.

I would love to know how come some of the instruction latency measurements are "0.5 cycles". How does that happen? Have they accidentally measured two pipelines emitting instructions in parallel? Maybe. There are weasel words in the intro about their test codes avoiding register dependencies. A stream of the same instruction with changed registers has no hazards and would split down two pipes at full occupancy.

If so, that's not a fair number. All the instruction timings should be multiplied up by the number of pipelines if they do that. Or does intel really have double-clocked pipelines?

Clues!

Regards

PTB

Christiaan Baaij

unread,
Feb 22, 2021, 3:43:54 AM2/22/21
to clash-l...@googlegroups.com
Thanks for all the effort you've put in so far!

However, I would want the code that Clash generates for `plusFloat#` and friends to be bit-accurate.
Especially when you start to have long-running accumulating operations, I wouldn't want my FPGA/ASIC to drift from the Haskell simulation.

However! I think it would make a great start towards a `Clash.FloatingPoint` module.
At least until we have bit-accurate implementations; I hope you can see the implications of a non-bit-accurate implementation (wrt to Haskell's plusFloat and friends) when someone depends on Clash for a mission-critical application.
With a separate `Clash.FloatingPoint` module, a developer could assess the differences wrt to Haskell standard Float and determine whether it suits their needs and/or slight adjustments to their algorithms are needed.
Could you give me permission to distribute the code you posted under Clash' BSD2 license?
(If you could just collate your implementations in one .hs file, add the your copyright statement and BSD2 licence text to the top, and send it over to my personal email address that would be appreciated)

With regards to your assessment on the signal propagation delay, and thus cycle latency in a pipelined operation:
For addition/subtraction, the issue isn't so much the integer addition by itself, as it is the integer addition plus:
1. the count-leading-zero's (find-first-one)
2. the dynamic shift (a barrel-shifter)
3. the multiplexers for dealing with infinity and NaN
That are all in the critical path. Especially on FPGA those circuit structures impede high operating frequencies.


--
You received this message because you are subscribed to the Google Groups "Clash - Hardware Description Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clash-languag...@googlegroups.com.

Peter Lebbing

unread,
Feb 22, 2021, 5:08:17 AM2/22/21
to clash-l...@googlegroups.com
On Mon, 22 Feb 2021, Christiaan Baaij wrote:
> 2. the dynamic shift (a barrel-shifter)

The 8087 had a really novel barrel shifter. We might be able to take
some queues from it to implement a more efficient barrel shifter in
FPGA's, although frankly I'm not sure how well it translates into an
FPGA, and how much the synthesis already reaches an equivalent state
unaided.

The following link might describe techniques that are still patented. If
you are one of those who are careful about diving into patented
material, you have been warned. Patents might have ended, I don't know.

Recently, Ken Shirrif explored and explained the 8087 barrel shifter
here:

https://www.righto.com/2020/05/

Peter.

peter.t...@gmail.com

unread,
Feb 22, 2021, 5:08:49 AM2/22/21
to Clash - Hardware Description Language
Hi Christiaan

You are saying (1) you want rounding, not truncation? (I assume haskell ends up doing rounding, unless compiled on an embedded platform ....); (2) you want the "subnormal" format extension of basic IEEE to be supported (I assume haskell ... ditto).  [I have steered clear, since it's an option in IEEE, not mandatory] (3) whatever every hardware  platform does should be matched to the bit.

 (1) is trivial, (2) is likely not worth the bother (and it's somewhat nasty) in light of (3).   (3) is impossible since platforms differ, but  is not too far off  a generic "softfloat" support for platforms without a FPU, compiled on the same platform even though it has a FPU/

One can use the IEEE softfloat reference implementation by J Hauser:

This C source file is part of the SoftFloat IEC/IEEE Floating-point Arithmetic Package, Release 2b. Written by John R. Hauser. This work was made possible in part by the International Computer Science Institute, located at Suite 600, 1947 Center Street, Berkeley, California 94704. Funding was partially provided by the National Science Foundation under grant MIP-9311980. The original version of this code was written as part of a project to build a fixed-point vector processor in collaboration with the University of California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek. More information is available through the Web page `http://www.cs.berkeley.edu/~jhauser/ arithmetic/SoftFloat.html'.

But after all that you get to have to choose things like the rounding mode:

int8 float_rounding_mode = float_round_nearest_even; // default

Which is one going to choose?  Then there's the subnormal form business, which allows for more precision at the cost of fewer exponent bits, or vice versa (to get more range). All that is up to the hardware platform.

The  IEEE standard  is not  prescriptive in a number of ways.

I'd be happy with (1) + some of (2). I can't in general guess how much of (2) is supported on particular hardware, nor in what way. Same for NaN support,  But I don't think anyone cares. One conforming implementation should be as good as another to a user, and different implementations will differ (because of (2), and NaNs, and some other stuff), so I don't see the point. Take anything and modify it if anyone wants it modified.

If you want rounding instead of truncation, it's easy. I gave the recipe a couple of times. One has to add half of the amount of precision that may be lost before losing it, wherever one may lose it. For example, this line in the "Add" function should read:

if testBit m 24 then (False,e+1,truncateB ((m+1).>>. 1) :: Word23) else (False,e, truncateB m :: Word23)

instead of

if testBit m 24 then (False,e+1,truncateB (m .>>. 1) :: Word23) else (False,e, truncateB m :: Word23)

The case where one was about to lose a bit of precision via ".>>. 1" first gets a bit added before the shift, with the effect that half the possible precision loss has been added to the result. That's rounding.

Same idea for the Div function. For rounding you want this line:

m = ((m1' .<<. 24) + (m2' .>>. 1)) `div` m2'

instead of this one (truncation):

m = (m1' .<<. 24) `div` m2'

That's because for rounding one adds half an m2 before dividing by m2, which amounts to the half a bit extra required.

To do better in software I'd have to match exactly what the hardware does for the "subnormal" format tradeoffs, when it does it.

peter.t...@gmail.com

unread,
Feb 25, 2021, 3:01:35 PM2/25/21
to Clash - Hardware Description Language
I sent Christiaan a file with a single precision IEEE implmentation.

Re: count leading zeros. One can do it with constant depth logic, for the fixed 24 bit data required, but one always can, since it's just a stateless bits to bits transform. But I did check that there is a constant depth implementation (as the data length increases), using a constant depth implementation of findIndex. That is aimed at the situation when there is always zero or one element of the vector satisfying the findIndex predicate. That maps the index or zero onto each element, as it satisfies the predicate or not, then ors all the index-or-zeros together bitwise. Out pops 0 or the index of the one element satisfying the predicate.  To count leading zeros, map every bit to 0000000 or 000011111 as the bit is 0 or 1, where the 1s start just after the bit position. Or all those together bitwise and one gets a 0000111111 where the 1s start just after the leading 1 in the original. Do a findIndex with a predicate that looks for the 01 boundary. To do that form the vector of successive pairs of bits 00 00 00 01 11 11 ... by zipping the vector aganst itself, shifted one. The 01 pair is the one to get with findIndex.

You probably do it that way or much better anyway, for all I know!
Reply all
Reply to author
Forward
0 new messages