I'll carry on to do (single precision) floating point addition and subtraction while I still have strength ... I've been leaving them because I know there'll be lots of cases and that's a headache and I'm sure to make a mistake ...
First let me correct my implementation of division. I can't count. I meant to shift the dividend left by 24 bits, not 23 bits. That way it clears the divisor, itself extended to 24 bits from the 23 bits in the IEEE representation by putting back the missing leading 1. That's how one gets maximal possible accuracy in a/b via integer division. Write it as a0000000/b and scale the extra 0s off again afterwards. I'll also put in an extra trap for 0 as dividend, else one gets a VSN ("very small number" - trademark) as result instead of 0. Shrug.
Recall that the IEEE arithmetic representation is a triple (Bool,Int8,Word23) representing sign, exponent, mantissa respectively, and the value it represents is (informally) sign * 1.mantissa * 2^exponent . The binary "interchange" format for that is a straight map bitwise of that triple, elements taken in order left to right, numbers bigendian (I think! I haven't done anything that would check the internal arrangment) and 2s complement. A nuance is that 127 (maxPos in Int8) as literal exponent really means 0, so I will write exponent-127 whenever I want the value. I'm not confused.
myDivFloat :: (Bool,Int8,Word23) -> (Bool,Int8,Word23) -> (Bool,Int8,Word23)
myDivFloat (s1,0,0) (s2,e2,m2) = (s1/=s2,0,0) -- (new trap for 0/..)
myDivFloat (s1,e1,m1) (s2,e2,m2) = (s,e,m)
where s = s1 /= s2 :: Bool
e = (e1 - 127) - (e2 - 127) + 128 :: Int8 -- (was +127)
m1' = setBit (zeroExtend m1) 23 :: Word48 -- caculate in 48b
m2' = setBit (zeroExtend m2) 23 :: Word48
m = truncateB ((m1' .<<. 24) `div` m2'):: Word23 -- (was .<<. 23)
I should trap for 0 as divisor too, but I don't recall the NaN format (one of them!) so I won't. Please somebody look it up and trap for ../0.
Hmm. I also forgot the trap for 0 as multiplicand in the multiplication operator. Please rescue as follows:
myMulFloat :: (Bool,Int8,Word23) -> (Bool,Int8,Word23) -> (Bool,Int8,Word23)
myMulFloat (s1,0,0) (s2,e2,m2) = (s1/=s2,0,0)
myMulFloat (s1,e1,m1) (s2,0,0) = (s1/=s2,0,0)
myMulFloat (s1,e1,m1) (s2,e2,m2) = ...
The s1/=s2 business is so one can have the thrill of seeing an occasional "-0.0" pop out as result. Who am I to say no.
So, apologies, mea culpa, and errati apart, take a deep breath and on to addition/subtraction. I'll get rid of having to consider the sign bit by reducing everthing to sum of two positive numbers, or difference of two positive numbers, before even thinking:
myAddFloat :: (Bool,Int8,Word23) -> (Bool,Int8,Word23) -> (Bool,Int8,Word23)
myAddFloat (s1,e1,m1) (s2,e2,m2) =
case (s1,s2) of
(False,False) -> myAddFloatPos (e1,m1) (e2,m2) -- pos + pos
(False,True) -> mySubFloatPos (e1,m1) (e2,m2) -- pos + neg
(True,False) -> let (s,e,m) = mySubFloatPos (e1,m1) (e2,m2) -- neg + pos
in (not s,e,m)
(True,True) -> let (s,e,m) = myAddFloatPos (e1,m1) (e2,m2) -- neg + neg
in (not s,e,m)
mySubFloat :: (Bool,Int8,Word23) -> (Bool,Int8,Word23) -> (Bool,Int8,Word23)
mySubFloat (s1,e1,m1) (s2,e2,m2) = myAddFloat (s1,e1,m1) (not s2,e2,m2)
I set subtraction to be just addition of a negative, since all I have to do to negate a number is change its sign bit in the IEEE triple ... and hope. (You can just see that I ought to trap for zero here too, but let's go on without ...).
All I have to do to add up two positive numbers is put back the leading 1 on the mantissas and add as integers, then take the leading 1 back off to get the mantissa of the result.
I'll have to watch for the leading 1 moving left one place (when one does large + large), in order to hit it when aiming to knock it off, which is the "testBit m 24" below. The addends have leading 1 in position 23, so adding the leading 1 either stays in position 23 or moves one left to position 24. In the latter case I have to drop the 0th bit of the result, losing 1 bit of precision,
but keeping to the allotted 23 bits for the mantissa. That means a shift right by 1 (" .>>. 1"). The leading 1 gets dropped in any case by the obligatory truncation to 23 bits for the mantissa.
myAddFloatPos :: (Int8,Word23) -> (Int8,Word23) -> (Bool,Int8,Word23)
myAddFloatPos (e1,m1) (e2,m2) =
if testBit m 24
then (False,e+1,truncateB (m .>>. 1) :: Word23) -- grew!
else (False,e, truncateB m :: Word23)
where
m1' = setBit (zeroExtend m1) 23 :: Word25 -- caculate in 25b
m2' = setBit (zeroExtend m2) 23 :: Word25
(e,m) = case undefined of
_ | e1-e2 > 0 -> (e1, m1' + (m2' .>>. fromEnum(e1-e2)))
_ | e1-e2 < 0 -> (e2, (m1' .>>. fromEnum(e2-e1)) + m2')
_ | e1 == e2 -> (e1, m1' + m2')
I had to shift the smaller addend to the right to match up its digits in the correct places with the digits of the larger addend. That loses some precision. The more caring should round rather than truncate when doing the shift right.
I yet have to handle subtraction of two positive numbers (below). This splits into several cases according to whether to expect a positive (nonegative) or negative result. Those cases are large - small and small - large, plus a couple in which it is harder to tell. One can primarily tell by which has the larger exponent. If the exponents are equal, one can tell by which has the larger mantissa. If the mantissas are equal too, then the answer for the subtraction is 0. I will use "ff1" to return the position of a leading 1:
mySubFloatPos :: (Int8,Word23) -> (Int8,Word23) -> (Bool,Int8,Word23)
mySubFloatPos (e1,m1) (e2,m2) = (s,e',m')
where
m1' = setBit (zeroExtend m1) 23 :: Word25 -- calculate in 25b
m2' = setBit (zeroExtend m2) 23 :: Word25
-- shift left or right if leading 1 not in expected posn 23
(e',m') = let n = if e1==e2 && m1==m2 then 23 else ff1 m
in
case n of
23 -> (e, truncateB m :: Word23)
_ | n > 23 -- impossible
-> (e-toEnum(23-n),truncateB (m .>>. (n-23)) :: Word23)
_ | n < 23 -- implausible
-> (e-toEnum(23-n),truncateB (m .<<. (23-n)) :: Word23)
(s,e,m) = case undefined of
_ | e1-e2 > 0 -> (False,e1,m1'-(m2' .>>. fromEnum(e1-e2))) -- big minus small
_ | e1-e2 < 0 -> (True,e2, m2'-(m1' .>>. fromEnum(e2-e1))) -- small minus big
_ | m2 < m1 -> (False,e1,m1'-m2') -- big minus small
_ | m1 < m2 -> (True,e2, m2'-m1') -- small minus big
_ -> (False,0,0) -- x - x = 0
In the x-x case, I have sneaked a (_,0,0) result by setting an imaginary "leading 1 position" on the intermediate result at 23, the expected default position, which will stop the code attempting to further mutate it. (That I have to explain it means it is wrong code, in software engineering terms -- just trap for a zero result at top level and skip the subterfuge).
All that needs much more testing than I have given it. The subtraction is the more delicate code. It invokes the addition. So here is a test:
> myencodeFloat(mySubFloat (mydecodeFloat 17.3) (mydecodeFloat 9.3))
8.0
> myencodeFloat(mySubFloat (mydecodeFloat 17.3) (mydecodeFloat 19.3))
-2.0
Yes, I have tried subtracting and adding negative numbers. Please try more and let me know what I got wrong.
I forgot to give a function that maps from float to integer. This one truncates towards zero (which I hate ... but may even be right in some universe, maybe the IEEE one too!). Here it is. Recall that cast = unpack . pack maps equal-sized objects onto each other bitwise. Again, I'll remove the sign bit first before even thinking about it, in order to simplify this problem. That already guarantees symmetrical behavior around zero, for better or worse:
myFloatToInt :: (Bool,Int8,Word23) -> Int
myFloatToInt (_,0,0) = 0
myFloatToInt (True,e,m) = - myFloatToIntPos (e,m)
myFloatToInt (False,e,m) = myFloatToIntPos (e,m)
-- needs Word/Int to have more than 24 bits
myFloatToIntPos :: (Int8,Word23) -> Int
myFloatToIntPos (e,m) = if e-127 > 23
then m' .<<. fromEnum ((e-127)-23)
else m' .>>. fromEnum (23-(e-127))
where m' = cast(setBit (zeroExtend m) 23) :: Int
(A late reminder that .<<. is shiftL and .>>. is shiftR). Test:
> myFloatToInt (mydecodeFloat 98.7)
98
> myFloatToInt (mydecodeFloat (-101.1))
-101
That's what I mean by symmetrical truncation towards zero. One might have expected -101.1 to truncate to -102 as an integer, instead. I certainly do. It's definitely the case that rounding is preferable to truncation here, in terms of creating least surprise. I'll do that if requested.
I can't think of anything much else one might need as floating point ops. Modulus and remainder, perhaps. Please do translate to primitive operation "templates" in the backend languages ... and add traps for zero etc to taste. It's not going to be terribly wrong whatever you do, because IEEE as a standard was originally intended to be inclusive. At worst some people might tut-tut until a correction.
That I've done only single precision floating point operations is not significant, other than that I intended to leave no space for the argument that floating point logic cannot fit in one cycle.
For double precision, change some of the bitsize numbers.
I don't think we're far off being able to handle even 64-bit floating point logic in one cycle. Addition and subtraction should fit.
Regards
Peter