Arithmetic Operators in Hardware  Seima Rao  8/1/12 11:56 PM  Hi,
When introducing a new operator in Hardware, typically, its cost is estimated to understand tractability. Sometime back, I was analyzing the square root operator and thought of this: 1/2 a = x 1/2 1/2 b = (x + C) = x + f(x, C) where, 0 < f(x, C) < 1 The above merely states the following: 1/2 2 = 4 1/2 1/2 1/2 5 = (4 + 1) = 4 + f(4, 1) where 0 < f(4, 1) < 1 Is this a good path to research? Sincerely, Seima Rao. 
Arithmetic Operators in Hardware  Seima Rao  8/1/12 11:56 PM  
Re: Arithmetic Operators in Hardware  MitchAlsup  8/2/12 9:46 AM  I think you should look into both a) Newton Rapson iteration, and b) Goldschmidt's quadradic convergence divide. There are also polynomial functions that approximate SQRT at lowtomoderate costs.
In particular, most of the iterative SQRT loops contain: root [n+1] = root [n]*(1 + error[n]/2) drive[n+1] = drive[n]*(1 + error[n]/2)**2 error[n+1] = 1  drive[n+1] A little math and one finds that: error[n+1] = 3/4*error[n]**2 + 1/4*error[n]**3 Which guarentees quadradic convergence. When error[n] < 2**53 the iteration is done, and one proceeds to determine how to round the calculated root. Mitch 
Re: Arithmetic Operators in Hardware  jacko  8/2/12 12:11 PM  It depends what your after.
Subtraction can generate all calculation. 1/root(x), x*y, xy are enough for a reasonably efficient arithmetic. a = b*(cd*a*a)/2 is perhaps the best operator!! 
Re: Arithmetic Operators in Hardware  jacko  8/2/12 12:24 PM  > a = b*(cd*a*a)/2 is perhaps the best operator!!If only four float instructions are allowed with single operands, then a = b*(cd*a*a)/2 ... (specific a reg) set b from reg set c from reg set d from reg Setting the a reg via b=2, c=value, d=0. Of course then b, c and d become part of the needed save state on interrupt. 
Re: Arithmetic Operators in Hardware  BGB  8/2/12 6:00 PM  I before implemented support for float128 math in software, and used the
NewtonRaphson method to implement "sqrt()". it actually worked fairly well. 
Re: Arithmetic Operators in Hardware  jacko  8/2/12 6:19 PM  Maybe floating point should be one instruction.
Each opcode is just 4 register specifiers a, b, c, d. Forget the /2 as this is within b. The starting of a float operation sequence is the by a loading of the float sequence PC, which continues incrementally until the float sequence PC is changed. The FSPC could be like any other float register. The integer/ALU PC is separate, and it would have to have instructions to load and save to the float registers. The question then becomes do polynomial expansions scale well with this abcd operator? What is the optimal latency for this opcode, and in general use how many are interleaved say in series calculation... Is there a general method to decompose any calculation into a minimal number of abcd ops? 
Re: Arithmetic Operators in Hardware  jacko  8/2/12 7:24 PM  I just invented another 1 instruction CPU. Use all the memory as register space, and hyperbolic squaring under/overflow vectoring gives conditional branch. I'm not sure of its efficiency, but the ALU and instruction decode is easy, with no multiplexing. The instruction fetch has the complication of the exception vectoring, with say maybe a pair of vectors per block of code.
;D (C) Jacko 
Re: Arithmetic Operators in Hardware  jacko  8/3/12 12:03 AM  So the five cycle mul, mul, normviamul, sub, mul of the ALU overlaps with an abcd fetch a store cycle, and first instruction does a safe write. When the negamac (a*bc) architecture is compared, it can be seen the negamac has a four cycle pattern, and only achieves a computation occupancy of mul, norm, sub. This along with the fact that polynomials in powers of z (= x^2) can lead to faster computation on the invroot (this one of mine) five cycle architecture. Also the lower writeback pressure, leads to less cache problems.
I expect a 64 bit memory cell size, and 64K cells, for a 64 bit instruction size as a starting point. 512 KB total memory per CPU. Allocate a bit of I/O address space for the of chip or inter core serial interfaces. Forget about reading the PC memory location for a value and make it read a constant (or set of constants based on cycle count?)... How would it compare against a GPGPU or just a good float core. Contact by email if your are interested in the idea. Cheers Jacko 
Re: Arithmetic Operators in Hardware  Terje Mathisen  8/3/12 6:52 AM  BGB wrote:In my own 128bit fp library I simply used an 80bit fsqrt(1.0/x) as a starting point for a single NR iteration followed by a multiplication by x. Since I didn't actually bother with getting the rounding modes correct, this was sufficient. (AFAIR I did play some tricks with the exponent, making sure that it was within the proper range...) Terje   <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" 
Re: Arithmetic Operators in Hardware  BGB  8/3/12 11:25 AM  On 8/3/2012 8:52 AM, Terje Mathisen wrote:I did it purely using integer math as the backing structure. probably not the fastest possible option, but it worked (anyways, a little fiddling would be needed anyways as IEEE float128 and x87 float80 values not exactly the same). in my own codegenerators (when applicable), I have typically used SSE registers to hold 128bit values (both integer and float). (owing to pretty much all recent HW having SSE2 or newer, I have tended to use this as a baseline, potentially using newer features if CPUID says they exist). when applicable, packedinteger operations may be used, but very often they may be stored temporarily on the stack followed by reverting to traditional integer operations, and loading the result back into an SSE register. some other logic instead assumes using pointers to boxed values instead (this is more how my current script VM handles it). > Terje > 
Re: Arithmetic Operators in Hardware  jacko  8/3/12 5:11 PM  It's not that we discourage you, it's that the best optimum of the present day IS optimal, by definition. The replacement of an optimum algorithm occurs infrequently, and this replacement frequency gets less given time. Do not let this stop you from trying. For it is only the challenge that provides the future.
Cheers Jacko 
Re: Arithmetic Operators in Hardware  ChrisQ  8/4/12 3:00 AM  On 08/04/12 00:11, jacko wrote:Wise words indeed. Many people will say you are mad if you describe a genuinely original idea. The satisfaction is proving it correct, but don't expect to be thanked for it :)... Regards, Chris 
Re: Arithmetic Operators in Hardware  Terje Mathisen  8/4/12 5:16 AM  BGB wrote: > I did it purely using integer math as the backing structure.My own fp128 was designed for speed mostly, I wrote it over a day or two to verify the FDIV/FPATAN2 fix/sw workaround we wrote around new year 94/95. I used a 1:31:96 format even though I know today that I should have had at least 107+ mantissa bits, i.e. something like 1:17:110 would have been better. Otoh the 96bit mantissa format would fit inside 3 32bit registers. :) On the gripping hand I would have needed something like fp160 in order to verify correct rounding for the 80bit fp hardware. 
Re: Arithmetic Operators in Hardware  bert  8/4/12 5:06 AM  No, it is not a good path to research. Many many
years ago when I wrote microcode, I found that it was remarkably simple to implement the square root function for floating point numbers with very few more steps than the division function, and that those steps to build up the square root were very similar to the steps in building up the quotient of a division. Consequently, it was enormously faster than any of the convergent algorithms in which division occurs repeatedly! The method is one which I learned at school in the early 1950's, being a variation of manual long division, and you can find it described on the web if you look hard. I suppose there is too much investment in existing floating point chips for any change to occur now, but I would surmise that if floating point square root was a useful function to have in hardware, it could be implemented very similarly to division, and with a very similar performance.  
Re: Arithmetic Operators in Hardware  Terje Mathisen  8/4/12 5:20 AM  It took me 1,5 hours to show my university lecturer that the algorithm I
had invented as part of an "Algorithms and Data Structures" exam was in fact O(n*log(n)), which was required for a passing grade on this question. It was _very_ different from the textbook answer which I didn't know about. :) 
Re: Arithmetic Operators in Hardware  ChrisQ  8/4/12 5:39 AM  On 08/04/12 12:20, Terje Mathisen wrote:I'm more engineer than mathematician and engineers use as much math as is needed to solve a given problem to the required accuracy. Perhaps I should have paid more attention, since the really clever stuff is the ability to optimise a math requirement for computing with limited resources. Something that's often not at all obvious without an extensive math background... Regards, Chris 
Re: Arithmetic Operators in Hardware  BGB  8/4/12 9:28 AM  On 8/4/2012 7:16 AM, Terje Mathisen wrote:mine was mostly because it made sense to have a higher precision type, and float80 was an awkward size (a lot of my stuff basically just handles float80 by padding it to 128 bits anyways). it is a problematic type as well because MSVC doesn't use it, but rather just interprets "long double" as the same as "double". the cost though is that float128 (in my VM, and also my C compiler when it still worked) is considerably slower than using double. FWIW, there is also int128, but the performance hit isn't as bad. mine implemented IEEE 754r float128 values. IIRC, these are something like 1.15.112. a float256 format could be: 1.31.224 there is also some code for float16 / halffloat, but the usual strategy here is basically just to fudge them to/from 32bit floats. for float16 and smaller (8 or 12bit variants), a partial advantage is that using a LUT becomes more practical for decoding them. I also have several "tweaked" formats: float28, and float48. both are mostly intended to be used in pointers as "flonum" types, and are essentially just shaved down floats and doubles. FWIW, it is also possible to shove a pointer into a double (as NaN values), and several script VMs actually do this, but my current VM does not. > Terje > 
Re: Arithmetic Operators in Hardware  BGB  8/4/12 11:16 AM  I recently had tried to argue with someone that a math expression which
was written slightly differently was: mathematically equivalent to the original; worked more reliably on computers (vs the original), due to differences in the rounding behavior (the original form would lose 1bit of accuracy, the latter would preserve it, thus being integerexact). but, I am hardly all that good with math, but in this case pretty much everything was within the confines of algebra (moreorless, nevermind the concept of roundoff). in this case, the calculation being integerexact (and reversible) was an important point of the calculation, and I saw little point on people insisting on a form of the calculation which would often result in the wrong answer (I saw it, got suspicious, and rigged up a test that had confirmed that, yes, the original calculation was not reliable, and that the reworked calculation was). but, whatever sometimes, luckily I have authority over my own code... (note: technically this was related to a lossless variant of JPEG which remains backwards compatible with lossy decoders, and works mostly by using a tweaked DCT transform and colorspace conversions). > Regards, > > Chris > > > 
Re: Arithmetic Operators in Hardware  Terje Mathisen  8/4/12 12:39 PM  BGB wrote:Nice! 
Re: Arithmetic Operators in Hardware  BGB  8/4/12 7:37 PM  On 8/4/2012 2:39 PM, Terje Mathisen wrote:yeah. FWIW, the guy working on libjpeg has added similar functionality, but there are minor differences in the implementation (so, full compatibility is less certain). the basic idea though it that an alternate version of DCT is used (SERMS RDCT), which differs from the normal DCT in being fully reversible, but is also still compatible with the (lossy) DCT. also, all quantization values are set to 1 (no quantization). the color conversion is a little harder, since the usual YCbCr color conversion ends up using a smaller output range than the input range, and so can't be made lossless. the solution is partly to use an alternate color conversion with a larger value range (based on JPEG2000 RCT). a downside of this though is that, for decoders which are not aware of the alternate colorspace, the colors come out "slightly off" (mostly the color saturation is a bit higher). the math in question mostly had to do with the RCT color conversions. another theoretical option would have been the use of rawRGB, but of the apps I had to test with, none seemed to recognize the markers to indicate the use of the RGB colorspace, so it was a tradeoff between "slightly off" and "totally wrong" (most apps don't recognize much outside the range of generic JFIF style images). (the main alternative, used by libjpeg, would require storing an alternate image for the lossless version, considerably adding to the image size). technically, my implementation is domain specific and has a number of nonstandard extensions to JPEG anyways (mostly adding stuff relevant to textures in a 3D engine, such as alpha blending, depth/normal maps, specular maps, luminance maps, support for limitedprecision floatingpoint data in images, ...). these extensions are backwards compatible in that an unaware implementation will generally just ignore them. > Terje > > 
Re: Arithmetic Operators in Hardware  Quadibloc  8/5/12 3:01 PM  On Aug 4, 6:06 am, bert <bert.hutchi...@btinternet.com> wrote:Indeed. I describe it on my web site at http://www.quadibloc.com/comp/cp0202.htm however, that is *not* an optimal method for square root. Considerably faster algorithms for division exist than long division  Goldschmidt division, for example  and so NewtonRaphson iteration would be faster than an implementation of this algorithm. For a system with minimal hardware, however, another alternative  CORDIC  is also faster than the traditional paperandpencil square root method. John Savard 
Re: Arithmetic Operators in Hardware  jacko  8/7/12 5:27 PM  So some NaN exception hardware, and maybe some limit point exception hardware with fast obvious result vectoring. Maybe add address and data delta ++/ values. Maybe some 1, 0, 1 specials of mul bypass, and a multiplier allocation pool. Ummm.
Cheers Jacko C ...? J ...? FORTRAN ...? 
Re: Arithmetic Operators in Hardware  jacko  8/9/12 5:23 PM  In terms of opcode compression it has maybe address usage constraints. If address d contains 0, then address a is still required, but need not be fetched, but may have to be fetched, to know that the same value need not be flushed. If address b contains 0, then c and d are not needed. This leads to an operand ordering, and a potential 2 bit loss of resolution on all data storage addressing, or some alignment constraints on data. Is this opcode compacted code worth the execute order penalty? Or is the opcode space a possibility? What is the best action on c and d knowing a can be a write target?
Cheers Jacko 
Re: Arithmetic Operators in Hardware  jacko  9/6/12 5:41 PM  https://github.com/jackokring/jarvar/wiki/DSP for a version 1 specification of a 1KB DSP unit.
