Arithmetic Operators in Hardware

Seima

unread,

Aug 2, 2012, 2:56:48 AM8/2/12

to seim...@gmail.com

Hi,

When introducing a new operator in Hardware, typically, its cost
is estimated
to understand tractability.

Sometime back, I was analyzing the square root operator and
thought of this:

1/2
a = x
1/2 1/2
b = (x + C) = x + f(x, C)

where,

0 < f(x, C) < 1

The above merely states the following:

1/2
2 = 4

1/2 1/2 1/2
5 = (4 + 1) = 4 + f(4, 1)

where

0 < f(4, 1) < 1

Is this a good path to research?

Sincerely,
Seima Rao.

Seima

unread,

Aug 2, 2012, 2:56:56 AM8/2/12

to seim...@gmail.com

MitchAlsup

unread,

Aug 2, 2012, 12:46:50 PM8/2/12

to seim...@gmail.com

I think you should look into both a) Newton Rapson iteration, and b) Goldschmidt's quadradic convergence divide. There are also polynomial functions that approximate SQRT at low-to-moderate costs.

In particular, most of the iterative SQRT loops contain:

root [n+1] = root [n]*(1 + error[n]/2)
drive[n+1] = drive[n]*(1 + error[n]/2)**2
error[n+1] = 1 - drive[n+1]

A little math and one finds that:

error[n+1] = 3/4*error[n]**2 + 1/4*error[n]**3

Which guarentees quadradic convergence.

When error[n] < 2**-53 the iteration is done, and one proceeds to determine how to round the calculated root.

Mitch

jacko

unread,

Aug 2, 2012, 3:11:26 PM8/2/12

to seim...@gmail.com

It depends what your after.

Subtraction can generate all calculation.

1/root(x), x*y, x-y are enough for a reasonably efficient arithmetic.

a = b*(c-d*a*a)/2 is perhaps the best operator!!

jacko

unread,

Aug 2, 2012, 3:24:58 PM8/2/12

to seim...@gmail.com

> a = b*(c-d*a*a)/2 is perhaps the best operator!!

If only four float instructions are allowed with single operands, then

a = b*(c-d*a*a)/2 ... (specific a reg)
set b from reg
set c from reg
set d from reg

Setting the a reg via b=2, c=value, d=0.

Of course then b, c and d become part of the needed save state on interrupt.

BGB

unread,

Aug 2, 2012, 9:00:02 PM8/2/12

to

I before implemented support for float128 math in software, and used the
Newton-Raphson method to implement "sqrt()". it actually worked fairly well.

jacko

unread,

Aug 2, 2012, 9:19:45 PM8/2/12

to

Maybe floating point should be one instruction.

Each opcode is just 4 register specifiers a, b, c, d. Forget the /2 as this is within b. The starting of a float operation sequence is the by a loading of the float sequence PC, which continues incrementally until the float sequence PC is changed. The FSPC could be like any other float register. The integer/ALU PC is separate, and it would have to have instructions to load and save to the float registers. The question then becomes do polynomial expansions scale well with this abcd operator? What is the optimal latency for this opcode, and in general use how many are interleaved say in series calculation...

Is there a general method to decompose any calculation into a minimal number of abcd ops?

jacko

unread,

Aug 2, 2012, 10:24:18 PM8/2/12

to

I just invented another 1 instruction CPU. Use all the memory as register space, and hyperbolic squaring under/overflow vectoring gives conditional branch. I'm not sure of its efficiency, but the ALU and instruction decode is easy, with no multiplexing. The instruction fetch has the complication of the exception vectoring, with say maybe a pair of vectors per block of code.

;D

(C) Jacko

jacko

unread,

Aug 3, 2012, 3:03:19 AM8/3/12

to

So the five cycle mul, mul, norm-via-mul, sub, mul of the ALU overlaps with an abcd fetch a store cycle, and first instruction does a safe write. When the negamac (a*b-c) architecture is compared, it can be seen the negamac has a four cycle pattern, and only achieves a computation occupancy of mul, norm, sub. This along with the fact that polynomials in powers of z (= x^2) can lead to faster computation on the invroot (this one of mine) five cycle architecture. Also the lower write-back pressure, leads to less cache problems.

I expect a 64 bit memory cell size, and 64K cells, for a 64 bit instruction size as a starting point. 512 KB total memory per CPU. Allocate a bit of I/O address space for the of chip or inter core serial interfaces. Forget about reading the PC memory location for a value and make it read a constant (or set of constants based on cycle count?)...

How would it compare against a GPGPU or just a good float core. Contact by e-mail if your are interested in the idea.

Cheers Jacko

Terje Mathisen

unread,

Aug 3, 2012, 9:52:40 AM8/3/12

to

BGB wrote:
> I before implemented support for float128 math in software, and used the
> Newton-Raphson method to implement "sqrt()". it actually worked fairly
> well.

In my own 128-bit fp library I simply used an 80-bit fsqrt(1.0/x) as a
starting point for a single NR iteration followed by a multiplication by x.

Since I didn't actually bother with getting the rounding modes correct,
this was sufficient.

(AFAIR I did play some tricks with the exponent, making sure that it was
within the proper range...)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

BGB

unread,

Aug 3, 2012, 2:25:05 PM8/3/12

to

On 8/3/2012 8:52 AM, Terje Mathisen wrote:
> BGB wrote:
>> I before implemented support for float128 math in software, and used the
>> Newton-Raphson method to implement "sqrt()". it actually worked fairly
>> well.
>
> In my own 128-bit fp library I simply used an 80-bit fsqrt(1.0/x) as a
> starting point for a single NR iteration followed by a multiplication by x.
>
> Since I didn't actually bother with getting the rounding modes correct,
> this was sufficient.
>
> (AFAIR I did play some tricks with the exponent, making sure that it was
> within the proper range...)
>

I did it purely using integer math as the backing structure.
probably not the fastest possible option, but it worked (anyways, a
little fiddling would be needed anyways as IEEE float128 and x87 float80
values not exactly the same).

in my own code-generators (when applicable), I have typically used SSE
registers to hold 128-bit values (both integer and float). (owing to
pretty much all recent HW having SSE2 or newer, I have tended to use
this as a baseline, potentially using newer features if CPUID says they
exist).

when applicable, packed-integer operations may be used, but very often
they may be stored temporarily on the stack followed by reverting to
traditional integer operations, and loading the result back into an SSE
register.

some other logic instead assumes using pointers to boxed values instead
(this is more how my current script VM handles it).

> Terje
>

jacko

unread,

Aug 3, 2012, 8:11:37 PM8/3/12

to seim...@gmail.com

It's not that we discourage you, it's that the best optimum of the present day IS optimal, by definition. The replacement of an optimum algorithm occurs infrequently, and this replacement frequency gets less given time. Do not let this stop you from trying. For it is only the challenge that provides the future.

Cheers Jacko

ChrisQ

unread,

Aug 4, 2012, 6:00:03 AM8/4/12

to

On 08/04/12 00:11, jacko wrote:

> Do not let this stop you from trying. For it is only
>the challenge that provides the future.
>
> Cheers Jacko

Wise words indeed. Many people will say you are mad if you
describe a genuinely original idea. The satisfaction is
proving it correct, but don't expect to be thanked for it :-)...

Regards,

Chris

Terje Mathisen

unread,

Aug 4, 2012, 8:16:57 AM8/4/12

to

BGB wrote:
> On 8/3/2012 8:52 AM, Terje Mathisen wrote:
>> BGB wrote:
>>> I before implemented support for float128 math in software, and used the
>>> Newton-Raphson method to implement "sqrt()". it actually worked fairly
>>> well.
>>
>> In my own 128-bit fp library I simply used an 80-bit fsqrt(1.0/x) as a
>> starting point for a single NR iteration followed by a multiplication
>> by x.
>>
>

> I did it purely using integer math as the backing structure.
> probably not the fastest possible option, but it worked (anyways, a
> little fiddling would be needed anyways as IEEE float128 and x87 float80
> values not exactly the same).

My own fp128 was designed for speed mostly, I wrote it over a day or two
to verify the FDIV/FPATAN2 fix/sw workaround we wrote around new year 94/95.

I used a 1:31:96 format even though I know today that I should have had
at least 107+ mantissa bits, i.e. something like 1:17:110 would have
been better. Otoh the 96-bit mantissa format would fit inside 3 32-bit
registers. :-)

On the gripping hand I would have needed something like fp160 in order
to verify correct rounding for the 80-bit fp hardware.

bert

unread,

Aug 4, 2012, 8:06:11 AM8/4/12

to seim...@gmail.com

No, it is not a good path to research. Many many
years ago when I wrote microcode, I found that it
was remarkably simple to implement the square root
function for floating point numbers with very few
more steps than the division function, and that
those steps to build up the square root were very
similar to the steps in building up the quotient
of a division. Consequently, it was enormously
faster than any of the convergent algorithms in
which division occurs repeatedly! The method is
one which I learned at school in the early 1950's,
being a variation of manual long division, and you
can find it described on the web if you look hard.

I suppose there is too much investment in existing
floating point chips for any change to occur now,
but I would surmise that if floating point square
root was a useful function to have in hardware, it
could be implemented very similarly to division,
and with a very similar performance.
--

Terje Mathisen

unread,

Aug 4, 2012, 8:20:00 AM8/4/12

to

It took me 1,5 hours to show my university lecturer that the algorithm I
had invented as part of an "Algorithms and Data Structures" exam was in
fact O(n*log(n)), which was required for a passing grade on this question.

It was _very_ different from the textbook answer which I didn't know
about. :-)

ChrisQ

unread,

Aug 4, 2012, 8:39:04 AM8/4/12

to

On 08/04/12 12:20, Terje Mathisen wrote:

>
> It took me 1,5 hours to show my university lecturer that the algorithm I
> had invented as part of an "Algorithms and Data Structures" exam was in
> fact O(n*log(n)), which was required for a passing grade on this question.
>
> It was _very_ different from the textbook answer which I didn't know
> about. :-)
>
> Terje

I'm more engineer than mathematician and engineers use as much math as is
needed to solve a given problem to the required accuracy. Perhaps I should
have paid more attention, since the really clever stuff is the ability to
optimise a math requirement for computing with limited resources. Something
that's often not at all obvious without an extensive math background...

Regards,

Chris

BGB

unread,

Aug 4, 2012, 12:28:57 PM8/4/12

to

On 8/4/2012 7:16 AM, Terje Mathisen wrote:
> BGB wrote:
>> On 8/3/2012 8:52 AM, Terje Mathisen wrote:
>>> BGB wrote:
>>>> I before implemented support for float128 math in software, and used
>>>> the
>>>> Newton-Raphson method to implement "sqrt()". it actually worked fairly
>>>> well.
>>>
>>> In my own 128-bit fp library I simply used an 80-bit fsqrt(1.0/x) as a
>>> starting point for a single NR iteration followed by a multiplication
>>> by x.
>>>
>>
>> I did it purely using integer math as the backing structure.
>> probably not the fastest possible option, but it worked (anyways, a
>> little fiddling would be needed anyways as IEEE float128 and x87 float80
>> values not exactly the same).
>
> My own fp128 was designed for speed mostly, I wrote it over a day or two
> to verify the FDIV/FPATAN2 fix/sw workaround we wrote around new year
> 94/95.
>

mine was mostly because it made sense to have a higher precision type,
and float80 was an awkward size (a lot of my stuff basically just
handles float80 by padding it to 128 bits anyways).

it is a problematic type as well because MSVC doesn't use it, but rather
just interprets "long double" as the same as "double".

the cost though is that float128 (in my VM, and also my C compiler when
it still worked) is considerably slower than using double.

FWIW, there is also int128, but the performance hit isn't as bad.

> I used a 1:31:96 format even though I know today that I should have had
> at least 107+ mantissa bits, i.e. something like 1:17:110 would have
> been better. Otoh the 96-bit mantissa format would fit inside 3 32-bit
> registers. :-)
>
> On the gripping hand I would have needed something like fp160 in order
> to verify correct rounding for the 80-bit fp hardware.
>

mine implemented IEEE 754r float128 values.
IIRC, these are something like 1.15.112.

a float256 format could be:
1.31.224

there is also some code for float16 / half-float, but the usual strategy
here is basically just to fudge them to/from 32-bit floats.

for float16 and smaller (8 or 12-bit variants), a partial advantage is
that using a LUT becomes more practical for decoding them.

I also have several "tweaked" formats:
float28, and float48.
both are mostly intended to be used in pointers as "flonum" types, and
are essentially just shaved down floats and doubles. FWIW, it is also
possible to shove a pointer into a double (as NaN values), and several
script VMs actually do this, but my current VM does not.

> Terje
>

BGB

unread,

Aug 4, 2012, 2:16:24 PM8/4/12

to

I recently had tried to argue with someone that a math expression which
was written slightly differently was:
mathematically equivalent to the original;
worked more reliably on computers (vs the original), due to differences
in the rounding behavior (the original form would lose 1-bit of
accuracy, the latter would preserve it, thus being integer-exact).

but, I am hardly all that good with math, but in this case pretty much
everything was within the confines of algebra (more-or-less, nevermind
the concept of roundoff).

in this case, the calculation being integer-exact (and reversible) was
an important point of the calculation, and I saw little point on people
insisting on a form of the calculation which would often result in the
wrong answer (I saw it, got suspicious, and rigged up a test that had
confirmed that, yes, the original calculation was not reliable, and that
the reworked calculation was).

but, whatever sometimes, luckily I have authority over my own code...

(note: technically this was related to a lossless variant of JPEG which
remains backwards compatible with lossy decoders, and works mostly by
using a tweaked DCT transform and colorspace conversions).

> Regards,
>
> Chris
>
>
>

Terje Mathisen

unread,

Aug 4, 2012, 3:39:52 PM8/4/12

to

BGB wrote:
> in this case, the calculation being integer-exact (and reversible) was
> an important point of the calculation, and I saw little point on people
> insisting on a form of the calculation which would often result in the
> wrong answer (I saw it, got suspicious, and rigged up a test that had
> confirmed that, yes, the original calculation was not reliable, and that
> the reworked calculation was).
>
> but, whatever sometimes, luckily I have authority over my own code...
>
>
> (note: technically this was related to a lossless variant of JPEG which
> remains backwards compatible with lossy decoders, and works mostly by
> using a tweaked DCT transform and colorspace conversions).

Nice!

BGB

unread,

Aug 4, 2012, 10:37:20 PM8/4/12

to

On 8/4/2012 2:39 PM, Terje Mathisen wrote:
> BGB wrote:
>> in this case, the calculation being integer-exact (and reversible) was
>> an important point of the calculation, and I saw little point on people
>> insisting on a form of the calculation which would often result in the
>> wrong answer (I saw it, got suspicious, and rigged up a test that had
>> confirmed that, yes, the original calculation was not reliable, and that
>> the reworked calculation was).
>>
>> but, whatever sometimes, luckily I have authority over my own code...
>>
>>
>> (note: technically this was related to a lossless variant of JPEG which
>> remains backwards compatible with lossy decoders, and works mostly by
>> using a tweaked DCT transform and colorspace conversions).
>
> Nice!
>

yeah.

FWIW, the guy working on libjpeg has added similar functionality, but
there are minor differences in the implementation (so, full
compatibility is less certain).

the basic idea though it that an alternate version of DCT is used (SERMS
RDCT), which differs from the normal DCT in being fully reversible, but
is also still compatible with the (lossy) DCT.
also, all quantization values are set to 1 (no quantization).

the color conversion is a little harder, since the usual YCbCr color
conversion ends up using a smaller output range than the input range,
and so can't be made lossless.

the solution is partly to use an alternate color conversion with a
larger value range (based on JPEG-2000 RCT). a downside of this though
is that, for decoders which are not aware of the alternate colorspace,
the colors come out "slightly off" (mostly the color saturation is a bit
higher).

the math in question mostly had to do with the RCT color conversions.

another theoretical option would have been the use of raw-RGB, but of
the apps I had to test with, none seemed to recognize the markers to
indicate the use of the RGB colorspace, so it was a tradeoff between
"slightly off" and "totally wrong" (most apps don't recognize much
outside the range of generic JFIF style images). (the main alternative,
used by libjpeg, would require storing an alternate image for the
lossless version, considerably adding to the image size).

technically, my implementation is domain specific and has a number of
non-standard extensions to JPEG anyways (mostly adding stuff relevant to
textures in a 3D engine, such as alpha blending, depth/normal maps,
specular maps, luminance maps, support for limited-precision
floating-point data in images, ...).

these extensions are backwards compatible in that an unaware
implementation will generally just ignore them.

> Terje
>
>

Quadibloc

unread,

Aug 5, 2012, 6:01:47 PM8/5/12

to

On Aug 4, 6:06 am, bert <bert.hutchi...@btinternet.com> wrote:
> Many many
> years ago when I wrote microcode, I found that it
> was remarkably simple to implement the square root
> function for floating point numbers with very few
> more steps than the division function, and that
> those steps to build up the square root were very
> similar to the steps in building up the quotient
> of a division. Consequently, it was enormously
> faster than any of the convergent algorithms in
> which division occurs repeatedly! The method is
> one which I learned at school in the early 1950's,
> being a variation of manual long division, and you
> can find it described on the web if you look hard.

Indeed. I describe it on my web site at

http://www.quadibloc.com/comp/cp0202.htm

however, that is *not* an optimal method for square root. Considerably
faster algorithms for division exist than long division - Goldschmidt
division, for example - and so Newton-Raphson iteration would be
faster than an implementation of this algorithm.

For a system with minimal hardware, however, another alternative -
CORDIC - is also faster than the traditional paper-and-pencil square
root method.

John Savard

jacko

unread,

Aug 7, 2012, 8:27:13 PM8/7/12

to

So some NaN exception hardware, and maybe some limit point exception hardware with fast obvious result vectoring. Maybe add address and data delta ++/-- values. Maybe some 1, 0, -1 specials of mul bypass, and a multiplier allocation pool. Ummm.

Cheers Jacko

C ...? J ...? FORTRAN ...?

jacko

unread,

Aug 9, 2012, 8:23:29 PM8/9/12

to

In terms of opcode compression it has maybe address usage constraints. If address d contains 0, then address a is still required, but need not be fetched, but may have to be fetched, to know that the same value need not be flushed. If address b contains 0, then c and d are not needed. This leads to an operand ordering, and a potential 2 bit loss of resolution on all data storage addressing, or some alignment constraints on data. Is this opcode compacted code worth the execute order penalty? Or is the opcode space a possibility? What is the best action on c and d knowing a can be a write target?

Cheers Jacko

jacko

unread,

Sep 6, 2012, 8:41:04 PM9/6/12

to

https://github.com/jackokring/jarvar/wiki/DSP for a version 1 specification of a 1KB DSP unit.