Odd-length Data Once Again

Quadibloc

unread,

Nov 23, 2021, 11:23:07 AM11/23/21

to

How to design a computer that handles 36-bit single-precision floating-point
numbers, and 60-bit double-precision floating-point numbers, that is just
as fast and efficient as a conventional computer, where all the data is in
lengths that are a power of two?

I have toyed with a number of different schemes of approaching this goal.

But those schemes have all required compromises of one sort or another,
thus falling somewhat short of that goal. In some cases, not by much, and
it had seemed to me that perhaps the goal was impossible to achieve.

But now I think I have found a way to achieve it. Of course, there are still
costs and compromises, but now they no longer get in the way of
performance.

The scheme I came up with is outlined on:

http://www.quadibloc.com/arch/per12.htm

Basically, what it consists of is this:

Memory is divided into 180-bit memory lines. In this way, each memory
line can contain either five 36-bit single-precision floating-point numbers,
or three 60-bit double-precision floating-point numbers, which are aligned
in both cases, thus never crossing memory line boundaries.

To avoid extra overhead when addressing either type of floating-point
numbers, both the displacements in instructions, and the contents of
index registers, are in *mixed-radix* format. So mixed-radix arithmetic
is used for indexed addressing.

And, therefore, in addition to plain binary index registers, used for
addressing 45-bit memory words, there is an additional set of index
registers for addressing 36-bit floats, and another set of index
registers for addressing 60-bit floats.

So there are three (actually four in the more complete proposal for
this architecture) sets of index registers, and there is some inefficiency
in the format of displacements in instructions. Those compromises,
though, don't get in the way of memory accesses for these unusual-length
data items being fast and straightforward.

John Savard

Thomas Koenig

unread,

Nov 23, 2021, 11:32:01 AM11/23/21

to

Quadibloc <jsa...@ecn.ab.ca> schrieb:

> How to design a computer that handles 36-bit single-precision floating-point
> numbers, and 60-bit double-precision floating-point numbers,

Is there a particular reason why you chose 60 and not 72 bits for the
double precision?

At least FORTRAN prescribed that a DOUBLE PRECISION number takes
up twice the amount of memory that a REAL variable had. You
could, of course, pad a common block, but it would be a waste.

Quadibloc

unread,

Nov 23, 2021, 1:15:32 PM11/23/21

to

On Tuesday, November 23, 2021 at 9:32:01 AM UTC-7, Thomas Koenig wrote:

> Is there a particular reason why you chose 60 and not 72 bits for the
> double precision?

Basically, on the premise that even 64 bits is wastefully long, and the Control
Data 6600 proves that 60 bit floats are long enough for the most demanding
scientific computations.

That may not be reasonable. The same principle as I outline here would also work
for a machine with 144-bit memory lines, and 36-bit single, 48-bit intermediate,
and 72-bit double; I prefer 48 bits to 45 bits as a length for intermediate floating
point. So there certainly are other options.

John Savard

MitchAlsup

unread,

Nov 23, 2021, 1:51:12 PM11/23/21

to

On Tuesday, November 23, 2021 at 12:15:32 PM UTC-6, Quadibloc wrote:
> On Tuesday, November 23, 2021 at 9:32:01 AM UTC-7, Thomas Koenig wrote:
>
> > Is there a particular reason why you chose 60 and not 72 bits for the
> > double precision?
> Basically, on the premise that even 64 bits is wastefully long, and the Control
> Data 6600 proves that 60 bit floats are long enough for the most demanding
> scientific computations.
<

Misnomer:: CDC 6600 proved that 60-bits of irregularly rounded FP was
sufficient for FP applications of the 1960s.
<
This in no way proves that 60-bits is (or will be found to be) sufficient for
FP applications of the '2020s. The sizes of the data sets are 10^5 larger.
<
On the other hand, 32-bit posits are being found to be sufficient for a
lot of smaller applications (as we speak).
<
So, I am in basic agreement with Thomas:: why not unify at the larger boundary.

Quadibloc

unread,

Nov 23, 2021, 2:09:02 PM11/23/21

to

On Tuesday, November 23, 2021 at 11:51:12 AM UTC-7, MitchAlsup wrote:

> Misnomer:: CDC 6600 proved that 60-bits of irregularly rounded FP was
> sufficient for FP applications of the 1960s.

> This in no way proves that 60-bits is (or will be found to be) sufficient for
> FP applications of the '2020s. The sizes of the data sets are 10^5 larger.

Oh, yes, it might well be that the requirements may be different.

I'd be using a 60-bit format based on IEEE-754, so except possibly for
allowing slightly more than 0.5 ULP for division, I wouldn't be copying
the bad points of the 6600 or the Cray-I.

The idea is, of course, that whatever length of floating-point numbers you
may actually need... unless they just happen to be the power-of-two
lengths, the ability to use other lengths for one's data types instead could
well be useful. Even if those lengths aren't the same as those as I may
use in an example, or which inspired me to look into these possibilities.

John Savard

Quadibloc

unread,

Nov 23, 2021, 5:04:35 PM11/23/21

to

On Tuesday, November 23, 2021 at 12:09:02 PM UTC-7, Quadibloc wrote:
> On Tuesday, November 23, 2021 at 11:51:12 AM UTC-7, MitchAlsup wrote:

> > Misnomer:: CDC 6600 proved that 60-bits of irregularly rounded FP was
> > sufficient for FP applications of the 1960s.

> > This in no way proves that 60-bits is (or will be found to be) sufficient for
> > FP applications of the '2020s. The sizes of the data sets are 10^5 larger.

> Oh, yes, it might well be that the requirements may be different.

I've added a little table to the page at

http://www.quadibloc.com/arch/per12.htm

which shows various other possibilities for use with this scheme.

What I've occasionally encountered in the literature are papers about using
80-bit or 128-bit floating-point numbers for some specialized applications
for which 64-bit double precision is inadequate, but not anything saying that
in general because programs are performing larger simulations these days
that run longer, we should change from using 64-bit floats to something
modestly larger, like 72 bits, to reflect that.

While I described the ordinary floating-point formats with a 180-bit memory
line, there would also be longer formats than 60-bit double-precision available.
The temporary real format, where the leading bit of the significand is no
longer hidden, would be 90 bits long, and there could even be a double format
that is 180 bits long.

The table I've added shows how the same scheme would work out, dividing
a memory line of various lengths into aligned floats that don't cross the boundaries
between memory lines (except possibly for the double format, which could
occupy two lines in some cases); the examples I've given are memory lines of
144 bits, 180 bits, 192 bits, and 256 bits.

Among those possibilities, people should be able to find something they would
like.

John Savard

Quadibloc

unread,

Nov 24, 2021, 3:38:33 AM11/24/21

to

On Tuesday, November 23, 2021 at 9:23:07 AM UTC-7, Quadibloc wrote:

> To avoid extra overhead when addressing either type of floating-point
> numbers, both the displacements in instructions, and the contents of
> index registers, are in *mixed-radix* format. So mixed-radix arithmetic
> is used for indexed addressing.

Of course, now I can see why I hadn't thought of this before.

This means that an *array subscript* has to be a special type of number
depending on the type of the elements of the array. So code in ordinary
computer languages would be difficult to compile, given that the
advantages of the architecture would be lost if unnecessary conversions
from binary integers to mixed-radix displacements were performed.

This isn't an insuperable problem - one could define a computer language
that allowed the programmer to write efficient higher-level code for this
kind of architecture by specifying counter/index/pointer types explicitly, or a
conventional language could be used, since keeping track of this sort of
thing is not more complicated than things compilers already do.

John Savard

JimBrakefield

unread,

Nov 25, 2021, 10:17:41 AM11/25/21

to

Given a cache line of 256 bits, the number 252 is almost magical.
Divides evenly by:
6 & 12
7, 14 and 28
9, 18 & 36
21, 42 & 84
and of course 256 divides evenly by 4, 8, 16 ...
For split radix addresses the data size and instruction size choices are magnificent

Quadibloc

unread,

Nov 25, 2021, 7:23:12 PM11/25/21

to

On Thursday, November 25, 2021 at 8:17:41 AM UTC-7, JimBrakefield wrote:

> Given a cache line of 256 bits, the number 252 is almost magical.

I know that in my original Concertina article, I took 256 bits, and for 36 bits
I used 252, and for 51 bits I used 255, to have single precision and intermediate
precision values that fitted into a unit of alignment.

But I am disappointed, though, that while I found a compromise that met my
performance goals, it's so different from the standpoint of programming for it
that it remains impractical. I think I finally have reached the limits of exploring
this subject, and there isn't an option that would provide a better solution than
those I've already examined.

John Savard

MitchAlsup

unread,

Nov 26, 2021, 6:43:02 PM11/26/21

to

On Thursday, November 25, 2021 at 6:23:12 PM UTC-6, Quadibloc wrote:
> On Thursday, November 25, 2021 at 8:17:41 AM UTC-7, JimBrakefield wrote:
>
> > Given a cache line of 256 bits, the number 252 is almost magical.
<

Yes, but::
The length of a cache line is a balancing act between memory performance
and the size of the tag-overhead and latency of the data beets moving the
data back and forth.
x86 went 512-bits in about 2003
256-bits is probably too small today
GPUs have been using 1024-bits since the-mid-teens.
<
Important to note: it remains a balancing act !

<
> I know that in my original Concertina article, I took 256 bits, and for 36 bits
> I used 252, and for 51 bits I used 255, to have single precision and intermediate
> precision values that fitted into a unit of alignment.
<

Pairs of these "things" in a 512-bit line or quads in a 1024-bit line are reasonable.
But the magicity of 256 (or 252) should not lead decisions of cache-line-size--
but be lead by decisions of cache-line-size.

>
> But I am disappointed, though, that while I found a compromise that met my
> performance goals, it's so different from the standpoint of programming for it
> that it remains impractical. I think I finally have reached the limits of exploring
> this subject, and there isn't an option that would provide a better solution than
> those I've already examined.
<

It will remain impractical (in my opinion) until you realize that performing AGEN
arithmetic for these odd-sized thingies is inherently a fixed-point calculation
(not integer). Ultimately, if you "buy" a small multiplier* in the address path (ala
Boroughs BSP) to properly round these fixed-point things--you will find that
there is a solution awaiting.
<
(*) or suitable adder configured to round 252->256 to make the AGENs work.
>
> John Savard

Quadibloc

unread,

Nov 26, 2021, 10:08:05 PM11/26/21

to

On Friday, November 26, 2021 at 4:43:02 PM UTC-7, MitchAlsup wrote:
> Ultimately, if you "buy" a small multiplier* in the address path (ala
> Boroughs BSP) to properly round these fixed-point things--you will find that
> there is a solution awaiting.

Oh, yes, I noted that the BSP presumably multiplied by 15, and then "divided"
by 255 by means of dividing by its reciprocal - and the trick is to have an
end-around-carry on the eight bits after the binary point so that the result
is an integer plus a fraction of the form n/255 so that the result will always
be correct.

So I am very much aware of that solution, I just tried to do even better, and
that attempt, on further reflection, was a failure,.

John Savard

JimBrakefield

unread,

Nov 26, 2021, 11:03:04 PM11/26/21

to

The Wikipedia page "division algorithm" covers "division by a constant" with authority?
The scaled reciprocal multiplies shown can be shortened by appropriate shift and add trees?

BGB

unread,

Nov 27, 2021, 2:01:19 AM11/27/21

to

Yeah.

Addressing by non-power-of-2 scales could work, but the harder thing
would be getting good throughput.

If one had specific divisors, like 15 or 255, iterative shift-and-add
could work.

But, FWIW:
y=x/n;
Can be transformed into, say:
y=(x*c)>>d;
Where, in my case, typically:
d=32;
c=(0x100000000ULL+(n-1))/n;

It is possible in many cases to use a smaller reciprocal, depends on the
ranges of the values.

Curiously, Wikipedia shows a different / more complicated algorithm,
implying that possibly my algorithm is not exact?...

I guess I may need to do some more extensive testing and to determine
whether or not my approach is accurate (of if I may need to tweak it).

It is possible to use a table of reciprocals, which works well for small N.

There is a trick where it is possible to extend the range of the table
(up to larger divisors) by using shifts (works well with a CLZ
instruction), but with the primary drawback that this approach is not
integer exact.

There are ways to fix up the exactness issue (and get exact results),
however, IME, these tend to add complexity and on average work out
slower than falling back to something like binary long division (if the
divisor is out of the range of the lookup table).

...

Quadibloc

unread,

Nov 27, 2021, 3:04:23 AM11/27/21

to

On Saturday, November 27, 2021 at 12:01:19 AM UTC-7, BGB wrote:

> There are ways to fix up the exactness issue (and get exact results),
> however, IME, these tend to add complexity and on average work out
> slower than falling back to something like binary long division (if the
> divisor is out of the range of the lookup table).

Interesting. In my reply to Mitch, I noted that fixing up exactness was
simple, once I understood it.

The reciprocal of 255 is .000000010000000100000001... in binary.

So to multiply by that exactly, what you do is take the last eight
bits of the result, the eight bits after the binary point, and slap an
end-around-carry on them. Once you do that, instead of having a
binary expansion that goes on forever, you have an integer, plus
a fraction in the form n/255. An end-around-carry isn't a big deal,
after all, it was standard equipment in ALUs for computers that
used one's complement arithmetic, quite common in the old
days.

John Savard

BGB

unread,

Nov 27, 2021, 5:02:50 AM11/27/21

to

On 11/27/2021 2:04 AM, Quadibloc wrote:
> On Saturday, November 27, 2021 at 12:01:19 AM UTC-7, BGB wrote:
>

I did more testing, and noted that for dividing sufficiently large
numbers, my approach:
r = ((1<<32)+(d-1))/d;
q = (n*r)>>32;

Was prone to occasional off-by-one errors.
These generally happened when the numerator was over ~ 500M or so.

Did fiddle with it, and came up with an intermediate option:
k = 32+floor(log2(d))
if(!(d&(d-1)))k--;
r = ((1LL<<k)+(d-1))/d;
q = (n*r)>>k;

This did seem to pass with a brute-force check at least over the parts
of the space I was able to verify within a reasonable timeframe.

Yeah. What I meant was exactness of trying to use shift-trickery to
extend the use of a small division lookup table up to arbitrarily large
divisors.

It is a nifty trick for 3D renderers, which don't generally care about
bit-exact division. Not so great for the C library or compiler, where:
c=a/b;
Is generally expected to be exact.

It is possible to fetch different pieces for different bit-ranges and
add them together to get a reciprocal, but this is more complicated and
still not exact.

One can use an iterative fixup approach, but this is typically slower
than using binary long-division in these cases.

Thomas Koenig

unread,

Nov 27, 2021, 5:28:42 AM11/27/21

to

BGB <cr8...@gmail.com> schrieb:

> On 11/27/2021 2:04 AM, Quadibloc wrote:
>> On Saturday, November 27, 2021 at 12:01:19 AM UTC-7, BGB wrote:
>>
>
> I did more testing, and noted that for dividing sufficiently large
> numbers, my approach:
> r = ((1<<32)+(d-1))/d;
> q = (n*r)>>32;

Get "Hacker's Delight" and read the stuff about division by constants.
It's quite well written.

Anton Ertl

unread,

Nov 27, 2021, 6:52:17 AM11/27/21

to

JimBrakefield <jim.bra...@ieee.org> writes:
>The Wikipedia page "division algorithm" covers "division by a constant" with authority?

It's not bad, but it does not reference the (IMO) best paper on the
topic:

Arch D. Robison. N -bit unsigned division via N -bit multiply-add. In
Proceedings of the 17th IEEE Symposium on Computer Arithmetic
(ARITH-17). IEEE Computer Society Press, 2005.

You can also read my paper on the topic, which presents a solution
that has a better latency in some cases; it also has a relatively
up-to-date Section on "Related Work".

@InProceedings{ertl19kps,
author = {M. Anton Ertl},
title = {Integer Division by Multiplying with the
Double-Width Reciprocal},
crossref = {kps19},
pages = {75--84},
url = {http://www.complang.tuwien.ac.at/papers/ertl19kps.pdf},
url-slides = {http://www.complang.tuwien.ac.at/papers/ertl19kps-slides.pdf},
abstract = {Earlier work on integer division by multiplying with
the reciprocal has focused on multiplying with a
single-width reciprocal, combined with a correction
and followed by a shift. The present work explores
using a double-width reciprocal to allow getting rid
of the correction and shift.}
}

@Proceedings{kps19,
title = {20. Kolloquium Programmiersprachen und Grundlagen
der Programmierung (KPS)},
booktitle = {20. Kolloquium Programmiersprachen und Grundlagen
der Programmierung (KPS)},
year = {2019},
key = {kps19},
editor = {Martin Pl\"umicke and Fayez Abu Alia},
url = {https://www.hb.dhbw-stuttgart.de/kps2019/kps2019_Tagungsband.pdf}
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

BGB

unread,

Nov 27, 2021, 12:43:22 PM11/27/21

to

Yeah, I don't have this.

My initial algo was basically something I threw together experimentally,
and did basic tests, and it seemed to work so I used it.

However, I didn't really test it "exhaustively", as brute-forcing the
entire 32/32 space is a bit of a problem. Trying to brute-force up to
the 32-bit limit for the numerator did reveal some problems though
(occasional off-by-one errors for large values).

The modified algo seemed to work and avoided this issue. But, as noted,
can't test over the entire space as this seems like it would take an
implausibly long time.

Not so sure about whatever exactly is going on with the approach
described on Wikipedia, my own attempts to mimic it led to results which
"didn't work".

Did add the modified algo to my compiler (for the division by constant
case), and to the lookup-table divider, no visible change in the
behavior of Doom or similar.

This mostly applies to 'int' cases, as the 64-bit and 128-bit dividers
do not use lookup tables (these cases only use the binary long division
loop).

...

Thomas Koenig

unread,

Nov 28, 2021, 11:37:41 AM11/28/21

to

BGB <cr8...@gmail.com> schrieb:
> On 11/27/2021 4:28 AM, Thomas Koenig wrote:
>> BGB <cr8...@gmail.com> schrieb:
>>> On 11/27/2021 2:04 AM, Quadibloc wrote:
>>>> On Saturday, November 27, 2021 at 12:01:19 AM UTC-7, BGB wrote:
>>>>
>>>
>>> I did more testing, and noted that for dividing sufficiently large
>>> numbers, my approach:
>>> r = ((1<<32)+(d-1))/d;
>>> q = (n*r)>>32;
>>
>> Get "Hacker's Delight" and read the stuff about division by constants.
>> It's quite well written.
>>
>
> Yeah, I don't have this.

It is addressed to compiler writers, so it is definitely something
that could benefit you. (As you roll your own architecture, you
could even implement a "signed division by 2" instruction, or at
least POWER's ADDZE so you only need two instructions for that).

As far as division by constands is concerned,
https://oeis.org/A346496 has a sequence for 32-bit unsigned
division.

Terje Mathisen

unread,

Nov 28, 2021, 12:51:08 PM11/28/21

to

When I re-invented reciprocal muls for constant DIV, I was told by Agner
Fog that DEC had published something about the trick several years
earlier, but that's fine, I had already written code that did exhaustive
testing for all 32/32->32 divisions, just to verify my intuition that
having a correctly rounded up 33+ bit reciprocal would always give the
exact result.

Agnar & I came up with optimal x86 sequences for both divisors that are
OK with a 32-bit reciprocal and those that need the extra bit at the end
(i.e. when the 33rd bit after rounding up 34+ is a 1 bit).

Since then every compiler where I have checked the asm/machine code
output will use effectively the same approach, sometimes simplifying it
slightly for smaller divisors.

When you want to use this trick for BSP style prime modulus division,
then you have to look for and select a divisor where the reciprocal have
a small set of set bits.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Terje Mathisen

unread,

Nov 28, 2021, 12:58:47 PM11/28/21

to

BGB wrote:
> On 11/27/2021 4:28 AM, Thomas Koenig wrote:
>> BGB <cr8...@gmail.com> schrieb:
>>> On 11/27/2021 2:04 AM, Quadibloc wrote:
>>>> On Saturday, November 27, 2021 at 12:01:19 AM UTC-7, BGB wrote:
>>>
>>> I did more testing, and noted that for dividing sufficiently large
>>> numbers, my approach:
>>> r = ((1<<32)+(d-1))/d;
>>> q = (n*r)>>32;
>>
>> Get "Hacker's Delight" and read the stuff about division by constants.
>> It's quite well written.
>>
>
> Yeah, I don't have this.
>
> My initial algo was basically something I threw together experimentally,
> and did basic tests, and it seemed to work so I used it.
>
> However, I didn't really test it "exhaustively", as brute-forcing the
> entire 32/32 space is a bit of a problem. Trying to brute-force up to
> the 32-bit limit for the numerator did reveal some problems though
> (occasional off-by-one errors for large values).
>
>
> The modified algo seemed to work and avoided this issue. But, as noted,
> can't test over the entire space as this seems like it would take an
> implausibly long time.

Not really: You only need to test the boundary values, i.e. those N/M
where N is very close to a multiple of M, so you check (N-1)/M and N/M,
and you should start with the largest possible N that fits in 32 bits.

Doing that for all divisors is quite feasible for 32-bit values.
somewhat less so for 64-bit. :-)

BGB

unread,

Nov 28, 2021, 1:35:58 PM11/28/21

to

On 11/28/2021 10:37 AM, Thomas Koenig wrote:
> BGB <cr8...@gmail.com> schrieb:
>> On 11/27/2021 4:28 AM, Thomas Koenig wrote:
>>> BGB <cr8...@gmail.com> schrieb:
>>>> On 11/27/2021 2:04 AM, Quadibloc wrote:
>>>>> On Saturday, November 27, 2021 at 12:01:19 AM UTC-7, BGB wrote:
>>>>>
>>>>
>>>> I did more testing, and noted that for dividing sufficiently large
>>>> numbers, my approach:
>>>> r = ((1<<32)+(d-1))/d;
>>>> q = (n*r)>>32;
>>>
>>> Get "Hacker's Delight" and read the stuff about division by constants.
>>> It's quite well written.
>>>
>>
>> Yeah, I don't have this.
>
> It is addressed to compiler writers, so it is definitely something
> that could benefit you. (As you roll your own architecture, you
> could even implement a "signed division by 2" instruction, or at
> least POWER's ADDZE so you only need two instructions for that).
>

I did look it up, and noted that it is apparently freely available
online (well, as opposed to needing to order a book).

But, yeah, such an instruction could be possible.

As-is, signed division by power-of-2 is 4 instructions:
MOV Rs, Rt | CMPGE 0, Rs
ADD?F (d-1), Rt
SHAD Rt, -k, Rd

Where: k=log2(d), and Rt is a scratch register.

> As far as division by constands is concerned,
> https://oeis.org/A346496 has a sequence for 32-bit unsigned
> division.
>

OK. I had done more testing and noted that my revised form still failed
for larger values with various divisors (7, 14, 19, 28, 31, ...).

I added logic to detect these cases and disable the optimization if it
would lead to unsafe results (and noted that these cases can be detected
without needing to brute-force the entire range).

Also switched the calculation for the shift to:
k = 31+ceil(log2(d));

The table-driven divider also now has a few extra checks for this.

It is possible I could add runtime calls for a 64*64->128 multiply with
a 128-bit right shift, which could potentially be used as a fallback case.

TBD.

Ironically, I did note recently when attempting to run a profiler on my
emulator (having to resort to WSL + gprof as none of the Windows
profilers seem to want to work right now), that apparently:
MOV (IR / 2R), ADD/SUB, SHAD/SHLD, ..., are actually some of the most
common instructions being executed, but are partly hidden away in my
other stats given that they are also typically executed in parallel with
other instructions, and thus partly avoid being counted in the
per-instruction clock-cycle budget.

So, say:
ADD 4, R5 | MOV.L (SP, 24), R4

Will only count the 'MOV.L' for its clock-cycles, and ignore the 'ADD'
(since its clock-cycles are effectively hidden behind the 'MOV.L').

For compiler stats:
~ 40% of F8 block instructions are in Lane 2/3;
~ 20% of F2 block instructions are in Lane 2/3;
~ 13% of F0 block instructions are in Lane 2/3;

None of the F1 block is in these lanes, partly as there is nothing in
the F1 block that is allowed in these lanes.

Annoyingly, some amount of the recent debugging did adversely effect the
Dhrystone score, but alas...

BGB

unread,

Nov 28, 2021, 1:55:53 PM11/28/21

to

My initial tests were brute forcing the range.

My first stage testing of the revised algorithm was only checking up to
2^31, but changing it to 2^32 showed that some divisors were still
failing (7, 14, 19, 28, 31, ...).

Did observe though that in the cases where it started failing, it would
also tend to fail within 2*D of the maximum value (which can generally
be checked a bit more quickly).

But, yeah, checking in the immediate vicinity of an integer multiple
could further narrow the check.

Say:
Find T=((2^32)-1)/D
P=T*D
Search, say, (P-10)..(P+10), and see if a mismatch happens.

My initial (earlier) tests for the original division algo were more like:
Brute force the 64K * 64K space;
Check a bunch of randomly generated numbers;
Call it good if it passed.

But, as noted, it turns out this wasn't quite good enough.
Didn't account for an error rate that got bigger the further one gets
from zero.

> Terje
>

Anton Ertl

unread,

Nov 29, 2021, 8:03:45 AM11/29/21

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>Since then every compiler where I have checked the asm/machine code
>output will use effectively the same approach, sometimes simplifying it
>slightly for smaller divisors.

For u/7 on AMD64 (latency measured on Skylake) there are at least the
following sequences:

gcc Robison/fish ertl
latency=8c latency=6.25c latency=6c
%Cl=0x4924924924924925
%C=0x2492492492492493 %C=0x9249249249249249 %Ch=0x2492492492492492
movabs $C,%rdx movabs $C,%rax movabs $Cl,%rax
mov %rdi,%rax mov %rax,%rcx mul %rdi
mul %rdx mul %rdi mov %rdx,%rcx
sub %rdx,%rdi add %rcx,%rax movabs $Ch,%rax
shr %rdi adc $0x0,%rdx mul %rdi
lea (%rdx,%rdi),%rax shr $0x2,%rdx add %rcx,%rax
shr $0x2,%rax mov %rdx,%rax adc $0x0,%rdx
mov %rdx,%rax

Terje Mathisen

unread,

Nov 29, 2021, 10:21:49 AM11/29/21

to

Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
>> Since then every compiler where I have checked the asm/machine code
>> output will use effectively the same approach, sometimes simplifying it
>> slightly for smaller divisors.
>
> For u/7 on AMD64 (latency measured on Skylake) there are at least the
> following sequences:
>
> gcc Robison/fish ertl
> latency=8c latency=6.25c latency=6c
> %Cl=0x4924924924924925
> %C=0x2492492492492493 %C=0x9249249249249249 %Ch=0x2492492492492492
> movabs $C,%rdx movabs $C,%rax movabs $Cl,%rax
> mov %rdi,%rax mov %rax,%rcx mul %rdi
> mul %rdx mul %rdi mov %rdx,%rcx
> sub %rdx,%rdi add %rcx,%rax movabs $Ch,%rax
> shr %rdi adc $0x0,%rdx mul %rdi
> lea (%rdx,%rdi),%rax shr $0x2,%rdx add %rcx,%rax
> shr $0x2,%rax mov %rdx,%rax adc $0x0,%rdx
> mov %rdx,%rax

The main difference between the two last approaches is the extra MUL vs
ADD/ADC to add in the 65th bit of the reciprocal, we used the first of
these but as long as the multiplier is properly pipelined your double
wide mul is very nice.

I suspect that for divisors with less than max significant bits, your
version will also generate a sufficiently accurate factional part which,
after multiplication, can generate the remainder from the division.

Thomas Koenig

unread,

Nov 29, 2021, 11:30:31 AM11/29/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> Terje Mathisen <terje.m...@tmsw.no> writes:
>>Since then every compiler where I have checked the asm/machine code
>>output will use effectively the same approach, sometimes simplifying it
>>slightly for smaller divisors.
>
> For u/7 on AMD64 (latency measured on Skylake) there are at least the
> following sequences:
>
> gcc Robison/fish ertl
> latency=8c latency=6.25c latency=6c
> %Cl=0x4924924924924925
> %C=0x2492492492492493 %C=0x9249249249249249 %Ch=0x2492492492492492
> movabs $C,%rdx movabs $C,%rax movabs $Cl,%rax
> mov %rdi,%rax mov %rax,%rcx mul %rdi
> mul %rdx mul %rdi mov %rdx,%rcx
> sub %rdx,%rdi add %rcx,%rax movabs $Ch,%rax
> shr %rdi adc $0x0,%rdx mul %rdi
> lea (%rdx,%rdi),%rax shr $0x2,%rdx add %rcx,%rax
> shr $0x2,%rax mov %rdx,%rax adc $0x0,%rdx
> mov %rdx,%rax

Makes me wonder... what do you think of the 128-bit division and
remainder by some constants with gcc11, like

unsigned int divrem_10 (__uint128_t x, __uint128_t *div)
{
*div = x / 10;
return x % 10;
}

? (see https://godbolt.org/z/czqWEn8jo ).

Anton Ertl

unread,

Nov 29, 2021, 12:11:48 PM11/29/21

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>I suspect that for divisors with less than max significant bits, your
>version will also generate a sufficiently accurate factional part which,
>after multiplication, can generate the remainder from the division.

Interesting idea, but I did not investigate it; the classical approach
of multiplying the quotient by the divisor and subtracting the result
from the dividend looked good enough (and it's not clear if it's
slower).

Terje Mathisen

unread,

Nov 29, 2021, 2:17:05 PM11/29/21

to

It looks like 25-30 cycles of latency on a 3+ wide core?

It isn't horrible, but it also isn't the best we can do:

What I think is that this particular code will pretty much always occur
in the context of transforming multiple digits to decimal, in which case
I am sure I could outperform a loop of this code by a big factor using
my divide & conquer approach:

An uint128 can require up to 39 decimal digits, so we start by splitting
the input into 13-digit chunks with a constant div-mod by 10^13, then we
can extract all 39 digits in parallel:

Start by scaling each chunk by 2^60/1e12, leaving the top digit in the
top 4 bits of a 64-bit register, then repeatedly mask away those 4 bits,
multiply the remainder by 5 (a single-cycle LEA) and move the mask down
one bit position, so effectively a mul-by-10.

Since all these operations are fast and unrolled by the three parallel
chunks, we'll get close to a digit/cycle, possibly staying under 100
cycles for the full conversion?

An easier approach will do the same using divrem_1e13() then convert
each chunk using standard 64-bit operations:

void u128_to_ascii(char *buf, int buflen, __uint128_t n)
{
if (buflen < 40) return null;
__uint128_t t;
uint64_t c0, c1, c2;

t = n / (__uint128_t) 10000000000000LL;
c0 = (uint64_t) n % 10000000000000LL;

c2 = (uint64_t) t / (__uint128_t) 10000000000000LL;
c0 = (uint64_t) t % 10000000000000LL;

sprintf(buf,"%013d%013d%013d", c2, c1, c0);

return buf;

Terje Mathisen

unread,

Nov 29, 2021, 2:19:58 PM11/29/21

to

Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
>> I suspect that for divisors with less than max significant bits, your
>> version will also generate a sufficiently accurate factional part which,
>> after multiplication, can generate the remainder from the division.
>
> Interesting idea, but I did not investigate it; the classical approach
> of multiplying the quotient by the divisor and subtracting the result
> from the dividend looked good enough (and it's not clear if it's
> slower).

It is probably just marginally slower, by needing a couple more
housekeeping instructions, but nothing significant.

Thomas Koenig

unread,

Nov 29, 2021, 3:57:11 PM11/29/21

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

Actually, truth be told, this was partially inspired by my search
to find non-trivial numbers whose sum of digits is the same for
every prime number up to 17 (you may remember the discussion about
this some time ago).

I've found 70911040973874056146188543 and 77332999599545910254098143
with an algorithm that used the approach above, which gained me
an OEIS entry (even if it is one of the "less interesting" ones).

> An uint128 can require up to 39 decimal digits, so we start by splitting
> the input into 13-digit chunks with a constant div-mod by 10^13, then we
> can extract all 39 digits in parallel:

Hm, yes, that approach could have sped up my calculations somewhat :-)

> Start by scaling each chunk by 2^60/1e12, leaving the top digit in the
> top 4 bits of a 64-bit register, then repeatedly mask away those 4 bits,
> multiply the remainder by 5 (a single-cycle LEA) and move the mask down
> one bit position, so effectively a mul-by-10.
>
> Since all these operations are fast and unrolled by the three parallel
> chunks, we'll get close to a digit/cycle, possibly staying under 100
> cycles for the full conversion?

I am tempted to put this approach libgfortran, but unfortunately
the general overhad of Fortran I/O is very high.

> An easier approach will do the same using divrem_1e13() then convert
> each chunk using standard 64-bit operations:
>
> void u128_to_ascii(char *buf, int buflen, __uint128_t n)
> {
> if (buflen < 40) return null;
> __uint128_t t;
> uint64_t c0, c1, c2;
>
> t = n / (__uint128_t) 10000000000000LL;
> c0 = (uint64_t) n % 10000000000000LL;
>
> c2 = (uint64_t) t / (__uint128_t) 10000000000000LL;
> c0 = (uint64_t) t % 10000000000000LL;
>
> sprintf(buf,"%013d%013d%013d", c2, c1, c0);
>
> return buf;
> }

That, as well.