Unsigned 48-Bit Integer Divide for 8086

Jerome H. Fine

unread,

Jul 7, 2014, 10:33:24 PM7/7/14

to

Is there an algorithm that will perform an Unsigned Integer Divide
for a 48-Bit Value using ONLY 16 bit instructions PLUS the
Signed Integer Divide (is is DIV or IDIV?) instruction which
divides DX,AX by a 16 bit value? OR maybe the Signed Integer
Multiply (is it MUL or IMUL?) instruction could be substituted
which multiplies two Signed 16-Bit Integers to produce a 32-Bit
Integer?

If this question is not of interest, you can forget the rest.

Note that I am using the 8086 as the closest familiar hardware.
The actual hardware is a bit different, but still has only 16 bits
words.

Basically, is there a simple, reasonable way to effectively divide
a 48-Bit value residing in three convenient registers such as
CX,DX,AX by 100,000,000?

Specifically, I am starting with an Unsigned 48-Bit Integer
in CX,DX,AX and I want to Encode the value (into Decimal
digits). There is already a subroutine to quickly Encode an
Unsigned 16-Bit Integer which I use as a starting point. For
values up to 327,679,999 a DIV by 10,000 produces a
quotient and remainder which handles most of the range for
an Unsigned 32-Bit Integer. Then, at present, subtracting
100,000,000 as many times as needed (from the 32-Bit
maximum limit of 4,294,967,295 and leaving a remainder
of up to 99,999,999) produces the required number for
the top 2 Decimal digits.

Extending those subtractions into the 48-Bit portion would use
a table of the powers of ten (from 100,000,000 up to
1,000,000,000,000 or five entries - to handle a result up to
6,553,599,999,999 as the maximum that is really needed
since a rough estimate is that reaching such a value would
take about 7.5 days).

A short (current?) logic summary is:
(a) Values up to 65,535 are when CX = DX = 0, then AX
       is Encoded as an Unsigned 16-Bit Integer
(b) Values up to 327,679,999 are when CX=0 and DX<5,000
       (327,679,999 / 65,536 = 4999 so the maximum for DX
       is 4999 for this case), then DX,AX is divided by 10,000
       after which first the quotient, then the remainder are Encoded
       as Unsigned 16-Bit Integers
(c) Values up to 6,553,599,999,999 are handled by using many
       subtractions (using a table of the powers of ten from 100,000,000
       up to 1,000,000,000,000) to obtain the top 5 digits and leave a
       remainder of the lower 8 digits. Then those top five digits are
       Encoded followed by (b) for the rest of the 48-Bit Value

Replacing the multiple subtractions for (c) is the goal and would
use (hopefully) less than 15 to 20 instructions. Using a cascaded
multiply on the 48-Bit Value to produce the value equal to a
divide by 100,000,000 is what I guess is needed, but I don't
have any idea of where to start. However, if I were to guess,
the Top Five Digits (in binary):

TFD = ( x / 4294067296 ) * 42.94067296

Then ( x / 49067296 ) is just the original 48-Bit Integer with
the binary point NOW between the top 16 bits and the lower
32 bits. I would likewise use 48 bits to hold 42.94067296 with
a 16 bit integer and a 48 bit fraction - with the binary point in the
same location. Figuring out the value of 42.94067296 should not
be too difficult. Does this seem like the solution to my question?
And if so, what is the best way to perform the multiply of 48-Bits
times 48-Bits? Must all 80 bits be retained or are 64 bits quite
sufficient (the least 16 significant bits don't seem to have any
affect)?

One other possibility is to convert everything to 64-Bit
double precision floating point since those instructions
are also available. All that would be needed would be
a divide of "x" by 100,000,000 and convert that quotient
to an integer. This would still require MANY instructions.

Or is there an algorithm that is even easier?

It would be easy to test the 65,535 possibilities by adding
100,000,000 to a total that many times, then checking if just
one less from the current total results in one less when using
any algorithm to obtain those upper 5 Decimal digits under (c)
above.

Jerome Fine

news

unread,

Jul 8, 2014, 1:43:16 AM7/8/14

to

Well, of course there's a way to do multiple precision division,
although it can be somewhat painful.

Doing it in general in 20 instructions is not going to be possible,
but for constants, you should be able to multiply be the reciprocal in
not much more than 20 instructions.

OTOH, if you're actually dividing by 100,000,000 (or any number that
you can actual split into two word-sized factors), you can do it
fairly simply, just divide by 10,000 twice.

Jerome H. Fine

unread,

Jul 8, 2014, 10:35:25 AM7/8/14

to

> news wrote:

Unfortunately, the first divide by 10,000 (using IDIV) would overflow, so a bitwise subroutine
would be required - which would defeat the whole purpose.

It looks like converting the Unsigned 48-Bit Integers to Double Precision Floating Point
may be the best solution. The conversion from a Signed 32-Bit Integer to Double
Precision Floating Point would need to be done twice (only 32 bits of Integer can be
handled at a time - so it might be just as easy to transfer the Unsigned 48-Bit Integer
into a 64-Bit Floating Point memory location with the exponent already set up - three
MOVs of the three registers into memory, then load the 64-Bit Floating Point Number).
It would then be trivial to remove the value due to the hidden bit by subtracting that
value. After that, a direct Floating Point Division (at 56 bit precision) by 100,000,000
which completely solves the multiple subtractions along with avoiding a few other
problems. The Integer portion of the divide should be the upper 5 Decimal digits
under (c) that I am looking for. I would still need to multiply that Integer by
100,000,000 and subtract it from the original 48-Bit to obtain the 32-Bit remainder
that (b) needs to complete the rest of the algorithm, but is seems possible at this
point.

If anyone else has any comments or suggestions, they would be appreciated.

Jerome Fine

Terje Mathisen

unread,

Jul 8, 2014, 11:18:50 AM7/8/14

to

Jerome H. Fine wrote:
> Is there an algorithm that will perform an Unsigned Integer Divide
> for a 48-Bit Value using ONLY 16 bit instructions PLUS the
> Signed Integer Divide (is is DIV or IDIV?) instruction which
> divides DX,AX by a 16 bit value? OR maybe the Signed Integer
> Multiply (is it MUL or IMUL?) instruction could be substituted
> which multiplies two Signed 16-Bit Integers to produce a 32-Bit
> Integer?

[snipped horribly complex binary to ascii conversion!]

> One other possibility is to convert everything to 64-Bit
> double precision floating point since those instructions
> are also available. All that would be needed would be
> a divide of "x" by 100,000,000 and convert that quotient
> to an integer. This would still require MANY instructions.

Using the x87 hw would indeed work, there is even a BCD conversion
opcode available but it is very slow.

>
> Or is there an algorithm that is even easier?

Yes!

The easiest is to split the input into chunks based on the largest power
of ten that fits in 16 bits, i.e. 10000:

Assuming SI points to 48-bit input variable, DI points to low end of 4
16-bit variables to receive the remainders, each to be less than 10000:

mov bx,10000
mov cx, 3
next:
xor dx,dx
mov ax,[si+4]
div bx
mov [si+4],ax
mov ax,[si+2]
div bx
mov [si+2],ax
mov ax,[si]
div bx
mov [di], dx
add di,2
dec cx
jnz next
stosw ; The most significant chunk ends up here.

The code above does more divisions than required, since after the first
two iterations the top word has to be zero.

Anyway, at this point you can convert each of the 4 chunks in parallel,
using reciprocal multiplication or any other algorithm you like.

Please do a little google for my optimal 32-bit binary to ascii
conversion, AMD "borrowed" it for some versions of their optimization
manual, without pointing to my original code.

>
> It would be easy to test the 65,535 possibilities by adding
> 100,000,000 to a total that many times, then checking if just
> one less from the current total results in one less when using
> any algorithm to obtain those upper 5 Decimal digits under (c)
> above.

Ouch!

Terje
>
> Jerome Fine

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Steve

unread,

Jul 8, 2014, 2:49:11 PM7/8/14

to

Terje Mathisen <terje.m...@nospicedham.tmsw.no> writes:
>Jerome H. Fine wrote:
>> Is there an algorithm that will perform an Unsigned Integer Divide
>> for a 48-Bit Value using ONLY 16 bit instructions PLUS the
>> Signed Integer Divide (is is DIV or IDIV?) instruction which
>> divides DX,AX by a 16 bit value? OR maybe the Signed Integer
>> Multiply (is it MUL or IMUL?) instruction could be substituted
>> which multiplies two Signed 16-Bit Integers to produce a 32-Bit
>> Integer?

<Snip>

The code I wrote to solve this problem is much cruder that that
by Terje. So feel free to ignore it. Most of the added complexity
is due to debugging. I kept getting the wrong answer, until I over
simplified it, and finally got the right answer.

Cheers,

Steve N.

; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Set up the data.

Data48 DW 0FFFFH, 0FFFFH, 0FFFFH
Work1 DW 0, 0, 0, 0
Work2 DW 0, 0, 0, 0, 0
TenThou DW 10000
Ten DW 10

; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
; Divide 48-bit number by 100,000,000.

; - - - First round, multiple precision divide by 10,000

XOR DX,DX
MOV AX,[Data48] ; High word of data (fake big endian).
DIV [TenTHou]
MOV [Work1],AX

MOV AX,[Data48+2] ; Middle word.
DIV [TenTHou]
MOV [Work1+2],AX

MOV AX,[Data48+4] ; Low word.
DIV [TenTHou]
MOV [Work1+4],AX

MOV AX,01000H
DIV [TenTHou]
MOV [Work1+6],AX ; NNFHD.

; - - - Second round, multiple precision divide by 10,000

XOR DX,DX
MOV AX,[Work1] ; High word of intermediate result.
DIV [TenTHou]
MOV [Work2],AX ; It better be zero here.

MOV AX,[Work1+2] ; Second word.
DIV [TenTHou]
MOV [Work2+2],AX

MOV AX,[Work1+4] ; Third word.
DIV [TenTHou]
MOV [Work2+4],AX

MOV AX,[Work1+6] ; Low word. Not needed FHD.
DIV [TenTHou] ; NNFHD.
MOV [Work2+6],AX ; NNFHD.

MOV AX,01000H
DIV [TenTHou]
MOV [Work2+8],AX ; NNFHD.

; - - - Display integer digits of Data48 / 100,000.
MOV AX,[Work2+4] ; Low word of integer result.
MOV DX,[Work2+2] ; High word of result.
MOV CH,6 ; Used by display routine.

CALL Bin2DecL ; High digits of Data48.
CALL CRLF

; - - - Display rest of the digits of Data48 / 100,000.
MOV CX,8 ; Number of digits in 100,000,000 remainder.
Fraction:
MOV AX,[Work2+6]
MUL [Ten]
MOV [Work2+6],AX
ADD DX,'0' ; Convert DL to ASCII
SCALL CONOUT ; Display DL. (DOS Fn 2.)

MOV AX,[Work2+8]
MUL [Ten]
MOV [Work2+8],AX
ADD [Work2+6],DX
JC Oops
LOOP Fraction

Phil Carmody

unread,

Jul 8, 2014, 7:28:33 PM7/8/14

to

"Jerome H. Fine" <ever...@nospicedham.nospam.com> writes:
> Is there an algorithm that will perform an Unsigned Integer Divide
> for a 48-Bit Value using ONLY 16 bit instructions PLUS the
> Signed Integer Divide (is is DIV or IDIV?) instruction which
> divides DX,AX by a 16 bit value? OR maybe the Signed Integer
> Multiply (is it MUL or IMUL?) instruction could be substituted
> which multiplies two Signed 16-Bit Integers to produce a 32-Bit
> Integer?
>
> If this question is not of interest, you can forget the rest.
>
> Note that I am using the 8086 as the closest familiar hardware.
> The actual hardware is a bit different, but still has only 16 bits
> words.
>
> Basically, is there a simple, reasonable way to effectively divide
> a 48-Bit value residing in three convenient registers such as
> CX,DX,AX by 100,000,000?

As the divisor is a constant, I'd be tempted to multiply by
the reciprocal. That would in turn be a multi-word value.
1/10^8
= (2^48/10^8)/2^48
= (2814749.76710656)/2^48
= (42*2^16 + 62237 + 50273*2^-16 +...)/2^48
= 42*2^-32 + 62237*2^-48 + 50273*2^-64

If you look at the 9 multiplies which that implies, you'll
see that some of them won't be able to influence the result
except perhaps in weird corner cases.

Just 3 multiplies:

[ 42*CX ] 0000
0000 [ 42*DX ] +
0000 [ 62237*CX ] +

gives you a maximum error of 2 in the middle limb (which represents
*2^32*2^-32 = units). 6 multiplies gives you a maximum error of 2
in the 2^-16 column. That's probably all you need.

An improvement will be possible if you use a signed radix
representation for the numbers (so limbs can be in the range
[-32768, 32767], as the accumulated carry that you could get
from lower limbs is 1/4 the size (and thus less than 1 in the
3-multiply case above). See Dan Bernstein's fast mulmod/expmod
library for an example of such a representation.

Note, however, that you always have the option of instead dividing
by 390625 (again using reciprocal multiply), and then shifting down
by 8 bits. That's almost certainly superior. The constants there are
10995*2^-32 + 7620*2^-48 + 24856*2^-64

I've been assuming this is cheap silicon, where divides are likely
to me much more expensive than multiplies.

Phil
--
Religion is too important a matter to its devotees to be a subject of
ridicule. If they indulge in absurdities, they are to be pitied rather
than ridiculed. -- Immanuel Kant (1724-1804), lecture at Konigsberg, 1775

news

unread,

Jul 8, 2014, 6:55:25 PM7/8/14

to

Not if you do it correctly. Let's say you have a 48 bit unsigned
value in dividend1/2/3 (most significant-to-least), you can divide by
10,000 with something like:

mov bx,10000
mov dx,0
mov ax,dividend1
div bx
mov quotient1,ax
mov ax,dividend2
div bx
mov quotient2,ax
mov ax,dividend3
div bx
mov quotient3,ax
mov remainder,dx

That gets you a 48 bit quotient in quotient1/2/3 and the remainder.
Lather, rinse, repeat.

Rosario193

unread,

Jul 9, 2014, 2:41:19 PM7/9/14

to

if i remember well this problem can be resolved with
some shift both the number, that depend from the number a in a/b
one only division
[from two division type 32:16 or 16:8 the number are in bit, the
use of withch depend from divisor ]

but possible i remember wrong...

Jerome H. Fine

unread,

Jul 9, 2014, 10:02:47 PM7/9/14

to

>news wrote:

It is quite possible that I am calculating (by hand) incorrectly.
I started with an Unsigned 48-Bit Integer of 8,796,093,022,208
which is 2**43. If my powers of 2 are correct, the high order
value in dividend 1 = 2048 with dividend 2 = 0 and dividend 3 = 0.

Just as a check, 4,294,967,296 is 2**32, so dividend 1 = 1 with
dividend 2 = 0 and dividend 3 = 0.

In addition, within your algorithm, it is easier to understand if you
replace "quotient" with "dividend" OR you define the location for
quotient N: to be identical to dividend N:, as per the algorithm that
Terje Mathisen provided yesterday a few hours before your post.
Note that both of these two algorithms are essentially identical.

Then, dividing 2048 by 10000 and quotient 1 = 0 and the
remainder of 2048 in dx when combined with dividend 2 = 0
produces a divisor of 134,217,728 for the second div. The
quotient 2 = 13,421 is saved and the remainder of 7728 is
combined with dividend 3 = 0.

UNFORTUNATELY within the hardware I am using, that
produces a divisor of 506,462,208 which has a value for
quotient 3 = 50642 and a remainder of 2208. (Note that
the remainder already has the lower 4 Decimal Digits of
the original 48-Bit Integer.) CONSEQUENTLY, if no
additional code is present when the SIGNED div that is
the ONLY div available with the instruction set on the
hardware I am using, the div overflows.

PLEASE check my hand calculations!! I am reasonably
confident since the final result (after a few repeats) was
correct.

FORTUNATELY, adding a test to determine if the divisor
is too large is VERY easy! Just check that dx (well it is not
dx in my case, but I am keeping the symbols the same) is
less than 5000 (since 5000 * 65536 = 327,680,000). If
not, then subtract 5000 from dx and place 32768 in the
quotient N: as an initial value. After the div, just add the
32768 back to the actual (reduced) ax before moving ax
to quotient N and VIOLA, the saved quotient N is now
correct and saved for the next repeat. This extra code to
support the SIGNED idiv instruction takes about 6 extra
instructions, but fortunately they are all simple mov, cmp
and add instructions, so they should not significantly increase
the time for the algorithm - although the extra 6 instructions
will obviously increase the code size for the algorithm. But
I am fairly confident that I can use an index for the 3
quotients along with a second index for the remainders,
so it is very probable that the above 3 div instructions
will collapse in a loop that is repeated 3 times and the
6 extra instructions needed to support a SIGNED idiv
instruction will be needed only once. I will also add code
to determine if the divisor is less than 10000 (when the
idiv instruction need not be executed - just XCHG ax and
dx - or does the microcode collapse the idiv instruction
in that case as far as timing is concerned?).

I have finally satisfied myself that the divisor can NEVER
exceed 655,359,999 during the course of the algorithm.
That is impossible since the maximum value carried
into the next divisor by dx will always be 9999 or less.
Since every quotient N is limited to 65535, a bit of
arithmetic: divisor = ( dx * 65535 ) + ax
results in 655,359,999, so everything is OK!!

AND MANY, MANY THANKS TO BOTH Terje
and Eric for the algorithm which performs multi-precision
dividing. Somehow, it did not occur to me to start the
divides with the high order dividend. With add, sub and
mul, the algorithm for multi-precision always starts with
the low order portion and uses adc (or sbc) to cascade
to the higher order portions.

Jerome Fine

Rosario193

unread,

Jul 10, 2014, 6:17:36 AM7/10/14

to

On Wed, 09 Jul 2014 20:41:19 +0200, Rosario193 wrote:

>if i remember well this problem can be resolved with
>some shift both the number, that depend from the number a in a/b
>one only division
>[from two division type 32:16 or 16:8 the number are in bit, the
>use of withch depend from divisor ]
>
>but possible i remember wrong...

i remember wrong...
this seems ok only for find only the first 16 bit digit of division
in some case only...
section DATA

denominatore dw 0, 0, 0, 0
numeratore dw 10000, 0
risultato dw 0, 0

section BSS
section TEXT

; ax:rx:cx diviso bx risultato in ax
dividi:
int3
cmp bx, 0
je .e
cmp cx, 0
jne .1
cmp dx, 0
jne .1
cmp ax, 0
jne .1
mov ax, 0
jmp short .f
.1: test dx, 08000h
jnz .3
test bx, 08000h
jnz .3
clc
rcl cx, 1
rcl ax, 1
rcl dx, 1
clc
rcl bx, 1
jmp short .1

.3: div bx

.f: clc
jmp short .z
.e: mov ax, -1
stc
.z:
ret

Main: :
pushad
xor eax, eax
xor ebx, ebx
xor ecx, ecx
xor edx, edx
xor esi, esi

; carricare i valore nei registri
mov cx, 01000h
mov ax, 02000h
mov dx, 04000h
; divisore

mov bx, 02710h
mov bx, 08710h

; 04000:2000:1000h / 02710h == 01_A36F_0069h not find first digit
; 04000:2000:1000h / 08710h == 794E_C741 find first digit
call dividi

.f:
popad
mov eax, 0
ret
_____________________
section DATA

denominatore dw 0, 0, 0, 0
numeratore dw 10000, 0
risultato dw 0, 0

section BSS
section TEXT

; ax:rx:cx diviso bx risultato in ax
dividi:
int3
bx==0#.e
cx#.1|rx#.1|ax#.1|ax=0|#.f
.1: rx& 08000h #.3
bx& 08000h #.3
clc
rcl cx, 1
rcl ax, 1
rcl rx, 1
clc
rcl bx, 1
#.1

.3: div bx

.f: clc |#.z
.e: ax=-1|stc
.z:
ret

Main::
pushad
a^=a|b^=b|c^=c|r^=r|i^=i

; carricare i valore nei registri
cx=01000h|ax=02000h|rx=04000h
; divisore

bx=02710h
bx=08710h

; 04000:2000:1000h / 02710h == 01_A36F_0069h
; 04000:2000:1000h / 08710h == 794E_C741
dividi()

.f:
popad
a=0
ret

Rosario193

unread,

Jul 10, 2014, 6:36:28 AM7/10/14

to

On Thu, 10 Jul 2014 12:17:36 +0200, Rosario193
<Ros...@nospicedham.invalid.invalid> wrote:

>; ax:rx:cx diviso bx risultato in ax

wrong
; rx:ax:cx diviso bx risultato in ax

Jerome H. Fine

unread,

Jul 10, 2014, 6:37:40 AM7/10/14

to

>Jerome H. Fine wrote:

>>>>[Snip]

As a CORRECTION to the above explanation, it is
hopefully obvious that whereever I used the word "divisor",
I should have used "TOTAL 32-Bit DIVIDEND" of dx,ax -
the divisor is, of course, always 10000.

Jerome Fine

Jerome H. Fine

unread,

Jul 16, 2014, 7:57:29 PM7/16/14

to

>Terje Mathisen wrote:

[Snip]

The easiest is to split the input into chunks based on the largest power of ten that fits in 16 bits, i.e. 10000:
Assuming SI points to 48-bit input variable, DI points to low end of 4 16-bit variables to receive the remainders, each to be less than 10000:

mov bx,10000
mov cx, 3
next:
xor dx,dx
mov ax,[si+4]
div bx
mov [si+4],ax
mov ax,[si+2]
div bx
mov [si+2],ax
mov ax,[si]
div bx
mov [di], dx
add di,2
dec cx
jnz next
stosw ; The most significant chunk ends up here.

I have some comments on your algorithm, so I thought
I would respond privately. But I don't think I have
your e-mail address. I tried the e-mail address we
both used back in 2007 ....

Please respond privately if you are interested. My
e-mail address has not changed.

Jerome Fine

Terje Mathisen

unread,

Jul 17, 2014, 9:44:54 AM7/17/14

to

Don't mind me, if I've messed up it should be fixed in public. :-)

> I would respond privately. But I don't think I have
> your e-mail address. I tried the e-mail address we
> both used back in 2007 ....

I'm at tmsw.no these days, I got my own domain(s) around that time.

Terje

>
> Please respond privately if you are interested. My
> e-mail address has not changed.
>
> Jerome Fine

Terje Mathisen

unread,

Jul 18, 2014, 2:26:17 AM7/18/14

to

Jerome H. Fine wrote:
> >Terje Mathisen wrote:
>> I'm at tmsw.no these days, I got my own domain(s) around that time.

> If I had opened my eyes and actually looked (rather than just
> saw your "quote", I would have realized.
>
> When someone at your level makes a typo (which no one else
> seems to care about in any case), I prefer to make the correction
> privately.

Don't do that!

As I said, when I make mistakes in public, everyone else should have the
opportunity to learn from them as well!

I make such mistakes almost every day, particularly when I don't even
try to assemble/compile my code. :-)
(I've replaced the newsgroup in my reply.)
>
> In any case, ...:
>
> mov bx,10000
> mov cx,4 ;Remove - not used
> next:
> xor cx,cx

> xor dx,dx
> mov ax,[si+4]
> div bx
> mov [si+4],ax

> or cx,ax

> mov ax,[si+2]
> div bx
> mov [si+2],ax

> or cx,ax
> or cx,dx
> jz finish

> mov ax,[si]
> div bx

> mov [si],ax

> mov [di], dx
> add di,2

> jmp next
> finish:
> mov dx,[si]
>
> The final value is in dx and di can be used to determine
> how many chunks were needed. If I am correct, the
> last (high order) chunk is between 0 and 65535 (not
> limited to a maximum of 9999). Unfortunately, for small
> values up to 65,535 the code still needs 2 div instructions.
> Even for values up to 655,369,999 the code still needs
> 5 div instructions, whereas a couple of test instructions
> eliminates all div instructions for the former; for the latter
> adding a cmp instruction requires exactly one div. I am
> not sure what I will do with the final code.
If you want this to be faster, you can sometimes replace the final DIV
with a reciprocal MUL, and you should indeed check where the most
significant chunk is located!

Instead of blind divisions you can start by testing the AX value:

xor dx,dx
mov ax,[si+4]

test ax,ax
jz skip1

div bx
mov [si+4],ax

skip1:
mov ax,[si+2]
test ax,ax
jz skip2

div bx
mov [si+2],ax

skip2:

This gets rid of any initial DIVs that you don't need.

Terje
>
> I am not familiar enough with either the 8086 instruction set
> or the syntax to be sure, but I suspect the extra few OR
> instructions should be useful when the original values are
> relatively small. In addition, "mov [si],ax" seems to have
> been missing. This code seems to be a good compromise
> over using cx as a counter.
>
> Just to clarify, while I occasionally do write code for 80386 (and
> higher) hardware (fortunately at least with 32-Bit registers and
> instructions), my primary (99.9% of the time) focus is writing
> assembler code for the PDP-11 running RT-11. However,
> there is sufficient correlation that an algorithm which works with
> the 8086 instruction set will almost always (with modifications)
> work with the PDP-11 instruction set. Note that my only recent
> experience writing x86 code was to enhance a 32-Bit DLL used
> with a PDP-11 emulator (Ersatz-11) which provides direct access
> to the PC memory - under a 32 bit Windows XP on a system with
> 4 GB of physical memory (of course Microsoft ignores the upper
> 1 GB), the DLL gives the PDP-11 code access to 1.75 GB of
> memory via the so-called IOPAGE registers that are used by
> the PDP-11 hardware to access physical devices. Just to
> clarify, I use that PDP-11 emulator (Ersatz-11 which just
> turned 20 years old while RT-11 is now 40 years old) under
> Windows 98% of the time when I run RT-11. The only real
> difference I notice is that the code runs MUCH faster.
>
> In fact, I can even take a SCSI hard drive (connected to
> a host adapter on the PDP-11) and connect it to a host
> adapter on the PC (Adaptec AHA-2940AU). Ersatz-11
> has commands which will MOUNT that SCSI hard drive,
> then run the PDP-11 instructions as they are found on the
> hard drive. Normally I don't do that since SATA drives
> are a bit faster - about 200 to 1000 times as fast.
>
> The reason I wanted the 48-Bit Encode was to display the
> value of a 48-Bit counter in the code for a device handler
> which acts as a DEBUG routine. The counter keeps track
> of how many instructions have been executed. On the real
> PDP-11, I suspect that the overhead associated with the
> extra code of the DEBUG routine limits the number of
> instructions to about 10,000 per second (from a normal
> rate of about 1,000,000 per second). So a 32-Bit counter
> would have been more than sufficient. But with the emulator
> running on current 4 GHz CPUs, the normal rate is probably
> more than 100,000,000 instructions per second, so a 48-Bit
> counter seems to be required. Obviously the PDP-11 Encode
> routine will look VERY different, but I think you would easily
> recognize your concept. If you are interested, I will send you
> a copy in a few weeks when I manage to squeeze in the extra
> code - the device driver uses extended memory for most of
> the code and data and unless I modify the code for another
> piece of 4096 words (which I don't want to do - three areas
> of 4096 words are already used), I only have 12 words left
> where the 32-Bit version of the code resides and I need to
> make some other code and / or data more efficient.

pe...@nospam.demon.co.uk

unread,

Jul 18, 2014, 2:39:26 AM7/18/14

to

On 8th Jul 2014 at 00:43 "news" <ne...@nospicedham.fx09.iad.highwinds-media.com> wrote:

Sorry for the late reply -- work got in the way...

> On Mon, 07 Jul 2014 22:33:24 -0400, "Jerome H. Fine"
> <ever...@nospicedham.nospam.com> wrote:
>
> >Is there an algorithm that will perform an Unsigned Integer Divide
> >for a 48-Bit Value using ONLY 16 bit instructions PLUS the
> >Signed Integer Divide (is is DIV or IDIV?) instruction which
> >divides DX,AX by a 16 bit value? OR maybe the Signed Integer
> >Multiply (is it MUL or IMUL?) instruction could be substituted
> >which multiplies two Signed 16-Bit Integers to produce a 32-Bit
> >Integer?

[snip]

> >Specifically, I am starting with an Unsigned 48-Bit Integer
> >in CX,DX,AX and I want to Encode the value (into Decimal
> >digits).

[snip rest]

For fun I thought I'd extend the idea to a 64 bit dividend, code below
if you are still interested. For 48 bits, just remove the "w3" block.
This is for MASM4/5 and callable from C -- adjust as required.

-- start U64TOA.ASM --
SLEN equ 32 ;overkill - 21 would do
TENTHOU equ 10000
HUNDRED equ 100

_TEXT SEGMENT PUBLIC WORD 'CODE'
_TEXT ENDS

CONST SEGMENT WORD 'CONST'
;lookup table for div by 100
lut db '00010203040506070809'
db '10111213141516171819'
db '20212223242526272829'
db '30313233343536373839'
db '40414243444546474849'
db '50515253545556575859'
db '60616263646566676869'
db '70717273747576777879'
db '80818283848586878889'
db '90919293949596979899'
CONST ENDS

_DATA SEGMENT PUBLIC WORD 'DATA'
answer db SLEN dup(0)
_DATA ENDS

CGROUP GROUP _TEXT
DGROUP GROUP CONST, _DATA

_TEXT SEGMENT PUBLIC WORD 'CODE'
ASSUME CS:CGROUP, DS:DGROUP
PUBLIC _u64toa

_u64toa proc near ;(char *) u64toa (unsigned int *x);
push bp
mov bp, sp
push si
push di

mov si, 4[bp] ;small model - get ptr to uint[4]
lea di, DGROUP:answer + SLEN - 1
mov word ptr -1[di], '0' ;init & terminate C string

entry: mov bx, TENTHOU
xor dx, dx ;init dividend MSW
xor cx, cx ;init 'done' flag
w3: mov ax, 6[si] ;point at MSW
or cx, ax
jz w2
div bx ;remainder in DX
mov 6[si], ax
w2: mov ax, 4[si] ;next word
or cx, ax
jz w1
div bx ;remainder in DX
mov 4[si], ax
w1: mov ax, 2[si] ;next word
or cx, ax
jz w0
div bx ;remainder in DX
mov 2[si], ax
w0: mov ax, 0[si] ;LSW
or cx, ax
jz done
div bx ;remainder in DX
mov 0[si], ax
call update_result ;rinse
jmp short entry ;repeat

done: call trim_leading_zeros
mov ax, di ;point at answer
pop di
pop si
pop bp
ret
_u64toa endp

update_result:
sub di, 4 ;update write ptr
mov ax, dx
xor dx, dx
mov bx, HUNDRED
div bx
mov bx, ax ;thousands + hundreds
shl bx, 1
mov ax, word ptr DGROUP:lut[bx]
mov [di], ax
mov bx, dx ;tens + units
shl bx, 1
mov ax, word ptr DGROUP:lut[bx]
mov 2[di], ax
ret

trim_leading_zeros:
cmp di, offset DGROUP:answer + SLEN - 2
jnbe tz0
cmp byte ptr [di], '0'
jne tz_ret
inc di
jmp short trim_leading_zeros
tz0: jna tz_ret
dec di ;edge case - initial value was zero
tz_ret: ret

_TEXT ENDS
END
-- end U64TOA.ASM --

You could probably shave off a few cycles by inlining the calls but
the biggest win would probably be replacing the DIVs with reciprocal
MULs, but in my testing I couldn't find the magic to preserve the
precision with only 16 bits. But doing this sort of thing on a 16 bit
processor is not going to be blisteringly fast anyway...

And here's a small C test harness:
-- start TEST.C --
#include <stdio.h>
#include <stdlib.h>
typedef unsigned int uint;
const char *u64toa (uint *);

int main (int argc, char *argv[])
{
uint x[4];
const char *res;

if (argc > 4)
{
x[0] = (uint) atoi(argv[1]);
x[1] = (uint) atoi(argv[2]);
x[2] = (uint) atoi(argv[3]);
x[3] = (uint) atoi(argv[4]);
res = u64toa(x);
printf ("res = %s\n", res);
}
return 0;
}
-- end TEST.C --

The 64 bit array is little endian (for no special reason), and the
result is a pointer to a static string; you could of course include a
pointer parameter to the receiving buffer for the string and do the
copy at the end -- your choice.

Pete
--
Believe those who are seeking the truth.
Doubt those who find it. - Andr� Gide

Jerome H. Fine

unread,

Jul 18, 2014, 11:15:18 AM7/18/14

to

>Terje Mathisen wrote:

>Jerome H. Fine wrote:
When someone at your level makes a typo (which no one else
seems to care about in any case), I prefer to make the correction
privately.
Don't do that!

As I said, when I make mistakes in public, everyone else should have the opportunity to learn from them as well!

I make such mistakes almost every day, particularly when I don't even try to assemble/compile my code. :-)

In my case, I must also execute the code to at least
catch to stupid mistakes. Normally, I focus sufficiently
on the more difficult code to get it correct, often the
first time.:-) On stuff which looks obvious (such as a
jz vs a jnz), I will sometimes use the wrong one.

(I've replaced the newsgroup in my reply.)

I understand. And I do agree. Which is the reason for
my reply.

div2:

div bx
mov [si+2],ax
skip2:

This gets rid of any initial DIVs that you don't need.

Just after "mov ax,[si+2], I suspect that:
test dx,dx
jnz div2
is needed since the divide instruction still needs to be done
if the dividend (dx,ax) is non-zero. (Note that I also inserted
the label div2: into the code so it would be available.) For
my actual target hardware, fortunately the first "test ax,ax" is
unnecessary since on the PDP-11, the "z" flag is set by the
PDP-11 instruction, "Mov -(R3),R0", so it is probably
time effective to avoid the first div instruction. However,
at least using the Ersatz-11 emulator, the second set of
test instructions are very probably not time effective since
the div instruction is only about 3 times the time for the
test instruction. On real DEC PDP-11 hardware, the
ratio is much higher, around 30 times for some CPUs.
Thus I suspect that executing up to 4 instructions just to
save the second div is probably not all the time effective
for the PDP-11. And probably (except maybe for the
8086), on more recent x86 CPUs, avoiding the second
div instruction is also not time effective.

As for replacing the final division with the mul instruction,
I think it seems unlikely since the dividend is 32 bits and
both the quotient and remainder are required.

Finally, for the PDP-11, the div instruction is equivalent
to the x86 idiv instruction. Consequently, it will be essential
to add code (just 5 instructions including one cmp does
the job) to take care of any dividend from 327,680,000 to
655,359,999. (Note that 655,359,999 is the maximum
dividend when dx = 9999 and ax = 65535.) Fortunately,
only the second and third dividend can be that large.

If I have missed anything, please let me know.

Jerome Fine

Jerome H. Fine

unread,

Jul 18, 2014, 11:25:52 AM7/18/14

to

A VERY quick inspection at your algorithm suggests that
the first "or cx,ax" should be "test ax,ax". In addition, it
seems like just before the last "or cx,ax", there needs to
be the instruction "or cx,dx". Please confirm. Otherwise,
your algorithm is extremely helpful. I especially like the
lut table to avoid extra divides by 10.

Jerome Fine

Rosario193

unread,

Jul 20, 2014, 5:36:03 AM7/20/14

to

On Tue, 08 Jul 2014 17:18:50 +0200, Terje Mathisen wrote:
>The easiest is to split the input into chunks based on the largest power
>of ten that fits in 16 bits, i.e. 10000:
>
>Assuming SI points to 48-bit input variable, DI points to low end of 4
>16-bit variables to receive the remainders, each to be less than 10000:
>
> mov bx,10000
> mov cx, 3
>next:
> xor dx,dx
> mov ax,[si+4]
> div bx
> mov [si+4],ax
> mov ax,[si+2]
> div bx
> mov [si+2],ax
> mov ax,[si]
> div bx
> mov [di], dx
> add di,2
> dec cx
> jnz next
> stosw ; The most significant chunk ends up here.

i traslate that for 386 in

dividi2:
int3
mov ecx, 3
next:
xor edx, edx
mov ax,[esi+4]
div bx
mov [esi+4],ax
mov ax,[esi+2]
div bx
mov [esi+2],ax
mov ax,[esi]
div bx
mov [edi], dx
add edi,2
dec ecx

jnz next
stosw ; The most significant chunk ends up here.

ret

than in with windows calculator i have:
04000:2000:1000h / 02710h == 01:A36F:0069h
04000:2000:1000h mod 02710h == 0A70h

all numbers are in hex and 1000 decimal == 02710h

i have this mem first of the computation
numeratore dw 01000h, 02000h, 04000h, 0
ck dd -1
modulo dw 0, 0, 0, 0, 0, 0
chk dd -1
denominatore dw 10000, 0

i have
mov esi, numeratore
mov edi, modulo
mov ebx, 02710h

and than
call dividi2()

afther the computation:
the debugger show i have

numeratore dw 1000 0000 0000 0000
dw FFFF FFFF
modulo dw 0A70 0830 0000 0000
dw FFFF FFFF
2710 0000

all in hex
where is the result?

the remainder will be that 0A70 but what means 0830?

Terje Mathisen

unread,

Jul 20, 2014, 7:46:29 AM7/20/14

to

Rosario193 wrote:
> i traslate that for 386 in

If you have a 32-bit cpu then the best choice is pretty obvious:

Calculate a 49+ bit accurate reciprocal of 100 000 000 and use this to
split the incoming value into two halves:

The top half will have the upper 6-7 digits and the low half the bottom 8.

The actual calculation would look like this:

;; Aggregate the 128-bit intermediate result into edi:esi:ecx:ebx

mov eax,[value]
mul [recip]
mov ebx,eax
mov ecx,edx ;; Next part
mov eax,[value]
mul [recip+4]
xor esi,esi
add ecx,eax
adc esi,edx

mov eax,[value+4]
mul [recip]
xor edi,edi
add ecx,eax
adc esi,edx
adc edi,edi

mov eax,[value+4]
mul [recip+4]
add esi,eax
adc edi,edx

;; With the proper scaling of the reciprocal, the result
;; should end up in ESI
;; I.e. the stages that aggregates stuff in EDI can probably be skipped

;; Now we back-multiply and subtract in order to get the lower
;; 8 digits of the result:
mov eax,100000000
mul esi
mov ebx,[value]
mov ecx,[value+4]
sub ecx,eax
sbb ebx,edx

At this point ECX should contain a value in the 0..99 999 999 range and
ESI a value between 0 and 2814749, i.e. the top 7 digits.

An alternative approach would take advantage of the limited (48-bit)
range and start with the top 32 bits of this value, rounded up (i.e. add
65535 to the bottom 16 bits and add in any carry) and multiply by a
32-bit accurate reciprocal of 100 000. The result can be a little bit
too large, but that is easily fixed with a back-multiply and subtract:
If we get a carry, then decrement the result and add 100 000 to the
remainder!

Terje

Jerome H. Fine

unread,

Jul 20, 2014, 8:35:40 AM7/20/14

to

I did not follow very much of the above, but perhaps you might
change the code just a bit?

On the other hand, as Terje just suggested, you may instead decide
to change to using all 32-bit values and registers. But if you are
just attempting to understand and use Terje's algorithm as is with
16 bit values and registers, then the following may help.

Just above the label "next:", change "mov ecx,3" to "mov ecx,4"
since 4 loops are required. You also need to add "mov bx,10000".
Then, just before "mov [edi],dx", you need to insert "mov [esi],ax".

In addition, esi and edi also need to be initialized as Terje mentioned
in his original response.

As Terje noted and especially for small 48-Bit Integers, many of
the div instructions are unnecessary. A few extra "or" instructions
and a conditional branch are helpful.

Also note that with a 386, your can divide by 1,000,000,000
using (edx,eax) and ebx RATHER than just 10,000 when using
(dx,ax) and bx. As a consequence, an Unsigned 48-Bit Integer
Divide is rather trivial for the 386. Now an Unsigned 96-Bit
Integer Divide would require the above type of algorithm even
for the 386 CPU, but then you would start with 32-bit values.

Since I rarely actually write or even modify x86 code, I really do
not understand the rest of your rely. But I hope that the above
helps correct the instructions which Terje's code needed. Please
confirm if my suggestions are helpful.

My primary purpose in starting this thread was to modify the
8086 code to be used with a PDP-11 which is also limited to
16-bit registers just like the 8086. That purpose Terje
accomplished PERFECTLY and I VERY MUCH appreciated
the algorithm which Terje provided. If anyone is interested in
the final PDP-11 code, please ask and I will post it when I
finish.

Jerome Fine

Rosario193

unread,

Jul 20, 2014, 12:03:01 PM7/20/14

to

On Sun, 20 Jul 2014 13:46:29 +0200, Terje Mathisen wrote:
>Rosario193 wrote:
>> i traslate that for 386 in
>
>If you have a 32-bit cpu then the best choice is pretty obvious:
>
>Calculate a 49+ bit accurate reciprocal of 100 000 000 and use this to
>split the incoming value into two halves:

this could be ok if you have to do only divisions with the
number 100 000 000 as divisor...

so i have to find hexdigits(1/100 000 000)

my phone says it is
2,af31dc46_11873bf4 ^-7?

so 1/100 000 000 in hex digit is: 0.00000002_af31dc46_11873bf4
or not?

so the 64 bit number to store in memory as 1/100 000 000 would be

rec1e8 dd af31dc46, 2

or is it wrong?

Rosario193

unread,

Jul 20, 2014, 2:14:33 PM7/20/14

to

On Sun, 20 Jul 2014 13:46:29 +0200, Terje Mathisen

---------------------------------------
one try:
section DATA

; with my phone calc
; value = 1828882828825655 = 67F5B_F7591437
; value/100 000 000 = 18288828.28825655 = 11710BC.49CB2E671C
; 100000000== 0af31dc46h, 02h,

value dd 0F7591437h, 067F5Bh, 0, 0
recip dd 0af31dc46h, 02h, 0, 0

section BSS
section TEXT

; Aggregate the 128-bit intermediate result into edi:esi:ecx:ebx

divrec:
int3
mov eax,[value]
mul D[recip]

mov ebx,eax
mov ecx,edx ;; Next part
mov eax,[value]

mul D[recip+4]

xor esi,esi
add ecx,eax
adc esi,edx

mov eax,[value+4]

mul D[recip]

xor edi,edi
add ecx,eax
adc esi,edx
adc edi,edi

mov eax,[value+4]

mul D[recip+4]

add esi,eax
adc edi,edx

; With the proper scaling of the reciprocal, the result
; should end up in ESI
; I.e. the stages that aggregates stuff in EDI can probably be skipped

; Now we back-multiply and subtract in order to get the lower
; 8 digits of the result:

mov eax,100000000
mul esi
mov ebx,[value]
mov ecx,[value+4]
sub ecx,eax
sbb ebx,edx

ret

i have to print, and think on this code

the result in esi is wrong if i make a compare to the one by the
calculator of the phone i have

the result i see it is in esi:ecx and for find the right point one
have to shift the result in esi:ecx by 4

this below find the correct result in esi of this only division
; value = 1828882828825655 = 67F5B_F7591437
; value/100 000 000 = 18288828.28825655 = 11710BC.49CB2E671C

; Aggregate the 128-bit intermediate result into edi:esi:ecx:ebx

divrec:
int3
mov eax,[value]
mul D[recip]

mov ebx,eax
mov ecx,edx ;; Next part
mov eax,[value]

mul D[recip+4]

xor esi,esi
add ecx,eax
adc esi,edx

mov eax,[value+4]

mul D[recip]

xor edi,edi
add ecx,eax
adc esi,edx
adc edi,edi

mov eax,[value+4]

mul D[recip+4]
add esi,eax
adc edi,edx

SHLD esi,ecx, 4

ret

this find in esi the right result

where D means dword
thanks

Terje Mathisen

unread,

Jul 20, 2014, 2:59:09 PM7/20/14

to

I don't know. :-)

What I do know is that I wrote a test program (using 64-bit C) to verify
my conjectures, it does work! :-)

I.e. you can take a 48-bit number and split it into two blocks, with 8
digits in the lower block and 7 in the top, then convert those two
blocks independently to ascii:

void test(uint64_t t)
{
unsigned recip = 2882303762; // ceil(2^58/1e8)
// Calculate a 32-bit accurate reciprocal, using two 32x32->64 MULs
uint64_t p = (t >> 16)*recip + (((t & 0xffff) * recip) >> 16);
// Adjust the result for the 58-16=42 bit shift of the reciprocal
p >>= 42;
// Calculate the remainder, can be negative
int rem = (int) (t - p*100000000);
int mask = rem >> 31; // -1 if p was too large
rem += 100000000 & mask;
p += mask;

// The following codes are never entered, tested around every
// possible 100M block boundary
while (rem < 0) {
rem += 100000000;
p--;
printf("%llu needed multiple adjustments!\n", t);
posadj++;
}
while (rem >= 100000000) {
rem -= 100000000;
p++;
printf("%llu needed negative adjustment\n", t);
negadj++;
}
}

I.e. by multiplying the 48-bit input value by the 32-bit 2882303762
reciprocal value and shifting the full 80-bit result down by 58 bits we
get a result which is sometimes off-by-one (too high).

With a back-multiply (32*32->64) and subtract we get an exact remainder
which will be negative if this happened, so we use the sign bit as a
mask to correct the result if needed!

Using 16-bit operations the initial 32x48 mul needs 6 16x16->32 MULs,
then a 32x32->64 backmul (needing 4 16x16->32 MULs), a 48-bit
subtraction and the adjustment step.

Converting a 32-bit value in the 0-1e8 range (8 digits) to ascii is most
easily handled by a division by 10000, leaving the top 4 digits as the
result and the bottom 4 as the remainder.

Rosario193

unread,

Jul 20, 2014, 3:28:53 PM7/20/14

to

On Sun, 20 Jul 2014 20:59:09 +0200, Terje Mathisen wrote:
>I wrote

>> so the 64 bit number to store in memory as 1/100 000 000 would be
>>
>> rec1e8 dd af31dc46, 2
>>
>> or is it wrong?
>>
>
>I don't know. :-)
>
>What I do know is that I wrote a test program (using 64-bit C) to verify
>my conjectures, it does work! :-)
>
>I.e. you can take a 48-bit number and split it into two blocks, with 8
>digits in the lower block and 7 in the top,

do you mean decimal digit or hex[4 bit] digit? or it is the register
bits have to do in this case the big number multiplication [32 bit]?

i when i speak in this argument for digit i mean the bit register
in this case one digit is 32 bit

when the algo for mult or div have register of 16 bits the digit is 16
bit etc

but i think it is here digit==hex digit

Terje Mathisen

unread,

Jul 20, 2014, 4:25:55 PM7/20/14

to

Rosario193 wrote:
> On Sun, 20 Jul 2014 20:59:09 +0200, Terje Mathisen wrote:
>> I wrote
>
>>> so the 64 bit number to store in memory as 1/100 000 000 would be
>>>
>>> rec1e8 dd af31dc46, 2
>>>
>>> or is it wrong?
>>>
>>
>> I don't know. :-)
>>
>> What I do know is that I wrote a test program (using 64-bit C) to verify
>> my conjectures, it does work! :-)
>>
>> I.e. you can take a 48-bit number and split it into two blocks, with 8
>> digits in the lower block and 7 in the top,
>
> do you mean decimal digit or hex[4 bit] digit? or it is the register
> bits have to do in this case the big number multiplication [32 bit]?

Decimal digit!

The OP started with a request to convert an arbitrary 48-bit value into
ascii, I suggested various approaches using long division, then when
32-bit ops were suggested (by you?) I suggested alternatives more suited
to such a cpu.

Terje

>
> i when i speak in this argument for digit i mean the bit register
> in this case one digit is 32 bit
>
> when the algo for mult or div have register of 16 bits the digit is 16
> bit etc
>
> but i think it is here digit==hex digit
>

Rosario193

unread,

Jul 21, 2014, 2:55:03 PM7/21/14

to

On Mon, 07 Jul 2014 22:33:24 -0400, "Jerome H. Fine" wrote:
>Is there an algorithm that will perform an Unsigned Integer Divide
>for a 48-Bit Value using ONLY 16 bit instructions PLUS the
>Signed Integer Divide (is is DIV or IDIV?) instruction which
>divides DX,AX by a 16 bit value? OR maybe the Signed Integer
>Multiply (is it MUL or IMUL?) instruction could be substituted
>which multiplies two Signed 16-Bit Integers to produce a 32-Bit
>Integer?

i not understand well

you search routine /(,) [division 2 argument]
unsigned 48 bit / unsigned 8 bit
unsigned 48 bit / unsigned 16 bit
unsigned 48 bit / unsigned 32 bit

or you search

unsigned 48 bit / unsigned 48 bit?

above there is not the problem for divide from 10 100 1000 etc

the problem for divide from 10 100 1000 etc
will be only for show the result in decimal
or get from extern the input in decimal

but these div functions in a debugger
can [*have*] to be seen with register in hex
all one can wish ...

with no need of decimal routines... no need register seen in decimal

i find impossible to debug these routine in decimal
i think-debug them in hex

result in hex all in hex...
hex==hexadecimal

Rosario193

unread,

Jul 21, 2014, 2:55:26 PM7/21/14

to

On Tue, 08 Jul 2014 17:18:50 +0200, Terje Mathisen wrote:

>Jerome H. Fine wrote:
>> Is there an algorithm that will perform an Unsigned Integer Divide
>> for a 48-Bit Value using ONLY 16 bit instructions PLUS the
>> Signed Integer Divide (is is DIV or IDIV?) instruction which
>> divides DX,AX by a 16 bit value? OR maybe the Signed Integer
>> Multiply (is it MUL or IMUL?) instruction could be substituted
>> which multiplies two Signed 16-Bit Integers to produce a 32-Bit
>> Integer?
>[snipped horribly complex binary to ascii conversion!]
>
>> One other possibility is to convert everything to 64-Bit
>> double precision floating point since those instructions
>> are also available. All that would be needed would be
>> a divide of "x" by 100,000,000 and convert that quotient
>> to an integer. This would still require MANY instructions.
>
>Using the x87 hw would indeed work, there is even a BCD conversion
>opcode available but it is very slow.
>>
>> Or is there an algorithm that is even easier?
>
>Yes!
>
>The easiest is to split the input into chunks based on the largest power
>of ten that fits in 16 bits, i.e. 10000:
>
>Assuming SI points to 48-bit input variable, DI points to low end of 4
>16-bit variables to receive the remainders, each to be less than 10000:
>
> mov bx,10000

if here you put "mov bx, 0E000h"
the below routine is ok
or it is ok only for 10000 decimal ?

[and si point to
value dw 0,0,0123h, 0 [it would be the number in hex 0123 0000 0000h]
]

i think this routine has to have a problem
10000 = 02710h

in this division:
0123 0000 0000h / 02710h = 077318FCh

> mov cx, 3
>next:
> xor dx,dx
> mov ax,[si+4]
> div bx
> mov [si+4],ax
> mov ax,[si+2]
> div bx
> mov [si+2],ax
> mov ax,[si]
> div bx
> mov [di], dx
> add di,2
> dec cx
> jnz next
> stosw ; The most significant chunk ends up here.

it seems to me the result of division it is in "si", "si+2" "si+4"

Rosario193

unread,

Jul 21, 2014, 2:58:53 PM7/21/14

to

On Mon, 07 Jul 2014 22:33:24 -0400, "Jerome H. Fine"

<ever...@nospicedham.nospam.com> wrote:

>Is there an algorithm that will perform an Unsigned Integer Divide
>for a 48-Bit Value using ONLY 16 bit instructions PLUS the
>Signed Integer Divide (is is DIV or IDIV?) instruction which
>divides DX,AX by a 16 bit value? OR maybe the Signed Integer
>Multiply (is it MUL or IMUL?) instruction could be substituted
>which multiplies two Signed 16-Bit Integers to produce a 32-Bit
>Integer?

this is one other try: [if you want 16 bit cpu code it would be
enought swap esi <-> si; edi <-> di]

this divide only 48 bit unsigned numbers to 16 bit unsigned numbers
i never know if it is ok or there are bug in it

; input esi:
;punta in mem a numero da dividere di 48 bit [16[0] 16[2] 16[4]]
; input bx: divisore
; output edi: punta in mem al risultato
; output esi: punta in mem a numero modulo della divisione
div48to16:
xor eax, eax
xor edx, edx
cmp word[esi+4], bx
jae .1
mov ax, [esi+2]
mov dx, [esi+4]
div bx
mov [edi+2], ax
mov [esi+2], dx
mov word[esi+4], 0
mov word[edi+4], 0
jmp short .3
.1: mov ax, [esi+4]
div bx
mov [edi+4], ax
mov [esi+4], dx

mov ax, [esi+2]
div bx

mov [edi+2], ax
mov [esi+2], dx
mov word[esi+4], 0
.3: mov ax, [esi]
div bx
mov [edi], ax
mov [esi], dx
mov word[esi+2], 0
ret
; 04000:2000:1000h / 02710h == 01_A36F_0069h

; input i: punta in mem a numero da dividere di 48 bit [16[0] 16[2]
16[4]]
; input bx: divisore
; output j: punta in mem al risultato
; output i: punta in mem a numero modulo della divisione
div48to16:
a^=a |r^=r
W*i+4>=bx#.1|ax=*i+2|rx=*i+4
div bx | *j+2=ax|*i+2=rx|W*i+4=0|W*j+4=0|#.3
.1: ax=*i+4 |div bx | *j+4=ax|*i+4=rx
ax=*i+2 |div bx | *j+2=ax|*i+2=rx|W*i+4=0
.3: ax=*i |div bx | *j =ax|*i =rx|W*i+2=0
ret
; 04000:2000:1000h / 02710h == 01_A36F_0069h

Jerome H. Fine

unread,

Jul 21, 2014, 9:47:00 PM7/21/14

to

>Rosario193 wrote:

>On Mon, 07 Jul 2014 22:33:24 -0400, "Jerome H. Fine" wrote:

Is there an algorithm that will perform an Unsigned Integer Divide
for a 48-Bit Value using ONLY 16 bit instructions PLUS the
Signed Integer Divide (is is DIV or IDIV?) instruction which
divides DX,AX by a 16 bit value?  OR maybe the Signed Integer
Multiply (is it MUL or IMUL?) instruction could be substituted
which multiplies two Signed 16-Bit Integers to produce a 32-Bit
Integer?

i not understand well

Since I my original post started this thread and I seem to have been the
primary beneficiary of code originally supplied by Terje Mathisen that
provided a successful solution, I feel some responsibility to answer.

Just to clarify, while I occasionally do write code for 80386 (and
higher) hardware (fortunately at least with 32-Bit registers and
instructions), my primary (99.9% of the time) focus is writing
assembler code for the PDP-11 running RT-11. However,
there is sufficient correlation that an algorithm which works with
the 8086 instruction set will almost always (with modifications)
work with the PDP-11 instruction set. Note that my only recent
experience writing x86 code was to enhance a 32-Bit DLL used
with a PDP-11 emulator (Ersatz-11) which provides direct access

to the PC memory. Under a 32 bit Windows XP on a system with

4 GB of physical memory (of course Microsoft ignores the upper
1 GB), the DLL gives the PDP-11 code access to 1.75 GB of
memory via the so-called IOPAGE registers that are used by
the PDP-11 hardware to access physical devices. Just to
clarify, I use that PDP-11 emulator (Ersatz-11 which just
turned 20 years old while RT-11 is now 40 years old) under
Windows 98% of the time when I run RT-11. The only real
difference I notice is that the code runs MUCH faster.

For those of you who are not familiar with the PDP-11 instruction
set, many of the 8086 instructions are almost identical, but the
DIV and MUL in the PDP-11 are SIGNED instructions. That
adds a bit of complication to the algorithm, but the code is still
essentially the same.

As for the specific reason for my original request, I modified
the DEBUG program which executes under RT-11 as a device
driver. As part of the enhancement, I added code which saves
the Program Counter Addresses so that a user could back trace
which instructions had previously been executed when the DEBUG
program assumed control upon a halt of the code at a breakpoint.

Since many billions of instructions could have been executed up to
that point, it occurred to me that it would be helpful to add a few
more instructions to increment a counter (each time an instruction
was executed) in the loop which saved the Program Counter
Address - the circular buffer only has room for a maximum of
3000 Addresses. Since an Unsigned 32-Bit Integer can count
only up to 4.2 Billion and one extra storage word (of 16 bits)
along with one extra instruction to add the carry bit is all that
is needed, it seemed efficient to use an Unsigned 48-Bit Integer
for the Count. Unfortunately, the original code to Encode just
an Unsigned 32-Bit Integer was inadequate. THUS the request
for an Encode which supports up to 48 bits! And just to be
sure there is no misunderstanding, the Encode would support
converting from an Unsigned 48-Bit Binary Integer to a Decimal
Ascii Character String - since most of us think in Decimal, not
Binary, Octal or Hex for everyday counting - or at least I do and
I assume that the the target users of the DEBUG program (there
might be more than one user, but that is questionable) also think
in Decimal.

If you have any questions I have not answered, please ask!

you search routine /(,) [division 2 argument]
           unsigned 48 bit /  unsigned  8 bit
           unsigned 48 bit /  unsigned 16 bit
           unsigned 48 bit /  unsigned 32 bit
 
or you search

           unsigned 48 bit /  unsigned 48 bit?

above there is not the problem for divide from 10 100 1000 etc

the problem for divide from 10 100 1000 etc
will be only for show the result in decimal
or get from extern the input in decimal
 
but these div functions in a debugger 
can [*have*] to be seen with register in hex 
all one can wish ...

with no need of decimal routines... no need register seen in decimal

i find impossible to debug these routine in decimal 
i think-debug them in hex

result in hex all in hex...
hex==hexadecimal

The maximum Decimal value of an Unsigned 48-Bit Binary Integer
is 281,474,976,710,655 which is 15 characters. The code which
was originally supplied by Terje Mathisen breaks that down into
4 Unsigned 16-Bit Binary Integers, all of which are between
0 and 9999 ( the maximum of the fourth is obviously only 281).
It is then almost trivial to convert each of the 4 Integers into
4 Decimal Ascii Characters and display the Decimal value to
the end user. I have not analysed the essence of the algorithm,
but I sort of assume that long division by 10,000 is being done
on each pass through the original Unsigned 48-Bit Binary Integer.

Note that other code is available to divide a 48-Bit dividend by
a 16-Bit divisor, but it does so ONE bit at a time and 48 passes
through the code are required.

If I have not answered all of your questions, please ask something
specific so I can help you.

I realize that you usual language is probably not English, but I
will attempt to interpret as well as I can.

Jerome Fine

Jerome H. Fine

unread,

Jul 21, 2014, 10:58:51 PM7/21/14

to

You can divide by any value that you choose. The reason
for 10000 decimal is that it allows the ENCODE to proceed
a portion at a time where each portion is between 0 and 9999.

[and si point to 
value dw 0,0,0123h, 0  [it would be the number in hex 0123 0000 0000h]
]

i think this routine has to have a problem
10000 = 02710h

in this division:
0123 0000 0000h / 02710h = 077318FCh

This part I do not understand.

  mov cx, 3
next:
  xor dx,dx
  mov ax,[si+4]
  div bx
  mov [si+4],ax
  mov ax,[si+2]
  div bx
  mov [si+2],ax
  mov ax,[si]
  div bx
  mov [di], dx
  add di,2
  dec cx
   jnz next
  stosw		; The most significant chunk ends up here.

it seems to me the result of division it is in "si", "si+2" "si+4"

If I can suggest two changes to make it work?

mov bx,10000
mov cx,4 ;originally mov cx,3

next:
xor dx,dx
mov ax,[si+4]
div bx
mov [si+4],ax
mov ax,[si+2]
div bx
mov [si+2],ax
mov ax,[si]
div bx

mov [si],ax ;was missing

mov [di], dx
add di,2
dec cx
jnz next

Of course, this portion just produces the 4 16-Bit Binary Integers
which must still be converted into 4 Ascii Decimal Characters
each, but that part is trivial.

The alternative is repeated subtraction by powers of 10 which
is trivial to understand, but probably takes many more instructions
to execute.

Jerome Fine

Rosario193

unread,

Jul 22, 2014, 1:21:41 AM7/22/14

to

i try just above code where
[e]si point to "numeratore2" and [e]di point to "risultato"

numeratore2 dw 0, 0, 0123h, 0
risultato dw 0, 0, 0, 0

i would know that:
;0123 0000 0000h / 02710h = 0773_18FCh
;0123 0000 0000h % 02710h = 0C40h

at end of the first iteration i have:

numeratore2 dw 018FCh, 0773h, 0, 0
risultato dw 0C40h, 0, 0, 0

that is the result i want [so i thought wrong that the above code have
to had some problem with these numbers]:

if i follow the loop until cx==0
i have

numeratore2 dw 0, 0, 0, 0
risultato dw 0C40h, 0DDCh, 09C2h, 01h

the right result is only in the first iteration, afther 4 iterations
it show numbers i not know how to understand

>The alternative is repeated subtraction by powers of 10 which
>is trivial to understand, but probably takes many more instructions
>to execute.
>
>Jerome Fine

the problem of decimal it is for me only routines of base conversion
and traslate result in base decimal or some other number

Jerome H. Fine

unread,

Jul 22, 2014, 9:23:16 AM7/22/14

to

>Rosario193 wrote:

>On Mon, 21 Jul 2014 22:58:51 -0400, "Jerome H. Fine" wrote:

[Snip]
it seems to me the result of division it is in :si", "si+2" "si+4"

Based on my (sometimes faulty) arithmetic and with all of my numbers
being in decimal, numeratore2 is 291 * 4,294,967,296 which equals
1,249,835,483,136. The decimal value in the third word (after converting
from hex to decimal) is 291. The first word is multiplied by 1. The
second word is multiplied by 65,536. The third word is multiplied
by 4,294,967,296. Since the first and second word are zero, the
result is as I have stated.

i would know that:
;0123 0000 0000h / 02710h = 0773_18FCh
;0123 0000 0000h % 02710h =      0C40h

at end of the first iteration i have:

numeratore2   dw  018FCh, 0773h, 0, 0
risultato     dw   0C40h,     0, 0, 0

that is the result i want [so i thought wrong that the above code have
to had some problem with these numbers]:

if i follow the loop until cx==0
i have

numeratore2   dw      0,     0,     0,   0
risultato     dw  0C40h, 0DDCh, 09C2h, 01h

Again, my (sometimes faulty) arithmetic and with all of my numbers
being in decimal, the above values in risultato are (after converting from
hex to decimal) 3136 3548 2498 0001

Now your "simple" subroutine to convert the four values in
risultato from unsigned binary to decimal ascii starting with
the fourth value and processing backward will be:

0001249835483136

NOTE that the "simple" subroutine MUST produce EXACTLY
FOUR decimal ascii characters each time it is called and zero fill
in front each time. If you prefer an "artistic" display, then you only
need to zero fill in front AFTER the first NON-ZERO value is
fount in risultato.

So your exercise produced the correct results. It would seem
that your difficulty was in not understanding what to do with
the four unsigned binary integer values in risultato.

the right result is only in the first iteration, afther 4 iterations
it show numbers i not know how to understand

Do you understand now?

The alternative is repeated subtraction by powers of 10 which
is trivial to understand, but probably takes many more instructions
to execute.

the problem of decimal it is for me only routines of base conversion
and traslate result in base decimal or some other number

Converting from an unsigned binary integer to ascii decimal
characters is not trivial. It requires a complete understanding
of all of the computer's internal handling of numbers.

NOTE that converting from unsigned binary integer to
hexadecimal characters is MUCH easier since the code
just takes 4 binary bits at a time and changes those
values to hex. The order is not changed and every
16 bit word is changed to 4 hex characters. Thus
your original value in hex was:

012300000000

For display, just start at the end and change each word
to hex: 0000 0000 0123

Of course, you again MUST again start at the end and
produce 4 hex ascii characters zero filled to display:

012300000000

which is, of course, the correct value again.

The problem is that most people (including myself) prefer
to think in decimal rather than hex. So we tend to need all
the code to make the conversions all the time.

Any other questions? Please ask. I am confident that many
other individuals are just as confused.

Jerome Fine

wolfgang kern

unread,

Jul 22, 2014, 1:42:52 PM7/22/14

to

Jerome H. Fine posted several interesting news:

<...> snipped the interesting stuff.

I receive all your posts with an added HTM-attachment
which contains just a bloated copy of your news-text.

For the sake of bandwidth and speed, I ask you why you
send plain text (common to all NWS-groups) as htm ?

__
wolfgang

Jerome H. Fine

unread,

Jul 22, 2014, 8:47:49 PM7/22/14

to

Basically, I am a WINDOWS DUMMY, so what you
are requesting is normally outside my knowledge base.

I asked for some help. Is this any better?

Jerome Fine

Frank Kotler

unread,

Jul 23, 2014, 4:40:15 AM7/23/14

to

Yes. I don't think it's a problem to very many people, but nntp is
"supposed" to be plain ascii. What I see is your html text big and bold,
which my tired old eyes actually like pretty well, but it's "supposed"
to be plain ascii. So if it isn't a problem to you, yes it's better.

(Thanks for the interesting question, BTW!)

Best,
Frank

Jerome H. Fine

unread,

Jul 30, 2014, 9:14:11 PM7/30/14

to

>Terje Mathisen wrote:

> >Jerome H. Fine wrote:
>
>> Is there an algorithm that will perform an Unsigned Integer Divide
>> for a 48-Bit Value using ONLY 16 bit instructions PLUS the
>> Signed Integer Divide (is is DIV or IDIV?) instruction which
>> divides DX,AX by a 16 bit value? OR maybe the Signed Integer
>> Multiply (is it MUL or IMUL?) instruction could be substituted
>> which multiplies two Signed 16-Bit Integers to produce a 32-Bit
>> Integer?
>

> Yes!
>
> The easiest is to split the input into chunks based on the largest
> power of ten that fits in 16 bits, i.e. 10000:
>
> Assuming SI points to 48-bit input variable, DI points to low end of 4
> 16-bit variables to receive the remainders, each to be less than 10000:
>
> mov bx,10000

> mov cx, 3

> next:
> xor dx,dx
> mov ax,[si+4]
> div bx
> mov [si+4],ax
> mov ax,[si+2]
> div bx
> mov [si+2],ax
> mov ax,[si]
> div bx

> mov [di], dx
> add di,2
> dec cx
> jnz next

> stosw ; The most significant chunk ends up here.
>

> The code above does more divisions than required, since after the
> first two iterations the top word has to be zero.

I believe that I have managed to improve the efficiency of the
code a bit by using the cx register in a somewhat different manner.
This took a bit of testing, but I think the code on my actual
hardware has now settled down enough to suggest this variation
for 8086 code:

----------------------------------------------------------------
ENCODE:
push di
mov bx,10000

next:
xor dx,dx
mov ax,[si+4]

test ax,ax ;probably omit this code
jz skip1 ; on faster hardware

div bx
mov [si+4],ax

skip1:
mov ax,[si+2]
mov cx,ax ;probably omit
or cx,dx ; this code on
jz skip2 ; faster hardware

div bx
mov [si+2],ax

skip2:
mov ax,[si]
mov cx,ax ;probably omit
or cx,dx ; this code on
jz skip3 ; faster hardware

div bx
mov [si],ax

skip3:
mov [di],dx
add di,2
mov cx.[si+2]
or cx.[si+4]
jnz next

xchg dx,ax
test dx,dx
jnz finish
xchg dx,ax
dec dx,2

finish:
pop bx ;suggested code
lea cx,[di] ; to determine
sub cx,bx ; the number of
shr cx,1 ; remainders in memory

makeascii:
call encode ;value to be encoded is in dx
loop:
cmp bx,di
jz done
dec di,2
mov dx,[di]
call encode ;value to be encoded is in dx
jmp loop

done:
ret
----------------------------------------------------------------

Note that the extra code to omit the "div bx" when any dividend
is zero assumes that the divide instruction is so slow that the extra
instructions and the branches will be faster than performing the
actual divide instruction. That will obviously be dependent on
the CPU being used.

The four instructions which follow finish "should" produce the number
of additional remainders (used from the bottom up) which are
to be encoded. I assume that the "shr cx,1" is a divide by two.
If the user wishes to pad the encoded value with blanks to
produce the same total number of characters each time, cx
can be used to determine the number of characters used for
the encode the first time. The maximum value of an Unsigned
48-Bit Integer is 281,474,976,710,655 which usually requires
at least 16 characters (with a leading blank). If cx is zero,
then use all 16 characters for the first (and final) encode.
Otherwise, multiply cx by 4 and subtract that from the number
of characters used for the first encode. Use 4 characters for
remaining encodes and remember to ZERO fill after the first
encode.

At makeascii, dx holds the final Unsigned 16-Bit Integer which
may be as large as 65535. cx can be used to determine how
many additional 16-Bit Integers have been produced. They
are removed from higher to lower until di equals the original
address. Check cx BEFORE (cx may be zero) removing any
additional remainders which may have been placed there since
the above code may have produced ONLY one value that is
less than 10000. Otherwise, if there are two or more values
(with the additional values being in memory), the first value in dx
may be as large as 65535.

NOTE that the above code has been neither assembled NOR even
tested, so there may be mistakes as I changed from my actual
hardware to 8086 instructions. There are probably faster versions
of the code as well. Mainly, I just wanted to reduce the number
of "div" instructions and provide some suggestions for anyone who
is still interested.

For those who are curious, I have included the code from my actual
hardware that is replaced by "Call Divide" from between skip 1 and
skip3.

On my actual hardware, the instructions are able to "auto-decrement"
a register during the execution of a Mov instruction. In addition,
there was an extreme need to conserve code space. Consequently,
on my actual hardware. the (repeated) code between "skip1" and
"skip3" was placed into a subroutine. Also note that since my
actual hardware has ONLY a SIGNED "Div" instruction, it was
necessary to divide by 20,000 and convert the result: Note that
R3 initially holds the address of the following value.

----------------------------------------------------------------
; NOTE: The following code supports a 48-Bit Unsigned Integer up to
; 281,474,976,710,655 and has been modified from 8086
; code supplied by Terje Mathisen

; NOTE: Since the maximum dividend can be 655,359,999 and the PDP-11
; Div Instruction is a SIGNED Divide Instruction, 20,000 is
used as the
; Divisor instead of 10,000 - which produces a maximum quotient
; of 32767 instead of 65537 and prevents overflow from occuring

; NOTE: The instructions which have been commented out are
; useful if the "Div" instruction takes an extremely long time.
; Otherwise, the additional overhead is not cost-effective.

Divide: Call (PC) ;Execute the Subroutine Twice

Mov R1,R0 ;Branch if the Divide Instruction
;;; Bne 10$ ; MUST be executed
;;; Mov -(R3),R1 ;Branch if the Divide Instruction
;;; Beq 50$ ; NEED not be executed
;;; Cmp R1,#20000. ;Branch if the Divide
;;; Bhis 20$ ; MUST be executed
;;; Br 30$ ;Continue with Common Code
10$: Mov -(R3),R1 ;Get the Low Order Portion of the
Dividend
20$: Div #20000.,R0 ;Divide by 20,000
Asl R0 ;Change from
30$: Cmp R1,#10000. ; Divide by
Blo 40$ ; 20,000
Sub #10000.,R1 ; to Divide
Inc R0 ; by 10,000
40$: Mov R0,(R3) ;Save the Quotient

50$: Return
----------------------------------------------------------------

THANK YOU MANY, MANY TIMES to Terje Mathisen
for the basic original algorithm.

Jerome Fine