poo = val % 320;
... which would usually be done in ASM like...
mov ax, [val]
cwd
mov bx, 320
idiv bx
mov [poo], dx
... IDIV is too slow for my purposes.
well.. if you are dividing by a power of 2... ie. 2, 4, 8, 16, 32, 64,
128 etc. you can get around using the IDIV command by just lopping off
the top of the number using a mask... example... say you have the
number 45 which is 00101101... to find the mod of that number divided
by 16 you would AND the value with a mask of 0F.. which turns
everything to 0 except for the bottom 4 numbers. leaving 1101 or 13d.
To take into account the other powers of 2 you need to shift the mask
to the left or right to take those into account. You should probably
start with a mask of 1 and then shift it arithmetically to the left so
that a 1 will appear to the far right. This preserves the mask because
the saved portion HAS to be 1. A 2 will have 1, 1's... a 4 will have
2's.. an 8 will have 3... and so on.
-MJ
That makes sense. Before this reply I figured out another way of doing it, which I'd
imagine is slower, but anyway:
mov ax, [poo]
mov bx, ax
shr ax, 6 ; AX = AX / 320
shr bx, 8
add ax, bx
mov bx, ax ; AX = AX * 320
shl ax, 6
shl bx, 8
add ax, bx
sub [poo], ax ; poo = poo - AX
... ie. divide by 320, multiply by 320, then the difference becomes the remainder. 17
cycles on a 486 as opposed to 34. (pretty stupid though eh?)
: poo = val % 320;
: ... which would usually be done in ASM like...
: mov ax, [val]
: cwd
: mov bx, 320
: idiv bx
: mov [poo], dx
: ... IDIV is too slow for my purposes.
You must divide. If you need to divide by a power of 2, you can
extract the and with a simple AND instruction.
--
<---->
It strikes me that many of the posters to this group are quite innocent as
regards elementary # theory. Since (unsigned) arithmetic on x86 actually
means arithmetic (mod 2**32), division by a small constant is possible via
multiplication by its reciprocal (mod 2**32). For best results on a Pentium,
this multiplication should be carried out with shift instructions. On MMX
Pentium or PentiumPro, this would not be a good idea. Here is some code
that divides by 100 for quotient and remainder, thus converting a 32-bit
unsigned number to a decimal string in about 15 clocks/digit pair on a
Pentium. Since school is in session, I don't have time to modify this
to divide by 320, and so I leave it as a challenge to the other posters
to do so. Your code should achieve its goal in less than 15 clocks, with
much smaller LUTs than required below. Hopefully, some interested readers
will take up this challenge and find # theory as entertaining as I do.
The code has few useful comments, but that's part of the challenge!
The input number is in EAX, while the pointer to the output string is in EDX.
..586
DGROUP group CONST
_TEXT segment para public use32 'CODE'
assume CS:_TEXT, DS:DGROUP
public hundred_
hundred_ proc near
push ebx
push ecx
push ebp
push esi
push edi
mov ecx, eax
shr ecx, 2
mov bh, [edx+2]
mov bl, [edx+1]
cmp eax, 100
jnb top
cmp eax, 10
jnb put_last_2
put_last_1:
or eax, 30h
pop edi
mov [edx], al
pop esi
pop ebp
pop ecx
pop ebx
ret
top:
lea edi, [ecx+2*ecx] ; 3
lea esi, [ecx+4*ecx] ; 5
shl edi, 10 ; c00
and eax, 3
lea ebp, [ecx+2*esi] ; b
lea ecx, [ecx+8*esi] ; 29
shl ebp, 12 ; b000
add ecx, edi ; c29
mov edi, ecx ; c29
sub ecx, ebp ; ffff5c29
shl edi, 20 ; c2900000
mov ebp, ecx ; ffff5c29
add ecx, edi ; c28f5c29
add ebp, edi ; c28f5c29
shr ebp, 27
mov [edx+2], bh
mov [edx+1], bl
sub edx, 2
add ecx, [LBT+4*ebp]
add al, [SBT+ebp]
mov ebp, eax
mov eax, ecx
shr ecx, 2
cmp eax, 100
mov bx, [ABT+2*ebp]
jnb top
cmp eax, 10
mov [edx+1], bl
mov [edx+2], bh
jb put_last_1
put_last_2:
mov bl, [ABT+2*eax]
pop edi
mov [edx-1], bl
mov bh, [ABT+1+2*eax]
mov [edx], bh
pop esi
pop ebp
pop ecx
pop ebx
ret
hundred_ endp
_TEXT ends
CONST segment para public use32 'DATA'
; Large boring table
LBT dd 0,0f5c28f5ch,0eb851eb8h,0e147ae14h,0e147ae14h,0d70a3d70h,0cccccccch
dd 0c28f5c28h,0b851eb85h,0b851eb85h,0ae147ae1h,0a3d70a3dh,99999999h
dd 99999999h,8f5c28f5h,851eb851h,7ae147aeh,70a3d70ah,70a3d70ah,66666666h
dd 5c28f5c2h,51eb851eh,51eb851eh,47ae147ah,3d70a3d7h,33333333h,28f5c28fh
dd 28f5c28fh,1eb851ebh,147ae147h,0a3d70a3h,0a3d70a3h
; Smaller boring table
SBT db 0,16,32,48,48,64,80,96,12,12,28,44,60,60,76,92
db 8,24,24,40,56,72,72,88, 4,20,36,36,52,68,84,84
; Another boring table
ABT:
irp @tens, "0, 1, 2, 3, 4, 5, 6, 7, 8, 9"
irp @ones, "0, 1, 2, 3, 4, 5, 6, 7, 8, 9"
dw 256 * @ones + @tens OR 3030h
endm
endm
CONST ends
end
Well ... 16 bits IDIV is only 30 cycles on Pentium processor
>Hi y'all - looking for a way to perform mods without using DIV or IDIV. Eg. in C...
>poo = val % 320;
>... which would usually be done in ASM like...
>mov ax, [val]
>cwd
>mov bx, 320
>idiv bx
>mov [poo], dx
>... IDIV is too slow for my purposes.
For the general case, there is no better way.
However, there are special cases...
Assuming you're only interested in nonnegative numbers (all values >= 0),
let M be the "modulus" where you're trying to get (x % M).
1) If (x / M) is known to be small, you can use a loop
while (x > M)
x = x - M;
2) If M is a power of two,
x = x & (M - 1);
3) When adding two numbers x and y,
if x < M and y < M then do
sum = x + y;
if (sum > M)
sum = sum - M;
/* sum will be (x + y) % M */
=====
Translation to assembly language left as an exercise.
--
Henry S. Takeuchi
ht...@eskimo.com
Seattle, Washington (USA)
What on earth are you calculating here? You are calculating AX/256+AX/64
>> AX / 320. While one can use SHL and ADD to create a general
multiplication, it is not so easy to create division with SHR and ADD.
> mov bx, ax ; AX = AX * 320
> shl ax, 6
> shl bx, 8
> add ax, bx
> sub [poo], ax ; poo = poo - AX
>
>... ie. divide by 320, multiply by 320, then the difference becomes the remainder. 17
>cycles on a 486 as opposed to 34. (pretty stupid though eh?)
Osmo
What do you need it for? To get the x,y coordinates of an address in the
video screen? Why do you need that fast?
Osmo
Well one could always do
function mod320(x:word):word; assembler;
asm
mov ax,x
mov bx,40960
mov cx,8
@l2: cmp ax,bx
jb @x
sub ax,bx
@x: shr bx,1
loop @l2
end;
But this is basically what DIV does and is probably not faster even if
the loop is unrolled.
Division is simply a hard process. Any kid who has learned do so it (I
do not mean using a calculator) knows it.
Osmo
How about:
performs (ebx = eax % 320). edx and eax are destroyed.
mov ebx, eax
mov edx, 0CCCCCCCCh
mul eax, edx
and edx, 0FFFFFF00h
sub ebx, edx
shr edx, 2
sub ebx, edx
Byron
>How about:
>performs (ebx = eax % 320). edx and eax are destroyed.
> mov ebx, eax
> mov edx, 0CCCCCCCCh
> mul eax, edx
> and edx, 0FFFFFF00h
> sub ebx, edx
> shr edx, 2
> sub ebx, edx
>
This doesn't work if eax % 320 == 0. You can fix this by using the constant
0CCCCCCCDh instead of 0CCCCCCCCh, but I still think that the low dword variant
of this technique has some advantages because the multiplication can be carried
out via shifts and adds on a Pentium, and IMUL can be used on a PentiumPro,
which has one clock less latency. Using the high dword does avoid the 8 byte
(or 32 byte) LUT that the low dword method requires (to convert the high 3 bits
of the product into a remainder).
I've come up with something slightly more satisfying than the number
theory method, and faster as well as more correct than Byron V. Foster's
submission:
mov edx,033333334h
mov ebx,eax
mul edx
and edx,0FFFFFFC0h
lea edx,[edx+edx*4]
sub ebx,edx
This follows from:
r = a - lowint[(a)/320]*320
= a - (lowint[((a)/5)/64]*64)*5
= a - ((((a)/5)) & (-64)) * 5
= a - ( ((2^32+2^2)/5 * (a))/(2^32) & (-64)) * 5 (approx.)
= a - ( (0x33333334 * (a))/(2^32) & (-64)) * 5
Approximation error occurs if a >= 2^30. Fails if a is negative.
--
Paul Hsieh
qed "at" chromatic "dot" com
http://www.geocities.com/SiliconValley/9498
Graphics Programmer
Chromatic Research
What I say and what my company says are not always the same thing