Integer division using IMUL???

Jaelani

unread,

Mar 28, 2009, 2:42:01 AM3/28/09

to

Hello,

While debugging an application made by someone, I've come accross a
confusing integer calculation. The application is a disk defragmenter
with a disk cluster map display shown as rows and columns. The
application displays a fixed 8x8 pixel blocks with one pixel as the
separator between each blocks. The blocks always fills the entire
available drawing area (with some left over space), so only the number
of rows & column is calculated. The objective of the calculation is to
find out how many blocks can be fitted in the drawing area.

To make things simple, I adjusted the window height so that the number
of columns is only one. The relevant variables/values that are used by
the application is like so:

W = Drawing area width
B = Block width (not sure if it includes the separator)
C = Number of blocks

Obviously, I will have this formula:
C = W div B

However, the application uses confusing calculation:
mov eax, 38E38E39h ;edx:eax = 38E38E39h (hard coded)
xor edx, edx
mov ecx, W ;ecx = W
imul ecx
sar edx, 1
mov C, edx ;C = edx

So the question is, what kind of calculation is that? I've never seen
something like this. A division calculation using a multiplication. What
I can understand is that it's a compiler optimization because
multiplication instructions are relatively faster than division
instructions. But the method/trick of accomplishing that is new and
unknown to me. Does anyone know something like this?

Thank you in advance.

Regards,
Jaelani

Terje Mathisen

unread,

Mar 29, 2009, 6:10:27 AM3/29/09

to

Sure!

I once helped write a paper about it, together with Agner Fog (do a
search!).

The key here is that the compiler could figure out that B was a
constant, in which case _all_ modern compilers will replace division by
a reciprocal multiplication followed by a shift.

It looks like the divisor could have been around 10?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

spam...@crayne.org

unread,

Mar 29, 2009, 2:19:39 AM3/29/09

to

On Mar 28, 12:42 am, Jaelani <spamt...@crayne.org> wrote:

> So the question is, what kind of calculation is that? I've never seen
> something like this. A division calculation using a multiplication. What
> I can understand is that it's a compiler optimization because
> multiplication instructions are relatively faster than division
> instructions. But the method/trick of accomplishing that is new and
> unknown to me. Does anyone know something like this?
>
> Thank you in advance.
>
> Regards,
> Jaelani

Chapter 8 of the AMD Optimization Guide for x86 does a pretty good
job:

http://support.amd.com/us/Processor_TechDocs/22007.pdf

And it will help you better understand, where the constant is coming
from
in the code you provided.

Brian

Dick Wesseling

unread,

Mar 29, 2009, 12:54:06 AM3/29/09

to

In article <49cdc658$0$90272$1472...@news.sunsite.dk>,

Jaelani <spam...@crayne.org> writes:
> Hello,
>
> While debugging an application made by someone, I've come accross a
> confusing integer calculation. The application is a disk defragmenter
> with a disk cluster map display shown as rows and columns. The
> application displays a fixed 8x8 pixel blocks with one pixel as the
> separator between each blocks. The blocks always fills the entire
> available drawing area (with some left over space), so only the number
> of rows & column is calculated. The objective of the calculation is to
> find out how many blocks can be fitted in the drawing area.
>
> To make things simple, I adjusted the window height so that the number
> of columns is only one. The relevant variables/values that are used by
> the application is like so:
>
>
> W = Drawing area width
> B = Block width (not sure if it includes the separator)

It includes the separator, so B=9.

> C = Number of blocks

> Obviously, I will have this formula:
> C = W div B
>

That is not obvious. Division is slow. Multiplication by the
reciprocal is faster:

C = W * (1/B)

assuming that 1/B is known at compile-time.

>
> However, the application uses confusing calculation:
>
> mov eax, 38E38E39h ;edx:eax = 38E38E39h (hard coded)
> xor edx, edx
> mov ecx, W ;ecx = W
> imul ecx
> sar edx, 1
> mov C, edx ;C = edx

> So the question is, what kind of calculation is that?

As said, the reciprocal is 1/9. But that would be truncated to zero
using integer arithmetic. Instead use:

C = W * (2^n/B)
---------
2^n

for some convenient value of n. n=32 is very convenient but may not
be accurate enough. This program fragment uses n=33.

2^33/9 = 38E38E38

with a fraction of .38E38E3838E38E3838E38E38....et cetera.
The division by 2^33 is carried out in two steps:

1) discard the low 32 bits of W * (2^n/B). In other words use DX
which contains the upper half of the result, ignoring AX which
contains the lowest 32 bits.

2) Shift right to conclude division by 2^33.

Hendrik van der Heijden

unread,

Mar 29, 2009, 5:31:25 AM3/29/09

to

Jaelani schrieb:

> However, the application uses confusing calculation:
> mov eax, 38E38E39h ;edx:eax = 38E38E39h (hard coded)
> xor edx, edx
> mov ecx, W ;ecx = W
> imul ecx
> sar edx, 1
> mov C, edx ;C = edx
>
> So the question is, what kind of calculation is that? I've never seen
> something like this. A division calculation using a multiplication. What
> I can understand is that it's a compiler optimization because
> multiplication instructions are relatively faster than division
> instructions. But the method/trick of accomplishing that is new and
> unknown to me. Does anyone know something like this?

Division is the same as multplying by the reciprocal.
38E38E39h is (2^33)/9. SAR corrects for 2^33 vs 2^32.

Hendrik vdH

Alexei A. Frounze

unread,

Mar 29, 2009, 2:23:11 AM3/29/09

to

It's so-called fixed-point arithmetic. I'm pretty sure you can work it
out if you follow the calculations.

An example in decimal... Suppose your CPU registers aren't binary but
decimal and can hold 4 decimal digits.
Suppose you want to calculate 3333 / 10.
Now, what if you represent the 10 as 10000/1000 and calculate 3333 *
1000 / 10000?
Watch it:
mov ax, 3333
mov bx, 1000
mul bx
After this dx:ax = 3333*1000=3333000. But if ax holds 4 decimal digits
as we agreed earlier (3000) and we simply drop them off, we end up
with 333 in dx.
And that's what you wanted: 3333 / 10 = 333.

Now, of course, those aren't decimal registers, they're binary. Do the
math.

Alex

Dick Wesseling

unread,

Mar 29, 2009, 1:07:39 AM3/29/09

to

Oops.
Correction on my previous posting:

> 2^33/9 = 38E38E38
>
> with a fraction of .38E38E3838E38E3838E38E38....et cetera.

That should be:

> 2^33/9 = 38E38E38
>
> with a fraction of .E38E3838E38E3838E38E38....et cetera.

Rounding 38E38E38.E3 gives 38E38E39

Jaelani

unread,

Mar 29, 2009, 6:43:37 PM3/29/09

to

>> Obviously, I will have this formula:
>> C = W div B
>>
>
> That is not obvious. Division is slow. Multiplication by the
> reciprocal is faster:
>
> C = W * (1/B)
>
> assuming that 1/B is known at compile-time.

I'm an idiot, so it was the only formula I can think of.

>> So the question is, what kind of calculation is that?
>
> As said, the reciprocal is 1/9. But that would be truncated to zero
> using integer arithmetic. Instead use:
>
>
> C = W * (2^n/B)
> ---------
> 2^n
>
> for some convenient value of n. n=32 is very convenient but may not
> be accurate enough. This program fragment uses n=33.
>
> 2^33/9 = 38E38E38
>
> with a fraction of .38E38E3838E38E3838E38E38....et cetera.
> The division by 2^33 is carried out in two steps:
>
> 1) discard the low 32 bits of W * (2^n/B). In other words use DX
> which contains the upper half of the result, ignoring AX which
> contains the lowest 32 bits.
>
> 2) Shift right to conclude division by 2^33.

The explanation makes more sense. So it's similar to scaling-up the
dividend prior dividing and scaling-down back the result, correct?

For 32-bit integers, does that mean the additional bits (n above 32)
serves as the fractional part? 32 significant bits and 32 fractional
bits perhaps?

Jaelani

unread,

Mar 29, 2009, 7:21:49 PM3/29/09

to

Alexei A. Frounze wrote:
> It's so-called fixed-point arithmetic. I'm pretty sure you can work it
> out if you follow the calculations.
>
> An example in decimal... Suppose your CPU registers aren't binary but
> decimal and can hold 4 decimal digits.
> Suppose you want to calculate 3333 / 10.
> Now, what if you represent the 10 as 10000/1000 and calculate 3333 *
> 1000 / 10000?
> Watch it:
> mov ax, 3333
> mov bx, 1000
> mul bx
> After this dx:ax = 3333*1000=3333000. But if ax holds 4 decimal digits
> as we agreed earlier (3000) and we simply drop them off, we end up
> with 333 in dx.
> And that's what you wanted: 3333 / 10 = 333.
>
> Now, of course, those aren't decimal registers, they're binary. Do the
> math.
>
> Alex

Great. This isn't just an answer to my question, but also helps me
understand how to multiply two 32-bit integers using only 16-bit
registers. A question of "how did they come up with that?" which was
long overdue.

Thanks.

Jaelani

unread,

Mar 29, 2009, 12:12:31 PM3/29/09

to

Terje Mathisen wrote:
> Sure!
>
> I once helped write a paper about it, together with Agner Fog (do a
> search!).
>
> The key here is that the compiler could figure out that B was a
> constant, in which case _all_ modern compilers will replace division by
> a reciprocal multiplication followed by a shift.
>
> It looks like the divisor could have been around 10?
>
> Terje

Many thanks for that. I later found a similar method in the AMD
documentations, but Agner Fog's document is much easier to understand.

I only sure that the divisor is 9 (8 pixel block & 1 pixel separator),
but I don't know why the application uses 38E38E39h instead of
E38E38E4h. It probably uses the other method described in the AMD
document although the application seems to use the exact same method.
But nevermind about this. I already have my question answered.

Thanks again.

_______
Jaelani

Jaelani

unread,

Mar 29, 2009, 7:29:16 PM3/29/09

to

Hendrik van der Heijden wrote:
> Division is the same as multplying by the reciprocal.
> 38E38E39h is (2^33)/9. SAR corrects for 2^33 vs 2^32.
>
>
> Hendrik vdH

Thanks.

With all the people here bursting with answers makes me feel like a newbie.

spam...@crayne.org

unread,

Mar 31, 2009, 1:48:25 AM3/31/09

to

On Mar 29, 4:10 am, Terje Mathisen <spamt...@crayne.org> wrote:
>
> I once helped write a paper about it, together with Agner Fog (do a
> search!).

These are really good:

http://www.agner.org/optimize/#manuals

Now, I'm learning C++ because of them.

Thanks,

Brian

Jaelani

unread,

Mar 29, 2009, 6:22:41 PM3/29/09

to

bwa...@yahoo.com wrote:
> Chapter 8 of the AMD Optimization Guide for x86 does a pretty good
> job:
>
> http://support.amd.com/us/Processor_TechDocs/22007.pdf
>
> And it will help you better understand, where the constant is coming
> from
> in the code you provided.
>
> Brian

Yes, it helps. Thank you.

One things though... It's unfortunate that, for Intel CPUs which has a
relatively slower integer division instructions than AMD, don't have
this particular topic in their CPU optimization manual.

Terje Mathisen

unread,

Mar 31, 2009, 3:54:00 PM3/31/09

to

Jaelani wrote:
> Terje Mathisen wrote:
>> Sure!
>>
>> I once helped write a paper about it, together with Agner Fog (do a
>> search!).
>>
>> The key here is that the compiler could figure out that B was a
>> constant, in which case _all_ modern compilers will replace division
>> by a reciprocal multiplication followed by a shift.
>>
>> It looks like the divisor could have been around 10?
>>
>> Terje
>
> Many thanks for that. I later found a similar method in the AMD
> documentations, but Agner Fog's document is much easier to understand.
>
> I only sure that the divisor is 9 (8 pixel block & 1 pixel separator),
> but I don't know why the application uses 38E38E39h instead of
> E38E38E4h. It probably uses the other method described in the AMD

Ah! This is easy to answer: It uses the shortest equivalent reciprocal,
i.e. the one where any trailing zero bits have been shifted out:

....E4 is a number which ends in two zero bits, so the only effect these
bits have is to increase the required shift amount by 2, i.e. you would
need

SHR EDX,3

instead of

SHR EDX,1

> document although the application seems to use the exact same method.
> But nevermind about this. I already have my question answered.

Good luck!