Issue 9 in libfixmath: 32 bit div

libfi...@googlecode.com

unread,

Mar 3, 2011, 10:24:47 AM3/3/11

to libfi...@googlegroups.com

Status: New
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 9 by birdla...@gmail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

Hi,

32 bit lib seems very expensive in terms of number of cycles and accuracy.

Is there any one working on it ?

libfi...@googlecode.com

unread,

Mar 3, 2011, 10:35:55 AM3/3/11

to libfi...@googlegroups.com

Updates:
Labels: Performance

Comment #1 on issue 9 by Flatm...@googlemail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

There are only 2 people on this project at the minute since it's very new.
I've implemented all the 32-bit stuff so far, but I only have so much time
to work on it amongst full-time work many other projects and commitments.
If you want better/faster algorithms and are interested in implementing
them then I'd be happy to give you committer rights.

As the division goes, division will always be a slow operation, there
aren't that many optimizations I can think to make to the algorithm I used
(not without assembler anyway) but that doesn't mean that there isn't a
fundamentally better algorithm out there that could be used.

The inaccuracy must be due to a small error somewhere in the code, the fact
that it's such a small error indicates that it's likely to be a rounding
error or an error in some extreme case where the result overflows.

libfi...@googlecode.com

unread,

Mar 3, 2011, 10:46:12 AM3/3/11

to libfi...@googlegroups.com

Comment #2 on issue 9 by Flatm...@googlemail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

I'd be interested to hear what you're using the library for birdland (if
you can tell me), I might be able to help you achieve what you want better
that way with the limited time I have to work on this project.

libfi...@googlecode.com

unread,

Mar 3, 2011, 12:05:21 PM3/3/11

to libfi...@googlegroups.com

Comment #3 on issue 9 by birdla...@gmail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

hi
I need these fixed point functions in the baseband processing of some
wireless communication protocol such as WLAN channel estimation. Timing
rquirements are very strict. Please run the following code and see how many
cycles division is taking. Please which processor you are runing on. Please
tell me also what value of d you get ?

#include <stdio.h>
#include <stdlib.h>

typedef __int32_t fix16_t;

fix16_t fix16_div(fix16_t inArg0, fix16_t inArg1);
fix16_t fix16_mul(fix16_t inArg0, fix16_t inArg1);
volatile int d;

int main(){

int x = 5063265; //77.2593;
int y = -6338360;//-96.7157;

int a;
d = fix16_div(x, y);
printf("%d \n",d);
//printf("%d \n",a);

}

fix16_t fix16_div(fix16_t inArg0, fix16_t inArg1) {
int neg = ((inArg0 < 0) != (inArg1 < 0));
inArg0 = (inArg0 < 0 ? -inArg0 : inArg0);
inArg1 = (inArg1 < 0 ? -inArg1 : inArg1);

while(((inArg0 | inArg1) & 1) == 0) {
inArg0 >>= 1;
inArg1 >>= 1;
}

__uint32_t r_hi = (inArg0 / inArg1);

__uint32_t n_lo = (inArg0 % inArg1);
__uint32_t n_hi = (n_lo >> 16);
n_lo <<= 16;

__uint32_t i, arg;
for(i = 1, arg = inArg1; ((n_lo | arg) & 1) == 0; i <<= 1) {
n_lo = ((n_lo >> 1) | (n_hi << 31));
n_hi = (n_hi >> 1);
arg >>= 1;
}

__uint32_t res = 0;
if(n_hi) {
__uint32_t arg_lo, arg_hi;
for(arg_lo = inArg1; (arg_lo >> 31) == 0; arg_lo <<= 1, i
<<= 1);
for(arg_hi = (arg_lo >> 31), arg_lo <<= 1, i <<= 1; arg_hi
< n_hi; arg_hi = (arg_hi << 1) | (arg_lo >> 31), arg_lo <<= 1, i <<= 1);

do {
arg_lo = (arg_lo >> 1) | (arg_hi << 31);
arg_hi = (arg_hi >> 1);
i >>= 1;
if(arg_hi < n_hi) {
n_hi -= arg_hi;
if(arg_lo > n_lo)
n_hi--;
n_lo -= arg_lo;
res += i;
} else if((arg_hi == n_hi) && (arg_lo <= n_lo)) {
n_hi -= arg_hi;
n_lo -= arg_lo;
res += i;
}
} while(n_hi);
}

res += (n_lo / inArg1);
#ifndef FIXMATH_NO_ROUNDING
if((n_lo % inArg1) >= (inArg1 >> 1))
res++;
#endif
res += (r_hi << 16);

return (neg ? -res : res);
}

fix16_t fix16_mul(fix16_t inArg0, fix16_t inArg1) {
__int16_t hi[2] = { (inArg0 >> 16), (inArg1 >> 16) };
__uint16_t lo[2] = { (inArg0 & 0xFFFF), (inArg1 & 0xFFFF) };

__int32_t r_hi = hi[0] * hi[1];
__int32_t r_md = (hi[0] * lo[1]) + (hi[1] * lo[0]);
__uint32_t r_lo = lo[0] * lo[1];
#ifndef FIXMATH_NO_ROUNDING
r_lo += 0xFFFF;
#endif

r_md += (r_hi & 0xFFFF) << 16;
r_md += (r_lo >> 16);

return r_md;

}

libfi...@googlecode.com

unread,

Mar 3, 2011, 12:36:13 PM3/3/11

to libfi...@googlegroups.com

Comment #4 on issue 9 by joe.scha...@gmail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

Running the code I get -49552. This is using gcc 4.3.1. I don't have time
right now to look into the excecution speed, but ill take a look later today

libfi...@googlecode.com

unread,

Mar 3, 2011, 1:30:20 PM3/3/11

to libfi...@googlegroups.com

Comment #5 on issue 9 by birdla...@gmail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

Yes I get the same value but the accurate one is -52352. cycle count seems
very high for this code.

let see what you get for cycle count

libfi...@googlecode.com

unread,

Mar 3, 2011, 1:36:28 PM3/3/11

to libfi...@googlegroups.com

Comment #6 on issue 9 by joe.scha...@gmail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

How are you measuring cycle count

libfi...@googlecode.com

unread,

Mar 3, 2011, 2:53:53 PM3/3/11

to libfi...@googlegroups.com

Comment #7 on issue 9 by Flatm...@googlemail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

Birdland, are you targetting ARM, if so then we could make a much faster
assembler optimized version, especially if it has a clz instruction.
This implementation is very generic and aimed at accuracy, there's a
similar divide function to divide a 64-bit number by a 32-bit number in the
newly submitted int64.h header which may work better, though it is largely
the same.

libfi...@googlecode.com

unread,

Mar 3, 2011, 9:26:43 PM3/3/11

to libfi...@googlegroups.com

Comment #8 on issue 9 by joe.scha...@gmail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

Ok so I have never run a profiler before so I am not entirely sure what I
am doing here, but here it goes.

so as a test I took birdlands code and copied it into fix16.c appending an
_b to the function name.
The standard fix16_div is using 64 bit math.
I threw in mul for good measure
I modified the main code to be:

int x = 5063265; //77.2593;
int y = -6338360;//-96.7157;

d = fix16_div_b(x, y);
k = fix16_div(x,y);
t = fix16_mul(x,y);
printf("%d \n",d);
printf("%d \n",k);
printf("%d \n",t);
return 0;

compiled it with gcc 4.2.1 with the -g option. Then used valgrind to look
at the number of instructions per function. Here is what I came up with.

#############################################################################
Results for Core Duo @ 2.0Ghz(32 bit)

valgrind --tool=callgrind --toggle-collect=fix16_div_b ./main
No optimization = 267
O3 = 198
Os = 209

valgrind --tool=callgrind --toggle-collect=fix16_div ./main
No optimization = 2341
O3 = 2334
Os = 2334

valgrind --tool=callgrind --toggle-collect=fix16_mul ./main
No optimization = 56
O3 = 10
Os = 10

#############################################################################
Results for Core 2 Duo @ 2Ghz(64 bit)

valgrind --tool=callgrind --toggle-collect=fix16_div_b ./main
No optimization = 268
O3 = 188
Os = 200

valgrind --tool=callgrind --toggle-collect=fix16_div ./main
No optimization = 23
O3 = 14
Os = 14

valgrind --tool=callgrind --toggle-collect=fix16_mul ./main
No optimization = 18
O3 = 10
Os = 10

#############################################################################
Results for Arm Cortex M0 @ 48 Mhz (32 bit)
(this uses a timer, which is running at the system clock speed) since it is
single threaded
it always runs in the same amount of time.
Also I only compiled with Os, because the code grows too large without it.

also keep in mind that this processor has NO division instruction. So some
magic is happening
which is probably why the 32bit and 64 bit functions are similar

fix16_div_b
Os = 880

fix16_div
Os = 1104

fix16_mul
Os = 145

#############################################################################

Wow. So the fix16_div routine is very slow for 32 bit systems even if the
compiler can handle 64 bit numbers
The ARM doesnt take as large of a hit, but that is because the div_b is
slow as it is, mostly due to the lack of a division routine

Hope all of this makes sense...

libfi...@googlecode.com

unread,

Mar 4, 2011, 4:27:59 AM3/4/11

to libfi...@googlegroups.com

Comment #9 on issue 9 by Flatm...@googlemail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

It's expected that 64-bit division would be slow (though concrete results
are very nice to have) because there is no trivial way to perform division
unlike multiplication.
I think that the algorithm I wrote could do with some work, but I'm not
sure there is going to be a very fast way of doing it.

libfi...@googlecode.com

unread,

Mar 4, 2011, 7:28:17 AM3/4/11

to libfi...@googlegroups.com

Comment #10 on issue 9 by Flatm...@googlemail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

I've found a good way to do fixed point division on arm:
http://me.henri.net/fp-div.html
It'll be very useful for you birdland, I don't want to steal it to put it
in the SDK though so I'm going to have to contact him and check if he wants
to join in or if he's happy for his work to go into our project.

libfi...@googlecode.com

unread,

Mar 4, 2011, 9:10:18 AM3/4/11

to libfi...@googlegroups.com

Comment #11 on issue 9 by Flatm...@googlemail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

Well we have Henri's permission (and encouragement) to use his code as part
of our library, so I'll add it when I get time (unless one of you want to).

Henri:
"By all means please use the code. It took a lot of work and research to
figure out, and at the time, I didn't want it all to go to waste - so I
wrote it all down in that article - but no-one was interested in publishing
it at the time. For this reason I would be very happy to see it part of an
open source library. That being said, I don't do ARM work any more - I
don't even have the build tools installed, so if you can use or adapt the
code 'as-is' please let that be my contribution."

libfi...@googlecode.com

unread,

Mar 4, 2011, 9:14:19 AM3/4/11

to libfi...@googlegroups.com

Comment #12 on issue 9 by birdla...@gmail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

What I did with my application now is that I saw the range of my numbers in
my application and saw that 2^12 gives me 3 decimal places precision. So at
the moment just using simple integer division with precision upto 3 decimal
places.

input1 = A*2^12;
input2 = B*2^12;

int div(int input1, int input2){
input1 = input1 << 12;
return input1/input2;
}

I need to keep track of my precision in my application.

Now I am stuck again in sqrt :( looking for some faster implementation.

libfi...@googlecode.com

unread,

Mar 4, 2011, 9:58:26 AM3/4/11

to libfi...@googlegroups.com

Comment #13 on issue 9 by birdla...@gmail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

currently using the following sqrt. I have not tested its speed but it
should be varying depending on the input value. Any one knows more faster
approach ?

int isqrt32 (int value)
{
int g = 0;
int bshft = 12;
int b = 1<<bshft;
do {
int temp = (g+g+b)<<bshft;
if (value >= temp) {
g += b;
value -= temp;
}
b>>=1;
} while (bshft--);

return g;
}

libfi...@googlecode.com

unread,

Mar 4, 2011, 10:10:34 AM3/4/11

to libfi...@googlegroups.com

Comment #14 on issue 9 by Flatm...@googlemail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

With sqrt functions the faster approaches tend to be approximations, but
this is meant to be an accurate library (nothing saying we can't use macros
and have both though). Implementing the function in 32-bit or using ARM
assembler would be a good bet, though I'm thinking this sqrt stuff should
be a new issue.

libfi...@googlecode.com

unread,

Jun 21, 2011, 8:21:52 PM6/21/11

to libfi...@googlegroups.com

Comment #15 on issue 9 by joe.scha...@gmail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

Birdland are you using an ARM. If so I have an assembly fixed point
division algorithm that I know will run on an ARM thumb architecture and I
think should run on any other ARM architecture.

I will commit it but would like someone with something other than what I
have test it out

fix16_t fix16_div(fix16_t inArg0, fix16_t inArg1)

{
register int32_t quotient;
asm(
"num .req %[inArg0] @ Map Register Equates\n\t"
"den .req %[inArg1]\n\t"
"mod .req r2\n\t"
"cnt .req r3\n\t"
"quo .req r4\n\t"
"notden .req r12\n\t"
"sign .req r11\n\t"

"@ Make sure that we are not dividing by zero\n\t"
"cmp den, #0\n\t"
"bne .skip_div0\n\t"
"@ If we are dividing by zero return all -1\n\t"
"mov num, #1\n\t"
"neg num, num\n\t"
"b .quit\n\t"
".skip_div0:\n\t"

"@ Store the sign in the sign register for later\n\t"
"@ Sign is computed by exoring the numerator and denominator\n\t"
"@ and checking if the result is positive or negative\n\t"
"movs mod, num\n\t"
"eor mod,mod,den\n\t"
"movs sign, mod\n\t"

"@ Negate the denominator if necessary\n\t"
"@ The algorithm requires the denominator to be signed negative\n\t"
"cmp den, #0\n\t"
"bmi .skip_negate_den\n\t"
"neg den,den\n\t"
".skip_negate_den:\n\t"

"@ Negate the numerator if necessary\n\t"
"@ The algoright requires the numerator to be signed positive\n\t"
"cmp num, #0\n\t"
"bpl .skip_negate_num\n\t"
"neg num,num\n\t"
".skip_negate_num:\n\t"

"@ Begin the actual division algorithm\n\t"
"neg mod, den\n\t"
"movs notden, mod\n\t"
"movs mod, #0\n\t"
"add num, num, num\n\t"
".rept 48\n\t"
" adc mod, mod, mod\n\t"
" add mod, den, mod\n\t"
" bcs .+4\n\t"
" add mod, mod, notden\n\t"
" adc quo, quo, quo\n\t"
" add num, num, num\n\t"
".endr\n\t"
"movs num,quo\n\t"

"@ Check the sign to determine if we need to negate the answer\n\t"
"movs quo,sign\n\t"
"cmp quo,#0\n\t"
"bpl .skip_negate_quo\n\t"
"neg num,num\n\t"
".skip_negate_quo:\n\t"

".quit:\n\t"
/* output registers */
: [quotient] "=r" (quotient)
/* input registers */
: [inArg0] "0" (inArg0), [inArg1] "r" (inArg1)
/* clobbered registers */
: "cc", "r2", "r3", "r4", "r11", "r12");
return quotient;
}

libfi...@googlecode.com

unread,

Dec 10, 2011, 1:20:27 PM12/10/11

to libfi...@googlegroups.com

Comment #16 on issue 9 by sbmeirow: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

It would be interesting to see code run on multiple flavors of ARM Cortex-M
family: Cortex-M0 vs Cortex-M3 vs Cortex-M4 vs Cortex-M4F. See instruction
differences at http://en.wikipedia.org/wiki/ARM_Cortex-M#Overview

I added links to this project at
http://en.wikipedia.org/wiki/STM32#Development_tools
http://en.wikipedia.org/wiki/ARM_Cortex-M#Development_tools

Keep up the good work!
http://en.wikipedia.org/wiki/User:Sbmeirow

libfi...@googlecode.com

unread,

Dec 10, 2011, 2:21:43 PM12/10/11

to libfi...@googlegroups.com

Comment #17 on issue 9 by vincent....@gmail.com: 32 bit div
http://code.google.com/p/libfixmath/issues/detail?id=9

Comparing cores does not make sense ...
Moreover, M3, M4 and M4F are exactly identical excepting DSP (+FPU) on
M4(F) which is not used by libfixmath since we're working on integer
operations ... you will measure exactly the same timings.

Nevertheless, I agree it could be interesting to compare toolchains.

Best regards,
Vincent.

libfi...@googlecode.com

unread,

Feb 19, 2015, 4:38:39 AM2/19/15

to libfi...@googlegroups.com

Comment #18 on issue 9 by sbmeirow: 32 bit div
https://code.google.com/p/libfixmath/issues/detail?id=9

The Cortex-M0 / M0+ / M1 only has 32-bit multiply instructions with a
lower-32-bit result (32bit x 32bit = 32bit),

where as the Cortex-M3 / M4 / M7 includes additional 32-bit multiply
instructions with 64-bit results (32bit x 32bit = 64bit).

The Cortex-M4 / M7 also include DSP instructions for (16bit x 16bit) and
(32bit x 16bit) multiplication.

--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

Reply all

Reply to author

Forward