AMD's 3DNow! imprecise

spam...@crayne.org

unread,

Apr 2, 2008, 3:50:35 PM4/2/08

to

The subject may be a bit misleading, but the imprecisement is rather
derived from how binary works, but maybe you smart people can help me
out with a workaround or something.
Ok, I'm relatively new to ASM(a few days actually), but I have a
background with C, Python, CLisp and such.

Anyway, I am trying to build the very-fast-prime-counter i assembly
and I succeeded when using only the regular instructions, but a 6.5
secs for a 100k primes is way too slow, so I looked around and
profiled my code...and of course, a major bottleneck was the idivl. So
I decided to use the 3DNow! instructions.

ecx = ecx+2 (begins from 3; 2 is already in the stack)
edx = all found primes less than current ecx, are looped through it.

MOVD %ecx, %MM0 /* divisor */
PI2FD %MM0, %MM0 /* divisor to float */
MOVQ %MM0, %MM2 /* copy of divisor */
PFRCP %MM0, %MM0 /* 1/divisor */
MOVD %edx, %MM1 /* prime_candidate */
PI2FD %MM1, %MM1 /* prime. to float */
MOVQ %MM1, %MM3 /* copy of prime. */
PFMUL %MM0, %MM1 /* 1/divisor * prime */
PF2ID %MM1, %MM1 /* answer to int ...*/
PI2FD %MM1, %MM1 /* ... and back to float */ /* NOTE1 */
PFMUL %MM1, %MM2 /* roundeddown(1/divisor*prime.)*divisor
PFCMPEQ %MM2, %MM3
MOVD %MM3, %eax /* if prime check "good", then != */

I will reorganize it into a tight loop later, and will, of course, do
other logical optimizations, but the real problem is currently the
following:
It will run perfectly until ecx=9 and edx=3. After "NOTE1", the MM1
should be 3.0, but it seems to get rounded down to 2 on the toint-
tofloat round. It seems the previous calculation before the converting
stays a bit below 3, and then gets rounded down.

Any workarounds on this problem? I tried using the SSE instructions,
but then I read that GNU "as" doesn't support them :/

Any help(even if just saying "you're screwed!") will be greatly
appreciated :)

Thanks,
Tanel

kuratkull

unread,

Apr 3, 2008, 5:13:07 PM4/3/08

to

I posted this yesterday, so I have had some advances with this
problem:

prefetchnta (%ebp)
MOVD (%ebp), %MM0 /* divisor */
addl $4,%ebp /* move pointer */

PI2FD %MM0, %MM0 /* divisor to float */
MOVQ %MM0, %MM2 /* copy of divisor */

PFRCP %MM0, %MM5 /* 1/div imprecise */
PFRCPIT1 %MM5, %MM0 /* 1/div more precise*/
PFRCPIT2 %MM5, %MM0 /* 1/div very precise */
MOVD %edx, %MM1 /* prime_cand */

PI2FD %MM1, %MM1 /* prime. to float */
MOVQ %MM1, %MM3 /* copy of prime. */

PFMUL %MM0, %MM1 /* 1/div * prime */

PF2ID %MM1, %MM1 /* answer to int ...*/
PI2FD %MM1, %MM1 /* ... and back to float */

PFMUL %MM1, %MM2 /* answer * divisor. */

PFCMPEQ %MM2, %MM3
MOVD %MM3, %eax

FEMMS

test% time ./primes

real 0m2.474s
user 0m2.387s
sys 0m0.023s

... for up to 100000. Slow i know, but i don't know how to optmize it
any further, the code above is the best I could come out with.

Also, it isn't still running smoothly. For 100, it gives 1 prime too
many... for 100000 it gives 9642 primes..although there should be
9592. So the 32 bit precision is not enough. For 10000000(which runs
slooow) it gives ~500 primes too much.

Anyone more experienced know any tips?

Thanks in advance,
Tanel

kuratkull

unread,

Apr 3, 2008, 5:23:43 PM4/3/08

to

Ok here is the full proggie, any suggestions are highly welcome. I
wanted to optimize this as much as I could by reading documentation
and tutorials. So this code is running 100000 primes for ~2.5 secs. I
know this is very slow, since I've read people having ~1000,000 primes
in 0.5secs, in handwritten ASM. Although the person used SSE2
instructions which, sadly, aren't supported by my old Athlon XP 2600+
Barton.

Any help is greatly appreciated either on optimization or the
precision-rounding problem.
Thanks!

PS! The "test" tags are for profiling.

.data

to:
.long 100000

.text
.globl _start

_start:
pushl $2 /* we push 2 to the stack */
movl $1, %ecx /* add 1 to total primes */
movl $1, %ebx /* nrs. to check on this round */
movl $3, %edx /* nr. in hand */

top:
pushl %edx /* push current num into stack */
incl %ecx /* increment count of primes*/

bad:
cmp to, %edx /*compare if number in hand is not over the
limit */
jge done /* ... if it is -> done */
mov %esp,%ebp /* set index_pointer to end of stack */
mov %ecx,%ebx /* set "to check" to beginning */
addl $2,%edx /* increment numbers by 2 */

test1:
cmp $0,%ebx /* compare 0 and "left to check" */
je top /* if 0 go to top, we found a prime */

test2:
/* 3DNOW! instructions */
prefetchnta (%ebp)
MOVD (%ebp), %MM0 /* stack_num */
addl $4,%ebp /* move pointer */
PI2FD %MM0, %MM0 /* stack to float */
MOVQ %MM0, %MM2 /* copy of stack */
PFRCP %MM0, %MM5 /*1/stack imprecise */
PFRCPIT1 %MM5, %MM0 /* 1/stack more precise*/
PFRCPIT2 %MM5, %MM0 /* 1/stack very precise */
MOVD %edx, %MM1 /* prime_cand */
PI2FD %MM1, %MM1 /* prime to float */
MOVQ %MM1, %MM3 /* copy of prime */
PFMUL %MM0, %MM1 /* 1/stack * prime */

PF2ID %MM1, %MM1 /* answer to int ...*/
PI2FD %MM1, %MM1 /* ... and back to float */

PFMUL %MM1, %MM2

PFCMPEQ %MM2, %MM3
MOVD %MM3, %eax

FEMMS

test3:
decl %ebx /* decrease "to check" */
cmp $0,%eax /* if comparison result is equal to 1,
then not prime -> bad */
je test1
jmp bad

done:
movl %ecx, %eax /* DEBUG */
movl $0, %ebx
movl $1, %eax
int $0x80

Sebastian Biallas

unread,

Apr 3, 2008, 6:23:22 PM4/3/08

to

kura...@gmail.com wrote:
> Any workarounds on this problem? I tried using the SSE instructions,
> but then I read that GNU "as" doesn't support them :/

GNU as supports SSE since I-don't-know-how-many years.

Since today they even support the AVX-extension Intel officially
released just yesterday!

--
Gruß,
Sebastian

Waldek Hebisch

unread,

Apr 3, 2008, 10:23:23 PM4/3/08

to

kura...@gmail.com <spam...@crayne.org> wrote:
> Anyway, I am trying to build the very-fast-prime-counter i assembly
> and I succeeded when using only the regular instructions, but a 6.5
> secs for a 100k primes is way too slow, so I looked around and
> profiled my code...and of course, a major bottleneck was the idivl.

I am not sure what "prime-counter" mean -- I you want just generate
primes than sieve of Eratosthenes is the standard method (much faster
than method using division).

In case you really want to use division, you may consider storing
recipracals -- you are using the same divisor many times.

> So
> I decided to use the 3DNow! instructions.
>
> ecx = ecx+2 (begins from 3; 2 is already in the stack)
> edx = all found primes less than current ecx, are looped through it.
>
> MOVD %ecx, %MM0 /* divisor */
> PI2FD %MM0, %MM0 /* divisor to float */
> MOVQ %MM0, %MM2 /* copy of divisor */
> PFRCP %MM0, %MM0 /* 1/divisor */
> MOVD %edx, %MM1 /* prime_candidate */
> PI2FD %MM1, %MM1 /* prime. to float */
> MOVQ %MM1, %MM3 /* copy of prime. */
> PFMUL %MM0, %MM1 /* 1/divisor * prime */
> PF2ID %MM1, %MM1 /* answer to int ...*/
> PI2FD %MM1, %MM1 /* ... and back to float */ /* NOTE1 */
> PFMUL %MM1, %MM2 /* roundeddown(1/divisor*prime.)*divisor
> PFCMPEQ %MM2, %MM3
> MOVD %MM3, %eax /* if prime check "good", then != */
>

> It will run perfectly until ecx=9 and edx=3. After "NOTE1", the MM1

> should be 3.0, but it seems to get rounded down to 2 on the toint-
> tofloat round. It seems the previous calculation before the converting
> stays a bit below 3, and then gets rounded down.
>

3DNow! PFRCP intruction give only approximate result (unlike 387 FPU
the result in _not_ correctly rounded). Worse, according to AMD
documentation PFRCP has only 12 bits of accuracy, so you are limited
to really tiny numbers. So, you probably should add PFRCPIT1 and
PFRCPIT2 instructions, which together should give you 24 bits of
accuracy (however, the documentation I have is pretty unclear what
accuraccy is guaranteed, so I would not trust in the last 1-2 bits).

Given inaccuracy of reciprocal and rounding of multiplicatin you
may get result which is smaller than true quotient. If the true
result is an integer and rounded result is smaller then rounding
down give you wrong result. To avoid this error you must make
sure that approximate result is _always_ bigger than true result.
Simple way to to multiply approximate a/b by a number slightly
bigger than 1, say 1+2^(-20) (this should be enough if the
approximate a/b has 21 bits of accuracy). Of course, there is
a danger that real a/b is smaller than an integer, but
after biasing up may get bigger and you may get the result
which is bigger than correct result. However, this can not
happen if the numbers are small enough. Namely, if a and b
are positive integer and a/b in not an integer, than distance
from a/b to the nearest integer is at least 1/b. As long as
absolute error of the calculation is smaller than 1/b there
is no risk that after rounding you get number bigger than
true result. Error bounds for floating point computations
are relative, so you really want relative error smaller than
1/a. Which means that you are limited to 18-19 bit a.
If you want bigger numbers than it is hard to beat IDIV.
One may try the same trick used by PFRCP, PFRCPIT1, PFRCPIT2 --
first get rough approximantion (say, via table lookup) and
refine using Newton iteration. But it is not clear if
table lookup and several aritmetic instructions needed
for Newton iteration will be faster than IDIV...

--
Waldek Hebisch
heb...@math.uni.wroc.pl

Terje Mathisen

unread,

Apr 4, 2008, 1:36:03 AM4/4/08

to

Waldek Hebisch wrote:
> kura...@gmail.com <spam...@crayne.org> wrote:
>> Anyway, I am trying to build the very-fast-prime-counter i assembly
>> and I succeeded when using only the regular instructions, but a 6.5
>> secs for a 100k primes is way too slow, so I looked around and
>> profiled my code...and of course, a major bottleneck was the idivl.
>
> I am not sure what "prime-counter" mean -- I you want just generate
> primes than sieve of Eratosthenes is the standard method (much faster
> than method using division).

Indeed, that is the way to do it for any kind of range processing, until
you get into stuff like 500+ bit prime number searching (for RSA keys),
in which case you should be using statistical procedures, i.e. modular
arithmetic.

> In case you really want to use division, you may consider storing
> recipracals -- you are using the same divisor many times.

Using the same divisor/reciprocal many times simply means using a sieve,
and for a properly compressed base table (personally I prefer storing 30
numbers in each 8-bit entry) you can even do stuff like casting out
multiple trial divisor multiples at the same time, when the divisors are
still pretty small.

Do a search on +prime +"sieve of Eratosthenes"!

I'm guessing you have at least a couple of orders of magnitude in speed
to gain. :-)

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

kuratkull

unread,

Apr 4, 2008, 2:13:00 AM4/4/08

to

On Apr 4, 8:36 am, Terje Mathisen <spamt...@crayne.org> wrote:
> Waldek Hebisch wrote:

> - <Terje.Mathi...@hda.hydro.com>

> "almost all programming can be viewed as an exercise in caching"

Primes are like my personal hobby, so yes, I know pretty much about
them (trying not to boast). I have ~10 different algorythms
implemented in C, I also have 4 primes which belong to the TOP5000
largest primes in the world list(found using LLR and sieving).
So please stop suggesting me to use the SOE, I know :)

Tanel

kuratkull

unread,

Apr 4, 2008, 1:45:12 AM4/4/08

to

On Apr 4, 5:23 am, Waldek Hebisch <spamt...@crayne.org> wrote:

> hebi...@math.uni.wroc.pl

Blah, our posts are falling out of sync with eachother.

Sorry Sebastian, I mixed up: the gas doc. says:
"Currently, as does not support Intel's floating point SIMD, Katmai
(KNI)."
->
SSE works, but the examples I tried didn't work for some reason, after
I had "converted" them from Intel to AT&T. Enough about this :)

Waldek:
By prime-counter I meant trial division with every previously found
prime. (I have looked into number theory as a personal interest ).

Thanks for the detailed info on the workaround, but I am more
staggered to find that the 3DNow! actually IS imprecise :/ So it is
basically useless, because as I said, it generates ~500 primes more
for 1000,000 than it should. (I previously wrote 10 mil, sorry).

The solution you offered is very nice, but I was hoping the solution
to be without limits, though I appreciate your effort. It just looks
like I have to dump 3DNow! altogether. (I'm going to get myself a new
IntC2D laptop, so I will be able to use SSE upto SSSE3).

Phil Carmody

unread,

Apr 4, 2008, 4:18:38 PM4/4/08

to

> Primes are like my personal hobby, so yes, I know pretty much about
> them (trying not to boast). I have ~10 different algorythms
> implemented in C, I also have 4 primes which belong to the TOP5000
> largest primes in the world list(found using LLR and sieving).
> So please stop suggesting me to use the SOE, I know :)

*cough*

The problem is that if you're looking at optimising assembly
at this stage, you've obviously ignored everything that Knuth
has said about optimisation. As has been pointed out before,
you've not even clearly described exactly what task you wish
to perform.

If, as you first said, you want a prime counter, and the
magnitudes of your bounds are quite small, but the range
between them not insignificant (say 1e18-1.01e18) then the
quickest way is to use something like a Meissel-Lehmer
algorithm, or at least Legendre's algorithm, rather than
a sieve. PD C and Java source can be found on sci.math
dating about 5 years ago. (When James Harris was in full
flood.)

However, if your range is narrow, then a sieve may be best.
The precise type of sieve that's best will depend on the
magnitude of the numbers involved. If your implementation
is polished, then an eratosthenes with intelligent queuing
of the small primes will be optimal up to 10^17 or so.
Tomás Oliveira e Silva has a pretty good open source
implementation which if tweaked correctly can approach
James van Buskirk's, which I believe to be the fastest
Eratosthenes.

Beyond that, you'll want an Atkin or Galway sieve. DJB's
primegen does work well at ~10^18 and beyond despite his
disclaimers, as long as you parametrise it correctly.
(Most people who have complained about its speed have
simply not read the instructions, it degrades badly if
you mis-parametrise.) As far as I know there are no
particularly fast implementations of a Galway sieve,
but as long as you're prepared to handle numbers >64
bits, a Galway sieve should run quite happily up to 10^20
on a commodity PC.

If you're actually looking at finding moderately large
primes in a modest range, then deep sieving becomes
relatively unimportant, as the PRP tests begin to dominate.
Shallow sieving, of course, is vital. The cutoff between
shallow and deep is something for which you'll need to do
a little bit of arithmetic. (I.e. what Terje said above.)

However, whatever you're doing, to reiterate, the first
thing to do is make sure you know what you really want
to do. The second most important thing is to chose the
right algorithm, and only the final thing to worry about
is the bit bashing itself.

Phil
--
Dear aunt, let's set so double the killer delete select all.
-- Microsoft voice recognition live demonstration

Sebastian Biallas

unread,

Apr 7, 2008, 7:55:19 AM4/7/08

to

kuratkull wrote:
> Sorry Sebastian, I mixed up: the gas doc. says:
> "Currently, as does not support Intel's floating point SIMD, Katmai
> (KNI)."

I guess they didn't updated their documentation for years. It is
definitely supported.

--
Gruß,
Sebastian

kuratkull

unread,

Apr 8, 2008, 8:19:12 AM4/8/08

to

On Apr 4, 11:18 pm, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
wrote:

Thanks for the long post, I really appreciate it Phil. Although the
point was not about the algorithm itself, but about any means to have
the app run faster with "simple" ASM tricks. But It doesn't matter any
more, since I have implemented a Sieve of E. in 55 lines which finds
10 million primes in 0.54 seconds, so I guess I managed to write fast
code anyway(because it's possible to write a rather slow S. of E.
too).
The 3DNow! problem is fixed by simply not using it, so I guess this
topic has served its purpose.

Thanks to all who responded!

Tanel

James Van Buskirk

unread,

Apr 10, 2008, 3:08:32 PM4/10/08

to

"kuratkull" <spam...@crayne.org> wrote in message
news:38432b4f-1f55-45ed...@l64g2000hse.googlegroups.com...

> Thanks for the long post, I really appreciate it Phil. Although the
> point was not about the algorithm itself, but about any means to have
> the app run faster with "simple" ASM tricks. But It doesn't matter any
> more, since I have implemented a Sieve of E. in 55 lines which finds
> 10 million primes in 0.54 seconds, so I guess I managed to write fast
> code anyway(because it's possible to write a rather slow S. of E.
> too).
> The 3DNow! problem is fixed by simply not using it, so I guess this
> topic has served its purpose.

It's funny how everyone who writes a Sieve program thinks it's fast.
Fot timing results on my old 1.4 GHz Athlon, see
http://home.comcast.net/~kmbtib/Sieve30h/time1000000000.dat
and similar files. Writing a fast sieve takes a lot of effort
because there are so many features that can be added, each of
which contributes significantly to performance.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end