I could use the repeated square-and-multiply algorithms but when I do
(A*A) % n I can get an overflow because A is a 64-bit integer.
Now I'm using miracl library, but somebody knows a *faster* way?
Thanks
Cristiano
If you're only using ~62 bits and you're on an x86, or if you're only using ~51 bits and you're on a non-x86, then there's a trick you can use for speeding up the operation.
Do either of these apply?
Method:
Create a*b's top bits in the FP unit, and its bottom bits in the int unit.
Multiply the approximate product by the reciprocal of n.
Round to integer.
In the int unit, multiply that rounded value by n.
Subtract that from the first integer result you obtained.
Phil
"Cristiano" <cristi...@NSquipo.it> wrote in message
news:7TL69.213426$Jj7.4...@news1.tin.it...
Which is why you'd put only 32-bits in the variable [um duh?]
> Now I'm using miracl library, but somebody knows a *faster* way?
miracl is in fact a very fast library. So is GMP. There are of
course other ways todo x^y mod z. For example, if you know y belongs
to h [read a number theory text to find out what that means] and h has
factors q1,q2,q3,...qN you could actually perform the exponentiation
in the smaller sub-groups and use CRT to combine the results.
That is, for example, how RSA is commonly sped up.
There are other tricks as well. If you know the range of y is small
[but not too small] you could use precomputed tables. For example, if
y is a 256-bit quantity you could use 32 lookup tables. Then a
modular exponentiation would require only 31 modular multiplications
[instead of the on average 384 required].
Then there are other wierd ideas like vector chains etc... e.g. the
shortest chain that adds and/or multiplies to the value you want
[mult-square is a derivative of this method].
Tom
Seems good. But for 2^64 mod n? 2^64 is a 65-bit number.
Thank you
Cristiano
Yes, I'm lucky :-) I'm using 50-bit numbers and I'm on P4.
> Method:
> Create a*b's top bits in the FP unit, and its bottom bits in the int
> unit. Multiply the approximate product by the reciprocal of n.
> Round to integer.
> In the int unit, multiply that rounded value by n.
> Subtract that from the first integer result you obtained.
I know just a bit of assembly. Do you have the code for x^y mod n?
Thank you
Cristiano
I was sleeping.
2^64 mod n = [(2^32 mod n) * (2^32 mod n)] mod n.
It is not very fast but I can try...
Thank you
Cristiano
> Seems good.
Unless n^3<2^128, the first multiplication will overflow a 64 bit
register.
Colin Percival
And I don't know how to calculate 2^64 mod n :-(
Cristiano
So do ((2^64 - 1) mod n) + 1
--
Edward Elliott
Calculate (2^64-n) mod n.
Colin Percival
No need for assembly, it can all be done in portable C.
The above is a mulmod, replace a*b with a*a for a sqmod, and just use an ordinary L2R or R2L exponentiation. R2L is more pipelineable, IIRC.
Phil
At the moment you can't even do _anything_ mod a number larger than 32 bit.
This route is a dead dog.
Here's some code, that requires some macros for token concatenation and shit like that, and also somemacros to chose which rounding mode to use, and whether errors are permissible in either the + or - direction. Only the final mulmod needs to be exact, as long as you've got a spare bit.
//------------------------------------------------------
// mod104_52_pr |
//---------------+
// xyl=low 64 bits of 'xy', a double-word integer
// xyf=approx value of xy as a float
// p =prime
// prf=approx 1.0/prime
//
static INLINE i52 CAT(mod104_52_pr,MATH52_SUFFIX) (int64 xyl, double xyf, i52 p, double prf)
{
int64 rounded=(int64)p*(int64)(xyf*prf); /* p*(xy/p) */
RET_SUB52_METHOD(xyl,rounded,p);
}
//------------------------------------------------------
// sqmod52_pr |
//---------------+
// x =integer
// p =prime
// prf=approx 1.0/prime
//
static INLINE i52 CAT(sqmod52_pr,MATH52_SUFFIX) (i52 x, i52 p, double prf)
{
return CAT(mod104_52_pr,MATH52_SUFFIX) ((int64)x*x, (double)x*x, p, prf);
}
//------------------------------------------------------
// mulmod52_pr |
//---------------+
// x,y=integer
// p =prime
// prf=approx 1.0/prime
//
static INLINE i52 CAT(mulmod52_pr,MATH52_SUFFIX) (i52 x, i52 y, i52 p, double prf)
{
return CAT(mod104_52_pr,MATH52_SUFFIX) ((int64)x*y, (double)x*y, p, prf);
}
There is also a problem for (a1^2 mod n)*(2^64 mod n) because n is a
50-bit number.
Cristiano
I tried to implement your functions in this way:
UI64 mod104_52_pr(UI64 xyl, double xyf, UI64 p, double prf)
{
UI64 rounded=(UI64)p*(UI64)(xyf*prf); /* p*(xy/p) */
return (xyl-rounded)%p;
}
UI64 mulmod52_pr(UI64 x, UI64 y, UI64 p)
{
return mod104_52_pr(x*y,(double)x*y, p, 1./double(p));
}
UI64 is unsigned int64. It doesn't work. Where am I wrong?
*Must* p be a prime or a generic number? In my program n (the modulus)
is always a 50-bit odd number.
Cristiano
You don't want that %p, because if the thing ever goes negative, the unsigned value will go haywire with the %. A signed value would work though.
> UI64 mulmod52_pr(UI64 x, UI64 y, UI64 p)
> {
> return mod104_52_pr(x*y,(double)x*y, p, 1./double(p));
> }
>
> UI64 is unsigned int64. It doesn't work. Where am I wrong?
> *Must* p be a prime or a generic number? In my program n (the modulus)
> is always a 50-bit odd number.
It should work as long as nothing ever goes negative.
Here are the macros I use that permit an error by -p, +p, +/-p or no error.
I've just noticed, this version of my code doesn't muck around with the rounding options at all, so I guess that it works out of the box on the assumtion of 'round to nearest' (which will round up 50% of the time, and cause the above value to be negative).
Assume the previous code was in file math52.tmplh
//
// SOLO MOD104_52, SQMOD, MULMOD
//
// Sometimes we can be out by +/- p.
// Perform a sloppy subtraction if we know we can get away with it.
// e.g. if the next thing to do is a square, then we don't care about being 'exact-p'
// if the next thing to do is +small, then we don't mind being 'exact+p'
//
#define RET_SUB52_BADPM(a,b,p) return (a)-(b)
#define RET_SUB52_BADP(a,b,p) return (a)-(b)+((b)>(a)?(p):0)
#define RET_SUB52_BADM(a,b,p) { i52 stab=a-b; return(stab>p)?stab-p:stab; }
#define RET_SUB52(a,b,p) { i52 stab=a-b; if(stab<0) return stab+p; if(stab>=p) return stab-p; re\
turn stab; }
#define RET_SUB52_METHOD(a,b,c) RET_SUB52_BADPM(a,b,c)
#define MATH52_SUFFIX _badpm
#include "math52.tmplh"
#undef MATH52_SUFFIX
#undef RET_SUB52_METHOD
#define RET_SUB52_METHOD(a,b,c) RET_SUB52_BADM(a,b,c)
#define MATH52_SUFFIX _badm
#include "math52.tmplh"
#undef MATH52_SUFFIX
#undef RET_SUB52_METHOD
#define RET_SUB52_METHOD(a,b,c) RET_SUB52_BADP(a,b,c)
#define MATH52_SUFFIX _badp
#include "math52.tmplh"
#undef MATH52_SUFFIX
#undef RET_SUB52_METHOD
#define RET_SUB52_METHOD(a,b,c) RET_SUB52(a,b,c)
#define MATH52_SUFFIX
#include "math52.tmplh"
#undef MATH52_SUFFIX
#undef RET_SUB52_METHOD
Phil
I tried your macros but sometime I get a wrong result even with 48-bit
numbers.
Your code works with you? I'm not able to find a SUB_METHOD which works.
Cristiano
I've never taken them above ~44 bits, so I can't make any promises beyond that. I used it in the pre-siever I used for last year's twin-prime hunt, and I don't think I sieved much beyond 20T. I was, however, using an Alpha, not an x86 - maybe there's some slight difference between their behaviour.
Can you give an example of it failing?
Phil
x= D55 71CBD705 bits: 44 14660632565509
y= 5C0 9F322B7E bits: 43 6324862724990
n= 7 F125CA98 bits: 35 34110556824
x * y mod n= 6 D150A0DE bits: 35 29281525982
I get fffffffee02ad646.
x= 9EB 12739CEB bits: 44 10905231531243
y= 966 5A37B5B7 bits: 44 10335204914615
n= D03 C7044180 bits: 44 14309875007872
x * y mod n= 2 0EF65CFD bits: 34 8840961277
I get d05d5fa9e7d.
Cristiano
Represented as hex, perhaps. However, the _value_ that represents as a _signed_ 64 bit integer is -4829030842
34110556824-4829030842 = 29281525982
The subtract macros will only work if you are treating the values as signed.
The only way you can get away with unsigned values is if you set the rounding mode to always round down, _and_ you sacrifice a chicken to the gods of IEEE754 each time you use it.
Signed,
Phil
Cristiano,
Maybe this piece of code can help you. I'm using it in order to find
factorizations of number near googolplex at:
http://www.alpertron.com.ar/GOOGOL.HTM
fild base
fild baseAux
fmul
fild prime
fdiv
fistp quotient
mov eax,dword ptr base
mul dword ptr base[4]
lea ebx,[eax+eax]
mov eax,dword ptr base
mul eax
add ebx,edx
mov ecx,eax
mov eax,edi
mul dword ptr quotient[4]
sub ebx,eax
mov eax,esi
mul dword ptr quotient
sub ebx,eax
mov eax,edi
mul dword ptr quotient
sub ecx,eax
sbb ebx,edx
js add_prime1
sub ecx,edi
sbb ebx,esi
jnc cont1
add_prime1:
add ecx,edi
adc ebx,esi
cont1:
mov dword ptr base,ecx
mov dword ptr base[4],ebx
In this case prime is the modulus, base and baseAux are the factors
and quotient is a temporary variable (all variables are quadwords). I
think it should work for 52 bits. Also ESI:EDI = prime.
If you know a faster method (that works of course) please let me know.
Best regards,
Dario Alejandro Alpern
Buenos Aires - Argentina
http://www.alpertron.com.ar/ENGLISH.HTM
If the result is in "base", I get a result totally wrong even with 40-bit
numbers. Do you know why?
> If you know a faster method (that works of course) please let me know.
I think the method of Phil is very fast but it doesn't work also with round
mode set to truncate:
unsigned short cw; // FPU control word
asm FNSTCW cw; // Store the control word in cw
cw|=0xC00; // Set bit 10 and 11 (RC flag)
asm FLDCW cw; // Load cw in the FPU
I haven't tried with the chicken to the gods of IEEE754... :-)
Cristiano
After the subract of the two integer values, the result (as a signed value) must, modulo the base, be equal to the desired value. I fail to see how rounding can account for a discrepancy of more than 1*base n absolute value, as that would mean the rounded value would have to have more than 1ulp error.
Can you post an example of it either being wrong by a value not a multiple of the base, or by more than 1*base in absolute difference, please?
Phil
You're right, Phil. I missed "modulo the base".
The following version works:
__int64 mulmod(__int64 x, __int64 y, __int64 p)
{
__int64 xyl = x * y;
FPU_CW(truncate);
double xyf = double(x) * y / double(p);
__int64 r = p * __int64(xyf); // CRITICAL
RESTORE_fpu_CW();
return (xyl - r) % p;
}
*but* if p is too small (say 34, 39 bits) the line "CRITICAL" generates an
exception: invalid floating point, because xyf is > 2^63 (we are using
*signed* int64) or, if we use *unsigned* int64 xyf cannot exceed 2^64.
Example:
x= 6D3D0 FDCD8098 bits: 51 1,92174442502569E15
y= A29CB 6D98B592 bits: 52 2,86070346093096E15
p= 6D 9DDB85F5 bits: 39 470799844853
x * y mod p= 5A E1BBC237 bits: 39 390334235191
xyf= 1.16770236604736e19
2^63= 9.22e18
And I need p from 2 to 2^49.
I'd like to hear some comment about my problems with the Dario's code...
Cristiano
The algorithm can trivially be made to work whatever the rounding mode, so don't change it repeatedly.
> return (xyl - r) % p;
Avoid % for speed. You know that you need to either add or subtract p, at most.
> }
>
> *but* if p is too small (say 34, 39 bits) the line "CRITICAL" generates an
> exception: invalid floating point, because xyf is > 2^63 (we are using
> *signed* int64) or, if we use *unsigned* int64 xyf cannot exceed 2^64.
If x<p and y<p, then x*y/p < p, so can't be > 2^63.
> Example:
> x= 6D3D0 FDCD8098 bits: 51 1,92174442502569E15
> y= A29CB 6D98B592 bits: 52 2,86070346093096E15
Well that's the problem. Start off within [0,p-1], or [-p/2, p/2], or whatever, and you'll be fine.
> p= 6D 9DDB85F5 bits: 39 470799844853
> x * y mod p= 5A E1BBC237 bits: 39 390334235191
> xyf= 1.16770236604736e19
> 2^63= 9.22e18
>
> And I need p from 2 to 2^49.
>
> I'd like to hear some comment about my problems with the Dario's code...
I'm fairly sure Dario has implemented the same algorithm that I've described. However, with a 1980s 32bit processor the 64-bit multiplies take a lot more lines of code that with virtual 64bit values, and so it looks more obfuscated.
Phi
In this case please write the factors that make the code fail. I'm
using it and it works OK for me. Anyway, if Phil's code is faster I
will change my code by Phil's.
Just generate 3 random numbers of 34-54 bits and you'll have the factors
which fail.
Anyway, this is an example:
x= 695EE 88DE5BC5 bits: 51 1,85370159129287E15
y= 359DA 32275851 bits: 50 943218609313873
p= 10F F9847EB9 bits: 41 1168122347193
x * y mod p= 26 5FCEE2E4 bits: 38 164816151268
x ^ y mod p= FE 7F83389C bits: 40 1093060999324
with your code I get: 50434cbcccda5e37.
I also tried to calculate base ^ baseAux (mod prime) with your code, but the
result is wrong in this case too.
What should your code calculate?
> I'm using it and it works OK for me.
I used your code in this way (UI64 is an unsigned 64-bit integer):
UI64 mulmod(UI64 base,UI64 baseAux,UI64 prime)
{
UI64 quotient;
asm{
YOUR CODE
}
return base;
}
Is that code right?
> Anyway, if Phil's code is faster I will change my code by Phil's.
I'm pretty sure that Phil's code is faster, but it seems that it has many
limitations. I'd like to compare his code with yours. Could you help me,
please?
Cristiano
How? If I don't change the rounding mode the code doesn't work.
>> return (xyl - r) % p;
>
> Avoid % for speed. You know that you need to either add or subtract
> p, at most.
Agreed.
>> }
>>
>> *but* if p is too small (say 34, 39 bits) the line "CRITICAL"
>> generates an exception: invalid floating point, because xyf is >
>> 2^63 (we are using *signed* int64) or, if we use *unsigned* int64
>> xyf cannot exceed 2^64.
>
> If x<p and y<p, then x*y/p < p, so can't be > 2^63.
But I can have also x>p in my applications.
>> I'd like to hear some comment about my problems with the Dario's
>> code...
>
> I'm fairly sure Dario has implemented the same algorithm that I've
> described. However, with a 1980s 32bit processor the 64-bit
> multiplies take a lot more lines of code that with virtual 64bit
> values, and so it looks more obfuscated.
Perhaps you're right. But I think the code seems more obfuscated because it
implements also your RET_SUB52_METHOD(a,b,c).
Cristiano
Yet no example provided.
Meet me half way - show me it not working, show me the intermediates and let me know the rounding mode in force.
Then I can help diagnose your problem.
The technique works for Paul Jobling in NewPGen; it works for Peter Montgomery and Francois Morain in the appendix of one of his (Morain's) papers; it works for Yves Gallot in Proth; it works for Jim Fougeron in GFNSieve; it works for David Underbakke in AthGFNSieve; and it works for me in several programs.
Basically it's nothing more than Barrett's method, but optimised for single-word operands, and a FP unit that can calculate the high part of a product.
Phil
Cristiano,
Please notice that the routine computes the modular multiplication of
base * baseAux mod prime and that prime should be loaded also in
ESI:EDI.
Also the prime should be greater than base and baseAux.
Do you have another example given these constraints?
Thank you.
The code didn't work because p was < x or y. When p>x and y the code works.
> The technique works for Paul Jobling in NewPGen; it works for Peter
> Montgomery and Francois Morain in the appendix of one of his
> (Morain's) papers; it works for Yves Gallot in Proth; it works for
> Jim Fougeron in GFNSieve; it works for David Underbakke in
> AthGFNSieve; and it works for me in several programs.
But you are better than me :-)
Cristiano
base= 365E CC576CD4 bits: 46 59780783107284
baseAux= 108C9 92ED0E55 bits: 49 291136823168597
prime= 1F2FB 614BFCF3 bits: 49 548636459793651
base * baseAux (mod prime)= 1A463 351EF6E2 bits: 49 462220976649954
I get 93031b44f23bcc0e.
I load ESI:EDI in this way:
UI64 mulmod(UI64 base,UI64 baseAux,UI64 prime)
{
UI64 quotient;
asm{
mov edi,dword ptr prime // lsw
mov esi,dword ptr prime[4] // msw
fild base // your code
fild baseAux
...
Cristiano
This example also works for me. I will show you all steps, so this
post is long.
All numbers are in _decimal_ so you can perform your calculations in
any calculator. Only changed registers/variables are shown.
base 59780783107284
baseAux 291136823168597
prime 548636459793651
quotient 0
eax 0
ebx 0
ecx 0
edx 0
esi 127739
edi 1632369907
st0 +0.00000000000000000e+0000
st1 +0.00000000000000000e+0000
fild base
st0 +5.97807831072840000e+0013
fild baseAux
st0 +2.91136823168597000e+0014
st1 +5.97807831072840000e+0013
fmul
st0 +1.74043872803855924e+0028
st1 +0.00000000000000000e+0000
fild prime
st0 +5.48636459793651000e+0014
st1 +1.74043872803855924e+0028
fdiv
st0 +3.17229869974948437e+0013
st1 +0.00000000000000000e+0000
fistp quotient
quotient 31722986997495
st0 +0.00000000000000000e+0000
st1 +0.00000000000000000e+0000
mov eax,dword ptr base
eax 3428281556
mul dword ptr base[4]
eax 2564756084
edx 54106
mov ebx,eax
ebx 2564756084
mov eax,dword ptr baseAux
eax 2465009237
mul dword ptr base[4]
eax 4094767414
edx 7987
add ebx,eax
ebx 2364556202
mov eax,dword ptr base
eax 3428281556
mul dword ptr baseAux
eax 1228847716
edx 1967592561
add ebx,edx
ebx 37181467
mov ecx,eax
ecx 1228847716
mov eax,edi
eax 1632369907
mul dword ptr quotient[4]
eax 710933230
edx 2807
sub ebx,eax
ebx 3621215533
mov eax,esi
eax 127739
mul dword ptr quotient
eax 3484963373
edx 10663
sub ebx,eax
ebx 136252160
mov eax,edi
eax 1632369907
mul dword ptr quotient
eax 1969995893
edx 136272280
sub ecx,eax
ecx 3553819119
sbb ebx,edx
ebx 4294947175
js add_prime3
EBX negative -> Jump done
add_prime3:
add ecx,edi
ecx 891221730
adc ebx,esi
ebx 107619
cont3:
mov dword ptr base,ecx
base 59778246047458
mov dword ptr base[4],ebx
base 462220976649954
Please let me know exactly what step differs. Maybe your processor
have a floating point bug (there are several versions of Intel
processor with these problems).
In my case I was running in a Celeron 333 MHz, but I checked it on a
Celeron 566 MHz, an AMD K6/2 400 MHz and an AMD Duron 950 MHz.
How do you get these?
mul dword ptr base[4]=
EDX:EAX <- EAX * base[4]
3428281556 * 13918 = 47714822696408 = 2b65790eadd8
EAX= 790eadd8 = 2031005144
EDX= 2b65 = 11109
and this is what I get.
> Please let me know exactly what step differs. Maybe your processor
> have a floating point bug (there are several versions of Intel
> processor with these problems).
The problem seems not in the fpu (mul should be performed in the integer
unit).
Anyway, I ran the program in a P4 and a P2, I got exactly the same result.
Thank you very much
Cristiano
I got these numbers because I copied wrong the instruction: please
replace base[4] by baseAux[4]. The code I gave in my first post does
not work also because I copied the wrong part of the program (it
happens sometimes :-( ).
Please use the following code instead:
fild base
fild baseAux
fmul
fild prime
fdiv
fistp quotient
mov eax,dword ptr base
mul dword ptr baseAux[4]
mov ebx,eax
mov eax,dword ptr baseAux
mul dword ptr base[4]
add ebx,eax
mov eax,dword ptr base
mul dword ptr baseAux
add ebx,edx
mov ecx,eax
mov eax,edi
mul dword ptr quotient[4]
sub ebx,eax
mov eax,esi
mul dword ptr quotient
sub ebx,eax
mov eax,edi
mul dword ptr quotient
sub ecx,eax
sbb ebx,edx
js add_prime3
sub ecx,edi
sbb ebx,esi
jnc cont3
add_prime3:
add ecx,edi
adc ebx,esi
cont3:
mov dword ptr base,ecx
mov dword ptr base[4],ebx
>
> Thank you very much
> Cristiano
Sorry for the inconvenience and best regards,
Good!
As long as the modulus > base and baseAux, the code works also for 62-bit
numbers. And it is very fast.
Thank you
Cristiano
Hi maybe this formula for mod helps you:
z = x mod y is the same as:
z = x – (x div y) * y