x^y mod n

Cristiano

unread,

Aug 15, 2002, 7:02:27 AM8/15/02

to

I should implement a *fast* modular exponentiation with the 64-bit
unsigned integers.
My compiler can handle 64-bit integers, that is I can do "unsigned
__int64 A=0x123456789abcdef0".

I could use the repeated square-and-multiply algorithms but when I do
(A*A) % n I can get an overflow because A is a 64-bit integer.

Now I'm using miracl library, but somebody knows a *faster* way?

Thanks
Cristiano

Phil Carmody

unread,

Aug 15, 2002, 9:25:18 AM8/15/02

to

If you're only using ~62 bits and you're on an x86, or if you're only using ~51 bits and you're on a non-x86, then there's a trick you can use for speeding up the operation.

Do either of these apply?

Method:
Create a*b's top bits in the FP unit, and its bottom bits in the int unit.
Multiply the approximate product by the reciprocal of n.
Round to integer.
In the int unit, multiply that rounded value by n.
Subtract that from the first integer result you obtained.

Phil

nospam

unread,

Aug 15, 2002, 9:52:26 AM8/15/02

to

A*A mod n=
(a1*2^32+a2)^2 mod n=
a1^2*2^64+2*a1*2^32*a2+a2^2 mod n=
[ (a1^2 mod n)*(2^64 mod n)+(2*a1*2^32*a2 mod n)+(a2^2 mod n) ] mod n

"Cristiano" <cristi...@NSquipo.it> wrote in message
news:7TL69.213426$Jj7.4...@news1.tin.it...

Tom St Denis

unread,

Aug 15, 2002, 10:53:45 AM8/15/02

to

"Cristiano" <cristi...@NSquipo.it> wrote in message news:<7TL69.213426$Jj7.4...@news1.tin.it>...

> I should implement a *fast* modular exponentiation with the 64-bit
> unsigned integers.
> My compiler can handle 64-bit integers, that is I can do "unsigned
> __int64 A=0x123456789abcdef0".
>
> I could use the repeated square-and-multiply algorithms but when I do
> (A*A) % n I can get an overflow because A is a 64-bit integer.

Which is why you'd put only 32-bits in the variable [um duh?]

> Now I'm using miracl library, but somebody knows a *faster* way?

miracl is in fact a very fast library. So is GMP. There are of
course other ways todo x^y mod z. For example, if you know y belongs
to h [read a number theory text to find out what that means] and h has
factors q1,q2,q3,...qN you could actually perform the exponentiation
in the smaller sub-groups and use CRT to combine the results.

That is, for example, how RSA is commonly sped up.

There are other tricks as well. If you know the range of y is small
[but not too small] you could use precomputed tables. For example, if
y is a 256-bit quantity you could use 32 lookup tables. Then a
modular exponentiation would require only 31 modular multiplications
[instead of the on average 384 required].

Then there are other wierd ideas like vector chains etc... e.g. the
shortest chain that adds and/or multiplies to the value you want
[mult-square is a derivative of this method].

Tom

Cristiano

unread,

Aug 15, 2002, 12:04:18 PM8/15/02

to

nospam wrote:
> A*A mod n=
> (a1*2^32+a2)^2 mod n=
> a1^2*2^64+2*a1*2^32*a2+a2^2 mod n=
> [ (a1^2 mod n)*(2^64 mod n)+(2*a1*2^32*a2 mod n)+(a2^2 mod n) ] mod n

Seems good. But for 2^64 mod n? 2^64 is a 65-bit number.

Thank you
Cristiano

Cristiano

unread,

Aug 15, 2002, 12:04:19 PM8/15/02

to

Phil Carmody wrote:
> Cristiano wrote:
>>
>> I should implement a *fast* modular exponentiation with the 64-bit
>> unsigned integers.
>> My compiler can handle 64-bit integers, that is I can do "unsigned
>> __int64 A=0x123456789abcdef0".
>>
>> I could use the repeated square-and-multiply algorithms but when I do
>> (A*A) % n I can get an overflow because A is a 64-bit integer.
>>
>> Now I'm using miracl library, but somebody knows a *faster* way?
>
> If you're only using ~62 bits and you're on an x86, or if you're only
> using ~51 bits and you're on a non-x86, then there's a trick you can
> use for speeding up the operation.
>
> Do either of these apply?

Yes, I'm lucky :-) I'm using 50-bit numbers and I'm on P4.

> Method:
> Create a*b's top bits in the FP unit, and its bottom bits in the int
> unit. Multiply the approximate product by the reciprocal of n.
> Round to integer.
> In the int unit, multiply that rounded value by n.
> Subtract that from the first integer result you obtained.

I know just a bit of assembly. Do you have the code for x^y mod n?

Thank you
Cristiano

Cristiano

unread,

Aug 15, 2002, 12:17:34 PM8/15/02

to

I was sleeping.
2^64 mod n = [(2^32 mod n) * (2^32 mod n)] mod n.
It is not very fast but I can try...

Thank you
Cristiano

Colin Andrew Percival

unread,

Aug 15, 2002, 12:19:00 PM8/15/02

to

>> A*A mod n=
>> (a1*2^32+a2)^2 mod n=
>> a1^2*2^64+2*a1*2^32*a2+a2^2 mod n=
>> [ (a1^2 mod n)*(2^64 mod n)+(2*a1*2^32*a2 mod n)+(a2^2 mod n) ] mod n

> Seems good.

Unless n^3<2^128, the first multiplication will overflow a 64 bit
register.

Colin Percival

Cristiano

unread,

Aug 15, 2002, 2:09:57 PM8/15/02

to

And I don't know how to calculate 2^64 mod n :-(

Cristiano

Edward Elliott

unread,

Aug 15, 2002, 2:22:51 PM8/15/02

to

Cristiano wrote:
> Seems good. But for 2^64 mod n? 2^64 is a 65-bit number.

So do ((2^64 - 1) mod n) + 1

--

Edward Elliott

Colin Andrew Percival

unread,

Aug 15, 2002, 2:31:25 PM8/15/02

to

Cristiano <cristi...@nsquipo.it> wrote:
> And I don't know how to calculate 2^64 mod n :-(

Calculate (2^64-n) mod n.

Colin Percival

Phil Carmody

unread,

Aug 15, 2002, 2:13:41 PM8/15/02

to

Cristiano wrote:
> > Method:
> > Create a*b's top bits in the FP unit, and its bottom bits in the int
> > unit. Multiply the approximate product by the reciprocal of n.
> > Round to integer.
> > In the int unit, multiply that rounded value by n.
> > Subtract that from the first integer result you obtained.
>
> I know just a bit of assembly. Do you have the code for x^y mod n?

No need for assembly, it can all be done in portable C.
The above is a mulmod, replace a*b with a*a for a sqmod, and just use an ordinary L2R or R2L exponentiation. R2L is more pipelineable, IIRC.

Phil

Phil Carmody

unread,

Aug 15, 2002, 2:22:03 PM8/15/02

to

At the moment you can't even do _anything_ mod a number larger than 32 bit.
This route is a dead dog.

Here's some code, that requires some macros for token concatenation and shit like that, and also somemacros to chose which rounding mode to use, and whether errors are permissible in either the + or - direction. Only the final mulmod needs to be exact, as long as you've got a spare bit.

//------------------------------------------------------
// mod104_52_pr |
//---------------+
// xyl=low 64 bits of 'xy', a double-word integer
// xyf=approx value of xy as a float
// p =prime
// prf=approx 1.0/prime
//
static INLINE i52 CAT(mod104_52_pr,MATH52_SUFFIX) (int64 xyl, double xyf, i52 p, double prf)
{
int64 rounded=(int64)p*(int64)(xyf*prf); /* p*(xy/p) */
RET_SUB52_METHOD(xyl,rounded,p);
}

//------------------------------------------------------
// sqmod52_pr |
//---------------+
// x =integer
// p =prime
// prf=approx 1.0/prime
//
static INLINE i52 CAT(sqmod52_pr,MATH52_SUFFIX) (i52 x, i52 p, double prf)
{
return CAT(mod104_52_pr,MATH52_SUFFIX) ((int64)x*x, (double)x*x, p, prf);
}

//------------------------------------------------------
// mulmod52_pr |
//---------------+
// x,y=integer
// p =prime
// prf=approx 1.0/prime
//
static INLINE i52 CAT(mulmod52_pr,MATH52_SUFFIX) (i52 x, i52 y, i52 p, double prf)
{
return CAT(mod104_52_pr,MATH52_SUFFIX) ((int64)x*y, (double)x*y, p, prf);
}

Cristiano

unread,

Aug 15, 2002, 5:14:25 PM8/15/02

to

Edward Elliott wrote:
> Cristiano wrote:
>> Seems good. But for 2^64 mod n? 2^64 is a 65-bit number.
>
> So do ((2^64 - 1) mod n) + 1

There is also a problem for (a1^2 mod n)*(2^64 mod n) because n is a
50-bit number.

Cristiano

unread,

Aug 15, 2002, 5:14:26 PM8/15/02

to

Phil Carmody wrote:
>
> Here's some code, that requires some macros for token concatenation
> and shit like that, and also somemacros to chose which rounding mode
> to use, and whether errors are permissible in either the + or -
> direction. Only the final mulmod needs to be exact, as long as you've
> got a spare bit.

[...]

I tried to implement your functions in this way:

UI64 mod104_52_pr(UI64 xyl, double xyf, UI64 p, double prf)
{
UI64 rounded=(UI64)p*(UI64)(xyf*prf); /* p*(xy/p) */
return (xyl-rounded)%p;
}

UI64 mulmod52_pr(UI64 x, UI64 y, UI64 p)
{
return mod104_52_pr(x*y,(double)x*y, p, 1./double(p));
}

UI64 is unsigned int64. It doesn't work. Where am I wrong?
*Must* p be a prime or a generic number? In my program n (the modulus)
is always a 50-bit odd number.

Cristiano

Phil Carmody

unread,

Aug 15, 2002, 6:36:15 PM8/15/02

to

Cristiano wrote:
>
> Phil Carmody wrote:
> >
> > Here's some code, that requires some macros for token concatenation
> > and shit like that, and also somemacros to chose which rounding mode
> > to use, and whether errors are permissible in either the + or -
> > direction. Only the final mulmod needs to be exact, as long as you've
> > got a spare bit.
> [...]
>
> I tried to implement your functions in this way:
>
> UI64 mod104_52_pr(UI64 xyl, double xyf, UI64 p, double prf)
> {
> UI64 rounded=(UI64)p*(UI64)(xyf*prf); /* p*(xy/p) */
> return (xyl-rounded)%p;
> }

You don't want that %p, because if the thing ever goes negative, the unsigned value will go haywire with the %. A signed value would work though.

> UI64 mulmod52_pr(UI64 x, UI64 y, UI64 p)
> {
> return mod104_52_pr(x*y,(double)x*y, p, 1./double(p));
> }
>
> UI64 is unsigned int64. It doesn't work. Where am I wrong?
> *Must* p be a prime or a generic number? In my program n (the modulus)
> is always a 50-bit odd number.

It should work as long as nothing ever goes negative.
Here are the macros I use that permit an error by -p, +p, +/-p or no error.
I've just noticed, this version of my code doesn't muck around with the rounding options at all, so I guess that it works out of the box on the assumtion of 'round to nearest' (which will round up 50% of the time, and cause the above value to be negative).
Assume the previous code was in file math52.tmplh

//
// SOLO MOD104_52, SQMOD, MULMOD
//

// Sometimes we can be out by +/- p.
// Perform a sloppy subtraction if we know we can get away with it.
// e.g. if the next thing to do is a square, then we don't care about being 'exact-p'
// if the next thing to do is +small, then we don't mind being 'exact+p'
//
#define RET_SUB52_BADPM(a,b,p) return (a)-(b)
#define RET_SUB52_BADP(a,b,p) return (a)-(b)+((b)>(a)?(p):0)
#define RET_SUB52_BADM(a,b,p) { i52 stab=a-b; return(stab>p)?stab-p:stab; }
#define RET_SUB52(a,b,p) { i52 stab=a-b; if(stab<0) return stab+p; if(stab>=p) return stab-p; re\
turn stab; }

#define RET_SUB52_METHOD(a,b,c) RET_SUB52_BADPM(a,b,c)
#define MATH52_SUFFIX _badpm
#include "math52.tmplh"
#undef MATH52_SUFFIX
#undef RET_SUB52_METHOD

#define RET_SUB52_METHOD(a,b,c) RET_SUB52_BADM(a,b,c)
#define MATH52_SUFFIX _badm
#include "math52.tmplh"
#undef MATH52_SUFFIX
#undef RET_SUB52_METHOD

#define RET_SUB52_METHOD(a,b,c) RET_SUB52_BADP(a,b,c)
#define MATH52_SUFFIX _badp
#include "math52.tmplh"
#undef MATH52_SUFFIX
#undef RET_SUB52_METHOD

#define RET_SUB52_METHOD(a,b,c) RET_SUB52(a,b,c)
#define MATH52_SUFFIX
#include "math52.tmplh"
#undef MATH52_SUFFIX
#undef RET_SUB52_METHOD

Phil

Cristiano

unread,

Aug 16, 2002, 10:54:27 AM8/16/02

to

Phil Carmody wrote:
> It should work as long as nothing ever goes negative.
> Here are the macros I use that permit an error by -p, +p, +/-p or no
> error. I've just noticed, this version of my code doesn't muck around
> with the rounding options at all, so I guess that it works out of the
> box on the assumtion of 'round to nearest' (which will round up 50%
> of the time, and cause the above value to be negative). Assume the
> previous code was in file math52.tmplh

[...]

I tried your macros but sometime I get a wrong result even with 48-bit
numbers.
Your code works with you? I'm not able to find a SUB_METHOD which works.

Cristiano

Phil Carmody

unread,

Aug 16, 2002, 10:50:19 AM8/16/02

to

I've never taken them above ~44 bits, so I can't make any promises beyond that. I used it in the pre-siever I used for last year's twin-prime hunt, and I don't think I sieved much beyond 20T. I was, however, using an Alpha, not an x86 - maybe there's some slight difference between their behaviour.

Can you give an example of it failing?

Phil

Cristiano

unread,

Aug 16, 2002, 1:39:34 PM8/16/02

to

Phil Carmody wrote:
> Cristiano wrote:

>> I tried your macros but sometime I get a wrong result even with 48-

>> bit numbers.

>> Your code works with you? I'm not able to find a SUB_METHOD which
>> works.
>
> I've never taken them above ~44 bits, so I can't make any promises
> beyond that. I used it in the pre-siever I used for last year's twin-
> prime hunt, and I don't think I sieved much beyond 20T. I was,
> however, using an Alpha, not an x86 - maybe there's some slight
> difference between their behaviour.
>
> Can you give an example of it failing?

x= D55 71CBD705 bits: 44 14660632565509
y= 5C0 9F322B7E bits: 43 6324862724990
n= 7 F125CA98 bits: 35 34110556824

x * y mod n= 6 D150A0DE bits: 35 29281525982

I get fffffffee02ad646.

x= 9EB 12739CEB bits: 44 10905231531243
y= 966 5A37B5B7 bits: 44 10335204914615
n= D03 C7044180 bits: 44 14309875007872

x * y mod n= 2 0EF65CFD bits: 34 8840961277

I get d05d5fa9e7d.

Cristiano

Phil Carmody

unread,

Aug 16, 2002, 3:56:10 PM8/16/02

to

Cristiano wrote:
>
> Phil Carmody wrote:
> > Cristiano wrote:
> >> I tried your macros but sometime I get a wrong result even with 48-
> >> bit numbers.
> >> Your code works with you? I'm not able to find a SUB_METHOD which
> >> works.
> >
> > I've never taken them above ~44 bits, so I can't make any promises
> > beyond that. I used it in the pre-siever I used for last year's twin-
> > prime hunt, and I don't think I sieved much beyond 20T. I was,
> > however, using an Alpha, not an x86 - maybe there's some slight
> > difference between their behaviour.
> >
> > Can you give an example of it failing?
>
> x= D55 71CBD705 bits: 44 14660632565509
> y= 5C0 9F322B7E bits: 43 6324862724990
> n= 7 F125CA98 bits: 35 34110556824
>
> x * y mod n= 6 D150A0DE bits: 35 29281525982
>
> I get fffffffee02ad646.

Represented as hex, perhaps. However, the _value_ that represents as a _signed_ 64 bit integer is -4829030842
34110556824-4829030842 = 29281525982

The subtract macros will only work if you are treating the values as signed.

The only way you can get away with unsigned values is if you set the rounding mode to always round down, _and_ you sacrifice a chicken to the gods of IEEE754 each time you use it.

Signed,
Phil

Dario Alpern

unread,

Aug 17, 2002, 5:01:54 PM8/17/02

to

"Cristiano" <cristi...@NSquipo.it> wrote in message news:<7iQ69.214191$Jj7.4...@news1.tin.it>...

Cristiano,

Maybe this piece of code can help you. I'm using it in order to find
factorizations of number near googolplex at:

http://www.alpertron.com.ar/GOOGOL.HTM

fild base
fild baseAux
fmul
fild prime
fdiv
fistp quotient
mov eax,dword ptr base
mul dword ptr base[4]
lea ebx,[eax+eax]
mov eax,dword ptr base
mul eax
add ebx,edx
mov ecx,eax
mov eax,edi
mul dword ptr quotient[4]
sub ebx,eax
mov eax,esi
mul dword ptr quotient
sub ebx,eax
mov eax,edi
mul dword ptr quotient
sub ecx,eax
sbb ebx,edx
js add_prime1
sub ecx,edi
sbb ebx,esi
jnc cont1
add_prime1:
add ecx,edi
adc ebx,esi
cont1:
mov dword ptr base,ecx
mov dword ptr base[4],ebx

In this case prime is the modulus, base and baseAux are the factors
and quotient is a temporary variable (all variables are quadwords). I
think it should work for 52 bits. Also ESI:EDI = prime.

If you know a faster method (that works of course) please let me know.

Best regards,

Dario Alejandro Alpern
Buenos Aires - Argentina
http://www.alpertron.com.ar/ENGLISH.HTM

Cristiano

unread,

Aug 20, 2002, 4:39:22 AM8/20/02

to

Dario Alpern wrote:
>
> Maybe this piece of code can help you. I'm using it in order to find
> factorizations of number near googolplex at:

[...]

> In this case prime is the modulus, base and baseAux are the factors
> and quotient is a temporary variable (all variables are quadwords). I
> think it should work for 52 bits. Also ESI:EDI = prime.

If the result is in "base", I get a result totally wrong even with 40-bit
numbers. Do you know why?

> If you know a faster method (that works of course) please let me know.

I think the method of Phil is very fast but it doesn't work also with round
mode set to truncate:

unsigned short cw; // FPU control word
asm FNSTCW cw; // Store the control word in cw
cw|=0xC00; // Set bit 10 and 11 (RC flag)
asm FLDCW cw; // Load cw in the FPU

I haven't tried with the chicken to the gods of IEEE754... :-)

Cristiano

Phil Carmody

unread,

Aug 20, 2002, 10:01:02 AM8/20/02

to

After the subract of the two integer values, the result (as a signed value) must, modulo the base, be equal to the desired value. I fail to see how rounding can account for a discrepancy of more than 1*base n absolute value, as that would mean the rounded value would have to have more than 1ulp error.

Can you post an example of it either being wrong by a value not a multiple of the base, or by more than 1*base in absolute difference, please?

Phil

Cristiano

unread,

Aug 20, 2002, 3:45:03 PM8/20/02

to

Phil Carmody wrote:
> After the subract of the two integer values, the result (as a signed

> value) must, modulo the base, be equal to the desired value. [...]

You're right, Phil. I missed "modulo the base".

The following version works:

__int64 mulmod(__int64 x, __int64 y, __int64 p)
{
__int64 xyl = x * y;
FPU_CW(truncate);
double xyf = double(x) * y / double(p);
__int64 r = p * __int64(xyf); // CRITICAL
RESTORE_fpu_CW();
return (xyl - r) % p;
}

*but* if p is too small (say 34, 39 bits) the line "CRITICAL" generates an
exception: invalid floating point, because xyf is > 2^63 (we are using
*signed* int64) or, if we use *unsigned* int64 xyf cannot exceed 2^64.

Example:
x= 6D3D0 FDCD8098 bits: 51 1,92174442502569E15
y= A29CB 6D98B592 bits: 52 2,86070346093096E15
p= 6D 9DDB85F5 bits: 39 470799844853
x * y mod p= 5A E1BBC237 bits: 39 390334235191
xyf= 1.16770236604736e19
2^63= 9.22e18

And I need p from 2 to 2^49.

I'd like to hear some comment about my problems with the Dario's code...

Cristiano

Phil Carmody

unread,

Aug 21, 2002, 4:38:11 AM8/21/02

to

Cristiano wrote:
>
> Phil Carmody wrote:
> > After the subract of the two integer values, the result (as a signed
> > value) must, modulo the base, be equal to the desired value. [...]
>
> You're right, Phil. I missed "modulo the base".
>
> The following version works:
>
> __int64 mulmod(__int64 x, __int64 y, __int64 p)
> {
> __int64 xyl = x * y;
> FPU_CW(truncate);
> double xyf = double(x) * y / double(p);
> __int64 r = p * __int64(xyf); // CRITICAL
> RESTORE_fpu_CW();

The algorithm can trivially be made to work whatever the rounding mode, so don't change it repeatedly.

> return (xyl - r) % p;

Avoid % for speed. You know that you need to either add or subtract p, at most.

> }
>
> *but* if p is too small (say 34, 39 bits) the line "CRITICAL" generates an
> exception: invalid floating point, because xyf is > 2^63 (we are using
> *signed* int64) or, if we use *unsigned* int64 xyf cannot exceed 2^64.

If x<p and y<p, then x*y/p < p, so can't be > 2^63.

> Example:
> x= 6D3D0 FDCD8098 bits: 51 1,92174442502569E15
> y= A29CB 6D98B592 bits: 52 2,86070346093096E15

Well that's the problem. Start off within [0,p-1], or [-p/2, p/2], or whatever, and you'll be fine.

> p= 6D 9DDB85F5 bits: 39 470799844853
> x * y mod p= 5A E1BBC237 bits: 39 390334235191
> xyf= 1.16770236604736e19
> 2^63= 9.22e18
>
> And I need p from 2 to 2^49.
>
> I'd like to hear some comment about my problems with the Dario's code...

I'm fairly sure Dario has implemented the same algorithm that I've described. However, with a 1980s 32bit processor the 64-bit multiplies take a lot more lines of code that with virtual 64bit values, and so it looks more obfuscated.

Phi

Dario Alpern

unread,

Aug 21, 2002, 8:34:21 AM8/21/02

to

"Cristiano" <cristi...@nsquipo.it> wrote in message news:<3%w89.5569$xi1.2...@news2.tin.it>...

In this case please write the factors that make the code fail. I'm
using it and it works OK for me. Anyway, if Phil's code is faster I
will change my code by Phil's.

Cristiano

unread,

Aug 21, 2002, 4:35:43 PM8/21/02

to

Dario Alpern wrote:

> "Cristiano" wrote:
>
>> I'd like to hear some comment about my problems with the Dario's
>> code...
>>
>> Cristiano
>
> In this case please write the factors that make the code fail.

Just generate 3 random numbers of 34-54 bits and you'll have the factors
which fail.
Anyway, this is an example:

x= 695EE 88DE5BC5 bits: 51 1,85370159129287E15
y= 359DA 32275851 bits: 50 943218609313873
p= 10F F9847EB9 bits: 41 1168122347193
x * y mod p= 26 5FCEE2E4 bits: 38 164816151268
x ^ y mod p= FE 7F83389C bits: 40 1093060999324

with your code I get: 50434cbcccda5e37.

I also tried to calculate base ^ baseAux (mod prime) with your code, but the
result is wrong in this case too.
What should your code calculate?

> I'm using it and it works OK for me.

I used your code in this way (UI64 is an unsigned 64-bit integer):

UI64 mulmod(UI64 base,UI64 baseAux,UI64 prime)
{
UI64 quotient;
asm{
YOUR CODE
}
return base;
}

Is that code right?

> Anyway, if Phil's code is faster I will change my code by Phil's.

I'm pretty sure that Phil's code is faster, but it seems that it has many
limitations. I'd like to compare his code with yours. Could you help me,
please?

Cristiano

unread,

Aug 21, 2002, 4:35:44 PM8/21/02

to

Phil Carmody wrote:
> Cristiano wrote:
>>
>> Phil Carmody wrote:
>>> After the subract of the two integer values, the result (as a signed
>>> value) must, modulo the base, be equal to the desired value. [...]
>>
>> You're right, Phil. I missed "modulo the base".
>>
>> The following version works:
>>
>> __int64 mulmod(__int64 x, __int64 y, __int64 p)
>> {
>> __int64 xyl = x * y;
>> FPU_CW(truncate);
>> double xyf = double(x) * y / double(p);
>> __int64 r = p * __int64(xyf); // CRITICAL
>> RESTORE_fpu_CW();
>
> The algorithm can trivially be made to work whatever the rounding
> mode, so don't change it repeatedly.

How? If I don't change the rounding mode the code doesn't work.

>> return (xyl - r) % p;
>
> Avoid % for speed. You know that you need to either add or subtract
> p, at most.

Agreed.

>> }
>>
>> *but* if p is too small (say 34, 39 bits) the line "CRITICAL"
>> generates an exception: invalid floating point, because xyf is >
>> 2^63 (we are using *signed* int64) or, if we use *unsigned* int64
>> xyf cannot exceed 2^64.
>
> If x<p and y<p, then x*y/p < p, so can't be > 2^63.

But I can have also x>p in my applications.

>> I'd like to hear some comment about my problems with the Dario's
>> code...
>
> I'm fairly sure Dario has implemented the same algorithm that I've
> described. However, with a 1980s 32bit processor the 64-bit
> multiplies take a lot more lines of code that with virtual 64bit
> values, and so it looks more obfuscated.

Perhaps you're right. But I think the code seems more obfuscated because it
implements also your RET_SUB52_METHOD(a,b,c).

Cristiano

Phil Carmody

unread,

Aug 22, 2002, 6:34:37 AM8/22/02

to

Cristiano wrote:
> How? If I don't change the rounding mode the code doesn't work.

Yet no example provided.

Meet me half way - show me it not working, show me the intermediates and let me know the rounding mode in force.

Then I can help diagnose your problem.

The technique works for Paul Jobling in NewPGen; it works for Peter Montgomery and Francois Morain in the appendix of one of his (Morain's) papers; it works for Yves Gallot in Proth; it works for Jim Fougeron in GFNSieve; it works for David Underbakke in AthGFNSieve; and it works for me in several programs.

Basically it's nothing more than Barrett's method, but optimised for single-word operands, and a FP unit that can calculate the high part of a product.

Phil

Dario Alpern

unread,

Aug 22, 2002, 8:30:15 AM8/22/02

to

"Cristiano" <cristi...@nsquipo.it> wrote in message news:<zQS89.11990$xi1.6...@news2.tin.it>...

Cristiano,

Please notice that the routine computes the modular multiplication of
base * baseAux mod prime and that prime should be loaded also in
ESI:EDI.

Also the prime should be greater than base and baseAux.

Do you have another example given these constraints?

Cristiano

unread,

Aug 22, 2002, 1:34:43 PM8/22/02

to

Phil Carmody wrote:
> Cristiano wrote:
>> How? If I don't change the rounding mode the code doesn't work.
>
> Yet no example provided.
>
> Meet me half way - show me it not working, show me the intermediates
> and let me know the rounding mode in force.
>
> Then I can help diagnose your problem.

Thank you.
The code didn't work because p was < x or y. When p>x and y the code works.

> The technique works for Paul Jobling in NewPGen; it works for Peter
> Montgomery and Francois Morain in the appendix of one of his
> (Morain's) papers; it works for Yves Gallot in Proth; it works for
> Jim Fougeron in GFNSieve; it works for David Underbakke in
> AthGFNSieve; and it works for me in several programs.

But you are better than me :-)

Cristiano

unread,

Aug 22, 2002, 1:34:44 PM8/22/02

to

Dario Alpern wrote:
>
> Please notice that the routine computes the modular multiplication of
> base * baseAux mod prime and that prime should be loaded also in
> ESI:EDI.
>
> Also the prime should be greater than base and baseAux.
>
> Do you have another example given these constraints?

base= 365E CC576CD4 bits: 46 59780783107284
baseAux= 108C9 92ED0E55 bits: 49 291136823168597
prime= 1F2FB 614BFCF3 bits: 49 548636459793651

base * baseAux (mod prime)= 1A463 351EF6E2 bits: 49 462220976649954

I get 93031b44f23bcc0e.

I load ESI:EDI in this way:

UI64 mulmod(UI64 base,UI64 baseAux,UI64 prime)
{
UI64 quotient;
asm{

mov edi,dword ptr prime // lsw
mov esi,dword ptr prime[4] // msw
fild base // your code
fild baseAux
...

Cristiano

Dario Alpern

unread,

Aug 23, 2002, 9:13:34 AM8/23/02

to

"Cristiano" <cristi...@nsquipo.it> wrote in message news:<Ug999.2271$pX1....@news2.tin.it>...

This example also works for me. I will show you all steps, so this
post is long.
All numbers are in _decimal_ so you can perform your calculations in
any calculator. Only changed registers/variables are shown.

base 59780783107284
baseAux 291136823168597
prime 548636459793651
quotient 0
eax 0
ebx 0
ecx 0
edx 0
esi 127739
edi 1632369907
st0 +0.00000000000000000e+0000
st1 +0.00000000000000000e+0000

fild base

st0 +5.97807831072840000e+0013

fild baseAux

st0 +2.91136823168597000e+0014
st1 +5.97807831072840000e+0013

fmul

st0 +1.74043872803855924e+0028
st1 +0.00000000000000000e+0000

fild prime

st0 +5.48636459793651000e+0014
st1 +1.74043872803855924e+0028

fdiv

st0 +3.17229869974948437e+0013
st1 +0.00000000000000000e+0000

fistp quotient

quotient 31722986997495
st0 +0.00000000000000000e+0000
st1 +0.00000000000000000e+0000

mov eax,dword ptr base

eax 3428281556

mul dword ptr base[4]

eax 2564756084
edx 54106

mov ebx,eax

ebx 2564756084

mov eax,dword ptr baseAux

eax 2465009237

mul dword ptr base[4]

eax 4094767414
edx 7987

add ebx,eax

ebx 2364556202

mov eax,dword ptr base

eax 3428281556

mul dword ptr baseAux

eax 1228847716
edx 1967592561

add ebx,edx

ebx 37181467

mov ecx,eax

ecx 1228847716

mov eax,edi

eax 1632369907

mul dword ptr quotient[4]

eax 710933230
edx 2807

sub ebx,eax

ebx 3621215533

mov eax,esi

eax 127739

mul dword ptr quotient

eax 3484963373
edx 10663

sub ebx,eax

ebx 136252160

mov eax,edi

eax 1632369907

mul dword ptr quotient

eax 1969995893
edx 136272280

sub ecx,eax

ecx 3553819119

sbb ebx,edx

ebx 4294947175

js add_prime3

EBX negative -> Jump done

add_prime3:
add ecx,edi

ecx 891221730

adc ebx,esi

ebx 107619

cont3:
mov dword ptr base,ecx

base 59778246047458

mov dword ptr base[4],ebx

base 462220976649954

Please let me know exactly what step differs. Maybe your processor
have a floating point bug (there are several versions of Intel
processor with these problems).

In my case I was running in a Celeron 333 MHz, but I checked it on a
Celeron 566 MHz, an AMD K6/2 400 MHz and an AMD Duron 950 MHz.

Cristiano

unread,

Aug 24, 2002, 5:14:14 AM8/24/02

to

How do you get these?

mul dword ptr base[4]=
EDX:EAX <- EAX * base[4]

3428281556 * 13918 = 47714822696408 = 2b65790eadd8
EAX= 790eadd8 = 2031005144
EDX= 2b65 = 11109

and this is what I get.

> Please let me know exactly what step differs. Maybe your processor
> have a floating point bug (there are several versions of Intel
> processor with these problems).

The problem seems not in the fpu (mul should be performed in the integer
unit).
Anyway, I ran the program in a P4 and a P2, I got exactly the same result.

Thank you very much
Cristiano

Dario Alpern

unread,

Aug 24, 2002, 8:17:05 PM8/24/02

to

"Cristiano" <cristi...@nsquipo.it> wrote in message news:<G7I99.9871$pX1.3...@news2.tin.it>...

> Dario Alpern wrote:
> >
> > All numbers are in _decimal_ so you can perform your calculations in
> > any calculator. Only changed registers/variables are shown.
> >

[snip]

> >
> > mul dword ptr base[4]
> >
> > eax 2564756084
> > edx 54106
>
> How do you get these?
>
> mul dword ptr base[4]=
> EDX:EAX <- EAX * base[4]
>
> 3428281556 * 13918 = 47714822696408 = 2b65790eadd8
> EAX= 790eadd8 = 2031005144
> EDX= 2b65 = 11109
>
> and this is what I get.
>

I got these numbers because I copied wrong the instruction: please
replace base[4] by baseAux[4]. The code I gave in my first post does
not work also because I copied the wrong part of the program (it
happens sometimes :-( ).

Please use the following code instead:

fild base
fild baseAux
fmul
fild prime
fdiv
fistp quotient
mov eax,dword ptr base

mul dword ptr baseAux[4]
mov ebx,eax
mov eax,dword ptr baseAux

mul dword ptr base[4]

add ebx,eax
mov eax,dword ptr base
mul dword ptr baseAux

add ebx,edx
mov ecx,eax
mov eax,edi

mul dword ptr quotient[4]

sub ebx,eax
mov eax,esi
mul dword ptr quotient
sub ebx,eax
mov eax,edi
mul dword ptr quotient
sub ecx,eax
sbb ebx,edx

js add_prime3
sub ecx,edi
sbb ebx,esi
jnc cont3
add_prime3:
add ecx,edi
adc ebx,esi

cont3:
mov dword ptr base,ecx

mov dword ptr base[4],ebx

>

> Thank you very much
> Cristiano

Sorry for the inconvenience and best regards,

Cristiano

unread,

Aug 25, 2002, 10:09:39 AM8/25/02

to

Dario Alpern wrote:
>
> Please use the following code instead:

[...]

Good!
As long as the modulus > base and baseAux, the code works also for 62-bit
numbers. And it is very fast.

Thank you
Cristiano

skybuck

unread,

Aug 25, 2002, 10:34:13 PM8/25/02

to

"Cristiano" <cristi...@NSquipo.it> wrote in message news:<7TL69.213426$Jj7.4...@news1.tin.it>...

> I should implement a *fast* modular exponentiation with the 64-bit
> unsigned integers.
> My compiler can handle 64-bit integers, that is I can do "unsigned
> __int64 A=0x123456789abcdef0".
>
> I could use the repeated square-and-multiply algorithms but when I do
> (A*A) % n I can get an overflow because A is a 64-bit integer.
>
> Now I'm using miracl library, but somebody knows a *faster* way?
>

> Thanks
> Cristiano

Hi maybe this formula for mod helps you:

z = x mod y is the same as:

z = x – (x div y) * y