Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

418 views

Skip to first unread message

Apr 6, 1994, 5:31:16 AM4/6/94

to

I got to thinking some more, and wrote down the follwoing ideas. I'm

posting it in the hopes it'll be useful to someone. The Newton's iteration

thing is, IMHO, really nifty.

posting it in the hopes it'll be useful to someone. The Newton's iteration

thing is, IMHO, really nifty.

More on finding multiplicative inverses

The general method for finding multiplicative inverses (the Extended

Euclidean Algorithm) involves repeated division. This is annoyingly

slow if you're working in extended precision. Fortunately, there are

some special cases where it can be made much faster.

The special cases are where the modulus is a power of a small prime.

This arises particularly in Montgomery's modular reduction algorithm.

Montgomery's algorithm is typically implemented using signle-digit

inverses, but can be expressed using full-width inverses, which lets

you use a subquadratic multiplication algorithm.

(See "Comparison of three modular reduction functions", Antoon

Bosselaers, Rene' Govaerts and Joos Vandewalle, _Advances in Cryptology,

Proc. Crypto '93_, Springer-Verlag, ISBN 3-540-57766-1, ISBN 0-387-57766-1)

The following can be derived (look up "valuations"), but I'm going to

just pull it out of a hat.

First, recall Newton's iteration for finding x = 1/A. Find the root

of f(x) = 1/x - A = 0. The equation to iterate is

x = x - f(x)/f'(x)

= x - (x^-1 - A)/(-x^-2)

= x + x^2 * (x^-1 - A)

= x + x - A*x^2

= 2*x - A*x*x

This is usually applied in the real numbers, but it can be applied in

modular arithmetic as well. In particular, suppose you have an initial

approximation x, such that x*A == 1 (mod m). So write it as follows:

x*A == 1 + k*m

Substituting in the interation, you get

(2*x - A*x*x)*A == 2*x*A - (x*A)^2

== 2 + 2*k*m - (1+k*m)^2

== 2 + 2*k*m - 1 - 2*k*m - k^2*m^2

== 2-1 + 2*k*m-2*k*m - k^2*m^2

== 1 - k^2*m^2

... which is 1, mod m^2.

So one iteration of Newton's method squared the modulus to which the

approximation is accurate! This is particularly useful in finding

multiplicative inverses modulo 2^k, since x = 1 is an initial

approximation that is good to 1 significant bit. You can double the

number of bits each iteration.

Each iteration requires two multiplies, plus some adds, but only the

last iteration need be full-width. Other iterations are half the width

or less. Thus, the total work is less than that of two iterations, or

four multiplies of k bits.

Actually, there is a better starting estimate. For all odd x,

x*x == 1 (mod 8). Modulo 16, it's a bit more awkward, but if you

look at the pattern of inverses in binary:

x x^-1

0001 0001

0011 1011

0101 1101

0111 0111

1001 1001

1011 0011

1101 0101

1111 1111

You can see that x^-1 is either x or x+8, and it's x+8 when the two

middle bits differ. So you can form your initial estimate thus:

x = ((A<<1 ^ A) & 4) << 1 ^ A;

which is good to 4 bits. Using either of these starting values, you

save the first few iterations. Then, working from there, you can

compute an inverse modulo 2^32 of any odd a:

unsigned long

inverse(unsigned long a)

{

unsigned long x = a;

assert((x*a & 0x7) == 1);

x += x - a*x*x;

assert((x*a & 0x3F) == 1);

x += x - a*x*x;

assert((x*a & 0xFFF) == 1);

x += x - a*x*x;

assert((x*a & 0xFFFFFF) == 1);

x += x - a*x*x;

assert((x*a & 0xFFFFFFFF) == 1);

return x;

}

or:

unsigned long

inverse(unsigned long a)

{

unsigned long x = ((a<<1 ^ a) & 4) << 1 ^ a;

assert((x*a & 0xF) == 1);

x += x - a*x*x;

assert((x*a & 0xFF) == 1);

x += x - a*x*x;

assert((x*a & 0xFFFF) == 1);

x += x - a*x*x;

assert((x*a & 0xFFFFFFFF) == 1);

return x;

}

This is considerably simpler than the Extended Euclidean algorithm,

and requires substantially fewer iterations. (See Knuth, section

4.5.3.) The average number of iterations needed to find an inverse

modulo n is 0.843 * ln(n) + 1.47, or 0.584 * log_2(n) + 1.47, so that's

10.8 iterations for 16 bits and 20.2 for 32. This method requires

log_2(log_2(n)) iterations, obviously asymptotically better!

I seem to recall that there is a division algorithm that uses

full-width multiplicative inverses like this, but I can't recall it.

You can't just compute B/A as B * 1/A modulo 2^k, where k is chosen

to be big enough, because remainders cause all sorts of problems.

Just for example, let B = 3*14+1 = 43 and A = 3. The desired

result is B/A = 14, and it can be seen that it will fit into

4 bits (modulo 16, that is), but trying to compute it that way produces:

1/3 == 11 (mod 16)

43 * 11 == (32 + 11) * 11

== 121

== 9

42 would come out exactly, so it can be used as-is for exact division,

but division with remainder is a different matter.

I have a rather complex algorithm for division in terms of

multiplication from Maurice Mignotte's _Mathematics For Computer

Algebra_ (Springer-Verlag, ISBN 0-387-97675-2, ISBN 0-3-540-97675-2),

but it doesn't look terribly practical. It reduces an n-bit division

to two n/2-bit divisions plus some multiplications.

This is also useful for doing trial division. There's a prime sieving

technique which replaces division by p with multiplication by p^-1,

modulo the word size b = 2^n. To find if a word is divisible by p, note

that the multiples of p are 0, p, ..., k*p, where k = floor((b-1)/p).

So multiplication by p^-1 will map multiples of p back onto that range.

All you have to do is multiply and compare with the (easily precomputed)

value floor((b-1)/p).

Assuming you're building these tables on the fly, a cost saving never

hurts.

Apr 6, 1994, 12:07:11 PM4/6/94

to

In article <1994Apr6.0...@mnemosyne.cs.du.edu> co...@nyx10.cs.du.edu (Colin Plumb) writes:

:I got to thinking some more, and wrote down the follwoing ideas. I'm

:posting it in the hopes it'll be useful to someone. The Newton's iteration

:thing is, IMHO, really nifty.

:

:More on finding multiplicative inverses

:

:The general method for finding multiplicative inverses (the Extended

:Euclidean Algorithm) involves repeated division. This is annoyingly

Negative. Use Lehmer's method. (see Knuth Vol. 2) it involves

NO division at all; only shifts and add/subtract.

--

Bob Silverman

These are my opinions and not MITRE's.

Mitre Corporation, Bedford, MA 01730

"You can lead a horse's ass to knowledge, but you can't make him think"

:I got to thinking some more, and wrote down the follwoing ideas. I'm

:posting it in the hopes it'll be useful to someone. The Newton's iteration

:thing is, IMHO, really nifty.

:

:More on finding multiplicative inverses

:

:The general method for finding multiplicative inverses (the Extended

:Euclidean Algorithm) involves repeated division. This is annoyingly

NO division at all; only shifts and add/subtract.

--

Bob Silverman

These are my opinions and not MITRE's.

Mitre Corporation, Bedford, MA 01730

"You can lead a horse's ass to knowledge, but you can't make him think"

Apr 6, 1994, 1:18:12 PM4/6/94

to

Colin Plumb (co...@nyx10.cs.du.edu) wrote:

> I got to thinking some more, and wrote down the follwoing ideas. I'm

> posting it in the hopes it'll be useful to someone. The Newton's iteration

> thing is, IMHO, really nifty.

I only scanned through your posting (quite a long Usenet posting... don't

you have a job?? :-) ), but what I saw seemed similar to the following

reference --- if it's not the same, I think you would still find the

following paper interesting (as well as the references that it contains):

M. Mnuk. "A div(n) depth boolean circuit for smooth modular inverse",

Information Processing Letters, Vol. 38, No. 3, May 17 1991, p. 153ff.

A "smooth modular inverse" is one in which all prime factors of the

modulus are "small", which is what you were looking at, I think.

--

Steve Tate --- s...@cs.unt.edu | "A mind is like a parachute; it only

Dept. of Computer Sciences | works when it's open."

University of North Texas | --- Duke Chronicle

Denton, TX 76201 |

Apr 6, 1994, 2:06:21 PM4/6/94

to

::More on finding multiplicative inverses

::

::The general method for finding multiplicative inverses (the Extended

::Euclidean Algorithm) involves repeated division. This is annoyingly

:

:Negative. Use Lehmer's method. (see Knuth Vol. 2) it involves

:NO division at all; only shifts and add/subtract.

::

::The general method for finding multiplicative inverses (the Extended

::Euclidean Algorithm) involves repeated division. This is annoyingly

:

:Negative. Use Lehmer's method. (see Knuth Vol. 2) it involves

:NO division at all; only shifts and add/subtract.

I don't know if this is relevant to the current discussion, but it's

a nice hack I thought more people would be interested in knowing if

they haven't come across it before.

Way back when, I used to do a little video-game programming on 8-bit

micros. We used all sorts of tricks for getting speed up. Lots of

stuff would be table driven, for instance, but every now and then

you had to do something like a multiply over a range that was too

big for a table. So we'd use the old shift-and-add trick. (Obviously

I'm talking about multiplication by a constant here)

for instance, n * 10 was equivalent to n * 8 + n * 2 which was

n << 3 + n << 1 which was ((n << 2) + n) << 1) which all worked out

very neatly on a processor that could shift by constants quickly, or

add, but not multiply.

Anyway, why am I telling you this, when it's stuff everyone knows?

Well, the same trick isn't quite so obvious for *division* by a

constant!

How we did it was like this: first of all, assume 16 bit ints (you

can scale this up yourself later; in fact we *were* using 16 bit ints

in our space games.) Let's say we want to divide by 10 instead of

multiplying by it... well, n / 10 is n / 10 * 65536/65536 which is

(n * 65536/10) / 65536 - so we'd code up a shift-and-add multiply

routine for multiplying by 6553, then do the divide just by extracting

the top two result bytes from the 4-byte product...

Now, the more astute of you will realise that this doesn't actually

give 100% accurate results near the boundaries, and the rounding error

doesn't always go the way you'd expect, but for what we used it for

it was a useful and very fast technique. I just thought I'd mention

it on the offchance that it might be useful for someone else out there

in some application or other...

G

Apr 6, 1994, 11:41:35 PM4/6/94

to

I ought to add to that post just a few details:

Instead of x += x - a*x*x, you can make the iteration step

x *= (2-a*x) and save one addition per step, reducing it to

two multiplies and one add.

You can also use a third-order recurrence,

t = a*x - 1;

x *= t*t - t - 1;

For an additional two adds and a multiply, this triples the number of

valid bits. If you want, you can use this to take a 3-bit estimate to

a 9-bit estimate and avoid the bit twiddling needed to make a 4-bit

estimate to save an iteration. It depends on your multiply speed.

--

-Colin

Apr 7, 1994, 3:42:24 AM4/7/94

to

In article <2numnf$q...@linus.mitre.org>,

Robert D. Silverman <b...@gauss.mitre.org> wrote:

>:Euclidean Algorithm) involves repeated division. This is annoyingly

>

>Negative. Use Lehmer's method. (see Knuth Vol. 2) it involves

>NO division at all; only shifts and add/subtract.

Robert D. Silverman <b...@gauss.mitre.org> wrote:

>:Euclidean Algorithm) involves repeated division. This is annoyingly

>

>Negative. Use Lehmer's method. (see Knuth Vol. 2) it involves

>NO division at all; only shifts and add/subtract.

Um, you've misremembered. Knuth credits the binary method (shifts and

subtracts) to J. Stein in 1961 (J. Comp. Phys. 1 (1967), 397-405).

Lehmer's method (AMM 45 (1938), 227-233) uses the leading digits to

reduce the number of multiple-precision steps, but still uses the classic

divide method, and produces the same quotients.

Exercise 34 (using LEhmer-like methods on the binary algorithm) is credited

to R.W. Gosper.

I'll also have to work through Exercise 35, extending the binary algorithm

to an extended one.

Since you're showing me where I'm stuck everywhere, can you remind me

of any fast high-precision division (ith remainder) algorithms? All

this reduces the need for it considerably, but it's still a useful

primitive. The best I can think of right now for computing a/b is to

use Newton's method to form an estimate of 1/b to sufficiently many

decimal places, then multiply by a to form a quotient estimate, multiply

by b and subtract to form a remainder estimate, and do final corrections.

I thought I remembered something a bit more direct, that didn't require

final corrections, but I can't find the reference now.

--

-Colin

Apr 7, 1994, 3:51:15 AM4/7/94

to

In article <1994040617...@an-teallach.com>,

Graham Toal <gt...@an-teallach.com> wrote:

>Anyway, why am I telling you this, when it's stuff everyone knows?

>Well, the same trick isn't quite so obvious for *division* by a

>constant!

>

>How we did it was like this: first of all, assume 16 bit ints (you

>can scale this up yourself later; in fact we *were* using 16 bit ints

>in our space games.) Let's say we want to divide by 10 instead of

>multiplying by it... well, n / 10 is n / 10 * 65536/65536 which is

>(n * 65536/10) / 65536 - so we'd code up a shift-and-add multiply

>routine for multiplying by 6553, then do the divide just by extracting

>the top two result bytes from the 4-byte product...

Graham Toal <gt...@an-teallach.com> wrote:

>Anyway, why am I telling you this, when it's stuff everyone knows?

>Well, the same trick isn't quite so obvious for *division* by a

>constant!

>

>How we did it was like this: first of all, assume 16 bit ints (you

>can scale this up yourself later; in fact we *were* using 16 bit ints

>in our space games.) Let's say we want to divide by 10 instead of

>multiplying by it... well, n / 10 is n / 10 * 65536/65536 which is

>(n * 65536/10) / 65536 - so we'd code up a shift-and-add multiply

>routine for multiplying by 6553, then do the divide just by extracting

>the top two result bytes from the 4-byte product...

This general technique is analyzed in "P.D. Barrett, "Implementing the

Rivest Shamir and Adleman public key encryption algorithm on a standard

digital signal processor," Advances in Cryptology: Proc. Crypto '86,

Lecture Notes in Computer Science 263, Springer-Verlag 1987. (This is

out of another bibliography I have, in the Crypto '93 proceedings,

so it may refer to other papers for the bulk of the work.)

Basically, yes, this works. Apparently something similar is done in the

HP Precision compilers to do division by constants.

--

-Colin

Apr 11, 1994, 5:25:21 PM4/11/94

to

I used a different method to avoid a long-division. I use the public

exponents 3,5,7,11 and 13, they are all smaller than 16 therefore fit in a

nibble (because I use an 8-bit-uProcessor). No I perform the euclidian

algorithm, dividing by 3 first. If the rest is 0, I'll try 5 and so on.

Just when the modulus is a multiple of 3*5*7*11*13=15015 I have to choose

other primes. If the rest is 1, then I am nearly finished, just calculating

the additive inverse.

In the other cases (remainder > 1) I use a tabular with the factors arising

from the rest of the euclidian algorithm:

exponents 3,5,7,11 and 13, they are all smaller than 16 therefore fit in a

nibble (because I use an 8-bit-uProcessor). No I perform the euclidian

algorithm, dividing by 3 first. If the rest is 0, I'll try 5 and so on.

Just when the modulus is a multiple of 3*5*7*11*13=15015 I have to choose

other primes. If the rest is 1, then I am nearly finished, just calculating

the additive inverse.

In the other cases (remainder > 1) I use a tabular with the factors arising

from the rest of the euclidian algorithm:

xxx = yyy * 13 + 7

13 = 1 * 7 + 6

7 = 1 * 6 + 1

-> 1=...= yyy * -3 * 13 (+ zzz * xxx)

yyy * -3 is the wanted multiplicative inverse for 13.

xxx = modulus

(I've got some (!) problems with my arrow keys, I beg pardon...)

And if you want a big public exponent, exchange it with the secret one, or

multiplikate the public exponent with a big random number and use the modulus

to hide it...

Using this method and an RSA-ASIC for DM 100,- I calculate a random RSA-

keypair within 1-2 seconds on an 8051 uProcessor (similar to an 8086 without

segments), with 8 bit Accumulator and 1 million cycles/second.

I also read parts of the book of Knuth and some others (funniest one was a

speedup using redundant conversions of the divisor, that needed just a 4-bit

lookahead to determine one bit of the result and then one bit more for one

bit more of the divisor). But why this work, if you just need an RSA-key???

Kurt

I think I make myself perfectly clear:

Step 1: Find the plan! Kurt Huwig Universitaet Saarbruecken

Step 2: Save the world!

Step 3: Get out of my house! ku...@stud.uni-sb.de

Let's get crackin'!

0 new messages

Search

Clear search

Close search

Google apps

Main menu