Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Computing multiplicative inverses

416 views
Skip to first unread message

Colin Plumb

unread,
Apr 6, 1994, 5:31:16 AM4/6/94
to
I got to thinking some more, and wrote down the follwoing ideas. I'm
posting it in the hopes it'll be useful to someone. The Newton's iteration
thing is, IMHO, really nifty.

More on finding multiplicative inverses

The general method for finding multiplicative inverses (the Extended
Euclidean Algorithm) involves repeated division. This is annoyingly
slow if you're working in extended precision. Fortunately, there are
some special cases where it can be made much faster.

The special cases are where the modulus is a power of a small prime.
This arises particularly in Montgomery's modular reduction algorithm.
Montgomery's algorithm is typically implemented using signle-digit
inverses, but can be expressed using full-width inverses, which lets
you use a subquadratic multiplication algorithm.

(See "Comparison of three modular reduction functions", Antoon
Bosselaers, Rene' Govaerts and Joos Vandewalle, _Advances in Cryptology,
Proc. Crypto '93_, Springer-Verlag, ISBN 3-540-57766-1, ISBN 0-387-57766-1)

The following can be derived (look up "valuations"), but I'm going to
just pull it out of a hat.

First, recall Newton's iteration for finding x = 1/A. Find the root
of f(x) = 1/x - A = 0. The equation to iterate is

x = x - f(x)/f'(x)
= x - (x^-1 - A)/(-x^-2)
= x + x^2 * (x^-1 - A)
= x + x - A*x^2
= 2*x - A*x*x

This is usually applied in the real numbers, but it can be applied in
modular arithmetic as well. In particular, suppose you have an initial
approximation x, such that x*A == 1 (mod m). So write it as follows:

x*A == 1 + k*m

Substituting in the interation, you get

(2*x - A*x*x)*A == 2*x*A - (x*A)^2
== 2 + 2*k*m - (1+k*m)^2
== 2 + 2*k*m - 1 - 2*k*m - k^2*m^2
== 2-1 + 2*k*m-2*k*m - k^2*m^2
== 1 - k^2*m^2

... which is 1, mod m^2.

So one iteration of Newton's method squared the modulus to which the
approximation is accurate! This is particularly useful in finding
multiplicative inverses modulo 2^k, since x = 1 is an initial
approximation that is good to 1 significant bit. You can double the
number of bits each iteration.

Each iteration requires two multiplies, plus some adds, but only the
last iteration need be full-width. Other iterations are half the width
or less. Thus, the total work is less than that of two iterations, or
four multiplies of k bits.

Actually, there is a better starting estimate. For all odd x,
x*x == 1 (mod 8). Modulo 16, it's a bit more awkward, but if you
look at the pattern of inverses in binary:

x x^-1
0001 0001
0011 1011
0101 1101
0111 0111
1001 1001
1011 0011
1101 0101
1111 1111

You can see that x^-1 is either x or x+8, and it's x+8 when the two
middle bits differ. So you can form your initial estimate thus:

x = ((A<<1 ^ A) & 4) << 1 ^ A;

which is good to 4 bits. Using either of these starting values, you
save the first few iterations. Then, working from there, you can
compute an inverse modulo 2^32 of any odd a:

unsigned long
inverse(unsigned long a)
{
unsigned long x = a;

assert((x*a & 0x7) == 1);
x += x - a*x*x;
assert((x*a & 0x3F) == 1);
x += x - a*x*x;
assert((x*a & 0xFFF) == 1);
x += x - a*x*x;
assert((x*a & 0xFFFFFF) == 1);
x += x - a*x*x;
assert((x*a & 0xFFFFFFFF) == 1);

return x;
}

or:

unsigned long
inverse(unsigned long a)
{
unsigned long x = ((a<<1 ^ a) & 4) << 1 ^ a;

assert((x*a & 0xF) == 1);
x += x - a*x*x;
assert((x*a & 0xFF) == 1);
x += x - a*x*x;
assert((x*a & 0xFFFF) == 1);
x += x - a*x*x;
assert((x*a & 0xFFFFFFFF) == 1);

return x;
}

This is considerably simpler than the Extended Euclidean algorithm,
and requires substantially fewer iterations. (See Knuth, section
4.5.3.) The average number of iterations needed to find an inverse
modulo n is 0.843 * ln(n) + 1.47, or 0.584 * log_2(n) + 1.47, so that's
10.8 iterations for 16 bits and 20.2 for 32. This method requires
log_2(log_2(n)) iterations, obviously asymptotically better!


I seem to recall that there is a division algorithm that uses
full-width multiplicative inverses like this, but I can't recall it.
You can't just compute B/A as B * 1/A modulo 2^k, where k is chosen
to be big enough, because remainders cause all sorts of problems.

Just for example, let B = 3*14+1 = 43 and A = 3. The desired
result is B/A = 14, and it can be seen that it will fit into
4 bits (modulo 16, that is), but trying to compute it that way produces:

1/3 == 11 (mod 16)

43 * 11 == (32 + 11) * 11
== 121
== 9

42 would come out exactly, so it can be used as-is for exact division,
but division with remainder is a different matter.

I have a rather complex algorithm for division in terms of
multiplication from Maurice Mignotte's _Mathematics For Computer
Algebra_ (Springer-Verlag, ISBN 0-387-97675-2, ISBN 0-3-540-97675-2),
but it doesn't look terribly practical. It reduces an n-bit division
to two n/2-bit divisions plus some multiplications.


This is also useful for doing trial division. There's a prime sieving
technique which replaces division by p with multiplication by p^-1,
modulo the word size b = 2^n. To find if a word is divisible by p, note
that the multiples of p are 0, p, ..., k*p, where k = floor((b-1)/p).

So multiplication by p^-1 will map multiples of p back onto that range.
All you have to do is multiply and compare with the (easily precomputed)
value floor((b-1)/p).

Assuming you're building these tables on the fly, a cost saving never
hurts.

Robert D. Silverman

unread,
Apr 6, 1994, 12:07:11 PM4/6/94
to
In article <1994Apr6.0...@mnemosyne.cs.du.edu> co...@nyx10.cs.du.edu (Colin Plumb) writes:
:I got to thinking some more, and wrote down the follwoing ideas. I'm

:posting it in the hopes it'll be useful to someone. The Newton's iteration
:thing is, IMHO, really nifty.
:
:More on finding multiplicative inverses
:
:The general method for finding multiplicative inverses (the Extended
:Euclidean Algorithm) involves repeated division. This is annoyingly

Negative. Use Lehmer's method. (see Knuth Vol. 2) it involves
NO division at all; only shifts and add/subtract.
--
Bob Silverman
These are my opinions and not MITRE's.
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"

Steve Tate

unread,
Apr 6, 1994, 1:18:12 PM4/6/94
to

Colin Plumb (co...@nyx10.cs.du.edu) wrote:
> I got to thinking some more, and wrote down the follwoing ideas. I'm
> posting it in the hopes it'll be useful to someone. The Newton's iteration
> thing is, IMHO, really nifty.

I only scanned through your posting (quite a long Usenet posting... don't
you have a job?? :-) ), but what I saw seemed similar to the following
reference --- if it's not the same, I think you would still find the
following paper interesting (as well as the references that it contains):

M. Mnuk. "A div(n) depth boolean circuit for smooth modular inverse",
Information Processing Letters, Vol. 38, No. 3, May 17 1991, p. 153ff.

A "smooth modular inverse" is one in which all prime factors of the
modulus are "small", which is what you were looking at, I think.

--
Steve Tate --- s...@cs.unt.edu | "A mind is like a parachute; it only
Dept. of Computer Sciences | works when it's open."
University of North Texas | --- Duke Chronicle
Denton, TX 76201 |

Graham Toal

unread,
Apr 6, 1994, 2:06:21 PM4/6/94
to
::More on finding multiplicative inverses

::
::The general method for finding multiplicative inverses (the Extended
::Euclidean Algorithm) involves repeated division. This is annoyingly
:
:Negative. Use Lehmer's method. (see Knuth Vol. 2) it involves
:NO division at all; only shifts and add/subtract.

I don't know if this is relevant to the current discussion, but it's
a nice hack I thought more people would be interested in knowing if
they haven't come across it before.

Way back when, I used to do a little video-game programming on 8-bit
micros. We used all sorts of tricks for getting speed up. Lots of
stuff would be table driven, for instance, but every now and then
you had to do something like a multiply over a range that was too
big for a table. So we'd use the old shift-and-add trick. (Obviously
I'm talking about multiplication by a constant here)

for instance, n * 10 was equivalent to n * 8 + n * 2 which was
n << 3 + n << 1 which was ((n << 2) + n) << 1) which all worked out
very neatly on a processor that could shift by constants quickly, or
add, but not multiply.

Anyway, why am I telling you this, when it's stuff everyone knows?
Well, the same trick isn't quite so obvious for *division* by a
constant!

How we did it was like this: first of all, assume 16 bit ints (you
can scale this up yourself later; in fact we *were* using 16 bit ints
in our space games.) Let's say we want to divide by 10 instead of
multiplying by it... well, n / 10 is n / 10 * 65536/65536 which is
(n * 65536/10) / 65536 - so we'd code up a shift-and-add multiply
routine for multiplying by 6553, then do the divide just by extracting
the top two result bytes from the 4-byte product...

Now, the more astute of you will realise that this doesn't actually
give 100% accurate results near the boundaries, and the rounding error
doesn't always go the way you'd expect, but for what we used it for
it was a useful and very fast technique. I just thought I'd mention
it on the offchance that it might be useful for someone else out there
in some application or other...

G

Colin Plumb

unread,
Apr 6, 1994, 11:41:35 PM4/6/94
to
I ought to add to that post just a few details:

Instead of x += x - a*x*x, you can make the iteration step
x *= (2-a*x) and save one addition per step, reducing it to
two multiplies and one add.

You can also use a third-order recurrence,
t = a*x - 1;
x *= t*t - t - 1;

For an additional two adds and a multiply, this triples the number of
valid bits. If you want, you can use this to take a 3-bit estimate to
a 9-bit estimate and avoid the bit twiddling needed to make a 4-bit
estimate to save an iteration. It depends on your multiply speed.
--
-Colin

Colin Plumb

unread,
Apr 7, 1994, 3:42:24 AM4/7/94
to
In article <2numnf$q...@linus.mitre.org>,

Robert D. Silverman <b...@gauss.mitre.org> wrote:
>:Euclidean Algorithm) involves repeated division. This is annoyingly
>
>Negative. Use Lehmer's method. (see Knuth Vol. 2) it involves
>NO division at all; only shifts and add/subtract.

Um, you've misremembered. Knuth credits the binary method (shifts and
subtracts) to J. Stein in 1961 (J. Comp. Phys. 1 (1967), 397-405).
Lehmer's method (AMM 45 (1938), 227-233) uses the leading digits to
reduce the number of multiple-precision steps, but still uses the classic
divide method, and produces the same quotients.

Exercise 34 (using LEhmer-like methods on the binary algorithm) is credited
to R.W. Gosper.

I'll also have to work through Exercise 35, extending the binary algorithm
to an extended one.

Since you're showing me where I'm stuck everywhere, can you remind me
of any fast high-precision division (ith remainder) algorithms? All
this reduces the need for it considerably, but it's still a useful
primitive. The best I can think of right now for computing a/b is to
use Newton's method to form an estimate of 1/b to sufficiently many
decimal places, then multiply by a to form a quotient estimate, multiply
by b and subtract to form a remainder estimate, and do final corrections.

I thought I remembered something a bit more direct, that didn't require
final corrections, but I can't find the reference now.
--
-Colin

Colin Plumb

unread,
Apr 7, 1994, 3:51:15 AM4/7/94
to
In article <1994040617...@an-teallach.com>,

Graham Toal <gt...@an-teallach.com> wrote:
>Anyway, why am I telling you this, when it's stuff everyone knows?
>Well, the same trick isn't quite so obvious for *division* by a
>constant!
>
>How we did it was like this: first of all, assume 16 bit ints (you
>can scale this up yourself later; in fact we *were* using 16 bit ints
>in our space games.) Let's say we want to divide by 10 instead of
>multiplying by it... well, n / 10 is n / 10 * 65536/65536 which is
>(n * 65536/10) / 65536 - so we'd code up a shift-and-add multiply
>routine for multiplying by 6553, then do the divide just by extracting
>the top two result bytes from the 4-byte product...

This general technique is analyzed in "P.D. Barrett, "Implementing the
Rivest Shamir and Adleman public key encryption algorithm on a standard
digital signal processor," Advances in Cryptology: Proc. Crypto '86,
Lecture Notes in Computer Science 263, Springer-Verlag 1987. (This is
out of another bibliography I have, in the Crypto '93 proceedings,
so it may refer to other papers for the bulk of the work.)

Basically, yes, this works. Apparently something similar is done in the
HP Precision compilers to do division by constants.
--
-Colin

Kurt Huwig

unread,
Apr 11, 1994, 5:25:21 PM4/11/94
to
I used a different method to avoid a long-division. I use the public
exponents 3,5,7,11 and 13, they are all smaller than 16 therefore fit in a
nibble (because I use an 8-bit-uProcessor). No I perform the euclidian
algorithm, dividing by 3 first. If the rest is 0, I'll try 5 and so on.
Just when the modulus is a multiple of 3*5*7*11*13=15015 I have to choose
other primes. If the rest is 1, then I am nearly finished, just calculating
the additive inverse.
In the other cases (remainder > 1) I use a tabular with the factors arising
from the rest of the euclidian algorithm:

xxx = yyy * 13 + 7
13 = 1 * 7 + 6
7 = 1 * 6 + 1

-> 1=...= yyy * -3 * 13 (+ zzz * xxx)
yyy * -3 is the wanted multiplicative inverse for 13.


xxx = modulus
(I've got some (!) problems with my arrow keys, I beg pardon...)

And if you want a big public exponent, exchange it with the secret one, or
multiplikate the public exponent with a big random number and use the modulus
to hide it...
Using this method and an RSA-ASIC for DM 100,- I calculate a random RSA-
keypair within 1-2 seconds on an 8051 uProcessor (similar to an 8086 without
segments), with 8 bit Accumulator and 1 million cycles/second.

I also read parts of the book of Knuth and some others (funniest one was a
speedup using redundant conversions of the divisor, that needed just a 4-bit
lookahead to determine one bit of the result and then one bit more for one
bit more of the divisor). But why this work, if you just need an RSA-key???

Kurt
I think I make myself perfectly clear:

Step 1: Find the plan! Kurt Huwig Universitaet Saarbruecken
Step 2: Save the world!
Step 3: Get out of my house! ku...@stud.uni-sb.de

Let's get crackin'!

0 new messages