Brian Kessler has uploaded this change for review.
math/big: implement lehmer's GCD algorithm
Updates #15833
Lehmer's GCD algorithm uses single precision calculations
to simulate several steps of multiple precision calculations
in Euclid's GCD algorithm which leads to a considerable
speed up. This implementation uses Collin's simplified
testing condition on the single digit cosequences that
requires only one quotient and avoids any possibility of
overflow.
name old time/op new time/op delta
GCD10x10/WithoutXY-4 1.82µs ±24% 0.28µs ± 6% -84.40% (p=0.008 n=5+5)
GCD10x10/WithXY-4 1.69µs ± 6% 1.71µs ± 6% ~ (p=0.595 n=5+5)
GCD10x100/WithoutXY-4 1.87µs ± 2% 0.56µs ± 4% -70.13% (p=0.008 n=5+5)
GCD10x100/WithXY-4 2.61µs ± 2% 2.65µs ± 4% ~ (p=0.635 n=5+5)
GCD10x1000/WithoutXY-4 2.75µs ± 2% 1.48µs ± 1% -46.06% (p=0.008 n=5+5)
GCD10x1000/WithXY-4 5.29µs ± 2% 5.25µs ± 2% ~ (p=0.548 n=5+5)
GCD10x10000/WithoutXY-4 10.7µs ± 2% 10.3µs ± 0% -4.38% (p=0.008 n=5+5)
GCD10x10000/WithXY-4 22.3µs ± 6% 22.1µs ± 1% ~ (p=1.000 n=5+5)
GCD10x100000/WithoutXY-4 93.7µs ± 2% 99.4µs ± 2% +6.09% (p=0.008 n=5+5)
GCD10x100000/WithXY-4 196µs ± 2% 199µs ± 2% ~ (p=0.222 n=5+5)
GCD100x100/WithoutXY-4 10.1µs ± 2% 2.5µs ± 2% -74.84% (p=0.008 n=5+5)
GCD100x100/WithXY-4 21.4µs ± 2% 21.3µs ± 7% ~ (p=0.548 n=5+5)
GCD100x1000/WithoutXY-4 11.3µs ± 2% 4.4µs ± 4% -60.87% (p=0.008 n=5+5)
GCD100x1000/WithXY-4 24.7µs ± 3% 23.9µs ± 1% ~ (p=0.056 n=5+5)
GCD100x10000/WithoutXY-4 26.6µs ± 1% 20.0µs ± 2% -24.82% (p=0.008 n=5+5)
GCD100x10000/WithXY-4 78.7µs ± 2% 78.2µs ± 2% ~ (p=0.690 n=5+5)
GCD100x100000/WithoutXY-4 174µs ± 2% 171µs ± 1% ~ (p=0.056 n=5+5)
GCD100x100000/WithXY-4 563µs ± 4% 561µs ± 2% ~ (p=1.000 n=5+5)
GCD1000x1000/WithoutXY-4 120µs ± 5% 29µs ± 3% -75.71% (p=0.008 n=5+5)
GCD1000x1000/WithXY-4 355µs ± 4% 358µs ± 2% ~ (p=0.841 n=5+5)
GCD1000x10000/WithoutXY-4 140µs ± 2% 49µs ± 2% -65.07% (p=0.008 n=5+5)
GCD1000x10000/WithXY-4 626µs ± 3% 628µs ± 9% ~ (p=0.690 n=5+5)
GCD1000x100000/WithoutXY-4 340µs ± 4% 259µs ± 6% -23.79% (p=0.008 n=5+5)
GCD1000x100000/WithXY-4 3.76ms ± 4% 3.82ms ± 5% ~ (p=0.310 n=5+5)
GCD10000x10000/WithoutXY-4 3.11ms ± 3% 0.54ms ± 2% -82.74% (p=0.008 n=5+5)
GCD10000x10000/WithXY-4 7.96ms ± 3% 7.69ms ± 3% ~ (p=0.151 n=5+5)
GCD10000x100000/WithoutXY-4 3.88ms ± 1% 1.27ms ± 2% -67.21% (p=0.008 n=5+5)
GCD10000x100000/WithXY-4 38.1ms ± 2% 38.8ms ± 1% ~ (p=0.095 n=5+5)
GCD100000x100000/WithoutXY-4 208ms ± 1% 25ms ± 4% -88.07% (p=0.008 n=5+5)
GCD100000x100000/WithXY-4 533ms ± 5% 525ms ± 4% ~ (p=0.548 n=5+5)
Change-Id: Ic1e007eb807b93e75f4752e968e98c1f0cb90e43
---
M src/math/big/int.go
M src/math/big/int_test.go
M src/math/big/rat.go
3 files changed, 121 insertions(+), 68 deletions(-)
diff --git a/src/math/big/int.go b/src/math/big/int.go
index 63a750c..a47eb12 100644
--- a/src/math/big/int.go
+++ b/src/math/big/int.go
@@ -466,7 +466,7 @@
return z
}
if x == nil && y == nil {
- return z.binaryGCD(a, b)
+ return z.lehmerGCD(a, b)
}
A := new(Int).Set(a)
@@ -505,64 +505,127 @@
return z
}
-// binaryGCD sets z to the greatest common divisor of a and b, which both must
-// be > 0, and returns z.
-// See Knuth, The Art of Computer Programming, Vol. 2, Section 4.5.2, Algorithm B.
-func (z *Int) binaryGCD(a, b *Int) *Int {
- u := z
- v := new(Int)
+// lehmerGCD sets z to the greatest common divisor of a and b,
+// both of which must be > 0, and returns z.
+// See Knuth, The Art of Computer Programming, Vol. 2, Section 4.5.2, Algorithm L.
+// This implementation uses the improved condition by Collin's requiring only one
+// quotient and avoiding the possibility of single Word overflow.
+// See Jebelean, "Improving the multiprecision Euclidean algorithm"
+// Design and Implementation of Symbolic Computation Systems pp 45-58
+func (z *Int) lehmerGCD(a, b *Int) *Int {
- // use one Euclidean iteration to ensure that u and v are approx. the same size
- switch {
- case len(a.abs) > len(b.abs):
- // must set v before u since u may be alias for a or b (was issue #11284)
- v.Rem(a, b)
- u.Set(b)
- case len(a.abs) < len(b.abs):
- v.Rem(b, a)
- u.Set(a)
+ A := new(Int)
+ B := new(Int)
+
+ // ensure A >= B
+ switch a.abs.cmp(b.abs) {
+ case -1:
+ B.Set(a)
+ A.Set(b)
default:
- v.Set(b)
- u.Set(a)
+ B.Set(b)
+ A.Set(a)
}
- // a, b must not be used anymore (may be aliases with u)
- // v might be 0 now
- if len(v.abs) == 0 {
- return u
- }
- // u > 0 && v > 0
-
- // determine largest k such that u = u' << k, v = v' << k
- k := u.abs.trailingZeroBits()
- if vk := v.abs.trailingZeroBits(); vk < k {
- k = vk
- }
- u.Rsh(u, k)
- v.Rsh(v, k)
-
- // determine t (we know that u > 0)
+ // temp variables for multiprecision update
t := new(Int)
- if u.abs[0]&1 != 0 {
- // u is odd
- t.Neg(v)
- } else {
- t.Set(u)
- }
+ r := new(Int)
+ s := new(Int)
+ w := new(Int)
- for len(t.abs) > 0 {
- // reduce t
- t.Rsh(t, t.abs.trailingZeroBits())
- if t.neg {
- v, t = t, v
- v.neg = len(v.abs) > 0 && !v.neg // 0 has no sign
- } else {
- u, t = t, u
+ // loop invariant A >= B
+ for len(B.abs) > 1 {
+
+ // Initialize the digits
+ var a1, a2, u0, u1, u2, v0, v1, v2 Word
+
+ m := len(B.abs) // m >= 2
+ n := len(A.abs) // n >= m >= 2
+
+ // extract the top Word of bits from A and B
+ h := nlz(A.abs[n-1])
+ a1 = (A.abs[n-1] << h) | (A.abs[n-2] >> (_W - h))
+ // B may have implicit zero words in the high bits if the lengths differ
+ switch {
+ case n == m:
+ a2 = (B.abs[n-1] << h) | (B.abs[n-2] >> (_W - h))
+ case n == m+1:
+ a2 = (B.abs[n-2] >> (_W - h))
+ default:
+ a2 = 0
}
- t.Sub(u, v)
+
+ // Since we are calculating with full words to avoid overflow
+ // we use even to track the sign of the cosequences
+ // for even iterations: u0, v1 >= 0 && u1, v0 <= 0
+ // for odd iterations: u0, v1 <= 0 && u1, v0 >= 0
+ // the first iteration starts with k=1 (odd)
+ even := false
+ // variables to track the cosequences
+ u0, u1, u2 = 0, 1, 0
+ v0, v1, v2 = 0, 0, 1
+
+ // Calculate the quotient and cosequences using Collin's stopping condition
+ for a2 >= v2 && (a1-a2) >= (v1+v2) {
+ q := a1 / a2
+ a1, a2 = a2, a1-q*a2
+ u0, u1, u2 = u1, u2, u1+q*u2
+ v0, v1, v2 = v1, v2, v1+q*v2
+ even = !even
+ }
+
+ // Multiprecision Step
+ if v0 != 0 {
+ // Simulate the effect of the single precision steps using the cosequences
+ // A = u0*A + v0*B
+ // B = u1*A + v1*B
+
+ t.abs = t.abs.setWord(u0)
+ s.abs = s.abs.setWord(v0)
+ t.neg = !even
+ s.neg = even
+
+ t = t.Mul(A, t)
+ s = s.Mul(B, s)
+
+ r.abs = r.abs.setWord(u1)
+ w.abs = w.abs.setWord(v1)
+ r.neg = even
+ w.neg = !even
+
+ r = r.Mul(A, r)
+ w = w.Mul(B, w)
+
+ A = A.Add(t, s)
+ B = B.Add(r, w)
+
+ } else {
+ // Single-digit calculations failed to simluate any quotients
+ // do a standard Euclidean step
+ t = t.Rem(A, B)
+ A, B, t = B, t, A
+ }
}
- return z.Lsh(u, k)
+ if len(B.abs) > 0 {
+ // standard Euclidean algorithm base case for B a single Word
+ if len(A.abs) > 1 {
+ // A is longer than a single Word
+ t = t.Rem(A, B)
+ A, B, t = B, t, A
+ }
+ if len(B.abs) > 0 {
+ // A and B are both a single Word
+ a1, a2 := A.abs[0], B.abs[0]
+ for a2 != 0 {
+ a1, a2 = a2, a1%a2
+ }
+ A.abs[0] = a1
+ }
+ }
+ *z = *A
+
+ return z
}
// Rand sets z to a pseudo-random number in [0, n) and returns z.
diff --git a/src/math/big/int_test.go b/src/math/big/int_test.go
index 65e24f1..f346c81 100644
--- a/src/math/big/int_test.go
+++ b/src/math/big/int_test.go
@@ -708,38 +708,28 @@
D := new(Int).GCD(X, Y, a, b)
if D.Cmp(d) != 0 {
- t.Errorf("GCD(%s, %s): got d = %s, want %s", a, b, D, d)
+ t.Errorf("GCD(%s, %s, %s, %s): got d = %s, want %s", x, y, a, b, D, d)
}
if x != nil && X.Cmp(x) != 0 {
- t.Errorf("GCD(%s, %s): got x = %s, want %s", a, b, X, x)
+ t.Errorf("GCD(%s, %s, %s, %s): got x = %s, want %s", x, y, a, b, X, x)
}
if y != nil && Y.Cmp(y) != 0 {
- t.Errorf("GCD(%s, %s): got y = %s, want %s", a, b, Y, y)
- }
-
- // binaryGCD requires a > 0 && b > 0
- if a.Sign() <= 0 || b.Sign() <= 0 {
- return
- }
-
- D.binaryGCD(a, b)
- if D.Cmp(d) != 0 {
- t.Errorf("binaryGcd(%s, %s): got d = %s, want %s", a, b, D, d)
+ t.Errorf("GCD(%s, %s, %s, %s): got y = %s, want %s", x, y, a, b, Y, y)
}
// check results in presence of aliasing (issue #11284)
a2 := new(Int).Set(a)
b2 := new(Int).Set(b)
- a2.binaryGCD(a2, b2) // result is same as 1st argument
+ a2.GCD(X, Y, a2, b2) // result is same as 1st argument
if a2.Cmp(d) != 0 {
- t.Errorf("binaryGcd(%s, %s): got d = %s, want %s", a, b, a2, d)
+ t.Errorf("aliased z = a GCD(%s, %s, %s, %s): got d = %s, want %s", x, y, a, b, a2, d)
}
a2 = new(Int).Set(a)
b2 = new(Int).Set(b)
- b2.binaryGCD(a2, b2) // result is same as 2nd argument
+ b2.GCD(X, Y, a2, b2) // result is same as 2nd argument
if b2.Cmp(d) != 0 {
- t.Errorf("binaryGcd(%s, %s): got d = %s, want %s", a, b, b2, d)
+ t.Errorf("aliased z = b GCD(%s, %s, %s, %s): got d = %s, want %s", x, y, a, b, b2, d)
}
}
diff --git a/src/math/big/rat.go b/src/math/big/rat.go
index f0f436e..99bc09b 100644
--- a/src/math/big/rat.go
+++ b/src/math/big/rat.go
@@ -422,7 +422,7 @@
neg := z.a.neg
z.a.neg = false
z.b.neg = false
- if f := NewInt(0).binaryGCD(&z.a, &z.b); f.Cmp(intOne) != 0 {
+ if f := NewInt(0).GCD(nil, nil, &z.a, &z.b); f.Cmp(intOne) != 0 {
z.a.abs, _ = z.a.abs.div(nil, z.a.abs, f.abs)
z.b.abs, _ = z.b.abs.div(nil, z.b.abs, f.abs)
if z.b.abs.cmp(natOne) == 0 {
To view, visit change 59450. To unsubscribe, or for help writing mail filters, visit settings.
Hi,
I had a go at implementing Lehmer's algorithm for the GCD and got pretty good speed up for a basic implementation. It should be straightforward to apply the same algorithm to the extended GCD case if this looks good.
Very cool, thank you @briankessler!
Patch set 1:Run-TryBot +1
TryBots are happy.
Patch set 1:TryBot-Result +1
Some suggestions for improvement, some questions, and some nit picks regarding your comments.
16 comments:
Patch Set #1, Line 509: // both of which must be > 0, and returns z.
which both must be > 0
Patch Set #1, Line 511: Collin's
s/Collin's/Collins/
(according to the citations in the paper below)
please add a comma at the end of this line
s/pp 45-58/, pp 45-48./
Patch Set #1, Line 517: A := new(Int)
simpler:
A := new(Int).Set(a)
B := new(Int).Set(b)
// ensure A >= B
if A.abs.cmp(b.abs) < 0 {
A, B = B, A
}(these are pointer swaps, so this is cheap)
But (read on):
In the end you want the result in z. That is, ideally you want A to be z so we can re-use z's memory and we don't need the assignment *z = *A before the return.
Because a and b are never used anymore in this function body (I believe), we can do:
// ensure a >= b
if a.abs.cmp(b.abs) < 0 {
a, b = b, a
}
// don't destroy incoming values of a and b
B := new(Int).Set(b) // do this first in case b is an alias of z
A := z.Set(a)
With the first assignment we make sure to have a copy of b. If b is an alias for z it doesn't matter, because b has been copied before we change z. Then we use z as A. If z is an alias for a, the assignment is a no-op (superfluous, but still a no-op). Otherwise it uses z's memory if it can. At the end, because A == z, we don't need the *z = *A assignment.
Patch Set #1, Line 539: Initialize
A comment on comments:
In this package I've used all lower-case comments when the comment is short and when it's not a sentence. Here:
// initialize the digits
I've used Uppercase (first word) and end with a period, if the comment is a proper sentence. In that case I also try to do proper punctuation.
Please try to follow this existing style.
Patch Set #1, Line 568: Calculate
s/Calculate/calculate/ or use a period at the end.
no need for these parentheses (both sides of >=)
Patch Set #1, Line 577: // Multiprecision Step
lowercase M
These are all integers (pointers) not nat slices. There's no need to assign to the receiver as it can be updated in place:
t.Mul(A, t)
is sufficient (no need to assign to t).
ditto
ditto for all 4 lines
and here as well
and here
This throws away whatever underlying array belonged to z. What you want is to re-use that array if at all possible and avoid this assignment altogether.
Patch Set #1, Line 425: if f := NewInt(0).GCD(nil, nil, &z.a, &z.b); f.Cmp(intOne) != 0 {
why can't this be lehmerGCD? (both z.a and z.b are > 0)
To view, visit change 59450. To unsubscribe, or for help writing mail filters, visit settings.
I made most of the updates. There are still a few open items around the reuse of backing arrays. Also, I have the extended Lehmer algorithm working with similar performance improvements.
https://perf.golang.org/search?q=upload:20170829.1
Its basically identical except the update steps are the Extended euclidean algorithm. Should I include that with this CL or a subsequent one.
16 comments:
Patch Set #1, Line 509: // both of which must be > 0, and returns z.
which both must be > 0
Done
Patch Set #1, Line 511: Collin's
s/Collin's/Collins/ […]
Done
please add a comma at the end of this line
Done
s/pp 45-58/, pp 45-48. […]
Done
Patch Set #1, Line 517: A := new(Int)
simpler: […]
I updated to use this simplification and reuse z's memory. But I think the assignment is still necessary due to pointer swaps in the loop causing A and z to point to different Ints.
Patch Set #1, Line 539: Initialize
A comment on comments: […]
Done
Patch Set #1, Line 568: Calculate
s/Calculate/calculate/ or use a period at the end.
Done
no need for these parentheses (both sides of >=)
Done
Patch Set #1, Line 577: // Multiprecision Step
lowercase M
Done
These are all integers (pointers) not nat slices. […]
Related to the above comments about reusing z's backing array, updating these in place will throw out the backing arrays. I haven't benchmarked it, but do you think its worth checking if introducing a couple of more temp variables here is worthwhile to avoid extra allocations?
ditto
Done
ditto for all 4 lines
Done
and here as well
Done
and here
Done
This throws away whatever underlying array belonged to z. […]
With the update above, I think this is still needed because A and z may not point to the same Int at the end due to the pointer swaps in the Euclidean update.
A, B, t = B, t, A
but after the change above, its not wasting z's underlying nat array since that was used to back A.
Patch Set #1, Line 425: if f := NewInt(0).GCD(nil, nil, &z.a, &z.b); f.Cmp(intOne) != 0 {
why can't this be lehmerGCD? (both z.a and z. […]
No specific reason. I just used the generic GCD in case the internal implementation changed at some point in the future, but using lehmerGCD avoids the extra checks so I will update.
To view, visit change 59450. To unsubscribe, or for help writing mail filters, visit settings.
Brian Kessler uploaded patch set #2 to this change.
3 files changed, 116 insertions(+), 68 deletions(-)
To view, visit change 59450. To unsubscribe, or for help writing mail filters, visit settings.
I have some more minor nits. The Go code looks ok, but I need to think it through a bit more, or find another pair of eyes (I'm not familiar with the respective math; and this is a pretty crucial piece of code).
8 comments:
Patch Set #2, Line 7: lehmer's
s/lehmer/Lehmer/
s/Collin's/Collins'/
s/that/which/
s/for/For/
ditto
s/the/The/
and add a period at the end
Patch Set #2, Line 599: Single
c/Single/single/
(or add a period at the end of the sentence)
agreed that this is needed
one could perhaps do
if A != z {
*z = *A
}but an integer is small, so it may not be worth it (or even be counterproductive)
To view, visit change 59450. To unsubscribe, or for help writing mail filters, visit settings.
Robert Griesemer removed Adam Langley from this change.
To view, visit change 59450. To unsubscribe, or for help writing mail filters, visit settings.
Removed reviewer Adam Langley.
Aliaksandr, I removed Adam. While he would be great he doesn't have any bandwidth for at the moment.
I cleaned up all of the comments and commit message. I will make sure to follow the standard comment style in the future.
Definitely agree that it deserves a thorough confirmation. I choose the basic implementation from the first section of the Jebelean paper because it seemed to be the easiest to verify correctness and didn't contain the edge cases of the version in Knuth. Would it be worthwhile to include a slow but definitely correct implementation of Euclid's algorithm solely for testing? There is currently checkGcd called using quick.Check that confirms d = x*a + b*y for the extended algorithm. I could include another quick.Check that confirms both implementations produce the same results for further confidence.
7 comments:
Patch Set #2, Line 7: lehmer's
s/lehmer/Lehmer/
Done
s/Collin's/Collins'/
Done
s/that/which/
s/for/For/
Done
ditto
Done
s/the/The/ […]
Done
Patch Set #2, Line 599: Single
c/Single/single/ […]
Done
To view, visit change 59450. To unsubscribe, or for help writing mail filters, visit settings.
Brian Kessler uploaded patch set #3 to this change.
math/big: implement Lehmer's GCD algorithm
Updates #15833
Lehmer's GCD algorithm uses single precision calculations
to simulate several steps of multiple precision calculations
in Euclid's GCD algorithm which leads to a considerable
speed up. This implementation uses Collins' simplified
testing condition on the single digit cosequences which
3 files changed, 116 insertions(+), 68 deletions(-)
To view, visit change 59450. To unsubscribe, or for help writing mail filters, visit settings.
Patch Set 2:
(7 comments)
I cleaned up all of the comments and commit message. I will make sure to follow the standard comment style in the future.
Definitely agree that it deserves a thorough confirmation. I choose the basic implementation from the first section of the Jebelean paper because it seemed to be the easiest to verify correctness and didn't contain the edge cases of the version in Knuth. Would it be worthwhile to include a slow but definitely correct implementation of Euclid's algorithm solely for testing? There is currently checkGcd called using quick.Check that confirms d = x*a + b*y for the extended algorithm. I could include another quick.Check that confirms both implementations produce the same results for further confidence.
Yes, that seems worthwhile: add the basic Euclid's algorithm to the testing code and use it to verify the fast version.
I'm pretty excited to see this happen as it should also speed up rational arithmetic significantly.
Brian Kessler uploaded patch set #4 to this change.
math/big: implement Lehmer's GCD algorithm
Updates #15833
Lehmer's GCD algorithm uses single precision calculations
to simulate several steps of multiple precision calculations
in Euclid's GCD algorithm which leads to a considerable
speed up. This implementation uses Collins' simplified
testing condition on the single digit cosequences which
3 files changed, 153 insertions(+), 68 deletions(-)
To view, visit change 59450. To unsubscribe, or for help writing mail filters, visit settings.
Patch Set 3:
Patch Set 2:
(7 comments)
I cleaned up all of the comments and commit message. I will make sure to follow the standard comment style in the future.
Definitely agree that it deserves a thorough confirmation. I choose the basic implementation from the first section of the Jebelean paper because it seemed to be the easiest to verify correctness and didn't contain the edge cases of the version in Knuth. Would it be worthwhile to include a slow but definitely correct implementation of Euclid's algorithm solely for testing? There is currently checkGcd called using quick.Check that confirms d = x*a + b*y for the extended algorithm. I could include another quick.Check that confirms both implementations produce the same results for further confidence.
Yes, that seems worthwhile: add the basic Euclid's algorithm to the testing code and use it to verify the fast version.
I'm pretty excited to see this happen as it should also speed up rational arithmetic significantly.
Ok, I added the reference Euclid's algorithm for testing with quick.Check.
Patch Set 3:
> > Patch Set 2:
>
> (7 comments)
>
> I cleaned up all of the comments and commit message. I will
make sure to follow the standard comment style in the future.
>
> Definitely agree that it deserves a thorough confirmation. I
choose the basic implementation from the first section of the
Jebelean paper because it seemed to be the easiest to verify
correctness and didn't contain the edge cases of the version in
Knuth. Would it be worthwhile to include a slow but definitely
correct implementation of Euclid's algorithm solely for testing?
There is currently checkGcd called using quick.Check that confirms
d = x*a + b*y for the extended algorithm. I could include another
quick.Check that confirms both implementations produce the same
results for further confidence.
> Yes, that seems worthwhile: add the basic Euclid's algorithm to
the testing code and use it to verify the fast version.
> I'm pretty excited to see this happen as it should also speed up
rational arithmetic significantly.Ok, I added the reference Euclid's algorithm for testing with
quick.Check.
Thanks. Just FYI I am ooo now for two weeks so most likely I won't get to this before I return. If you don't hear from me by 9/20 please ping me. Thanks.
R=gri
I am running the trybots and I've also added a tag to remind @gri when he returns mid month
Patch set 4:Run-TryBot +1
TryBots beginning. Status page: https://farmer.golang.org/try?commit=527d3d58
TryBots are happy.
Patch set 4:TryBot-Result +1
Patch Set 4:
Patch Set 3:
> > Patch Set 2:
>
> (7 comments)
>
> I cleaned up all of the comments and commit message. I will
make sure to follow the standard comment style in the future.
>
> Definitely agree that it deserves a thorough confirmation. I
choose the basic implementation from the first section of the
Jebelean paper because it seemed to be the easiest to verify
correctness and didn't contain the edge cases of the version in
Knuth. Would it be worthwhile to include a slow but definitely
correct implementation of Euclid's algorithm solely for testing?
There is currently checkGcd called using quick.Check that confirms
d = x*a + b*y for the extended algorithm. I could include another
quick.Check that confirms both implementations produce the same
results for further confidence.
> Yes, that seems worthwhile: add the basic Euclid's algorithm to
the testing code and use it to verify the fast version.
> I'm pretty excited to see this happen as it should also speed up
rational arithmetic significantly.Ok, I added the reference Euclid's algorithm for testing with
quick.Check.Thanks. Just FYI I am ooo now for two weeks so most likely I won't get to this before I return. If you don't hear from me by 9/20 please ping me. Thanks.
Pinging to see if you had a chance to look at this again or identified any additional reviewers. Thanks.
Apologies for the delay.
2 comments:
File src/math/big/int_test.go:
Patch Set #4, Line 681: func (z *Int) euclidGCD(a, b *Int) *Int {
You are always throwing away z; also this is testing code. No need to pass in z. Just make this a function of a and b returning a result.
Patch Set #4, Line 701: if a.Cmp(zero) <= 0 || b.Cmp(zero) <= 0 {
if a.Sign() <= 0 || b.Sign() <= 0 ...
and get rid of zero.
To view, visit change 59450. To unsubscribe, or for help writing mail filters, visit settings.
Cleaned up the test.
2 comments:
Patch Set #4, Line 681: func (z *Int) euclidGCD(a, b *Int) *Int {
You are always throwing away z; also this is testing code. No need to pass in z. […]
Done
Patch Set #4, Line 701: if a.Cmp(zero) <= 0 || b.Cmp(zero) <= 0 {
if a.Sign() <= 0 || b.Sign() <= 0 ... […]
Done
To view, visit change 59450. To unsubscribe, or for help writing mail filters, visit settings.
Brian Kessler uploaded patch set #6 to this change.
math/big: implement Lehmer's GCD algorithm
Updates #15833
Lehmer's GCD algorithm uses single precision calculations
to simulate several steps of multiple precision calculations
in Euclid's GCD algorithm which leads to a considerable
speed up. This implementation uses Collins' simplified
testing condition on the single digit cosequences which
3 files changed, 151 insertions(+), 68 deletions(-)
To view, visit change 59450. To unsubscribe, or for help writing mail filters, visit settings.
LGTM.
I'd still love to have somebody with expertise in GCD look over this. I haven't forgotten. Sorry about all these delays.
Patch set 6:Run-TryBot +1
TryBots beginning. Status page: https://farmer.golang.org/try?commit=b4cd2a75
TryBots are happy.
Patch set 6:TryBot-Result +1
Patch Set 6: Run-TryBot+1
LGTM.
I'd still love to have somebody with expertise in GCD look over this. I haven't forgotten. Sorry about all these delays.
Sounds good. Let me know if there is anything else you'd like from me. I'm not sure if I could help find reviewers at all.
I also have the code for the extended algorithm ready to go. I was waiting on that CL in case there were updates on this one since they share the same single-digit code.
cc'ing Filippo, since this may optimize crypto package.
Hi Brian;
Uri Mendlovic from Google who is familiar with Lehmer GCD took a look and he had the following comments:
----
Overall he agrees that the code is algorithmically correct. I'm going to approve this and check it in so we can move forward with incremental changes.
Thanks for doing this and bearing with me.
Patch set 6:Code-Review +2
Robert Griesemer merged this change.
Reviewed-on: https://go-review.googlesource.com/59450
Run-TryBot: Robert Griesemer <g...@golang.org>
TryBot-Result: Gobot Gobot <go...@golang.org>
Reviewed-by: Robert Griesemer <g...@golang.org>
---
M src/math/big/int.go
M src/math/big/int_test.go
M src/math/big/rat.go
3 files changed, 151 insertions(+), 68 deletions(-)
diff --git a/src/math/big/int.go b/src/math/big/int.go
index 73d48de..c5ff672 100644
--- a/src/math/big/int.go
+++ b/src/math/big/int.go
@@ -476,7 +476,7 @@
return z
}
if x == nil && y == nil {
- return z.binaryGCD(a, b)
+ return z.lehmerGCD(a, b)
}
A := new(Int).Set(a)
@@ -515,64 +515,122 @@
return z
}
-// binaryGCD sets z to the greatest common divisor of a and b, which both must
-// be > 0, and returns z.
-// See Knuth, The Art of Computer Programming, Vol. 2, Section 4.5.2, Algorithm B.
-func (z *Int) binaryGCD(a, b *Int) *Int {
- u := z
- v := new(Int)
+// lehmerGCD sets z to the greatest common divisor of a and b,
+// which both must be > 0, and returns z.
+// See Knuth, The Art of Computer Programming, Vol. 2, Section 4.5.2, Algorithm L.
+// This implementation uses the improved condition by Collins requiring only one
+// quotient and avoiding the possibility of single Word overflow.
+// See Jebelean, "Improving the multiprecision Euclidean algorithm",
+// Design and Implementation of Symbolic Computation Systems, pp 45-58.
+func (z *Int) lehmerGCD(a, b *Int) *Int {
- // use one Euclidean iteration to ensure that u and v are approx. the same size
- switch {
- case len(a.abs) > len(b.abs):
- // must set v before u since u may be alias for a or b (was issue #11284)
- v.Rem(a, b)
- u.Set(b)
- case len(a.abs) < len(b.abs):
- v.Rem(b, a)
- u.Set(a)
- default:
- v.Set(b)
- u.Set(a)
+ // ensure a >= b
+ if a.abs.cmp(b.abs) < 0 {
+ a, b = b, a
}
- // a, b must not be used anymore (may be aliases with u)
- // v might be 0 now
- if len(v.abs) == 0 {
- return u
- }
- // u > 0 && v > 0
+ // don't destroy incoming values of a and b
+ B := new(Int).Set(b) // must be set first in case b is an alias of z
+ A := z.Set(a)
- // determine largest k such that u = u' << k, v = v' << k
- k := u.abs.trailingZeroBits()
- if vk := v.abs.trailingZeroBits(); vk < k {
- k = vk
- }
- u.Rsh(u, k)
- v.Rsh(v, k)
-
- // determine t (we know that u > 0)
+ // temp variables for multiprecision update
t := new(Int)
- if u.abs[0]&1 != 0 {
- // u is odd
- t.Neg(v)
- } else {
- t.Set(u)
- }
+ r := new(Int)
+ s := new(Int)
+ w := new(Int)
- for len(t.abs) > 0 {
- // reduce t
- t.Rsh(t, t.abs.trailingZeroBits())
- if t.neg {
- v, t = t, v
- v.neg = len(v.abs) > 0 && !v.neg // 0 has no sign
- } else {
- u, t = t, u
+ // loop invariant A >= B
+ for len(B.abs) > 1 {
+
+ // initialize the digits
+ var a1, a2, u0, u1, u2, v0, v1, v2 Word
+
+ m := len(B.abs) // m >= 2
+ n := len(A.abs) // n >= m >= 2
+
+ // extract the top Word of bits from A and B
+ h := nlz(A.abs[n-1])
+ a1 = (A.abs[n-1] << h) | (A.abs[n-2] >> (_W - h))
+ // B may have implicit zero words in the high bits if the lengths differ
+ switch {
+ case n == m:
+ a2 = (B.abs[n-1] << h) | (B.abs[n-2] >> (_W - h))
+ case n == m+1:
+ a2 = (B.abs[n-2] >> (_W - h))
+ default:
+ a2 = 0
}
- t.Sub(u, v)
+
+ // Since we are calculating with full words to avoid overflow,
+ // we use 'even' to track the sign of the cosequences.
+ // For even iterations: u0, v1 >= 0 && u1, v0 <= 0
+ // For odd iterations: u0, v1 <= 0 && u1, v0 >= 0
+ // The first iteration starts with k=1 (odd).
+ even := false
+ // variables to track the cosequences
+ u0, u1, u2 = 0, 1, 0
+ v0, v1, v2 = 0, 0, 1
+
+ // calculate the quotient and cosequences using Collins' stopping condition
+ for a2 >= v2 && a1-a2 >= v1+v2 {
+ q := a1 / a2
+ a1, a2 = a2, a1-q*a2
+ u0, u1, u2 = u1, u2, u1+q*u2
+ v0, v1, v2 = v1, v2, v1+q*v2
+ even = !even
+ }
+
+ // multiprecision step
+ if v0 != 0 {
+ // simulate the effect of the single precision steps using the cosequences
+ // A = u0*A + v0*B
+ // B = u1*A + v1*B
+
+ t.abs = t.abs.setWord(u0)
+ s.abs = s.abs.setWord(v0)
+ t.neg = !even
+ s.neg = even
+
+ t.Mul(A, t)
+ s.Mul(B, s)
+
+ r.abs = r.abs.setWord(u1)
+ w.abs = w.abs.setWord(v1)
+ r.neg = even
+ w.neg = !even
+
+ r.Mul(A, r)
+ w.Mul(B, w)
+
+ A.Add(t, s)
+ B.Add(r, w)
+
+ } else {
+ // single-digit calculations failed to simluate any quotients
+ // do a standard Euclidean step
+ t.Rem(A, B)
+ A, B, t = B, t, A
+ }
}
- return z.Lsh(u, k)
+ if len(B.abs) > 0 {
+ // standard Euclidean algorithm base case for B a single Word
+ if len(A.abs) > 1 {
+ // A is longer than a single Word
+ t.Rem(A, B)
+ A, B, t = B, t, A
+ }
+ if len(B.abs) > 0 {
+ // A and B are both a single Word
+ a1, a2 := A.abs[0], B.abs[0]
+ for a2 != 0 {
+ a1, a2 = a2, a1%a2
+ }
+ A.abs[0] = a1
+ }
+ }
+ *z = *A
+ return z
}
// Rand sets z to a pseudo-random number in [0, n) and returns z.
diff --git a/src/math/big/int_test.go b/src/math/big/int_test.go
index bc2eef5..e42917b 100644
--- a/src/math/big/int_test.go
+++ b/src/math/big/int_test.go
@@ -675,6 +675,37 @@
return x.Cmp(d) == 0
}
+// euclidGCD is a reference implementation of Euclid's GCD
+// algorithm for testing against optimized algorithms.
+// Requirements: a, b > 0
+func euclidGCD(a, b *Int) *Int {
+
+ A := new(Int).Set(a)
+ B := new(Int).Set(b)
+ t := new(Int)
+
+ for len(B.abs) > 0 {
+ // A, B = B, A % B
+ t.Rem(A, B)
+ A, B, t = B, t, A
+ }
+ return A
+}
+
+func checkLehmerGcd(aBytes, bBytes []byte) bool {
+ a := new(Int).SetBytes(aBytes)
+ b := new(Int).SetBytes(bBytes)
+
+ if a.Sign() <= 0 || b.Sign() <= 0 {
+ return true // can only test positive arguments
+ }
+
+ d := new(Int).lehmerGCD(a, b)
+ d0 := euclidGCD(a, b)
+
+ return d.Cmp(d0) == 0
+}
+
var gcdTests = []struct {
d, x, y, a, b string
}{
@@ -708,38 +739,28 @@
D := new(Int).GCD(X, Y, a, b)
if D.Cmp(d) != 0 {
- t.Errorf("GCD(%s, %s): got d = %s, want %s", a, b, D, d)
+ t.Errorf("GCD(%s, %s, %s, %s): got d = %s, want %s", x, y, a, b, D, d)
}
if x != nil && X.Cmp(x) != 0 {
- t.Errorf("GCD(%s, %s): got x = %s, want %s", a, b, X, x)
+ t.Errorf("GCD(%s, %s, %s, %s): got x = %s, want %s", x, y, a, b, X, x)
}
if y != nil && Y.Cmp(y) != 0 {
- t.Errorf("GCD(%s, %s): got y = %s, want %s", a, b, Y, y)
- }
-
- // binaryGCD requires a > 0 && b > 0
- if a.Sign() <= 0 || b.Sign() <= 0 {
- return
- }
-
- D.binaryGCD(a, b)
- if D.Cmp(d) != 0 {
- t.Errorf("binaryGcd(%s, %s): got d = %s, want %s", a, b, D, d)
+ t.Errorf("GCD(%s, %s, %s, %s): got y = %s, want %s", x, y, a, b, Y, y)
}
// check results in presence of aliasing (issue #11284)
a2 := new(Int).Set(a)
b2 := new(Int).Set(b)
- a2.binaryGCD(a2, b2) // result is same as 1st argument
+ a2.GCD(X, Y, a2, b2) // result is same as 1st argument
if a2.Cmp(d) != 0 {
- t.Errorf("binaryGcd(%s, %s): got d = %s, want %s", a, b, a2, d)
+ t.Errorf("aliased z = a GCD(%s, %s, %s, %s): got d = %s, want %s", x, y, a, b, a2, d)
}
a2 = new(Int).Set(a)
b2 = new(Int).Set(b)
- b2.binaryGCD(a2, b2) // result is same as 2nd argument
+ b2.GCD(X, Y, a2, b2) // result is same as 2nd argument
if b2.Cmp(d) != 0 {
- t.Errorf("binaryGcd(%s, %s): got d = %s, want %s", a, b, b2, d)
+ t.Errorf("aliased z = b GCD(%s, %s, %s, %s): got d = %s, want %s", x, y, a, b, b2, d)
}
}
@@ -760,6 +781,10 @@
if err := quick.Check(checkGcd, nil); err != nil {
t.Error(err)
}
+
+ if err := quick.Check(checkLehmerGcd, nil); err != nil {
+ t.Error(err)
+ }
}
type intShiftTest struct {
diff --git a/src/math/big/rat.go b/src/math/big/rat.go
index f0f436e..b33fc69 100644
--- a/src/math/big/rat.go
+++ b/src/math/big/rat.go
@@ -422,7 +422,7 @@
neg := z.a.neg
z.a.neg = false
z.b.neg = false
- if f := NewInt(0).binaryGCD(&z.a, &z.b); f.Cmp(intOne) != 0 {
+ if f := NewInt(0).lehmerGCD(&z.a, &z.b); f.Cmp(intOne) != 0 {
z.a.abs, _ = z.a.abs.div(nil, z.a.abs, f.abs)
z.b.abs, _ = z.b.abs.div(nil, z.b.abs, f.abs)
if z.b.abs.cmp(natOne) == 0 {
To view, visit change 59450. To unsubscribe, or for help writing mail filters, visit settings.
1 comment:
Does the algorithm guarantee that we never run into overflow (of a Word) here and on the next two lines?
Can we document this?
To view, visit change 59450. To unsubscribe, or for help writing mail filters, visit settings.
Patch Set 6: Code-Review+2
Hi Brian;
Uri Mendlovic from Google who is familiar with Lehmer GCD took a look and he had the following comments:
- The performance is O(num words ^ 2), which can be improved. This should help for numbers with tens of words or more.
I looked at the subquadeatic algorithms a bit and Niels Moller has a paper http://www.ams.org/journals/mcom/2008-77-261/S0025-5718-07-02017-0/S0025-5718-07-02017-0.pdf that references a payoff only around ~100 words so I thought it wasn't worth the additional complexity for a first improvement. We can definitely add that extension as a follow up.
- There is a variant that manipulates the leading words (a1, a2) by aligning them to the MSBit. This replaces the division/multiplication with shifts and may be faster.
I believe he may be referencing Sorenson's k-ary GCD algorithm
ftp://ftp.cs.wisc.edu/pub/techreports/1990/TR979.pdf Which looks basically comparable in performance to Lehmer's algorithm but more complex to implement.
- I saw that there is also another naive implementation for the case of extended GCD (when the coefficients is of interest). This implementation is much slower than both the old binary GCD and the purposed implementation. The new implementation can be modified to return the coefficient with small (x2) performance loss, and thus replacing the duplicate implementation.
Yes, I have the extension to the cofactors ready to go but it probably will be a couple of weeks before I can clean it up to submit because I'm in the middle of a move.
Would you prefer a single unified implementation? I was planning factoring out the shared single digit code, but having separate update and base cases for the extended algorithm to handle the extra bookeeping for cofactors.
----
Overall he agrees that the code is algorithmically correct. I'm going to approve this and check it in so we can move forward with incremental changes.
Thanks for doing this and bearing with me.
- gri
The process was really great. Thanks for taking the time to thoroughly review, vet and improve the code.
Yes, the first line is just the remainder, so q*a2 <= a1 and a1 - q*a2 >= 0. u and v are positive and bounded by the size of the inputs (a1 and a2) so are always guaranteed to fit into a Word. Thus the products and sums on the following two lines will not overflow. Jebelean's paper discussed this as well. I can document this, but it will likely be a couple of weeks as I am in the middle of a move.
Feel free to ping me in mid November if this is still open.
Patch Set 7:
Patch Set 6: Code-Review+2
Hi Brian;
Uri Mendlovic from Google who is familiar with Lehmer GCD took a look and he had the following comments:
- The performance is O(num words ^ 2), which can be improved. This should help for numbers with tens of words or more.
I looked at the subquadeatic algorithms a bit and Niels Moller has a paper http://www.ams.org/journals/mcom/2008-77-261/S0025-5718-07-02017-0/S0025-5718-07-02017-0.pdf that references a payoff only around ~100 words so I thought it wasn't worth the additional complexity for a first improvement. We can definitely add that extension as a follow up.
- There is a variant that manipulates the leading words (a1, a2) by aligning them to the MSBit. This replaces the division/multiplication with shifts and may be faster.
I believe he may be referencing Sorenson's k-ary GCD algorithm
ftp://ftp.cs.wisc.edu/pub/techreports/1990/TR979.pdf Which looks basically comparable in performance to Lehmer's algorithm but more complex to implement.
- I saw that there is also another naive implementation for the case of extended GCD (when the coefficients is of interest). This implementation is much slower than both the old binary GCD and the purposed implementation. The new implementation can be modified to return the coefficient with small (x2) performance loss, and thus replacing the duplicate implementation.
Yes, I have the extension to the cofactors ready to go but it probably will be a couple of weeks before I can clean it up to submit because I'm in the middle of a move.
Would you prefer a single unified implementation? I was planning factoring out the shared single digit code, but having separate update and base cases for the extended algorithm to handle the extra bookeeping for cofactors.
----
Overall he agrees that the code is algorithmically correct. I'm going to approve this and check it in so we can move forward with incremental changes.
Thanks for doing this and bearing with me.
- gri
The process was really great. Thanks for taking the time to thoroughly review, vet and improve the code.
Please take your time. The factored approach seems fine.
Note that the 1.10 freeze is around Nov. 1, so future work may have to wait until 1.11. We can still get it going and reviewed though.
Patch Set 7:
Patch Set 7:
Patch Set 6: Code-Review+2
Hi Brian;
Uri Mendlovic from Google who is familiar with Lehmer GCD took a look and he had the following comments:
- The performance is O(num words ^ 2), which can be improved. This should help for numbers with tens of words or more.
I looked at the subquadeatic algorithms a bit and Niels Moller has a paper http://www.ams.org/journals/mcom/2008-77-261/S0025-5718-07-02017-0/S0025-5718-07-02017-0.pdf that references a payoff only around ~100 words so I thought it wasn't worth the additional complexity for a first improvement. We can definitely add that extension as a follow up.
- There is a variant that manipulates the leading words (a1, a2) by aligning them to the MSBit. This replaces the division/multiplication with shifts and may be faster.
I believe he may be referencing Sorenson's k-ary GCD algorithm
ftp://ftp.cs.wisc.edu/pub/techreports/1990/TR979.pdf Which looks basically comparable in performance to Lehmer's algorithm but more complex to implement.
- I saw that there is also another naive implementation for the case of extended GCD (when the coefficients is of interest). This implementation is much slower than both the old binary GCD and the purposed implementation. The new implementation can be modified to return the coefficient with small (x2) performance loss, and thus replacing the duplicate implementation.
Yes, I have the extension to the cofactors ready to go but it probably will be a couple of weeks before I can clean it up to submit because I'm in the middle of a move.
Would you prefer a single unified implementation? I was planning factoring out the shared single digit code, but having separate update and base cases for the extended algorithm to handle the extra bookeeping for cofactors.
----
Overall he agrees that the code is algorithmically correct. I'm going to approve this and check it in so we can move forward with incremental changes.
Thanks for doing this and bearing with me.
- gri
The process was really great. Thanks for taking the time to thoroughly review, vet and improve the code.
Please take your time. The factored approach seems fine.
Note that the 1.10 freeze is around Nov. 1, so future work may have to wait until 1.11. We can still get it going and reviewed though.
PS: Some comments from Uri Mendlovic:
The sub-quadratic variant may indeed provide significant acceleration only at 100 words, that depends on the implementation. 100 words are possible in practical usecases, for example RSA4096 numbers.
But this indeed can wait for future work. Consider adding a comment about the complexity.
The benefit of the shift-based variant is platform depended, specifically on the latency ratio of single word shift/multiplication. In modern CPUs the ratio is indeed small, so if Go doesn't add overhead to the multiplication the improvement will be small.
As for computing the coefficients, I thought this code could be put inside if statements. Probably there will be 3 such ifs so separating the implementations is reasonable, but will have some code duplication even if the single word manipulation is partially shared. I will be glad to have a look on the code when Brian have it ready.