Hi all,
recently, two "data points" were presented that our code is not optimal:
1) Richard Parker explained his strategy for matrix multiplication on this
mailing list and estimated that we could do better by being more clever wrt to
cache utilisation. I spent 90 minutes on the phone with Richard and we
discussed strategies etc. which gave me a lot to think about.
2) "Fast matrix decomposition in F2" by Enrico Bertolazzi and Anna Rimoldi
http://arxiv.org/pdf/1209.5198.pdf
proposes a new algorithm for Gaussian elimination over GF(2) avoiding column
swaps entirely. The authors benchmark their implementation against M4RI (they
mention Sage 5.2 but not the M4RI library. Though, they cite our two papers).
Their implementation is solidly faster than our code in their benchmarks.
Here are their timings:
n PLE their code
4 096 0.0504 0.0314
8 192 0.334 0.207
16 384 2.392 1.532
32 768 18.341 11.529
Although (2) (implicitly) claims these improvements are due to the new
algorithm I am not convinced of this. For dense random matrices column swaps
are irrelevant, there are so few they do not significantly contribute to the
run-time. But I have to admit that I cannot claim to have understood
completely what they are doing. In any case, I noticed that their
implementation uses data representation which shares some features with
Richard's. For example, cutting up C and B - in C = A*B - in vertical stripes.
So, I played a bit with different chopping-up strategies, to see whether we
can improve our code to close this gap.
A) In my experiments I did not manage to improve our timings on my machine
(Macbook Pro, i7) using different strategies for chopping up the matrices A, B
and C. I wonder whether this is due to me not being ambitious enough or
perhaps due to the fact that I made no changes whatsoever to our internal
matrix representation. But for now, I found no different strategy that would
improve upon what we have now. If anybody is up for trying himself/herself I'd
be very curious to see what you can come up with!
B) However, I did manage to produce some speed-up! These are due to two
patches:
1) When reaching the dimension where we switch to M4RM copy over all matrices
A,B and C to improve data locality. So far, we only copied over C: copy.patch
2) Save instructions in M4RM when reading bits from A. Previously we had 8
calls to mzd_read_bits, now we have one call. The price for this is that must
be <= 8, but for now this doesn't seem to be that much of a restriction. If we
need larger k, we can easily write the logic that deals with this as well, I
was just being lazy so far. Note that this is about saving operations and not
cache efficiency: simpler_m4rm.patch
In simpler_m4rm.patch I hardcoded k=6 because that worked best on my machine.
However, it seems that k should be chosen such that the eight grease tables
fit into L2. This needs some changes to our cache detection logic though, as
we now test for L1 and L3 only (which we call L2, it's a mess).
Here are some results (note that these timings are somewhat volatile, I tried
to take of being precise, taking many samples but still)
= Multiplication =
== 2.66Ghz Intel Core i7 ==
n before after
4096 0.07710400 0.06999600
8192 0.56896300 0.51854700
16384 4.34118200 3.65992600
32768 30.8518770 26.0082520
== 2.8Ghz AMD Opteron 8439 SE (k=7 to fit L2) ==
# these timings are pretty volatile, machine under load
n before after
4096 0.17036300 0.15838200
8192 1.23192000 1.04591700
16384 9.49730300 9.92871000
32768 74.1292670 64.3472200
= Elimination =
== 2.66Ghz Intel Core i7 ==
n before after
4096 0.046127 0.046502
8192 0.317956 0.304784
16384 2.435766 1.951766
32768 16.95562 13.22079
65536 131.4590 97.82825
== 2.8Ghz AMD Opteron 8439 SE (k=7 to fit L2) ==
# these timings are pretty volatile, machine under load
n before after
4096 0.065061 0.063964
8192 0.490642 0.495988
16384 3.804558 3.512572
32768 27.52439 25.341536
What do you guys think? Can you perhaps test these patches on your machines?
If they provide speed-ups on modern hardware, I am inclined to commit them.
Cheers,
Martin
PS: I noticed that the performance figures on our website are rather outdated.
= Multiplication on 2.66Ghz Intel Core i7 =
n x n website current (with patches)
10,000 x 10,000 1.170 0.956
16,384 x 16,384 4.836 3.723
20,000 x 20,000 7.835 6.778
32,000 x 32,000 33.400 28.510
= Elimination on 2.66Ghz Intel Core i7 =
m x n website current (with patches)
10,000 x 10,000 0.612 0.517
16,384 x 16,384 2.886 2.099
20,000 x 20,000 4.228 3.523
32,000 x 32,000 18.309 13.437
--
name: Martin Albrecht
_pgp:
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_otr: 47F43D1A 5D68C36F 468BAEBA 640E8856 D7951CCF
_www:
http://martinralbrecht.wordpress.com/
_jab:
martinr...@jabber.ccc.de