(this e-mail contains details on a particular implementation for
GF(2) linear algebra, feel free to ignore it if that doesn't get you going)
I've just submitted a new (much improved) version of the M4RI library for
inclusion in Sage:
http://trac.sagemath.org/sage_trac/ticket/3204
M4RI is a library by Gregory Bard and myself for fast arithmetic with dense
matrices over GF(2). M4RI stands for the "Method of the Four Russians"
inversion which is an algorithm for matrix reduction first published by Greg.
The website is here:
http://sage.math.washington.edu/home/malb/m4ri/
This new version has among other things two new features:
- SSE2 support where support (XOR two 128-bit words in one instruction)
- Strassen-Winograd matrix multiplication on top of M4RM ("Method of the Four
Russians" multiplication or Kronrod's algorithm)
I'm writing to [sage-devel] because I observed some odd performance
characteristics and hope to pick your brains on how to improve performance:
At least in my experience enabling SSE2 results in slower code on AMD CPUs. I
think this is rumoured across the net but did anyone else also witness this
behaviour? I'm only using 128-bit integer XOR instrinsics. On my Core2Duo
notebook SSE2 speeds up the computation at least up to the size of the L2
cache.
One would expect that the M4RM + Strassen implementation would have an easy
time beating Magma since it uses a better algorithm. Same as Magma it does
Strassen-Winograd down to the cutoff but then it switches to the M4RM
algorithm (O(n^3/log_2(n))) rather than naive multiplication (O(n^3)). This
should give a nice constant speed-up. This theory so far seems to match
reality on a Core2Duo notebook with 4MB L2 cache:
----------------------------------------------------------------------
| SAGE Version 3.0.1, Release Date: 2008-05-05 |
| Type notebook() for the GUI, and license() for information. |
----------------------------------------------------------------------
sage: A = random_matrix(GF(2),10^4,10^4)
sage: B = random_matrix(GF(2),10^4,10^4)
sage: time C = A._multiply_strassen(B,cutoff=3200)
CPU times: user 4.50 s, sys: 0.03 s, total: 4.53 s
Wall time: 4.56
Magma V2.13-5 Wed May 14 2008 21:55:31 on road [Seed = 2246483032]
Type ? for help. Type <Ctrl>-D to quit.
> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> t := Cputime(); C := A*B; Cputime(t);
7.170
sage: A = random_matrix(GF(2),2*10^4,2*10^4)
sage: B = random_matrix(GF(2),2*10^4,2*10^4)
sage: time C = A._multiply_strassen(B,cutoff=3200)
CPU times: user 33.35 s, sys: 0.09 s, total: 33.44 s
Wall time: 33.59
> A := RandomMatrix(GF(2),2*10^4,2*10^4);
> B := RandomMatrix(GF(2),2*10^4,2*10^4);
> t := Cputime(); C := A*B; Cputime(t);
49.990
Finally, a perfect cutting example (no extra rows/columns to take care of):
sage: A = random_matrix(GF(2),1<<14,1<<14)
sage: B = random_matrix(GF(2),1<<14,1<<14)
sage: time C = A._multiply_strassen(B,cutoff=1<<10) # !=3200
CPU times: user 16.82 s, sys: 0.07 s, total: 16.88 s
Wall time: 16.96
> A:= RandomMatrix(GF(2),2^14,2^14);
> B:= RandomMatrix(GF(2),2^14,2^14);
> t := Cputime(); C := A*B; Cputime(t);
24.060
Note however that this comparison is NOT FAIR since this version of Magma is
not optimised for the C2D chip!
However, on AMD CPUs the times don't look that nice. Both on sage.math and on
a
AMD Athlon(tm) 64 X2 Dual Core Processor 6400+
the performance is way way behind Magma (or what I'd expect Magma to perform
like), e.g. on sage.math:
sage: A = random_matrix(GF(2),10^4,10^4)
sage: B = random_matrix(GF(2),10^4,10^4)
sage: time C = A._multiply_strassen(B,cutoff=1500)
CPU times: user 14.40 s, sys: 0.17 s, total: 14.58 s
Wall time: 14.62
Magma V2.13-5 Wed May 14 2008 14:07:18 on sage [Seed = 4255048009]
Type ? for help. Type <Ctrl>-D to quit.
> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> t := Cputime(); C := A*B; Cputime(t);
3.890 <<--------- They even beat the C2D time!
sage: A = random_matrix(GF(2),2*10^4,2*10^4)
sage: B = random_matrix(GF(2),2*10^4,2*10^4)
sage: time C = A._multiply_strassen(B,cutoff=1500)
CPU times: user 99.42 s, sys: 0.64 s, total: 100.06 s
Wall time: 100.06
> B := RandomMatrix(GF(2),2*10^4,2*10^4);
> A := RandomMatrix(GF(2),2*10^4,2*10^4);
> t := Cputime(); C := A*B; Cputime(t);
27.470
# perfect cutoff
sage: A = random_matrix(GF(2),1<<14,1<<14)
sage: B = random_matrix(GF(2),1<<14,1<<14)
sage: time C = A._multiply_strassen(B,cutoff=1<<10)
CPU times: user 100.14 s, sys: 0.32 s, total: 100.46 s
Wall time: 100.45
sage: time C = A._multiply_strassen(B,cutoff=1<<9)
CPU times: user 76.48 s, sys: 0.39 s, total: 76.87 s
Wall time: 76.87
sage: time C = A._multiply_strassen(B,cutoff=1<<8)
CPU times: user 52.45 s, sys: 0.33 s, total: 52.78 s
Wall time: 52.78
> A := RandomMatrix(GF(2),2^14,2^14);
> B := RandomMatrix(GF(2),2^14,2^14);
> t := Cputime(); C := A*B; Cputime(t);
14.970
I guess the bad performance of the M4RI library is partly related to the
smaller L2 cache (AMD: 1MB, C2D: 4MB) which forces us to set the cutoff lower
(since we want all three C = A*B submatrices to fit in L2 cache). Also it
seems that Magma is almost always ~ 3.5 times faster than our implementation.
I am slowly running out of ideas what to improve in order to close to the gap
so if anyone has some ideas please let me hear them. It might be worth noting
that we are using the same cutting strategy as Sage in strassen.pyx which
ignores uneven rows/columns at every recursion step and deals with the extra
rows/columns after the Strassen-Winograd formula (see strassen.pyx for
details). We are also using the improved operation schedule from paper of
Clement et al.Though I don't know if the cutting strategy is optimal, we also
get beaten in a perfect cutting scenario, so I guess it is not that important
for now. Anyway, I'd happily provide profiling data if anyone wants to take a
look and squeeze a couple of cycles out. If not, then this e-mail is jut a
quite long status report :-)
Btw. I don't have access to Magma 2.14 which I believe to be the fastest in
linear algebra over GF(2). In version 2.14 they added SSE2 support too.
So if anybody could compare the new M4RI with Magma 2.14 I'd be happy to hear
about the results. I don't know what else to compare with except Magma. Any
suggestions? Thoughts? Rants?
Cheers,
Martin
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
Here are magma V2.14-9 timings on sage.math:
was@sage:~/magma64/bin$ ./magma
Magma V2.14-9 Wed May 14 2008 17:33:23 on sage [Seed = 793568142]
Type ? for help. Type <Ctrl>-D to quit.
> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> time C := A*B;
Time: 4.020
> time C := A*B;
Time: 3.980
> time C := A*B;
Time: 3.990
> A := RandomMatrix(GF(2),2^14,2^14);
> B := RandomMatrix(GF(2),2^14,2^14);
> time C := A*B;
Time: 15.450
> time C := A*B;
Time: 15.420
> Any suggestions? Thoughts? Rants?
Allan Steel is a really amazing programmer.
----
I tried your new code up at #3204 under OS X and get this:
sage: A = random_matrix(GF(2),10^4,10^4)
sage: B = random_matrix(GF(2),10^4,10^4)
sage: time C = A._multiply_strassen(B,cutoff=3200)
sage.bin(39971) malloc: *** error for object 0xb95c010: Non-aligned
pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
sage.bin(39971) malloc: *** error for object 0x79c9c10: Non-aligned
pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
sage.bin(39971) malloc: *** error for object 0x7465a00:
non-page-aligned, non-allocated pointer being freed
*** set a breakpoint in malloc_error_break to debug
sage.bin(39971) malloc: *** error for object 0x79ca610: Non-aligned
pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
...
CPU times: user 10.29 s, sys: 0.26 s, total: 10.55 s
Wall time: 16.31
Maybe you're doing something wrong?
-- William
> time C := A*B;
Time: 3.980
> time C := A*B;
Time: 3.990
> A := RandomMatrix(GF(2),2^14,2^14);
> B := RandomMatrix(GF(2),2^14,2^14);
> time C := A*B;
Time: 15.450
> time C := A*B;
Time: 15.420
This doesn't seem much different from 2.13, which is good for my purposes.
Once I fixed the OSX issue mentioned below could you run the benchmark on the
C2D? I want to see if they can make better use of the 4MB L2 cache.
> ----
>
> I tried your new code up at #3204 under OS X and get this:
>
> sage: A = random_matrix(GF(2),10^4,10^4)
> sage: B = random_matrix(GF(2),10^4,10^4)
> sage: time C = A._multiply_strassen(B,cutoff=3200)
> sage.bin(39971) malloc: *** error for object 0xb95c010: Non-aligned
> pointer being freed (2)
> *** set a breakpoint in malloc_error_break to debug
> sage.bin(39971) malloc: *** error for object 0x79c9c10: Non-aligned
> pointer being freed (2)
> *** set a breakpoint in malloc_error_break to debug
> sage.bin(39971) malloc: *** error for object 0x7465a00:
> non-page-aligned, non-allocated pointer being freed
> *** set a breakpoint in malloc_error_break to debug
> sage.bin(39971) malloc: *** error for object 0x79ca610: Non-aligned
> pointer being freed (2)
> *** set a breakpoint in malloc_error_break to debug
> ...
> CPU times: user 10.29 s, sys: 0.26 s, total: 10.55 s
> Wall time: 16.31
>
> Maybe you're doing something wrong?
Likely, though my Valgrind log is clean under Linux. I'll try under OSX.
This bug is fixed in:
http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080515.p0.spkg
Hi Bill,
I cannot explain the random behaviour of the cpucycles counter. The cpucycles
library is by Dan Bernstein so I'd suspect it is carefully coded. Some
potential sources of variance are: dynticks (i.e. the kernel doesn't have a
fixed tick rate, this is done to safe power), CPU frequency scaling and other
applications. On the other hand, I'd suspect I don't need to tell you since
you probably benchmarked more code than I did. To account for the variance I
have the small run_bench.py script in the testsuite directory. Run it like
this:
./run_bench.py -t mul -n 10000 -c 3200
It'll run the experiment 17 times (17 for no particular reason) and return the
median runtime. -n is the size and -c the cutoff.
Cheers,
Martin
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
Median? Shouldn't you take the minimum? Are there any good papers
on benchmarking?
By the way -- present for you:
To update that page you have to ssh to sagemath.org.
William
> Median? Shouldn't you take the minimum? Are there any good papers
> on benchmarking?
I'm using cpucycles from:
http://www.ecrypt.eu.org/ebats/cpucycles.html
which advises:
------------------------
Notes on accuracy
Benchmarking tools are encouraged to record several timings of a function:
call cpucycles(), function(), cpucycles(), function(), etc., and then print
one line reporting the differences between successive cpucycles() results.
The median of several differences is much more stable than the average.
Cycle counts continue to increase while other programs are running, while the
operating system is handling an interruption such as a network packet, etc.
This won't affect the median of several timings of a fast function---the
function usually won't be interrupted---but it can affect the median of
several timings of a slow function. Hopefully a benchmarking machine isn't
running other programs.
On dual-CPU systems (and dual-core systems such as the Athlon 64 X2), the CPUs
often don't have synchronized cycle counters, so a process that switches CPUs
can have its cycle counts jump forwards or backwards. I've never seen this
affect the median of several timings.
Some CPUs dynamically reduce CPU speed to save power, but deliberately keep
their cycle counters running at full speed, the idea being that measuring
time is more important than measuring cycles. Hopefully a benchmarking
machine won't enter power-saving mode.
Cycle counts are occasionally off by a multiple of 2^32 on some CPUs, as
discussed below. I've never seen this affect the median of several timings.
The estimate returned by cpucycles_persecond() may improve accuracy after
cpucycles() has been called repeatedly.
------------------------
I just followed that. I guess if one takes the minimum or not depends on the
data. If there are (much) easier cases then it is probably not best to take
the minimum. However, I guess for matrix multiplication the times don't
really depend on the entries of the matrices so I could use the minimum or
just output all (minimum, average, median).
> By the way -- present for you:
>
> http://m4ri.sagemath.org/
>
> To update that page you have to ssh to sagemath.org.
Cool, thanks! Any chance that I could get write access to the home directory
so that I can upload my SSH public key?
No, I was stupid. The cpucycles are printed as %u but they should be printed
as %llu since they are longer than an int. I've attached the fixed C file
(since it is < 1KB).
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
I observed the same behaviour but have no explanation besides stuff like this:
http://www.theinquirer.net/en/inquirer/news/2003/05/02/sse2-makes-opterons-slower-than-athlon-xps
How do you know what the Magma base case is, can you tell Magma somehow to
print its crossover? I'll look into plugging in naive multiplication as the
base case (after optimisation) to see if your idea is right, though I doubt
that the crossover to M4RM is beyond the crossover to Strassen-Winograd.
Martin
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
On the other hand: a squiggly line is what one one would expect for
Strassen-Winograd due to the extra rows/columns that have to be taken care
of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel
once gave a talk about his implementation and stated that they don't use
classical block multiplication (I saw some slides with that remark).
Martin
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
Bill, thanks so much for looking into this! It is very much appreciated. I'm
going to read/try your patch right away!
Martin
PS: I have a slight delay in my replies because my provider's spam filter
consistently flags you and mabshoff as spam for some reason.
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
Are you suggesting Magma uses M4RM too? I'd doubt that, since they don't state
that anyway. Probably I'm just misunderstanding you.
Yes, the change worked like a charm. I made some changes (the fixed k is
replaced by a k that depends on the new block dimensions etc.) and it is much
much faster now. I'm working on re-introducing SSE2 now to see if it at least
on the Core2Duo makes the world a better place. Btw. all the stuff I wrote
about the L2 cache size C2D vs. Opteron was bollocks. The reason I beat Magma
so badly earlier was that I used a 32-bit Magma and compared it with a 64-bit
version of M4RI. To state it clearly: We don't beat Magma anywhere. Anyhow,
here are the times:
64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit)
10,000 x 10,000 2.920 4.130
16,384 x 16,384 11.140 15.740
20,000 x 20,000 20.370 28.950
64-bit Debian/GNU Linux, 1.8Ghz Opteron (sage.math)
Matrix Dimension Magma 2.13-5 (64-bit) M4RI-20080517 (64-bit)
10,000 x 10,000 3.930 7.860
16,384 x 16,384 16.230 104.77???
20,000 x 20,000 27.080 56.420
(My university today finally granted me access to Magma 2.14 for my notebook.)
As you can see there is an odd (reproducible) spike at 2^14 x 2^14. I cannot
explain that for now, it might just be a bug. I have a similar spike in
another run on another AMD (Athlon X2) machine:
n: 2048, cutoff: 1024, speedup: 1.06, m4rm: 0.01 strassen: 0.01
n: 3072, cutoff: 1536, speedup: 0.87, m4rm: 0.04 strassen: 0.05
n: 4096, cutoff: 2048, speedup: 1.13, m4rm: 0.15 strassen: 0.13
n: 5120, cutoff: 2560, speedup: 1.35, m4rm: 0.39 strassen: 0.29
n: 6144, cutoff: 3072, speedup: 1.20, m4rm: 0.66 strassen: 0.55
n: 7168, cutoff: 3584, speedup: 1.87, m4rm: 1.64 strassen: 0.88
n: 8192, cutoff: 4096, speedup: 4.48, m4rm: 6.07 strassen: 1.35
>>> n: 9216, cutoff: 4608, speedup: 2.94, m4rm: 8.90 strassen: 3.02 <<<
n: 10240, cutoff: 5120, speedup: 4.68, m4rm: 12.99 strassen: 2.78
n: 11264, cutoff: 5632, speedup: 4.84, m4rm: 18.78 strassen: 3.88
n: 12288, cutoff: 6144, speedup: 4.65, m4rm: 24.40 strassen: 5.24
n: 13312, cutoff: 6656, speedup: 3.32, m4rm: 30.92 strassen: 9.33
I'm investigating this one too and play around with the parameters. Since the
C2D times are considerably better I also bet it is L2 related, we'll see.
Thanks again!
Martin
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
I was stuck there too yesterday. Maybe only at 10000x10000 the pipeline gets
fully utilised?
Martin
PS: If we run out of idea we can simply go for parallelism, that should help
on sage.math ;-)
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
Since all routines use the RADIX and I only check if their results match they
are all wrong in the same way but it isn't detected. I should add a test with
known answers I suppose.
> I notice that in the SSE code, you check to see if alignment can be
> achieved, otherwise it doesn't use SSE. But this introduces an
> unpredictable branch. Also, where ther are three operands, you can't
> use SSE2 because the likelihood of all three being aligned is too
> small.
> I think a better idea would be to explicitly force all matrices and
> all rows to be 128 bit aligned if the matrices are wide enough to
> benefit from SSE2, Then the combine function can always use SSE2 and
> there will be no need to check for alignment.
I'll try that.
> I experimented with interleaving MMX and GPR XOR's, but this doesn't
> speed anything up. There are more instructions emitted and the time
> stays about the same. The only way interleaving the MMX and GPR code
> would speed things up is if there was more computation going on in the
> registers and less memory loading and storing, I think.
I came to the same conclusion (but my code might not have been as good as
your's). I improved other areas of the code (e.g. use naiv multiplication
rather than M4RM if B->ncols < RADIX since it is faster etc.) I can forward
you my newest tarball (but the speed improvements aren't really noticable
yet).
Martin
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
That doesn't seem to make a noticeable difference for me (on C2D). However, I
realised that the multiplications where the target matrix is a real matrix
rather than a window (which has bad data locality). Copying everything over
seems not like a good idea but it at least indicates an area for
improvements.
Martin
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
Okay, if I only copy when we crossover to M4RM then the memory overhead is
constant (~ cutoff^2) and the performance still improves.
Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit)
10,000 x 10,000 2.920 3.610
16,384 x 16,384 11.140 12.120
20,000 x 20,000 20.370 24.390
32,000 x 32,000 74.290 94.910
New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit)
10,000 x 10,000 2.920 2.990
16,384 x 16,384 11.140 11.750
20,000 x 20,000 20.370 21.180
32,000 x 32,000 74.290 86.570
On Opteron things don't look this way, but I think sage.math is pretty heavily
used right now such that my benchmarks there are not very telling.
If you take this + Bill's idea:
> For 10000x10000 we currently use k = 6. Instead of this, we could use
> k = 5 and make two Gray tables simultaneously. This will still fit in
> cache.
>
> Instead of doing 6 bits at a time, we can then do 10 bits at a time.
> We'd load the appropriate line from the first Gray table, then the
> appropriate one from the second and xor them, then xor with the output
> matrix. This should decrease the number of loads and stores
> considerably. Moreover, the SSE instructions will then be much more
> efficient as the ratio of arithmetic instructions to loads and stores
> is higher.
>
> Of course one could also do 16 bits at a time, by doing 4 tables, but
> I think this might actually get slower again since you've only
> increased the amount of work done by 60%, but you've had a 30 %
> increase in instructions.
You get (on the C2D):
sage: B = random_matrix(GF(2), 3.2*10^4, 3.2*10^4)
sage: A = random_matrix(GF(2), 3.2*10^4, 3.2*10^4)
sage: time C= A._multiply_strassen(B,cutoff=2^11)
CPU times: user 75.82 s, sys: 0.22 s, total: 76.04 s
Wall time: 76.31
sage: A = random_matrix(GF(2), 2*10^4, 2*10^4)
sage: B = random_matrix(GF(2), 2*10^4, 2*10^4)
sage: time C= A._multiply_strassen(B,cutoff=2^11)
CPU times: user 19.14 s, sys: 0.09 s, total: 19.24 s
Wall time: 19.29
sage: B = random_matrix(GF(2), 2^14, 2^14)
sage: A = random_matrix(GF(2), 2^14, 2^14)
sage: time C= A._multiply_strassen(B,cutoff=2^11)
CPU times: user 10.62 s, sys: 0.05 s, total: 10.67 s
Wall time: 10.70
sage: B = random_matrix(GF(2), 10^4, 10^4)
sage: A = random_matrix(GF(2), 10^4, 10^4)
sage: time C= A._multiply_strassen(B,cutoff=2^11)
CPU times: user 2.73 s, sys: 0.02 s, total: 2.75 s
Wall time: 2.76
i.e the speed of my current Magma install on the same computer (mind you,
this one might not be optimised for the C2D but for the Opteron, I don't
know). The times above don't have SSE2 yet. I guess documenting Bill's tricks
- process the rows of A in blocks
- use two rather than one Gray code table
well is in order since now M4RM looks quite different from the original
algorithm. I'll do that tomorrow.
Again, thanks Bill!
If one uses M4RI with the new patch from within Sage another PRBG is used, but
coinflip should be fine. Don't see these speedups (but I have two Gray code
tables and this warrants for more if's)
Hi, I think we might consider merging our two forks again? Or do you also have
the two Gray code tables? Are your timings on the Opteron? Because then
things look really goo since mine are on the C2D.
Exciting times,
Martin
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
I'm using a crossover of 2048 here, so maybe our improvements are orthogonal?
Even more puzzling, I'd expect that my crossover should be bigger than yours.
(on a side note: my code changes how the crossover is used, your
version: 'size < cutoff', my version: '|cutoff - size| is minimal' which
should give a actual cutoffs closer to the desired values).
My version is here:
http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080516.p1.spkg
(this needs an updated patch for Sage)
and here:
http://sage.math.washington.edu/home/malb/m4ri-20080516.tar.gz
(which is the raw source). Those don't have SSE2 yet but it doesn't seem to
make that much of a difference anyway. I'll add that back before doing an
official release. However, unfortunately I'll probably have limited/no time
tomorrow to commit.
Martin
PS: To give at least some indication that my code still does the right thing,
a 'known answer' test:
sage: A = random_matrix(GF(2), 10^3, 10^3)
sage: B = random_matrix(GF(2), 10^3, 10^3)
sage: (A*B)._magma_() == A._magma_() * B._magma_()
True
sage: (A._multiply_strassen(B,cutoff=256))._magma_() == A._magma_() *
B._magma_()
True
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
> Of course one can go too crazy with optimisation.
No.... surely that never happens around here.....
david
first, I recorded the different speed-ups in a small table for an overview in
the attachment (I think we've come a far way :-)) To disable the copying out
one needs to edit
/* we copy the matrix first since it is only constant memory
overhead and improves data locality, if you remove it make sure
there are no speed regressions */
/* C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); */
packedmatrix *Cbar = mzd_init(C->nrows, C->ncols);
Cbar = _mzd_mul_m4rm_impl(Cbar, A, B, 0, FALSE);
mzd_copy(C, Cbar);
mzd_free(Cbar);
return C;
in strassen.c to
/* we copy the matrix first since it is only constant memory
overhead and improves data locality, if you remove it make sure
there are no speed regressions */
C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE);
return C;
This disables the copying out.
Martin
PS: If I find some time later today I'll make some changes such that SSE2 can
be used more often, i.e. align each row at 16-byte borders if HAVE_SSE2 is
used.
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
Cool, I'll take a look tomorrow. It seems we're closing in on the classical
algorithm though. Although, you don't seem to decrease k anymore.
As for the copying and 2^14 vs. 10^4: Maybe the extra rows/columns cost too
many cycles. I'll try valgrind/callgrind/cachegrind on the code to see if
that is true. Also maybe for 2^14 x 2^14 the submatrices are nicely aligned
on cache lines. I'm widely speculating here.
Martin
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
I do get a small speed-up on the Core2Duo for SSE2 but I'm not sure it is
worth the trouble (I agree that it make the otherwise pretty looking code
ugly).
I have some timings (for an old implementation) here:
http://trac.sagemath.org/sage_trac/ticket/3204#comment:2
My guess is that SSE2 is slower on the Opteron because SSE2 is basically an
Intel thing and only provided by AMD for compatibility reasons. There are
several reports of SSE2 being slow on the Opteron and I guess the SSE2
integer operations were not focused for speed since MMX/SSE is all about
floating point mainly.
One thing I noticed on the Opteron is that if I put the code in mzd_combine
vs. putting the same code directly in the function made huge difference. I
blamed it on better cache prefetching support but that was probably
preliminary.
My proposal:
- This evening I'll update my code with your 8 Gray tables and check the
performance on the C2D
- Then I'll re-introduce SSE2 and check whether it makes a worthy difference,
if not we drop SSE2 from the multiplication.
Martin
PS: I tried a quick and dirty OpenMP (which is cool, btw) based parallel
implementation of Strassen-Winograd yesterday and it gives - as is - a
speedup of 1.8 (so not optimal yet) or so. But comparing that with Magma
feels like cheating, first we should aim for better speed with the same
resources and then we switch to parallel implementations for even better
times. Anyway, I wouldn't have believed that I can do a 10^4 x 10^4 matrix
multiplication in 1.7 seconds on my notebook one week ago :-)
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
Okay, so a good compromise is to remove all SSE2 stuff from the main function
_mzd_mul_m4rm_impl and put it in static inline _mzd_combine8 function which
is specifically tailored towards this particular application. Thus the code
still looks relatively pretty/elegant but we can have SSE2 support.
Martin
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
Bill,
some progress report for the C2D:
I incorporated your changes with the following modifications:
- if (x1==0 & x2==0 & x3==0 .... x8==8) ... I removed this one since it seems
rather unlikely that this is true
- I added #define called GRAY8 which defines whether 8 or 4 tables ought to be
used. I figured this might be handy for machines with a smaller L1.
- I added the correct a_nc%k values back in
- We don't need to make sure that the allocated buffers are correctly aligned,
since we try allocate with _mm_malloc. If that is not available we should
probably just use posix_memalign.
The speed-up on the C2D (similar to the Opteron) is considerable (the last
column is a parallel toy implementation):
64-bit Debian/GNU Linux, 2.33Ghz Core2Duo, cutoff=2^12 (*)
Matrix Dimension Magma M4RI M4RI (OpenMP), walltime
10,000 x 10,000 2.920 2.270 1.470
16,384 x 16,384 11.140 9.130 5.540
20,000 x 20,000 20.370 16.110 11.800
32,000 x 32,000 75.510 64.340 40.040
Amazingly M4RM alone (w.o. Strassen-Winograd) now beats Magma up to 2*10^4 x
2*10^4 in this configuration:
sage: A = random_matrix(GF(2),10^3, 10^3);
sage: B = random_matrix(GF(2),10^3, 10^3);
sage: magma(A._multiply_m4rm(B)) == magma(A)*magma(B)
True
sage: A = random_matrix(GF(2),2*10^4, 2*10^4);
sage: B = random_matrix(GF(2),2*10^4, 2*10^4);
sage: time A._multiply_m4rm(B)
CPU times: user 18.32 s, sys: 0.10 s, total: 18.42 s
Wall time: 18.65
64-bit Debian/GNU Linux, 1.8Ghz Opteron (sage.math), cutoff=2^11
Matrix Dimension Magma 2.13-5 (64-bit) M4RI-20080518 (64-bit)
10,000 x 10,000 4.190 4.290
16,384 x 16,384 15.360 15.230
20,000 x 20,000 29.530 28.640
32,000 x 32,000 103.970 114.620
That does seem to roughly match what you reported. I'll now look into SSE2
support.
Cheers,
Martin
(*) Note: Magma is 64-bit but not optimised for C2D.
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
Okay, I added SSE2 support again and the timings are pretty good on the C2D:
Dimension Old New
10000 x 10000 2.270 1.720
16384 x 16384 9.130 6.760
20000 x 20000 16.110 12.310
32000 x 32000 64.340 50.690
Throwing parallelism in the mix (still lame implementation):
Dimension Old New
10000 x 10000 1.470 1.220
16384 x 16384 5.540 4.390
20000 x 20000 11.800 8.580
32000 x 32000 40.040 32.810
Btw. Mike Hansen pointed out on IRC that GAP has a pretty fast implementation
of matrix multiplication too:
GAP4, Version: 4.4.10 of 02-Oct-2007, x86_64-unknown-linux-gnu-gcc
gap> A := RandomMat(10000,10000,GF(2));
<a 10000x10000 matrix over GF2>
gap> B := RandomMat(10000,10000,GF(2));
<a 10000x10000 matrix over GF2>
gap> C := A*B;
<a 10000x10000 matrix over GF2>
gap> time;
5951
The unit here is ms so this takes 6 seconds. However, the generation of random
matrices takes forever. Mike also pointed out that GAP is twice as fast for
the example he tried than the current Sage code (i.e. the code before the
improvements discussed in this thread).
On sage.math things don't improve as expected:
sage: A = random_matrix(GF(2),32000,32000)
sage: B = random_matrix(GF(2),32000,32000)
sage: time C = A._multiply_strassen(B,cutoff=2^11)
CPU times: user 121.69 s, sys: 3.93 s, total: 125.62 s
Wall time: 125.62
This was 114.620 before.
Martin
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
My computer is a Macbook Pro so it is one of those 64-bit Intel (OSX) machines
but I'm running Debian/GNU Linux. According to
https://magma.maths.usyd.edu.au/magma/export/x86_64-linux/
there is a special Intel64 version of Magma 2.14. But even though I have a
license to use it, I can't download it since my Uni keeps the login data for
Magma and only puts versions on an internal server for me to grab. So
basically I have to way until they grabbed the Intel64 version for me.
Maybe William could run some benchmarks on his machine which is identical to
mine (except that I upgraded my RAM and he is running OSX not Linux)?
> It's amazing how much difference the SSE makes on your machine. The
> AMD does essentially use its MMX or SSE hardware to read in cache
> lines I believe, so basically unless you are doing something requiring
> lots of wide arithmetic/logic, you aren't going to get anything more
> out of the chip.
>
> I look forward to seeing the new code now that you've cleaned it up.
The tarball is here:
http://m4ri.sagemath.org/downloads/m4ri-20080519.alpha0.tar.gz
and the SPKG is here:
http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080519.p0.spkg
The SPKG needs a patch:
http://sage.math.washington.edu/home/malb/new_m4ri_2.patch
> I'm going to try and figure out what GAP does, in case there's any
> ideas we missed. It's surely old code, but there might be lots of
> interesting things in there.
I'll also check again but it seems they are doing M4RM with a fixed k and
matrix blocking.
> Anyhow, who would have thought that one would see 1.22s for a
> 10000x10000 matrix multiply. That's pretty exciting.
Yeah, good work Bill!
Martin
PS: I now actually believe that it is possible that Magma uses M4RM (but not
M4RI maybe). If GAP has it and it is old code, I don't see why Magma
wouldn't. So having a hard time beating them isn't that implausible anymore,
since one doesn't have a better algorithm just like that.
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
I am finally up to date with this discussion (I was being interviewed,
and then flying when it started).
First, congrats for the great job you have achieved. I have started to
dive into m4ri, and I really like the quality of the code.
I have a few remarks
* the loop unrolling technique used for the creation of the table, could
maybe be used in the computation of the product as well.
Is 8 optimal? I remember seeing 32 in ATLAS, but don't know of any
justifications. Since some pipeline are longer than 8, this might be
better to have a longer unrolled loop.
* I am not sure about this, but wouldn't it be better to have a block
decomposition that matches the babystep-giantstep structure?
This could happen at the strassen threshold : instead of simply copying
the matrix (which already improves the data-locality) copy it into a
bunch blocks of size blocksize and call m4rm on that structure. ATLAS
are doing this kind of copies for dimensions not larger than 200 if I
recall correctly.
Maybe I am just missing something about your babystep/giantstep algorithm.
Anyway, as you pointed out, the battle is now on the asymptotic
comparison with Magma, and I still have no ideas on how to improve your
strassen implementation. Still thinking about it....
Cheers,
Clément
Bill Hart a écrit :
Clement, since you are actively doing research on fast mod-3 matrix
multiplication, any chance you could spend 2-3 paragraphs here and
advertise that? It would fit nicely in this thread.
-- William
Yes, I'll fix that.
> By the way, I wrote some code ages ago for computing the row echelon
> form of sparse matrices over GF2 which essentially inverts 6000x6000
> sparse matrices in about a second.
Excellent, is the code available somewhere? How dense are these matrices?
My plan for M4RI is:
- polish what we have for multiplication now, find appropriate/acceptable
parameters for a wide range of platforms, make sure it builds everywhere etc.
- during dev1 Greg and I are going to work on Strassen + M4RI/echelonisation.
This is the actual application we are most interested in
- Once that is done, I want to look into sparse matrices (which are probably
much more relevant for my work than dense matrices) so I'd be interested in
looking into your old code. I'm not sure this should go under the roof of the
M4RI library but if it does then this probably warrants a name change :-)
Martin
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
Same here, I'll look into it right away. "Only" Strassen fails so it shouldn't
be too hard to figure out.
> Also I later replaced the following lines of strassen.c:
>
> a -= a%RADIX;
> b -= b%RADIX;
> c -= c%RADIX;
>
> with
>
> unsigned long mult = 1;
> unsigned long width = a;
> while (width > 2*cutoff)
> {
> width/=2;
> mult*=2;
> }
> a -= a%(RADIX*mult);
> b -= b%(RADIX*mult);
> c -= c%(RADIX*mult);
>
> and this sped up the 32000x32000 multiplication by a further 5% or so.
> The other times didn't change (the 10000x10000 may have been slightly
> quicker).
>
> :-)
Good, I'll add it and compare times on the C2D :-)
Martin
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
I've uploaded a new version to
http://m4ri.sagemath.org/downloads/m4ri-20080521.tar.gz
which
- fixes the bug described above
- adds your code snipped
- fixes a couple of other problems reported by Valgrind.
It seems the last bugfix also speeds up the thing (on my notebook) since we
computed too much stuff and wrote it to unallocated memory. Updated
performance figures:
64-bit Debian/GNU Linux, 2.33Ghz Core2Duo (Macbook Pro, 2nd Gen.)
Matrix Dimension Magma GAP M4RI
10,000 x 10,000 2.920 6.691 1.760
16,384 x 16,384 11.140 36.063 6.760
20,000 x 20,000 20.370 - 12.200
32,000 x 32,000 74.260 - 51.510
I'm having trouble believing those figures but all tests pass (including some
equality checks with Magma via Sage). On sage.math things don't seem to get
any better but it is hard to tell since it is often so loaded. Also, this
version only enables SSE2 on Intel CPUs and has a configure switch for OpenMP
(--enable-openmp).
Martin
PS: One thing that seems significant on sage.math: The systime is
considerable, e.g.:
sage: A = random_matrix(GF(2),32000,32000)
sage: B = random_matrix(GF(2),32000,32000)
sage: time C = A._multiply_strassen(B,cutoff=2^11)
CPU times: user 108.02 s, sys: 5.03 s, total: 113.05 s
Wall time: 113.05
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
Bill, I suppose that also means that now we actually beat (or are close to
beating) Magma on the C2D "for real". My M4RI times are quite similar on the
C2D as your times on your Opteron. But my version of Magma (on the C2D) is
much worse than your version of Magma (on the Opteron). So it is probably
best to assume at least your times for Magma on my machine too.
Martin
PS: I wonder if this argument makes sense:
We have a complexity of n^2.807 and a CPU of say 2.333 Ghz which can operate
on 64 bits per clock (128 if we use SSE2). So if we had optimal code (no
missed branch predictions, no caching issues, everything optimal) we would
expect a running time of n^2.807 / 64 / 2.333 / 10^9
If we plug in 20,000 for n we'd get 7.923 seconds w.o. SSE2 and 3.961 with
SSE2. So our implementation (12.2 s) is a factor of ~1.5 or ~3 away from
being optimal? Does that sound correct or complete bollocks?
--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de
>
> Bill, I suppose that also means that now we actually beat (or are close to
> beating) Magma on the C2D "for real". My M4RI times are quite similar on the
> C2D as your times on your Opteron. But my version of Magma (on the C2D) is
> much worse than your version of Magma (on the Opteron). So it is probably
> best to assume at least your times for Magma on my machine too.
>
That's awesome news!
> PS: I wonder if this argument makes sense:
>
> We have a complexity of n^2.807 and a CPU of say 2.333 Ghz which can operate
> on 64 bits per clock (128 if we use SSE2). So if we had optimal code (no
> missed branch predictions, no caching issues, everything optimal) we would
> expect a running time of n^2.807 / 64 / 2.333 / 10^9
Don't forget the constants!
Strassen-winograd, is 6n^2.807.
Now this constant correspond to both mul and adds and I guess that your
boolean word operation ^= computes a + and a * in 1 clock cycle, so I
don't really know the constant in this case (6/2=3 seems dubious to me).
Furthermore, as Bill pointed out, one really has to count the real
number of ops since only a few recursive calls are made.
Eventhough, this would mean that the expected optimal time would be
larger, and consequently, that you're could is closer to optimal!