slightly OT: new M4RI library

156 views
Skip to first unread message

Martin Albrecht

unread,
May 14, 2008, 5:52:43 PM5/14/08
to Sage Development
Oh gods of the cpu cycle ... or hi there,

(this e-mail contains details on a particular implementation for
GF(2) linear algebra, feel free to ignore it if that doesn't get you going)

I've just submitted a new (much improved) version of the M4RI library for
inclusion in Sage:

http://trac.sagemath.org/sage_trac/ticket/3204

M4RI is a library by Gregory Bard and myself for fast arithmetic with dense
matrices over GF(2). M4RI stands for the "Method of the Four Russians"
inversion which is an algorithm for matrix reduction first published by Greg.
The website is here:

http://sage.math.washington.edu/home/malb/m4ri/

This new version has among other things two new features:
- SSE2 support where support (XOR two 128-bit words in one instruction)
- Strassen-Winograd matrix multiplication on top of M4RM ("Method of the Four
Russians" multiplication or Kronrod's algorithm)

I'm writing to [sage-devel] because I observed some odd performance
characteristics and hope to pick your brains on how to improve performance:

At least in my experience enabling SSE2 results in slower code on AMD CPUs. I
think this is rumoured across the net but did anyone else also witness this
behaviour? I'm only using 128-bit integer XOR instrinsics. On my Core2Duo
notebook SSE2 speeds up the computation at least up to the size of the L2
cache.

One would expect that the M4RM + Strassen implementation would have an easy
time beating Magma since it uses a better algorithm. Same as Magma it does
Strassen-Winograd down to the cutoff but then it switches to the M4RM
algorithm (O(n^3/log_2(n))) rather than naive multiplication (O(n^3)). This
should give a nice constant speed-up. This theory so far seems to match
reality on a Core2Duo notebook with 4MB L2 cache:

----------------------------------------------------------------------
| SAGE Version 3.0.1, Release Date: 2008-05-05 |
| Type notebook() for the GUI, and license() for information. |
----------------------------------------------------------------------
sage: A = random_matrix(GF(2),10^4,10^4)
sage: B = random_matrix(GF(2),10^4,10^4)
sage: time C = A._multiply_strassen(B,cutoff=3200)
CPU times: user 4.50 s, sys: 0.03 s, total: 4.53 s
Wall time: 4.56

Magma V2.13-5 Wed May 14 2008 21:55:31 on road [Seed = 2246483032]
Type ? for help. Type <Ctrl>-D to quit.
> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> t := Cputime(); C := A*B; Cputime(t);
7.170

sage: A = random_matrix(GF(2),2*10^4,2*10^4)
sage: B = random_matrix(GF(2),2*10^4,2*10^4)
sage: time C = A._multiply_strassen(B,cutoff=3200)
CPU times: user 33.35 s, sys: 0.09 s, total: 33.44 s
Wall time: 33.59

> A := RandomMatrix(GF(2),2*10^4,2*10^4);
> B := RandomMatrix(GF(2),2*10^4,2*10^4);
> t := Cputime(); C := A*B; Cputime(t);
49.990

Finally, a perfect cutting example (no extra rows/columns to take care of):

sage: A = random_matrix(GF(2),1<<14,1<<14)
sage: B = random_matrix(GF(2),1<<14,1<<14)
sage: time C = A._multiply_strassen(B,cutoff=1<<10) # !=3200
CPU times: user 16.82 s, sys: 0.07 s, total: 16.88 s
Wall time: 16.96

> A:= RandomMatrix(GF(2),2^14,2^14);
> B:= RandomMatrix(GF(2),2^14,2^14);
> t := Cputime(); C := A*B; Cputime(t);
24.060

Note however that this comparison is NOT FAIR since this version of Magma is
not optimised for the C2D chip!

However, on AMD CPUs the times don't look that nice. Both on sage.math and on
a

AMD Athlon(tm) 64 X2 Dual Core Processor 6400+

the performance is way way behind Magma (or what I'd expect Magma to perform
like), e.g. on sage.math:

sage: A = random_matrix(GF(2),10^4,10^4)
sage: B = random_matrix(GF(2),10^4,10^4)
sage: time C = A._multiply_strassen(B,cutoff=1500)
CPU times: user 14.40 s, sys: 0.17 s, total: 14.58 s
Wall time: 14.62

Magma V2.13-5 Wed May 14 2008 14:07:18 on sage [Seed = 4255048009]
Type ? for help. Type <Ctrl>-D to quit.
> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> t := Cputime(); C := A*B; Cputime(t);
3.890 <<--------- They even beat the C2D time!

sage: A = random_matrix(GF(2),2*10^4,2*10^4)
sage: B = random_matrix(GF(2),2*10^4,2*10^4)
sage: time C = A._multiply_strassen(B,cutoff=1500)
CPU times: user 99.42 s, sys: 0.64 s, total: 100.06 s
Wall time: 100.06

> B := RandomMatrix(GF(2),2*10^4,2*10^4);
> A := RandomMatrix(GF(2),2*10^4,2*10^4);
> t := Cputime(); C := A*B; Cputime(t);
27.470

# perfect cutoff
sage: A = random_matrix(GF(2),1<<14,1<<14)
sage: B = random_matrix(GF(2),1<<14,1<<14)
sage: time C = A._multiply_strassen(B,cutoff=1<<10)
CPU times: user 100.14 s, sys: 0.32 s, total: 100.46 s
Wall time: 100.45
sage: time C = A._multiply_strassen(B,cutoff=1<<9)
CPU times: user 76.48 s, sys: 0.39 s, total: 76.87 s
Wall time: 76.87
sage: time C = A._multiply_strassen(B,cutoff=1<<8)
CPU times: user 52.45 s, sys: 0.33 s, total: 52.78 s
Wall time: 52.78

> A := RandomMatrix(GF(2),2^14,2^14);
> B := RandomMatrix(GF(2),2^14,2^14);
> t := Cputime(); C := A*B; Cputime(t);
14.970

I guess the bad performance of the M4RI library is partly related to the
smaller L2 cache (AMD: 1MB, C2D: 4MB) which forces us to set the cutoff lower
(since we want all three C = A*B submatrices to fit in L2 cache). Also it
seems that Magma is almost always ~ 3.5 times faster than our implementation.
I am slowly running out of ideas what to improve in order to close to the gap
so if anyone has some ideas please let me hear them. It might be worth noting
that we are using the same cutting strategy as Sage in strassen.pyx which
ignores uneven rows/columns at every recursion step and deals with the extra
rows/columns after the Strassen-Winograd formula (see strassen.pyx for
details). We are also using the improved operation schedule from paper of
Clement et al.Though I don't know if the cutting strategy is optimal, we also
get beaten in a perfect cutting scenario, so I guess it is not that important
for now. Anyway, I'd happily provide profiling data if anyone wants to take a
look and squeeze a couple of cycles out. If not, then this e-mail is jut a
quite long status report :-)

Btw. I don't have access to Magma 2.14 which I believe to be the fastest in
linear algebra over GF(2). In version 2.14 they added SSE2 support too.
So if anybody could compare the new M4RI with Magma 2.14 I'd be happy to hear
about the results. I don't know what else to compare with except Magma. Any
suggestions? Thoughts? Rants?

Cheers,
Martin

--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinr...@jabber.ccc.de

William Stein

unread,
May 14, 2008, 8:41:35 PM5/14/08
to sage-...@googlegroups.com
> Btw. I don't have access to Magma 2.14 which I believe to be the fastest in
> linear algebra over GF(2). In version 2.14 they added SSE2 support too.
> So if anybody could compare the new M4RI with Magma 2.14 I'd be happy to hear
> about the results. I don't know what else to compare with except Magma.

Here are magma V2.14-9 timings on sage.math:

was@sage:~/magma64/bin$ ./magma
Magma V2.14-9 Wed May 14 2008 17:33:23 on sage [Seed = 793568142]


Type ? for help. Type <Ctrl>-D to quit.
> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);

> time C := A*B;
Time: 4.020
> time C := A*B;
Time: 3.980
> time C := A*B;
Time: 3.990


> A := RandomMatrix(GF(2),2^14,2^14);
> B := RandomMatrix(GF(2),2^14,2^14);

> time C := A*B;
Time: 15.450
> time C := A*B;
Time: 15.420


> Any suggestions? Thoughts? Rants?

Allan Steel is a really amazing programmer.

----

I tried your new code up at #3204 under OS X and get this:

sage: A = random_matrix(GF(2),10^4,10^4)
sage: B = random_matrix(GF(2),10^4,10^4)
sage: time C = A._multiply_strassen(B,cutoff=3200)

sage.bin(39971) malloc: *** error for object 0xb95c010: Non-aligned
pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
sage.bin(39971) malloc: *** error for object 0x79c9c10: Non-aligned
pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
sage.bin(39971) malloc: *** error for object 0x7465a00:
non-page-aligned, non-allocated pointer being freed
*** set a breakpoint in malloc_error_break to debug
sage.bin(39971) malloc: *** error for object 0x79ca610: Non-aligned
pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
...
CPU times: user 10.29 s, sys: 0.26 s, total: 10.55 s
Wall time: 16.31

Maybe you're doing something wrong?

-- William

Martin Albrecht

unread,
May 15, 2008, 4:14:39 AM5/15/08
to sage-...@googlegroups.com
On Thursday 15 May 2008, William Stein wrote:
> > Btw. I don't have access to Magma 2.14 which I believe to be the fastest
> > in linear algebra over GF(2). In version 2.14 they added SSE2 support
> > too. So if anybody could compare the new M4RI with Magma 2.14 I'd be
> > happy to hear about the results. I don't know what else to compare with
> > except Magma.
>
> Here are magma V2.14-9 timings on sage.math:
was@sage:~/magma64/bin$ ./magma
Magma V2.14-9 Wed May 14 2008 17:33:23 on sage [Seed = 793568142]
Type ? for help. Type <Ctrl>-D to quit.
> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> time C := A*B;
Time: 4.020

> time C := A*B;
Time: 3.980
> time C := A*B;
Time: 3.990
> A := RandomMatrix(GF(2),2^14,2^14);
> B := RandomMatrix(GF(2),2^14,2^14);
> time C := A*B;
Time: 15.450
> time C := A*B;
Time: 15.420

This doesn't seem much different from 2.13, which is good for my purposes.
Once I fixed the OSX issue mentioned below could you run the benchmark on the
C2D? I want to see if they can make better use of the 4MB L2 cache.

> ----
>
> I tried your new code up at #3204 under OS X and get this:
>
> sage: A = random_matrix(GF(2),10^4,10^4)
> sage: B = random_matrix(GF(2),10^4,10^4)
> sage: time C = A._multiply_strassen(B,cutoff=3200)
> sage.bin(39971) malloc: *** error for object 0xb95c010: Non-aligned
> pointer being freed (2)
> *** set a breakpoint in malloc_error_break to debug
> sage.bin(39971) malloc: *** error for object 0x79c9c10: Non-aligned
> pointer being freed (2)
> *** set a breakpoint in malloc_error_break to debug
> sage.bin(39971) malloc: *** error for object 0x7465a00:
> non-page-aligned, non-allocated pointer being freed
> *** set a breakpoint in malloc_error_break to debug
> sage.bin(39971) malloc: *** error for object 0x79ca610: Non-aligned
> pointer being freed (2)
> *** set a breakpoint in malloc_error_break to debug
> ...
> CPU times: user 10.29 s, sys: 0.26 s, total: 10.55 s
> Wall time: 16.31
>
> Maybe you're doing something wrong?

Likely, though my Valgrind log is clean under Linux. I'll try under OSX.

Bill Hart

unread,
May 15, 2008, 8:46:17 AM5/15/08
to sage-devel
Hi Martin,

The cycle counter in your bench code seems to give random values on
the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has
Magma 2-14.10 on it.

Anyhow, here are the Magma times:

> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> t := Cputime(); C := A*B; Cputime(t);
2.940
> A := RandomMatrix(GF(2),2*10^4,2*10^4);
> B := RandomMatrix(GF(2),2*10^4,2*10^4);
> t := Cputime(); C := A*B; Cputime(t);
16.570
> A := RandomMatrix(GF(2),2^14,2^14);
> B := RandomMatrix(GF(2),2^14,2^14);
> t := Cputime(); C := A*B; Cputime(t);
9.250

The corresponding times for your code are approximately (hand timed;
adjusted for generation of random matrices):

10^4: 8s
2*10^4: 59s
2^14: 43s

I think it is polynomials over GF2 which Magma 2-14.x is much faster
at rather than linear algebra.

Bill.

On 14 May, 22:52, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Martin Albrecht

unread,
May 15, 2008, 1:35:49 PM5/15/08
to sage-...@googlegroups.com
> Maybe you're doing something wrong?

This bug is fixed in:

http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080515.p0.spkg

Martin Albrecht

unread,
May 15, 2008, 1:42:11 PM5/15/08
to sage-...@googlegroups.com

Hi Bill,

I cannot explain the random behaviour of the cpucycles counter. The cpucycles
library is by Dan Bernstein so I'd suspect it is carefully coded. Some
potential sources of variance are: dynticks (i.e. the kernel doesn't have a
fixed tick rate, this is done to safe power), CPU frequency scaling and other
applications. On the other hand, I'd suspect I don't need to tell you since
you probably benchmarked more code than I did. To account for the variance I
have the small run_bench.py script in the testsuite directory. Run it like
this:

./run_bench.py -t mul -n 10000 -c 3200

It'll run the experiment 17 times (17 for no particular reason) and return the
median runtime. -n is the size and -c the cutoff.

Cheers,
Martin

_jab: martinr...@jabber.ccc.de

William Stein

unread,
May 15, 2008, 2:37:49 PM5/15/08
to sage-...@googlegroups.com

Median? Shouldn't you take the minimum? Are there any good papers
on benchmarking?

By the way -- present for you:

http://m4ri.sagemath.org/

To update that page you have to ssh to sagemath.org.

William

Martin Albrecht

unread,
May 15, 2008, 3:01:20 PM5/15/08
to sage-...@googlegroups.com
> > It'll run the experiment 17 times (17 for no particular reason) and
> > return the median runtime. -n is the size and -c the cutoff.

> Median? Shouldn't you take the minimum? Are there any good papers
> on benchmarking?

I'm using cpucycles from:

http://www.ecrypt.eu.org/ebats/cpucycles.html

which advises:

------------------------

Notes on accuracy

Benchmarking tools are encouraged to record several timings of a function:
call cpucycles(), function(), cpucycles(), function(), etc., and then print
one line reporting the differences between successive cpucycles() results.
The median of several differences is much more stable than the average.

Cycle counts continue to increase while other programs are running, while the
operating system is handling an interruption such as a network packet, etc.
This won't affect the median of several timings of a fast function---the
function usually won't be interrupted---but it can affect the median of
several timings of a slow function. Hopefully a benchmarking machine isn't
running other programs.

On dual-CPU systems (and dual-core systems such as the Athlon 64 X2), the CPUs
often don't have synchronized cycle counters, so a process that switches CPUs
can have its cycle counts jump forwards or backwards. I've never seen this
affect the median of several timings.

Some CPUs dynamically reduce CPU speed to save power, but deliberately keep
their cycle counters running at full speed, the idea being that measuring
time is more important than measuring cycles. Hopefully a benchmarking
machine won't enter power-saving mode.

Cycle counts are occasionally off by a multiple of 2^32 on some CPUs, as
discussed below. I've never seen this affect the median of several timings.

The estimate returned by cpucycles_persecond() may improve accuracy after
cpucycles() has been called repeatedly.

------------------------

I just followed that. I guess if one takes the minimum or not depends on the
data. If there are (much) easier cases then it is probably not best to take
the minimum. However, I guess for matrix multiplication the times don't
really depend on the entries of the matrices so I could use the minimum or
just output all (minimum, average, median).

> By the way -- present for you:
>
> http://m4ri.sagemath.org/
>
> To update that page you have to ssh to sagemath.org.

Cool, thanks! Any chance that I could get write access to the home directory
so that I can upload my SSH public key?

Bill Hart

unread,
May 15, 2008, 3:03:22 PM5/15/08
to sage-devel
Hi Martin,

Here is a run that illustrates the problem. Am I doing something
wrong?

wbhart@host-57-44:~/m4ri-20080514/testsuite> time ./
bench_multiplication 5000 2048
n: 5000, cutoff: 2048, cpu cycles: 2670143764

real 0m2.768s
user 0m2.760s
sys 0m0.008s
wbhart@host-57-44:~/m4ri-20080514/testsuite> time ./
bench_multiplication 5000 3000
n: 5000, cutoff: 3000, cpu cycles: 746344039

real 0m3.277s
user 0m3.252s
sys 0m0.024s

Bill.

On 15 May, 18:42, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Martin Albrecht

unread,
May 15, 2008, 3:25:32 PM5/15/08
to sage-...@googlegroups.com
On Thursday 15 May 2008, Bill Hart wrote:
> Hi Martin,
>
> Here is a run that illustrates the problem. Am I doing something
> wrong?

No, I was stupid. The cpucycles are printed as %u but they should be printed
as %llu since they are longer than an int. I've attached the fixed C file
(since it is < 1KB).

_jab: martinr...@jabber.ccc.de

bench_multiplication.c

Bill Hart

unread,
May 15, 2008, 4:47:02 PM5/15/08
to sage-devel
Martin,

Do you think Magma uses naive multiplication for its base case? This
can be ridicoulously fast, especially over GF2. I note for example
that Magma's base case is about 6 times faster than M4RI at 1000x1000.
Is it possible that the naive multiplication can just be optimised
with a far better constant and that the crossover with m4r is above
the crossover with Strassen-Winograd?

I also note that the M4RI SSE2 code doesn't seem faster than the
generic C code on the Opteron, in fact it seems the other way around.
Very strange.

Bill.

On 15 May, 20:25, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de
>
>  bench_multiplication.c
> 1KDownload

Martin Albrecht

unread,
May 15, 2008, 5:23:40 PM5/15/08
to sage-...@googlegroups.com
On Thursday 15 May 2008, Bill Hart wrote:
> Martin,
>
> Do you think Magma uses naive multiplication for its base case? This
> can be ridicoulously fast, especially over GF2. I note for example
> that Magma's base case is about 6 times faster than M4RI at 1000x1000.
> Is it possible that the naive multiplication can just be optimised
> with a far better constant and that the crossover with m4r is above
> the crossover with Strassen-Winograd?
>
> I also note that the M4RI SSE2 code doesn't seem faster than the
> generic C code on the Opteron, in fact it seems the other way around.
> Very strange.

I observed the same behaviour but have no explanation besides stuff like this:

http://www.theinquirer.net/en/inquirer/news/2003/05/02/sse2-makes-opterons-slower-than-athlon-xps

How do you know what the Magma base case is, can you tell Magma somehow to
print its crossover? I'll look into plugging in naive multiplication as the
base case (after optimisation) to see if your idea is right, though I doubt
that the crossover to M4RM is beyond the crossover to Strassen-Winograd.

Martin

_jab: martinr...@jabber.ccc.de

Bill Hart

unread,
May 15, 2008, 5:41:18 PM5/15/08
to sage-devel
Oh, actually I have no idea where Magma's crossover is. I'll wager it
is somewhere between 4000x4000 and 6000x6000, but let's not speculate.
I'll try and work it out with some timings.

Bill.

On 15 May, 22:23, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> On Thursday 15 May 2008, Bill Hart wrote:
>
> > Martin,
>
> > Do you think Magma uses naive multiplication for its base case? This
> > can be ridicoulously fast, especially over GF2. I note for example
> > that Magma's base case is about 6 times faster than M4RI at 1000x1000.
> > Is it possible that the naive multiplication can just be optimised
> > with a far better constant and that the crossover with m4r is above
> > the crossover with Strassen-Winograd?
>
> > I also note that the M4RI SSE2 code doesn't seem faster than the
> > generic C code on the Opteron, in fact it seems the other way around.
> > Very strange.
>
> I observed the same behaviour but have no explanation besides stuff like this:
>
> http://www.theinquirer.net/en/inquirer/news/2003/05/02/sse2-makes-opt...
>
> How do you know what the Magma base case is, can you tell Magma somehow to
> print its crossover? I'll look into plugging in naive multiplication as the
> base case (after optimisation) to see if your idea is right, though I doubt
> that the crossover to M4RM is beyond the crossover to Strassen-Winograd.
>
> Martin
>
> --
> name: Martin Albrecht
> _pgp:http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
> _www:http://www.informatik.uni-bremen.de/~malb
> _jab: martinralbre...@jabber.ccc.de

Bill Hart

unread,
May 15, 2008, 6:33:23 PM5/15/08
to sage-devel
Here is the graph of Magma times:

http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png

The crossover is not clear. The change from a smooth curve to a
squiggly line is about 3600. So presumably that is it, but the graph
also seems to change character at about 6200 or 7000 as well. One of
these changes may be cache related.

At 3600, the total data for all three matrices is almost 5mb and the
cache on my machine is 1024kb. But if Magma is using classical
multiplication, then this is pretty much irrelevant anyway, since you
can keep the working data within the cache for quite a while during
the algorithm.

Bill.

Martin Albrecht

unread,
May 15, 2008, 7:03:19 PM5/15/08
to sage-...@googlegroups.com
On Thursday 15 May 2008, Bill Hart wrote:
> Here is the graph of Magma times:
>
> http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png
>
> The crossover is not clear. The change from a smooth curve to a
> squiggly line is about 3600. So presumably that is it, but the graph
> also seems to change character at about 6200 or 7000 as well. One of
> these changes may be cache related.
> At 3600, the total data for all three matrices is almost 5mb and the
> cache on my machine is 1024kb. But if Magma is using classical
> multiplication, then this is pretty much irrelevant anyway, since you
> can keep the working data within the cache for quite a while during
> the algorithm.

On the other hand: a squiggly line is what one one would expect for
Strassen-Winograd due to the extra rows/columns that have to be taken care
of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel
once gave a talk about his implementation and stated that they don't use
classical block multiplication (I saw some slides with that remark).

Martin

_jab: martinr...@jabber.ccc.de

Bill Hart

unread,
May 15, 2008, 8:41:02 PM5/15/08
to sage-devel
I think it might just be possible to get down to the speed of Magma
with a highly optimised classical multiplication routine. At 3600X3600
one clearly has to do 3600x3600 scalar products of a row by a column.
We'll assume one of the matrices has been transposed to facilitate
this.

If we use SSE2 then we can operate 128 bits at a time. There are 16
SSE registers.

The basic idea would be to load 4 SSE registers from different rows of
matrix A and 2 from different columns of matrix B. We compute the
scalar products of all 8 combinations of rowsxcolumns simultaneously.
For this we need 2 temporary registers and 8 registers to hold the
running totals. So all in all we need 4+2+2+8 = 16 registers. We only
need to do an AND and an XOR to do the multiplication and addition
required.

Caching becomes irrelevant if we choose a large selection of rows from
A and a large selection of columns from B and do all the possible
scalar products or rows by columns before moving to the next part.

Assuming the Opteron can do the memory loads at the same time as doing
arithmetic not depending on those loads, careful instruction
scheduling should get everything down to the cost of an SSE AND and an
SSE OR per 128x128 bit pair in the classical algorithm. That puts the
entire algorithm at pretty close to what Magma is doing timewise.

Another option is to not use SSE and just use the 64 bit integer
registers. The disadvantage is one has to load things 64 bits at a
time. But the advantage is the Opteron can do three arithmetic
operations per cycle if properly scheduled. That gets 50% more work
done than the SSE registers, which can only do 1 x 128 bit operation
per cycle.

Of course I'm making the assumption here that the Opteron can indeed
do loads at the same time as arithmetic.

if not, then there is no way Magma can be using classical
multiplication out to 3600x3600 since there are simply too many
operations to perform in the number of cycles available.

Bill.

On 16 May, 00:03, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Bill Hart

unread,
May 16, 2008, 8:53:39 AM5/16/08
to sage-devel
I made some changes to the original code so it would use the cache a
bit better. The test code seems to pass, so I don't think I've screwed
anything up.

The time for 16384x16384 on my machine is now 20s, so a factor of
2.15x faster. The time for 2000x2000 also seems to be the same time as
Magma now. Hopefully I didn't just mistime things before, and this is
a real improvement.

I am still trying things out.

Bill.

Bill Hart

unread,
May 16, 2008, 9:16:27 AM5/16/08
to sage-devel
10000x10000 is down to 4.5s now, so nearly a 2x speedup.

20000x20000 is down to 32.0s, so again nearly a 2x speedup.

Bill.

Bill Hart

unread,
May 16, 2008, 9:37:20 AM5/16/08
to sage-devel
Here is the modified _mzd_mul_m4rm_impl function which gives this
speedup:

http://sage.math.washington.edu/home/wbhart/m4rmul.c

I used a crossover of 3600 and I indicate at the top of this file how
constants should be changed to get the speedups for the various values
of n. I didn't put any code in to automatically choose the correct
values.

I am presuming that the test_multiplication code actually tests this
function, otherwise maybe my code is just rubbish.

The basic idea is to cache block the A matrix into groups of BLOCK
rows. If one also cache blocked the B matrix and left the computed
gray tables in memory instead of writing over them and recalculating
them all the time, as I currently do, one could probably get the other
factor of 2 that we need.

Note I've turned SSE off in this function, but left it on in
_mzd_combine since it makes no real difference there.

Bill.

Bill Hart

unread,
May 16, 2008, 10:59:42 AM5/16/08
to sage-devel
I tried cache blocking matrix B, but the times are exactly the same. I
think the processor is happy to keep loading B one row at a time
sequentially, and since it is only done a handful of times in the
algorithm, it accounts for little of the runtime.

It might go faster in combination with storing the gray tables, but
the optimal block size seems to be about 100 (*k) for the matrix B,
which would necessitate storage of 100 gray tables, which is an awful
lot of memory. Probably I think it is best to avoid this kind of
memory usage.

So I am not sure where the other factor of 2 will come from.

Bill.

Bill Hart

unread,
May 16, 2008, 11:00:20 AM5/16/08
to sage-devel
P.S: I also tried cache hints, but no luck. They just slow it down.

Bill.

Martin Albrecht

unread,
May 16, 2008, 11:24:08 AM5/16/08
to sage-...@googlegroups.com
On Friday 16 May 2008, Bill Hart wrote:
> P.S: I also tried cache hints, but no luck. They just slow it down.
>
> Bill.

Bill, thanks so much for looking into this! It is very much appreciated. I'm
going to read/try your patch right away!

Martin

PS: I have a slight delay in my replies because my provider's spam filter
consistently flags you and mabshoff as spam for some reason.

_jab: martinr...@jabber.ccc.de

Bill Hart

unread,
May 16, 2008, 2:26:24 PM5/16/08
to sage-devel
Here are the times I get for Magma vs M4RI now. Note the crossover
between the two programs is now above about 5000 and M4RI beats Magma
below that point. This suggests the remaining factor of 2 is in the
Strassen-Winograd function. Probably Winograd-Strassen is falling out
of L2 cache (the previous adjustments I made were to prevent the M4R
algorithm falling out of L1 cache).

The other possibility is that Magma combines the two algorithms so
that there is even greater usage of the Gray code tables. This would
be an ugly hack, but could work.

40000x40000:
Magma: 112.6s
M4RI: 232.4s

20000x20000:
Magma: 16.40s
M4RI: 32.34s

10000x10000:
Magma: 2.750s
M4RI: 4.529s

5000x5000:
Magma: 0.700s
M4RI: 0.672s

2500x2500:
Magma: 0.13s
M4RI: 0.079s

1250x1250:
Magma: 0.015s
M4RI: 0.012s

625x625:
Magma: 0.0030s
M4RI: 0.0023s

312x312:
Magma: 0.0014s
M4RI: 0.00032s

156x156:
Magma: 0.001s
M4RI: 0.0001s

If I get some time I'll look into this.

Did those changes work for you Martin?

Bill.

On 16 May, 16:24, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Martin Albrecht

unread,
May 16, 2008, 6:05:58 PM5/16/08
to sage-...@googlegroups.com
On Friday 16 May 2008, Bill Hart wrote:
> Here are the times I get for Magma vs M4RI now. Note the crossover
> between the two programs is now above about 5000 and M4RI beats Magma
> below that point. This suggests the remaining factor of 2 is in the
> Strassen-Winograd function. Probably Winograd-Strassen is falling out
> of L2 cache (the previous adjustments I made were to prevent the M4R
> algorithm falling out of L1 cache).
>
> The other possibility is that Magma combines the two algorithms so
> that there is even greater usage of the Gray code tables. This would
> be an ugly hack, but could work.

Are you suggesting Magma uses M4RM too? I'd doubt that, since they don't state
that anyway. Probably I'm just misunderstanding you.

Yes, the change worked like a charm. I made some changes (the fixed k is
replaced by a k that depends on the new block dimensions etc.) and it is much
much faster now. I'm working on re-introducing SSE2 now to see if it at least
on the Core2Duo makes the world a better place. Btw. all the stuff I wrote
about the L2 cache size C2D vs. Opteron was bollocks. The reason I beat Magma
so badly earlier was that I used a 32-bit Magma and compared it with a 64-bit
version of M4RI. To state it clearly: We don't beat Magma anywhere. Anyhow,
here are the times:

64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit)
10,000 x 10,000 2.920 4.130
16,384 x 16,384 11.140 15.740
20,000 x 20,000 20.370 28.950

64-bit Debian/GNU Linux, 1.8Ghz Opteron (sage.math)
Matrix Dimension Magma 2.13-5 (64-bit) M4RI-20080517 (64-bit)
10,000 x 10,000 3.930 7.860
16,384 x 16,384 16.230 104.77???
20,000 x 20,000 27.080 56.420

(My university today finally granted me access to Magma 2.14 for my notebook.)

As you can see there is an odd (reproducible) spike at 2^14 x 2^14. I cannot
explain that for now, it might just be a bug. I have a similar spike in
another run on another AMD (Athlon X2) machine:

n: 2048, cutoff: 1024, speedup: 1.06, m4rm: 0.01 strassen: 0.01
n: 3072, cutoff: 1536, speedup: 0.87, m4rm: 0.04 strassen: 0.05
n: 4096, cutoff: 2048, speedup: 1.13, m4rm: 0.15 strassen: 0.13
n: 5120, cutoff: 2560, speedup: 1.35, m4rm: 0.39 strassen: 0.29
n: 6144, cutoff: 3072, speedup: 1.20, m4rm: 0.66 strassen: 0.55
n: 7168, cutoff: 3584, speedup: 1.87, m4rm: 1.64 strassen: 0.88
n: 8192, cutoff: 4096, speedup: 4.48, m4rm: 6.07 strassen: 1.35
>>> n: 9216, cutoff: 4608, speedup: 2.94, m4rm: 8.90 strassen: 3.02 <<<
n: 10240, cutoff: 5120, speedup: 4.68, m4rm: 12.99 strassen: 2.78
n: 11264, cutoff: 5632, speedup: 4.84, m4rm: 18.78 strassen: 3.88
n: 12288, cutoff: 6144, speedup: 4.65, m4rm: 24.40 strassen: 5.24
n: 13312, cutoff: 6656, speedup: 3.32, m4rm: 30.92 strassen: 9.33

I'm investigating this one too and play around with the parameters. Since the
C2D times are considerably better I also bet it is L2 related, we'll see.

Thanks again!
Martin

_jab: martinr...@jabber.ccc.de

Bill Hart

unread,
May 16, 2008, 8:48:35 PM5/16/08
to sage-devel
Hi Martin,

That spike is wierd. Basically I got closer to 2x the time of Magma
for 16384x16384, but you need different parameters than for
10000x10000 or 20000x20000 since the size of the M4R matrices will be
different than in either of those cases.

For 16384x16384 my notes say that I used k = 6 and BLOCK = 256. One
might also need to fiddle with the cutoff. I think I used 3600, but at
various times I had the cutoff set lower.

It would be awfully surprising if there wasn't a set of parameters
that dropped this time right down. It is possible that the L1 cache is
smaller on sage.math than on the Opteron I was using. Perhaps my
parameters don't apply on that machine.

Bill.

On 16 May, 23:05, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

mabshoff

unread,
May 16, 2008, 8:53:42 PM5/16/08
to sage-devel


On May 17, 2:48 am, Bill Hart <goodwillh...@googlemail.com> wrote:
> Hi Martin,
>
> That spike is wierd. Basically I got closer to 2x the time of Magma
> for 16384x16384, but you need different parameters than for
> 10000x10000 or 20000x20000 since the size of the M4R matrices will be
> different than in either of those cases.
>
> For 16384x16384 my notes say that I used k = 6 and BLOCK = 256. One
> might also need to fiddle with the cutoff. I think I used 3600, but at
> various times I had the cutoff set lower.
>
> It would be awfully surprising if there wasn't a set of parameters
> that dropped this time right down. It is possible that the L1 cache is
> smaller on sage.math than on the Opteron I was using. Perhaps my
> parameters don't apply on that machine.

Shouldn't we use an ATLAS like tuning process? [obviously not that
heavy timewise]. It will likely find good default values.

> Bill.

<SNIP>

Bill Hart

unread,
May 16, 2008, 9:02:59 PM5/16/08
to sage-devel


On 16 May, 23:05, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> On Friday 16 May 2008, Bill Hart wrote:
> > The other possibility is that Magma combines the two algorithms so
> > that there is even greater usage of the Gray code tables. This would
> > be an ugly hack, but could work.
>
> Are you suggesting Magma uses M4RM too? I'd doubt that, since they don't state
> that anyway. Probably I'm just misunderstanding you.

Well, the thing is, Allan Steel may not have ever read a paper on the
Method of the 4 Russians. He may consider it to just be a highly
optimised classical algorithm and may have just rediscovered it, or
something similar. It is very similar to a "window" method that is
used for polynomials over GF2, so I could easily imagine this
happening.

Then again, Magma does seem a lot slower for very small matrix
multiplications, which makes me wonder if they aren't just using a
highly efficiently implemented classical routine. I suppose it is
vaguely possible that you could order operations in such a way as to
effectively mix the classical and Strassen-Winograd algorithms for the
purpose of getting better cache efficiency. I don't know if it is
likely, but if someone went to the trouble of optimising a classical
routine down to one cycle per 64 by 64 bit pair, they are sure serious
about squeezing absolutely everything out.

Bill.

Bill Hart

unread,
May 16, 2008, 9:08:12 PM5/16/08
to sage-devel
That may not be necessary. It may only be necessary to know the size
of the L1 and L2 caches and then it is probably possible to figure out
the optimal values. Probably it is something like 2^k rows of B must
fit in half the L1 cache and BLOCK rows of A must fit in half the L1
cache.

I'm not sure what happens with the L2 cache yet, since we haven't
really solved that issue yet.

Bill.

Bill Hart

unread,
May 16, 2008, 11:08:33 PM5/16/08
to sage-devel
In going from 5000x5000 to 10000x10000 Magma's time increases by a
factor of less than 4. That is impossible. Strassen will never help us
there. They must be doing something else. Probably something clever.

Bill.

Martin Albrecht

unread,
May 17, 2008, 4:40:49 AM5/17/08
to sage-...@googlegroups.com
On Saturday 17 May 2008, Bill Hart wrote:
> In going from 5000x5000 to 10000x10000 Magma's time increases by a
> factor of less than 4. That is impossible. Strassen will never help us
> there. They must be doing something else. Probably something clever.
>
> Bill.

I was stuck there too yesterday. Maybe only at 10000x10000 the pipeline gets
fully utilised?

Martin

PS: If we run out of idea we can simply go for parallelism, that should help
on sage.math ;-)

_jab: martinr...@jabber.ccc.de

Bill Hart

unread,
May 17, 2008, 5:28:23 AM5/17/08
to sage-devel
Yes, of course, that is it. The Opteron can perform an MMX instruction
at the same time as an integer instruction (even an SSE instruction if
need be). We just need to set it up so that instructions get
interleaved between the two units.

Probably the reason Magma jumps nearly a factor of 2 at that point is
that it is probably difficult to do this parallel operation for the
naive algorithm, or whatever Magma uses for its base case, but easy to
do with the Strassen algorithm.

Bill.

On 17 May, 09:40, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Bill Hart

unread,
May 17, 2008, 10:45:35 AM5/17/08
to sage-devel
Hi Martin,

Here is another 10% improvement. In the loop at the bottom of
mzd_combine you can explicitly unroll by a factor of 8:

word * end = b1_ptr + wide;
register word * end8 = end - 8;
while (b1_ptr < end8)
{
*(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
*(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
*(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
*(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
*(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
*(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
*(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
*(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
}
while (b1_ptr < end)
{
*(b1_ptr++) = *(b2_ptr++) ^ *(b3_ptr++);
}

I did this in combination with changing the crossover for 10000x10000
from 3600 to 7200.

Bill.

On 17 May, 09:40, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Bill Hart

unread,
May 17, 2008, 12:45:37 PM5/17/08
to sage-devel
Martin,

The test code still passes if you change RADIX to 128. I've no idea
how it passes, but it does. Shame the results are not correct, because
this speeds the code up by a factor of 2.

I notice that in the SSE code, you check to see if alignment can be
achieved, otherwise it doesn't use SSE. But this introduces an
unpredictable branch. Also, where ther are three operands, you can't
use SSE2 because the likelihood of all three being aligned is too
small.

I think a better idea would be to explicitly force all matrices and
all rows to be 128 bit aligned if the matrices are wide enough to
benefit from SSE2, Then the combine function can always use SSE2 and
there will be no need to check for alignment.

I experimented with interleaving MMX and GPR XOR's, but this doesn't
speed anything up. There are more instructions emitted and the time
stays about the same. The only way interleaving the MMX and GPR code
would speed things up is if there was more computation going on in the
registers and less memory loading and storing, I think.

Bill.

Bill Hart

unread,
May 17, 2008, 1:19:28 PM5/17/08
to sage-devel
Heres another idea which should speed things up a bit.

For 10000x10000 we currently use k = 6. Instead of this, we could use
k = 5 and make two Gray tables simultaneously. This will still fit in
cache.

Instead of doing 6 bits at a time, we can then do 10 bits at a time.
We'd load the appropriate line from the first Gray table, then the
appropriate one from the second and xor them, then xor with the output
matrix. This should decrease the number of loads and stores
considerably. Moreover, the SSE instructions will then be much more
efficient as the ratio of arithmetic instructions to loads and stores
is higher.

Of course one could also do 16 bits at a time, by doing 4 tables, but
I think this might actually get slower again since you've only
increased the amount of work done by 60%, but you've had a 30 %
increase in instructions.

Bill.

Martin Albrecht

unread,
May 17, 2008, 1:29:04 PM5/17/08
to sage-...@googlegroups.com
On Saturday 17 May 2008, Bill Hart wrote:
> Martin,
>
> The test code still passes if you change RADIX to 128. I've no idea
> how it passes, but it does. Shame the results are not correct, because
> this speeds the code up by a factor of 2.

Since all routines use the RADIX and I only check if their results match they
are all wrong in the same way but it isn't detected. I should add a test with
known answers I suppose.

> I notice that in the SSE code, you check to see if alignment can be
> achieved, otherwise it doesn't use SSE. But this introduces an
> unpredictable branch. Also, where ther are three operands, you can't
> use SSE2 because the likelihood of all three being aligned is too
> small.

> I think a better idea would be to explicitly force all matrices and
> all rows to be 128 bit aligned if the matrices are wide enough to
> benefit from SSE2, Then the combine function can always use SSE2 and
> there will be no need to check for alignment.

I'll try that.

> I experimented with interleaving MMX and GPR XOR's, but this doesn't
> speed anything up. There are more instructions emitted and the time
> stays about the same. The only way interleaving the MMX and GPR code
> would speed things up is if there was more computation going on in the
> registers and less memory loading and storing, I think.

I came to the same conclusion (but my code might not have been as good as
your's). I improved other areas of the code (e.g. use naiv multiplication
rather than M4RM if B->ncols < RADIX since it is faster etc.) I can forward
you my newest tarball (but the speed improvements aren't really noticable
yet).

Martin

_jab: martinr...@jabber.ccc.de

Martin Albrecht

unread,
May 17, 2008, 1:57:35 PM5/17/08
to sage-...@googlegroups.com
> I think a better idea would be to explicitly force all matrices and
> all rows to be 128 bit aligned if the matrices are wide enough to
> benefit from SSE2, Then the combine function can always use SSE2 and
> there will be no need to check for alignment.

That doesn't seem to make a noticeable difference for me (on C2D). However, I
realised that the multiplications where the target matrix is a real matrix
rather than a window (which has bad data locality). Copying everything over
seems not like a good idea but it at least indicates an area for
improvements.

Martin

_jab: martinr...@jabber.ccc.de

Martin Albrecht

unread,
May 17, 2008, 3:32:36 PM5/17/08
to sage-...@googlegroups.com
On Saturday 17 May 2008, Martin Albrecht wrote:
> > I think a better idea would be to explicitly force all matrices and
> > all rows to be 128 bit aligned if the matrices are wide enough to
> > benefit from SSE2, Then the combine function can always use SSE2 and
> > there will be no need to check for alignment.
>
> That doesn't seem to make a noticeable difference for me (on C2D). However,
> I realised that the multiplications where the target matrix is a real
> matrix rather than a window (which has bad data locality). Copying
> everything over seems not like a good idea but it at least indicates an
> area for improvements.

Okay, if I only copy when we crossover to M4RM then the memory overhead is
constant (~ cutoff^2) and the performance still improves.

Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo


Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit)

10,000 x 10,000 2.920 3.610
16,384 x 16,384 11.140 12.120
20,000 x 20,000 20.370 24.390
32,000 x 32,000 74.290 94.910

New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo


Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit)

10,000 x 10,000 2.920 2.990
16,384 x 16,384 11.140 11.750
20,000 x 20,000 20.370 21.180
32,000 x 32,000 74.290 86.570

On Opteron things don't look this way, but I think sage.math is pretty heavily
used right now such that my benchmarks there are not very telling.

Bill Hart

unread,
May 17, 2008, 4:05:54 PM5/17/08
to sage-devel
That's looking good. Would you like me to run it on an unburdened
opteron to see how it goes? If you like you can send me a tarball and
I'll try it out.

I think our best bet for a significant improvement now is the idea of
using two Gray tables of half the size simultaneously. I also realised
it possibly improves the cache performance for the A matrix too.

I was casually wondering whether Magma might use a highly optimised
Winograd's algorithm instead of the naive algorithm. But over GF2 I
think it probably actually takes longer, since it basically replaces
n^2 full length scalar multiplies by n^2 half length ones and 2*n^2
half row additions, plus a pile of other overhead.

Bill.

On 17 May, 20:32, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Bill Hart

unread,
May 17, 2008, 5:05:14 PM5/17/08
to sage-devel
Yet another idea.

Suppose we do not combine entire rows in the Gray table, but only half
rows. Once half a row is bigger than a single cache line (512 bits on
the Opteron) we may as well work with half rows. This allows us to
work with twice as many rows at once in the Gray tables (each of half
the size). This means that we are dealing with twice as many bits from
rows of A as usual and twice as many rows of B as usual, but we need
to do it all again for the second half of the rows. This means we get
twice the work done in the same amount of cache space.

Combined with the idea of using two Gray tables of 2^5 combinations of
rows instead of a single table of 2^6 combinations of rows, this would
equate to dealing with 20 bits of each row of A at a time and 20 rows
of B at a time.

With this scheme, there would then be 4 arithmetic operations in SSE
registers, 5 loads and 1 store, when combining rows from Gray tables,
instead of about 6.6 loads, 3.3 stores and 3.3 arithmetic operations,
changing the ratio of load/stores to arithmetic ops from 2.7 to 1.5.

This is another example where copying the data (the half rows) out and
reordering it so it has better locality, would probably make a big
difference. That sort of thing always works exceptionally well on AMD
chips.

Bill.

Bill Hart

unread,
May 17, 2008, 5:40:26 PM5/17/08
to sage-devel
Martin,

Here's a really unusual thing. Perhaps you can confirm this. I get a
20% improvement if I add:

if (x)
{
}

in the three obvious places in the function _mzd_mul_m4rm_impl. This
stops it mpz_combining the zero row.

But I don't understand why this works. The time should be only 1.5%
better since k = 6 and there are 2^k rows in the table, only one of
which is zero.

Could it be that your coinflip function is not quite random?

Anyway, I'm down to 3.40s for 10000x10000 with this change. Test
functions still pass.

Bill.

Bill Hart

unread,
May 17, 2008, 5:57:04 PM5/17/08
to sage-devel
I suppose that this might be due to the ends of rows all being zero as
they aren't a multiple of 64 bits long. But I checked for 16384x16384
and we are nearly down to the speed of Magma there too. I just don't
get it. The coinflip has to be broken I think.

Bill.

Bill Hart

unread,
May 17, 2008, 6:22:22 PM5/17/08
to sage-devel
I checked the coinflip and it is definitely fine. There is no greater
probability of 6 zeroes in a row than there ought to be. So the
speedup I just reported is quite a mystery.

Bill.

Martin Albrecht

unread,
May 17, 2008, 6:46:13 PM5/17/08
to sage-...@googlegroups.com, Gregory Bard
> Old: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
> Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit)
> 10,000 x 10,000 2.920 3.610
> 16,384 x 16,384 11.140 12.120
> 20,000 x 20,000 20.370 24.390
> 32,000 x 32,000 74.290 94.910
>
> New: 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
> Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit)
> 10,000 x 10,000 2.920 2.990
> 16,384 x 16,384 11.140 11.750
> 20,000 x 20,000 20.370 21.180
> 32,000 x 32,000 74.290 86.570

If you take this + Bill's idea:

> For 10000x10000 we currently use k = 6. Instead of this, we could use
> k = 5 and make two Gray tables simultaneously. This will still fit in
> cache.
>
> Instead of doing 6 bits at a time, we can then do 10 bits at a time.
> We'd load the appropriate line from the first Gray table, then the
> appropriate one from the second and xor them, then xor with the output
> matrix. This should decrease the number of loads and stores
> considerably. Moreover, the SSE instructions will then be much more
> efficient as the ratio of arithmetic instructions to loads and stores
> is higher.
>
> Of course one could also do 16 bits at a time, by doing 4 tables, but
> I think this might actually get slower again since you've only
> increased the amount of work done by 60%, but you've had a 30 %
> increase in instructions.

You get (on the C2D):

sage: B = random_matrix(GF(2), 3.2*10^4, 3.2*10^4)
sage: A = random_matrix(GF(2), 3.2*10^4, 3.2*10^4)
sage: time C= A._multiply_strassen(B,cutoff=2^11)
CPU times: user 75.82 s, sys: 0.22 s, total: 76.04 s
Wall time: 76.31

sage: A = random_matrix(GF(2), 2*10^4, 2*10^4)
sage: B = random_matrix(GF(2), 2*10^4, 2*10^4)
sage: time C= A._multiply_strassen(B,cutoff=2^11)
CPU times: user 19.14 s, sys: 0.09 s, total: 19.24 s
Wall time: 19.29

sage: B = random_matrix(GF(2), 2^14, 2^14)
sage: A = random_matrix(GF(2), 2^14, 2^14)
sage: time C= A._multiply_strassen(B,cutoff=2^11)
CPU times: user 10.62 s, sys: 0.05 s, total: 10.67 s
Wall time: 10.70

sage: B = random_matrix(GF(2), 10^4, 10^4)
sage: A = random_matrix(GF(2), 10^4, 10^4)
sage: time C= A._multiply_strassen(B,cutoff=2^11)
CPU times: user 2.73 s, sys: 0.02 s, total: 2.75 s
Wall time: 2.76

i.e the speed of my current Magma install on the same computer (mind you,
this one might not be optimised for the C2D but for the Opteron, I don't
know). The times above don't have SSE2 yet. I guess documenting Bill's tricks

- process the rows of A in blocks
- use two rather than one Gray code table

well is in order since now M4RM looks quite different from the original
algorithm. I'll do that tomorrow.

Again, thanks Bill!

Bill Hart

unread,
May 17, 2008, 6:52:43 PM5/17/08
to sage-devel
Woot!!

On 17 May, 23:46, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Martin Albrecht

unread,
May 17, 2008, 7:12:16 PM5/17/08
to sage-...@googlegroups.com
> I suppose that this might be due to the ends of rows all being zero as
> they aren't a multiple of 64 bits long. But I checked for 16384x16384
> and we are nearly down to the speed of Magma there too. I just don't
> get it. The coinflip has to be broken I think.

If one uses M4RI with the new patch from within Sage another PRBG is used, but
coinflip should be fine. Don't see these speedups (but I have two Gray code
tables and this warrants for more if's)

Hi, I think we might consider merging our two forks again? Or do you also have
the two Gray code tables? Are your timings on the Opteron? Because then
things look really goo since mine are on the C2D.

Exciting times,
Martin

_jab: martinr...@jabber.ccc.de

Bill Hart

unread,
May 17, 2008, 7:23:10 PM5/17/08
to sage-devel
I don't have the two Gray code tables, so it would be good to get your
version. Also my code is currently a mess, so it would be good to
clean it up by merging with a cleaner version (yours). Tomorrow I'll
check carefully what I've changed and try and merge the ideas if there
are any you don't have which definitely improve performance on the
Opteron.

The speedups I am seeing from the ifs are possibly a feature of the
Opteron cache algorithms. It is very sensitive when things just begin
to fall out of cache, as they certainly are here. Not combining with
the zero row just nudges things closer in to the cache boundary since
it never has to read that row.

I have checked and the speedups are quite reproducible, and they
definitely come from the ifs, though I am now using a crossover with
Strassen of 7200!!

Bill.

On 18 May, 00:12, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Bill Hart

unread,
May 17, 2008, 7:28:01 PM5/17/08
to sage-devel
P.S: yes all my times are on a 2.8Ghz Opteron. Cpuinfo says:

wbhart@host-57-44:~/m4ri-20080514/testsuite> cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 65
model name : Dual-Core AMD Opteron(tm) Processor 2220
stepping : 3
cpu MHz : 1000.000
cache size : 1024 KB
<snip>
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm
extapic cr8_legacy
<snip>
The 1000.000 there refers to the FSB.

Bill.

On 18 May, 00:12, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Martin Albrecht

unread,
May 17, 2008, 7:40:42 PM5/17/08
to sage-...@googlegroups.com
On Sunday 18 May 2008, Bill Hart wrote:
> I don't have the two Gray code tables, so it would be good to get your
> version. Also my code is currently a mess, so it would be good to
> clean it up by merging with a cleaner version (yours). Tomorrow I'll
> check carefully what I've changed and try and merge the ideas if there
> are any you don't have which definitely improve performance on the
> Opteron.
>
> The speedups I am seeing from the ifs are possibly a feature of the
> Opteron cache algorithms. It is very sensitive when things just begin
> to fall out of cache, as they certainly are here. Not combining with
> the zero row just nudges things closer in to the cache boundary since
> it never has to read that row.
>
> I have checked and the speedups are quite reproducible, and they
> definitely come from the ifs, though I am now using a crossover with
> Strassen of 7200!!

I'm using a crossover of 2048 here, so maybe our improvements are orthogonal?
Even more puzzling, I'd expect that my crossover should be bigger than yours.
(on a side note: my code changes how the crossover is used, your
version: 'size < cutoff', my version: '|cutoff - size| is minimal' which
should give a actual cutoffs closer to the desired values).

My version is here:

http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080516.p1.spkg

(this needs an updated patch for Sage)

and here:

http://sage.math.washington.edu/home/malb/m4ri-20080516.tar.gz

(which is the raw source). Those don't have SSE2 yet but it doesn't seem to
make that much of a difference anyway. I'll add that back before doing an
official release. However, unfortunately I'll probably have limited/no time
tomorrow to commit.

Martin

PS: To give at least some indication that my code still does the right thing,
a 'known answer' test:

sage: A = random_matrix(GF(2), 10^3, 10^3)
sage: B = random_matrix(GF(2), 10^3, 10^3)
sage: (A*B)._magma_() == A._magma_() * B._magma_()
True
sage: (A._multiply_strassen(B,cutoff=256))._magma_() == A._magma_() *
B._magma_()
True

_jab: martinr...@jabber.ccc.de

Bill Hart

unread,
May 17, 2008, 8:38:49 PM5/17/08
to sage-devel
Here are the times I get with the different cutoffs.

Magma M4RI:7200 M4RI:2048

10000x10000:
2.940s 3.442s 4.132s

16384x16384:
9.250s 11.47s 11.80s

20000x20000:
16.57s 19.3s 26.05s

32000x32000:
59.05s 71.9s 71.8s

So it seems when there is not an exact cut, the higher cutoff is
substantially better. Don't know why that is.

Tomorrow I'll see if there is anything I have that speeds up your
code. I'm hopeful we'll be within about 5% on the Opteron by then. The
other ideas I outlined above should push us 10-15% ahead of Magma if
we end up implementing them, I think. Of course one can go too crazy
with optimisation.

Bill.

On 18 May, 00:40, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

David Harvey

unread,
May 17, 2008, 8:41:09 PM5/17/08
to sage-...@googlegroups.com

On May 17, 2008, at 8:38 PM, Bill Hart wrote:

> Of course one can go too crazy with optimisation.

No.... surely that never happens around here.....

david

Bill Hart

unread,
May 17, 2008, 8:58:39 PM5/17/08
to sage-devel


On 18 May, 00:40, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> My version is here:
>
>    http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080516.p1.spkg
>
> (this needs an updated patch for Sage)
>
> and here:
>
>    http://sage.math.washington.edu/home/malb/m4ri-20080516.tar.gz
>
> (which is the raw source).

This pure C version seems to be the old version, before you made
either of the two big changes.

Bill.

Bill Hart

unread,
May 17, 2008, 9:21:11 PM5/17/08
to sage-devel
I managed to get the modified version from the spkg. Nice code!!

Unfortunately it is not as fast on my opteron. So more work tomorrow
for me to try and get it down to the same times as I have with my
version.

Here are the times all on my opteron. Note your CTD version was
optimal at a cutoff of 2048, not 7200 as for my code. Now I am worried
that maybe my code is actually broken somehow and still passing the
test code. I'll carefully make the changes to your code tomorrow to
see if that is the case.

Magma CTD-M4RI:2048 AMD-M4RI:7200 AMD-M4RI:2048

10000x10000: 2.940s 3.13s 3.442s 4.132s

16384x16384: 9.250s 12.96s 11.47s 11.80s

20000x20000: 16.57s 22.43s 19.3s 26.05s

32000x32000: 59.05s 90.20s 71.9s 71.8s

Bill.

Bill Hart

unread,
May 18, 2008, 9:40:29 AM5/18/08
to sage-devel
I sorted out what was wrong. In my combine code I was combining
operands in the wrong order. This meant that a whole lot of zeroes
ended up where they shouldn't be, hence the if (x) thing working. So
my times were unfortunately completely screwed up. I went back to a
fresh tarball with the original code and applied my changes one by
one. Only two changes actually improved the time.

First I give Magma times vs your (Martin's) code:

10000x10000: 2.940s 3.13s

16384x16384: 9.250s 12.96s

20000x20000: 16.57s 22.43s

32000x32000: 59.1s 90.2s

Now I give times for my code with a single Gray table. The legend
below indicates what the three times mean.

10000x10000: 7.76s 6.70s 4.40s

16384x16384: 44.6s 37.3s 18.3s

20000x20000: 53.3s 45.9s 31.2s

32000x32000: ------- 194s 134s

0) cutoff = 2048, original code no SSE
1) cutoff = 3600, CACHE BLOCK A (256, 768)
2) cutoff = 7200, fix k = 6

So it is clear that two Gray tables is much better than one. The only
thing that is puzzling me now is how you get away with k = 7 with two
Gray tables, whereas I was using k = 6 with one table. Also, you stick
with BLOCK = 768, whereas I found it optimal to switch to 256 for
16384x16384. However, if I make any additional changes to your code it
just slows down. It must have to do with this copying out that you
did. That must significantly affect cache performance.

Anyhow, I'm now going to wipe all versions I have (except my working
one with an optimal single Gray table implementation) and just start
from your code as a base for further improvements. We are still up to
50% slower than Magma on the Opteron!!

Bill.

Martin Albrecht

unread,
May 18, 2008, 12:36:59 PM5/18/08
to sage-...@googlegroups.com
Hi,

first, I recorded the different speed-ups in a small table for an overview in
the attachment (I think we've come a far way :-)) To disable the copying out
one needs to edit

/* we copy the matrix first since it is only constant memory
overhead and improves data locality, if you remove it make sure
there are no speed regressions */
/* C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE); */
packedmatrix *Cbar = mzd_init(C->nrows, C->ncols);
Cbar = _mzd_mul_m4rm_impl(Cbar, A, B, 0, FALSE);
mzd_copy(C, Cbar);
mzd_free(Cbar);
return C;

in strassen.c to

/* we copy the matrix first since it is only constant memory
overhead and improves data locality, if you remove it make sure
there are no speed regressions */
C = _mzd_mul_m4rm_impl(C, A, B, 0, TRUE);
return C;

This disables the copying out.

Martin

PS: If I find some time later today I'll make some changes such that SSE2 can
be used more often, i.e. align each row at 16-byte borders if HAVE_SSE2 is
used.

_jab: martinr...@jabber.ccc.de

timings.html

Bill Hart

unread,
May 18, 2008, 2:19:30 PM5/18/08
to sage-devel
If two Gray tables is better than one, then you can't have enough of a
good thing right? So I made 3 Gray tables now.

The files I modified are in:

http://sage.math.washington.edu/home/wbhart/m4ri3gray/

There are three main modifications:

1) Make all matrices SSE aligned if we have SSE2, and make all rows of
all matrices aligned. This required a fix in brilliantrussian.c, since
it makes some assumptions about how the data is stored in matrices.

2) Introduce function combine2 and combine3 which use SSE to do a[i]
^= b[i] ^ c[i] and a[i] ^= b[i] ^ c[i] ^ d[i].

3) Code to use three Gray tables. I had to remove the a_nc%k's since
these were causing segfaults for reasons I don't understand.

Here are the Magma times, your old times and the new times on my
Opteron:

10000x10000: 2.940s 3.13s 2.82s

16384x16384: 9.250s 12.96s 11.25s

20000x20000: 16.57s 22.43s 19.39s

32000x32000: 59.1s 90.2s 81.56s

Note I do not use the new combine2 and combine3 functions as they slow
it down on my machine. I cannot believe how useless SSE seems to be!

I've commented the three lines out that use combine3 in
brilliantrussian.c in case you want to try it with SSE on your CTD.

Bill.


On 18 May, 17:36, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de
>
>  timings.html
> 2KDownload

Bill Hart

unread,
May 18, 2008, 3:09:31 PM5/18/08
to sage-devel
The copying out makes 50% difference (its better with copying) to the
speed of 16384x16384 but no difference to 10000x10000 or 20000x20000.

That's wierd.

Bill.

On 18 May, 17:36, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de
>
>  timings.html
> 2KDownload

Bill Hart

unread,
May 18, 2008, 4:50:14 PM5/18/08
to sage-devel
I tried copying out the input matrices to the M4RM routine, but only
when the rows aren't all contiguous in memory. This didn't speed
anything up. Of course the reason for that is the second matrix is
only read a handful of times, to construct the Gray tables which are
then used extensively. The first matrix is read out of order anyway,
by M4RM, so there's no point making all its rows contiguous.

It's hard to see how to get that last bit we need to beat Magma.

What I don't quite understand now is the fact that we are beating
Magma all the way up to 10000x10000, which is surely past their
crossover to Strassen. But we start losing for large matrices when we
use Strassen. I checked that the addition is not slower than Magma
(its not, it's up to 5 times faster).

The only trick I have left to try is to use twice the number of Gray
tables, but make them half the width, but that seems like cheating,
since we should already have a fast enough base case by now!

Bill.

Bill Hart

unread,
May 18, 2008, 7:30:10 PM5/18/08
to sage-devel
Well, apparently there are speed gains right up to 8 Gray tables,
though I really have no idea why that is.

Here are the new times:

Magma Old New

10000x10000:
2.940s 3.13s 2.32s

16384x16384:
9.250s 12.96s 9.17s

20000x20000:
16.57s 22.43s 16.49s

32000x32000:
59.1s 90.2s 81.6s 65.7s

So now we beat Magma all the way up to 20000x20000.

I'll put the adjusted code in the directory:

http://sage.math.washington.edu/home/wbhart/m4ri3gray/

in just a moment. I also found a memory leak in my code which I fixed.

Bill.

Martin Albrecht

unread,
May 18, 2008, 9:00:23 PM5/18/08
to sage-...@googlegroups.com
On Monday 19 May 2008, Bill Hart wrote:
> Well, apparently there are speed gains right up to 8 Gray tables,
> though I really have no idea why that is.
>
> Here are the new times:
>
> Magma Old New
>
> 10000x10000:
> 2.940s 3.13s 2.32s
>
> 16384x16384:
> 9.250s 12.96s 9.17s
>
> 20000x20000:
> 16.57s 22.43s 16.49s
>
> 32000x32000:
> 59.1s 90.2s 81.6s 65.7s
>
> So now we beat Magma all the way up to 20000x20000.
>
> I'll put the adjusted code in the directory:
>
> http://sage.math.washington.edu/home/wbhart/m4ri3gray/
>
> in just a moment. I also found a memory leak in my code which I fixed.
>
> Bill.
>

Cool, I'll take a look tomorrow. It seems we're closing in on the classical
algorithm though. Although, you don't seem to decrease k anymore.

As for the copying and 2^14 vs. 10^4: Maybe the extra rows/columns cost too
many cycles. I'll try valgrind/callgrind/cachegrind on the code to see if
that is true. Also maybe for 2^14 x 2^14 the submatrices are nicely aligned
on cache lines. I'm widely speculating here.

Martin

_jab: martinr...@jabber.ccc.de

Bill Hart

unread,
May 18, 2008, 9:10:42 PM5/18/08
to sage-devel
I figured out why 8 tables is optimal on this machine. The L1 cache is
128kb and at 2048x2048, 8 Gray tables with k = 5 fits exactly in half
the cache, which is probably optimal.

Bill.

On 19 May, 02:00, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Bill Hart

unread,
May 19, 2008, 8:52:29 AM5/19/08
to sage-devel
Martin,

does your machine speed up at all with any of the SSE code? I spent
considerable time yesterday trying to optimise the combine3 function I
wrote (note: I didn't submit this improved code). Whilst I did speed
it up considerably by removing %'s and /'s and changing the function
prototype to accept rows directly and lots of other minor
improvements, the overall result was still much slower than the
generic C code.

If the SSE code doesn't ever speed it up (and it might not be faster
to use SSE when there is at least one load/store per arithmetic
operation), then it might be worth throwing all the SSE code out
completely, since it just polutes what is otherwise very nice looking
code.

I did check by compiling the code to assembler without assembling it
(the -S option in gcc), that it is actually using SSE instructions and
that it is doing this in what I would consider an optimal way. So it
really is SSE itself that is slow, not just inefficiencies introduced
by the compiler.

One possible avenue for exploration is the gf2x library by Zimmerman,
Brent, et al. They use SSE in their code for polynomials over GF2,
apparently to good effect. We could look to see if there is some trick
to using it efficiently. The difference there however is probably that
the polynomials get quite long and multiple arithmetic operations need
to be done for each store/load. I don't know if any of their tricks
can help us for that reason.

Bill.

On 19 May, 02:00, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Bill Hart

unread,
May 19, 2008, 9:05:22 AM5/19/08
to sage-devel
I had a look at gf2x. The reason they can make use of sse is because
of the shl and shr capabilities. Doing that 128 bits at a time is more
efficient since one operation can be done instead of quite a few using
general purpose regs.

Bill.

Martin Albrecht

unread,
May 19, 2008, 9:23:59 AM5/19/08
to sage-...@googlegroups.com
Bill,

I do get a small speed-up on the Core2Duo for SSE2 but I'm not sure it is
worth the trouble (I agree that it make the otherwise pretty looking code
ugly).

I have some timings (for an old implementation) here:

http://trac.sagemath.org/sage_trac/ticket/3204#comment:2

My guess is that SSE2 is slower on the Opteron because SSE2 is basically an
Intel thing and only provided by AMD for compatibility reasons. There are
several reports of SSE2 being slow on the Opteron and I guess the SSE2
integer operations were not focused for speed since MMX/SSE is all about
floating point mainly.

One thing I noticed on the Opteron is that if I put the code in mzd_combine
vs. putting the same code directly in the function made huge difference. I
blamed it on better cache prefetching support but that was probably
preliminary.

My proposal:
- This evening I'll update my code with your 8 Gray tables and check the
performance on the C2D
- Then I'll re-introduce SSE2 and check whether it makes a worthy difference,
if not we drop SSE2 from the multiplication.

Martin

PS: I tried a quick and dirty OpenMP (which is cool, btw) based parallel
implementation of Strassen-Winograd yesterday and it gives - as is - a
speedup of 1.8 (so not optimal yet) or so. But comparing that with Magma
feels like cheating, first we should aim for better speed with the same
resources and then we switch to parallel implementations for even better
times. Anyway, I wouldn't have believed that I can do a 10^4 x 10^4 matrix
multiplication in 1.7 seconds on my notebook one week ago :-)

_jab: martinr...@jabber.ccc.de

Bill Hart

unread,
May 19, 2008, 9:43:05 AM5/19/08
to sage-devel
You seemed to be getting up to 8% at points there. That's definitely
worth it. I'll be interested to see this evening how it comes out,
though I recommend optimising my combine3 function (which I suppose
should now be combine8), even including it inline rather than have it
in a separate file.

Of course on the Opteron, SSE should be switched off, since it is
definitely slower by about 5%-10% even with careful optimisation.

Bill.

On 19 May, 14:23, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Martin Albrecht

unread,
May 19, 2008, 9:54:28 AM5/19/08
to sage-...@googlegroups.com
On Monday 19 May 2008, Bill Hart wrote:
> You seemed to be getting up to 8% at points there. That's definitely
> worth it. I'll be interested to see this evening how it comes out,
> though I recommend optimising my combine3 function (which I suppose
> should now be combine8), even including it inline rather than have it
> in a separate file.
>
> Of course on the Opteron, SSE should be switched off, since it is
> definitely slower by about 5%-10% even with careful optimisation.
>
> Bill.

Okay, so a good compromise is to remove all SSE2 stuff from the main function
_mzd_mul_m4rm_impl and put it in static inline _mzd_combine8 function which
is specifically tailored towards this particular application. Thus the code
still looks relatively pretty/elegant but we can have SSE2 support.

Martin

_jab: martinr...@jabber.ccc.de

Martin Albrecht

unread,
May 19, 2008, 2:58:02 PM5/19/08
to sage-...@googlegroups.com
On Monday 19 May 2008, Bill Hart wrote:
> You seemed to be getting up to 8% at points there. That's definitely
> worth it. I'll be interested to see this evening how it comes out,
> though I recommend optimising my combine3 function (which I suppose
> should now be combine8), even including it inline rather than have it
> in a separate file.

Bill,

some progress report for the C2D:

I incorporated your changes with the following modifications:
- if (x1==0 & x2==0 & x3==0 .... x8==8) ... I removed this one since it seems
rather unlikely that this is true
- I added #define called GRAY8 which defines whether 8 or 4 tables ought to be
used. I figured this might be handy for machines with a smaller L1.
- I added the correct a_nc%k values back in
- We don't need to make sure that the allocated buffers are correctly aligned,
since we try allocate with _mm_malloc. If that is not available we should
probably just use posix_memalign.

The speed-up on the C2D (similar to the Opteron) is considerable (the last
column is a parallel toy implementation):

64-bit Debian/GNU Linux, 2.33Ghz Core2Duo, cutoff=2^12 (*)
Matrix Dimension Magma M4RI M4RI (OpenMP), walltime
10,000 x 10,000 2.920 2.270 1.470
16,384 x 16,384 11.140 9.130 5.540
20,000 x 20,000 20.370 16.110 11.800
32,000 x 32,000 75.510 64.340 40.040

Amazingly M4RM alone (w.o. Strassen-Winograd) now beats Magma up to 2*10^4 x
2*10^4 in this configuration:

sage: A = random_matrix(GF(2),10^3, 10^3);
sage: B = random_matrix(GF(2),10^3, 10^3);
sage: magma(A._multiply_m4rm(B)) == magma(A)*magma(B)
True
sage: A = random_matrix(GF(2),2*10^4, 2*10^4);
sage: B = random_matrix(GF(2),2*10^4, 2*10^4);
sage: time A._multiply_m4rm(B)
CPU times: user 18.32 s, sys: 0.10 s, total: 18.42 s
Wall time: 18.65

64-bit Debian/GNU Linux, 1.8Ghz Opteron (sage.math), cutoff=2^11
Matrix Dimension Magma 2.13-5 (64-bit) M4RI-20080518 (64-bit)
10,000 x 10,000 4.190 4.290
16,384 x 16,384 15.360 15.230
20,000 x 20,000 29.530 28.640
32,000 x 32,000 103.970 114.620

That does seem to roughly match what you reported. I'll now look into SSE2
support.

Cheers,
Martin

(*) Note: Magma is 64-bit but not optimised for C2D.

_jab: martinr...@jabber.ccc.de

Martin Albrecht

unread,
May 19, 2008, 4:39:16 PM5/19/08
to sage-...@googlegroups.com
On Monday 19 May 2008, Bill Hart wrote:
> You seemed to be getting up to 8% at points there. That's definitely
> worth it. I'll be interested to see this evening how it comes out,
> though I recommend optimising my combine3 function (which I suppose
> should now be combine8), even including it inline rather than have it
> in a separate file.
>
> Of course on the Opteron, SSE should be switched off, since it is
> definitely slower by about 5%-10% even with careful optimisation.
>
> Bill.

Okay, I added SSE2 support again and the timings are pretty good on the C2D:

Dimension Old New
10000 x 10000 2.270 1.720
16384 x 16384 9.130 6.760
20000 x 20000 16.110 12.310
32000 x 32000 64.340 50.690

Throwing parallelism in the mix (still lame implementation):

Dimension Old New
10000 x 10000 1.470 1.220
16384 x 16384 5.540 4.390
20000 x 20000 11.800 8.580
32000 x 32000 40.040 32.810

Btw. Mike Hansen pointed out on IRC that GAP has a pretty fast implementation
of matrix multiplication too:

GAP4, Version: 4.4.10 of 02-Oct-2007, x86_64-unknown-linux-gnu-gcc
gap> A := RandomMat(10000,10000,GF(2));
<a 10000x10000 matrix over GF2>
gap> B := RandomMat(10000,10000,GF(2));
<a 10000x10000 matrix over GF2>
gap> C := A*B;
<a 10000x10000 matrix over GF2>
gap> time;
5951

The unit here is ms so this takes 6 seconds. However, the generation of random
matrices takes forever. Mike also pointed out that GAP is twice as fast for
the example he tried than the current Sage code (i.e. the code before the
improvements discussed in this thread).

On sage.math things don't improve as expected:

sage: A = random_matrix(GF(2),32000,32000)
sage: B = random_matrix(GF(2),32000,32000)
sage: time C = A._multiply_strassen(B,cutoff=2^11)
CPU times: user 121.69 s, sys: 3.93 s, total: 125.62 s
Wall time: 125.62

This was 114.620 before.

Martin

_jab: martinr...@jabber.ccc.de

Bill Hart

unread,
May 19, 2008, 5:23:44 PM5/19/08
to sage-devel
Martin,

That's all excellent news!! So on the c2d we are caning magma. But we
should try and figure out if your magma version is optimised for c2d
or for amd64, since that will make a big difference. Is your machine
some kind of 64 bit Intel OSX machine? I don't see a specific core 2
version of Magma on their current list. Of course if you just had a
generic linux x86 version of Magma, that would be much slower than
optimal.

It's amazing how much difference the SSE makes on your machine. The
AMD does essentially use its MMX or SSE hardware to read in cache
lines I believe, so basically unless you are doing something requiring
lots of wide arithmetic/logic, you aren't going to get anything more
out of the chip.

I look forward to seeing the new code now that you've cleaned it up.

I'm going to try and figure out what GAP does, in case there's any
ideas we missed. It's surely old code, but there might be lots of
interesting things in there.

Anyhow, who would have thought that one would see 1.22s for a
10000x10000 matrix multiply. That's pretty exciting.

Bill.

On 19 May, 21:39, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de
Message has been deleted

Martin Albrecht

unread,
May 19, 2008, 6:19:32 PM5/19/08
to sage-...@googlegroups.com
On Monday 19 May 2008, Bill Hart wrote:
> Martin,
>
> That's all excellent news!! So on the c2d we are caning magma. But we
> should try and figure out if your magma version is optimised for c2d
> or for amd64, since that will make a big difference. Is your machine
> some kind of 64 bit Intel OSX machine? I don't see a specific core 2
> version of Magma on their current list. Of course if you just had a
> generic linux x86 version of Magma, that would be much slower than
> optimal.

My computer is a Macbook Pro so it is one of those 64-bit Intel (OSX) machines
but I'm running Debian/GNU Linux. According to

https://magma.maths.usyd.edu.au/magma/export/x86_64-linux/

there is a special Intel64 version of Magma 2.14. But even though I have a
license to use it, I can't download it since my Uni keeps the login data for
Magma and only puts versions on an internal server for me to grab. So
basically I have to way until they grabbed the Intel64 version for me.

Maybe William could run some benchmarks on his machine which is identical to
mine (except that I upgraded my RAM and he is running OSX not Linux)?

> It's amazing how much difference the SSE makes on your machine. The
> AMD does essentially use its MMX or SSE hardware to read in cache
> lines I believe, so basically unless you are doing something requiring
> lots of wide arithmetic/logic, you aren't going to get anything more
> out of the chip.
>
> I look forward to seeing the new code now that you've cleaned it up.

The tarball is here:

http://m4ri.sagemath.org/downloads/m4ri-20080519.alpha0.tar.gz

and the SPKG is here:

http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080519.p0.spkg

The SPKG needs a patch:

http://sage.math.washington.edu/home/malb/new_m4ri_2.patch

> I'm going to try and figure out what GAP does, in case there's any
> ideas we missed. It's surely old code, but there might be lots of
> interesting things in there.

I'll also check again but it seems they are doing M4RM with a fixed k and
matrix blocking.

> Anyhow, who would have thought that one would see 1.22s for a
> 10000x10000 matrix multiply. That's pretty exciting.

Yeah, good work Bill!

Martin

PS: I now actually believe that it is possible that Magma uses M4RM (but not
M4RI maybe). If GAP has it and it is old code, I don't see why Magma
wouldn't. So having a hard time beating them isn't that implausible anymore,
since one doesn't have a better algorithm just like that.

_jab: martinr...@jabber.ccc.de

Bill Hart

unread,
May 19, 2008, 7:01:42 PM5/19/08
to sage-devel
I can't tell exactly what GAP does. It is beautifully documented, but
it talks about "grease units", which is terminology I don't
understand. It does look like M4RM though.

One trick they use is to handle the case where the bits they get from
the A matrix equals 1. But I think they only do this to speed it up
when the "grease level" is 1, which I think is our k. So no ideas for
us there.

I don't see where they get so much speed from.

I agree, it is highly likely Magma uses this algorithm. It seems like
a relatively well known technique.

Bill.

On 19 May, 23:19, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Robert Miller

unread,
May 19, 2008, 7:22:33 PM5/19/08
to sage-devel
> I can't tell exactly what GAP does. It is beautifully documented, but
> it talks about "grease units", which is terminology I don't
> understand. It does look like M4RM though.

Grease is a concept for speeding up certain things using caching. For
example, suppose I have the permutation group S_{32} acting on ints. I
can represent a particular permutation as a permutation matrix, which
means that applying that permutation is just vector-matrix
multiplication. However, instead of computing the full matrix
multiplication, we can cut the matrix into pieces (probably of length
"grease units" or something). Essentially, we compute every possible
sum of the first four rows, then the next four rows, etc. Then, to see
how to multiply the vector by the matrix, we cut the vector into
chunks of four, and simply look up the corresponding entry in the
"grease table", finally adding them together in the end.

-- RLM

Bill Hart

unread,
May 19, 2008, 7:47:50 PM5/19/08
to sage-devel
Yep that's exactly the same thing as what M4RM does. Thanks for the
explanation.

Bill.

Bill Hart

unread,
May 19, 2008, 8:55:55 PM5/19/08
to sage-devel
Martin,

On the M4RI website you say that M4R inversion is asymptotically fast
with time complexity n^3/log_2(n), but Strassen's method has
complexity n^log2(7), which I would call asymptotically fast.

Do you just mean asymptotically faster than the classical algorithm?
By the way, I wrote some code ages ago for computing the row echelon
form of sparse matrices over GF2 which essentially inverts 6000x6000
sparse matrices in about a second.

Bill.

On 19 May, 19:58, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Clement Pernet

unread,
May 19, 2008, 9:14:14 PM5/19/08
to sage-...@googlegroups.com
hi guys,

I am finally up to date with this discussion (I was being interviewed,
and then flying when it started).
First, congrats for the great job you have achieved. I have started to
dive into m4ri, and I really like the quality of the code.

I have a few remarks

* the loop unrolling technique used for the creation of the table, could
maybe be used in the computation of the product as well.
Is 8 optimal? I remember seeing 32 in ATLAS, but don't know of any
justifications. Since some pipeline are longer than 8, this might be
better to have a longer unrolled loop.

* I am not sure about this, but wouldn't it be better to have a block
decomposition that matches the babystep-giantstep structure?
This could happen at the strassen threshold : instead of simply copying
the matrix (which already improves the data-locality) copy it into a
bunch blocks of size blocksize and call m4rm on that structure. ATLAS
are doing this kind of copies for dimensions not larger than 200 if I
recall correctly.
Maybe I am just missing something about your babystep/giantstep algorithm.

Anyway, as you pointed out, the battle is now on the asymptotic
comparison with Magma, and I still have no ideas on how to improve your
strassen implementation. Still thinking about it....

Cheers,

Clément

Bill Hart a écrit :

William Stein

unread,
May 19, 2008, 9:24:23 PM5/19/08
to sage-...@googlegroups.com
On Mon, May 19, 2008 at 6:14 PM, Clement Pernet
<clement...@gmail.com> wrote:
>
> hi guys,
>
> I am finally up to date with this discussion (I was being interviewed,
> and then flying when it started).
> First, congrats for the great job you have achieved. I have started to
> dive into m4ri, and I really like the quality of the code.
>
> I have a few remarks
>
> * the loop unrolling technique used for the creation of the table, could
> maybe be used in the computation of the product as well.
> Is 8 optimal? I remember seeing 32 in ATLAS, but don't know of any
> justifications. Since some pipeline are longer than 8, this might be
> better to have a longer unrolled loop.
>
> * I am not sure about this, but wouldn't it be better to have a block
> decomposition that matches the babystep-giantstep structure?
> This could happen at the strassen threshold : instead of simply copying
> the matrix (which already improves the data-locality) copy it into a
> bunch blocks of size blocksize and call m4rm on that structure. ATLAS
> are doing this kind of copies for dimensions not larger than 200 if I
> recall correctly.
> Maybe I am just missing something about your babystep/giantstep algorithm.
>
> Anyway, as you pointed out, the battle is now on the asymptotic
> comparison with Magma, and I still have no ideas on how to improve your
> strassen implementation. Still thinking about it....
>

Clement, since you are actively doing research on fast mod-3 matrix
multiplication, any chance you could spend 2-3 paragraphs here and
advertise that? It would fit nicely in this thread.

-- William

Bill Hart

unread,
May 19, 2008, 9:55:15 PM5/19/08
to sage-devel
Hi Clement,

I heard you had a big smile on your face today. Well done.

Regarding your suggestion about copying into blocks, that is a very
good idea. The problem at present is that we break up into blocks
vertically, not horizontally. But we absolutely should be doing it
horizontally. The reason is that we could break along cache line
boundaries. Then if we copied into blocks as you suggest, before
calling M4RM, the time saving should be quite noticeable. I think I
will investigate this when I get some more time to work on this.

Regarding the asymptotics, I think they are the same as Magma. The
time for addition is absolutely tiny in comparison with multiplication
(like 1000 times smaller). Each matrix above our crossover to Strassen
is also already well outside the L2 cache size. Thus there just isn't
anything to save either in cache friendliness or in the additions.
Thus I think the issue is actually still in the M4RM routine.
Basically 32000x32000 breaks down into matrices of size 2000x2000 when
it gets below the crossover. These smaller matrices are just taking
too long to multiply.

The reason that I formerly felt that there was a problem with Strassen
is that when we compared 10000x10000 and 20000x20000 it seemed that
Magma beat us for the latter, but not the former. But when 20000x20000
gets broken down, it can't get broken into 10000x10000 because 10000
is not divisible by 64. So it can't be broken in this way along a limb
boundary. So whatever it is broken down into is too slow. That's where
we need to focus our efforts now, I think.

Bill.

Martin Albrecht

unread,
May 20, 2008, 4:47:48 AM5/20/08
to sage-...@googlegroups.com
On Tuesday 20 May 2008, Bill Hart wrote:
> Martin,
>
> On the M4RI website you say that M4R inversion is asymptotically fast
> with time complexity n^3/log_2(n), but Strassen's method has
> complexity n^log2(7), which I would call asymptotically fast.
>
> Do you just mean asymptotically faster than the classical algorithm?

Yes, I'll fix that.

> By the way, I wrote some code ages ago for computing the row echelon
> form of sparse matrices over GF2 which essentially inverts 6000x6000
> sparse matrices in about a second.

Excellent, is the code available somewhere? How dense are these matrices?

My plan for M4RI is:

- polish what we have for multiplication now, find appropriate/acceptable
parameters for a wide range of platforms, make sure it builds everywhere etc.

- during dev1 Greg and I are going to work on Strassen + M4RI/echelonisation.
This is the actual application we are most interested in

- Once that is done, I want to look into sparse matrices (which are probably
much more relevant for my work than dense matrices) so I'd be interested in
looking into your old code. I'm not sure this should go under the roof of the
M4RI library but if it does then this probably warrants a name change :-)

Martin

_jab: martinr...@jabber.ccc.de

Bill Hart

unread,
May 20, 2008, 8:30:39 AM5/20/08
to sage-devel
My code is not downloadable anywhere that I am aware of. I used to use
it in my quadratic sieve before I switched to Jason P's block Lanczos
code (BL solves a more restricted problem but is fine for the QS). My
old code is some of the earliest mathematical C code that I wrote, so
can certainly be improved. I can certainly make it available at some
point. The crossover with block Lanczos was about 6000x6000.

The matrices of 6000x6000 had something like 12 nonzero entries in
each row, so quite sparse. Of course the matrix quickly becomes dense
as the algorithm proceeds. I was also taking advantage of the fact
that the left hand side of the matrix was more dense than the right,
which is probably not the situation you would have arising in your
research, Martin.

At any rate, it might make a reasonable base case for an
implementation of something.

Bill.

On 20 May, 09:47, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Bill Hart

unread,
May 20, 2008, 10:42:50 PM5/20/08
to sage-devel
Hi Martin,

I downloaded the clean tarball and added an extra test, but I get:

mul: m: 4096, l: 3528, n: 4096, k: 0, cutoff: 1024
FAIL: Strassen != M4RM
FAIL: Strassen != Naive

:-(

Also I later replaced the following lines of strassen.c:

a -= a%RADIX;
b -= b%RADIX;
c -= c%RADIX;

with

unsigned long mult = 1;
unsigned long width = a;
while (width > 2*cutoff)
{
width/=2;
mult*=2;
}
a -= a%(RADIX*mult);
b -= b%(RADIX*mult);
c -= c%(RADIX*mult);

and this sped up the 32000x32000 multiplication by a further 5% or so.
The other times didn't change (the 10000x10000 may have been slightly
quicker).

:-)

Bill.


On 19 May, 23:19, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Martin Albrecht

unread,
May 21, 2008, 4:50:25 AM5/21/08
to sage-...@googlegroups.com
On Wednesday 21 May 2008, Bill Hart wrote:
> Hi Martin,
>
> I downloaded the clean tarball and added an extra test, but I get:
>
> mul: m: 4096, l: 3528, n: 4096, k: 0, cutoff: 1024
> FAIL: Strassen != M4RM
> FAIL: Strassen != Naive
>
> :-(

Same here, I'll look into it right away. "Only" Strassen fails so it shouldn't
be too hard to figure out.

> Also I later replaced the following lines of strassen.c:
>
> a -= a%RADIX;
> b -= b%RADIX;
> c -= c%RADIX;
>
> with
>
> unsigned long mult = 1;
> unsigned long width = a;
> while (width > 2*cutoff)
> {
> width/=2;
> mult*=2;
> }
> a -= a%(RADIX*mult);
> b -= b%(RADIX*mult);
> c -= c%(RADIX*mult);
>
> and this sped up the 32000x32000 multiplication by a further 5% or so.
> The other times didn't change (the 10000x10000 may have been slightly
> quicker).
>
> :-)

Good, I'll add it and compare times on the C2D :-)

Martin

_jab: martinr...@jabber.ccc.de

Martin Albrecht

unread,
May 21, 2008, 2:57:59 PM5/21/08
to sage-...@googlegroups.com

I've uploaded a new version to

http://m4ri.sagemath.org/downloads/m4ri-20080521.tar.gz

which
- fixes the bug described above
- adds your code snipped
- fixes a couple of other problems reported by Valgrind.

It seems the last bugfix also speeds up the thing (on my notebook) since we
computed too much stuff and wrote it to unallocated memory. Updated
performance figures:

64-bit Debian/GNU Linux, 2.33Ghz Core2Duo (Macbook Pro, 2nd Gen.)
Matrix Dimension Magma GAP M4RI
10,000 x 10,000 2.920 6.691 1.760
16,384 x 16,384 11.140 36.063 6.760
20,000 x 20,000 20.370 - 12.200
32,000 x 32,000 74.260 - 51.510

I'm having trouble believing those figures but all tests pass (including some
equality checks with Magma via Sage). On sage.math things don't seem to get
any better but it is hard to tell since it is often so loaded. Also, this
version only enables SSE2 on Intel CPUs and has a configure switch for OpenMP
(--enable-openmp).

Martin

PS: One thing that seems significant on sage.math: The systime is
considerable, e.g.:

sage: A = random_matrix(GF(2),32000,32000)
sage: B = random_matrix(GF(2),32000,32000)
sage: time C = A._multiply_strassen(B,cutoff=2^11)

CPU times: user 108.02 s, sys: 5.03 s, total: 113.05 s
Wall time: 113.05

_jab: martinr...@jabber.ccc.de

Bill Hart

unread,
May 21, 2008, 10:08:04 PM5/21/08
to sage-devel
Hi Martin,

This version works great. Here are the times on my unburdened 2.8Ghz
Opteron. First the Magma times, then the times for an older version of
m4ri and now, for the first time ever, the new Magma beating times:

10000x10000: 2.940s 3.13s 2.25s

16384x16384: 9.250s 12.96s 8.80s

20000x20000: 16.57s 22.43s 15.48

32000x32000: 59.1s 90.2s 57.8s

The first three times pretty much agree with what I had before and are
only marginally better. The third time is 4.3s faster than the time I
had (before you fixed the bug). I am not so surprised that a bug fix
could make this much difference. I've seen that sort of thing before.

I suppose we should do a comparison at 32000x32000 with Magma just to
verify, but other than that, well done Martin!!

Bill.

On 21 May, 19:57, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Martin Albrecht

unread,
May 22, 2008, 6:21:45 AM5/22/08
to sage-...@googlegroups.com
On Thursday 22 May 2008, Bill Hart wrote:
> Hi Martin,
>
> This version works great. Here are the times on my unburdened 2.8Ghz
> Opteron. First the Magma times, then the times for an older version of
> m4ri and now, for the first time ever, the new Magma beating times:
>
> 10000x10000: 2.940s 3.13s 2.25s
>
> 16384x16384: 9.250s 12.96s 8.80s
>
> 20000x20000: 16.57s 22.43s 15.48
>
> 32000x32000: 59.1s 90.2s 57.8s

Bill, I suppose that also means that now we actually beat (or are close to
beating) Magma on the C2D "for real". My M4RI times are quite similar on the
C2D as your times on your Opteron. But my version of Magma (on the C2D) is
much worse than your version of Magma (on the Opteron). So it is probably
best to assume at least your times for Magma on my machine too.

Martin

PS: I wonder if this argument makes sense:

We have a complexity of n^2.807 and a CPU of say 2.333 Ghz which can operate
on 64 bits per clock (128 if we use SSE2). So if we had optimal code (no
missed branch predictions, no caching issues, everything optimal) we would
expect a running time of n^2.807 / 64 / 2.333 / 10^9

If we plug in 20,000 for n we'd get 7.923 seconds w.o. SSE2 and 3.961 with
SSE2. So our implementation (12.2 s) is a factor of ~1.5 or ~3 away from
being optimal? Does that sound correct or complete bollocks?

_jab: martinr...@jabber.ccc.de

Bill Hart

unread,
May 22, 2008, 9:46:11 AM5/22/08
to sage-devel
The asymptotics really only kick in when we are using Strassen. The
starting point is thus not 0, but the crossover point between M4RM and
Strassen. So I don't think you can read too much into the asymptotic
statement. But once the asymptotics do kick in, we aren't too far off.
The only thing throwing them off slightly is the inexact cut business,
which my patch mitigates to some extent, though not totally.

I wouldn't be surprised to find we were a factor of 1.5 or 2 from
optimal, but this would be because we could probably do more work in
our base case without falling out of cache.

There is still room for improvement. A few percent can be gained by
splitting A horizontally into two matrices.

I also believe we could get a significant gain by allowing exact cuts,
i.e. not along 64 bit boundaries. This could be done by copying out
the data and shifting it (for the shifting we'd benefit from SSE on
both the Opteron and C2D), from the two submatrices on the right when
doing Strassen. But this would also dramatically increase memory
usage, so I'm not sure how pronounced the overall effect would be.
Perhaps we'd gain 10-15%. The other problem would be that we'd have to
mask the end of each row in the submatrices on the left hand side
before using them, so this could complicate the code quite a bit.

Another option would be to represent our matrices a different way.
We'd store a 32000 by 32000 matrix as a series of 32000 x 2000
matrices, padded with zeroes out to a 64 bit boundary. Addition of
matrices wouldn't be affected much by this, but the m4rm code would
have to be substantially altered in the way it deals with the matrix
A.

At any rate I think most of the possible substantial improvements now
result in yucky code.

Bill.


On 22 May, 11:21, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> _jab: martinralbre...@jabber.ccc.de

Clement Pernet

unread,
May 22, 2008, 1:19:11 PM5/22/08
to sage-...@googlegroups.com
Hi,

>
> Bill, I suppose that also means that now we actually beat (or are close to
> beating) Magma on the C2D "for real". My M4RI times are quite similar on the
> C2D as your times on your Opteron. But my version of Magma (on the C2D) is
> much worse than your version of Magma (on the Opteron). So it is probably
> best to assume at least your times for Magma on my machine too.
>

That's awesome news!

> PS: I wonder if this argument makes sense:
>
> We have a complexity of n^2.807 and a CPU of say 2.333 Ghz which can operate
> on 64 bits per clock (128 if we use SSE2). So if we had optimal code (no
> missed branch predictions, no caching issues, everything optimal) we would
> expect a running time of n^2.807 / 64 / 2.333 / 10^9

Don't forget the constants!
Strassen-winograd, is 6n^2.807.
Now this constant correspond to both mul and adds and I guess that your
boolean word operation ^= computes a + and a * in 1 clock cycle, so I
don't really know the constant in this case (6/2=3 seems dubious to me).

Furthermore, as Bill pointed out, one really has to count the real
number of ops since only a few recursive calls are made.
Eventhough, this would mean that the expected optimal time would be
larger, and consequently, that you're could is closer to optimal!

Reply all
Reply to author
Forward
0 new messages