M4RI is a library by Gregory Bard and myself for fast arithmetic with dense matrices over GF(2). M4RI stands for the "Method of the Four Russians" inversion which is an algorithm for matrix reduction first published by Greg. The website is here:
This new version has among other things two new features: - SSE2 support where support (XOR two 128-bit words in one instruction) - Strassen-Winograd matrix multiplication on top of M4RM ("Method of the Four Russians" multiplication or Kronrod's algorithm)
I'm writing to [sage-devel] because I observed some odd performance characteristics and hope to pick your brains on how to improve performance:
At least in my experience enabling SSE2 results in slower code on AMD CPUs. I think this is rumoured across the net but did anyone else also witness this behaviour? I'm only using 128-bit integer XOR instrinsics. On my Core2Duo notebook SSE2 speeds up the computation at least up to the size of the L2 cache.
One would expect that the M4RM + Strassen implementation would have an easy time beating Magma since it uses a better algorithm. Same as Magma it does Strassen-Winograd down to the cutoff but then it switches to the M4RM algorithm (O(n^3/log_2(n))) rather than naive multiplication (O(n^3)). This should give a nice constant speed-up. This theory so far seems to match reality on a Core2Duo notebook with 4MB L2 cache:
---------------------------------------------------------------------- | SAGE Version 3.0.1, Release Date: 2008-05-05 | | Type notebook() for the GUI, and license() for information. | ---------------------------------------------------------------------- sage: A = random_matrix(GF(2),10^4,10^4) sage: B = random_matrix(GF(2),10^4,10^4) sage: time C = A._multiply_strassen(B,cutoff=3200) CPU times: user 4.50 s, sys: 0.03 s, total: 4.53 s Wall time: 4.56
Magma V2.13-5 Wed May 14 2008 21:55:31 on road [Seed = 2246483032] Type ? for help. Type <Ctrl>-D to quit.
> A := RandomMatrix(GF(2),10^4,10^4); > B := RandomMatrix(GF(2),10^4,10^4); > t := Cputime(); C := A*B; Cputime(t);
7.170
sage: A = random_matrix(GF(2),2*10^4,2*10^4) sage: B = random_matrix(GF(2),2*10^4,2*10^4) sage: time C = A._multiply_strassen(B,cutoff=3200) CPU times: user 33.35 s, sys: 0.09 s, total: 33.44 s Wall time: 33.59
> A := RandomMatrix(GF(2),2*10^4,2*10^4); > B := RandomMatrix(GF(2),2*10^4,2*10^4); > t := Cputime(); C := A*B; Cputime(t);
49.990
Finally, a perfect cutting example (no extra rows/columns to take care of):
sage: A = random_matrix(GF(2),1<<14,1<<14) sage: B = random_matrix(GF(2),1<<14,1<<14) sage: time C = A._multiply_strassen(B,cutoff=1<<10) # !=3200 CPU times: user 16.82 s, sys: 0.07 s, total: 16.88 s Wall time: 16.96
> A:= RandomMatrix(GF(2),2^14,2^14); > B:= RandomMatrix(GF(2),2^14,2^14); > t := Cputime(); C := A*B; Cputime(t);
24.060
Note however that this comparison is NOT FAIR since this version of Magma is not optimised for the C2D chip!
However, on AMD CPUs the times don't look that nice. Both on sage.math and on a
AMD Athlon(tm) 64 X2 Dual Core Processor 6400+
the performance is way way behind Magma (or what I'd expect Magma to perform like), e.g. on sage.math:
sage: A = random_matrix(GF(2),10^4,10^4) sage: B = random_matrix(GF(2),10^4,10^4) sage: time C = A._multiply_strassen(B,cutoff=1500) CPU times: user 14.40 s, sys: 0.17 s, total: 14.58 s Wall time: 14.62
Magma V2.13-5 Wed May 14 2008 14:07:18 on sage [Seed = 4255048009] Type ? for help. Type <Ctrl>-D to quit.
> A := RandomMatrix(GF(2),10^4,10^4); > B := RandomMatrix(GF(2),10^4,10^4); > t := Cputime(); C := A*B; Cputime(t);
3.890 <<--------- They even beat the C2D time!
sage: A = random_matrix(GF(2),2*10^4,2*10^4) sage: B = random_matrix(GF(2),2*10^4,2*10^4) sage: time C = A._multiply_strassen(B,cutoff=1500) CPU times: user 99.42 s, sys: 0.64 s, total: 100.06 s Wall time: 100.06
> B := RandomMatrix(GF(2),2*10^4,2*10^4); > A := RandomMatrix(GF(2),2*10^4,2*10^4); > t := Cputime(); C := A*B; Cputime(t);
27.470
# perfect cutoff sage: A = random_matrix(GF(2),1<<14,1<<14) sage: B = random_matrix(GF(2),1<<14,1<<14) sage: time C = A._multiply_strassen(B,cutoff=1<<10) CPU times: user 100.14 s, sys: 0.32 s, total: 100.46 s Wall time: 100.45 sage: time C = A._multiply_strassen(B,cutoff=1<<9) CPU times: user 76.48 s, sys: 0.39 s, total: 76.87 s Wall time: 76.87 sage: time C = A._multiply_strassen(B,cutoff=1<<8) CPU times: user 52.45 s, sys: 0.33 s, total: 52.78 s Wall time: 52.78
> A := RandomMatrix(GF(2),2^14,2^14); > B := RandomMatrix(GF(2),2^14,2^14); > t := Cputime(); C := A*B; Cputime(t);
14.970
I guess the bad performance of the M4RI library is partly related to the smaller L2 cache (AMD: 1MB, C2D: 4MB) which forces us to set the cutoff lower (since we want all three C = A*B submatrices to fit in L2 cache). Also it seems that Magma is almost always ~ 3.5 times faster than our implementation. I am slowly running out of ideas what to improve in order to close to the gap so if anyone has some ideas please let me hear them. It might be worth noting that we are using the same cutting strategy as Sage in strassen.pyx which ignores uneven rows/columns at every recursion step and deals with the extra rows/columns after the Strassen-Winograd formula (see strassen.pyx for details). We are also using the improved operation schedule from paper of Clement et al.Though I don't know if the cutting strategy is optimal, we also get beaten in a perfect cutting scenario, so I guess it is not that important for now. Anyway, I'd happily provide profiling data if anyone wants to take a look and squeeze a couple of cycles out. If not, then this e-mail is jut a quite long status report :-)
Btw. I don't have access to Magma 2.14 which I believe to be the fastest in linear algebra over GF(2). In version 2.14 they added SSE2 support too. So if anybody could compare the new M4RI with Magma 2.14 I'd be happy to hear about the results. I don't know what else to compare with except Magma. Any suggestions? Thoughts? Rants?
> Btw. I don't have access to Magma 2.14 which I believe to be the fastest in > linear algebra over GF(2). In version 2.14 they added SSE2 support too. > So if anybody could compare the new M4RI with Magma 2.14 I'd be happy to hear > about the results. I don't know what else to compare with except Magma.
Here are magma V2.14-9 timings on sage.math:
was@sage:~/magma64/bin$ ./magma Magma V2.14-9 Wed May 14 2008 17:33:23 on sage [Seed = 793568142] Type ? for help. Type <Ctrl>-D to quit.
> A := RandomMatrix(GF(2),10^4,10^4); > B := RandomMatrix(GF(2),10^4,10^4); > time C := A*B; Time: 4.020 > time C := A*B; Time: 3.980 > time C := A*B; Time: 3.990 > A := RandomMatrix(GF(2),2^14,2^14); > B := RandomMatrix(GF(2),2^14,2^14); > time C := A*B; Time: 15.450 > time C := A*B;
Time: 15.420
> Any suggestions? Thoughts? Rants?
Allan Steel is a really amazing programmer.
----
I tried your new code up at #3204 under OS X and get this:
sage: A = random_matrix(GF(2),10^4,10^4) sage: B = random_matrix(GF(2),10^4,10^4) sage: time C = A._multiply_strassen(B,cutoff=3200) sage.bin(39971) malloc: *** error for object 0xb95c010: Non-aligned pointer being freed (2) *** set a breakpoint in malloc_error_break to debug sage.bin(39971) malloc: *** error for object 0x79c9c10: Non-aligned pointer being freed (2) *** set a breakpoint in malloc_error_break to debug sage.bin(39971) malloc: *** error for object 0x7465a00: non-page-aligned, non-allocated pointer being freed *** set a breakpoint in malloc_error_break to debug sage.bin(39971) malloc: *** error for object 0x79ca610: Non-aligned pointer being freed (2) *** set a breakpoint in malloc_error_break to debug ... CPU times: user 10.29 s, sys: 0.26 s, total: 10.55 s Wall time: 16.31
> > Btw. I don't have access to Magma 2.14 which I believe to be the fastest > > in linear algebra over GF(2). In version 2.14 they added SSE2 support > > too. So if anybody could compare the new M4RI with Magma 2.14 I'd be > > happy to hear about the results. I don't know what else to compare with > > except Magma.
> Here are magma V2.14-9 timings on sage.math:
was@sage:~/magma64/bin$ ./magma Magma V2.14-9 Wed May 14 2008 17:33:23 on sage [Seed = 793568142] Type ? for help. Type <Ctrl>-D to quit.
> A := RandomMatrix(GF(2),10^4,10^4); > B := RandomMatrix(GF(2),10^4,10^4); > time C := A*B;
Time: 4.020
> time C := A*B; Time: 3.980 > time C := A*B; Time: 3.990 > A := RandomMatrix(GF(2),2^14,2^14); > B := RandomMatrix(GF(2),2^14,2^14); > time C := A*B; Time: 15.450 > time C := A*B;
Time: 15.420
This doesn't seem much different from 2.13, which is good for my purposes. Once I fixed the OSX issue mentioned below could you run the benchmark on the C2D? I want to see if they can make better use of the 4MB L2 cache.
> I tried your new code up at #3204 under OS X and get this:
> sage: A = random_matrix(GF(2),10^4,10^4) > sage: B = random_matrix(GF(2),10^4,10^4) > sage: time C = A._multiply_strassen(B,cutoff=3200) > sage.bin(39971) malloc: *** error for object 0xb95c010: Non-aligned > pointer being freed (2) > *** set a breakpoint in malloc_error_break to debug > sage.bin(39971) malloc: *** error for object 0x79c9c10: Non-aligned > pointer being freed (2) > *** set a breakpoint in malloc_error_break to debug > sage.bin(39971) malloc: *** error for object 0x7465a00: > non-page-aligned, non-allocated pointer being freed > *** set a breakpoint in malloc_error_break to debug > sage.bin(39971) malloc: *** error for object 0x79ca610: Non-aligned > pointer being freed (2) > *** set a breakpoint in malloc_error_break to debug > ... > CPU times: user 10.29 s, sys: 0.26 s, total: 10.55 s > Wall time: 16.31
> Maybe you're doing something wrong?
Likely, though my Valgrind log is clean under Linux. I'll try under OSX.
The cycle counter in your bench code seems to give random values on
the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has
Magma 2-14.10 on it.
Anyhow, here are the Magma times:
> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> t := Cputime(); C := A*B; Cputime(t);
2.940
> A := RandomMatrix(GF(2),2*10^4,2*10^4);
> B := RandomMatrix(GF(2),2*10^4,2*10^4);
> t := Cputime(); C := A*B; Cputime(t);
16.570
> A := RandomMatrix(GF(2),2^14,2^14);
> B := RandomMatrix(GF(2),2^14,2^14);
> t := Cputime(); C := A*B; Cputime(t);
9.250
The corresponding times for your code are approximately (hand timed;
adjusted for generation of random matrices):
10^4: 8s
2*10^4: 59s
2^14: 43s
I think it is polynomials over GF2 which Magma 2-14.x is much faster
at rather than linear algebra.
Bill.
On 14 May, 22:52, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> M4RI is a library by Gregory Bard and myself for fast arithmetic with dense
> matrices over GF(2). M4RI stands for the "Method of the Four Russians"
> inversion which is an algorithm for matrix reduction first published by Greg.
> The website is here:
> This new version has among other things two new features:
> - SSE2 support where support (XOR two 128-bit words in one instruction)
> - Strassen-Winograd matrix multiplication on top of M4RM ("Method of the Four
> Russians" multiplication or Kronrod's algorithm)
> I'm writing to [sage-devel] because I observed some odd performance
> characteristics and hope to pick your brains on how to improve performance:
> At least in my experience enabling SSE2 results in slower code on AMD CPUs. I
> think this is rumoured across the net but did anyone else also witness this
> behaviour? I'm only using 128-bit integer XOR instrinsics. On my Core2Duo
> notebook SSE2 speeds up the computation at least up to the size of the L2
> cache.
> One would expect that the M4RM + Strassen implementation would have an easy
> time beating Magma since it uses a better algorithm. Same as Magma it does
> Strassen-Winograd down to the cutoff but then it switches to the M4RM
> algorithm (O(n^3/log_2(n))) rather than naive multiplication (O(n^3)). This
> should give a nice constant speed-up. This theory so far seems to match
> reality on a Core2Duo notebook with 4MB L2 cache:
> ----------------------------------------------------------------------
> | SAGE Version 3.0.1, Release Date: 2008-05-05 |
> | Type notebook() for the GUI, and license() for information. |
> ----------------------------------------------------------------------
> sage: A = random_matrix(GF(2),10^4,10^4)
> sage: B = random_matrix(GF(2),10^4,10^4)
> sage: time C = A._multiply_strassen(B,cutoff=3200)
> CPU times: user 4.50 s, sys: 0.03 s, total: 4.53 s
> Wall time: 4.56
> Magma V2.13-5 Wed May 14 2008 21:55:31 on road [Seed = 2246483032]
> Type ? for help. Type <Ctrl>-D to quit.> A := RandomMatrix(GF(2),10^4,10^4);
> > B := RandomMatrix(GF(2),10^4,10^4);
> > t := Cputime(); C := A*B; Cputime(t);
> 7.170
> sage: A = random_matrix(GF(2),2*10^4,2*10^4)
> sage: B = random_matrix(GF(2),2*10^4,2*10^4)
> sage: time C = A._multiply_strassen(B,cutoff=3200)
> CPU times: user 33.35 s, sys: 0.09 s, total: 33.44 s
> Wall time: 33.59
> > A := RandomMatrix(GF(2),2*10^4,2*10^4);
> > B := RandomMatrix(GF(2),2*10^4,2*10^4);
> > t := Cputime(); C := A*B; Cputime(t);
> 49.990
> Finally, a perfect cutting example (no extra rows/columns to take care of):
> sage: A = random_matrix(GF(2),1<<14,1<<14)
> sage: B = random_matrix(GF(2),1<<14,1<<14)
> sage: time C = A._multiply_strassen(B,cutoff=1<<10) # !=3200
> CPU times: user 16.82 s, sys: 0.07 s, total: 16.88 s
> Wall time: 16.96
> > A:= RandomMatrix(GF(2),2^14,2^14);
> > B:= RandomMatrix(GF(2),2^14,2^14);
> > t := Cputime(); C := A*B; Cputime(t);
> 24.060
> Note however that this comparison is NOT FAIR since this version of Magma is
> not optimised for the C2D chip!
> However, on AMD CPUs the times don't look that nice. Both on sage.math and on
> a
> AMD Athlon(tm) 64 X2 Dual Core Processor 6400+
> the performance is way way behind Magma (or what I'd expect Magma to perform
> like), e.g. on sage.math:
> sage: A = random_matrix(GF(2),10^4,10^4)
> sage: B = random_matrix(GF(2),10^4,10^4)
> sage: time C = A._multiply_strassen(B,cutoff=1500)
> CPU times: user 14.40 s, sys: 0.17 s, total: 14.58 s
> Wall time: 14.62
> Magma V2.13-5 Wed May 14 2008 14:07:18 on sage [Seed = 4255048009]
> Type ? for help. Type <Ctrl>-D to quit.> A := RandomMatrix(GF(2),10^4,10^4);
> > B := RandomMatrix(GF(2),10^4,10^4);
> > t := Cputime(); C := A*B; Cputime(t);
> 3.890 <<--------- They even beat the C2D time!
> sage: A = random_matrix(GF(2),2*10^4,2*10^4)
> sage: B = random_matrix(GF(2),2*10^4,2*10^4)
> sage: time C = A._multiply_strassen(B,cutoff=1500)
> CPU times: user 99.42 s, sys: 0.64 s, total: 100.06 s
> Wall time: 100.06
> > B := RandomMatrix(GF(2),2*10^4,2*10^4);
> > A := RandomMatrix(GF(2),2*10^4,2*10^4);
> > t := Cputime(); C := A*B; Cputime(t);
> 27.470
> # perfect cutoff
> sage: A = random_matrix(GF(2),1<<14,1<<14)
> sage: B = random_matrix(GF(2),1<<14,1<<14)
> sage: time C = A._multiply_strassen(B,cutoff=1<<10)
> CPU times: user 100.14 s, sys: 0.32 s, total: 100.46 s
> Wall time: 100.45
> sage: time C = A._multiply_strassen(B,cutoff=1<<9)
> CPU times: user 76.48 s, sys: 0.39 s, total: 76.87 s
> Wall time: 76.87
> sage: time C = A._multiply_strassen(B,cutoff=1<<8)
> CPU times: user 52.45 s, sys: 0.33 s, total: 52.78 s
> Wall time: 52.78
> > A := RandomMatrix(GF(2),2^14,2^14);
> > B := RandomMatrix(GF(2),2^14,2^14);
> > t := Cputime(); C := A*B; Cputime(t);
> 14.970
> I guess the bad performance of the M4RI library is partly related to the
> smaller L2 cache (AMD: 1MB, C2D: 4MB) which forces us to set the cutoff lower
> (since we want all three C = A*B submatrices to fit in L2 cache). Also it
> seems that Magma is almost always ~ 3.5 times faster than our implementation.
> I am slowly running out of ideas what to improve in order to close to the gap
> so if anyone has some ideas please let me hear them. It might be worth noting
> that we are using the same cutting strategy as Sage in strassen.pyx which
> ignores uneven rows/columns at every recursion step and deals with the extra
> rows/columns after the Strassen-Winograd formula (see strassen.pyx for
> details). We are also using the improved operation schedule from paper of
> Clement et al.Though I don't know if the cutting strategy is optimal, we also
> get beaten in a perfect cutting scenario, so I guess it is not that important
> for now. Anyway, I'd happily provide profiling data if anyone wants to take a
> look and squeeze a couple of cycles out. If not, then this e-mail is jut a
> quite long status report :-)
> Btw. I don't have access to Magma 2.14 which I believe to be the fastest in
> linear algebra over GF(2). In version 2.14 they added SSE2 support too.
> So if anybody could compare the new M4RI with Magma 2.14 I'd be happy to hear
> about the results. I don't know what else to compare with except Magma. Any
> suggestions? Thoughts? Rants?
> The cycle counter in your bench code seems to give random values on > the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has > Magma 2-14.10 on it.
> Anyhow, here are the Magma times: > > A := RandomMatrix(GF(2),10^4,10^4); > > B := RandomMatrix(GF(2),10^4,10^4); > > t := Cputime(); C := A*B; Cputime(t);
> 2.940
> > A := RandomMatrix(GF(2),2*10^4,2*10^4); > > B := RandomMatrix(GF(2),2*10^4,2*10^4); > > t := Cputime(); C := A*B; Cputime(t);
> 16.570
> > A := RandomMatrix(GF(2),2^14,2^14); > > B := RandomMatrix(GF(2),2^14,2^14); > > t := Cputime(); C := A*B; Cputime(t);
> 9.250
> The corresponding times for your code are approximately (hand timed; > adjusted for generation of random matrices):
> 10^4: 8s > 2*10^4: 59s > 2^14: 43s
> I think it is polynomials over GF2 which Magma 2-14.x is much faster > at rather than linear algebra.
> Bill.
Hi Bill,
I cannot explain the random behaviour of the cpucycles counter. The cpucycles library is by Dan Bernstein so I'd suspect it is carefully coded. Some potential sources of variance are: dynticks (i.e. the kernel doesn't have a fixed tick rate, this is done to safe power), CPU frequency scaling and other applications. On the other hand, I'd suspect I don't need to tell you since you probably benchmarked more code than I did. To account for the variance I have the small run_bench.py script in the testsuite directory. Run it like this:
./run_bench.py -t mul -n 10000 -c 3200
It'll run the experiment 17 times (17 for no particular reason) and return the median runtime. -n is the size and -c the cutoff.
> On Thursday 15 May 2008, Bill Hart wrote: >> Hi Martin,
>> The cycle counter in your bench code seems to give random values on >> the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has >> Magma 2-14.10 on it.
>> Anyhow, here are the Magma times: >> > A := RandomMatrix(GF(2),10^4,10^4); >> > B := RandomMatrix(GF(2),10^4,10^4); >> > t := Cputime(); C := A*B; Cputime(t);
>> 2.940
>> > A := RandomMatrix(GF(2),2*10^4,2*10^4); >> > B := RandomMatrix(GF(2),2*10^4,2*10^4); >> > t := Cputime(); C := A*B; Cputime(t);
>> 16.570
>> > A := RandomMatrix(GF(2),2^14,2^14); >> > B := RandomMatrix(GF(2),2^14,2^14); >> > t := Cputime(); C := A*B; Cputime(t);
>> 9.250
>> The corresponding times for your code are approximately (hand timed; >> adjusted for generation of random matrices):
>> 10^4: 8s >> 2*10^4: 59s >> 2^14: 43s
>> I think it is polynomials over GF2 which Magma 2-14.x is much faster >> at rather than linear algebra.
>> Bill.
> Hi Bill,
> I cannot explain the random behaviour of the cpucycles counter. The cpucycles > library is by Dan Bernstein so I'd suspect it is carefully coded. Some > potential sources of variance are: dynticks (i.e. the kernel doesn't have a > fixed tick rate, this is done to safe power), CPU frequency scaling and other > applications. On the other hand, I'd suspect I don't need to tell you since > you probably benchmarked more code than I did. To account for the variance I > have the small run_bench.py script in the testsuite directory. Run it like > this:
> ./run_bench.py -t mul -n 10000 -c 3200
> It'll run the experiment 17 times (17 for no particular reason) and return the > median runtime. -n is the size and -c the cutoff.
Median? Shouldn't you take the minimum? Are there any good papers on benchmarking?
> > It'll run the experiment 17 times (17 for no particular reason) and > > return the median runtime. -n is the size and -c the cutoff. > Median? Shouldn't you take the minimum? Are there any good papers > on benchmarking?
Benchmarking tools are encouraged to record several timings of a function: call cpucycles(), function(), cpucycles(), function(), etc., and then print one line reporting the differences between successive cpucycles() results. The median of several differences is much more stable than the average.
Cycle counts continue to increase while other programs are running, while the operating system is handling an interruption such as a network packet, etc. This won't affect the median of several timings of a fast function---the function usually won't be interrupted---but it can affect the median of several timings of a slow function. Hopefully a benchmarking machine isn't running other programs.
On dual-CPU systems (and dual-core systems such as the Athlon 64 X2), the CPUs often don't have synchronized cycle counters, so a process that switches CPUs can have its cycle counts jump forwards or backwards. I've never seen this affect the median of several timings.
Some CPUs dynamically reduce CPU speed to save power, but deliberately keep their cycle counters running at full speed, the idea being that measuring time is more important than measuring cycles. Hopefully a benchmarking machine won't enter power-saving mode.
Cycle counts are occasionally off by a multiple of 2^32 on some CPUs, as discussed below. I've never seen this affect the median of several timings.
The estimate returned by cpucycles_persecond() may improve accuracy after cpucycles() has been called repeatedly.
------------------------
I just followed that. I guess if one takes the minimum or not depends on the data. If there are (much) easier cases then it is probably not best to take the minimum. However, I guess for matrix multiplication the times don't really depend on the entries of the matrices so I could use the minimum or just output all (minimum, average, median).
Here is a run that illustrates the problem. Am I doing something
wrong?
wbhart@host-57-44:~/m4ri-20080514/testsuite> time ./
bench_multiplication 5000 2048
n: 5000, cutoff: 2048, cpu cycles: 2670143764
real 0m2.768s
user 0m2.760s
sys 0m0.008s
wbhart@host-57-44:~/m4ri-20080514/testsuite> time ./
bench_multiplication 5000 3000
n: 5000, cutoff: 3000, cpu cycles: 746344039
real 0m3.277s
user 0m3.252s
sys 0m0.024s
Bill.
On 15 May, 18:42, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> > The cycle counter in your bench code seems to give random values on
> > the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has
> > Magma 2-14.10 on it.
> > Anyhow, here are the Magma times:
> > > A := RandomMatrix(GF(2),10^4,10^4);
> > > B := RandomMatrix(GF(2),10^4,10^4);
> > > t := Cputime(); C := A*B; Cputime(t);
> > 2.940
> > > A := RandomMatrix(GF(2),2*10^4,2*10^4);
> > > B := RandomMatrix(GF(2),2*10^4,2*10^4);
> > > t := Cputime(); C := A*B; Cputime(t);
> > 16.570
> > > A := RandomMatrix(GF(2),2^14,2^14);
> > > B := RandomMatrix(GF(2),2^14,2^14);
> > > t := Cputime(); C := A*B; Cputime(t);
> > 9.250
> > The corresponding times for your code are approximately (hand timed;
> > adjusted for generation of random matrices):
> > 10^4: 8s
> > 2*10^4: 59s
> > 2^14: 43s
> > I think it is polynomials over GF2 which Magma 2-14.x is much faster
> > at rather than linear algebra.
> > Bill.
> Hi Bill,
> I cannot explain the random behaviour of the cpucycles counter. The cpucycles
> library is by Dan Bernstein so I'd suspect it is carefully coded. Some
> potential sources of variance are: dynticks (i.e. the kernel doesn't have a
> fixed tick rate, this is done to safe power), CPU frequency scaling and other
> applications. On the other hand, I'd suspect I don't need to tell you since
> you probably benchmarked more code than I did. To account for the variance I
> have the small run_bench.py script in the testsuite directory. Run it like
> this:
> ./run_bench.py -t mul -n 10000 -c 3200
> It'll run the experiment 17 times (17 for no particular reason) and return the
> median runtime. -n is the size and -c the cutoff.
> Here is a run that illustrates the problem. Am I doing something > wrong?
No, I was stupid. The cpucycles are printed as %u but they should be printed as %llu since they are longer than an int. I've attached the fixed C file (since it is < 1KB).
Do you think Magma uses naive multiplication for its base case? This
can be ridicoulously fast, especially over GF2. I note for example
that Magma's base case is about 6 times faster than M4RI at 1000x1000.
Is it possible that the naive multiplication can just be optimised
with a far better constant and that the crossover with m4r is above
the crossover with Strassen-Winograd?
I also note that the M4RI SSE2 code doesn't seem faster than the
generic C code on the Opteron, in fact it seems the other way around.
Very strange.
Bill.
On 15 May, 20:25, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> > Here is a run that illustrates the problem. Am I doing something
> > wrong?
> No, I was stupid. The cpucycles are printed as %u but they should be printed
> as %llu since they are longer than an int. I've attached the fixed C file
> (since it is < 1KB).
> Do you think Magma uses naive multiplication for its base case? This > can be ridicoulously fast, especially over GF2. I note for example > that Magma's base case is about 6 times faster than M4RI at 1000x1000. > Is it possible that the naive multiplication can just be optimised > with a far better constant and that the crossover with m4r is above > the crossover with Strassen-Winograd?
> I also note that the M4RI SSE2 code doesn't seem faster than the > generic C code on the Opteron, in fact it seems the other way around. > Very strange.
I observed the same behaviour but have no explanation besides stuff like this:
How do you know what the Magma base case is, can you tell Magma somehow to print its crossover? I'll look into plugging in naive multiplication as the base case (after optimisation) to see if your idea is right, though I doubt that the crossover to M4RM is beyond the crossover to Strassen-Winograd.
Oh, actually I have no idea where Magma's crossover is. I'll wager it
is somewhere between 4000x4000 and 6000x6000, but let's not speculate.
I'll try and work it out with some timings.
Bill.
On 15 May, 22:23, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> > Do you think Magma uses naive multiplication for its base case? This
> > can be ridicoulously fast, especially over GF2. I note for example
> > that Magma's base case is about 6 times faster than M4RI at 1000x1000.
> > Is it possible that the naive multiplication can just be optimised
> > with a far better constant and that the crossover with m4r is above
> > the crossover with Strassen-Winograd?
> > I also note that the M4RI SSE2 code doesn't seem faster than the
> > generic C code on the Opteron, in fact it seems the other way around.
> > Very strange.
> I observed the same behaviour but have no explanation besides stuff like this:
> How do you know what the Magma base case is, can you tell Magma somehow to
> print its crossover? I'll look into plugging in naive multiplication as the
> base case (after optimisation) to see if your idea is right, though I doubt
> that the crossover to M4RM is beyond the crossover to Strassen-Winograd.
The crossover is not clear. The change from a smooth curve to a
squiggly line is about 3600. So presumably that is it, but the graph
also seems to change character at about 6200 or 7000 as well. One of
these changes may be cache related.
At 3600, the total data for all three matrices is almost 5mb and the
cache on my machine is 1024kb. But if Magma is using classical
multiplication, then this is pretty much irrelevant anyway, since you
can keep the working data within the cache for quite a while during
the algorithm.
Bill.
On 15 May, 22:41, Bill Hart <goodwillh...@googlemail.com> wrote:
> Oh, actually I have no idea where Magma's crossover is. I'll wager it
> is somewhere between 4000x4000 and 6000x6000, but let's not speculate.
> I'll try and work it out with some timings.
> Bill.
> On 15 May, 22:23, Martin Albrecht <m...@informatik.uni-bremen.de>
> wrote:
> > On Thursday 15 May 2008, Bill Hart wrote:
> > > Martin,
> > > Do you think Magma uses naive multiplication for its base case? This
> > > can be ridicoulously fast, especially over GF2. I note for example
> > > that Magma's base case is about 6 times faster than M4RI at 1000x1000.
> > > Is it possible that the naive multiplication can just be optimised
> > > with a far better constant and that the crossover with m4r is above
> > > the crossover with Strassen-Winograd?
> > > I also note that the M4RI SSE2 code doesn't seem faster than the
> > > generic C code on the Opteron, in fact it seems the other way around.
> > > Very strange.
> > I observed the same behaviour but have no explanation besides stuff like this:
> > How do you know what the Magma base case is, can you tell Magma somehow to
> > print its crossover? I'll look into plugging in naive multiplication as the
> > base case (after optimisation) to see if your idea is right, though I doubt
> > that the crossover to M4RM is beyond the crossover to Strassen-Winograd.
> The crossover is not clear. The change from a smooth curve to a > squiggly line is about 3600. So presumably that is it, but the graph > also seems to change character at about 6200 or 7000 as well. One of > these changes may be cache related. > At 3600, the total data for all three matrices is almost 5mb and the > cache on my machine is 1024kb. But if Magma is using classical > multiplication, then this is pretty much irrelevant anyway, since you > can keep the working data within the cache for quite a while during > the algorithm.
On the other hand: a squiggly line is what one one would expect for Strassen-Winograd due to the extra rows/columns that have to be taken care of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel once gave a talk about his implementation and stated that they don't use classical block multiplication (I saw some slides with that remark).
I think it might just be possible to get down to the speed of Magma
with a highly optimised classical multiplication routine. At 3600X3600
one clearly has to do 3600x3600 scalar products of a row by a column.
We'll assume one of the matrices has been transposed to facilitate
this.
If we use SSE2 then we can operate 128 bits at a time. There are 16
SSE registers.
The basic idea would be to load 4 SSE registers from different rows of
matrix A and 2 from different columns of matrix B. We compute the
scalar products of all 8 combinations of rowsxcolumns simultaneously.
For this we need 2 temporary registers and 8 registers to hold the
running totals. So all in all we need 4+2+2+8 = 16 registers. We only
need to do an AND and an XOR to do the multiplication and addition
required.
Caching becomes irrelevant if we choose a large selection of rows from
A and a large selection of columns from B and do all the possible
scalar products or rows by columns before moving to the next part.
Assuming the Opteron can do the memory loads at the same time as doing
arithmetic not depending on those loads, careful instruction
scheduling should get everything down to the cost of an SSE AND and an
SSE OR per 128x128 bit pair in the classical algorithm. That puts the
entire algorithm at pretty close to what Magma is doing timewise.
Another option is to not use SSE and just use the 64 bit integer
registers. The disadvantage is one has to load things 64 bits at a
time. But the advantage is the Opteron can do three arithmetic
operations per cycle if properly scheduled. That gets 50% more work
done than the SSE registers, which can only do 1 x 128 bit operation
per cycle.
Of course I'm making the assumption here that the Opteron can indeed
do loads at the same time as arithmetic.
if not, then there is no way Magma can be using classical
multiplication out to 3600x3600 since there are simply too many
operations to perform in the number of cycles available.
Bill.
On 16 May, 00:03, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> > The crossover is not clear. The change from a smooth curve to a
> > squiggly line is about 3600. So presumably that is it, but the graph
> > also seems to change character at about 6200 or 7000 as well. One of
> > these changes may be cache related.
> > At 3600, the total data for all three matrices is almost 5mb and the
> > cache on my machine is 1024kb. But if Magma is using classical
> > multiplication, then this is pretty much irrelevant anyway, since you
> > can keep the working data within the cache for quite a while during
> > the algorithm.
> On the other hand: a squiggly line is what one one would expect for
> Strassen-Winograd due to the extra rows/columns that have to be taken care
> of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel
> once gave a talk about his implementation and stated that they don't use
> classical block multiplication (I saw some slides with that remark).
I made some changes to the original code so it would use the cache a
bit better. The test code seems to pass, so I don't think I've screwed
anything up.
The time for 16384x16384 on my machine is now 20s, so a factor of
2.15x faster. The time for 2000x2000 also seems to be the same time as
Magma now. Hopefully I didn't just mistime things before, and this is
a real improvement.
I am still trying things out.
Bill.
On 16 May, 01:41, Bill Hart <goodwillh...@googlemail.com> wrote:
> I think it might just be possible to get down to the speed of Magma
> with a highly optimised classical multiplication routine. At 3600X3600
> one clearly has to do 3600x3600 scalar products of a row by a column.
> We'll assume one of the matrices has been transposed to facilitate
> this.
> If we use SSE2 then we can operate 128 bits at a time. There are 16
> SSE registers.
> The basic idea would be to load 4 SSE registers from different rows of
> matrix A and 2 from different columns of matrix B. We compute the
> scalar products of all 8 combinations of rowsxcolumns simultaneously.
> For this we need 2 temporary registers and 8 registers to hold the
> running totals. So all in all we need 4+2+2+8 = 16 registers. We only
> need to do an AND and an XOR to do the multiplication and addition
> required.
> Caching becomes irrelevant if we choose a large selection of rows from
> A and a large selection of columns from B and do all the possible
> scalar products or rows by columns before moving to the next part.
> Assuming the Opteron can do the memory loads at the same time as doing
> arithmetic not depending on those loads, careful instruction
> scheduling should get everything down to the cost of an SSE AND and an
> SSE OR per 128x128 bit pair in the classical algorithm. That puts the
> entire algorithm at pretty close to what Magma is doing timewise.
> Another option is to not use SSE and just use the 64 bit integer
> registers. The disadvantage is one has to load things 64 bits at a
> time. But the advantage is the Opteron can do three arithmetic
> operations per cycle if properly scheduled. That gets 50% more work
> done than the SSE registers, which can only do 1 x 128 bit operation
> per cycle.
> Of course I'm making the assumption here that the Opteron can indeed
> do loads at the same time as arithmetic.
> if not, then there is no way Magma can be using classical
> multiplication out to 3600x3600 since there are simply too many
> operations to perform in the number of cycles available.
> Bill.
> On 16 May, 00:03, Martin Albrecht <m...@informatik.uni-bremen.de>
> wrote:
> > > The crossover is not clear. The change from a smooth curve to a
> > > squiggly line is about 3600. So presumably that is it, but the graph
> > > also seems to change character at about 6200 or 7000 as well. One of
> > > these changes may be cache related.
> > > At 3600, the total data for all three matrices is almost 5mb and the
> > > cache on my machine is 1024kb. But if Magma is using classical
> > > multiplication, then this is pretty much irrelevant anyway, since you
> > > can keep the working data within the cache for quite a while during
> > > the algorithm.
> > On the other hand: a squiggly line is what one one would expect for
> > Strassen-Winograd due to the extra rows/columns that have to be taken care
> > of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel
> > once gave a talk about his implementation and stated that they don't use
> > classical block multiplication (I saw some slides with that remark).
> I made some changes to the original code so it would use the cache a
> bit better. The test code seems to pass, so I don't think I've screwed
> anything up.
> The time for 16384x16384 on my machine is now 20s, so a factor of
> 2.15x faster. The time for 2000x2000 also seems to be the same time as
> Magma now. Hopefully I didn't just mistime things before, and this is
> a real improvement.
> I am still trying things out.
> Bill.
> On 16 May, 01:41, Bill Hart <goodwillh...@googlemail.com> wrote:
> > I think it might just be possible to get down to the speed of Magma
> > with a highly optimised classical multiplication routine. At 3600X3600
> > one clearly has to do 3600x3600 scalar products of a row by a column.
> > We'll assume one of the matrices has been transposed to facilitate
> > this.
> > If we use SSE2 then we can operate 128 bits at a time. There are 16
> > SSE registers.
> > The basic idea would be to load 4 SSE registers from different rows of
> > matrix A and 2 from different columns of matrix B. We compute the
> > scalar products of all 8 combinations of rowsxcolumns simultaneously.
> > For this we need 2 temporary registers and 8 registers to hold the
> > running totals. So all in all we need 4+2+2+8 = 16 registers. We only
> > need to do an AND and an XOR to do the multiplication and addition
> > required.
> > Caching becomes irrelevant if we choose a large selection of rows from
> > A and a large selection of columns from B and do all the possible
> > scalar products or rows by columns before moving to the next part.
> > Assuming the Opteron can do the memory loads at the same time as doing
> > arithmetic not depending on those loads, careful instruction
> > scheduling should get everything down to the cost of an SSE AND and an
> > SSE OR per 128x128 bit pair in the classical algorithm. That puts the
> > entire algorithm at pretty close to what Magma is doing timewise.
> > Another option is to not use SSE and just use the 64 bit integer
> > registers. The disadvantage is one has to load things 64 bits at a
> > time. But the advantage is the Opteron can do three arithmetic
> > operations per cycle if properly scheduled. That gets 50% more work
> > done than the SSE registers, which can only do 1 x 128 bit operation
> > per cycle.
> > Of course I'm making the assumption here that the Opteron can indeed
> > do loads at the same time as arithmetic.
> > if not, then there is no way Magma can be using classical
> > multiplication out to 3600x3600 since there are simply too many
> > operations to perform in the number of cycles available.
> > Bill.
> > On 16 May, 00:03, Martin Albrecht <m...@informatik.uni-bremen.de>
> > wrote:
> > > > The crossover is not clear. The change from a smooth curve to a
> > > > squiggly line is about 3600. So presumably that is it, but the graph
> > > > also seems to change character at about 6200 or 7000 as well. One of
> > > > these changes may be cache related.
> > > > At 3600, the total data for all three matrices is almost 5mb and the
> > > > cache on my machine is 1024kb. But if Magma is using classical
> > > > multiplication, then this is pretty much irrelevant anyway, since you
> > > > can keep the working data within the cache for quite a while during
> > > > the algorithm.
> > > On the other hand: a squiggly line is what one one would expect for
> > > Strassen-Winograd due to the extra rows/columns that have to be taken care
> > > of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel
> > > once gave a talk about his implementation and stated that they don't use
> > > classical block multiplication (I saw some slides with that remark).
I used a crossover of 3600 and I indicate at the top of this file how
constants should be changed to get the speedups for the various values
of n. I didn't put any code in to automatically choose the correct
values.
I am presuming that the test_multiplication code actually tests this
function, otherwise maybe my code is just rubbish.
The basic idea is to cache block the A matrix into groups of BLOCK
rows. If one also cache blocked the B matrix and left the computed
gray tables in memory instead of writing over them and recalculating
them all the time, as I currently do, one could probably get the other
factor of 2 that we need.
Note I've turned SSE off in this function, but left it on in
_mzd_combine since it makes no real difference there.
Bill.
On 16 May, 14:16, Bill Hart <goodwillh...@googlemail.com> wrote:
> 10000x10000 is down to 4.5s now, so nearly a 2x speedup.
> 20000x20000 is down to 32.0s, so again nearly a 2x speedup.
> Bill.
> On 16 May, 13:53, Bill Hart <goodwillh...@googlemail.com> wrote:
> > I made some changes to the original code so it would use the cache a
> > bit better. The test code seems to pass, so I don't think I've screwed
> > anything up.
> > The time for 16384x16384 on my machine is now 20s, so a factor of
> > 2.15x faster. The time for 2000x2000 also seems to be the same time as
> > Magma now. Hopefully I didn't just mistime things before, and this is
> > a real improvement.
> > I am still trying things out.
> > Bill.
> > On 16 May, 01:41, Bill Hart <goodwillh...@googlemail.com> wrote:
> > > I think it might just be possible to get down to the speed of Magma
> > > with a highly optimised classical multiplication routine. At 3600X3600
> > > one clearly has to do 3600x3600 scalar products of a row by a column.
> > > We'll assume one of the matrices has been transposed to facilitate
> > > this.
> > > If we use SSE2 then we can operate 128 bits at a time. There are 16
> > > SSE registers.
> > > The basic idea would be to load 4 SSE registers from different rows of
> > > matrix A and 2 from different columns of matrix B. We compute the
> > > scalar products of all 8 combinations of rowsxcolumns simultaneously.
> > > For this we need 2 temporary registers and 8 registers to hold the
> > > running totals. So all in all we need 4+2+2+8 = 16 registers. We only
> > > need to do an AND and an XOR to do the multiplication and addition
> > > required.
> > > Caching becomes irrelevant if we choose a large selection of rows from
> > > A and a large selection of columns from B and do all the possible
> > > scalar products or rows by columns before moving to the next part.
> > > Assuming the Opteron can do the memory loads at the same time as doing
> > > arithmetic not depending on those loads, careful instruction
> > > scheduling should get everything down to the cost of an SSE AND and an
> > > SSE OR per 128x128 bit pair in the classical algorithm. That puts the
> > > entire algorithm at pretty close to what Magma is doing timewise.
> > > Another option is to not use SSE and just use the 64 bit integer
> > > registers. The disadvantage is one has to load things 64 bits at a
> > > time. But the advantage is the Opteron can do three arithmetic
> > > operations per cycle if properly scheduled. That gets 50% more work
> > > done than the SSE registers, which can only do 1 x 128 bit operation
> > > per cycle.
> > > Of course I'm making the assumption here that the Opteron can indeed
> > > do loads at the same time as arithmetic.
> > > if not, then there is no way Magma can be using classical
> > > multiplication out to 3600x3600 since there are simply too many
> > > operations to perform in the number of cycles available.
> > > Bill.
> > > On 16 May, 00:03, Martin Albrecht <m...@informatik.uni-bremen.de>
> > > wrote:
> > > > > The crossover is not clear. The change from a smooth curve to a
> > > > > squiggly line is about 3600. So presumably that is it, but the graph
> > > > > also seems to change character at about 6200 or 7000 as well. One of
> > > > > these changes may be cache related.
> > > > > At 3600, the total data for all three matrices is almost 5mb and the
> > > > > cache on my machine is 1024kb. But if Magma is using classical
> > > > > multiplication, then this is pretty much irrelevant anyway, since you
> > > > > can keep the working data within the cache for quite a while during
> > > > > the algorithm.
> > > > On the other hand: a squiggly line is what one one would expect for
> > > > Strassen-Winograd due to the extra rows/columns that have to be taken care
> > > > of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel
> > > > once gave a talk about his implementation and stated that they don't use
> > > > classical block multiplication (I saw some slides with that remark).
I tried cache blocking matrix B, but the times are exactly the same. I
think the processor is happy to keep loading B one row at a time
sequentially, and since it is only done a handful of times in the
algorithm, it accounts for little of the runtime.
It might go faster in combination with storing the gray tables, but
the optimal block size seems to be about 100 (*k) for the matrix B,
which would necessitate storage of 100 gray tables, which is an awful
lot of memory. Probably I think it is best to avoid this kind of
memory usage.
So I am not sure where the other factor of 2 will come from.
Bill.
On 16 May, 14:37, Bill Hart <goodwillh...@googlemail.com> wrote:
> I used a crossover of 3600 and I indicate at the top of this file how
> constants should be changed to get the speedups for the various values
> of n. I didn't put any code in to automatically choose the correct
> values.
> I am presuming that the test_multiplication code actually tests this
> function, otherwise maybe my code is just rubbish.
> The basic idea is to cache block the A matrix into groups of BLOCK
> rows. If one also cache blocked the B matrix and left the computed
> gray tables in memory instead of writing over them and recalculating
> them all the time, as I currently do, one could probably get the other
> factor of 2 that we need.
> Note I've turned SSE off in this function, but left it on in
> _mzd_combine since it makes no real difference there.
> Bill.
> On 16 May, 14:16, Bill Hart <goodwillh...@googlemail.com> wrote:
> > 10000x10000 is down to 4.5s now, so nearly a 2x speedup.
> > 20000x20000 is down to 32.0s, so again nearly a 2x speedup.
> > Bill.
> > On 16 May, 13:53, Bill Hart <goodwillh...@googlemail.com> wrote:
> > > I made some changes to the original code so it would use the cache a
> > > bit better. The test code seems to pass, so I don't think I've screwed
> > > anything up.
> > > The time for 16384x16384 on my machine is now 20s, so a factor of
> > > 2.15x faster. The time for 2000x2000 also seems to be the same time as
> > > Magma now. Hopefully I didn't just mistime things before, and this is
> > > a real improvement.
> > > I am still trying things out.
> > > Bill.
> > > On 16 May, 01:41, Bill Hart <goodwillh...@googlemail.com> wrote:
> > > > I think it might just be possible to get down to the speed of Magma
> > > > with a highly optimised classical multiplication routine. At 3600X3600
> > > > one clearly has to do 3600x3600 scalar products of a row by a column.
> > > > We'll assume one of the matrices has been transposed to facilitate
> > > > this.
> > > > If we use SSE2 then we can operate 128 bits at a time. There are 16
> > > > SSE registers.
> > > > The basic idea would be to load 4 SSE registers from different rows of
> > > > matrix A and 2 from different columns of matrix B. We compute the
> > > > scalar products of all 8 combinations of rowsxcolumns simultaneously.
> > > > For this we need 2 temporary registers and 8 registers to hold the
> > > > running totals. So all in all we need 4+2+2+8 = 16 registers. We only
> > > > need to do an AND and an XOR to do the multiplication and addition
> > > > required.
> > > > Caching becomes irrelevant if we choose a large selection of rows from
> > > > A and a large selection of columns from B and do all the possible
> > > > scalar products or rows by columns before moving to the next part.
> > > > Assuming the Opteron can do the memory loads at the same time as doing
> > > > arithmetic not depending on those loads, careful instruction
> > > > scheduling should get everything down to the cost of an SSE AND and an
> > > > SSE OR per 128x128 bit pair in the classical algorithm. That puts the
> > > > entire algorithm at pretty close to what Magma is doing timewise.
> > > > Another option is to not use SSE and just use the 64 bit integer
> > > > registers. The disadvantage is one has to load things 64 bits at a
> > > > time. But the advantage is the Opteron can do three arithmetic
> > > > operations per cycle if properly scheduled. That gets 50% more work
> > > > done than the SSE registers, which can only do 1 x 128 bit operation
> > > > per cycle.
> > > > Of course I'm making the assumption here that the Opteron can indeed
> > > > do loads at the same time as arithmetic.
> > > > if not, then there is no way Magma can be using classical
> > > > multiplication out to 3600x3600 since there are simply too many
> > > > operations to perform in the number of cycles available.
> > > > Bill.
> > > > On 16 May, 00:03, Martin Albrecht <m...@informatik.uni-bremen.de>
> > > > wrote:
> > > > > On Thursday 15 May 2008, Bill Hart wrote:
> > > > > > The crossover is not clear. The change from a smooth curve to a
> > > > > > squiggly line is about 3600. So presumably that is it, but the graph
> > > > > > also seems to change character at about 6200 or 7000 as well. One of
> > > > > > these changes may be cache related.
> > > > > > At 3600, the total data for all three matrices is almost 5mb and the
> > > > > > cache on my machine is 1024kb. But if Magma is using classical
> > > > > > multiplication, then this is pretty much irrelevant anyway, since you
> > > > > > can keep the working data within the cache for quite a while during
> > > > > > the algorithm.
> > > > > On the other hand: a squiggly line is what one one would expect for
> > > > > Strassen-Winograd due to the extra rows/columns that have to be taken care
> > > > > of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel
> > > > > once gave a talk about his implementation and stated that they don't use
> > > > > classical block multiplication (I saw some slides with that remark).
> I tried cache blocking matrix B, but the times are exactly the same. I
> think the processor is happy to keep loading B one row at a time
> sequentially, and since it is only done a handful of times in the
> algorithm, it accounts for little of the runtime.
> It might go faster in combination with storing the gray tables, but
> the optimal block size seems to be about 100 (*k) for the matrix B,
> which would necessitate storage of 100 gray tables, which is an awful
> lot of memory. Probably I think it is best to avoid this kind of
> memory usage.
> So I am not sure where the other factor of 2 will come from.
> Bill.
> On 16 May, 14:37, Bill Hart <goodwillh...@googlemail.com> wrote:
> > Here is the modified _mzd_mul_m4rm_impl function which gives this
> > speedup:
> > I used a crossover of 3600 and I indicate at the top of this file how
> > constants should be changed to get the speedups for the various values
> > of n. I didn't put any code in to automatically choose the correct
> > values.
> > I am presuming that the test_multiplication code actually tests this
> > function, otherwise maybe my code is just rubbish.
> > The basic idea is to cache block the A matrix into groups of BLOCK
> > rows. If one also cache blocked the B matrix and left the computed
> > gray tables in memory instead of writing over them and recalculating
> > them all the time, as I currently do, one could probably get the other
> > factor of 2 that we need.
> > Note I've turned SSE off in this function, but left it on in
> > _mzd_combine since it makes no real difference there.
> > Bill.
> > On 16 May, 14:16, Bill Hart <goodwillh...@googlemail.com> wrote:
> > > 10000x10000 is down to 4.5s now, so nearly a 2x speedup.
> > > 20000x20000 is down to 32.0s, so again nearly a 2x speedup.
> > > Bill.
> > > On 16 May, 13:53, Bill Hart <goodwillh...@googlemail.com> wrote:
> > > > I made some changes to the original code so it would use the cache a
> > > > bit better. The test code seems to pass, so I don't think I've screwed
> > > > anything up.
> > > > The time for 16384x16384 on my machine is now 20s, so a factor of
> > > > 2.15x faster. The time for 2000x2000 also seems to be the same time as
> > > > Magma now. Hopefully I didn't just mistime things before, and this is
> > > > a real improvement.
> > > > I am still trying things out.
> > > > Bill.
> > > > On 16 May, 01:41, Bill Hart <goodwillh...@googlemail.com> wrote:
> > > > > I think it might just be possible to get down to the speed of Magma
> > > > > with a highly optimised classical multiplication routine. At 3600X3600
> > > > > one clearly has to do 3600x3600 scalar products of a row by a column.
> > > > > We'll assume one of the matrices has been transposed to facilitate
> > > > > this.
> > > > > If we use SSE2 then we can operate 128 bits at a time. There are 16
> > > > > SSE registers.
> > > > > The basic idea would be to load 4 SSE registers from different rows of
> > > > > matrix A and 2 from different columns of matrix B. We compute the
> > > > > scalar products of all 8 combinations of rowsxcolumns simultaneously.
> > > > > For this we need 2 temporary registers and 8 registers to hold the
> > > > > running totals. So all in all we need 4+2+2+8 = 16 registers. We only
> > > > > need to do an AND and an XOR to do the multiplication and addition
> > > > > required.
> > > > > Caching becomes irrelevant if we choose a large selection of rows from
> > > > > A and a large selection of columns from B and do all the possible
> > > > > scalar products or rows by columns before moving to the next part.
> > > > > Assuming the Opteron can do the memory loads at the same time as doing
> > > > > arithmetic not depending on those loads, careful instruction
> > > > > scheduling should get everything down to the cost of an SSE AND and an
> > > > > SSE OR per 128x128 bit pair in the classical algorithm. That puts the
> > > > > entire algorithm at pretty close to what Magma is doing timewise.
> > > > > Another option is to not use SSE and just use the 64 bit integer
> > > > > registers. The disadvantage is one has to load things 64 bits at a
> > > > > time. But the advantage is the Opteron can do three arithmetic
> > > > > operations per cycle if properly scheduled. That gets 50% more work
> > > > > done than the SSE registers, which can only do 1 x 128 bit operation
> > > > > per cycle.
> > > > > Of course I'm making the assumption here that the Opteron can indeed
> > > > > do loads at the same time as arithmetic.
> > > > > if not, then there is no way Magma can be using classical
> > > > > multiplication out to 3600x3600 since there are simply too many
> > > > > operations to perform in the number of cycles available.
> > > > > Bill.
> > > > > On 16 May, 00:03, Martin Albrecht <m...@informatik.uni-bremen.de>
> > > > > wrote:
> > > > > > On Thursday 15 May 2008, Bill Hart wrote:
> > > > > > > The crossover is not clear. The change from a smooth curve to a
> > > > > > > squiggly line is about 3600. So presumably that is it, but the graph
> > > > > > > also seems to change character at about 6200 or 7000 as well. One of
> > > > > > > these changes may be cache related.
> > > > > > > At 3600, the total data for all three matrices is almost 5mb and the
> > > > > > > cache on my machine is 1024kb. But if Magma is using classical
> > > > > > > multiplication, then this is pretty much irrelevant anyway, since you
> > > > > > > can keep the working data within the cache for quite a while during
> > > > > > > the algorithm.
> > > > > > On the other hand: a squiggly line is what one one would expect for
> > > > > > Strassen-Winograd due to the extra rows/columns that have to be taken care
> > > > > > of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel
> > > > > > once gave a talk about his implementation and stated that they don't use
> > > > > > classical block multiplication (I saw some slides with that remark).
Here are the times I get for Magma vs M4RI now. Note the crossover
between the two programs is now above about 5000 and M4RI beats Magma
below that point. This suggests the remaining factor of 2 is in the
Strassen-Winograd function. Probably Winograd-Strassen is falling out
of L2 cache (the previous adjustments I made were to prevent the M4R
algorithm falling out of L1 cache).
The other possibility is that Magma combines the two algorithms so
that there is even greater usage of the Gray code tables. This would
be an ugly hack, but could work.
40000x40000:
Magma: 112.6s
M4RI: 232.4s
20000x20000:
Magma: 16.40s
M4RI: 32.34s
10000x10000:
Magma: 2.750s
M4RI: 4.529s
5000x5000:
Magma: 0.700s
M4RI: 0.672s
2500x2500:
Magma: 0.13s
M4RI: 0.079s
1250x1250:
Magma: 0.015s
M4RI: 0.012s
625x625:
Magma: 0.0030s
M4RI: 0.0023s
312x312:
Magma: 0.0014s
M4RI: 0.00032s
156x156:
Magma: 0.001s
M4RI: 0.0001s
If I get some time I'll look into this.
Did those changes work for you Martin?
Bill.
On 16 May, 16:24, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> Here are the times I get for Magma vs M4RI now. Note the crossover > between the two programs is now above about 5000 and M4RI beats Magma > below that point. This suggests the remaining factor of 2 is in the > Strassen-Winograd function. Probably Winograd-Strassen is falling out > of L2 cache (the previous adjustments I made were to prevent the M4R > algorithm falling out of L1 cache).
> The other possibility is that Magma combines the two algorithms so > that there is even greater usage of the Gray code tables. This would > be an ugly hack, but could work.
Are you suggesting Magma uses M4RM too? I'd doubt that, since they don't state that anyway. Probably I'm just misunderstanding you.
Yes, the change worked like a charm. I made some changes (the fixed k is replaced by a k that depends on the new block dimensions etc.) and it is much much faster now. I'm working on re-introducing SSE2 now to see if it at least on the Core2Duo makes the world a better place. Btw. all the stuff I wrote about the L2 cache size C2D vs. Opteron was bollocks. The reason I beat Magma so badly earlier was that I used a 32-bit Magma and compared it with a 64-bit version of M4RI. To state it clearly: We don't beat Magma anywhere. Anyhow, here are the times:
64-bit Debian/GNU Linux, 2.33Ghz Core2Duo Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 2.920 4.130 16,384 x 16,384 11.140 15.740 20,000 x 20,000 20.370 28.950
64-bit Debian/GNU Linux, 1.8Ghz Opteron (sage.math) Matrix Dimension Magma 2.13-5 (64-bit) M4RI-20080517 (64-bit) 10,000 x 10,000 3.930 7.860 16,384 x 16,384 16.230 104.77??? 20,000 x 20,000 27.080 56.420
(My university today finally granted me access to Magma 2.14 for my notebook.)
As you can see there is an odd (reproducible) spike at 2^14 x 2^14. I cannot explain that for now, it might just be a bug. I have a similar spike in another run on another AMD (Athlon X2) machine:
I'm investigating this one too and play around with the parameters. Since the C2D times are considerably better I also bet it is L2 related, we'll see.
That spike is wierd. Basically I got closer to 2x the time of Magma
for 16384x16384, but you need different parameters than for
10000x10000 or 20000x20000 since the size of the M4R matrices will be
different than in either of those cases.
For 16384x16384 my notes say that I used k = 6 and BLOCK = 256. One
might also need to fiddle with the cutoff. I think I used 3600, but at
various times I had the cutoff set lower.
It would be awfully surprising if there wasn't a set of parameters
that dropped this time right down. It is possible that the L1 cache is
smaller on sage.math than on the Opteron I was using. Perhaps my
parameters don't apply on that machine.
Bill.
On 16 May, 23:05, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:
> > Here are the times I get for Magma vs M4RI now. Note the crossover
> > between the two programs is now above about 5000 and M4RI beats Magma
> > below that point. This suggests the remaining factor of 2 is in the
> > Strassen-Winograd function. Probably Winograd-Strassen is falling out
> > of L2 cache (the previous adjustments I made were to prevent the M4R
> > algorithm falling out of L1 cache).
> > The other possibility is that Magma combines the two algorithms so
> > that there is even greater usage of the Gray code tables. This would
> > be an ugly hack, but could work.
> Are you suggesting Magma uses M4RM too? I'd doubt that, since they don't state
> that anyway. Probably I'm just misunderstanding you.
> Yes, the change worked like a charm. I made some changes (the fixed k is
> replaced by a k that depends on the new block dimensions etc.) and it is much
> much faster now. I'm working on re-introducing SSE2 now to see if it at least
> on the Core2Duo makes the world a better place. Btw. all the stuff I wrote
> about the L2 cache size C2D vs. Opteron was bollocks. The reason I beat Magma
> so badly earlier was that I used a 32-bit Magma and compared it with a 64-bit
> version of M4RI. To state it clearly: We don't beat Magma anywhere. Anyhow,
> here are the times:
> 64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
> Matrix Dimension Magma 2.14-13 (64-bit) M4RI-20080517 (64-bit)
> 10,000 x 10,000 2.920 4.130
> 16,384 x 16,384 11.140 15.740
> 20,000 x 20,000 20.370 28.950
> 64-bit Debian/GNU Linux, 1.8Ghz Opteron (sage.math)
> Matrix Dimension Magma 2.13-5 (64-bit) M4RI-20080517 (64-bit)
> 10,000 x 10,000 3.930 7.860
> 16,384 x 16,384 16.230 104.77???
> 20,000 x 20,000 27.080 56.420
> (My university today finally granted me access to Magma 2.14 for my notebook.)
> As you can see there is an odd (reproducible) spike at 2^14 x 2^14. I cannot
> explain that for now, it might just be a bug. I have a similar spike in
> another run on another AMD (Athlon X2) machine:
> I'm investigating this one too and play around with the parameters. Since the
> C2D times are considerably better I also bet it is L2 related, we'll see.