Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
slightly OT: new M4RI library
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 85 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Martin Albrecht  
View profile  
 More options May 14 2008, 5:52 pm
From: Martin Albrecht <m...@informatik.uni-bremen.de>
Date: Wed, 14 May 2008 22:52:43 +0100
Local: Wed, May 14 2008 5:52 pm
Subject: slightly OT: new M4RI library
Oh gods of the cpu cycle ... or hi there,

(this e-mail contains details on a particular implementation for
GF(2) linear algebra, feel free to ignore it if that doesn't get you going)

I've just submitted a new (much improved) version of the M4RI library for
inclusion in Sage:

  http://trac.sagemath.org/sage_trac/ticket/3204

M4RI is a library by Gregory Bard and myself for fast arithmetic with dense
matrices over GF(2). M4RI stands for the "Method of the Four Russians"
inversion which is an algorithm for matrix reduction first published by Greg.
The website is here:

  http://sage.math.washington.edu/home/malb/m4ri/

This new version has among other things two new features:
 - SSE2 support where support (XOR two 128-bit words in one instruction)
 - Strassen-Winograd matrix multiplication on top of M4RM ("Method of the Four
Russians" multiplication or Kronrod's algorithm)

I'm writing to [sage-devel] because I observed some odd performance
characteristics and hope to pick your brains on how to improve performance:

At least in my experience enabling SSE2 results in slower code on AMD CPUs. I
think this is rumoured across the net but did anyone else also witness this
behaviour? I'm only using 128-bit integer XOR instrinsics. On my Core2Duo
notebook SSE2 speeds up the computation at least up to the size of the L2
cache.

One would expect that the M4RM + Strassen implementation would have an easy
time beating Magma since it uses a better algorithm. Same as Magma it does
Strassen-Winograd down to the cutoff but then it switches to the M4RM
algorithm (O(n^3/log_2(n))) rather than naive multiplication (O(n^3)). This
should give a nice constant speed-up. This theory so far seems to match
reality on a Core2Duo notebook with 4MB L2 cache:

----------------------------------------------------------------------
| SAGE Version 3.0.1, Release Date: 2008-05-05                       |
| Type notebook() for the GUI, and license() for information.        |
----------------------------------------------------------------------
sage: A = random_matrix(GF(2),10^4,10^4)
sage: B = random_matrix(GF(2),10^4,10^4)
sage: time C = A._multiply_strassen(B,cutoff=3200)
CPU times: user 4.50 s, sys: 0.03 s, total: 4.53 s
Wall time: 4.56

Magma V2.13-5     Wed May 14 2008 21:55:31 on road     [Seed = 2246483032]
Type ? for help.  Type <Ctrl>-D to quit.

> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> t := Cputime(); C := A*B; Cputime(t);

7.170

sage: A = random_matrix(GF(2),2*10^4,2*10^4)
sage: B = random_matrix(GF(2),2*10^4,2*10^4)
sage: time C = A._multiply_strassen(B,cutoff=3200)
CPU times: user 33.35 s, sys: 0.09 s, total: 33.44 s
Wall time: 33.59

> A := RandomMatrix(GF(2),2*10^4,2*10^4);
> B := RandomMatrix(GF(2),2*10^4,2*10^4);
> t := Cputime(); C := A*B; Cputime(t);

49.990

Finally, a perfect cutting example (no extra rows/columns to take care of):

sage: A = random_matrix(GF(2),1<<14,1<<14)
sage: B = random_matrix(GF(2),1<<14,1<<14)
sage: time C = A._multiply_strassen(B,cutoff=1<<10) # !=3200
CPU times: user 16.82 s, sys: 0.07 s, total: 16.88 s
Wall time: 16.96

> A:= RandomMatrix(GF(2),2^14,2^14);
> B:= RandomMatrix(GF(2),2^14,2^14);
> t := Cputime(); C := A*B; Cputime(t);

24.060

Note however that this comparison is NOT FAIR since this version of Magma is
not optimised for the C2D chip!

However, on AMD CPUs the times don't look that nice. Both on sage.math and on
a

   AMD Athlon(tm) 64 X2 Dual Core Processor 6400+

the performance is way way behind Magma (or what I'd expect Magma to perform
like), e.g. on sage.math:

sage: A = random_matrix(GF(2),10^4,10^4)
sage: B = random_matrix(GF(2),10^4,10^4)
sage: time C = A._multiply_strassen(B,cutoff=1500)
CPU times: user 14.40 s, sys: 0.17 s, total: 14.58 s
Wall time: 14.62

Magma V2.13-5     Wed May 14 2008 14:07:18 on sage     [Seed = 4255048009]
Type ? for help.  Type <Ctrl>-D to quit.

> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> t := Cputime(); C := A*B; Cputime(t);

3.890 <<--------- They even beat the C2D time!

sage: A = random_matrix(GF(2),2*10^4,2*10^4)
sage: B = random_matrix(GF(2),2*10^4,2*10^4)
sage: time C = A._multiply_strassen(B,cutoff=1500)
CPU times: user 99.42 s, sys: 0.64 s, total: 100.06 s
Wall time: 100.06

> B := RandomMatrix(GF(2),2*10^4,2*10^4);
> A := RandomMatrix(GF(2),2*10^4,2*10^4);
> t := Cputime(); C := A*B; Cputime(t);

27.470

# perfect cutoff
sage: A = random_matrix(GF(2),1<<14,1<<14)
sage: B = random_matrix(GF(2),1<<14,1<<14)
sage: time C = A._multiply_strassen(B,cutoff=1<<10)
CPU times: user 100.14 s, sys: 0.32 s, total: 100.46 s
Wall time: 100.45
sage: time C = A._multiply_strassen(B,cutoff=1<<9)
CPU times: user 76.48 s, sys: 0.39 s, total: 76.87 s
Wall time: 76.87
sage: time C = A._multiply_strassen(B,cutoff=1<<8)
CPU times: user 52.45 s, sys: 0.33 s, total: 52.78 s
Wall time: 52.78

> A := RandomMatrix(GF(2),2^14,2^14);
> B := RandomMatrix(GF(2),2^14,2^14);
> t := Cputime(); C := A*B; Cputime(t);

14.970

I guess the bad performance of the M4RI library is partly related to the
smaller L2 cache (AMD: 1MB, C2D: 4MB) which forces us to set the cutoff lower
(since we want all three C = A*B submatrices to  fit in L2 cache). Also it
seems that Magma is almost always ~ 3.5 times faster than our implementation.
I am slowly running out of ideas what to improve in order to close to the gap
so if anyone has some ideas please let me hear them. It might be worth noting
that we are using the same cutting strategy as Sage in strassen.pyx which
ignores uneven rows/columns at every recursion step and deals with the extra
rows/columns after the Strassen-Winograd formula (see strassen.pyx for
details). We are also using the improved operation schedule from paper of
Clement et al.Though I don't know if the cutting strategy is optimal, we also
get beaten in a perfect cutting scenario, so I guess it is not that important
for now. Anyway, I'd happily provide profiling data if anyone wants to take a
look and squeeze a couple of cycles out. If not, then this e-mail is jut a
quite long status report :-)

Btw. I don't have access to Magma 2.14 which I believe to be the fastest in
linear algebra over GF(2). In version 2.14 they added SSE2 support too.
So if anybody could compare the new M4RI with Magma 2.14 I'd be happy to hear
about the results. I don't know what else to compare with except Magma. Any
suggestions? Thoughts? Rants?

Cheers,
Martin

--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinralbre...@jabber.ccc.de


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
William Stein  
View profile  
 More options May 14 2008, 8:41 pm
From: "William Stein" <wst...@gmail.com>
Date: Wed, 14 May 2008 17:41:35 -0700
Local: Wed, May 14 2008 8:41 pm
Subject: Re: [sage-devel] slightly OT: new M4RI library

> Btw. I don't have access to Magma 2.14 which I believe to be the fastest in
> linear algebra over GF(2). In version 2.14 they added SSE2 support too.
> So if anybody could compare the new M4RI with Magma 2.14 I'd be happy to hear
> about the results. I don't know what else to compare with except Magma.

Here are magma V2.14-9 timings on sage.math:

was@sage:~/magma64/bin$ ./magma
Magma V2.14-9     Wed May 14 2008 17:33:23 on sage     [Seed = 793568142]
Type ? for help.  Type <Ctrl>-D to quit.

> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> time C := A*B;
Time: 4.020
> time C := A*B;
Time: 3.980
> time C := A*B;
Time: 3.990
> A := RandomMatrix(GF(2),2^14,2^14);
> B := RandomMatrix(GF(2),2^14,2^14);
> time C := A*B;
Time: 15.450
> time C := A*B;

Time: 15.420

> Any suggestions? Thoughts? Rants?

Allan Steel is a really amazing programmer.

----

I tried your new code up at #3204 under OS X and get this:

sage: A = random_matrix(GF(2),10^4,10^4)
sage: B = random_matrix(GF(2),10^4,10^4)
sage: time C = A._multiply_strassen(B,cutoff=3200)
sage.bin(39971) malloc: *** error for object 0xb95c010: Non-aligned
pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
sage.bin(39971) malloc: *** error for object 0x79c9c10: Non-aligned
pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
sage.bin(39971) malloc: *** error for object 0x7465a00:
non-page-aligned, non-allocated pointer being freed
*** set a breakpoint in malloc_error_break to debug
sage.bin(39971) malloc: *** error for object 0x79ca610: Non-aligned
pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
...
CPU times: user 10.29 s, sys: 0.26 s, total: 10.55 s
Wall time: 16.31

Maybe you're doing something wrong?

 -- William


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Martin Albrecht  
View profile  
 More options May 15 2008, 4:14 am
From: Martin Albrecht <m...@informatik.uni-bremen.de>
Date: Thu, 15 May 2008 09:14:39 +0100
Local: Thurs, May 15 2008 4:14 am
Subject: Re: [sage-devel] Re: slightly OT: new M4RI library
On Thursday 15 May 2008, William Stein wrote:
> > Btw. I don't have access to Magma 2.14 which I believe to be the fastest
> > in linear algebra over GF(2). In version 2.14 they added SSE2 support
> > too. So if anybody could compare the new M4RI with Magma 2.14 I'd be
> > happy to hear about the results. I don't know what else to compare with
> > except Magma.

> Here are magma V2.14-9 timings on sage.math:

was@sage:~/magma64/bin$ ./magma
Magma V2.14-9     Wed May 14 2008 17:33:23 on sage     [Seed = 793568142]
Type ? for help.  Type <Ctrl>-D to quit.
> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> time C := A*B;

Time: 4.020

> time C := A*B;
Time: 3.980
> time C := A*B;
Time: 3.990
> A := RandomMatrix(GF(2),2^14,2^14);
> B := RandomMatrix(GF(2),2^14,2^14);
> time C := A*B;
Time: 15.450
> time C := A*B;

Time: 15.420

This doesn't seem much different from 2.13, which is good for my purposes.
Once I fixed the OSX issue mentioned below could you run the benchmark on the
C2D? I want to see if they can make better use of the 4MB L2 cache.

Likely, though my Valgrind log is clean under Linux. I'll try under OSX.

Martin

--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinralbre...@jabber.ccc.de


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options May 15 2008, 8:46 am
From: Bill Hart <goodwillh...@googlemail.com>
Date: Thu, 15 May 2008 05:46:17 -0700 (PDT)
Local: Thurs, May 15 2008 8:46 am
Subject: Re: slightly OT: new M4RI library
Hi Martin,

The cycle counter in your bench code seems to give random values on
the 2.4GHz Opteron with SUSE 9 linux that I have access to, which has
Magma 2-14.10 on it.

Anyhow, here are the Magma times:

> A := RandomMatrix(GF(2),10^4,10^4);
> B := RandomMatrix(GF(2),10^4,10^4);
> t := Cputime(); C := A*B; Cputime(t);
2.940
> A := RandomMatrix(GF(2),2*10^4,2*10^4);
> B := RandomMatrix(GF(2),2*10^4,2*10^4);
> t := Cputime(); C := A*B; Cputime(t);
16.570
> A := RandomMatrix(GF(2),2^14,2^14);
> B := RandomMatrix(GF(2),2^14,2^14);
> t := Cputime(); C := A*B; Cputime(t);

9.250

The corresponding times for your code are approximately (hand timed;
adjusted for generation of random matrices):

10^4: 8s
2*10^4: 59s
2^14: 43s

I think it is polynomials over GF2 which Magma 2-14.x is much faster
at rather than linear algebra.

Bill.

On 14 May, 22:52, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Martin Albrecht  
View profile  
 More options May 15 2008, 1:35 pm
From: Martin Albrecht <m...@informatik.uni-bremen.de>
Date: Thu, 15 May 2008 18:35:49 +0100
Local: Thurs, May 15 2008 1:35 pm
Subject: Re: [sage-devel] Re: slightly OT: new M4RI library

> Maybe you're doing something wrong?

This bug is fixed in:

   http://sage.math.washington.edu/home/malb/spkgs/libm4ri-20080515.p0.spkg

Cheers,
Martin

--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinralbre...@jabber.ccc.de


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Martin Albrecht  
View profile  
 More options May 15 2008, 1:42 pm
From: Martin Albrecht <m...@informatik.uni-bremen.de>
Date: Thu, 15 May 2008 18:42:11 +0100
Local: Thurs, May 15 2008 1:42 pm
Subject: Re: [sage-devel] Re: slightly OT: new M4RI library
On Thursday 15 May 2008, Bill Hart wrote:

Hi Bill,

I cannot explain the random behaviour of the cpucycles counter. The cpucycles
library is by Dan Bernstein so I'd suspect it is carefully coded. Some
potential sources of variance are: dynticks (i.e. the kernel doesn't have a
fixed tick rate, this is done to safe power), CPU frequency scaling and other
applications. On the other hand, I'd suspect I don't need to tell you since
you probably benchmarked more code than I did. To account for the variance I
have the small run_bench.py script in the testsuite directory. Run it like
this:

    ./run_bench.py -t mul -n 10000 -c 3200

It'll run the experiment 17 times (17 for no particular reason) and return the
median runtime. -n is the size and -c the cutoff.

Cheers,
Martin

--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinralbre...@jabber.ccc.de


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
William Stein  
View profile  
 More options May 15 2008, 2:37 pm
From: "William Stein" <wst...@gmail.com>
Date: Thu, 15 May 2008 11:37:49 -0700
Local: Thurs, May 15 2008 2:37 pm
Subject: Re: [sage-devel] Re: slightly OT: new M4RI library
On Thu, May 15, 2008 at 10:42 AM, Martin Albrecht

Median?  Shouldn't you take the minimum?  Are there any good papers
on benchmarking?

By the way -- present for you:

      http://m4ri.sagemath.org/

To update that page you have to ssh to sagemath.org.

William


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Martin Albrecht  
View profile  
 More options May 15 2008, 3:01 pm
From: Martin Albrecht <m...@informatik.uni-bremen.de>
Date: Thu, 15 May 2008 20:01:20 +0100
Local: Thurs, May 15 2008 3:01 pm
Subject: Re: [sage-devel] Re: slightly OT: new M4RI library

> > It'll run the experiment 17 times (17 for no particular reason) and
> > return the median runtime. -n is the size and -c the cutoff.
> Median?  Shouldn't you take the minimum?  Are there any good papers
> on benchmarking?

I'm using cpucycles from:

http://www.ecrypt.eu.org/ebats/cpucycles.html

which advises:

------------------------

Notes on accuracy

Benchmarking tools are encouraged to record several timings of a function:
call cpucycles(), function(), cpucycles(), function(), etc., and then print
one line reporting the differences between successive cpucycles() results.
The median of several differences is much more stable than the average.

Cycle counts continue to increase while other programs are running, while the
operating system is handling an interruption such as a network packet, etc.
This won't affect the median of several timings of a fast function---the
function usually won't be interrupted---but it can affect the median of
several timings of a slow function. Hopefully a benchmarking machine isn't
running other programs.

On dual-CPU systems (and dual-core systems such as the Athlon 64 X2), the CPUs
often don't have synchronized cycle counters, so a process that switches CPUs
can have its cycle counts jump forwards or backwards. I've never seen this
affect the median of several timings.

Some CPUs dynamically reduce CPU speed to save power, but deliberately keep
their cycle counters running at full speed, the idea being that measuring
time is more important than measuring cycles. Hopefully a benchmarking
machine won't enter power-saving mode.

Cycle counts are occasionally off by a multiple of 2^32 on some CPUs, as
discussed below. I've never seen this affect the median of several timings.

The estimate returned by cpucycles_persecond() may improve accuracy after
cpucycles() has been called repeatedly.

------------------------

I just followed that. I guess if one takes the minimum or not depends on the
data. If there are (much) easier cases then it is probably not best to take
the minimum. However, I guess for matrix multiplication the times don't
really depend on the entries of the matrices so I could use the minimum or
just output all (minimum, average, median).

> By the way -- present for you:

>       http://m4ri.sagemath.org/

> To update that page you have to ssh to sagemath.org.

Cool, thanks! Any chance that I could get write access to the home directory
so that I can upload my SSH public key?

Martin

--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinralbre...@jabber.ccc.de


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options May 15 2008, 3:03 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Thu, 15 May 2008 12:03:22 -0700 (PDT)
Local: Thurs, May 15 2008 3:03 pm
Subject: Re: slightly OT: new M4RI library
Hi Martin,

Here is a run that illustrates the problem. Am I doing something
wrong?

wbhart@host-57-44:~/m4ri-20080514/testsuite> time ./
bench_multiplication 5000 2048
n:  5000, cutoff:  2048, cpu cycles: 2670143764

real    0m2.768s
user    0m2.760s
sys     0m0.008s
wbhart@host-57-44:~/m4ri-20080514/testsuite> time ./
bench_multiplication 5000 3000
n:  5000, cutoff:  3000, cpu cycles:  746344039

real    0m3.277s
user    0m3.252s
sys     0m0.024s

Bill.

On 15 May, 18:42, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Martin Albrecht  
View profile  
 More options May 15 2008, 3:25 pm
From: Martin Albrecht <m...@informatik.uni-bremen.de>
Date: Thu, 15 May 2008 20:25:32 +0100
Local: Thurs, May 15 2008 3:25 pm
Subject: Re: [sage-devel] Re: slightly OT: new M4RI library

On Thursday 15 May 2008, Bill Hart wrote:

> Hi Martin,

> Here is a run that illustrates the problem. Am I doing something
> wrong?

No, I was stupid. The cpucycles are printed as %u but they should be printed
as %llu since they are longer than an int. I've attached the fixed C file
(since it is < 1KB).

--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinralbre...@jabber.ccc.de

  bench_multiplication.c
< 1K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options May 15 2008, 4:47 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Thu, 15 May 2008 13:47:02 -0700 (PDT)
Local: Thurs, May 15 2008 4:47 pm
Subject: Re: slightly OT: new M4RI library
Martin,

Do you think Magma uses naive multiplication for its base case? This
can be ridicoulously fast, especially over GF2. I note for example
that Magma's base case is about 6 times faster than M4RI at 1000x1000.
Is it possible that the naive multiplication can just be optimised
with a far better constant and that the crossover with m4r is above
the crossover with Strassen-Winograd?

I also note that the M4RI SSE2 code doesn't seem faster than the
generic C code on the Opteron, in fact it seems the other way around.
Very strange.

Bill.

On 15 May, 20:25, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Martin Albrecht  
View profile  
 More options May 15 2008, 5:23 pm
From: Martin Albrecht <m...@informatik.uni-bremen.de>
Date: Thu, 15 May 2008 22:23:40 +0100
Local: Thurs, May 15 2008 5:23 pm
Subject: Re: [sage-devel] Re: slightly OT: new M4RI library
On Thursday 15 May 2008, Bill Hart wrote:

> Martin,

> Do you think Magma uses naive multiplication for its base case? This
> can be ridicoulously fast, especially over GF2. I note for example
> that Magma's base case is about 6 times faster than M4RI at 1000x1000.
> Is it possible that the naive multiplication can just be optimised
> with a far better constant and that the crossover with m4r is above
> the crossover with Strassen-Winograd?

> I also note that the M4RI SSE2 code doesn't seem faster than the
> generic C code on the Opteron, in fact it seems the other way around.
> Very strange.

I observed the same behaviour but have no explanation besides stuff like this:

http://www.theinquirer.net/en/inquirer/news/2003/05/02/sse2-makes-opt...

How do you know what the Magma base case is, can you tell Magma somehow to
print its crossover? I'll look into plugging in naive multiplication as the
base case (after optimisation) to see if your idea is right, though I doubt
that the crossover to M4RM is beyond the crossover to Strassen-Winograd.

Martin

--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinralbre...@jabber.ccc.de


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options May 15 2008, 5:41 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Thu, 15 May 2008 14:41:18 -0700 (PDT)
Local: Thurs, May 15 2008 5:41 pm
Subject: Re: slightly OT: new M4RI library
Oh, actually I have no idea where Magma's crossover is. I'll wager it
is somewhere between 4000x4000 and 6000x6000, but let's not speculate.
I'll try and work it out with some timings.

Bill.

On 15 May, 22:23, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options May 15 2008, 6:33 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Thu, 15 May 2008 15:33:23 -0700 (PDT)
Local: Thurs, May 15 2008 6:33 pm
Subject: Re: slightly OT: new M4RI library
Here is the graph of Magma times:

http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png

The crossover is not clear. The change from a smooth curve to a
squiggly line is about 3600. So presumably that is it, but the graph
also seems to change character at about 6200 or 7000 as well. One of
these changes may be cache related.

At 3600, the total data for all three matrices is almost 5mb and the
cache on my machine is 1024kb. But if Magma is using classical
multiplication, then this is pretty much irrelevant anyway, since you
can keep the working data within the cache for quite a while during
the algorithm.

Bill.

On 15 May, 22:41, Bill Hart <goodwillh...@googlemail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Martin Albrecht  
View profile  
 More options May 15 2008, 7:03 pm
From: Martin Albrecht <m...@informatik.uni-bremen.de>
Date: Fri, 16 May 2008 00:03:19 +0100
Local: Thurs, May 15 2008 7:03 pm
Subject: Re: [sage-devel] Re: slightly OT: new M4RI library
On Thursday 15 May 2008, Bill Hart wrote:

> Here is the graph of Magma times:

> http://sage.math.washington.edu/home/wbhart/flint-trunk/graphing/gf2.png

> The crossover is not clear. The change from a smooth curve to a
> squiggly line is about 3600. So presumably that is it, but the graph
> also seems to change character at about 6200 or 7000 as well. One of
> these changes may be cache related.
> At 3600, the total data for all three matrices is almost 5mb and the
> cache on my machine is 1024kb. But if Magma is using classical
> multiplication, then this is pretty much irrelevant anyway, since you
> can keep the working data within the cache for quite a while during
> the algorithm.

On the other hand: a squiggly line is what one one would expect for
Strassen-Winograd due to the extra rows/columns that have to be taken care
of. In any case: 3600 seems rather late for 1MB L2 cache. I think Allan Steel
once gave a talk about his implementation and stated that they don't use
classical block multiplication (I saw some slides with that remark).

Martin

--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinralbre...@jabber.ccc.de


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options May 15 2008, 8:41 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Thu, 15 May 2008 17:41:02 -0700 (PDT)
Local: Thurs, May 15 2008 8:41 pm
Subject: Re: slightly OT: new M4RI library
I think it might just be possible to get down to the speed of Magma
with a highly optimised classical multiplication routine. At 3600X3600
one clearly has to do 3600x3600 scalar products of a row by a column.
We'll assume one of the matrices has been transposed to facilitate
this.

If we use SSE2 then we can operate 128 bits at a time. There are 16
SSE registers.

The basic idea would be to load 4 SSE registers from different rows of
matrix A and 2 from different columns of matrix B. We compute the
scalar products of all 8 combinations of rowsxcolumns simultaneously.
For this we need 2 temporary registers and 8 registers to hold the
running totals. So all in all we need 4+2+2+8 = 16 registers. We only
need to do an AND and an XOR to do the multiplication and addition
required.

Caching becomes irrelevant if we choose a large selection of rows from
A and a large selection of columns from B and do all the possible
scalar products or rows by columns before moving to the next part.

Assuming the Opteron can do the memory loads at the same time as doing
arithmetic not depending on those loads, careful instruction
scheduling should get everything down to the cost of an SSE AND and an
SSE OR per 128x128 bit pair in the classical algorithm. That puts the
entire algorithm at pretty close to what Magma is doing timewise.

Another option is to not use SSE and just use the 64 bit integer
registers. The disadvantage is one has to load things 64 bits at a
time. But the advantage is the Opteron can do three arithmetic
operations per cycle if properly scheduled. That gets 50% more work
done than the SSE registers, which can only do 1 x 128 bit operation
per cycle.

Of course I'm making the assumption here that the Opteron can indeed
do loads at the same time as arithmetic.

if not, then there is no way Magma can be using classical
multiplication out to 3600x3600 since there are simply too many
operations to perform in the number of cycles available.

Bill.

On 16 May, 00:03, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options May 16 2008, 8:53 am
From: Bill Hart <goodwillh...@googlemail.com>
Date: Fri, 16 May 2008 05:53:39 -0700 (PDT)
Local: Fri, May 16 2008 8:53 am
Subject: Re: slightly OT: new M4RI library
I made some changes to the original code so it would use the cache a
bit better. The test code seems to pass, so I don't think I've screwed
anything up.

The time for 16384x16384 on my machine is now 20s, so a factor of
2.15x faster. The time for 2000x2000 also seems to be the same time as
Magma now. Hopefully I didn't just mistime things before, and this is
a real improvement.

I am still trying things out.

Bill.

On 16 May, 01:41, Bill Hart <goodwillh...@googlemail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options May 16 2008, 9:16 am
From: Bill Hart <goodwillh...@googlemail.com>
Date: Fri, 16 May 2008 06:16:27 -0700 (PDT)
Local: Fri, May 16 2008 9:16 am
Subject: Re: slightly OT: new M4RI library
10000x10000 is down to 4.5s now, so nearly a 2x speedup.

20000x20000 is down to 32.0s, so again nearly a 2x speedup.

Bill.

On 16 May, 13:53, Bill Hart <goodwillh...@googlemail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options May 16 2008, 9:37 am
From: Bill Hart <goodwillh...@googlemail.com>
Date: Fri, 16 May 2008 06:37:20 -0700 (PDT)
Local: Fri, May 16 2008 9:37 am
Subject: Re: slightly OT: new M4RI library
Here is the modified _mzd_mul_m4rm_impl function which gives this
speedup:

http://sage.math.washington.edu/home/wbhart/m4rmul.c

I used a crossover of 3600 and I indicate at the top of this file how
constants should be changed to get the speedups for the various values
of n. I didn't put any code in to automatically choose the correct
values.

I am presuming that the test_multiplication code actually tests this
function, otherwise maybe my code is just rubbish.

The basic idea is to cache block the A matrix into groups of BLOCK
rows. If one also cache blocked the B matrix and left the computed
gray tables in memory instead of writing over them and recalculating
them all the time, as I currently do, one could probably get the other
factor of 2 that we need.

Note I've turned SSE off in this function, but left it on in
_mzd_combine since it makes no real difference there.

Bill.

On 16 May, 14:16, Bill Hart <goodwillh...@googlemail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options May 16 2008, 10:59 am
From: Bill Hart <goodwillh...@googlemail.com>
Date: Fri, 16 May 2008 07:59:42 -0700 (PDT)
Local: Fri, May 16 2008 10:59 am
Subject: Re: slightly OT: new M4RI library
I tried cache blocking matrix B, but the times are exactly the same. I
think the processor is happy to keep loading B one row at a time
sequentially, and since it is only done a handful of times in the
algorithm, it accounts for little of the runtime.

It might go faster in combination with storing the gray tables, but
the optimal block size seems to be about 100 (*k) for the matrix B,
which would necessitate storage of 100 gray tables, which is an awful
lot of memory. Probably I think it is best to avoid this kind of
memory usage.

So I am not sure where the other factor of 2 will come from.

Bill.

On 16 May, 14:37, Bill Hart <goodwillh...@googlemail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options May 16 2008, 11:00 am
From: Bill Hart <goodwillh...@googlemail.com>
Date: Fri, 16 May 2008 08:00:20 -0700 (PDT)
Local: Fri, May 16 2008 11:00 am
Subject: Re: slightly OT: new M4RI library
P.S: I also tried cache hints, but no luck. They just slow it down.

Bill.

On 16 May, 15:59, Bill Hart <goodwillh...@googlemail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Martin Albrecht  
View profile  
 More options May 16 2008, 11:24 am
From: Martin Albrecht <m...@informatik.uni-bremen.de>
Date: Fri, 16 May 2008 16:24:08 +0100
Local: Fri, May 16 2008 11:24 am
Subject: Re: [sage-devel] Re: slightly OT: new M4RI library
On Friday 16 May 2008, Bill Hart wrote:

> P.S: I also tried cache hints, but no luck. They just slow it down.

> Bill.

Bill, thanks so much for looking into this! It is very much appreciated. I'm
going to read/try your patch right away!

Martin

PS: I have a slight delay in my replies because my provider's spam filter
consistently flags you and mabshoff as spam for some reason.

--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinralbre...@jabber.ccc.de


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options May 16 2008, 2:26 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Fri, 16 May 2008 11:26:24 -0700 (PDT)
Local: Fri, May 16 2008 2:26 pm
Subject: Re: slightly OT: new M4RI library
Here are the times I get for Magma vs M4RI now. Note the crossover
between the two programs is now above about 5000 and M4RI beats Magma
below that point. This suggests the remaining factor of 2 is in the
Strassen-Winograd function. Probably Winograd-Strassen is falling out
of L2 cache (the previous adjustments I made were to prevent the M4R
algorithm falling out of L1 cache).

The other possibility is that Magma combines the two algorithms so
that there is even greater usage of the Gray code tables. This would
be an ugly hack, but could work.

40000x40000:
  Magma: 112.6s
  M4RI: 232.4s

20000x20000:
  Magma: 16.40s
  M4RI: 32.34s

10000x10000:
  Magma: 2.750s
  M4RI: 4.529s

5000x5000:
  Magma: 0.700s
  M4RI: 0.672s

2500x2500:
  Magma: 0.13s
  M4RI: 0.079s

1250x1250:
  Magma: 0.015s
  M4RI: 0.012s

625x625:
  Magma: 0.0030s
  M4RI: 0.0023s

312x312:
  Magma: 0.0014s
  M4RI: 0.00032s

156x156:
  Magma: 0.001s
  M4RI: 0.0001s

If I get some time I'll look into this.

Did those changes work for you Martin?

Bill.

On 16 May, 16:24, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Martin Albrecht  
View profile  
 More options May 16 2008, 6:05 pm
From: Martin Albrecht <m...@informatik.uni-bremen.de>
Date: Fri, 16 May 2008 23:05:58 +0100
Local: Fri, May 16 2008 6:05 pm
Subject: Re: [sage-devel] Re: slightly OT: new M4RI library
On Friday 16 May 2008, Bill Hart wrote:

> Here are the times I get for Magma vs M4RI now. Note the crossover
> between the two programs is now above about 5000 and M4RI beats Magma
> below that point. This suggests the remaining factor of 2 is in the
> Strassen-Winograd function. Probably Winograd-Strassen is falling out
> of L2 cache (the previous adjustments I made were to prevent the M4R
> algorithm falling out of L1 cache).

> The other possibility is that Magma combines the two algorithms so
> that there is even greater usage of the Gray code tables. This would
> be an ugly hack, but could work.

Are you suggesting Magma uses M4RM too? I'd doubt that, since they don't state
that anyway. Probably I'm just misunderstanding you.

Yes, the change worked like a charm. I made some changes (the fixed k is
replaced by a k that depends on the new block dimensions etc.) and it is much
much faster now. I'm working on re-introducing SSE2 now to see if it at least
on the Core2Duo makes the world a better place. Btw. all the stuff I wrote
about the L2 cache size C2D vs. Opteron was bollocks. The reason I beat Magma
so badly earlier was that I used a 32-bit Magma and compared it with a 64-bit
version of M4RI. To state it clearly: We don't beat Magma anywhere. Anyhow,
here are the times:

64-bit Debian/GNU Linux, 2.33Ghz Core2Duo
Matrix Dimension        Magma 2.14-13 (64-bit)  M4RI-20080517 (64-bit)
10,000 x 10,000         2.920                   4.130
16,384 x 16,384         11.140                  15.740
20,000 x 20,000         20.370                  28.950

64-bit Debian/GNU Linux, 1.8Ghz Opteron (sage.math)
Matrix Dimension        Magma 2.13-5 (64-bit)   M4RI-20080517 (64-bit)
10,000 x 10,000         3.930                   7.860
16,384 x 16,384         16.230                  104.77???
20,000 x 20,000         27.080                  56.420

(My university today finally granted me access to Magma 2.14 for my notebook.)

As you can see there is an odd (reproducible) spike at 2^14 x 2^14. I cannot
explain that for now, it might just be a bug. I have a similar spike in
another run on another AMD (Athlon X2) machine:

n: 2048, cutoff: 1024, speedup: 1.06, m4rm: 0.01 strassen: 0.01
n: 3072, cutoff: 1536, speedup: 0.87, m4rm: 0.04 strassen: 0.05
n: 4096, cutoff: 2048, speedup: 1.13, m4rm: 0.15 strassen: 0.13
n: 5120, cutoff: 2560, speedup: 1.35, m4rm: 0.39 strassen: 0.29
n: 6144, cutoff: 3072, speedup: 1.20, m4rm: 0.66 strassen: 0.55
n: 7168, cutoff: 3584, speedup: 1.87, m4rm: 1.64 strassen: 0.88
n: 8192, cutoff: 4096, speedup: 4.48, m4rm: 6.07 strassen: 1.35

>>> n: 9216, cutoff: 4608, speedup: 2.94, m4rm: 8.90 strassen: 3.02 <<<

n: 10240, cutoff: 5120, speedup: 4.68, m4rm: 12.99 strassen: 2.78
n: 11264, cutoff: 5632, speedup: 4.84, m4rm: 18.78 strassen: 3.88
n: 12288, cutoff: 6144, speedup: 4.65, m4rm: 24.40 strassen: 5.24
n: 13312, cutoff: 6656, speedup: 3.32, m4rm: 30.92 strassen: 9.33

I'm investigating this one too and play around with the parameters. Since the
C2D times are considerably better I also bet it is L2 related, we'll see.

Thanks again!
Martin

--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_www: http://www.informatik.uni-bremen.de/~malb
_jab: martinralbre...@jabber.ccc.de


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill Hart  
View profile  
 More options May 16 2008, 8:48 pm
From: Bill Hart <goodwillh...@googlemail.com>
Date: Fri, 16 May 2008 17:48:35 -0700 (PDT)
Local: Fri, May 16 2008 8:48 pm
Subject: Re: slightly OT: new M4RI library
Hi Martin,

That spike is wierd. Basically I got closer to 2x the time of Magma
for 16384x16384, but you need different parameters than for
10000x10000 or 20000x20000 since the size of the M4R matrices will be
different than in either of those cases.

For 16384x16384 my notes say that I used k = 6 and BLOCK = 256. One
might also need to fiddle with the cutoff. I think I used 3600, but at
various times I had the cutoff set lower.

It would be awfully surprising if there wasn't a set of parameters
that dropped this time right down. It is possible that the L1 cache is
smaller on sage.math than on the Opteron I was using. Perhaps my
parameters don't apply on that machine.

Bill.

On 16 May, 23:05, Martin Albrecht <m...@informatik.uni-bremen.de>
wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 85   Newer >
« Back to Discussions « Newer topic     Older topic »