Strassen or no Strassen

40 views
Skip to first unread message

Martin Albrecht

unread,
Aug 5, 2012, 9:02:12 AM8/5/12
to M4RI Development, Richard Parker
Hi [m4ri-devel],

I was contacted by Richard Parker who argues that Strassen is not a good idea
for GF(2). I've reproduced his e-mails below (slightly edited). Sorry, I
should have moved this here earlier, but I was travelling and not thinking
straight :)

What do you think?

Cheers,
Martin

---------- Forwarded Message ----------
Subject: Matrices over GF2.
Date: Friday 03 Aug 2012
From: Richard Parker <richp...@hotmail.co.uk>
To: martinr...@googlemail.com

I'm not sure whether we've met, but it appears we have a common
interest in the fast implementation of matrix operations (multiply,
Gaussian elimination, etc) over GF2 in particular.

My own interest comes from working with groups of matrices as
part of group theory and, in particular, their modular representations.

So far I have been unable to see how the ideas of Strassen can help,
since they seem to reduce multiplications at the expense of data movement,
and it seems to me that for GF2 specifically it is the data movement
that is taking the time.

Reading your stuff about M4RI suggests you have concluded otherwise,
and - if you are prepared to divulge - I'd be most interested to understand
how multiplication can be speeded up by such methods.

To give you a first view of where I am coming from, standard Strassen
does 11 adds and 7 multiplies, as against the 4 adds and 8 multiplies
the ordinary way, and at many scales seven adds is slower than
one multiply because of all the I/O, whether this be from file-server,
disk or ordinary memory.

Richard Parker

---------- Forwarded Message ----------
Subject: Re: Matrices over GF2.
Date: Saturday 04 Aug 2012
From: Martin Albrecht <martinr...@googlemail.com>
To: Richard Parker <richp...@hotmail.co.uk>


Hey,

before getting into the why perhaps it's useful to establish that indeed
Strassen-Winograd makes multiplication faster over GF(2). For example,
if you run

./bench_multiplication 10000 0

in the M4RI testsuite you'll get the time it takes with Strassen down to the
dimension where matrix blocks fit into L2. If you run

./bench_multiplication 10000 10000

you get the time for the "Method of the Four Russians", i.e., cubic
multiplication + "greasing".

Moving on to the "why". I am not sure what you mean by "data movement". So to
avoid a misunderstanding, in our implementation data is not moved in the
strict sense, i.e., copied from A to B. Of course, there are many memory reads
and writes.

However, Strassen-Winograd uses less memory reads and writes than classical
cubic multiplication if your dimension is big enough. That is, 7
multiplications + 15 additions uses less I/O than 8 multiplications + 8
additions for n x n matrices starting at some value of n. For 64 x 64 matrices
indeed adds and multiplications seem to be equally expensive but for larger
dimensions this is not the case. In our implementation we cross over from
Strassen to M4RM at dimensions of a few thousand.

Put simply, I think this statement is incorrect:

> at many scales seven adds is slower than one multiply because of all the
> I/O, whether this be from file-server, disk or ordinary memory.

because seven adds do a lot less I/O than one multiply, which is does ~n
additions.

I hope this is helpful, feel free to get back to me if not etc.

Cheers,
Martin

---------- Forwarded Message ----------
Subject: RE: Matrices over GF2.
Date: Saturday 04 Aug 2012
From: Richard Parker <richp...@hotmail.co.uk>
To: martinr...@googlemail.com



Thanks for your reply. I suspect I have a rather faster basic multiply,
(mainly by being cache-friendly) which makes the difference. Unless I
have missed something (and I still suspect I have) the crossover seems to
be at around 100,000 dimensions rather than 1000.

If you are still active in this area, perhaps the first step would be to
persuade you to consider chopping up your matrix in both directions
so that the cache-full is more nearly square. This balances the memory
usage across the three matrices (oversimplification!) and in my work
produced a factor of ten (!) improvement in the basic multiplication once
I tried using multiple cores.

Actually by using L1, L2 and L3 in a balanced way, I suspect there is
yet another factor of two available, though it is a lot of work, and
will have a very short shelf life - as soon as they change the sizes or
speeds of the caches, I'd have to tweek it! :(

[This would involve processing matrix A very carefully, keeping track of
which vectors had been used recently so that they can be favoured since
they will be in L1 cache].

Richard Parker

---------- Forwarded Message ----------
Subject: Re: RE: Matrices over GF2.
Date: Saturday 04 Aug 2012
From: Martin Albrecht <martinr...@googlemail.com>
To: Richard Parker <richp...@hotmail.co.uk>


Hey,

On Saturday 04 Aug 2012, you wrote:
> Thanks for your reply. I suspect I have a rather faster basic multiply,
> (mainly by being cache-friendly) which makes the difference. Unless I
> have missed something (and I still suspect I have) the crossover seems to
> be at around 100,000 dimensions rather than 1000.

is your implementation available somewhere? In any case, we could easily
resolve this question by comparing your implementation with ours (and
Magma's). If your base case is so efficient this would of course be very
interesting for us.

> If you are still active in this area, perhaps the first step would be to
> persuade you to consider chopping up your matrix in both directions
> so that the cache-full is more nearly square.

I am not sure what you mean by that, we do chop up our matrices in blocks
e.g.,

[A0 A1]
[A2 A3]

is that what you mean? However, we do store these matrices in row major
representation, i.e, there's a jump between the zeros and first row of A0.

> This balances the memory
> usage across the three matrices (oversimplification!) and in my work
> produced a factor of ten (!) improvement in the basic multiplication once
> I tried using multiple cores.

Mhh, how many cores did you use? Our code definitely doesn't scale well on
multicore systems.

---------- Forwarded Message ----------
Subject: RE: Matrices over GF2.
Date: Saturday 04 Aug 2012
From: Richard Parker <richp...@hotmail.co.uk>
To: martinr...@googlemail.com

> is your implementation available somewhere? In any case, we could easily
> resolve this question by comparing your implementation with ours (and
> Magma's). If your base case is so efficient this would of course be very
> interesting for us.

I did a prototype in Essen in February. I guess it still works, but I'm in
the middle of rewriting it, so the version I currently have doesn't even
compile,
let alone run.

My target was to use the 12 core (6+6) Nehalem in Essen to multiply two
random dense matrices at 350,000 x 350,000 in 1 Hour. I failed, in that
it took 80 minutes. This uses no Strassen.

I must apologise for what I wrote after that. I was guessing what you were
doing,
and after due consideration have concluded that I guessed wrong.

> Mhh, how many cores did you use? Our code definitely doesn't scale well on
> multicore systems.

12, as stated above. The more cores you use, the more you have to worry
about memory bandwidth, since it is shared between them. I suspect this
is the major difference, in fact, between our approaches. It scaled
perfectly, except that the reading of the matrices and the writing of the
answer was not multi-threaded (and thereby hangs this tale).

If you really have to take memory bandwidth into account so you scale well,
this seems to make the whole problem slightly different at every scale, and
I am beginning to think that the five scales I had (file server, local disk,
memory, thread-worth and cache-full) is going to turn out too simplisitic. :(

I can send my documentation if you like.

---------- Forwarded Message ----------
Subject: RE: Matrices over GF2.
Date: Saturday 04 Aug 2012
From: Richard Parker <richp...@hotmail.co.uk>
To: martinr...@googlemail.com



OK. Looked at Albrecht, Bard and Hart.

If that is what you are still doing, I think I see the main improvement
needed.

In your pre-Strassen algorithm, you always deal with whole rows of C.

You should chop them up. That way your (now very short) rows of B
can be greased, and still have quite a few of them so that each little
row of C gets 16 XORs (doing 96 rows).

For my 350,000 prototype . . .

I used strips of A of width 96.
As a first step, I "prepared" matrix A by putting the strips
contiguous (so that the first 96 columns, in there entirety,
came first, then the next 96, and so on. Knowing that grease
level 6 was coming, I also reformatted each 3 bytes of A
into four, to save the shifting/anding in the main loop.

I used "bricks" of B that were 96 x 1024.

Matrix C was generated in strips 1024 columns wide.

There were 342 strips of C, and had I had 342 cores I could
have done them all at once. I only had 12, so I was running
12 worker threads, each doing one strip of C.

Each brick therefore corresponded to one 350000 x 96 strip of A
and one 350000 x 1024 strip of C. Do do this work, L2 was filled
with 16 "slices" of grease level 6 (16 x 64 x 1024 bits).
"Fill your cache with grease".

One way of looking at this is that I pass matrix C 350000/96 times.

You pass it many times more.

---------- Forwarded Message ----------
Subject: Strassen.
Date: Sunday 05 Aug 2012
From: Richard Parker <richp...@hotmail.co.uk>
To: martinr...@googlemail.com



I have been pondering the applicability of Strassen-Winograd (S-W) to matrix
multiplication since my last mail, and have provisionally come to the
following conclusions. What do you think? Does this sound right?

1. That by chopping up the rows, you could have speeded up your base case
for dimensions over 2048 but you do not use them so it will not, in fact, help
you as your system currently stands.

2. The existence of a better base-case throws into doubt your conclusion
(if I have it correct) that S-W is faster at 2048 than the base case. I
suspect
that if you use these tricks, you will find that the new 2048 base case is
faster
than S-W applied to the 1024 base case and that this will improve your
software,
though I suspect this will be rather marginal and perhaps not worth bothering
with.
Nevertheless if you want to understand these tricks, I'm more than happy to
explain.

3. If you try to scale to use multiple cores, you will have to make a drastic
reduction
to your use of main memory bandwidth, and the key (which you have already
found)
is to use more things that I call slices and you, I think, call Grey-code-
tables. You must
then chop up your rows of B and C to shorten the rows so you can get all you
need
into L2 cache. This is no longer a marginal improvement. It is huge. It
makes each
core run at about the same speed as for the base case, but now it scales
perfectly.

For my software, I have learned already that at the level where the matrices
no longer
fit in memory, S-W is basically compulsory. I thank you for forcing me to
think this
out!

For memory work (i.e. less than 350,000 on the Essen machine) strategically
this seems to be a fairly close call. S-W's gains are obvious, but the
losses are
also quite substantial.

1. The first is, of course, the additional memory bandwidth needed to do the
extra additions. This requires about six complete passes of a matrix which
equates (in my warped thinking) to about 600 extra rows. If this were the
only consideration, it would suggest that 8192 is better done by S-W.
[This is because I am using all the cores, so the memory bandwidth
used in one core doing an add is slowing down the multiplies
taking place in the other cores and I am taking this into account].

2. I pre-process matrix A to spot sparsity and to use coding theory
to squeeze a bit more out of the same resources. S-W makes this harder,
since there are now seven different matrices used as left-multipliers as
against the (virtual) four in the base case. This means that the more
sophisicated codes can no longer be used since they are too slow to encode.
Also any attempt at cache-prediction would be too slow for a S-W environment.

3. By adding the matrices together, any systematic blocks of zeros are likely
to be messed-up. S-W actually left-multiplies by three single blocks and
only "messes up" four of them, but nevertheless the impact is to both A and
B, and important in many cases.

I propose, I think, to proceed as planned with the base case up to one memory
full.
I basically have to to this anyway, since Gausian Elimination needs its
components
and I haven't got my head around S-W with the rectangular matrices that are
needed
for Gaussian Elimination.

I should probably write an alternative S-W-based multiply at the memory level
so
that there is a choice - the base case will be faster if there is even just
moderate
sparsity, whereas S-W will be faster for fully dense matrices. The main use
for this
is, in fact, for Gaussian Elimination at the large scale, and I therefore need
to
understand how best to multiply rectangular dense using S-W.

Questions for you, then . . .

I wish to multiply an AxB matrix by a BxC one. How do I do that using
Strassen-Winograd.

Also - Is Laderman still the best for 3x3? 23 multiplies? How many adds? I
suspect
this is going to figure in here somewhere!

--
name: Martin Albrecht
_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_otr: 47F43D1A 5D68C36F 468BAEBA 640E8856 D7951CCF
_www: http://martinralbrecht.wordpress.com/
_jab: martinr...@jabber.ccc.de

Bill Hart

unread,
Aug 5, 2012, 9:33:25 AM8/5/12
to m4ri-...@googlegroups.com
From memory we already tried slicing the matrices in the other
dimension and it was slower on a single core machine. Naturally, I
look forward to a comparison of the two pieces of code when he has
finished his implementation. Chips have changed a lot and it is
feasible that with different cache hierarchies and instruction
scheduling a speedup over our approach is possible.

Of course, we didn't code this to be a fast parallel algorithm, so I
am not surprised if it can be beaten on a multicore computer.

I don't like this: "The existence of a better base-case throws into
doubt your conclusion (if I have it correct) that S-W is faster at
2048 than the base case."

This is a misunderstanding. We never claimed that S-W was faster at
2048 than base case. Rather, we found that our implementation of the
particular algorithms we chose happens to have that crossover on the
given machines. That is not a conclusion, but an experimental
observation. So by definition, no "conclusion" is being thrown into
doubt.

Bill.
> --
> You received this message because you are subscribed to the Google Groups "M4RI Development" group.
> To post to this group, send an email to m4ri-...@googlegroups.com.
> To unsubscribe from this group, send email to m4ri-devel+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/m4ri-devel?hl=en-GB.
>

Martin Albrecht

unread,
Aug 5, 2012, 9:38:49 AM8/5/12
to m4ri-...@googlegroups.com
On Sunday 05 Aug 2012, Bill Hart wrote:
> From memory we already tried slicing the matrices in the other
> dimension and it was slower on a single core machine.

Yep, we tried. Fun fact, our discussion was public so it's somewhat well
documented what we tried and how:

https://groups.google.com/forum/?fromgroups#!topic/sage-devel/qk7cJByk1rs

This might also uncover that we didn't try the right way though :)

Cheers,
Martin

Bill Hart

unread,
Aug 5, 2012, 10:00:32 AM8/5/12
to m4ri-...@googlegroups.com
One thing that wouldn't surprise me these days, but which may have
back then, is that some technique to better prepare the (layout in
memory of the) data might lead to a speed gain.

He does mention such a thing, so I will be interested to see if that
is important.

His claim that chopping the matrices up in both dimensions is a win
for multicore does not surprise me at all. I don't quite understand
his claim that we "pass" the matrix C many more times. The benefits of
what he is suggesting is surely that increasing the number of cores
with our technique means that a much larger amount of data needs to be
available at any one time, and the (L3) cache quickly becomes
overwhelmed.

I could be wrong, but I don't recall giving all that much
consideration to the multicore situation. I vaguely recall that as we
were finishing the paper, there had been some attempts to use OpenMP
to give a "free" speedup on multicore machines. But that was not too
developed; there was just some parallelised "for" loops. Martin, you
probably remember this part better than me.

Bill.

Richard Parker

unread,
Aug 5, 2012, 10:15:15 AM8/5/12
to m4ri-...@googlegroups.com


> as you can see I moved our discussion to [m4ri-devel], I hope that's okay with
> you. IMHO it's always better to have these kind of discussions in the open, so
> that other people can learn and contribute.

Agree 100%

This is partly a test-post!

>
> On Saturday 04 Aug 2012, you wrote:
> > My target was to use the 12 core (6+6) Nehalem in Essen to multiply two
> > random dense matrices at 350,000 x 350,000 in 1 Hour. I failed, in that
> > it took 80 minutes. This uses no Strassen.
>
> The biggest computation I did was to compute the row echelon form of a dense
> 500,000 x 500,000 matrix in 1.05 hours on four cores. Using one core it took
> 2.7 hours.



That is IMPRESSIVE.  How do you manage to do row echelon form so much
faster than multiplication? I thought it took 1/4 of the time of a multiply at least
to get the rank, another 1/4 to get full echelon form, and another 1/2 if you want
the left-multiplier as well.

Also this is so much faster than your paper says.  Why is that?

How on earth did you do this?

> A somewhat hand waving computation puts the requires CPU time for
> multiplication of two 350,000 x 350,000 matrices at
>
> constant * n^2.807 / (cycles per second) / (seconds per hour)
>
> 0.02034 * 350000^2.807 / (2.6 * 10^9) /3600.0 = 7.93h
>
> on a single core, where I estimated the constant using timings for 35,000 x
> 35,000 matrices.



This probably about right.  Actually I do not have an algorithm that runs much faster

on a single core, and my program would have taken about 14-15 hours on a single core.

Also, of course, I don't use Strassen (yet) so my 2.807 is 3.000.



> No worries and no need to apologise :)



Oops.  Sorry, I did it again.  :)


RP

Martin Albrecht

unread,
Aug 5, 2012, 10:30:24 AM8/5/12
to m4ri-...@googlegroups.com
Hi,

On Sunday 05 Aug 2012, Richard Parker wrote:
> > as you can see I moved our discussion to [m4ri-devel], I hope that's okay
> > with you. IMHO it's always better to have these kind of discussions in
> > the open, so that other people can learn and contribute.
>
> Agree 100%
>
> This is partly a test-post!
>
> > On Saturday 04 Aug 2012, you wrote:
> > > My target was to use the 12 core (6+6) Nehalem in Essen to multiply two
> > > random dense matrices at 350,000 x 350,000 in 1 Hour. I failed, in that
> > > it took 80 minutes. This uses no Strassen.
> >
> > The biggest computation I did was to compute the row echelon form of a
> > dense 500,000 x 500,000 matrix in 1.05 hours on four cores. Using one
> > core it took 2.7 hours.
>
> That is IMPRESSIVE. How do you manage to do row echelon form so much
> faster than multiplication? I thought it took 1/4 of the time of a multiply
> at least to get the rank, another 1/4 to get full echelon form, and
> another 1/2 if you want the left-multiplier as well.

This is only the row echelon form *not* the reduced row echelon form, so the
left-multiplier is kinda free. Our implementation is described in:

http://arxiv.org/abs/1111.6549

> Also this is so much faster than your paper says. Why is that?

Different machine I guess.

> How on earth did you do this?

I wish I'd remember. I did this benchmark a while ago, I should probably
repeat it to make sure it still holds, it was definitely on the sage.math
machine. I did find an older write up of mine, where I report timings, here:

https://martinralbrecht.wordpress.com/2009/03/20/%EF%BB%BFm4ri-api-change-and-
big-matrices/

which seems a bit slower than what I claimed above by a factor of two or so.

On my machine the constant for elimination is 4.3 * 10^-12 which would give

4.3 * 10^(-12) * 500000^2.807 / 3600.0 = 11.86 for the *reduced* row echelon
form. So about 6h for the non reduced row echelon form (==rank). So again,
that benchmark claimed above is about a factor of two faster (and on a
different computer).

> > A somewhat hand waving computation puts the requires CPU time for
> > multiplication of two 350,000 x 350,000 matrices at
> >
> > constant * n^2.807 / (cycles per second) / (seconds per hour)
> >
> > 0.02034 * 350000^2.807 / (2.6 * 10^9) /3600.0 = 7.93h
> >
> > on a single core, where I estimated the constant using timings for 35,000
> > x 35,000 matrices.
>
> This probably about right. Actually I do not have an algorithm that runs
> much faster
>
> on a single core, and my program would have taken about 14-15 hours on a
> single core.
>
> Also, of course, I don't use Strassen (yet) so my 2.807 is 3.000.

Cheers,
Martin

Martin Albrecht

unread,
Aug 5, 2012, 10:37:46 AM8/5/12
to m4ri-...@googlegroups.com


On Sunday 05 Aug 2012, Bill Hart wrote:
> One thing that wouldn't surprise me these days, but which may have
> back then, is that some technique to better prepare the (layout in
> memory of the) data might lead to a speed gain.

Agreed that this could make a difference.

However, if you take a look at slides 19 and 20 of

https://bitbucket.org/malb/talks/src/7ae176bc6c08/20120607%20-%20M4RIE%20-%20Kaiserslautern

or the "small matrices" slide in:

https://martinralbrecht.files.wordpress.com/2010/07/20111219_-_m4ri_-
_warwick.pdf

we can estimate the raw computation time without memory access overhead. That
is, Emmanuel tried many different strategies for 64 x 64 matrices and the best
he could find was 0.2533 microseconds for both addition and multiplication.
Based on that we should be able to estimate what would be optimal. For
example, our implementation and Magma's implementation pretty much achieve the
same complexity as running Strassen on 128 x 128 matrices with 64 x 64 matrix
blocks (possibly not optimal, I know, it's just that these timings march
somewhat).

> He does mention such a thing, so I will be interested to see if that
> is important.
>
> His claim that chopping the matrices up in both dimensions is a win
> for multicore does not surprise me at all. I don't quite understand
> his claim that we "pass" the matrix C many more times. The benefits of
> what he is suggesting is surely that increasing the number of cores
> with our technique means that a much larger amount of data needs to be
> available at any one time, and the (L3) cache quickly becomes
> overwhelmed.
>
> I could be wrong, but I don't recall giving all that much
> consideration to the multicore situation. I vaguely recall that as we
> were finishing the paper, there had been some attempts to use OpenMP
> to give a "free" speedup on multicore machines. But that was not too
> developed; there was just some parallelised "for" loops. Martin, you
> probably remember this part better than me.

Yes, to this day that's the sophistication of our parallel efforts, i.e., we
never *really* tried.

Bill Hart

unread,
Aug 5, 2012, 10:42:44 AM8/5/12
to m4ri-...@googlegroups.com
Excellent! Then there is a real chance for Richard Parker to extend
the "state-of-the-art" here. Perhaps he will make his implementation
available to be used in m4ri?

I look forward to hearing about Richard's progress on this list!

Bill.

Martin Albrecht

unread,
Aug 6, 2012, 10:32:40 AM8/6/12
to M4RI Development
Hi, sorry for the delay in replying more in depth, lots of other work stuff on
my plate.

> ---------- Forwarded Message ----------
> Subject: RE: Matrices over GF2.
> Date: Saturday 04 Aug 2012
> From: Richard Parker <richp...@hotmail.co.uk>
> To: martinr...@googlemail.com
>
>
>
> OK. Looked at Albrecht, Bard and Hart.
>
> If that is what you are still doing, I think I see the main improvement
> needed.
>
> In your pre-Strassen algorithm, you always deal with whole rows of C.
>
> You should chop them up.

I think this is what Bill tried when we originally wrote this code? cf.

https://groups.google.com/d/msg/sage-devel/qk7cJByk1rs/99mzlYAieWIJ

> That way your (now very short) rows of B
> can be greased, and still have quite a few of them so that each little
> row of C gets 16 XORs (doing 96 rows).



> For my 350,000 prototype . . .
>
> I used strips of A of width 96.
> As a first step, I "prepared" matrix A by putting the strips
> contiguous (so that the first 96 columns, in there entirety,
> came first, then the next 96, and so on. Knowing that grease
> level 6 was coming, I also reformatted each 3 bytes of A
> into four, to save the shifting/anding in the main loop.
>
> I used "bricks" of B that were 96 x 1024.
>
> Matrix C was generated in strips 1024 columns wide.
>
> There were 342 strips of C, and had I had 342 cores I could
> have done them all at once. I only had 12, so I was running
> 12 worker threads, each doing one strip of C.
>
> Each brick therefore corresponded to one 350000 x 96 strip of A
> and one 350000 x 1024 strip of C. Do do this work, L2 was filled
> with 16 "slices" of grease level 6 (16 x 64 x 1024 bits).
> "Fill your cache with grease".

Thanks, very well explained!

> One way of looking at this is that I pass matrix C 350000/96 times.

Each pass costs

16 * 2^6 * 1024 operations to construct the tables
+ 350000 * 1024 operations to add rows from the tables to C

per 1024 bit slice of B, i.e., this is repeated 350000/1024 times.

So you should end up with 4.47921250000000e14 bit operations:

=> n=350000; e = n/1024 * (n * 1024 + (16 * 2^6 * 1024)) * n/96.0
<= 4.47921250000000e14

Is that correct?

Doing the same computation for 175,000 x 175,000 matrices we get:


=> n=175000; e = n/1024 * (n * 1024 + (16 * 2^6 * 1024)) * n/96.0
<= 5.61534895833333e13

Finally, we get

7*e + 15 * n^2 = 3.93533802083333e14

which is slightly less operations with one level of Strassen. One more level
would give 3.45940976718750e14 operations.

Of course, these are arithmetic operations rather than memory reads and
writes. So the question is how these compare?

On a site note, in our implementation when we switch to greasing, we copy out
C, effectively rearranging C in a similar way to you (but with less rows).
Also, we switch over to greasing when C fits into L2, so each greasing step
costs one read and one write to RAM (but many more to/from L2).

Not sure what I tried to accomplish by this, just to get a better feeling of
how things stand I guess :)

PS: Fingers crossed I didn't screw up the computation which I am prone to do.

> You pass it many times more.



> ---------- Forwarded Message ----------
> Subject: Strassen.
> Date: Sunday 05 Aug 2012
> From: Richard Parker <richp...@hotmail.co.uk>
> To: martinr...@googlemail.com
>
>
>
> I have been pondering the applicability of Strassen-Winograd (S-W) to
> matrix multiplication since my last mail, and have provisionally come to
> the following conclusions. What do you think? Does this sound right?
>
> 1. That by chopping up the rows, you could have speeded up your base case
> for dimensions over 2048 but you do not use them so it will not, in fact,
> help you as your system currently stands.
>
> 2. The existence of a better base-case throws into doubt your conclusion
> (if I have it correct) that S-W is faster at 2048 than the base case. I
> suspect
> that if you use these tricks, you will find that the new 2048 base case is
> faster
> than S-W applied to the 1024 base case and that this will improve your
> software,
> though I suspect this will be rather marginal and perhaps not worth
> bothering with.
> Nevertheless if you want to understand these tricks, I'm more than happy to
> explain.

In our experiments we found that cutting roughly at L2 cache size was optimal
for us.

> 3. If you try to scale to use multiple cores, you will have to make a
> drastic reduction
> to your use of main memory bandwidth, and the key (which you have already
> found)
> is to use more things that I call slices and you, I think, call Grey-code-
> tables. You must
> then chop up your rows of B and C to shorten the rows so you can get all
> you need
> into L2 cache. This is no longer a marginal improvement. It is huge. It
> makes each
> core run at about the same speed as for the base case, but now it scales
> perfectly.

Yes, it seems the way to parallelise these things is to do what you are doing.
I wonder how much it will cost us to convert between the two formats, i.e.,
whether it would be useful to implement your strategy for multi-cpu and
perhaps some multicore scenarios and to convert between the two
representations for our highlevel code.

> For my software, I have learned already that at the level where the
> matrices no longer
> fit in memory, S-W is basically compulsory. I thank you for forcing me to
> think this
> out!

I am not sure I follow this conclusion. You could also do cubic block
recursive?
I think the only difficulty are left-over rows which are not neatly divisible
by 2, the rectangular case is no different from the square case.

> I should probably write an alternative S-W-based multiply at the memory
> level so
> that there is a choice - the base case will be faster if there is even just
> moderate
> sparsity, whereas S-W will be faster for fully dense matrices. The main
> use for this
> is, in fact, for Gaussian Elimination at the large scale, and I therefore
> need to
> understand how best to multiply rectangular dense using S-W.
>
> Questions for you, then . . .
>
> I wish to multiply an AxB matrix by a BxC one. How do I do that using
> Strassen-Winograd.

We don't even differentiate square vs non-square:

https://bitbucket.org/malb/m4ri/src/daf7eaebf778/src/strassen.c#cl-50

Strassen "just" goes through.

> Also - Is Laderman still the best for 3x3? 23 multiplies? How many adds?
> I suspect
> this is going to figure in here somewhere!

Thought there is nothing better in practice than Strassen-Winograd?

Cheers,
Martin

Bill Hart

unread,
Aug 6, 2012, 6:10:25 PM8/6/12
to m4ri-...@googlegroups.com
Doesn't Bard have the record for 3x3?

Bill.

Martin Albrecht

unread,
Aug 7, 2012, 6:59:23 AM8/7/12
to m4ri-...@googlegroups.com


On Sunday 05 Aug 2012, Richard Parker wrote:
> > as you can see I moved our discussion to [m4ri-devel], I hope that's okay
> > with you. IMHO it's always better to have these kind of discussions in
> > the open, so that other people can learn and contribute.
>
> Agree 100%
>
> This is partly a test-post!
>
> > On Saturday 04 Aug 2012, you wrote:
> > > My target was to use the 12 core (6+6) Nehalem in Essen to multiply two
> > > random dense matrices at 350,000 x 350,000 in 1 Hour. I failed, in that
> > > it took 80 minutes. This uses no Strassen.
> >
> > The biggest computation I did was to compute the row echelon form of a
> > dense 500,000 x 500,000 matrix in 1.05 hours on four cores. Using one
> > core it took 2.7 hours.
>
> That is IMPRESSIVE. How do you manage to do row echelon form so much
> faster than multiplication? I thought it took 1/4 of the time of a multiply
> at least to get the rank, another 1/4 to get full echelon form, and
> another 1/2 if you want the left-multiplier as well.

Okay, more data points, I ran

./bench_elimination -t 36000 350000 350000 pluq

on geom.math.washington.edu:

model name : Intel(R) Xeon(R) CPU X7460 @ 2.66GHz
cpu MHz : 2670.000
cache size : 16384 KB

which means: compute the *reduced* row echelon form using PLE decomposition
for random 350,000 x 350,000 matrices for about 36000 seconds = 10 hours. The
result is:

Total running time: 115650.132 seconds.
Sample size: 2; mean: 62481106613290; standard deviation: 1321923046957
99% confidence interval: +/- 59502793365722 (95.2%):
[2978313247568..121983899979012]
m: 350000, n: 350000, last r: 350000, cpu cycles: 62481106613290,
cc/(mnr^0.807): 0.01712, wall time: 23429.825020

So it takes 6.50 hours to do this computation on a single core on the above
mentioned using the M4RI software.

Cheers,
Martin

PS: This means I should stop using that benchmark data which I posted before,
because I cannot reproduce it. Apologies for the noise.

Martin Albrecht

unread,
Aug 11, 2012, 8:10:49 AM8/11/12
to m4ri-...@googlegroups.com
> Doesn't Bard have the record for 3x3?

His approach only works for approximate algorithms for reals and complex
numbers (I asked him and he just got back to me).

Bill Hart

unread,
Aug 12, 2012, 5:41:48 AM8/12/12
to m4ri-...@googlegroups.com
Thanks. I keep forgetting these details. Must be old age coming early.

Bill.
Reply all
Reply to author
Forward
0 new messages