Hi [m4ri-devel],
I was contacted by Richard Parker who argues that Strassen is not a good idea
for GF(2). I've reproduced his e-mails below (slightly edited). Sorry, I
should have moved this here earlier, but I was travelling and not thinking
straight :)
What do you think?
Cheers,
Martin
---------- Forwarded Message ----------
Subject: Matrices over GF2.
Date: Friday 03 Aug 2012
From: Richard Parker <
richp...@hotmail.co.uk>
To:
martinr...@googlemail.com
I'm not sure whether we've met, but it appears we have a common
interest in the fast implementation of matrix operations (multiply,
Gaussian elimination, etc) over GF2 in particular.
My own interest comes from working with groups of matrices as
part of group theory and, in particular, their modular representations.
So far I have been unable to see how the ideas of Strassen can help,
since they seem to reduce multiplications at the expense of data movement,
and it seems to me that for GF2 specifically it is the data movement
that is taking the time.
Reading your stuff about M4RI suggests you have concluded otherwise,
and - if you are prepared to divulge - I'd be most interested to understand
how multiplication can be speeded up by such methods.
To give you a first view of where I am coming from, standard Strassen
does 11 adds and 7 multiplies, as against the 4 adds and 8 multiplies
the ordinary way, and at many scales seven adds is slower than
one multiply because of all the I/O, whether this be from file-server,
disk or ordinary memory.
Richard Parker
---------- Forwarded Message ----------
Subject: Re: Matrices over GF2.
Date: Saturday 04 Aug 2012
From: Martin Albrecht <
martinr...@googlemail.com>
To: Richard Parker <
richp...@hotmail.co.uk>
Hey,
before getting into the why perhaps it's useful to establish that indeed
Strassen-Winograd makes multiplication faster over GF(2). For example,
if you run
./bench_multiplication 10000 0
in the M4RI testsuite you'll get the time it takes with Strassen down to the
dimension where matrix blocks fit into L2. If you run
./bench_multiplication 10000 10000
you get the time for the "Method of the Four Russians", i.e., cubic
multiplication + "greasing".
Moving on to the "why". I am not sure what you mean by "data movement". So to
avoid a misunderstanding, in our implementation data is not moved in the
strict sense, i.e., copied from A to B. Of course, there are many memory reads
and writes.
However, Strassen-Winograd uses less memory reads and writes than classical
cubic multiplication if your dimension is big enough. That is, 7
multiplications + 15 additions uses less I/O than 8 multiplications + 8
additions for n x n matrices starting at some value of n. For 64 x 64 matrices
indeed adds and multiplications seem to be equally expensive but for larger
dimensions this is not the case. In our implementation we cross over from
Strassen to M4RM at dimensions of a few thousand.
Put simply, I think this statement is incorrect:
> at many scales seven adds is slower than one multiply because of all the
> I/O, whether this be from file-server, disk or ordinary memory.
because seven adds do a lot less I/O than one multiply, which is does ~n
additions.
I hope this is helpful, feel free to get back to me if not etc.
Cheers,
Martin
---------- Forwarded Message ----------
Subject: RE: Matrices over GF2.
Date: Saturday 04 Aug 2012
From: Richard Parker <
richp...@hotmail.co.uk>
To:
martinr...@googlemail.com
Thanks for your reply. I suspect I have a rather faster basic multiply,
(mainly by being cache-friendly) which makes the difference. Unless I
have missed something (and I still suspect I have) the crossover seems to
be at around 100,000 dimensions rather than 1000.
If you are still active in this area, perhaps the first step would be to
persuade you to consider chopping up your matrix in both directions
so that the cache-full is more nearly square. This balances the memory
usage across the three matrices (oversimplification!) and in my work
produced a factor of ten (!) improvement in the basic multiplication once
I tried using multiple cores.
Actually by using L1, L2 and L3 in a balanced way, I suspect there is
yet another factor of two available, though it is a lot of work, and
will have a very short shelf life - as soon as they change the sizes or
speeds of the caches, I'd have to tweek it! :(
[This would involve processing matrix A very carefully, keeping track of
which vectors had been used recently so that they can be favoured since
they will be in L1 cache].
Richard Parker
---------- Forwarded Message ----------
Subject: Re: RE: Matrices over GF2.
Date: Saturday 04 Aug 2012
From: Martin Albrecht <
martinr...@googlemail.com>
To: Richard Parker <
richp...@hotmail.co.uk>
Hey,
On Saturday 04 Aug 2012, you wrote:
> Thanks for your reply. I suspect I have a rather faster basic multiply,
> (mainly by being cache-friendly) which makes the difference. Unless I
> have missed something (and I still suspect I have) the crossover seems to
> be at around 100,000 dimensions rather than 1000.
is your implementation available somewhere? In any case, we could easily
resolve this question by comparing your implementation with ours (and
Magma's). If your base case is so efficient this would of course be very
interesting for us.
> If you are still active in this area, perhaps the first step would be to
> persuade you to consider chopping up your matrix in both directions
> so that the cache-full is more nearly square.
I am not sure what you mean by that, we do chop up our matrices in blocks
e.g.,
[A0 A1]
[A2 A3]
is that what you mean? However, we do store these matrices in row major
representation, i.e, there's a jump between the zeros and first row of A0.
> This balances the memory
> usage across the three matrices (oversimplification!) and in my work
> produced a factor of ten (!) improvement in the basic multiplication once
> I tried using multiple cores.
Mhh, how many cores did you use? Our code definitely doesn't scale well on
multicore systems.
---------- Forwarded Message ----------
Subject: RE: Matrices over GF2.
Date: Saturday 04 Aug 2012
From: Richard Parker <
richp...@hotmail.co.uk>
To:
martinr...@googlemail.com
> is your implementation available somewhere? In any case, we could easily
> resolve this question by comparing your implementation with ours (and
> Magma's). If your base case is so efficient this would of course be very
> interesting for us.
I did a prototype in Essen in February. I guess it still works, but I'm in
the middle of rewriting it, so the version I currently have doesn't even
compile,
let alone run.
My target was to use the 12 core (6+6) Nehalem in Essen to multiply two
random dense matrices at 350,000 x 350,000 in 1 Hour. I failed, in that
it took 80 minutes. This uses no Strassen.
I must apologise for what I wrote after that. I was guessing what you were
doing,
and after due consideration have concluded that I guessed wrong.
> Mhh, how many cores did you use? Our code definitely doesn't scale well on
> multicore systems.
12, as stated above. The more cores you use, the more you have to worry
about memory bandwidth, since it is shared between them. I suspect this
is the major difference, in fact, between our approaches. It scaled
perfectly, except that the reading of the matrices and the writing of the
answer was not multi-threaded (and thereby hangs this tale).
If you really have to take memory bandwidth into account so you scale well,
this seems to make the whole problem slightly different at every scale, and
I am beginning to think that the five scales I had (file server, local disk,
memory, thread-worth and cache-full) is going to turn out too simplisitic. :(
I can send my documentation if you like.
---------- Forwarded Message ----------
Subject: RE: Matrices over GF2.
Date: Saturday 04 Aug 2012
From: Richard Parker <
richp...@hotmail.co.uk>
To:
martinr...@googlemail.com
OK. Looked at Albrecht, Bard and Hart.
If that is what you are still doing, I think I see the main improvement
needed.
In your pre-Strassen algorithm, you always deal with whole rows of C.
You should chop them up. That way your (now very short) rows of B
can be greased, and still have quite a few of them so that each little
row of C gets 16 XORs (doing 96 rows).
For my 350,000 prototype . . .
I used strips of A of width 96.
As a first step, I "prepared" matrix A by putting the strips
contiguous (so that the first 96 columns, in there entirety,
came first, then the next 96, and so on. Knowing that grease
level 6 was coming, I also reformatted each 3 bytes of A
into four, to save the shifting/anding in the main loop.
I used "bricks" of B that were 96 x 1024.
Matrix C was generated in strips 1024 columns wide.
There were 342 strips of C, and had I had 342 cores I could
have done them all at once. I only had 12, so I was running
12 worker threads, each doing one strip of C.
Each brick therefore corresponded to one 350000 x 96 strip of A
and one 350000 x 1024 strip of C. Do do this work, L2 was filled
with 16 "slices" of grease level 6 (16 x 64 x 1024 bits).
"Fill your cache with grease".
One way of looking at this is that I pass matrix C 350000/96 times.
You pass it many times more.
---------- Forwarded Message ----------
Subject: Strassen.
Date: Sunday 05 Aug 2012
From: Richard Parker <
richp...@hotmail.co.uk>
To:
martinr...@googlemail.com
I have been pondering the applicability of Strassen-Winograd (S-W) to matrix
multiplication since my last mail, and have provisionally come to the
following conclusions. What do you think? Does this sound right?
1. That by chopping up the rows, you could have speeded up your base case
for dimensions over 2048 but you do not use them so it will not, in fact, help
you as your system currently stands.
2. The existence of a better base-case throws into doubt your conclusion
(if I have it correct) that S-W is faster at 2048 than the base case. I
suspect
that if you use these tricks, you will find that the new 2048 base case is
faster
than S-W applied to the 1024 base case and that this will improve your
software,
though I suspect this will be rather marginal and perhaps not worth bothering
with.
Nevertheless if you want to understand these tricks, I'm more than happy to
explain.
3. If you try to scale to use multiple cores, you will have to make a drastic
reduction
to your use of main memory bandwidth, and the key (which you have already
found)
is to use more things that I call slices and you, I think, call Grey-code-
tables. You must
then chop up your rows of B and C to shorten the rows so you can get all you
need
into L2 cache. This is no longer a marginal improvement. It is huge. It
makes each
core run at about the same speed as for the base case, but now it scales
perfectly.
For my software, I have learned already that at the level where the matrices
no longer
fit in memory, S-W is basically compulsory. I thank you for forcing me to
think this
out!
For memory work (i.e. less than 350,000 on the Essen machine) strategically
this seems to be a fairly close call. S-W's gains are obvious, but the
losses are
also quite substantial.
1. The first is, of course, the additional memory bandwidth needed to do the
extra additions. This requires about six complete passes of a matrix which
equates (in my warped thinking) to about 600 extra rows. If this were the
only consideration, it would suggest that 8192 is better done by S-W.
[This is because I am using all the cores, so the memory bandwidth
used in one core doing an add is slowing down the multiplies
taking place in the other cores and I am taking this into account].
2. I pre-process matrix A to spot sparsity and to use coding theory
to squeeze a bit more out of the same resources. S-W makes this harder,
since there are now seven different matrices used as left-multipliers as
against the (virtual) four in the base case. This means that the more
sophisicated codes can no longer be used since they are too slow to encode.
Also any attempt at cache-prediction would be too slow for a S-W environment.
3. By adding the matrices together, any systematic blocks of zeros are likely
to be messed-up. S-W actually left-multiplies by three single blocks and
only "messes up" four of them, but nevertheless the impact is to both A and
B, and important in many cases.
I propose, I think, to proceed as planned with the base case up to one memory
full.
I basically have to to this anyway, since Gausian Elimination needs its
components
and I haven't got my head around S-W with the rectangular matrices that are
needed
for Gaussian Elimination.
I should probably write an alternative S-W-based multiply at the memory level
so
that there is a choice - the base case will be faster if there is even just
moderate
sparsity, whereas S-W will be faster for fully dense matrices. The main use
for this
is, in fact, for Gaussian Elimination at the large scale, and I therefore need
to
understand how best to multiply rectangular dense using S-W.
Questions for you, then . . .
I wish to multiply an AxB matrix by a BxC one. How do I do that using
Strassen-Winograd.
Also - Is Laderman still the best for 3x3? 23 multiplies? How many adds? I
suspect
this is going to figure in here somewhere!
--
name: Martin Albrecht
_pgp:
http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99
_otr: 47F43D1A 5D68C36F 468BAEBA 640E8856 D7951CCF
_www:
http://martinralbrecht.wordpress.com/
_jab:
martinr...@jabber.ccc.de