19 views

Skip to first unread message

Dec 13, 2011, 1:12:29 PM12/13/11

to linbox-use

Hi,

I am using the attached code to benchmarket LinBox's performance over GF(2^e)

for e small. I know there is no specialised implementation for it in

LinBox/FFLAS, that's the point that I want to demonstrate in my benchmark.

But I want to make sure I'm doing it fairly. Hence my question, will the

attached code (based on code by Clément) choose the fastest code path?

Cheers,

Martin

--

name: Martin Albrecht

_pgp: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x8EF0DC99

_otr: 47F43D1A 5D68C36F 468BAEBA 640E8856 D7951CCF

_www: http://martinralbrecht.wordpress.com/

_jab: martinr...@jabber.ccc.de

Dec 14, 2011, 7:24:16 AM12/14/11

to linbo...@googlegroups.com, Martin Albrecht

Martin Albrecht wrote, on 13/12/2011 19:12:

> Hi,

>

> I am using the attached code to benchmarket LinBox's performance over GF(2^e)

> for e small. I know there is no specialised implementation for it in

> LinBox/FFLAS, that's the point that I want to demonstrate in my benchmark.

>

> But I want to make sure I'm doing it fairly. Hence my question, will the

> attached code (based on code by Cl�ment) choose the fastest code path?> Hi,

>

> I am using the attached code to benchmarket LinBox's performance over GF(2^e)

> for e small. I know there is no specialised implementation for it in

> LinBox/FFLAS, that's the point that I want to demonstrate in my benchmark.

>

> But I want to make sure I'm doing it fairly. Hence my question, will the

>

> Cheers,

> Martin

>

Dear Martin,

yes this is our fastest !

It is roughly a small constant (2/3/4 ...) factor faster than peak

floating point performance.

Thanks for the benchmarking, please let us know the results ...

Best,

--

Jean-Guillaume Dumas.

____________________________________________________________________

Jean-Guill...@imag.fr T�l.: +33 476 514 866

Universit� Joseph Fourier, Grenoble I. Fax.: +33 476 631 263

Laboratoire Jean Kuntzmann, Math�matiques Appliqu�es et Informatique

51, avenue des Math�matiques. LJK/IMAG - BP53. 38041 Grenoble FRANCE

http://ljk.imag.fr/membres/Jean-Guillaume.Dumas

____________________________________________________________________

Dec 14, 2011, 8:05:18 AM12/14/11

to linbo...@googlegroups.com

On Wednesday 14 December 2011, Jean-Guillaume Dumas wrote:

> Martin Albrecht wrote, on 13/12/2011 19:12:

> > Hi,

> >

> > I am using the attached code to benchmarket LinBox's performance over

> > GF(2^e) for e small. I know there is no specialised implementation for

> > it in LinBox/FFLAS, that's the point that I want to demonstrate in my

> > benchmark.

> >

> > But I want to make sure I'm doing it fairly. Hence my question, will the

> > attached code (based on code by Clément) choose the fastest code path?> Martin Albrecht wrote, on 13/12/2011 19:12:

> > Hi,

> >

> > I am using the attached code to benchmarket LinBox's performance over

> > GF(2^e) for e small. I know there is no specialised implementation for

> > it in LinBox/FFLAS, that's the point that I want to demonstrate in my

> > benchmark.

> >

> > But I want to make sure I'm doing it fairly. Hence my question, will the

> >

> > Cheers,

> > Martin

>

> Dear Martin,

> yes this is our fastest !

> It is roughly a small constant (2/3/4 ...) factor faster than peak

> floating point performance.

> Thanks for the benchmarking, please let us know the results ...

> Best,

Hi,

currently it takes about 10 seconds on my computer to multiply two 4000 x 4000

matrices over GF(2^2). But I am apparently linking against the wrong BLAS. In

Sage GF(3) takes about 6 seconds which uses a custom built ATLAS + LinBox. For

comparison my dedicated implementation needs 0.24 seconds.

Dec 14, 2011, 8:43:23 AM12/14/11

to linbo...@googlegroups.com, Martin Albrecht

Martin Albrecht wrote, on 14/12/2011 14:05:

> On Wednesday 14 December 2011, Jean-Guillaume Dumas wrote:

>

>> Martin Albrecht wrote, on 13/12/2011 19:12:

>>

>>> Hi,

>>>

>>> I am using the attached code to benchmarket LinBox's performance over

>>> GF(2^e) for e small. I know there is no specialised implementation for

>>> it in LinBox/FFLAS, that's the point that I want to demonstrate in my

>>> benchmark.

>>>

>>> But I want to make sure I'm doing it fairly. Hence my question, will the

>>> attached code (based on code by Cl�ment) choose the fastest code path?> On Wednesday 14 December 2011, Jean-Guillaume Dumas wrote:

>

>> Martin Albrecht wrote, on 13/12/2011 19:12:

>>

>>> Hi,

>>>

>>> I am using the attached code to benchmarket LinBox's performance over

>>> GF(2^e) for e small. I know there is no specialised implementation for

>>> it in LinBox/FFLAS, that's the point that I want to demonstrate in my

>>> benchmark.

>>>

>>> But I want to make sure I'm doing it fairly. Hence my question, will the

>>>

>>> Cheers,

>>> Martin

>>>

>> Dear Martin,

>> yes this is our fastest !

>> It is roughly a small constant (2/3/4 ...) factor faster than peak

>> floating point performance.

>> Thanks for the benchmarking, please let us know the results ...

>> Best,

>>

> Hi,

>

> currently it takes about 10 seconds on my computer to multiply two 4000 x 4000

> matrices over GF(2^2). But I am apparently linking against the wrong BLAS.

GF(2^2); n = 4000: 5.64035sIn

> Sage GF(3) takes about 6 seconds which uses a custom built ATLAS + LinBox.

GF(3); n = 4000: 5.35234s

so this seems coherent.

> For

> comparison my dedicated implementation needs 0.24 seconds.

>

Dec 14, 2011, 9:26:04 AM12/14/11

to linbo...@googlegroups.com

Hi,

> > Hi,

> >

> > currently it takes about 10 seconds on my computer to multiply two 4000 x

> > 4000 matrices over GF(2^2). But I am apparently linking against the

> > wrong BLAS.

>

> With Goto on my quad 2.8GHz, I have

> GF(2^2); n = 4000: 5.64035sIn

>

> > Sage GF(3) takes about 6 seconds which uses a custom built ATLAS +

> > LinBox.

>

> With the same blas and machine I have

> GF(3); n = 4000: 5.35234s

> so this seems coherent.

>

> > For

> > comparison my dedicated implementation needs 0.24 seconds.

Thanks for those timings. GF(2^2) is probably a choice where LinBox is

particularly disadvantaged, because it's so small. That is, the time it takes

LinBox to perform the computation doesn't seem to increase between GF(2^2) and

GF(2^10). So, GF(2^8) still takes only 10 seconds (using OpenBLAS on Debian

still) whereas my code takes 2.5 seconds. For GF(2^10) LinBox still only takes

10 seconds but my implementation 36 seconds (I didn't implement the bitslice

business yet for e>8)

One more question: does linbox compute echelon forms for GF(2^e)?

Dec 14, 2011, 9:39:31 AM12/14/11

to linbo...@googlegroups.com, Martin Albrecht

Martin Albrecht wrote, on 14/12/2011 15:26:

> Hi,

>

>

>>> Hi,

>>>

>>> currently it takes about 10 seconds on my computer to multiply two 4000 x

>>> 4000 matrices over GF(2^2). But I am apparently linking against the

>>> wrong BLAS.

>>>

>> With Goto on my quad 2.8GHz, I have

>> GF(2^2); n = 4000: 5.64035sIn

>>

>>

>>> Sage GF(3) takes about 6 seconds which uses a custom built ATLAS +

>>> LinBox.

>>>

>> With the same blas and machine I have

>> GF(3); n = 4000: 5.35234s

>> so this seems coherent.

>>

>>

>>> For

>>> comparison my dedicated implementation needs 0.24 seconds.

>>>

> Thanks for those timings. GF(2^2) is probably a choice where LinBox is

> particularly disadvantaged, because it's so small. That is, the time it takes

> LinBox to perform the computation doesn't seem to increase between GF(2^2) and

> GF(2^10). So, GF(2^8) still takes only 10 seconds (using OpenBLAS on Debian

> still) whereas my code takes 2.5 seconds. For GF(2^10) LinBox still only takes

> 10 seconds but my implementation 36 seconds (I didn't implement the bitslice

> business yet for e>8)

>

hum, wait, I am not sure about this anymore.> Hi,

>

>

>>> Hi,

>>>

>>> currently it takes about 10 seconds on my computer to multiply two 4000 x

>>> 4000 matrices over GF(2^2). But I am apparently linking against the

>>> wrong BLAS.

>>>

>> With Goto on my quad 2.8GHz, I have

>> GF(2^2); n = 4000: 5.64035sIn

>>

>>

>>> Sage GF(3) takes about 6 seconds which uses a custom built ATLAS +

>>> LinBox.

>>>

>> With the same blas and machine I have

>> GF(3); n = 4000: 5.35234s

>> so this seems coherent.

>>

>>

>>> For

>>> comparison my dedicated implementation needs 0.24 seconds.

>>>

> Thanks for those timings. GF(2^2) is probably a choice where LinBox is

> particularly disadvantaged, because it's so small. That is, the time it takes

> LinBox to perform the computation doesn't seem to increase between GF(2^2) and

> GF(2^10). So, GF(2^8) still takes only 10 seconds (using OpenBLAS on Debian

> still) whereas my code takes 2.5 seconds. For GF(2^10) LinBox still only takes

> 10 seconds but my implementation 36 seconds (I didn't implement the bitslice

> business yet for e>8)

>

To compute with kronecker substitution, we need to specialize the bound

computations for FFLAS.

We worked on it at some point but at first glance I don't find it in the

svn.

Cl�ment do you rememebr where we put it ?

> One more question: does linbox compute echelon forms for GF(2^e)?

>

there, this I don't think we did.

Dec 14, 2011, 9:42:44 AM12/14/11

to Jean-Guillaume Dumas, linbo...@googlegroups.com

One more question, I'm having trouble to convince myself the computation does

what I expect. I added printing of A, B and C and get for example:

what I expect. I added printing of A, B and C and get for example:

$ ./test-fgemm-gfq 1 2

A:

3

B:

2

C:

3

But (a+1)*a = 1 in GF(2^2). Am I interpreting/doing this wrong?

Dec 15, 2011, 9:17:16 AM12/15/11

to Martin Albrecht, linbo...@googlegroups.com

Martin Albrecht wrote, on 14/12/2011 15:42:

> One more question, I'm having trouble to convince myself the computation does

> what I expect. I added printing of A, B and C and get for example:

>

> $ ./test-fgemm-gfq 1 2

> A:

> 3

>

> B:

> 2

>

> C:

> 3

>

> But (a+1)*a = 1 in GF(22). Am I interpreting/doing this wrong?> One more question, I'm having trouble to convince myself the computation does

> what I expect. I added printing of A, B and C and get for example:

>

> $ ./test-fgemm-gfq 1 2

> A:

> 3

>

> B:

> 2

>

> C:

> 3

>

>

> Cheers,

> Martin

>

Well this is because random is not yet correct.

For now it uses a random value without shifts.

for instance, if there is a shift then 1+a should be in binary as say 1 001,

so 3 is actualy interpreted as 0 011 which has no meaning ...

Best,

--

Jean-Guillaume Dumas.

____________________________________________________________________

Jean-Guill...@imag.fr Tél.: +33 476 514 866

Université Joseph Fourier, Grenoble I. Fax.: +33 476 631 263

Laboratoire Jean Kuntzmann, Mathématiques Appliquées et Informatique

51, avenue des Mathématiques. LJK/IMAG - BP53. 38041 Grenoble FRANCE

http://ljk.imag.fr/membres/Jean-Guillaume.Dumas

____________________________________________________________________

Dec 15, 2011, 10:01:30 AM12/15/11

to Martin Albrecht, linbo...@googlegroups.com

Martin Albrecht wrote, on 14/12/2011 15:42:

> One more question, I'm having trouble to convince myself the computation does

> what I expect. I added printing of A, B and C and get for example:

>

> $ ./test-fgemm-gfq 1 2

> A:

> 3

>

> B:

> 2

>

> C:

> 3

>

also be careful between values an representation in the field.> One more question, I'm having trouble to convince myself the computation does

> what I expect. I added printing of A, B and C and get for example:

>

> $ ./test-fgemm-gfq 1 2

> A:

> 3

>

> B:

> 2

>

> C:

> 3

>

your one for instance is not correct, you should use F.one or F.init(one,1),

then printing should also use F.write,

then I have for instance :

A:2 is x

B:3 is 1+x

C:1 is 1 which is correctly x*(1+x) mod x^2+x+1

Best,

--

Jean-Guillaume Dumas.

____________________________________________________________________

Jean-Guill...@imag.fr Tél.: +33 476 514 866

Université Joseph Fourier, Grenoble I. Fax.: +33 476 631 263

Laboratoire Jean Kuntzmann, Mathématiques Appliquées et Informatique

51, avenue des Mathématiques. LJK/IMAG - BP53. 38041 Grenoble FRANCE

http://ljk.imag.fr/membres/Jean-Guillaume.Dumas

____________________________________________________________________

Dec 15, 2011, 11:23:07 AM12/15/11

to linbo...@googlegroups.com, Martin Albrecht

OK, I worked things out.

1) First there is one issue in fflas-ffpack :

e.g. in fflas.h there are tests like

---

F.areEqual(F.mOne, alpha) alphad= -1.0

---

Which are not correct over char 2 with givgfqext.h :

indeed in char 2, 1=-1 but if init from givgfqext.h receives a negative

value it can segfault.

A first patch for this is to add tests for F.one beforehand, everywhere

? see attached partial patch for fflas.

The bug becomes more important if Winograd is used as more negative

values are needed ...

Cl�ment, Brice what should we do there ?

1) First there is one issue in fflas-ffpack :

e.g. in fflas.h there are tests like

---

F.areEqual(F.mOne, alpha) alphad= -1.0

---

Which are not correct over char 2 with givgfqext.h :

indeed in char 2, 1=-1 but if init from givgfqext.h receives a negative

value it can segfault.

A first patch for this is to add tests for F.one beforehand, everywhere

? see attached partial patch for fflas.

The bug becomes more important if Winograd is used as more negative

values are needed ...

Cl�ment, Brice what should we do there ?

2) Second I added a random generator in givgfqext.h, in the svn :

----

template<class RandIter> Rep& random(RandIter& g, Rep& r) const {

return init(r, static_cast<double>( (UTT)g() % _MODOUT));

}

---

3) Third the specializations of BaseCompute and Mantissa for FFLAS and

givgfqext.h are missing

Not sure where to put them ...

----

namespace FFLAS {

namespace Protected {

template<> FFLAS_BASE BaseCompute (const Field& F, const size_t w) {

return FflasDouble; }

template<> unsigned long Mantissa (const Field& F, const FFLAS_BASE

base) {

return F.bits(); }

}

}

----

see attached code.

Then I have this behavior (my winograd threshold is 2536) :

GF(2^2); n = 2500: 3.1962s

GF(2^3); n = 2500: 3.42021s

GF(2^4); n = 2500: 7.19245s

GF(2^5); n = 2500: 20.4533s

GF(2^6); n = 2500: 38.7384s

GF(3^2); n = 2500: 3.09619s

GF(3^3); n = 2500: 4.8323s

GF(3^4); n = 2500: 18.6812s

GF(5^2); n = 2500: 3.05619s

GF(5^3); n = 2500: 10.8167s

GF(5^4); n = 2500: 68.5043s

etc.

Best regards,

Dec 16, 2011, 8:17:20 AM12/16/11

to Jean-Guillaume Dumas, linbo...@googlegroups.com

Hi,

On Thursday 15 December 2011, Jean-Guillaume Dumas wrote:

> OK, I worked things out.

> 1) First there is one issue in fflas-ffpack :

> e.g. in fflas.h there are tests like

> ---

> F.areEqual(F.mOne, alpha) alphad= -1.0

> ---

> Which are not correct over char 2 with givgfqext.h :

> indeed in char 2, 1=-1 but if init from givgfqext.h receives a negative

> value it can segfault.

> A first patch for this is to add tests for F.one beforehand, everywhere

> ? see attached partial patch for fflas.

> The bug becomes more important if Winograd is used as more negative

> values are needed ...

> Clément, Brice what should we do there ?

Great. Are the instructions posted somewhere how to get all the bits and

pieces (i.e, SVN versions of the various libraries) and how to install them by

any chance?

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu