Parallel echelonization

charles.b...@gmail.com

unread,

Aug 28, 2023, 4:39:00 AM8/28/23

to ffpack-devel

Dear all,

I am trying to use FFPACK to compute the row echelon form of largish matrices (dimension 150K) modulo a 16-bit prime, in parallel. I have a lots of threads to use (say 128).

I use the ModularBalanced<double> field (as suggested by JGD in 2016...) but I could use anything else.

My questions are:

0) I have to call pRowEchelonForm() and not rowEchelonForm, right?

1) should I use FfpackSlabRecursive (LUdivine) or FfpackTileRecursive?

2) I understand that I can provide a num_threads argument. Should I give 0 (I assume it means "do whatever is best")?

3) The underlying BLAS is OpenBLAS. It can be multi-threaded. Should I try to have a multi-threaded BLAS or a sequential one?

(I am doing this because I am trying to obtain a "usable" version of [SpaSM](https://github.com/cbouilla/spasm), in particular code that can echelonize a sparse matrix while keeping the result sparse; this leads to a "switch-to-dense" strategy at the end of the factorization, hence the use of FFLAS-FFPACK; it will become a hard dependency for SpaSM; I keep being contacted by various people who want sparse kernel basis of largish sparse matrices, of size, say, 1M, and it is sometimes feasible).

Cheers,

---

Charles Bouillaguet

Clement Pernet

unread,

Aug 29, 2023, 12:46:42 PM8/29/23

to ffpack...@googlegroups.com, charles.b...@gmail.com

Hi Charles,

Le 28/08/2023 à 10:39, charles.b...@gmail.com a écrit :
> Dear all,
>
> I am trying to use FFPACK to compute the row echelon form of largish matrices (dimension 150K)
> modulo a 16-bit prime, in parallel. I have a lots of threads to use (say 128).
>
> I use the ModularBalanced<double> field (as suggested by JGD in 2016...) but I could use anything else.

This is fine. Modular<double> could be another good choice, but likely with similar timings.

>
> My questions are:
> 0) I have to call pRowEchelonForm() and not rowEchelonForm, right?

Yes.

> 1) should I use FfpackSlabRecursive (LUdivine) or FfpackTileRecursive?

The SlabRecursive variant is sequential only, while the TileRecursive allow for parallelization.

> 2) I understand that I can provide a num_threads argument. Should I give 0 (I assume it means "do
> whatever is best")?

Quite right. numthreads=0 sets the max number of ffpack threads to the current OPENMP NUM_THREADS.

> 3) The underlying BLAS is OpenBLAS. It can be multi-threaded. Should I try to have a multi-threaded
> BLAS or a sequential one?

By default, fflas-ffpack forces the BLAS to run on only 1 thread, even if the BLAS is compiled using
multithreaded (in order to manage the parallism at a higher level).

Yet if you want to run fflas-ffpack using the multithreaded OpenBLAS, you can set the macro
__FFLASFFPACK_OPENBLAS_NUM_THREADS
to the number of threads you want for OpenBLAS.

This can be automated by running configure with option
--with-openblas-num-threads=<num-threads>

You should then add in your main:

#ifdef __FFLASFFPACK_OPENBLAS_NUM_THREADS
openblas_set_num_threads(__FFLASFFPACK_OPENBLAS_NUM_THREADS);
#endif

and

#define __FFLASFFPACK_OPENBLAS_NT_ALREADY_SET 1
before all other includes.

(see benchmarks/benchmark-pluq.C for instance)

>
> (I am doing this because I am trying to obtain a "usable" version of
> [SpaSM](https://github.com/cbouilla/spasm), in particular code that can echelonize a sparse matrix
> while keeping the result sparse; this leads to a "switch-to-dense" strategy at the end of the
> factorization, hence the use of FFLAS-FFPACK; it will become a hard dependency for SpaSM; I keep
> being contacted by various people who want sparse kernel basis of largish sparse matrices, of size,
> say, 1M, and it is sometimes feasible).
>

Great! Let us know how this works and feel free to ask for further assistance if needed.

Btw: I presume that you saw the boolean option "transform" which allow you to avoid computing the
transformation matrix if you don't need it.

Best.

Clément

Charles Bouillaguet

unread,

Sep 4, 2023, 10:59:16 AM9/4/23

to Clement Pernet, ffpack...@googlegroups.com

On 8/29/23 18:46, Clement Pernet wrote:
>> I am trying to use FFPACK to compute the row echelon form of largish
>> matrices (dimension 150K) modulo a 16-bit prime, in parallel. I have
>> a lots of threads to use (say 128).

I have further questions...

I am running :

size_t rank = FFPACK::pRowEchelonForm(GFp, n, m, A, lda, P, Qt,
false, 0, FFPACK::FfpackTileRecursive);

I can't really figure out the output format. In particular, I assumed
that Qt[j] would give the row where the pivot is on column j, but it
does not.

For instance, running it on a 4x4 matrix yields:

# P[0] = 0
# P[1] = 1
# P[2] = 2
# P[3] = 3

# Qt[0] = 0
# Qt[1] = 2
# Qt[2] = 3
# Qt[3] = 3

So Qt is not even a permutation, or I misunderstand its representation.

(not that with FfpackSlabRecursive I have no problems).

Best regards,

--
Charles Bouillaguet

Jean-Guillaume Dumas

unread,

Sep 4, 2023, 11:13:56 AM9/4/23

to ffpack...@googlegroups.com, Charles Bouillaguet, Clement Pernet

Dear Charles, QT is given in LAPACK format (@param Qt the column
position of the pivots in the echelon form)

This means that, sequentially, for 1 <= i <= min(m,n), column[i] was
interchanged with column Qt[i]

That is in your case, i.i.a.n.m., the mathematical permutation Qt is :

1000

0001

0100

0010

Conversions routines are provided if needed, like LAPACKPerm2MathPerm ;
MathPerm2LAPACKPerm ; as well as applicators with the different formats,
see ffpack/ffpack.h ...

Yours,

--
Jean-Guillaume Dumas.
____________________________________________________________________
Jean-Guill...@univ-grenoble-alpes.fr Tél.: +33 457 421 732
Directeur, Laboratoire Jean Kuntzmann. Bât. IMAG, bureau 115
Mathématiques Appliquées et Informatique, Université Grenoble Alpes.
150 place du Torrent. IMAG, CS 40700, 38058 GRENOBLE cedex 9, FRANCE
http://membres-ljk.imag.fr/Jean-Guillaume.Dumas
____________________________________________________________________

Charles Bouillaguet

unread,

Sep 5, 2023, 5:34:47 AM9/5/23

to Jean-Guillaume Dumas, ffpack...@googlegroups.com, Clement Pernet

Hello,

On 9/4/23 17:13, Jean-Guillaume Dumas wrote:
> On 04/09/2023 16:59, Charles Bouillaguet wrote:
>> size_t rank = FFPACK::pRowEchelonForm(GFp, n, m, A, lda, P, Qt,
>> false, 0, FFPACK::FfpackTileRecursive);

I have managed to read the output now :-).

One more thing. Running pReducedRowEchelonForm() on a matrix of size
160K x 160K, of rank about 120K takes about 3h using 32 cores. However,
I observe a significant sequential section at the end of the computation
(e.g. only one thread for something like, say, 2 hours).

Is there any way I could avoir that? I could potentially use
pRowEchelonForm() (e.g. non-reduced).

Thanks,

--

Charles Bouillaguet

Reply all

Reply to author

Forward