BLIS multi-threading implementation using OpenMP

279 views
Skip to first unread message

Md Mosharaf Hossain

unread,
Jan 30, 2018, 12:20:23 PM1/30/18
to blis-discuss
Hi,

Recently, I am working with BLIS library for my thesis. I was looking on github, but not found the source code of specifying parallelism automatically in the five loops of BLIS's matrix multiplication algorithm. The manual way of setting parallelism seems simple. I want to know how BLIS automatically sets OpenMP threads in different for loops to the kernels.

The multi-threading wiki page of BLIS(https://github.com/flame/blis/wiki/Multithreading) says like below.
"This causes BLIS to automatically determine a reasonable threading strategy based on what is known about your architecture"

I would like to appreciate if anyone can give me the source code for that.

Thanks and Regards,
Mosharaf
Email: mosh...@gmail.com

Devin Matthews

unread,
Jan 30, 2018, 12:32:00 PM1/30/18
to blis-discuss

Hi Mosharaf,

The code which automatically determines how many threads to use at each level is at https://github.com/flame/blis/blob/1f94bb7b96eb2b67257e6c4df89e29c73e9ab386/frame/base/bli_cntx.c#L868.

Basically what it does is take the total number of threads and then perform a weighted partitioning based on the relative length of the m and n dimensions. E.g. by default a square multiplication tries to have twice as many threads along m as along n.

Then, separately for m and n, the threads are split between the register and cache loops (BLIS_IC_NT vs. BLIS_IR_NT e.g.), threads are assigned to the register loop greedily up to a predefined maximum since more threads at this level stresses the cache coherency protocol.

All of the partitioning parameters are configuration-dependent and are defined in each configuration's bli_family_XXX.h file. See e.g. https://github.com/flame/blis/blob/1f94bb7b96eb2b67257e6c4df89e29c73e9ab386/frame/include/bli_kernel_macro_defs.h#L39 and https://github.com/flame/blis/blob/1f94bb7b96eb2b67257e6c4df89e29c73e9ab386/config/knl/bli_family_knl.h#L39.

Thanks,
Devin Matthews

Md Mosharaf Hossain

unread,
Jan 30, 2018, 1:33:06 PM1/30/18
to blis-discuss
Hi Devin,
Thanks a lot for the prompt reply. I will be really glad if you can help me to answer some other questions.

q1. When setting the threads automatically like below. bli_thread_set_num_threads(threads_count).
I have printed the value of jc_nt, ic_nt, jr_nt and ir_nt. Although I set threads_count = 15 but jc_nt, ic_nt, jr_nt and ir_nt give me 1. I dont know why? These should give weighted assignments of threads, right?


q2. When setting the threads manually like blow

bli_thread_set_jc_nt( 1 );
bli_thread_set_ic_nt( 15 );
bli_thread_set_jr_nt( 1 );
bli_thread_set_ir_nt( 1 );

I am multiplying two double precision matrices like 250x200000 and 20000x3000 using our 20 core shared memory system(Ivybridge, Intel 2 sockets). I found that the McxKc block for double precision and Ivybridge/Sandybridge system is 96x256. So, only 3(250/96) threads should work of the three blocks of A matrix and Remaining threads should remain idle, right? But I found 15 threads are working simultaneously, Even though its having limited parallelism situation as m = 200 only.

Can you please explain me how the threads work this particular situations?


Thanks and Regards,
Mosharaf

Devin Matthews

unread,
Jan 30, 2018, 1:41:56 PM1/30/18
to blis-discuss
> q1. When setting the threads automatically like below. bli_thread_set_num_threads(threads_count).
> I have printed the value of jc_nt, ic_nt, jr_nt and ir_nt. Although I set threads_count = 15 but jc_nt, ic_nt, jr_nt and ir_nt give me 1. I dont know why? These should give weighted assignments of threads, right?

What exactly do you mean "printed...jc_nt, ..."? Where are you printing these and from what variables/function return values?

> q2. When setting the threads manually like blow
>
> bli_thread_set_jc_nt( 1 );
> bli_thread_set_ic_nt( 15 );
> bli_thread_set_jr_nt( 1 );
> bli_thread_set_ir_nt( 1 );
>
> I am multiplying two double precision matrices like 250x200000 and 20000x3000 using our 20 core shared memory system(Ivybridge, Intel 2 sockets). I found that the McxKc block for double precision and Ivybridge/Sandybridge system is 96x256. So, only 3(250/96) threads should work of the three blocks of A matrix and Remaining threads should remain idle, right? But I found 15 threads are working simultaneously, Even though its having limited parallelism situation as m = 200 only.

How are the matrices stored (row-major or column-major)? BLIS may transpose the matrices internally if it is advantageous to do so. Also, the thread partitioning happens at the granularity of the micro-kernel (mr x nr), so 8x4 instead of 96x4096 (since C is partitioned it would be mc x nc). Note that when partitioning a small dimension among many threads, the threads may all be "busy" this way but the scalability will be poor.

Thanks,
Devin

Md Mosharaf Hossain

unread,
Jan 30, 2018, 1:57:05 PM1/30/18
to blis-discuss
Hi Devin,

Let me put a sample c code here where I used dsyrk_(). No transpose is used for the matrices.

/**************************
Author: mosh...@gmail.com

*/

#include "stdio.h"
#include "stdlib.h"
#include "sys/time.h"
#include "time.h"
#include "blis.h"


void initialize_matrix(double* matrix, int row, int col){

for (int i = 0; i < row; i++ ){
for(int j = 0; j < col; j++){
matrix[j+ i*col] = (double)rand()/RAND_MAX;
}
}
}

void initialize_matrix_with_zero(double* matrix, int row, int col){

for (int i = 0; i < row; i++ ){
for(int j = 0; j < col; j++){
matrix[j+ i*col] = 0.0;
}
}
}

void show_matrix(double *matrix, int row, int col){
for (int i = 0; i < row; i++ ){
for(int j = 0; j < col; j++){
printf("%10.4f ", matrix[j * row + i]);
}
printf("\n");
}
}

int main(int argc, char* argv[])
{
int i;

if(argc<4){
printf("Input Error\n");
return 1;
}

int threads_count = atoi(argv[1]);
int m = atoi(argv[2]);
int k = atoi(argv[3]);

int sizeofa = m * k;
int sizeofc = m * m;

double alpha = 1.0;
double beta = 0.0;
struct timeval start,finish;
double duration;

double* A = (double*)malloc(sizeof(double) * sizeofa);
double* C = (double*)malloc(sizeof(double) * sizeofc);

srand((unsigned)time(NULL));


initialize_matrix(A, m, k);
initialize_matrix_with_zero(C, m, m);

bli_init();

char uplo = 'L'; //uplo = 'U';
char trans = 'N';
int N = m;
int K = k;
int LDA = m;
int LDC = m;

//bli_thread_set_num_threads(threads_count);


bli_thread_set_jc_nt( 1 ); // Set BLIS_JC_NT to 2.
bli_thread_set_ic_nt( threads_count ); // Set BLIS_IC_NT to 4.
bli_thread_set_jr_nt( 1 ); // Set BLIS_JR_NT to 3.
bli_thread_set_ir_nt( 1 ); // Set BLIS_IR_NT to 1.


gettimeofday(&start, NULL);
dsyrk_(&uplo, &trans, &N, &K, &alpha, A, &LDA, &beta, C, &LDC );
gettimeofday(&finish, NULL);

int jc_nt = bli_thread_get_jc_nt();
int ic_nt = bli_thread_get_ic_nt();
int jr_nt = bli_thread_get_jr_nt();
int ir_nt = bli_thread_get_ir_nt();

duration = ((double)(finish.tv_sec-start.tv_sec)*1000000 + (double)(finish.tv_usec-start.tv_usec)) / 1000000;
printf("jc_nt = %d, ic_nt = %d, jr_nt = %d, ir_nt = %d\n", jc_nt, ic_nt, jr_nt, ir_nt);
//printf("C[0] = %lf C[1] = %lf C[2] = %lf\n", C[0], C[1], C[2]);


FILE *fp;
fp = fopen("blis_dsyrk.csv", "a");
fprintf(fp, "%d,%d,%d,%lf\n", threads_count, m, k, duration);
fclose(fp);


free(A);
free(C);

bli_finalize();
return 0;
}


Thanks and Regards,
Mosharaf


Devin Matthews

unread,
Jan 30, 2018, 2:06:12 PM1/30/18
to blis-discuss
OK,

1) bli_thread_get_xx_nt() return the current value of the env. var. BLIS_XX_NT which is unaffected by the automatic thread assignment machinery. If you don't set this explicity (on command line or through bli_thread_set_xx_nt()), then it will return 1. **In this case (the var. is unset) the automatic machinery takes over**.

There no way to get what thread assignment the automatic system used through the BLIS API, but you could edit the code I linked to print the values out.

2) I thought you were doing dgemm and that C was 250x200000, but it looks like you are doing dsyrk with m=n=250 and k=2000000? In this case it will still parallelize but as I noted above scaling will be poor. We haven't added parallelization over the k dimension for this kind of shape.

3) You are using column-major matrices which matches what the sandybridge kernel "prefers". But note that on haswell the framework would compute C^T = B^T A^T instead. Of course for dsyrk B = A^T so it is all the same, but it is something to keep in mind for dgemm.

Thanks,
Devin

Md Mosharaf Hossain

unread,
Jan 30, 2018, 5:32:55 PM1/30/18
to blis-discuss
Hi Devin,

I am sorry I asking too many questions. But this is really helping me understand the BLIS library. I am basically using DSYRK for the experiment.

What I understand, the below situation, micro kernel have 4 threads(jr_nt=4) to parallelize.

bli_thread_set_jc_nt( 1 );
bli_thread_set_ic_nt( 3);
bli_thread_set_jr_nt( 4 );
bli_thread_set_ir_nt( 1 );


But, I wonder why all the threads run simultaneously in the below particular situation. say, m=n=200 and k=2000000. mc x kc = 96x256, kcxnc = 256x4096.

bli_thread_set_jc_nt( 1 );
bli_thread_set_ic_nt( 15 );
bli_thread_set_jr_nt( 1 );
bli_thread_set_ir_nt( 1 );


As I am setting only one thread to jr_nt and ir_nt, Micro kernels should have not opportunity to parallelize. As I understand, In this case most of the threads should remain idle, right?


Another two things I need to know-
1) DSYRK, DSYMM these routines internally use DGEMM right?
2) If the matrices are not evenly partitioned by the block size, the last thread uses same block size or adaptive block size?

Thanks and Regards,
Mosharaf


Devin Matthews

unread,
Jan 30, 2018, 5:41:59 PM1/30/18
to blis-discuss
Hi Mosharaf,

When parallelizing over ic_nt and jc_nt, BLIS doesn't take into account the cache block sizes (such as mc) at all. Rather, it partitions as evenly as possible with the register block size granularity, and then each thread does a loop over cache blocks independently. In your case, each thread just gets a submatrix size (much) less than mc and so the cache (mc) loop is a noop.

Note that for us, the microkernel operates on the mr x nr sized block of C and has no parallelism internally. The mc x nc x kc operation is called the macro-kernel by us and inner kernel by others (Goto). You are correct that no threading is done in the macro-kernel in your case--instead each thread does a smaller macro-kernel independent of the others. Having an incomplete macro-kernel block is no problem and happens all the time.

1) dsyrk and dsymm use similar techniques and code paths as dgemm, but they are separate operations. E.g. dsymm does half as many flops as dgemm.

2) As I mentioned above, the cache block size has no effect on thread partitioning. The register block size is used, though, and so the last thread may have an incomplete register block (zero-padding is used in this case).

Md Mosharaf Hossain

unread,
Feb 2, 2018, 1:08:14 PM2/2/18
to blis-discuss
Many Thank Devin. This is really helpful.

Md Mosharaf Hossain

unread,
Feb 5, 2018, 4:59:38 PM2/5/18
to blis-discuss
Hi Devin,

Many thanks for your reply. I am facing some difficulties to find out the code of the loops of gemm or syrk functions in BLIS using openmp threads. I found the paper "Anatomy of High-Performance Many-Threaded Matrix Multiplication" suggest Gemm uses below five loops.


for jc = 0,.... ,n - 1 in steps of nc
for pc = 0,...., k - 1 in steps of kc
for ic = 0,.....,m - 1 in steps of mc
for jr = 0,....., nc - 1 in steps of nr
for ir = 0,....., mc - 1 in steps of mr
C(ir: ir+mr-1; jr: jr+nr-1)+=.....
endfor
endfor
endfor
endfor

Can you please help me find out the code for the actual gemm/syrk threaded implementation in BLIS?

Regards,
Mosharaf

Devin Matthews

unread,
Feb 5, 2018, 5:28:36 PM2/5/18
to blis-discuss

> for jc = 0,.... ,n - 1 in steps of nc
> for pc = 0,...., k - 1 in steps of kc
> for ic = 0,.....,m - 1 in steps of mc
> for jr = 0,....., nc - 1 in steps of nr
> for ir = 0,....., mc - 1 in steps of mr
> C(ir: ir+mr-1; jr: jr+nr-1)+=.....
> endfor
> endfor
> endfor
> endfor
>
> Can you please help me find out the code for the actual gemm/syrk threaded implementation in BLIS?

You won't find anything looking even remotely like the above code in BLIS, because it uses a technique called "control trees" in order to compose the GEMM/SYRK/whatever algorithms. Basically, each loop in the above, plus some packing operations, plus the microkernel in the middle, are all represented as nodes in a linked list. This list (control tree) is traversed and for each node, a piece of code is executed that partitions, packs, computes, etc. Some examples for GEMM (all of these are in frame/3/gemm):

1) bli_gemm_int parses the next node on the control tree.
2) Partitioning is performed by the bli_gemm_blk_var[123] functions (m, n, and k are different variants).
3) bli_gemm_pack[ab] do the packing by calling bli_l3_packm and then recurse.
4) The microkernel and register loops are handled by the macrokernel, but I'm not actually sure where that code is. Field?

Devin

Devin Matthews

unread,
Feb 5, 2018, 5:30:54 PM2/5/18
to blis-discuss
I should also add that threading isn't done with simple OpenMP pragmas. Rather, we use something we call "thread communicators", which are a lot like MPI communicators in distributed computing. This allows us to efficiently manage and synchronize nested teams of threads.

Devin

Field G. Van Zee

unread,
Feb 6, 2018, 2:25:52 PM2/6/18
to blis-d...@googlegroups.com
Mosharaf,

For the syrk operation, the higher-level blocking (jc, pc, ic loops) is
handled by the gemm blocked variants in

frame/3/gemm/

while macrokernel-level blocking (jr and ir loops) is handled by the
herk macrokernels in

frame/3/herk/bli_herk_[lu]_ker_var2.c

I refer to "macrokernels" since we have two: one for upper-stored C
matrices and one for lower-stored matrices.

If you ignore the higher-level front-end functions, the packing
functions, as well as the internal back-end (control tree decoder)
functions, the algorithmic code path currently used is:

blk_var2 -> blk_var3 -> blk_var1 -> herk_[lu]_ker_var2

Keep in mind that we greatly simplify things in our papers so that our
readers can easily grasp the concepts, but the code in our libraries
oftentimes looks very different (usually for code
consolidation/maintenance purposes).

Field

Robert van de Geijn

unread,
Feb 7, 2018, 10:02:45 AM2/7/18
to blis-d...@googlegroups.com

You may want to read


Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 
International Parallel and Distributed Processing Symposium 2014.

Robert van de Geijn

unread,
Feb 9, 2018, 11:20:19 PM2/9/18
to blis-d...@googlegroups.com
Field pointed out to me that Mosharaf already found the "Anatomy"
paper.  My problem is that when I see "Anatomy" I think of what we call
the "Goto paper".
Reply all
Reply to author
Forward
0 new messages