SUPERMATRIX parallelism

40 views
Skip to first unread message

Roberto Corradini

unread,
Apr 25, 2020, 12:50:23 PM4/25/20
to libflame...@googlegroups.com, rob.co...@gmail.com
Dear libflame discussion group,

I have libflame and BLIS compiled and installed. BLIS low level
parallelism works fine.

Is there a suitable practice to use the SUPERMATRIX high level
parallelism on "plain flat FLAME Objects" ? Or it is required to have
hierarchical FLASH Objects in order to run concurrent functions ?

I have then a second question , running the function :

FLASH_Obj_create_hier_copy_of_flat(...)

I get it working properly depending on the block size b_mn parameter.
Small values of the parameter works correctly , and then the Cholesky
factorization ( FLASH_Chol() ) runs fine and parallel , with
performances comparable to my own OpenMP implementation based on tasks
calling BLAS functions that are then linked with the BLIS library.

When b_mn grows larger ( 256 in may case is fine , 512 doesn't work ) I
get a segmentation fault. Opening the core file with GDB I get:

(gdb) bt
#0  0x000055cdeee5d93e in FLA_Obj_elemtype (obj=...) at
src/base/flamec/main/FLA_Query.c:56
#1  0x000055cdeef40ebf in FLASH_Obj_adjust_views_hierarchy
(attach_buffer=attach_buffer@entry=1, offm=offm@entry=0,
offn=offn@entry=0, m=m@entry=768, n=n@entry=548, A=...,
S=S@entry=0x55cdefb3dab8) at src/base/flamec/hierarchy/main/FLASH_View.c:292
#2  0x000055cdeef41ac3 in FLASH_Obj_adjust_views_hierarchy
(attach_buffer=attach_buffer@entry=1, offm=offm@entry=0,
offn=offn@entry=0, m=m@entry=5156, n=n@entry=5156, A=...,
S=S@entry=0x7fff3a414da0) at src/base/flamec/hierarchy/main/FLASH_View.c:530
#3  0x000055cdeef43e14 in FLASH_Obj_adjust_views (S=0x7fff3a414da0,
A=..., n=<optimized out>, m=5156, offn=0, offm=0, attach_buffer=1) at
src/base/flamec/hierarchy/main/FLASH_View.c:278
#4  FLASH_Part_create_2x2 (A=..., ATL=ATL@entry=0x7fff3a414ce0,
ATR=ATR@entry=0x7fff3a414d20, ABL=ABL@entry=0x7fff3a414d60,
ABR=ABR@entry=0x7fff3a414da0, n_rows=<optimized out>, n_rows@entry=0,
n_cols=<optimized out>, n_cols@entry=0, side=11) at
src/base/flamec/hierarchy/main/FLASH_View.c:266
#5  0x000055cdeef4075c in FLASH_Copy_flat_to_hier (F=..., i=i@entry=0,
j=j@entry=0, H=...) at src/base/flamec/hierarchy/main/FLASH_Copy_other.c:92
#6  0x000055cdeee4ffc1 in FLASH_Obj_create_hier_copy_of_flat (F=...,
depth=1, b_mn=0x7fff3a4150b8, H=0x7fff3a415220) at
src/base/flamec/hierarchy/main/FLASH_Obj.c:601
#7  0x000055cdeee1d82a in aux_perf_sdf_flame_t (t=0x55cdefb05610,
test_data_file_name=<optimized out>) at test/ut_flame.c:312
#8  0x000055cdeee450ea in ut_suite_run (s=0x55cdefb052c0) at
src/unit_test.c:392
#9  0x000055cdeee1c6b8 in main (argc=<optimized out>, argv=<optimized
out>) at test/ut_flame.c:1055
(gdb) p obj.base
$1 = (FLA_Base_obj *) 0x300
(gdb) p obj.base->elemtype
Cannot access memory at address 0x304
(gdb) p obj
$2 = {offm = 160, offn = 648540061797, m = 768, n = 768, m_inner = 1,
n_inner = 768, base = 0x300}

Do you have any suggestion ?

Thank you,

Roberto

rob.co...@gmail.com

unread,
Apr 26, 2020, 2:08:37 AM4/26/20
to libflame-discuss
I have done further investigations.
My test set is composed by three symmetric real definite positive matrices of size: 3002 , 5156 , 24046
I am testing different block sizes : 256 , 512 , 768 , 1024
Matrix size of 3002 works fine with block sizes : 256 , 512 , 768 , 1024
Matrix size of 5156 works fine with block sizes : 256 , 512 , 1024 and fails with 768
Matrix size of 24046 works fine with block sizes : 256 , 1024 and fails with 512 and 768

Roberto

rob.co...@gmail.com

unread,
Apr 26, 2020, 11:18:27 AM4/26/20
to libflame-discuss
Using the smaller matrix ( 3002x3002 ) I have prepared a shorter test program , and tested block sizes in the range 1-1024

1:3 OK
4 FAILS
5:12 OK
13:14 FAIL
15:20 OK
21 FAILS
22:37 OK
38 FAILS
39:47 OK
48 FAILS
49:54 OK
55 FAILS
56:63 OK
64:65 FAIL
66:76 OK
77:78 FAIL
79:96 OK
97:100 FAIL
101:130 OK
131:136 FAIL
137:200 OK
201:214 FAIL
215:428 OK
429:500 FAIL
501:1024 OK

This is the test code:

/* debug function FLASH_Obj_create_hier_copy_of_flat */
static void
aux_debug_flame_t (ut_test_t *const t,
                   const char test_data_file_name[])
{
  int ret;
  size_t nr, nc;
  double **a;

  FLA_Error fla_ret;
  FLA_Obj A, B;
  dim_t bs;

  printf("\n");

  /*
   * STEP (0) - Retrieve from storage the A matrix
   */
  a = lial_retrieve_matrix(test_data_file_name, &nr, &nc, &ret);
  if (ret != 0 || !a || (nr != nc)) {
    printf("Unable to read properly matrix a from file: %s\n", test_data_file_name);
    ut_assert(t, false);
  }  else {
    printf("Reading from storage SDF matrix of size : %zu, from file \"%s\"\n", nr, test_data_file_name);
  }


  FLA_Init();

  fla_ret = FLA_Obj_create_without_buffer(FLA_DOUBLE, nr, nr, &A);
  ut_assert(t, fla_ret == FLA_SUCCESS);

  fla_ret = FLA_Obj_attach_buffer(*a, nr, 1, &A);
  ut_assert(t, fla_ret == FLA_SUCCESS);

  for (size_t i = 445; i < 1024; i++) {
    printf("bs=%zu\n", i);
    bs = i;
    fla_ret = FLASH_Obj_create_hier_copy_of_flat(A, 1, &bs, &B);
    ut_assert(t, fla_ret == FLA_SUCCESS);

    /* --- Cholesky factorization , solution , ... --- */

    FLASH_Obj_free(&B);
  }

  fla_ret = FLA_Obj_free_without_buffer(&A);
  ut_assert(t, fla_ret == FLA_SUCCESS);

  FLA_Finalize();


  lial_free_matrix(a, nr);

  printf("\n");
}

rob.co...@gmail.com

unread,
Apr 27, 2020, 12:03:47 PM4/27/20
to libflame-discuss
I have observed that when the software fails the value of the cs field in the base object is larger by 1 compared with m and n values.
here the most interesting debugging info.

in function : FLASH_Part_create_2x2

bs=200 - OK
(gdb) p *A.base
$41 = {datatype = 101, elemtype = 150, m = 16, n = 16, rs = 1, cs = 16, m_inner = 3002, n_inner = 3002, id = 93824999846208, m_index = 0, n_index = 0, n_elem_alloc = 256, buffer = 0x555555c97200, buffer_info = 0, uplo = 0, n_read_blocks = 0, n_write_blocks = 0, n_read_tasks = 0, read_task_head = 0x0, read_task_tail = 0x0, write_task = 0x0}

bs=201 - FAILS
(gdb) p *A.base
$42 = {datatype = 101, elemtype = 150, m = 15, n = 15, rs = 1, cs = 16, m_inner = 3002, n_inner = 3002, id = 93824999846208, m_index = 0, n_index = 0, n_elem_alloc = 240, buffer = 0x555555c97200, buffer_info = 0, uplo = 0, n_read_blocks = 0, n_write_blocks = 0, n_read_tasks = 0, read_task_head = 0x0, read_task_tail = 0x0, write_task = 0x0}
.
.
.
bs=214 - FAILS
(gdb) p *A.base
$43 = {datatype = 101, elemtype = 150, m = 15, n = 15, rs = 1, cs = 16, m_inner = 3002, n_inner = 3002, id = 93824999846208, m_index = 0, n_index = 0, n_elem_alloc = 240, buffer = 0x555555c97200, buffer_info = 0, uplo = 0, n_read_blocks = 0, n_write_blocks = 0, n_read_tasks = 0, read_task_head = 0x0, read_task_tail = 0x0, write_task = 0x0}

bs=215 - OK
(gdb) p *A.base
$44 = {datatype = 101, elemtype = 150, m = 14, n = 14, rs = 1, cs = 14, m_inner = 3002, n_inner = 3002, id = 93824999846208, m_index = 0, n_index = 0, n_elem_alloc = 196, buffer = 0x555555c97200, buffer_info = 0, uplo = 0, n_read_blocks = 0, n_write_blocks = 0, n_read_tasks = 0, read_task_head = 0x0, read_task_tail = 0x0, write_task = 0x0}



bs=428 - OK
(gdb) p *A.base
$36 = {datatype = 101, elemtype = 150, m = 8, n = 8, rs = 1, cs = 8, m_inner = 3002, n_inner = 3002, id = 93824999846208, m_index = 0, n_index = 0, n_elem_alloc = 64, buffer = 0x555555c97200, buffer_info = 0, uplo = 0, n_read_blocks = 0, n_write_blocks = 0, n_read_tasks = 0, read_task_head = 0x0, read_task_tail = 0x0, write_task = 0x0}

bs=429 - FAILS
(gdb) p *A.base
$35 = {datatype = 101, elemtype = 150, m = 7, n = 7, rs = 1, cs = 8, m_inner = 3002, n_inner = 3002, id = 93824999846208, m_index = 0, n_index = 0, n_elem_alloc = 56, buffer = 0x555555c97200, buffer_info = 0, uplo = 0, n_read_blocks = 0, n_write_blocks = 0, n_read_tasks = 0, read_task_head = 0x0, read_task_tail = 0x0, write_task = 0x0}
.
.
.
bs=500 - FAILS
(gdb) p *A.base
$39 = {datatype = 101, elemtype = 150, m = 7, n = 7, rs = 1, cs = 8, m_inner = 3002, n_inner = 3002, id = 93824999846208, m_index = 0, n_index = 0, n_elem_alloc = 56, buffer = 0x555555c97200, buffer_info = 0, uplo = 0, n_read_blocks = 0, n_write_blocks = 0, n_read_tasks = 0, read_task_head = 0x0, read_task_tail = 0x0, write_task = 0x0}

bs=501 - OK
(gdb) p *A.base
$40 = {datatype = 101, elemtype = 150, m = 6, n = 6, rs = 1, cs = 6, m_inner = 3002, n_inner = 3002, id = 93824999846208, m_index = 0, n_index = 0, n_elem_alloc = 36, buffer = 0x555555c97200, buffer_info = 0, uplo = 0, n_read_blocks = 0, n_write_blocks = 0, n_read_tasks = 0, read_task_head = 0x0, read_task_tail = 0x0, write_task = 0x0}

Roberto

rob.co...@gmail.com

unread,
Apr 27, 2020, 12:44:11 PM4/27/20
to libflame-discuss
Debugging the code I got the suspect that the issue is related with alignment.
I have reconfigured the library removing the options :
--enable-memory-alignment=64 --enable-ldim-alignment

With this change the test code runs fine for all the block_size values in range [1..1024] . GREAT !!!

Debugging the case having block size of 429 , that was among the failing ones , I spot a relevant difference :

bs=429 - FAILS ( --enable-memory-alignment=64 --enable-ldim-alignment )
(gdb) p *A.base
$35 = {datatype = 101, elemtype = 150, m = 7, n = 7, rs = 1, cs = 8, m_inner = 3002, n_inner = 3002, id = 93824999846208, m_index = 0, n_index = 0, n_elem_alloc = 56, buffer = 0x555555c97200, buffer_info = 0, uplo = 0, n_read_blocks = 0, n_write_blocks = 0, n_read_tasks = 0, read_task_head = 0x0, read_task_tail = 0x0, write_task = 0x0}
bs=429 - OK
(gdb) p *A.base
$4  = {datatype = 101, elemtype = 150, m = 7, n = 7, rs = 1, cs = 7, m_inner = 3002, n_inner = 3002, id = 93824999803232, m_index = 0, n_index = 0, n_elem_alloc = 49, buffer = 0x555555c8ca00, buffer_info = 0, uplo = 0, n_read_blocks = 0, n_write_blocks = 0, n_read_tasks = 0, read_task_head = 0x0, read_task_tail = 0x0, write_task = 0x0}

Maybe , just a guess , the alignment code affects not just the leaf buffers in the hierarchy , but also the generation of the nodes.

Roberto

Field G. Van Zee

unread,
Apr 29, 2020, 1:25:18 PM4/29/20
to libflame...@googlegroups.com
Roberto,

Sorry for the delayed response, and thanks for sharing the results of
your debugging. It does indeed look like there is something going on
with the object buffer alignments. Those options are disabled by default
(even if they are currently "enabled" in my run-configure.sh convenience
script), so it's possible that the FLASH code was never properly
exercised with these options. Although, it's also possible there is
something more subtle going on here.

I wish I could dig into this further right now, but I have other more
pressing priorities, especially since you seem to have a workaround.

Also, in the future, you may wish to consider using the Issues feature
on GitHub to report bugs [1]. This may reach more people who are
able/willing to assist.

Regards,
Field

[1] https://github.com/flame/libflame/issues
> --
> You received this message because you are subscribed to the Google
> Groups "libflame-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to libflame-discu...@googlegroups.com
> <mailto:libflame-discu...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/libflame-discuss/91c8188e-f508-415d-9f41-f9de6886fd64%40googlegroups.com
> <https://groups.google.com/d/msgid/libflame-discuss/91c8188e-f508-415d-9f41-f9de6886fd64%40googlegroups.com?utm_medium=email&utm_source=footer>.

Roberto Corradini

unread,
Apr 29, 2020, 1:51:09 PM4/29/20
to Field G. Van Zee, libflame...@googlegroups.com
Thank you Field,

So far the work around holds , so I am fine !
Next time I will post on github.

I understand that the FLASH storage design adopted by Supermatrix is suited also for bulk data transfers between distributed servers , Accelerators , GPUs , and I was wondering why not new flash Nvme storage.

Since I adopted the BLIS library the CPU time is not constraining me any more , my computations are limited by the amount of RAM. So I was figuring out to avoid Hessian computation and to investigate Limited Memory Quasi Newton techniques.
But I am also starting to wonder a different approach : to keep my current algorithm and to develop a task based cholesky solver that reads and writes blocks of data from and to the Nvme storage. With the proper prefetch, and amortizing the reads on more blocks operations , maybe it is possible to keep the cores fully busy......

Too much wondering for now! It will take many month to accomplish what I am doing and to start to experiment with these ideas.

Anyhow thank you again for you time.

Roberto

> On Apr 29, 2020, at 7:25 PM, Field G. Van Zee <fi...@cs.utexas.edu> wrote:
>
> Roberto,
> You received this message because you are subscribed to a topic in the Google Groups "libflame-discuss" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/libflame-discuss/Twd9iqq1mig/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to libflame-discu...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/libflame-discuss/cb649357-c64d-b273-42d5-70d14bb69abe%40cs.utexas.edu.
Reply all
Reply to author
Forward
0 new messages