[AMD Official Use Only - AMD Internal Distribution Only]
Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
|
Igor Kozachenko <ig...@berkeley.edu>: Jul 17 02:58PM -0700
*QUESTION 1. I am confused about the interpretation of the auxiliary blocksizes.* I am reading the following digression in the blis/docs/ConfigurationHowTo.md. *Digression:* Auxiliary blocksize values for cache blocksizes are interpreted as the maximum cache blocksizes. The maximum cache blocksizes are a convenient and portable way of smoothing performance of the level-3 operations when computing with a matrix operand that is just slightly larger than a multiple of the preferred cache blocksize in that dimension. In these "edge cases," iterations run with highly sub-optimal blocking. We can address this problem by merging the "edge case" iteration with the second-to-last iteration, such that the cache blocksizes are slightly larger --rather than significantly smaller -- than optimal. The maximum cache blocksizes allow the developer to specify the maximum size of this merged iteration; if the edge case causes the merged iteration to exceed this maximum, then the edge case is not merged and instead it is computed upon in separate (final) iteration. From the description, it follows that the auxiliary block size is the maximum sum of the optimal block size and the edge block size for the last iteration (edge iteration) to be merged with before last iteration, instead of the last iteration be a separate iteration. Thus, the maximum block size, i.e. auxiliary block size, is ALWAYS larger than the optimal block size. According to this logic, if KC_optimal = 512, K = 1026, then the edge iteration size is 512, edge iteration size is 2. The last edge iteration merged with the before last iteration would be 514. If we set the maximum block size to 512, then the merge should not happen, and the last edge iteration of size 2 happens. If we set the maximum block size to 640 , then the last iteration is merged with the before last iteration resulting in the merged iteration of size 514. In contrast, looking at blis/config/zen4/bli_cntx_init_zen4.c: 37 /* 38 * List of default block sizes for zen4. 39 * Converted it to macro as this list is used at multiple places in this file. 40 */ 41 42 #define BLI_CNTX_DEFAULT_BLKSZ_LIST_GENOA(blkszs) \ 43 /* s d c z */ \ 44 bli_blksz_init_easy( &blkszs[ BLIS_MR ], 32, 32, 3 , 12 ); \ 45 bli_blksz_init_easy( &blkszs[ BLIS_NR ], 12, 6, 8 , 4 ); \ 46 bli_blksz_init_easy( &blkszs[ BLIS_MC ], 512, 128, 144 , 60 ); \ 47 bli_blksz_init ( &blkszs[ BLIS_KC ], 480, 512, 256, 512, \ 48 480, 320, 256, 160 ); \ 49 bli_blksz_init_easy( &blkszs[ BLIS_NC ], 6144, 4002, 4080, 2004 ); \ 50 \ 51 bli_blksz_init_easy( &blkszs[ BLIS_AF ], 5, 5, -1, -1 ); \ 52 bli_blksz_init_easy( &blkszs[ BLIS_DF ], 8, 8, -1, -1 ); \ 53 54 55 #define BLI_CNTX_DEFAULT_BLKSZ_LIST_BERGAMO(blkszs) \ 56 /* s d c z */ \ 57 bli_blksz_init_easy( &blkszs[ BLIS_MR ], 32, 32, 3 , 12 ); \ 58 bli_blksz_init_easy( &blkszs[ BLIS_NR ], 12, 6, 8 , 4 ); \ 59 bli_blksz_init_easy( &blkszs[ BLIS_MC ], 512, 64, 144 , 60 ); \ 60 bli_blksz_init ( &blkszs[ BLIS_KC ], 480, 512, 256, 512, \ 61 480, 320, 256, 160 ); \ 62 bli_blksz_init_easy( &blkszs[ BLIS_NC ], 6144, 3600, 4080, 2004 ); \ 63 \ 64 bli_blksz_init_easy( &blkszs[ BLIS_AF ], 5, 5, -1, -1 ); \ 65 bli_blksz_init_easy( &blkszs[ BLIS_DF ], 8, 8, -1, -1 ); \ one can infer from lines 47,48 (for double precision), that the meaning of the auxiliary block size KC is most likely “the maximum size of the last edge block to be merged with the previous optimal sized block into a single iteration”. And the last edge block which size is larger than the maximum block size, i.e. auxiliary block size, will not be merged with the previous iteration and will have its own iteration. Could you please resolve my concern about the definition of the auxiliary blocksize. *QUESTION 2. Descriptions of the SUP thresholds.* Looking at blis/config/zen4/bli_cntx_init_zen4. 269 // Initialize sup thresholds with architecture-appropriate values. 270 // s d c z 271 bli_blksz_init_easy( &thresh[ BLIS_MT ], 682, 1000, 380, 110 ); 272 bli_blksz_init_easy( &thresh[ BLIS_NT ], 512, 1000, 256, 128 ); 273 bli_blksz_init_easy( &thresh[ BLIS_KT ], 240, 220, 220, 110 ); 274 334 // Initialize level-3 sup blocksize objects with architecture-specific 335 // values. 336 // s d c z 337 bli_blksz_init ( &blkszs[ BLIS_MR ], 6, 24, 3 , 12, 338 6, 9, 3 , 12 ); 339 bli_blksz_init_easy( &blkszs[ BLIS_NR ], 64, 8, 8 , 4 ); 340 bli_blksz_init_easy( &blkszs[ BLIS_MC ], 192, 144, 72 , 48 ); 341 bli_blksz_init_easy( &blkszs[ BLIS_KC ], 512, 480, 128 , 64 ); 342 bli_blksz_init_easy( &blkszs[ BLIS_NC ], 8064, 4080, 2040, 1020 ); 343 Where can I read more about SUP thresholds and their meaning and function? I suppose, SUP stands for small unpacked matrices. I also see the thresholds in at blis/config/zen4/bli_family_zen4.h. 44 #define BLIS_ENABLE_SMALL_MATRIX 45 #define BLIS_ENABLE_SMALL_MATRIX_TRSM 46 47 // This will select the threshold below which small matrix code will be called. 48 #define BLIS_SMALL_MATRIX_THRES 700 49 #define BLIS_SMALL_M_RECT_MATRIX_THRES 160 50 #define BLIS_SMALL_K_RECT_MATRIX_THRES 128 51 52 #define BLIS_SMALL_MATRIX_A_THRES_M_SYRK 96 53 #define BLIS_SMALL_MATRIX_A_THRES_N_SYRK 128 How do these thresholds relate to each other? Thank you, |
Matthews, Devin <damat...@mail.smu.edu>: Jul 17 10:49PM
Question 1: This file must come from AMD’s version of BLIS since Zen4 support hasn’t been merged into vanilla BLIS yes. It looks like the maximum block size is being set “incorrectly”, that is, because 320 < 512 an edge case can never actually be merged. I’ll pass this on to our colleagues at AMD. Question 2: The AMD version of BLIS has two separate mechanisms for dealing with small matrix multiplications. The “SUP” mechanism also exists in vanilla BLIS and uses the thresholds set in bli_cntx_init_zen4. The other “small matrix” mechanism is AMD-specific and uses a different set of thresholds. From https://github.com/amd/blis/blob/7c564c74e103249b52636e6cfc5a93ba8c2b0406/frame/compat/bla_gemm_amd.c it looks like a) those macros aren’t actually used and the thresholds are hardcoded instead and b) this check only happens when dgemm is called and not bli_gemm/bli_dgemm. Devin Matthews From: 'Igor Kozachenko' via blis-devel <blis-...@googlegroups.com> Date: Wednesday, July 17, 2024 at 4:59 PM To: blis-devel <blis-...@googlegroups.com> Subject: [blis-devel] BLIS auxiliary blocksize meaning and matrix size thresholds question You don't often get email from blis-...@googlegroups.com. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> QUESTION 1. I am confused about the interpretation of the auxiliary blocksizes. I am reading the following digression in the blis/docs/ConfigurationHowTo.md. Digression: Auxiliary blocksize values for cache blocksizes are interpreted as the maximum cache blocksizes. The maximum cache blocksizes are a convenient and portable way of smoothing performance of the level-3 operations when computing with a matrix operand that is just slightly larger than a multiple of the preferred cache blocksize in that dimension. In these "edge cases," iterations run with highly sub-optimal blocking. We can address this problem by merging the "edge case" iteration with the second-to-last iteration, such that the cache blocksizes are slightly larger --rather than significantly smaller -- than optimal. The maximum cache blocksizes allow the developer to specify the maximum size of this merged iteration; if the edge case causes the merged iteration to exceed this maximum, then the edge case is not merged and instead it is computed upon in separate (final) iteration. From the description, it follows that the auxiliary block size is the maximum sum of the optimal block size and the edge block size for the last iteration (edge iteration) to be merged with before last iteration, instead of the last iteration be a separate iteration. Thus, the maximum block size, i.e. auxiliary block size, is ALWAYS larger than the optimal block size. According to this logic, if KC_optimal = 512, K = 1026, then the edge iteration size is 512, edge iteration size is 2. The last edge iteration merged with the before last iteration would be 514. If we set the maximum block size to 512, then the merge should not happen, and the last edge iteration of size 2 happens. If we set the maximum block size to 640 , then the last iteration is merged with the before last iteration resulting in the merged iteration of size 514. In contrast, looking at blis/config/zen4/bli_cntx_init_zen4.c: 37 /* 38 * List of default block sizes for zen4. 39 * Converted it to macro as this list is used at multiple places in this file. 40 */ 41 42 #define BLI_CNTX_DEFAULT_BLKSZ_LIST_GENOA(blkszs) \ 43 /* s d c z */ \ 44 bli_blksz_init_easy( &blkszs[ BLIS_MR ], 32, 32, 3, 12 ); \ 45 bli_blksz_init_easy( &blkszs[ BLIS_NR ], 12, 6, 8, 4 ); \ 46 bli_blksz_init_easy( &blkszs[ BLIS_MC ], 512, 128, 144, 60 ); \ 47 bli_blksz_init ( &blkszs[ BLIS_KC ], 480, 512, 256, 512, \ 48 480, 320, 256, 160 ); \ 49 bli_blksz_init_easy( &blkszs[ BLIS_NC ], 6144, 4002, 4080, 2004 ); \ 50 \ 51 bli_blksz_init_easy( &blkszs[ BLIS_AF ], 5, 5, -1, -1 ); \ 52 bli_blksz_init_easy( &blkszs[ BLIS_DF ], 8, 8, -1, -1 ); \ 53 54 55 #define BLI_CNTX_DEFAULT_BLKSZ_LIST_BERGAMO(blkszs) \ 56 /* s d c z */ \ 57 bli_blksz_init_easy( &blkszs[ BLIS_MR ], 32, 32, 3, 12 ); \ 58 bli_blksz_init_easy( &blkszs[ BLIS_NR ], 12, 6, 8, 4 ); \ 59 bli_blksz_init_easy( &blkszs[ BLIS_MC ], 512, 64, 144, 60 ); \ 60 bli_blksz_init ( &blkszs[ BLIS_KC ], 480, 512, 256, 512, \ 61 480, 320, 256, 160 ); \ 62 bli_blksz_init_easy( &blkszs[ BLIS_NC ], 6144, 3600, 4080, 2004 ); \ 63 \ 64 bli_blksz_init_easy( &blkszs[ BLIS_AF ], 5, 5, -1, -1 ); \ 65 bli_blksz_init_easy( &blkszs[ BLIS_DF ], 8, 8, -1, -1 ); \ one can infer from lines 47,48 (for double precision), that the meaning of the auxiliary block size KC is most likely “the maximum size of the last edge block to be merged with the previous optimal sized block into a single iteration”. And the last edge block which size is larger than the maximum block size, i.e. auxiliary block size, will not be merged with the previous iteration and will have its own iteration. Could you please resolve my concern about the definition of the auxiliary blocksize. QUESTION 2. Descriptions of the SUP thresholds. Looking at blis/config/zen4/bli_cntx_init_zen4. 269 // Initialize sup thresholds with architecture-appropriate values. 270 // s d c z 271 bli_blksz_init_easy( &thresh[ BLIS_MT ], 682, 1000, 380, 110 ); 272 bli_blksz_init_easy( &thresh[ BLIS_NT ], 512, 1000, 256, 128 ); 273 bli_blksz_init_easy( &thresh[ BLIS_KT ], 240, 220, 220, 110 ); 274 334 // Initialize level-3 sup blocksize objects with architecture-specific 335 // values. 336 // s d c z 337 bli_blksz_init ( &blkszs[ BLIS_MR ], 6, 24, 3, 12, 338 6, 9, 3, 12 ); 339 bli_blksz_init_easy( &blkszs[ BLIS_NR ], 64, 8, 8, 4 ); 340 bli_blksz_init_easy( &blkszs[ BLIS_MC ], 192, 144, 72, 48 ); 341 bli_blksz_init_easy( &blkszs[ BLIS_KC ], 512, 480, 128, 64 ); 342 bli_blksz_init_easy( &blkszs[ BLIS_NC ], 8064, 4080, 2040, 1020 ); 343 Where can I read more about SUP thresholds and their meaning and function? I suppose, SUP stands for small unpacked matrices. I also see the thresholds in at blis/config/zen4/bli_family_zen4.h. 44 #define BLIS_ENABLE_SMALL_MATRIX 45 #define BLIS_ENABLE_SMALL_MATRIX_TRSM 46 47 // This will select the threshold below which small matrix code will be called. 48 #define BLIS_SMALL_MATRIX_THRES 700 49 #define BLIS_SMALL_M_RECT_MATRIX_THRES 160 50 #define BLIS_SMALL_K_RECT_MATRIX_THRES 128 51 52 #define BLIS_SMALL_MATRIX_A_THRES_M_SYRK 96 53 #define BLIS_SMALL_MATRIX_A_THRES_N_SYRK 128 How do these thresholds relate to each other? Thank you, -- You received this message because you are subscribed to the Google Groups "blis-devel" group. To unsubscribe from this group and stop receiving emails from it, send an email to blis-devel+...@googlegroups.com<mailto:blis-devel+...@googlegroups.com>. To view this discussion on the web visit https://groups.google.com/d/msgid/blis-devel/f8babbef-ab1c-4988-ad90-338aa47b7418n%40googlegroups.com<https://groups.google.com/d/msgid/blis-devel/f8babbef-ab1c-4988-ad90-338aa47b7418n%40googlegroups.com?utm_medium=email&utm_source=footer>. |
You received this digest because you're subscribed to updates for this group. You can change your settings on the
group membership page. To unsubscribe from this group and stop receiving emails from it send an email to blis-devel+...@googlegroups.com. |