QUESTION 1. I am confused about the interpretation of the auxiliary blocksizes.
I am reading the following digression in the blis/docs/ConfigurationHowTo.md.
Digression: Auxiliary blocksize values for cache blocksizes are interpreted as the maximum cache blocksizes. The maximum cache blocksizes are a convenient and portable way of smoothing performance of the level-3 operations when computing with a matrix operand that is just slightly larger than a multiple of the preferred cache blocksize in that dimension. In these "edge cases," iterations run with highly sub-optimal blocking. We can address this problem by merging the "edge case" iteration with the second-to-last iteration, such that the cache blocksizes are slightly larger --rather than significantly smaller -- than optimal. The maximum cache blocksizes allow the developer to specify the maximum size of this merged iteration; if the edge case causes the merged iteration to exceed this maximum, then the edge case is not merged and instead it is computed upon in separate (final) iteration.
From the description, it follows that the auxiliary block size is the maximum sum of the optimal block size and the edge block size for the last iteration (edge iteration) to be merged with before last iteration, instead of the last iteration be a separate iteration. Thus, the maximum block size, i.e. auxiliary block size, is ALWAYS larger than the optimal block size.
According to this logic, if KC_optimal = 512, K = 1026, then the edge iteration size is 512, edge iteration size is 2. The last edge iteration merged with the before last iteration would be 514. If we set the maximum block size to 512, then the merge should not happen, and the last edge iteration of size 2 happens. If we set the maximum block size to 640 , then the last iteration is merged with the before last iteration resulting in the merged iteration of size 514.
In contrast, looking at blis/config/zen4/bli_cntx_init_zen4.c:
37 /*
38 * List of default block sizes for zen4.
39 * Converted it to macro as this list is used at multiple places in this file.
40 */
41
42 #define BLI_CNTX_DEFAULT_BLKSZ_LIST_GENOA(blkszs) \
43 /* s d c z */ \
44 bli_blksz_init_easy( &blkszs[ BLIS_MR ], 32, 32, 3, 12 ); \
45 bli_blksz_init_easy( &blkszs[ BLIS_NR ], 12, 6, 8, 4 ); \
46 bli_blksz_init_easy( &blkszs[ BLIS_MC ], 512, 128, 144, 60 ); \
47 bli_blksz_init ( &blkszs[ BLIS_KC ], 480, 512, 256, 512, \
48 480, 320, 256, 160 ); \
49 bli_blksz_init_easy( &blkszs[ BLIS_NC ], 6144, 4002, 4080, 2004 ); \
50 \
51 bli_blksz_init_easy( &blkszs[ BLIS_AF ], 5, 5, -1, -1 ); \
52 bli_blksz_init_easy( &blkszs[ BLIS_DF ], 8, 8, -1, -1 ); \
53
54
55 #define BLI_CNTX_DEFAULT_BLKSZ_LIST_BERGAMO(blkszs) \
56 /* s d c z */ \
57 bli_blksz_init_easy( &blkszs[ BLIS_MR ], 32, 32, 3, 12 ); \
58 bli_blksz_init_easy( &blkszs[ BLIS_NR ], 12, 6, 8, 4 ); \
59 bli_blksz_init_easy( &blkszs[ BLIS_MC ], 512, 64, 144, 60 ); \
60 bli_blksz_init ( &blkszs[ BLIS_KC ], 480, 512, 256, 512, \
61 480, 320, 256, 160 ); \
62 bli_blksz_init_easy( &blkszs[ BLIS_NC ], 6144, 3600, 4080, 2004 ); \
63 \
64 bli_blksz_init_easy( &blkszs[ BLIS_AF ], 5, 5, -1, -1 ); \
65 bli_blksz_init_easy( &blkszs[ BLIS_DF ], 8, 8, -1, -1 ); \
one can infer from lines 47,48 (for double precision), that the meaning of the auxiliary block size KC is most likely “the maximum size of the last edge block to be merged with the previous optimal sized block into a single iteration”. And the last edge block which size is larger than the maximum block size, i.e. auxiliary block size, will not be merged with the previous iteration and will have its own iteration.
Could you please resolve my concern about the definition of the auxiliary blocksize.
QUESTION 2. Descriptions of the SUP thresholds.
Looking at blis/config/zen4/bli_cntx_init_zen4.
269 // Initialize sup thresholds with architecture-appropriate values.
270 // s d c z
271 bli_blksz_init_easy( &thresh[ BLIS_MT ], 682, 1000, 380, 110 );
272 bli_blksz_init_easy( &thresh[ BLIS_NT ], 512, 1000, 256, 128 );
273 bli_blksz_init_easy( &thresh[ BLIS_KT ], 240, 220, 220, 110 );
274
334 // Initialize level-3 sup blocksize objects with architecture-specific
335 // values.
336 // s d c z
337 bli_blksz_init ( &blkszs[ BLIS_MR ], 6, 24, 3, 12,
338 6, 9, 3, 12 );
339 bli_blksz_init_easy( &blkszs[ BLIS_NR ], 64, 8, 8, 4 );
340 bli_blksz_init_easy( &blkszs[ BLIS_MC ], 192, 144, 72, 48 );
341 bli_blksz_init_easy( &blkszs[ BLIS_KC ], 512, 480, 128, 64 );
342 bli_blksz_init_easy( &blkszs[ BLIS_NC ], 8064, 4080, 2040, 1020 );
343
Where can I read more about SUP thresholds and their meaning and function? I suppose, SUP stands for small unpacked matrices.
I also see the thresholds in at blis/config/zen4/bli_family_zen4.h.
44 #define BLIS_ENABLE_SMALL_MATRIX
45 #define BLIS_ENABLE_SMALL_MATRIX_TRSM
46
47 // This will select the threshold below which small matrix code will be called.
48 #define BLIS_SMALL_MATRIX_THRES 700
49 #define BLIS_SMALL_M_RECT_MATRIX_THRES 160
50 #define BLIS_SMALL_K_RECT_MATRIX_THRES 128
51
52 #define BLIS_SMALL_MATRIX_A_THRES_M_SYRK 96
53 #define BLIS_SMALL_MATRIX_A_THRES_N_SYRK 128
How do these thresholds relate to each other?
Thank you,
Question 1: This file must come from AMD’s version of BLIS since Zen4 support hasn’t been merged into vanilla BLIS yes. It looks like the maximum block size is being set “incorrectly”, that is, because 320 < 512 an edge case can never actually be merged. I’ll pass this on to our colleagues at AMD.
Question 2: The AMD version of BLIS has two separate mechanisms for dealing with small matrix multiplications. The “SUP” mechanism also exists in vanilla BLIS and uses the thresholds set in bli_cntx_init_zen4. The other “small matrix” mechanism is AMD-specific and uses a different set of thresholds. From https://github.com/amd/blis/blob/7c564c74e103249b52636e6cfc5a93ba8c2b0406/frame/compat/bla_gemm_amd.c it looks like a) those macros aren’t actually used and the thresholds are hardcoded instead and b) this check only happens when dgemm is called and not bli_gemm/bli_dgemm.
Devin Matthews
From: 'Igor Kozachenko' via blis-devel <blis-...@googlegroups.com>
Date: Wednesday, July 17, 2024 at 4:59 PM
To: blis-devel <blis-...@googlegroups.com>
Subject: [blis-devel] BLIS auxiliary blocksize meaning and matrix size thresholds question
You don't often get email from blis-...@googlegroups.com. Learn why this is important |
--
You received this message because you are subscribed to the Google Groups "blis-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
blis-devel+...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/blis-devel/f8babbef-ab1c-4988-ad90-338aa47b7418n%40googlegroups.com.