Re: [blis-devel] Digest for blis-devel@googlegroups.com - 2 updates in 1 topic

5 views
Skip to first unread message

Varaganti, Kiran

unread,
Jul 22, 2024, 2:50:36 AM7/22/24
to blis-...@googlegroups.com

[AMD Official Use Only - AMD Internal Distribution Only]


Hi  Igor Kozachenko,
Question 1: Like Devin pointed out, auxiliary block size should be greater than primary blocksize. We will fix it. Thanks for pointing this out.
Question 2: 
For now you can ignore these thresholds, we have multiple gemm code paths to handle different sizes. We used these thresholds in the past to decide between small vs native. This might require some cleanup. We will do that.

Thanks,
Kiran V

From: blis-...@googlegroups.com <blis-...@googlegroups.com>
Sent: Friday, July 19, 2024 1:38 AM
To: Digest recipients <blis-...@googlegroups.com>
Subject: [blis-devel] Digest for blis-...@googlegroups.com - 2 updates in 1 topic
 
Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.

Igor Kozachenko <ig...@berkeley.edu>: Jul 17 02:58PM -0700

*QUESTION 1. I am confused about the interpretation of the auxiliary
blocksizes.*
 

 
I am reading the following digression in the
blis/docs/ConfigurationHowTo.md.
 

 
*Digression:* Auxiliary blocksize values for cache blocksizes are
interpreted as the maximum cache blocksizes. The maximum cache blocksizes
are a convenient and portable way of smoothing performance of the level-3
operations when computing with a matrix operand that is just slightly larger
than a multiple of the preferred cache blocksize in that dimension. In
these "edge cases," iterations run with highly sub-optimal blocking. We can
address this problem by merging the "edge case" iteration with the
second-to-last iteration, such that the cache blocksizes are slightly
larger --rather than significantly smaller -- than optimal. The maximum
cache blocksizes allow the developer to specify the maximum size of this
merged iteration; if the edge case causes the merged iteration to exceed
this maximum, then the edge case is not merged and instead it is computed
upon in separate (final) iteration.
 

 
From the description, it follows that the auxiliary block size is the
maximum sum of the optimal block size and the edge block size for the last
iteration (edge iteration) to be merged with before last iteration, instead
of the last iteration be a separate iteration. Thus, the maximum block
size, i.e. auxiliary block size, is ALWAYS larger than the optimal block
size.
 

 
According to this logic, if KC_optimal = 512, K = 1026, then the edge
iteration size is 512, edge iteration size is 2. The last edge iteration
merged with the before last iteration would be 514. If we set the maximum
block size to 512, then the merge should not happen, and the last edge
iteration of size 2 happens. If we set the maximum block size to 640 , then
the last iteration is merged with the before last iteration resulting in
the merged iteration of size 514.
 

 
In contrast, looking at blis/config/zen4/bli_cntx_init_zen4.c:
 

 
37 /*
 
38 * List of default block sizes for zen4.
 
39 * Converted it to macro as this list is used at multiple places in
this file.
 
40 */
 
41
 
42 #define BLI_CNTX_DEFAULT_BLKSZ_LIST_GENOA(blkszs) \
 
43 /* s d
c z */ \
 
44 bli_blksz_init_easy( &blkszs[ BLIS_MR ], 32, 32, 3
, 12 ); \
 
45 bli_blksz_init_easy( &blkszs[ BLIS_NR ], 12, 6, 8
, 4 ); \
 
46 bli_blksz_init_easy( &blkszs[ BLIS_MC ], 512, 128, 144
, 60 ); \
 
47 bli_blksz_init ( &blkszs[ BLIS_KC ], 480, 512, 256,
512, \
 
48 480, 320, 256,
160 ); \
 
49 bli_blksz_init_easy( &blkszs[ BLIS_NC ], 6144, 4002, 4080,
2004 ); \
 
50
\
 
51 bli_blksz_init_easy( &blkszs[ BLIS_AF ], 5, 5, -1,
-1 ); \
 
52 bli_blksz_init_easy( &blkszs[ BLIS_DF ], 8, 8, -1,
-1 ); \
 
53
 
54
 
55 #define BLI_CNTX_DEFAULT_BLKSZ_LIST_BERGAMO(blkszs) \
 
56 /* s d
c z */ \
 
57 bli_blksz_init_easy( &blkszs[ BLIS_MR ], 32, 32, 3
, 12 ); \
 
58 bli_blksz_init_easy( &blkszs[ BLIS_NR ], 12, 6, 8
, 4 ); \
 
59 bli_blksz_init_easy( &blkszs[ BLIS_MC ], 512, 64, 144
, 60 ); \
 
60 bli_blksz_init ( &blkszs[ BLIS_KC ], 480, 512, 256,
512, \
 
61 480, 320, 256,
160 ); \
 
62 bli_blksz_init_easy( &blkszs[ BLIS_NC ], 6144, 3600, 4080,
2004 ); \
 
63
\
 
64 bli_blksz_init_easy( &blkszs[ BLIS_AF ], 5, 5, -1,
-1 ); \
 
65 bli_blksz_init_easy( &blkszs[ BLIS_DF ], 8, 8, -1,
-1 ); \
 

 
one can infer from lines 47,48 (for double precision), that the meaning of
the auxiliary block size KC is most likely “the maximum size of the last
edge block to be merged with the previous optimal sized block into a single
iteration”. And the last edge block which size is larger than the maximum
block size, i.e. auxiliary block size, will not be merged with the previous
iteration and will have its own iteration.
 

 
Could you please resolve my concern about the definition of the auxiliary
blocksize.
 

 

 
*QUESTION 2. Descriptions of the SUP thresholds.*
 

 

 
Looking at blis/config/zen4/bli_cntx_init_zen4.
 
269 // Initialize sup thresholds with architecture-appropriate
values.
 
270 // s d
c z
 
271 bli_blksz_init_easy( &thresh[ BLIS_MT ], 682, 1000, 380,
110 );
 
272 bli_blksz_init_easy( &thresh[ BLIS_NT ], 512, 1000, 256,
128 );
 
273 bli_blksz_init_easy( &thresh[ BLIS_KT ], 240, 220, 220,
110 );
 
274
 
334 // Initialize level-3 sup blocksize objects with
architecture-specific
 
335 // values.
 
336 // s d
c z
 
337 bli_blksz_init ( &blkszs[ BLIS_MR ], 6, 24, 3
, 12,
 
338 6, 9, 3
, 12 );
 
339 bli_blksz_init_easy( &blkszs[ BLIS_NR ], 64, 8, 8
, 4 );
 
340 bli_blksz_init_easy( &blkszs[ BLIS_MC ], 192, 144, 72
, 48 );
 
341 bli_blksz_init_easy( &blkszs[ BLIS_KC ], 512, 480, 128
, 64 );
 
342 bli_blksz_init_easy( &blkszs[ BLIS_NC ], 8064, 4080, 2040,
1020 );
 
343
 

 

 
Where can I read more about SUP thresholds and their meaning and function?
I suppose, SUP stands for small unpacked matrices.
 

 
I also see the thresholds in at blis/config/zen4/bli_family_zen4.h.
 

 
44 #define BLIS_ENABLE_SMALL_MATRIX
 
45 #define BLIS_ENABLE_SMALL_MATRIX_TRSM
 
46
 
47 // This will select the threshold below which small matrix code will be
called.
 
48 #define BLIS_SMALL_MATRIX_THRES 700
 
49 #define BLIS_SMALL_M_RECT_MATRIX_THRES 160
 
50 #define BLIS_SMALL_K_RECT_MATRIX_THRES 128
 
51
 
52 #define BLIS_SMALL_MATRIX_A_THRES_M_SYRK 96
 
53 #define BLIS_SMALL_MATRIX_A_THRES_N_SYRK 128
 

 
 
How do these thresholds relate to each other?
 
 

 
Thank you,
Matthews, Devin <damat...@mail.smu.edu>: Jul 17 10:49PM

Question 1: This file must come from AMD’s version of BLIS since Zen4 support hasn’t been merged into vanilla BLIS yes. It looks like the maximum block size is being set “incorrectly”, that is, because 320 < 512 an edge case can never actually be merged. I’ll pass this on to our colleagues at AMD.
 
Question 2: The AMD version of BLIS has two separate mechanisms for dealing with small matrix multiplications. The “SUP” mechanism also exists in vanilla BLIS and uses the thresholds set in bli_cntx_init_zen4. The other “small matrix” mechanism is AMD-specific and uses a different set of thresholds. From https://github.com/amd/blis/blob/7c564c74e103249b52636e6cfc5a93ba8c2b0406/frame/compat/bla_gemm_amd.c it looks like a) those macros aren’t actually used and the thresholds are hardcoded instead and b) this check only happens when dgemm is called and not bli_gemm/bli_dgemm.
 
Devin Matthews
 
From: 'Igor Kozachenko' via blis-devel <blis-...@googlegroups.com>
Date: Wednesday, July 17, 2024 at 4:59 PM
To: blis-devel <blis-...@googlegroups.com>
Subject: [blis-devel] BLIS auxiliary blocksize meaning and matrix size thresholds question
You don't often get email from blis-...@googlegroups.com. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
 
QUESTION 1. I am confused about the interpretation of the auxiliary blocksizes.
 
 
 
I am reading the following digression in the blis/docs/ConfigurationHowTo.md.
 
 
 
Digression: Auxiliary blocksize values for cache blocksizes are interpreted as the maximum cache blocksizes. The maximum cache blocksizes are a convenient and portable way of smoothing performance of the level-3 operations when computing with a matrix operand that is just slightly larger than a multiple of the preferred cache blocksize in that dimension. In these "edge cases," iterations run with highly sub-optimal blocking. We can address this problem by merging the "edge case" iteration with the second-to-last iteration, such that the cache blocksizes are slightly larger --rather than significantly smaller -- than optimal. The maximum cache blocksizes allow the developer to specify the maximum size of this merged iteration; if the edge case causes the merged iteration to exceed this maximum, then the edge case is not merged and instead it is computed upon in separate (final) iteration.
 
 
 
From the description, it follows that the auxiliary block size is the maximum sum of the optimal block size and the edge block size for the last iteration (edge iteration) to be merged with before last iteration, instead of the last iteration be a separate iteration. Thus, the maximum block size, i.e. auxiliary block size, is ALWAYS larger than the optimal block size.
 
 
 
According to this logic, if KC_optimal = 512, K = 1026, then the edge iteration size is 512, edge iteration size is 2. The last edge iteration merged with the before last iteration would be 514. If we set the maximum block size to 512, then the merge should not happen, and the last edge iteration of size 2 happens. If we set the maximum block size to 640 , then the last iteration is merged with the before last iteration resulting in the merged iteration of size 514.
 
 
 
In contrast, looking at blis/config/zen4/bli_cntx_init_zen4.c:
 
 
 
 
37 /*
 
38 * List of default block sizes for zen4.
 
39 * Converted it to macro as this list is used at multiple places in this file.
 
40 */
 
41
 
42 #define BLI_CNTX_DEFAULT_BLKSZ_LIST_GENOA(blkszs) \
 
43 /* s d c z */ \
 
44 bli_blksz_init_easy( &blkszs[ BLIS_MR ], 32, 32, 3, 12 ); \
 
45 bli_blksz_init_easy( &blkszs[ BLIS_NR ], 12, 6, 8, 4 ); \
 
46 bli_blksz_init_easy( &blkszs[ BLIS_MC ], 512, 128, 144, 60 ); \
 
47 bli_blksz_init ( &blkszs[ BLIS_KC ], 480, 512, 256, 512, \
 
48 480, 320, 256, 160 ); \
 
49 bli_blksz_init_easy( &blkszs[ BLIS_NC ], 6144, 4002, 4080, 2004 ); \
 
50 \
 
51 bli_blksz_init_easy( &blkszs[ BLIS_AF ], 5, 5, -1, -1 ); \
 
52 bli_blksz_init_easy( &blkszs[ BLIS_DF ], 8, 8, -1, -1 ); \
 
53
 
54
 
55 #define BLI_CNTX_DEFAULT_BLKSZ_LIST_BERGAMO(blkszs) \
 
56 /* s d c z */ \
 
57 bli_blksz_init_easy( &blkszs[ BLIS_MR ], 32, 32, 3, 12 ); \
 
58 bli_blksz_init_easy( &blkszs[ BLIS_NR ], 12, 6, 8, 4 ); \
 
59 bli_blksz_init_easy( &blkszs[ BLIS_MC ], 512, 64, 144, 60 ); \
 
60 bli_blksz_init ( &blkszs[ BLIS_KC ], 480, 512, 256, 512, \
 
61 480, 320, 256, 160 ); \
 
62 bli_blksz_init_easy( &blkszs[ BLIS_NC ], 6144, 3600, 4080, 2004 ); \
 
63 \
 
64 bli_blksz_init_easy( &blkszs[ BLIS_AF ], 5, 5, -1, -1 ); \
 
65 bli_blksz_init_easy( &blkszs[ BLIS_DF ], 8, 8, -1, -1 ); \
 
 
 
one can infer from lines 47,48 (for double precision), that the meaning of the auxiliary block size KC is most likely “the maximum size of the last edge block to be merged with the previous optimal sized block into a single iteration”. And the last edge block which size is larger than the maximum block size, i.e. auxiliary block size, will not be merged with the previous iteration and will have its own iteration.
 
 
 
Could you please resolve my concern about the definition of the auxiliary blocksize.
 
 
 
 
 
QUESTION 2. Descriptions of the SUP thresholds.
 
 
 
 
 
Looking at blis/config/zen4/bli_cntx_init_zen4.
 
 
269 // Initialize sup thresholds with architecture-appropriate values.
 
270 // s d c z
 
271 bli_blksz_init_easy( &thresh[ BLIS_MT ], 682, 1000, 380, 110 );
 
272 bli_blksz_init_easy( &thresh[ BLIS_NT ], 512, 1000, 256, 128 );
 
273 bli_blksz_init_easy( &thresh[ BLIS_KT ], 240, 220, 220, 110 );
 
274
 
 
334 // Initialize level-3 sup blocksize objects with architecture-specific
 
335 // values.
 
336 // s d c z
 
337 bli_blksz_init ( &blkszs[ BLIS_MR ], 6, 24, 3, 12,
 
338 6, 9, 3, 12 );
 
339 bli_blksz_init_easy( &blkszs[ BLIS_NR ], 64, 8, 8, 4 );
 
340 bli_blksz_init_easy( &blkszs[ BLIS_MC ], 192, 144, 72, 48 );
 
341 bli_blksz_init_easy( &blkszs[ BLIS_KC ], 512, 480, 128, 64 );
 
342 bli_blksz_init_easy( &blkszs[ BLIS_NC ], 8064, 4080, 2040, 1020 );
 
343
 
 
 
 
 
Where can I read more about SUP thresholds and their meaning and function? I suppose, SUP stands for small unpacked matrices.
 
 
 
I also see the thresholds in at blis/config/zen4/bli_family_zen4.h.
 
 
 
 
44 #define BLIS_ENABLE_SMALL_MATRIX
 
45 #define BLIS_ENABLE_SMALL_MATRIX_TRSM
 
46
 
47 // This will select the threshold below which small matrix code will be called.
 
48 #define BLIS_SMALL_MATRIX_THRES 700
 
49 #define BLIS_SMALL_M_RECT_MATRIX_THRES 160
 
50 #define BLIS_SMALL_K_RECT_MATRIX_THRES 128
 
51
 
52 #define BLIS_SMALL_MATRIX_A_THRES_M_SYRK 96
 
53 #define BLIS_SMALL_MATRIX_A_THRES_N_SYRK 128
 
 
 
 
 
How do these thresholds relate to each other?
 
 
 
 
 
Thank you,
--
You received this message because you are subscribed to the Google Groups "blis-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blis-devel+...@googlegroups.com<mailto:blis-devel+...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/blis-devel/f8babbef-ab1c-4988-ad90-338aa47b7418n%40googlegroups.com<https://groups.google.com/d/msgid/blis-devel/f8babbef-ab1c-4988-ad90-338aa47b7418n%40googlegroups.com?utm_medium=email&utm_source=footer>.
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to blis-devel+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages