SFU ID: 301294664
Github username: trevorbonas
Github link: g...@github.com:CMPT-295-SFU/assignment-4-trevorbonas.git
Line and file: Lines 25-38, tans.c
Expected behavior: 5 points for transposing all three matrices
Observed behavior: 4.8 for 32X32, 0 for 64X64, and 4.9 for 61X67
Question:
Do I have this right?:
The problem with transposing A and B with trans (the given baseline transpose function) is that A will be iterated row by row, that being A[row][0:MAX], meaning its items are accessed the way it is represented in memory and if cache block size is 32 bytes then when iterating through A (with stride 1) there will be a cache miss every 8 ints, A[0][0] will cause a miss, A[0][0:7] will be brought into the cache, accessing A[0][0:7] will result in hits until A[0][8], which will cause a miss and A[0][8:15] will be brought into the cache, and so on.
But for B it's very different. For every item accessed in a row of A an item of a column of B is being accessed. Meaning B is being accessed in memory with a stride of MAX. Every item in a column of B will be accessed causing a miss every time, then its next column will be accessed and will also miss every time in that column as well.
The solution, as I understand it, presented in the Intel post, is to keep in mind that every time B accesses an item in a column (say, [2][0]) that item is put into the cache along with 7 other ints (say, [2][0:7]) but trans doesn't take advantage of this and continues down the column before going to the next column, never hitting those ints in the cache. So then we take advantage of these ints being brought into the cache and "chunk" our loops so that instead of going through a column all the way to MAX we instead go to size_of_block
then go to the next column, hitting all those ints that are in the cache.
I've tried to do exactly this but for some reason it doesn't work at all for 64X64.