I have finished the first stage and started test the new kernel. After running the test program, I got the following log:
PASS: 32x32x32 RowMajor x ColMajor -> ColMajor, SingleThreadGemm, Kernel: NEON aarch64, 12x4, depth 2, assuming 12-bit products, offsets -75/-91/74980, mult 123, shift 18
FAIL: 33x33x33 RowMajor x ColMajor -> ColMajor, SingleThreadGemm, Kernel: NEON aarch64, 12x4, depth 2, assuming 12-bit products, offsets 0/0/0, mult 1, shift 8
number of matrix entries: 1089
median value: 59
median unsigned diff: 196 (tolerating 7)
max unsigned diff: 255 (tolerating 29)
median signed diff: -196 (tolerating 2)
mean signed diff: -180 (tolerating 2)
No error: 0.00 % of entries
Error in 1..1 range: 0.00 % of entries
Error in 2..3 range: 0.00 % of entries
Error in 4..7 range: 0.00 % of entries
Error in 8..15 range: 0.00 % of entries
Error in 16..31 range: 0.55 % of entries
Error in 32..63 range: 4.13 % of entries
Error in 64..127 range: 16.71 % of entries
Error in 128..255 range: 78.60 % of entries
bad coeff: at (0, 0), expected 255, got 45
bad coeff: at (1, 0), expected 255, got 72
bad coeff: at (2, 0), expected 255, got 109
bad coeff: at (3, 0), expected 255, got 32
The red test case was added by myself since I have found that the new kernel will fail for operands large than 32x32x32. I was stuck on this for one day, I carefully reread the code but did not find any bugs. I had tried to use gdb on this test case but it was hard to dig into the asm code.
I want to ask that can you provide some idea on which part of the code can lead to this problem ? so that I can focus on that point.