On my machine using an intel 9880H with a L2 Cache: Unified, 256 KiB, 4-way set associative,
rows vs. columns performance is basically the same as long as the array size fits into the L2 cache.
This seems to be the case for a rowSize = colSize = 180. For slightly higher values (190) the
column bench gets slower as the row bench.
If you print the array sizes using
fmt.Println("array size:", unsafe.Sizeof(array))
you get:
180: array size: 259200
190: array size: 288800
If log the array’s element addresses to the code like this:
for r := 0; r < rowSize; r++ {
for c := 0; c < colSize; c++ {
log.Printf("rows: %x", &array[r][c])
sum += array[r][c]
}
}
and
for c := 0; c < colSize; c++ {
for r := 0; r < rowSize; r++ {
log.Printf("cols: %x", &array[r][c])
sum += array[r][c]
}
}
It seems that the “row method” looks more cache (line) friendly:
2019/09/29 17:33:59 rows: 1198400
2019/09/29 17:33:59 rows: 1198408
2019/09/29 17:33:59 rows: 1198410
2019/09/29 17:33:59 rows: 1198418
2019/09/29 17:33:59 rows: 1198420
2019/09/29 17:33:59 rows: 1198428
2019/09/29 17:33:59 rows: 1198430
…
2019/09/29 17:33:59 cols: 1198400
2019/09/29 17:33:59 cols: 1198450
2019/09/29 17:33:59 cols: 11984a0
2019/09/29 17:33:59 cols: 11984f0
2019/09/29 17:33:59 cols: 1198540
2019/09/29 17:33:59 cols: 1198590
2019/09/29 17:33:59 cols: 11985e0
2019/09/29 17:33:59 cols: 1198630
…
Cheers,
-Michael