A question about performance when traverse the array with row-wise and column-wise

116 views
Skip to first unread message

zct

unread,
Sep 29, 2019, 10:18:15 AM9/29/19
to golang-nuts
The test code is below:
package main

import (
"testing"
)

const rowSize = 1000000
const colSize = 100

var array [rowSize][colSize]int

func BenchmarkRow(b *testing.B) {
b.ResetTimer()
for i := 0; i < b.N; i++ {
sum := 0
for r := 0; r < rowSize; r++ {
for c := 0; c < colSize; c++ {
sum += array[r][c]
}
}
}
}

func BenchmarkColumn(b *testing.B) {
b.ResetTimer()
for i := 0; i < b.N; i++ {
sum := 0
for c := 0; c < colSize; c++ {
for r := 0; r < rowSize; r++ {
sum += array[r][c]
}
}
}
}


As we known, there is a cpu cache in computer, so the row-wise should perform better than the column-wise. But the test result is :

go test -bench=. -count=5
goos: darwin
goarch: amd64
BenchmarkRow-4                30          42926367 ns/op
BenchmarkRow-4                30          50048505 ns/op
BenchmarkRow-4                32          38466153 ns/op
BenchmarkRow-4                28          40887279 ns/op
BenchmarkRow-4                30          36325967 ns/op
BenchmarkColumn-4             34          30991838 ns/op
BenchmarkColumn-4             36          30965998 ns/op
BenchmarkColumn-4             39          31575142 ns/op
BenchmarkColumn-4             33          35048352 ns/op
BenchmarkColumn-4             38          32584167 ns/op
PASS

It show that the column-wise traverse is quicker than the row-wise. I test it in my macbook and ubuntu server, the result is the same

Michael Stiller

unread,
Sep 29, 2019, 11:35:52 AM9/29/19
to zct, golang-nuts
On my machine using an intel 9880H with a L2 Cache: Unified, 256 KiB, 4-way set associative,

rows vs. columns performance is basically the same as long as the array size fits into the L2 cache.

This seems to be the case for a rowSize = colSize = 180. For slightly higher values (190) the
column bench gets slower as the row bench.

If you print the array sizes using

fmt.Println("array size:", unsafe.Sizeof(array))

you get:

180: array size: 259200

190: array size: 288800

If log the array’s element addresses to the code like this:

for r := 0; r < rowSize; r++ {
for c := 0; c < colSize; c++ {
log.Printf("rows: %x", &array[r][c])
sum += array[r][c]
}
}

and

for c := 0; c < colSize; c++ {
for r := 0; r < rowSize; r++ {
log.Printf("cols: %x", &array[r][c])
sum += array[r][c]
}
}

It seems that the “row method” looks more cache (line) friendly:

2019/09/29 17:33:59 rows: 1198400
2019/09/29 17:33:59 rows: 1198408
2019/09/29 17:33:59 rows: 1198410
2019/09/29 17:33:59 rows: 1198418
2019/09/29 17:33:59 rows: 1198420
2019/09/29 17:33:59 rows: 1198428
2019/09/29 17:33:59 rows: 1198430


2019/09/29 17:33:59 cols: 1198400
2019/09/29 17:33:59 cols: 1198450
2019/09/29 17:33:59 cols: 11984a0
2019/09/29 17:33:59 cols: 11984f0
2019/09/29 17:33:59 cols: 1198540
2019/09/29 17:33:59 cols: 1198590
2019/09/29 17:33:59 cols: 11985e0
2019/09/29 17:33:59 cols: 1198630


Cheers,

-Michael


zct

unread,
Sep 29, 2019, 10:28:12 PM9/29/19
to golang-nuts
I think in my test code, what affect the performance maybe the cpu branches misses? The row num is bigger than the column num, so in column-wise traverse, the inner loop is easier to predict. 

perf result:

row
-wise
     0.034014712            741,811      branch-misses             #    1.05% of all branches          (37.17%)

column-wise
     0.028383333             27,624      branch-misses             #    0.04% of all branches          (45.29%)



在 2019年9月29日星期日 UTC+8下午11:35:52,Michael Stiller写道:

lgo...@gmail.com

unread,
Oct 1, 2019, 4:52:11 PM10/1/19
to golang-nuts
Are your test results different  when rowSize = colSize    > 1000 say ??
Reply all
Reply to author
Forward
0 new messages