A question about performance when traverse the array with row-wise and column-wise

zct

unread,

Sep 29, 2019, 10:18:15 AM9/29/19

to golang-nuts

The test code is below:

package main

import (
	"testing"
)

const rowSize = 1000000
const colSize = 100

var array [rowSize][colSize]int

func BenchmarkRow(b *testing.B) {
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		sum := 0
		for r := 0; r < rowSize; r++ {
			for c := 0; c < colSize; c++ {
				sum += array[r][c]
			}
		}
	}
}

func BenchmarkColumn(b *testing.B) {
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		sum := 0
		for c := 0; c < colSize; c++ {
			for r := 0; r < rowSize; r++ {
				sum += array[r][c]
			}
		}
	}
}

As we known, there is a cpu cache in computer, so the row-wise should perform better than the column-wise. But the test result is :

go test -bench=. -count=5
goos: darwin
goarch: amd64
BenchmarkRow-4                30          42926367 ns/op
BenchmarkRow-4                30          50048505 ns/op
BenchmarkRow-4                32          38466153 ns/op
BenchmarkRow-4                28          40887279 ns/op
BenchmarkRow-4                30          36325967 ns/op
BenchmarkColumn-4             34          30991838 ns/op
BenchmarkColumn-4             36          30965998 ns/op
BenchmarkColumn-4             39          31575142 ns/op
BenchmarkColumn-4             33          35048352 ns/op
BenchmarkColumn-4             38          32584167 ns/op
PASS

It show that the column-wise traverse is quicker than the row-wise. I test it in my macbook and ubuntu server, the result is the same

Michael Stiller

unread,

Sep 29, 2019, 11:35:52 AM9/29/19

to zct, golang-nuts

On my machine using an intel 9880H with a L2 Cache: Unified, 256 KiB, 4-way set associative,

rows vs. columns performance is basically the same as long as the array size fits into the L2 cache.

This seems to be the case for a rowSize = colSize = 180. For slightly higher values (190) the
column bench gets slower as the row bench.

If you print the array sizes using

fmt.Println("array size:", unsafe.Sizeof(array))

you get:

180: array size: 259200

190: array size: 288800

If log the array’s element addresses to the code like this:

for r := 0; r < rowSize; r++ {
for c := 0; c < colSize; c++ {

log.Printf("rows: %x", &array[r][c])

sum += array[r][c]
}
}

and

for c := 0; c < colSize; c++ {
for r := 0; r < rowSize; r++ {

log.Printf("cols: %x", &array[r][c])

sum += array[r][c]
}
}

It seems that the “row method” looks more cache (line) friendly:

2019/09/29 17:33:59 rows: 1198400
2019/09/29 17:33:59 rows: 1198408
2019/09/29 17:33:59 rows: 1198410
2019/09/29 17:33:59 rows: 1198418
2019/09/29 17:33:59 rows: 1198420
2019/09/29 17:33:59 rows: 1198428
2019/09/29 17:33:59 rows: 1198430
…

2019/09/29 17:33:59 cols: 1198400
2019/09/29 17:33:59 cols: 1198450
2019/09/29 17:33:59 cols: 11984a0
2019/09/29 17:33:59 cols: 11984f0
2019/09/29 17:33:59 cols: 1198540
2019/09/29 17:33:59 cols: 1198590
2019/09/29 17:33:59 cols: 11985e0
2019/09/29 17:33:59 cols: 1198630
…

Cheers,

-Michael

zct

unread,

Sep 29, 2019, 10:28:12 PM9/29/19

to golang-nuts

I think in my test code, what affect the performance maybe the cpu branches misses? The row num is bigger than the column num, so in column-wise traverse, the inner loop is easier to predict.

perf result:

row-wise
     0.034014712            741,811      branch-misses             #    1.05% of all branches          (37.17%)

column-wise
     0.028383333             27,624      branch-misses             #    0.04% of all branches          (45.29%)

在 2019年9月29日星期日 UTC+8下午11:35:52，Michael Stiller写道：

lgo...@gmail.com

unread,

Oct 1, 2019, 4:52:11 PM10/1/19

to golang-nuts

Are your test results different when rowSize = colSize > 1000 say ??

Reply all

Reply to author

Forward