How to optimize?

Christian LeMoussel

unread,

Nov 15, 2017, 12:15:32 PM11/15/17

to golang-nuts

Hi,

I have this program (https://play.golang.org/p/qHPzjj2uj3) that take a decent amount of time for computing.
Let’s open pprof and see what it spent its time on.
Type: cpu
Time: Nov 15, 2017 at 5:48pm (CET)
Duration: 9.92s, Total samples = 9.78s (98.54%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) list main.main
Total: 9.78s
ROUTINE ======================== main.main in c:\Users\lemoussel\go\src\C-Libraries\main.go
      80ms      9.78s (flat, cum)   100% of Total
         .          .    170:   bufferHexHash, err := HexToUint32(hash, binary.BigEndian)
         .          .    171:   /*for i, val := range bufferHexHash {
         .          .    172:           fmt.Printf("bufferHexHash[%d]: %08x \n", i, val)
         .          .    173:   }*/
         .          .    174:   startTime = time.Now()
      20ms       20ms    175:   for i := 0; i < 10000000; i++ {
      60ms      9.76s    176:           result = findHex(bufferHexSearch, bufferHexHash)
         .          .    177:   }
         .          .    178:   fmt.Printf("findHex()               Find: %t Time : %.2f sec.\n", result, time.Now().Sub(startTime).Seconds())
         .          .    179:}

Ok, so it seems we spent most of the time in findHex, makes sense!

(pprof) list main.findHex
Total: 9.78s
ROUTINE ======================== main.findHex in c:\Users\lemoussel\go\src\C-Libraries\main.go
     9.52s      9.52s (flat, cum) 97.34% of Total
         .          .     72:   {0x00000fff, 0xfffffff0, 0x0},        // 5
         .          .     73:   {0x000000ff, 0xffffffff, 0x0},        // 6
         .          .     74:   {0x0000000f, 0xffffffff, 0xf0000000}, // 7
         .          .     75:}
         .          .     76:
      20ms       20ms     77:func findHex(bufferHexSearch [8][3]uint32, bufferHexHash []uint32) bool {
         .          .     78:   var bFind = false
         .          .     79:   var i, j, k int
         .          .     80:   var lenBufferHexHash = len(bufferHexHash)
         .          .     81:   var lenBufferHexSearch = len(bufferHexSearch)
         .          .     82:   var lenMaskHex = len(maskHex)
         .          .     83:LoopSearch:
     300ms      300ms     84:   for i = 0; i < lenBufferHexHash; i++ {
     410ms      410ms     85:           for j = 0; j < lenBufferHexSearch; j++ {
     2.53s      2.53s     86:                   for k = 0; k < lenMaskHex; k++ {
     6.16s      6.16s     87:                           if bufferHexSearch[j][0] == bufferHexHash[i]&maskHex[k][0] &&
     100ms      100ms     88:                                   bufferHexSearch[j][1] == bufferHexHash[i+1]&maskHex[k][1] {
         .          .     89:                                   bFind = true
         .          .     90:                                   break LoopSearch
         .          .     91:                           }
         .          .     92:                   }
         .          .     93:           }

Oh ! most of the time is in the line if bufferHexSearch[j][0] == bufferHexHash[i]&maskHex[k][0] &&

I don't understand why?
Should I use goroutine to compute concurrently?
I'm interested in any recommendations to optimize this program.

Christian LeMoussel

unread,

Nov 15, 2017, 1:01:03 PM11/15/17

to golang-nuts

I created benchmarks : https://play.golang.org/p/kgBsciRbpe

BenchmarkByteIndex-2             5000000               210 ns/op              64 B/op          1 allocs/op
BenchmarkByteIndexPointeur-2    20000000                90.1 ns/op             0 B/op          0 allocs/op
BenchmarkFindByte-2             10000000               181 ns/op               0 B/op          0 allocs/op
BenchmarkFindHex-2               1000000              1003 ns/op               0 B/op          0 allocs/op

the idea being that BenchmarkFindHex should be as good as BenchmarkByteIndexPointeur, because int32 comparison instead of byte

Tamás Gulácsi

unread,

Nov 15, 2017, 1:13:15 PM11/15/17

to golang-nuts

Help the compiler and put everything that can be looked up beforehand into a separate variable:

hhi0, hhi1 := bufferHexHash[i], bufferHexHash[i+1]

for j = 0; j < lenBufferHexSearch; j++ {

hsj0, hsj1 := bufferHexSearch[j][0], bufferHexSearch[j][1]

for k = 0; k < lenMaskHex; k++ {

                   if hsj0 == hhi0&maskHex[k][0] && hsj1 == hhi1&maskHex[k][1] {
                       bFind = true
                       break LoopSearch
                   }
               }
           }

Jan Mercl

unread,

Nov 15, 2017, 1:29:15 PM11/15/17

to Christian LeMoussel, golang-nuts

On Wed, Nov 15, 2017 at 6:15 PM Christian LeMoussel <cnh...@gmail.com> wrote:

> I have this program (https://play.golang.org/p/qHPzjj2uj3) that take a decent amount of time for computing.

Don't know where to start. Maybe this one: It passes ~1GB of data by value on the stack for no reason. (Unless the compiler is much smarter than I think it is.)

--

-j

Christian LeMoussel

unread,

Nov 15, 2017, 2:44:46 PM11/15/17

to golang-nuts

@Jan, Why not but I don't understand.
Because if I do

        startTime = time.Now()

        for i := 0; i < 10000000; i++ {

            result = bytes.Index([]byte(hash), []byte(toSearch)) != -1
        }
        fmt.Printf("bytes.Contains          Find: %t Time : %.2f sec.\n", result, time.Now().Sub(startTime).Seconds())

performance is better with so much data manipulated.

Christian LeMoussel

unread,

Nov 15, 2017, 2:51:15 PM11/15/17

to golang-nuts

this has a little improved performance

BenchmarkFindHex-2 2000000 954 ns/op 0 B/op 0 allocs/op
BenchmarkFindHexOpti-2 2000000 759 ns/op 0 B/op 0 allocs/op

Christian LeMoussel

unread,

Nov 15, 2017, 2:57:21 PM11/15/17

to golang-nuts

         .          .     83:LoopSearch:
      20ms       20ms     84:   for i = 0; i < lenBufferHexHash; i++ {
      90ms       90ms     85:           hhi0, hhi1 := bufferHexHash[i], bufferHexHash[i+1]
     390ms      390ms     86:           for j = 0; j < lenBufferHexSearch; j++ {
     710ms      710ms     87:                   hsj0, hsj1 := bufferHexSearch[j][0], bufferHexSearch[j][1]
     2.96s      2.96s     88:                   for k = 0; k < lenMaskHex; k++ {
         .          .     89:                           /*if bufferHexSearch[j][0] == bufferHexHash[i]&maskHex[k][0] &&
         .          .     90:                           bufferHexSearch[j][1] == bufferHexHash[i+1]&maskHex[k][1] {*/
     4.74s      4.74s     91:                           if hsj0 == hhi0&maskHex[k][0] && hsj1 == hhi1&maskHex[k][1] {
         .          .     92:                                   bFind = true
         .          .     93:                                   break LoopSearch
         .          .     94:                           }
         .          .     95:                   }
         .          .     96:           }
         .          .     97:   }

Jan Mercl

unread,

Nov 15, 2017, 3:04:05 PM11/15/17

to Christian LeMoussel, golang-nuts

On Wed, Nov 15, 2017 at 8:45 PM Christian LeMoussel <cnh...@gmail.com> wrote:

> Why not but I don't understand.

Neither do I understand, unfortunately. Additionally, the unreadable, low-contrast picture is, well, unreadable for me. Link to the playground version is perfect. Plain, black on white text without any formatting in the mail is also very good.

Anyway, I was talking about passing a 96 byte array 10 million times _by value_ (cf. the OP). Especially when it doesn't have to be passed at all. Some modern CPUs may perhaps optimize a lot of that away, but still, it's a mistake on other processors. Would one do that when codding the same logic in assembler? (Rhetoric question)

--

-j

jake...@gmail.com

unread,

Nov 16, 2017, 8:27:03 PM11/16/17

to golang-nuts

It is not entirely clear what you are trying to accomplish. I suspect effective optimization would require rethinking or refining your algorithm.

In the benchmarks you are comparing apples to oranges. BenchmarkByteIndexPointeur searches for a 10 byte pattern in a 56 byte slice. On amd64 at least, it falls through to assembly. I have not looked at the assembly, but given that there are no substantial occurrences of the pattern prefix in search slice, I would expect it to involve around 56 comparisons in the naive case. In your code, on the other hand, the hot comparison (line 87 in the original, and line 76 in the benchmark) gets called 375 times. As I said, I'm not sure what the expected behavior is, but I suspect your algorithm is sub-optimal.

Good Luck.

Reply all

Reply to author

Forward