What explains this 6x slowdown ?

262 views
Skip to first unread message

Isaac Gouy

unread,
Mar 25, 2021, 3:02:52 PM3/25/21
to golang-nuts
(
Yeah, it's a tiny tiny program from someone's language comparison but this is a hugely bigger slowdown than gcc or java?

The code change is compile time constant —
    array_length int = 100000000   

— changed to run time value —
    array_length int = 1000000 * iterations
)


package main
import "os"
import "fmt"
import "strconv"
func main() {
    var (
       element, iteration, iterations, innerloop int
       sum float64
    )

    if len(os.Args) > 1 {
        iterations,_ = strconv.Atoi(os.Args[1])
    }
    fmt.Printf("iterations %d\n", iterations)
    
    var (
        array_length int = 100000000    
        array []float64 = make([]float64, array_length)
    )    
    
    for element = 0; element < array_length; element++ {
        array[element] = float64(element)
    }
    for iteration = 0; iteration < iterations; iteration++ {
        for innerloop = 0; innerloop < 1000000000; innerloop++ {
            sum += array[(iteration + innerloop) % array_length]
        }
    }
    fmt.Printf("sum %E\n", sum)
    array = nil
}

$ /opt/src/go1.16/go/bin/go build -o out test.go
$ time ./out 100
iterations 100
sum 5.000000E+18

real    3m3.225s

====

package main
import "os"
import "fmt"
import "strconv"
func main() {
    var (
       element, iteration, iterations, innerloop int
       sum float64
    )

    if len(os.Args) > 1 {
        iterations,_ = strconv.Atoi(os.Args[1])
    }
    fmt.Printf("iterations %d\n", iterations)
    
    var (
        array_length int = 1000000 * iterations    
        array []float64 = make([]float64, array_length)
    )    
    
    for element = 0; element < array_length; element++ {
        array[element] = float64(element)
    }
    for iteration = 0; iteration < iterations; iteration++ {
        for innerloop = 0; innerloop < 1000000000; innerloop++ {
            sum += array[(iteration + innerloop) % array_length]
        }
    }
    fmt.Printf("sum %E\n", sum)
    array = nil
}

$ /opt/src/go1.16/go/bin/go build -o out test.go
$ time ./out 100
iterations 100
sum 5.000000E+18

real    18m20.737s

Didier Spezia

unread,
Mar 25, 2021, 3:31:25 PM3/25/21
to golang-nuts
I believe this is simply due to the 64 bits integer division.

1. Contrary to Java or most C implementations int is 64 bits with Go on AMD64, and not 32 bits. Integer division on 64 bits is more expensive than for 32 bits.
2. When array_length is known at compile time, the compiler can replace the integer division done for % with cheaper operations.

Regards,
Didier.

Alex

unread,
Mar 25, 2021, 3:41:36 PM3/25/21
to golang-nuts
On my AMD 5950x I get 1m26.273s vs 2m30.918s
The huge time difference (3m vs 18m) in your times could be due to thermal throttling or core clock boost strategy
e.g. The CPU may boost clocks high for a short while and then drop down.

You can disable intel turbo boost and set a static core clock to rule that out. 
Not much you can do for thermal throttling unless you underclock until it doesn't thermal throttle.

Isaac Gouy

unread,
Mar 25, 2021, 6:31:17 PM3/25/21
to golang-nuts
I believe this is simply due to the 64 bits integer division.

1. Contrary to Java or most C implementations int is 64 bits with Go on AMD64, and not 32 bits. Integer division on 64 bits is more expensive than for 32 bits.
2. When array_length is known at compile time, the compiler can replace the integer division done for % with cheaper operations.

Well, the compile-time-constant program did not become slower with int64, but the run-time-value program did become faster with int32 —

package main
import "os"
import "fmt"
import "strconv"
func main() {
    var (
       element, iteration, iterations, innerloop int64

       sum float64
    )

    if len(os.Args) > 1 {
        var i,_ = strconv.Atoi(os.Args[1])
        iterations = int64(i)

    }
    fmt.Printf("iterations %d\n", iterations)
    
    var (
        array_length int64 = 100000000    

        array []float64 = make([]float64, array_length)
    )    
    
    for element = 0; element < array_length; element++ {
        array[element] = float64(element)
    }
    for iteration = 0; iteration < iterations; iteration++ {
        for innerloop = 0; innerloop < 1000000000; innerloop++ {
            sum += array[(iteration + innerloop) % array_length]
        }
    }
    fmt.Printf("sum %E\n", sum)
    array = nil
}
$ /opt/src/go1.16/go/bin/go build -o out test.go
$ time ./out 100
iterations 100
sum 5.000000E+18

real    3m3.240s

====

package main
import "os"
import "fmt"
import "strconv"
func main() {
    var (
       element, iteration, iterations, innerloop int32

       sum float64
    )

    if len(os.Args) > 1 {
        var i,_ = strconv.Atoi(os.Args[1])
        iterations = int32(i)

    }
    fmt.Printf("iterations %d\n", iterations)
    
    var (
        array_length int32 = 1000000 * iterations    

        array []float64 = make([]float64, array_length)
    )    
    
    for element = 0; element < array_length; element++ {
        array[element] = float64(element)
    }
    for iteration = 0; iteration < iterations; iteration++ {
        for innerloop = 0; innerloop < 1000000000; innerloop++ {
            sum += array[(iteration + innerloop) % array_length]
        }
    }
    fmt.Printf("sum %E\n", sum)
    array = nil
}
$ /opt/src/go1.16/go/bin/go build -o out test.go
$ time ./out 100
iterations 100
sum 5.000000E+18

real    6m57.880s

Isaac Gouy

unread,
Mar 25, 2021, 7:21:15 PM3/25/21
to golang-nuts
On Thursday, March 25, 2021 at 12:31:25 PM UTC-7 Didier wrote:
I believe this is simply due to the 64 bits integer division.

1. Contrary to Java or most C implementations int is 64 bits with Go on AMD64, and not 32 bits. Integer division on 64 bits is more expensive than for 32 bits.
2. When array_length is known at compile time, the compiler can replace the integer division done for % with cheaper operations.

And now compile-time-constant program with explicit int32 and run-time-value program with explicit int64.

In summary —

compile-time-constant program
int    3m3.225s
int32    2m43.404s
int64    3m3.240s

run-time-value program
int    18m20.737s
int32    6m57.880s
int64    18m21.020s

Hmmm what explains the difference for the compile-time-constant program between int and explicit int32 ?
 

Kurtis Rader

unread,
Mar 25, 2021, 7:45:31 PM3/25/21
to Isaac Gouy, golang-nuts
On Thu, Mar 25, 2021 at 4:21 PM 'Isaac Gouy' via golang-nuts <golan...@googlegroups.com> wrote:
compile-time-constant program
int    3m3.225s
int32    2m43.404s
int64    3m3.240s

run-time-value program
int    18m20.737s
int32    6m57.880s
int64    18m21.020s

Hmmm what explains the difference for the compile-time-constant program between int and explicit int32 ?

Hint #1: Notice anything interesting about the relationship of the int and int64 times?


--
Kurtis Rader
Caretaker of the exceptional canines Junior and Hank

Isaac Gouy

unread,
Mar 25, 2021, 8:35:55 PM3/25/21
to golang-nuts


On Thursday, March 25, 2021 at 4:45:31 PM UTC-7 Kurtis Rader wrote:
On Thu, Mar 25, 2021 at 4:21 PM 'Isaac Gouy' via golang-nuts wrote:
compile-time-constant program
int    3m3.225s
int32    2m43.404s
int64    3m3.240s

run-time-value program
int    18m20.737s
int32    6m57.880s
int64    18m21.020s

Hmmm what explains the difference for the compile-time-constant program between int and explicit int32 ?

Hint #1: Notice anything interesting about the relationship of the int and int64 times?


    uint either 32 or 64 bits
    int same size as uint
 
Well I figured int was going to be either 32 or 64 bits :-)


Didier Spezia

unread,
Mar 27, 2021, 9:10:23 AM3/27/21
to golang-nuts
I suppose the difference between int and int32 comes from the behavior of the modulo strength reduction mechanism.

With int (=int64), n%100000000 is compiled as:
MOVQ $0xabcc77118461cefd, AX
MOVQ #n, CX
IMULQ CX
ADDQ CX, DX
SARQ $0x1a, DX
MOVQ CX, BX
SARQ $0x3f, CX
SUBQ CX, DX
IMULQ $0x5f5e100, DX, CX
SUBQ CX, BX
MOVQ BX, #result

With int32, n%100000000 is compiled as:
MOVL #n, AX
MOVSXD AX, CX
MOVL $-0x543388ee, DX
IMULQ CX, DX
SARQ $0x3a, DX
SARQ $0x3f, CX
SUBL CX, DX
IMULL $0x5f5e100, DX, CX
SUBL CX, AX
MOVL AX, #result

With int32, there are a bit less instructions, and the last multiplication is using a 32 bits instruction.
I think this accounts for the slightly better performance, but it is likely dependent on the CPU model.

Btw, if you change to uint32, you will get even better performance, because in that case, the strength reduction gives:
MOVL $-0x543388ee, AX
MOVL #n, CX
IMULQ CX, AX
SHRQ $0x3a, AX
IMULL $0x5f5e100, AX, AX
SUBL AX, CX
MOVL CX, #result

In the Go compiler, I believe these optimizations are implemented in:
You may want to search for Div/Mod related transformations.

Best regards,
Didier.
Reply all
Reply to author
Forward
0 new messages