Re: [go-nuts] append() is 200% to 300% slower than manual indexing and fmt.Sprintf is 80 times slower yet.

Dave Cheney

unread,

Feb 8, 2013, 10:09:51 PM2/8/13

to joex...@gmail.com, golang-nuts

I'm not really sure what you are trying to demonstrate. You appear to comparing apples, oranges and yugos.

On 9 Feb 2013 13:57, <joex...@gmail.com> wrote:

I was experimenting with buffer management and ran across some interesting performance numbers. I wanted to see how much overhead is affected the append functionality so I wrote a several variations of the test. My analysis indicates that for complex buffers with lots individual items the single item append will be 2 to 3 times slower. This could be important but if you happen to be using fmt.Sprintf() to generate the content then it will not matter since it is Sprintf() is 80 times slower yet. Comments or improvements welcomed.

A) append a string to slice 1 million times,

B) save a string to slice using index assignments 1 million times. Manually grows the slice by a big chunk when you run out.

C) save a string to slice using index assignments 1 million times using an array already big enough to contain the result.

E) Generate same number of similar sized strings with Sprintf() but do not save them.

D) Save the string using the manual grow slice but generate the strings with a simple call to Sprintf()

The representative results where as shown:

A = 92ms, B = 31ms, C = 30ms, D = 2,412ms, E = 2446ms

A = 115ms, B = 38ms, C = 31ms, D = 2,384ms, E = 2332ms

A = 80ms, B = 28ms, C = 30ms, D = 2,326ms, E = 2670ms

Sample code:

package main

import "fmt"

import . "time"

import "runtime"

func now_ms() float64 {

    nn := Now()

    return float64(nn.UnixNano()) / float64(Millisecond)

func elap_ms(msg string, beg_time float64, end_time float64) float64 {

    elap := end_time - beg_time

    fmt.Printf("ELAP %s = %12.3f ms\n", msg, elap)

    return elap

func main() {

    const numLoop = 1000000

    const numCycle = 10

    const DefaultSBSize = 500

    fmt.Printf("Number of Appends for this test run=%d\n", numLoop)

    // GC calls added so the next test is not spending

    // time cleaning up the prior tests garbage.

    for xxk := 0; xxk < 5; xxk++ {

//

//

        // Demonstrate the simple case

        // append numLoop times to initially

        // empty slice.

        runtime.GC()

        beg_time := now_ms()

        b := make([]string, DefaultSBSize)

        for ii := 0; ii < numLoop; ii++ {

            b = append(b, "I am a simple string 198282\n")

        elap_ms("build with simple Append", beg_time, now_ms())

        beg_time = now_ms()

        b = nil

        runtime.GC()

        elap_ms("    post test GC", beg_time, now_ms())

//

//

        // Demonstrate the Manual manual array

        // pre-allocation.   Each time we run out

        // manually allocate 5 times the existing

        // storage size.  May be too aggressive. Do not

        // want to pay to allocate millions up front

        // but also want to minimize the alloc and copy

        // operations.

        runtime.GC()

        beg_time = now_ms()

        max := DefaultSBSize

        b = make([]string, DefaultSBSize)

        for i := 0; i < numLoop; i++ {

            if i >= max {

                if max > 500000 {

                    max = max * 2

                } else {

                    max = max * 8

                t := make([]string, max)

                copy(t, b)

                b = t

            b[i] = "I am a simple string 198282\n"

        elap_ms("\n\nbuild with simple pre-grow array", beg_time, now_ms())

        beg_time = now_ms()

        for i := 0; i < numLoop; i++ {

            b[i] = ""

        elap_ms("    Clear elements to nic but leave array", beg_time, now_ms())

        beg_time = now_ms()

        runtime.GC()

        elap_ms("    post test GC", beg_time, now_ms())

//

//

        // Demonstrate the Manual manual array

        // pre-allocation but this time re-use the array

        // pre-allocated last time.

        runtime.GC()

        beg_time = now_ms()

        max = DefaultSBSize

        for i := 0; i < numLoop; i++ {

            if i >= max {

                if max > 500000 {

                    max = max * 2

                } else {

                    max = max * 8

                t := make([]string, max)

                copy(t, b)

                b = t

            b[i] = "I am a simple string 198282\n"

        elap_ms("\n\nbuild but re-use prior array", beg_time, now_ms())

        b = nil

        beg_time = now_ms()

        runtime.GC()

        elap_ms("    post test GC", beg_time, now_ms())

//

//

        // Demonstrate how slow fmt.fprintf() runs

        // for the same loop with now append overhead

        runtime.GC()

        beg_time = now_ms()

        for i := 0; i < numLoop; i++ {

            fmt.Sprintf("I am a simple string %d", i)

        elap_ms("\n\nRun fmt.Sprintf() the same number of times ", beg_time, now_ms())

        beg_time = now_ms()

        runtime.GC()

        elap_ms("    post test GC", beg_time, now_ms())

//

//

        // Demonstrate the really slooooow fmt.fprintf()

        // We will re-use the fast manual grow but this

        // time we will use a fmt.Printf to add the index

        // to the string being concatenated

        runtime.GC()

        beg_time = now_ms()

        max = DefaultSBSize

        b = make([]string, DefaultSBSize)

        for i := 0; i < numLoop; i++ {

            if i >= max {

                max = max * 5

                t := make([]string, max)

                copy(t, b)

                b = t

            b[i] = fmt.Sprintf("I am a simple string %d", i)

        elap_ms("\n\nbuild with pre-grow array with Sprintf()", beg_time, now_ms())

        // Time converting the array to

        // output string.

        //beg_time = now_ms()

        //tout = strings.Join(b[0:i], "")

        //elap_ms("  convert array to single str", beg_time, now_ms())

        //fmt.Printf("length of string=%d\n", len(tout))

        //tout = ""

        b = nil

        beg_time = now_ms()

        runtime.GC()

        elap_ms("    post test GC", beg_time, now_ms())

        fmt.Printf("\n\n***********\n")

Sample Output

***********

ELAP build with simple Append = 80.004 ms

ELAP post test GC = 227.013 ms

ELAP

build with simple pre-grow array = 33.002 ms

ELAP Clear elements to nic but leave array = 16.001 ms

ELAP post test GC = 265.015 ms

ELAP

build but re-use prior array = 30.002 ms

ELAP post test GC = 223.013 ms

ELAP

Run fmt.Sprintf() the same number of times = 2371.136 ms

ELAP post test GC = 378.022 ms

ELAP

build with pre-grow array with Sprintf() = 2412.138 ms

ELAP post test GC = 339.020 ms

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Martin Schnabel

unread,

Feb 8, 2013, 10:34:30 PM2/8/13

to golan...@googlegroups.com

On 02/09/2013 03:57 AM, joex...@gmail.com wrote:
> I was experimenting with buffer management and ran across some
> interesting performance numbers.

you can use go test for benchmarks. see:
http://golang.org/doc/code.html#Testing
http://golang.org/pkg/testing/

> b := make([]string, DefaultSBSize)
> for ii := 0; ii < numLoop; ii++ {
> b = append(b, "I am a simple string 198282\n")
> }

this does not what you expect. you create a slice with len 500 (empty
strings) and append 1000000 simple strings. you probably want to use an
empty slice with an initial capacity: make([]T, 0, cap)

read the Effective Go doc and about make, arrays and slices. it has
examples.
http://golang.org/doc/effective_go.html

or even better the language specification. it is very good.
http://golang.org/ref/spec

hope that helps

andrey mirtchovski

unread,

Feb 9, 2013, 12:32:40 AM2/9/13

to joex...@gmail.com, golang-nuts

> When I replaced it with a manually
> written method which reads the bytes in larger chunks and then breaks them
> into arrays of strings GO moved from last place to second place behind F#.
> Knowing this kind of performance information is the difference between GO
> being viable or being thrown out as a toy.

If I understand you correctly, you seem to be complaining that because
Go provides you with the rich set of tools to analyze and easily
replace parts of its standard library to fit your discrete needs it
may be discarded as a toy. Am I correct?

Kyle Lemons

unread,

Feb 9, 2013, 1:20:49 AM2/9/13

to joex...@gmail.com, golang-nuts

I don't think anything here is really surprising. If you know how much data you have, you should clearly preallocate if you care about performance. If you preallocate and assign indices (first case below), it will be slightly faster than preallocating and appending (second case), but not by much. If you have no idea how many you will have (fourth case), you should use append. If you have a guess, you should use append (third). Sprintf is a completely different operation and incurs multiple allocations and reflection, so of course it's going to be slower (benchmarked separately below).

Code: http://play.golang.org/p/F3vFIwRw9C (you need a lot of memory to fit half a billion ints; consider setting test.benchtime down a bit if you want to reproduce)

[prealloc] arr[i] = n 500000000 3.15 ns/op

[prealloc] arr = append(arr, n) 500000000 3.95 ns/op

[partial] arr = append(arr, n) 200000000 10.6 ns/op

[dynamic] arr = append(arr, n) 100000000 14.8 ns/op

Sprintf("%d", n) 5000000 376 ns/op

What you're measuring here is the speed of allocation: of course it's faster to allocate up front. In many cases, though, you either don't know how many you'll have or you aren't in an inner loop and so it doesn't matter. The built-in append function, in both of the current implementations, tries to help you out here: if there is no room left in the backing array, it will allocate one that's big enough to fit both the new data and some more (how much depends on the size; it doubles the size up to 1000 elements and then adds 25%, if memory serves, but this could change) to minimize the number of allocations it has to do.

On Fri, Feb 8, 2013 at 9:35 PM, <joex...@gmail.com> wrote:

Good catch. I changed it for the map construction to use b := make([]string, 0, DefaultSBSize) Unfortunately it had no observed impact. Or the changes where within the statistical variation on this system. The original 500 was wasted effort but since it took so little time it had no meaningful impact. Original point stands

By the way I have read and re-read http://golang.org/doc/effective_go.html it has been quite helpful.

I will look at moving over to use the test http://golang.org/doc/code.html#Testing but I believe my current numbers are sufficiently accurate for this test.

--

Nigel Tao

unread,

Feb 9, 2013, 1:49:42 AM2/9/13

to joex...@gmail.com, golang-nuts

On Sat, Feb 9, 2013 at 3:59 PM, <joex...@gmail.com> wrote:
> also found that Sprintf() should be avoided where performance matters.
> (GO CORE TEAM PLEASE FIX)

I don't know exactly what you want changed here. What I learned from
your code regarding Sprintf is that calling Sprintf a million times is
more expensive than not calling Sprintf a million times. These two
lines:

b[i] = "I am a simple string 198282\n"

b[i] = fmt.Sprintf("I am a simple string %d", i)

do different things.

The RHS (right hand side) on the first line is a string constant; no
memory allocation is needed. A Go string is two words: a pointer to
the contents and a length: 16 bytes, on a 64-bit system. So you're
just copying 16 bytes from a location known at compile time to b[i],
regardless of the length of the RHS string. Copying 16 bytes is two
MOVQ instructions on amd64.

The RHS on the second line is dynamic: the result depends on i, and
new memory must be allocated at runtime at each iteration. Allocations
can cause garbage collections, and the combination of memory
allocations and the consequent garbage collections is obviously much
more expensive than just copying 16 bytes from one location to
another.

So, creating a new string at runtime is more expensive than referring
to an existing, immutable string at compile time. Is F#, Python or Lua
any different?

> f.ReadString() is a performance pig.

It's worth knowing that Go makes a distinction between the string and
[]byte types. A string's contents are immutable, a byte slice is not.
s[i] = 'x' is a compile-time error, b[i] = 'x' is not.

Strings are easier to reason about; with []byte values you might
accidentally read from or write to a []byte buffer after it is re-used
elsewhere. But in order to enforce immutability, creating a new string
requires allocation of dedicated memory, which may require a garbage
collection. It can be more efficient to use []byte instead of string;
io.Reader and io.Writer take []byte args instead string args for this
reason.

You may want the bufio.Reader's ReadSlice method instead of its
ReadString method. It's hard to say, though, without knowing exactly
you're trying to do. You've shared some code for some
micro-benchmarks, but I don't know what problem you're really trying
to solve.

Kevin Gillette

unread,

Feb 9, 2013, 2:21:15 AM2/9/13

to golan...@googlegroups.com, joex...@gmail.com

I don't doubt that the relative figures you came up with are reasonably accurate for the test scenario you came up with, but I agree with Kyle, that you're probably artificially benchmarking reallocations, not append directly -- albeit I have said that allocation is reasonably fast (using an optimized memcpy), that is generally dwarfed by other computations in the main loop, since all we've been testing is append in a loop, the relative cost of those allocations will, of course, be higher. Here are my figures (intel atom 64-bit):

BenchmarkAppend 20000000 95.7 ns/op

BenchmarkAppendTwo 50000000 88.4 ns/op

BenchmarkAppendThree 20000000 73.4 ns/op

BenchmarkAppendPre 100000000 20.0 ns/op

BenchmarkAppendTwoPre 100000000 14.5 ns/op

BenchmarkAppendThreePre 100000000 13.5 ns/op

BenchmarkIndexSet 100000000 10.4 ns/op

BenchmarkIndexSetThree 100000000 10.4 ns/op

And with an amd x2:

BenchmarkAppend 20000000 116 ns/op

BenchmarkAppendTwo 20000000 101 ns/op

BenchmarkAppendThree 20000000 103 ns/op

BenchmarkAppendPre 100000000 19.0 ns/op

BenchmarkAppendTwoPre 100000000 15.3 ns/op

BenchmarkAppendThreePre 100000000 15.1 ns/op

BenchmarkIndexSet 100000000 14.2 ns/op

BenchmarkIndexSetThree 100000000 14.3 ns/op

The code is at: http://play.golang.org/p/Au_gTBxykf

As can be seen, the difference between append in the fully preallocated case, and a full preallocation with directly setting indices, is negligible -- append is effectively zero cost when it doesn't need to reallocate, since its bounds check triggers an allocation, while setting an index directly has a bounds check, which if violated, triggers a panic -- in the typical case, they both do about the same amount of work. Also, if you were to reimplement append in pure go (for your specific types, since you cannot do it generically), unless the amount by which you extended the capacity were precisely adapted to the behavior of your app, and where any other growth value would have pathological behavior somehow, the builtin append would be considerably faster.

Even in the worst case, when the workload is centered around using non-preallocated appends for known data copies (which in practice is a misuse, by the way, but that is what the initial benchmarks are doing), being 200% slower than the optimal approach for that particular concrete task (preallocated append/index-assignment in a tight loop) does not mean that append without preallocation is "slow", since, in effect, it's doing a lot more than twice the data processing (occasional malloc+memcpy sweeps, of potentially millions of elements, plus whatever you're "appending" each step). I strongly doubt that any use case where there's so much data that 200% would make a perceptible difference would mean that the app design is bad anyway -- if 200% is slow, then 100% is also a lot slower than your users should be allowing you to get away with: you should be interleaving the data processing with other ui tasks rather than making the user wait, you should be preallocating when you know the amount of data you're dealing with, and you should be leveraging "2nd order allocators" (see http://blog.golang.org/2011/06/profiling-go-programs.html) when this kind of data processing is done periodically, so that memory usage becomes stable.

Printf is not designed primarily for speed -- if it were, the API would look more like regexp, where you would pre-compile format strings, rather than passing them in each time. I believe that format strings are not currently being interned (and therefore must be re-parsed for each call) -- for one thing, interning would make potentially false assumptions about the nature of the application, which the stdlib tries to avoid doing. node and python may indeed win out here if they intern strings, yet Go has a much lower assumed memory overhead than either of those languages, and so it would be more concerning if Go made assumptions on your behalf than if node or python did. Additionally, any formatted output facilities in node, like the rest of the core, would almost certainly be implemented in C/C++, just as format strings are implemented in C for CPython. The Go community doesn't generally try to defend itself against claims that C is faster than Go. That said, a small LRU cache of format strings would probably be reasonable, though since reflection is involved, it would still pretty much be guaranteed to always be slower compared to, e.g. C's printf, since C format strings encode type sizes as well as semantics (so C unsafely makes assumptions about the arguments being passed), while Go uses reflection to make those size determinations (another safety vs time tradeoff).

On Friday, February 8, 2013 9:59:53 PM UTC-7, joex...@gmail.com wrote:

My tests show that GO is fast but not all that much faster than other languages. My area of focus is centered around heavy duty information mining and predictive analytics.    I never believe FUD and always write empirical tests representative of my applications. I was simply trying to share some findings I thought others could use.

The append() function is heavily hyped and multiple GO experts have recently told me that it is fast.    Well fast is a relative term.    I recently fixed a system for a major company where a 300% improvement in speed was enough to move them from unhappy customers to very happy customers.    Net, Net a 300% speed delta between append() and manually indexed buffers could be important.    Many people claim they adopted GO for speed reasons. If this is true then they should care about a 300% performance delta between similar techniques commonly used for a common task.

It is one thing to wait and hot spot the code, It is even better to run enough from micro tests to know what to avoid before you have to substantially refactor a complex system because it fails to perform.    I spend a good portion of my career fixing distributed performance problems in complex systems.     My customers are generally on an emergency footing.   Their root cause problems are varied but they are almost always caused by engineers who failed to smoke test the code as it was written. It is always horribly expensive because they generally call me after they missed their launch date or after they are loosing revenue and their customers are complaining.

Accumulating a collection or buffer from many small bits of input is a common task. I find it important to know which techniques are likely to be hot spots.    Once of our known hot spots is exactly this kind a problem.   I have seen problems in this area in enough customer locations to know that it is an area where performance problems multiply.   This test shows the new append method is far slower than managing pre-allocated buffers. (The same is true in C).   The code difference is trivial but it could be expensive to ferret out all the places to change latter.     I am not trying to say the new append() function should never be used as there is a lot of code which is not performance sensitive.   I am saying that it should either be fixed or avoided for areas where you know performance counts.

I find that stress tests like this are essential when designing high performance systems. I need to know exactly how much speed penalty is paid when using different parts of the system.       As part of this testing I also found that Sprintf() should be avoided where performance matters.    (GO CORE TEAM PLEASE FIX) I think Python and Node.js would would beat you in this area right now.

Some of my tests show that GO was barely twice as fast as Python for a complex jobs. If you throw away 300% in performance for a critical hot area then you may be running slower than the equivalent python code.   In one of my tests we parsed about 1.5 million rows of data from a 320 meg file. GO was initially the slowest performer of the bunch behind F#, Python and Lua by over 200%.     It turned out that the problem was in f := bufio.NewReader(fi) head_str, _ := f.ReadString('\n') . The f.ReadString() is a performance pig.    When I replaced it with a manually written method which reads the bytes in larger chunks and then breaks them into arrays of strings GO moved from last place to second place behind F#.     Knowing this kind of performance information is the difference between GO being viable or being thrown out as a toy.

peterGo

unread,

Feb 9, 2013, 10:27:42 AM2/9/13

to golan...@googlegroups.com, joex...@gmail.com

Joe,

I've optimized a lot of systems too.

Analyzing performance by black-box metrics is not always useful; your black-box model may be flawed. Go is open source; look inside the black box.

Use the Go benchmarking facilities. For Go data structures, read Russ Cox's "Go Data Structures" article. Look at the append built-in function in detail. For example, examine the code for the runtime.appendslice and runtime.growslice1 functions. Read and run the runtime append benchmarks. Read the code for package fmt. If you are an assembly language programmer, analyze the generated pseudo-assembler and machine code. You should now be able to anticipate and explain your benchmark results.

Peter

Reply all

Reply to author

Forward