Bitmask is slower than modulo calculation

leon

unread,

May 12, 2024, 12:58:41 AMMay 12

to golang-nuts

I'm trying to prove an optimization technique for ring buffer is effective. One of the technique is using bitmask instead of modulo to calculate a wrap around. However, in my environment, modulo is slightly faster in a test where 1 billion items are enqueued /dequeued by a single goroutine. What do you think could be the cause?

Full code:

https://go.dev/play/p/H933oqrhPI-

Environment:

* go version go1.21.4 darwin/arm64

* Apple M1 Pro

RingBuffer with modulo:

```

type RingBuffer0 struct {
writeIdx uint64
readIdx uint64
buffers []any
size uint64
}

func NewRingBuffer0(size uint64) *RingBuffer0 {
rb := &RingBuffer0{}
rb.init(size)
return rb
}

func (rb *RingBuffer0) init(size uint64) {
rb.buffers = make([]any, size)
rb.size = size
}

func (rb *RingBuffer0) Enqueue(item any) error {
if rb.writeIdx-rb.readIdx == rb.size {
return ErrBufferFull
}
rb.buffers[rb.writeIdx%rb.size] = item
rb.writeIdx++
return nil
}

func (rb *RingBuffer0) Dequeue() (any, error) {
if rb.writeIdx == rb.readIdx {
return nil, ErrBufferEmpty
}
item := rb.buffers[rb.readIdx%rb.size]
rb.readIdx++
return item, nil
}

```

RingBuffer with bitmask:

change each module calculation to the code below

* rb.buffers[rb.writeIdx&(rb.size-1)] = item

* item := rb.buffers[rb.readIdx&(rb.size-1)]

Test:

func TestSingle(rb RingBuffer) {
start := time.Now()
total := 500000
for i := 0; i < total; i++ {
for j := 0; j < 1000; j++ {
rb.Enqueue(j)
}
for j := 0; j < 1000; j++ {
rb.Dequeue()
}
}
end := time.Now()
count := total * 2000
duration := end.Sub(start).Milliseconds()
fmt.Printf("%d ops in %d ms\n", count, duration)
}

robert engels

unread,

May 12, 2024, 7:20:05 PMMay 12

to leon, golang-nuts

Use the Go benchmarking facilities, see https://dave.cheney.net/2013/06/30/how-to-write-benchmarks-in-go

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/b9c4d2e0-4ab4-4d27-9359-abd8c090ae33n%40googlegroups.com.

Kurtis Rader

unread,

May 12, 2024, 11:58:09 PMMay 12

to leon, golang-nuts

On my Apple Studio system the difference in speed is barely measurable and sometimes the bitmask implementation is faster:

macpro ~/experiment > go run ring.go
1000000000 ops in 5092 ms
1000000000 ops in 5140 ms
macpro ~/experiment > go run ring.go
1000000000 ops in 5080 ms
1000000000 ops in 5122 ms
macpro ~/experiment > go run ring.go
1000000000 ops in 5166 ms
1000000000 ops in 5133 ms

Micro-benchmarks of this nature are notoriously sensitive to noise and unexpected factors such as the location of code and variables in memory. Those examples show a +0.94%, +0.83%, and -0,64% change in execution time respectively. I would not be surprised if a huge number of runs showed the modulo variant is consistently faster on the order of 0.5% to 1.0%. Figuring out why will likely require examining the actual instruction sequences being executed and a very detailed understanding of the CPU architecture.

P.S., the code in the Go Playground can't be run as-is and was very different from the code in your email. If you want people to help you should make it easier for people to test the behavior of your code.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/b9c4d2e0-4ab4-4d27-9359-abd8c090ae33n%40googlegroups.com.

--

Kurtis Rader

Caretaker of the exceptional canines Junior and Hank

leon

unread,

May 13, 2024, 12:40:35 AMMay 13

to golang-nuts

Thank you so much.

2024年5月13日月曜日 12:58:09 UTC+9 Kurtis Rader:

Message has been deleted

robert engels

unread,

May 13, 2024, 1:18:38 AMMay 13

to Yuta MIYAKE, golang-nuts

Hi. This is still not correct. Use the “for.. in b.N” as discussed in the blog in order to understand the per op difference - which will be more accurate for a microbenchmark timing,

But, if you really want to make a case that bit mask is slower than mod, then a simpler test would be better - you can do these ops on in loop using b.N to time the difference.

On May 12, 2024, at 8:28 PM, Yuta MIYAKE <yuta_...@mercari.com> wrote:

Thank you.

This is benchstat result. new test code follows:

❯ go test -test.bench BenchmarkTestSingleModulo -count=10 -cpu=1 > modulo.txt
❯ go test -test.bench BenchmarkTestSingleBitMask -count=10 -cpu=1 > bitmask.txt
❯ benchstat modulo.txt bitmask.txt
goos: darwin
goarch: arm64
pkg: ringbuffer
│ modulo.txt │ bitmask.txt │
│ sec/op │ sec/op vs base │
TestSingleModulo 6.648 ± 1%
TestSingleBitMask 6.694 ± 5%
geomean 6.648 6.694 ? ¹ ²

new test code:

const BufferSize = 2 * 1024 * 1024

func benchmarkSingle(rb RingBuffer) {

total := 500000
for i := 0; i < total; i++ {
for j := 0; j < 1000; j++ {
rb.Enqueue(j)
}
for j := 0; j < 1000; j++ {
rb.Dequeue()
}
}
}

func BenchmarkTestSingleModulo(b *testing.B) {
rb := NewRingBuffer0(BufferSize)
b.ResetTimer()
benchmarkSingle(rb)
}

func BenchmarkTestSingleBitMask(b *testing.B) {
rb := NewRingBuffer1(BufferSize)
b.ResetTimer()
benchmarkSingle(rb)
}

To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/efe2618f-c520-4b53-b233-a724fe9b77d5n%40googlegroups.com.

Keith Randall

unread,

May 14, 2024, 12:55:34 PMMay 14

to golang-nuts

Your test is benchmarking int->any conversions more than it is testing the underlying modulo/mask difference.

When you pass an integer to Enqueue, it is converted to any, which (usually) involves an allocation. That will swamp the cost of any single arithmetic operation.

There are a few ways to fix it. Probably the simplest is to make your ring buffers generic on the element type.

You could also fix the benchmark, by precomputing the any-typed elements you're going to enqueue before starting the timer.

When you get a surprising benchmark result, always run the benchmark with profiling on (-cpuprofile) and look at the resulting profile. This is a case where it would be obvious what the problem is.

Reply all

Reply to author

Forward