Bitmask is slower than modulo calculation

239 views
Skip to first unread message

leon

unread,
May 12, 2024, 12:58:41 AMMay 12
to golang-nuts
I'm trying to prove an optimization technique for ring buffer is effective. One of the technique is using bitmask instead of modulo to calculate a wrap around. However, in my environment, modulo is slightly faster in a test where 1 billion items are enqueued /dequeued by a single goroutine. What do you think could be the cause? 


Environment:
* go version go1.21.4 darwin/arm64
* Apple M1 Pro

RingBuffer with modulo:
```
type RingBuffer0 struct {
writeIdx uint64
readIdx  uint64
buffers  []any
size     uint64
}

func NewRingBuffer0(size uint64) *RingBuffer0 {
rb := &RingBuffer0{}
rb.init(size)
return rb
}

func (rb *RingBuffer0) init(size uint64) {
rb.buffers = make([]any, size)
rb.size = size
}

func (rb *RingBuffer0) Enqueue(item any) error {
if rb.writeIdx-rb.readIdx == rb.size {
return ErrBufferFull
}
rb.buffers[rb.writeIdx%rb.size] = item
rb.writeIdx++
return nil
}

func (rb *RingBuffer0) Dequeue() (any, error) {
if rb.writeIdx == rb.readIdx {
return nil, ErrBufferEmpty
}
item := rb.buffers[rb.readIdx%rb.size]
rb.readIdx++
return item, nil
}
```

RingBuffer with bitmask:
change each module calculation to the code below
* rb.buffers[rb.writeIdx&(rb.size-1)] = item
* item := rb.buffers[rb.readIdx&(rb.size-1)]

Test:
func TestSingle(rb RingBuffer) {
start := time.Now()
total := 500000
for i := 0; i < total; i++ {
for j := 0; j < 1000; j++ {
rb.Enqueue(j)
}
for j := 0; j < 1000; j++ {
rb.Dequeue()
}
}
end := time.Now()
count := total * 2000
duration := end.Sub(start).Milliseconds()
fmt.Printf("%d ops in %d ms\n", count, duration)
}

robert engels

unread,
May 12, 2024, 7:20:05 PMMay 12
to leon, golang-nuts
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/b9c4d2e0-4ab4-4d27-9359-abd8c090ae33n%40googlegroups.com.

Kurtis Rader

unread,
May 12, 2024, 11:58:09 PMMay 12
to leon, golang-nuts
On my Apple Studio system the difference in speed is barely measurable and sometimes the bitmask implementation is faster:

 macpro  ~/experiment  > go run ring.go
1000000000 ops in 5092 ms
1000000000 ops in 5140 ms
 macpro  ~/experiment  > go run ring.go
1000000000 ops in 5080 ms
1000000000 ops in 5122 ms
 macpro  ~/experiment  > go run ring.go
1000000000 ops in 5166 ms
1000000000 ops in 5133 ms

Micro-benchmarks of this nature are notoriously sensitive to noise and unexpected factors such as the location of code and variables in memory. Those examples show a +0.94%, +0.83%, and -0,64% change in execution time respectively. I would not be surprised if a huge number of runs showed the modulo variant is consistently faster on the order of 0.5% to 1.0%. Figuring out why will likely require examining the actual instruction sequences being executed and a very detailed understanding of the CPU architecture.

P.S., the code in the Go Playground can't be run as-is and was very different from the code in your email. If you want people to help you should make it easier for people to test the behavior of your code.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/b9c4d2e0-4ab4-4d27-9359-abd8c090ae33n%40googlegroups.com.


--
Kurtis Rader
Caretaker of the exceptional canines Junior and Hank

leon

unread,
May 13, 2024, 12:40:35 AMMay 13
to golang-nuts
Thank you so much. 

2024年5月13日月曜日 12:58:09 UTC+9 Kurtis Rader:
Message has been deleted

robert engels

unread,
May 13, 2024, 1:18:38 AMMay 13
to Yuta MIYAKE, golang-nuts
Hi. This is still not correct. Use the “for.. in b.N” as discussed in the blog in order to understand the per op difference - which will be more accurate for a microbenchmark timing,

But, if you really want to make a case that bit mask is slower than mod, then a simpler test would be better - you can do these ops on in loop using b.N to time the difference.



On May 12, 2024, at 8:28 PM, Yuta MIYAKE <yuta_...@mercari.com> wrote:

Thank you. 

This is benchstat result. new test code follows:

❯ go test -test.bench BenchmarkTestSingleModulo -count=10 -cpu=1 > modulo.txt
❯ go test -test.bench BenchmarkTestSingleBitMask -count=10 -cpu=1 > bitmask.txt
❯ benchstat modulo.txt bitmask.txt
goos: darwin
goarch: arm64
pkg: ringbuffer
                  │ modulo.txt │    bitmask.txt    │
                  │   sec/op   │   sec/op    vs base   │
TestSingleModulo    6.648 ± 1%
TestSingleBitMask                6.694 ± 5%
geomean             6.648        6.694       ? ¹ ²


new test code:

const BufferSize = 2 * 1024 * 1024

func benchmarkSingle(rb RingBuffer) {

total := 500000
for i := 0; i < total; i++ {
for j := 0; j < 1000; j++ {
rb.Enqueue(j)
}
for j := 0; j < 1000; j++ {
rb.Dequeue()
}
}
}

func BenchmarkTestSingleModulo(b *testing.B) {
rb := NewRingBuffer0(BufferSize)
b.ResetTimer()
benchmarkSingle(rb)
}

func BenchmarkTestSingleBitMask(b *testing.B) {
rb := NewRingBuffer1(BufferSize)
b.ResetTimer()
benchmarkSingle(rb)
}

Keith Randall

unread,
May 14, 2024, 12:55:34 PMMay 14
to golang-nuts
Your test is benchmarking int->any conversions more than it is testing the underlying modulo/mask difference.
When you pass an integer to Enqueue, it is converted to any, which (usually) involves an allocation. That will swamp the cost of any single arithmetic operation.

There are a few ways to fix it. Probably the simplest is to make your ring buffers generic on the element type.
You could also fix the benchmark, by precomputing the any-typed elements you're going to enqueue before starting the timer.

When you get a surprising benchmark result, always run the benchmark with profiling on (-cpuprofile) and look at the resulting profile. This is a case where it would be obvious what the problem is.
Reply all
Reply to author
Forward
0 new messages