Joseph Tsai has uploaded this change for review.
encoding/json: improve Marshal memory utilization
Previously, Marshal (and Encoder.Encode) used a pooled encodeState
that included an embedded bytes.Buffer.
This is an incorrect use of sync.Pool (see #23199).
This results in problems when a high frequency stream of tiny Marshals
is interspersed with a low frequency stream of huge Marshals.
The occasional huge Marshal will cause bytes.Buffer to be grown in size
to accomodate the large output, but never be reduced in size.
However, the continuous stream of tiny Marshals will keep the buffers alive,
preventing the GC from being able to reclaim the under-utilized capacity.
Previous attempts to trivially cap the buffer capacity (see CL 136035)
resulted in unacceptably poor performance in marshaling large outputs.
We fix this problem by switching from bytes.Buffer to a custom pooledBuffer
implementation that uses a series of pooled buffer segments.
The content of the buffer is the concatenation of all the buffer segments.
Using buffer segments allows the implementation to more readily return
unused buffers to the pool, and allow the GC to reclaim them if necessary.
The semantics of the Marshal function requires that the returned buffer
be a unique copy owned by the caller.
With the segmented buffer, we use bytes.Join to join all the segments,
while the previous code did the semantic equivalent of bytes.Clone.
Performance:
name old time/op new time/op delta
LargeOutput 1.85s ± 3% 1.31s ± 2% -29.25% (p=0.000 n=10+8)
CodeEncoder 394µs ± 1% 382µs ± 2% -2.95% (p=0.000 n=9+10)
CodeEncoderError 437µs ± 1% 428µs ± 1% -1.97% (p=0.000 n=9+9)
AllocationWastage 396µs ± 2% 476µs ± 1% +20.16% (p=0.000 n=10+8)
CodeMarshal 534µs ± 1% 712µs ± 3% +33.31% (p=0.000 n=10+10)
CodeMarshalError 694µs ± 2% 903µs ± 4% +29.98% (p=0.000 n=10+10)
MarshalBytes/32 170ns ± 3% 164ns ± 1% -3.22% (p=0.000 n=10+10)
MarshalBytes/256 387ns ± 2% 381ns ± 1% -1.63% (p=0.000 n=10+10)
MarshalBytes/4096 3.93µs ± 2% 3.91µs ± 1% ~ (p=0.184 n=10+10)
MarshalBytesError/32 372µs ± 1% 370µs ± 0% -0.71% (p=0.002 n=10+8)
MarshalBytesError/256 372µs ± 1% 370µs ± 1% ~ (p=0.133 n=9+10)
MarshalBytesError/4096 379µs ± 1% 375µs ± 1% -1.03% (p=0.007 n=10+10)
EncodeMarshaler 22.8ns ± 2% 24.4ns ± 2% +7.06% (p=0.000 n=8+9)
EncoderEncode 12.5ns ± 3% 13.6ns ± 8% +8.23% (p=0.005 n=8+8)
name old alloc/op new alloc/op delta
LargeOutput 1.55GB ± 0% 0.97GB ± 0% -37.61% (p=0.000 n=10+7)
CodeEncoder 4.00B ± 0% 3608.40B ±77% +90110.00% (p=0.000 n=9+10)
CodeEncoderError 141B ± 2% 2605B ±207% +1751.22% (p=0.000 n=10+9)
AllocationWastage 2.87B ±30% 2017.10B ±54% +70060.00% (p=0.000 n=8+10)
CodeMarshal 1.96MB ± 2% 1.99MB ± 1% +1.52% (p=0.000 n=10+10)
CodeMarshalError 2.04MB ± 2% 2.00MB ± 1% -1.89% (p=0.000 n=10+9)
EncodeMarshaler 4.00B ± 0% 4.00B ± 0% ~ (all equal)
EncoderEncode 0.00B 0.00B ~ (all equal)
name old allocs/op new allocs/op delta
LargeOutput 44.8 ±23% 19649.6 ± 0% +43760.71% (p=0.000 n=10+10)
CodeEncoder 0.00 47.00 ± 0% +Inf% (p=0.000 n=10+10)
CodeEncoderError 4.00 ± 0% 51.00 ± 0% +1175.00% (p=0.000 n=10+10)
AllocationWastage 0.00 52.00 ± 0% +Inf% (p=0.000 n=10+10)
CodeMarshal 1.00 ± 0% 53.00 ± 0% +5200.00% (p=0.000 n=10+8)
CodeMarshalError 6.30 ±11% 59.40 ± 1% +842.86% (p=0.000 n=10+10)
EncodeMarshaler 1.00 ± 0% 1.00 ± 0% ~ (all equal)
EncoderEncode 0.00 0.00 ~ (all equal)
name old wasted_b/op new wasted_b/op delta
AllocationWastage 2.25M ± 0% 0.16M ± 0% -92.88% (p=0.000 n=10+10)
There is no notable performance difference on small output (i.e., < 64KiB).
For larger outputs (e.g., CodeMarshal), we do see a performance degradation.
This is unavoidable as having a single contiguous buffer that is large enough
is always better for CPU performance for many reasons.
However, the nature of the Marshal call makes it such that it is impossible
to know what size buffer we need prior to marshaling the value itself.
The loss of performance is deemed acceptable in order to gain
stronger guarantees on the maximum amount of heap used.
Part of the cost of using bytes.Join over bytes.Clone is that
bytes.Clone can rely on compiler optimizations that avoid unnecessarily
zeroing the buffer for the portion that is gauranteed to be overwritten.
The bytes.Join function does not benefit from that compiler optimization.
In CL 456336, bytes.Join is optimized to also avoid the zeroing.
This results in the performance loss being reduced to:
name old time/op new time/op delta
CodeMarshal 534µs ± 1% 633µs ± 4% +18.46% (p=0.000 n=10+10)
CodeMarshalError 694µs ± 2% 789µs ± 3% +13.64% (p=0.000 n=10+9)
This is a reduction of performance loss from +33% down to +18%,
but is still much better than the performance loss of +72%
without the use of any pooled buffers at all.
The idea of a segmented buffer came about during discussion with
Daniel Smith on CL 455236 as a counter proposal
to using pooled contiguous buffers.
The benchmarks come from Daniel Smith's alternate implementation in CL 455776.
Huge thanks goes to Daniel Smith who did a bulk majority of the work
to prove that segmented buffers was a viable approach.
The implementation in this CL is a simplified version of CL 455776.
Fixes #27735
Change-Id: Ib23f00eb64615b21b1ddddf9fadc6498bcefa935
---
M src/encoding/json/bench_test.go
A src/encoding/json/buffers.go
A src/encoding/json/buffers_test.go
M src/encoding/json/encode.go
M src/encoding/json/stream.go
5 files changed, 495 insertions(+), 32 deletions(-)
diff --git a/src/encoding/json/bench_test.go b/src/encoding/json/bench_test.go
index d3af0dc..5d1741b 100644
--- a/src/encoding/json/bench_test.go
+++ b/src/encoding/json/bench_test.go
@@ -22,6 +22,7 @@
"runtime"
"strings"
"sync"
+ "sync/atomic"
"testing"
)
@@ -82,6 +83,30 @@
}
}
+func BenchmarkLargeOutput(b *testing.B) {
+ b.ReportAllocs()
+
+ b.StopTimer()
+ if codeJSON == nil {
+ codeInit()
+ }
+ list := make([]*codeResponse, 250)
+ for i := range list {
+ list[i] = &codeStruct
+ }
+ b.StartTimer()
+
+ for i := 0; i < b.N; i++ {
+ buf, err := Marshal(list)
+ if err != nil {
+ b.Fatal("Marshal:", err)
+ }
+ if i == b.N-1 {
+ b.SetBytes(int64(len(buf)))
+ }
+ }
+}
+
func BenchmarkCodeEncoder(b *testing.B) {
b.ReportAllocs()
if codeJSON == nil {
@@ -130,6 +155,74 @@
b.SetBytes(int64(len(codeJSON)))
}
+func (e *encodeState) wastedSpace() (n int) {
+ for _, p := range e.fullList() {
+ n += cap(p) - len(p)
+ }
+ return n
+}
+
+func marshalReturnWastage(v any) ([]byte, int, error) {
+ e := getEncodeState()
+ defer putEncodeState(e)
+
+ err := e.marshal(v, encOpts{escapeHTML: true})
+ if err != nil {
+ return nil, e.wastedSpace(), err
+ }
+ return e.Bytes(), e.wastedSpace(), nil
+}
+
+func BenchmarkAllocationWastage(b *testing.B) {
+ b.ReportAllocs()
+
+ b.StopTimer()
+ type SmallNested struct {
+ Name string
+ Next *SmallNested
+ }
+ sn := &SmallNested{Name: "Small and nested"}
+ for i := 0; i < 10; i++ {
+ sn = &SmallNested{
+ Name: fmt.Sprintf("loop iteration %v", i),
+ Next: sn,
+ }
+ }
+ smallObj, err := Marshal(sn)
+ if err != nil {
+ b.Fatal("Marshal:", err)
+ }
+ if codeJSON == nil {
+ codeInit()
+ }
+ b.StartTimer()
+
+ var totalTotal int64
+ var totalCount int64
+
+ b.RunParallel(func(pb *testing.PB) {
+ var totalWasted int64
+ var iterCount int
+ for pb.Next() {
+ if _, wasted, err := marshalReturnWastage(&codeStruct); err != nil {
+ b.Fatal("Marshal:", err)
+ } else {
+ totalWasted += int64(wasted)
+ }
+ if _, wasted, err := marshalReturnWastage(sn); err != nil {
+ b.Fatal("Marshal:", err)
+ } else {
+ totalWasted += int64(wasted)
+ }
+ iterCount++
+ }
+ atomic.AddInt64(&totalTotal, totalWasted)
+ atomic.AddInt64(&totalCount, int64(iterCount))
+ })
+ b.ReportMetric(float64(totalTotal)/float64(totalCount), "wasted_b/op")
+ b.ReportMetric(float64(len(smallObj)+len(codeJSON)), "useful_b/op")
+}
+
func BenchmarkCodeMarshal(b *testing.B) {
b.ReportAllocs()
if codeJSON == nil {
diff --git a/src/encoding/json/buffers.go b/src/encoding/json/buffers.go
new file mode 100644
index 0000000..2da5728
--- /dev/null
+++ b/src/encoding/json/buffers.go
@@ -0,0 +1,186 @@
+// Copyright 2023 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package json
+
+import (
+ "bytes"
+ "io"
+ "math/bits"
+ "slices"
+ "sync"
+)
+
+// pooledBuffer is similar to bytes.Buffer,
+// but uses a series of segmented buffers instead of a single contiguous buffer.
+// Each segment never has a smaller capacity than the previous buffer and
+// are retrieved from a tiered list of sync.Pool as needed.
+//
+// Invariant: len(pooledBuffer.fullList()) <= maxNumSegments
+type pooledBuffer struct {
+ list [][]byte // list of segments except the last one
+ last []byte
+}
+
+func (b *pooledBuffer) fullList() [][]byte {
+ if cap(b.last) > 0 {
+ b.list = slices.Grow(b.list, 1)
+ return append(b.list, b.last) // avoid changing length of b.list
+ }
+ return b.list
+}
+
+func (b *pooledBuffer) Len() (n int) {
+ if len(b.list) == 0 {
+ return len(b.last)
+ }
+ for _, p := range b.fullList() {
+ n += len(p)
+ }
+ return n
+}
+
+func (b *pooledBuffer) Cap() int { return b.Len() + b.Available() }
+func (b *pooledBuffer) Available() int { return cap(b.AvailableBuffer()) }
+func (b *pooledBuffer) AvailableBuffer() []byte { return b.last[len(b.last):] }
+
+func (b *pooledBuffer) Grow(n int) {
+ if cap(b.last)-len(b.last) < n {
+ b.growSlow(n)
+ }
+}
+func (b *pooledBuffer) growSlow(n int) {
+ if cap(b.last) > 0 {
+ if n <= cap(b.last) {
+ n = cap(b.last) // ensure segments never decrease in capacity
+ if n < maxRetainSegmentSize {
+ n *= 2 // only double capacity until maxRetainSegmentSize
+ }
+ }
+ if len(b.last) == 0 {
+ putBuffer(b.last)
+ } else {
+ b.list = append(b.list, b.last)
+ }
+ }
+ b.last = getBuffer(n)
+}
+
+func (b *pooledBuffer) Write(p []byte) (int, error) {
+ b.Grow(len(p))
+ b.last = b.last[:len(b.last)+len(p)]
+ copy(b.last[len(b.last)-len(p):], p)
+ return len(p), nil
+}
+
+func (b *pooledBuffer) WriteString(s string) (int, error) {
+ b.Grow(len(s))
+ b.last = b.last[:len(b.last)+len(s)]
+ copy(b.last[len(b.last)-len(s):], s)
+ return len(s), nil
+}
+
+func (b *pooledBuffer) WriteByte(c byte) error {
+ b.Grow(1)
+ b.last = append(b.last, c)
+ return nil
+}
+
+func (b *pooledBuffer) WriteTo(w io.Writer) (n int64, err error) {
+ for _, p := range b.fullList() {
+ m, err := w.Write(p)
+ if err != nil {
+ return n, err
+ }
+ n += int64(m)
+ }
+ return n, err
+}
+
+// Bytes returns the buffer content as a single contiguous buffer.
+// It may need to merge segments to produce a contiguous buffer.
+func (b *pooledBuffer) Bytes() []byte {
+ if len(b.list) == 0 {
+ return b.last
+ }
+ p := getBuffer(b.Len())
+ for i, list := 0, b.fullList(); i < len(list); i++ {
+ p = append(p, list[i]...)
+ putBuffer(list[i])
+ list[i] = nil // allow GC to reclaim the buffer
+ }
+ b.list = b.list[:0]
+ b.last = p
+ return b.last
+}
+
+// BytesClone returns a copy of the buffer content.
+func (b *pooledBuffer) BytesClone() []byte {
+ if len(b.list) == 0 {
+ return bytes.Clone(b.last)
+ }
+ return bytes.Join(b.fullList(), nil)
+}
+
+func (b *pooledBuffer) Reset() {
+ // Return all segments to the pools except one segment,
+ // which may be retained locally if sufficiently small.
+ if len(b.list) == 0 && cap(b.last) <= maxRetainSegmentSize {
+ b.last = b.last[:0]
+ return
+ }
+ var retain []byte
+ list := b.fullList()
+ for i := len(list) - 1; i >= 0; i-- {
+ if retain == nil && cap(list[i]) <= maxRetainSegmentSize {
+ retain = list[i][:0] // retain locally, but clear the length
+ } else {
+ putBuffer(list[i])
+ }
+ list[i] = nil // allow GC to reclaim the buffer
+ }
+ b.list = b.list[:0]
+ b.last = retain
+}
+
+const (
+ minPooledSegmentShift = 12 // minumum size of buffer to pool
+ maxRetainSegmentShift = 16 // maximum size of buffer to retain locally in buffer
+ maxRetainSegmentSize = 1 << maxRetainSegmentShift
+ maxNumSegments = bits.UintSize - minPooledSegmentShift
+)
+
+// TODO(https://go.dev/issue/47657): Use sync.PoolOf.
+// The putBuffer must allocate a slice header whenever calling sync.Pool.Put.
+// Using *[]byte instead of []byte avoids the allocation,
+// but complicates the logic of pooledBuffer and is slower as a result.
+
+// bufferPools is a list of buffer pools.
+// Each pool manages buffers of capacity within [1<<shift : 2<<shift),
+// where shift is (index+minPooledSegmentShift).
+var bufferPools [maxNumSegments]sync.Pool
+
+// getBuffer acquires an empty buffer with enough capacity to hold n bytes.
+func getBuffer(n int) []byte {
+ if n < 1<<minPooledSegmentShift {
+ n = 1 << minPooledSegmentShift
+ }
+ shift := bits.Len(uint(n - 1))
+ if p, _ := bufferPools[shift-minPooledSegmentShift].Get().([]byte); p != nil {
+ return p[:0]
+ }
+ return make([]byte, 0, 1<<shift)
+}
+
+// putBuffer releases a buffer back to the pools.
+func putBuffer(p []byte) {
+ if cap(p) < 1<<minPooledSegmentShift {
+ return
+ }
+ shift := bits.Len(uint(cap(p)) - 1)
+ // TODO: In race detector mode, asynchronously write to the buffer to detect
+ // buffers that may have accidentally leaked to users.
+ // See https://go.dev/issue/58452 for inspiration.
+ bufferPools[shift-minPooledSegmentShift].Put(p)
+}
diff --git a/src/encoding/json/buffers_test.go b/src/encoding/json/buffers_test.go
new file mode 100644
index 0000000..e481ebd
--- /dev/null
+++ b/src/encoding/json/buffers_test.go
@@ -0,0 +1,81 @@
+// Copyright 2023 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package json
+
+import (
+ "bytes"
+ "crypto/md5"
+ "io"
+ "math/bits"
+ "math/rand"
+ "testing"
+)
+
+func FuzzBuffer(f *testing.F) {
+ f.Add(int64(0))
+ f.Fuzz(func(t *testing.T, seed int64) {
+ const maxCapacity = 1 << (maxRetainSegmentShift + 4)
+ rn := rand.New(rand.NewSource(seed))
+ var gotBuffer pooledBuffer
+ var wantBuffer bytes.Buffer
+ var exit bool
+ for i := 0; !exit; i++ {
+ exit = gotBuffer.Cap() > maxCapacity || wantBuffer.Cap() > maxCapacity || i >= 100
+ switch {
+ case gotBuffer.Len() != wantBuffer.Len():
+ t.Fatalf("Buffer.Len = %d, want %d", gotBuffer.Len(), wantBuffer.Len())
+ case gotBuffer.Len() > gotBuffer.Cap():
+ t.Fatalf("Buffer.Len > Buffer.Cap: %d > %d", gotBuffer.Len(), gotBuffer.Cap())
+ case len(gotBuffer.AvailableBuffer()) > 0:
+ t.Fatalf("len(Buffer.AvailableBuffer) = %d, want 0", len(gotBuffer.AvailableBuffer()))
+ case cap(gotBuffer.AvailableBuffer()) > gotBuffer.Cap()-gotBuffer.Len():
+ t.Fatalf("cap(Buffer.AvailableBuffer) = %d, want %d", cap(gotBuffer.AvailableBuffer()), gotBuffer.Cap()-gotBuffer.Len())
+ }
+
+ // Random size that is exponentially spaced.
+ n := (1 << rn.Intn(bits.Len(maxCapacity-1))) >> 1
+ n += rn.Intn(n + 1)
+
+ // Randomly perform some mutating buffer operation.
+ switch j := rn.Intn(20); {
+ case j < 8 && !exit: // 40% probability
+ b := gotBuffer.AvailableBuffer()
+ b = b[:rn.Intn(cap(b)+1)]
+ io.ReadFull(rn, b)
+ gotBuffer.Write(b)
+ wantBuffer.Write(b)
+ case j < 14 && !exit: // 30% probability
+ b := make([]byte, n)
+ io.ReadFull(rn, b)
+ gotBuffer.Write(b)
+ wantBuffer.Write(b)
+ case j < 19 && !exit: // 25% probability
+ gotBuffer.Grow(n)
+ default: // 5% probability
+ var prev []byte
+ for _, p := range gotBuffer.fullList() {
+ if cap(p) < cap(prev) {
+ t.Fatalf("buffer segments are not growing in size")
+ }
+ prev = p
+ }
+ var gotBytes []byte
+ if rn.Intn(2) == 0 {
+ gotBytes = gotBuffer.Bytes()
+ } else {
+ gotBytes = gotBuffer.BytesClone()
+ }
+ if !bytes.Equal(gotBytes, wantBuffer.Bytes()) {
+ t.Fatalf("content mismatch: %x != %x", md5.Sum(gotBytes), md5.Sum(wantBuffer.Bytes()))
+ }
+ gotBuffer.Reset()
+ wantBuffer.Reset()
+ if gotBuffer.Len() > 0 {
+ t.Fatalf("Buffer.Len = %d, want 0", gotBuffer.Len())
+ }
+ }
+ }
+ })
+}
diff --git a/src/encoding/json/encode.go b/src/encoding/json/encode.go
index ff0e40d..602583a 100644
--- a/src/encoding/json/encode.go
+++ b/src/encoding/json/encode.go
@@ -155,16 +155,13 @@
// handle them. Passing cyclic structures to Marshal will result in
// an error.
func Marshal(v any) ([]byte, error) {
- e := newEncodeState()
- defer encodeStatePool.Put(e)
-
+ e := getEncodeState()
+ defer putEncodeState(e)
err := e.marshal(v, encOpts{escapeHTML: true})
if err != nil {
return nil, err
}
- buf := append([]byte(nil), e.Bytes()...)
-
- return buf, nil
+ return e.BytesClone(), nil
}
// MarshalIndent is like Marshal but applies Indent to format the output.
@@ -281,9 +278,9 @@
var hex = "0123456789abcdef"
-// An encodeState encodes JSON into a bytes.Buffer.
+// An encodeState encodes JSON into a pooled buffer.
type encodeState struct {
- bytes.Buffer // accumulated output
+ pooledBuffer // accumulated output
// Keep track of what pointers we've seen in the current recursive call
// path, to avoid cycles that could lead to a stack overflow. Only do
@@ -294,18 +291,13 @@
ptrSeen map[any]struct{}
}
-func (e *encodeState) AvailableBuffer() []byte {
- return availableBuffer(&e.Buffer)
-}
-
const startDetectingCyclesAfter = 1000
var encodeStatePool sync.Pool
-func newEncodeState() *encodeState {
+func getEncodeState() *encodeState {
if v := encodeStatePool.Get(); v != nil {
e := v.(*encodeState)
- e.Reset()
if len(e.ptrSeen) > 0 {
panic("ptrEncoder.encode should have emptied ptrSeen via defers")
}
@@ -315,6 +307,11 @@
return &encodeState{ptrSeen: make(map[any]struct{})}
}
+func putEncodeState(e *encodeState) {
+ e.Reset()
+ encodeStatePool.Put(e)
+}
+
// jsonError is an error wrapper type for internal use only.
// Panics with errors are wrapped in jsonError so that the top-level recover
// can distinguish intentional panics from this package.
@@ -480,9 +477,9 @@
b, err := m.MarshalJSON()
if err == nil {
e.Grow(len(b))
- out := availableBuffer(&e.Buffer)
+ out := e.AvailableBuffer()
out, err = appendCompact(out, b, opts.escapeHTML)
- e.Buffer.Write(out)
+ e.Write(out)
}
if err != nil {
e.error(&MarshalerError{v.Type(), err, "MarshalJSON"})
@@ -499,9 +496,9 @@
b, err := m.MarshalJSON()
if err == nil {
e.Grow(len(b))
- out := availableBuffer(&e.Buffer)
+ out := e.AvailableBuffer()
out, err = appendCompact(out, b, opts.escapeHTML)
- e.Buffer.Write(out)
+ e.Write(out)
}
if err != nil {
e.error(&MarshalerError{v.Type(), err, "MarshalJSON"})
diff --git a/src/encoding/json/stream.go b/src/encoding/json/stream.go
index b4146a3..e63b1cc 100644
--- a/src/encoding/json/stream.go
+++ b/src/encoding/json/stream.go
@@ -179,8 +179,8 @@
// An Encoder writes JSON values to an output stream.
type Encoder struct {
- w io.Writer
- err error
+ wr io.Writer
+ wrErr error
escapeHTML bool
indentBuf []byte
@@ -190,7 +190,7 @@
// NewEncoder returns a new encoder that writes to w.
func NewEncoder(w io.Writer) *Encoder {
- return &Encoder{w: w, escapeHTML: true}
+ return &Encoder{wr: w, escapeHTML: true}
}
// Encode writes the JSON encoding of v to the stream,
@@ -199,12 +199,12 @@
// See the documentation for Marshal for details about the
// conversion of Go values to JSON.
func (enc *Encoder) Encode(v any) error {
- if enc.err != nil {
- return enc.err
+ if enc.wrErr != nil {
+ return enc.wrErr
}
- e := newEncodeState()
- defer encodeStatePool.Put(e)
+ e := getEncodeState()
+ defer putEncodeState(e)
err := e.marshal(v, encOpts{escapeHTML: enc.escapeHTML})
if err != nil {
@@ -219,18 +219,16 @@
// digits coming.
e.WriteByte('\n')
- b := e.Bytes()
if enc.indentPrefix != "" || enc.indentValue != "" {
- enc.indentBuf, err = appendIndent(enc.indentBuf[:0], b, enc.indentPrefix, enc.indentValue)
+ enc.indentBuf, err = appendIndent(enc.indentBuf[:0], e.Bytes(), enc.indentPrefix, enc.indentValue)
if err != nil {
return err
}
- b = enc.indentBuf
+ _, enc.wrErr = enc.wr.Write(enc.indentBuf)
+ } else {
+ _, enc.wrErr = e.WriteTo(enc.wr)
}
- if _, err = enc.w.Write(b); err != nil {
- enc.err = err
- }
- return err
+ return enc.wrErr
}
// SetIndent instructs the encoder to format each subsequent encoded
To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.
Attention is currently required from: Daniel Martí, Daniel Smith, Johan Brandhorst-Satzkorn, Jordan Liggitt.
Patch set 1:Run-TryBot +1Auto-Submit +1
Attention is currently required from: Daniel Martí, Johan Brandhorst-Satzkorn, Jordan Liggitt, Joseph Tsai.
1 comment:
Commit Message:
Patch Set #1, Line 75: For larger outputs (e.g., CodeMarshal), we do see a performance degradation.
Maybe something else is wrong, too? on my CL I saw:
```
CodeMarshal-16 1.364m ± 1% 1.439m ± ∞ ¹ +5.47% (p=0.000 n=30+5)
CodeMarshalError-16 1.570m ± 1% 1.686m ± ∞ ¹ +7.39% (p=0.000 n=30+5)
```
Which is a much smaller degradation.
To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.
Attention is currently required from: Daniel Martí, Daniel Smith, Johan Brandhorst-Satzkorn, Jordan Liggitt.
1 comment:
Commit Message:
Patch Set #1, Line 75: For larger outputs (e.g., CodeMarshal), we do see a performance degradation.
Maybe something else is wrong, too? on my CL I saw: […]
Quite possibly due to different CPU architectures. I'm on a AMD Ryzen 5900x.
Could you run both this CL and your CL and see how they compare?
That way we can compare apples to apples.
To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.
Attention is currently required from: Daniel Martí, Johan Brandhorst-Satzkorn, Jordan Liggitt, Joseph Tsai.
1 comment:
Commit Message:
Patch Set #1, Line 75: For larger outputs (e.g., CodeMarshal), we do see a performance degradation.
Quite possibly due to different CPU architectures. I'm on a AMD Ryzen 5900x. […]
Sure -- but on Monday :)
To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.
Attention is currently required from: Daniel Martí, Johan Brandhorst-Satzkorn, Jordan Liggitt, Joseph Tsai.
1 comment:
Commit Message:
Patch Set #1, Line 75: For larger outputs (e.g., CodeMarshal), we do see a performance degradation.
Quite possibly due to different CPU architectures. I'm on a AMD Ryzen 5900x. […]
Also, quite possible related to RAM speed too. I cheaped out and bought mediocre RAM for my computer.
To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.
Attention is currently required from: Daniel Martí, Johan Brandhorst-Satzkorn, Jordan Liggitt, Joseph Tsai.
Joseph Tsai uploaded patch set #2 to this change.
The following approvals got outdated and were removed: Auto-Submit+1 by Joseph Tsai
encoding/json: improve Marshal memory utilization
Previously, Marshal (and Encoder.Encode) used a pooled encodeState
that included an embedded bytes.Buffer.
This is an incorrect use of sync.Pool (see #23199).
This results in problems when a high frequency stream of tiny Marshals
is interspersed with a low frequency stream of huge Marshals.
The occasional huge Marshal will cause bytes.Buffer to be grown in size
to accomodate the large output, but never be reduced in size.
However, a continuous stream of tiny Marshals will keep buffers alive,
preventing GC from being able to reclaim the under-utilized capacity.
Previous attempts to trivially cap the buffer capacity (see CL 136035)
resulted in unacceptably poor performance in marshaling large outputs.
We fix this problem by switching from bytes.Buffer to a pooledBuffer
implementation that uses a series of pooled buffer segments.
The content of the buffer is the concatenation of all buffer segments.
Using buffer segments allows the implementation to more readily return
unused buffers to the pool, and allow GC to reclaim them if necessary.
The semantics of the Marshal function requires that the returned buffer
be a unique copy owned by the caller.
With the segmented buffer, we use bytes.Join to join all the segments,
while the previous code did the semantic equivalent of bytes.Clone.
Performance:
name old time/op new time/op delta
LargeOutput 1.85s ± 3% 1.31s ± 2% -29.25%
CodeEncoder 394µs ± 1% 382µs ± 2% -2.95%
CodeEncoderError 437µs ± 1% 428µs ± 1% -1.97%
AllocationWastage 396µs ± 2% 476µs ± 1% +20.16%
CodeMarshal 534µs ± 1% 712µs ± 3% +33.31%
CodeMarshalError 694µs ± 2% 903µs ± 4% +29.98%
MarshalBytes/32 170ns ± 3% 164ns ± 1% -3.22%
MarshalBytes/256 387ns ± 2% 381ns ± 1% -1.63%
MarshalBytes/4096 3.93µs ± 2% 3.91µs ± 1% ~
MarshalBytesError/32 372µs ± 1% 370µs ± 0% -0.71%
MarshalBytesError/256 372µs ± 1% 370µs ± 1% ~
MarshalBytesError/4096 379µs ± 1% 375µs ± 1% -1.03%
EncodeMarshaler 22.8ns ± 2% 24.4ns ± 2% +7.06%
EncoderEncode 12.5ns ± 3% 13.6ns ± 8% +8.23%
name old alloc/op new alloc/op delta
LargeOutput 1.55GB ± 0% 0.97GB ± 0% -37.61%
CodeEncoder 4.00B ± 0% 3608.40B ±77% +90110.00%
CodeEncoderError 141B ± 2% 2605B ±207% +1751.22%
AllocationWastage 2.87B ±30% 2017.10B ±54% +70060.00%
CodeMarshal 1.96MB ± 2% 1.99MB ± 1% +1.52%
CodeMarshalError 2.04MB ± 2% 2.00MB ± 1% -1.89%
EncodeMarshaler 4.00B ± 0% 4.00B ± 0% ~
EncoderEncode 0.00B 0.00B ~
name old allocs/op new allocs/op delta
LargeOutput 44.8 ±23% 19649.6 ± 0% +43760.71%
CodeEncoder 0.00 47.00 ± 0% +Inf%
CodeEncoderError 4.00 ± 0% 51.00 ± 0% +1175.00%
AllocationWastage 0.00 52.00 ± 0% +Inf%
CodeMarshal 1.00 ± 0% 53.00 ± 0% +5200.00%
CodeMarshalError 6.30 ±11% 59.40 ± 1% +842.86%
EncodeMarshaler 1.00 ± 0% 1.00 ± 0% ~
EncoderEncode 0.00 0.00 ~
name old wasted_b/op new wasted_b/op delta
AllocationWastage 2.25M ± 0% 0.16M ± 0% -92.88%
There is no notable performance difference on small outputs.
For larger outputs (e.g., CodeMarshal), we see a performance drop.
This is unavoidable as having a single contiguous buffer that is
large enough is always better for CPU performance for many reasons.
However, the nature of the Marshal call makes it such that it is
impossible to know what size buffer we need prior to marshaling.
The loss of performance is deemed acceptable in order to gain
stronger guarantees on the maximum amount of heap used.
Part of the cost of using bytes.Join over bytes.Clone is that
bytes.Clone can rely on compiler optimizations that avoid unnecessarily
zeroing the buffer for the portion that is gauranteed to be overwritten.
The bytes.Join function does not benefit from the compiler optimization.
In CL 456336, bytes.Join is optimized to also avoid the zeroing.
This results in the performance loss being reduced to:
name old time/op new time/op delta
CodeMarshal 534µs ± 1% 633µs ± 4% +18.46%
CodeMarshalError 694µs ± 2% 789µs ± 3% +13.64%
This is a reduction of performance loss from +33% down to +18%,
but is still much better than the performance loss of +72%
without the use of any pooled buffers at all.
The idea of a segmented buffer came about during discussion with
Daniel Smith on CL 455236 as a counter proposal
to using pooled contiguous buffers.
The benchmarks come from Daniel Smith's implementation in CL 455776.
Huge thanks goes to Daniel Smith who did a bulk majority of the work
to prove that segmented buffers was a viable approach.
The implementation in this CL is a simplified version of CL 455776.
Fixes #27735
Co-authored-by: Daniel Smith <dbs...@google.com>
Change-Id: Ib23f00eb64615b21b1ddddf9fadc6498bcefa935
---
M src/encoding/json/bench_test.go
A src/encoding/json/buffers.go
A src/encoding/json/buffers_test.go
M src/encoding/json/encode.go
M src/encoding/json/stream.go
5 files changed, 496 insertions(+), 32 deletions(-)
To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.
Attention is currently required from: Daniel Martí, Johan Brandhorst-Satzkorn, Jordan Liggitt, Joseph Tsai.
3 comments:
Commit Message:
Patch Set #1, Line 75: For larger outputs (e.g., CodeMarshal), we do see a performance degradation.
Also, quite possible related to RAM speed too. […]
I checked this and your change is actually an improvement on baseline, congrats. I was pretty skeptical of the Grow/AvailableBuffer paradigm but I guess if you start out with that approach it works just fine.
I am using some sort of cloud system, so my numbers are probably less meaningful than yours, but here they are anyway:
```
goos: linux
goarch: amd64
pkg: encoding/json
cpu: Intel(R) Xeon(R) CPU @ 2.20GHz
│ change-455776.bench.txt │ change-471200.bench.txt │
│ sec/op │ sec/op vs base │
LargeOutput-16 2.359 ± 1% 2.185 ± 3% -7.37% (p=0.002 n=10)
CodeEncoder-16 1.163m ± 7% 1.064m ± 1% -8.47% (p=0.000 n=10)
CodeEncoderError-16 1.239m ± 2% 1.170m ± 1% -5.54% (p=0.000 n=10)
AllocationWastage-16 1.491m ± 3% 1.099m ± 1% -26.25% (p=0.000 n=10)
CodeMarshal-16 1.474m ± 15% 1.383m ± 2% -6.15% (p=0.000 n=10)
CodeMarshalError-16 1.696m ± 2% 1.674m ± 1% ~ (p=0.075 n=10)
MarshalBytes/32-16 305.8n ± 2% 283.9n ± 1% -7.16% (p=0.000 n=10)
MarshalBytes/256-16 873.6n ± 3% 746.9n ± 1% -14.51% (p=0.000 n=10)
MarshalBytes/4096-16 8.806µ ± 0% 8.310µ ± 0% -5.64% (p=0.000 n=10)
MarshalBytesError/32-16 632.8µ ± 1% 660.9µ ± 1% +4.43% (p=0.000 n=10)
MarshalBytesError/256-16 631.3µ ± 1% 667.3µ ± 1% +5.72% (p=0.000 n=10)
MarshalBytesError/4096-16 647.0µ ± 1% 676.1µ ± 1% +4.49% (p=0.000 n=10)
CodeDecoder-16 5.080m ± 1% 5.145m ± 3% ~ (p=0.089 n=10)
UnicodeDecoder-16 334.0n ± 1% 333.4n ± 0% ~ (p=0.644 n=10)
DecoderStream-16 248.0n ± 1% 246.8n ± 1% -0.48% (p=0.050 n=10)
CodeUnmarshal-16 6.503m ± 2% 6.515m ± 1% ~ (p=0.481 n=10)
CodeUnmarshalReuse-16 5.185m ± 2% 5.210m ± 1% ~ (p=0.796 n=10)
UnmarshalString-16 98.07n ± 2% 104.85n ± 4% +6.92% (p=0.000 n=10)
UnmarshalFloat64-16 88.93n ± 2% 89.68n ± 2% +0.85% (p=0.009 n=10)
UnmarshalInt64-16 76.63n ± 6% 79.98n ± 4% +4.38% (p=0.007 n=10)
Issue10335-16 129.4n ± 3% 129.4n ± 2% ~ (p=0.810 n=10)
Issue34127-16 60.48n ± 3% 57.61n ± 3% -4.75% (p=0.000 n=10)
Unmapped-16 288.7n ± 2% 283.7n ± 1% -1.75% (p=0.023 n=10)
TypeFieldsCache/MissTypes1-16 18.31µ ± 4% 18.29µ ± 2% ~ (p=0.565 n=10)
TypeFieldsCache/MissTypes10-16 74.79µ ± 1% 75.51µ ± 2% ~ (p=0.105 n=10)
TypeFieldsCache/MissTypes100-16 272.4µ ± 1% 277.8µ ± 1% +1.98% (p=0.001 n=10)
TypeFieldsCache/MissTypes1000-16 2.347m ± 2% 2.321m ± 2% ~ (p=0.315 n=10)
TypeFieldsCache/MissTypes10000-16 19.92m ± 2% 20.03m ± 1% ~ (p=0.247 n=10)
TypeFieldsCache/MissTypes100000-16 218.4m ± 2% 217.5m ± 4% ~ (p=0.971 n=10)
TypeFieldsCache/MissTypes1000000-16 2.539 ± 16% 2.596 ± 18% ~ (p=0.481 n=10)
TypeFieldsCache/HitTypes1-16 3.526n ± 1% 3.460n ± 1% -1.86% (p=0.009 n=10)
TypeFieldsCache/HitTypes10-16 3.478n ± 1% 3.486n ± 2% ~ (p=0.272 n=10)
TypeFieldsCache/HitTypes100-16 3.388n ± 1% 3.486n ± 1% +2.86% (p=0.000 n=10)
TypeFieldsCache/HitTypes1000-16 3.436n ± 1% 3.385n ± ∞ ¹ ~ (p=0.286 n=6+1)
```
If you were to spend more time looking at this, the only thing I would look at is LargeOutput's allocation count, everything else looks great.
```
│ change-455776.bench.txt │ change-471200.bench.txt │
│ allocs/op │ allocs/op vs base │
LargeOutput-16 5.674k ± 1% 12.194k ± 0% +114.90% (p=0.000 n=10)
```
File src/encoding/json/buffers.go:
Patch Set #2, Line 150: maxRetainSegmentSize
This actually appears to retain all segment sizes and just this is where it stops making each segment twice as big as the prior, right (likely this explains why it runs allocation wastage test so much faster)? So I would maybe change this name.
File src/encoding/json/buffers_test.go:
Patch Set #2, Line 41: // Randomly perform some mutating buffer operation.
Should probably exercise WriteString and WriteByte somewhere in here?
To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.
Attention is currently required from: Daniel Martí, Jordan Liggitt, Joseph Tsai.
Patch set 2:Code-Review +1
Attention is currently required from: Daniel Martí, Jordan Liggitt, Joseph Tsai.
2 comments:
Patchset:
Thanks for doing this, and I feel bad again for adding the example that helped promote this use.
Commit Message:
Patch Set #2, Line 85: zeroing the buffer for the portion that is gauranteed to be overwritten.
guaranteed
To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.