[go] encoding/json: improve Marshal memory utilization

Joseph Tsai (Gerrit)

unread,

Feb 24, 2023, 7:58:49 PM2/24/23

to goph...@pubsubhelper.golang.org, golang-co...@googlegroups.com

Joseph Tsai has uploaded this change for review.

encoding/json: improve Marshal memory utilization

Previously, Marshal (and Encoder.Encode) used a pooled encodeState
that included an embedded bytes.Buffer.
This is an incorrect use of sync.Pool (see #23199).
This results in problems when a high frequency stream of tiny Marshals
is interspersed with a low frequency stream of huge Marshals.
The occasional huge Marshal will cause bytes.Buffer to be grown in size
to accomodate the large output, but never be reduced in size.
However, the continuous stream of tiny Marshals will keep the buffers alive,
preventing the GC from being able to reclaim the under-utilized capacity.

Previous attempts to trivially cap the buffer capacity (see CL 136035)
resulted in unacceptably poor performance in marshaling large outputs.

We fix this problem by switching from bytes.Buffer to a custom pooledBuffer
implementation that uses a series of pooled buffer segments.
The content of the buffer is the concatenation of all the buffer segments.
Using buffer segments allows the implementation to more readily return
unused buffers to the pool, and allow the GC to reclaim them if necessary.

The semantics of the Marshal function requires that the returned buffer
be a unique copy owned by the caller.
With the segmented buffer, we use bytes.Join to join all the segments,
while the previous code did the semantic equivalent of bytes.Clone.

Performance:

	name                       old time/op      new time/op      delta
	LargeOutput                  1.85s ± 3%       1.31s ± 2%     -29.25%  (p=0.000 n=10+8)
	CodeEncoder                  394µs ± 1%       382µs ± 2%      -2.95%  (p=0.000 n=9+10)
	CodeEncoderError             437µs ± 1%       428µs ± 1%      -1.97%  (p=0.000 n=9+9)
	AllocationWastage            396µs ± 2%       476µs ± 1%     +20.16%  (p=0.000 n=10+8)
	CodeMarshal                  534µs ± 1%       712µs ± 3%     +33.31%  (p=0.000 n=10+10)
	CodeMarshalError             694µs ± 2%       903µs ± 4%     +29.98%  (p=0.000 n=10+10)
	MarshalBytes/32              170ns ± 3%       164ns ± 1%      -3.22%  (p=0.000 n=10+10)
	MarshalBytes/256             387ns ± 2%       381ns ± 1%      -1.63%  (p=0.000 n=10+10)
	MarshalBytes/4096           3.93µs ± 2%      3.91µs ± 1%        ~     (p=0.184 n=10+10)
	MarshalBytesError/32         372µs ± 1%       370µs ± 0%      -0.71%  (p=0.002 n=10+8)
	MarshalBytesError/256        372µs ± 1%       370µs ± 1%        ~     (p=0.133 n=9+10)
	MarshalBytesError/4096       379µs ± 1%       375µs ± 1%      -1.03%  (p=0.007 n=10+10)
	EncodeMarshaler             22.8ns ± 2%      24.4ns ± 2%      +7.06%  (p=0.000 n=8+9)
	EncoderEncode               12.5ns ± 3%      13.6ns ± 8%      +8.23%  (p=0.005 n=8+8)

	name                       old alloc/op     new alloc/op     delta
	LargeOutput                 1.55GB ± 0%      0.97GB ± 0%     -37.61%  (p=0.000 n=10+7)
	CodeEncoder                  4.00B ± 0%    3608.40B ±77%  +90110.00%  (p=0.000 n=9+10)
	CodeEncoderError              141B ± 2%      2605B ±207%   +1751.22%  (p=0.000 n=10+9)
	AllocationWastage            2.87B ±30%    2017.10B ±54%  +70060.00%  (p=0.000 n=8+10)
	CodeMarshal                 1.96MB ± 2%      1.99MB ± 1%      +1.52%  (p=0.000 n=10+10)
	CodeMarshalError            2.04MB ± 2%      2.00MB ± 1%      -1.89%  (p=0.000 n=10+9)
	EncodeMarshaler              4.00B ± 0%       4.00B ± 0%        ~     (all equal)
	EncoderEncode                0.00B            0.00B             ~     (all equal)

	name                       old allocs/op    new allocs/op    delta
	LargeOutput                   44.8 ±23%     19649.6 ± 0%  +43760.71%  (p=0.000 n=10+10)
	CodeEncoder                   0.00            47.00 ± 0%       +Inf%  (p=0.000 n=10+10)
	CodeEncoderError              4.00 ± 0%       51.00 ± 0%   +1175.00%  (p=0.000 n=10+10)
	AllocationWastage             0.00            52.00 ± 0%       +Inf%  (p=0.000 n=10+10)
	CodeMarshal                   1.00 ± 0%       53.00 ± 0%   +5200.00%  (p=0.000 n=10+8)
	CodeMarshalError              6.30 ±11%       59.40 ± 1%    +842.86%  (p=0.000 n=10+10)
	EncodeMarshaler               1.00 ± 0%        1.00 ± 0%        ~     (all equal)
	EncoderEncode                 0.00             0.00             ~     (all equal)

	name                       old wasted_b/op  new wasted_b/op  delta
	AllocationWastage            2.25M ± 0%       0.16M ± 0%     -92.88%  (p=0.000 n=10+10)

There is no notable performance difference on small output (i.e., < 64KiB).
For larger outputs (e.g., CodeMarshal), we do see a performance degradation.
This is unavoidable as having a single contiguous buffer that is large enough
is always better for CPU performance for many reasons.
However, the nature of the Marshal call makes it such that it is impossible
to know what size buffer we need prior to marshaling the value itself.
The loss of performance is deemed acceptable in order to gain
stronger guarantees on the maximum amount of heap used.

Part of the cost of using bytes.Join over bytes.Clone is that
bytes.Clone can rely on compiler optimizations that avoid unnecessarily
zeroing the buffer for the portion that is gauranteed to be overwritten.
The bytes.Join function does not benefit from that compiler optimization.
In CL 456336, bytes.Join is optimized to also avoid the zeroing.
This results in the performance loss being reduced to:

	name                       old time/op      new time/op      delta
	CodeMarshal                  534µs ± 1%       633µs ± 4%     +18.46%  (p=0.000 n=10+10)
	CodeMarshalError             694µs ± 2%       789µs ± 3%     +13.64%  (p=0.000 n=10+9)

This is a reduction of performance loss from +33% down to +18%,
but is still much better than the performance loss of +72%
without the use of any pooled buffers at all.

The idea of a segmented buffer came about during discussion with
Daniel Smith on CL 455236 as a counter proposal
to using pooled contiguous buffers.
The benchmarks come from Daniel Smith's alternate implementation in CL 455776.
Huge thanks goes to Daniel Smith who did a bulk majority of the work
to prove that segmented buffers was a viable approach.
The implementation in this CL is a simplified version of CL 455776.

Fixes #27735

Change-Id: Ib23f00eb64615b21b1ddddf9fadc6498bcefa935
---
M src/encoding/json/bench_test.go
A src/encoding/json/buffers.go
A src/encoding/json/buffers_test.go
M src/encoding/json/encode.go
M src/encoding/json/stream.go
5 files changed, 495 insertions(+), 32 deletions(-)

diff --git a/src/encoding/json/bench_test.go b/src/encoding/json/bench_test.go
index d3af0dc..5d1741b 100644
--- a/src/encoding/json/bench_test.go
+++ b/src/encoding/json/bench_test.go
@@ -22,6 +22,7 @@
 	"runtime"
 	"strings"
 	"sync"
+	"sync/atomic"
 	"testing"
 )
 
@@ -82,6 +83,30 @@
 	}
 }
 
+func BenchmarkLargeOutput(b *testing.B) {
+	b.ReportAllocs()
+
+	b.StopTimer()
+	if codeJSON == nil {
+		codeInit()
+	}
+	list := make([]*codeResponse, 250)
+	for i := range list {
+		list[i] = &codeStruct
+	}
+	b.StartTimer()
+
+	for i := 0; i < b.N; i++ {
+		buf, err := Marshal(list)
+		if err != nil {
+			b.Fatal("Marshal:", err)
+		}
+		if i == b.N-1 {
+			b.SetBytes(int64(len(buf)))
+		}
+	}
+}
+
 func BenchmarkCodeEncoder(b *testing.B) {
 	b.ReportAllocs()
 	if codeJSON == nil {
@@ -130,6 +155,74 @@
 	b.SetBytes(int64(len(codeJSON)))
 }
 
+func (e *encodeState) wastedSpace() (n int) {
+	for _, p := range e.fullList() {
+		n += cap(p) - len(p)
+	}
+	return n
+}
+
+func marshalReturnWastage(v any) ([]byte, int, error) {
+	e := getEncodeState()
+	defer putEncodeState(e)
+
+	err := e.marshal(v, encOpts{escapeHTML: true})
+	if err != nil {
+		return nil, e.wastedSpace(), err
+	}
+	return e.Bytes(), e.wastedSpace(), nil
+}
+
+func BenchmarkAllocationWastage(b *testing.B) {
+	b.ReportAllocs()
+
+	b.StopTimer()
+	type SmallNested struct {
+		Name string
+		Next *SmallNested
+	}
+	sn := &SmallNested{Name: "Small and nested"}
+	for i := 0; i < 10; i++ {
+		sn = &SmallNested{
+			Name: fmt.Sprintf("loop iteration %v", i),
+			Next: sn,
+		}
+	}
+	smallObj, err := Marshal(sn)
+	if err != nil {
+		b.Fatal("Marshal:", err)
+	}
+	if codeJSON == nil {
+		codeInit()
+	}
+	b.StartTimer()
+
+	var totalTotal int64
+	var totalCount int64
+
+	b.RunParallel(func(pb *testing.PB) {
+		var totalWasted int64
+		var iterCount int
+		for pb.Next() {
+			if _, wasted, err := marshalReturnWastage(&codeStruct); err != nil {
+				b.Fatal("Marshal:", err)
+			} else {
+				totalWasted += int64(wasted)
+			}
+			if _, wasted, err := marshalReturnWastage(sn); err != nil {
+				b.Fatal("Marshal:", err)
+			} else {
+				totalWasted += int64(wasted)
+			}
+			iterCount++
+		}
+		atomic.AddInt64(&totalTotal, totalWasted)
+		atomic.AddInt64(&totalCount, int64(iterCount))
+	})
+	b.ReportMetric(float64(totalTotal)/float64(totalCount), "wasted_b/op")
+	b.ReportMetric(float64(len(smallObj)+len(codeJSON)), "useful_b/op")
+}
+
 func BenchmarkCodeMarshal(b *testing.B) {
 	b.ReportAllocs()
 	if codeJSON == nil {
diff --git a/src/encoding/json/buffers.go b/src/encoding/json/buffers.go
new file mode 100644
index 0000000..2da5728
--- /dev/null
+++ b/src/encoding/json/buffers.go
@@ -0,0 +1,186 @@
+// Copyright 2023 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package json
+
+import (
+	"bytes"
+	"io"
+	"math/bits"
+	"slices"
+	"sync"
+)
+
+// pooledBuffer is similar to bytes.Buffer,
+// but uses a series of segmented buffers instead of a single contiguous buffer.
+// Each segment never has a smaller capacity than the previous buffer and
+// are retrieved from a tiered list of sync.Pool as needed.
+//
+// Invariant: len(pooledBuffer.fullList()) <= maxNumSegments
+type pooledBuffer struct {
+	list [][]byte // list of segments except the last one
+	last []byte
+}
+
+func (b *pooledBuffer) fullList() [][]byte {
+	if cap(b.last) > 0 {
+		b.list = slices.Grow(b.list, 1)
+		return append(b.list, b.last) // avoid changing length of b.list
+	}
+	return b.list
+}
+
+func (b *pooledBuffer) Len() (n int) {
+	if len(b.list) == 0 {
+		return len(b.last)
+	}
+	for _, p := range b.fullList() {
+		n += len(p)
+	}
+	return n
+}
+
+func (b *pooledBuffer) Cap() int                { return b.Len() + b.Available() }
+func (b *pooledBuffer) Available() int          { return cap(b.AvailableBuffer()) }
+func (b *pooledBuffer) AvailableBuffer() []byte { return b.last[len(b.last):] }
+
+func (b *pooledBuffer) Grow(n int) {
+	if cap(b.last)-len(b.last) < n {
+		b.growSlow(n)
+	}
+}
+func (b *pooledBuffer) growSlow(n int) {
+	if cap(b.last) > 0 {
+		if n <= cap(b.last) {
+			n = cap(b.last) // ensure segments never decrease in capacity
+			if n < maxRetainSegmentSize {
+				n *= 2 // only double capacity until maxRetainSegmentSize
+			}
+		}
+		if len(b.last) == 0 {
+			putBuffer(b.last)
+		} else {
+			b.list = append(b.list, b.last)
+		}
+	}
+	b.last = getBuffer(n)
+}
+
+func (b *pooledBuffer) Write(p []byte) (int, error) {
+	b.Grow(len(p))
+	b.last = b.last[:len(b.last)+len(p)]
+	copy(b.last[len(b.last)-len(p):], p)
+	return len(p), nil
+}
+
+func (b *pooledBuffer) WriteString(s string) (int, error) {
+	b.Grow(len(s))
+	b.last = b.last[:len(b.last)+len(s)]
+	copy(b.last[len(b.last)-len(s):], s)
+	return len(s), nil
+}
+
+func (b *pooledBuffer) WriteByte(c byte) error {
+	b.Grow(1)
+	b.last = append(b.last, c)
+	return nil
+}
+
+func (b *pooledBuffer) WriteTo(w io.Writer) (n int64, err error) {
+	for _, p := range b.fullList() {
+		m, err := w.Write(p)
+		if err != nil {
+			return n, err
+		}
+		n += int64(m)
+	}
+	return n, err
+}
+
+// Bytes returns the buffer content as a single contiguous buffer.
+// It may need to merge segments to produce a contiguous buffer.
+func (b *pooledBuffer) Bytes() []byte {
+	if len(b.list) == 0 {
+		return b.last
+	}
+	p := getBuffer(b.Len())
+	for i, list := 0, b.fullList(); i < len(list); i++ {
+		p = append(p, list[i]...)
+		putBuffer(list[i])
+		list[i] = nil // allow GC to reclaim the buffer
+	}
+	b.list = b.list[:0]
+	b.last = p
+	return b.last
+}
+
+// BytesClone returns a copy of the buffer content.
+func (b *pooledBuffer) BytesClone() []byte {
+	if len(b.list) == 0 {
+		return bytes.Clone(b.last)
+	}
+	return bytes.Join(b.fullList(), nil)
+}
+
+func (b *pooledBuffer) Reset() {
+	// Return all segments to the pools except one segment,
+	// which may be retained locally if sufficiently small.
+	if len(b.list) == 0 && cap(b.last) <= maxRetainSegmentSize {
+		b.last = b.last[:0]
+		return
+	}
+	var retain []byte
+	list := b.fullList()
+	for i := len(list) - 1; i >= 0; i-- {
+		if retain == nil && cap(list[i]) <= maxRetainSegmentSize {
+			retain = list[i][:0] // retain locally, but clear the length
+		} else {
+			putBuffer(list[i])
+		}
+		list[i] = nil // allow GC to reclaim the buffer
+	}
+	b.list = b.list[:0]
+	b.last = retain
+}
+
+const (
+	minPooledSegmentShift = 12 // minumum size of buffer to pool
+	maxRetainSegmentShift = 16 // maximum size of buffer to retain locally in buffer
+	maxRetainSegmentSize  = 1 << maxRetainSegmentShift
+	maxNumSegments        = bits.UintSize - minPooledSegmentShift
+)
+
+// TODO(https://go.dev/issue/47657): Use sync.PoolOf.
+// The putBuffer must allocate a slice header whenever calling sync.Pool.Put.
+// Using *[]byte instead of []byte avoids the allocation,
+// but complicates the logic of pooledBuffer and is slower as a result.
+
+// bufferPools is a list of buffer pools.
+// Each pool manages buffers of capacity within [1<<shift : 2<<shift),
+// where shift is (index+minPooledSegmentShift).
+var bufferPools [maxNumSegments]sync.Pool
+
+// getBuffer acquires an empty buffer with enough capacity to hold n bytes.
+func getBuffer(n int) []byte {
+	if n < 1<<minPooledSegmentShift {
+		n = 1 << minPooledSegmentShift
+	}
+	shift := bits.Len(uint(n - 1))
+	if p, _ := bufferPools[shift-minPooledSegmentShift].Get().([]byte); p != nil {
+		return p[:0]
+	}
+	return make([]byte, 0, 1<<shift)
+}
+
+// putBuffer releases a buffer back to the pools.
+func putBuffer(p []byte) {
+	if cap(p) < 1<<minPooledSegmentShift {
+		return
+	}
+	shift := bits.Len(uint(cap(p)) - 1)
+	// TODO: In race detector mode, asynchronously write to the buffer to detect
+	// buffers that may have accidentally leaked to users.
+	// See https://go.dev/issue/58452 for inspiration.
+	bufferPools[shift-minPooledSegmentShift].Put(p)
+}
diff --git a/src/encoding/json/buffers_test.go b/src/encoding/json/buffers_test.go
new file mode 100644
index 0000000..e481ebd
--- /dev/null
+++ b/src/encoding/json/buffers_test.go
@@ -0,0 +1,81 @@
+// Copyright 2023 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package json
+
+import (
+	"bytes"
+	"crypto/md5"
+	"io"
+	"math/bits"
+	"math/rand"
+	"testing"
+)
+
+func FuzzBuffer(f *testing.F) {
+	f.Add(int64(0))
+	f.Fuzz(func(t *testing.T, seed int64) {
+		const maxCapacity = 1 << (maxRetainSegmentShift + 4)
+		rn := rand.New(rand.NewSource(seed))
+		var gotBuffer pooledBuffer
+		var wantBuffer bytes.Buffer
+		var exit bool
+		for i := 0; !exit; i++ {
+			exit = gotBuffer.Cap() > maxCapacity || wantBuffer.Cap() > maxCapacity || i >= 100
+			switch {
+			case gotBuffer.Len() != wantBuffer.Len():
+				t.Fatalf("Buffer.Len = %d, want %d", gotBuffer.Len(), wantBuffer.Len())
+			case gotBuffer.Len() > gotBuffer.Cap():
+				t.Fatalf("Buffer.Len > Buffer.Cap: %d > %d", gotBuffer.Len(), gotBuffer.Cap())
+			case len(gotBuffer.AvailableBuffer()) > 0:
+				t.Fatalf("len(Buffer.AvailableBuffer) = %d, want 0", len(gotBuffer.AvailableBuffer()))
+			case cap(gotBuffer.AvailableBuffer()) > gotBuffer.Cap()-gotBuffer.Len():
+				t.Fatalf("cap(Buffer.AvailableBuffer) = %d, want %d", cap(gotBuffer.AvailableBuffer()), gotBuffer.Cap()-gotBuffer.Len())
+			}
+
+			// Random size that is exponentially spaced.
+			n := (1 << rn.Intn(bits.Len(maxCapacity-1))) >> 1
+			n += rn.Intn(n + 1)
+
+			// Randomly perform some mutating buffer operation.
+			switch j := rn.Intn(20); {
+			case j < 8 && !exit: // 40% probability
+				b := gotBuffer.AvailableBuffer()
+				b = b[:rn.Intn(cap(b)+1)]
+				io.ReadFull(rn, b)
+				gotBuffer.Write(b)
+				wantBuffer.Write(b)
+			case j < 14 && !exit: // 30% probability
+				b := make([]byte, n)
+				io.ReadFull(rn, b)
+				gotBuffer.Write(b)
+				wantBuffer.Write(b)
+			case j < 19 && !exit: // 25% probability
+				gotBuffer.Grow(n)
+			default: // 5% probability
+				var prev []byte
+				for _, p := range gotBuffer.fullList() {
+					if cap(p) < cap(prev) {
+						t.Fatalf("buffer segments are not growing in size")
+					}
+					prev = p
+				}
+				var gotBytes []byte
+				if rn.Intn(2) == 0 {
+					gotBytes = gotBuffer.Bytes()
+				} else {
+					gotBytes = gotBuffer.BytesClone()
+				}
+				if !bytes.Equal(gotBytes, wantBuffer.Bytes()) {
+					t.Fatalf("content mismatch: %x != %x", md5.Sum(gotBytes), md5.Sum(wantBuffer.Bytes()))
+				}
+				gotBuffer.Reset()
+				wantBuffer.Reset()
+				if gotBuffer.Len() > 0 {
+					t.Fatalf("Buffer.Len = %d, want 0", gotBuffer.Len())
+				}
+			}
+		}
+	})
+}
diff --git a/src/encoding/json/encode.go b/src/encoding/json/encode.go
index ff0e40d..602583a 100644
--- a/src/encoding/json/encode.go
+++ b/src/encoding/json/encode.go
@@ -155,16 +155,13 @@
 // handle them. Passing cyclic structures to Marshal will result in
 // an error.
 func Marshal(v any) ([]byte, error) {
-	e := newEncodeState()
-	defer encodeStatePool.Put(e)
-
+	e := getEncodeState()
+	defer putEncodeState(e)
 	err := e.marshal(v, encOpts{escapeHTML: true})
 	if err != nil {
 		return nil, err
 	}
-	buf := append([]byte(nil), e.Bytes()...)
-
-	return buf, nil
+	return e.BytesClone(), nil
 }
 
 // MarshalIndent is like Marshal but applies Indent to format the output.
@@ -281,9 +278,9 @@
 
 var hex = "0123456789abcdef"
 
-// An encodeState encodes JSON into a bytes.Buffer.
+// An encodeState encodes JSON into a pooled buffer.
 type encodeState struct {
-	bytes.Buffer // accumulated output
+	pooledBuffer // accumulated output
 
 	// Keep track of what pointers we've seen in the current recursive call
 	// path, to avoid cycles that could lead to a stack overflow. Only do
@@ -294,18 +291,13 @@
 	ptrSeen  map[any]struct{}
 }
 
-func (e *encodeState) AvailableBuffer() []byte {
-	return availableBuffer(&e.Buffer)
-}
-
 const startDetectingCyclesAfter = 1000
 
 var encodeStatePool sync.Pool
 
-func newEncodeState() *encodeState {
+func getEncodeState() *encodeState {
 	if v := encodeStatePool.Get(); v != nil {
 		e := v.(*encodeState)
-		e.Reset()
 		if len(e.ptrSeen) > 0 {
 			panic("ptrEncoder.encode should have emptied ptrSeen via defers")
 		}
@@ -315,6 +307,11 @@
 	return &encodeState{ptrSeen: make(map[any]struct{})}
 }
 
+func putEncodeState(e *encodeState) {
+	e.Reset()
+	encodeStatePool.Put(e)
+}
+
 // jsonError is an error wrapper type for internal use only.
 // Panics with errors are wrapped in jsonError so that the top-level recover
 // can distinguish intentional panics from this package.
@@ -480,9 +477,9 @@
 	b, err := m.MarshalJSON()
 	if err == nil {
 		e.Grow(len(b))
-		out := availableBuffer(&e.Buffer)
+		out := e.AvailableBuffer()
 		out, err = appendCompact(out, b, opts.escapeHTML)
-		e.Buffer.Write(out)
+		e.Write(out)
 	}
 	if err != nil {
 		e.error(&MarshalerError{v.Type(), err, "MarshalJSON"})
@@ -499,9 +496,9 @@
 	b, err := m.MarshalJSON()
 	if err == nil {
 		e.Grow(len(b))
-		out := availableBuffer(&e.Buffer)
+		out := e.AvailableBuffer()
 		out, err = appendCompact(out, b, opts.escapeHTML)
-		e.Buffer.Write(out)
+		e.Write(out)
 	}
 	if err != nil {
 		e.error(&MarshalerError{v.Type(), err, "MarshalJSON"})
diff --git a/src/encoding/json/stream.go b/src/encoding/json/stream.go
index b4146a3..e63b1cc 100644
--- a/src/encoding/json/stream.go
+++ b/src/encoding/json/stream.go
@@ -179,8 +179,8 @@
 
 // An Encoder writes JSON values to an output stream.
 type Encoder struct {
-	w          io.Writer
-	err        error
+	wr         io.Writer
+	wrErr      error
 	escapeHTML bool
 
 	indentBuf    []byte
@@ -190,7 +190,7 @@
 
 // NewEncoder returns a new encoder that writes to w.
 func NewEncoder(w io.Writer) *Encoder {
-	return &Encoder{w: w, escapeHTML: true}
+	return &Encoder{wr: w, escapeHTML: true}
 }
 
 // Encode writes the JSON encoding of v to the stream,
@@ -199,12 +199,12 @@
 // See the documentation for Marshal for details about the
 // conversion of Go values to JSON.
 func (enc *Encoder) Encode(v any) error {
-	if enc.err != nil {
-		return enc.err
+	if enc.wrErr != nil {
+		return enc.wrErr
 	}
 
-	e := newEncodeState()
-	defer encodeStatePool.Put(e)
+	e := getEncodeState()
+	defer putEncodeState(e)
 
 	err := e.marshal(v, encOpts{escapeHTML: enc.escapeHTML})
 	if err != nil {
@@ -219,18 +219,16 @@
 	// digits coming.
 	e.WriteByte('\n')
 
-	b := e.Bytes()
 	if enc.indentPrefix != "" || enc.indentValue != "" {
-		enc.indentBuf, err = appendIndent(enc.indentBuf[:0], b, enc.indentPrefix, enc.indentValue)
+		enc.indentBuf, err = appendIndent(enc.indentBuf[:0], e.Bytes(), enc.indentPrefix, enc.indentValue)
 		if err != nil {
 			return err
 		}
-		b = enc.indentBuf
+		_, enc.wrErr = enc.wr.Write(enc.indentBuf)
+	} else {
+		_, enc.wrErr = e.WriteTo(enc.wr)
 	}
-	if _, err = enc.w.Write(b); err != nil {
-		enc.err = err
-	}
-	return err
+	return enc.wrErr
 }
 
 // SetIndent instructs the encoder to format each subsequent encoded

To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.

Joseph Tsai (Gerrit)

unread,

Feb 24, 2023, 8:00:13 PM2/24/23

to goph...@pubsubhelper.golang.org, Daniel Martí, Johan Brandhorst-Satzkorn, Daniel Smith, Jordan Liggitt, golang-co...@googlegroups.com

Attention is currently required from: Daniel Martí, Daniel Smith, Johan Brandhorst-Satzkorn, Jordan Liggitt.

Patch set 1:Run-TryBot +1Auto-Submit +1

View Change

To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.

Daniel Smith (Gerrit)

unread,

Feb 24, 2023, 8:12:22 PM2/24/23

to Joseph Tsai, goph...@pubsubhelper.golang.org, Gopher Robot, Daniel Martí, Johan Brandhorst-Satzkorn, Jordan Liggitt, golang-co...@googlegroups.com

Attention is currently required from: Daniel Martí, Johan Brandhorst-Satzkorn, Jordan Liggitt, Joseph Tsai.

View Change

1 comment:

Commit Message:
- Patch Set #1, Line 75: For larger outputs (e.g., CodeMarshal), we do see a performance degradation.
  Maybe something else is wrong, too? on my CL I saw:
```
CodeMarshal-16 1.364m ± 1% 1.439m ± ∞ ¹ +5.47% (p=0.000 n=30+5)
CodeMarshalError-16 1.570m ± 1% 1.686m ± ∞ ¹ +7.39% (p=0.000 n=30+5)
```
  Which is a much smaller degradation.

To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.

Joseph Tsai (Gerrit)

unread,

Feb 24, 2023, 8:14:08 PM2/24/23

to goph...@pubsubhelper.golang.org, Gopher Robot, Daniel Martí, Johan Brandhorst-Satzkorn, Daniel Smith, Jordan Liggitt, golang-co...@googlegroups.com

Attention is currently required from: Daniel Martí, Daniel Smith, Johan Brandhorst-Satzkorn, Jordan Liggitt.

View Change

1 comment:

Commit Message:

- Patch Set #1, Line 75: For larger outputs (e.g., CodeMarshal), we do see a performance degradation.

- Maybe something else is wrong, too? on my CL I saw: […]
  Quite possibly due to different CPU architectures. I'm on a AMD Ryzen 5900x.
  Could you run both this CL and your CL and see how they compare?
  That way we can compare apples to apples.

To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.

Daniel Smith (Gerrit)

unread,

Feb 24, 2023, 8:15:11 PM2/24/23

to Joseph Tsai, goph...@pubsubhelper.golang.org, Gopher Robot, Daniel Martí, Johan Brandhorst-Satzkorn, Jordan Liggitt, golang-co...@googlegroups.com

Attention is currently required from: Daniel Martí, Johan Brandhorst-Satzkorn, Jordan Liggitt, Joseph Tsai.

View Change

1 comment:

Commit Message:

- Patch Set #1, Line 75: For larger outputs (e.g., CodeMarshal), we do see a performance degradation.

- Quite possibly due to different CPU architectures. I'm on a AMD Ryzen 5900x. […]
  Sure -- but on Monday :)

To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.

Joseph Tsai (Gerrit)

unread,

Feb 24, 2023, 8:15:30 PM2/24/23

to goph...@pubsubhelper.golang.org, Gopher Robot, Daniel Martí, Johan Brandhorst-Satzkorn, Daniel Smith, Jordan Liggitt, golang-co...@googlegroups.com

Attention is currently required from: Daniel Martí, Johan Brandhorst-Satzkorn, Jordan Liggitt, Joseph Tsai.

View Change

1 comment:

Commit Message:

- Patch Set #1, Line 75: For larger outputs (e.g., CodeMarshal), we do see a performance degradation.
  Quite possibly due to different CPU architectures. I'm on a AMD Ryzen 5900x. […]

- Also, quite possible related to RAM speed too. I cheaped out and bought mediocre RAM for my computer.

To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.

Joseph Tsai (Gerrit)

unread,

Feb 24, 2023, 8:24:56 PM2/24/23

to goph...@pubsubhelper.golang.org, golang-co...@googlegroups.com

Attention is currently required from: Daniel Martí, Johan Brandhorst-Satzkorn, Jordan Liggitt, Joseph Tsai.

Joseph Tsai uploaded patch set #2 to this change.

View Change

The following approvals got outdated and were removed: Auto-Submit+1 by Joseph Tsai

encoding/json: improve Marshal memory utilization

Previously, Marshal (and Encoder.Encode) used a pooled encodeState
that included an embedded bytes.Buffer.
This is an incorrect use of sync.Pool (see #23199).
This results in problems when a high frequency stream of tiny Marshals
is interspersed with a low frequency stream of huge Marshals.
The occasional huge Marshal will cause bytes.Buffer to be grown in size
to accomodate the large output, but never be reduced in size.

However, a continuous stream of tiny Marshals will keep buffers alive,
preventing GC from being able to reclaim the under-utilized capacity.



Previous attempts to trivially cap the buffer capacity (see CL 136035)
resulted in unacceptably poor performance in marshaling large outputs.

We fix this problem by switching from bytes.Buffer to a pooledBuffer


implementation that uses a series of pooled buffer segments.

The content of the buffer is the concatenation of all buffer segments.


Using buffer segments allows the implementation to more readily return

unused buffers to the pool, and allow GC to reclaim them if necessary.



The semantics of the Marshal function requires that the returned buffer
be a unique copy owned by the caller.
With the segmented buffer, we use bytes.Join to join all the segments,
while the previous code did the semantic equivalent of bytes.Clone.

Performance:

	name                    old time/op  new time/op   delta
	LargeOutput              1.85s ± 3%   1.31s ± 2%  -29.25%

	CodeEncoder              394µs ± 1%   382µs ± 2%   -2.95%

	CodeEncoderError         437µs ± 1%   428µs ± 1%   -1.97%

	AllocationWastage        396µs ± 2%   476µs ± 1%  +20.16%

	CodeMarshal              534µs ± 1%   712µs ± 3%  +33.31%

	CodeMarshalError         694µs ± 2%   903µs ± 4%  +29.98%

	MarshalBytes/32          170ns ± 3%   164ns ± 1%   -3.22%

	MarshalBytes/256         387ns ± 2%   381ns ± 1%   -1.63%

	MarshalBytes/4096       3.93µs ± 2%  3.91µs ± 1%     ~

	MarshalBytesError/32     372µs ± 1%   370µs ± 0%   -0.71%

	MarshalBytesError/256    372µs ± 1%   370µs ± 1%     ~

	MarshalBytesError/4096   379µs ± 1%   375µs ± 1%   -1.03%

	EncodeMarshaler         22.8ns ± 2%  24.4ns ± 2%   +7.06%

	EncoderEncode           12.5ns ± 3%  13.6ns ± 8%   +8.23%

	name              old alloc/op   new alloc/op     delta
	LargeOutput        1.55GB ± 0%    0.97GB ± 0%     -37.61%

	CodeEncoder         4.00B ± 0%  3608.40B ±77%  +90110.00%

	CodeEncoderError     141B ± 2%    2605B ±207%   +1751.22%

	AllocationWastage   2.87B ±30%  2017.10B ±54%  +70060.00%

	CodeMarshal        1.96MB ± 2%    1.99MB ± 1%      +1.52%

	CodeMarshalError   2.04MB ± 2%    2.00MB ± 1%      -1.89%

	EncodeMarshaler     4.00B ± 0%     4.00B ± 0%        ~

	EncoderEncode       0.00B          0.00B             ~



	name               old allocs/op  new allocs/op  delta
	LargeOutput           44.8 ±23%   19649.6 ± 0%   +43760.71%

	CodeEncoder           0.00          47.00 ± 0%        +Inf%

	CodeEncoderError      4.00 ± 0%     51.00 ± 0%    +1175.00%

	AllocationWastage     0.00          52.00 ± 0%        +Inf%

	CodeMarshal           1.00 ± 0%     53.00 ± 0%    +5200.00%

	CodeMarshalError      6.30 ±11%     59.40 ± 1%     +842.86%

	EncodeMarshaler       1.00 ± 0%      1.00 ± 0%         ~

	EncoderEncode         0.00           0.00              ~



	name               old wasted_b/op  new wasted_b/op  delta
	AllocationWastage    2.25M ± 0%       0.16M ± 0%     -92.88%

There is no notable performance difference on small outputs.
For larger outputs (e.g., CodeMarshal), we see a performance drop.


This is unavoidable as having a single contiguous buffer that is
large enough is always better for CPU performance for many reasons.
However, the nature of the Marshal call makes it such that it is

impossible to know what size buffer we need prior to marshaling.


The loss of performance is deemed acceptable in order to gain
stronger guarantees on the maximum amount of heap used.

Part of the cost of using bytes.Join over bytes.Clone is that
bytes.Clone can rely on compiler optimizations that avoid unnecessarily
zeroing the buffer for the portion that is gauranteed to be overwritten.

The bytes.Join function does not benefit from the compiler optimization.


In CL 456336, bytes.Join is optimized to also avoid the zeroing.
This results in the performance loss being reduced to:

	name              old time/op  new time/op  delta
	CodeMarshal       534µs ± 1%   633µs ± 4%   +18.46%

	CodeMarshalError  694µs ± 2%   789µs ± 3%   +13.64%

This is a reduction of performance loss from +33% down to +18%,
but is still much better than the performance loss of +72%
without the use of any pooled buffers at all.

The idea of a segmented buffer came about during discussion with
Daniel Smith on CL 455236 as a counter proposal
to using pooled contiguous buffers.

The benchmarks come from Daniel Smith's implementation in CL 455776.


Huge thanks goes to Daniel Smith who did a bulk majority of the work
to prove that segmented buffers was a viable approach.
The implementation in this CL is a simplified version of CL 455776.

Fixes #27735

Co-authored-by: Daniel Smith <dbs...@google.com>


Change-Id: Ib23f00eb64615b21b1ddddf9fadc6498bcefa935
---
M src/encoding/json/bench_test.go
A src/encoding/json/buffers.go
A src/encoding/json/buffers_test.go
M src/encoding/json/encode.go
M src/encoding/json/stream.go

5 files changed, 496 insertions(+), 32 deletions(-)

To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.

Daniel Smith (Gerrit)

unread,

Feb 27, 2023, 1:55:55 PM2/27/23

to Joseph Tsai, goph...@pubsubhelper.golang.org, Gopher Robot, Daniel Martí, Johan Brandhorst-Satzkorn, Jordan Liggitt, golang-co...@googlegroups.com

Attention is currently required from: Daniel Martí, Johan Brandhorst-Satzkorn, Jordan Liggitt, Joseph Tsai.

View Change

3 comments:

Commit Message:

- Patch Set #1, Line 75: For larger outputs (e.g., CodeMarshal), we do see a performance degradation.

Also, quite possible related to RAM speed too. […]

I checked this and your change is actually an improvement on baseline, congrats. I was pretty skeptical of the Grow/AvailableBuffer paradigm but I guess if you start out with that approach it works just fine.

I am using some sort of cloud system, so my numbers are probably less meaningful than yours, but here they are anyway:

```
goos: linux
goarch: amd64
pkg: encoding/json
cpu: Intel(R) Xeon(R) CPU @ 2.20GHz
                                    │ change-455776.bench.txt │         change-471200.bench.txt          │
                                    │        sec/op         │     sec/op       vs base                 │
LargeOutput-16                                  2.359 ±  1%     2.185 ±  3%     -7.37% (p=0.002 n=10)
CodeEncoder-16                                 1.163m ±  7%    1.064m ±  1%     -8.47% (p=0.000 n=10)
CodeEncoderError-16                            1.239m ±  2%    1.170m ±  1%     -5.54% (p=0.000 n=10)
AllocationWastage-16                           1.491m ±  3%    1.099m ±  1%    -26.25% (p=0.000 n=10)
CodeMarshal-16                                 1.474m ± 15%    1.383m ±  2%     -6.15% (p=0.000 n=10)
CodeMarshalError-16                            1.696m ±  2%    1.674m ±  1%          ~ (p=0.075 n=10)
MarshalBytes/32-16                             305.8n ±  2%    283.9n ±  1%     -7.16% (p=0.000 n=10)
MarshalBytes/256-16                            873.6n ±  3%    746.9n ±  1%    -14.51% (p=0.000 n=10)
MarshalBytes/4096-16                           8.806µ ±  0%    8.310µ ±  0%     -5.64% (p=0.000 n=10)
MarshalBytesError/32-16                        632.8µ ±  1%    660.9µ ±  1%     +4.43% (p=0.000 n=10)
MarshalBytesError/256-16                       631.3µ ±  1%    667.3µ ±  1%     +5.72% (p=0.000 n=10)
MarshalBytesError/4096-16                      647.0µ ±  1%    676.1µ ±  1%     +4.49% (p=0.000 n=10)
CodeDecoder-16                                 5.080m ±  1%    5.145m ±  3%          ~ (p=0.089 n=10)
UnicodeDecoder-16                              334.0n ±  1%    333.4n ±  0%          ~ (p=0.644 n=10)
DecoderStream-16                               248.0n ±  1%    246.8n ±  1%     -0.48% (p=0.050 n=10)
CodeUnmarshal-16                               6.503m ±  2%    6.515m ±  1%          ~ (p=0.481 n=10)
CodeUnmarshalReuse-16                          5.185m ±  2%    5.210m ±  1%          ~ (p=0.796 n=10)
UnmarshalString-16                             98.07n ±  2%   104.85n ±  4%     +6.92% (p=0.000 n=10)
UnmarshalFloat64-16                            88.93n ±  2%    89.68n ±  2%     +0.85% (p=0.009 n=10)
UnmarshalInt64-16                              76.63n ±  6%    79.98n ±  4%     +4.38% (p=0.007 n=10)
Issue10335-16                                  129.4n ±  3%    129.4n ±  2%          ~ (p=0.810 n=10)
Issue34127-16                                  60.48n ±  3%    57.61n ±  3%     -4.75% (p=0.000 n=10)
Unmapped-16                                    288.7n ±  2%    283.7n ±  1%     -1.75% (p=0.023 n=10)
TypeFieldsCache/MissTypes1-16                  18.31µ ±  4%    18.29µ ±  2%          ~ (p=0.565 n=10)
TypeFieldsCache/MissTypes10-16                 74.79µ ±  1%    75.51µ ±  2%          ~ (p=0.105 n=10)
TypeFieldsCache/MissTypes100-16                272.4µ ±  1%    277.8µ ±  1%     +1.98% (p=0.001 n=10)
TypeFieldsCache/MissTypes1000-16               2.347m ±  2%    2.321m ±  2%          ~ (p=0.315 n=10)
TypeFieldsCache/MissTypes10000-16              19.92m ±  2%    20.03m ±  1%          ~ (p=0.247 n=10)
TypeFieldsCache/MissTypes100000-16             218.4m ±  2%    217.5m ±  4%          ~ (p=0.971 n=10)
TypeFieldsCache/MissTypes1000000-16             2.539 ± 16%     2.596 ± 18%          ~ (p=0.481 n=10)
TypeFieldsCache/HitTypes1-16                   3.526n ±  1%    3.460n ±  1%     -1.86% (p=0.009 n=10)
TypeFieldsCache/HitTypes10-16                  3.478n ±  1%    3.486n ±  2%          ~ (p=0.272 n=10)
TypeFieldsCache/HitTypes100-16                 3.388n ±  1%    3.486n ±  1%     +2.86% (p=0.000 n=10)
TypeFieldsCache/HitTypes1000-16                3.436n ±  1%    3.385n ±   ∞ ¹        ~ (p=0.286 n=6+1)
```

If you were to spend more time looking at this, the only thing I would look at is LargeOutput's allocation count, everything else looks great.

```
                      │ change-455776.bench.txt │         change-471200.bench.txt         │
                      │       allocs/op       │  allocs/op    vs base                   │
LargeOutput-16                    5.674k ± 1%   12.194k ± 0%  +114.90% (p=0.000 n=10)
```

File src/encoding/json/buffers.go:
- Patch Set #2, Line 150: maxRetainSegmentSize
  This actually appears to retain all segment sizes and just this is where it stops making each segment twice as big as the prior, right (likely this explains why it runs allocation wastage test so much faster)? So I would maybe change this name.
File src/encoding/json/buffers_test.go:
- Patch Set #2, Line 41: // Randomly perform some mutating buffer operation.
  Should probably exercise WriteString and WriteByte somewhere in here?

To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.

Johan Brandhorst-Satzkorn (Gerrit)

unread,

Feb 27, 2023, 11:50:52 PM2/27/23

to Joseph Tsai, goph...@pubsubhelper.golang.org, Gopher Robot, Daniel Martí, Daniel Smith, Jordan Liggitt, golang-co...@googlegroups.com

Attention is currently required from: Daniel Martí, Jordan Liggitt, Joseph Tsai.

Patch set 2:Code-Review +1

View Change

To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.

Kevin Burke (Gerrit)

unread,

Oct 21, 2023, 4:11:16 PM10/21/23

to Joseph Tsai, goph...@pubsubhelper.golang.org, Eric Lin, Johan Brandhorst-Satzkorn, Gopher Robot, Daniel Martí, Jordan Liggitt, golang-co...@googlegroups.com

Attention is currently required from: Daniel Martí, Jordan Liggitt, Joseph Tsai.

View Change

2 comments:

Patchset:
- Patch Set #2:
  Thanks for doing this, and I feel bad again for adding the example that helped promote this use.
Commit Message:
- Patch Set #2, Line 85: zeroing the buffer for the portion that is gauranteed to be overwritten.
  guaranteed

To view, visit change 471200. To unsubscribe, or for help writing mail filters, visit settings.

Reply all

Reply to author

Forward