Gerrit Bot would like Klaus Post to review this change.
compress/flate: improve decompression speed
Improve inflate decompression speed, mainly through 3 optimizations:
1) Read further ahead on non-final blocks.
The reader guarantees that it will not read beyond the end of the stream.
This poses limitations on the decoder in terms of how far it can read ahead
and is set to the size of an end-of-block marker in `f.h1.min = f.bits[endBlockMarker]`.
We can however take advantage of the fact that each block gives
information on whether it is the final block on a stream.
So if we are not reading the final block we can safely add the size
of the smallest block possible with nothing but an EOB marker.
That is a block with a predefined table and a single EOB marker.
Since we know the size of the block header and the encoding
of the EOB this totals to 10 additional bits.
Adding 10 bits reduces the number of stream reads significantly.
Approximately 5% throughput increase.
2) Manually inline f.huffSym call
This change by itself give about about 13% throughput increase.
3) Generate decoders for stdlib io.ByteReader types
We generate decoders for the known implementations of `io.ByteReader`,
namely `*bytes.Buffer`, `*bytes.Reader`, `*bufio.Reader` and `*strings.Reader`.
This change by itself gives about 20-25% throughput increase,
including when an `io.Reader` is passed.
I would say only `*strings.Reader` probably isn't that common.
Minor changes:
* Reuse `h.chunks` and `h.links`.
* Trade some bounds checks for AND operations.
* Change chunks from uint32 to uint16.
* Avoid padding of decompressor struct members.
Per loop allocation removed from benchmarks.
The numbers in the benchmark below includes this change for the 'old' numbers.
```
name old time/op new time/op delta
Decode/Digits/Huffman/1e4-32 78.0µs ± 0% 50.5µs ± 1% -35.26% (p=0.008 n=5+5)
Decode/Digits/Huffman/1e5-32 779µs ± 2% 487µs ± 0% -37.48% (p=0.008 n=5+5)
Decode/Digits/Huffman/1e6-32 7.68ms ± 0% 4.88ms ± 1% -36.44% (p=0.008 n=5+5)
Decode/Digits/Speed/1e4-32 88.5µs ± 1% 59.9µs ± 1% -32.33% (p=0.008 n=5+5)
Decode/Digits/Speed/1e5-32 963µs ± 1% 678µs ± 1% -29.58% (p=0.008 n=5+5)
Decode/Digits/Speed/1e6-32 9.75ms ± 1% 6.90ms ± 0% -29.21% (p=0.008 n=5+5)
Decode/Digits/Default/1e4-32 91.2µs ± 1% 61.4µs ± 0% -32.72% (p=0.008 n=5+5)
Decode/Digits/Default/1e5-32 954µs ± 0% 675µs ± 0% -29.25% (p=0.008 n=5+5)
Decode/Digits/Default/1e6-32 9.67ms ± 0% 6.79ms ± 1% -29.76% (p=0.008 n=5+5)
Decode/Digits/Compression/1e4-32 90.7µs ± 1% 61.5µs ± 1% -32.21% (p=0.008 n=5+5)
Decode/Digits/Compression/1e5-32 953µs ± 1% 672µs ± 0% -29.46% (p=0.016 n=4+5)
Decode/Digits/Compression/1e6-32 9.76ms ± 4% 6.78ms ± 0% -30.54% (p=0.008 n=5+5)
Decode/Newton/Huffman/1e4-32 90.4µs ± 0% 54.7µs ± 0% -39.52% (p=0.008 n=5+5)
Decode/Newton/Huffman/1e5-32 885µs ± 0% 538µs ± 0% -39.19% (p=0.008 n=5+5)
Decode/Newton/Huffman/1e6-32 8.84ms ± 0% 5.44ms ± 0% -38.46% (p=0.016 n=4+5)
Decode/Newton/Speed/1e4-32 81.5µs ± 0% 55.1µs ± 1% -32.42% (p=0.016 n=4+5)
Decode/Newton/Speed/1e5-32 751µs ± 4% 528µs ± 0% -29.70% (p=0.008 n=5+5)
Decode/Newton/Speed/1e6-32 7.49ms ± 2% 5.32ms ± 0% -28.92% (p=0.008 n=5+5)
Decode/Newton/Default/1e4-32 73.3µs ± 1% 48.9µs ± 1% -33.36% (p=0.008 n=5+5)
Decode/Newton/Default/1e5-32 601µs ± 2% 418µs ± 0% -30.40% (p=0.008 n=5+5)
Decode/Newton/Default/1e6-32 5.92ms ± 0% 4.17ms ± 0% -29.60% (p=0.008 n=5+5)
Decode/Newton/Compression/1e4-32 72.7µs ± 0% 48.5µs ± 0% -33.21% (p=0.008 n=5+5)
Decode/Newton/Compression/1e5-32 597µs ± 0% 418µs ± 0% -29.90% (p=0.008 n=5+5)
Decode/Newton/Compression/1e6-32 5.90ms ± 0% 4.15ms ± 0% -29.63% (p=0.016 n=4+5)
name old speed new speed delta
Decode/Digits/Huffman/1e4-32 128MB/s ± 0% 198MB/s ± 1% +54.46% (p=0.008 n=5+5)
Decode/Digits/Huffman/1e5-32 128MB/s ± 2% 205MB/s ± 0% +59.92% (p=0.008 n=5+5)
Decode/Digits/Huffman/1e6-32 130MB/s ± 0% 205MB/s ± 1% +57.33% (p=0.008 n=5+5)
Decode/Digits/Speed/1e4-32 113MB/s ± 1% 167MB/s ± 1% +47.79% (p=0.008 n=5+5)
Decode/Digits/Speed/1e5-32 104MB/s ± 1% 147MB/s ± 1% +42.01% (p=0.008 n=5+5)
Decode/Digits/Speed/1e6-32 103MB/s ± 1% 145MB/s ± 0% +41.26% (p=0.008 n=5+5)
Decode/Digits/Default/1e4-32 110MB/s ± 1% 163MB/s ± 0% +48.63% (p=0.008 n=5+5)
Decode/Digits/Default/1e5-32 105MB/s ± 0% 148MB/s ± 0% +41.34% (p=0.008 n=5+5)
Decode/Digits/Default/1e6-32 103MB/s ± 0% 147MB/s ± 1% +42.37% (p=0.008 n=5+5)
Decode/Digits/Compression/1e4-32 110MB/s ± 1% 163MB/s ± 1% +47.51% (p=0.008 n=5+5)
Decode/Digits/Compression/1e5-32 105MB/s ± 1% 149MB/s ± 0% +41.77% (p=0.016 n=4+5)
Decode/Digits/Compression/1e6-32 102MB/s ± 4% 147MB/s ± 0% +43.91% (p=0.008 n=5+5)
Decode/Newton/Huffman/1e4-32 111MB/s ± 0% 183MB/s ± 0% +65.35% (p=0.008 n=5+5)
Decode/Newton/Huffman/1e5-32 113MB/s ± 0% 186MB/s ± 0% +64.44% (p=0.008 n=5+5)
Decode/Newton/Huffman/1e6-32 113MB/s ± 0% 184MB/s ± 0% +62.50% (p=0.016 n=4+5)
Decode/Newton/Speed/1e4-32 123MB/s ± 0% 182MB/s ± 1% +47.98% (p=0.016 n=4+5)
Decode/Newton/Speed/1e5-32 133MB/s ± 4% 189MB/s ± 0% +42.20% (p=0.008 n=5+5)
Decode/Newton/Speed/1e6-32 134MB/s ± 2% 188MB/s ± 0% +40.67% (p=0.008 n=5+5)
Decode/Newton/Default/1e4-32 136MB/s ± 1% 205MB/s ± 1% +50.05% (p=0.008 n=5+5)
Decode/Newton/Default/1e5-32 166MB/s ± 2% 239MB/s ± 0% +43.67% (p=0.008 n=5+5)
Decode/Newton/Default/1e6-32 169MB/s ± 0% 240MB/s ± 0% +42.04% (p=0.008 n=5+5)
Decode/Newton/Compression/1e4-32 138MB/s ± 0% 206MB/s ± 0% +49.73% (p=0.008 n=5+5)
Decode/Newton/Compression/1e5-32 168MB/s ± 0% 239MB/s ± 0% +42.66% (p=0.008 n=5+5)
Decode/Newton/Compression/1e6-32 170MB/s ± 0% 241MB/s ± 0% +42.11% (p=0.016 n=4+5)
name old alloc/op new alloc/op delta
Decode/Digits/Huffman/1e4-32 0.00B ±NaN% 16.00B ± 0% +Inf% (p=0.008 n=5+5)
Decode/Digits/Huffman/1e5-32 7.60B ± 8% 32.00B ± 0% +321.05% (p=0.008 n=5+5)
Decode/Digits/Huffman/1e6-32 79.6B ± 1% 264.0B ± 0% +231.66% (p=0.008 n=5+5)
Decode/Digits/Speed/1e4-32 80.0B ± 0% 16.0B ± 0% -80.00% (p=0.008 n=5+5)
Decode/Digits/Speed/1e5-32 297B ± 0% 33B ± 0% ~ (p=0.079 n=4+5)
Decode/Digits/Speed/1e6-32 3.86kB ± 0% 0.27kB ± 0% -92.98% (p=0.008 n=5+5)
Decode/Digits/Default/1e4-32 48.0B ± 0% 16.0B ± 0% -66.67% (p=0.008 n=5+5)
Decode/Digits/Default/1e5-32 297B ± 0% 49B ± 0% -83.50% (p=0.008 n=5+5)
Decode/Digits/Default/1e6-32 4.28kB ± 0% 0.38kB ± 0% ~ (p=0.079 n=4+5)
Decode/Digits/Compression/1e4-32 48.0B ± 0% 16.0B ± 0% -66.67% (p=0.008 n=5+5)
Decode/Digits/Compression/1e5-32 297B ± 0% 49B ± 0% ~ (p=0.079 n=4+5)
Decode/Digits/Compression/1e6-32 4.28kB ± 0% 0.38kB ± 0% -91.09% (p=0.000 n=4+5)
Decode/Newton/Huffman/1e4-32 705B ± 0% 16B ± 0% -97.73% (p=0.008 n=5+5)
Decode/Newton/Huffman/1e5-32 4.50kB ± 0% 0.03kB ± 0% -99.27% (p=0.008 n=5+5)
Decode/Newton/Huffman/1e6-32 39.4kB ± 0% 0.3kB ± 0% -99.29% (p=0.008 n=5+5)
Decode/Newton/Speed/1e4-32 625B ± 0% 16B ± 0% -97.44% (p=0.008 n=5+5)
Decode/Newton/Speed/1e5-32 3.21kB ± 0% 0.03kB ± 0% -98.97% (p=0.008 n=5+5)
Decode/Newton/Speed/1e6-32 40.6kB ± 0% 0.3kB ± 0% -99.25% (p=0.008 n=5+5)
Decode/Newton/Default/1e4-32 513B ± 0% 16B ± 0% -96.88% (p=0.008 n=5+5)
Decode/Newton/Default/1e5-32 2.37kB ± 0% 0.03kB ± 0% -98.61% (p=0.008 n=5+5)
Decode/Newton/Default/1e6-32 21.2kB ± 0% 0.2kB ± 0% -98.97% (p=0.008 n=5+5)
Decode/Newton/Compression/1e4-32 513B ± 0% 16B ± 0% -96.88% (p=0.008 n=5+5)
Decode/Newton/Compression/1e5-32 2.37kB ± 0% 0.03kB ± 0% -98.61% (p=0.008 n=5+5)
Decode/Newton/Compression/1e6-32 23.0kB ± 0% 0.2kB ± 0% -99.07% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
Decode/Digits/Huffman/1e4-32 0.00 ±NaN% 1.00 ± 0% +Inf% (p=0.008 n=5+5)
Decode/Digits/Huffman/1e5-32 0.00 ±NaN% 2.00 ± 0% +Inf% (p=0.008 n=5+5)
Decode/Digits/Huffman/1e6-32 0.00 ±NaN% 16.00 ± 0% +Inf% (p=0.008 n=5+5)
Decode/Digits/Speed/1e4-32 3.00 ± 0% 1.00 ± 0% -66.67% (p=0.008 n=5+5)
Decode/Digits/Speed/1e5-32 6.00 ± 0% 2.00 ± 0% -66.67% (p=0.008 n=5+5)
Decode/Digits/Speed/1e6-32 68.0 ± 0% 16.0 ± 0% -76.47% (p=0.008 n=5+5)
Decode/Digits/Default/1e4-32 2.00 ± 0% 1.00 ± 0% -50.00% (p=0.008 n=5+5)
Decode/Digits/Default/1e5-32 8.00 ± 0% 3.00 ± 0% -62.50% (p=0.008 n=5+5)
Decode/Digits/Default/1e6-32 74.0 ± 0% 23.0 ± 0% -68.92% (p=0.008 n=5+5)
Decode/Digits/Compression/1e4-32 2.00 ± 0% 1.00 ± 0% -50.00% (p=0.008 n=5+5)
Decode/Digits/Compression/1e5-32 8.00 ± 0% 3.00 ± 0% -62.50% (p=0.008 n=5+5)
Decode/Digits/Compression/1e6-32 74.0 ± 0% 23.0 ± 0% -68.92% (p=0.008 n=5+5)
Decode/Newton/Huffman/1e4-32 9.00 ± 0% 1.00 ± 0% -88.89% (p=0.008 n=5+5)
Decode/Newton/Huffman/1e5-32 18.0 ± 0% 2.0 ± 0% -88.89% (p=0.008 n=5+5)
Decode/Newton/Huffman/1e6-32 156 ± 0% 16 ± 0% -89.74% (p=0.008 n=5+5)
Decode/Newton/Speed/1e4-32 13.0 ± 0% 1.0 ± 0% -92.31% (p=0.008 n=5+5)
Decode/Newton/Speed/1e5-32 26.0 ± 0% 2.0 ± 0% -92.31% (p=0.008 n=5+5)
Decode/Newton/Speed/1e6-32 223 ± 0% 16 ± 0% -92.83% (p=0.008 n=5+5)
Decode/Newton/Default/1e4-32 10.0 ± 0% 1.0 ± 0% -90.00% (p=0.008 n=5+5)
Decode/Newton/Default/1e5-32 27.0 ± 0% 2.0 ± 0% -92.59% (p=0.008 n=5+5)
Decode/Newton/Default/1e6-32 153 ± 0% 12 ± 0% -92.16% (p=0.008 n=5+5)
Decode/Newton/Compression/1e4-32 10.0 ± 0% 1.0 ± 0% -90.00% (p=0.008 n=5+5)
Decode/Newton/Compression/1e5-32 27.0 ± 0% 2.0 ± 0% -92.59% (p=0.008 n=5+5)
Decode/Newton/Compression/1e6-32 145 ± 0% 12 ± 0% -91.72% (p=0.008 n=5+5)
```
These changes have been included in github.com/klauspost/compress
for a little more than a month now, which includes fuzz testing.
Change-Id: I7e346330512116baa27e448aa606a2f4e551054c
GitHub-Last-Rev: 6180f3c8841bad9184e2d9d43c6829eb9b7a63e5
GitHub-Pull-Request: golang/go#38324
---
A src/compress/flate/gen_inflate.go
M src/compress/flate/inflate.go
A src/compress/flate/inflate_gen.go
M src/compress/flate/reader_test.go
4 files changed, 1,220 insertions(+), 46 deletions(-)
diff --git a/src/compress/flate/gen_inflate.go b/src/compress/flate/gen_inflate.go
new file mode 100644
index 0000000..9db11f3
--- /dev/null
+++ b/src/compress/flate/gen_inflate.go
@@ -0,0 +1,259 @@
+// Copyright 2020 The Go Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+// Package flate implements the DEFLATE compressed data format, described in
+// RFC 1951. The gzip and zlib packages implement access to DEFLATE-based file
+// formats.
+
+// +build generate
+
+//go:generate go run $GOFILE && gofmt -w inflate_gen.go
+
+package main
+
+import (
+ "os"
+ "strings"
+)
+
+func main() {
+ f, err := os.Create("inflate_gen.go")
+ if err != nil {
+ panic(err)
+ }
+ defer f.Close()
+ types := []string{"*bytes.Buffer", "*bytes.Reader", "*bufio.Reader", "*strings.Reader"}
+ names := []string{"BytesBuffer", "BytesReader", "BufioReader", "StringsReader"}
+ imports := []string{"bytes", "bufio", "io", "strings", "math/bits"}
+ f.WriteString(`// Code generated by go generate gen_inflate.go. DO NOT EDIT.
+
+package flate
+
+import (
+`)
+
+ for _, imp := range imports {
+ f.WriteString("\t\"" + imp + "\"\n")
+ }
+ f.WriteString(")\n\n")
+
+ template := `
+
+// $FUNCNAME$ decodes a single Huffman block from f.
+// f.r must be a $TYPE$.
+// hl and hd are the Huffman states for the lit/length values
+// and the distance values, respectively. If hd == nil, using the
+// fixed distance encoding associated with fixed Huffman blocks.
+func (f *decompressor) $FUNCNAME$() {
+ const (
+ stateInit = iota // Zero value must be stateInit
+ stateDict
+ )
+ fr := f.r.($TYPE$)
+ moreBits := func() error {
+ c, err := fr.ReadByte()
+ if err != nil {
+ return noEOF(err)
+ }
+ f.roffset++
+ f.b |= uint32(c) << f.nb
+ f.nb += 8
+ return nil
+ }
+
+ switch f.stepState {
+ case stateInit:
+ goto readLiteral
+ case stateDict:
+ goto copyHistory
+ }
+
+readLiteral:
+ // Read literal and/or (length, distance) according to RFC section 3.2.3.
+ {
+ var v int
+ {
+ // Inlined v, err := f.huffSym(f.hl)
+ // Since a huffmanDecoder can be empty or be composed of a degenerate tree
+ // with single element, huffSym must error on these two edge cases. In both
+ // cases, the chunks slice will be 0 for the invalid sequence, leading it
+ // satisfy the n == 0 check below.
+ n := uint(f.hl.maxRead)
+ // Optimization. Compiler isn't smart enough to keep f.b,f.nb in registers,
+ // but is smart enough to keep local variables in registers, so use nb and b,
+ // inline call to moreBits and reassign b,nb back to f on return.
+ nb, b := f.nb, f.b
+ for {
+ for nb < n {
+ c, err := fr.ReadByte()
+ if err != nil {
+ f.b = b
+ f.nb = nb
+ f.err = noEOF(err)
+ return
+ }
+ f.roffset++
+ b |= uint32(c) << (nb & 31)
+ nb += 8
+ }
+ chunk := f.hl.chunks[b&(huffmanNumChunks-1)]
+ n = uint(chunk & huffmanCountMask)
+ if n > huffmanChunkBits {
+ chunk = f.hl.links[chunk>>huffmanValueShift][(b>>huffmanChunkBits)&f.hl.linkMask]
+ n = uint(chunk & huffmanCountMask)
+ }
+ if n <= nb {
+ if n == 0 {
+ f.b = b
+ f.nb = nb
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+ f.b = b >> (n & 31)
+ f.nb = nb - n
+ v = int(chunk >> huffmanValueShift)
+ break
+ }
+ }
+ }
+
+ var n uint // number of bits extra
+ var length int
+ var err error
+ switch {
+ case v < 256:
+ f.dict.writeByte(byte(v))
+ if f.dict.availWrite() == 0 {
+ f.toRead = f.dict.readFlush()
+ f.step = (*decompressor).$FUNCNAME$
+ f.stepState = stateInit
+ return
+ }
+ goto readLiteral
+ case v == 256:
+ f.finishBlock()
+ return
+ // otherwise, reference to older data
+ case v < 265:
+ length = v - (257 - 3)
+ n = 0
+ case v < 269:
+ length = v*2 - (265*2 - 11)
+ n = 1
+ case v < 273:
+ length = v*4 - (269*4 - 19)
+ n = 2
+ case v < 277:
+ length = v*8 - (273*8 - 35)
+ n = 3
+ case v < 281:
+ length = v*16 - (277*16 - 67)
+ n = 4
+ case v < 285:
+ length = v*32 - (281*32 - 131)
+ n = 5
+ case v < maxNumLit:
+ length = 258
+ n = 0
+ default:
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+ if n > 0 {
+ for f.nb < n {
+ if err = moreBits(); err != nil {
+ f.err = err
+ return
+ }
+ }
+ length += int(f.b & uint32(1<<n-1))
+ f.b >>= n
+ f.nb -= n
+ }
+
+ var dist int
+ if f.hd == nil {
+ for f.nb < 5 {
+ if err = moreBits(); err != nil {
+ f.err = err
+ return
+ }
+ }
+ dist = int(bits.Reverse8(uint8(f.b & 0x1F << 3)))
+ f.b >>= 5
+ f.nb -= 5
+ } else {
+ if dist, err = f.huffSym(f.hd); err != nil {
+ f.err = err
+ return
+ }
+ }
+
+ switch {
+ case dist < 4:
+ dist++
+ case dist < maxNumDist:
+ nb := uint(dist-2) >> 1
+ // have 1 bit in bottom of dist, need nb more.
+ extra := (dist & 1) << nb
+ for f.nb < nb {
+ if err = moreBits(); err != nil {
+ f.err = err
+ return
+ }
+ }
+ extra |= int(f.b & uint32(1<<nb-1))
+ f.b >>= nb
+ f.nb -= nb
+ dist = 1<<(nb+1) + 1 + extra
+ default:
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+
+ // No check on length; encoding can be prescient.
+ if dist > f.dict.histSize() {
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+
+ f.copyLen, f.copyDist = length, dist
+ goto copyHistory
+ }
+
+copyHistory:
+ // Perform a backwards copy according to RFC section 3.2.3.
+ {
+ cnt := f.dict.tryWriteCopy(f.copyDist, f.copyLen)
+ if cnt == 0 {
+ cnt = f.dict.writeCopy(f.copyDist, f.copyLen)
+ }
+ f.copyLen -= cnt
+
+ if f.dict.availWrite() == 0 || f.copyLen > 0 {
+ f.toRead = f.dict.readFlush()
+ f.step = (*decompressor).$FUNCNAME$ // We need to continue this work
+ f.stepState = stateDict
+ return
+ }
+ goto readLiteral
+ }
+}
+
+`
+ for i, t := range types {
+ s := strings.Replace(template, "$FUNCNAME$", "huffman"+names[i], -1)
+ s = strings.Replace(s, "$TYPE$", t, -1)
+ f.WriteString(s)
+ }
+ f.WriteString("func (f *decompressor) huffmanBlockDecoder() func() {\n")
+ f.WriteString("\tswitch f.r.(type) {\n")
+ for i, t := range types {
+ f.WriteString("\t\tcase " + t + ":\n")
+ f.WriteString("\t\t\treturn f.huffman" + names[i] + "\n")
+ }
+ f.WriteString("\t\tdefault:\n")
+ f.WriteString("\t\t\treturn f.huffmanBlockGeneric")
+ f.WriteString("\t}\n}\n")
+}
diff --git a/src/compress/flate/inflate.go b/src/compress/flate/inflate.go
index 4992139..0145ee8 100644
--- a/src/compress/flate/inflate.go
+++ b/src/compress/flate/inflate.go
@@ -16,7 +16,8 @@
)
const (
- maxCodeLen = 16 // max length of Huffman code
+ maxCodeLen = 16 // max length of Huffman code
+ maxCodeLenMask = 15 // mask for max length of Huffman code
// The next three numbers come from the RFC section 3.2.7, with the
// additional proviso in section 3.2.5 which implies that distance codes
// 30 and 31 should never occur in compressed data.
@@ -102,10 +103,10 @@
)
type huffmanDecoder struct {
- min int // the minimum code length
- chunks [huffmanNumChunks]uint32 // chunks as described above
- links [][]uint32 // overflow links
- linkMask uint32 // mask the width of the link table
+ maxRead int // the maximum number of bits we can read and not overread
+ chunks *[huffmanNumChunks]uint16 // chunks as described above
+ links [][]uint16 // overflow links
+ linkMask uint32 // mask the width of the link table
}
// Initialize Huffman decoding tables from array of code lengths.
@@ -119,12 +120,15 @@
// development to supplement the currently ad-hoc unit tests.
const sanity = false
- if h.min != 0 {
- *h = huffmanDecoder{}
+ if h.chunks == nil {
+ h.chunks = &[huffmanNumChunks]uint16{}
+ }
+ if h.maxRead != 0 {
+ *h = huffmanDecoder{chunks: h.chunks, links: h.links}
}
// Count number of codes of each length,
- // compute min and max length.
+ // compute maxRead and max length.
var count [maxCodeLen]int
var min, max int
for _, n := range lengths {
@@ -137,7 +141,7 @@
if n > max {
max = n
}
- count[n]++
+ count[n&maxCodeLenMask]++
}
// Empty tree. The decompressor.huffSym function will fail later if the tree
@@ -155,8 +159,8 @@
var nextcode [maxCodeLen]int
for i := min; i <= max; i++ {
code <<= 1
- nextcode[i] = code
- code += count[i]
+ nextcode[i&maxCodeLenMask] = code
+ code += count[i&maxCodeLenMask]
}
// Check that the coding is complete (i.e., that we've
@@ -168,14 +172,23 @@
return false
}
- h.min = min
+ h.maxRead = min
+ chunks := h.chunks[:]
+ for i := range chunks {
+ chunks[i] = 0
+ }
+
if max > huffmanChunkBits {
numLinks := 1 << (uint(max) - huffmanChunkBits)
h.linkMask = uint32(numLinks - 1)
// create link tables
link := nextcode[huffmanChunkBits+1] >> 1
- h.links = make([][]uint32, huffmanNumChunks-link)
+ if cap(h.links) < huffmanNumChunks-link {
+ h.links = make([][]uint16, huffmanNumChunks-link)
+ } else {
+ h.links = h.links[:huffmanNumChunks-link]
+ }
for j := uint(link); j < huffmanNumChunks; j++ {
reverse := int(bits.Reverse16(uint16(j)))
reverse >>= uint(16 - huffmanChunkBits)
@@ -183,9 +196,16 @@
if sanity && h.chunks[reverse] != 0 {
panic("impossible: overwriting existing chunk")
}
- h.chunks[reverse] = uint32(off<<huffmanValueShift | (huffmanChunkBits + 1))
- h.links[off] = make([]uint32, numLinks)
+ h.chunks[reverse] = uint16(off<<huffmanValueShift | (huffmanChunkBits + 1))
+ if cap(h.links[off]) < numLinks {
+ h.links[off] = make([]uint16, numLinks)
+ } else {
+ links := h.links[off][:0]
+ h.links[off] = links[:numLinks]
+ }
}
+ } else {
+ h.links = h.links[:0]
}
for i, n := range lengths {
@@ -194,7 +214,7 @@
}
code := nextcode[n]
nextcode[n]++
- chunk := uint32(i<<huffmanValueShift | n)
+ chunk := uint16(i<<huffmanValueShift | n)
reverse := int(bits.Reverse16(uint16(code)))
reverse >>= uint(16 - n)
if n <= huffmanChunkBits {
@@ -269,10 +289,6 @@
r Reader
roffset int64
- // Input bits, in top of b.
- b uint32
- nb uint
-
// Huffman decoders for literal/length, distance.
h1, h2 huffmanDecoder
@@ -283,19 +299,24 @@
// Output history, buffer.
dict dictDecoder
- // Temporary buffer (avoids repeated allocation).
- buf [4]byte
-
// Next step in the decompression,
// and decompression state.
step func(*decompressor)
stepState int
- final bool
err error
toRead []byte
hl, hd *huffmanDecoder
copyLen int
copyDist int
+
+ // Temporary buffer (avoids repeated allocation).
+ buf [4]byte
+
+ // Input bits, in top of b.
+ b uint32
+
+ nb uint
+ final bool
}
func (f *decompressor) nextBlock() {
@@ -316,7 +337,7 @@
// compressed, fixed Huffman tables
f.hl = &fixedHuffmanDecoder
f.hd = nil
- f.huffmanBlock()
+ f.huffmanBlockDecoder()()
case 2:
// compressed, dynamic Huffman tables
if f.err = f.readHuffman(); f.err != nil {
@@ -324,7 +345,7 @@
}
f.hl = &f.h1
f.hd = &f.h2
- f.huffmanBlock()
+ f.huffmanBlockDecoder()()
default:
// 3 is reserved.
f.err = CorruptInputError(f.roffset)
@@ -460,12 +481,18 @@
return CorruptInputError(f.roffset)
}
- // As an optimization, we can initialize the min bits to read at a time
+ // As an optimization, we can initialize the maxRead bits to read at a time
// for the HLIT tree to the length of the EOB marker since we know that
// every block must terminate with one. This preserves the property that
// we never read any extra bytes after the end of the DEFLATE stream.
- if f.h1.min < f.bits[endBlockMarker] {
- f.h1.min = f.bits[endBlockMarker]
+ if f.h1.maxRead < f.bits[endBlockMarker] {
+ f.h1.maxRead = f.bits[endBlockMarker]
+ }
+ if !f.final {
+ // If not the final block, the smallest block possible is
+ // a predefined table, BTYPE=01, with a single EOB marker.
+ // This will take up 3 + 7 bits.
+ f.h1.maxRead += 10
}
return nil
@@ -475,7 +502,7 @@
// hl and hd are the Huffman states for the lit/length values
// and the distance values, respectively. If hd == nil, using the
// fixed distance encoding associated with fixed Huffman blocks.
-func (f *decompressor) huffmanBlock() {
+func (f *decompressor) huffmanBlockGeneric() {
const (
stateInit = iota // Zero value must be stateInit
stateDict
@@ -491,19 +518,61 @@
readLiteral:
// Read literal and/or (length, distance) according to RFC section 3.2.3.
{
- v, err := f.huffSym(f.hl)
- if err != nil {
- f.err = err
- return
+ var v int
+ {
+ // Inlined v, err := f.huffSym(f.hl)
+ // Since a huffmanDecoder can be empty or be composed of a degenerate tree
+ // with single element, huffSym must error on these two edge cases. In both
+ // cases, the chunks slice will be 0 for the invalid sequence, leading it
+ // satisfy the n == 0 check below.
+ n := uint(f.hl.maxRead)
+ // Optimization. Compiler isn't smart enough to keep f.b,f.nb in registers,
+ // but is smart enough to keep local variables in registers, so use nb and b,
+ // inline call to moreBits and reassign b,nb back to f on return.
+ nb, b := f.nb, f.b
+ for {
+ for nb < n {
+ c, err := f.r.ReadByte()
+ if err != nil {
+ f.b = b
+ f.nb = nb
+ f.err = noEOF(err)
+ return
+ }
+ f.roffset++
+ b |= uint32(c) << (nb & 31)
+ nb += 8
+ }
+ chunk := f.hl.chunks[b&(huffmanNumChunks-1)]
+ n = uint(chunk & huffmanCountMask)
+ if n > huffmanChunkBits {
+ chunk = f.hl.links[chunk>>huffmanValueShift][(b>>huffmanChunkBits)&f.hl.linkMask]
+ n = uint(chunk & huffmanCountMask)
+ }
+ if n <= nb {
+ if n == 0 {
+ f.b = b
+ f.nb = nb
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+ f.b = b >> (n & 31)
+ f.nb = nb - n
+ v = int(chunk >> huffmanValueShift)
+ break
+ }
+ }
}
+
var n uint // number of bits extra
var length int
+ var err error
switch {
case v < 256:
f.dict.writeByte(byte(v))
if f.dict.availWrite() == 0 {
f.toRead = f.dict.readFlush()
- f.step = (*decompressor).huffmanBlock
+ f.step = (*decompressor).huffmanBlockGeneric
f.stepState = stateInit
return
}
@@ -610,7 +679,7 @@
if f.dict.availWrite() == 0 || f.copyLen > 0 {
f.toRead = f.dict.readFlush()
- f.step = (*decompressor).huffmanBlock // We need to continue this work
+ f.step = (*decompressor).huffmanBlockGeneric // We need to continue this work
f.stepState = stateDict
return
}
@@ -621,20 +690,31 @@
// Copy a single uncompressed data block from input to output.
func (f *decompressor) dataBlock() {
// Uncompressed.
- // Discard current half-byte.
- f.nb = 0
- f.b = 0
+ // Discard current partial byte.
+ left := (f.nb) & 7
+ f.nb -= left
+ f.b >>= left
+
+ offBytes := f.nb >> 3
+ // Unfilled values will be overwritten.
+ f.buf[0] = uint8(f.b)
+ f.buf[1] = uint8(f.b >> 8)
+ f.buf[2] = uint8(f.b >> 16)
+ f.buf[3] = uint8(f.b >> 24)
+
+ f.roffset += int64(offBytes)
+ f.nb, f.b = 0, 0
// Length then ones-complement of length.
- nr, err := io.ReadFull(f.r, f.buf[0:4])
+ nr, err := io.ReadFull(f.r, f.buf[offBytes:4])
f.roffset += int64(nr)
if err != nil {
f.err = noEOF(err)
return
}
- n := int(f.buf[0]) | int(f.buf[1])<<8
- nn := int(f.buf[2]) | int(f.buf[3])<<8
- if uint16(nn) != uint16(^n) {
+ n := uint16(f.buf[0]) | uint16(f.buf[1])<<8
+ nn := uint16(f.buf[2]) | uint16(f.buf[3])<<8
+ if nn != ^n {
f.err = CorruptInputError(f.roffset)
return
}
@@ -645,7 +725,7 @@
return
}
- f.copyLen = n
+ f.copyLen = int(n)
f.copyData()
}
@@ -709,7 +789,7 @@
// with single element, huffSym must error on these two edge cases. In both
// cases, the chunks slice will be 0 for the invalid sequence, leading it
// satisfy the n == 0 check below.
- n := uint(h.min)
+ n := uint(h.maxRead)
// Optimization. Compiler isn't smart enough to keep f.b,f.nb in registers,
// but is smart enough to keep local variables in registers, so use nb and b,
// inline call to moreBits and reassign b,nb back to f on return.
@@ -778,6 +858,8 @@
r: makeReader(r),
bits: f.bits,
codebits: f.codebits,
+ h1: f.h1,
+ h2: f.h2,
dict: f.dict,
step: (*decompressor).nextBlock,
}
diff --git a/src/compress/flate/inflate_gen.go b/src/compress/flate/inflate_gen.go
new file mode 100644
index 0000000..b7e6727
--- /dev/null
+++ b/src/compress/flate/inflate_gen.go
@@ -0,0 +1,829 @@
+// Code generated by go generate gen_inflate.go. DO NOT EDIT.
+
+package flate
+
+import (
+ "bufio"
+ "bytes"
+ "math/bits"
+ "strings"
+)
+
+// huffmanBytesBuffer decodes a single Huffman block from f.
+// f.r must be a *bytes.Buffer.
+// hl and hd are the Huffman states for the lit/length values
+// and the distance values, respectively. If hd == nil, using the
+// fixed distance encoding associated with fixed Huffman blocks.
+func (f *decompressor) huffmanBytesBuffer() {
+ const (
+ stateInit = iota // Zero value must be stateInit
+ stateDict
+ )
+ fr := f.r.(*bytes.Buffer)
+ moreBits := func() error {
+ c, err := fr.ReadByte()
+ if err != nil {
+ return noEOF(err)
+ }
+ f.roffset++
+ f.b |= uint32(c) << f.nb
+ f.nb += 8
+ return nil
+ }
+
+ switch f.stepState {
+ case stateInit:
+ goto readLiteral
+ case stateDict:
+ goto copyHistory
+ }
+
+readLiteral:
+ // Read literal and/or (length, distance) according to RFC section 3.2.3.
+ {
+ var v int
+ {
+ // Inlined v, err := f.huffSym(f.hl)
+ // Since a huffmanDecoder can be empty or be composed of a degenerate tree
+ // with single element, huffSym must error on these two edge cases. In both
+ // cases, the chunks slice will be 0 for the invalid sequence, leading it
+ // satisfy the n == 0 check below.
+ n := uint(f.hl.maxRead)
+ // Optimization. Compiler isn't smart enough to keep f.b,f.nb in registers,
+ // but is smart enough to keep local variables in registers, so use nb and b,
+ // inline call to moreBits and reassign b,nb back to f on return.
+ nb, b := f.nb, f.b
+ for {
+ for nb < n {
+ c, err := fr.ReadByte()
+ if err != nil {
+ f.b = b
+ f.nb = nb
+ f.err = noEOF(err)
+ return
+ }
+ f.roffset++
+ b |= uint32(c) << (nb & 31)
+ nb += 8
+ }
+ chunk := f.hl.chunks[b&(huffmanNumChunks-1)]
+ n = uint(chunk & huffmanCountMask)
+ if n > huffmanChunkBits {
+ chunk = f.hl.links[chunk>>huffmanValueShift][(b>>huffmanChunkBits)&f.hl.linkMask]
+ n = uint(chunk & huffmanCountMask)
+ }
+ if n <= nb {
+ if n == 0 {
+ f.b = b
+ f.nb = nb
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+ f.b = b >> (n & 31)
+ f.nb = nb - n
+ v = int(chunk >> huffmanValueShift)
+ break
+ }
+ }
+ }
+
+ var n uint // number of bits extra
+ var length int
+ var err error
+ switch {
+ case v < 256:
+ f.dict.writeByte(byte(v))
+ if f.dict.availWrite() == 0 {
+ f.toRead = f.dict.readFlush()
+ f.step = (*decompressor).huffmanBytesBuffer
+ f.stepState = stateInit
+ return
+ }
+ goto readLiteral
+ case v == 256:
+ f.finishBlock()
+ return
+ // otherwise, reference to older data
+ case v < 265:
+ length = v - (257 - 3)
+ n = 0
+ case v < 269:
+ length = v*2 - (265*2 - 11)
+ n = 1
+ case v < 273:
+ length = v*4 - (269*4 - 19)
+ n = 2
+ case v < 277:
+ length = v*8 - (273*8 - 35)
+ n = 3
+ case v < 281:
+ length = v*16 - (277*16 - 67)
+ n = 4
+ case v < 285:
+ length = v*32 - (281*32 - 131)
+ n = 5
+ case v < maxNumLit:
+ length = 258
+ n = 0
+ default:
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+ if n > 0 {
+ for f.nb < n {
+ if err = moreBits(); err != nil {
+ f.err = err
+ return
+ }
+ }
+ length += int(f.b & uint32(1<<n-1))
+ f.b >>= n
+ f.nb -= n
+ }
+
+ var dist int
+ if f.hd == nil {
+ for f.nb < 5 {
+ if err = moreBits(); err != nil {
+ f.err = err
+ return
+ }
+ }
+ dist = int(bits.Reverse8(uint8(f.b & 0x1F << 3)))
+ f.b >>= 5
+ f.nb -= 5
+ } else {
+ if dist, err = f.huffSym(f.hd); err != nil {
+ f.err = err
+ return
+ }
+ }
+
+ switch {
+ case dist < 4:
+ dist++
+ case dist < maxNumDist:
+ nb := uint(dist-2) >> 1
+ // have 1 bit in bottom of dist, need nb more.
+ extra := (dist & 1) << nb
+ for f.nb < nb {
+ if err = moreBits(); err != nil {
+ f.err = err
+ return
+ }
+ }
+ extra |= int(f.b & uint32(1<<nb-1))
+ f.b >>= nb
+ f.nb -= nb
+ dist = 1<<(nb+1) + 1 + extra
+ default:
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+
+ // No check on length; encoding can be prescient.
+ if dist > f.dict.histSize() {
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+
+ f.copyLen, f.copyDist = length, dist
+ goto copyHistory
+ }
+
+copyHistory:
+ // Perform a backwards copy according to RFC section 3.2.3.
+ {
+ cnt := f.dict.tryWriteCopy(f.copyDist, f.copyLen)
+ if cnt == 0 {
+ cnt = f.dict.writeCopy(f.copyDist, f.copyLen)
+ }
+ f.copyLen -= cnt
+
+ if f.dict.availWrite() == 0 || f.copyLen > 0 {
+ f.toRead = f.dict.readFlush()
+ f.step = (*decompressor).huffmanBytesBuffer // We need to continue this work
+ f.stepState = stateDict
+ return
+ }
+ goto readLiteral
+ }
+}
+
+// huffmanBytesReader decodes a single Huffman block from f.
+// f.r must be a *bytes.Reader.
+// hl and hd are the Huffman states for the lit/length values
+// and the distance values, respectively. If hd == nil, using the
+// fixed distance encoding associated with fixed Huffman blocks.
+func (f *decompressor) huffmanBytesReader() {
+ const (
+ stateInit = iota // Zero value must be stateInit
+ stateDict
+ )
+ fr := f.r.(*bytes.Reader)
+ moreBits := func() error {
+ c, err := fr.ReadByte()
+ if err != nil {
+ return noEOF(err)
+ }
+ f.roffset++
+ f.b |= uint32(c) << f.nb
+ f.nb += 8
+ return nil
+ }
+
+ switch f.stepState {
+ case stateInit:
+ goto readLiteral
+ case stateDict:
+ goto copyHistory
+ }
+
+readLiteral:
+ // Read literal and/or (length, distance) according to RFC section 3.2.3.
+ {
+ var v int
+ {
+ // Inlined v, err := f.huffSym(f.hl)
+ // Since a huffmanDecoder can be empty or be composed of a degenerate tree
+ // with single element, huffSym must error on these two edge cases. In both
+ // cases, the chunks slice will be 0 for the invalid sequence, leading it
+ // satisfy the n == 0 check below.
+ n := uint(f.hl.maxRead)
+ // Optimization. Compiler isn't smart enough to keep f.b,f.nb in registers,
+ // but is smart enough to keep local variables in registers, so use nb and b,
+ // inline call to moreBits and reassign b,nb back to f on return.
+ nb, b := f.nb, f.b
+ for {
+ for nb < n {
+ c, err := fr.ReadByte()
+ if err != nil {
+ f.b = b
+ f.nb = nb
+ f.err = noEOF(err)
+ return
+ }
+ f.roffset++
+ b |= uint32(c) << (nb & 31)
+ nb += 8
+ }
+ chunk := f.hl.chunks[b&(huffmanNumChunks-1)]
+ n = uint(chunk & huffmanCountMask)
+ if n > huffmanChunkBits {
+ chunk = f.hl.links[chunk>>huffmanValueShift][(b>>huffmanChunkBits)&f.hl.linkMask]
+ n = uint(chunk & huffmanCountMask)
+ }
+ if n <= nb {
+ if n == 0 {
+ f.b = b
+ f.nb = nb
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+ f.b = b >> (n & 31)
+ f.nb = nb - n
+ v = int(chunk >> huffmanValueShift)
+ break
+ }
+ }
+ }
+
+ var n uint // number of bits extra
+ var length int
+ var err error
+ switch {
+ case v < 256:
+ f.dict.writeByte(byte(v))
+ if f.dict.availWrite() == 0 {
+ f.toRead = f.dict.readFlush()
+ f.step = (*decompressor).huffmanBytesReader
+ f.stepState = stateInit
+ return
+ }
+ goto readLiteral
+ case v == 256:
+ f.finishBlock()
+ return
+ // otherwise, reference to older data
+ case v < 265:
+ length = v - (257 - 3)
+ n = 0
+ case v < 269:
+ length = v*2 - (265*2 - 11)
+ n = 1
+ case v < 273:
+ length = v*4 - (269*4 - 19)
+ n = 2
+ case v < 277:
+ length = v*8 - (273*8 - 35)
+ n = 3
+ case v < 281:
+ length = v*16 - (277*16 - 67)
+ n = 4
+ case v < 285:
+ length = v*32 - (281*32 - 131)
+ n = 5
+ case v < maxNumLit:
+ length = 258
+ n = 0
+ default:
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+ if n > 0 {
+ for f.nb < n {
+ if err = moreBits(); err != nil {
+ f.err = err
+ return
+ }
+ }
+ length += int(f.b & uint32(1<<n-1))
+ f.b >>= n
+ f.nb -= n
+ }
+
+ var dist int
+ if f.hd == nil {
+ for f.nb < 5 {
+ if err = moreBits(); err != nil {
+ f.err = err
+ return
+ }
+ }
+ dist = int(bits.Reverse8(uint8(f.b & 0x1F << 3)))
+ f.b >>= 5
+ f.nb -= 5
+ } else {
+ if dist, err = f.huffSym(f.hd); err != nil {
+ f.err = err
+ return
+ }
+ }
+
+ switch {
+ case dist < 4:
+ dist++
+ case dist < maxNumDist:
+ nb := uint(dist-2) >> 1
+ // have 1 bit in bottom of dist, need nb more.
+ extra := (dist & 1) << nb
+ for f.nb < nb {
+ if err = moreBits(); err != nil {
+ f.err = err
+ return
+ }
+ }
+ extra |= int(f.b & uint32(1<<nb-1))
+ f.b >>= nb
+ f.nb -= nb
+ dist = 1<<(nb+1) + 1 + extra
+ default:
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+
+ // No check on length; encoding can be prescient.
+ if dist > f.dict.histSize() {
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+
+ f.copyLen, f.copyDist = length, dist
+ goto copyHistory
+ }
+
+copyHistory:
+ // Perform a backwards copy according to RFC section 3.2.3.
+ {
+ cnt := f.dict.tryWriteCopy(f.copyDist, f.copyLen)
+ if cnt == 0 {
+ cnt = f.dict.writeCopy(f.copyDist, f.copyLen)
+ }
+ f.copyLen -= cnt
+
+ if f.dict.availWrite() == 0 || f.copyLen > 0 {
+ f.toRead = f.dict.readFlush()
+ f.step = (*decompressor).huffmanBytesReader // We need to continue this work
+ f.stepState = stateDict
+ return
+ }
+ goto readLiteral
+ }
+}
+
+// huffmanBufioReader decodes a single Huffman block from f.
+// f.r must be a *bufio.Reader.
+// hl and hd are the Huffman states for the lit/length values
+// and the distance values, respectively. If hd == nil, using the
+// fixed distance encoding associated with fixed Huffman blocks.
+func (f *decompressor) huffmanBufioReader() {
+ const (
+ stateInit = iota // Zero value must be stateInit
+ stateDict
+ )
+ fr := f.r.(*bufio.Reader)
+ moreBits := func() error {
+ c, err := fr.ReadByte()
+ if err != nil {
+ return noEOF(err)
+ }
+ f.roffset++
+ f.b |= uint32(c) << f.nb
+ f.nb += 8
+ return nil
+ }
+
+ switch f.stepState {
+ case stateInit:
+ goto readLiteral
+ case stateDict:
+ goto copyHistory
+ }
+
+readLiteral:
+ // Read literal and/or (length, distance) according to RFC section 3.2.3.
+ {
+ var v int
+ {
+ // Inlined v, err := f.huffSym(f.hl)
+ // Since a huffmanDecoder can be empty or be composed of a degenerate tree
+ // with single element, huffSym must error on these two edge cases. In both
+ // cases, the chunks slice will be 0 for the invalid sequence, leading it
+ // satisfy the n == 0 check below.
+ n := uint(f.hl.maxRead)
+ // Optimization. Compiler isn't smart enough to keep f.b,f.nb in registers,
+ // but is smart enough to keep local variables in registers, so use nb and b,
+ // inline call to moreBits and reassign b,nb back to f on return.
+ nb, b := f.nb, f.b
+ for {
+ for nb < n {
+ c, err := fr.ReadByte()
+ if err != nil {
+ f.b = b
+ f.nb = nb
+ f.err = noEOF(err)
+ return
+ }
+ f.roffset++
+ b |= uint32(c) << (nb & 31)
+ nb += 8
+ }
+ chunk := f.hl.chunks[b&(huffmanNumChunks-1)]
+ n = uint(chunk & huffmanCountMask)
+ if n > huffmanChunkBits {
+ chunk = f.hl.links[chunk>>huffmanValueShift][(b>>huffmanChunkBits)&f.hl.linkMask]
+ n = uint(chunk & huffmanCountMask)
+ }
+ if n <= nb {
+ if n == 0 {
+ f.b = b
+ f.nb = nb
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+ f.b = b >> (n & 31)
+ f.nb = nb - n
+ v = int(chunk >> huffmanValueShift)
+ break
+ }
+ }
+ }
+
+ var n uint // number of bits extra
+ var length int
+ var err error
+ switch {
+ case v < 256:
+ f.dict.writeByte(byte(v))
+ if f.dict.availWrite() == 0 {
+ f.toRead = f.dict.readFlush()
+ f.step = (*decompressor).huffmanBufioReader
+ f.stepState = stateInit
+ return
+ }
+ goto readLiteral
+ case v == 256:
+ f.finishBlock()
+ return
+ // otherwise, reference to older data
+ case v < 265:
+ length = v - (257 - 3)
+ n = 0
+ case v < 269:
+ length = v*2 - (265*2 - 11)
+ n = 1
+ case v < 273:
+ length = v*4 - (269*4 - 19)
+ n = 2
+ case v < 277:
+ length = v*8 - (273*8 - 35)
+ n = 3
+ case v < 281:
+ length = v*16 - (277*16 - 67)
+ n = 4
+ case v < 285:
+ length = v*32 - (281*32 - 131)
+ n = 5
+ case v < maxNumLit:
+ length = 258
+ n = 0
+ default:
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+ if n > 0 {
+ for f.nb < n {
+ if err = moreBits(); err != nil {
+ f.err = err
+ return
+ }
+ }
+ length += int(f.b & uint32(1<<n-1))
+ f.b >>= n
+ f.nb -= n
+ }
+
+ var dist int
+ if f.hd == nil {
+ for f.nb < 5 {
+ if err = moreBits(); err != nil {
+ f.err = err
+ return
+ }
+ }
+ dist = int(bits.Reverse8(uint8(f.b & 0x1F << 3)))
+ f.b >>= 5
+ f.nb -= 5
+ } else {
+ if dist, err = f.huffSym(f.hd); err != nil {
+ f.err = err
+ return
+ }
+ }
+
+ switch {
+ case dist < 4:
+ dist++
+ case dist < maxNumDist:
+ nb := uint(dist-2) >> 1
+ // have 1 bit in bottom of dist, need nb more.
+ extra := (dist & 1) << nb
+ for f.nb < nb {
+ if err = moreBits(); err != nil {
+ f.err = err
+ return
+ }
+ }
+ extra |= int(f.b & uint32(1<<nb-1))
+ f.b >>= nb
+ f.nb -= nb
+ dist = 1<<(nb+1) + 1 + extra
+ default:
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+
+ // No check on length; encoding can be prescient.
+ if dist > f.dict.histSize() {
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+
+ f.copyLen, f.copyDist = length, dist
+ goto copyHistory
+ }
+
+copyHistory:
+ // Perform a backwards copy according to RFC section 3.2.3.
+ {
+ cnt := f.dict.tryWriteCopy(f.copyDist, f.copyLen)
+ if cnt == 0 {
+ cnt = f.dict.writeCopy(f.copyDist, f.copyLen)
+ }
+ f.copyLen -= cnt
+
+ if f.dict.availWrite() == 0 || f.copyLen > 0 {
+ f.toRead = f.dict.readFlush()
+ f.step = (*decompressor).huffmanBufioReader // We need to continue this work
+ f.stepState = stateDict
+ return
+ }
+ goto readLiteral
+ }
+}
+
+// huffmanStringsReader decodes a single Huffman block from f.
+// f.r must be a *strings.Reader.
+// hl and hd are the Huffman states for the lit/length values
+// and the distance values, respectively. If hd == nil, using the
+// fixed distance encoding associated with fixed Huffman blocks.
+func (f *decompressor) huffmanStringsReader() {
+ const (
+ stateInit = iota // Zero value must be stateInit
+ stateDict
+ )
+ fr := f.r.(*strings.Reader)
+ moreBits := func() error {
+ c, err := fr.ReadByte()
+ if err != nil {
+ return noEOF(err)
+ }
+ f.roffset++
+ f.b |= uint32(c) << f.nb
+ f.nb += 8
+ return nil
+ }
+
+ switch f.stepState {
+ case stateInit:
+ goto readLiteral
+ case stateDict:
+ goto copyHistory
+ }
+
+readLiteral:
+ // Read literal and/or (length, distance) according to RFC section 3.2.3.
+ {
+ var v int
+ {
+ // Inlined v, err := f.huffSym(f.hl)
+ // Since a huffmanDecoder can be empty or be composed of a degenerate tree
+ // with single element, huffSym must error on these two edge cases. In both
+ // cases, the chunks slice will be 0 for the invalid sequence, leading it
+ // satisfy the n == 0 check below.
+ n := uint(f.hl.maxRead)
+ // Optimization. Compiler isn't smart enough to keep f.b,f.nb in registers,
+ // but is smart enough to keep local variables in registers, so use nb and b,
+ // inline call to moreBits and reassign b,nb back to f on return.
+ nb, b := f.nb, f.b
+ for {
+ for nb < n {
+ c, err := fr.ReadByte()
+ if err != nil {
+ f.b = b
+ f.nb = nb
+ f.err = noEOF(err)
+ return
+ }
+ f.roffset++
+ b |= uint32(c) << (nb & 31)
+ nb += 8
+ }
+ chunk := f.hl.chunks[b&(huffmanNumChunks-1)]
+ n = uint(chunk & huffmanCountMask)
+ if n > huffmanChunkBits {
+ chunk = f.hl.links[chunk>>huffmanValueShift][(b>>huffmanChunkBits)&f.hl.linkMask]
+ n = uint(chunk & huffmanCountMask)
+ }
+ if n <= nb {
+ if n == 0 {
+ f.b = b
+ f.nb = nb
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+ f.b = b >> (n & 31)
+ f.nb = nb - n
+ v = int(chunk >> huffmanValueShift)
+ break
+ }
+ }
+ }
+
+ var n uint // number of bits extra
+ var length int
+ var err error
+ switch {
+ case v < 256:
+ f.dict.writeByte(byte(v))
+ if f.dict.availWrite() == 0 {
+ f.toRead = f.dict.readFlush()
+ f.step = (*decompressor).huffmanStringsReader
+ f.stepState = stateInit
+ return
+ }
+ goto readLiteral
+ case v == 256:
+ f.finishBlock()
+ return
+ // otherwise, reference to older data
+ case v < 265:
+ length = v - (257 - 3)
+ n = 0
+ case v < 269:
+ length = v*2 - (265*2 - 11)
+ n = 1
+ case v < 273:
+ length = v*4 - (269*4 - 19)
+ n = 2
+ case v < 277:
+ length = v*8 - (273*8 - 35)
+ n = 3
+ case v < 281:
+ length = v*16 - (277*16 - 67)
+ n = 4
+ case v < 285:
+ length = v*32 - (281*32 - 131)
+ n = 5
+ case v < maxNumLit:
+ length = 258
+ n = 0
+ default:
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+ if n > 0 {
+ for f.nb < n {
+ if err = moreBits(); err != nil {
+ f.err = err
+ return
+ }
+ }
+ length += int(f.b & uint32(1<<n-1))
+ f.b >>= n
+ f.nb -= n
+ }
+
+ var dist int
+ if f.hd == nil {
+ for f.nb < 5 {
+ if err = moreBits(); err != nil {
+ f.err = err
+ return
+ }
+ }
+ dist = int(bits.Reverse8(uint8(f.b & 0x1F << 3)))
+ f.b >>= 5
+ f.nb -= 5
+ } else {
+ if dist, err = f.huffSym(f.hd); err != nil {
+ f.err = err
+ return
+ }
+ }
+
+ switch {
+ case dist < 4:
+ dist++
+ case dist < maxNumDist:
+ nb := uint(dist-2) >> 1
+ // have 1 bit in bottom of dist, need nb more.
+ extra := (dist & 1) << nb
+ for f.nb < nb {
+ if err = moreBits(); err != nil {
+ f.err = err
+ return
+ }
+ }
+ extra |= int(f.b & uint32(1<<nb-1))
+ f.b >>= nb
+ f.nb -= nb
+ dist = 1<<(nb+1) + 1 + extra
+ default:
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+
+ // No check on length; encoding can be prescient.
+ if dist > f.dict.histSize() {
+ f.err = CorruptInputError(f.roffset)
+ return
+ }
+
+ f.copyLen, f.copyDist = length, dist
+ goto copyHistory
+ }
+
+copyHistory:
+ // Perform a backwards copy according to RFC section 3.2.3.
+ {
+ cnt := f.dict.tryWriteCopy(f.copyDist, f.copyLen)
+ if cnt == 0 {
+ cnt = f.dict.writeCopy(f.copyDist, f.copyLen)
+ }
+ f.copyLen -= cnt
+
+ if f.dict.availWrite() == 0 || f.copyLen > 0 {
+ f.toRead = f.dict.readFlush()
+ f.step = (*decompressor).huffmanStringsReader // We need to continue this work
+ f.stepState = stateDict
+ return
+ }
+ goto readLiteral
+ }
+}
+
+func (f *decompressor) huffmanBlockDecoder() func() {
+ switch f.r.(type) {
+ case *bytes.Buffer:
+ return f.huffmanBytesBuffer
+ case *bytes.Reader:
+ return f.huffmanBytesReader
+ case *bufio.Reader:
+ return f.huffmanBufioReader
+ case *strings.Reader:
+ return f.huffmanStringsReader
+ default:
+ return f.huffmanBlockGeneric
+ }
+}
diff --git a/src/compress/flate/reader_test.go b/src/compress/flate/reader_test.go
index 9d2943a..cac99a8 100644
--- a/src/compress/flate/reader_test.go
+++ b/src/compress/flate/reader_test.go
@@ -51,10 +51,14 @@
w.Close()
buf1 := compressed.Bytes()
buf0, compressed, w = nil, nil, nil
+ src := bytes.NewReader(buf1)
+ dec := NewReader(src)
runtime.GC()
b.StartTimer()
for i := 0; i < b.N; i++ {
- io.Copy(ioutil.Discard, NewReader(bytes.NewReader(buf1)))
+ src.Reset(buf1)
+ dec.(Resetter).Reset(src, nil)
+ io.Copy(ioutil.Discard, dec)
}
})
}
To view, visit change 227737. To unsubscribe, or for help writing mail filters, visit settings.
Gerrit Bot uploaded patch set #2 to this change.
To view, visit change 227737. To unsubscribe, or for help writing mail filters, visit settings.
Gopher Robot abandoned this change.
To view, visit change 227737. To unsubscribe, or for help writing mail filters, visit settings.
Heschi Kreinick restored this change.
To view, visit change 227737. To unsubscribe, or for help writing mail filters, visit settings.
1 comment:
Patchset:
Sorry this has sat around for so long. Any interest in fixing the merge conflicts? Thanks.
To view, visit change 227737. To unsubscribe, or for help writing mail filters, visit settings.
Attention is currently required from: Ian Lance Taylor.
1 comment:
Patchset:
Sorry this has sat around for so long. Any interest in fixing the merge conflicts? Thanks.
Sure. I have an additional 5-10% I found since then, which I will see if I can incorporate.
Maybe this could be achieved through generics instead of codegen. I will investigate a bit, since I haven't done anything with generics yet.
To view, visit change 227737. To unsubscribe, or for help writing mail filters, visit settings.
Attention is currently required from: Ian Lance Taylor.
Gerrit Bot uploaded patch set #3 to this change.
Change-Id: I7e346330512116baa27e448aa606a2f4e551054c
GitHub-Last-Rev: c00babd617fabf290f784c5bd074a82a6af781c5
GitHub-Pull-Request: golang/go#38324
---
A src/compress/flate/gen_inflate.go
M src/compress/flate/inflate.go
A src/compress/flate/inflate_gen.go
M src/compress/flate/reader_test.go
4 files changed, 1,386 insertions(+), 46 deletions(-)
To view, visit change 227737. To unsubscribe, or for help writing mail filters, visit settings.
Attention is currently required from: Ian Lance Taylor.
Gerrit Bot uploaded patch set #4 to this change.
Change-Id: I7e346330512116baa27e448aa606a2f4e551054c
GitHub-Last-Rev: ae9b62a9851eb3d8acf14cebfef1d8c663e407dc
GitHub-Pull-Request: golang/go#38324
---
A src/compress/flate/gen_inflate.go
M src/compress/flate/inflate.go
A src/compress/flate/inflate_gen.go
M src/compress/flate/reader_test.go
4 files changed, 1,665 insertions(+), 186 deletions(-)
To view, visit change 227737. To unsubscribe, or for help writing mail filters, visit settings.
Attention is currently required from: Ian Lance Taylor.
Gerrit Bot uploaded patch set #5 to this change.
Decode/Digits/Huffman/1e4-32 63.8µs ± 0% 41.3µs ± 0% -35.22% (p=0.000 n=10+10)
Decode/Digits/Huffman/1e5-32 625µs ± 0% 404µs ± 0% -35.31% (p=0.000 n=10+10)
Decode/Digits/Huffman/1e6-32 6.25ms ± 0% 4.02ms ± 1% -35.64% (p=0.000 n=10+10)
Decode/Digits/Speed/1e4-32 72.1µs ± 1% 48.0µs ± 0% -33.36% (p=0.000 n=10+9)
Decode/Digits/Speed/1e5-32 792µs ± 1% 578µs ± 1% -27.04% (p=0.000 n=10+10)
Decode/Digits/Speed/1e6-32 8.09ms ± 0% 5.85ms ± 0% -27.68% (p=0.000 n=9+10)
Decode/Digits/Default/1e4-32 74.1µs ± 1% 49.7µs ± 1% -32.87% (p=0.000 n=10+9)
Decode/Digits/Default/1e5-32 775µs ± 1% 579µs ± 0% -25.35% (p=0.000 n=10+10)
Decode/Digits/Default/1e6-32 7.84ms ± 1% 5.84ms ± 1% -25.59% (p=0.000 n=10+10)
Decode/Digits/Compression/1e4-32 74.1µs ± 0% 49.8µs ± 0% -32.83% (p=0.000 n=10+10)
Decode/Digits/Compression/1e5-32 777µs ± 1% 579µs ± 0% -25.47% (p=0.000 n=10+10)
Decode/Digits/Compression/1e6-32 7.83ms ± 1% 5.83ms ± 0% -25.59% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e4-32 72.9µs ± 0% 45.6µs ± 1% -37.48% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e5-32 712µs ± 1% 471µs ± 1% -33.92% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e6-32 7.11ms ± 0% 4.70ms ± 1% -33.98% (p=0.000 n=10+10)
Decode/Newton/Speed/1e4-32 67.0µs ± 1% 45.4µs ± 1% -32.19% (p=0.000 n=10+10)
Decode/Newton/Speed/1e5-32 616µs ± 1% 447µs ± 0% -27.49% (p=0.000 n=10+10)
Decode/Newton/Speed/1e6-32 6.17ms ± 0% 4.50ms ± 0% -26.98% (p=0.000 n=10+10)
Decode/Newton/Default/1e4-32 60.7µs ± 0% 39.6µs ± 0% -34.84% (p=0.000 n=10+10)
Decode/Newton/Default/1e5-32 492µs ± 0% 360µs ± 0% -26.84% (p=0.000 n=10+10)
Decode/Newton/Default/1e6-32 4.87ms ± 1% 3.59ms ± 0% -26.34% (p=0.000 n=10+10)
Decode/Newton/Compression/1e4-32 60.8µs ± 1% 39.6µs ± 1% -34.92% (p=0.000 n=10+10)
Decode/Newton/Compression/1e5-32 491µs ± 1% 357µs ± 1% -27.23% (p=0.000 n=10+10)
Decode/Newton/Compression/1e6-32 4.84ms ± 0% 3.58ms ± 0% -26.17% (p=0.000 n=10+10)
name old speed new speed delta
Decode/Digits/Huffman/1e4-32 157MB/s ± 0% 242MB/s ± 0% +54.37% (p=0.000 n=10+10)
Decode/Digits/Huffman/1e5-32 160MB/s ± 0% 247MB/s ± 0% +54.58% (p=0.000 n=10+10)
Decode/Digits/Huffman/1e6-32 160MB/s ± 0% 249MB/s ± 1% +55.39% (p=0.000 n=10+10)
Decode/Digits/Speed/1e4-32 139MB/s ± 1% 208MB/s ± 0% +50.06% (p=0.000 n=10+9)
Decode/Digits/Speed/1e5-32 126MB/s ± 1% 173MB/s ± 1% +37.05% (p=0.000 n=10+10)
Decode/Digits/Speed/1e6-32 124MB/s ± 0% 171MB/s ± 0% +38.28% (p=0.000 n=9+10)
Decode/Digits/Default/1e4-32 135MB/s ± 1% 201MB/s ± 1% +48.95% (p=0.000 n=10+9)
Decode/Digits/Default/1e5-32 129MB/s ± 1% 173MB/s ± 0% +33.95% (p=0.000 n=10+10)
Decode/Digits/Default/1e6-32 127MB/s ± 1% 171MB/s ± 1% +34.39% (p=0.000 n=10+10)
Decode/Digits/Compression/1e4-32 135MB/s ± 0% 201MB/s ± 0% +48.88% (p=0.000 n=10+10)
Decode/Digits/Compression/1e5-32 129MB/s ± 1% 173MB/s ± 0% +34.17% (p=0.000 n=10+10)
Decode/Digits/Compression/1e6-32 128MB/s ± 1% 172MB/s ± 0% +34.39% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e4-32 137MB/s ± 0% 219MB/s ± 1% +59.96% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e5-32 140MB/s ± 1% 212MB/s ± 1% +51.32% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e6-32 141MB/s ± 0% 213MB/s ± 1% +51.46% (p=0.000 n=10+10)
Decode/Newton/Speed/1e4-32 149MB/s ± 1% 220MB/s ± 1% +47.48% (p=0.000 n=10+10)
Decode/Newton/Speed/1e5-32 162MB/s ± 1% 224MB/s ± 0% +37.92% (p=0.000 n=10+10)
Decode/Newton/Speed/1e6-32 162MB/s ± 0% 222MB/s ± 0% +36.95% (p=0.000 n=10+10)
Decode/Newton/Default/1e4-32 165MB/s ± 0% 253MB/s ± 0% +53.47% (p=0.000 n=10+10)
Decode/Newton/Default/1e5-32 203MB/s ± 0% 278MB/s ± 0% +36.68% (p=0.000 n=10+10)
Decode/Newton/Default/1e6-32 205MB/s ± 1% 279MB/s ± 0% +35.77% (p=0.000 n=10+10)
Decode/Newton/Compression/1e4-32 164MB/s ± 1% 253MB/s ± 1% +53.66% (p=0.000 n=10+10)
Decode/Newton/Compression/1e5-32 204MB/s ± 1% 280MB/s ± 1% +37.41% (p=0.000 n=10+10)
Decode/Newton/Compression/1e6-32 206MB/s ± 0% 280MB/s ± 0% +35.44% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
Decode/Digits/Huffman/1e4-32 0.00B ±NaN% 16.00B ± 0% +Inf% (p=0.000 n=10+10)
Decode/Digits/Huffman/1e5-32 6.00B ± 0% 36.00B ± 0% +500.00% (p=0.000 n=10+10)
Decode/Digits/Huffman/1e6-32 64.0B ± 0% 304.0B ± 0% +375.00% (p=0.000 n=10+9)
Decode/Digits/Speed/1e4-32 80.0B ± 0% 16.0B ± 0% -80.00% (p=0.000 n=10+10)
Decode/Digits/Speed/1e5-32 296B ± 0% 39B ± 0% -86.82% (p=0.000 n=10+8)
Decode/Digits/Speed/1e6-32 3.78kB ± 0% 0.33kB ± 0% -91.29% (p=0.000 n=10+10)
Decode/Digits/Default/1e4-32 40.0B ± 0% 16.0B ± 0% -60.00% (p=0.000 n=10+10)
Decode/Digits/Default/1e5-32 287B ± 0% 54B ± 1% -81.04% (p=0.000 n=10+10)
Decode/Digits/Default/1e6-32 4.15kB ± 0% 0.44kB ± 0% -89.43% (p=0.000 n=10+9)
Decode/Digits/Compression/1e4-32 40.0B ± 0% 16.0B ± 0% -60.00% (p=0.000 n=10+10)
Decode/Digits/Compression/1e5-32 288B ± 0% 55B ± 1% -81.01% (p=0.000 n=10+10)
Decode/Digits/Compression/1e6-32 4.15kB ± 0% 0.44kB ± 0% -89.43% (p=0.000 n=10+8)
Decode/Newton/Huffman/1e4-32 705B ± 0% 16B ± 0% -97.73% (p=0.000 n=9+10)
Decode/Newton/Huffman/1e5-32 4.49kB ± 0% 0.04kB ± 0% -99.15% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e6-32 39.4kB ± 0% 0.3kB ± 0% -99.18% (p=0.000 n=9+10)
Decode/Newton/Speed/1e4-32 617B ± 0% 16B ± 0% -97.41% (p=0.000 n=10+10)
Decode/Newton/Speed/1e5-32 3.19kB ± 0% 0.04kB ± 0% -98.84% (p=0.000 n=10+10)
Decode/Newton/Speed/1e6-32 40.5kB ± 0% 0.3kB ± 0% -99.15% (p=0.000 n=10+10)
Decode/Newton/Default/1e4-32 513B ± 0% 16B ± 0% -96.88% (p=0.000 n=10+10)
Decode/Newton/Default/1e5-32 2.35kB ± 0% 0.04kB ± 0% -98.47% (p=0.000 n=10+10)
Decode/Newton/Default/1e6-32 21.1kB ± 0% 0.3kB ± 0% -98.80% (p=0.000 n=8+8)
Decode/Newton/Compression/1e4-32 513B ± 0% 16B ± 0% -96.88% (p=0.000 n=10+10)
Decode/Newton/Compression/1e5-32 2.35kB ± 0% 0.04kB ± 0% -98.47% (p=0.000 n=10+10)
Decode/Newton/Compression/1e6-32 22.9kB ± 0% 0.2kB ± 0% -98.92% (p=0.000 n=8+8)
name old allocs/op new allocs/op delta
Decode/Digits/Huffman/1e4-32 0.00 ±NaN% 1.00 ± 0% +Inf% (p=0.000 n=10+10)
Decode/Digits/Huffman/1e5-32 0.00 ±NaN% 2.00 ± 0% +Inf% (p=0.000 n=10+10)
Decode/Digits/Huffman/1e6-32 0.00 ±NaN% 16.00 ± 0% +Inf% (p=0.000 n=10+10)
Decode/Digits/Speed/1e4-32 3.00 ± 0% 1.00 ± 0% -66.67% (p=0.000 n=10+10)
Decode/Digits/Speed/1e5-32 6.00 ± 0% 2.00 ± 0% -66.67% (p=0.000 n=10+10)
Decode/Digits/Speed/1e6-32 68.0 ± 0% 16.0 ± 0% -76.47% (p=0.000 n=10+10)
Decode/Digits/Default/1e4-32 2.00 ± 0% 1.00 ± 0% -50.00% (p=0.000 n=10+10)
Decode/Digits/Default/1e5-32 8.00 ± 0% 3.00 ± 0% -62.50% (p=0.000 n=10+10)
Decode/Digits/Default/1e6-32 74.0 ± 0% 23.0 ± 0% -68.92% (p=0.000 n=10+10)
Decode/Digits/Compression/1e4-32 2.00 ± 0% 1.00 ± 0% -50.00% (p=0.000 n=10+10)
Decode/Digits/Compression/1e5-32 8.00 ± 0% 3.00 ± 0% -62.50% (p=0.000 n=10+10)
Decode/Digits/Compression/1e6-32 74.0 ± 0% 23.0 ± 0% -68.92% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e4-32 9.00 ± 0% 1.00 ± 0% -88.89% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e5-32 18.0 ± 0% 2.0 ± 0% -88.89% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e6-32 156 ± 0% 16 ± 0% -89.74% (p=0.000 n=10+10)
Decode/Newton/Speed/1e4-32 13.0 ± 0% 1.0 ± 0% -92.31% (p=0.000 n=10+10)
Decode/Newton/Speed/1e5-32 26.0 ± 0% 2.0 ± 0% -92.31% (p=0.000 n=10+10)
Decode/Newton/Speed/1e6-32 223 ± 0% 16 ± 0% -92.83% (p=0.000 n=10+10)
Decode/Newton/Default/1e4-32 10.0 ± 0% 1.0 ± 0% -90.00% (p=0.000 n=10+10)
Decode/Newton/Default/1e5-32 27.0 ± 0% 2.0 ± 0% -92.59% (p=0.000 n=10+10)
Decode/Newton/Default/1e6-32 153 ± 0% 12 ± 0% -92.16% (p=0.000 n=10+10)
Decode/Newton/Compression/1e4-32 10.0 ± 0% 1.0 ± 0% -90.00% (p=0.000 n=10+10)
Decode/Newton/Compression/1e5-32 27.0 ± 0% 2.0 ± 0% -92.59% (p=0.000 n=10+10)
Decode/Newton/Compression/1e6-32 145 ± 0% 12 ± 0% -91.72% (p=0.000 n=10+10)
```
These changes have been included in github.com/klauspost/compress
for a little more than a month now, which includes fuzz testing.
Change-Id: I7e346330512116baa27e448aa606a2f4e551054c
Change-Id: I7e346330512116baa27e448aa606a2f4e551054c
GitHub-Last-Rev: ae9b62a9851eb3d8acf14cebfef1d8c663e407dc
GitHub-Pull-Request: golang/go#38324
---
A src/compress/flate/gen_inflate.go
M src/compress/flate/inflate.go
A src/compress/flate/inflate_gen.go
M src/compress/flate/reader_test.go
4 files changed, 1,665 insertions(+), 186 deletions(-)
To view, visit change 227737. To unsubscribe, or for help writing mail filters, visit settings.
Attention is currently required from: Ian Lance Taylor.
2 comments:
Patchset:
Sure. I have an additional 5-10% I found since then, which I will see if I can incorporate. […]
I updated the PR with the latest version as well as resolving the conflict. This matches https://github.com/klauspost/compress/tree/master/flate
Benchmark numbers updated.
Patchset:
New changes integrated. Benchmark updated compared to current master.
To view, visit change 227737. To unsubscribe, or for help writing mail filters, visit settings.
Attention is currently required from: Ian Lance Taylor.
Gerrit Bot uploaded patch set #6 to this change.
Decode/Digits/Huffman/1e4-32 63.8µs ± 0% 41.3µs ± 0% -35.22% (p=0.000 n=10+10)
Decode/Digits/Huffman/1e5-32 625µs ± 0% 404µs ± 0% -35.31% (p=0.000 n=10+10)
Decode/Digits/Huffman/1e6-32 6.25ms ± 0% 4.02ms ± 1% -35.64% (p=0.000 n=10+10)
Decode/Digits/Speed/1e4-32 72.1µs ± 1% 48.0µs ± 0% -33.36% (p=0.000 n=10+9)
Decode/Digits/Speed/1e5-32 792µs ± 1% 578µs ± 1% -27.04% (p=0.000 n=10+10)
Decode/Digits/Speed/1e6-32 8.09ms ± 0% 5.85ms ± 0% -27.68% (p=0.000 n=9+10)
Decode/Digits/Default/1e4-32 74.1µs ± 1% 49.7µs ± 1% -32.87% (p=0.000 n=10+9)
Decode/Digits/Default/1e5-32 775µs ± 1% 579µs ± 0% -25.35% (p=0.000 n=10+10)
Decode/Digits/Default/1e6-32 7.84ms ± 1% 5.84ms ± 1% -25.59% (p=0.000 n=10+10)
Decode/Digits/Compression/1e4-32 74.1µs ± 0% 49.8µs ± 0% -32.83% (p=0.000 n=10+10)
Decode/Digits/Compression/1e5-32 777µs ± 1% 579µs ± 0% -25.47% (p=0.000 n=10+10)
Decode/Digits/Compression/1e6-32 7.83ms ± 1% 5.83ms ± 0% -25.59% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e4-32 72.9µs ± 0% 45.6µs ± 1% -37.48% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e5-32 712µs ± 1% 471µs ± 1% -33.92% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e6-32 7.11ms ± 0% 4.70ms ± 1% -33.98% (p=0.000 n=10+10)
Decode/Newton/Speed/1e4-32 67.0µs ± 1% 45.4µs ± 1% -32.19% (p=0.000 n=10+10)
Decode/Newton/Speed/1e5-32 616µs ± 1% 447µs ± 0% -27.49% (p=0.000 n=10+10)
Decode/Newton/Speed/1e6-32 6.17ms ± 0% 4.50ms ± 0% -26.98% (p=0.000 n=10+10)
Decode/Newton/Default/1e4-32 60.7µs ± 0% 39.6µs ± 0% -34.84% (p=0.000 n=10+10)
Decode/Newton/Default/1e5-32 492µs ± 0% 360µs ± 0% -26.84% (p=0.000 n=10+10)
Decode/Newton/Default/1e6-32 4.87ms ± 1% 3.59ms ± 0% -26.34% (p=0.000 n=10+10)
Decode/Newton/Compression/1e4-32 60.8µs ± 1% 39.6µs ± 1% -34.92% (p=0.000 n=10+10)
Decode/Newton/Compression/1e5-32 491µs ± 1% 357µs ± 1% -27.23% (p=0.000 n=10+10)
Decode/Newton/Compression/1e6-32 4.84ms ± 0% 3.58ms ± 0% -26.17% (p=0.000 n=10+10)
name old speed new speed delta
Decode/Digits/Huffman/1e4-32 157MB/s ± 0% 242MB/s ± 0% +54.37% (p=0.000 n=10+10)
Decode/Digits/Huffman/1e5-32 160MB/s ± 0% 247MB/s ± 0% +54.58% (p=0.000 n=10+10)
Decode/Digits/Huffman/1e6-32 160MB/s ± 0% 249MB/s ± 1% +55.39% (p=0.000 n=10+10)
Decode/Digits/Speed/1e4-32 139MB/s ± 1% 208MB/s ± 0% +50.06% (p=0.000 n=10+9)
Decode/Digits/Speed/1e5-32 126MB/s ± 1% 173MB/s ± 1% +37.05% (p=0.000 n=10+10)
Decode/Digits/Speed/1e6-32 124MB/s ± 0% 171MB/s ± 0% +38.28% (p=0.000 n=9+10)
Decode/Digits/Default/1e4-32 135MB/s ± 1% 201MB/s ± 1% +48.95% (p=0.000 n=10+9)
Decode/Digits/Default/1e5-32 129MB/s ± 1% 173MB/s ± 0% +33.95% (p=0.000 n=10+10)
Decode/Digits/Default/1e6-32 127MB/s ± 1% 171MB/s ± 1% +34.39% (p=0.000 n=10+10)
Decode/Digits/Compression/1e4-32 135MB/s ± 0% 201MB/s ± 0% +48.88% (p=0.000 n=10+10)
Decode/Digits/Compression/1e5-32 129MB/s ± 1% 173MB/s ± 0% +34.17% (p=0.000 n=10+10)
Decode/Digits/Compression/1e6-32 128MB/s ± 1% 172MB/s ± 0% +34.39% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e4-32 137MB/s ± 0% 219MB/s ± 1% +59.96% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e5-32 140MB/s ± 1% 212MB/s ± 1% +51.32% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e6-32 141MB/s ± 0% 213MB/s ± 1% +51.46% (p=0.000 n=10+10)
Decode/Newton/Speed/1e4-32 149MB/s ± 1% 220MB/s ± 1% +47.48% (p=0.000 n=10+10)
Decode/Newton/Speed/1e5-32 162MB/s ± 1% 224MB/s ± 0% +37.92% (p=0.000 n=10+10)
Decode/Newton/Speed/1e6-32 162MB/s ± 0% 222MB/s ± 0% +36.95% (p=0.000 n=10+10)
Decode/Newton/Default/1e4-32 165MB/s ± 0% 253MB/s ± 0% +53.47% (p=0.000 n=10+10)
Decode/Newton/Default/1e5-32 203MB/s ± 0% 278MB/s ± 0% +36.68% (p=0.000 n=10+10)
Decode/Newton/Default/1e6-32 205MB/s ± 1% 279MB/s ± 0% +35.77% (p=0.000 n=10+10)
Decode/Newton/Compression/1e4-32 164MB/s ± 1% 253MB/s ± 1% +53.66% (p=0.000 n=10+10)
Decode/Newton/Compression/1e5-32 204MB/s ± 1% 280MB/s ± 1% +37.41% (p=0.000 n=10+10)
Decode/Newton/Compression/1e6-32 206MB/s ± 0% 280MB/s ± 0% +35.44% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
Decode/Digits/Huffman/1e4-32 0.00B ±NaN% 16.00B ± 0% +Inf% (p=0.000 n=10+10)
Decode/Digits/Huffman/1e5-32 6.00B ± 0% 36.00B ± 0% +500.00% (p=0.000 n=10+10)
Decode/Digits/Huffman/1e6-32 64.0B ± 0% 304.0B ± 0% +375.00% (p=0.000 n=10+9)
Decode/Digits/Speed/1e4-32 80.0B ± 0% 16.0B ± 0% -80.00% (p=0.000 n=10+10)
Decode/Digits/Speed/1e5-32 296B ± 0% 39B ± 0% -86.82% (p=0.000 n=10+8)
Decode/Digits/Speed/1e6-32 3.78kB ± 0% 0.33kB ± 0% -91.29% (p=0.000 n=10+10)
Decode/Digits/Default/1e4-32 40.0B ± 0% 16.0B ± 0% -60.00% (p=0.000 n=10+10)
Decode/Digits/Default/1e5-32 287B ± 0% 54B ± 1% -81.04% (p=0.000 n=10+10)
Decode/Digits/Default/1e6-32 4.15kB ± 0% 0.44kB ± 0% -89.43% (p=0.000 n=10+9)
Decode/Digits/Compression/1e4-32 40.0B ± 0% 16.0B ± 0% -60.00% (p=0.000 n=10+10)
Decode/Digits/Compression/1e5-32 288B ± 0% 55B ± 1% -81.01% (p=0.000 n=10+10)
Decode/Digits/Compression/1e6-32 4.15kB ± 0% 0.44kB ± 0% -89.43% (p=0.000 n=10+8)
Decode/Newton/Huffman/1e4-32 705B ± 0% 16B ± 0% -97.73% (p=0.000 n=9+10)
Decode/Newton/Huffman/1e5-32 4.49kB ± 0% 0.04kB ± 0% -99.15% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e6-32 39.4kB ± 0% 0.3kB ± 0% -99.18% (p=0.000 n=9+10)
Decode/Newton/Speed/1e4-32 617B ± 0% 16B ± 0% -97.41% (p=0.000 n=10+10)
Decode/Newton/Speed/1e5-32 3.19kB ± 0% 0.04kB ± 0% -98.84% (p=0.000 n=10+10)
Decode/Newton/Speed/1e6-32 40.5kB ± 0% 0.3kB ± 0% -99.15% (p=0.000 n=10+10)
Decode/Newton/Default/1e4-32 513B ± 0% 16B ± 0% -96.88% (p=0.000 n=10+10)
Decode/Newton/Default/1e5-32 2.35kB ± 0% 0.04kB ± 0% -98.47% (p=0.000 n=10+10)
Decode/Newton/Default/1e6-32 21.1kB ± 0% 0.3kB ± 0% -98.80% (p=0.000 n=8+8)
Decode/Newton/Compression/1e4-32 513B ± 0% 16B ± 0% -96.88% (p=0.000 n=10+10)
Decode/Newton/Compression/1e5-32 2.35kB ± 0% 0.04kB ± 0% -98.47% (p=0.000 n=10+10)
Decode/Newton/Compression/1e6-32 22.9kB ± 0% 0.2kB ± 0% -98.92% (p=0.000 n=8+8)
name old allocs/op new allocs/op delta
Decode/Digits/Huffman/1e4-32 0.00 ±NaN% 1.00 ± 0% +Inf% (p=0.000 n=10+10)
Decode/Digits/Huffman/1e5-32 0.00 ±NaN% 2.00 ± 0% +Inf% (p=0.000 n=10+10)
Decode/Digits/Huffman/1e6-32 0.00 ±NaN% 16.00 ± 0% +Inf% (p=0.000 n=10+10)
Decode/Digits/Speed/1e4-32 3.00 ± 0% 1.00 ± 0% -66.67% (p=0.000 n=10+10)
Decode/Digits/Speed/1e5-32 6.00 ± 0% 2.00 ± 0% -66.67% (p=0.000 n=10+10)
Decode/Digits/Speed/1e6-32 68.0 ± 0% 16.0 ± 0% -76.47% (p=0.000 n=10+10)
Decode/Digits/Default/1e4-32 2.00 ± 0% 1.00 ± 0% -50.00% (p=0.000 n=10+10)
Decode/Digits/Default/1e5-32 8.00 ± 0% 3.00 ± 0% -62.50% (p=0.000 n=10+10)
Decode/Digits/Default/1e6-32 74.0 ± 0% 23.0 ± 0% -68.92% (p=0.000 n=10+10)
Decode/Digits/Compression/1e4-32 2.00 ± 0% 1.00 ± 0% -50.00% (p=0.000 n=10+10)
Decode/Digits/Compression/1e5-32 8.00 ± 0% 3.00 ± 0% -62.50% (p=0.000 n=10+10)
Decode/Digits/Compression/1e6-32 74.0 ± 0% 23.0 ± 0% -68.92% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e4-32 9.00 ± 0% 1.00 ± 0% -88.89% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e5-32 18.0 ± 0% 2.0 ± 0% -88.89% (p=0.000 n=10+10)
Decode/Newton/Huffman/1e6-32 156 ± 0% 16 ± 0% -89.74% (p=0.000 n=10+10)
Decode/Newton/Speed/1e4-32 13.0 ± 0% 1.0 ± 0% -92.31% (p=0.000 n=10+10)
Decode/Newton/Speed/1e5-32 26.0 ± 0% 2.0 ± 0% -92.31% (p=0.000 n=10+10)
Decode/Newton/Speed/1e6-32 223 ± 0% 16 ± 0% -92.83% (p=0.000 n=10+10)
Decode/Newton/Default/1e4-32 10.0 ± 0% 1.0 ± 0% -90.00% (p=0.000 n=10+10)
Decode/Newton/Default/1e5-32 27.0 ± 0% 2.0 ± 0% -92.59% (p=0.000 n=10+10)
Decode/Newton/Default/1e6-32 153 ± 0% 12 ± 0% -92.16% (p=0.000 n=10+10)
Decode/Newton/Compression/1e4-32 10.0 ± 0% 1.0 ± 0% -90.00% (p=0.000 n=10+10)
Decode/Newton/Compression/1e5-32 27.0 ± 0% 2.0 ± 0% -92.59% (p=0.000 n=10+10)
Decode/Newton/Compression/1e6-32 145 ± 0% 12 ± 0% -91.72% (p=0.000 n=10+10)
```
These changes have been included in github.com/klauspost/compress
for a little more than a month now, which includes fuzz testing.
Change-Id: I7e346330512116baa27e448aa606a2f4e551054c
Change-Id: I7e346330512116baa27e448aa606a2f4e551054c
GitHub-Last-Rev: 161f02171c05a37751c7ca314daf5e5b4809905f
GitHub-Pull-Request: golang/go#38324
---
A src/compress/flate/gen_inflate.go
M src/compress/flate/inflate.go
A src/compress/flate/inflate_gen.go
M src/compress/flate/reader_test.go
4 files changed, 1,653 insertions(+), 186 deletions(-)
To view, visit change 227737. To unsubscribe, or for help writing mail filters, visit settings.
1 comment:
Patchset:
Ping Joe Tsai: do you have any time to look at this one? Thanks.
To view, visit change 227737. To unsubscribe, or for help writing mail filters, visit settings.
1 comment:
Patchset:
To view, visit change 227737. To unsubscribe, or for help writing mail filters, visit settings.
1 comment:
Patchset:
This is a lot of code to work around the fact that the API guarantees that we never read a single byte past the final block IF the input reader happens to implement `io.ByteReader`.
Most usages do not actually pass an `io.ByteReader`, and so the implementation implicitly wraps it within a `bufio.Reader`. There seems to be an impedance mismatch in the complexity of this code and what happens in practice.
I argue that we should instead specialize for `bufio.Reader` and make use of the `bufio.Reader.{Buffered,Peek,Discard}` methods to fetch a chunk of the upcoming data. This avoids the tricky logic checking for the final block and computing the maximum number of bits we are allowed to read. This approach is what's taken in https://github.com/dsnet/compress/blob/f66993602bf5da07ef49d35b08e7264ae9fe2b6e/internal/prefix/reader.go#L253-L317 and results in performance gains comparable to what's reported here for much less code complexity.
We already have a dependency on `bufio.Reader`, so special-casing that type seem palatable. Anyone wanting better performance, can just wrap their input in a `bufio.Reader`.
Adding a dependency on "bytes" and "strings" is expensive and shouldn't be done lightly. See https://github.com/golang/go/issues/54098. Until that issue is resolved, we shouldn't be adding dependencies on these if it's not necessary.
To view, visit change 227737. To unsubscribe, or for help writing mail filters, visit settings.
Attention is currently required from: Ian Lance Taylor.
1 comment:
Patchset:
Ping Joe Tsai: do you have any time to look at this one? Thanks.
It seems that I never got the email since I'm no longer reachable at joe...@google.com :|
To view, visit change 227737. To unsubscribe, or for help writing mail filters, visit settings.
Patchset:
This is a lot of code to work around the fact that the API guarantees that we never read a single by […]
It seems that "bufio" already depends on "bytes" and "strings", so my last point isn't valid, but the rest of my argument still stands.
If we want to optimize `bytes.Buffer`, `strings.Reader`, and `bytes.Reader` with my proposed approach, we can wrap it in a way that presents an API similar to `bufio.Reader`: https://github.com/dsnet/compress/blob/master/internal/prefix/wrap.go
To view, visit change 227737. To unsubscribe, or for help writing mail filters, visit settings.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
This is a lot of code to work around the fact that the API guarantees that we never read a single byte past the final block IF the input reader happens to implement `io.ByteReader`.
Most usages do not actually pass an `io.ByteReader`, and so the implementation implicitly wraps it within a `bufio.Reader`. There seems to be an impedance mismatch in the complexity of this code and what happens in practice.
I argue that we should instead specialize for `bufio.Reader` and make use of the `bufio.Reader.{Buffered,Peek,Discard}` methods to fetch a chunk of the upcoming data. This avoids the tricky logic checking for the final block and computing the maximum number of bits we are allowed to read. This approach is what's taken in https://github.com/dsnet/compress/blob/f66993602bf5da07ef49d35b08e7264ae9fe2b6e/internal/prefix/reader.go#L253-L317 and results in performance gains comparable to what's reported here for much less code complexity.
We already have a dependency on `bufio.Reader`, so special-casing that type seem palatable. Anyone wanting better performance, can just wrap their input in a `bufio.Reader`.
Adding a dependency on "bytes" and "strings" is expensive and shouldn't be done lightly. See https://github.com/golang/go/issues/54098. Until that issue is resolved, we shouldn't be adding dependencies on these if it's not necessary.
It seems that "bufio" already depends on "bytes" and "strings", so my last point isn't valid, but the rest of my argument still stands.
If we want to optimize `bytes.Buffer`, `strings.Reader`, and `bytes.Reader` with my proposed approach, we can wrap it in a way that presents an API similar to `bufio.Reader`: https://github.com/dsnet/compress/blob/master/internal/prefix/wrap.go
I honestly don't feel like a total rewrite and this is mostly a verbatim copy of well tested code.
Most usages do not actually pass an `io.ByteReader`.
HTTP requests do, which is by far the most common usage.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Joseph TsaiThis is a lot of code to work around the fact that the API guarantees that we never read a single byte past the final block IF the input reader happens to implement `io.ByteReader`.
Most usages do not actually pass an `io.ByteReader`, and so the implementation implicitly wraps it within a `bufio.Reader`. There seems to be an impedance mismatch in the complexity of this code and what happens in practice.
I argue that we should instead specialize for `bufio.Reader` and make use of the `bufio.Reader.{Buffered,Peek,Discard}` methods to fetch a chunk of the upcoming data. This avoids the tricky logic checking for the final block and computing the maximum number of bits we are allowed to read. This approach is what's taken in https://github.com/dsnet/compress/blob/f66993602bf5da07ef49d35b08e7264ae9fe2b6e/internal/prefix/reader.go#L253-L317 and results in performance gains comparable to what's reported here for much less code complexity.
We already have a dependency on `bufio.Reader`, so special-casing that type seem palatable. Anyone wanting better performance, can just wrap their input in a `bufio.Reader`.
Adding a dependency on "bytes" and "strings" is expensive and shouldn't be done lightly. See https://github.com/golang/go/issues/54098. Until that issue is resolved, we shouldn't be adding dependencies on these if it's not necessary.
Klaus PostIt seems that "bufio" already depends on "bytes" and "strings", so my last point isn't valid, but the rest of my argument still stands.
If we want to optimize `bytes.Buffer`, `strings.Reader`, and `bytes.Reader` with my proposed approach, we can wrap it in a way that presents an API similar to `bufio.Reader`: https://github.com/dsnet/compress/blob/master/internal/prefix/wrap.go
I honestly don't feel like a total rewrite and this is mostly a verbatim copy of well tested code.
Most usages do not actually pass an `io.ByteReader`.
HTTP requests do, which is by far the most common usage.
Most usages do not actually pass an io.ByteReader, and so the implementation implicitly wraps it within a bufio.Reader.
Actually, it will also speed up these, since the wrapped is never currently accessed directly, but only through `(decompressor).r` interface.
I have attempted to do a generic implementation, but it is something like 10-20% slower than this: https://github.com/klauspost/compress/commit/355740f356693049e3b107fb140a5a4813cf6397
It doesn't really change the imports either.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |