I've been thinking about a semi-official Go package to convert between
UTF-8 and other encodings (e.g. UTF-16, Windows-1252). It would live
in the go.text sub-repo, as
code.google.com/p/go.text/unicode/encoding.
The key ideas are that an Encoding is an interface (e.g. "package
big5; var Encoding encoding.Encoding = etc"), it best operates on
[]byte (although it can operate on rune), and it is stateless:
type Encoding interface {
// Decode converts encoded runes in src to UTF-8 bytes in dst.
It returns
// the number of dst bytes written, the number of src bytes
read, and the
// next encoding to use to decode the rest of the byte stream.
Decode(dst, src []byte) (nDst, nSrc int, enc Encoding)
DecodeRune(p []byte) (r rune, n int, enc Encoding)
Encode(dst, src []byte) (nDst, nSrc int, enc Encoding)
EncodeRune(p []byte, r rune) (n int, enc Encoding)
}
func NewReader(e Encoding, r io.Reader) *Reader
func NewWriter(e Encoding, w io.Writer) *Writer
The act of decoding (or encoding) can change the Encoding in use. For
example, you could start with an endian-agnostic UTF-16 Encoding, and
Decode could return an UTF-16 (Little Endian) Encoding on encountering
a byte order mark.
I am aware of two existing similar packages:
code.google.com/p/go-charset/charset/
code.google.com/p/mahonia/
My proposal differs in a number of ways:
1. "package encoding" only provides a minimal number of encodings:
UTF-8 and UTF-16. Other encodings like Big-5 or GBK, which can require
large data tables, would be in separate packages. If your program
needs the Big-5 encoding, it can import big5 and refer to
big5.Encoding as a variable, instead of having to look up by string.
If your program does not need Big-5 or GBK, then the compiler and
linker do not need to see those data tables. Data tables are generated
before compile time; no data files are read at run time.
2. There is no central registry of encodings, keyed by strings such as
"windows-1252" or "cp1252". If you want the Big-5 encoding, import
big5 and refer to big5.Encoding. If you want to implement the
equivalent of iconv, provide your own map[string]Encoding.
2. There's only an Encoding. There is no separation of (stateless)
Factory and (stateful) Translator, or (stateless) Charset and
(stateful) Decoder.
3. The primary interface is batch; as I anticipate most users would
want to use NewReader or NewWriter, which boils down to Decode or
Encode. Decode takes []byte, DecodeRune is provided mostly as a
convenience. Decode takes a destination buffer as an argument, like
io.Reader, instead of returning a buffer, like go-charset. Unlike
mahonia, conversion does not require a Decoder call per decoded rune.
4. The Decode method does not take an explicit eof argument, unlike
go-charset. Decode will return nSrc == 0 if it cannot decode a rune
from src. It is up to the caller if it wants to behave differently
depending on whether or not they are at EOF.
5. The Writer implementation is buffered; it has an explicit Flush method.
https://codereview.appspot.com/10085049 has a proof of concept for the
Windows-1252 encoding, a lot more wordage about the Encoding
interface, and how e.g. DecodeRune differs from utf8.DecodeRune from
the standard unicode/utf8 package.
Some benchmark numbers comparing "this package" against go-charset and
mahonia on 26K of Windows-1252 data:
$ go test -bench=. -benchmem
PASS
BenchmarkReaderGoCharset 10000 131647 ns/op 41157 B/op 6 allocs/op
BenchmarkReaderMahonia 10000 259387 ns/op 12565 B/op 7 allocs/op
BenchmarkReaderThisPackage 50000 64742 ns/op 8337 B/op
3 allocs/op
BenchmarkWriterGoCharset8K --- FAIL: BenchmarkWriterGoCharset8K
encoding_test.go:190: written 25879 bytes, want 25877
BenchmarkWriterGoCharset64K 10000 150204 ns/op 28805 B/op
4 allocs/op
BenchmarkWriterMahonia8K 5000 370096 ns/op 17651 B/op 7 allocs/op
BenchmarkWriterMahonia64K 5000 374755 ns/op 28907 B/op 5 allocs/op
BenchmarkWriterThisPackage8K 10000 135038 ns/op 8261 B/op
2 allocs/op
BenchmarkWriterThisPackage64K 10000 134980 ns/op 8261 B/op
2 allocs/op
ok
code.google.com/p/go.text/unicode/encoding 15.918s
I'm not sure why the GoCharset benchmark fails for the
many-smaller-writes case but passes the one-big-write case. I might be
doing something dumb with go-charset (and/or mahonia).
WDYT? Am I missing any subtleties?