Decoding malformed data with x/text/encoding

159 views
Skip to first unread message

Rin Tohsaka

unread,
Dec 2, 2015, 10:21:26 AM12/2/15
to golang-nuts
Hello everyone!
Having some troubles decoding data with invalid bytes. From encoding.go:
type Encoding interface {
       
// NewDecoder returns a transformer that converts to UTF-8.
       
//
       
// Transforming source bytes that are not of that encoding will not
       
// result in an error per se. Each byte that cannot be transcoded will
       
// be represented in the output by the UTF-8 encoding of '\uFFFD', the
       
// replacement rune.
       
NewDecoder() transform.Transformer
...

So I expect that decoder will not stop on first error and just replaces invalid bytes with U+FFFD.
Can I do something to force this behavior?
My code is very simple:
t := japanese.ShiftJIS
reader
:= transform.NewReader(file, t.NewDecoder())
data
, err := ioutil.ReadAll(reader)
if err != nil {
        fmt
.Println(err.Error())
}

Resulting in:
japanese: invalid Shift JIS encoding

I tried some workaround, but I think it's ugly and potentially bugged as far as I can see:
func ForceDecode(src []byte, t transform.Transformer) (result []byte, skipped int) {
        src0
, src1 := 0, len(src)
       
for src0 != src1 {
                buf
, n, err := transform.Bytes(t, src[src0:src1])
                src0
+= n
                result
= append(result, buf...)
               
if err != nil {
                        src0
++
                        skipped
++
               
}
       
}
       
return
}

Any advice would be appreciated.

Nigel Tao

unread,
Dec 6, 2015, 9:49:38 PM12/6/15
to Rin Tohsaka, Marcel van Lohuizen, golang-nuts
I remember, a year or three ago, discussing whether or not Encoding
transformers should return an error early, use a substitute character,
or be configurable between the two. I can't remember the details,
though. Marcel, do you?

Maybe we thought that people could write their own ForceDecode
function if they wanted to, although I'd make it a function that
returned a Transformer. Perhaps such a beast should live in
golang.org/x/text/transform.

In any case, it seems like a bug that the NewDecoder docs don't match
the implementation. One or the other should change.

mp...@golang.org

unread,
Dec 19, 2015, 9:40:19 AM12/19/15
to Nigel Tao, Rin Tohsaka, Marcel van Lohuizen, golang-nuts
Sorry for the late reply. Just found this email among the noise.

I indeed recently found the same discrepancy between the decoder's doc and implementation. Moreover, decoders do not always return an error on invalid input. There seems to be some pattern/system, but I'm not sure what it is.

I recently changed the Encoders to return errors. There is no single method of replacement that is generally applicable so there is no way around this. There are now two different decorators for handling errors. See CL

We could do a similar thing for Decoding. However, it is a bit more tricky to do for Decoders (how many bytes should be gobbled per error). Also, unlike with encoders, decoders should not return an error by default. This makes the "decorator" technique used for encodings less suitable.

I think it would be fine for Decoders to simply never returning an error, as the documentation suggests (one can always scan for U+FFFD), but it would be good to know if people could use the errors or why different errors were handled differently in the first place.

Reply all
Reply to author
Forward
0 new messages