Email header, subject charset decoding (email might be encoded in wide range of charsets like ISO-2022-JP, GB-2312 etc...)

479 views
Skip to first unread message

hiepkh...@gmail.com

unread,
Jan 30, 2016, 11:36:21 AM1/30/16
to golang-nuts
Hi,

I am working on a project which needs to deal with email encoding/decoding in different charsets. A python code for this can be shown in the below:

from email.header import Header, decode_header, make_header
from charset import text_to_utf8

class ....
def decode_header(self, header):
decoded_header = decode_header(header)

if decoded_header[0][1] is None:
return text_to_utf8(decoded_header[0][0]).decode("utf-8", "replace")
else:
return decoded_header[0][0].decode(decoded_header[0][1].replace("windows-", "cp"), "replace")


Basically, for the text like "=?iso-2022-jp?b?GyRCRW1CQE86GyhCIDxtb21vQHRhcm8ubmUuanA=?="; the "decode_header" function just tries to find the encoding: [('\x1b$BEmB@O:\x1b(B <mo...@taro.ne.jp', 'iso-2022-jp')]; then it will use the "decode" function to decode the charset to unicode. 

Now, in go, i can do something similar to like: 
import "mime"

dec := new(mime.WordDecoder)
text := "=?utf-8?q?=C3=89ric?= <er...@example.org>, =?utf-8?q?Ana=C3=AFs?= <an...@example.org>"
header, err := dec.DecodeHeader(text)

Seems that there mime.WordDecoder allow to put a charset decoder "hook": 
type WordDecoder struct {
// CharsetReader, if non-nil, defines a function to generate
// charset-conversion readers, converting from the provided
// charset into UTF-8.
// Charsets are always lower-case. utf-8, iso-8859-1 and us-ascii charsets
// are handled by default.
// One of the the CharsetReader's result values must be non-nil.
CharsetReader func(charset string, input io.Reader) (io.Reader, error)
}

I am wondering is there any library which can allow me to convert arbitrary charset like the "decode" function in python as shown in the above example. I don't want to write a big "switch-case"like the one being used in mime/encodedword.go:

func (d *WordDecoder) convert(buf *bytes.Buffer, charset string, content []byte) error {
switch {
case strings.EqualFold("utf-8", charset):
buf.Write(content)
case strings.EqualFold("iso-8859-1", charset):
for _, c := range content {
buf.WriteRune(rune(c))
}
....


Any help would be very appreciated.

Thanks.


Tamás Gulácsi

unread,
Jan 30, 2016, 11:47:09 AM1/30/16
to golang-nuts
Use Golan.org/x/text/charset/htmlindex -see github.com/tgulacsi/go i18nmail.HeadDecode
Reply all
Reply to author
Forward
0 new messages