[New User] Handling Character Encodings

244 views
Skip to first unread message

Donovan

unread,
Jan 13, 2011, 1:55:39 AM1/13/11
to golang-nuts
First, I want to describe the problem I'm seeing...

It seems like Go is lacking any real character decoding / encoding
functionality. All byte translating functionality seems to assume
utf-8. If I want to read a text file that isn't ascii / utf-8 and
manipulate it as strings I have to translate the bytes to utf-8
myself. Similarly, I have to do my own translation for writing out to
a file or network socket. Have I missed something builtin to Go or a
community package that would make this simpler?

Second, assuming my observation is correct, my followup is:

Is this an intentional omission or just a case of no one has needed to
implement it yet?

- Donovan


Jessta

unread,
Jan 13, 2011, 2:15:41 AM1/13/11
to Donovan, golang-nuts
On Thu, Jan 13, 2011 at 5:55 PM, Donovan <donovan...@gmail.com> wrote:
> First, I want to describe the problem I'm seeing...
>
> It seems like Go is lacking any real character decoding / encoding
> functionality. All byte translating functionality seems to assume
> utf-8. If I want to read a text file that isn't ascii / utf-8 and
> manipulate it as strings I have to translate the bytes to utf-8
> myself.

What encoding are you wanting to use?
There is support for utf8,utf16, and utf32. If you want to manipulate
it as a string you'll have to convert
it to utf8, but if you want to keep the original encoding you can just
use []byte. Most of the string functions also have a comparable []byte function.

>Similarly, I have to do my own translation for writing out to
> a file or network socket. Have I missed something builtin to Go or a
> community package that would make this simpler?

encoding and decoding streams is usually pretty easy. Encoding packages tend
to satisfy the io.Reader and io.Writer interfaces, so encoding while
writing out
to a file is as simple as using two calls to io.Copy()

> Second, assuming my observation is correct, my followup is:
>
> Is this an intentional omission or just a case of no one has needed to
> implement it yet?

Since the encoding a text being read by a Go program has nothing to do
with the standard library or the Go language, I can't imagine why any
encoding would be intentionally omitted.
The omission is most likely to do with a lack of a need. UTF-8 does a
fine job and is used widely.


--
=====================
http://jessta.id.au

Donovan Jimenez

unread,
Jan 13, 2011, 3:49:01 AM1/13/11
to Jessta, golang-nuts
file encodings I deal with often: utf-8 (hooray!), latin1, windows-1252, and ebcdic (boo!)

Can you point me at these byte stream encoders / decoders? I couldn't find any packages that allowed me to wrap a Reader or Writer and interpret the byte stream as anything other than utf-8.

The other thing I find odd, is that the len of a string value is its byte count rather than its character count. This old school convention is often the source of bugs and data mangling when developers confuse byte length with character length because they don't test with non-ascii data. For example, lets look at a characters sequence right out of the specification:

// show that a byte sequence is interpreted as utf-8, and print its value's len
myString := string([]byte{'h', 'e', 'l', 'l', '\xc3', '\xb8'}) // "hellø"
fmt.Println("string len: ", len(myString))  // outputs "string len: 6"
// show that the byte count is different than the character count
myUtf8String := utf8.NewString(myString)
fmt.Println("String len: ", myUtf8String.RuneCount()) // outputs "string len: 5"

If you had to validate that some input string was <= 5 characters, which way would you have coded the check? If you're a c / c++ programmer you might have done it right (but only because you've made the mistake before), if you were a java programmer you probably did it wrong (because you're accustomed to a string's length being its number of characters no matter what encoding you read it in as).

- Donovan

Donovan

unread,
Jan 13, 2011, 11:24:25 AM1/13/11
to golang-nuts
After a night of thinking about it, my problem really was my
expectation of what the string value is. I expected it to be more like
a java string, rather than just an immutable byte array.

I still have not found any transcoding packages, so I may try to wrap
iconv or icu.

- Donovan


On Jan 13, 2:15 am, Jessta <jes...@jessta.id.au> wrote:

peterGo

unread,
Jan 13, 2011, 11:38:45 AM1/13/11
to golang-nuts
Donovan,

On Jan 13, 11:24 am, Donovan <donovan.jime...@gmail.com> wrote:

> I still have not found any transcoding packages, so I may try to wrap
> iconv or icu.

Here's an example of an iconv wrapper.

oibore/go-iconv - GitHub
https://github.com/oibore/go-iconv

Peter

On Jan 13, 11:24 am, Donovan <donovan.jime...@gmail.com> wrote:
> After a night of thinking about it, my problem really was my
> expectation of what the string value is. I expected it to be more like
> a java string, rather than just an immutable byte array.
>
> I still have not found any transcoding packages, so I may try to wrap
> iconv or icu.
>
> - Donovan
>
> On Jan 13, 2:15 am, Jessta <jes...@jessta.id.au> wrote:
>
> > On Thu, Jan 13, 2011 at 5:55 PM, Donovan <donovan.jime...@gmail.com> wrote:
> > > First, I want to describe the problem I'm seeing...
>
> > > It seems like Go is lacking any real character decoding / encoding
> > > functionality. All byte translating functionality seems to assume
> > > utf-8. If I want to read a text file that isn't ascii / utf-8 and
> > > manipulate it as strings I have to translate the bytes to utf-8
> > > myself.
>
> > What encoding are you wanting to use?
> > There is support for utf8,utf16, and utf32. If you want to manipulate
> > it as a string you'll have to convertgo-icu

Donovan

unread,
Jan 13, 2011, 11:56:49 PM1/13/11
to golang-nuts
FANTASTIC! thank you very much!

- Donovan

On Jan 13, 11:38 am, peterGo <go.peter...@gmail.com> wrote:
> Donovan,
>
> On Jan 13, 11:24 am, Donovan <donovan.jime...@gmail.com> wrote:
>
> > I still have not found any transcoding packages, so I may try to wrap
> > iconv or icu.
>
> Here's an example of an iconv wrapper.
>
> oibore/go-iconv - GitHubhttps://github.com/oibore/go-iconv

Eoghan Sherry

unread,
Jan 14, 2011, 9:05:08 AM1/14/11
to Donovan Jimenez, Jessta, golang-nuts
On 13 January 2011 03:49, Donovan Jimenez <donovan...@gmail.com> wrote:
> If you had to validate that some input string was <= 5 characters, which way
> would you have coded the check? If you're a c / c++ programmer you might
> have done it right (but only because you've made the mistake before), if you
> were a java programmer you probably did it wrong (because you're accustomed
> to a string's length being its number of characters no matter what encoding
> you read it in as).

You'd be wrong in Java too since the length of a Java string is the number
of the UTF-16 code points, not the number of characters. You'd just be
wrong less often.

Eoghan

peterGo

unread,
Jan 14, 2011, 9:54:19 AM1/14/11
to golang-nuts
Eoghan,

On Jan 14, 9:05 am, Eoghan Sherry <ejshe...@gmail.com> wrote

> You'd be wrong in Java too since the length of a Java string is the number
> of the UTF-16 code points, not the number of characters. You'd just be
> wrong less often.

The length of a Java string is the number of Unicode UTF-16 code
units, not the number of Unicode code points.

[Class String] public int length() Returns the length of this string.
The length is equal to the number of Unicode code units in the string.
http://download.oracle.com/javase/6/docs/api/java/lang/String.html#length%28%29

Unicode code point is used for character values in the range between U
+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char
values that are code units of the UTF-16 encoding.
http://download.oracle.com/javase/6/docs/api/java/lang/Character.html#unicode

Peter

On Jan 14, 9:05 am, Eoghan Sherry <ejshe...@gmail.com> wrote:

Donovan Jimenez

unread,
Jan 14, 2011, 1:10:26 PM1/14/11
to Eoghan Sherry, golang-nuts
touché  ;)
Reply all
Reply to author
Forward
0 new messages