understanding utf-8 for a newbie

2,341 views
Skip to first unread message

rob solomon

unread,
May 5, 2017, 9:11:16 PM5/5/17
to golan...@googlegroups.com
Hi. I decided to write a small program in Go to convert utf8 to simple
ASCII. This need arose by my copying a file created in Ubuntu 16.04
amd64, and used on a win10 computer.

I decided to first change ", ' and emdash characters. Using hexdump -C
in Ubuntu, the runes in the file are:

open quote = 0xE2809C

close quote = 0xE2809D

apostrophe = 0xE28099

emdash = 0xE28094


However, when I write a simple program to display these runes from the
file, using the routines in unicode/utf8, I get very different values.
I do not understand this.

open quote = 0x201C

close quote = 0x201D

apostrophe = 0x2019

emdash = 0x2014.


Why are the runes returned by utf8.DecodeRuneInString different from
what hexdump shows when inspecting the file directly?

--rob solomon

Andy Balholm

unread,
May 5, 2017, 10:49:12 PM5/5/17
to r...@drrob1.com, golan...@googlegroups.com
Hexdump shows the actual bytes in the file—the UTF-8 encoding of the runes (Unicode code points). Apparently you are reading them with utf8.DecodeRune or something like that; those return the code points, without the UTF-8 encoding.

Andy
Message has been deleted
Message has been deleted

Sam Whited

unread,
May 6, 2017, 7:54:19 PM5/6/17
to r...@drrob1.com, golang-nuts
On Fri, May 5, 2017 at 8:11 PM, rob solomon <drro...@verizon.net> wrote:
> I decided to first change ", ' and emdash characters. Using hexdump -C in
> Ubuntu, the runes in the file are:
>
> open quote = 0xE2809C
>
> close quote = 0xE2809D
>
> apostrophe = 0xE28099
>
> emdash = 0xE28094

The output of hexdump will be the actual bytes of the file; these are
the UTF-8 encoded values.

> However, when I write a simple program to display these runes from the file,
> using the routines in unicode/utf8, I get very different values. I do not
> understand this.
>
> open quote = 0x201C
>
> close quote = 0x201D
>
> apostrophe = 0x2019
>
> emdash = 0x2014.

These are called Unicode codepoints. In Unicode lots of different
things like letters, numbers, emoji, etc. are assigned numbers (Go's
type for storing codepoints is called "rune"). These numbers are then
encoded using an encoding such as UTF-8 to make the final output which
you saw when you used hexdump. The Unicode codepoint of an em dash is
always U+2014 (sometimes they're written this way, prefixed by `U+'),
but the encoding might be different depending on what system you're on
or what file format you're using.

Here is an example of encoding a rune with the value 0x2014 as UTF-8,
which gives the number you observed in your hexdump output:
https://play.golang.org/p/ddIfzobKD4

—Sam

peterGo

unread,
May 6, 2017, 8:40:03 PM5/6/17
to golang-nuts, r...@drrob1.com, drro...@verizon.net
rob,

Why are you converting UTF-8 to ASCII for Windows 10? Convert UTF-8 to CP1252 (Windows-1252) or UTF16.

Corrected, corrected link. I will get it right eventually!

For UTF-8 to CP1252: https://play.golang.org/p/vzupJY78XB

Peter

Sam Whited

unread,
May 6, 2017, 8:48:31 PM5/6/17
to peterGo, golang-nuts, r...@drrob1.com, drro...@verizon.net
On Sat, May 6, 2017 at 7:40 PM, peterGo <go.pe...@gmail.com> wrote:
> Corrected, corrected link. I will get it right eventually!
>
> For UTF-8 to CP1252: https://play.golang.org/p/vzupJY78XB

FWIW, if you do need CP1252 (you probably don't) it already exists in
Go-land as one of the encodings specified in this package:
https://godoc.org/golang.org/x/text/encoding/charmap

—Sam

rob solomon

unread,
May 7, 2017, 10:43:58 AM5/7/17
to golan...@googlegroups.com
Thanks to those who answered.

I grew up in the EBCDIC vs ASCII era, and I've always expected that the
bytes in the file were the same as those that represented a character.

I now understand that the bytes may be different.

Thanks guys.

-- rob solomon

Tom Limoncelli

unread,
May 7, 2017, 11:20:48 AM5/7/17
to Sam Whited, peterGo, golang-nuts, r...@drrob1.com, drro...@verizon.net
I highly recommend "The Absolute Minimum Every Software Developer
Absolutely, Positively Must Know About Unicode and Character Sets (No
Excuses!)"
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Plug: The Golang unicode decode/encode libraries are pretty confusing
if you aren't already an expert in Unicode. I wrote
https://github.com/TomOnTime/utfutil to make certain things easier.

Tom
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Email: t...@whatexit.org Work: tlimo...@StackOverflow.com
Blog: http://EverythingSysadmin.com

Sam Whited

unread,
May 7, 2017, 11:29:53 AM5/7/17
to Robert Solomon, golang-nuts
On Sun, May 7, 2017 at 9:44 AM, rob solomon <drro...@verizon.net> wrote:
> I now understand that the bytes may be different.

It's also worth noting that when Ken Thompson and Rob Pike (yes, the
same Rob Pike and Ken Thompson that created Go) created UTF-8, they
made sure it was backwards compatible with ASCII. Any characters that
are representable in ASCII will be the exact same bytes when encoded
to UTF-8. I'd be suprised if Windows didn't understand UTF-8 these
days, so it may be that you really don't need to "convert" your file
at all.

Here's a fun introduction to Unicode (with a brief discussion of
encoding methods), if you're interested:

http://reedbeta.com/blog/programmers-intro-to-unicode/

—Sam

peterGo

unread,
May 7, 2017, 2:33:48 PM5/7/17
to golang-nuts, r...@drrob1.com
Sam,


"I'd be suprised if Windows didn't understand UTF-8 these days,"

Be surprised! For Unicode, Microsoft Windows uses UTF-16.

Peter

peterGo

unread,
May 7, 2017, 2:38:59 PM5/7/17
to golang-nuts, r...@drrob1.com
Sam,

"[Rob Pike and Ken Thompson] they made sure it was backwards compatible with ASCII."

ASCII is 7-bits.


Peter

On Sunday, May 7, 2017 at 11:29:53 AM UTC-4, Sam Whited wrote:

Jan Mercl

unread,
May 7, 2017, 2:44:18 PM5/7/17
to peterGo, golang-nuts, r...@drrob1.com
On Sun, May 7, 2017 at 8:39 PM peterGo <go.pe...@gmail.com> wrote:

> "[Rob Pike and Ken Thompson] they made sure it was backwards compatible with ASCII."

> ASCII is 7-bits.

So is any UTF-8 encoded ASCII.

--

-j

Reply all
Reply to author
Forward
0 new messages