UTF-8 normalization?

339 views
Skip to first unread message

Mark Summerfield

unread,
Feb 21, 2011, 6:49:43 AM2/21/11
to golang-nuts
Hi,

This program:

package main
import "fmt"
func main() {
a1 := string([]byte{0xe2,0x84,0xab})
a2 := string([]byte{0xc3,0x85})
fmt.Println(a1, a2, a1 == a2)
}

Prints:

Å Å false

The two characters are not the same as such (one is Angstrom the other
an A with a ring above), but they are rendered the same visually so it
is reasonable to expect that end users would expect them to compare
equal, e.g., when sorted.

Python provides the unicodedata.normalize() function which can help in
such cases, but I can't find a Go equivalent in the standard library.

Thanks!

--
Mark Summerfield, Qtrac Ltd, www.qtrac.eu
C++, Python, Qt, PyQt - training and consultancy
"Advanced Qt Programming" - ISBN 0321635906
http://www.qtrac.eu/aqpbook.html

Anthony Martin

unread,
Feb 21, 2011, 7:40:59 AM2/21/11
to Mark Summerfield, golang-nuts
Mark Summerfield <li...@qtrac.plus.com> once said:
> The two characters are not the same as such (one is Angstrom the other
> an A with a ring above), but they are rendered the same visually so it
> is reasonable to expect that end users would expect them to compare
> equal, e.g., when sorted.

I don't think those two code points are
required to render the same way, given a
specific font. And it's reasonable to
assume that different fonts may be used
for arbitrary groups of code points.

Now if you want to talk about combining
characters ... well, that's a can of worms
I'd rather not open.

Anthony

Mark Summerfield

unread,
Feb 21, 2011, 8:09:14 AM2/21/11
to Anthony Martin, golang-nuts
Hi Anthony,

On Mon, 21 Feb 2011 04:40:59 -0800
Anthony Martin <al...@pbrane.org> wrote:
> Mark Summerfield <li...@qtrac.plus.com> once said:
> > The two characters are not the same as such (one is Angstrom the
> > other an A with a ring above), but they are rendered the same
> > visually so it is reasonable to expect that end users would expect
> > them to compare equal, e.g., when sorted.
>
> I don't think those two code points are
> required to render the same way, given a
> specific font. And it's reasonable to
> assume that different fonts may be used
> for arbitrary groups of code points.

Yes, I chose a poor example.



> Now if you want to talk about combining
> characters ... well, that's a can of worms
> I'd rather not open.

This is exactly what I meant!

Bytes \xE2\x84\xAB is A with a ring above and
bytes \x41\xCC\x81 is A and combining ring above, so in both cases,
character 'Å'.

So I guess the answer is that there's no normalization function for such
cases in Go at the moment.

Thanks!

--
Mark Summerfield, Qtrac Ltd, www.qtrac.eu
C++, Python, Qt, PyQt - training and consultancy

"Rapid GUI Programming with Python and Qt" - ISBN 0132354187
http://www.qtrac.eu/pyqtbook.html

Jan Mercl

unread,
Feb 21, 2011, 9:08:47 AM2/21/11
to golang-nuts
On Feb 21, 2:09 pm, Mark Summerfield <l...@qtrac.plus.com> wrote:
> So I guess the answer is that there's no normalization function for such
> cases in Go at the moment.

See also:
http://groups.google.com/group/golang-nuts/browse_frm/thread/01242d005e76d9f0/64db5ca670deb4fc?#64db5ca670deb4fc

Go wrapper for ICU might be the best answer to this task, but I'm not
aware of it being available yet.

peterGo

unread,
Feb 21, 2011, 9:03:50 AM2/21/11
to golang-nuts
Mark,

Currently, Go does not support Unicode normalization forms.
http://unicode.org/reports/tr15/

Peter

Rob 'Commander' Pike

unread,
Feb 21, 2011, 1:34:05 PM2/21/11
to Jan Mercl, golang-nuts
This is a huge item on our TODO list.

-rob

Dave Cheney

unread,
Feb 21, 2011, 4:05:04 PM2/21/11
to Mark Summerfield, golang-nuts

andrey mirtchovski

unread,
Feb 21, 2011, 4:13:21 PM2/21/11
to Mark Summerfield, golang-nuts
> but they are rendered the same visually so it
> is reasonable to expect that end users would expect them to compare
> equal, e.g., when sorted.

combining characters notwithstanding, there's a very good rationale
for _not_ having identically-looking characters be equal in
comparison. for example, i would want to distinguish between скука
(cyrillic) and ckyka (latin) when parsing text. also, nobody would be
really happy if their browser decided that a url pointing at bank.com
was identical to a url pointing at bаnk.com, under the control of some
bad guys (the second one has one character rendered in cyrillic,
visually identical in most fonts).

Steven

unread,
Feb 21, 2011, 4:26:57 PM2/21/11
to andrey mirtchovski, Mark Summerfield, golang-nuts
I'd say this is an argument the other way... If you copy a link to bank.com/whatever, you can't tell which particular codepoint underlies each character from visual inspection. So, assuming utf-8 URL's you would want normalization, so that you can't have two identical looking domain names that direct you to two different web sites (one being your bank, the other being an attack site).

Even if your reasoning were sound, I don't see its relevance. Whether normalizing is appropriate for a specific application is irrelevant to whether it is useful in general.

David Brown

unread,
Feb 21, 2011, 4:30:25 PM2/21/11
to Dave Cheney, Mark Summerfield, golang-nuts
On Mon, Feb 21 2011, Dave Cheney wrote:

They do normalize the same, though, apparently a consequence of history:

<http://en.wikipedia.org/wiki/%C3%85ngstr%C3%B6m#Symbol>

One concern I have about normalizing UTF-8 is how will one handle
situations where normalization is not desired? For example, Linux
pathnames are not normalized (and in fact, aren't even required to be
valid UTF-8), but it would still be useful to work with the pathnames.

David

David Brown

unread,
Feb 21, 2011, 9:41:43 PM2/21/11
to Ian Lance Taylor, David Brown, Dave Cheney, Mark Summerfield, golang-nuts
On Mon, Feb 21 2011, Ian Lance Taylor wrote:

> David Brown <gol...@davidb.org> writes:
>
>> One concern I have about normalizing UTF-8 is how will one handle
>> situations where normalization is not desired? For example, Linux
>> pathnames are not normalized (and in fact, aren't even required to be
>> valid UTF-8), but it would still be useful to work with the pathnames.
>

> The language is certainly not going to automatically normalize UTF-8
> strings for you. However, the standard library should at some point
> provide facilities for an application to normalize them when that is
> appropriate.

That's good to hear. I've used too many languages/libraries that have
would either crash or fail to handle invalid UTF-8 sequences in
filenames.

I like the approach that Go seems to use, that strings are sequences of
bytes, that can be interpreted as UTF-8 if desired.

Speaking of which, would it be safe to use arbitrary binary slices as
string keys in a map? I didn't have any problems with it, but wasn't
sure if it was supposed to work?

Thanks,
David

Ian Lance Taylor

unread,
Feb 21, 2011, 8:47:35 PM2/21/11
to David Brown, Dave Cheney, Mark Summerfield, golang-nuts
David Brown <gol...@davidb.org> writes:

> One concern I have about normalizing UTF-8 is how will one handle
> situations where normalization is not desired? For example, Linux
> pathnames are not normalized (and in fact, aren't even required to be
> valid UTF-8), but it would still be useful to work with the pathnames.

The language is certainly not going to automatically normalize UTF-8


strings for you. However, the standard library should at some point
provide facilities for an application to normalize them when that is
appropriate.

Ian

Mark Summerfield

unread,
Feb 22, 2011, 3:22:27 AM2/22/11
to Rob 'Commander' Pike, Jan Mercl, golang-nuts

Good!

I suspect that in most cases normalization won't be needed, e.g., when
all the strings are created within Go programs. But sometimes I guess it
will be essential.

And I imagine an even bigger item is http://unicode.org/reports/tr10
unless you wrap an existing library for that:-)

--
Mark Summerfield, Qtrac Ltd, www.qtrac.eu
C++, Python, Qt, PyQt - training and consultancy

"Programming in Python 3" - ISBN 0321680561
http://www.qtrac.eu/py3book.html

Ian Lance Taylor

unread,
Feb 22, 2011, 10:33:24 AM2/22/11
to David Brown, Dave Cheney, Mark Summerfield, golang-nuts
David Brown <gol...@davidb.org> writes:

> Speaking of which, would it be safe to use arbitrary binary slices as
> string keys in a map? I didn't have any problems with it, but wasn't
> sure if it was supposed to work?

Yes, that should work fine. The only thing that really matters for the
key type of a map is equality, and that is well defined for arbitrary
byte sequences.

Ian

peterGo

unread,
Feb 22, 2011, 10:41:27 AM2/22/11
to golang-nuts
David,

> Speaking of which, would it be safe to use arbitrary binary slices as
> string keys in a map? I didn't have any problems with it, but wasn't
> sure if it was supposed to work?

See this thread for an example.

What is the best way to map slices or arrays?
http://groups.google.com/group/golang-nuts/browse_thread/thread/590ef0b622bde60a

Peter
Reply all
Reply to author
Forward
0 new messages