This program:
package main
import "fmt"
func main() {
a1 := string([]byte{0xe2,0x84,0xab})
a2 := string([]byte{0xc3,0x85})
fmt.Println(a1, a2, a1 == a2)
}
Prints:
Å Å false
The two characters are not the same as such (one is Angstrom the other
an A with a ring above), but they are rendered the same visually so it
is reasonable to expect that end users would expect them to compare
equal, e.g., when sorted.
Python provides the unicodedata.normalize() function which can help in
such cases, but I can't find a Go equivalent in the standard library.
Thanks!
--
Mark Summerfield, Qtrac Ltd, www.qtrac.eu
C++, Python, Qt, PyQt - training and consultancy
"Advanced Qt Programming" - ISBN 0321635906
http://www.qtrac.eu/aqpbook.html
I don't think those two code points are
required to render the same way, given a
specific font. And it's reasonable to
assume that different fonts may be used
for arbitrary groups of code points.
Now if you want to talk about combining
characters ... well, that's a can of worms
I'd rather not open.
Anthony
On Mon, 21 Feb 2011 04:40:59 -0800
Anthony Martin <al...@pbrane.org> wrote:
> Mark Summerfield <li...@qtrac.plus.com> once said:
> > The two characters are not the same as such (one is Angstrom the
> > other an A with a ring above), but they are rendered the same
> > visually so it is reasonable to expect that end users would expect
> > them to compare equal, e.g., when sorted.
>
> I don't think those two code points are
> required to render the same way, given a
> specific font. And it's reasonable to
> assume that different fonts may be used
> for arbitrary groups of code points.
Yes, I chose a poor example.
> Now if you want to talk about combining
> characters ... well, that's a can of worms
> I'd rather not open.
This is exactly what I meant!
Bytes \xE2\x84\xAB is A with a ring above and
bytes \x41\xCC\x81 is A and combining ring above, so in both cases,
character 'Å'.
So I guess the answer is that there's no normalization function for such
cases in Go at the moment.
Thanks!
--
Mark Summerfield, Qtrac Ltd, www.qtrac.eu
C++, Python, Qt, PyQt - training and consultancy
"Rapid GUI Programming with Python and Qt" - ISBN 0132354187
http://www.qtrac.eu/pyqtbook.html
-rob
https://skitch.com/davecheney/rty1c/screen-shot-2011-02-22-at-8.02.47-am
combining characters notwithstanding, there's a very good rationale
for _not_ having identically-looking characters be equal in
comparison. for example, i would want to distinguish between скука
(cyrillic) and ckyka (latin) when parsing text. also, nobody would be
really happy if their browser decided that a url pointing at bank.com
was identical to a url pointing at bаnk.com, under the control of some
bad guys (the second one has one character rendered in cyrillic,
visually identical in most fonts).
> They don't appear to be visually the same.
>
> https://skitch.com/davecheney/rty1c/screen-shot-2011-02-22-at-8.02.47-am
They do normalize the same, though, apparently a consequence of history:
<http://en.wikipedia.org/wiki/%C3%85ngstr%C3%B6m#Symbol>
One concern I have about normalizing UTF-8 is how will one handle
situations where normalization is not desired? For example, Linux
pathnames are not normalized (and in fact, aren't even required to be
valid UTF-8), but it would still be useful to work with the pathnames.
David
> David Brown <gol...@davidb.org> writes:
>
>> One concern I have about normalizing UTF-8 is how will one handle
>> situations where normalization is not desired? For example, Linux
>> pathnames are not normalized (and in fact, aren't even required to be
>> valid UTF-8), but it would still be useful to work with the pathnames.
>
> The language is certainly not going to automatically normalize UTF-8
> strings for you. However, the standard library should at some point
> provide facilities for an application to normalize them when that is
> appropriate.
That's good to hear. I've used too many languages/libraries that have
would either crash or fail to handle invalid UTF-8 sequences in
filenames.
I like the approach that Go seems to use, that strings are sequences of
bytes, that can be interpreted as UTF-8 if desired.
Speaking of which, would it be safe to use arbitrary binary slices as
string keys in a map? I didn't have any problems with it, but wasn't
sure if it was supposed to work?
Thanks,
David
> One concern I have about normalizing UTF-8 is how will one handle
> situations where normalization is not desired? For example, Linux
> pathnames are not normalized (and in fact, aren't even required to be
> valid UTF-8), but it would still be useful to work with the pathnames.
The language is certainly not going to automatically normalize UTF-8
strings for you. However, the standard library should at some point
provide facilities for an application to normalize them when that is
appropriate.
Ian
Good!
I suspect that in most cases normalization won't be needed, e.g., when
all the strings are created within Go programs. But sometimes I guess it
will be essential.
And I imagine an even bigger item is http://unicode.org/reports/tr10
unless you wrap an existing library for that:-)
--
Mark Summerfield, Qtrac Ltd, www.qtrac.eu
C++, Python, Qt, PyQt - training and consultancy
"Programming in Python 3" - ISBN 0321680561
http://www.qtrac.eu/py3book.html
> Speaking of which, would it be safe to use arbitrary binary slices as
> string keys in a map? I didn't have any problems with it, but wasn't
> sure if it was supposed to work?
Yes, that should work fine. The only thing that really matters for the
key type of a map is equality, and that is well defined for arbitrary
byte sequences.
Ian