Hi.
I’m just starting to use Unicode and am looking for some guidance.
I'm interested in sorting a slice of UTF-8 strings in lexicographical order (code point by code point, based simply on its numeric value), ignoring duplicate/equivalent normalization and language/locale collation issues for now. I've seen it said in a few places that this can be done by treating each of the strings as a series of bytes and sorting accordingly.
I don't see how this can possibly work, however. If all of the code points used the same number of bytes, fine, but that's not the case. A UTF-8 code point can use one to six bytes. Thus you might end up comparing a byte of one code point to a byte of another code point.
I would think that you would have to convert the UTF-8 string to a rune string (or walk the UTF-8 string a rune at a time). Fine, but wouldn't you have to allow for combined code points, for example one with a diacritical mark(s)? You would really have to walk the string a combined code point at a time, right?
Does Go offer any support to extract combined code point (base code point plus any combiner code points) from a UTF-8 string or a slice of runes?
Once I get this much figured out, I’ll worry about normalization and collation. J
Note: For doing solely numeric sorts, I do see that there is a library unicode/IsDigit function which identifies digits for all of the languages.
Thanks,
John
John Souvestre - New Orleans LA
Jan,
Thank you very much. I did not understand the effect of the UTF-8 encoding bumping the leading bits in the first byte as bytes were added. Very nice!
So now I guess it’s time to look into normalization and collation. Do you have any tips?
Thanks,
John
John Souvestre - New Orleans LA
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Thanks!
John
John Souvestre - New Orleans LA
From: Jan Mercl [mailto:0xj...@gmail.com]
Sent: 2015 April 15, Wed 07:27
To: John Souvestre; golan...@googlegroups.com
Subject: Re: [go-nuts] Sorting UTF-8 Strings
On Wed, Apr 15, 2015 at 2:19 PM John Souvestre <jo...@sstar.com> wrote: