Sorting UTF-8 Strings

1,821 views
Skip to first unread message

John Souvestre

unread,
Apr 15, 2015, 5:15:50 AM4/15/15
to golan...@googlegroups.com, Frits van Bommel, Maxim Kupriianov

Hi.

 

I’m just starting to use Unicode and am looking for some guidance.

 

I'm interested in sorting a slice of UTF-8 strings in lexicographical order (code point by code point, based simply on its numeric value), ignoring duplicate/equivalent normalization and language/locale collation issues for now.  I've seen it said in a few places that this can be done by treating each of the strings as a series of bytes and sorting accordingly. 

 

I don't see how this can possibly work, however.  If all of the code points used the same number of bytes, fine, but that's not the case.  A UTF-8 code point can use one to six bytes.  Thus you might end up comparing a byte of one code point to a byte of another code point.

 

I would think that you would have to convert the UTF-8 string to a rune string (or walk the UTF-8 string a rune at a time).  Fine, but wouldn't you have to allow for combined code points, for example one with a diacritical mark(s)?  You would really have to walk the string a combined code point at a time, right?

 

Does Go offer any support to extract combined code point (base code point plus any combiner code points) from a UTF-8 string or a slice of runes?

 

Once I get this much figured out, I’ll worry about normalization and collation.  J

 

Note:  For doing solely numeric sorts, I do see that there is a library unicode/IsDigit function which identifies digits for all of the languages.

 

Thanks,

 

John

    John Souvestre - New Orleans LA

 

Jan Mercl

unread,
Apr 15, 2015, 5:33:44 AM4/15/15
to golan...@googlegroups.com

On Wed, Apr 15, 2015 at 11:15 AM John Souvestre <jo...@sstar.com> wrote:

> I don't see how this can possibly work, however.

Then a nice exercise would be to try to come with a pair of correctly UTF-8 encoded strings that do not sort the same way as the strings of the codepoints they represent (encode) would ;-)


> If all of the code points used the same number of bytes, fine, but that's
> not the case.

That's not the case, correct, but this fact cannot change the ordering.


> Thus you might end up comparing a byte of one code point to a byte of 
> another code point.

No. If the lengths are different, the order was already determined using only at most min(len(a),len(b)) bytes of the two UTF-8 encoded codepoints, provided the codepoints differ. And if they do not differ then len(a) == len(b). And once a difference is found, no more codepoints - and thus no more UTF-8 bytes - are needed to be considered.

From [0]:
""""
Sorting a set of UTF-8 encoded strings as strings of unsigned bytes yields the same order as sorting the corresponding Unicode strings lexicographically by codepoint.
""""

[0]: http://en.wikipedia.org/wiki/UTF-8#Advantages

-j

John Souvestre

unread,
Apr 15, 2015, 8:20:01 AM4/15/15
to Jan Mercl, golan...@googlegroups.com

Jan,

 

Thank you very much.  I did not understand the effect of the UTF-8 encoding bumping the leading bits in the first byte as bytes were added.  Very nice!

 

So now I guess it’s time to look into normalization and collation.  Do you have any tips?

 

Thanks,

 

John

    John Souvestre - New Orleans LA

 

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jan Mercl

unread,
Apr 15, 2015, 8:27:01 AM4/15/15
to John Souvestre, golan...@googlegroups.com
On Wed, Apr 15, 2015 at 2:19 PM John Souvestre <jo...@sstar.com> wrote:

> So now I guess it’s time to look into normalization and collation. Do you 
> have any tips?

For example, the blog post at https://blog.golang.org/normalization and the packages linked from there: https://godoc.org/golang.org/x/text/unicode/norm and http://godoc.org/golang.org/x/text/collate

-j

John Souvestre

unread,
Apr 15, 2015, 8:32:41 AM4/15/15
to golan...@googlegroups.com

Thanks!

 

John

    John Souvestre - New Orleans LA

 

From: Jan Mercl [mailto:0xj...@gmail.com]
Sent: 2015 April 15, Wed 07:27
To: John Souvestre; golan...@googlegroups.com
Subject: Re: [go-nuts] Sorting UTF-8 Strings

 

On Wed, Apr 15, 2015 at 2:19 PM John Souvestre <jo...@sstar.com> wrote:

Reply all
Reply to author
Forward
0 new messages