Sorting a list of Unicode strings?

53 views
Skip to first unread message

oli...@obeattie.com

unread,
Aug 19, 2007, 12:50:37 PM8/19/07
to
Hey Guys,

Maybe I'm missing something fundamental here, but if I have a list of
Unicode strings, and I want to sort these alphabetically, then it
places those that begin with unicode characters at the bottom. Is
there a way to avoid this, and make it sort them properly?

I'm sure that this is the "proper way" programatically with character
entities etc. - but when I have a list of countries, and I have Åland
Islands right at the bottom, it just doesn't look right.

Any help would be really appreciated.

Thanks,
Oliver

Stefan Behnel

unread,
Aug 19, 2007, 1:01:38 PM8/19/07
to oli...@obeattie.com
oli...@obeattie.com wrote:
> Hey Guys,

... and girls - maybe ...


> Maybe I'm missing something fundamental here, but if I have a list of
> Unicode strings, and I want to sort these alphabetically, then it
> places those that begin with unicode characters at the bottom.

That's because "Unicode" is more than one alphabet. unicode objects compare
based on the Unicode character value, so sort() does alike.

Stefan

oli...@obeattie.com

unread,
Aug 19, 2007, 1:05:16 PM8/19/07
to

Thanks for putting me right -- gals indeed!

Anyway, I know _why_ it does this, but I really do need it to sort
them correctly based on how humans would look at it.

Any ideas?

Alex Martelli

unread,
Aug 19, 2007, 2:09:42 PM8/19/07
to
oli...@obeattie.com <oli...@obeattie.com> wrote:
...
> > > Maybe I'm missing something fundamental here, but if I have a list of
> > > Unicode strings, and I want to sort these alphabetically, then it
> > > places those that begin with unicode characters at the bottom.
...

> Anyway, I know _why_ it does this, but I really do need it to sort
> them correctly based on how humans would look at it.

Depending on the nationality of those humans, you may need very
different sorting criteria; indeed, in some countries, different sorting
criteria apply to different use cases (such as sorting surnames versus
sorting book titles, etc; sorry, I don't recall specific examples, but
if you delve on sites about i18n issues you'll find some).

In both Swedish and Danish, I believe, A-with-ring sorts AFTER the
letter Z in the alphabet; so, having Åaland (where I'm using Aa for
A-with-ring, since this newsreader has some problem in letting me enter
non-ascii characters;-) sort "right at the bottom", while it "doesn't
look right" to YOU (maybe an English-speaker?) may look right to the
inhabitants of that locality (be they Danes or Swedes -- but I believe
Norwegian may also work similarly in terms of sorting).

The Unicode consortium does define a standard collation algorithm (UCA)
and table (DUCET) to use when you need a locale-independent ordering; at
<http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm>
you'll be able to obtain James Tauber's Python implementation of UCA, to
work with the DUCET found at
<http://jtauber.com/blog/2006/01/27/python_unicode_collation_algorithm>.

I suspect you won't like the collation order you obtain this way, but
you might start from there, subsetting and tweaking the DUCET into an
OUCET (Oliver Unicode Collation Element Table;-) that suits you better.

A simpler, rougher approach, if you think the "right" collation is
obtained by ignoring accents, diacritics, etc (even though the speakers
of many languages that include diacritics, &c, disagree;-) is to use the
key=coll argument in your sorting call, passing a function coll that
maps any Unicode string to what you _think_ it should be like for
sorting purposes. The .translate method of Unicode string objects may
help there: it takes a dict mapping Unicode ordinals to ordinals or
string (or None for characters you want to delete as part of the
translation).

For example, suppose that what we want is the following somewhat silly
collation: we only care about ISO-8859-1 characters, and want to ignore
for sorting purposes any accent (be it grave, acute or circumflex),
umlauts, slashes through letters, tildes, cedillas. htmlentitydefs has
a useful dict called codepoint2name that helps us identify those "weirdy
decorated foreign characters".

def make_transdict():
import htmlentitydefs
cp2n = htmlentitydefs.codepoint2name
suffixes = 'acute crave circ uml slash tilde cedil'.split()
td = {}
for x in range(128, 256):
if x not in cp2n: continue
n = cp2n[x]
for s in suffixes:
if n.endswith(s):
td[x] = unicode(n[-len(s)])
break
return td

def coll(us, td=make_transdict()):
return us.translate(td)

listofus.sort(key=coll)


I haven't tested this code, but it should be reasonably easy to fix any
problems it might have, as well as making make_transdict "richer" to
meet your goals. Just be aware that the resulting collation (e.g.,
sorting a-ring just as if it was a plain a) will be ABSOLUTELY WEIRD to
anybody who knows something about Scandinavian languages...!!!-)


Alex

Steve Holden

unread,
Aug 19, 2007, 8:45:02 PM8/19/07
to pytho...@python.org
Alex Martelli wrote:
> oli...@obeattie.com <oli...@obeattie.com> wrote:
> ...
>>>> Maybe I'm missing something fundamental here, but if I have a list of
>>>> Unicode strings, and I want to sort these alphabetically, then it
>>>> places those that begin with unicode characters at the bottom.
> ...
>> Anyway, I know _why_ it does this, but I really do need it to sort
>> them correctly based on how humans would look at it.
>
> Depending on the nationality of those humans, you may need very
> different sorting criteria; indeed, in some countries, different sorting
> criteria apply to different use cases (such as sorting surnames versus
> sorting book titles, etc; sorry, I don't recall specific examples, but
> if you delve on sites about i18n issues you'll find some).
>
Just one example from my own experience. When sorting names in Scotland
(and technically in the rest of the UK too in deference to Scotland,
though this is often ignored) named beginning with "Mc" have to be
sorted /as though/ they began with "Mac". Since the two prefixes are
indistinguishable phonetically it would otherwise mean twice as much
work to look up one of those names.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------

thebjorn

unread,
Aug 19, 2007, 9:47:05 PM8/19/07
to
On Aug 19, 8:09 pm, al...@mac.com (Alex Martelli) wrote:
[...]

> In both Swedish and Danish, I believe, A-with-ring sorts AFTER the
> letter Z in the alphabet; so, having Åaland (where I'm using Aa for
> A-with-ring, since this newsreader has some problem in letting me enter
> non-ascii characters;-) sort "right at the bottom", while it "doesn't
> look right" to YOU (maybe an English-speaker?) may look right to the
> inhabitants of that locality (be they Danes or Swedes -- but I believe
> Norwegian may also work similarly in terms of sorting).

You're absolutely correct, the Norwegian and Danish alphabets end
with ..xyzæøå, while the Swedish alphabet ends with ..xyzåäö and sort
order follows placement. Indeed, my first reaction to the op was:
where else would Åland be but at the end? One, perhaps interesting,
tidbit, is that Åland "belongs" to Finland (it's an autonomous,
demilitarized, monolingually Swedish-speaking administrative province
of Finland). The Finnish alphabet is identical to the Swedish
alphabet, including sort order (at least in this case)

For the ascii-speakers out there, the key point to remember is that
the letter Å (pronounced like the au in brittish autumn) is not an
ascii A with a ring on top. The ring-on-top is an intrinsic part of
the letter, in the same way the tail on the letter Q isn't a
decoration of the letter O.

-- bjorn

Tommy Nordgren

unread,
Aug 20, 2007, 7:08:21 AM8/20/07
to oli...@obeattie.com, pytho...@python.org

> --
> http://mail.python.org/mailman/listinfo/python-list
That is the correct alfabetic sort order for Åland.
The Swedish letters Å , Ä and Ö sorts last in Alphabetic order.
-----------------------------------------------------
An astronomer to a colleague:
-I can't understsnad how you can go to the brothel as often as you
do. Not only is it a filthy habit, but it must cost a lot of money too.
-Thats no problem. I've got a big government grant for the study of
black holes.
Tommy Nordgren
tommy.n...@comhem.se

oli...@obeattie.com

unread,
Aug 20, 2007, 7:13:08 AM8/20/07
to
Thank you all for your very quick and informative replies. I was
basing this assumption that Å was classed as a standard 'A' from a
list of countries I was looking at (Wikipedia sorts it like this, too
- though this isn't what I was using http://en.wikipedia.org/wiki/List_of_countries#A)

I will leave it as it is, with Å at the bottom, if this is the correct
ordering.

Once again, thank you!

Oliver

koen.va...@gmail.com

unread,
Aug 30, 2007, 9:53:01 AM8/30/07
to
Wikipedia in Suomi lists it at the bottom ;-)

http://sv.wikipedia.org/wiki/Lista_%C3%B6ver_l%C3%A4nder#.C3.85

Cheers
~K

Reply all
Reply to author
Forward
0 new messages