Proper Unicode case folding

338 views
Skip to first unread message

Thomas Bushnell, BSG

unread,
Jul 6, 2012, 7:47:24 PM7/6/12
to golang-nuts
It came up recently on another thread; I wanted to expand here.

Let me say first off that the presence of strings.EqualFold as the key interface is a great win. Because proper case-insensitive equality testing can only be done for strings, and not rune-by-rune, the availability of this function with its semantics is the most important thing. So while the current implementation is not ideal, that's neither here nor there. It can be fixed whenever there's the desire, and as long as code uses this call--which please do!--it will get the fix when it happens.

The problem is that even language-neutral case-folding requires the ability to handle a non 1:1 mapping of runes in cases. For example, ß in German is a lower-case letter with no one-to-one uppercase equivalent. In uppercase it shows up as SS or SZ, depending on the word.

Note that case-mapping is much harder than mere case-insensitive equality testing. When you see a word in all upper case with SS you cannot necessarily replace SS with ß; for example if you see BUSSE you cannot know except from semantic context whether the word in lower case should be Busse or Buße, which are entirely separate words. (The first vowel in the two is different, but that is not visible in the spelling of all-caps word.) But you can reliably say that BUSSE and Busse test equal on a case-insensitive match, and that BUSSE and Buße also test equal.

This might raise the concern that case-insensitive matching could be non-transitive. But you also want Busse and Buße to match, because in Switzerland ß is being phased out, and both would be spelled Busse.

Another example is the obsolete letter kra in Kalaallisut, which similarly has a lower case form ĸ (which is not the same as k) and is upper-cased as the two-rune sequence K'.

Because the current implementation of strings.EqualFold works only rune-by-rune, it does not get these cases correct.  http://play.golang.org/p/3VsQMI4Rx5

There are other things which the current implementation gets wrong for case equality testing. The correct algorithm is described in the section "Default Caseless Matching" of the Unicode Standard 3.13. In various ways it gives different results from the current implementation in Go.

However, the most important thing is what has already been done: the relevant operation in Go works on strings, and not on runes. Given that, the implementation can be fixed whenever someone wants to invest the effort.

Thomas Bushnell, BSG

unread,
Jul 6, 2012, 7:49:03 PM7/6/12
to golang-nuts
Oops, I'm sorry I posted the wrong link. The code which demonstrates the incorrect failure to match ß and SS is at http://play.golang.org/p/ox9jby9-RK.

Sam Freiberg

unread,
Jul 6, 2012, 8:06:40 PM7/6/12
to Thomas Bushnell, BSG, golang-nuts
So in light of this does it make sense to also have a strings.ContainsEqualFold function?

On Fri, Jul 6, 2012 at 4:47 PM, Thomas Bushnell, BSG <tbus...@google.com> wrote:



--
Sam Freiberg
+ samue...@gmail.com
+- http://www.darkaslight.com/

Thomas Bushnell, BSG

unread,
Jul 6, 2012, 8:27:35 PM7/6/12
to Sam Freiberg, golang-nuts
It might, but there isn't currently any case-independent Contains function, so at least we don't have the wrong one. :)

I think that full correct Unicode support for Go is a fine goal, and this might be part of it, but I'm not one to say whether it should be in strings or elsewhere. The most important thing that needs to be there is a good case-folding interface and down-deep correct support for runes, and the mere fact that we have this means we have won 95% of the battle. All the rest is icing.

It should be clear why the current Go implementation does not have full correct Unicode support. it's very difficult to get all the details right, let alone do so in a way which lets the common cases run fast without getting bogged down in the rare details. There have been far more important fish to fry in Go! Please do not interpret my comments as some kind of criticism of their work.

(I don't speak for the official gophers, by the way, if that isn't clear.)

Paul Borman

unread,
Jul 6, 2012, 9:08:00 PM7/6/12
to Thomas Bushnell, BSG, Sam Freiberg, golang-nuts
There are unicode standard which say precisely how to make canonical strings so they can be properly compared.  It is mostly tedious, but there are even tests to make sure it is done right.  You need that to do proper comparisons of case folded strings.

    -Paul

Thomas Bushnell, BSG

unread,
Jul 6, 2012, 9:16:07 PM7/6/12
to Paul Borman, golang-nuts, Sam Freiberg

Yes; I even gave the reference in the OP. :-)

peterGo

unread,
Jul 6, 2012, 10:07:59 PM7/6/12
to golang-nuts
Thomas,

Full Unicode support in Go is a work in progress.

http://unicode.org/

$GOROOT/src/pkg/exp/locale
$GOROOT/src/pkg/exp/norm

Peter

On Jul 6, 7:47 pm, "Thomas Bushnell, BSG" <tbushn...@google.com>
wrote:
> It came up recently on another thread; I wanted to expand here.
>
> Let me say first off that the presence of strings.EqualFold as the key
> interface is a great win. Because proper case-insensitive equality testing
> can only be done for strings, and not rune-by-rune, the availability of
> this *function *with its semantics is the most important thing. So while
> the current *implementation *is not ideal, that's neither here nor there.
> It can be fixed whenever there's the desire, and as long as code *uses this
> call--which please do!*--it will get the fix when it happens.
>
> The problem is that even language-neutral case-folding requires the ability
> to handle a non 1:1 mapping of runes in cases. For example, ß in German is
> a lower-case letter with no one-to-one uppercase equivalent. In uppercase
> it shows up as SS or SZ, depending on the word.
>
> Note that case-mapping is much harder than mere case-insensitive equality
> testing. When you see a word in all upper case with SS you cannot
> necessarily replace SS with ß; for example if you see BUSSE you cannot know
> except from semantic context whether the word in lower case should be Busse
> or Buße, which are entirely separate words. (The first vowel in the two is
> different, but that is not visible in the spelling of all-caps word.) But
> you can reliably say that BUSSE and Busse test equal on a case-insensitive
> match, and that BUSSE and Buße also test equal.
>
> This might raise the concern that case-insensitive matching could be
> non-transitive. But you also want Busse and Buße to match, because in
> Switzerland ß is being phased out, and both would be spelled Busse.
>
> Another example is the obsolete letter kra in Kalaallisut, which similarly
> has a lower case form ĸ (which is not the same as k) and is upper-cased as
> the two-rune sequence K'.
>
> Because the current implementation of strings.EqualFold works only
> rune-by-rune, it does not get these cases correct.http://play.golang.org/p/3VsQMI4Rx5

tomwilde

unread,
Jul 7, 2012, 4:02:08 AM7/7/12
to golan...@googlegroups.com
Hi

The ß sign is actually case-neutral. It can be "Ss", "ss", "Sz" or "sz" (never "SS") depending on where it's used.

Peter Kleiweg

unread,
Jul 7, 2012, 6:05:57 AM7/7/12
to golang-nuts
Related: strings.Title(). This is also a language specific thing. For
Dutch, "ijsvrij" should become "IJsvrij". It doesn't work, even with
LC_ALL=nl_NL.utf8. And strings.ToTitleSpecial() doesn't offer a
solution either.

http://play.golang.org/p/wfwtbFnmkF

Peter Kleiweg

unread,
Jul 7, 2012, 6:08:14 AM7/7/12
to golang-nuts
Type in that one.

Corrected: http://play.golang.org/p/UWfvwFo-vU

DisposaBoy

unread,
Jul 7, 2012, 6:22:01 AM7/7/12
to golan...@googlegroups.com
the function is working as documented. whether its behaviour is useful is another topic.

 

Thomas Bushnell, BSG

unread,
Jul 7, 2012, 7:05:36 AM7/7/12
to tomwilde, golan...@googlegroups.com
When a word with ß is spelled in all upper-case, it is replaced with SS, as in the example I gave. The examples you give are appropriate for titlecase, but that's different from all uppercase.

Thomas Bushnell, BSG

unread,
Jul 7, 2012, 7:08:22 AM7/7/12
to Peter Kleiweg, golang-nuts
Note that Unicode defines rules for language-neutral case folding, case-insensitive equality testing, and so forth. A good Unicode implementation should provide for access to the language-neutral algorithms and might also to suitable language-specific ones. There are of course cases where proper case functions require more semantic information than is available.

Title casing is, as you note, a particularly difficult one.

Peter Kleiweg

unread,
Jul 7, 2012, 7:16:06 AM7/7/12
to golang-nuts
On 7 jul, 12:22, DisposaBoy <disposa...@dby.me> wrote:
> On Saturday, July 7, 2012 11:05:57 AM UTC+1, Peter Kleiweg wrote:
>
> > Related: strings.Title(). This is also a language specific thing. For
> > Dutch, "ijsvrij" should become "IJsvrij". It doesn't work, even with
> > LC_ALL=nl_NL.utf8. And strings.ToTitleSpecial() doesn't offer a
> > solution either.
>
> >http://play.golang.org/p/wfwtbFnmkF
>
> the function is working as documented.

I don't think so.

"Title returns a copy of the string s with all Unicode letters that
begin words mapped to their title case."

It says letter, not character or code point.

In my example, 'ij' is the letter that begins a word, not 'i'.

> whether its behaviour is useful is another topic.

It is the same topic as mapping between 'ß' and 'SS'. The 'SS' is a
single letter, even though it uses two characters.

Thomas Bushnell, BSG

unread,
Jul 7, 2012, 7:22:23 AM7/7/12
to Peter Kleiweg, golang-nuts
Yes; it is best not to use the word "letter" in these contexts, because of cases where a letter is composed of multiple characters. (It is also crucial to understand the difference between a codepoint ["rune" in Go] and a character; multiple codepoints can make up a single character.)

Spanish is another language where multiple characters can make a single letter, in the cases of the letters ch, ll, and rr.

Thomas Bushnell, BSG

unread,
Jul 7, 2012, 7:23:29 AM7/7/12
to tomwilde, golan...@googlegroups.com
I don't know how ß could ever be Ss or Sz, since it cannot occur at the beginning of a word. But it certainly can be SS in a word like Straße, which, when rendered in all caps, is spelled STRASSE.

Sometimes it is SZ, as in Maße.

This is complicated by the creation of the capital ß in recent years, which has had only small use.


On Sat, Jul 7, 2012 at 1:02 AM, tomwilde <sedevel...@gmail.com> wrote:

tomwilde

unread,
Jul 7, 2012, 8:03:05 AM7/7/12
to golan...@googlegroups.com
You are right, there are no words where ß is at the beginning, but I could make one up:

ßerfindung

Tell me, which of the 4 mentioned cases is used?

The word looks like a noun so it would probably be spelled upper-case. But is it "Ss" or "Sz" or maybe the rule applies as in dutch and it is either "SS" or "SZ"...

The short answer is that ß is case-neutral and my made up word is written the same upper-case and lower-case

In spanish "ll" and "ch" are treated as separate letters when written at the beginning of a sentence:

- Lluvia (rain)
- Charco (pond)

"rr" is never written at the beginning of a word .: either at a sentence start so same mistery as ß

Thomas Bushnell, BSG

unread,
Jul 7, 2012, 12:45:18 PM7/7/12
to tomwilde, golan...@googlegroups.com
On Sat, Jul 7, 2012 at 5:03 AM, tomwilde <sedevel...@gmail.com> wrote:
You are right, there are no words where ß is at the beginning, but I could make one up:

ßerfindung

Well, first off, it isn't German; the rules of German orthography and phonology don't permit it. It is hardly surprising that the rules of German spelling don't address it.
 
Tell me, which of the 4 mentioned cases is used?
The word looks like a noun so it would probably be spelled upper-case. But is it "Ss" or "Sz" or maybe the rule applies as in dutch and it is either "SS" or "SZ"...

Since there can be no German words beginning with ß, there are also no German rules for how you capitalize them. Follow the unicode case-independent rules and be done with it.

The short answer is that ß is case-neutral and my made up word is written the same upper-case and lower-case

So what? My post was about case-independent equality matching. The point is that ß (one rune) should match SS (two runes) just as s matches S.

In spanish "ll" and "ch" are treated as separate letters when written at the beginning of a sentence:

- Lluvia (rain) 
- Charco (pond)

No, they most certainly are not. This is why titlecasing is not the same as uppercasing the first letter.

If you were to uppercase the word "lluvia", you get "LLUVIA"; if you title case it, you get "Lluvia." But the first letter is "ll", not "l". But this does not mean that ''ll" is treated as separate letters; it means that the titlecase form of ll is different from the uppercase form of it. The distinction is important, because when collating you must sort words beginning with ll after all the words beginning with l, and not in between hypothetical "lk" and "lm".

DIfferent languages have different rules about collating multi-character letters and combined-form letters; there is no single rule which would work for all languages. Unicode prescribes a language-neutral collating rule, however, which should be used for generic locale-independent functions.
 
"rr" is never written at the beginning of a word .: either at a sentence start  so same mistery as ß

Right, but it certainly does have an uppercase form: RR, used for example in uppercasing perro to get PERRO.

There is really only one lesson here: if you want to handle Unicode correctly, you should not just figure you know enough languages and guess. You should always simply implement the functions where the Unicode standard prescribes exactly what the algorithms should be.

In the actual context here, you are simply wildly wrong when you say that the uppercase form of ß is ß; perhaps you have confused uppercase with titlecase. There is no German titlecase form of ß, but that's irrelevant.
Reply all
Reply to author
Forward
0 new messages