Case conversion not working correctly for Turkish?

180 views
Skip to first unread message

Mehmet D. Akın

unread,
Mar 29, 2010, 10:44:54 AM3/29/10
to golan...@googlegroups.com
Hi,

It seems ToLower and ToUpper in the strings package do not work correctly for Turkish "ı,i" letters. The correct case conversion for Turkish locale would be i <-> İ  and ı <-> I . Is there a package for locale settings or locale aware string functions? AFAIK, this case conversion rules are same for Azeri and some other Turkic languages as well.

I have this simple test application:

package main

import ( 
  "fmt"
  "strings" 
)

func main() {
  tr_alphabet := "a,b,c,ç,d,e,f,g,ğ,h,ı,i,j,k,l,m,n,o,ö,p,r,s,ş,t,u,ü,v,y,z,â,î,û"
  tr_upper := strings.ToUpper(tr_alphabet);
  tr_lower := strings.ToLower(tr_upper);
  fmt.Printf("Alphabet      : %s \n", tr_alphabet);
  fmt.Printf("Alphabet Upper: %s \n", tr_upper);
  fmt.Printf("Alphabet Lower: %s \n", tr_lower);
}

The output is:
Alphabet      : a,b,c,ç,d,e,f,g,ğ,h,ı,i,j,k,l,m,n,o,ö,p,r,s,ş,t,u,ü,v,y,z,â,î,û 
Alphabet Upper: A,B,C,Ç,D,E,F,G,Ğ,H,I,I,J,K,L,M,N,O,Ö,P,R,S,Ş,T,U,Ü,V,Y,Z,Â,Î,Û 
Alphabet Lower: a,b,c,ç,d,e,f,g,ğ,h,i,i,j,k,l,m,n,o,ö,p,r,s,ş,t,u,ü,v,y,z,â,î,û 

tr_alphabet and tr_lower should have been the same, but they are not.

Some information about this problem: http://www.i18nguy.com/unicode/turkish-i18n.html

Mehmet.

Rob 'Commander' Pike

unread,
Mar 29, 2010, 2:33:51 PM3/29/10
to Mehmet D. Akın, golan...@googlegroups.com
I am not a Unicode expert, so I may be missing some nuance, but here is my understanding.

The Unicode tables do not respect locale.  (Just ask the Chinese.)  You would need a different set of tables to achieve what you want, since the mapping you're asking for is not that specified by Unicode.

In short, the tables specified by Unicode do not implement the mapping you want. I don't know if there's some official alternate mapping that does.  If so, please advise because there might be a way to make it happen.

Note that Go's Unicode functions are all trivial wrappings of generated tables.  Make new tables and you can use or write a simple function that will implement the mapping you need.

-rob

To unsubscribe from this group, send email to golang-nuts+unsubscribegooglegroups.com or reply to this email with the words "REMOVE ME" as the subject.

Marcin 'Qrczak' Kowalczyk

unread,
Mar 29, 2010, 2:48:07 PM3/29/10
to Rob 'Commander' Pike, Mehmet D. Akın, golan...@googlegroups.com
2010/3/29 Rob 'Commander' Pike <r...@google.com>:

> The Unicode tables do not respect locale.

UnicodeData.txt is not the whole case mapping specified by Unicode.
It only covers context-independent 1-1 mappings. The remaing rules are
described at http://unicode.org/Public/UNIDATA/SpecialCasing.txt and
they include Turkish i.

--
Marcin Kowalczyk

Message has been deleted
Message has been deleted

peterGo

unread,
Mar 29, 2010, 2:35:34 PM3/29/10
to golang-nuts
Mehmet,

Here's a more precise formulation of the problem, in a form suitable
for programming, which uses authoritative sources to be more
persuasive.

The Unicode Standard includes case mapping rules.

Unicode Technical Report #21 - Case Mappings
http://unicode.org/reports/tr21/tr21-3.html

The Unicode Standard 5.2 - Chapter 5 - 5.18 Case Mappings
http://unicode.org/versions/Unicode5.0.0/ch05.pdf

"The case mappings specified by the Unicode Character Database are in
the union of the UnicodeData.txt and SpecialCasing.txt files." TR21-3

The Go unicode package only refers to the rules in the UnicodeData.txt
file ; it omits the rules in the SpecialCasing.txt file. Therefore,
the case mappings for Turkish and some other language are incorrect.

Peter

On Mar 29, 10:44 am, Mehmet D. Akın <mda...@gmail.com> wrote:
> Hi,
>
> It seems ToLower and ToUpper in the strings package do not work correctly
> for Turkish "ı,i" letters. The correct case conversion for Turkish locale

> would be i <-> İ </wiki/Turkish_dotted_and_dotless_I>  and ı <-> I . Is

peterGo

unread,
Mar 29, 2010, 3:08:03 PM3/29/10
to golang-nuts
Mehmet,

Here's an example of the use of a culture qualifier to apply the
Unicode case mapping and sorting rules to Turkish and several other
alphabets.

Custom Case Mappings and Sorting Rules
http://msdn.microsoft.com/en-us/library/xk2wykcz%28VS.100%29.aspx

Peter

On Mar 29, 2:50 pm, peterGo <go.peter...@gmail.com> wrote:
> Mehmet,
>

> Here's a more precise formulation, in programming terms, of the
> problem, using authoritative sources to be more persuasive.


>
> The Unicode Standard includes case mapping rules.
>
> Unicode Technical Report #21 - Case Mappingshttp://unicode.org/reports/tr21/tr21-3.html
>

> The Unicode Standard - Chapter 5 - 5.18 Case Mappingshttp://unicode.org/versions/Unicode5.0.0/ch05.pdf


>
> "The case mappings specified by the Unicode Character Database are in
> the union of the UnicodeData.txt and SpecialCasing.txt files." TR21-3
>

> The Go unicode package only refers to the rules in UnicodeData.txt; it
> omits the rules in SpecialCasing.txt. Therefore, the case mappings for


> Turkish and some other language are incorrect.
>
> Peter
>
> On Mar 29, 10:44 am, Mehmet D. Akın <mda...@gmail.com> wrote:
>

> > Hi,
>
> > It seems ToLower and ToUpper in the strings package do not work correctly
> > for Turkish "ı,i" letters. The correct case conversion for Turkish locale

> > would be i <-> İ </wiki/Turkish_dotted_and_dotless_I>  and ı <-> I . Is

Mehmet D. Akın

unread,
Mar 29, 2010, 3:17:17 PM3/29/10
to peterGo, golang-nuts
Hi,

Thanks, so what is the solution for having correct ToLower, ToUpper and ToTitle for Turkish in go language?

A new package with ToUpperTr and ToLowerTr etc. methods? But this will not work because other applications using standard strings will be broken when working with Turkish text. Or implementing rules in SpecialCasing.txt in standard strings package? I am not sure But this might introduce some complexity like toLowerCase(Locale l) in Java's string implementation (http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#toLowerCase(java.util.Locale).

Mehmet

On Mon, Mar 29, 2010 at 8:50 PM, peterGo <go.pe...@gmail.com> wrote:
Mehmet,

Here's a more precise formulation, in programming terms, of the
problem, using authoritative sources to be more persuasive.

The Unicode Standard includes case mapping rules.

Unicode Technical Report #21 - Case Mappings
http://unicode.org/reports/tr21/tr21-3.html

The Unicode Standard - Chapter 5 - 5.18 Case Mappings
http://unicode.org/versions/Unicode5.0.0/ch05.pdf

"The case mappings specified by the Unicode Character Database are in
the union of the UnicodeData.txt and SpecialCasing.txt files." TR21-3

The Go unicode package only refers to the rules in UnicodeData.txt; it
omits the rules in SpecialCasing.txt. Therefore, the case mappings for
Turkish and some other language are incorrect.

Peter

On Mar 29, 10:44 am, Mehmet D. Akın <mda...@gmail.com> wrote:
> Hi,
>
> It seems ToLower and ToUpper in the strings package do not work correctly
> for Turkish "ı,i" letters. The correct case conversion for Turkish locale
> would be i <-> İ </wiki/Turkish_dotted_and_dotless_I>  and ı <-> I . Is

Rob 'Commander' Pike

unread,
Mar 29, 2010, 5:17:44 PM3/29/10
to Mehmet D. Akın, peterGo, golang-nuts
We've come up with a reasonable design to add variant case mapping tables to the interface. It will take a few days to make the changes.

Having code understand when to use Turkish mapping vs. other languages's mappings will take another round. In general, the program must know the language before being able to get the right answer. 

Let's start with the simpler step of getting all the mapping tables installed. That will at least make it possible for code that knows the language at hand to get the right answer locally.

-rob

Mehmet D. Akın

unread,
Mar 29, 2010, 5:21:06 PM3/29/10
to Rob 'Commander' Pike, peterGo, golang-nuts
Thanks Rob, So it will be fixed eventually. Good news. Should I open a bug to track this?

2010/3/29 Rob 'Commander' Pike <r...@google.com>
We've come up with a reasonable design to add variant case mapping tables to the interface. It will take a few days to make the changes.

Rob 'Commander' Pike

unread,
Mar 29, 2010, 5:26:17 PM3/29/10
to Mehmet D. Akın, peterGo, golang-nuts

On Mar 29, 2010, at 2:21 PM, Mehmet D. Akın wrote:

> Thanks Rob, So it will be fixed eventually. Good news. Should I open
> a bug to track this?

Yes please.

-rob


Mehmet D. Akın

unread,
Mar 30, 2010, 5:32:09 AM3/30/10
to Rob 'Commander' Pike, peterGo, golang-nuts
2010/3/29 Rob 'Commander' Pike <r...@google.com>
Yes please.

-rob


 
I created the issue. Also added a note that correct case conversion for Turkish breaks ASCII -> ASCII mapping for case conversion ( i -> İ  and I -> ı ) and there should be a way to escape it as well.

http://code.google.com/p/go/issues/detail?id=703

Mehmet

Johann Höchtl

unread,
Mar 31, 2010, 12:35:54 PM3/31/10
to golang-nuts

On Mar 29, 8:33 pm, "Rob 'Commander' Pike" <r...@google.com> wrote:
> I am not a Unicode expert, so I may be missing some nuance, but here  
> is my understanding.
>

The "Turkish test" gained some popularity. To see what problems people
from turkey face, take a look at this post:
http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html

alco

unread,
Jul 7, 2013, 6:02:48 AM7/7/13
to golan...@googlegroups.com
I'd like to revive this thread because there is clearly more work to be done in this space.

I've stumbled upon this program recently http://play.golang.org/p/3q4t_rDeif. If you're not familiar with ß, it doesn't have an established upper-case variant and is transformed into SS when doing the full case mapping (as per Unicode).

Usable information on Unicode is notoriously hard to find, but I trust ICU and I've found this on their site -- http://userguide.icu-project.org/transforms/casemappings. It has the following paragraph:

A character is considered to have a lowercase, uppercase, or title case equivalent if there is a respective "simple" case mapping specified for the character in the Unicode Character Database (UnicodeData.txt). If a character has no mapping equivalent, the result is the character itself.

So now the strings.ToUpper() behavior makes sense -- Go simply maps each character individually based on the UnicodeData.txt file, as has been mentioned in this thread.

If you read the page at the above link a bit further, you'll see the following:

ICU implements full Unicode string case mappings. In general,
  • case mapping can change the number of code points and/or code units of a string, 
  • is language-sensitive (results may differ depending on language), and
  • is context-sensitive (a character in the input string may map differently depending on surrounding characters).
So, the present list thread prompted Go to obtain a SpecialCase type allowing users to select a mapping for their local language. But what about the general case?

Imagine a theoretical example: a server that accepts unicode documents and converts them to uppercase to send to some archive. Can you write such a server in Go that would accept documents in any language? You could probably do that, but the clients would need to specify the language their document is in, so that the server could select the proper SpecialCase type to process the document with.

What about documents in mixed languages? Not currently possible in Go. In fact, the SpecialCase type is just a slice[1], so you could fill it with both Turkish and other mappings (once they're added). But it's not scalable -- when Go update comes out that adds new mappings, you'd need to change your program's code to start using them.

I have two questions:
* is Go interested in implementing the full Unicode spec or should users rely on libraries like ICU with a Go wrapper?
* is Go interested in adopting a more scalable API to provide context- and locale-independent behavior when working with arbitrary unicode strings without too much effort on the user's part?

Thanks.

[1]: http://golang.org/src/pkg/unicode/letter.go?s=2386:2414#L52

peterGo

unread,
Jul 7, 2013, 8:25:47 AM7/7/13
to golan...@googlegroups.com
alco,

The Go Programming Language
FAQ
What is the status of the project?
http://golang.org/doc/faq#What_is_the_status_of_the_project

"Of course, development will continue on Go itself, but the focus will be on performance, reliability, portability and the addition of new functionality such as improved support for internationalization."


The Go Programming Language
References
http://golang.org/ref/

Sub-repositories

These packages are part of the Go Project but outside the main Go tree. They are developed under looser compatibility requirements than the Go core. Install them with "go get".

    code.google.com/p/go.text [docs]

See the documents page for more documentation.

https://code.google.com/p/go/source/browse/?repo=text
http://godoc.org/code.google.com/p/go.text


The Go Programming Language
The Go Project
Contributing code
http://golang.org/project/

Go is an open source project and we welcome contributions from the community.

To get started, read these contribution guidelines for information on design, testing, and our code review process.

The Go Programming Language
Contribution Guidelines
http://golang.org/doc/contribute.html


Peter

Nigel Tao

unread,
Jul 7, 2013, 9:01:07 AM7/7/13
to alco, Marcel van Lohuizen, golang-nuts
On Sun, Jul 7, 2013 at 8:02 PM, alco <alcos...@gmail.com> wrote:
> * is Go interested in implementing the full Unicode spec or should users
> rely on libraries like ICU with a Go wrapper?
> * is Go interested in adopting a more scalable API to provide context- and
> locale-independent behavior when working with arbitrary unicode strings
> without too much effort on the user's part?

mp...@golang.org is working on i18n for Go. A design document is at
https://docs.google.com/document/d/1Q64ktYh7XptpEI3L2G7xYqsusohOeLUft865Zd7fbGU/edit
and code will live in the go.text sub-repository. Watch the golang-dev
mailing list if you're interested.

Alexei Sholik

unread,
Jul 7, 2013, 10:26:14 AM7/7/13
to golang-nuts
Thanks for the info!
--
Best regards
Alexei Sholik
Reply all
Reply to author
Forward
0 new messages