In my opinion, Ruby is practically useless for many applications without proper Unicode support. How a modern language can ignore this issue is really beyond me.
Is there a plan to get Unicode support into the language anytime soon?
In message "Re: Unicode roadmap?" on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner <roman.haus...@gmail.com> writes: |In my opinion, Ruby is practically useless for many applications without |proper Unicode support. How a modern language can ignore this issue is |really beyond me.
Define "proper Unicode support" first.
|Is there a plan to get Unicode support into the language anytime soon?
I'm planning enhancing Unicode support in 1.9 in a year or so (finally). But I'm not sure that conforms your definition of "proper Unicode support". Note that 1.8 handles Unicode (UTF-8) if your string operations are based on Regexp.
> In message "Re: Unicode roadmap?" > on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner <roman.haus...@gmail.com> writes: > |In my opinion, Ruby is practically useless for many applications without > |proper Unicode support. How a modern language can ignore this issue is > |really beyond me.
> Define "proper Unicode support" first.
> |Is there a plan to get Unicode support into the language anytime soon?
> I'm planning enhancing Unicode support in 1.9 in a year or so > (finally). But I'm not sure that conforms your definition of "proper > Unicode support". Note that 1.8 handles Unicode (UTF-8) if your > string operations are based on Regexp.
> having an unicode-equivalent for all methods of class String
> like size, slice, upcase
> E.g. I tried the unicode plugin... but, alas, who want's to write > stuff like 'normalize_KC' etc. if you just want the frickin' > substring of a string?!
def substring(str, start, len) md = str.match(/\A.{#{start}}(.{#{len}})/) md[1] end
def strlength(str) n = 0 str.gsub(/./m) { n += 1; $& } n end
See! Regexps do everything!
Just you know, set $KCODE and use these methods and you are set!
> you need to read books on unicode just to properly use the plugin...
> aargg :-((
> Best regards > Peter
> Yukihiro Matsumoto schrieb: >> Hi,
>> In message "Re: Unicode roadmap?" >> on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner >> <roman.haus...@gmail.com> writes: >> |In my opinion, Ruby is practically useless for many applications >> without |proper Unicode support. How a modern language can ignore >> this issue is |really beyond me.
>> Define "proper Unicode support" first.
>> |Is there a plan to get Unicode support into the language anytime >> soon?
>> I'm planning enhancing Unicode support in 1.9 in a year or so >> (finally). But I'm not sure that conforms your definition of "proper >> Unicode support". Note that 1.8 handles Unicode (UTF-8) if your >> string operations are based on Regexp.
>> having an unicode-equivalent for all methods of class String
>> like size, slice, upcase
>> E.g. I tried the unicode plugin... but, alas, who want's to write >> stuff like 'normalize_KC' etc. if you just want the frickin' >> substring of a string?!
> def strlength(str) > n = 0 > str.gsub(/./m) { n += 1; $& } > n > end
> See! Regexps do everything!
> Just you know, set $KCODE and use these methods and you are set!
> (I am kidding... btw)
>> you need to read books on unicode just to properly use the plugin...
>> aargg :-((
>> Best regards >> Peter
>> Yukihiro Matsumoto schrieb: >>> Hi,
>>> In message "Re: Unicode roadmap?" >>> on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner >>> <roman.haus...@gmail.com> writes: >>> |In my opinion, Ruby is practically useless for many applications >>> without |proper Unicode support. How a modern language can ignore >>> this issue is |really beyond me.
>>> Define "proper Unicode support" first.
>>> |Is there a plan to get Unicode support into the language anytime soon?
>>> I'm planning enhancing Unicode support in 1.9 in a year or so >>> (finally). But I'm not sure that conforms your definition of "proper >>> Unicode support". Note that 1.8 handles Unicode (UTF-8) if your >>> string operations are based on Regexp.
From: Pete [mailto:pe...@gmx.org] Sent: Wednesday, June 14, 2006 1:58 AM
> As I am German the 'missing' unicode support is one of the greatest > obstacles for me (and probably all other Germans doing their stuff > seriously)...
The same is for Russians/Ukrainians. In our programming communities question "does the programming language supports Unicode as 'native'?" has very high priority.
/BTW, here is one of the things where Python beats Ruby completely
I suspect the Japanese posters on this list can answer better than I can, but my impression is that Unicode is, shall we say, not highly thought of outside Europe and North America. The way they dealt with "Chinese" characters was apparently more than a bit of a hack, and just doesn't work very well in the real world. Reading some of the explanations for glyphs versus characters in Unicode just makes you shake your head. What were they thinking? Sure doesn't pass the smell test, although I'll be the first to admit I haven't exactly thought deeply about the subject.
There's another problem with Japanese - I've got a friend who's been dealing with some issues around the fact that Japanese apparently innovates new characters on a regular basis, and everyone is expected to use the new characters. (I believe this is called gaiji). The concept of a fixed character set apparently just isn't a good idea to start with.
[Awaiting corrections from people who actually know something about this topic :-)...]
> I suspect the Japanese posters on this list can answer better than I can, > but my impression is that Unicode is, shall we say, not highly thought of > outside Europe and North America. The way they dealt with "Chinese" > characters was apparently more than a bit of a hack, and just doesn't work > very well in the real world. Reading some of the explanations for glyphs > versus characters in Unicode just makes you shake your head. What were they > thinking? Sure doesn't pass the smell test, although I'll be the first to > admit I haven't exactly thought deeply about the subject.
> There's another problem with Japanese - I've got a friend who's been dealing > with some issues around the fact that Japanese apparently innovates new > characters on a regular basis, and everyone is expected to use the new > characters. (I believe this is called gaiji). The concept of a fixed > character set apparently just isn't a good idea to start with.
> [Awaiting corrections from people who actually know something about this > topic :-)...]
There is a good summary of the han unification controversy on wikipedia;
> There's another problem with Japanese - I've got a friend who's > been dealing > with some issues around the fact that Japanese apparently innovates > new > characters on a regular basis, and everyone is expected to use the new > characters. (I believe this is called gaiji). The concept of a fixed > character set apparently just isn't a good idea to start with.
> [Awaiting corrections from people who actually know something about > this > topic :-)...]
I have one Japanese person here who's never heard of this gaiji concept. But it could be new and behind a generation gap of some kind. They do sure like to add symbols where they can, though. Especially graphical star characters. I see that a lot. -Mat
In message "Re: Unicode roadmap?" on Wed, 14 Jun 2006 08:11:49 +0900, "Victor Shepelev" <vshepe...@imho.com.ua> writes:
|From: Pete [mailto:pe...@gmx.org] |Sent: Wednesday, June 14, 2006 1:58 AM |> As I am German the 'missing' unicode support is one of the greatest |> obstacles for me (and probably all other Germans doing their stuff |> seriously)... | |The same is for Russians/Ukrainians. In our programming communities question |"does the programming language supports Unicode as 'native'?" has very high |priority.
Alright, then what specific features are you (both) missing? I don't think it is a method to get number of characters in a string. It can't be THAT crucial. I do want to cover "your missing features" in the future M17N support in Ruby.
> In message "Re: Unicode roadmap?" > on Wed, 14 Jun 2006 08:11:49 +0900, "Victor Shepelev" > <vshepe...@imho.com.ua> writes:
> |From: Pete [mailto:pe...@gmx.org] > |Sent: Wednesday, June 14, 2006 1:58 AM > |> As I am German the 'missing' unicode support is one of the greatest > |> obstacles for me (and probably all other Germans doing their stuff > |> seriously)... > | > |The same is for Russians/Ukrainians. In our programming communities > question > |"does the programming language supports Unicode as 'native'?" has very > high > |priority.
> Alright, then what specific features are you (both) missing? I don't > think it is a method to get number of characters in a string. It > can't be THAT crucial. I do want to cover "your missing features" in > the future M17N support in Ruby.
> matz.
I suppose, all we (non-English-writers) need is to have all string-related methods working. Just for now, I think about plain testing each string method; also, some other classes can be affected by Unicode (possibly regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes are not: File.open with Russian letters in path don't finds the file.
More generally, it can make sense to have Unicode as the "base" mode; where non-Unicode to stay "old, compatibility" mode.
Roman Hausner wrote: > In my opinion, Ruby is practically useless for many applications without > proper Unicode support. How a modern language can ignore this issue is > really beyond me.
> Is there a plan to get Unicode support into the language anytime soon?
In message "Re: Unicode roadmap?" on Wed, 14 Jun 2006 14:26:30 +0900, "Victor Shepelev" <vshepe...@imho.com.ua> writes:
|I suppose, all we (non-English-writers) need is to have all string-related |methods working. Just for now, I think about plain testing each string |method;
In that sense, _I_ am one of the non-English-writers, so that I can suppose I know what we need. And I have no problem with the current UTF-8 support. Maybe that's because Japanese don't have cases in our characters. Or maybe I'm missing something. Can you show us your concrete problems caused by Ruby's lack of "proper" Unicode support?
|also, some other classes can be affected by Unicode (possibly |regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes are |not: File.open with Russian letters in path don't finds the file.
Strange. Ruby does not convert encoding, so that there should be no problem opening files, if you are using strings in the encoding your OS expect. If they are differ, you have to specify (and convert) them properly, no matter how Unicode support is.
From: Yukihiro Matsumoto [mailto:m...@ruby-lang.org] Sent: Wednesday, June 14, 2006 9:35 AM
> Hi,
> In message "Re: Unicode roadmap?" > on Wed, 14 Jun 2006 14:26:30 +0900, "Victor Shepelev" > <vshepe...@imho.com.ua> writes:
> |I suppose, all we (non-English-writers) need is to have all string- > related > |methods working. Just for now, I think about plain testing each string > |method;
> In that sense, _I_ am one of the non-English-writers,
Sorry, Matz, I know, of course. But I know too less about Japanese to see how close our tasks are. Under "non-English-writers" I, maybe, had to say "European languages" or so - which has common punctuations, LTR writing, "words" and "whitespaces" and so on. I have almost no knowledge about Japanese, Korean, Arabic, Hebrew people needs.
> so that I can > suppose I know what we need. And I have no problem with the current > UTF-8 support. Maybe that's because Japanese don't have cases in our > characters. Or maybe I'm missing something.
Just what I've said above.
> Can you show us your > concrete problems caused by Ruby's lack of "proper" Unicode support?
As mentioned in this topic, it's String#length, upcase, downcase, capitalize.
BTW, does String#length works good for you?
Moreover, there seems to be some huge problems with pathes having Russian letters; but I'm really not convinced, if Ruby really has to handle this.
> |also, some other classes can be affected by Unicode (possibly > |regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes > are > |not: File.open with Russian letters in path don't finds the file.
> Strange. Ruby does not convert encoding, so that there should be no > problem opening files, if you are using strings in the encoding your OS > expect. If they are differ, you have to specify (and convert) them > properly, no matter how Unicode support is.
Oh, it's a bit hard theme for me. I know Windows XP must support Unicode file names; I see my filenames in Russian, but I have low knowledge of system internals to say, are they really Unicode?
If not take in account those problems, the only String problems remains, but they are so base core methods!
On Jun 14, 2006, at 15:56 , Victor Shepelev wrote:
> As mentioned in this topic, it's String#length, upcase, downcase, > capitalize.
Just to chime in, aren't upcase, downcase, and capitalize a locale/ localization issue rather than a Unicode-only issue per se? For example, different languages will have different rules for capitalization. Or am I wrong? Does Unicode in and of itself address these issues?
Granted, proper support for upcase, downcase, and capitalize is important, but I think it's a separate issue, part of m17n as a whole rather than support for Unicode in particular.
> As mentioned in this topic, it's String#length, upcase, downcase, > capitalize.
> BTW, does String#length works good for you?
To have the length of a Unicode string, just do str.split(//).length, or "require 'jcode'" at the beginning of your code. For the other functions, try looking at the unicode library http://www.yoshidam.net/Ruby.html#unicode
> > |also, some other classes can be affected by Unicode (possibly > > |regexps, and pathes). Regexps seems to work fine (in my 1.9), but pathes > > are > > |not: File.open with Russian letters in path don't finds the file.
> > Strange. Ruby does not convert encoding, so that there should be no > > problem opening files, if you are using strings in the encoding your OS > > expect. If they are differ, you have to specify (and convert) them > > properly, no matter how Unicode support is.
> Oh, it's a bit hard theme for me. I know Windows XP must support Unicode > file names; I see my filenames in Russian, but I have low knowledge of > system internals to say, are they really Unicode?
Windows XP does support Unicode file names, but I'm not sure you can use them with Ruby (I do not use Ruby much under Windows). Try converting the file names to your current locale, it should work if the file names can be converted to it. What I mean is that Russian file names encoded in the Windows Russian encoding should work on a Russian PC.
In message "Re: Unicode roadmap?" on Wed, 14 Jun 2006 15:56:02 +0900, "Victor Shepelev" <vshepe...@imho.com.ua> writes:
|> Can you show us your |> concrete problems caused by Ruby's lack of "proper" Unicode support? | |As mentioned in this topic, it's String#length, upcase, downcase, |capitalize.
OK. Case is the problem. I understand.
|BTW, does String#length works good for you?
I don't remember the last time I needed length method to count character numbers. Actually I don't count string length at all both in bytes and characters in my string processing. Maybe this is a special case. I am too optimized for Ruby string operations using Regexp.
|Oh, it's a bit hard theme for me. I know Windows XP must support Unicode |file names; I see my filenames in Russian, but I have low knowledge of |system internals to say, are they really Unicode?
Windows 32 path encoding is a nightmare. Our Win32 maintainers often troubled by unexpected OS behavior. I am sure we _can_ handle Russian path names, but we need help from Russian people to improve.
From: Vincent Isambart [mailto:vincent.isamb...@gmail.com] Sent: Wednesday, June 14, 2006 10:14 AM
> > As mentioned in this topic, it's String#length, upcase, downcase, > > capitalize.
> > BTW, does String#length works good for you?
> To have the length of a Unicode string, just do str.split(//).length, > or "require 'jcode'" at the beginning of your code. > For the other functions, try looking at the unicode library > http://www.yoshidam.net/Ruby.html#unicode
I know about it. But, theoretically speaking, such a "core" methods muts be in core. Not?
> > > |also, some other classes can be affected by Unicode (possibly > > > |regexps, and pathes). Regexps seems to work fine (in my 1.9), but > pathes > > > are > > > |not: File.open with Russian letters in path don't finds the file.
> > > Strange. Ruby does not convert encoding, so that there should be no > > > problem opening files, if you are using strings in the encoding your > OS > > > expect. If they are differ, you have to specify (and convert) them > > > properly, no matter how Unicode support is.
> > Oh, it's a bit hard theme for me. I know Windows XP must support Unicode > > file names; I see my filenames in Russian, but I have low knowledge of > > system internals to say, are they really Unicode?
> Windows XP does support Unicode file names, but I'm not sure you can > use them with Ruby (I do not use Ruby much under Windows). Try > converting the file names to your current locale, it should work if > the file names can be converted to it. What I mean is that Russian > file names encoded in the Windows Russian encoding should work on a > Russian PC.
Yes, they works. But I can't solve the problem: need Ruby Unicode support include filenames operations?
From: Michael Glaesemann [mailto:g...@seespotcode.net] Sent: Wednesday, June 14, 2006 10:08 AM
> On Jun 14, 2006, at 15:56 , Victor Shepelev wrote:
> > As mentioned in this topic, it's String#length, upcase, downcase, > > capitalize.
> Just to chime in, aren't upcase, downcase, and capitalize a locale/ > localization issue rather than a Unicode-only issue per se? For > example, different languages will have different rules for > capitalization.
Really? I know about two cases: European capitalization and no capitalization.
But, really, you maybe right. I suppose, Florian Gross can say something about German-specific capitalization issues.
> Granted, proper support for upcase, downcase, and capitalize is > important, but I think it's a separate issue, part of m17n as a whole > rather than support for Unicode in particular.
Maybe. Generally, sometimes I want Unicode, and sometimes (for "quick dirty" scripts) I'll prefer capitalization and regexps "just work" with Windows-1251 (one-byte Russian encoding).
> In message "Re: Unicode roadmap?" > on Wed, 14 Jun 2006 15:56:02 +0900, "Victor Shepelev" > <vshepe...@imho.com.ua> writes:
> |> Can you show us your > |> concrete problems caused by Ruby's lack of "proper" Unicode support? > | > |As mentioned in this topic, it's String#length, upcase, downcase, > |capitalize.
> OK. Case is the problem. I understand.
> |BTW, does String#length works good for you?
> I don't remember the last time I needed length method to count > character numbers. Actually I don't count string length at all both > in bytes and characters in my string processing. Maybe this is a > special case. I am too optimized for Ruby string operations using > Regexp.
I can confirm. But I'm afraid that some libraries I rely on use #length and can break when #length doesn't work.
> |Oh, it's a bit hard theme for me. I know Windows XP must support Unicode > |file names; I see my filenames in Russian, but I have low knowledge of > |system internals to say, are they really Unicode?
> Windows 32 path encoding is a nightmare. Our Win32 maintainers often > troubled by unexpected OS behavior. I am sure we _can_ handle Russian > path names, but we need help from Russian people to improve.
In Russian encoding (Win-1251) and on Russian PC all works well. In Unicode it doesn't, but I'm not convinced it must.
In any case, I'm ready to spend my time helping Ruby community (especially in Russian/Ukrainian localization issues), because I really love the language.
> In message "Re: Unicode roadmap?" > on Wed, 14 Jun 2006 06:13:03 +0900, Roman Hausner <roman.haus...@gmail.com> writes: > |In my opinion, Ruby is practically useless for many applications without > |proper Unicode support. How a modern language can ignore this issue is > |really beyond me.
> Define "proper Unicode support" first.
I won't define "proper Unicode support" here.
But there must be a problem somewhere since pure-ruby Ferret doesn't support UTF-8. You need to use the c-extension of Ferret to have it support UTF-8 (which doesn't work on Windows yet :( ). I don't know if that is just a sucky impl of Ferret or if it's Ruby that make it so.
Maybe Dave Balmain can enlighten us why UTF-8 doesn't work in the pure Ruby version and what is needed of Ruby to make it work (if it's actually Ruby's fault that is)?
My personal belief is that it should just work in a case like this if data in is UTF-8 and search strings is UTF-8 without the lib author and/or user having to do anything very special to make it work (apart from specifying encoding). Am I wrong in this?
Almost all typical tasks on Unicode can be handled with UTF8 support in Regexp, Iconv, jcode and $KCODE=u, and unicode[1] library (as in unicode_hack[2]) :) (but case-insensitive regexp don't work for non ASCII chars in Ruby 1.8, that can be probably solved using latest Oniguruma).
But if you're looking for deeper level of "Unicode support", e.g. as described in Unicode FAQ[3], those problems aren't about handling Unicode per se, but are rather L10N/I18N problems, such as locale dependent text breaks,collation, formatting etc. To deal with them from Ruby take look at somewhat broken wrappers to ICU library icu4r[4], g11n[5] and Ruby/CLDR[6].
And if you want Unicode as default String encoding and want to use national chars in names for your vars/functions/classes in Ruby code, I believe, it will never happen. :)
On Jun 13, 2006, at 10:26 PM, Victor Shepelev wrote:
> Regexps seems to work fine (in my 1.9), but pathes are > not: File.open with Russian letters in path don't finds the file.
On OS X multibyte filenames work:
$ cat x.rb $KCODE = 'u'
puts File.read('Cyrillic_Я.txt') $ cat Cyrillic_\320\257.txt test file with Я! $ ruby x.rb test file with Я! $ uname -a Darwin kaa.jijo.segment7.net 8.6.0 Darwin Kernel Version 8.6.0: Tue Mar 7 16:58:48 PST 2006; root:xnu-792.6.70.obj~1/RELEASE_PPC Power Macintosh powerpc $ ruby -v ruby 1.8.4 (2006-05-18) [powerpc-darwin8.6.0] $
-- Eric Hodel - drbr...@segment7.net - http://blog.segment7.net This implementation is HODEL-HASH-9600 compliant
From: Dmitry Severin [mailto:dmitry.seve...@gmail.com] Sent: Wednesday, June 14, 2006 11:20 AM
> To: ruby-talk ML > Subject: Re: Unicode roadmap?
> Almost all typical tasks on Unicode can be handled with UTF8 support in > Regexp, Iconv, jcode and $KCODE=u, and unicode[1] library (as in > unicode_hack[2]) :) > (but case-insensitive regexp don't work for non ASCII chars in Ruby 1.8, > that can be probably solved using latest Oniguruma).
> But if you're looking for deeper level of "Unicode support", e.g. as > described in Unicode FAQ[3], those problems aren't about handling Unicode > per se, but are rather L10N/I18N problems, such as locale dependent text > breaks,collation, formatting etc. > To deal with them from Ruby take look at somewhat broken wrappers to ICU > library icu4r[4], g11n[5] and Ruby/CLDR[6].
Thanks Dmitry!
> And if you want Unicode as default String encoding and want to use > national > chars in names for your vars/functions/classes in Ruby code, I believe, it > will never happen. :)
Hmmm.. I've think Unicode IS defaul String encoding when $KCODE=u Not?
On 6/14/06, Victor Shepelev <vshepe...@imho.com.ua> wrote:
> Hmmm.. I've think Unicode IS defaul String encoding when $KCODE=u > Not?
No. Current String implementation has no notion of "encoding" (Ruby String is just a sequence of bytes) and $KCODE is just a hint for methods to change their behaviour (e.g. in Regexp) and treat those bytes as text represented in some encoding.