[Rails-core] mb_chars.upcase and Ruby 1.9.2

273 views
Skip to first unread message

Rodrigo Rosenfeld Rosas

unread,
May 7, 2010, 11:00:10 PM5/7/10
to rubyonrails-core
I'm testing ruby-head through rvm but can't get 'ação'.mb_chars.upcase
== 'AÇÃO'... I get 'AçãO' instead...

This happens both for Rails 2.3.5 and Rails 3 beta 3...

How can I get upcase to work correctly?

Thanks in advance,

Rodrigo.

--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group.
To post to this group, send email to rubyonra...@googlegroups.com.
To unsubscribe from this group, send email to rubyonrails-co...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/rubyonrails-core?hl=en.

Michael Koziarski

unread,
May 7, 2010, 11:04:23 PM5/7/10
to rubyonra...@googlegroups.com
in 1.9 mb_chars simply returns self. This behaviour is coming
straight from ruby core:

http://github.com/rails/rails/blob/master/activesupport/lib/active_support/core_ext/string/multibyte.rb#L53-65
--
Cheers

Koz

Rodrigo Rosenfeld Rosas

unread,
May 8, 2010, 12:03:47 AM5/8/10
to rubyonra...@googlegroups.com
Is there any approach currently used for making the Ruby 1.8/Rails 2.3.5
behavior the same in Ruby 1.9?

This is important for virtually any non-english application... Are there
any plans for integration some library for achieving the same results as
Rails currently supports?

Rodrigo.

Michael Koziarski

unread,
May 8, 2010, 1:34:51 AM5/8/10
to rubyonra...@googlegroups.com
On Sat, May 8, 2010 at 12:03 PM, Rodrigo Rosenfeld Rosas
<rr.r...@gmail.com> wrote:
> Is there any approach currently used for making the Ruby 1.8/Rails 2.3.5
> behavior the same in Ruby 1.9?
>
> This is important for virtually any non-english application... Are there any
> plans for integration some library for achieving the same results as Rails
> currently supports?

My understanding is that ruby 1.9 is meant to support all these
operations internally, our mb_chars functionality was only ever
intended as a stop-gap until ruby itself could do native multi-byte
aware string operations. So what you're seeing are bugs in ruby which
should be fixed there, we probably shouldn't be maintaining a second
multi-byte aware library.


--
Cheers

Koz

Norman Clarke

unread,
May 8, 2010, 8:57:51 AM5/8/10
to rubyonra...@googlegroups.com
On Sat, May 8, 2010 at 02:34, Michael Koziarski <mic...@koziarski.com> wrote:
> On Sat, May 8, 2010 at 12:03 PM, Rodrigo Rosenfeld Rosas
> <rr.r...@gmail.com> wrote:
>> Is there any approach currently used for making the Ruby 1.8/Rails 2.3.5
>> behavior the same in Ruby 1.9?

Not a solution, and perhaps you're already aware of this, but as a
workaround to these issues you can get an instance of
ActiveSupport::Multibyte::Chars and perform the operations you need:

ActiveSupport::Multibyte::Chars.new("café").upcase

This lets you use the same methods that would be used on Ruby 1.8.

Regards,

Norman

Rodrigo Rosenfeld Rosas

unread,
May 8, 2010, 11:24:55 AM5/8/10
to rubyonra...@googlegroups.com

Em 08-05-2010 02:34, Michael Koziarski escreveu:
> On Sat, May 8, 2010 at 12:03 PM, Rodrigo Rosenfeld Rosas
> <rr.r...@gmail.com> wrote:
>
>> Is there any approach currently used for making the Ruby 1.8/Rails 2.3.5
>> behavior the same in Ruby 1.9?
>>
>> This is important for virtually any non-english application... Are there any
>> plans for integration some library for achieving the same results as Rails
>> currently supports?
>>
> My understanding is that ruby 1.9 is meant to support all these
> operations internally, our mb_chars functionality was only ever
> intended as a stop-gap until ruby itself could do native multi-byte
> aware string operations. So what you're seeing are bugs in ruby which
> should be fixed there, we probably shouldn't be maintaining a second
> multi-byte aware library.
>
>
>

Please, take a look at this documentation for String#upcase:

http://ruby-doc.org/ruby-1.9/classes/String.html#M000593

"Returns a copy of str with all lowercase letters replaced with their
uppercase counterparts. The operation is locale insensitive—*only
characters ``a’’ to ``z’’ are affected*. Note: case replacement is
effective only in ASCII region."

It doesn't seem Ruby 1.9 will change this behavior, so Rails should keep
using its Proxy approach while Ruby doesn't support it itself.

My guess is that mb_chars should be set on Rails initialization with
something like:

def mb_chars
self
end

String.send :include, StringMultiBytePatch unless 'ação'.upcase == 'AÇÃO'

Of course this is not the real code, but a suggestiong of an approach...
The StringMultiBytePatch module would override mb_chars to use
ActiveSupport::Multibyte::Chars proxy as noted by Norman Clarke.

Please, see also this thread from 2008:
http://old.nabble.com/String-upcase-downcase-with-UTF-8-strings-in-Ruby-1.9-td18372062.html

---
|in *Ruby* *1*.*9* I get the following behaviour:
|
|>> "aoueäöüé".*upcase*
|=> "AOUEäöüé"
|>> "AOUEÄÖÜÉ".downcase
|=> "aoueÄÖÜÉ"
|
|I can't find however find a bug in the bug tracking system.
|Doesn't this qualify as a bug?

The document for String#*upcase* says:

call-seq:
str.*upcase* => new_str

Returns a copy of <i>str</i> with all lowercase letters replaced with their
uppercase counterparts. The operation is locale insensitive---only
characters ``a'' to ``z'' are affected.
Note: case replacement is effective only in ASCII region.

"hEllO".*upcase* #=> "HELLO"

See "Note:". Tim Bray have persuaded me to do so, since case
conversion outside of ASCII region is highly dependent on country,
language, culture and script.

matz.
---

So, it doesn't seem Matz consider this a bug and he won't probably
change this behavior for Ruby 1.9...

So, don't you think we should continue supporting mb_chars as before?

Best regards,

Rodrigo.

Rodrigo Rosenfeld Rosas

unread,
May 8, 2010, 11:59:12 AM5/8/10
to rubyonra...@googlegroups.com
Em 08-05-2010 09:57, Norman Clarke escreveu:
> On Sat, May 8, 2010 at 02:34, Michael Koziarski<mic...@koziarski.com> wrote:
>
>> On Sat, May 8, 2010 at 12:03 PM, Rodrigo Rosenfeld Rosas
>> <rr.r...@gmail.com> wrote:
>>
>>> Is there any approach currently used for making the Ruby 1.8/Rails 2.3.5
>>> behavior the same in Ruby 1.9?
>>>
> Not a solution, and perhaps you're already aware of this, but as a
> workaround to these issues you can get an instance of
> ActiveSupport::Multibyte::Chars and perform the operations you need:
>
> ActiveSupport::Multibyte::Chars.new("café").upcase
>
> This lets you use the same methods that would be used on Ruby 1.8.
>
> Regards,
>
> Norman

Hi Norman, while this seem to work with Rails 3 beta, it didn't work
with rails 2.3.5 in my tests...

Any idea of why is this behavior different between 2.3.5 and 3?

Thanks,

Rodrigo.

Norman Clarke

unread,
May 8, 2010, 1:31:19 PM5/8/10
to rubyonra...@googlegroups.com
On Sat, May 8, 2010 at 12:24, Rodrigo Rosenfeld Rosas
<rr.r...@gmail.com> wrote:
>
> Em 08-05-2010 02:34, Michael Koziarski escreveu:
>>
>> On Sat, May 8, 2010 at 12:03 PM, Rodrigo Rosenfeld Rosas
>> <rr.r...@gmail.com>  wrote:
>>
>>>
> Please, take a look at this documentation for String#upcase:
>
> http://ruby-doc.org/ruby-1.9/classes/String.html#M000593
>
> "Returns a copy of str with all lowercase letters replaced with their
> uppercase counterparts. The operation is locale insensitive—*only characters
> ``a’’ to ``z’’ are affected*. Note: case replacement is effective only in
> ASCII region."

> Please, see also this thread from 2008:
> http://old.nabble.com/String-upcase-downcase-with-UTF-8-strings-in-Ruby-1.9-td18372062.html
>
> ---
> |in *Ruby* *1*.*9* I get the following behaviour:
> |
> |>> "aoueäöüé".*upcase*
> |=> "AOUEäöüé"
> |>> "AOUEÄÖÜÉ".downcase
> |=> "aoueÄÖÜÉ"
> |
> |I can't find however find a bug in the bug tracking system.
> |Doesn't this qualify as a bug?
>
> The document for String#*upcase* says:
>
> call-seq:
> str.*upcase* => new_str
>
> Returns a copy of <i>str</i> with all lowercase letters replaced with their
> uppercase counterparts. The operation is locale insensitive---only
> characters ``a'' to ``z'' are affected.
> Note: case replacement is effective only in ASCII region.
>
> "hEllO".*upcase* #=> "HELLO"
>
> See "Note:". Tim Bray have persuaded me to do so, since case
> conversion outside of ASCII region is highly dependent on country,
> language, culture and script.

I had been considering working a patch to add a "light" proxy class
for 1.9.x that uses some but not all of the method in the proxy class
for 1.8.

If it's true that there are no plans to add UTF-8 case-folding to Ruby
1.9 then I think it would be a good idea. I've been working on
multibyte a bit lately and would be happy to work on it some more if
folks think it would be useful.

There are also a couple of pedantic issues with AS's case folding,
such as incomplete support for Greek and Turkic languages, that I'd
like to fix. I'll look into it this week to see if maybe that would be
worthwhile as well.

-Norman

Norman Clarke

unread,
May 8, 2010, 2:39:01 PM5/8/10
to rubyonra...@googlegroups.com

Interesting, I didn't realize this was going to change in 1.9.2. While I sympathize with Matz for not wanting to step into the minefield that is case folding, I'm a bit disappointed. With no built-in support for that, or normalization, Ruby's UTF-8 support is so weak that I find myself relying on AS more and more, even outside Rails apps.

I had considered working on a light multibyte proxy class for 1.9 when 1.9.1-p343 broke String#center and a few other methods, but decided against it when I saw it fixed in 1.9.2. AS's case folding is a little lacking too, because it doesn't implement case folding for Greek and Turkic as recommended for Unicode 5.1.
I've been hacking on multibye quite a bit lately and would be happy to take a longer look if folks think it's worthwhile.

-Norman

On May 8, 2010 12:25 PM, "Rodrigo Rosenfeld Rosas" <rr.r...@gmail.com> wrote:


Em 08-05-2010 02:34, Michael Koziarski escreveu:


>
> On Sat, May 8, 2010 at 12:03 PM, Rodrigo Rosenfeld Rosas
> <rr.r...@gmail.com>  wrote:
>  

>>...



--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core...

Norman Clarke

unread,
May 8, 2010, 2:45:04 PM5/8/10
to rubyonra...@googlegroups.com
> sympathize with Matz for not wanting to step into the minefield that is case

... Sorry for the double post, Looks like I accidentally sent an
earlier draft from my phone.

Mateo Murphy

unread,
May 8, 2010, 2:56:44 PM5/8/10
to rubyonra...@googlegroups.com

On 8-May-10, at 1:31 PM, Norman Clarke wrote:

> If it's true that there are no plans to add UTF-8 case-folding to Ruby
> 1.9 then I think it would be a good idea. I've been working on
> multibyte a bit lately and would be happy to work on it some more if
> folks think it would be useful.

I'd say that developing this as part of the I18n gem or even
standalone would be better than as part of rails, as it would be very
useful outside of rails, and not everybody who uses rails would need
this functionality.

Rodrigo Rosenfeld Rosas

unread,
May 8, 2010, 3:31:47 PM5/8/10
to rubyonra...@googlegroups.com
Em 08-05-2010 15:56, Mateo Murphy escreveu:
>
> On 8-May-10, at 1:31 PM, Norman Clarke wrote:
>
>> If it's true that there are no plans to add UTF-8 case-folding to Ruby
>> 1.9 then I think it would be a good idea. I've been working on
>> multibyte a bit lately and would be happy to work on it some more if
>> folks think it would be useful.
>
> I'd say that developing this as part of the I18n gem or even
> standalone would be better than as part of rails, as it would be very
> useful outside of rails, and not everybody who uses rails would need
> this functionality.
>
>
I agree that writing this in I18n or a standalone library would probably
be better because of you first argument, but not for the last one...

Rails has an approach different from Merb or Sinatra in the way it is a
full-stack framework. I believe multibyte support would be more useful
for most people than REST support, for instance...

But since AS is also an independent library and could be used outside
Rails too, I don't see any problems in patching String in AS... But I
think it would be cleaner if it was an independent library that could be
used inside I18n or AS gem...

Rodrigo.

Norman Clarke

unread,
May 8, 2010, 4:02:34 PM5/8/10
to rubyonra...@googlegroups.com
On Sat, May 8, 2010 at 16:31, Rodrigo Rosenfeld Rosas
<rr.r...@gmail.com> wrote:
> Em 08-05-2010 15:56, Mateo Murphy escreveu:
>>
>> On 8-May-10, at 1:31 PM, Norman Clarke wrote:
>>
>>> If it's true that there are no plans to add UTF-8 case-folding to Ruby
>>> 1.9 then I think it would be a good idea. I've been working on
>>> multibyte a bit lately and would be happy to work on it some more if
>>> folks think it would be useful.
>>
>> I'd say that developing this as part of the I18n gem or even standalone
>> would be better than as part of rails, as it would be very useful outside of
>> rails, and not everybody who uses rails would need this functionality.
>>
>>
> I agree that writing this in I18n or a standalone library would probably be
> better because of you first argument, but not for the last one...
>
> Rails has an approach different from Merb or Sinatra in the way it is a
> full-stack framework. I believe multibyte support would be more useful for
> most people than REST support, for instance...
>
> But since AS is also an independent library and could be used outside Rails
> too, I don't see any problems in patching String in AS... But I think it
> would be cleaner if it was an independent library that could be used inside
> I18n or AS gem...

These two libraries provide pretty good support for UTF-8 manipulation:

http://github.com/blackwinter/unicode
http://github.com/lang/unicode_utils

Yoshida Masato's is written in C and provides good performance, while
Stefan Lang's is written in Ruby and also appears to provide support
for proper UTF-8 case folding, so there's probably no need to
duplicate the effort of adding that to AS; it should be easy enough to
just implement proxy classes that use them, and make AS use them in
place of its default proxy class:

ActiveSupport::Multibyte.proxy_class = PutativeUnicodeProxyClass
ActiveSupport::Multibyte.proxy_class = PutativeUnicodeUtilsProxyClass

But I do think that Rails should still provide decent support for case
folding, and the behavior of commonly-used things like #upcase and
#downcase should not change so dramatically when you use Ruby 1.9 vs
1.8. It would be pretty simple to extract some methods from
Multibyte::Chars into a module that can be shared between the current
feature-rich proxy class for 1.8 and a thinner one for 1.9.

-Norman

Rodrigo Rosenfeld Rosas

unread,
May 8, 2010, 5:00:27 PM5/8/10
to rubyonra...@googlegroups.com
Agreed. Is it possible in Bundler to add dependency to either unicode or
unicode_utils gem? This should work as script/server, in Rails 2. If it
finds a mongrel, use it, othercase, use webrick... If the faster C
implementation is available, use it, else try the pure Ruby
alternative... Is it possible?

Rodrigo.

Manfred Stienstra

unread,
May 10, 2010, 3:12:10 AM5/10/10
to Ruby on Rails: Core
AS::Multibyte currently implements two things: encoding aware string
operations and Unicode algorithms. 1.9 only implements encoding aware
string operations. We could activate the proxy with the Unicode
operations for 1.9, that should solve most people's problems.

I don't really like the idea of depending on external libraries for
this kind of functionality because the most used algorithms are
already defined in Multibyte.

Manfred

On May 8, 10:02 pm, Norman Clarke <nor...@njclarke.com> wrote:
> On Sat, May 8, 2010 at 16:31, Rodrigo Rosenfeld Rosas
>
>
>
>
>
> <rr.ro...@gmail.com> wrote:
> > Em 08-05-2010 15:56, Mateo Murphy escreveu:
>
> >> On 8-May-10, at 1:31 PM, Norman Clarke wrote:
>
> >>> If it's true that there are no plans to add UTF-8 case-folding to Ruby
> >>> 1.9 then I think it would be a good idea. I've been working on
> >>> multibyte a bit lately and would be happy to work on it some more if
> >>> folks think it would be useful.
>
> >> I'd say that developing this as part of the I18n gem or even standalone
> >> would be better than as part of rails, as it would be very useful outside of
> >> rails, and not everybody who uses rails would need this functionality.
>
> > I agree that writing this in I18n or a standalone library would probably be
> > better because of you first argument, but not for the last one...
>
> > Rails has an approach different from Merb or Sinatra in the way it is a
> > full-stack framework. I believe multibyte support would be more useful for
> > most people than REST support, for instance...
>
> > But since AS is also an independent library and could be used outside Rails
> > too, I don't see any problems in patching String in AS... But I think it
> > would be cleaner if it was an independent library that could be used inside
> > I18n or AS gem...
>
> These two libraries provide pretty good support for UTF-8 manipulation:
>
> http://github.com/blackwinter/unicodehttp://github.com/lang/unicode_utils

Norman Clarke

unread,
May 10, 2010, 7:55:37 AM5/10/10
to rubyonra...@googlegroups.com
On Mon, May 10, 2010 at 04:12, Manfred Stienstra <man...@gmail.com> wrote:

> I don't really like the idea of depending on external libraries for
> this kind of functionality because the most used algorithms are
> already defined in Multibyte.

I agree. I was thinking more about implementing proxy classes for them
in a separate library that people could use, for example, if they
needed either the high performance of the library written in C, or the
proper case-folding for Greek and Turkic that the other one provides.

Manfred Stienstra

unread,
May 11, 2010, 3:30:53 AM5/11/10
to Ruby on Rails: Core
That's more or less how it's right now. The C implementation is called
Unichars: http://github.com/Manfred/unichars.

On May 10, 1:55 pm, Norman Clarke <nor...@njclarke.com> wrote:

NARUSE, Yui

unread,
May 13, 2010, 3:54:03 AM5/13/10
to rubyonra...@googlegroups.com
2010/5/9 Norman Clarke <nor...@njclarke.com>:
> Interesting, I didn't realize this was going to change in 1.9.2.

1.9.2's feature is already froze and it doesn't have such Unicode utilities.

We ruby-core know such needs for Unicode utility and had some discussion
about it but we can't agree its spec and implementation.
I think it needs more time.

> While I
> sympathize with Matz for not wanting to step into the minefield that is case
> folding, I'm a bit disappointed. With no built-in support for that, or
> normalization, Ruby's UTF-8 support is so weak that I find myself relying on
> AS more and more, even outside Rails apps.
>
> I had considered working on a light multibyte proxy class for 1.9 when
> 1.9.1-p343 broke String#center and a few other methods, but decided against
> it when I saw it fixed in 1.9.2. AS's case folding is a little lacking too,
> because it doesn't implement case folding for Greek and Turkic as
> recommended for Unicode 5.1.
> I've been hacking on multibye quite a bit lately and would be happy to take
> a longer look if folks think it's worthwhile.

FYI:
If you implement case folding for greek and Turkic, a string (or something),
the string needs language information. Selecting font, calculating width,

--
NARUSE, Yui
nar...@airemix.jp

Norman Clarke

unread,
May 21, 2010, 1:30:25 PM5/21/10
to rubyonra...@googlegroups.com
On Thu, May 13, 2010 at 04:54, NARUSE, Yui <nar...@airemix.jp> wrote:
> 2010/5/9 Norman Clarke <nor...@njclarke.com>:
>> Interesting, I didn't realize this was going to change in 1.9.2.
>
> 1.9.2's feature is already froze and it doesn't have such Unicode utilities.
>
> We ruby-core know such needs for Unicode utility and had some discussion
> about it but we can't agree its spec and implementation.
> I think it needs more time.
>
>> While I
>> sympathize with Matz for not wanting to step into the minefield that is case
>> folding, I'm a bit disappointed. With no built-in support for that, or
>> normalization, Ruby's UTF-8 support is so weak that I find myself relying on
>> AS more and more, even outside Rails apps.
>>
>> I had considered working on a light multibyte proxy class for 1.9 when
>> 1.9.1-p343 broke String#center and a few other methods, but decided against
>> it when I saw it fixed in 1.9.2. AS's case folding is a little lacking too,
>> because it doesn't implement case folding for Greek and Turkic as
>> recommended for Unicode 5.1.
>> I've been hacking on multibye quite a bit lately and would be happy to take
>> a longer look if folks think it's worthwhile.
>
> FYI:
> If you implement case folding for greek and Turkic, a string (or something),
> the string needs language information. Selecting font, calculating  width,

Hi all,

I submitted a patch to fix the upcasing issue with 1.9 about a week
ago[1], but haven't gotten any followup yet. I saw today that there's
been some more work on this area, so my patch now conflicts with Rails
master.

If somebody has the time and inclination, could you let me know if
there's any interest in including my changes? In addition to resolving
the issue with upcasing on Ruby 1.9, I added an
ActiveSupport::Multibyte::Unicode module to contain the class methods
from ActiveSupport::Multibyte::Chars, and then moved in some related
functionality to the module for the sake of consistency.

I'm happy to resolve the conflicts to make the patch apply again, but
if people don't like the direction my refactoring went and don't want
to include the changes, then no problem, I'll just kill my branch[2]
and won't bother resolving the conflicts.

Either way, I think it would still be ideal to get a fix for the
upcasing issue before 3.0 is released.

Regards,

Norman


[1] https://rails.lighthouseapp.com/projects/8994/tickets/4595-stringmb_charsupcase-doesnt-upcase-non-ascii-chars-on-with-ruby-19x
[2] http://github.com/norman/rails/commit/f01dd100a7853e9bb5c7eb9097068ddb9ed1909d

Rodrigo Rosenfeld Rosas

unread,
May 21, 2010, 1:40:44 PM5/21/10
to rubyonra...@googlegroups.com
Norman, take a look at the above link. It seems Jeremy is willing to
accept your patch. Please rebase agains master again.

Best regards,

Rodrigo.
Reply all
Reply to author
Forward
0 new messages