ActiveSupport::Multibyte for better Unicode support

66 views
Skip to first unread message

Manfred Stienstra

unread,
Sep 20, 2006, 9:03:18 AM9/20/06
to rubyonra...@googlegroups.com
Three months ago Julian Tarkhanov submitted a test implementation of
his ActiveSupport::Multibyte string extension patch. Since then we've
been steadily improving the extension based on the feedback we received.

The code has been completely refactored to be more transparent and
easier to understand. There is now a single optional accelerated
backend and all multibyte-safe operations have a pure Ruby
implementation. Test structure and coverage has also been greatly
improved.

ActiveSupport::Multibyte is available as a plugin and can be
converted to a patch using the included 'create_patch' rake task.

We would like to see ActiveSupport::Multibyte included in Rails so
that developers can start depending on it for simpler and better
Unicode support.

The ticket for the patch is at http://dev.rubyonrails.org/ticket/
6242. More information and code can be found at https://fngtps.com/
projects/multibyte_for_rails.

Manfred

DHH

unread,
Sep 20, 2006, 10:15:05 AM9/20/06
to Ruby on Rails: Core
> We would like to see ActiveSupport::Multibyte included in Rails so
> that developers can start depending on it for simpler and better
> Unicode support.

I concur. Let this start an official request for comments. Any
objections to getting this into core?

Michael Koziarski

unread,
Sep 23, 2006, 2:08:55 AM9/23/06
to rubyonra...@googlegroups.com

I'm definitely keen to see this get added. However I'm a bit
concerned about the lack of discussion in this thread. It's a big
piece of work, and I was hoping more people would have opinions on it
--
Cheers

Koz

Manfred Stienstra

unread,
Sep 23, 2006, 3:08:48 AM9/23/06
to rubyonra...@googlegroups.com
On Sep 23, 2006, at 8:08 AM, Michael Koziarski wrote:

> I'm definitely keen to see this get added. However I'm a bit
> concerned about the lack of discussion in this thread. It's a big
> piece of work, and I was hoping more people would have opinions on it

I think that's the problem, because the codebase is pretty esotheric
not much people want to dive in and give their opinion. I could
explain on a global level, without gettting into all the details
concerning encoding, what it does and what decisions were made during
coding if anyone is interested.

Manfred

Peter Michaux

unread,
Sep 23, 2006, 3:19:14 AM9/23/06
to rubyonra...@googlegroups.com

I'm interested in a general overview on what problem it fixes and why
it is needed. I don't know much about the whole unicode problem with
Ruby people keep bringing up and then other say it isn't really a
problem.

Peter

Mathieu Jobin

unread,
Sep 23, 2006, 4:15:20 AM9/23/06
to rubyonra...@googlegroups.com
The ticket description already seems to be a very good general overview.
if my opinion count and this package has been well tested, I'd say "Add please".
although if it only patches ruby, not rails, it could be a separate gems or a patch on ruby core/stdlib

Mathieu
--
gcc -O0  -DRUBY_EXPORT   -rdynamic -Wl,-export-dynamic -L.   main.o  -lruby-static -ldl -lcrypt -lm   -o ruby
Everyone is trying their hardest to do their job but management has set it up so that it's impossible.
Take the control over your money, track your expenses http://justbudget.com

Mathieu

Manfred Stienstra

unread,
Sep 23, 2006, 4:30:36 AM9/23/06
to rubyonra...@googlegroups.com

On Sep 23, 2006, at 10:15 AM, Mathieu Jobin wrote:

> The ticket description already seems to be a very good general
> overview.
> if my opinion count and this package has been well tested, I'd say
> "Add please".
> although if it only patches ruby, not rails, it could be a separate
> gems or a patch on ruby core/stdlib

Matz claims that Ruby currently has enough tools to deal with
encoding. The problem is that you have to be an expert to do it
right. The earliest Ruby is going to deal with encoding is in Rails
2.0 and that's not going to come out really soon. So this leaves the
encoding problem with the application programmers. Even though I have
to admit that I would rather see a good solution in Ruby core or in a
stdlib, it's not going to happen. ActiveSupport::Multibyte is an
attempt to make dealing with encoding simpler for the Rails (core)
programmer, right now. It could also work as a deprecation mechanism
when/if support for Ruby comes out.

If ActiveSupport::Multibyte would be released as a gem or standalone
library, Rails code can't depend on it and we'd have to litter the
code with if statements.

Manfred

Mathieu Jobin

unread,
Sep 23, 2006, 4:38:01 AM9/23/06
to rubyonra...@googlegroups.com
make total sense. thanks

On 9/23/06, Manfred Stienstra <man...@gmail.com> wrote:

Mislav Marohnić

unread,
Sep 23, 2006, 10:10:33 AM9/23/06
to rubyonra...@googlegroups.com
Peter,

The problems is correctly supporting multibyte strings. Unicode, the most complete character set, has several encodings (UTF-8 being the most popular one), each of them having some (or all) characters expressed with two or more bytes (unlike ASCII, for instance). In UTF-8, "abc" is a three-character string encoded in 3 bytes, but "čžš" (3 characters from Croatian alphabet) are encoded in 6 bytes (2 bytes each).

Multibyte-unaware programming languages (like Ruby and PHP < 6) assume 1 character = 1 byte. In Ruby, try string.reverse or string.length on strings containing special characters to see some unexpected results. Reverse will corrupt the string while length will report in bytes, not in characters. These are trivial examples, while the problem goes much deeper.

Rails needs this.

--
Mislav

On 9/23/06, Peter Michaux <peterm...@gmail.com> wrote:

Charles O Nutter

unread,
Sep 23, 2006, 10:43:37 AM9/23/06
to rubyonra...@googlegroups.com
On 9/20/06, Manfred Stienstra <man...@gmail.com> wrote:
>
> Three months ago Julian Tarkhanov submitted a test implementation of
> his ActiveSupport::Multibyte string extension patch. Since then we've
> been steadily improving the extension based on the feedback we received.

It appears this doesn't have any native/C code, but can you confirm
that in case I'm not looking hard enough? Obviously we JRubyists
wouldn't want anything in Rails to start requiring code we can't run.

--
Contribute to RubySpec! @ www.headius.com/rubyspec
Charles Oliver Nutter @ headius.blogspot.com
Ruby User @ ruby.mn

Sam Ruby

unread,
Sep 23, 2006, 11:04:05 AM9/23/06
to rubyonra...@googlegroups.com
Charles O Nutter wrote:
> On 9/20/06, Manfred Stienstra <man...@gmail.com> wrote:
>> Three months ago Julian Tarkhanov submitted a test implementation of
>> his ActiveSupport::Multibyte string extension patch. Since then we've
>> been steadily improving the extension based on the feedback we received.
>
> It appears this doesn't have any native/C code, but can you confirm
> that in case I'm not looking hard enough? Obviously we JRubyists
> wouldn't want anything in Rails to start requiring code we can't run.

How does JRuby handle strings? If they are mapped to java.lang.String,
the JRuby already has more than adequate Unicode support.

It seems to me that .chars should return back the same object, if the
underlying VM supports Unicode. I would guess that today that would
include JRuby, and in the future, that would include Ruby 2.0.

Some day in the future, when Ruby 1.x is a distant memory, .chars should
be deprecated, and ultimately removed.

- Sam Ruby

Michael Koziarski

unread,
Sep 23, 2006, 6:45:37 PM9/23/06
to rubyonra...@googlegroups.com
> Some day in the future, when Ruby 1.x is a distant memory, .chars should
> be deprecated, and ultimately removed.

That's definitely our intention, if JRuby is using java.lang.String,
then a simple plugin which does the following would be sufficient.

class String
def chars
self
end
end

We'll update ActiveSupport to contain that (with appropriate
deprecation) when ruby 2.x comes to the party.

--
Cheers

Koz

Pete Yandell

unread,
Sep 23, 2006, 9:21:37 PM9/23/06
to rubyonra...@googlegroups.com

I'm definitely in favour of seeing something like this in core.
Better unicode handling is needed yesterday! The chars proxy is a
very nice way of handling this.

A question:

How does this compare to the unicode_hacks plugin? (See http://
julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/) They
seem very similar in both intent and interface.

Some comments:

Even with this plugin, supporting unicode in a Rails app is too
complicated and fiddly. For those who haven't tried it, here are the
steps:

- Make sure your database character set is utf8
- Make sure all your tables have a character set of utf8
- Make sure your database.yml has 'encoding: utf8' set for each database
- Put $KCODE='u' in your environment.rb
- Add an after_filter to application.rb to set the Content-Type
header correctly
- Add 'normalize_unicode_params :form => :kc' to your application.rb

Missing one of these steps can produce strange results and corrupted
data.

If unicode support is being included in core, then this needs to be
rationalised. Ideally a single setting in environment.rb should take
care of all of this. I also think it should be enabled by default.
(Who doesn't want to support unicode nowadays?)

Rumour also has it that ActiveRecord, when recreating timed-out
database connections, doesn't honour the 'encoding: utf8' setting.
I've never run into this personally, so I assume it was fixed at some
point?

Cheers,

Pete Yandell

Thijs van der Vossen

unread,
Sep 24, 2006, 7:07:08 AM9/24/06
to rubyonra...@googlegroups.com
On 23 Sep 2006, at 16:43 , Charles O Nutter wrote:
> On 9/20/06, Manfred Stienstra <man...@gmail.com> wrote:
>> Three months ago Julian Tarkhanov submitted a test implementation of
>> his ActiveSupport::Multibyte string extension patch. Since then we've
>> been steadily improving the extension based on the feedback we
>> received.
>
> It appears this doesn't have any native/C code, but can you confirm
> that in case I'm not looking hard enough?

Confirmed. All operations are implemented as pure Ruby.

Kind regards,
Thijs

--
Fingertips - http://www.fngtps.com

Phone: +31 (0)6 24204845
Skype: tvandervossen

MSN Messenger: th...@fngtps.com
iChat/AOL: t.vande...@mac.com
Jabber IM: th...@jabber.org

PGP.sig

Joshua Sierles

unread,
Sep 24, 2006, 8:20:37 AM9/24/06
to rubyonra...@googlegroups.com
> - Make sure your database character set is utf8
> - Make sure all your tables have a character set of utf8
> - Make sure your database.yml has 'encoding: utf8' set for each database

None of these steps are required officially unless you use utf-8
specific features of the database (collation). The last setting seems
to set the connection encoding, which shouldn't be required unless
there is non-utf8 data stored in the database.

> - Put $KCODE='u' in your environment.rb

This is only required if you use unicode strings in your Ruby code.

- Add an after_filter to application.rb to set the Content-Type
header correctly

Rails now defaults to utf-8 Content-Type.

Joshua Sierles

Thijs van der Vossen

unread,
Sep 24, 2006, 2:35:41 PM9/24/06
to rubyonra...@googlegroups.com
On 24 Sep 2006, at 03:21 , Pete Yandell wrote:
> How does this compare to the unicode_hacks plugin? (See http://
> julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/) They
> seem very similar in both intent and interface.

ActiveSupport::Multibyte is a component of the Multibyte for Rails
project which is basically the next version of the unicode_hacks
plugin.should take

PGP.sig

Pete Yandell

unread,
Sep 24, 2006, 9:58:41 PM9/24/06
to rubyonra...@googlegroups.com

On 24/09/2006, at 10:20 PM, Joshua Sierles wrote:

>> - Make sure your database character set is utf8
>> - Make sure all your tables have a character set of utf8
>> - Make sure your database.yml has 'encoding: utf8' set for each
>> database
>
> None of these steps are required officially unless you use utf-8
> specific features of the database (collation). The last setting seems
> to set the connection encoding, which shouldn't be required unless
> there is non-utf8 data stored in the database.

Not true! Collation and character set are separate things.

There are a couple of obvious reasons you want your database
character set to be UTF8 if you're storing UTF8 strings:

1. When you access the database through the mysql (or pgsql, or
other) command line, or through tools such as CocoaMySQL, you want
strings to display properly.

2. MySQL never treats strings as binary; they always have a character
set, which is latin1 (CP1252) by default. Putting UTF8 data into
fields marked as latin1 seems like asking for trouble. (There are
some byte values that are invalid in CP1252, so technically strings
containing those bytes are illegal. It's only through MySQL's
laziness in not checking the strings when the connection and table
character sets match up that you can get away with this at all.)

There are even worse potential pitfalls here too. On one of our
projects, we did everything except set the the connection encoding.
What happened was that a UTF8 string in Rails would be regarded as
CP1252 by MySQL, but MySQL knew that the tables needed UTF8, so it
did a CP1252 to UTF8 conversion on the (already UTF8) string before
writing it. As you can imagine, we ended up with all sorts of crap in
the database, and the occasional string got completely munged as
invalid CP1252 bytes were replaced with question marks.

These three things should at least be reduced to a single setting to
avoid mistakes. I can't imagine a situation in which you would want
to do one of them without the others.

>> - Put $KCODE='u' in your environment.rb
>
> This is only required if you use unicode strings in your Ruby code.

If your app handles UTF8, then you're going to want to write tests
involving UTF8 strings, so you're going to need this turned on. You
do write UTF8 tests for your apps, right? :)

> - Add an after_filter to application.rb to set the Content-Type
> header correctly
>
> Rails now defaults to utf-8 Content-Type.

Good to know. I'll take this as an endorsement of the idea the UTF8
should be the default for Rails apps. :)

Cheers,

Pete Yandell

David Goodlad

unread,
Sep 24, 2006, 11:06:05 PM9/24/06
to rubyonra...@googlegroups.com
On 9/24/06, Pete Yandell <pete.y...@gmail.com> wrote:
> Good to know. I'll take this as an endorsement of the idea the UTF8
> should be the default for Rails apps. :)

I have to put in my two cents here. I can't see any reason why one
_wouldn't_ want to use UTF-8 over plain-ol' ASCII. It's a totally
different ball game than localization; I just want my users to be able
to input data using their own native characters. What app doesn't
have a "full name" field for a user? Shouldn't your users be able to
input their name properly? :)

Besides implementation issues, I can't see any real downside to
supporting UTF-8 out of the box in Rails. It would sure avoid a lot
of potential issues...

Dave

--
Dave Goodlad
dgoo...@gmail.com or da...@goodlad.ca
http://david.goodlad.ca/

Julian 'Julik' Tarkhanov

unread,
Sep 25, 2006, 10:48:56 AM9/25/06
to rubyonra...@googlegroups.com

On 24-sep-2006, at 3:21, Pete Yandell wrote:

>
> How does this compare to the unicode_hacks plugin? (See http://
> julik.textdriven.com/svn/tools/rails_plugins/unicode_hacks/) They
> seem very similar in both intent and interface.

It's a sancitioned evolution thereof. Manfred and Thijs overtook the
business while I am plowing through my internship (which BTW has
nothing to do with Rails and we-development). We split the
repositories so that they can perform exhaustive code changes without
hurting everyone sitting on unicode_hacks.
--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl


Julian 'Julik' Tarkhanov

unread,
Sep 25, 2006, 10:49:52 AM9/25/06
to rubyonra...@googlegroups.com

On 25-sep-2006, at 5:06, David Goodlad wrote:

> I can't see any real downside to
> supporting UTF-8 out of the box in Rails.

Tell it to the Japanese and the Chinese railers. I wonder how long
you will stand before you get your ass served :-)

David Goodlad

unread,
Sep 25, 2006, 12:06:44 PM9/25/06
to rubyonra...@googlegroups.com
On 9/25/06, Julian 'Julik' Tarkhanov <lis...@julik.nl> wrote:
>
>
> On 25-sep-2006, at 5:06, David Goodlad wrote:
>
> > I can't see any real downside to
> > supporting UTF-8 out of the box in Rails.
>
> Tell it to the Japanese and the Chinese railers. I wonder how long
> you will stand before you get your ass served :-)

You mean they would get mad if Rails _did_ support UTF-8 out of the box?

Michael Koziarski

unread,
Sep 25, 2006, 6:17:34 PM9/25/06
to rubyonra...@googlegroups.com
> You mean they would get mad if Rails _did_ support UTF-8 out of the box?

Yeah, UTF-8 and unicode aren't terribly popular in japan. For more
information than you ever thought you'd want, you can read up on the
Han unification. It's also much less efficient (space wise) than
their 'legacy' encodings.


--
Cheers

Koz

Pete Yandell

unread,
Sep 25, 2006, 6:30:14 PM9/25/06
to rubyonra...@googlegroups.com
On 26/09/2006, at 12:49 AM, Julian 'Julik' Tarkhanov wrote:

> On 25-sep-2006, at 5:06, David Goodlad wrote:
>
>> I can't see any real downside to
>> supporting UTF-8 out of the box in Rails.
>
> Tell it to the Japanese and the Chinese railers. I wonder how long
> you will stand before you get your ass served :-)

Why? It's not like Rails supports Japanese or Chinese encodings out
of the box now. How is going from supporting just ASCII to supporting
UTF-8 taking anything away from Japanese or Chinese railers?

Like David said, what exactly is the downside to default UTF-8
support? Who does it hurt, how, and why?

Pete Yandell

Sam Ruby

unread,
Sep 25, 2006, 7:15:35 PM9/25/06
to rubyonra...@googlegroups.com

Java and C# seem to do OK in Japan.

I would also imagine that ASCII wouldn't be very popular in Japan. :-)

- Sam Ruby

Michael Koziarski

unread,
Sep 25, 2006, 7:24:25 PM9/25/06
to rubyonra...@googlegroups.com
> Java and C# seem to do OK in Japan.
>
> I would also imagine that ASCII wouldn't be very popular in Japan. :-)

I should clarify, don't take my previous statement as disagreeing
with "utf-8 everywhere", I'm for it, not against it. But it's
definitely not as simple an issue as it appears at first glance ;)

--
Cheers

Koz

Manfred Stienstra

unread,
Sep 26, 2006, 2:41:25 AM9/26/06
to rubyonra...@googlegroups.com

ActiveSupport::Multibyte doesn't favor any encoding. It currently
implements UTF-8 operations because that's what we, and a lot of
other people on the web, use daily. We believe that you shouldn't
implement anything you're not going to use yourself. This is also
explained on our Trac page, in the FAQ.

https://fngtps.com/projects/multibyte_for_rails/wiki/FAQ

Manfred

Thijs van der Vossen

unread,
Sep 26, 2006, 3:12:55 AM9/26/06
to rubyonra...@googlegroups.com
On 26 Sep 2006, at 01:15 , Sam Ruby wrote:
> Michael Koziarski wrote:
>>> You mean they would get mad if Rails _did_ support UTF-8 out of
>>> the box?
>>
>> Yeah, UTF-8 and unicode aren't terribly popular in japan. For more
>> information than you ever thought you'd want, you can read up on
>> the Han unification. It's also much less efficient (space wise)
>> than their 'legacy' encodings.
>
> Java and C# seem to do OK in Japan.

And for good reason. I have yet to see an example of something that
you can do in Shift-JIS and EUC that you can't do with Unicode 5
encoded as UTF-8. I'm not saying there are no issues some people feel
strongly about, but there are certainly no compelling technical or
practical reasons why you can't use Unicode in Japan.

Even so, Ruby supports Shift-JIS and EUC and will continue to.
Because Rails gets so much out of Ruby it would be somewhat rude if
the next Rails release were to make it impossible to use these encoding.

That's _exactly_ why ActiveSupport::Multibyte is designed to support
multiple encodings. The only reason Shift-JIS and EUC are currently
not implemented in ActiveSupport::Multibyte is that we don't feel
comfortable building stuff we don't use.

So, if you need Shift-JIS or EUC, please add it to
ActiveSupport::Multibyte and send us a patch.

For more information see the Multibyte for Rails FAQ:

https://fngtps.com/projects/multibyte_for_rails/wiki/FAQ

PGP.sig

Thijs van der Vossen

unread,
Sep 26, 2006, 3:44:01 AM9/26/06
to rubyonra...@googlegroups.com

There's no downside to default UTF-8 support, but it would be nice if
switching from the default to Shift-JIS or EUC is going to be as easy
as changing $KCODE = 'utf-8' to $KCODE = 'sjis'.

If you want this, please add Shift-JIS and/or EUC support in

ActiveSupport::Multibyte and send us a patch.

Kind regards,

PGP.sig

Michael Koziarski

unread,
Sep 26, 2006, 3:49:22 AM9/26/06
to rubyonra...@googlegroups.com
> So, if you need Shift-JIS or EUC, please add it to
> ActiveSupport::Multibyte and send us a patch.

Other encodings can be support with plugins initially, I'm personally
happy with utf-8 only as a position for 1.2.
--
Cheers

Koz

Michael Koziarski

unread,
Sep 26, 2006, 4:06:08 AM9/26/06
to rubyonra...@googlegroups.com

So, if we merged in ActiveSupport::Multibyte, and updated helpers
like truncate to use the chars proxy, what other changes would be
required to make this stuff simple? Normalisation of input
parameters? Anything else?

It would be nice if we could make it really easy to have this stuff
'just work' without much in the way of additional user intervention.

--
Cheers

Koz

Manfred Stienstra

unread,
Sep 26, 2006, 4:27:43 AM9/26/06
to rubyonra...@googlegroups.com
On 26-sep-2006, at 10:06, Michael Koziarski wrote:

> So, if we merged in ActiveSupport::Multibyte, and updated helpers
> like truncate to use the chars proxy, what other changes would be
> required to make this stuff simple? Normalisation of input
> parameters? Anything else?

Well, Normalization of input parameters depends on the situation. If
you want to compare strings you probably want compatability
normalization (like NFKC), but compatability normalization forms also
looses data.

For instance, the ligature ffi:

"ffi".chars.normalize(:kc) #=> "ffi"

Or the 'vulgar fraction one quarter':

"¼".chars.normalize(:kc) #=> "1/4"

When you're comparing strings, you might want "¼" to be equal to
"1/4". When you want your users to use nice glyphs, you can't just
discard this data.

But _if_ you normalize, you have to make sure you _always_ normalize.
For instance, when you save a password to the database and normalize
it, you have to make sure that you always normalize passwords from
forms otherwise the password might not match when filled out by the
user. Using NFKC might introduce false positives because
"¼".chars.normalize == "1/4".chars.normalize, which isn't a very
large problem if the rest of the password is strong enough.

Currently normalization is implemented in a separate plugin called
'utf8_plugin' [1], and can be turned on by the class method
`normalize_unicode_params'.

You can find more information in your Unicode Primer [2].

Manfred

[1] https://fngtps.com/svn/multibyte_for_rails/utf8_plugin
[2] https://fngtps.com/projects/multibyte_for_rails/wiki/UnicodePrimer

Mathieu Jobin

unread,
Sep 26, 2006, 4:42:42 AM9/26/06
to rubyonra...@googlegroups.com
ok so ActiveSupport::Multibyte would work with SJIS and EUC-JP but it seems some extra work from someone who understand those encodings.

well, I think if ActiveSupport::Multibyte gets integrated into rails with decent docs (docs that includes writting plugins for other encoding)
I'm sure you have a lot more chance to see a Japanese guru sending you a patch. if it does not get integrated, they won't know about it. or won't care cuz it ain't mainstream.

and I am using utf-8 a good 80% of the time anyway, so I'm totally with the motion.

Julian 'Julik' Tarkhanov

unread,
Sep 26, 2006, 4:46:55 AM9/26/06
to rubyonra...@googlegroups.com

On 26-sep-2006, at 9:49, Michael Koziarski wrote:

> Other encodings can be support with plugins initially, I'm personally
> happy with utf-8 only as a position for 1.2.

+3

Julian 'Julik' Tarkhanov

unread,
Sep 26, 2006, 4:51:03 AM9/26/06
to rubyonra...@googlegroups.com

On 26-sep-2006, at 10:06, Michael Koziarski wrote:

>
> It would be nice if we could make it really easy to have this stuff
> 'just work' without much in the way of additional user intervention.

Normalization on input and before saving to the database, but this
might scare some people off if used wrong.
What Rails might do is adopt the Character Model for the Web and just
stick to C normalizations everywhere.

However I think this still might stay optional, because this might
raise exceptions and loose ends in the situations where
people send intrinsic bytestrings as input parameters. What I do is I
had defined input norm as a filter for ApplicationController,
as the step in the chain responsible for input sanitization.

Implicit normalization at runtime is not the way because it
transiently changes the offsets of strings as soon as you slice/
truncate/concatenate.

Julian 'Julik' Tarkhanov

unread,
Sep 26, 2006, 4:52:42 AM9/26/06
to rubyonra...@googlegroups.com

On 26-sep-2006, at 10:06, Michael Koziarski wrote:

>
> So, if we merged in ActiveSupport::Multibyte, and updated helpers
> like truncate to use the chars proxy, what other changes would be
> required to make this stuff simple? Normalisation of input
> parameters? Anything else?

KCODE, all response charsets out of the box UTF, maybe processing the
params with iconv according to the request-charset.
But first and foremost - clear documentation.

Michael Koziarski

unread,
Sep 26, 2006, 5:16:39 AM9/26/06
to rubyonra...@googlegroups.com
> KCODE, all response charsets out of the box UTF, maybe processing the
> params with iconv according to the request-charset.

Is the request charset sent by all browsers for all requests? How
risky is automatically translating with iconv (assuming it's
available)? Incidentally, this is what I meant by normalization,
that'll teach me to use a reserved word ;).

> But first and foremost - clear documentation.

What do you feel is currently missing from the ActiveSupport::Multibyte patch?

> --
> Julian 'Julik' Tarkhanov
> please send all personal mail to
> me at julik.nl
>
>
>
> >
>


--
Cheers

Koz

Thijs van der Vossen

unread,
Sep 26, 2006, 5:17:52 AM9/26/06
to rubyonra...@googlegroups.com
On 26 Sep 2006, at 10:52 , Julian 'Julik' Tarkhanov wrote:
> On 26-sep-2006, at 10:06, Michael Koziarski wrote:
>> So, if we merged in ActiveSupport::Multibyte, and updated helpers
>> like truncate to use the chars proxy, what other changes would be
>> required to make this stuff simple? Normalisation of input
>> parameters? Anything else?
>
> KCODE,

I agree. It's the Ruby way to set your encoding using $KCODE so Rails
1.2 should have $KCODE='utf-8' in environment.rb

> all response charsets out of the box UTF,

This is already in trunk since changeset 5129.

> maybe processing the params with iconv according to the request-
> charset.

This is only needed for very old and badly broken browsers. I don't
think Rails should do this by default.

Kind regards,
Thijs

Julian 'Julik' Tarkhanov

unread,
Sep 26, 2006, 5:18:37 AM9/26/06
to rubyonra...@googlegroups.com

On 26-sep-2006, at 0:30, Pete Yandell wrote:

> Like David said, what exactly is the downside to default UTF-8
> support? Who does it hurt, how, and why?

It doesn't hurt _us_. I'm 200% for it anyways, just wanted to bring
the point before anyone sneaks up on us about it.

Julian 'Julik' Tarkhanov

unread,
Sep 26, 2006, 5:29:52 AM9/26/06
to rubyonra...@googlegroups.com

On 26-sep-2006, at 11:16, Michael Koziarski wrote:

>> KCODE, all response charsets out of the box UTF, maybe processing the
>> params with iconv according to the request-charset.
>
> Is the request charset sent by all browsers for all requests? How
> risky is automatically translating with iconv (assuming it's
> available)? Incidentally, this is what I meant by normalization,
> that'll teach me to use a reserved word ;).

I see almost no risk. t has to do with a browser (or a REST client,
for that matter) using a wrong
charset when doing the request. The server recieving the request can
then decode the request into it's internal encoding. This is how
(among others)
Trackback system works in MovableType. But just as Thijs said. we
might as well omit that.

It has nothing to do with normalisation.

>
>> But first and foremost - clear documentation.
>
> What do you feel is currently missing from the
> ActiveSupport::Multibyte patch?

As one of the authors I feel pretty secure here. Just wanted to make
sure the big README we have put there gets
a visible spot in the AS docs.

Charles O Nutter

unread,
Sep 26, 2006, 11:09:23 AM9/26/06
to rubyonra...@googlegroups.com
On 9/23/06, Sam Ruby <ru...@intertwingly.net> wrote:
> How does JRuby handle strings? If they are mapped to java.lang.String,
> the JRuby already has more than adequate Unicode support.

JRuby does use java.lang.String, but we have to artificially downgrade
everything to a single-byte encoding for Ruby's sake. Because there's
no concept of characters versus bytes in Ruby, we can't really support
multiybyte characters or code points or what have you without creating
incompatible interfaces. It's a source of great frustration for us, so
much so that we're probably just going to create some
incompatibilities to solve the Unicode issue on our end. It's likely
that in the future all strings in JRuby will be UTF-16 strings as in
Java, and all operations will deal in characters instead of bytes
whereever possible. We'll deal with issues that arise as they come up,
such as for handling IO that wants byte counts when we're providing
character counts.

>
> It seems to me that .chars should return back the same object, if the
> underlying VM supports Unicode. I would guess that today that would
> include JRuby, and in the future, that would include Ruby 2.0.

chars would be easy to implement today; and really we may look at the
ActiveSupport::MultiByte way to handle Unicode as "the one way" we
also do it in JRuby. Rails is driving Unicode innovation at this
point, so if this sees wider adoption we're not opposed to including
it in core JRuby.

To be absolutely clear: we want to support Unicode natively in JRuby,
and we're really just looking to the community to decide what form
that should take. If there's something that can be done within Ruby
1.8-semantics that works with Ruby 1.8-compatible apps, we'll include
it.

--
Contribute to RubySpec! @ www.headius.com/rubyspec
Charles Oliver Nutter @ headius.blogspot.com
Ruby User @ ruby.mn

Charles O Nutter

unread,
Sep 26, 2006, 1:02:24 PM9/26/06
to rubyonra...@googlegroups.com
On 9/20/06, Manfred Stienstra <man...@gmail.com> wrote:
>
> Three months ago Julian Tarkhanov submitted a test implementation of
> his ActiveSupport::Multibyte string extension patch. Since then we've
> been steadily improving the extension based on the feedback we received.

I'm studying it now...a few notes as I go and thoughts at the end:

- I think we could support Chars natively under JRuby pretty easily,
though everything would be UTF-16 internally. However Java has many,
many utilities already present for converting UTF-16 to damn near
every encoding under the sun, so this wouldn't be a real limitation.
Native JRuby support for MultiByte could potentially be significantly
faster than a pure Ruby version, but fully API-equivalent.
- We've been kicking around the possibility of migrating to a mutable
UTF-8 string inside JRuby, to avoid the wasted high byte and to get
all our mutability in a single type that's friendlier than what Java
provides. If we could say that the base JRuby String implementation
supports a fast, solid UTF-8 backing store normally and MultiByte's
String#chars and Chars for actual multibyte operations I think we'd
have the best of all worlds

I like the interface I'm seeing so far. The separation of the base
"dumb" String from the "smart" multibyte-aware Chars seems to be the
path of least resistance. In my opinion, it's potentially the "right
way" long term too...let String remain a dumb byte-box and provide a
character-aware type that knows how to do the "right thing" with
multibyte encodings.

I also think you guys are going to drive unicode adoption in Ruby for
the next year. Matz's m17n is a long way out, and people need unicode
now. If something like MultiByte gains serious traction, there's going
to be a lot of pressure to support that API in the long term, and
there would be little reason we couldn't support it out of the box in
core JRuby right now.

Thijs van der Vossen

unread,
Sep 26, 2006, 2:22:49 PM9/26/06
to rubyonra...@googlegroups.com
On 26 Sep 2006, at 17:09 , Charles O Nutter wrote:
> [...] we're probably just going to create some incompatibilities to
> solve the Unicode issue on our end. It's likely that in the future
> all strings in JRuby will be UTF-16 strings as in Java, and all
> operations will deal in characters instead of bytes whereever
> possible. We'll deal with issues that arise as they come up, such
> as for handling IO that wants byte counts when we're providing
> character counts.

Early versions of the unicode_hacks plugin redefined string methods
to work on codepoints instead of bytes. This turned out to break a
lot of libraries and applications in sometimes subtle but very nasty
ways. Patching up IO might work, but suppose you have something like
this:

header('Content-Length', body.length)

Here, length must return the number of bytes and not the number of
characters. How can you ever know what to return in this case?

Kind regards,
Thijs


PGP.sig

Charles O Nutter

unread,
Sep 26, 2006, 2:59:23 PM9/26/06
to rubyonra...@googlegroups.com
On 9/26/06, Thijs van der Vossen <t.vande...@gmail.com> wrote:

It's for exactly this reason I advocated a separate char sequence type
in future Ruby versions, and why I like AS::MB's approach to the
problem best so far.

>
> Kind regards,
> Thijs

Thijs van der Vossen

unread,
Sep 26, 2006, 3:30:37 PM9/26/06
to rubyonra...@googlegroups.com
On 26 Sep 2006, at 19:02 , Charles O Nutter wrote:
> [...] Native JRuby support for MultiByte could potentially be
> significantly
> faster than a pure Ruby version, but fully API-equivalent.

Finally a good reason to run Rails on Java! :-)

> - We've been kicking around the possibility of migrating to a mutable
> UTF-8 string inside JRuby, to avoid the wasted high byte and to get
> all our mutability in a single type that's friendlier than what Java
> provides. If we could say that the base JRuby String implementation
> supports a fast, solid UTF-8 backing store normally and MultiByte's
> String#chars and Chars for actual multibyte operations I think we'd
> have the best of all worlds

Sounds like the way to go to me. UTF-8 is what Ruby has (although
limited) built-in support for and for most Rails apps it's what you
have to convert to in the end anyway.

Please not that ActiveSupport::Multibyte support all Unicode planes
and not only the Basic Multilingual Plane. My knowledge of Java is
very limited, but judging from the article at [1] working with
anything beyond U+FFFF takes some serious effort.

Kind regards,
Thijs

[1] http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

PGP.sig

Charles O Nutter

unread,
Sep 26, 2006, 3:51:07 PM9/26/06
to rubyonra...@googlegroups.com
On 9/26/06, Thijs van der Vossen <t.vande...@gmail.com> wrote:
> Please not that ActiveSupport::Multibyte support all Unicode planes
> and not only the Basic Multilingual Plane. My knowledge of Java is
> very limited, but judging from the article at [1] working with
> anything beyond U+FFFF takes some serious effort.

Yes, the limitations are fairly well-known for the "astral plane", but
hopefully they'll comprise rare edge cases. On the other hand, our
potential UTF-8 string implementation (a gift from Tim Bray) is
intended to do full Unicode support "the right way" so it could serve
as a general-purpose AS::MB implementation in JRuby as well as our
implementation of Ruby's normal String...

Pete Yandell

unread,
Sep 26, 2006, 8:27:25 PM9/26/06
to rubyonra...@googlegroups.com
On 26/09/2006, at 6:06 PM, Michael Koziarski wrote:

> So, if we merged in ActiveSupport::Multibyte, and updated helpers
> like truncate to use the chars proxy, what other changes would be
> required to make this stuff simple? Normalisation of input
> parameters? Anything else?

As I said in an earlier email, the laundry list reads something like:
- Make sure your database character set is utf8 <- this should
possibly be checked by Rails
- Make sure all your tables have a character set of utf8 <- this
should be done in migrations


- Make sure your database.yml has 'encoding: utf8' set for each database

- Put $KCODE='u' in your environment.rb

- Add 'normalize_unicode_params :form => :kc' to your application.rb

> It would be nice if we could make it really easy to have this stuff
> 'just work' without much in the way of additional user intervention.

I'll sit down next week and write a plugin that does all this (if
someone doesn't beat me to it).

Cheers,

Pete Yandell

Michael Koziarski

unread,
Sep 26, 2006, 9:51:29 PM9/26/06
to rubyonra...@googlegroups.com
> As I said in an earlier email, the laundry list reads something like:
> - Make sure your database character set is utf8 <- this should
> possibly be checked by Rails
> - Make sure all your tables have a character set of utf8 <- this
> should be done in migrations
> - Make sure your database.yml has 'encoding: utf8' set for each database

We can't change these without the users intervention, and doing utf-8
with postgres is a little harder than just 'setting the encoding' for
the table. Perhaps this is just something we need to include in our
documentation?

> - Put $KCODE='u' in your environment.rb

We could update the railties templates, but people will still need to
manually update their application.

> - Add 'normalize_unicode_params :form => :kc' to your application.rb

Why do we need this? I can understand the rationale for doing iconv
for 'differently encoded' strings, but can't quite follow the
justification of normalization.

> I'll sit down next week and write a plugin that does all this (if
> someone doesn't beat me to it).

Sounds good.

--
Cheers

Koz

Manfred Stienstra

unread,
Sep 27, 2006, 2:48:42 AM9/27/06
to rubyonra...@googlegroups.com
On Sep 27, 2006, at 2:27 AM, Pete Yandell wrote:
>
> As I said in an earlier email, the laundry list reads something like:
> - Make sure your database character set is utf8 <- this should
> possibly be checked by Rails

Like someone said before, setting your database encoding to utf-8 is
only important if you want to do string operations. Otherwise you can
just use the database as a bitbucket and it won't matter. I think
this this should be the default in railties and not handled by a plugin.

> - Make sure all your tables have a character set of utf8 <- this
> should be done in migrations

The best solution is to set the default encoding of the database when
you create it, that way you can't miss a table and you still have the
option to override it for certain tables.

> - Make sure your database.yml has 'encoding: utf8' set for each
> database

Again, I think this a matter of defaults in Rails.

> - Put $KCODE='u' in your environment.rb

This should probably be a default in environment.rb if we want Rails
to be completely utf-8.

> - Add 'normalize_unicode_params :form => :kc' to your application.rb

Compatability normalization should _never_ be a default, because it
causes data loss. If there is a default, it should probably be NFC or
NFD. I'm still not convinced it's important to normalize all incoming
data.

> I'll sit down next week and write a plugin that does all this (if
> someone doesn't beat me to it).

The plugin that defines normalize_unicode_params is called
utf8_plugin and it's in the same repository as the rest of Multibyte
for Rails stuff. It was meant as a plugin to do all the utf-8
settings and operations you need to do utf-8 in a Rails application.

The plugin is a descendant of unicode_hacks and in the past this also
set the database client encoding and the content-type header. We feel
this is no longer necessary, we this it's better to solve this with
good defaults and documentation. Thijs van der Vossen is currently
writing a series of blog posts on our weblog about which steps you
have to take to have a fully utf-8 Rails, we hope to convert this to
documentation in the near future.

I am in no way trying to stop you from writing your own plugin, but I
hope you don't waste time going down the same route we did.

Manfred

Pete Yandell

unread,
Sep 27, 2006, 8:15:25 PM9/27/06
to rubyonra...@googlegroups.com

On 27/09/2006, at 11:51 AM, Michael Koziarski wrote:

>> As I said in an earlier email, the laundry list reads something like:
>> - Make sure your database character set is utf8 <- this should
>> possibly be checked by Rails
>> - Make sure all your tables have a character set of utf8 <- this
>> should be done in migrations
>> - Make sure your database.yml has 'encoding: utf8' set for each
>> database
>
> We can't change these without the users intervention, and doing utf-8
> with postgres is a little harder than just 'setting the encoding' for
> the table. Perhaps this is just something we need to include in our
> documentation?

We can certainly make sure tables created with migrations have the
right character set, and we can at least check and give a warning if
the various character sets (database, table, connection, Rails) don't
match up.

I don't know what's required for Postgres, but I'll build for MySQL
and somebody with Postgres experience can extend from there.

>> - Add 'normalize_unicode_params :form => :kc' to your application.rb
>
> Why do we need this? I can understand the rationale for doing iconv
> for 'differently encoded' strings, but can't quite follow the
> justification of normalization.

Because there are characters in unicode (for example 'ü') that can be
encoded in multiple possible ways (either as a single 'ü' character,
or as a 'u' followed by an umlaut modifier), and without
normalisation comparisons between them will fail.

This is probably the trickiest area. When and how to normalise is
something a developer really needs to think about, so just enabling
auto-normalisation of all parameters is possibly not the solution.

Pete Yandell

Pete Yandell

unread,
Sep 27, 2006, 8:25:19 PM9/27/06
to rubyonra...@googlegroups.com

On 27/09/2006, at 4:48 PM, Manfred Stienstra wrote:

> On Sep 27, 2006, at 2:27 AM, Pete Yandell wrote:
>>
>> As I said in an earlier email, the laundry list reads something like:
>> - Make sure your database character set is utf8 <- this should
>> possibly be checked by Rails
>
> Like someone said before, setting your database encoding to utf-8 is
> only important if you want to do string operations. Otherwise you can
> just use the database as a bitbucket and it won't matter. I think
> this this should be the default in railties and not handled by a
> plugin.

I vehemently disagree! :) If you're storing UTF8 data, you shouldn't
have your database think it's latin1. (You can only get away with
this at all because MySQL is lazy with checking strings.) Backing up,
exporting, or accessing the database through something other than
Rails can all give you trouble if you do this.

>> - Make sure all your tables have a character set of utf8 <- this
>> should be done in migrations
>
> The best solution is to set the default encoding of the database when
> you create it, that way you can't miss a table and you still have the
> option to override it for certain tables.

I agree, but it would be nice to have Rails at least say "I think I'm
running in utf8 mode, so I'd better make sure the database agrees and
warn the developer if it doesn't."

>> - Add 'normalize_unicode_params :form => :kc' to your application.rb
>
> Compatability normalization should _never_ be a default, because it
> causes data loss. If there is a default, it should probably be NFC or
> NFD. I'm still not convinced it's important to normalize all incoming
> data.

Yep, fair call. I think this is the trickiest point.

> I am in no way trying to stop you from writing your own plugin, but I
> hope you don't waste time going down the same route we did.

Well, if nothing else my plugin will be useful to me. I'm sick of
having to go through all the steps required to support unicode in
every app I write, and I've accidentally missed steps before in ways
that have caused nasty data corruption and been hard to fix. (Try
setting your database character set to utf8, but not setting the
connection character set.)

Pete Yandell

Michael Glaesemann

unread,
Sep 27, 2006, 10:00:57 PM9/27/06
to rubyonra...@googlegroups.com

On Sep 28, 2006, at 9:15 , Pete Yandell wrote:

> We can certainly make sure tables created with migrations have the
> right character set, and we can at least check and give a warning if
> the various character sets (database, table, connection, Rails) don't
> match up.
>
> I don't know what's required for Postgres, but I'll build for MySQL
> and somebody with Postgres experience can extend from there.

In PostgreSQL, encoding is a database-level setting, not a table
attribute. IIRC, changing from one encoding to another requires
dumping the database, passing the dump through iconv, creating a new
database with the target encoding, and loading the dump into the new
database.

Michael Glaesemann
grzm seespotcode net


Pete Yandell

unread,
Sep 27, 2006, 10:14:49 PM9/27/06
to rubyonra...@googlegroups.com

Yep, which is yet another reason to have UTF8 be the convention for
new Rails apps. Re-encoding all the strings in your database is not fun.

Pete Yandell

Thijs van der Vossen

unread,
Sep 28, 2006, 3:04:57 AM9/28/06
to rubyonra...@googlegroups.com
On 28 Sep 2006, at 02:15 , Pete Yandell wrote:
> We can certainly make sure tables created with migrations have the
> right character set,

If you set the default character set to utf-8 when you create a MySQL
or PostgreSQL database, all tables you create using Rails migrations
will inherit this character set. In other words, if you create your
MySQL database with:

> CREATE DATABASE db_name CHARACTER SET utf8 COLLATE utf8_unicode_ci;

or your PostgreSQL database with:

$ createdb --encoding=UTF8 db_name

all tables will use UTF-8 so there's not really a need to set or
check the character set on the tables.

PGP.sig

Julian 'Julik' Tarkhanov

unread,
Sep 28, 2006, 3:01:46 PM9/28/06
to rubyonra...@googlegroups.com

On 28-sep-2006, at 4:00, Michael Glaesemann wrote:

> In PostgreSQL, encoding is a database-level setting, not a table
> attribute. I

AFAIK it's customizable all the way, from the cluster to the database
to the tables and columns.
And the locale of the postmaster user plays it's part too.

Thijs van der Vossen

unread,
Sep 28, 2006, 3:15:05 PM9/28/06
to rubyonra...@googlegroups.com
On 28 Sep 2006, at 21:01 , Julian 'Julik' Tarkhanov wrote:
> On 28-sep-2006, at 4:00, Michael Glaesemann wrote:
>> In PostgreSQL, encoding is a database-level setting, not a table
>> attribute. I
> AFAIK it's customizable all the way, from the cluster to the
> database to the tables and columns.

You can set the encoding on multiple levels, but you can only set the
locale, which defines the collation when you create the 'database
cluster'. You can get a list of available locales on your system with
locale -a

> And the locale of the postmaster user plays it's part too.

Only if you so not set the local when you create the 'database
cluster'.

The easiest way that we could find to do UTF-8 with PostgreSQL is to
first create the 'database cluster' with:

$ initdb --locale=en_GB.UTF-8 -D data_dir

...and then create the databases with something like:

$ createdb --encoding=UTF8 db_name

And yes, you've read it right, you can only get _one collation type_
for your cluster...

PGP.sig

Michael Glaesemann

unread,
Sep 28, 2006, 10:17:52 PM9/28/06
to rubyonra...@googlegroups.com

On Sep 29, 2006, at 4:01 , Julian 'Julik' Tarkhanov wrote:

>
>
> On 28-sep-2006, at 4:00, Michael Glaesemann wrote:
>
>> In PostgreSQL, encoding is a database-level setting, not a table
>> attribute. I
> AFAIK it's customizable all the way, from the cluster to the database
> to the tables and columns.
> And the locale of the postmaster user plays it's part too.

Could you please point me to where you can specify table or column
encodings separate from those of the database? Encoding from the
client side is negotiated by the client (so you might be sending
Latin-1 to the server and it gets translated to the database
encoding) so in some (weak) sense you can handle the data for tables
and columns in different encodings *on the client side*, but on the
server side, the encoding is fixed for the database at the time of
database creation.

At the time of initdb, a default encoding can be chosen for the
entire cluster, but it can be overridden for individual databases at
the time of database creation.

http://www.postgresql.org/docs/8.1/interactive/multibyte.html

Reply all
Reply to author
Forward
0 new messages