I tried to identify the general and root causes for these problems
with 1.9, by taking into account non-utf encoding, current patches,
comments and ideas. I used ticket #2188 as base for explanations.
This is a long read. I wanted to include all the relevant information
in one place. I also included information about related tickets in LH
and their status. I decided that adding parts of this to LH would just
add to the confusion.
Two patches are included (one is from Andrew Grim) that should fix one
issue (#2188) in a way, that fixes the problem and doesn't break
anything. Two small steps for Rails, one giant step for proper
encoding support. I hope.
I welcome any feedback that would help get Rails closer to fully
supporting Ruby 1.9 and vice-versa.
SOLUTION:
---------
The general idea is: allow only one "internal" encoding in Rails at
any given time, based on the default Ruby encoding (or configurable).
And treat any incoming external strings that cannot be converted to
this "internal" encoding as errors in the gems, which they occur. And
possibly report mismatches before they even "enter" Rails, by
attempting to convert them into the "internal" encoding immediately.
As a result of enforcing this, all Rails tests should work with any
encoding, that is a superset of the encodings used for input (db,
Rack, ERB, Haml, ...) in a given environment.
With a optimal setup (db encoding, Ruby encoding, Rack encoding
settings, I18n translations, ...), no transcoding will occur during
the rendering process, no matter what the default Rails encoding is
used (including ASCII_8BIT), and no force_encoding would be needed
internally in Rails, except as workarounds for gems and libraries
where this is difficult otherwise.
The guideline for gem and plugin developers would be: do not create or
return strings (other than internal use) that are not compatible with
the default encoding both ways.
In some cases, it may be acceptable to drop or escape characters that
cannot be transcoded (maybe Rack input, for example).
The idea is based on:
- Jeremy Kemper's strong attitude toward avoiding solutions
requiring UTF-8 as default or forcing it
- Yehuda's opinion about using UTF-8 as default in Ruby instead of
ASCII-8BIT
- James Edward Gray's solution for encoding issues in CSV
- the multitude of ways to set the encoding in Ruby
- giving everyone the liberty to use any encoding they want for any
task, without the need of porting and modifying existing code if
possible
- personal experience with many encoding pitfalls
For those interested in Ruby encoding support, I very much recommend
the extremely well written in-depth article by James Edward Gray II:
http://blog.grayproductions.net/articles/understanding_m17n
Results of "Please do investigate":
----------------------------------
The ticket:
#2188: (March 9th, 2009): Encoding error in Ruby1.9 for templates
Actual cause: ERB uses force_encoding("ASCII-8BIT") which is just an
alias for "BINARY". This is actually ok, except for the way Ruby 1.9
handles concat with a non-BINARY string, e.g. UTF-8:
>> '日本'.force_encoding('BINARY').concat('語'.force_encoding('UTF-8'))
Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8
Although the following works (equivalent to how Ruby 1.8 works):
>> '日本'.force_encoding('BINARY').concat('語'.force_encoding('BINARY'))
=> "\xE6\x97\xA5\xE6\x9C\xAC\xE8\xAA\x9E"
The surprise is that it "sometimes works", when a string contains only valid ASCII-7
characters, giving the impression that a patch fixed the problem:
>> 'abc'.force_encoding('BINARY').concat('語'.force_encoding('UTF-8'))
=> "abc語"
(I used force_encoding here for consistency in different locale
settings).
Solutions that come into mind:
-----------------------------
1. force_encoding should not be used, unless really necessary, and
this rule should be applied to ERB. Unfortunately, I have no idea
why ERB uses force_encoding, but I can come up with a few reasons,
the main one being: Rails uses ERB (a general lib) for a specific
purpose and requiring a non-ASCII-8BIT encoding is just as specific.
I would really like an opinion on this.
2. Don't use ERB. AFAIK, this is why Rails 3.0 works.
3. Treat everything as binary, since the resulting file is sent to a
browser, which will detect the encoding anyway. This is also doesn't
affect performance, but it ruins the whole idea of having encoding
support, possibly breaking test frameworks instead.
4. Force UTF-8. This is the brute-force idea used in many patches
and workarounds, and this prevents commits from happening. People
should have a right to use non-utf8 ERB files and render in any
encoding e.g. EUC-JP.
5. Try to be intelligent, and guess. This means handling
everything, except BINARY. The problem is how do we know what
encoding to use for template input? And what encoding do we use for
output?
Solution 1 would be best, but with force_encoding already in the wild
with Ruby 1.9, including ruby-head. So that leaves solution 5. Option
3 is a way to get Ruby 1.9 to behave more like 1.8, but will require
all template input strings to be set to BINARY.
Solution 5
----------
force_encoding has to be used at least once somewhere in
Rails - to fix what ERB "breaks", but on what basis should the
encoding be selected? For performance, there should be no
transcoding during rendering, unless absolutely necessary.
When we think about it, the output depends on what we want the
browser to receive, and that is why many people are pushing UTF-8:
the layout usually has UTF-8 anyway, and it would otherwise have to
be parsed to get the encoding from the content-type value.
The input using in rendering a template is a mixture of what web
designers provide, the translators use, the databases return and
Rack emits, among other things.
The policy in Rails could be: "don't allow multiple encodings
during template rendering". I believe the effort required to do
otherwise is not be justified.
This would force other gem developers to provide a way to set or
read the correct encoding they use or stick with the current
default. In this case (#2188), ERB has to either provide a way to
either return the result in a encoding specified by Rails, or the
ERB handler should be adapted to provide this functionality.
The problem with this: ERB templates do not have an embedded
encoding. Which means we need a way to specify the encoding used in
the template.
Andrew Grim fixes this in his patch here:
https://rails.lighthouseapp.com/projects/8994/tickets/2188/a/359640/erb_encoding.diff
I am only worried about the default case, when no encoding is set.
"ASCII_8BIT", the result of ERB, is not acceptable, unless the
"internal" encoding would also be BINARY. I would propose merging the
following with the patch above:
def compile(template)
input = "<% __in_erb_template=true %>#{template.source}"
src = ::ERB.new(input, nil, erb_trim_mode, '@output_buffer').src
if RUBY_VERSION >= '1.9' and src.encoding != input.encoding
if src.encoding == Encoding::ASCII_8BIT
src = src.force_encoding(input.encoding) #ERB workaround
else
src = src.encode(input.encoding)
end
end
# Ruby 1.9 prepends an encoding to the source. However this is
# useless because you can only set an encoding on the first line
RUBY_VERSION >= '1.9' ? src.sub(/\A#coding:.*\n/, '') : src
end
And here is an example test case, similar to many others already in
the tickets, which shows the issue:
<%= "日本" %><%= "語".force_encoding("UTF-8") %>
A few things here to note (for both patches put together):
- the fallback encoding would be assumed to be the same as ruby
default, which can be set by the locale, RUBYOPT with -K option,
or using Encoding.default_*. I believe this is sufficient
flexibility.
- note that there are no assumptions regarding the charset and the
ASCII_8BIT case is handled with this in mind
- obviously, test cases would be executed with different Ruby
encoding defaults - testing one setup no longer guarantees
anything. Rails tests should work with almost any default
encoding, which means testing at least on 3 should be recommended
before a patch is committed: (BINARY + UTF-8 + EUC ?).
- similar conversion to the "internal" encoding would be required
for all strings from other engines, databases and Rack, regardless
of whether they are in UTF-8 or not. As for Rack and strings
submitted through forms, they should ultimately be also in the
"internal" encoding and not BINARY (unless "internal" *is*
BINARY), but getting this to work is a can of worms in itself
(AFAIK, this is true for native Japanese sites, where assuming
UTF-8 is almost never valid).
- there are a few other places where ERB is used, but I prefer to
leave that until this single case is solved. Fixing other
template issues should be done separately.
I hope this is enough to be committed into 2-3-stable, IMHO. At least
as a first step after many months of threads, discussions, issues,
tickets, articles, without any fully acceptable patches or progress.
Also, I believe the tickets in LH need some love - just to straighten
out the issue and introduce more clarity. The best results would be to
start closing the tickets with definite conclusions and guidelines, so
that people start using Ruby 1.9 with Rails, so plugin developers in
turn get enough time and feedback to get things right.
IMPORTANT: I had intention of offending anyone by the following
digests - I just wanted to provide an overview of the lack of
progress, the complexity of issue and the willingness to help, despite
months without progress. I admit I have no idea what prevented the
problem from being solved a long time ago.
Ticket #2188:
https://rails.lighthouseapp.com/projects/8994/tickets/2188-i18n-fails-with-multibyte-strings-in-ruby-19-similar-to-2038
1. Incorrect mention of I18n and #2038 as similar error
2. Correctly identified problem (Hector E. Gomez Morales)
3. Patch forcing UTF8 as workaround, #1988 reported as dup (Hector)
4. Unintentional hijacking with a MySQL problem (crazy_bug)
5. MySQL DB problem redirected to #2476 (Hector)
6. Unintentional hijacking with a HAML problem (Portfonica)
7. Jakub Kuźma identifies a wider set of problems
8. Jakub Kuźma identifies Rack problems
9. Adam S talks about setting default encoding in Rails
10. Jérôme points out the need for a default encoding for erb
files
11. Jeremy Kemper notes that the reports are not really helpful
12. Rocco Di Leo provides detailed test case, but formatting
problems make it unreadable
13. Adam S suggests solving the problem by converting ASCII ->
UTF8
14. hkstar mentions the lack of progress
15. Jeremy Kemper notes that the issue still hasn't been properly
investigated
16. Turns into a discussion about UTF-8 support in 1.9
17. Andrew Grim proposes alternative patch that honors ERB
template encoding
18. ahaller notes strange behaviour in ERB
19. Marcello Barnaba proposes general monkey patch for ActionView,
probably related to Rack issues
20. UVSoft proposes patch for HAML
21. Alberto describes the problem - just as Hector did
22. TICKET STATUS IS STILL OPEN WITH NO ACCEPTABLE PATCH
What I propose is combining the two patches above to close this
issue, and give references to non-related tickets which give a
similar error.
#Ticket 1988: Make utf8 partial rendering from within a content_for work in ruby1.9
https://rails.lighthouseapp.com/projects/8994/tickets/1988
1. Patch that works around the issue
2. Jeremy Kemper does not accept the patch due to being utf-8 - only
3. TICKET STATUS IS INCOMPLETE
What I propose is solving #2188 first and then investigate this
bug further - it could be a bad assumption about the encoding of
strings returned by tag helpers in a specific case.
#Ticket 2476: ASCII-8BIT encoding of query results in rails 2.3.2 and ruby 1.9.1
https://rails.lighthouseapp.com/projects/8994/tickets/2476
1. Hector describe database adaptor problem with 1.9 encodings,
provides a mysql-ruby fork and other links
2. Patches and fixes for databases / adaptors (James Healy, Jakub
Kuźma, Yugui)
3. Talk about assuming UTF-8 for databases
4. Loren Segal proposes hack instead of modifying mysql-ruby
5. Micheal Hasensein asks about issue 5 months later
6. UVSoft accidentally posts HAML workaround
6. TICKET STATUS IS NEW
My proposal - after fixing #2188, a short description of
adapters/databases and fixed versions could be presented - and
possibly have this issue closed, to prevent it being listed as a
pending UTF-8 issue. Work could be started on validation code for
the strings returned by database adapters and their compatibility
with the "internal" encoding.
Open/new tickets related to Rack:
https://rails.lighthouseapp.com/projects/8994/tickets/3331-patch-block-invalid-chars-to-come-in-rails-app
https://rails.lighthouseapp.com/projects/8994/tickets/3392-rackinput-requires-ascii-8bit-encoded-stringio
https://rails.lighthouseapp.com/projects/8994/tickets/4336-ruby19-submitted-string-form-parameters-with-non-ascii-characters-cause-encoding-errors
My proposal: gather issues and investigate with the help of people
working with non-utf and non-ascii input - I believe Japan is such
a country, where UTF-8 assumptions about Rack input are wrong.
I would like to thank everyone who invested even the slightest bit of
time in solving this issue.
I hope the information here will help find a solution that will work
without issues for years to come and that creating Rails applications
will be an enjoyable experience for users, designers, developers,
translators and all contributors, regardless of their environment and
language preferences.
--
Cezary Baginski
On Mon, Apr 19, 2010 at 11:30:16AM -0700, Jeremy Kemper wrote:
> On Mon, Apr 19, 2010 at 6:58 AM, Czarek <cezary....@gmail.com> wrote:
> > The general idea is: allow only one "internal" encoding in Rails at
> > any given time, based on the default Ruby encoding (or configurable).
I chose Encoding::default_external for this.
The short story is that Encoding::default_internal shouldn't really
matter for Rails.
> > As a result of enforcing this, all Rails tests should work with any
> > encoding
Probably the most convenient way to test this is:
RUBYOPT=-Ke rake tests
See #4466 for an example test script for ActionPack and the trivial
fixes that make everything work.
> > The guideline for gem and plugin developers would be: do not create or
> > return strings (other than internal use) that are not compatible with
> > the default encoding both ways.
> >
> > In some cases, it may be acceptable to drop or escape characters that
> > cannot be transcoded (maybe Rack input, for example).
>
> +1
String#{encode,encode!} have both nice options for replacing
characters and provide almost all the necessary functionality
(force_encoding handles a few other surprise cases). Rack, and
converting between incompatible encoding are places where this seems
useful.
> I don't know why ERB forces encoding to ASCII-8BIT in the absence of a
> magic comment. See r21170. The ERB compiler should probably take a
> default source encoding option that's used if the magic comment is
> missing.
Two issues are worth mentioning: regexes have their own
encoding semantics and force_encoding is actually necessary if you
want to "encode" a string to or from ascii-8bit specifically.
ERB uses a regex to detect the encoding comment, but the regex has to
have the same encoding as the source stream, so ERB uses ASCII-8BIT to
be able to run the regex on the stream, regardless of the stream's
encoding.
Then ERB continues to use that ASCII-8BIT string for compiling, which
seems to be ok, because the strings are passed to eval, with and
encoding comment in the beginning...
The problem actually lies elsewhere: ERB didn't detect the encoding,
because the encoding magic wasn't in the first tag. The first tag was
added by Rails ERB handler:
"<% __in_erb_template=true %><%# encoding ...."
Andrew Grim worked this out and created a patch for this in #2188.
Should ERB search the whole stream for an encoding tag? Or should
Rails guarantee the first tag has the encoding information? I believe
the second option will save more time. Erubis is also a reason to
forget about patching ERB directly.
> Using Erubis is a possibility as well.
Patching the ERB problem taught me that although this will solve many
encoding issues and headaches, it may unfortunately hide a few general
design flaws that should be worked on before Rails 3.0 or Ruby 1.9.2
become production ready.
The workarounds I used for patching ERB seem actually quite generic.
They allow one to have partials in different encodings and even have
ASCII-8BIT as the Ruby default_external without breaking anything.
And any encoding incompatibilities occur during encode! calls in the
ERB handler - close to the problem.
Something similar could be done for db adapters, because just like the
template handler being ERB instead od Erubis, people can have
old/broken libs, gems and plugins. And since Rails is becoming more
modular with 3.0, additional issues may surface, slowing down
development in the long run.
>
> > 3. Treat everything as binary, since the resulting file is sent to a
> > browser, which will detect the encoding anyway. This is also doesn't
> > affect performance, but it ruins the whole idea of having encoding
> > support, possibly breaking test frameworks instead.
>
> -1
Actually, it turns out that supporting everything as binary takes
really no more effort than supporting multiple encoding and it is a
good way to test Rails, applications and gems. ASCII-8BIT is the most
restrictive when it comes to encoding making it ideal for regression
tests. Allowing an application to support ASCII-8BIT through
default_external requires more effort, but is worth it.
> > 4. Force UTF-8. This is the brute-force idea used in many patches
> > and workarounds, and this prevents commits from happening. People
> > should have a right to use non-utf8 ERB files and render in any
> > encoding e.g. EUC-JP.
>
> -1
Complementary to ASCII-8BIT, UTF-8 is ideal for an 'internal' encoding
and for detecting cases where ASCII-8BIT is (mis)used. UTF-8 should
actually *be* used when there are multiple - incompatible otherwise -
encodings. Ruby 1.8 just glues anything together, but in 1.9
everything should first be encoded to something as general as UTF-8
before encoded to ASCII-8BIT (if there is such a need). For example,
this would allow people to make ISO2022_JP web pages from EUC-JP
templates and SJIS databases - by using UTF-8 as the internal
encoding.
Although choosing UTF-8 seems wrong, in this case it prevents us from
loosing encoding information from converting to ASCII-8BIT.
> We could set a single default encoding for the app, like we're doing
> in Rails 3.
I admit I haven't even tried Rails 3.0. Shame on me.
A single default encoding within rails is a must to gracefully handle
the example I gave above (with EUC, SJIS and ISO2022). Of course UTF-8
is reasonable, but there is no reason to assume UTF-8 for all cases.
> The ERB compiler is supposed to preserve the input file's source
> encoding unless it has a magic comment. Puzzled why this is necessary.
> It should also be fixed in ERB itself, I think.
Rails inserts code that breaks ERB's magic comment detection. How does
Erubis handle the issue? Does it regex the stream?
> > - obviously, test cases would be executed with different Ruby
> > encoding defaults - testing one setup no longer guarantees
> > anything. Rails tests should work with almost any default
> > encoding, which means testing at least on 3 should be recommended
> > before a patch is committed: (BINARY + UTF-8 + EUC ?).
Actually, all 5 cases could be used in Rails tests and in apps:
- no K option, Ks (sjis), Ke (euc-jp), Ku (utf-8), Kn
(binary/ascii-8bit)
ActionPack is trivial to fix. Other Rails gems may require more work.
>
> Ok, good. They'll need to be rebased against master, and I think
> Andrew's patch breaks some tests since it changes the ERB line
> numbers.
I haven't noticed this. Could you provide some details? I am wondering
how I missed this.
I didn't check his patch too thoroughly, since I was busy getting a
patch #2188 out the door.
I only checked my own patch (based on his) on ActionPack and
ActiveSupport. Currently, everything seems to work, so let me know if
I looked something over.
> Rack is woefully lagging on encoding support. It needs an encoding
> push of its own.
>
> Ruby CGI has updated to include just-enough support, e.g. for giving
> an encoding for parsed query parameters.
I would handle Rack last or at least after Rails tests work in all the
encodings. The reason is: I learned not to underestimate encoding
problems and leaving Rack for last seems like a good choice.
> Indeed! Thanks for leading the charge, Cezary.
I'm happy to helpful in some way.
>
> jeremy
--
Cezary Baginski
Thanks :)
> Maybe one additional thing: make all generators put the magic comment
> with the standard encoding at the top of all source files they create.
> Does that sound like a good idea? Should we open a ticket for it?
This is a great idea, since people new to Rails usually both are new
to Ruby and use generators. The question is how do we choose the
encoding? Consider the following:
% LC_CTYPE=en_US ruby -e 'p IO.read("_foo.rhtml").encoding'
#<Encoding:US-ASCII>
% LC_CTYPE=en_US.UTF-8 ruby -e 'p IO.read("_foo.rhtml").encoding'
#<Encoding:UTF-8>
This is important for partials. People will eventually create partials
without the encoding information, which will be rendered from
templates. I would prefer us-ascii to be used by generators instead of
Ruby's Encoding::default_external for the following reasons:
- user may have a non-UTF8 environment, and us-ascii will more likely
give an error closer in the call stack to the file without the
encoding comment
- user shouldn't really use non ascii characters in partials and
templates - i18n is the solution and will help localize the
application when it goes global
- this would help adopt using '# encoding: us-ascii' as a no-brainer
solution instead of '# encoding: utf-8' which usually just makes
problems more obscure
The only upside to using UTF-8 at all instead is quickly fixing huge
sites with many localized pages, but generators are for new projects
anyway.
So, by all means, yes, please open a ticket, since this may not be too
trivial and encoding issues will more likely need good understanding
rather than assuming Rails can and will magically fix everything.
> Just to clarify how important this issue is: Rails 2.3 claims to be
> Ruby 1.9 compatible, but until this is fixed, even the most trivial of
> applications simply don't work on 1.9, especially if the application
> is in a language that often uses non-ASCII characters (pretty much
> anything other than English, in other words). This has prevented me
> from moving to Ruby 1.9.
The m17n support in Ruby > 1.9 is a great concept. Unfortunately
balancing:
- correctness
- performance
- robustness in a production environment
quickly turns encoding problems into philosophical debates. Without a
deep understanding of encoding internal it is too easy to "fix" things
by just converting to UTF-8, hiding the real issues.
Thanks for bringing this up!
>
> /Jonas
>
--
Cezary Baginski
Forgive me for not making the context clear. There is no 'policy'
here, just a suggested generator default behavior for users writing
mainly US applications, possibly wishing to easily globalize their
applications in the future. In *this* case specifically, my
conclusions are:
- using utf-8 instead of ascii-us for encoding comments hide
problems for those users
- people with no experience in encodings other than us-ascii will
forget the encoding comments more often than not
- Ruby 1.9 chokes when trying to convert two non us-ascii compatible
strings
- generators could create files with ascii-us by default to prevent
the above
If that case does not describe your own, chances are you already know
what you are doing and Rails gives you all the freedom you can get to
adapt things to your own situation, choosing the right tool for the
right job.
The reason for the proposed generator default is *exactly* to help
people unaware of encoding problems to deliver applications that spare
others the suffering and grief.
>
> On Apr 25, 1:01 pm, Czarek <cezary.bagin...@gmail.com> wrote:
> ....
> > - user shouldn't really use non ascii characters in partials and
> > templates - i18n is the solution and will help localize the
> > application when it goes global
> ...
--
Cezary Baginski
I didn't correctly state what I meant and thank you for helping me
realize that :)
What I did mean was that users shouldn't assume non-ascii characters
will always work correctly with Ruby 1.9, without specifying encoding
comments or assuring specific, correct environment settings. So, let
me rephrase myself:
Users should not be able to use non-ascii characters in a us-ascii
environment without providing an alternative encoding comment or
overriding the environment settings. If neither of these are
acceptable, i18n is a suggestion.
This behavior would be consistent with the way Ruby loads source
files. The reason is that doing otherwise can give obscure, hard to
track encoding problems, looking like Rails bugs.
By supplying a _default_ "us-ascii" encoding comment in generated
template files, we help people oblivious to encoding details to do the
right thing or do the necessary research (i18n, change encoding
comments, localized versions of pages, etc).
Encoding problems can be so frustrating, it is easy to perceive US
developers as being ignorant. The truth is, it is unusual for them to
even experience the problems or reproduce without effort, let alone
research ways to test the issues effectively. This feature may
slightly help with the latter.
Suggestion
----------
I am wondering if Rails could actually assume us-ascii for all types
of template files without a specified encoding, or emit a warning
unless Ruby is running in full UTF-8 mode (-Ku) or full binary (-Kn).
The fix would be adding encoding comments to all the files and may
mean a lot of work for existing projects. On the other hand, this is
more consistent with how Ruby 1.9 handles source files, so it won't be
a surprise to anyone.
This would prevent people from forgetting to put encoding comments in
partials, for example. And if this would really be troublesome, people
can always stick to Ruby 1.8 or run their servers with -Ku or even
-Kn.
Would anyone care to comment on this idea?
And thank you too for helping out! Especially for giving the summary
of rack issues with patches, which obviously saved me hours of
research.
It make be a while before Rails 2.3 becomes 1.9 compatible as the
result of detailed test cases and well thought out, politically
correct patches, but it is so encouraging to see Rails users not
giving up!
Thanks again!
--
Cezary Baginski