On 12 Sep 2010, at 13:48, Hongli Lai wrote:
> The current Rack specification doesn't say anything about the encoding
> of the value strings in the Rack environment. However from various bug
> reports it has become clear that Rails and possibly many other apps
> expect some value strings, such as REQUEST_URI, to be UTF-8. See #16
> at http://code.google.com/p/phusion-passenger/issues/detail?id=404.
I'm not sure, but I think they don't expect them to be utf-8, they actually expect them to be compatible with literals.
> I believe the encoding should be standardized. Here are some ideas
> that might serve as a starting point for discussion.
Maybe, from what I am told, we will need users of CP932 and other really annoying stuff to come forward and test, such that we don't either break their stuff or have to revert these specs in future. Binary is the only lossless manner we have right now. I personally can't (fully) audit this yet, although I'm trying to learn.
> PATH_INFO and QUERY_STRING are usually extracted from REQUEST_URI,
> however REQUEST_URI is not standardized even though lots of people use
> it. Furthermore REQUEST_URI tends to be a URI. I therefore propose the
> following requirements:
>
> - REQUEST_URI, if exists, MUST be a valid URI. This implies that
> REQUEST_URI must contain the unescaped form of the URI (e.g. "/clubs/
> %C3%BC", not "/clubs/ü").
Percent encoded is definitely the way to go. Percent encoded data must fit within the ascii encoding set (no multibyte). It is commonly accepted that most percent encoded URIs actually expand out to UTF-8, however, as far as I know this is not exclusively the case. iirc, there is some mention of this in the newer IRI spec, which tbh, is quite horrible.
> - All required Rack variables that are strings (PATH_INFO,
> REQUEST_URI, etc) except for HTTP_ variables MUST be encoded as UTF-8.
> That is, the #encoding method must return #<Encoding:UTF-8>.
Although this might be literal compatible, I'm not sure it's strictly correct. That may not be a technical problem because it's compatible with the most likely actual encoding of the /data itself/ (n.b. not the non-percent-encoded data as noted above), however it may lead to a social problem, whereby it suggests that the data really is UTF-8. Maybe I'm being pedantic here, I'm not sure, but this stuff is hard enough without misleading defaults.
> - HTTP_ variables MUST be encoded as binary.
Seems sensible.
> The valid URI requirement for REQUEST_URI guarantees that encoding it
> in UTF-8 is possible because URIs are valid ASCII.
> Because PATH_INFO and QUERY_STRING tend to be extracted from
> REQUEST_URI and are therefore substrings of an URI, it is also
> possible for them to be UTF-8.
Yes, they are compatible, that much I agree with, but with the caveats above.
> The binary requirement for HTTP_ is necessary because HTTP allows
> header values to contain characters that are not valid UTF-8 nor valid
> US-ASCII (see the HTTP grammar's TEXT rule).
It's hard to determine what they should be, indeed the snowmen help, but I think there's yet more research to figure out what the real world use cases are when browsers are set in non-utf8 encodings and setting headers from JS, etc, etc. I would really appreciate someone or some company in the community sponsoring real research and documentation in this area; that is, extensively. (Extensively includes headers, form data, multipart data, etc etc across all major browsers in all major encoding settings and with all major encodings as inputs (files, pasted data, etc). Noticeable issues occur as a common and major example, pasting rich text data into forms into windows browsers on non-automatic modes into forms from programs like word. This is of course the very essence of the snowman hack from rails3, but to apply any of this kind of stuff to rack requires more research and at least some helpers and documentation. (The latter I have started whilst trying to get as much of this out of Yehudas brain as possible, but have not had time to finish yet).
> Non-HTTP_ required Rack values must not be ASCII-encoded because Rails
> and many apps work primarily with UTF-8 strings.
Literal strings. I should also note at this point that -e, irb, and textmate create lies during testing, please don't rely on "rules of the road" you derive via these tools, which enforce utf-8 literals by default, whereas normal source files start ascii encoded. This also depends on arguments to ruby, and $LC_CTYPE.
> If the app does
> something like
>
> some_utf8_string + env['PATH_INFO']
>
> then Ruby 1.9 will complain with an incompatible encoding error.
On your system.
> On the other hand, if the app does something like
>
> some_utf8_string + env['HTTP_FOO_BAR']
>
> then things will still blow up so I'm not sure whether my requirement
> makes sense. Does Rails mandate an encoding for its request.env?
Rails does a lot of work on the /client side/ to try and ensure it receives UTF-8, and tries to enforce UTF-8 elsewhere. Rack can't enforce this as it doesn't operate client side (build forms). It's also worth noting that rails accepts a percentile use case hit here, whereby it makes no attempt to expect full support for encodings that can't round-trip through unicode. For them this is sensible, and maybe it might be for us, but this is why I need particularly CP932 users to actually pay attention here. Until I hear from someone who deals with these issues in the real world, I cannot defer to the advice "just use unicode". Alas, one of the larger issues here is that I don't speak the languages required to actually track down most of these users, so I need help from people who do. I hope there's someone on this list proactive enough to do this, or knows someone to call on.
> I'm unsure what to do with all other variables. Should there be
> requirements about their encodings?
I think we do need to either:
1. Set some specific requirements based on complete research (and document potential loss/error cashes)
2. Not set requirements based on complete research (and document as such)
It should be noted that 1 may result in the software being simply incompatible with certain requirements, whereas 2 may require the common user to do more work themselves. At present there may be no workaround for when 1 is a problem, due to the pervasiveness of rack in rubys modern libraries and frameworks.
In any case, the minimum output of these discussions should be documentation on the topic, so that once we're done, we can stop having them with everyone who hits another issue. As 1.9 becomes more common, this is going to come up more and more.
I am a super-noob when it comes to this stuff, so I don't feel I can
add that much to the discussion. What I can tell you is that using the
escape_utils gem [1] took care of my error. They also discuss a patch
to Rack itself [2] in one of the links on that page. I'm not
experienced enough to say if this is a good idea or not; but hopefully
it's some food for thought?
1: http://openhood.com/rack/ruby/2010/07/15/rack-test-warning/
2: http://jonathan.tron.name/2009/08/20/sinatra-sequel-haml-postgresql-utf-8-and-ruby-1-9-1
-Steve
class String
def encode!
encode!(Encoding.default_internal) if Encoding.default_internal
end
end
Ruby itself never sets Encoding.default_internal. It is used by
end-users to specify that they would like libraries operating at the
boundary to convert known encodings to the one specified. Rails sets
this to the value of config.encoding, which defaults to UTF-8.
As a result, if a boundary library knows the encoding of a String (and
it honors default_internal, like most well-behaved libraries), Rails
will get it as UTF-8, but libraries don't need to hardcode UTF-8, and
they can allow people who need to work with encoding on a more
fine-grained level to do as they will.
I should point out that Encoding.default_external is an entirely
different setting, which tells Ruby what encoding to default files on
the file system to. This defaults to the encoding of the operating
system. In general, libraries that read files from disk should either
let the operating system's default encoding ($LC_CTYPE/$LANG) win, or
they should read files in the "rb" mode, which will tag the String as
BINARY.
Yehuda Katz
Architect | Engine Yard
(ph) 718.877.1325
On Sun, Sep 12, 2010 at 9:48 AM, Hongli Lai <hon...@phusion.nl> wrote:
>
> The current Rack specification doesn't say anything about the encoding
> of the value strings in the Rack environment. However from various bug
> reports it has become clear that Rails and possibly many other apps
> expect some value strings, such as REQUEST_URI, to be UTF-8. See #16
> at http://code.google.com/p/phusion-passenger/issues/detail?id=404.
>
> I believe the encoding should be standardized. Here are some ideas
> that might serve as a starting point for discussion.
>
> PATH_INFO and QUERY_STRING are usually extracted from REQUEST_URI,
> however REQUEST_URI is not standardized even though lots of people use
> it. Furthermore REQUEST_URI tends to be a URI. I therefore propose the
> following requirements:
I'm actually opposed to standardizing REQUEST_URI. It's always
possible to extract REQUEST_URI from SCRIPT_NAME and PATH_INFO, and
endpoints that rely on REQUEST_URI cannot be mounted. This was a
serious problem for both Rails and Merb (before we both switched to
using PATH_INFO).
> - REQUEST_URI, if exists, MUST be a valid URI. This implies that
> REQUEST_URI must contain the unescaped form of the URI (e.g. "/clubs/
> %C3%BC", not "/clubs/ü").
> - All required Rack variables that are strings (PATH_INFO,
> REQUEST_URI, etc) except for HTTP_ variables MUST be encoded as UTF-8.
> That is, the #encoding method must return #<Encoding:UTF-8>.
The main Rack variables (REQUEST_METHOD, SCRIPT_NAME, PATH_INFO,
QUERY_STRING, SERVER_NAME, and SERVER_PORT) should always be ASCII.
These should be encoded as ASCII, and then the server should call
encode!. This will have the effect of giving the end application the
encoding that it expects (representing by Encoding.default_internal)
or by leaving it in ASCII, which is the correct encoding.
> - HTTP_ variables MUST be encoded as binary.
This seems correct, because headers can be encoded as Latin-1. Since
we can't know for sure which encoding the client used, we should leave
it as BINARY and let the application (which might know better) decode
it.
> The valid URI requirement for REQUEST_URI guarantees that encoding it
> in UTF-8 is possible because URIs are valid ASCII.
Again, I don't think we should standardize REQUEST_URI.
> Because PATH_INFO and QUERY_STRING tend to be extracted from
> REQUEST_URI and are therefore substrings of an URI, it is also
> possible for them to be UTF-8.
Again, I think the best way to achieve this would be to mark these as
ASCII (which they actually are), and then let the application specify
what transcoding it wants using the standard Ruby mechanism.
> The binary requirement for HTTP_ is necessary because HTTP allows
> header values to contain characters that are not valid UTF-8 nor valid
> US-ASCII (see the HTTP grammar's TEXT rule).
Yep.
> Non-HTTP_ required Rack values must not be ASCII-encoded because Rails
> and many apps work primarily with UTF-8 strings. If the app does
> something like
>
> some_utf8_string + env['PATH_INFO']
>
> then Ruby 1.9 will complain with an incompatible encoding error.
Actually, ASCII and UTF-8 should always concatenate with no error.
Maybe you're thinking about putting BINARY Strings through a Unicode
regular expression?
> On the other hand, if the app does something like
>
> some_utf8_string + env['HTTP_FOO_BAR']
>
> then things will still blow up so I'm not sure whether my requirement
> makes sense. Does Rails mandate an encoding for its request.env?
Concatenating UTF-8 and BINARY should blow up. As you pointed out, we
can't be sure that HTTP_FOO_BAR *is* UTF-8. For all we know, it's
Latin-1. In Ruby 1.9, if a BINARY string contains characters that are
not ASCII, you get an exception. This is correct. The only real
solution is to somehow know for sure what the encoding of the header
is.
One solution could be a middleware that marks the Strings as
Encoding::ASCII if the String is #ascii_only? and then uses rchardet
to guess the encoding if it's not. Of course, it'd call encode!
afterward, which would mean that Rails apps would see the String as
UTF-8 no matter what.
> I'm unsure what to do with all other variables. Should there be
> requirements about their encodings?
As far as I can tell, when unescaped, SCRIPT_NAME and PATH_INFO will
always be UTF-8 in the wild (I've tried with quite a number of
browsers). The Utils that unescapes percent encoded Strings should
first mark the String as UTF-8, and then call encode! (which should
almost always be a no-op, unless somewhat made their default_internal
UTF-16 or Latin-1 from some odd reason).