Rack environment encoding

Hongli Lai

unread,

Sep 12, 2010, 12:48:51 PM9/12/10

to Rack Development

The current Rack specification doesn't say anything about the encoding
of the value strings in the Rack environment. However from various bug
reports it has become clear that Rails and possibly many other apps
expect some value strings, such as REQUEST_URI, to be UTF-8. See #16
at http://code.google.com/p/phusion-passenger/issues/detail?id=404.

I believe the encoding should be standardized. Here are some ideas
that might serve as a starting point for discussion.

PATH_INFO and QUERY_STRING are usually extracted from REQUEST_URI,
however REQUEST_URI is not standardized even though lots of people use
it. Furthermore REQUEST_URI tends to be a URI. I therefore propose the
following requirements:

- REQUEST_URI, if exists, MUST be a valid URI. This implies that
REQUEST_URI must contain the unescaped form of the URI (e.g. "/clubs/
%C3%BC", not "/clubs/ü").
- All required Rack variables that are strings (PATH_INFO,
REQUEST_URI, etc) except for HTTP_ variables MUST be encoded as UTF-8.
That is, the #encoding method must return #<Encoding:UTF-8>.
- HTTP_ variables MUST be encoded as binary.

The valid URI requirement for REQUEST_URI guarantees that encoding it
in UTF-8 is possible because URIs are valid ASCII.
Because PATH_INFO and QUERY_STRING tend to be extracted from
REQUEST_URI and are therefore substrings of an URI, it is also
possible for them to be UTF-8.
The binary requirement for HTTP_ is necessary because HTTP allows
header values to contain characters that are not valid UTF-8 nor valid
US-ASCII (see the HTTP grammar's TEXT rule).

Non-HTTP_ required Rack values must not be ASCII-encoded because Rails
and many apps work primarily with UTF-8 strings. If the app does
something like

some_utf8_string + env['PATH_INFO']

then Ruby 1.9 will complain with an incompatible encoding error.

On the other hand, if the app does something like

some_utf8_string + env['HTTP_FOO_BAR']

then things will still blow up so I'm not sure whether my requirement
makes sense. Does Rails mandate an encoding for its request.env?

I'm unsure what to do with all other variables. Should there be
requirements about their encodings?

James Tucker

unread,

Sep 12, 2010, 2:00:19 PM9/12/10

to rack-...@googlegroups.com

Eeek, encodings.. here we go!

On 12 Sep 2010, at 13:48, Hongli Lai wrote:

> The current Rack specification doesn't say anything about the encoding
> of the value strings in the Rack environment. However from various bug
> reports it has become clear that Rails and possibly many other apps
> expect some value strings, such as REQUEST_URI, to be UTF-8. See #16
> at http://code.google.com/p/phusion-passenger/issues/detail?id=404.

I'm not sure, but I think they don't expect them to be utf-8, they actually expect them to be compatible with literals.

> I believe the encoding should be standardized. Here are some ideas
> that might serve as a starting point for discussion.

Maybe, from what I am told, we will need users of CP932 and other really annoying stuff to come forward and test, such that we don't either break their stuff or have to revert these specs in future. Binary is the only lossless manner we have right now. I personally can't (fully) audit this yet, although I'm trying to learn.

> PATH_INFO and QUERY_STRING are usually extracted from REQUEST_URI,
> however REQUEST_URI is not standardized even though lots of people use
> it. Furthermore REQUEST_URI tends to be a URI. I therefore propose the
> following requirements:
>
> - REQUEST_URI, if exists, MUST be a valid URI. This implies that
> REQUEST_URI must contain the unescaped form of the URI (e.g. "/clubs/
> %C3%BC", not "/clubs/ü").

Percent encoded is definitely the way to go. Percent encoded data must fit within the ascii encoding set (no multibyte). It is commonly accepted that most percent encoded URIs actually expand out to UTF-8, however, as far as I know this is not exclusively the case. iirc, there is some mention of this in the newer IRI spec, which tbh, is quite horrible.

> - All required Rack variables that are strings (PATH_INFO,
> REQUEST_URI, etc) except for HTTP_ variables MUST be encoded as UTF-8.
> That is, the #encoding method must return #<Encoding:UTF-8>.

Although this might be literal compatible, I'm not sure it's strictly correct. That may not be a technical problem because it's compatible with the most likely actual encoding of the /data itself/ (n.b. not the non-percent-encoded data as noted above), however it may lead to a social problem, whereby it suggests that the data really is UTF-8. Maybe I'm being pedantic here, I'm not sure, but this stuff is hard enough without misleading defaults.

> - HTTP_ variables MUST be encoded as binary.

Seems sensible.

> The valid URI requirement for REQUEST_URI guarantees that encoding it
> in UTF-8 is possible because URIs are valid ASCII.
> Because PATH_INFO and QUERY_STRING tend to be extracted from
> REQUEST_URI and are therefore substrings of an URI, it is also
> possible for them to be UTF-8.

Yes, they are compatible, that much I agree with, but with the caveats above.

> The binary requirement for HTTP_ is necessary because HTTP allows
> header values to contain characters that are not valid UTF-8 nor valid
> US-ASCII (see the HTTP grammar's TEXT rule).

It's hard to determine what they should be, indeed the snowmen help, but I think there's yet more research to figure out what the real world use cases are when browsers are set in non-utf8 encodings and setting headers from JS, etc, etc. I would really appreciate someone or some company in the community sponsoring real research and documentation in this area; that is, extensively. (Extensively includes headers, form data, multipart data, etc etc across all major browsers in all major encoding settings and with all major encodings as inputs (files, pasted data, etc). Noticeable issues occur as a common and major example, pasting rich text data into forms into windows browsers on non-automatic modes into forms from programs like word. This is of course the very essence of the snowman hack from rails3, but to apply any of this kind of stuff to rack requires more research and at least some helpers and documentation. (The latter I have started whilst trying to get as much of this out of Yehudas brain as possible, but have not had time to finish yet).

> Non-HTTP_ required Rack values must not be ASCII-encoded because Rails
> and many apps work primarily with UTF-8 strings.

Literal strings. I should also note at this point that -e, irb, and textmate create lies during testing, please don't rely on "rules of the road" you derive via these tools, which enforce utf-8 literals by default, whereas normal source files start ascii encoded. This also depends on arguments to ruby, and $LC_CTYPE.

> If the app does
> something like
>
> some_utf8_string + env['PATH_INFO']
>
> then Ruby 1.9 will complain with an incompatible encoding error.

On your system.

> On the other hand, if the app does something like
>
> some_utf8_string + env['HTTP_FOO_BAR']
>
> then things will still blow up so I'm not sure whether my requirement
> makes sense. Does Rails mandate an encoding for its request.env?

Rails does a lot of work on the /client side/ to try and ensure it receives UTF-8, and tries to enforce UTF-8 elsewhere. Rack can't enforce this as it doesn't operate client side (build forms). It's also worth noting that rails accepts a percentile use case hit here, whereby it makes no attempt to expect full support for encodings that can't round-trip through unicode. For them this is sensible, and maybe it might be for us, but this is why I need particularly CP932 users to actually pay attention here. Until I hear from someone who deals with these issues in the real world, I cannot defer to the advice "just use unicode". Alas, one of the larger issues here is that I don't speak the languages required to actually track down most of these users, so I need help from people who do. I hope there's someone on this list proactive enough to do this, or knows someone to call on.

> I'm unsure what to do with all other variables. Should there be
> requirements about their encodings?

I think we do need to either:

1. Set some specific requirements based on complete research (and document potential loss/error cashes)
2. Not set requirements based on complete research (and document as such)

It should be noted that 1 may result in the software being simply incompatible with certain requirements, whereas 2 may require the common user to do more work themselves. At present there may be no workaround for when 1 is a problem, due to the pervasiveness of rack in rubys modern libraries and frameworks.

In any case, the minimum output of these discussions should be documentation on the topic, so that once we're done, we can stop having them with everyone who hits another issue. As 1.9 becomes more common, this is going to come up more and more.

Steve Klabnik

unread,

Sep 12, 2010, 2:23:03 PM9/12/10

to rack-...@googlegroups.com

How funny, this thread gets started up when I found a solution to my
issue before.

I am a super-noob when it comes to this stuff, so I don't feel I can
add that much to the discussion. What I can tell you is that using the
escape_utils gem [1] took care of my error. They also discuss a patch
to Rack itself [2] in one of the links on that page. I'm not
experienced enough to say if this is a good idea or not; but hopefully
it's some food for thought?

1: http://openhood.com/rack/ruby/2010/07/15/rack-test-warning/

2: http://jonathan.tron.name/2009/08/20/sinatra-sequel-haml-postgresql-utf-8-and-ruby-1-9-1

-Steve

Yehuda Katz

unread,

Sep 13, 2010, 12:21:20 AM9/13/10

to rack-devel

In general, when dealing with Strings from external sources (as Rack
is), you have two options:
1) The String comes with some out-of-band information about its
encoding (or you happen to know the encoding for sure), and you should
tag it with that encoding
2) The String does not come with out-of-band information, and you
don't know the encoding, and you should leave it as BINARY.
In the case of (1), after marking the encoding with force_encoding,
you should call encode! (with no arguments). The encode! method works
conceptually like this:

class String
  def encode!
   encode!(Encoding.default_internal) if Encoding.default_internal
  end
end

Ruby itself never sets Encoding.default_internal. It is used by
end-users to specify that they would like libraries operating at the
boundary to convert known encodings to the one specified. Rails sets
this to the value of config.encoding, which defaults to UTF-8.

As a result, if a boundary library knows the encoding of a String (and
it honors default_internal, like most well-behaved libraries), Rails
will get it as UTF-8, but libraries don't need to hardcode UTF-8, and
they can allow people who need to work with encoding on a more
fine-grained level to do as they will.

I should point out that Encoding.default_external is an entirely
different setting, which tells Ruby what encoding to default files on
the file system to. This defaults to the encoding of the operating
system. In general, libraries that read files from disk should either
let the operating system's default encoding ($LC_CTYPE/$LANG) win, or
they should read files in the "rb" mode, which will tag the String as
BINARY.

Yehuda Katz
Architect | Engine Yard
(ph) 718.877.1325

On Sun, Sep 12, 2010 at 9:48 AM, Hongli Lai <hon...@phusion.nl> wrote:
>
> The current Rack specification doesn't say anything about the encoding
> of the value strings in the Rack environment. However from various bug
> reports it has become clear that Rails and possibly many other apps
> expect some value strings, such as REQUEST_URI, to be UTF-8. See #16
> at http://code.google.com/p/phusion-passenger/issues/detail?id=404.
>
> I believe the encoding should be standardized. Here are some ideas
> that might serve as a starting point for discussion.
>
> PATH_INFO and QUERY_STRING are usually extracted from REQUEST_URI,
> however REQUEST_URI is not standardized even though lots of people use
> it. Furthermore REQUEST_URI tends to be a URI. I therefore propose the
> following requirements:

I'm actually opposed to standardizing REQUEST_URI. It's always
possible to extract REQUEST_URI from SCRIPT_NAME and PATH_INFO, and
endpoints that rely on REQUEST_URI cannot be mounted. This was a
serious problem for both Rails and Merb (before we both switched to
using PATH_INFO).

> - REQUEST_URI, if exists, MUST be a valid URI. This implies that
> REQUEST_URI must contain the unescaped form of the URI (e.g. "/clubs/
> %C3%BC", not "/clubs/ü").
> - All required Rack variables that are strings (PATH_INFO,
> REQUEST_URI, etc) except for HTTP_ variables MUST be encoded as UTF-8.
> That is, the #encoding method must return #<Encoding:UTF-8>.

The main Rack variables (REQUEST_METHOD, SCRIPT_NAME, PATH_INFO,
QUERY_STRING, SERVER_NAME, and SERVER_PORT) should always be ASCII.
These should be encoded as ASCII, and then the server should call
encode!. This will have the effect of giving the end application the
encoding that it expects (representing by Encoding.default_internal)
or by leaving it in ASCII, which is the correct encoding.

> - HTTP_ variables MUST be encoded as binary.

This seems correct, because headers can be encoded as Latin-1. Since
we can't know for sure which encoding the client used, we should leave
it as BINARY and let the application (which might know better) decode
it.

> The valid URI requirement for REQUEST_URI guarantees that encoding it
> in UTF-8 is possible because URIs are valid ASCII.

Again, I don't think we should standardize REQUEST_URI.

> Because PATH_INFO and QUERY_STRING tend to be extracted from
> REQUEST_URI and are therefore substrings of an URI, it is also
> possible for them to be UTF-8.

Again, I think the best way to achieve this would be to mark these as
ASCII (which they actually are), and then let the application specify
what transcoding it wants using the standard Ruby mechanism.

> The binary requirement for HTTP_ is necessary because HTTP allows
> header values to contain characters that are not valid UTF-8 nor valid
> US-ASCII (see the HTTP grammar's TEXT rule).

Yep.

> Non-HTTP_ required Rack values must not be ASCII-encoded because Rails
> and many apps work primarily with UTF-8 strings. If the app does
> something like
>
> some_utf8_string + env['PATH_INFO']
>
> then Ruby 1.9 will complain with an incompatible encoding error.

Actually, ASCII and UTF-8 should always concatenate with no error.
Maybe you're thinking about putting BINARY Strings through a Unicode
regular expression?

> On the other hand, if the app does something like
>
> some_utf8_string + env['HTTP_FOO_BAR']
>
> then things will still blow up so I'm not sure whether my requirement
> makes sense. Does Rails mandate an encoding for its request.env?

Concatenating UTF-8 and BINARY should blow up. As you pointed out, we
can't be sure that HTTP_FOO_BAR *is* UTF-8. For all we know, it's
Latin-1. In Ruby 1.9, if a BINARY string contains characters that are
not ASCII, you get an exception. This is correct. The only real
solution is to somehow know for sure what the encoding of the header
is.
One solution could be a middleware that marks the Strings as
Encoding::ASCII if the String is #ascii_only? and then uses rchardet
to guess the encoding if it's not. Of course, it'd call encode!
afterward, which would mean that Rails apps would see the String as
UTF-8 no matter what.

> I'm unsure what to do with all other variables. Should there be
> requirements about their encodings?

As far as I can tell, when unescaped, SCRIPT_NAME and PATH_INFO will
always be UTF-8 in the wild (I've tried with quite a number of
browsers). The Utils that unescapes percent encoded Strings should
first mark the String as UTF-8, and then call encode! (which should
almost always be a no-op, unless somewhat made their default_internal
UTF-16 or Latin-1 from some odd reason).

naruse

unread,

Sep 13, 2010, 5:05:46 AM9/13/10

to Rack Development

Hello,

I agree with wycats at almost point.

On 9月13日, 午後1:21, Yehuda Katz <wyc...@gmail.com> wrote:
> On Sun, Sep 12, 2010 at 9:48 AM, Hongli Lai <hon...@phusion.nl> wrote:
> > I'm unsure what to do with all other variables. Should there be
> > requirements about their encodings?
>
> As far as I can tell, when unescaped, SCRIPT_NAME and PATH_INFO will
> always be UTF-8 in the wild (I've tried with quite a number of
> browsers). The Utils that unescapes percent encoded Strings should
> first mark the String as UTF-8, and then call encode! (which should
> almost always be a no-op, unless somewhat made their default_internal
> UTF-16 or Latin-1 from some odd reason).

I wrote long time ago an web application like /items/<item name in EUC-
JP>.
Yeah, new applications should use UTF-8 in URI, but there is such
apps.
So assumed external encoding: UTF-8 should be configurable.

Following thread may be related with this thread.
http://rack.lighthouseapp.com/projects/22435/tickets/48-rackutilsunescape-problems-in-ruby-191

Regards

Hongli Lai

unread,

Sep 13, 2010, 9:56:42 AM9/13/10

to Rack Development

On Sep 12, 8:00 pm, James Tucker <jftuc...@gmail.com> wrote:
> I'm not sure, but I think they don't expect them to be utf-8, they actually expect them to be compatible with literals.

Yes. I'm fine with US-ASCII for PATH_INFO and friends as long as
comment #16 in the bug report doesn't result in breakage anymore.

> > If the app does
> > something like
> > some_utf8_string + env['PATH_INFO']
> > then Ruby 1.9 will complain with an incompatible encoding error.
>
> On your system.

No. Specifically, it breaks in Phusion Passenger because we set the
encoding of the entire environment hash to binary, regardless of the
system encoding, exactly to prevent data loss as you've mentioned
earlier. However setting everything to binary results in breakages as
described in the bug report which is the reason why I proposed setting
some things to UTF-8/ASCII/whatever and other things to binary.

> Rails does a lot of work on the /client side/ to try and ensure it receives UTF-8, and tries to enforce UTF-8 elsewhere. Rack can't enforce this as it doesn't operate client side (build forms). It's also worth noting that rails accepts a percentile use case hit here, whereby it makes no attempt to expect full support for encodings that can't round-trip through unicode. For them this is sensible, and maybe it might be for us, but this is why I need particularly CP932 users to actually pay attention here. Until I hear from someone who deals with these issues in the real world, I cannot defer to the advice "just use unicode". Alas, one of the larger issues here is that I don't speak the languages required to actually track down most of these users, so I need help from people who do. I hope there's someone on this list proactive enough to do this, or knows someone to call on.

Woah, I think we have a misunderstanding here. I started this thread
to discuss what env['something'].encoding should return. Whether
env['something'] actually contains UTF-8 data is a different
discussion.

To re-iterate: the problem that we're running into is that
env['something'].encoding always returns #<Encoding: binary> in
Phusion Passenger, even if env['something'] contains valid UTF-8 data.
Should env['something'] - assuming it contains valid UTF-8 data or
ASCII data or whatever - have its #encoding return #<Encoding: UTF-8>?

Of course, the easiest way to solve this problem is to mandate all
Rack web servers to set the encoding to binary have the frameworks
deal with conversions.

Hongli Lai

unread,

Sep 13, 2010, 10:08:03 AM9/13/10

to Rack Development

On Sep 13, 6:21 am, Yehuda Katz <wyc...@gmail.com> wrote:
> I'm actually opposed to standardizing REQUEST_URI. It's always
> possible to extract REQUEST_URI from SCRIPT_NAME and PATH_INFO, and
> endpoints that rely on REQUEST_URI cannot be mounted. This was a
> serious problem for both Rails and Merb (before we both switched to
> using PATH_INFO).

I'm not proposing standardizing REQUEST_URI as a variable that must
exist or standardizing its meaning. I'm only proposing standardizing
its encoding, if it exists. I was using REQUEST_URI as an example
rationale to describe why PATH_INFO and QUERY_STRING should only
contain ASCII characters (because they're extracting from a URI) and
that therefore it is okay to encode PATH_INFO and QUERY_STRING as
UTF-8 (or ASCII).

> The main Rack variables (REQUEST_METHOD, SCRIPT_NAME, PATH_INFO,
> QUERY_STRING, SERVER_NAME, and SERVER_PORT) should always be ASCII.
> These should be encoded as ASCII, and then the server should call
> encode!. This will have the effect of giving the end application the
> encoding that it expects (representing by Encoding.default_internal)
> or by leaving it in ASCII, which is the correct encoding.

I'm fine with ASCII for most variables, but are you sure SERVER_NAME
should be ASCII as well? I don't have any strong opinions on this.

> > Because PATH_INFO and QUERY_STRING tend to be extracted from
> > REQUEST_URI and are therefore substrings of an URI, it is also
> > possible for them to be UTF-8.
>
> Again, I think the best way to achieve this would be to mark these as
> ASCII (which they actually are), and then let the application specify
> what transcoding it wants using the standard Ruby mechanism.

Fine with this.

> > Non-HTTP_ required Rack values must not be ASCII-encoded because Rails
> > and many apps work primarily with UTF-8 strings. If the app does
> > something like
>
>
> > some_utf8_string + env['PATH_INFO']
>
>
> > then Ruby 1.9 will complain with an incompatible encoding error.
>
>
> Actually, ASCII and UTF-8 should always concatenate with no error.
> Maybe you're thinking about putting BINARY Strings through a Unicode
> regular expression?

Yeah I wasn't thinking straight, sorry. Fine with ASCII for those.

> Concatenating UTF-8 and BINARY should blow up. As you pointed out, we
> can't be sure that HTTP_FOO_BAR *is* UTF-8.

Yes. After having given it some thought, I'm fine with it blowing up.
However the encoding should be standardized so that all web servers
consistently blow up, instead of the current situation where things
blow up in web server A but not in web server B.

So here's a new proposal:
- All required Rack variables must be ASCII, except for HTTP_
variables.
- HTTP_ variables must be binary.
- REQUEST_URI, if exists, must be ASCII. This is the only requirement,
I'm not proposing standardizing its meaning or requiring that it
exists.
- All other variables can have arbitrary encodings (i.e. no
standardizations).

Outstanding issues:
- Should SERVER_NAME be ASCII or UTF-8? I'm fine with either.

naruse

unread,

Sep 14, 2010, 9:23:40 PM9/14/10

to Rack Development

On 9月13日, 午後11:08, Hongli Lai <hon...@phusion.nl> wrote:
> So here's a new proposal:
> - All required Rack variables must be ASCII, except for HTTP_
> variables.

Seeing Rack Spec, required variable is REQUEST_METHOD, SCRIPT_NAME,
PATH_INFO, QUERY_STRING,
SERVER_NAME, and SERVER_PORT.
http://rack.rubyforge.org/doc/SPEC.html

They are from PEP333 and RFC3875.
http://www.python.org/dev/peps/pep-0333/
http://www.ietf.org/rfc/rfc3875.txt

By those specs, their content is limited in ASCII.
So these encoding can be US-ASCII (or an ASCII compatible encoding).

> - HTTP_ variables must be binary.

Yes, so their encoding should be ASCII-8BIT.

> - REQUEST_URI, if exists, must be ASCII. This is the only requirement,
> I'm not proposing standardizing its meaning or requiring that it
> exists.

REQUEST_URI is not defined in PEP333 nor RFC3875.
But RFC3050 and Apache's extension seems to be related.
http://www.ietf.org/rfc/rfc3050.txt
http://httpd.apache.org/docs/2.0/en/mod/mod_setenvif.html

They say it is a part of URI, so it must be in ASCII.

> - All other variables can have arbitrary encodings (i.e. no
> standardizations).
>
> Outstanding issues:
> - Should SERVER_NAME be ASCII or UTF-8? I'm fine with either.

SERVER_NAME's content is limited within ASCII characters.
So its encoding should be US-ASCII.

Reply all

Reply to author

Forward