Merb vs. Ruby 1.9.x vs. Rack vs. Encoding

58 views
Skip to first unread message

Pavel Kunc

unread,
Apr 10, 2011, 6:25:57 PM4/10/11
to merb
This has been problem for loooong time and we have been forced to do
crazy things. While Rack is in development it goes slow and some
patches including patch for support of Encoding in Ruby 1.9.x are
still not there.

The result is that what comes from Rack is not what you might expect
and that leads to nasty exceptions about incompatible encoding as we
all know them. It happen during POST, sometimes get and so on.

Also Merb source code doesn't set the encoding so unless you do
something about it the source code (helpers for example) are in the
default encoding.

I've been successfully using:

Encoding.default_internal = 'UTF-8'
Encoding.default_external = 'UTF-8'

In my init, but latest upgrade of the server to 1.9.2-p180 broke
things again.

So I tried to UBFF (Ugly-Brute-Force-Fix) it in the experimental 1.2.1
branch:

https://github.com/merb/merb/tree/ruby-1.9.x-encoding-fixes

Until Rack will handle encodings I don't know about other way to solve
the problem.

I WOULD LOVE TO HEAR IF IT BREAKS YOUR APPS!

:-)

Pavel

ngo...@googlemail.com

unread,
Apr 11, 2011, 1:34:51 PM4/11/11
to me...@googlegroups.com
Hi,


On Apr 11, 2011 12:25am, Pavel Kunc <pavel...@gmail.com> wrote:
> This has been problem for loooong time and we have been forced to do
> crazy things. While Rack is in development it goes slow and some
> patches including patch for support of Encoding in Ruby 1.9.x are
> still not there.
>
> The result is that what comes from Rack is not what you might expect
> and that leads to nasty exceptions about incompatible encoding as we
> all know them. It happen during POST, sometimes get and so on.

There's a bit in the Rack specs:

"The input stream is an IO-like object which contains the raw HTTP POST data. When applicable, its external encoding must be “ASCII-8BIT” and it must be opened in binary mode, for Ruby 1.9 compatibility."

http://rack.rubyforge.org/doc/SPEC.html

So the external (and internal) encoding of the input object given by Rack could be set in the Rack stack when using Ruby 1.9 as long as it is a "proper enough" IO, or wrapped so that the mandatory methods deliver ASCII..UTF-8. The spec still has a bunch of holes regarding it, but handling that and the encoding of the environment variables in Merb::Rack::Application before it even hits the dispatcher sounds good to me.

pedro mg

unread,
Apr 11, 2011, 3:24:40 PM4/11/11
to me...@googlegroups.com
Hi,

hope to test it real soon, since have some Merb apps to move to 1.9.
Started a new Merb app 2 weeks ago ;-)

Thanks in advance,
pedro mg

Pavel Kunc

unread,
Apr 11, 2011, 5:48:09 PM4/11/11
to merb
I agree that the best place would be Merb::Rack::Application where I
thought we could do it. What stopped me from doing that is that the
Multipart POST will arrive as rack.input which is an String::IO. And
I'm not that sure that changing the whole rack.input encoding to
Encoding.default_external would not break things. This was actually
solution in the Rack Lighthouse which was dropped due to this fear.

Also in theory each part of the Multipart post can have different
encoding AFAIK which makes things even more complicated. Or am I wrong
here?

It's worth trying though.

Pavel

On Apr 11, 6:34 pm, ngol...@googlemail.com wrote:
> Hi,
>

Nicos Gollan

unread,
Apr 12, 2011, 5:28:31 AM4/12/11
to merb
On Apr 11, 11:48 pm, Pavel Kunc <pavel.k...@gmail.com> wrote:
> Multipart POST will arrive as rack.input which is an String::IO.

Big POST request bodies could arrive as File. I don't think we should
assume anything but the interface specified by Rack.

> I'm not that sure that changing the whole rack.input encoding to
> Encoding.default_external would not break things. This was actually
> solution in the Rack Lighthouse which was dropped due to this fear.
>
> Also in theory each part of the Multipart post can have different
> encoding AFAIK which makes things even more complicated. Or am I wrong
> here?

Probably not. This is The Web where everything is more complicated.
There's also two rather orthogonal concepts: encoding and character
sets. The encoding (Content-Transfer-Encoding header a message part)
only specifies the wire format for transmission.

Anyway, looking at Thin, rack.input contains raw request data; at
least it should since it doesn't do any fancy interpretation of the
Content-Type header(s). So as I read it, we actually need to
*reinterpret* (as opposed to *convert*) incoming data based on

* each part's Content-Transfer-Encoding header
* each part's Content-Encoding header (yay)
* the type and character set specified in the part's Content-Type
header (defaulting to US-ASCII text/plain)

Effectively, we should probably consider the content of rack.input as
byte soup with 7bit-clean delimiters. It might still be interesting to
push the processing into Merb::Application and create a merb.input
environment part with more convenient semantics.

Nicos Gollan

unread,
Apr 12, 2011, 10:37:44 AM4/12/11
to merb
Latest master has specs for character set handling in multipart
headers, which is yet another bag of fleas:
merb-core/spec/public/request/multipart_spec.rb

That is using the parser as it is now, and obviously failing since
merb doesn't support those encodings at all. There is also
Rack::Utils::Multipart which may be interesting, as well as the rack-
multipart-related gem which provides a middleware to piece together a
multipart/related request with parts referenced from a JSON document.

Nicos Gollan

unread,
Apr 28, 2011, 11:56:30 AM4/28/11
to merb
Smallish update, I'm doing some work on RFC-compliant header and
parameter handling in my private fork at the moment. It will only work
well on Ruby VMs with 1.9-compliant encoding handling, but should lead
to a clean transition from the outside (ASCII-like byte soup) world to
a UTF-8 environment.

It currently passes the additional multipart specs, and a bunch of
others that aren't yet integrated into dispatch. I'll see how that
works out and merge it and Pavel's changes into the active_support
branch (and thus into main development) if it doesn't break more than
the AS stuff already does ;-)

This will eventually mean that merb is becoming incompatible or at
least rough to use on <1.9 VMs, but IMO proper charset support is too
good a thing to pass.
Reply all
Reply to author
Forward
0 new messages