Matching both + and %20 as space in path

43 views
Skip to first unread message

candlerb

unread,
Feb 7, 2009, 4:19:02 PM2/7/09
to sinatrarb
There is a problem with path definitions which include spaces. They
are find if the incoming request encodes the space as %20, but not if
it encodes as +. For example:
--------------
require 'rubygems'
require 'sinatra'

get '/hello :id' do
"hello!#{params[:id].inspect}"
end
--------------

This matches GET /hello%20world HTTP/1.0
but not GET /hello+world HTTP/1.0

A similar issue exists for %HH encodings, which match only uppercase.
That is,

get '/hello{:id}'
matches GET /hello%7Bworld%7D HTTP/1.0
but not GET /hello%7bworld%7d HTTP/1.0
(even though both are equally valid encodings of open and close
braces).

I made a (somewhat ugly) fix for this issue - you can find it at
http://github.com/candlerb/sinatra/tree/master
in the candlerb/matchspace branch.

Regards,

Brian.

Jeff Hodges

unread,
Feb 8, 2009, 12:38:25 AM2/8/09
to sina...@googlegroups.com
The '+' example seems like reasonable, correct behavior, IMHO. '+' is
only unescasped to ' ' in CGI parameters, not in the rest of the URI.
This is because '+' is allowed as a sub-delimiter in URIs. See
RFC-3986[1] for more info.

[1] http://labs.apache.org/webarch/uri/rfc/rfc3986.html

candlerb

unread,
Feb 8, 2009, 5:09:36 AM2/8/09
to sinatrarb
On Feb 8, 5:38 am, Jeff Hodges <j...@somethingsimilar.com> wrote:
> The '+' example seems like reasonable, correct behavior, IMHO. '+' is
> only unescasped to ' ' in CGI parameters, not in the rest of the URI.

In that case, Sinatra's current behaviour is inconsistent, because

get '/:id' do ...
matches /Hello+world *and* decodes the + to a space (params[:id] ==
"Hello world")

> This is because '+' is allowed as a sub-delimiter in URIs. See
> RFC-3986[1] for more info.

RFC 3986 just says that + is a sub-delim, which can occur in the query
part of a URI and also in a path component (see pchar). However it
doesn't give the semantics of + in HTTP URIs.

So you also need the HTTP scheme-specific info. However, as far as I
can see, the current spec for HTTP (RFC 2616) doesn't describe using +
to encode space in either path or query. RFC 2616 hasn't been updated
to reference RFC 3986, but rather uses the older RFC 2392. RFC 2392
says that + can occur in pchar and in query, but again I can't see any
reference to using + to encode space, in either a path component or a
query.

RFC 1630 (old and informational) *does* say that + can be used to
encode a space in the query part of the URL, but doesn't mention its
use in the path.

I'll certainly agree that this is a minefield :-)

To explain why this matters to me: I have been using CGI.escape(foo)
to escape path components for insertion into a URL, because URI.escape
is too limited (it doesn't escape "/" to "%2F", for example). However,
CGI.escape(" ") gives "+" and so I end up with "/Foo+bar"

Maybe CGI.escape should just use %20 for space instead of +.

Regards,

Brian.

Ryan Tomayko

unread,
Feb 8, 2009, 9:42:13 AM2/8/09
to sina...@googlegroups.com
Yeah. The root of the issue is that we URL encode the route patterns
before turning them into Regexps. I'm not sure why. We should be
matching on the unencoded path_info. I thought this was a performance
optimization but that doesn't make much sense. We should get the
path_info from Rack in unencoded form and then we re-encode it to
match against routes. We should be be matching decoded/normalized
values.

I've opened a ticket for tracking:

http://sinatra.lighthouseapp.com/projects/9779-sinatra/tickets/147

When I get a sec, I'll try removing the URL encoding logic and see
where the specs fail.

Thanks,
Ryan

candlerb

unread,
Feb 8, 2009, 11:16:37 AM2/8/09
to sinatrarb
On Feb 8, 2:42 pm, Ryan Tomayko <r...@tomayko.com> wrote:
> Yeah. The root of the issue is that we URL encode the route patterns
> before turning them into Regexps. I'm not sure why. We should be
> matching on the unencoded path_info.

But this would require splitting on / first. E.g. consider matching
get '/:foo/:bar' against

/this%2Fpath/has%2Ftwo%2Fcomponents

Ryan Tomayko

unread,
Feb 8, 2009, 12:04:27 PM2/8/09
to sina...@googlegroups.com

I believe servers are to process that as "/this/path/has/two/components".

Ryan

Sam Roberts

unread,
Feb 8, 2009, 12:59:46 PM2/8/09
to sina...@googlegroups.com

Your sure? What would be the point of encoding the '/' if you can't
make it different from the semantic use of '/' as the separator
between path segments?

I think its a path with two components "this/path" and "has/two/components".

http://labs.apache.org/webarch/uri/rfc/rfc3986.html#path

Sam

Ryan Tomayko

unread,
Feb 8, 2009, 1:32:35 PM2/8/09
to sina...@googlegroups.com
On Sun, Feb 8, 2009 at 9:59 AM, Sam Roberts <vieu...@gmail.com> wrote:
>
> On Sun, Feb 8, 2009 at 9:04 AM, Ryan Tomayko <r...@tomayko.com> wrote:
>>
>> On Sun, Feb 8, 2009 at 8:16 AM, candlerb <b.ca...@pobox.com> wrote:
>>>
>>> On Feb 8, 2:42 pm, Ryan Tomayko <r...@tomayko.com> wrote:
>>>> Yeah. The root of the issue is that we URL encode the route patterns
>>>> before turning them into Regexps. I'm not sure why. We should be
>>>> matching on the unencoded path_info.
>>>
>>> But this would require splitting on / first. E.g. consider matching
>>> get '/:foo/:bar' against
>>>
>>> /this%2Fpath/has%2Ftwo%2Fcomponents
>>
>> I believe servers are to process that as "/this/path/has/two/components".
>
> Your sure? What would be the point of encoding the '/' if you can't
> make it different from the semantic use of '/' as the separator
> between path segments?

I'm definitely not sure. I didn't realize that slashes after the first
had any semantic value to be honest. I thought the entire path part
was an opaque value. But browsers obviously have some smarts here
since they're capable of traversing relative URLs like "../foo/bar".

> I think its a path with two components "this/path" and "has/two/components".
>
> http://labs.apache.org/webarch/uri/rfc/rfc3986.html#path

Interesting. And it looks like apache at least considers the URLs to
be different:

http://labs.apache.org/webarch%2Furi%2Frfc%2Frfc3986.html#path
http://labs.apache.org/webarch/uri/rfc/rfc3986.html#path

Nginx treats them as the same (i.e., "foo/bar" and "foo%2Fbar" are equivalent):

http://tomayko.com/misc%2Fbob/
http://tomayko.com/misc/bob/

I haven't tried other servers.

Still, I can't think of any practical advantage in treating them
differently. Can you? IMO, embedding slashes in the path part is
begging for trouble. If that's the only issue with comparing routes
after decoding and it fixes all of the other issues, I'd be very much
for accepting the limitation of not being able to embed slashes in
path segments.

Thanks,
Ryan

Sam Roberts

unread,
Feb 8, 2009, 2:53:15 PM2/8/09
to sina...@googlegroups.com

Its a corner case, but I can think of reasons to put slashes in paths
for stuff I'm working, one reason might be embedding entire URIs as a
path component (rather than as a query parameter).

I assume I can do that with sinatra, though, its just that I can match
it with routes? That'd be OK, and if somebody was dead set, they could
do their own routing as long as they can get the undecoded path
component of the URI.

I tried to confirm this by looking for the docs on how route matching
is done, and couldn't find them. It looks like routes can be a RegEx,
or a vaguely glob-like pattern? Except that unlike real globs, the *
can match a / character? So ** isn't used for matching multiple path
levels?

I've only found examples, and I believe that at least one example in
the intro must have a typo:

get '/hello/:name' do
# matches "GET /foo" and "GET /bar"
# params[:name] is 'foo' or 'bar'

- http://www.sinatrarb.com/intro.html

The GET commands must be missing a leading /hello?

Cheers,
Sam

candlerb

unread,
Feb 8, 2009, 4:55:43 PM2/8/09
to sinatrarb
> > I think its a path with two components "this/path" and "has/two/components".
>
> >http://labs.apache.org/webarch/uri/rfc/rfc3986.html#path
>
> Interesting. And it looks like apache at least considers the URLs to
> be different:
>
> http://labs.apache.org/webarch%2Furi%2Frfc%2Frfc3986.html#pathhttp://labs.apache.org/webarch/uri/rfc/rfc3986.html#path
>
> Nginx treats them as the same (i.e., "foo/bar" and "foo%2Fbar" are equivalent):
>
> http://tomayko.com/misc%2Fbob/http://tomayko.com/misc/bob/
>
> I haven't tried other servers.
>
> Still, I can't think of any practical advantage in treating them
> differently. Can you?

For RESTful-style URLs where the resource ID itself contains a slash,
such as an invoice whose (business-assigned) ID is 2009/1234

/invoices/2009%2F1234
/invoices/2009%2F1234/print

Ryan Tomayko

unread,
Feb 8, 2009, 10:10:02 PM2/8/09
to sina...@googlegroups.com

Well, there's actually nothing more "RESTful" about stuffing things
like this into the path part than the query string. URLs that are
simple and meaningful definitely have value but there's nothing in
REST that favors the path part of a URL over the query string. I'd
stick to the query string for values like this or replace the slash
with a dash or other character. You're begging for compatibility
issues with a scheme like that.

As much as I dislike the idea of making it hard to take advantage of
features of the underlying specs (HTTP, URIs, etc.), this still feels
like a really good trade off that fixes a bunch of practical issues
for the cost of one corner-case feature that's also a bad practice.

Thanks,
Ryan

candlerb

unread,
Feb 9, 2009, 3:48:41 AM2/9/09
to sinatrarb
On Feb 8, 7:53 pm, Sam Roberts <vieuxt...@gmail.com> wrote:
> I tried to confirm this by looking for the docs on how route matching
> is done, and couldn't find them. It looks like routes can be a RegEx,
> or a vaguely glob-like pattern? Except that unlike real globs, the *
> can match a / character? So ** isn't used for matching multiple path
> levels?

I worked this out from the source, which is actually very
straightforward. Methods 'get', 'post' etc. call 'route', and 'route'
calls method 'compile', so search for 'def compile'. You can see here
how the passed path is converted into a regexp.

If you pass a regexp to route, e.g. get %r{\A/hello/(.*)} then this
regexp is used as-is.

Matching is done in 'route!', where each regexp is matched in turn,
together with any associated conditions. If the match is successful,
then the captures are turned into params using the keys you provided,
like /:id. If you pass your own native regexp then the captures are
just stuffed into params[:captures]

> I've only found examples, and I believe that at least one example in
> the intro must have a typo:
>
> get '/hello/:name' do
>     # matches "GET /foo" and "GET /bar"
>     # params[:name] is 'foo' or 'bar'
>
> -http://www.sinatrarb.com/intro.html
>
> The GET commands must be missing a leading /hello?

Yes, that looks like a typo to me too.
Reply all
Reply to author
Forward
0 new messages