unicode support for args

93 views
Skip to first unread message

Pawel Jasinski

unread,
Oct 25, 2011, 7:24:11 AM10/25/11
to web2py-users
hi all,

I have discovered that args in url are restricted to ascii.
In addition tilde (~) is also not considered valid in arguments.
I am using rest mapping @request.restful() where as far as I can tell
there is no technical reason to restrict args (RFC 3986).
I also found similar limitation for http headers (ascii only). Values
can be only ascii.

I would like to provide a patch.
Is it something what you would consider to include in future release.

Cheers,
Pawel

Massimo Di Pierro

unread,
Oct 25, 2011, 9:14:55 AM10/25/11
to web2py-users
The validation is done on purpose. We think that allowing special
characters in the path info makes urls less readable and can be cause
of directory traversal attacks (~ specifically).

Massimo

Pawel Jasinski

unread,
Oct 25, 2011, 2:57:21 PM10/25/11
to web2py-users
hi,

> of directory traversal attacks (~ specifically).
how exactly?

I am talking about arguments and only arguments.
I agree that ~ in case of application/controller/method makes no sense
In case of static agree 100%, but that is different control path.

The arguments are just that, arguments. If you put such a blanket
statement about arguments in url, should you also do it for forms? At
the end these are also arguments and someone may take it 1:1 and feed
into 'open'.
It is up to the controller to decide what to do with args. I believe
nobody takes anything what comes from browser (args or form elements)
and try to use it as argument of the 'open'. In case of web2py, DAL
delivers already a perfect mechanism to take whatever comes and
convert into reasonable name:
filename=db.table.field.store(content,whatever_convoluted_name_we_get).


To be specific about the args filtering:

agrs must match:
regex_args = re.compile(r'''
(^
(?P<s>
( [\w@/-][=.]? )* # s=args
)?
/?$) # trailing slash
''', re.X)

what I suggest is:
regex_args = re.compile(r'''
(^
(?P<s>
( [~\w@/-][=.]? )* # s=args
)?
/?$) # trailing slash
''', re.X|re.U)



Cheers,
Pawel

Jonathan Lundell

unread,
Oct 25, 2011, 3:18:16 PM10/25/11
to web...@googlegroups.com

On Oct 25, 2011, at 11:57 AM, Pawel Jasinski <pawel.j...@gmail.com> wrote:

> hi,
>
>> of directory traversal attacks (~ specifically).
> how exactly?
>
> I am talking about arguments and only arguments.
> I agree that ~ in case of application/controller/method makes no sense
> In case of static agree 100%, but that is different control path.

If you enable the parametric router, you'll get the kind of args handling you want, with the added feature that you can rewrite the args validation regex.

Pawel Jasinski

unread,
Oct 25, 2011, 5:06:18 PM10/25/11
to web2py-users
hi,

thanks! That solved my ~ problem.

Unfortunately for my öäü (chars above 128 and below 255 in latin-1) I
still need to overcome 2 challenges:

1. re.U must be supplied to compile or match to take advantage of
unicode interpretation of \w.
I could shift compile into the routes.py. Is it acceptable?

2. at some point before match call args have to be subjected to
decode('utf-8') to become unicode
Any suggestions?

--Pawel


On Oct 25, 9:18 pm, Jonathan Lundell <jlund...@pobox.com> wrote:

Jonathan Lundell

unread,
Oct 25, 2011, 11:25:01 PM10/25/11
to web...@googlegroups.com
On Oct 25, 2011, at 2:06 PM, Pawel Jasinski wrote:

> hi,
>
> thanks! That solved my ~ problem.
>
> Unfortunately for my öäü (chars above 128 and below 255 in latin-1) I
> still need to overcome 2 challenges:
>
> 1. re.U must be supplied to compile or match to take advantage of
> unicode interpretation of \w.
> I could shift compile into the routes.py. Is it acceptable?

I think so, yes.

>
> 2. at some point before match call args have to be subjected to
> decode('utf-8') to become unicode
> Any suggestions?

I'd like to do this right, but I'm a little confused. Do we need to consider Punycode, for example? Or is that just for domain names?

Pawel Jasinski

unread,
Oct 26, 2011, 2:26:29 AM10/26/11
to web2py-users
On Oct 26, 5:25 am, Jonathan Lundell <jlund...@pobox.com> wrote:
> On Oct 25, 2011, at 2:06 PM, Pawel Jasinski wrote:
>
> > hi,
>
> > thanks! That solved my ~ problem.
>
> > Unfortunately for my öäü (chars above 128 and below 255 in latin-1) I
> > still need to overcome 2 challenges:
>
> > 1. re.U must be supplied to compile or match to take advantage of
> > unicode interpretation of \w.
> > I could shift compile into the routes.py. Is it acceptable?
>
> I think so, yes.
>
>
>
> > 2. at some point before match call args have to be subjected to
> > decode('utf-8') to become unicode
> > Any suggestions?
>
> I'd like to do this right, but I'm a little confused. Do we need to consider Punycode, for example? Or is that just for domain names?
>

what I mean is very trivial:

***************
*** 915,925 ****
def validate_args(self):
'''
check args against validation pattern
'''
for arg in self.args:
! if not self.router._args_match.match(arg):
raise HTTP(400, thread.routes.error_message %
'invalid request',
web2py_error='invalid arg <%s>' % arg)

def update_request(self):
'''
--- 939,949 ----
def validate_args(self):
'''
check args against validation pattern
'''
for arg in self.args:
! if not
self.router._args_match.match(unicode(arg,'utf-8')):
raise HTTP(400, thread.routes.error_message %
'invalid request',
web2py_error='invalid arg <%s>' % arg)

This makes the validation pass. The args are still passed down to
application as char and have to be again converted there into Unicode.
Ideally framework should make it happen once.

Jonathan Lundell

unread,
Oct 28, 2011, 11:14:37 AM10/28/11
to web...@googlegroups.com
On Oct 25, 2011, at 11:26 PM, Pawel Jasinski wrote:

> On Oct 26, 5:25 am, Jonathan Lundell <jlund...@pobox.com> wrote:
>> On Oct 25, 2011, at 2:06 PM, Pawel Jasinski wrote:
>>
>>> hi,
>>
>>> thanks! That solved my ~ problem.
>>
>>> Unfortunately for my öäü (chars above 128 and below 255 in latin-1) I
>>> still need to overcome 2 challenges:
>>
>>> 1. re.U must be supplied to compile or match to take advantage of
>>> unicode interpretation of \w.
>>> I could shift compile into the routes.py. Is it acceptable?
>>
>> I think so, yes.
>>
>>
>>
>>> 2. at some point before match call args have to be subjected to
>>> decode('utf-8') to become unicode
>>> Any suggestions?
>>
>> I'd like to do this right, but I'm a little confused. Do we need to consider Punycode, for example? Or is that just for domain names?
>>
>
> what I mean is very trivial:

Thanks.

I'm a little concerned, for compatibility reasons, about making an unconditional change. I'm thinking I'll put an enable flag into the routing parameters and then implement both changes you suggest (vars, too), conditional on that flag.

I'm traveling now and am not likely to get to it for about a week, though.

Reply all
Reply to author
Forward
0 new messages