problem downloading files

293 views
Skip to first unread message

Chris Leonello

unread,
May 7, 2008, 4:32:55 PM5/7/08
to web2py Web Framework
When trying to download a file that and set the filename with an extra
arg in the path like so:

http://www.somehost.com/myapp/appadmin/download/files03020393.pdf/my%20friendly%20name.pdf

web2py hangs. I can download if I replace the spaces (which are
escaped as %20 above) with underscores:

http://www.somehost.com/myapp/appadmin/download/files03020393.pdf/my_friendly_name.pdf

I am trying to preserve the original filenames of the files that the
users uploaded but am forced presently to replace spaces with
underscores when they download the file. Does anyone else notice this
behavior? Is this a bug?

I'm running rev. 209 behind an apache proxy server.

Massimo Di Pierro

unread,
May 7, 2008, 5:05:44 PM5/7/08
to web...@googlegroups.com
This should not be happening. Has anybody else seen this?
What OS are you using? Are you suing the source or the binary version
of web2py?

Massimo

Massimo Di Pierro

unread,
May 7, 2008, 5:17:29 PM5/7/08
to web...@googlegroups.com
I just tried on my system. It does not hang. Are you sure it hangs on
your system?
The correct behavior is to send an http 400 "Invalid Request"

This is because web2py does no allow spaces or other special
characters that need escaping in the URL.

Massimo

On May 7, 2008, at 3:32 PM, Chris Leonello wrote:

Chris Leonello

unread,
May 7, 2008, 7:46:27 PM5/7/08
to web2py Web Framework
Are the %20 escape codes acceptable? Those are working for you?

I'm on Mac OSX 10.5.2. When I use a url with these escape codes, the
python process hangs using up 100% of a processor. Works fine if %20
codes.

It looks like a peculiarity of my configuration, I'd like to track it
down. Where is this renaming the file to the browser actually
handled? I suppose a Cotent-disposition header is being set somewhere
or something like that? I couldn't find anything in globals.py
Response.stream() or in stream_file_or_304_or_206() in streamer.py.

On May 7, 5:17 pm, Massimo Di Pierro <mdipie...@cs.depaul.edu> wrote:
> I just tried on my system. It does not hang. Are you sure it hangs on
> your system?
> The correct behavior is to send an http 400 "Invalid Request"
>
> This is because web2py does no allow spaces or other special
> characters that need escaping in the URL.
>
> Massimo
>
> On May 7, 2008, at 3:32 PM, Chris Leonello wrote:
>
>
>
> > When trying to download a file that and set the filename with an extra
> > arg in the path like so:
>
> >http://www.somehost.com/myapp/appadmin/download/files03020393.pdf/my%...
>
> > web2py hangs. I can download if I replace the spaces (which are
> > escaped as %20 above) with underscores:
>
> >http://www.somehost.com/myapp/appadmin/download/files03020393.pdf/my_...

Massimo Di Pierro

unread,
May 7, 2008, 10:20:25 PM5/7/08
to web...@googlegroups.com
in gluon/main.py there is a regex_url %20 should not pass the
regex_validation.

Massimo

Chris Leonello

unread,
May 8, 2008, 9:10:02 AM5/8/08
to web2py Web Framework
When I try to run the regex_url.match(path) in a plain python shell on
a path with a space in the name, it hangs. I've done this on Mac OSX
10.5.2 with python 2.5.2 and on Windows XP with Python 2.4.3 and both
times hung the python process. I copied regex_url from main.py:

c:\>python
Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> regex_url=re.compile('(?:^$)|(?:^(\w+/?){0,3}$)|(?:^(\w+/){3}\w+(/?\.?[\w\-\.]+)*/?$)|(?:^(\w+)/static(/\.?[\w\-\.]+)*/?$)')
>>> path='myapp/appadmin/download/file400404.doc/abig_presentation 082005.doc'
>>> regex_url.match(path)

The %20 that is in the path that the user requests is replaced with a
space somewhere before evaluation of the regex.

Massimo Di Pierro

unread,
May 8, 2008, 9:49:52 AM5/8/08
to web...@googlegroups.com
very bad. This is a Python bug.

Massimo

Chris Leonello

unread,
May 8, 2008, 10:00:51 AM5/8/08
to web2py Web Framework
So, are the %20 escape codes in urls working for you if you? Do you
get a 400 error message in your browser, or does your python process
running web2py hang? If you get a 400 response, what is the path that
is seen by the regex_url.match() expression, b/c that is where web2py
is hanging for me.

Massimo Di Pierro

unread,
May 8, 2008, 10:42:03 AM5/8/08
to web...@googlegroups.com
I get a 400 error as it should. web2py does not allow special
characters in the path.

yarko

unread,
May 8, 2008, 1:57:32 PM5/8/08
to web2py Web Framework
This I found sort of interesting, so I played around w/ chris's
example a little while eating my lunch...

Windows, Python 2.5.2....

I isolated the problem to this sub-regular expression:

>>> regex_url=re.compile(r'(?:^(\w+/)\w+(/?\.?[\w\-\.]+)*/?$)')

If path contains no space, then there's no problem. If path contains
a space, then there's a problem.

Since this "feels" like a multi-line sort of issue, I tried to remove
the "end of line" from the RE, such:

>>> regex_url=re.compile(r'(?:^(\w+/)\w+(/?\.?[\w\-\.]+)*/?)')

Now no problem...

Multiline, dotall flags to re.compile didn't seem to affect..... and
the other alternations (with ending '$') don't seem to be involved in
the problem...

.... as far as I got at lunch...

mdipierro

unread,
May 8, 2008, 4:16:12 PM5/8/08
to web2py Web Framework
Unfortunately that is not a solution because we need to match the
netire URL.
I cannot reproduce this problem. I wonder whether this is a regex
implementation issue or a runaway problem.
Do you use windows? Can you send me a short examples to tryr once more
to reproduce the problem?

Massimo
> > >>>>>>> I'm running rev. 209 behind an apache proxy server.- Hide quoted text -
>
> - Show quoted text -

yarko

unread,
May 8, 2008, 6:40:37 PM5/8/08
to web2py Web Framework
I understand... was intended to narrow the problem area (possibly
report something on python; possibly find alternative form of RE that
did the same thing)...

Let me try Chris's (concise) command line example on a couple of more
python interpreters, platforms... see what I find...

yarko

unread,
May 8, 2008, 6:56:14 PM5/8/08
to web2py Web Framework
Ok - this from Chris:

>>> import re
>>> regex_url=re.compile('(?:^$)|(?:^(\w+/?){0,3}$)|(?:^(\w+/){3}\w+(/?\.?[\w\-\.]+)*/?$)|(?:^(\w+)/static(/\.?[\w\-\.]+)*/?$)')
>>> path='myapp/appadmin/download/file400404.doc/abig_presentation 082005.doc'
>>> regex_url.match(path)


Also "hangs" (fails) on Fedora-8, with Python 2.5.1 that comes w/
Fedora:

Name : python Relocations: (not
relocatable)
Version : 2.5.1 Vendor: Fedora Project
Release : 15.fc8 Build Date: Tue 30 Oct
2007 12:57:32 PM CDT
Install Date: Thu 28 Feb 2008 02:45:48 PM CST Build Host:
xenbuilder1.fedora.redhat.com
Group : Development/Languages Source RPM:
python-2.5.1-15.fc8.src.rpm
Size : 18499208 License: Python
Software Foundation License v2
Signature : DSA/SHA1, Tue 30 Oct 2007 03:15:15 PM CDT, Key ID
b44269d04f2a6fd2
Packager : Fedora Project
URL : http://www.python.org/
Summary : An interpreted, interactive, object-oriented programming
language.

So - I'll look over what this RE is trying to say a bit, and see if I
can find an equivalent / alternate way that will work with Chris's
example path.

BTW - my windows Python was from python.org (not the activestate build
- will try that too later tonight)


Yarko
On May 8, 3:16 pm, mdipierro <mdipie...@cs.depaul.edu> wrote:

Chris Leonello

unread,
May 8, 2008, 7:22:35 PM5/8/08
to web2py Web Framework
Thanks yarko,

I am very busy and don't have the desire to debug a regex right now
anyway. Do you get a hang when web2py is passed a url with a %20 in
the path or just when you try this regex on the command line? Massimo
indicated he didn't have a problem. This would indicate that the path
passed to regex_url.match(path) (in gluon/main.py) didn't have the %20
replaced by a space on his system. But it did have the %20 replaced
to a space on mine. I think this is another piece of the puzzle that
needs confirmed.

yarko

unread,
May 8, 2008, 9:01:01 PM5/8/08
to web2py Web Framework
This is the regular expression engine.

To answer your question, Chris, I can cause this effect with or
without the space in the path (with or without '%20'). I think
someone from python project is going to have to debug this at code
level.

Here's what I can say about it.

The minimal "failing" set seems to involve:

1 - '/' at the beginning of a pattern ('/?' exacerbates it);
2 - '$' at the end of a pattern;
3 - lack of space (' ') in the matching pattern;

Example:

Given this starting point:
--------------
import re
path='myapp/appadmin/download/file400404.doc/abig_presentation
082005.doc'
-----------

Here are some minimal failing (and near passing) patterns on Windows:

------- hangs: ---------
regex_url=re.compile(r'((/?[\w]*)*$)')
print regex_url.match(path)

------ completes (returns "None"): --------
regex_url=re.compile(r'((/?[\d]*)*$)')
print regex_url.match(path)

----- completes: --------
regex_url=re.compile(r'((/?[-.0-9A-Z]*)*$)')

----- completes: --------
regex_url=re.compile(r'((/?[-.0-9A-Za-t]*)*$)')

----- hangs: --------
regex_url=re.compile(r'((/?[-.0-9A-Za-w]*)*$)')

----- completes (note lower case range starts w/ 'b'): --------
regex_url=re.compile(r'((/?[-.0-9A-Zb-w]*)*$)')


I've run these same lines on Windows (Python 2.5.2) and Fedora-8
(Python 2.5.1), with the same results.

I thought for a bit that it was just one aspect, and tried various
permutations to get around it, but I think there are several things
that lead to this, and it will just need some debugging of the RE
engine code.

This may be related to this bug:
http://bugs.python.org/issue1160

To confirm, will compile... may do later or in a few days...


Yarko.

yarko

unread,
May 8, 2008, 9:19:09 PM5/8/08
to web2py Web Framework
...anyway, just peeking at the Python 2.6 code, it's not changed.

It might be that this code (around line 44 in gluon/main.py) will
need to rework the RE into some python logic - just to get around this
RE bug.
> ...
>
> read more »

Massimo Di Pierro

unread,
May 8, 2008, 10:08:16 PM5/8/08
to web...@googlegroups.com
Could you do two tests for me?

1) Add ^ at the beginning of the regular expression.
2) Print the string as passed to the match function.

This may be related to the regex "runaway problem"...

Massimo

Massimo Di Pierro

unread,
May 8, 2008, 10:13:46 PM5/8/08
to web...@googlegroups.com
Ignore my email.... I am working on this

Massimo Di Pierro

unread,
May 8, 2008, 10:27:17 PM5/8/08
to web...@googlegroups.com
I think I understand what the problem is in Yarko's examples and I can
reproduce it.
(/?[\w]*)* would result in a very large number of matching combinations.

I still cannot reproduce the error with web2py. I am using leopard.

Massimo

>>>>>>>>>>>> (/?\.?[\w\-\.]+)*/?$)|(?:^(\w+)/static(/\.?[\w\-\.]+)*/?
>>>>>>>>>>>> $)')
>>>>>>>>>>>> path='myapp/appadmin/download/file400404.doc/
>>>>>>>>>>>> abig_presentation

Massimo Di Pierro

unread,
May 8, 2008, 10:42:22 PM5/8/08
to web...@googlegroups.com
Can you try if this hangs?

regex_url=re.compile('(?:^$)|(?:^\w+/?$)|(?:^\w+/\w+/?$)|(?:^(\w+/)
{2}\w+/?$)|(?:^(\w+/){2}\w+(/[\w\-]+(\.[\w\-]+)*)+$)|(?:^(\w+)/static(/
[\w\-]+(\.[\w\-]+)*)+$)')

I think this should fix the runaway problem caused by ((...)*)*

Massimo

yarko

unread,
May 8, 2008, 11:24:14 PM5/8/08
to web2py Web Framework
Just a note on these:

On May 8, 9:08 pm, Massimo Di Pierro <mdipie...@cs.depaul.edu> wrote:
> Could you do two tests for me?
>
> 1) Add ^ at the beginning of the regular expression.

I ran my original tests with this.... and gradually removed
pieces of the RE until I got the smallest that I could manipulate and
easily to on the boundary of what "passed" / what "hung" -- so "^"
didn't seem to have impact on that boundary...

> 2) Print the string as passed to the match function.

I can do - but did this all from command line, so continually
printed string stored before running match(); in==out;
> ...
>
> read more »

yarko

unread,
May 9, 2008, 12:03:25 AM5/9/08
to web2py Web Framework
Thank you Massimo!

With python 2.5.2 on both Fedora-8 (Python 2.5.1) and WIndows
(Python2.5.2), this does the trick.

Where the "old" RE captured the filename (no space) in group 3; this
does too, but captures the extension ('.doc' in Chris's example) in
group 4.

Probably time to see how it plays in real use.

Regards,
Yarko


For the filename with space, this makes "no match".
> ...
>
> read more »

Massimo Di Pierro

unread,
May 9, 2008, 12:10:39 AM5/9/08
to web...@googlegroups.com
I think it is fixed and posted it in trunk 214. Thank you yarko. Your
tests really helped nail this down.

The problem is that in expressions like

((...)*)*$

the more nested expression can match to empty and the python engine
(on some systems) is stupid enough to try whether an infinite
sequence of empty strings adds up to the expression to be matched. I
still do not see why the regex_url was falling in this trap but the
new one is more explicit and therefore I am more confident this does
not happen.

Massimo

>>>>>>>>>>> +(/?\.?[\w\-\.]+)*/?$)|(?:^(\w+)/static(/\.?[\w\-\.]+)*/?

>>>>>>>>>>> $)')
>>>>>>>>>>> path='myapp/appadmin/download/file400404.doc/
>>>>>>>>>>> abig_presentation
>>>>>>>>>>> 082005.doc'
>>>>>>>>>>> regex_url.match(path)
>>

yarko

unread,
May 9, 2008, 12:20:38 AM5/9/08
to web2py Web Framework
Thanks! What is interesting is that depending on the match phrase
(character set size increase by one causes runaway, otherwise not)
runaway happens..... so I still suspect there is some boundary or
overflow in the RE code in Python. I may still try compiling on
Fedora maybe this weekend, maybe next - just to see if I can catch
what is going on at this boundary. Would be good to report to
Python...

Yarko
> ...
>
> read more »

Massimo Di Pierro

unread,
May 9, 2008, 12:27:03 AM5/9/08
to web...@googlegroups.com
Yes , you should report it. This is a Python bug. Even if it is
possible that conceptually the problem cannot be detected in all
regex, still Python should have a timeout.

Massimo

Reply all
Reply to author
Forward
0 new messages