Bug possibly in IS_URL or other validator

Kyle Smith

unread,

Jun 10, 2008, 5:24:41 PM6/10/08

to web2py Web Framework

I haven't had a chance to dig into the validators yet to determine
exactly where this bug is occurring but I have been able to duplicate
it repeatedly.

When inserting or updating a field that contains a requirement such
as:

db.mytable.myurl.requires=IS_URL()

There are a number of cases in IS_URL() where valid URL's are said to
be invalid:

http://test.com/
http://test.com/?param=1
http://test.com/index/&test=1

All of the above are perfectly valid. The larger problem I've run into
is with large URL's. While doing some data input of real data into a
test system we found that attempting to insert a URL over about 50
characters with something that IS_URL doesn't like at the end of the
URL causes python to use as much CPU as is available, web2py and
cherrypy to become completely unresponsive, and the app needing to be
killed and restarted.

http://myreallylongdomain.com/r.aspx?vendcode=AD999&productid=01922&departmentid=392929&optionalinfo=

The above url is an example that caused the process death that I'm
describing.

I'll attempt to look into this later tonight, but the validation in
IS_URL is probably not as robust as it needs to be.

Kyle

Kyle Smith

unread,

Jun 10, 2008, 4:37:41 PM6/10/08

to web...@googlegroups.com

yarko

unread,

Jun 10, 2008, 6:09:44 PM6/10/08

to web2py Web Framework

Hi Kyle -

You hint of "over about 50 characters" makes this sound suspicously
familiar...

...see the regular expression discussion at
http://groups.google.com/group/web2py/browse_thread/thread/59ff2e31698bced6/

I confess that I have not reported this python bug yet (and need to).

Yarko

On Jun 10, 3:37 pm, "Kyle Smith" <kyletsm...@gmail.com> wrote:
> I haven't had a chance to dig into the validators yet to determine exactly
> where this bug is occurring but I have been able to duplicate it repeatedly.
>
> When inserting or updating a field that contains a requirement such as:
>
> db.mytable.myurl.requires=IS_URL()
>
> There are a number of cases in IS_URL() where valid URL's are said to be
> invalid:
>

> http://test.com/http://test.com/?param=1http://test.com/index/&test=1

>
> All of the above are perfectly valid. The larger problem I've run into is
> with large URL's. While doing some data input of real data into a test
> system we found that attempting to insert a URL over about 50 characters
> with something that IS_URL doesn't like at the end of the URL causes python
> to use as much CPU as is available, web2py and cherrypy to become completely
> unresponsive, and the app needing to be killed and restarted.
>

> http://myreallylongdomain.com/r.aspx?vendcode=AD999&productid=01922&d...

Kyle Smith

unread,

Jun 10, 2008, 6:38:09 PM6/10/08

to web...@googlegroups.com

Thank you for pointing me to that thread, it didn't come up in my search prior to sending this message. This is likely the same type of problem that you encountered since the runaway does appear to be in python from what little bit of debugging I've been able to do today.

Kyle

yarko

unread,

Jun 10, 2008, 6:56:08 PM6/10/08

to web2py Web Framework

Kyle -

If you could give as much detail as possible, particularly on the long
urls - then perhaps people from the community could chime in to narrow
it down as we were able to in that other thread....

I would be happy to have more data to report if indeed it is a related
python bug...

Regards,
Yarko

On Jun 10, 5:38 pm, "Kyle Smith" <kyletsm...@gmail.com> wrote:
> Thank you for pointing me to that thread, it didn't come up in my search
> prior to sending this message. This is likely the same type of problem that
> you encountered since the runaway does appear to be in python from what
> little bit of debugging I've been able to do today.
>
> Kyle
>

> On Tue, Jun 10, 2008 at 3:09 PM, yarko <yark...@gmail.com> wrote:
>
> > Hi Kyle -
>
> > You hint of "over about 50 characters" makes this sound suspicously
> > familiar...
>
> > ...see the regular expression discussion at
>

> >http://groups.google.com/group/web2py/browse_thread/thread/59ff2e3169...

Massimo Di Pierro

unread,

Jun 10, 2008, 8:03:07 PM6/10/08

to web...@googlegroups.com

I posted trunk 151 with a IS_URL that will validate what you suggest. The fact is the less strict IS_URL, the less useful it is.

The limitation is dues to field size. Try something like SQLField(.....,length=128)

Massimo Di Pierro

unread,

Jun 10, 2008, 8:09:50 PM6/10/08

to web...@googlegroups.com

check if it is now fixed.

Kyle Smith

unread,

Jun 10, 2008, 8:16:29 PM6/10/08

to web...@googlegroups.com

I'll be in a position to do some more specific debugging this evening, and will be more than happy to provide more specific details with examples of cases that cause python to run away while performing the regular expression check.

Kyle

Massimo Di Pierro

unread,

Jun 10, 2008, 8:21:02 PM6/10/08

to web...@googlegroups.com

Thanks. please check the trunk. I remind you that you can test it like this

python web2py -S newapp

... create? yes

>>> IS_URL('http://test.com')

('http://test.com', None)

None here means no error-> the url is valid.

Massimo

Kyle Smith

unread,

Jun 10, 2008, 8:28:33 PM6/10/08

to web...@googlegroups.com

Thanks for the quick response. I understand that the URL validation being strict has benefits, but at the same time, what constitutes a valid URL is pretty liberal by definition. In my particular case I have to store URL's that will be provided as entry points to third party systems. As such I deal with urls that have lots of paramaters, large query strings, small query strings, large paths, small paths, etc. Obviously this makes it more difficult to validate using a regular expression and it may just make more sense for me to use a custom validator. There were a few limitations in the existing regular expression that seemed to miss very common cases however. Thanks for addressing those. I will check out rev 151 shortly and run my tests again.

As for the limitation being the field size, I'm not sure what you mean by that. I am using the following in my model:

SQLField("url","string",length=255,notnull=True,default="",required=True)

I have in my existing database urls that are up to 180 characters long including the query portion hence the preset of 255 to the length. Are you implying that the length should be smaller?

I do have a question about the choice of using a regular expression for the IS_URL validation. Would it make more sense to use urlparse to break the url up into it's parts and then validate each part individually. While this would certainly add some overhead, it would seem to me to provide a more robust URL validation than a single regular expression can provide.

Kyle

Massimo Di Pierro

unread,

Jun 10, 2008, 8:29:48 PM6/10/08

to web...@googlegroups.com

> I have in my existing database urls that are up to 180 characters long including the query portion hence the preset of 255 to the length. Are you implying that the length > should be smaller?

no. there is not reason for a limit.

I do have a question about the choice of using a regular expression for the IS_URL validation. Would it make more sense to use urlparse to break the url up into it's parts and then validate each part individually. While this would certainly add some overhead, it would seem to me to provide a more robust URL validation than a single regular expression can provide.

Yes it would make sense. could send me a patch?

Massimo

Kyle Smith

unread,

Jun 10, 2008, 9:05:03 PM6/10/08

to web...@googlegroups.com

revision 251 does appear to validate the various URL's that I mentioned above, and all of the URL's that I am currently working with.

As for a patch ot validators.py, I would be more than happy to work on that.

Kyle

Kyle Smith

unread,

Jun 11, 2008, 10:02:38 PM6/11/08

to web...@googlegroups.com

I've spent a bit of time thinking about this and wanted to discuss it with the list before I went any further. The question for me breaks down to how much validation we want to do on URL's. Right now the regular expression in use will validate many urls in a set list of schemes, but probably will invalidate some urlencoded data in the parameters portion of the url, and will not validate urls that include a username nad password at all. This is the existing regexp:

^(http|HTTP|https|HTTPS|ftp|FTP|file|FILE|rstp|RSTP)\://[\w/=&\?\.]+$

A slight modification to this will allow for including user:pass@ in a url though even this should be expanded to work with complex characters in the password portion. (it also allows dashes)

^(http|HTTP|https|HTTPS|ftp|FTP|file|FILE|rstp|RSTP)\://([\w]+\:{1}[\w]+\@{1})?[\w-/=#&\?\.]+$

We can use urlparse to split a url into it's component pieces and then validate each piece. For example we can do something like this in IS_URL:

def __init__(self,error_message='invalid url!):
    self.error_message=error_message

def __call__(self,value):
    o = urlparse( value )
    ''' check supported scheme '''
    if o.scheme == '' or o.netloc == '': return (value, self.error_message)

    ''' validate netloc '''
    regex=re.compile( '([\w]+\:{1}[\w]+\@{1})?[\w-\.]+$' )
    match=regex.match( o.netloc )
    if not match: return (value, self.error_message)

    ''' validate path '''
    regex=re.compile( '^/([\w\/\-_\.])+$' )
    match=regex.match( o.path )
    if not match: return (value, self.error_message)

    ''' validate query '''
    regex=re.compile( '^[\w\=&%@_=\.\+\-]+$' )
    match=regex.match( o.query )
    if not match: return (value, self.error_message)

    ''' passes validation '''
    return (value, None)

I guess my question boils down to the following: is it worthwhile to do this level of validation?

warning: the above code is just my writing while thinking about the problem and wouldn't be a drop in replacement for __init__ or __call__ in IS_URL

Kyle

Massimo Di Pierro

unread,

Jun 11, 2008, 10:34:38 PM6/11/08

to web...@googlegroups.com

can't o = urlparse( value ) fail too?

If you are happy with the regex you proposed I have no objection to it.

Massimo

Kyle Smith

unread,

Jun 11, 2008, 10:47:40 PM6/11/08

to web...@googlegroups.com

urlparse returns a subclass of tuple, and I could be wrong but I believe that in a failure case it's output would simply be:
('', '', '', '', '') of course the code provided above should be taken as just my ideas while looking over the docs and thinking about how best to deal with the issue of validation. I just noticed I didn't even include "from urlparse import urlparse"

As far as the regular expression goes, it's an improvement over what is currently being used in IS_URL, but I'm sure there is still room for improvement.

Kyle

Massimo Di Pierro

unread,

Jun 11, 2008, 11:27:26 PM6/11/08

to web...@googlegroups.com

probably urlparse is implemented with a regex inside.

Massimo

Kyle Smith

unread,

Jun 12, 2008, 1:07:50 AM6/12/08

to web...@googlegroups.com

Actually it isn't. Mostly it uses a pre-defined set of lists and a tuple, determines how to split the the url into it's pieces based on which list the scheme is found in and so on. That's why even if urlparse is used the content of each piece of the tuple would still need some validation. urlparse will return the following output from 'http://www.abc;;;xyz.com/123/456/?a=1b=2c=3'

('http', 'www.abc;;;xyz.com', '/123/456/', '', 'a=1b=2c=3', '')

While it nicely breaks the url into it's components of scheme, netloc, path and query. It doesn't do any validation and we are left with a netloc portion that we wouldn't want to approve.

There's a package in the cheeseshop (netaddress) that tries to be compliant with RFC 3986 which defines the generic URI syntax, but it also has another package as a dependency, and defeats some of the goals you've set forth with web2py. It's also complete overkill for what IS_URL tries to accomplish.

Kyle

Reply all

Reply to author

Forward