I've spent a bit of time thinking about this and wanted to discuss it with the list before I went any further. The question for me breaks down to how much validation we want to do on URL's. Right now the regular expression in use will validate many urls in a set list of schemes, but probably will invalidate some urlencoded data in the parameters portion of the url, and will not validate urls that include a username nad password at all. This is the existing regexp:
^(http|HTTP|https|HTTPS|ftp|FTP|file|FILE|rstp|RSTP)\://[\w/=&\?\.]+$
A slight modification to this will allow for including user:pass@ in a url though even this should be expanded to work with complex characters in the password portion. (it also allows dashes)
^(http|HTTP|https|HTTPS|ftp|FTP|file|FILE|rstp|RSTP)\://([\w]+\:{1}[\w]+\@{1})?[\w-/=#&\?\.]+$
We can use urlparse to split a url into it's component pieces and then validate each piece. For example we can do something like this in IS_URL:
def __init__(self,error_message='invalid url!):
self.error_message=error_message
def __call__(self,value):
o = urlparse( value )
''' check supported scheme '''
if o.scheme == '' or o.netloc == '': return (value, self.error_message)
''' validate netloc '''
regex=re.compile( '([\w]+\:{1}[\w]+\@{1})?[\w-\.]+$' )
match=regex.match( o.netloc )
if not match: return (value, self.error_message)
''' validate path '''
regex=re.compile( '^/([\w\/\-_\.])+$' )
match=regex.match( o.path )
if not match: return (value, self.error_message)
''' validate query '''
regex=re.compile( '^[\w\=&%@_=\.\+\-]+$' )
match=regex.match( o.query )
if not match: return (value, self.error_message)
''' passes validation '''
return (value, None)
I guess my question boils down to the following: is it worthwhile to do this level of validation?
warning: the above code is just my writing while thinking about the problem and wouldn't be a drop in replacement for __init__ or __call__ in IS_URL
Kyle