I have a program – written in Python 2.7.3 and wxPython 2.9.4.0 – that uses
Python's urllib2.urlopen() function/method. The purpose of my program is to
count the occurence of a user-given word that exists on a user-given
website. So if I want to know how many times the word "library" occurs on "
wxpython.org", I type in the URL box "http://www.wxpython.org" (without
quotes), and in the word box I type "library" (without quotes).
My problem is that the user needs to be exact at inputting the URL, because
"wxpython.org" is not enough; it has to be inputted exactly as "
http://www.wxpython.org <http://www.python.org>", which is rather annoying
for the user.
Is there a way the URL be inputted only as "wxpython.org" and
urllib2.urlopen's function/method filling in the missing "http://www."
part? Or is there any other way like checking for the "http://www." part
and prepend it if not present? What is the best way to solve my issue and
how exactly should I go and implement the solution?
On Mon, Oct 29, 2012 at 09:14:56PM +0100, Boštjan Mejak wrote:
> I have a program – written in Python 2.7.3 and wxPython 2.9.4.0 – that uses
> Python's urllib2.urlopen() function/method. The purpose of my program is to
> count the occurence of a user-given word that exists on a user-given
> website. So if I want to know how many times the word "library" occurs on "
> wxpython.org", I type in the URL box "http://www.wxpython.org" (without
> quotes), and in the word box I type "library" (without quotes).
> My problem is that the user needs to be exact at inputting the URL, because
> "wxpython.org" is not enough; it has to be inputted exactly as "
> http://www.wxpython.org <http://www.python.org>", which is rather annoying
> for the user.
> Is there a way the URL be inputted only as "wxpython.org" and
> urllib2.urlopen's function/method filling in the missing "http://www."
> part? Or is there any other way like checking for the "http://www." part
> and prepend it if not present? What is the best way to solve my issue and
> how exactly should I go and implement the solution?
There is, and that's probably why it is your homework
assignment. You need to read up on string handling.
> My problem is that the user needs to be exact at inputting the URL,
> because "wxpython.org <http://wxpython.org>" is not enough; it has to
> be inputted exactly as "http://www.wxpython.org > <http://www.python.org>", which is rather annoying for the user.
The "http://" part is really required. The latest round of browsers is
making people lazy, because they fill that in if it's not provided. I
still find myself typing it by hand. It's not hard to detect this,
however. Remember that there are a lot of protocols available for
URLs. There's no reason for urllib to assume that you meant "http". If
your program knows that http:// is the default, then it's up to you to
provide it.
The "www" part is different. That's not universal. You should be able
to fetch http://wxpython.org and have it work just fine. You might get
a "redirect" response telling you to fetch www.wxpython.org instead, but
that's something you need to be handling.
-- Tim Roberts, t...@probo.com
Providenza & Boekelheide, Inc.