A html page contains 'anchor' elements with 'href' attribute having
a semicolon in the url , while fetching the page using
urllib2.urlopen, all such href's containing 'semicolons' are
truncated.
For example the href http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt=AlWSqpkpqhICp1lMgChtJkCdGWoL
get truncated to http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i
The page I am talking about can be fetched from
http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_ylc=X3oDMTFka28zOGNuBF9TAzI3NjY2NzkEX3MDOTY5NTUzMjUEc2VjA3NzcC1kZXN0BHNsawN0aXRsZQ--
Thanks a Lot
Regards
jitu
Hi
Sorry, the question what I wanted to ask was, whether is this the
correct behaviour or a bug ?
Thanks A Lot.
Regards
jitu
It might be worth checking that you are actually getting the page you
want; I seem to remember that semicolons need to be encoded, similar
to '&'.
Dorzey
> "geturl - this returns the real URL of the page fetched. This is
> useful because urlopen (or the opener object used) may have followed a
> redirect. The URL of the page fetched may not be the same as the URL
> requested." from
> http://www.voidspace.org.uk/python/articles/urllib2.shtml#info-and-geturl
>
> It might be worth checking that you are actually getting the page you
> want; I seem to remember that semicolons need to be encoded, similar
> to '&'.
You remember wrong.
http://www.faqs.org/rfcs/rfc2396.html
See Section 3.3, path-components.
Diez
My memory has been known to let me down on occasions ;) Thank you for
correcting my mistake.
>j> Hi,
>j> A html page contains 'anchor' elements with 'href' attribute having
>j> a semicolon in the url , while fetching the page using
>j> urllib2.urlopen, all such href's containing 'semicolons' are
>j> truncated.
>j> For example the href http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt=AlWSqpkpqhICp1lMgChtJkCdGWoL
>j> get truncated to http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i
>j> The page I am talking about can be fetched from
>j> http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_ylc=X3oDMTFka28zOGNuBF9TAzI3NjY2NzkEX3MDOTY5NTUzMjUEc2VjA3NzcC1kZXN0BHNsawN0aXRsZQ--
It's not python that causes this. It is the server that sends you the
URLs without these parameters (that's what they are).
To get them you have to tell the server that you are a respectable
browser. E.g.
import urllib2
url = 'http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt=AlWSqpkpqhICp1lMgChtJkCdGWoL'
hdrs = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13',
'Accept': 'image/*'}
request = urllib2.Request(url = url, headers = hdrs)
page = urllib2.urlopen(request).read()
--
Piet van Oostrum <pi...@cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: pi...@vanoostrum.org
'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US;
rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13 AppEngine-Google;
(+http://code.google.com/appengine)'
Any way Thanks . Good to know about the User-Agent field.
Jitu
On Aug 11, 12:36 am, Piet van Oostrum <p...@cs.uu.nl> wrote:
> >>>>> jitu <nair.jiten...@gmail.com> (j) wrote:
> >j> Hi,
> >j> A html page contains 'anchor' elements with 'href' attribute having
> >j> a semicolon in the url , while fetching the page using
> >j> urllib2.urlopen, all such href's containing 'semicolons' are
> >j> truncated.
> >j> For example the hrefhttp://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt...
> >j> get truncated tohttp://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i
> >j> The page I am talking about can be fetched from
> >j>http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_...
>
> It's not python that causes this. It is the server that sends you the
> URLs without these parameters (that's what they are).
>
> To get them you have to tell the server that you are a respectable
> browser. E.g.
>
> import urllib2
>
> url = 'http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt...
>
> url = 'http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_...
>
> hdrs = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13',
> 'Accept': 'image/*'}
>
> request = urllib2.Request(url = url, headers = hdrs)
> page = urllib2.urlopen(request).read()
>
> --
> Piet van Oostrum <p...@cs.uu.nl>
> URL:http://pietvanoostrum.com[PGP 8DAE142BE17999C4]
> Private email: p...@vanoostrum.org