Problem when fetching page using urllib2.urlopen

jitu

unread,

Aug 10, 2009, 7:39:03 AM8/10/09

to

Hi,

A html page contains 'anchor' elements with 'href' attribute having
a semicolon in the url , while fetching the page using
urllib2.urlopen, all such href's containing 'semicolons' are
truncated.

For example the href http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt=AlWSqpkpqhICp1lMgChtJkCdGWoL
get truncated to http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i

The page I am talking about can be fetched from
http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_ylc=X3oDMTFka28zOGNuBF9TAzI3NjY2NzkEX3MDOTY5NTUzMjUEc2VjA3NzcC1kZXN0BHNsawN0aXRsZQ--

Thanks a Lot
Regards
jitu

jitu

unread,

Aug 10, 2009, 7:43:12 AM8/10/09

to

On Aug 10, 4:39 pm, jitu <nair.jiten...@gmail.com> wrote:
> Hi,
>
> A html page contains 'anchor' elements with 'href' attribute having
> a semicolon in the url , while fetching the page using
> urllib2.urlopen, all such href's containing 'semicolons' are
> truncated.
>

> For example the hrefhttp://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt...
> get truncated tohttp://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i
>
> The page I am talking about can be fetched fromhttp://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_...

>
> Thanks a Lot
> Regards
> jitu

Hi

Sorry, the question what I wanted to ask was, whether is this the
correct behaviour or a bug ?

Thanks A Lot.
Regards
jitu

dorzey

unread,

Aug 10, 2009, 12:15:07 PM8/10/09

to

"geturl - this returns the real URL of the page fetched. This is
useful because urlopen (or the opener object used) may have followed a
redirect. The URL of the page fetched may not be the same as the URL
requested." from http://www.voidspace.org.uk/python/articles/urllib2.shtml#info-and-geturl

It might be worth checking that you are actually getting the page you
want; I seem to remember that semicolons need to be encoded, similar
to '&'.

Dorzey

Diez B. Roggisch

unread,

Aug 10, 2009, 1:11:12 PM8/10/09

to

dorzey wrote:

> "geturl - this returns the real URL of the page fetched. This is
> useful because urlopen (or the opener object used) may have followed a
> redirect. The URL of the page fetched may not be the same as the URL
> requested." from
> http://www.voidspace.org.uk/python/articles/urllib2.shtml#info-and-geturl
>
> It might be worth checking that you are actually getting the page you
> want; I seem to remember that semicolons need to be encoded, similar
> to '&'.

You remember wrong.

http://www.faqs.org/rfcs/rfc2396.html

See Section 3.3, path-components.

Diez

dorzey

unread,

Aug 10, 2009, 1:28:51 PM8/10/09

to

On 10 Aug, 18:11, "Diez B. Roggisch" <de...@nospam.web.de> wrote:
> dorzey wrote:
> > "geturl - this returns the real URL of the page fetched. This is
> > useful because urlopen (or the opener object used) may have followed a
> > redirect. The URL of the page fetched may not be the same as the URL
> > requested." from

> >http://www.voidspace.org.uk/python/articles/urllib2.shtml#info-and-ge...

>
> > It might be worth checking that you are actually getting the page you
> > want; I seem to remember that semicolons need to be encoded, similar
> > to '&'.
>
> You remember wrong.
>
> http://www.faqs.org/rfcs/rfc2396.html
>
> See Section 3.3, path-components.
>
> Diez

My memory has been known to let me down on occasions ;) Thank you for
correcting my mistake.

Piet van Oostrum

unread,

Aug 10, 2009, 3:36:55 PM8/10/09

to

>>>>> jitu <nair.j...@gmail.com> (j) wrote:

>j> Hi,
>j> A html page contains 'anchor' elements with 'href' attribute having
>j> a semicolon in the url , while fetching the page using
>j> urllib2.urlopen, all such href's containing 'semicolons' are
>j> truncated.

>j> For example the href http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt=AlWSqpkpqhICp1lMgChtJkCdGWoL
>j> get truncated to http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i

>j> The page I am talking about can be fetched from
>j> http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_ylc=X3oDMTFka28zOGNuBF9TAzI3NjY2NzkEX3MDOTY5NTUzMjUEc2VjA3NzcC1kZXN0BHNsawN0aXRsZQ--

It's not python that causes this. It is the server that sends you the
URLs without these parameters (that's what they are).

To get them you have to tell the server that you are a respectable
browser. E.g.

import urllib2

url = 'http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt=AlWSqpkpqhICp1lMgChtJkCdGWoL'

url = 'http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_ylc=X3oDMTFka28zOGNuBF9TAzI3NjY2NzkEX3MDOTY5NTUzMjUEc2VjA3NzcC1kZXN0BHNsawN0aXRsZQ--'

hdrs = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13',
'Accept': 'image/*'}

request = urllib2.Request(url = url, headers = hdrs)
page = urllib2.urlopen(request).read()

--
Piet van Oostrum <pi...@cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: pi...@vanoostrum.org

jitu

unread,

Aug 11, 2009, 1:15:31 AM8/11/09

to

Yes Piet you were right this works. But seems does not work on google
app engine, since it appends it own agent info as seen below

'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US;

rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13 AppEngine-Google;
(+http://code.google.com/appengine)'

Any way Thanks . Good to know about the User-Agent field.

Jitu

On Aug 11, 12:36 am, Piet van Oostrum <p...@cs.uu.nl> wrote:

> >>>>> jitu <nair.jiten...@gmail.com> (j) wrote:
> >j> Hi,
> >j> A html page contains 'anchor' elements with 'href' attribute having
> >j> a semicolon in the url , while fetching the page using
> >j> urllib2.urlopen, all such href's containing 'semicolons' are
> >j> truncated.

> >j> For example the hrefhttp://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt...
> >j> get truncated tohttp://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i

> >j> The page I am talking about can be fetched from

> >j>http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_...

>
> It's not python that causes this. It is the server that sends you the
> URLs without these parameters (that's what they are).
>
> To get them you have to tell the server that you are a respectable
> browser. E.g.
>
> import urllib2
>

> url = 'http://travel.yahoo.com/p-travelguide-6901959-pune_restaurants-i;_ylt...
>
> url = 'http://travel.yahoo.com/p-travelguide-485468-pune_india_vacations-i;_...

>
> hdrs = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13',
> 'Accept': 'image/*'}
>
> request = urllib2.Request(url = url, headers = hdrs)
> page = urllib2.urlopen(request).read()
>
> --

> Piet van Oostrum <p...@cs.uu.nl>

> URL:http://pietvanoostrum.com[PGP 8DAE142BE17999C4]

> Private email: p...@vanoostrum.org