___doPostBack challenge

328 views

Skip to first unread message

Paul Bradshaw

unread,

Apr 2, 2014, 12:24:27 PM4/2/14

to scrap...@googlegroups.com

Trying to scrape this page but as it;s not a form I'm hitting an issue with the javascript:

https://jobsearch.direct.gov.uk/Reports/Reports.aspx?setype=1&seswitch=1

This StackOverflow thread says you can't do it with Python - correct?

Paul

Páll Hilmarsson

unread,

Apr 2, 2014, 12:34:09 PM4/2/14

to scrap...@googlegroups.com

Incorrect. It can be done in, at least, three ways:

1. Emulate the post requests with the standard Urllib library (or
better: Requests)

2. Use PhantomJS
(http://www.realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/#.Uzw7JlF_vE4)

3. Use Mechanise (which is really the same as 1, with a few shortcuts)
(https://stackoverflow.com/questions/6116023/screenscaping-aspx-with-python-mechanize-javascript-form-submission/6124393#6124393)

This should get you started:

https://stackoverflow.com/questions/13114977/why-does-this-scraperwiki-for-an-aspx-site-return-only-the-same-page-of-search-r

All the best,

P

On 02/04/14 16:24, Paul Bradshaw wrote:
> Trying to scrape this page but as it;s not a form I'm hitting an issue
> with the javascript:
>
> https://jobsearch.direct.gov.uk/Reports/Reports.aspx?setype=1&seswitch=1
>
> This StackOverflow thread

> <http://stackoverflow.com/questions/3898660/emulate-javascript-dopostback-in-python-web-scrapping> says

> you can't do it with Python - correct?
>
> Paul
>

> --
> You received this message because you are subscribed to the Google
> Groups "ScraperWiki" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to scraperwiki...@googlegroups.com
> <mailto:scraperwiki...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
pal...@gogn.in | http://gogn.in | http://twitter.com/pallih |
https://github.com/pallih
GPG: C266 603E 9918 A38B F11D 9F9B E721 347C 45B1 04E9

signature.asc

Peter Waller

unread,

Apr 2, 2014, 12:44:03 PM4/2/14

to scrap...@googlegroups.com

Please could you give a bit more detail about what you're trying to achieve and what you've tried so far? I don't understand the question.

Are you trying to scrape a search with a keyword?

A quick test shows that one can just insert query parameters and it works, no javascript necessary, e.g, searching for "Nurse" (q=Nurse) and navigating to page 3 (pg=3):

https://jobsearch.direct.gov.uk/JobSearch/Search.aspx?setype=1&pp=25&pg=3&q=Nurse&cy=UK&sort=rv.dt.di&re=4

If you want to improve your chances of getting quick and accurate help and make efficient use of other people's time you could improve the way you ask - this is a great general article on mailing list etiquette: http://catb.org/~esr/faqs/smart-questions.html#beprecise

--

You received this message because you are subscribed to the Google Groups "ScraperWiki" group.

To unsubscribe from this group and stop receiving emails from it, send an email to scraperwiki...@googlegroups.com.

Páll Hilmarsson

unread,

Apr 2, 2014, 6:44:52 PM4/2/14

to scrap...@googlegroups.com, Peter Waller

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Peter is right, in his passive aggressive way.

The site seems to approve GET request with the supplied keywords.

All the best,

P

On 2 April 2014 16:44:03 GMT+00:00, Peter Waller <pe...@scraperwiki.com> wrote:
>Please could you give a bit more detail about what you're trying to
>achieve
>and what you've tried so far? I don't understand the question.
>
>Are you trying to scrape a search with a keyword?
>
>A quick test shows that one can just insert query parameters and it
>works,
>no javascript necessary, e.g, searching for "Nurse" (q=Nurse) and
>navigating to page 3 (pg=3):
>
>https://jobsearch.direct.gov.uk/JobSearch/Search.aspx?setype=1&pp=25&pg=3&q=Nurse&cy=UK&sort=rv.dt.di&re=4
>
>If you want to improve your chances of getting quick and accurate help
>and
>make efficient use of other people's time you could improve the way you
>ask
>- this is a great general article on mailing list etiquette:
>http://catb.org/~esr/faqs/smart-questions.html#beprecise
>
>On 2 April 2014 17:24, Paul Bradshaw <paulonh...@gmail.com> wrote:
>
>> Trying to scrape this page but as it;s not a form I'm hitting an
>issue
>> with the javascript:
>>
>>
>https://jobsearch.direct.gov.uk/Reports/Reports.aspx?setype=1&seswitch=1
>>
>> This StackOverflow

>thread<http://stackoverflow.com/questions/3898660/emulate-javascript-dopostback-in-python-web-scrapping>

>says
>> you can't do it with Python - correct?
>>
>> Paul
>>
>> --
>> You received this message because you are subscribed to the Google
>Groups
>> "ScraperWiki" group.
>> To unsubscribe from this group and stop receiving emails from it,
>send an
>> email to scraperwiki...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>--
>You received this message because you are subscribed to the Google
>Groups "ScraperWiki" group.
>To unsubscribe from this group and stop receiving emails from it, send
>an email to scraperwiki...@googlegroups.com.
>For more options, visit https://groups.google.com/d/optout.

- --
pal...@gogn.in | https://gogn.in | https://twitter.com/pallih | https://github.com/pallih

GPG: C266 603E 9918 A38B F11D 9F9B E721 347C 45B1 04E9

Paul Bradshaw

unread,

Apr 3, 2014, 2:01:21 AM4/3/14

to scrap...@googlegroups.com

Thanks Peter, Pall,

My question said "it's not a form" so no, I'm not trying to scrape a search.

The page contains links to reports which use __doPostBack. Apologies for not giving more details but my question was not about that page - it was about the broad practice of scraping doPostBack where it's not attached to a form, and This StackOverflow thread which says you can't do it with Python.

In terms of the etiquette (thanks for the link), I hope I tried to make my question succinct and relevant as per the guidance, rather than bogging it down in details which were not relevant to the question. I wasn't asking a question about a scraper, but about a general possibility within Python. Sorry if that wasn't clear.

Pall answered that question succinctly.

Paul Bradshaw

Behind The Numbers - the Birmingham Mail datablog

Out now - Scraping for Journalists: http://leanpub.com/scrapingforjournalists

8,000 Holes: How the 2012 Olympic Torch Relay Lost its Way: https://leanpub.com/8000holes (all proceeds to the Brittle Bone Society)

The Online Journalism Handbook: http://amzn.to/jEND3p

Online Journalism Blog http://onlinejournalismblog.com
Help Me Investigate http://helpmeinvestigate.com - Shortlisted for

Multimedia Publisher of the Year, 2010; winner of Talk About Local investigation of the year 2010

Organiser, Hacks and Hackers Birmingham http://meetupbirmingham.hackshackers.com/

Visiting Professor, City University, London http://www.city.ac.uk/journalism/
Course Leader, MA Online Journalism, Birmingham City University http://bit.ly/maonlinejournalism

http://twitter.com/paulbradshaw
LinkedIn profile and recommendations at http://bit.ly/paulbrecommendations

Paul Bradshaw

unread,

Apr 3, 2014, 2:49:45 AM4/3/14

to scrap...@googlegroups.com

Specifically, however, my challenge is this:

I've used Mechanize and the page has one form, which is for the search at the top.

Beneath that, however, are a number of links to report pages which also use __doPostBack. The first a href= for example has this value:

javascript:__doPostBack('MasterPage1$MainContent$folderControl$rptItems$_ctl0$btnItem','')

Not surprisingly, submitting this through the one form doesn't work, because that's for the job search.

The code I tried for that was:

br['__EVENTTARGET'] = 'MasterPage1$MainContent$folderControl$rptItems$_ctl0$btnItem'

br['__EVENTARGUMENT'] = ''

br.select_form(name='aspnetForm')

br.form.set_all_readonly(False)

response = br.submit()

html = response.read

Páll Hilmarsson

unread,

Apr 3, 2014, 6:58:07 AM4/3/14

to scrap...@googlegroups.com

Hey Paul.

This is how you get to the individual report views (it's also attached).
From there you can post in a similar way in the form on those pages:

import requests
import lxml.html

START_URL =
'https://jobsearch.direct.gov.uk/Reports/Reports.aspx?setype=1&seswitch=1'

# Set up a session
s = requests.session()

# Get a cookie by requesting the initial url
response = s.get(START_URL)

root = lxml.html.fromstring(response.text.encode('utf-8'))

# Capture the __VIEWSTATE in the form
VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]')[0].attrib['value']

# Set up some headers (some of these might not be needed)
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.149 Safari/537.36'
headers['Referer'] =
'https://jobsearch.direct.gov.uk/Reports/Reports.aspx?setype=1&seswitch=1'
headers['Origin'] = 'https://jobsearch.direct.gov.uk'
headers['Host'] = 'jobsearch.direct.gov.uk'
headers['Content-Length'] = 11485

# Set up the payload for the post request
payload = {}

# This is the first link ('Active jobs by industrial and occupational
classification')
# You can get the __EVENTTARGET from the links id:
"""
<a id="MasterPage1_MainContent_folderControl_rptItems__ctl0_btnItem"
href="javascript:__doPostBack('MasterPage1$MainContent$folderControl$rptItems$_ctl0$btnItem','')"
tabindex="100">

<span class="fnt24">
Active jobs by industrial and
occupational classification
</span>
</a>
"""
payload['__EVENTTARGET'] =
'MasterPage1$MainContent$folderControl$rptItems$_ctl0$btnItem'
payload['__EVENTARGUMENT'] = ''
payload['__VIEWSTATE'] = VIEWSTATE
payload['__VIEWSTATEENCRYPTED'] = ''

# These control the default inputs in the forms search, they might be
redundant
payload['MasterPage1:HeaderContent:Header_Default:searchControlsSwitcher:_ctl0:_tbKeywords']
='Keywords (e.g. nurse)'
payload['MasterPage1:HeaderContent:Header_Default:searchControlsSwitcher:_ctl0:_ddlCountries']
='160'
payload['MasterPage1:HeaderContent:Header_Default:searchControlsSwitcher:_ctl0:_tbWhere']=
'City, county or postcode'
payload['MasterPage1:HeaderContent:Header_Default:searchControlsSwitcher:_ctl0:defaultRadius']=
'20'
payload['MasterPage1:HeaderContent:Header_Default:searchControlsSwitcher:_ctl0:_joblocations']
= ''
payload['MasterPage1:HeaderContent:Header_Default:searchControlsSwitcher:_ctl0:radiusUnits']
= ''

# Post away
response2 = s.post(START_URL, data=payload, headers=headers)

print response2.text.encode('utf-8')

All the best,

p

On 03/04/14 06:49, Paul Bradshaw wrote:
> Specifically, however, my challenge is this:
>
> I've used Mechanize and the page has one form, which is for the search
> at the top.

>

> On Thu, Apr 3, 2014 at 7:01 AM, Paul Bradshaw <paulonh...@gmail.com
> <mailto:paulonh...@gmail.com>> wrote:
>
> Thanks Peter, Pall,
>
> My question said "it's not a form" so no, I'm not trying to scrape a
> search.
>
> The page contains links to reports which use __doPostBack. Apologies
> for not giving more details but my question was not about that page
> - it was about the broad practice of scraping doPostBack where it's
> not attached to a form, and This StackOverflow thread

> <http://stackoverflow.com/questions/3898660/emulate-javascript-dopostback-in-python-web-scrapping> which says

> you can't do it with Python.
>
> In terms of the etiquette (thanks for the link), I hope I tried to
> make my question succinct and relevant as per the guidance, rather
> than bogging it down in details which were not relevant to the
> question. I wasn't asking a question about a scraper, but about a
> general possibility within Python. Sorry if that wasn't clear.
>
> Pall answered that question succinctly.
>
>
> On Wed, Apr 2, 2014 at 11:44 PM, Páll Hilmarsson <pal...@gogn.in
> <mailto:pal...@gogn.in>> wrote:
>
> Peter is right, in his passive aggressive way.
>
> The site seems to approve GET request with the supplied keywords.
>
> All the best,
>
> P
>
> On 2 April 2014 16:44:03 GMT+00:00, Peter Waller

> <pe...@scraperwiki.com <mailto:pe...@scraperwiki.com>> wrote:
>>Please could you give a bit more detail about what you're trying to
>>achieve
>>and what you've tried so far? I don't understand the question.
>
>>Are you trying to scrape a search with a keyword?
>
>>A quick test shows that one can just insert query parameters and it
>>works,
>>no javascript necessary, e.g, searching for "Nurse" (q=Nurse) and
>>navigating to page 3 (pg=3):
>
>>https://jobsearch.direct.gov.uk/JobSearch/Search.aspx?setype=1&pp=25&pg=3&q=Nurse&cy=UK&sort=rv.dt.di&re=4
>
>>If you want to improve your chances of getting quick and
> accurate help
>>and
>>make efficient use of other people's time you could improve the
> way you
>>ask
>>- this is a great general article on mailing list etiquette:
>>http://catb.org/~esr/faqs/smart-questions.html#beprecise
>
>>On 2 April 2014 17:24, Paul Bradshaw <paulonh...@gmail.com

> <mailto:paulonh...@gmail.com>> wrote:
>
>>> Trying to scrape this page but as it;s not a form I'm hitting an
>>issue
>>> with the javascript:
>>>
>>>
>>https://jobsearch.direct.gov.uk/Reports/Reports.aspx?setype=1&seswitch=1
>>>
>>> This StackOverflow
>>thread<http://stackoverflow.com/questions/3898660/emulate-javascript-dopostback-in-python-web-scrapping>
>>says
>>> you can't do it with Python - correct?
>>>
>>> Paul
>>>
>>> --
>>> You received this message because you are subscribed to the
> Google
>>Groups
>>> "ScraperWiki" group.
>>> To unsubscribe from this group and stop receiving emails from it,
>>send an
>>> email to scraperwiki...@googlegroups.com

> <mailto:scraperwiki%2Bunsu...@googlegroups.com>.

>>> For more options, visit https://groups.google.com/d/optout.
>>>
>
>>--
>>You received this message because you are subscribed to the Google
>>Groups "ScraperWiki" group.
>>To unsubscribe from this group and stop receiving emails from
> it, send
>>an email to scraperwiki...@googlegroups.com

> <mailto:scraperwiki%2Bunsu...@googlegroups.com>.

>>For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the
> Google Groups "ScraperWiki" group.
> To unsubscribe from this group and stop receiving emails from
> it, send an email to scraperwiki...@googlegroups.com

> <mailto:scraperwiki%2Bunsu...@googlegroups.com>.

> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
> --
>
> Paul Bradshaw
>
> Behind The Numbers

> <http://www.birminghammail.co.uk/all-about/behind%20the%20numbers> -

> the Birmingham Mail datablog
>
> Out now - Scraping for
> Journalists: http://leanpub.com/scrapingforjournalists
> 8,000 Holes: How the 2012 Olympic Torch Relay Lost its
> Way: https://leanpub.com/8000holes (all proceeds to the Brittle Bone
> Society)
> The Online Journalism Handbook: http://amzn.to/jEND3p
>
> Online Journalism Blog http://onlinejournalismblog.com
> Help Me Investigate http://helpmeinvestigate.com - Shortlisted for
> Multimedia Publisher of the Year, 2010; winner of Talk About Local
> investigation of the year 2010
>
> Organiser, Hacks and Hackers
> Birmingham http://meetupbirmingham.hackshackers.com/
>
> Visiting Professor, City University,
> London http://www.city.ac.uk/journalism/
> Course Leader, MA Online Journalism, Birmingham City
> University http://bit.ly/maonlinejournalism
>
> http://twitter.com/paulbradshaw
> LinkedIn profile and recommendations at
> http://bit.ly/paulbrecommendations
>
>
>
>
>
> --
>
> Paul Bradshaw
>
> Behind The Numbers

> <http://www.birminghammail.co.uk/all-about/behind%20the%20numbers> - the

> Birmingham Mail datablog
>
> Out now - Scraping for
> Journalists: http://leanpub.com/scrapingforjournalists
> 8,000 Holes: How the 2012 Olympic Torch Relay Lost its
> Way: https://leanpub.com/8000holes (all proceeds to the Brittle Bone
> Society)
> The Online Journalism Handbook: http://amzn.to/jEND3p
>
> Online Journalism Blog http://onlinejournalismblog.com
> Help Me Investigate http://helpmeinvestigate.com - Shortlisted for
> Multimedia Publisher of the Year, 2010; winner of Talk About Local
> investigation of the year 2010
>
> Organiser, Hacks and Hackers
> Birmingham http://meetupbirmingham.hackshackers.com/
>
> Visiting Professor, City University,
> London http://www.city.ac.uk/journalism/
> Course Leader, MA Online Journalism, Birmingham City
> University http://bit.ly/maonlinejournalism
>
> http://twitter.com/paulbradshaw
> LinkedIn profile and recommendations at http://bit.ly/paulbrecommendations
>
>
> --

> You received this message because you are subscribed to the Google
> Groups "ScraperWiki" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to scraperwiki...@googlegroups.com

> <mailto:scraperwiki...@googlegroups.com>.

> For more options, visit https://groups.google.com/d/optout.

--

pal...@gogn.in | http://gogn.in | http://twitter.com/pallih |

scrape.py

signature.asc

Paul Bradshaw

unread,

Apr 3, 2014, 7:24:04 AM4/3/14

to scrap...@googlegroups.com

Thanks Pall -that's enormously helpful.

Paul Bradshaw

Behind The Numbers - the Birmingham Mail datablog

Reply all

Reply to author

Forward

0 new messages