Hey Paul.
This is how you get to the individual report views (it's also attached).
From there you can post in a similar way in the form on those pages:
import requests
import lxml.html
START_URL =
'
https://jobsearch.direct.gov.uk/Reports/Reports.aspx?setype=1&seswitch=1'
# Set up a session
s = requests.session()
# Get a cookie by requesting the initial url
response = s.get(START_URL)
root = lxml.html.fromstring(response.text.encode('utf-8'))
# Capture the __VIEWSTATE in the form
VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]')[0].attrib['value']
# Set up some headers (some of these might not be needed)
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.149 Safari/537.36'
headers['Referer'] =
'
https://jobsearch.direct.gov.uk/Reports/Reports.aspx?setype=1&seswitch=1'
headers['Origin'] = '
https://jobsearch.direct.gov.uk'
headers['Host'] = '
jobsearch.direct.gov.uk'
headers['Content-Length'] = 11485
# Set up the payload for the post request
payload = {}
# This is the first link ('Active jobs by industrial and occupational
classification')
# You can get the __EVENTTARGET from the links id:
"""
<a id="MasterPage1_MainContent_folderControl_rptItems__ctl0_btnItem"
href="javascript:__doPostBack('MasterPage1$MainContent$folderControl$rptItems$_ctl0$btnItem','')"
tabindex="100">
<!--<img alt='Report'
src="
http://media.newjobs.com/id/WebAdmin/reportwrapper/Report.png"
style="width:24px;height:24px;vertical-align:middle;border:0" /> -->
<span class="fnt24">
Active jobs by industrial and
occupational classification
</span>
</a>
"""
payload['__EVENTTARGET'] =
'MasterPage1$MainContent$folderControl$rptItems$_ctl0$btnItem'
payload['__EVENTARGUMENT'] = ''
payload['__VIEWSTATE'] = VIEWSTATE
payload['__VIEWSTATEENCRYPTED'] = ''
# These control the default inputs in the forms search, they might be
redundant
payload['MasterPage1:HeaderContent:Header_Default:searchControlsSwitcher:_ctl0:_tbKeywords']
='Keywords (e.g. nurse)'
payload['MasterPage1:HeaderContent:Header_Default:searchControlsSwitcher:_ctl0:_ddlCountries']
='160'
payload['MasterPage1:HeaderContent:Header_Default:searchControlsSwitcher:_ctl0:_tbWhere']=
'City, county or postcode'
payload['MasterPage1:HeaderContent:Header_Default:searchControlsSwitcher:_ctl0:defaultRadius']=
'20'
payload['MasterPage1:HeaderContent:Header_Default:searchControlsSwitcher:_ctl0:_joblocations']
= ''
payload['MasterPage1:HeaderContent:Header_Default:searchControlsSwitcher:_ctl0:radiusUnits']
= ''
# Post away
response2 =
s.post(START_URL, data=payload, headers=headers)
print response2.text.encode('utf-8')
All the best,
p
On 03/04/14 06:49, Paul Bradshaw wrote:
> Specifically, however, my challenge is this:
>
> I've used Mechanize and the page has one form, which is for the search
> at the top.
>
> On Thu, Apr 3, 2014 at 7:01 AM, Paul Bradshaw <
paulonh...@gmail.com
> <mailto:
paulonh...@gmail.com>> wrote:
>
> Thanks Peter, Pall,
>
> My question said "it's not a form" so no, I'm not trying to scrape a
> search.
>
> The page contains links to reports which use __doPostBack. Apologies
> for not giving more details but my question was not about that page
> - it was about the broad practice of scraping doPostBack where it's
> not attached to a form, and This StackOverflow thread
> <
http://stackoverflow.com/questions/3898660/emulate-javascript-dopostback-in-python-web-scrapping> which says
> you can't do it with Python.
>
> In terms of the etiquette (thanks for the link), I hope I tried to
> make my question succinct and relevant as per the guidance, rather
> than bogging it down in details which were not relevant to the
> question. I wasn't asking a question about a scraper, but about a
> general possibility within Python. Sorry if that wasn't clear.
>
> Pall answered that question succinctly.
>
>
> On Wed, Apr 2, 2014 at 11:44 PM, Páll Hilmarsson <
pal...@gogn.in
> <mailto:
pal...@gogn.in>> wrote:
>
> Peter is right, in his passive aggressive way.
>
> The site seems to approve GET request with the supplied keywords.
>
> All the best,
>
> P
>
> On 2 April 2014 16:44:03 GMT+00:00, Peter Waller
> <mailto:
scraperwiki%2Bunsu...@googlegroups.com>.
> <mailto:
scraperwiki%2Bunsu...@googlegroups.com>.
> <mailto:
scraperwiki%2Bunsu...@googlegroups.com>.
> <
http://www.birminghammail.co.uk/all-about/behind%20the%20numbers> -
> <
http://www.birminghammail.co.uk/all-about/behind%20the%20numbers> - the
> You received this message because you are subscribed to the Google
> Groups "ScraperWiki" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
scraperwiki...@googlegroups.com
> <mailto:
scraperwiki...@googlegroups.com>.
pal...@gogn.in |
http://gogn.in |
http://twitter.com/pallih |