Scraping a Serbian website with doPostBack -- success! (Almost)

350 views
Skip to first unread message

John Maines

unread,
Oct 26, 2012, 6:18:30 PM10/26/12
to scrap...@googlegroups.com
Hello,
 
I posted my first post on ScraperWiki a couple days ago and got a great help from some people. I was trying to scrape a Serbian .aspx webpage with __doPostBack. It has contracts that the government there has entered into. They won't release the raw data,
 
 
Pall Himarsson gave me some python code that worked fine ... for the first 15 pages. Then I need to jump from the start page to the second page. A web user would get there by clicking on the "..." that takes the user to page 16.
 
BUT the scraper won't get be past page 16. The site generates an error saying the page the scraper has sought does not exist. I can't figure out how to make the scraper jump to the next series of pages (16-30)
 
Any thoughts? Thanks in advance.
 
Here's Pall's code:
 
import lxml.html
import requests
s = requests.session() # create a session object
r1 = s.get(starturl) #get page 1
html = r1.text
#process page one
root = lxml.html.fromstring(html)
#pick up the javascript values
EVENTVALIDATION =
root.xpath('//input[@name="__EVENTVALIDATION"]')[0].attrib['value']
#find the __EVENTVALIDATION value
VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]')[0].attrib['value']
#find the __VIEWSTATE value
# build a dictionary to post to the site with the values we have
collected. The __EVENTARGUMENT can be changed to fetch another result
page (3,4,5 etc.)
payload = {'__EVENTTARGET': 'ctl00$ContentPlaceHolder3$grwIzvestaji',
'__EVENTARGUMENT':
'Page$2','referer':'http://portal.ujn.gov.rs/Izvestaji.aspx','__EVENTVALIDATION':EVENTVALIDATION,'__VIEWSTATE':VIEWSTATE,'ctl00$txtUser':'','ctl00$txtPass':'','ctl00$ContentPlaceHolder1$txtSearchIzvestaj':'','__VIEWSTATEENCRYPTED':''}
# post it
r2 = s.post(starturl, data=payload)
# our response is now page 2
print r2.text
 

Silvio Traversaro

unread,
Oct 27, 2012, 9:51:43 PM10/27/12
to scrap...@googlegroups.com
Apparently, to reach pages from 16 to 30, you have to first get page
16, and then (using EVENTVALIDATION and VIEWSTATE scraped from page
16) do a request for the wanted page. A similar approach is needed for
all other page intervals.
I have tried to implement a scraper for this data, the code is not
polished but it should be working:
https://scraperwiki.com/scrapers/serbian_contracts/
I hope it can be of help.

John Maines

unread,
Oct 29, 2012, 9:22:39 AM10/29/12
to scrap...@googlegroups.com
Wow, thanks for all that work Silvio. I saw the scraper on Sunday and it looked fine. The web page you built appears to be down now, though. Generating some sort of error.
 
Ultimately the goal is to get the file out into Excel or delimited text, something that can be used for analysis.
 
I will try what you suggest with Page 16.
 
Thanks again,
 
 
John

Silvio Traversaro

unread,
Oct 29, 2012, 9:48:43 AM10/29/12
to scrap...@googlegroups.com
Yes, there was some problem of connection, but the scraper now is
running correctly. At the moment it scraped approximately 6000 pages
out of 17818 pages. When the scraping is completed (you can check the
progress by looking a the variable "last_page_scraped" in the
swvariables) you can simply download the data using the Download as a
CSV button.
Just a warning: I was not able to use Cyrillic names for the column
names (and the column order is changed with respect to the website),
so I used generic column names (COLUMN00, COLUMN01, ...). The meaning
of these generic column names is contained in the "header_info" table.

John Maines

unread,
Oct 29, 2012, 10:54:49 AM10/29/12
to scrap...@googlegroups.com
 
Great. I do not read Cyrillic either, so I can fix it. Thanks so much
 
How did you overcome the problem of it stop working at Page 16? I was looking at that this morning. I was trying to grab the EVENTVALIDATION and VIEWSTATE from page 16 but it did not seem to work.
 
I do data analysis for newspapers ... things like election analysis, crime, mapping, etc. My scraping skills are limited, but I sure would like to learn more. I'm going to use Python. 
 
Thanks again. If this works you will make some Serbian journalists very happy.
 
John

Silvio Traversaro

unread,
Oct 29, 2012, 11:13:00 AM10/29/12
to scrap...@googlegroups.com
I have tryed to modify Pall Himarsson code that you posted to show how
to grab page 20 (just for example), the code is here:
https://scraperwiki.com/scrapers/example_serbian_contract_get_page/

I have tried to comment the code, if something is not clear please ask.

Páll Hilmarsson

unread,
Oct 29, 2012, 11:26:15 AM10/29/12
to scrap...@googlegroups.com
Silvio is correct.

Every 15 pages you have to make an additional request. So the logic is like this:

Get page 1
Extract VIEWSTATE and EVENTVALIDATION
Get page 2,3,4,5,6,7,8,9,10,11,12,13,14,15
Get page 16
Extract VIEWSTATE and EVENTVALIDATION
Get page 17,18,19, etc.

This means that you do 1187 more requests for all the 17818 result pages


All the best,

p
--
pal...@kaninka.net | http://gogn.in | http://twitter.com/pallih | https://github.com/pallih

PGP: C266 603E 9918 A38B F11D  9F9B E721 347C 45B1 04E9

John Maines

unread,
Oct 29, 2012, 1:56:29 PM10/29/12
to scrap...@googlegroups.com


Thanks Pall. I thought that might be the case. I tried it once this morning, but it did not work. I was in a hurry and probably did it wrong.

I will practice again tonight.

John Maines

unread,
Oct 29, 2012, 2:01:11 PM10/29/12
to scrap...@googlegroups.com


Thanks Silvio and Pall




On Monday, October 29, 2012 11:26:17 AM UTC-4, Páll Hilmarsson wrote:

Djordje Padejski

unread,
Nov 30, 2012, 2:22:10 PM11/30/12
to scrap...@googlegroups.com
awesome work guys, thanks a lot silvio 

this is great tool now  :) 

it is not a big deal to convert that wierd cyrilic file to nice spreadsheet 

best,
Djordje
Reply all
Reply to author
Forward
0 new messages