Regarding scraping Open Govt. NREGA Data

280 views
Skip to first unread message

mali mukesh

unread,
May 21, 2012, 10:21:45 AM5/21/12
to data...@googlegroups.com
Hi Friends,
        
              I am currently working on Open Govt. Data project . I have chosen NREGA project . I want to do some analysis of NREGA data which are provided online. I want to scrap the data from http://nrega.nic.in . So I need some help if someone have already worked on Open Govt. Data earlier.

Thanking you.

Mali Mukesh

Gautam John

unread,
May 21, 2012, 10:22:43 AM5/21/12
to data...@googlegroups.com
On Mon, May 21, 2012 at 7:51 PM, mali mukesh <malimuk...@gmail.com> wrote:

> provided online. I want to scrap the data from http://nrega.nic.in . So I
> need some help if someone have already worked on Open Govt. Data earlier.

ScraperWiki? https://scraperwiki.com/

mali mukesh

unread,
May 21, 2012, 10:45:11 AM5/21/12
to data...@googlegroups.com
yes, it is online but if somebody have already written script or have backup of downloaded data . Plz share it .
Or if someone have already worked on it then I need help.

S Anand

unread,
May 21, 2012, 10:56:16 AM5/21/12
to data...@googlegroups.com
This looks like your best bet for now: https://github.com/ravibalgi/nrega

It's about 9 months old, and you might want to reach out to Ravi.

Venkata Pingali

unread,
May 21, 2012, 11:16:48 AM5/21/12
to data...@googlegroups.com
It is an asp.net site. You will not be able to scrape anything
behind a form using python/html parsing (been there! done that!).

I used Phantomjs (headless browser) to scrape a PSU site
(asp.net site like the nrega site). I intend to opensource it
after some cleaning (along with the data). Happy to share
the raw version with you (or anybody else for that matter) if
interested. You will need this if you have to ever fill forms
on aspx sites.

The likely issue you are going to face is that the core database
(job cards) is very large (43GB based on my last estimate). You
cant scrape it without tacit or explicit approval.

-V

Karthik Shashidhar

unread,
May 21, 2012, 10:07:21 PM5/21/12
to data...@googlegroups.com
On Mon, May 21, 2012 at 7:51 PM, mali mukesh <malimuk...@gmail.com> wrote:
Hi Friends,
        
              I am currently working on Open Govt. Data project . I have chosen NREGA project . I want to do some analysis of NREGA data which are provided online. I want to scrap the data from http://nrega.nic.in . So I need some help if someone have already worked on Open Govt. Data earlier.


for a moment, I got excited when I misread the headline as "scrapping NREGA", and thought the government is finally seeing some sense and thinking of scrapping NREGA! unfortunately, not to be! 

Deepak Shenoy

unread,
May 21, 2012, 10:19:59 PM5/21/12
to data...@googlegroups.com
On Mon, May 21, 2012 at 8:46 PM, Venkata Pingali <pin...@gmail.com> wrote:
> It is an asp.net site. You will not be able to scrape anything
> behind a form using python/html parsing (been there! done that!).

The one way it can be done is using Windows and automating IE instead.
Have done that using C#, but yes, it's a painful exercise and NREGA
more so because they constantly validate with VIEWSTATE and other
hidden parameters.

Having said that, the end result is a bunch of HTML files (for job
cards) which I assume can be obtained using some test framework tools?

S Anand

unread,
May 21, 2012, 10:38:32 PM5/21/12
to data...@googlegroups.com
If 2009 data is OK for you, it looks like it's available, at least at the village level: http://ifmr.ac.in/cmf/resources/data/Andhra%20Pradesh%20-%20NREGA,%20MPTC%20and%20BPL%20Data/ 

A team from Berkeley seems to have done some work with the raw data as well -- http://tier.cs.berkeley.edu/resources/data/villagemap/docs/CITRIS-ITS-NREGA-v4.pdf -- might be useful to contact.

Regards,
Anand


On Monday, May 21, 2012, mali mukesh wrote:

Venkata Pingali

unread,
May 21, 2012, 11:34:01 PM5/21/12
to data...@googlegroups.com
On Tue, May 22, 2012 at 7:49 AM, Deepak Shenoy <deepak...@gmail.com> wrote:
> On Mon, May 21, 2012 at 8:46 PM, Venkata Pingali <pin...@gmail.com> wrote:
>> It is an asp.net site. You will not be able to scrape anything
>> behind a form using python/html parsing (been there! done that!).
>
> The one way it can be done is using Windows and automating IE instead.
> Have done that using C#, but yes, it's a painful exercise and NREGA
> more so because they constantly validate with VIEWSTATE and other
> hidden parameters.
>

Right. Thats the problem. The server side is very sensitive to these values,
and not being asp developer, dont know the logic by which it is generated
or validated.

The approach will eventually look like browser automation of various kinds -
IE or other framework. Phantomjs uses webkit (chrome/safari engine).
I looked at available test automation frameworks incuding Selenium.
I needed something to run off AWS servers (read: no display; found
xvrb inefficient, hard to use). So chose to go with Phantomjs. Another
advantage I found with it is that all code is in javascript. So it is
cross-platform, compact, and potentially convertible into an
chrome extension - which I am working on.

> Having said that, the end result is a bunch of HTML files (for job
> cards) which I assume can be obtained using some test framework tools?

Thats what is going on here. Once we get the html, we can use whatever
mechanism that suits. I used python.
Reply all
Reply to author
Forward
0 new messages