Land records data scraping

1,226 views
Skip to first unread message

Nikhil VJ

unread,
Oct 24, 2016, 12:39:58 PM10/24/16
to datameet
Hi,

I'm looking at Maharashtra's land records portal :
https://mahabhulekh.maharashtra.gov.in

.. and wondering if it's possible to scrape data from here?

Will share a workflow:
choose 7/12 (७/१२) > select any जिल्हा > तालुका > गाव
select शोध : सर्वे नंबर / गट नंबर (first option)
type 1 in the text box and press the "शोधा" button
Then we get a dropdown with options like 1/1 , 1/2, 1/3 etc.

On selecting any and clicking "७/१२ पहा",
a new window/tab opens up (you have to enable popups), having static
HTML content (some tables). I need to capture this content.

The URL is always the same:
https://mahabhulekh.maharashtra.gov.in/Konkan/pg712.aspx
..but the content changes depending on the options chosen.

On using the browser's "Inspect Element"> Network and clicking the
final button, there is a request to this URL:

https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712

and the request Params / Payload is like:

{'sno':'1','vid':'273200030398260000','dn':'रत्नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','did':'32','tid':'3'}

when you change the survey/gat number to 1/10, the params change like so:
{'sno':'1#10','vid':'273200030398260000','dn':'रत्नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','did':'32','tid':'3'}

for 1/1अ:
{'sno':'1#1अ','vid':'273200030398260000','dn':'रत्नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','did':'32','tid':'3'}

I tried some wget and curl commands but no luck so far. Do let me know
if you can make some headway.

Also, it would be great to also learn how to extract on the list of
districts, talukas (subdistricts) in each district, and villages in
each taluka.

dumping other info at bottom if it helps.

Why do this:
At present it's just an exploration following on from our work on
village shapefiles.
The district > taluka > village mapping data from official Land
Records data could serve as a good source for triangulation.
Then, while I don't see myself going deeper into this right now, I am
aware that land records / ownership has major corruption,
entanglements and other issues precisely because of the lack of
transparency. The mahabhulekh website itself is a significant step
forward in making this sector a little more transparent, and more push
in this direction would probably do more good IMHO. At some point
GIS/lat-long info might come in, and it would be good to bring the
data to a level that is ready for it.


Data dump:
When we press the button to fetch the 7/12 (saatbarah) record, the
console records a POST with these parameters:

Copy as cURL:
curl 'https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712'
-H 'Host: mahabhulekh.maharashtra.gov.in' -H 'User-Agent: Mozilla/5.0
(X11; Ubuntu; Linux i686; rv:42.0) Gecko/20100101 Firefox/42.0' -H
'Accept: application/json, text/plain, */*' -H 'Accept-Language:
en-US,en;q=0.5' --compressed -H 'Content-Type:
application/json;charset=utf-8' -H 'Referer:
https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx' -H
'Content-Length: 170' -H 'Cookie:
ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc' -H 'Connection:
keep-alive' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'

Copy POST data:
{'sno':'1#1अ','vid':'273200030398260000','dn':'रत्नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','did':'32','tid':'3'}

request headers:
POST /Konkan/Home.aspx/call712 HTTP/1.1
Host: mahabhulekh.maharashtra.gov.in
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:42.0)
Gecko/20100101 Firefox/42.0
Accept: application/json, text/plain, */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: application/json;charset=utf-8
Referer: https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx
Content-Length: 170
Cookie: ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-cache

response headers:
HTTP/1.1 200 OK
Cache-Control: private, max-age=0
Content-Type: application/json; charset=utf-8
Server: Microsoft-IIS/8.0
X-Powered-By: ASP.NET
Date: Mon, 24 Oct 2016 15:31:40 GMT
Content-Length: 10

Copy Response:
{"d":null}


--
--
Cheers,
Nikhil
+91-966-583-1250
Pune, India
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>
Blog <http://nikhilsheth.blogspot.in> | Contribute
<https://www.payumoney.com/webfronts/#/index/NikhilVJ>

Ankit Gaur

unread,
Oct 24, 2016, 11:32:42 PM10/24/16
to data...@googlegroups.com
Though I am not very well conversant with Data Sciences and web scraping, we had a recent DataKind meetup https://www.meetup.com/DataKind-Bangalore/events/234855978/ in Bangalore, where Bargava talked about using R's rvest library. We were able to do some basic scraping on goodreads with this. See if this fits your needs.

Thanks,
Ankit 

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nikhil VJ

unread,
Oct 27, 2016, 1:16:59 AM10/27/16
to data...@googlegroups.com
Hi Ankit,

Thanks for the R lead! I checked it out.. am already doing something
like it using some quick shell/bash commands and a python script that
converts any html table to csv (http://stackoverflow.com/a/16697784).
Once we have the data down in HTMLs it's fairly straightforward. This
part come after the scraping.

The data in this case is not in permanent HTMLs that we can just save
in batch. It's being generated at server-side on Mahabhulekh server
depending on form inputs in an authenticated user session and then
being rendered as html at one constant URL. So what I'm looking for is
something that would simulate / automate (with due time intervals
between each call of course, we must not overload the server) the
calls to the mahabhulekh server, and capture the output it is
returning.

So far I'm not able to programmatically capture the HTML coming in the
popup window it is generating. The POST request returns a generic null
response or the site's main webpage in all the wget and curl commands
I've tried. Folks who have done some scraping earlier might be able to
help.

Another track worth exploring might be iMacros or other ways to
automate browser sessions. Foiks working in testing departments of
ticketing / booking sites etc might know and could help, so please
share this with your friends working in such projects!

I've read at some places R can be used to simulate this.. so yes it'll
be worth to keep exploring but I know shell scripting more so hoping
something comes there.

--
--
Cheers,
Nikhil
+91-966-583-1250
Pune, India
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>
Blog <http://nikhilsheth.blogspot.in> | Contribute
<https://www.payumoney.com/webfronts/#/index/NikhilVJ>



On 10/25/16, Ankit Gaur <gaur...@gmail.com> wrote:
> Though I am not very well conversant with Data Sciences and web scraping,
> we had a recent DataKind meetup
> https://www.meetup.com/DataKind-Bangalore/events/234855978/ in Bangalore,
> where Bargava talked about using R's rvest library
> <https://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/>. We
>> email to datameet+u...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to datameet+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>


Nikhil VJ

unread,
Nov 13, 2016, 10:51:41 PM11/13/16
to data...@googlegroups.com
Hi friends,

I've created some shell scripts to aggregate the data from downloaded
7/12 records (html files) into two csv's. Sharing a github link having
the code and instructions:
https://github.com/answerquest/mahabhulekh-7-12-aggregating

Still no luck on automated scraping from the site, but this
aggregating was the next step and has really simplified the process of
inspecting multiple records at once.

-Nikhil

Pradeep Bhatt

unread,
Nov 14, 2016, 7:14:37 AM11/14/16
to data...@googlegroups.com
This is very interesting. 

Can this be used for commercial purposes? Where I can read about data policy on this?

Regards,
Pradeep


>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> Datameet is a community of Data Science enthusiasts in India. Know more
>> about us by visiting http://datameet.org
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "datameet" group.
>> To unsubscribe from this group and stop receiving emails from it, send an

>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> --
> --
> Cheers,
> Nikhil
> +91-966-583-1250
> Pune, India
> Self-designed learner at Swaraj University
> <http://www.swarajuniversity.org>
> Blog <http://nikhilsheth.blogspot.in> | Contribute
> <https://www.payumoney.com/webfronts/#/index/NikhilVJ>
>


--
--
Cheers,
Nikhil
+91-966-583-1250
Pune, India
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>
Blog <http://nikhilsheth.blogspot.in> | Contribute
<https://www.payumoney.com/webfronts/#/index/NikhilVJ>

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscribe@googlegroups.com.

Nikhil VJ

unread,
Nov 14, 2016, 1:18:14 PM11/14/16
to data...@googlegroups.com
Hi Pradeep,

My aim is more that people can use snippets from the scripts to devise
their own stuff.

I've put the repo under the GPL license; would prefer to share freely
and have others take it forward. And I don't think commercial use with
this kind of data would be permitted but feel free to check

-Nikhil
>> >>> email to datameet+u...@googlegroups.com.
>> >>> For more options, visit https://groups.google.com/d/optout.
>> >>>
>> >>
>> >> --
>> >> Datameet is a community of Data Science enthusiasts in India. Know
>> >> more
>> >> about us by visiting http://datameet.org
>> >> ---
>> >> You received this message because you are subscribed to the Google
>> Groups
>> >> "datameet" group.
>> >> To unsubscribe from this group and stop receiving emails from it, send
>> an
>> >> email to datameet+u...@googlegroups.com.
>> >> For more options, visit https://groups.google.com/d/optout.
>> >>
>> >
>> >
>> > --
>> > --
>> > Cheers,
>> > Nikhil
>> > +91-966-583-1250
>> > Pune, India
>> > Self-designed learner at Swaraj University
>> > <http://www.swarajuniversity.org>
>> > Blog <http://nikhilsheth.blogspot.in> | Contribute
>> > <https://www.payumoney.com/webfronts/#/index/NikhilVJ>
>> >
>>
>>
>> --
>> --
>> Cheers,
>> Nikhil
>> +91-966-583-1250
>> Pune, India
>> Self-designed learner at Swaraj University <http://www.swarajuniversity.o
>> rg>
>> Blog <http://nikhilsheth.blogspot.in> | Contribute
>> <https://www.payumoney.com/webfronts/#/index/NikhilVJ>
>>
>> --
>> Datameet is a community of Data Science enthusiasts in India. Know more
>> about us by visiting http://datameet.org
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "datameet" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to datameet+u...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to datameet+u...@googlegroups.com.

Sandeep Kumar

unread,
Oct 17, 2017, 3:02:17 AM10/17/17
to datameet
Can somebody educate us on this topic that what kind of legal action we can face if we scrape data from a website where data is publicly visulaised but not readily available to download like bhuvan portal.

>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> Datameet is a community of Data Science enthusiasts in India. Know more
>> about us by visiting http://datameet.org
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "datameet" group.
>> To unsubscribe from this group and stop receiving emails from it, send an

>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> --
> --
> Cheers,
> Nikhil
> +91-966-583-1250
> Pune, India
> Self-designed learner at Swaraj University
> <http://www.swarajuniversity.org>
> Blog <http://nikhilsheth.blogspot.in> | Contribute
> <https://www.payumoney.com/webfronts/#/index/NikhilVJ>
>


--
--
Cheers,
Nikhil
+91-966-583-1250
Pune, India
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>
Blog <http://nikhilsheth.blogspot.in> | Contribute
<https://www.payumoney.com/webfronts/#/index/NikhilVJ>

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.

Devendra Damle

unread,
Oct 23, 2017, 5:25:35 AM10/23/17
to datameet
Hi Nikhil.

A colleague of mine wrote a python script for scraping data from the Debt Recovery Tribunals website. The problem was similar to yours.

His script uses selenium web driver, and gecko drivers for firefox. It opens the website in firefox, then simulates clicks to select things from drop-down menus to generate tables, and then downloads the data in it to a JSON file. I am attaching the source code file. I am myself not a coder, so I won't be able to help you with the code itself, but you might be able to modify it to suit your needs.

Regards,
Devendra
drtdownloader.py

harsha

unread,
Nov 14, 2017, 3:48:29 AM11/14/17
to datameet
Hi Nikhil,

I have been thinking on similar lines to work in Telangana(http://mabhoomi.telangana.gov.in/) and have spoken to local land activists & researchers. Why this ? , one is to keep a record dump of records as they are changing very fastly in Telangana with huge amount of surveys done, and we have no clue about how the records are changing, and only the final changes are in public domain.
Second we are running a farmer distress helpline since the last 7 months in Vikarabad District, Telangana and 50% of the issues we get are land issues, so it would make the accessibility of the records easy too.
Third is also to understand and do some analysis on the land acregae, and who owns it, who cultivates/benefits from it(its currently noted in pahani in 13th column). As we have been working on rights of Tenant farmers, this is an important data point and understanding we need to get.

So we would be eager to know on how we can collaborate and take it forward.
We can take help from Srinivas Kodali(iota.kodali@gmail.com>) , who had locally offerred to help, who has experience in scrapping.

Cheers,
SreeHarsha

Naveen Francis

unread,
Nov 14, 2017, 10:09:34 AM11/14/17
to datameet
Reply all
Reply to author
Forward
0 new messages