Re: Web Scraping Mission Antyodaya website

151 views
Skip to first unread message

Piyush Kumar

unread,
Feb 4, 2022, 2:33:25 AM2/4/22
to data...@googlegroups.com
Could folks here suggest how to go about this? 


When we click this link, we get data on village-level infrastructure put within multiple HTML tables across many pages (separated into state, dist., block etc.)

Suppose I want to scrape data upto the village level for a particular state, is there any way I can get it done without too much back and forth over Selenium webdriver? Please note that to access village level data you have to go through a nested hierarchy of links (gram panchyt within block, which is within a district and so on). To make matters more complicated, the pages have also not been numbered. 

Can someone in the know help me figure this out?

Thanks in advance
Piyush

Sanjay Bhangar

unread,
Feb 4, 2022, 3:12:29 AM2/4/22
to data...@googlegroups.com
Piyush -

You could write a python (or your preferred language) script that just requests the HTML, parses it, and follows the hierarchy, without using selenium. This could be a bunch of work as the site doesn't use regular links with GET requests, but rather when you click on a state in the table, it uses Javascript to fill up hidden form fields with the state code, etc. and then does a form submit, causing a POST request to be made with those values.

For eg. you can see the links in the table have an onClick handler like "selectState(2,'HIMACHAL PRADESH','preloginDistrictInfrastructureReports2020.html')" .

Then, in the javascript, you can see the selectState function defined like so:

function selectState(stateCode,stateName,action){	
	$("#stateCode").val(stateCode);	
	$("#stateName").val(stateName);	
	$("#reportForm").attr('action', action);
	$("#reportForm").submit(); 
}


So this will make a POST request to preloginDistrictInfrastructureReports2020.html
with stateCode=2, stateName=HIMACHAL PRADESH

Similarly, there are different onCick handlers defined for selecting districts, etc. that you can follow down to see what URLs they are calling with what parameters. And in theory, you could write some HTML parsing code and some regex to go through the items in each table, parse out the parameters and URLs to call, and follow things down. 

So, in theory you could write this without mucking around with selenium, but it also seems like a lot more work than if the site was structured "normally" with unique URLs and GET requests. 

For the page numbering, this seems okay: the HTML outputs all the items across all the pages, and then the actual pagination on the page is purely client-side javascript - so if you were to read the HTML on the page via python or so, you would just get all the items in the table without having to worry about pagination.

Unfortunately, this does seem like a lot of work and I don't really have the time to do anything, but it seemed like an interesting problem and I was curious so I took a look. Hope it could help a bit.

All the best,
Sanjay

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/CAFtOtdujRhq36O4SW%3Dtie%2BSDH_6Pq1R87B6nVerzU4giQVka%3Dw%40mail.gmail.com.

Nikhil VJ

unread,
Feb 6, 2022, 7:18:58 AM2/6/22
to datameet
Hi,

I don't think Selenium is required - this looks like it can be done with just varying the request payload of one POST api call.
the POST request content type is application/x-www-form-urlencoded 

at state level, request payload is like:
stateCode: 27
stateName: MAHARASHTRA
districtCode: 
districtName: 
blockCode: 
blockName: 
gpCode: 
gpName: 

It district level it becomes:
stateCode: 27
stateName: MAHARASHTRA
districtCode: 469
districtName: AURANGABAD
blockCode: 
blockName: 
gpCode: 
gpName: 

then block level:
stateCode: 27
stateName: MAHARASHTRA
districtCode: 469
districtName: AURANGABAD
blockCode: 4315
blockName: KHULTABAD
gpCode: 
gpName: 

then GP level:
stateCode: 27
stateName: MAHARASHTRA
districtCode: 469
districtName: AURANGABAD
blockCode: 4315
blockName: KHULTABAD
gpCode: 170584
gpName: BODKHA

If in python, one can use Beautifulscrape to capture the table data as well as get the (code + name) pairs for the next level.

--
Cheers,
Nikhil VJ
https://nikhilvj.co.in


Piyush Kumar

unread,
Feb 7, 2022, 2:12:50 AM2/7/22
to data...@googlegroups.com
Thank you Sanjay and Nikhil. I think these are good starting points to try and figure out how to get this done and I am sure with some time and effort, it is possible.

Piyush

Abhilash Chowdhary

unread,
Feb 10, 2022, 7:04:14 PM2/10/22
to data...@googlegroups.com
Piyush,

Took a look, looks like you can use these APIs. I'm providing the curl requests. You can copy-paste them to https://curlconverter.com/ to convert it into language of your choice :)

1. Get the blocks of a district

curl 'https://missionantyodaya.nic.in/getPreLoginAnalyticsData.html?stateCode=6&districtCode=61' \
  -X 'POST' \
  -H 'Connection: keep-alive' \
  -H 'Content-Length: 0' \
  -H 'Pragma: no-cache' \
  -H 'Cache-Control: no-cache' \
  -H 'sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"' \
  -H 'Accept: */*' \
  -H 'Content-Type: application/json' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36' \
  -H 'sec-ch-ua-platform: "Linux"' \
  -H 'Origin: https://missionantyodaya.nic.in' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Referer: https://missionantyodaya.nic.in/preloginAnalytics2020.html' \
  -H 'Accept-Language: en-US,en;q=0.9' \
  -H 'Cookie: JSESSIONID=obT6zCBsqbClJdpkAhrHxIbVaNog5IcQNt1WerzF.nqj1p-lxapp8-001' \
  --compressed


2. Get all the metrics for block

curl 'https://missionantyodaya.nic.in/getPreLoginAnalyticsData.html?stateCode=6&districtCode=61&blockCode=469' \
  -X 'POST' \
  -H 'Connection: keep-alive' \
  -H 'Content-Length: 0' \
  -H 'Pragma: no-cache' \
  -H 'Cache-Control: no-cache' \
  -H 'sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"' \
  -H 'Accept: */*' \
  -H 'Content-Type: application/json' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36' \
  -H 'sec-ch-ua-platform: "Linux"' \
  -H 'Origin: https://missionantyodaya.nic.in' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Referer: https://missionantyodaya.nic.in/preloginAnalytics2020.html' \
  -H 'Accept-Language: en-US,en;q=0.9' \
  -H 'Cookie: JSESSIONID=obT6zCBsqbClJdpkAhrHxIbVaNog5IcQNt1WerzF.nqj1p-lxapp8-001' \
  --compressed


I basically went to the analytics tab and looked for the API's being called.

sagar srivastava

unread,
Feb 10, 2022, 7:04:19 PM2/10/22
to data...@googlegroups.com
I have a script in python for Mission Antyodaya if any one want I can help you
Because I scrape the data GP wise and questionnaire also

Uzair Khan

unread,
Feb 10, 2022, 7:04:36 PM2/10/22
to data...@googlegroups.com
Hello Seniors 
From where i can get the city shap fie of Tarapur, Aurangabad, Nashik  this all  cities are in Maharashtra State 


Uzair

a.ja...@gmail.com

unread,
Feb 18, 2022, 4:54:05 AM2/18/22
to datameet
Data from the Antyodaya mission is already available on the Indiadataportal site... Not sure if that is complete or not - but you can check.
Reply all
Reply to author
Forward
0 new messages