ANN: Opening of 7,00,000+ Rural Points of Interests Data

Skip to first unread message

Nov 21, 2020, 12:19:52 AM11/21/20
to datameet
Hello all,

ANN: PMGSY has opened data for about 7,00,000 geo-tagged rural facilities across India.

The data was collected to help plan road investments in PMGSY-III. It was collected over the last year and counting. Depending on which state's data you download either the survey activity is completed or still under-process.

The list of facilities which were to be surveyed as per guidelines of the scheme can be seen on Pg 37 of the PMGSY-III Guidelines (

Eg. High Schools, Higher Secondary Schools, Vet Hospitals, PHCs, CHCs, Bedded Hospitals, Bus Stands, Block HQs, Panchayat HQs, Banks, Fuel Stations, Cold Storages, Agro Industries, Pack Houses, Collection Centres etc.

Data opened includes name of facility, address, category, sub-category and lat/long.

Some context:
While a common android application was used for this data collection there was no in-depth centralized training/SOP for how the data was to be collected and states were given freedom to interpret the definition of the facilities which need to be surveyed as long as they met the overarching categories and goals. Eg. Some states would have considered privately owned facilities as well for certain categories or would have interpreted bus-stands to include taxi-stands if that's the only relevant means of transport or not considered weekly haats for agro-markets etc. There is no documentation for these variations. Once the survey is completed in a Block it won't be updated in the future.

Even within a state you'll find variation because different divisions may have undertaken the survey independently with different levels of completeness, intent and accuracy. No standard mobile was used and GPS accuracy will vary from place to place. Further, the surveyors could be either on contract or government engineers. Treating it as a census may lead to claims of little substance.

Nevertheless, it was a massive exercise and hopefully of some secondary use as well.

License is Open Data License - India (s/o Naveen Francis) and you can download data for one state at a time. Other disclaimers are on the website.

Link: Other Reports -> Facility Details

PS. Any pointers on how to collect citation metrics for this dataset are appreciated. It may help create a case for future such attempts to open data.

Harsh Nisar

Arun Ganesh

Nov 23, 2020, 2:05:47 AM11/23/20
to datameet
Wonderful Harsh, its amazing to see such rural spatial datasets opened by the government. The OpenStreetMap India community is looking into the data to better get a sense of the quality to see how it could be integrated with the existing OSM basemap.

The dataset points are in yellow, the rest are from OSM. Some initial evaluations in Kerala and Karnataka suggest that the data is pretty good and spatially accurate within 100m.

Pratap Vardhan

Nov 27, 2020, 12:20:23 AM11/27/20
to datameet
I've pulled states csvs to this repo Consolidated India csv is at
And, posted a thread of couple of visuals and minor data issues here
Would love to hear if you use this data to create something.

Nov 27, 2020, 12:47:59 AM11/27/20
to datameet
Really beautiful! 

I'll answer some of the queries you raised on the tweet thread here for everyone.

You've commented that the data in UK, HP & Nagaland appears erroneous. I am assuming because the lat/long is missing. It is so because these states are still doing the survey and haven't completed. They may complete it in the next few months. Many of the other NE states are also in a similar position. Goa and most UTs haven't been onboarded to the scheme yet. For the same reason, I would point people towards to original dataset or put in a system to update your GH repos regularly.

Apart from that - I'll re-iterate that the data was collected by government rural engineers at the block level. Intention, accuracy and even understanding of definitions will vary across blocks/districts and especially states. The data serves its primary purpose with these assumptions but may lead to misleading statistics if treated as a census for cross-geography comparisons.

Pratap Vardhan

Nov 27, 2020, 9:27:56 AM11/27/20
to datameet
Thanks Nisar, that's useful to know. I'll update the repo with these pointers. 
What frequency for updates would you suggest (monthly or)? If you prefer, we can move the repo to datameet org or any other widely accessible one and collectively edit it.  
So, what I meant about coordinates data is, some rows are blank and some have coordinates but beyond India's extent bounds. I guess they will get fixed with updates too.
Here's the distribution for states.

Separately, minor issue perhaps - there are 22 rows which probably have sub-category mislabeled.
Thanks for the details!

Pratap Vardhan

Nov 27, 2020, 9:31:35 AM11/27/20
to datameet
Images seem to have been lost. Attaching them.
Distribution for states.  

22 rows which probably have sub-category mislabeled? 


Nov 27, 2020, 4:01:15 PM11/27/20
to datameet
Rows which don't have lat-long will get updated as the work progresses in the concerned states. The lat-long which are out of extent will not be corrected and remain as-is because of errors in the app or simply low GPS accuracy. They'll be very few. Once the survey work is complete and finalized in all states; it will be mostly usable as-is from It's snowing right now so survey work is on a hold.

As the department released the data already in a machine readable format, doesn't require scraping etc and under Open Data License, please try to attribute the original source, link and license in your work and repo (4a/5 of the GODL).

I have two concerns; primary is to ensure the data collection mechanism and assumptions are documented officially and available readily (not just for people subscribed to this google group) so that inferences made by people on the data are grounded and more useful. The second is to find a mechanism for people to cite the original source so that a case can be made in the future for releasing other such datasets within the government.

On my side, it seems the best way to do this is to dedicate a static page with FAQs on ommas & further release this data on the Though in true spirit, anyone can host it anywhere subject to Section 4 and 7 of GODL.

Pratap Vardhan

Nov 27, 2020, 4:33:54 PM11/27/20
to datameet
I've added a note on top with source, license and last updated. Is there an recommended citation you'd like me to place? (also, feel free to submit a PR of what you think might keep the reading more informed.)

Arun Ganesh

Nov 27, 2020, 10:28:36 PM11/27/20
to datameet
Nisar, this is such an amazing dataset. To assist with spatial data joins, it would be helpful to include the unique LGD code for each block and village so that one does not have to rely on the name which always is inconsistent across datasets.

I tried matching upto block level names against the LGD block list and was able to get the following matches. Here is the spreadsheet of matches if you or others would like to use to join to this dataset: PMGSY block LGD code lookup

Screen Shot 2020-11-27 at 9.49.07 PM.png

Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To view this discussion on the web visit

Arun Ganesh

Nov 28, 2020, 1:14:41 AM11/28/20
to datameet
Btw, Bhanu and I have made some great progress matching the LGD codes in the last hour. 100% of districts matched and it looks like the PMGSY district names are definitely older (eg. Allahabad vs Prayagraj).

Down to 1539 unmatched blocks which can be finished with some help. Feel free to request access to above sheet if you would like to help out. Instructions on column I & H. You basically need to copy and paste the matching lookup key from the lgd block sheet to the pmgsy sheet.

Nov 30, 2020, 7:24:01 AM11/30/20
to datameet
Hi Harsh, All; 

Thank you for pointing this dataset out and storing it in an easy to access location. It looks super useful.

I was trying to find how exactly they've classified schools and hospitals and couldn't find anything in this PMGSY documentation.  For example the census 2011 documents what constitutes a high school and a higher secondary school for example here.  I wonder if they've used the same definitions for schools and hospitals as the census? Does anyone have any information on how they've chosen what public facilities to document or is it pretty ad hoc as Harsh had indicated in the opening post on this thread?

Kritarth Jha

Digvijay Bendrikar Shinde

Nov 30, 2020, 8:27:17 AM11/30/20
This is Amazing!

Thank you very much!!

Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to

Harsh Nisar

Dec 3, 2020, 1:53:27 AM12/3/20
to datameet

The definitions were purposely kept lose as the primary focus of the datasets was to aid local selection of roads for a government programme. In such cases; the consistency of definitions within a block/district/state was assumed more important than having it consistent across geographies. The primary focus isn't a comparative census.

So the definitions are as assumed by the JE/AE (frontline road engineers) residing in the Block or in some states they borrowed from census. But there isn't a documented consistency. But, the variance in high schools will be minimal versus say agro industry.

The generic list of facilities to be surveyed are in the guidelines document Annexure 1 Pg 37.

We are in process of getting a dataset FAQ uploaded on ommas.


Dec 4, 2020, 8:06:33 PM12/4/20
to datameet
Ah makes sense! Still, i look forward to the dataset FAQ getting uploaded. But thank you so much for the quick clarification. 

Message has been deleted

Harsh Nisar

Jan 6, 2021, 4:29:00 AM1/6/21
to datameet
Closing the loop: The team has uploaded FAQs on the main website and given a static link for attribution & sharing.

Jan 11, 2021, 4:10:09 PM1/11/21
to datameet
Really useful to have this FAQ! Just a quick question is there a habitation to PC11/LGD village mapping anywhere? 
In case we want to match this at the village level?

Arun Ganesh

Jan 11, 2021, 10:02:59 PM1/11/21
to datameet
On Mon, Jan 11, 2021 at 4:10 PM <> wrote:
Really useful to have this FAQ! Just a quick question is there a habitation to PC11/LGD village mapping anywhere? 
In case we want to match this at the village level?
A lookup of the PMGSY data block names to block LGD codes was crowdsourced here:

Going the next level to map habitation to village LGD code would be amazing.

Rajesvari Parasa

Jan 20, 2021, 1:10:46 AM1/20/21
Hello Harsh and everyone,
This is great work! 

Would it be possible to include these two pieces of info in the FAQs: 
1. Date/ Month of the opening of the dataset to the public 
2. The survey period over which this dataset was collected (I understand the survey is ongoing for some states, so maybe at least the start month would be useful to have).

On Sat, Nov 21, 2020 at 10:49 AM <> wrote:

Harsh Nisar

Mar 1, 2021, 4:19:26 AM3/1/21
to datameet
On Wednesday, 20 January 2021 at 11:40:46 UTC+5:30 wrote:
Hello Harsh and everyone,
This is great work! 

Would it be possible to include these two pieces of info in the FAQs: 
1. Date/ Month of the opening of the dataset to the public 
2. The survey period over which this dataset was collected (I understand the survey is ongoing for some states, so maybe at least the start month would be useful to have).

Yes - I'll get that updated. Thanks.

Does anyone know how to get this data ported to OSM (if at all that's possibility)?

Arun Ganesh

Mar 1, 2021, 10:33:07 AM3/1/21
to datameet

Does anyone know how to get this data ported to OSM (if at all that's possibility)?

Importing data to OSM is possible, but since it will have to be conflated with any existing data, it will require quite a bit of data preparation with many volunteers. An example of a recent import was the health facilities dataset . An overview of the import process is here: . If the data quality is not consistent and requires manual cleanup, going for an import might be a lot of effort.

That said, the PMGSY data is quite valuable and can add a lot of missing info into OSM for rural areas. It would make sense to start a conversation with the OSM community on ideas and how to take this forward. A good way to begin is by sending an intro email to the mailing list and starting a conversation on it on the telegram group . There are quite a few folks experienced with OSM imports who can help out.

Nikhil VJ

Mar 2, 2021, 1:17:43 AM3/2/21
to datameet
Hi Harsh,

The PMGSY site is dizzyingly full of data! Kudos and gratitude to all the people who have been working on it and the govt / elected officials who supported its release to the public. Sets a great benchmark / precedent.

Even apart from the data itself, the hierarchy in the dropdown selects is valuable too as people can use that for mapping so many other things in other fields.

I'm not able to see geo-tagging in the sections I'm checking out. Can you guide pls? 

Suggestion : Make short screen recording videos on youtube showing how to use the site. There's a lot of free tools and sites for it, but if zoom is already there then one can start a call with recording on and screen-share and do the job.

Nikhil VJ

Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to

Pratap Vardhan

Mar 10, 2021, 12:26:14 PM3/10/21
to datameet
I've updated the data files today, it has about 7,83,014 facilities. You can download individual state files from
Reply all
Reply to author
0 new messages