Hi,
I have merged all excel files and created a single CSV that can be downloaded from here....
http://oksoft.s3.amazonaws.com/all_state_data.csv.gzThis is 100 MB gzip compressed file. Contains 800 MB CSV data of all states merged into a single file. Here are the steps how to download and import the data in MySQL database. It will take a few minutes to import the data in SQL - This is better option than using excel.
1) download and unzip the file
wget
http://oksoft.s3.amazonaws.com/all_state_data.csv.gzgunzip all_state_data.csv.gz
2) Download the table structure:
wget
https://gist.githubusercontent.com/shantanuo/302f2d983c5282e56124/raw/442a11b33b3a5284644886a0a3876a4a014ed3e5/create_table.sql3) create table
mysql test < create_table.sql
4) load data
mysql> load data infile '/home/ubuntu/datameet/tmp/all_state_data.csv' into table temp_aaa fields terminated by ',' optionally enclosed by '"' ignore 1 lines;
_____
This query shows that a village named "mahad" is in 3 tehsils of Maharashtra.
mysql> select Sub_District_Name344 from temp_aaa where State_Name340 = 'Maharashtra' and Village_Name390 = 'Mahad' limit 10;
+----------------------+
| Sub_District_Name344 |
+----------------------+
| Baglan |
| Khalapur |
| Mahad |
+----------------------+
3 rows in set (3.94 sec)
If I omit the state name from the where clause of the query, I can see that a village with the same name also exist in Rajasthan and Uttarakhand.
_____
This query shows that there are 83 Crore people living in villages.
select sum(Total_Female_Population_of_Village364) as female, sum(Total_Male_Population_of_Village366) as male, sum(Total_Population_of_Village367) as total from temp_aaa;
+--------------------+--------------+--------------+
| female | male | total |
+--------------------+--------------+--------------+
| 405953826 | 427824396 | 833647211 |
+--------------------+--------------+--------------+
_____
Some interesting facts:
1) The data is collected from 35 files starting with number 01 to 35. for e.g. DCHB_Village_Release_3500.xlsx
2) 0900 is the biggest file with 195 MB data followed by 2300 (100 MB) and 2100 (95 MB) While 0400, 3100 and 2500 are smallest files with size around 50,000 bytes.
3) There are around 6.5 lakh villages across all states.
# wc -l all_state_data.csv
646170
4) There are around 24,000 villages growing potato as their primary or secondary crop.
# grep POTATO all_state_data.csv | wc -l
24069
Several columns like nearest village / town will be obsolete if geo co-ordinates are linked with this data.
-- Shantanu