SOS: Need help extracting data

28 views
Skip to first unread message

Shree D N

unread,
Aug 18, 2015, 4:01:54 AM8/18/15
to data...@googlegroups.com
This link has form 7 (list of candidates) for all contesting candidates for BBMP polls 2015. Typically all pdfs, scanned and uploaded. Language: Kannada
http://117.247.176.82/
Is there a way to download it all and merge into one document or a table that represents the list of all candidates for all wards??
We are trying to put this together because this consolidated list is not available anywhere as far as I know. Has anyone else seen it?
--
-------
Cheers,

Shree | Associate Editor | Oorvani Foundation
Citizen Matters - Bangalore's own online news magazine
Bangalore |
Tel: +91-80-4173 7584 | Mobile: +91-95909 35559
Follow us on Twitter | Follow us on Facebook

Bhanu Kamapantula

unread,
Aug 18, 2015, 12:33:28 PM8/18/15
to data...@googlegroups.com
Hi Shree,

You would want to write a script which can scrape the data from the website. This can be automated using Python, mechanize library (with support for doPostBack calls as in these webpages).

Once downloaded, PDFs can be combined using PDFtk library (one among different methods). Then, XPDF might be useful to retrieve text from the combined PDF.

best,
Bhanu

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Bhanu

Shree D N

unread,
Aug 19, 2015, 12:10:33 AM8/19/15
to data...@googlegroups.com
We downloaded it all manually. However unable to merge them. The text is in Kannada, and may be unclear even if we are able to extract it. We have anyway merged some data manually, from these 190+ files to existing data we had already prepared.
Digitizing it all starting from the affidavit filing stage would have helped us greatly but sadly BBMP or EC or SEC doesn't have that system.

Nikhil VJ

unread,
Aug 19, 2015, 7:56:01 PM8/19/15
to datameet
Is your target text in text form or image form?

If text: http://tabula.technology/

If image: http://www.i2ocr.com/free-online-kannada-ocr

If image with handwritten text : interns :P

--
Cheers,
Nikhil
+91-966-583-1250
Pune, India
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>
http://nikhilsheth.blogspot.in



Reply all
Reply to author
Forward
0 new messages