Electoral Rolls Karnataka - Request for resources

48 views
Skip to first unread message

Anirudh K

unread,
Nov 23, 2020, 2:29:27 PM11/23/20
to datameet
Hi all,

The Chief Electoral Officer - Karnataka has published a new version of Electoral Rolls. These are image based PDFs that have to be converted to text based PDFs.

There is a need for additional compute resources to convert these large files. If anyone would like help with this, the process would entail running a python script (already made) on Google Colab and sharing the output folder on Google Drive. A more technical description of the process is detailed below.

Please reach out to bha...@gmail.com (or call PG Bhat - 9900141232) to help out with this project, or in case of any queries. 

The full process:
  1. Create a shared folder on Drive called 'ERMS' and give edit access to bha...@gmail.com.
  2. He will create 3 subfolders:
    • Code - This will contain the script. There is no need for any software to be installed locally.
    • Image files - This houses the image files
    • Text files - where the script will write the results
  3. Run the script on Colab (free account). The text files can then be downloaded from the drive folder
Thank you for considering this request.

Regards,
Anirudh


Nikhil VJ

unread,
Nov 25, 2020, 2:31:42 AM11/25/20
to datameet, PG Bhat
Hi,

Just to update, I got in touch with Mr.PG and we have setup a workflow on a cloud server and it's chugging along nicely.

What the program does is cool - it implements a python library: ocrmypdf in bulk mode.

This description from their docs is what it's mainly doing:
OCRmyPDF adds an optical character recognition (OCR) text layer to scanned PDF files, allowing them to be searched.

I made some tweaks to PG's program, have put it on github here: https://github.com/answerquest/bulk_pdf_OCR/

I think it may be useful at other places too.

--
Cheers,
Nikhil VJ
https://nikhilvj.co.in


--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/be9e4621-03a6-4e7e-8dfd-51ab93478b4en%40googlegroups.com.

bob quest

unread,
Nov 25, 2020, 8:14:22 AM11/25/20
to data...@googlegroups.com
Hi 

Where we can get the image format of the electoral roll?

PG Bhat

unread,
Nov 25, 2020, 9:13:31 AM11/25/20
to Nikhil VJ, datameet

I request you all not to publicise this. Let us limit the discussions to technical matters and not about the data. Let us not put the data on any public domain.

 

ECI has placed a lot of restrictions on the data to deny access. If they see the voter records being reconstructed and made public, that may not mean good to them.

 

If anyone wants to discuss the issue further, please call me.

 

Warm Regards,

PG

990 014 1232

Reply all
Reply to author
Forward
0 new messages