New Dataset: India's Electoral Rolls (2024-25, Parsed)

24 views
Skip to first unread message

Sharik Laliwala

unread,
Aug 26, 2025, 12:58:03 AM (11 days ago) Aug 26
to data...@googlegroups.com
Hi everyone, 

Over the past year and more, I’ve been working on parsing India’s recent electoral rolls to make them available in analyzable formats. Earlier this week, I released a small portion of this dataset: the state of Haryana (2024 Vidhansabha), with over 20 million individual-level voter records. I plan to update the dataset about once a month, potentially adding one state at a time, with the goal of preparing a journal article introducing the dataset by early 2026.

As some of you may know, India’s electoral rolls have been made available only in non-machine-readable (non-OCR) formats over the last five years or so. Unfortunately, widely used OCR models do not perform well on these rolls. To address this, I used a new OCR engine—Surya-OCR—on high-performance computing clusters. This required a lot of diving into the growing frontier of machine learning and supercomputing infrastructure. 

You can access this dataset on Harvard Dataverse. To access this dataset, you will have to fill out this Google Form (also available on Dataverse page's description). You can find the documentation related to the dataset on my GitHub repo. 

Please feel free to send in any queries, feedback, or questions. And please feel free to share and circulate. Thanks so much! 

Warmly,
--
Sharik Laliwala
PhD Candidate
Department of Political Science
University of California, Berkeley
Reply all
Reply to author
Forward
Message has been deleted
0 new messages