Hi everyone,
Over the past year and more, I’ve been working on parsing India’s recent electoral rolls to make them available in analyzable formats. Earlier this week, I released a small portion of this dataset: the state of Haryana (2024 Vidhansabha), with over 20 million individual-level voter records. I plan to update the dataset about once a month, potentially adding one state at a time, with the goal of preparing a journal article introducing the dataset by early 2026.
As some of you may know, India’s electoral rolls have been made available only in non-machine-readable (non-OCR) formats over the last five years or so. Unfortunately, widely used OCR models do not perform well on these rolls. To address this, I used a new OCR engine—Surya-OCR—on high-performance computing clusters. This required a lot of diving into the growing frontier of machine learning and supercomputing infrastructure.
You can access this dataset on
Harvard Dataverse. To access this dataset, you will have to fill out this
Google Form (also available on Dataverse page's description). You can find the documentation related to the dataset on my
GitHub repo.
Please feel free to send in any queries, feedback, or questions. And please feel free to share and circulate. Thanks so much!
Warmly,