Parsing Voters List : Glyph to Unicode issue

1,621 views
Skip to first unread message

Siddharth Vijayakrishnan

unread,
Sep 1, 2015, 10:09:21 AM9/1/15
to datameet
Hi,

I downloaded a few files containing voter rolls and tried to parse the PDFs using pdfminer. Ran straight into a problem[1] where the glyphs are converted to unicode using a wrong character map. Before I try and solve this on my own, I wonder if anyone in this community has a readymade solution ?

[1] http://stackoverflow.com/questions/31876415/parsing-a-pdfdevanagari-script-using-pdfminer-gives-incorrect-output

Nikhil VJ

unread,
Sep 19, 2015, 5:51:24 AM9/19/15
to datameet
Hi Siddharth,

Sorry I missed this earlier.
In April this year I converted a budget PDF to excel that had Marathi content, in legacy font (similar to ShreeDev). It was two-step : first extract to excel, and then replace all the text after passing through a legacy font to unicode converter (an HTML file with javascript)


Just check your document or send me a copy.. if it has legacy fonts then copy-pasting from it gives us random english letters and punctuations. It it's unicode, then copy-pasting gives us unicode text only, but inaccurate. It's possible that someone might have made a converter for this; if not, then if you have enough content then you could make your own converter.

If the PDF has Unicode font in it, then my method fails.

I wasn't aware of the stackoverflow questions you've linked to. Great insights here into why Unicode extraction is failing.

If it's less pages then this free online multi-language OCR tool might help: http://www.i2ocr.com/free-online-hindi-ocr
(per page time-taking process, so only advisable if content is less or if you have a slave army of interns at your disposal :P)




--
Cheers,
Nikhil
+91-966-583-1250
Pune, India
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>
http://nikhilsheth.blogspot.in






--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Raphael Susewind

unread,
Sep 21, 2015, 9:27:00 AM9/21/15
to data...@googlegroups.com
Hi Siddarth and Nikhil,

sorry for the delay, I was travelling for the past weeks. I have worked
extensively with the electoral rolls, and ultimately the only solution I
found for the problem of corrupted text is OCR - tesseract was the most
accurate in my experiments (and the relatively fastest...). It can also
be automated, though scaling up would require vast resources.

Let us know if you find an alternative (though I am sceptical),

Best,
Raphael
> <mailto:datameet%2Bunsu...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google
> Groups "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to datameet+u...@googlegroups.com
> <mailto:datameet+u...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Dr. Raphael Susewind | Political anthropologist, Associate CSASP Oxford
Snail Mail | Melanchthonstr. 4a, 33615 Bielefeld, Germany
Web & Twitter | http://www.raphael-susewind.de | @RaphaelSusewind

Please do consider http://www.gnupg.org for encryption (key id 10AEE42F)

Nikhil VJ

unread,
Sep 29, 2015, 3:20:00 PM9/29/15
to datameet
Hi Raphael,

Thanks for sharing about Tesseract: it always helps to know what's in the engines ~:)

I wish we had a way of OCR'ing tabular documents. Tabula's interface combined with OCR.
I created a feature request on Tabula for this :
https://github.com/tabulapdf/tabula/issues/409
Let's hope it gets some love! Please +1 it!

Siddharth, you should share at least a one page PDF sample of what you're working with, we'll be able to see which way is best for what you've got.

If one goes the OCR way, we might need to convert the target PDF to image format. There are quite some online sites for doing that, but it gets tricky when using non-English script. If you're on a linux OS, then pdftoppm is a good command line tool to use.

Sample command: pdftoppm -rx 200 -ry 200 -png b.pdf b
(200 sets DPI.. I found this to be best with the docs I was doing)

To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.

Raphael Susewind

unread,
Sep 29, 2015, 4:28:37 PM9/29/15
to data...@googlegroups.com
Hi Nikhil and all,

I had the best results with a python tool called pdf-table-extract:

https://github.com/ashima/pdf-table-extract

you have to tweak the parameters a bit, but then it rather nicely
extracts the coordinates of each cell (defined as something surrounded
by a black rectangle) which you can then feed into ghostscript or some
such to extract the image (gs is faster than pdftoppm IMHO). In most
cases pdf-table-extract -i FILE -p PAGE -r 300 -l 0.7 -t cells_xml
worked nicely for electoral rolls...

Just my 5 cents,

Raphael

On 29.09.2015 21:19, Nikhil VJ wrote:

> Hi Raphael,
>
> Thanks for sharing about Tesseract: it always helps to know what's in
> the engines ~:)
>
> I wish we had a way of OCR'ing tabular documents. Tabula's interface
> combined with OCR.
> I created a feature request on Tabula for this :
> https://github.com/tabulapdf/tabula/issues/409
> Let's hope it gets some love! Please +1 it!
>
> Siddharth, you should share at least a one page PDF sample of what
> you're working with, we'll be able to see which way is best for what
> you've got.
>
> If one goes the OCR way, we might need to convert the target PDF to
> image format. There are quite some online sites for doing that, but it
> gets tricky when using non-English script. If you're on a linux OS, then
> *pdftoppm* is a good command line tool to use.
>
> Sample command: pdftoppm -rx 200 -ry 200 -png b.pdf b
> (200 sets DPI.. I found this to be best with the docs I was doing)

--
Reply all
Reply to author
Forward
0 new messages