Local Language PDF Parsing

Johnson Chetty

unread,

Jan 6, 2017, 5:35:46 AM1/6/17

to data...@googlegroups.com

Hey,

So I had run into the proverbial pdf parsing problem for a project.

Tamil + English PDF and its kinda getting a bit harrowing.

I've tried a couple of things like PyPDF and pdfminer.
The structuring of the documents and datapoints seems a bit amiss and running after them gets a bit tedious.

Just thought I should send a shoutout and ask if anyone knows of a solution that works well.

I wanted to know which python libraries are good for extracting data from unicode (tamil+english) PDFs and parsing unicode and tamil characters.

--

Regards,
Johnson Chetty

Thejesh GN

unread,

Jan 6, 2017, 10:51:45 AM1/6/17

to data...@googlegroups.com

Post an example pdf that you are trying to scrape.

Thej
--
Thejesh GN ⏚ ತೇಜೇಶ್ ಜಿ.ಎನ್
http://thejeshgn.com

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Johnson Chetty

unread,

Jan 7, 2017, 6:12:55 AM1/7/17

to data...@googlegroups.com

Hey,

Yes, sorry, here is a sample document!

It would need the TAB_Reginet font.

http://www.tnreginet.net/TAB_R___.TTF

--

Regards,
Johnson Chetty

sample.pdf

Gora Mohanty

unread,

Jan 7, 2017, 7:16:43 AM1/7/17

to data...@googlegroups.com

On 7 January 2017 at 16:42, Johnson Chetty <johnso...@gmail.com> wrote:

Hey,

Yes, sorry, here is a sample document!

It would need the TAB_Reginet font.

http://www.tnreginet.net/TAB_R___.TTF

This has nothing to do with parsing the PDF per se, but rather needs a conversion from some non-standard encoding to Unicode. This is a difficult problem as:

* One has to examine individual glyphs in the font to determine the Unicode character(s) that they should map on to.

* Because of the way Indian languages work, various reordering/substitution of glyphs is also needed to make it Unicode

Your best bet would be to try to find an existing utility where someone has already done this work, e.g., try searching Google for "tamil font converter unicode". There seem to be some likely-looking candidates.

Regards,

Gora

Venkata Pingali

unread,

Jan 7, 2017, 10:44:15 AM1/7/17

to data...@googlegroups.com

Based on my past experience doing this, you will have work beyond

what Gora has identified:

(a) The glyphs sequences will have gaps, mixups from nearby

words etc. So you are looking at time to reverse engineer the rules.

(b) Text from separate sections will be mixed up because pdf

mixes 'what' (text that needs to be rendered) with 'how' (precise

layout). So you are looking at writing additional code to reorgnize

the text based on coordinates of the glyphs.

--

Reply all

Reply to author

Forward