Local Language PDF Parsing

31 views
Skip to first unread message

Johnson Chetty

unread,
Jan 6, 2017, 5:35:46 AM1/6/17
to data...@googlegroups.com
Hey, 

   So I had run into the proverbial pdf parsing problem for a project.
   Tamil + English PDF and its kinda getting a bit harrowing. 
    I've tried a couple of things like PyPDF and pdfminer.
    The structuring of the documents and datapoints seems a bit amiss and running after them gets a bit tedious.  

Just thought I should send a shoutout and ask if anyone knows of a solution that works well. 
   I wanted to know which python libraries are good for extracting data from unicode (tamil+english) PDFs and parsing unicode and tamil characters.



--
Regards,
Johnson Chetty




Thejesh GN

unread,
Jan 6, 2017, 10:51:45 AM1/6/17
to data...@googlegroups.com
Post an example pdf that you are trying to scrape.


Thej
--
Thejesh GN ⏚ ತೇಜೇಶ್ ಜಿ.ಎನ್
http://thejeshgn.com

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Johnson Chetty

unread,
Jan 7, 2017, 6:12:55 AM1/7/17
to data...@googlegroups.com
Hey, 

Yes, sorry, here is a sample document! 

It would need the TAB_Reginet font. 






--
Regards,
Johnson Chetty




sample.pdf

Gora Mohanty

unread,
Jan 7, 2017, 7:16:43 AM1/7/17
to data...@googlegroups.com
On 7 January 2017 at 16:42, Johnson Chetty <johnso...@gmail.com> wrote:
Hey, 

Yes, sorry, here is a sample document! 

It would need the TAB_Reginet font. 


This has nothing to do with parsing the PDF per se, but rather needs a conversion from some non-standard encoding to Unicode. This is a difficult problem as:
* One has to examine individual glyphs in the font to determine the Unicode character(s) that they should map on to.
* Because of the way Indian languages work, various reordering/substitution of glyphs is also needed to make it Unicode

Your best bet would be to try to find an existing utility where someone has already done this work, e.g., try searching Google for "tamil font converter unicode". There seem to be some likely-looking candidates.

Regards,
Gora

Venkata Pingali

unread,
Jan 7, 2017, 10:44:15 AM1/7/17
to data...@googlegroups.com
Based on my past experience doing this, you will have work beyond
what Gora has identified:

(a) The glyphs sequences will have gaps, mixups from nearby
words etc. So you are looking at time to reverse engineer the rules.

(b) Text from separate sections will be mixed up because pdf
mixes 'what' (text that needs to be rendered) with 'how' (precise
layout). So you are looking at writing additional code to reorgnize
the text based on coordinates of the glyphs.


--
Reply all
Reply to author
Forward
0 new messages