I need help in extracting data from a kannada pdf

91 views
Skip to first unread message

rang...@onlinerti.com

unread,
Mar 2, 2018, 9:39:34 AM3/2/18
to datameet
I have a kannada PDF file am trying to extract the data from the PDF. But it seems the font used is not in Unicode. I tried copy pasting the text from PDF still the character display properly.

I have attached the PDF also. 
HO_1.pdf

Nikhil VJ

unread,
Mar 2, 2018, 12:10:14 PM3/2/18
to datameet
Hi Ranganath,

You're in luck. I'll explain why at the end.

First, in your PDF if you navigate around the menu to find something like the document properties, then you will find the list of fonts somewhere.
I see "Nudi Akshar-01" and "Nudi Akshar-06" fonts listed other than the regular ones. So I started searching using this string:

kannada font "nudi akshar" to unicode convert

Sharing a few potential leads for conversion:
- http://aravindavk.in/projects/ : It seems he has created a converter for another Kannada font and released it on github. One can take it and change the mappings to make it work for Nudi Akshar. He has shared his email id on one of the pages.

I found this page mentioned in this search result which managed to come around the top of my search: https://bitbin.it/KV0Mn1x1/ ... talk about digital breadcrumbs. I advise you make an entry on the Talk page here to get in touch with others like yourself.


I'm not exploring further.. please check it out at your end.

If you're more interested in just having that content read than converted to Unicode and you have some control on the places where it'll be read, then you can find and install the fonts mentioned, and share their .TTF files for installing elsewhere. However, this will not be possible on phones and tablets (as far as I know).

----------

For folks having a similar issue in Devnagri fonts (Hindi, Marathi etc), check out this : https://sites.google.com/site/technicalhindi/home/converters.
Brilliant work, but I wish someone would help them move to github. I had to customize one of their converters as the text I was dealing with had slightly differing mappings. It was a fun reverse engineering exercise. I've shared my customized converters here: http://ourpuneourbudget.in/tools/

----------

Why you're in luck

Non-English Unicode texts and PDF technology have a weird problem that hasn't been resolved yet. PDF has to re-arrange the character glyphs to make them appear properly. It messes the text up. Display is achieved but Fidelity is lost. So, Unicode text that goes into a PDF... may or may not make fully it back out in one piece. The degree of distortion even seems to vary across softwares and operating systems.

An intervention at the PDF creating end (hence not applicable to our case) is shared here: 

Legacy ANSII fonts on the other hand.. retain full fidelity. You can convert a legacy fonts doc (like yours) to PDF, copy out the text and retain the original.

So, since your text is in a legacy font (Nudi Akshar), you stand a chance of converting the whole thing into Unicode Kannada at the click of a button.
Had it been in Unicode Kannada, you may have to manually proof-read everything and make necessary edits.

------------

For those trying to get Unicode text out of PDFs : Hope you find a way, all the best. Check out that documentfoundation link above. See past discussions on this group: https://groups.google.com/forum/#!searchin/datameet/pdf$20unicode%7Csort:date



--
Cheers,
Nikhil VJ
+91-966-583-1250
Pune, India
Website <http://nikhilvj.cu.cc>
DataMeet Pune chapter <https://datameet-pune.github.io/>
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>

On Fri, Mar 2, 2018 at 10:19 AM, <rang...@onlinerti.com> wrote:
I have a kannada PDF file am trying to extract the data from the PDF. But it seems the font used is not in Unicode. I tried copy pasting the text from PDF still the character display properly.

I have attached the PDF also. 

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

rang...@onlinerti.com

unread,
Mar 5, 2018, 7:07:36 AM3/5/18
to datameet
Hi Nikil thank you so much for your valuable inputs, I will try with your solution.

Nikhil VJ

unread,
Mar 16, 2018, 12:44:24 AM3/16/18
to datameet
Hi Ranganath,

Update from the technical-hindi group: Vishal Goel from Punjabi Uni Patiala is working on a one-stop solution integrating all the converters. Check out this post and it might be valuable to get in touch and work towards adding Kannada conversion too.



--
Cheers,
Nikhil VJ
+91-966-583-1250
Pune, India
DataMeet Pune chapter <https://datameet-pune.github.io/>
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>

--
Reply all
Reply to author
Forward
0 new messages