Converting old Devnagari fonts to unicode.

2,184 views
Skip to first unread message

Rushabh Mehta

unread,
Jul 5, 2013, 1:50:59 AM7/5/13
to data...@googlegroups.com
Hello all,

I am not sure if this is the right forum, but would love to get any pointers.

I am volunteering with a local Hindi newspaper and want to get their editions online in web searchable format. Here is the link to the site.


The biggest hurdle I am facing is to convert the fonts the paper is encoded in (APS-Priyanka) and converting them to unicode (assuming that I can extract the text from the pdfs and keeping the formatting issues on the side for the moment)

From what I gathered from web searches, APS Priyanka is a really old font and does not follow any specific encoding like ISCII etc. I tried some basic scripts and character maps but it does not seem like a "trivial" problem.

If anyone has experience in this and can help, it would be great.

best,
Rushabh


T: @rushabh_mehta

Anand Chitipothu

unread,
Jul 5, 2013, 2:02:30 AM7/5/13
to data...@googlegroups.com
Hi Rushbh,

Looks like you want to convert text encoded using custom encoding used by proprietary fonts to unicode, not making a legacy font font to be unicode friendly.

I'm not an expert in that area, but I can give you some pointers.

There used to be a website uni.medhas.org which used to convert websites using windows specific fonts to unicode on the fly. Looks like that website is no more and here is copy of it from the wayback machine.


The same guys created firefox extension to do the same translation.


Look at the code or talk to those guys about how to convert fonts.

Anand

Rushabh Mehta

unread,
Jul 5, 2013, 2:24:53 AM7/5/13
to data...@googlegroups.com
Anand,

Thanks for the padma tip! Never found that in web searches. I see a lot of character maps. I guess the way would be to run the text through all of them, find the closest match and the fix the encodings that are off.

best,
Rushabh


T: @rushabh_mehta

--
For more details about this list
http://datameet.org/discussions/
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Gautam John

unread,
Jul 5, 2013, 3:24:41 AM7/5/13
to data...@googlegroups.com
This is a huge problem and something that we have been struggling with
at Pratham Books. While you can create maps they are wildly inaccurate
and you'll need one for every font too.

Rushabh Mehta

unread,
Jul 5, 2013, 11:29:30 AM7/5/13
to data...@googlegroups.com
Gautam,

Will be curious to learn how do you do it at Pratham? Any chance there is a Python script lying around :) ? I am looking at Padma and thinking how do I use it because even if I manage to decode it, I will still need to run batch processes to convert from PDFs.

best,
Rushabh


T: @rushabh_mehta

Gautam John

unread,
Jul 5, 2013, 11:32:47 PM7/5/13
to data...@googlegroups.com
On Fri, Jul 5, 2013 at 8:59 PM, Rushabh Mehta <rme...@gmail.com> wrote:

> Will be curious to learn how do you do it at Pratham? Any chance there is a
> Python script lying around :) ? I am looking at Padma and thinking how do I
> use it because even if I manage to decode it, I will still need to run batch
> processes to convert from PDFs.

Our problem is that the source files are InDesign or CorelDraw so much
harder to run any sort of script on them. For now, we replace them
manually,

Anivar Aravind

unread,
Jul 6, 2013, 2:04:33 AM7/6/13
to data...@googlegroups.com
http://wiki.smc.org.in/Payyans
http://wiki.smc.org.in/Payyans#How_to_create_a_font_map.3F

http://silpa.org.in/ASCII2Unicode

--
"[It is not] possible to distinguish between 'numerical' and 'nonnumerical'
algorithms, as if numbers were somehow different from other kinds of precise
information." - Donald Knuth
Reply all
Reply to author
Forward
0 new messages