converting data from unknown encoding in pdf to unicode

152 views
Skip to first unread message

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
May 23, 2016, 10:22:04 AM5/23/16
to Anunad Singh, sanskrit-programmers
I needed some help with https://github.com/sanskrit-coders/stardict-sanskrit/issues/12 - the problem is to copy text from the pdf http://ignca.nic.in/vedic_heritage/Vedic_heritage_Illustratred_dic_hindi.pdf and convert it to unicode. The pdf is an important resource for students and practitioners of shrauta kalpa like me. Any ideas? ( + anunaad who has prior related experience)

--
--
Vishvas /विश्वासः

Sandeep Nangia

unread,
May 23, 2016, 11:35:54 AM5/23/16
to sanskrit-p...@googlegroups.com, Anunad Singh
Some loud thinking here. Not sure how useful this is. But here it goes. 

The problem can be broken into two parts:

A. Extracting text from pdf. This can be done using Apache PDFBox command line tools (See http://pdfbox.apache.org and http://pdfbox.apache.org/2.0/commandline.html#extracttext). 

So a command like below should get you the output in a text file (hindioutput.txt):

java -jar pdfbox-app-2.0.0.jar ExtractText Vedic_heritage_Illustratred_dic_hindi.pdf hindioutput.txt


I am not too sure which option -encoding to use. 

B. Now we have the text file which is in some proprietary font (not unicode). 

I tried to copy past the initial portion of hindioutput.txt and started experimenting with various possible choices. Selecting kruti at this page http://foss.coep.org.in/foss/marathi/ConvertStringGui.py I was able to convert initial cover pages  to unicode devanagari.

However, it does not seem the entire file is in the same encoding throughout. Because when I tried to do that for later text (picked randomly), I got garbage. 

So the problem boils down to figuring out various encoding options in (A) and then figuring out which encoding font has been used. Perhaps this can be found using trial and error. I tried a few. The first page seems kruti. It does not seem to be chanakya. 

Another possibility is to call DV Printers (the printers phone numbers are given on page 4) and tell them you want to get something printed and ask them which devanagari fonts do they use. This is a fairly recently print (2015). So the fonts they use wouldn't be very different now. That way the possible choices can be restricted in step B. Once the encoding is known, appropriate converter can be found (or written).

Regards,

Sandeep

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bhasha IME

unread,
May 23, 2016, 1:28:45 PM5/23/16
to sanskrit-p...@googlegroups.com
Hi
Here's the unicode converted doc minus images

regards
Venkatesh


On Mon, May 23, 2016 at 7:51 PM, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:
Vedic_heritage_Illustratred_dic_hindi.7z

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
May 23, 2016, 6:48:09 PM5/23/16
to sanskrit-programmers
Thank you so much sandIpa and venkatesh!

venkatesh - how exactly did you ultimately do it?

Sandeep Nangia

unread,
Jun 1, 2016, 10:52:59 AM6/1/16
to sanskrit-p...@googlegroups.com

Wondering if I missed a reply. This would be very useful to know.

Regards,

Sandeep

Bhasha IME

unread,
Jun 1, 2016, 11:01:52 AM6/1/16
to sanskrit-p...@googlegroups.com
No, you didn't


Latest ver of BhashaIME (yet & soon to be released)

It has the capability to analyze RTF text (which preserves font info) and selectively apply conversion. The free Foxit PDF reader extracts & copies text in RTF preserving font info, color, size, style of text. This can be fed to BhashaIME for conversion.

Foxit it by far best of all free readers up to and including Acrobat reader 11. I have no experience with later ver of Acrobat Reader since I am on XP and can't run them.

Shall be uploading BhashaIME in a a few days.

regards

Sai Susarla

unread,
Jun 2, 2016, 3:43:05 AM6/2/16
to sanskrit-p...@googlegroups.com
This might be a repetition in this list, but I urgently need a piece of software (Python preferred) that does the following:
1) generates all the vibhakti forms of a subanta word, given its praatipadikam
2) given a subanta word, returns its praatipadikam and linga, vachana, vibhakti
3) given a ti~nanta word, returns its lakaara, vachana, purusha
4) given a dhaatu, generates its various forms.

We need it for a Samskrita bharati E-learning program that auto-generates exercises for various lessons.

I am also building a RESTful API-based public service called IndicTools, that provides these functionalities via a programmatic API for everybody to use. I'll wrap up your software with this API. Everything will be open-source.

Please accrue puNya by giving me the above functionality for the society's benefit.
- Sai.


On Mon, May 23, 2016 at 7:51 PM, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Jun 2, 2016, 11:17:59 AM6/2/16
to sanskrit-programmers

2016-06-02 0:43 GMT-07:00 Sai Susarla <sai.s...@gmail.com>:
This might be a repetition in this list, but I urgently need a piece of software (Python preferred) that does the following:
1) generates all the vibhakti forms of a subanta word, given its praatipadikam
2) given a subanta word, returns its praatipadikam and linga, vachana, vibhakti
3) given a ti~nanta word, returns its lakaara, vachana, purusha
4) given a dhaatu, generates its various forms.

We need it for a Samskrita bharati E-learning program that auto-generates exercises for various lessons.

I am also building a RESTful API-based public service called IndicTools, that provides these functionalities via a programmatic API for everybody to use. I'll wrap up your software with this API. Everything will be open-source.

Please accrue puNya by giving me the above functionality for the society's benefit.

A new thread should have been started for this message. I've done so - https://groups.google.com/forum/#!topic/sanskrit-programmers/1HU8PV6UT8Q ​, please follow up there, rather than here.
Reply all
Reply to author
Forward
0 new messages