Batch extract devanagari text from AIR pdf files

115 views

Skip to first unread message

Avinash

unread,

Jul 6, 2018, 4:46:00 PM7/6/18

to samskrita

नमो नमः,

I know that the topic of extracting devanagari text from pdf files keeps coming up in various forums. So far, the solutions I have seen involve some manual work (figure out the font, copy/paste the text into a suitable converter, etc.). Has there been any effort to create an automated/programmatic way to extract the text? If not, any thoughts on the best way to go about doing so will be welcome. I am specifically interested in the pdfs corresponding to AIR Sanskrit News.

For context - I was having an offline discussion with an acquaintance about the lack of a speech-to-text/text-to-speech program for Sanskrit (at least, I am unaware of any). We were discussing ideas for collecting a database for research and experimentation. The recent thread regarding AIR news sparked the thought that it could be a potential bootstrap database since audio and corresponding text (mostly) are available. As Shree Pooja pointed out, these have been collected for the last 5-6 years in https://groups.google.com/forum/#!forum/samskrithanews. I've already created a python script to download the attachments from this group and will make the script and data available soon. Unfortunately, the text for the news program is only available as a pdf on this forum. Hence this question. Alternately, if you are aware of a source for the plain text in the news programs, please point that out as well.

भवत्कृतज्ञः,

अविनाशः

P.S. - I am aware of a few other audio sources such as (https://archive.org/details/bAlamodinI-01) that have the corresponding text. The same question applies in many of these cases, as the text is largely available in pdf.

Taff Rivers

unread,

Jul 8, 2018, 9:01:27 AM7/8/18

to samskrita

Avinash,

Those pdf files will have been generated from a text file in the first place.

And Air broadcasts will most probably be scripted broadcasts.

The whole idea behind Portable Document Format files is that the document itself contains all required fonts for the language(s) it employs.

Thus to be accesssible an all platforms.

Given that the language of interest is Sanskrit that font will be Devanagari. Easily obtainable by everyone.

In short, why not simply ask the supplier of those pdf's if you can access the source texts.

Especially if the pdf comes in user friendly unprotected Security mode, as is the case of those AIR pdfs.

But much more flexible than a database is an algorithm with a fuzzy speller that scans the entire text or batches of texts, digitally.

Any Database then needs only be a list of URL's of distributed texts that are already out there.

But again, such lists, fuzzy spellingly searchable already exist, select वेदान्त, right click on it, and select Search Google from the dropdown...

... and select from 202,000 odd results.

Regards,

Taff_Rivers,

Research & Development, Information Technology, retd.

Reply all

Reply to author

Forward

0 new messages