Need help in recognizing english texts with sanskrit roman diacritical marks.

844 views
Skip to first unread message

Srivas

unread,
Nov 26, 2013, 2:10:30 AM11/26/13
to tesser...@googlegroups.com
Hi!
I have a bunch of PDF files journals and I need to get the text out of it. They contain a lot of romanized sanskrit diacritical marks and that creates a difficulty. I tried Finereader and OmniPage but they cannot be trained to recognize those symbols. I just need an ORC program I can train to show any symbol required and the above programs cannot do that. 

Where should I start from? I feel like this program can do the job but can you help me to get started? I downloaded tesseract and installed it (windows). There are different GUIs available and I think it will make it easier to work. Can you suggest a good one? I tried gimagereader but it's too primitive and leaves a lot of work to be done afterwards with the overall text.

I don't think this kind of language pack is available and how to create it? 

I will add one pdf and fonts that were used to create it. Maybe someone would like to try and let me know how to do it?

Thank you for any help!

Regards,
Srivas
37-6.pdf
TTF.zip

V S Rawat

unread,
Nov 26, 2013, 4:47:11 AM11/26/13
to tesser...@googlegroups.com
Dear Sir Srivas ji,

firstly, you should not have sent 2.2 MB 68 page pdf file and 181 KB zip
to all the list members unasked. You could have loaded it somewhere and
sent the link so that only those download it who can contribute in it.
It is a wastage of time and bandwidth to get such huge messages.

Secondly, I couldn't really understand your issue. I saw your pdf file.
it is pure English. You can open it in any pdf reader and just copy
entire text from there and paste in a text or word file. So, what else
exactly you are looking for, please elaborate.

you don't even need to ocr it. These are already ASCII text.

Thanks.
--
Rawat

Shree Devi Kumar

unread,
Nov 26, 2013, 6:53:03 AM11/26/13
to tesser...@googlegroups.com

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Srivas

unread,
Nov 26, 2013, 10:46:18 PM11/26/13
to tesser...@googlegroups.com
Hi Rawat!

I'm really sorry, I didn't know that this is a mailing list type of forum ;-(

Second, if you look carefully, you will see that the text is not entirely english. In many places words with sanskrit transliteration marks are used. But as you said, it can actually copy/pasted and it didn't even come to my mind! So this part is actually working and that is great! So I am almost there. The remaining problem is another type. The provided tamalten font will display the marks, but I need to use another font to display the final document. It also contains the same diacritical marks but uses another encoding. But this might be a question to another person, I know the author of the fonts, I will ask him. Thanks for the help!

Btw. If anyone needs to use sanskrit transliterated fonts, here are the resources: http://www.krishna-das.com/ksyberspace/fonts/

Srivas

unread,
Nov 26, 2013, 10:50:57 PM11/26/13
to tesser...@googlegroups.com
Thanks, I almost got my problem solved but I also want to try this out. I'm quite sure I will need it also since I have some scanned vedic texts and I would like to get them recognized also.

I'm encountering the following problem: After installing the VietORC and trying to open a pdf file, the following error comes up: The gsdll32.dll wasn't found in default DLL search path. Please install GPL Ghostscript and/or set the appropriate environment variable.

I did download and install Ghostscript but the error remains. What to do next?

V S Rawat

unread,
Nov 27, 2013, 4:41:47 AM11/27/13
to tesser...@googlegroups.com

"words with sanskrit transliteration marks are used"

could you please point out exact pages where to look for it. I will try
to ocr it and see the results.

Also,
http://www.omkarananda-ashram.org/Sanskrit/itranslator99.htm#downloads

The above page and several links from that page also have a lot of
Sanskrit fonts. Maybe someone might be used by you.

Thanks.
--
Rawat

Jaanus Henno

unread,
Nov 27, 2013, 4:47:05 AM11/27/13
to tesser...@googlegroups.com
Ok, you can try page 11. There is glossary and lots of words with diacritics. Thanks. 


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to

For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/6uG7HUxLY7w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

V S Rawat

unread,
Nov 27, 2013, 5:00:35 AM11/27/13
to tesser...@googlegroups.com
I am also not able to get pdf file read directly in vietocr. same
ghostscript problem is appearing.

If you want a way around, use some pdf reader (xchange pdf reader) to
"export" all pages to save as image tif files, keep them in one folder,
use vietocr's batch processing option, it will ocr all images in that
folder and create a corresponding txt file having the text, you then
combine all files to a single text one and bring that to ms word.

That is the long route I have been using as an alternative.

Thanks.
--
Rawat

V S Rawat

unread,
Nov 27, 2013, 5:15:04 AM11/27/13
to tesser...@googlegroups.com
those Ā á character are defined in Garamond font, but the ASCII code
used in this document is not the same as defined in Garamond font.

So, it is some other font where these ASCII codes have been defined for
this character.

The document list a dozen fonts, some of it might be that. you need to
figure out which font it could be, by hammer hit trial error method.

Thanks.
--
Rawat

On 11/27/2013 3:17 PM, Jaanus Henno wrote:
> Ok, you can try page 11. There is glossary and lots of words with
> diacritics. Thanks.
>
>
> On Wed, Nov 27, 2013 at 4:41 PM, V S Rawat <vsr...@gmail.com
> <mailto:vsr...@gmail.com>> wrote:
>
>
> "words with sanskrit transliteration marks are used"
>
> could you please point out exact pages where to look for it. I will
> try to ocr it and see the results.
>
> Also,
> http://www.omkarananda-ashram.__org/Sanskrit/itranslator99.__htm#downloads
> resources: http://www.krishna-das.com/__ksyberspace/fonts/

Shree Devi Kumar

unread,
Nov 27, 2013, 8:20:27 AM11/27/13
to tesser...@googlegroups.com
I think rather than try to OCR, please extract the text and then run a conversion script to change the letters with diacritical marks.

eg. you would do the following substitution using sed for the sample text from page 11

s/Å/Ā/g
s/å/ā/g
s/®/ṛ/g
s/ß/ṣ/g
s/∫/ṇ/g
s/î/ī/g
s/Ê/Ī/g
s/¸/Ś/g
s/Ω/ś/g
s/ü/ū/g

Also attaching sed script as a utf-8 text file.

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscribe@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
roman.sed

Jaanus Henno

unread,
Nov 27, 2013, 8:38:52 AM11/27/13
to tesser...@googlegroups.com
Thank you both for your help. This letter replacement is a good idea! Looks like this sed script will do the work. I will just have to see how to use sed... Tomorrow I will check it out.



For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/6uG7HUxLY7w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

Shree Devi Kumar

unread,
Nov 27, 2013, 9:11:01 AM11/27/13
to tesser...@googlegroups.com
sed -f roman.sed inputfile.txt > outputfile.txt

You will have to add other substitutions to the file roman.sed - it only has the first few substitutions that I encountered.

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

V S Rawat

unread,
Nov 27, 2013, 10:59:43 AM11/27/13
to tesser...@googlegroups.com
That is very convenient solution, Shree Devi ji.

However, if sed or other "substitutors" are not there, or if one wants
to avoid using them, I think it can be done using built in
post-processing method of tesseract.

use san.DangAmbigs.txt or hin.DangAmbigs.txt whichever language you are
using.

then put them as
Å=Ā
one per line.

Should it work equally well and automatically, without needing manual step?

if so, then, Shree Devi ji, is there any major benefit of post
processing in sed?

Please remind me where this DangAmbigs file is to be put?

Thanks.
--
Rawat
> <mailto:vsr...@gmail.com <mailto:vsr...@gmail.com>>> wrote:
>
>
> "words with sanskrit transliteration marks are used"
>
> could you please point out exact pages where to look for
> it. I will
> try to ocr it and see the results.
>
> Also,
> http://www.omkarananda-ashram.____org/Sanskrit/itranslator99.____htm#downloads
>
>
> <http://www.omkarananda-__ashram.org/Sanskrit/__itranslator99.htm#downloads
> http://www.krishna-das.com/____ksyberspace/fonts/
> <http://www.krishna-das.com/__ksyberspace/fonts/>

Nick White

unread,
Nov 27, 2013, 11:06:55 AM11/27/13
to tesser...@googlegroups.com
On Wed, Nov 27, 2013 at 09:29:43PM +0530, V S Rawat wrote:
> However, if sed or other "substitutors" are not there, or if one
> wants to avoid using them, I think it can be done using built in
> post-processing method of tesseract.
>
> use san.DangAmbigs.txt or hin.DangAmbigs.txt whichever language you
> are using.
>
> then put them as
> Å=Ā
> one per line.
>
> Should it work equally well and automatically, without needing manual step?

Yes, that should work as well. DangAmbigs was the format for
Tesseract 2, current tesseract uses unicharambigs instead - see
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#The_last_file_(unicharambigs).

So the file would be of the form:
v1
1 Å 1 Ā 1

Nick

Shree Devi Kumar

unread,
Nov 27, 2013, 11:20:19 AM11/27/13
to tesser...@googlegroups.com
Rawatji,

I was going by the assumption that the text can be easily extracted from his pdf by saving as txt. In that case just running the sed script will fix the text for the letters  with diacritics which were mapped to some other letters in the ascii font.

Doing OCR never gives 100% correct result, so to use the OCR output and postprocess in this case may not be the best solution.

You could try windows version of sed from

i only tested using one para of text from page 11.

Shree

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


V S Rawat

unread,
Nov 27, 2013, 11:21:22 AM11/27/13
to tesser...@googlegroups.com
wow. That was new information that we should use unicharambigs instead
of DangAmbigs.

Thanks Nick, for sharing about it.

1. where should this file be put.

2. Is the same file to be used for all lanuages? previous method was
convenient when each language has its own file name.

File should have a recognized extension to help it getting opened
automatically in standard relevant editor. it is bad method to have a
file without an extension.

Thanks.
--
Rawat

Nick White

unread,
Nov 27, 2013, 11:48:14 AM11/27/13
to tesser...@googlegroups.com
On Wed, Nov 27, 2013 at 09:51:22PM +0530, V S Rawat wrote:
> wow. That was new information that we should use unicharambigs
> instead of DangAmbigs.

Not that new really, it's been that way for years now :)

> 1. where should this file be put.

It should be called xxx.unicharambigs (where xxx is the language
code you're using), and then added to the training file using
combine_tessdata.

> 2. Is the same file to be used for all lanuages? previous method was
> convenient when each language has its own file name.

The file is specific to each training.

> File should have a recognized extension to help it getting opened
> automatically in standard relevant editor. it is bad method to have
> a file without an extension.

It's only a bad habit in the Windows world, so it isn't really a bad
habit at all ;)

Note though that this is just a quick hack really, which probably
won't work well for different fonts or pages. The correct way of
doing this would be to retrain with the extra characters with all
the diacritics you need, but obviously that would take more work.
That's why Shree Devi Kumar just recommended it as a quick sed
script; it may well work well for a few pages that you need to sort
out, but shouldn't be relied upon for anything more than that.

Nick

V S Rawat

unread,
Nov 27, 2013, 12:53:39 PM11/27/13
to tesser...@googlegroups.com
Yes, for Srivas ji's file text is 100% text, not images, and is 100%
extractable to word/text file by simple copy paste. ocr is just not needed.

Then, it is good that sed will make the changes without need of ocr.
Good thought.

I use vim on w8 so, I wouldn't downgrade to sed. he he. just kidding.
vim has sed built in. :-)

Thanks.
--
Rawat
> ______________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
>
> On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat <vsr...@gmail.com
> <mailto:vsr...@gmail.com>
> http://www.omkarananda-ashram.______org/Sanskrit/__itranslator99.____htm#__downloads
>
>
>
> <http://www.omkarananda-____ashram.org/Sanskrit/____itranslator99.htm#downloads
> <http://www.omkarananda-__ashram.org/Sanskrit/__itranslator99.htm#downloads>
> http://www.krishna-das.com/______ksyberspace/fonts/
> <http://www.krishna-das.com/____ksyberspace/fonts/>

Quan Nguyen

unread,
Nov 27, 2013, 1:25:58 PM11/27/13
to tesser...@googlegroups.com
Download and install http://sourceforge.net/projects/ghostscript/files/GPL%20Ghostscript/9.10/gs910w32.exe. Then follow the steps for setting Path environment variable as described in http://vietocr.sourceforge.net/usage.html.

Srivas

unread,
Nov 27, 2013, 10:13:23 PM11/27/13
to tesser...@googlegroups.com
I'm a little new to all that. How do you run sed under Windows 7? I read information about it and that it can also be run under windows but cannot understand how to do that.

Srivas

unread,
Nov 27, 2013, 10:15:27 PM11/27/13
to tesser...@googlegroups.com
That would be great, however I still cannot import pdf into VietOcr. Of course, there are other GUIs to do the work but this one looks nice. I already wrote the author of the program about it. As soon as this will be solved, I will post it here also. 

Jaanus Henno

unread,
Nov 27, 2013, 10:22:41 PM11/27/13
to tesser...@googlegroups.com
Ok, somehow I missed last part of the conversation. I will try out those Windows based options you mentioned for sed.

Jaanus Henno

unread,
Nov 27, 2013, 10:32:57 PM11/27/13
to tesser...@googlegroups.com
Nope, it won't work. I use Windows 7 64 bit. The program is installed into Program files(x86) folder. Even if I set it for path en. variable, it will still give the same error.


You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/6uG7HUxLY7w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

Jaanus Henno

unread,
Nov 27, 2013, 11:01:48 PM11/27/13
to tesser...@googlegroups.com
How do you run sed on Vim?


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscribe@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/6uG7HUxLY7w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

V S Rawat

unread,
Nov 28, 2013, 4:22:51 AM11/28/13
to tesser...@googlegroups.com
sed is the command line part of vim.

type : in vim (when not in insert mode)

The cursor will move to bottom line and a line will open where you can
type any sed command and press enter and it will run on the text, you
can also define line range on which it should operate, and also select
confirmation mode for replace, a feature that I guess might not be
possible in sed script-wise replacement.

I hope I am saying it correctly.

Thanks.
--
Rawat
> http://gnuwin32.sourceforge.__net/packages/sed.htm
> <http://gnuwin32.sourceforge.net/packages/sed.htm>
>
> i only tested using one para of text from page 11.
>
> Shree
>
> Shree Devi Kumar
> ______________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
>
> On Wed, Nov 27, 2013 at 9:29 PM, V S Rawat <vsr...@gmail.com
> <mailto:vsr...@gmail.com>
> __________________________________________________________________
>
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
>
> On Wed, Nov 27, 2013 at 3:45 PM, V S Rawat
> <vsr...@gmail.com <mailto:vsr...@gmail.com>
> <mailto:vsr...@gmail.com <mailto:vsr...@gmail.com>>
> http://www.omkarananda-ashram.________org/Sanskrit/____itranslator99.____htm#____downloads
>
>
>
>
> <http://www.omkarananda-______ashram.org/Sanskrit/______itranslator99.htm#downloads
> <http://www.omkarananda-____ashram.org/Sanskrit/____itranslator99.htm#downloads>
> http://www.krishna-das.com/________ksyberspace/fonts/
> <http://www.krishna-das.com/______ksyberspace/fonts/>

V S Rawat

unread,
Nov 28, 2013, 4:24:48 AM11/28/13
to tesser...@googlegroups.com
On 11/27/2013 3:17 PM, Jaanus Henno wrote:
> Ok, you can try page 11. There is glossary and lots of words with
> diacritics. Thanks.
>

Oh, so, Jaanus Henno is id of Mr Srivas.

I was wondering why I was getting replies from two persons.

Just curious. How sometimes it shows name as Srivas and other times it
shows as Jaanus Henno. Are you using two different interfaces to send
different mails?

Thanks.
--
Rawat

V S Rawat

unread,
Nov 28, 2013, 4:29:33 AM11/28/13
to tesser...@googlegroups.com
On 11/27/2013 9:50 PM, Shree Devi Kumar wrote:
> Rawatji,
>
> I was going by the assumption that the text can be easily extracted from

It is good that we have found two methods for replacing these letters.

However, the fundamental solution is that there has to be font in which
these same ASCII codes must already be showing the correct letters.

So, if anyone gets time to do some research or somehow figures out which
font it is, it will be very helpful for handling such text in future.
Then replacement would not be required.

To begin with, the font has to be one of the dozen listed in pdf file's
properties-fonts.

Thanks.
--
Rawat


Jaanus Henno

unread,
Nov 28, 2013, 5:08:24 AM11/28/13
to tesser...@googlegroups.com
I configured my gmail inbox to receive mail from all my different accounts. Srivas is my nickname among my friends in spiritual circles. I'm not indian ;-)



Jaanus Henno

unread,
Nov 28, 2013, 5:28:43 AM11/28/13
to tesser...@googlegroups.com
Yes, this font is called Tamalten. But the problem is that I need to use another font (that is Balaram in the font list I send). This is part of a project of Vedic scriptures, you can see the online version here: http://vedabase.com/en
Those texts I need to get from those PDFs are for the offline version which uses Balaram font. So these two are not compatible. So a find&replace method to get the proper symbols is ok since there are not much material to get from those pdfs. 

In a broader sense there are people who are traveling throughout Indian libraries to photograph old manuscripts to preserve and digitize them. So for that purpose a working OCR will be much needed. I think I will contact one person because if he actually needs the help in this regard, it will be definitely worth trying to train tesseract to properly recognize those images. But that is native sanskrit, bengali and other languages. And there are others who are looking for solution to be able to recognize sanskrit transliteration also. What do you think, can it be done in tesseract? No Finereader or other commercial orc programs cannot do that.


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscribe@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/6uG7HUxLY7w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

Shree Devi Kumar

unread,
Nov 28, 2013, 6:01:37 AM11/28/13
to tesser...@googlegroups.com
You may want to look at a software called SANSKRITOCR. The old version was free. There is a new commercial version also. Please see http://www.sanskritreader.de/

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com



For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Jaanus Henno

unread,
Nov 28, 2013, 7:42:24 AM11/28/13
to tesser...@googlegroups.com
Ok, thank you!


To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

Quan Nguyen

unread,
Nov 28, 2013, 11:07:25 AM11/28/13
to tesser...@googlegroups.com
On Win7 64bit, Ghostscript would be installed into C:\Program Files (x86)\gs\gs9.10. So when you bring up the Environment Variables dialog, append to the Path value ";C:\Program Files (x86)\gs\gs9.10\bin". After save, log out or restart Windows for the change to take effect. It should work.

V S Rawat

unread,
Nov 28, 2013, 1:13:02 PM11/28/13
to tesser...@googlegroups.com
Windows dos mode still doesn't recognize spaces within folder/ file names.

So, if you face problem, try putting "C:\Program Files
(x86)\gs\gs9.10\bin" WITH QUOTES.

If that doesn't work, try putting ;C:\Progra~2\gs\gs9.10\bin WITHOUT QUOTES.

if it is 64 bit system, one of the "Program Files" would be progra~1,
and I guess Program Files (x86) would be progra~2.

try both. I am on pure 32 bit w8 so can't test myself.

--
Rawat

On 11/28/2013 9:37 PM, Quan Nguyen wrote:
> On Win7 64bit, Ghostscript would be installed into C:\Program Files
> (x86)\gs\gs9.10. So when you bring up the Environment Variables dialog,
> *append* to the Path value ";C:\Program Files (x86)\gs\gs9.10\bin".
> After save, log out or restart Windows for the change to take effect. It
> should work.
>
> On Wednesday, November 27, 2013 9:32:57 PM UTC-6, Srivas wrote:
>
> Nope, it won't work. I use Windows 7 64 bit. The program is
> installed into Program files(x86) folder. Even if I set it for path
> en. variable, it will still give the same error.
>
>
> On Thu, Nov 28, 2013 at 1:25 AM, Quan Nguyen <nguy...@gmail.com
> <javascript:>> wrote:
>
> Download and install
> http://sourceforge.net/projects/ghostscript/files/GPL%20Ghostscript/9.10/gs910w32.exe
> <http://sourceforge.net/projects/ghostscript/files/GPL%20Ghostscript/9.10/gs910w32.exe>.
> Then follow the steps for setting |Path| environment variable as
> described in http://vietocr.sourceforge.net/usage.html
> <http://vietocr.sourceforge.net/usage.html>.
>
>
> On Tuesday, November 26, 2013 9:50:57 PM UTC-6, Srivas wrote:
>
> Thanks, I almost got my problem solved but I also want to
> try this out. I'm quite sure I will need it also since I
> have some scanned vedic texts and I would like to get them
> recognized also.
>
> I'm encountering the following problem: After installing the
> VietORC and trying to open a pdf file, the following error
> comes up: The gsdll32.dll wasn't found in default DLL search
> path. Please install GPL Ghostscript and/or set the
> appropriate environment variable.
>
> I did download and install Ghostscript but the error
> remains. What to do next?
>
> On Tuesday, November 26, 2013 6:53:03 PM UTC+7, shree wrote:
>
> For GUI
> you can try VietOCR -
> http://sourceforge.net/__projects/vietocr/files/__vietocr/
> <http://sourceforge.net/projects/vietocr/files/vietocr/>
>
> For Language data for sanskrit transliteration
> Try
> http://sourceforge.net/__projects/tesseracthindi/files/__Tesseract-3-02-__SanskritTransliteration/
> ______________________________________________________________

Ravi Roshan

unread,
Jan 9, 2014, 1:57:17 PM1/9/14
to tesser...@googlegroups.com
Please tell me where I could find this " hin.DangAmbigs.txt" file.
Thank you.

Quan Nguyen

unread,
Jan 10, 2014, 8:36:51 PM1/10/14
to tesser...@googlegroups.com
aaa.DangAmbigs.txt is user-defined file used by VietOCR in post-processing (post-OCR) corrections.
Reply all
Reply to author
Forward
0 new messages