How to use the "latin sanskrit" language?

immarried

unread,

Jul 25, 2018, 8:56:14 AM7/25/18

to tesseract-ocr

Just started using tesseract, and want to do exactly what ShreeShrii did here to get kamakoti-san_latn_1.txt':

https://github.com/tesseract-ocr/langdata/pull/4

However, what value do I need to use for the "-l" option to do this? Or, do I need to install some additional language?

I'm on macos, and installed tesseract using 'brew install tesseract'.

$ tesseract --version
tesseract 4.0.0-beta.3
 leptonica-1.76.0
  libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
 Found AVX2
 Found AVX
 Found SSE

I suspect that my tesseract setup is different, because, using the Latin option,

tesseract -l lat --oem 1 --psm 3

I get the following output text

drstva devam mahakalam kalikangam mahaprabhum |

bhargavah patito bhumau dandavatsurapujite ||

bhargava uvaca

kalyantakalagnisamanabhasam

caturbhujam kalikayopajustam |

kapalakhatvangavarabhayadhya-

karam mahakalamanantamide ||

namah paramarüpaya paramalasurupine |

niyatipraptadehaya tattvarupaya te namah ||

namah paramarüpaya paramarthaikarupine |

viyanmayasvarupaya bhairavaya namo.astute ||

OM namah parameésaya paratattvarthadaráine |

viyanmayadyadhisaya dhivicitraya $ambhave ||

triloke$aya güdhaya suksmayavyaktarupine |

parakasthadirupaya paraya $ambhave namah ||

OM namah kalikankaya kalatjananibhaya te |

jagatsamharakartre ca mahakalaya te namah ||

nama ugraya devaya bhimaya bhayadayine |

mahabhayavinasaya srstisamharakarine ||

namah paraparanandasvarupaya mahatmane |

paraprakasarüpaya praka$anam praka$sine ||

OM namo dhyanagamyaya yogihrtpadmavasine |

vedatantrarthagamyaya vedatantrarthadarsine ||

vedagamaparamar$aparamanandadayine |

tantravedantavedyaya $ambhave vibhave namah ||

dhiyam pracodakam yattu paramam jyotiruttamam |

tatprerakaya devaya paramajyotise namah ||

gunaérayaya devaya nirgunaya kapardine |

atisthulaya devaya hyatisuksmaya te namah ||

trigunaya tryadhisaya saktitritayasaline |

namastrijyotise tubhyam tryaksaya ca trimürtaye ||

which is not the same as the text in kamakoti-Latin.txt that ShreeShrii obtained.

Help much appreciated, thanks.

John Muccigrosso

unread,

Jul 26, 2018, 2:51:57 PM7/26/18

to tesseract-ocr

You're telling tesseract that your text is in Latin. You need the traineddata for san-lat.

Shree Devi Kumar

unread,

Jul 26, 2018, 10:57:58 PM7/26/18

to tesser...@googlegroups.com

There is no official traineddata for san_latn or last. I have created some experimental versions but the output is not fully accurate.

On Fri 27 Jul, 2018, 12:21 AM John Muccigrosso, <jmuc...@gmail.com> wrote:

You're telling tesseract that your text is in Latin. You need the traineddata for san-lat.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d2fc7942-16a2-48f0-9651-920616179d54%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,

Jul 27, 2018, 12:29:09 AM7/27/18

to tesser...@googlegroups.com

You can try IAST ones from https://github.com/Shreeshrii/tessdata_shreetest?files=1

Frank

unread,

May 15, 2020, 5:39:07 AM5/15/20

to tesseract-ocr

Hi, Ive just installed tesseract to OCR some old Epigraphy documents. I used Google colab as well as a Mac install. All fine, except I am unable to get the text with IAST...characters are substituted (ā becomes i etc). I tried using the lang attribute as lat but it doesnt find a latin lang package and installing latin script didnt help. Ive searched through all of Shree's work on github, but cant figure this out. I have three objectives:

1. OCR english pages and search through them

2. It would be nice to convert the sanskrit into IAST and search through it

3. OCR Kannada inscriptions and keep them in OCR'ed format-this is optional- a "good to have"

Writing the search code doesnt seem to be tough, however the IAST recognition/transcription is the challenge. Accuracy is not very important as I have to search through volumes of inscriptions for specific key words to recategorize a lot of mis categorised inscriptions on my research topic. Any help would be appreciated. The volume itself doesnt make the Google OCR solution suggested by Shree elsewhere practicable.

Im new at Python and tesseract, though have programmed in the past.

Any help is appreciated.

On Friday, July 27, 2018 at 6:29:09 AM UTC+2, shree wrote:

You can try IAST ones from https://github.com/Shreeshrii/tessdata_shreetest?files=1

On Fri 27 Jul, 2018, 8:27 AM Shree Devi Kumar, <shree...@gmail.com> wrote:

There is no official traineddata for san_latn or last. I have created some experimental versions but the output is not fully accurate.

On Fri 27 Jul, 2018, 12:21 AM John Muccigrosso, <jmuc...@gmail.com> wrote:

You're telling tesseract that your text is in Latin. You need the traineddata for san-lat.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

shree

unread,

Feb 24, 2021, 12:04:23 AM2/24/21

to tesseract-ocr

Please try the models from https://github.com/Shreeshrii/tesstrain-Sanskrit-IAST

Jajwalya Karajgikar

unread,

Sep 15, 2021, 3:09:04 AM9/15/21

to tesseract-ocr

Hello Frank, I am wondering if you have worked on " 3. OCR Kannada inscriptions and keep them in OCR'ed format". I am very interested in multilingual OCR-ing for Kannada inscriptions. You mention Epigraphy documents, might they be Epigraphia Carnatica? In which case I would be grateful for any knowledge you have to share.

Thank you,

Jajwalya

Greg Jay

unread,

Nov 21, 2022, 1:13:44 AM11/21/22

to tesseract-ocr

I have installed Tesseract 5.2.0 on Macbookpro (M1 Apple Silicon) running MacOS 12.6.1 Monterey using Homebrew.

I have downloaded IAST.traineddata.

Moving this file into /opt/homebrew/cellar/tesseract/5.2.0/share/tessdata or /opt/homebrew/opt/tesseract-lang/share/tessdata doesn't seem to work.

I get error messages.

How do I load it into the program and use it for OCRing IAST diacritics?

Also is there any traineddata files for ISO15919 diacritics? or for Indian Grantha script?