Tesseract for Tibetan

816 views
Skip to first unread message

Yizhen Hai

unread,
Nov 25, 2015, 2:11:52 AM11/25/15
to tesseract-ocr
I am working on a volunteer project to digitize the Sutra and all related materials, most of them in Tibetan. It will save a lot of time if I can use some OCR technology in this process. However, there are hardly any software available for Tibetan. 
Therefore, I wonder how I can get help to use Tesseract for Tibetan. (I am new on both OCR and Tesseract and the only programming language I know is R.) I have no idea how to get started, training Tesseract for a new language? Tibetan? And what if the image contains both Chinese and Tibetan? Please give me some hints.
Thanks a lot.


Sriranga(83yrsold)

unread,
Nov 25, 2015, 2:37:42 AM11/25/15
to tesser...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0959cecf-8d21-4a9c-b6bf-b53227439e6a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nick White

unread,
Nov 25, 2015, 5:12:06 AM11/25/15
to tesser...@googlegroups.com
Hi Yizhen,

On Tue, Nov 24, 2015 at 07:08:24PM -0800, Yizhen Hai wrote:
> I am working on a volunteer project to digitize the Sutra and all related
> materials, most of them in Tibetan.

Sounds like a great project :)

> Therefore, I wonder how I can get help to use Tesseract for Tibetan. (I am new
> on both OCR and Tesseract and the only programming language I know is R.) I
> have no idea how to get started, training Tesseract for a new language?

Are you sure Tesseract doesn't already support the Tibetan language
you need? I know almost nothing about Tibetan, but I see in the
langdata[0] repository (which is used to build the official training
files) a Tibetan.unicharset file, which implies it probably does
have support. Take a look for the ISO-693 code for the language(s)
you're interested in in the tessdata repository[1].

I quickly compared the ISO-693 codes from this wikipedia page[2]
with the tessdata and bod (Lhasa Tibetan) is the only one there that
I see available. But maybe it's the language you want anyway?

> And what if the image contains both Chinese and Tibetan? Please
> give me some hints.

Tesseract can be told to expect multiple languages in an image,
using a plus in the language argument (i.e. '-l eng+spa').

Hope that's helpful.

Nick

0. https://github.com/tesseract-ocr/langdata
1. https://github.com/tesseract-ocr/tessdata
2. https://en.wikipedia.org/wiki/Central_Tibetan_language

Yizhen Hai

unread,
Nov 25, 2015, 7:17:17 AM11/25/15
to tesseract-ocr
Hi Nick,

Thanks a lot! I am not sure if Tesseract has already supported the Tibetan language. That is why I asked. :)
I will get started with your and sriranga's suggestion and see how far I can get.

Yizhen


在 2015年11月25日星期三 UTC+8下午6:12:06,Nick White写道:

Zach

unread,
Jan 15, 2016, 3:03:19 PM1/15/16
to tesseract-ocr
I am the developer of the Namsel OCR project (https://www.namsel.com/) and can speak to a few different Tibetan OCR implementations. First, you may want look at tbrc.org and particularly their e-text section. We've OCR'd the entire Tibetan Tengyur and Kangyur as well as hundreds of thousands of additional pages of Tibetan literature and made it available for search there.

As already mentioned in this thread, there is also the Yakpo OCR project (http://www.dharmabook.ru/ocr/). The Google Drive/Google Docs and Google Books projects have recently added support for Tibetan OCR, although from what I understand it is still a work in progress. As far as I can tell, the Google Doc OCR service isn't presently meant for building large collections of OCR text, but can handle documents of a few pages.

Otherwise, you can try searching this email list for previous discussions on training Tesseract. For example, Tenzin Dendup has spent time attempting to train Tesseract on Tibetan/Dzongkha: https://groups.google.com/d/msg/tesseract-ocr/ONkAD2kuxUQ/EQsepM67D94J
Reply all
Reply to author
Forward
0 new messages