Hi Yizhen,
On Tue, Nov 24, 2015 at 07:08:24PM -0800, Yizhen Hai wrote:
> I am working on a volunteer project to digitize the Sutra and all related
> materials, most of them in Tibetan.
Sounds like a great project :)
> Therefore, I wonder how I can get help to use Tesseract for Tibetan. (I am new
> on both OCR and Tesseract and the only programming language I know is R.) I
> have no idea how to get started, training Tesseract for a new language?
Are you sure Tesseract doesn't already support the Tibetan language
you need? I know almost nothing about Tibetan, but I see in the
langdata[0] repository (which is used to build the official training
files) a Tibetan.unicharset file, which implies it probably does
have support. Take a look for the ISO-693 code for the language(s)
you're interested in in the tessdata repository[1].
I quickly compared the ISO-693 codes from this wikipedia page[2]
with the tessdata and bod (Lhasa Tibetan) is the only one there that
I see available. But maybe it's the language you want anyway?
> And what if the image contains both Chinese and Tibetan? Please
> give me some hints.
Tesseract can be told to expect multiple languages in an image,
using a plus in the language argument (i.e. '-l eng+spa').
Hope that's helpful.
Nick
0.
https://github.com/tesseract-ocr/langdata
1.
https://github.com/tesseract-ocr/tessdata
2.
https://en.wikipedia.org/wiki/Central_Tibetan_language