Ocropus adaptation to the Tibetan Language

9 views
Skip to first unread message

Czenrezig

unread,
Oct 14, 2009, 11:34:47 AM10/14/09
to ocropus
Hello

I am writing to this list as to ask few questions related with the
Ocropus adaptation to the new language. As the system is not
supporting the Tibetan script yet, I would like to commit myself to
the development of the Ocropus extension.

- Is there any guideline describing adaptation of the system to the
new kind of script?
- I found in the "Contributed" section that there is also another
person working on such extension. However I was not able to bind a
contact to Mr./Mrs. Ryan.

It suits also to explain why I am interested in the Tibetan language.
I am associated with a group of Tibetan translators. My long term goal
is to OCR to the Unicode representation the Buddhist Kanjur and Tanjur
and further create a Web Based system supporting translation of
scanned verses.

The initial plan is to adapt the system to particular print as the
shape of Tibetan fonts is greatly varying among wooden stamps used for
traditional printing. The second stage will be to produce an extension
for easy re-adapting it to another writing styles.

Wojtek

Tom Breuel

unread,
Oct 19, 2009, 6:08:24 PM10/19/09
to ocropus
> - Is there any guideline describing adaptation of the system to the
> new kind of script?

There is some information on training OCRopus on the Wiki on
ocropus.org; there will be more after the beta release.

We will be starting work on multilingual (in particular Indic scripts)
early next year. Right now, we're still working hard on the beta
release.

Tom

Czenrezig

unread,
Oct 31, 2009, 4:17:07 PM10/31/09
to ocropus
Hello

First of all, thank you for this response.
I further discussed the topic with people I am associated with and the
possibilities of the system make us seriously considering joining the
project using a number of people and extensive resources. As I
mentioned before, we are interested with the Tibetan script (and
possibly Sanskrit). Our purpose is to produce electronic
representation of the historical Tibetan volumes for further
processing.

I understand that the team is busy with the beta version but before we
start studying the system and join the effort we need to clarify some
important issues.

1. What is precisely sponsored by Google?
2. I understand that the whole distribution is licensed under the
Apache License 2.0 (except of PNG). The copyright statement says that
the base system is Copyrighted by "Deutsches Forschungszentrum fuer
Kuenstliche Intelligenz" except from the content of "ext"
subdirectory. I have not studied the system construction yet,
therefore I need to ask the question if the language adaptation would
be completely contained in the ext directory. Is it possible to
negotiate licensing in case we influence files outside of the ext
directory so that our organization is included in the copyright
statement?
3. Are you interested in the cooperation at all? :)

Kind Regards,
Wojciech Dobrowolski
Reply all
Reply to author
Forward
0 new messages