Re: GSoC idea proofing [Better CAT tool]

56 views
Skip to first unread message

Forest Rui Jiang

unread,
Mar 18, 2021, 1:18:25 PM3/18/21
to Drupchen Dorje, tibetan-initi...@googlegroups.com, Garima Singh
+tibetan-initi...@googlegroups.com for archiving such nice questions and explanations by Drupchen.

Forest Rui Jiang | Software Engineer | for...@google.com | +1 650-862-0630


On Thu, Mar 18, 2021 at 2:57 AM Drupchen Dorje <drup...@esukhia.org> wrote:
Dear Garima,

Thank you for your interest in CAT and Tibetan. I'll answer your questions below.

1. The short answer is that this is a good idea, but not possible in the context of Tibetan.
The long answer is that the problem is not really linked with the Tibetan words themselves. The problem has to do with the parsing of a text into words. For English, the general idea is that spaces delimitate words. There is some fancy things that can be done to improve that over simplified algo, but there is nothing much complicated.

In Tibetan, if I have a sentence as simple as "The cat ate the mouse", here is what we get: "ཞི་མིས་ཙི་ཙི་ཟས་སོང་།".
The first thing you notice is that there is no space at all, so any strategy based on the spaces and improving this first segmentation is not applicable. Yet, you will have observed that Tibetans invented the dot to separate syllables, which is an improvement from the ancient sanskrit that delimitated neither syllables nor words, yet it remains a challenge. 

Let me now do a gloss of the sentence up there:
"ཞི་མི ས་ ཙི་ཙི་ ཟས་ སོང་ །"
cat SUBJ mouse ate PAST"
As you see, one of the tricks of Tibetan language is to merge the subject case marker into the last syllable of a word. ཞི་མི་གིས་ becomes ཞི་མིས་, so when I split into words, I can't just take a list of words and see if I get matches or not. I need to reconstruct the lemma from the inflected form. It is similar to Sandhi as found in Sanskrit.

What all this means to us is that unless you build a rather complex word parser, you can't find word boundaries. Sanskrit has that exact same problem and up to now, as you will see here, parsing one verse of sanskrit yields thousands of possible parsings. In the link, they are all the possible combinations that can be obtained by making different choices. The choice of the developper behind this Sanskrit parser is to never make a choice, but present the human reader all the possibilities and let him choose.

On the other hand, my strategy in botok was to use basic heuristics to choose the most probable parsing. So it gives a reading that is not perfect, but something plausible nonetheless. That can be used as a basis for improving the support of Tibetan by CAT tools.

In short, helping support Tibetan language in CAT tools such as OmegaT would be to find a way of having something like botok to pre-process the input text. 
Making something that integrates with OmegaT, ideally in their own code-base, would be the best. You will see OmegaT uses regexes that can be modified to parse texts into sentences and then in words. If we could have something within OmegaT that allowed to pre-process Tibetan text with a parser like botok, it would be marvelous.

2.
About the idea of having multiple sources for a translation, I think the best approach is to keep one main source that will be the main one, but have a menu where many variants can be consulted and have something like checkboxes or something that enables to say that for a specific word/sentence, one version is followed.

No question is dumb. It only shows that something in what is written in the ideas is not clear, so thanks for asking!

As for the BDRC project, the indicated person is Élie, who you can reach directly in Slack. (I believe you know how to go there, otherwise, please tell me)

I hope this answers your questions. I'll also post this answer about the CAT tools in the Slack channel so others can benefit from it as well.

On Sat, 13 Mar 2021 at 18:51, Garima Singh <garimasi...@gmail.com> wrote:
Hello,
My name is Garima Singh. I am final year Information Technology student from India, and this specific topic of bettering a CAT tool has widely interested me. I do have a few doubts which, if cleared, can greatly help me in designing an idea proposal.
1. the one problem I understand is the lack of support for Tibetan words so one of the primary tasks would be accumulating source and target words (both Tibetan and non Latin) and pushing them into their database?
2. since the idea defines how multiple texts are referred instead of just one source text, does that mean that the CAT tool GUI should display multiple versions in the same window frame (for easy lookup) to get one single target translation? or do we translate them all separately and see which is the better fit?

The BDRC database idea hooked on to me as well, but I haven't researched on it much. I would like to know though, why is it that the site uses existdb instead of mongodb? or rather XML rather than JSON? I've always heard JSON is much more advantageous so I would love to know the reasoning behind this.

Thank you for your time and I look forward to hearing from you. Pardon me if I have asked something too obvious, for me, it's better to be a fool for 5 minutes rather than 5 weeks.
Reply all
Reply to author
Forward
0 new messages