1) Using Recurrent Neural Networks for joint compound splitting and Sandhi
resolution in Sanskrit
Oliver Hellwig
This uses a "shallow" approach, using Recurrent Neural Networks (RNN). A corpus of sentences from SanskritTagger, developed by the author is used as the training/test sets. An RNN is trained on a golden sandhi/samasa split from the corpus (a sequence of input phonemes and a sequence of output phonemes), and learns to generate a "split" sequence of phonemes from an "unsplit" input. No additional tag information seems to be used.
Three goodness measures are used - Precision (1 - false positive rate), Recall (1 - false negative rate) and F score (harmonic mean of Precision and Recall). Sandhis are divided into 5 classes (with one class being a null operation), and these goodness measures are measured for the entire dataset after training. The more the data, the higher the accuracy - up to 93.24 (F score, presumably) for the entire corpus.
Intuitively, this would tend to learn word splits from common usage, splitting commonly found phoneme sequences that are found as a result of sandhi, using contextual information (as encoded by the forward and backward parts of the RNN). No morphological information is used.
Strengths - simple! Does not need cumbersome tagging beyond basic split training data. Intuitively, this seems to mimic the first level of splitting that humans do, which is purely based on relatively local context, and the set of words they have come across in the past
Weaknesses - Lack of morphological information can lead to errors, like the example in the paper, where bhujaagra is split as bhujaa-agra instead of bhuja-agra. Both are lexically correct, but the former is semantically incorrect in the sentence being decoded, while being a common form in the corpus
2) Morphological Disambiguation of Classical
Sanskrit
Oliver Hellwig
This one takes a linguistically "deep" approach. It relies on some major inputs
1) A lexical database with lemmata (lexical items), semantic information, and inflected verb forms (ti~Nantas)
2) A corpus of Sanskrit texts, tagged with lexicographic, morphological and word semantic "gold" annotations
3) Linguistic models for sandhi split, declension (sup addition), verb conjugation (ti~N addition)
4) tag sets for indeclinables (avyaya), nouns and verbal forms
5) A linguistic processor that uses 1-4 to analyze a sentence
Each string is scanned from left to right, with possible sandhi splits done at each phonemic position. If the left part after a split is a valid lexical form, that is added to a hypothesis "lattice" and the right part is recursively split. The Viterbi algorithm (which I've seen used in Hard Disk Drives and Wireless Communication!) is used to traverse the lattice to pick the best split. Goodness measures for each split are derived from bigram probabilities estimated from the annotated corpus. An accuracy of 94% is claimed on random sets of 10000 sentences from the corpus (the rest of which was used to train the algorithm)
Once a lexical split is chosen, morphological disambiguation (say, distinguishing between vanaM prathamaa/dvitiiya) is done by training a machine learning model.
Advantages: Uses more morphological information, and hence can get better splits (at least intuitively). Seems closer to the way humans do sentence splits, which uses all possible lexical and morphological information about potential splits.
Disadvantages: Cumbersome, requires more tagging.
I'm not convinced that Viterbi is a good solution for this problem. I may need to dig deeper, though, because clearly these folks have thought longer and deeper than I. From my knowledge, Viterbi works where problems can be decomposed into prefix-suffix splits. For example, the optimal decoding D(S) of a string S, split as S1_S2, can (in many applications) be proven to be D(S)=D(S1)_D(S2). This is not true in this case! I would've thought a dynamic programming (knapsack problem) approach would be required.
I need to understand n-gram methods better.
Approach 2) has similarities with the idea I've floated here, but has major differences which require more thought. I will write this up in more detail for discussion.