# -*- mode: org -*- [These are just my notes while reading the papers, so much of the text is liberally quoted verbatim from the papers, often without quotation marks. Also I've tried to mark the parts where I differ with [] but not always. -Shreevatsa] * Oliver Hellwig, 2015. Using Recurrent Neural Networks for joint compound splitting and Sandhi resolution in Sanskrit. /7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics/ ** Abstract "The interacting levels of its [Sanskrit's] linguistic processes complicate the computer based analysis of its corpus." "using only gold transcripts of correct string splits, but no further lexical or morphological resources" ** 1. Introduction *** 1.1 Sandhi Distinguishes between: - Inner-word sandhi: applied while deriving an inflected form from its root, and - Inter-word sandhi: applied while combining multiple inflected words Deals only with latter. Example: gardabhaścāśvaśca = gardabha[śc][ā]śva[śc]a (*) [line I added] = gardabha(ḥ-c)(a-a)śva(ḥ-c)a = gardabhaḥ ca aśvaḥ ca Although sandhi-application is strictly deterministic (gardabhaḥ ca aśvaḥ ca -> gardabhaścāśvaśca uniquely), analysis is not. E.g. ā could be (a-a), (a-ā), (ā-a), (ā-a) or ā-_ (in a compound, first part ends in ā, no sandhi) or just be ā within the stem of a lexeme (neither sandhi nor compound boundary). So 6 possibilities for an ā. Resolving sandhi needs lexical and semantic context. Not just lexical, e.g. even though like tajjalam -> ta(t-j)alam we could do kajjalam -> ka(t-j)alam which is lexically valid[? he gives *"how many water?"], it is not semantically valid. *** 1.2 Compounds Can be recursive. Gives example of /aśva-avi-uṣṭrāḥ/ = /aśvāvyuṣṭrāḥ/ (a dvandva compound) which can be part of a tatpuruṣa compound /aśvāvyuṣṭradarśanam/. Need to consider compounds that are lexicalized, i.e. recorded as independent units in a (contextual) dictionary. E.g. /mahā-ratnāni/ split as "big jewels" in most texts, but in gemmological texts /mahāratna/ is a technical term for a class of precious jewels and should not be split. *** 1.3 Paper's goal Of course the number of possibilities grows exponentially with the length of the string, for both sandhis and compounds. The paper's goal is a learning system that splits compounds, sandhis, and shows the sandhi rules. E.g. from /uttamādhamamadhyānāṃ/, produce /uttama-adhama-madhyānām/. ** 2. Current NLP methods Some modern approaches try to use the Paninian system. *** Mishra "reformulates the rules of the Aṣṭādhyāyī using ideas from set theory to build a generator for valid Sanskrit forms". - Anand Mishra, 2009. Simulating the Paṇinian system of Sanskrit grammar. In /Sanskrit Computational Linguistics/. Springer, pages 127–138. *** Huet, Kulkarni "combine formal methods from the Aṣṭādhyāyī with a statistical scorer for the analysis of Sanskrit". - Gérard Huet, 2005. A functional toolkit for morphological and phonological processing, application to a Sanskrit tagger. /Journal of Functional Programming/, 15(04):573–614 - Amba Kulkarni and Devanand Shukla, 2009. Sanskrit morphological analyser: Some issues. /Indian Linguistics/, 70(1-4):169–177. *** Mittal "reports 92.8% split accuracy in Sandhi resolution by combining an FSA trained on a parallel corpus of sandhied and unsandhied texts with lexical frequencies and a morphological analyzer". - Vipul Mittal, 2010. Automatic Sanskrit segmentizer using finite state transducers. In /Proceedings of the ACL 2010 Student Research Workshop/. Stroudsburg, PA, USA: Association for Computational Linguistics. *** Hellwig "applies a Sandhi rule base, a morphological analyzer, and a language model estimated from a corpus to generate lexical and morphological analyses of unrestricted Sanskrit text" - Oliver Hellwig, 2009. SanskritTagger, a stochastic lexical and POS tagger for Sanskrit. In Gérard Huet, Amba Kulkarni, and Peter Scharf (eds.), /Sanskrit Computational Linguistics. First and Second International Symposia/, Lecture Notes in Artificial Intelligence, 5402. Berlin: Springer Verlag. - Oliver Hellwig, 2015. Morphological disambiguation of Classical Sanskrit. In Cerstin Mahlow and Michael Piotrowski (eds.), /Systems and Frameworks for Computational Morphology/. Cham: Springer. This paper instead treats its problem (sandhi and compound resolution) as a "sequence labeling task". Sequence labeling is an important ML task. [Occurs in Part-of-Speech tagging, Semantic role labeling, bioinformatics, etc. Typical application: *classify* a sequence of things, but instead of running a classifier on each item independently, use the whole sequence as input/signal. Stub on Wikipedia: https://en.wikipedia.org/wiki/Sequence_labeling Useful post here: http://nlpers.blogspot.com/2006/11/getting-started-in-sequence-labeling.html] Hellwig picks Recurrent Neural Networks over other common approaches because the other approaches (Hidden Markov Models, Conditional Random Fields) usually look at a smaller range around the input symbol. In particular, he uses Long Short-Term Memory (LSTM) neural cells. ** 3. What data used and how Data is from corpus of Hellwig's SanskritTagger. This already contains an analysis (but not sandhi information) of sentences in the corpus, and this analysis is taken as the "gold analysis". For the purpose of this paper, distinguishes two types of sandhis (different from the Pāṇinian classification): - vocalic sandhi: when result of applying sandhi is a single vowel. - non-vocalic sandhi: everything else. Each string in the corpus is split into its phonemes p (the observed sequence), and associated with one of five transformation rules (target sequences): - R1: p → p (do nothing, e.g. in varāhaḥ, keep ā as ā) - R2: p → a-b (add a compound split after undoing vocalic sandhi, e.g. caiva → ca-eva, specifically a → a-e). [Note: I wouldn't call this a compound split; it's a word split and we should write "ca eva". —S] - R3: p → p- (add a compound split without change in phonemes/sandhi, e.g. mahāgiriḥ → mahā-giriḥ, specifically ā → ā-) - R4: p → a- (add a compound split after non-vocalic sandhi, e.g. aśvaśca → aśvaḥ-ca, specifically ś → ḥ-) Note 1: In the "applying sandhi" direction, this is contextual, e.g. ḥ- becomes ś only when followed by c (or voiceless palatal in general). Note 2: In a dictionary word like āścarya, the śc should *not* be broken into ḥ-c. For these two reasons, this knowledge about Sanskrit is not encoded in the training data; classifier learns it and stays unprejudiced. - R5: p → a (apply a sandhi, e.g. ratnaṃ → ratnam). Examples: observed r a t n a ṃ c target r a t n a m BOW observed t ā ṃ ś c a g target t ā x n c a BOW What's going on in the examples: 1. As many string-final sandhis depend on the first phoneme of the next string, in the training data the first phoneme of the next string is included too. In the target sequence, the first phoneme of the next string is encoded as the dummy class BOW. [My guess is that BOW stands for Beginning Of Word.] 2. Some non-vocalic sandhis replace a single phoneme by multiple, like (n-c) → (ṃś-c). In such cases, the observed sequence is marked as deletable (x). This is a treated as an instance of R5. [3. Note that these are two examples of complete strings given as input to the program.] The full data set has 2591000 phoneme sequences of various lengths (33.9% are ≤ 5 in length, 45.9% are 6–10, 13.7% are 11–15, and 6.5% are > 15). In the data set, these are the relative frequences of R1–R5 [I've sorted by frequency]: R1 90.96 R5 4.05 R3 2.62 R2 1.49 R4 0.88 ** 4. The neural network / algorithm used A recurrent neural network, containing: an input layer, a hidden forward and backward layer, and an output layer. The size of the input layer is the number of distinct input phonemes. At time t, input layer receives the phoneme observed at position t in a string. The hidden forward and backward layers capture the left and right context. LSTM cells are used in this layer. "The output layer receives the individual outputs from both hidden layers and performs a softmax regression for the desired target values. The network is trained with stochastic gradient descent" ** 5. The results obtained by this method "The labeler is trained for 100 epochs with a momentum of 0.9 and a learning rate of 0.0005" Takes 1 ms per string, so about 1000 strings per second. These are the results (precision+recall+F score) [I've added the "Frequency" and "What" columns and sorted by the former]: Rule type [Frequency] [What] Precision Recall F score R1 90.96 p → p 99.59 99.44 99.51 R5 4.05 p → a 98.28 98.63 98.46 R3 2.62 p → p- 89.53 91.67 90.59 R2 1.49 p → a-b 92.96 93.23 93.09 R4 0.88 p → a- 89.35 95.55 92.35 Note that the high scores are for R1 and R5, i.e. it prefers to be conservative and not insert boundaries ("-"s). For ā in particular, this is what it does [I've sorted by "proportion"]: Target Precision Recall F Proportion ā 98.0 97.82 97.95 0.82 a-a 89.08 92.67 90.84 0.07 a-ā 88.26 86.48 87.36 0.03 āḥ 73.13 77.66 75.33 0.03 ā- 84.6 87.72 86.13 0.02 ā-a 83.59 75 79.06 0.01 ā-ā 72.45 58.97 65.02 <0.01 He shows that the R3/R2/R4 scores depend on having good training data. Some examples: - ātreyādayaḥ → ātrey(a-ā)dayaḥ correctly. Probably because ādayaḥ is common. - śarkarāmaricopetaṃ → śarkar(ā-)maric(a-u)peta(m) correctly. From a medicinal subcorpus that contains many lists of similar ingredients. - prakīrṇakeyūrabhujāgramaṇḍalaḥ → prakīrṇ(a-)keyūr(a-)bhuj(ā-a)gr(a-)maṇḍalaḥ with error at bhujāgra labeled bhuj(ā-a)gra instead of bhuj(a-a)gra. Says bhujā ("by the enjoying one") is "quite regularly" found as compound termination. This overall accuracy of 93.24 comes with the full dataset; with 10000 and 500000 samples respectively accuracy is 77.92 and 91.03 respectively. ** 6. Summary of the paper This algorithm was trained with just "shallow" annotations; did not use any resources like language models, morphological or phonetic analyzers. A deep architecture (e.g. additional layer after input layer) may improve accuracy. * Oliver Hellwig, 2015. Morphological disambiguation of Classical Sanskrit. In Cerstin Mahlow and Michael Piotrowski (eds.), /Systems and Frameworks for Computational Morphology./ Cham: Springer. ** Abstract "a system for the tokenization and morphosyntactic analysis of Sanskrit [...] combines a fixed morphological rule base with a statistical selection of the most probable analysis of an input text." ** Slides Apart from the paper, this also has associated slides at http://sfcm.eu/sfcm2015/data/uploads/slides/sfcm2015_slides_02.pdf aśvasyāhāraḥ = aśvasya āhāraḥ (food of the horse), could also be aśvasya ahāraḥ "the non-catcher of the horse". Orthography: • Western style: yas tv ekāgre cetasi sadbhūtam arthaṃ pradyotayati • Traditional style: yastvekāgrecetasisadbhūtamarthaṃpradyotayati • Any intermediate level: yas tvekāgre cetasi sadbhūtamarthaṃ pradyotayati ** 2. Previous research and resources See a version of Pāṇini's Aṣṭādhyāyī with some annotation at http://panini.phil-fak.uni-duesseldorf.de/panini (redirects to http://panini.phil.hhu.de/panini/panini/) "Some researchers try to transpose the Pāṇinian system of morphosyntactic analysis, more or less directly, into an NLP tool. These approaches frequently face problems with overgeneration [...] and with the order in which the rules of the Aṣṭādhyāyī need to be applied." The order is regulated by anuvṛtti rules which "are not part of the text of the Aṣṭādhyāyī, but are recorded – and heavily discussed – in the commentarial literature". Example of such a problem: - M. Kulkarni. Phonological overgeneration in Paninian system. In G. Huet, A. Kulkarni, P. Scharf (eds) Sanskrit Computation Linguistics, pp. 306–319. 2009. "Mishra handles these problems by reformulating the rules of the Aṣṭādhyāyī in terms of set theory." - Anand Mishra, 2009. Simulating the Paṇinian system of Sanskrit grammar. In /Sanskrit Computational Linguistics/. Springer, pages 127–138. Huet and (separately) Kulkarni "combine formal methods from the Aṣṭādhyāyī with a statistical scorer." - Gérard Huet, 2005. A functional toolkit for morphological and phonological processing, application to a Sanskrit tagger. /Journal of Functional Programming/, 15(04):573–614 - Amba Kulkarni and Devanand Shukla, 2009. Sanskrit morphological analyser: Some issues. /Indian Linguistics/, 70(1-4):169–177. "Mittal estimates the probability of Sandhi splits from a parallel corpus of sandhied and unsandhied texts. He combines a finite-state automaton built from the parallel corpus, estimations of word frequencies, and a morphological analyzer with a scoring function that calculates a joint lexical and phonological weight for a given analysis of a Sanskrit sentence. The author reports that his system selects the best split of a given Sanskrit string in 92.8% of all cases." - Vipul Mittal, 2010. Automatic Sanskrit segmentizer using finite state transducers. In /Proceedings of the ACL 2010 Student Research Workshop/. Stroudsburg, PA, USA: Association for Computational Linguistics. "Hellwig presents a statistical lexical and morphological analyzer" ... "performance data of this system" - Oliver Hellwig, 2009. SanskritTagger, a stochastic lexical and POS tagger for Sanskrit. In Gérard Huet, Amba Kulkarni, and Peter Scharf (eds.), /Sanskrit Computational Linguistics. First and Second International Symposia/, Lecture Notes in Artificial Intelligence, 5402. Berlin: Springer Verlag. -Oliver Hellwig, 2010. Performance of a lexical and POS tagger for Sanskrit. In: G. Jha (ed.) Proceedings of the Fourth International Sanskrit Computational Linguistics Symposium. pp. 162–172. Springer Verlag, Berlin. ** 3. Background and challenges [Couple of notes for me:] - nouns and adjectives are found in 3 numbers x 8 cases, *and* a stem form in compounds. [25.] - bahuvrīhi compounds can change the gender of a nominal comound - a testcase: niḥśeṣo hi kṛto vaṃśo mama taiḥ pāpakarmabhiḥ -- "because my family was completely destroyed by these bad persons" -- here "taiḥ pāpakarmabhiḥ" can equally well be translated "by these bad actions" (as a tatpuruṣa compound, neuter); what makes the bahuvrīhi (masc.) more plausible? Numerous references to the word with unambiguous (masc.) case endings. - sandhi, e.g. tān cet jayati -> tāṃścejjayati pāṇḍavāḥ api -> pāṇḍavā api Morphological analyzer must first[?] reconstruct the pre-sandhied form before analysis. - sandhi not strictly followed in the epics; also authors and scribes make mistakes. ** 4. Architecture of the system Based on SanskritTagger described in the 2009 paper. *** Core components Five components: 1. lexical database: stores lemmata (174k distinct lemmata), their grammatical categories, meanings (314k meanings), word semantic information (105k connections to a word semantic repository), and inflected verbal forms (generated automatically, checked manually, then stored). 2. corpus: "Sanskrit texts along with their lexicographic, morphological and word semantic gold annotations" (273 texts, 69 completely annotated. 2674000 strings, split into 3587000 lexical tokens with morphological annotations.) This corpus data is used to train statistical models. 3. Linguistic models: (a) hard-coded: sandhi rule base, rules for determining morphological categories of nouns (b) learned parameters of the statistical algorithms (from being trained on the corpus data) Example of (a): to analyse noun forms, remove possible inflectional suffixes and look up remaining word root in the noun section of lexical database. To analyze verb forms, just look up fully inflected word stored in the database. (Doing it automatically would require implementing large range of Aṣṭādhyāyī.) 4. Tag set: for nouns, gender-case-number. For verbs: tense (just present/future/past), person, number. 5. Linguistic processor: uses the models and the lexical database to analyze a sentence For the problem of this paper, tokenization and morphological analysis. *** Tokenization Scan string from left to right. Try breaking sandhi and position i. If you can, analyze left part. If valid, analyze right part in the same way. If this reaches the end of the string, insert all proposed analyses in a "hypothesis lattice". Extract all morpho-lexical analyses LM for each possible split string. E.g. if after split gives "vanam", 3 analyses: (1 sing of neut. vana), (2 sing of neut. vana), (voc. sing of neut. vana) Then Viterbi decoding to find the most probable sequence of lexical tokens from this lattice. [https://en.wikipedia.org/w/index.php?title=Viterbi_algorithm&oldid=712115185] This uses bigram probabilities (LM_{j-1}, LM_j) estimated from the annotated corpus. (See paper for some detail of the weights.) *** Results / evaluation (how well the algorithm does) 10000 sentences removed, algorithm trained on the rest of the corpus is run on these 10000. Step 1: Produces correct lexical tokenization for 94.4% of these sentences, 3.3% have one error (edit distance = 1). (Works slightly better for the epic/Purāṇic corpus.) Step 2: The morphological analysis is frequently wrong. Run another round of Viterbi decoding on just the morphological tags (no lexical data) for the highest-scoring lexical path. Still not great. ** 5. Further improvements + evaluation 42% of all words have more than 1 morphological possibility. [E.g. they are words like "vanam".] Instead of ignoring lexical data and trying to guess just the morphological tags in Step 2, include the lexical and morphological data of surrounding words (at distance up to 3). Tries two kinds of classifiers: Maximum Entropy and Conditional Random Fields. Gets some results, see paper for details. Does well at inherently ambiguous forms but which have one interpretation much more frequently: - uvāca (both 3rd person "he/she/it said" and 1st person "I said") is mostly used in the former. - gacchati is mostly used as a verb (3rd person singular present) rather than as locative (7th vibhakti) singular of present participle (as in "evaṃ gacchati kāle..."). Note that compared to 94.4% of this system, the other system that uses no linguistic information achieves 93.2%.